{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "| DS 2000 | \n", "|:----------:|\n", "| Prof. Rachlin | \n", "| Miscellaneous Data Sources | \n", " \n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Reading data with the csv module\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's how we've been reading in this kind of data so far." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[['NAME', 'AGE', 'LIKES_PYTHON', 'HOBBY'], ['Joe', '21', 'True', 'Fishing'], ['Ann', '20', 'True', 'Soccer'], ['Lee', '24', 'False', 'Painting'], ['Mary', '87', 'True', 'Motorcycles']]\n" ] } ], "source": [ "with open(\"clean.csv\") as f:\n", " data = [line.strip().split(\",\") for line in f]\n", "print(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This works reasonably well, but already we see a few problems. \n", "\n", "- We might want to discard the first line header, but what if there is no header\n", "- Everything is converted to a string\n", "- We are assuming that values are delimited by commas, but it could be a tab, semicolon or other character\n", "- We have to re-write this code for every file\n", "\n", "Let's create a more robust data reader." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[['Joe', 21, True, 'Fishing'], ['Ann', 20, True, 'Soccer'], ['Lee', 24, True, 'Painting'], ['Mary', 87, True, 'Motorcycles']]\n" ] } ], "source": [ "def read_data(filename, header=False, sep=',', coltypes=[]):\n", " \"\"\" Read tabular data from a file.\n", " header: Whether the file has a header\n", " sep: The value separator \n", " coltypes The column data types \"\"\"\n", " with open(filename, \"r\") as f:\n", " \n", " # Convert file to a list of lists\n", " data = [line.strip().split(sep) for line in f]\n", " \n", " # Discard header if there is one\n", " if header:\n", " data = data[1:]\n", " \n", " # Convert datatypes (if provided)\n", " if len(coltypes)>0:\n", " for record in data:\n", " for i,value in enumerate(record):\n", " record[i] = coltypes[i](record[i])\n", " \n", " return data\n", " \n", "data = read_data(\"clean.csv\", header=True, sep=',', coltypes=[str, int, bool, str])\n", "print(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is pretty nice, and will work fine when you have nicely formatted \"clean\" data. But data is rarely so kind. Data Scientists often spend more than half their time preparing their data: cleaning, reformating, filtering, and dealing with inconsistencies or missing values. This is called *data munging* and its importance should not be under-estimated! Let's look at a messy version of this file:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "more messy.csv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some additional problems arise:\n", " \n", "* There may be missing values (designated here as an empty value, but data providers might alternatively use a special code: NA, N/A, ?, 999, -1, UNK, NIL, NULL, etc.) to indicate a missing value. Our data reader will choke trying to convert an empty string or a missing value code to an integer, float, or boolean.\n", "* Some values may contain what we are assuming is a delimiter - a comma in this case. CSV format has a work-around, surround the entire value in quotes. Our fancy data reader won't handle this at all. " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[['NAME', 'AGE', 'LIKES_PYTHON', 'HOBBY'], ['Joe', '21', 'True', 'Fishing'], ['Ann', '20', 'True', ''], ['Lee', '', 'False', '\"Art', ' Acrylic Painting\"'], ['Mary', '87', 'True', 'Motorcycles']]\n" ] } ], "source": [ "data = read_data(\"messy.csv\")\n", "print(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice how the value **\"Art, Acrylic Painting\"** has been converted into two items, **'\"Art'** and **' Acrylic Painting\"'** and notice, also, the double-quotes embedded inside the items. Trying to convert empty strings or missing value codes to ints, floats, or booleans will also be problematic. A better way to read data formatted like this (i.e., delimited data) is the `csv` module, which is built-in to Python. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "<_csv.reader object at 0x1078db3c0>\n", "[['NAME', 'AGE', 'LIKES_PYTHON', 'HOBBY'], ['Joe', '21', 'True', 'Fishing'], ['Ann', '20', 'True', ''], ['Lee', '', 'False', 'Art, Acrylic Painting'], ['Mary', '87', 'True', 'Motorcycles']]\n" ] } ], "source": [ "import csv\n", "with open(\"messy.csv\") as f:\n", " data = csv.reader(f)\n", " print(data)\n", " print(list(data))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The reader method has optional parameters for dealing with headers and different separators. We still have to deal with type conversions on our own. The big advantages of using the **csv** module are:\n", "\n", "- Simplicity: less code is good!\n", "- Handles problematic missing values and values containing delimiters automatically.\n", "\n", "How does pandas handle our messy data?" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | NAME | \n", "AGE | \n", "LIKES_PYTHON | \n", "HOBBY | \n", "
---|---|---|---|---|
0 | \n", "Joe | \n", "21.0 | \n", "True | \n", "Fishing | \n", "
1 | \n", "Ann | \n", "20.0 | \n", "True | \n", "NaN | \n", "
2 | \n", "Lee | \n", "NaN | \n", "False | \n", "Art, Acrylic Painting | \n", "
3 | \n", "Mary | \n", "87.0 | \n", "True | \n", "Motorcycles | \n", "