{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "| DS 2000 | \n", "|:----------:|\n", "| Prof. Rachlin | \n", "| Miscellaneous Data Sources | \n", " \n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Reading data with the csv module\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's how we've been reading in this kind of data so far." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This works reasonably well, but already we see a few problems. \n", "\n", "- We might want to discard the first line header, but what if there is no header\n", "- Everything is converted to a string\n", "- We are assuming that values are delimited by commas, but it could be a tab, semicolon or other character\n", "- We have to re-write this code for every file\n", "\n", "Let's create a more robust data reader." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "None\n" ] } ], "source": [ "def read_data(filename, header=False, sep=',', coltypes=[]):\n", " \"\"\" Read tabular data from a file.\n", " header: Whether the file has a header\n", " sep: The value separator \n", " coltypes The column data types \"\"\"\n", " pass\n", " \n", "data = read_data(\"clean.csv\", header=True, sep=',', coltypes=[str, int, bool, str])\n", "print(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is pretty nice, and will work fine when you have nicely formatted \"clean\" data. But data is rarely so kind. Data Scientists often spend more than half their time preparing their data: cleaning, reformating, filtering, and dealing with inconsistencies or missing values. This is called *data munging* and its importance should not be under-estimated! Let's look at a messy version of this file:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "more messy.csv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some additional problems arise:\n", " \n", "* There may be missing values (designated here as an empty value, but data providers might alternatively use a special code: NA, N/A, ?, 999, -1, UNK, NIL, NULL, etc.) to indicate a missing value. Our data reader will choke trying to convert an empty string or a missing value code to an integer, float, or boolean.\n", "* Some values may contain what we are assuming is a delimiter - a comma in this case. CSV format has a work-around, surround the entire value in quotes. Our fancy data reader won't handle this at all. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice how the value **\"Art, Acrylic Painting\"** has been converted into two items, **'\"Art'** and **' Acrylic Painting\"'** and notice, also, the double-quotes embedded inside the items. Trying to convert empty strings or missing value codes to ints, floats, or booleans will also be problematic. A better way to read data formatted like this (i.e., delimited data) is the `csv` module, which is built-in to Python. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The reader method has optional parameters for dealing with headers and different separators. We still have to deal with type conversions on our own. The big advantages of using the **csv** module are:\n", "\n", "- Simplicity: less code is good!\n", "- Handles problematic missing values and values containing delimiters automatically.\n", "\n", "How does pandas handle our messy data?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# a column of data (series)\n", "\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# a row of data (also a series)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Excel Spreadsheets\n", "\n", "With the **pandas** module, we can even read excel spreadsheets!" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "# df is shorthand for \"Data Frame\" - a table of data\n", "\n", "\n", "# We could make the student column the 'index'\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### JSON Files\n", "\n", "JSON stands for Javascript Object Notation. It is a data interchange format that is widely used on the web and for representing hierarchical data, something tabular flat files don't do so well.\n", "\n", "You can learn more about JSON [here](https://www.json.org), but from a python point of view, JSON looks much like a dictionary." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "more example.json" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "import json\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Now, let's use the `json` and `requests` modules to call web-service APIs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Application Programming Interfaces (APIs) provide a general means of interacting with other programs. More specifically, many websites will provide data via a web API -- this means we can make *requests* to remote applications / websites for data. Here's what this looks like schematically.\n", "\n", "![alt text](https://cdn-images-1.medium.com/max/2000/1*q9CRTmO258jWLsMZAd5JLw.png \"Image credit: http://www.robert-drummond.com/2013/05/08/how-to-build-a-restful-web-api-on-a-raspberry-pi-in-javascript-2/\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The way that we do this in Python is via the `requests` module. The idea is that the Response (in the above depiction) is nicely formatted `json`, which feels a lot like the dictionaries you now know and love. \n", "\n", "Here is an example of calling the Yelp API. It requires that we have registered our application with Yelp in order to obtain an API KEY. This will be our way of telling Yelp who we are. It's how we authenticate our request with Yelp. Yelp will keep track of how many requests we make and place daily limits on our usage (5000 API calls per day.) According to the Yelp API documentation, we must pass our API_KEY using a request header (discussed below) of the form:\n", "\n", "**Authorization: Bearer API_KEY**\n", "\n", "There are also additional parameters we can specify like the location (Boston) and the type of business we are looking for (Pizza). \n", "\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "import requests\n", "\n", "API_KEY = 'k04EveojLDWtsNI9D_GcEAYSDLNThcspgKOAPuP3TgaCH7u97JdAtaoFni8FiD612pkEJRQyvkSI0iCMXbM8xVWe6e6N0_KWNB-e1zQw7JR1Qv-hg_R-Rwy0L7TMXXYx'\n", "\n", "API_URL = 'https://api.yelp.com/v3/businesses/search'\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What is the distribution of ratings?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Not much of a correlation but it does show that most pizza places are averaging either 4.0 or 4.5 stars." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As another an example, reddit makes it possible to fetch content from sub-redit streams. The documentation for this API is found here. In particular, let's fetch some recent cat photos from reddit - because we haven't seen enough cat photos on the Internet, right?\n", "\n", "### A note about HTTP headers\n", "When we make an API call, we may have to supply additional metadata about our request in the form of key-value pairs. Which request headers we have to supply is API-specific. The reddit API requires that we supply a descriptive value for the 'User-Agent'. The User-Agent attribute is used to define the caller, the browser, or the application being used to make the API request. In otherwords, reddit is telling us that we can freely access their database via a documented API call, but they want to know who's calling! Additional rules for using the reddit API can be found here." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "import requests\n", "import json\n", "\n", "# here's we're hitting the 'new' endpoint, see: https://www.reddit.com/dev/api/#GET_new\n", "\n", "# Make the request. BTW, the result you get is likely to change every few minutes.\n", "# Cat's are really really popular on reddit!\n", "\n" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "# How many items did we receive?\n", "# We asked for at most 10.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So these are the top 10 posts. Let's look at the first." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "# Let's see some cats!\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.9" } }, "nbformat": 4, "nbformat_minor": 2 }