#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Felix Muzny 11/29/2022 DS 2000 Lecture 22 - pandas and DataFrames Logistics: - Take the final quiz - OH for the rest of the semester - 4 - 8pm - we're happy to help you with DS 2001 projects AND expect to explain your project/goal to the TA a bit to get help to start with :) - remote attendance (https://bit.ly/remote-ds2000-muzny) Three ways to participate in multiple choice questions 1) via the PollEverywhere website: https://pollev.com/muzny 2) via text: text "muzny" to the number 22333 to join the session 3) via Poll Everywhere app (available for iOS or Android) """ """ Warm-up 0 ---- What is an API? A. Apple Pie Incident B. Android Program Intervention C. Application Programming Interface <--- (in the context of programming) D. Abstract Python Information E. Automated Port Issuer """ """ Warm-up 1 ---- What is an API? A. A way to programmatically get information from servers on the internet <--- B. A way to find the location of files on a server C. A way to remotely boot your software so that it can be accessed online D. A way to write programs that have no functions E. A way to clean data files so that all the data is nice to work with """ # What did we do on Tuesday (11/22/22)? # a little bit of how the internet works # what is an API # how to programmtically get info from the internet in python # use the requests library (import requests) # connect to a server and ask for information from an endpoint (url) # Twitter provides an api to access tweets # we wrote/updated a program that accessed: # pokemon statistics # pokeapi # we also learned about JSON # JavaScript Object Notation # basically a way to encode complex (very complex) # data structures in plain strings # load JSON responses from servers into python data structures # json library can help with this """ Warm-up 2 --- Please spend 10 minutes filling out your trace evaluations. -> you have *separate* evaluations for DS 2000 & DS 2001 -> I do read these at the end of the semester! -> tell me specifically: -> what did you like? -> what did you wish had been different? If you have already done your trace evaluations, consider the below problem: The time module allows you to ask your computer what time it currently is. How much time does it take to sort a list? Does it matter if the list is already sorted or not? """ import time import random ls = list(range(1000000)) # to shuffle a list into a random order with a mutator # random.shuffle(ls) start = time.time() # your code here ls.sort() end = time.time() duration = end - start print("That took:", duration) # We'll work from here on Friday! # we'll take a look at different data structures # and functions and test out how fast/efficient they are/are not """ Data Science in the real world: a case study --- All data science projects are motivated by a question. We'll take a look at the question "is this map unconsistitutionally gerrymandered?" """ """ pandas & DataFrames --- what is it most useful for? -> if you're going on to DS 2500 -> you'll likely be working with dataframes for ~80% of the semester -> https://course.ccs.neu.edu/ds2500/ -> if you're doing other data science projects in the future and want a bit more analytical power -> if you're curious about working with big data sets efficiently """ """ pandas & DataFrames ---- What is pandas? - unfortunately not a group of fluffy black and white bears - a python library - installed with your anaconda installation - for data management and analysis What is a DataFrame? - an object to represent a table of data -> lists of dicts -> lists of lists - it will do some work for you -> figuring out what types different columns are - provide some convenient analytical power -> getting max, min, averages is very easy - provide some convenient data management syntax -> getting a column -> find a subset of the data (e.g. all rows for patients with heart disease) """ print() # the standard nickname to give the library import pandas as pd # read in a file # movies.csv, boston_earnings.csv, trips.csv # return a DataFrame to you df = pd.read_csv("trips.csv") print(df) print() # ask the data frame what it's shape is # accessing the shape attribute # tuple of (rows, columns) # tuples are like a list except you # can't modify them print(df.shape) print(df.shape[0]) print() # what columns does it have print(df.columns) # print(df[0]) print(df["duration"]) print() # get a summary of the data frame as a whole # will print for you df.info() print() # access a single column # get the max, min, average of a certain column print(df["duration"].max()) print(df["duration"].mean()) print() # base stats for every column in the dataframe print(df.describe()) print() # find rows that meet a specific condition target_duration = 10000 long_trips = df[df["duration"] > target_duration] print(long_trips) """ DataFrames + Jupyter notebooks --- Jupyter Notebooks will automatically display DataFrames nicely for us. This makes Jupyter Notebooks a natural choice to use when doing data science investigations that use DataFrames. """ # Next Time # --- # - Timing Experiments # - More Data Science Applications