#!/usr/bin/env python3
# -*- coding: utf-8 -*-

"""
Felix Muzny
11/29/2022
DS 2000
Lecture 22 - pandas and DataFrames

Logistics:
    - Take the final quiz
        
    - OH for the rest of the semester 
        - 4 - 8pm
        - we're happy to help you with DS 2001
        projects AND expect to explain your project/goal
        to the TA a bit to get help to start with :)
        
    - remote attendance (https://bit.ly/remote-ds2000-muzny)
    

Three ways to participate in multiple choice questions
 1) via the PollEverywhere website: https://pollev.com/muzny
 2) via text: text "muzny" to the number 22333 to join the session
 3) via Poll Everywhere app (available for iOS or Android)
"""

"""
Warm-up 0
----

What is an API?

A. Apple Pie Incident
B. Android Program Intervention
C. Application Programming Interface <--- (in the context of programming)
D. Abstract Python Information
E. Automated Port Issuer 

"""


"""
Warm-up 1
----

What is an API?

A. A way to programmatically get information from servers on the internet <---

B. A way to find the location of files on a server
C. A way to remotely boot your software so that it can be accessed online
D. A way to write programs that have no functions
E. A way to clean data files so that all the data is nice to work with

"""

# What did we do on Tuesday (11/22/22)?
# a little bit of how the internet works
# what is an API
# how to programmtically get info from the internet in python
# use the requests library (import requests)
# connect to a server and ask for information from an endpoint (url)

# Twitter provides an api to access tweets
# we wrote/updated a program that accessed:
# pokemon statistics
# pokeapi

# we also learned about JSON
# JavaScript Object Notation
# basically a way to encode complex (very complex)
# data structures in plain strings
# load JSON responses from servers into python data structures
# json library can help with this


"""
Warm-up 2
---
Please spend 10 minutes filling out your trace evaluations.

-> you have *separate* evaluations for DS 2000 & DS 2001
-> I do read these at the end of the semester!
    -> tell me specifically:
        -> what did you like?
        -> what did you wish had been different?      

If you have already done your trace evaluations, consider the
below problem:

The time module allows you to ask your computer what time 
it currently is. How much time does it take to sort a list?
Does it matter if the list is already sorted or not?
"""
import time
import random

ls = list(range(1000000))
# to shuffle a list into a random order with a mutator 
# random.shuffle(ls)

start = time.time()

# your code here
ls.sort()

end = time.time()
duration = end - start
print("That took:", duration)

# We'll work from here on Friday!
# we'll take a look at different data structures
# and functions and test out how fast/efficient they are/are not


"""
Data Science in the real world: a case study
---

All data science projects are motivated by a question.

We'll take a look at the question "is this map
unconsistitutionally gerrymandered?"

"""


"""
pandas & DataFrames
---

what is it most useful for?
    -> if you're going on to DS 2500
        -> you'll likely be working with dataframes for ~80% of the 
        semester 
        -> https://course.ccs.neu.edu/ds2500/
        
    -> if you're doing other data science projects in 
    the future and want a bit more analytical power
    
    -> if you're curious about working with big data 
    sets efficiently
"""

"""
pandas & DataFrames
----

What is pandas?
- unfortunately not a group of fluffy black and white 
bears
- a python library
- installed with your anaconda installation
- for data management and analysis

What is a DataFrame?
- an object to represent a table of data
    -> lists of dicts
    -> lists of lists
- it will do some work for you
    -> figuring out what types different columns are
- provide some convenient analytical power
    -> getting max, min, averages is very easy
- provide some convenient data management syntax
    -> getting a column
    -> find a subset of the data
        (e.g. all rows for patients with heart disease)
"""
print()

# the standard nickname to give the library
import pandas as pd

# read in a file
# movies.csv, boston_earnings.csv, trips.csv
# return a DataFrame to you
df = pd.read_csv("trips.csv")
print(df)
print()

# ask the data frame what it's shape is
# accessing the shape attribute
# tuple of (rows, columns)
# tuples are like a list except you 
# can't modify them
print(df.shape) 
print(df.shape[0])
print()

# what columns does it have
print(df.columns)
# print(df[0])
print(df["duration"])
print()

# get a summary of the data frame as a whole
# will print for you
df.info()
print()

# access a single column

# get the max, min, average of a certain column
print(df["duration"].max())
print(df["duration"].mean())
print()

# base stats for every column in the dataframe
print(df.describe())
print()

# find rows that meet a specific condition
target_duration = 10000

long_trips =  df[df["duration"] > target_duration]
print(long_trips)

"""
DataFrames + Jupyter notebooks
---

Jupyter Notebooks will automatically display 
DataFrames nicely for us. This makes Jupyter Notebooks
a natural choice to use when doing data science
investigations that use DataFrames. 

"""


# Next Time
# ---
#     - Timing Experiments
#     - More Data Science Applications