pandas
?¶An open-source (https://pandas.pydata.org), high-performance set of objects/functions/etc for data analysis in Python. These sit atop Python, making it easier for you to do basic & advanced data analysis tasks quickly (but you can always do these things/supplement with raw Python).
pandas
by Example: Red Wine Quality¶Source: https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009
# Let's import pandas - it's common to name it something short
# Note: it's not built into Python, but is distributed with Anaconda
import pandas as pd
# Now read in the CSV
wine = pd.read_csv("winequality-red.csv")
# Well that was easy!!
# But what do I have...?
print(type(wine))
There are two main classes in pandas
:
DataFrame
is like a spreadsheet (rows and columns)Series
is like a column of data# Let's see what's in this CSV (the columns and first few rows)
wine.head()
# How do you get just the columns?
wine.columns.values
type(wine.columns.values)
pandas
sits atop another module called numpy
(http://www.numpy.org), which makes it very efficient to work with data inside Python. An ndarray
is an $n$-dimensional array of values.
# Let's get the "shape" of the data
wine.shape
This means there are 1599 rows (wines) and 12 columns in this dataset.
# Now let's get a summary of each of the columns
wine.describe()
# We can also get a visual summary
wine.hist()
You can get to columns of values similar to how you get to fields in a dictionary...
wine['alcohol']
type(wine['alcohol'])
Notice that pandas
nicely cuts the output to be of reasonable length for long datasets :)
We can do many things with columns...
# Describe an individual column
wine['alcohol'].describe()
# Visualize a single column
wine['alcohol'].hist()
# Extract the values
wine['alcohol'].values
# You can do most things with an ndarray, but we could also extract a pure list
list(wine['alcohol'].values[:10])
A common pattern is to create data frames based upon selections from other data frames...
high_quality = wine[wine['quality'] > 7]
high_quality.shape
high_quality
high_quality_low_alcohol = wine[(wine['quality'] > 7) & (wine['alcohol'] < 10)]
high_quality_low_alcohol.shape
high_quality_low_alcohol
You can do SO much more, including changing values, joining datasets, ...
# Hmmm... are quality and alcohol related?
wine.plot(x='alcohol', y='quality', style='o')
# There's sortof a relationship there??
import statsmodels.formula.api as stats
quality_v_alcohol = stats.ols(formula="quality ~ alcohol", data=wine).fit()
quality_v_alcohol.summary()