DS2000 (Spring 2019, NCH) :: Lecture 12a

0. Administrivia

  1. Final (numerical) grades available in Blackboard (just rounded overall average)
  2. This Friday (4/5) and next (4/12) are project work days -- e-mail for Derbinsky help (and we can video chat if that's useful)
  3. Please fill out TRACE: be specific, constructive if possible :)

1. What is pandas?

An open-source (https://pandas.pydata.org), high-performance set of objects/functions/etc for data analysis in Python. These sit atop Python, making it easier for you to do basic & advanced data analysis tasks quickly (but you can always do these things/supplement with raw Python).

2. pandas by Example: Red Wine Quality

Source: https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009

In [2]:
# Let's import pandas - it's common to name it something short
# Note: it's not built into Python, but is distributed with Anaconda
import pandas as pd
In [3]:
# Now read in the CSV
wine = pd.read_csv("winequality-red.csv")
In [4]:
# Well that was easy!!
# But what do I have...?

print(type(wine))
<class 'pandas.core.frame.DataFrame'>

There are two main classes in pandas:

  • A DataFrame is like a spreadsheet (rows and columns)
  • A Series is like a column of data
In [5]:
# Let's see what's in this CSV (the columns and first few rows)
wine.head()
Out[5]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8 6
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
In [6]:
# How do you get just the columns?
wine.columns.values
Out[6]:
array(['fixed acidity', 'volatile acidity', 'citric acid',
       'residual sugar', 'chlorides', 'free sulfur dioxide',
       'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol',
       'quality'], dtype=object)
In [7]:
type(wine.columns.values)
Out[7]:
numpy.ndarray

pandas sits atop another module called numpy (http://www.numpy.org), which makes it very efficient to work with data inside Python. An ndarray is an $n$-dimensional array of values.

In [8]:
# Let's get the "shape" of the data
wine.shape
Out[8]:
(1599, 12)

This means there are 1599 rows (wines) and 12 columns in this dataset.

In [9]:
# Now let's get a summary of each of the columns
wine.describe()
Out[9]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
count 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000
mean 8.319637 0.527821 0.270976 2.538806 0.087467 15.874922 46.467792 0.996747 3.311113 0.658149 10.422983 5.636023
std 1.741096 0.179060 0.194801 1.409928 0.047065 10.460157 32.895324 0.001887 0.154386 0.169507 1.065668 0.807569
min 4.600000 0.120000 0.000000 0.900000 0.012000 1.000000 6.000000 0.990070 2.740000 0.330000 8.400000 3.000000
25% 7.100000 0.390000 0.090000 1.900000 0.070000 7.000000 22.000000 0.995600 3.210000 0.550000 9.500000 5.000000
50% 7.900000 0.520000 0.260000 2.200000 0.079000 14.000000 38.000000 0.996750 3.310000 0.620000 10.200000 6.000000
75% 9.200000 0.640000 0.420000 2.600000 0.090000 21.000000 62.000000 0.997835 3.400000 0.730000 11.100000 6.000000
max 15.900000 1.580000 1.000000 15.500000 0.611000 72.000000 289.000000 1.003690 4.010000 2.000000 14.900000 8.000000
In [23]:
# We can also get a visual summary
wine.hist()
Out[23]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x1a2530ab38>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a2586a278>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a25893470>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x1a258bd8d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a258e5e48>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a25917400>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x1a2593e978>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a25967f28>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a25967f60>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x1a259bda20>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a259e8f98>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a25a18550>]],
      dtype=object)

You can get to columns of values similar to how you get to fields in a dictionary...

In [11]:
wine['alcohol']
Out[11]:
0        9.4
1        9.8
2        9.8
3        9.8
4        9.4
5        9.4
6        9.4
7       10.0
8        9.5
9       10.5
10       9.2
11      10.5
12       9.9
13       9.1
14       9.2
15       9.2
16      10.5
17       9.3
18       9.0
19       9.2
20       9.4
21       9.7
22       9.5
23       9.4
24       9.7
25       9.3
26       9.5
27       9.5
28       9.4
29       9.8
        ... 
1569    11.5
1570    12.4
1571    11.1
1572     9.5
1573    12.5
1574    10.5
1575    11.8
1576    10.8
1577    11.9
1578    11.3
1579    11.3
1580    11.9
1581    11.3
1582    11.9
1583     9.8
1584    11.6
1585    11.5
1586    11.4
1587    10.9
1588    12.8
1589     9.2
1590    11.6
1591    11.6
1592    11.0
1593     9.5
1594    10.5
1595    11.2
1596    11.0
1597    10.2
1598    11.0
Name: alcohol, Length: 1599, dtype: float64
In [12]:
type(wine['alcohol'])
Out[12]:
pandas.core.series.Series

Notice that pandas nicely cuts the output to be of reasonable length for long datasets :)

We can do many things with columns...

In [13]:
# Describe an individual column
wine['alcohol'].describe()
Out[13]:
count    1599.000000
mean       10.422983
std         1.065668
min         8.400000
25%         9.500000
50%        10.200000
75%        11.100000
max        14.900000
Name: alcohol, dtype: float64
In [14]:
# Visualize a single column
wine['alcohol'].hist()
Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x1233fa198>
In [15]:
# Extract the values
wine['alcohol'].values
Out[15]:
array([ 9.4,  9.8,  9.8, ..., 11. , 10.2, 11. ])
In [16]:
# You can do most things with an ndarray, but we could also extract a pure list
list(wine['alcohol'].values[:10])
Out[16]:
[9.4, 9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10.0, 9.5, 10.5]

A common pattern is to create data frames based upon selections from other data frames...

In [17]:
high_quality = wine[wine['quality'] > 7]
high_quality.shape
Out[17]:
(18, 12)
In [18]:
high_quality
Out[18]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
267 7.9 0.35 0.46 3.6 0.078 15.0 37.0 0.99730 3.35 0.86 12.8 8
278 10.3 0.32 0.45 6.4 0.073 5.0 13.0 0.99760 3.23 0.82 12.6 8
390 5.6 0.85 0.05 1.4 0.045 12.0 88.0 0.99240 3.56 0.82 12.9 8
440 12.6 0.31 0.72 2.2 0.072 6.0 29.0 0.99870 2.88 0.82 9.8 8
455 11.3 0.62 0.67 5.2 0.086 6.0 19.0 0.99880 3.22 0.69 13.4 8
481 9.4 0.30 0.56 2.8 0.080 6.0 17.0 0.99640 3.15 0.92 11.7 8
495 10.7 0.35 0.53 2.6 0.070 5.0 16.0 0.99720 3.15 0.65 11.0 8
498 10.7 0.35 0.53 2.6 0.070 5.0 16.0 0.99720 3.15 0.65 11.0 8
588 5.0 0.42 0.24 2.0 0.060 19.0 50.0 0.99170 3.72 0.74 14.0 8
828 7.8 0.57 0.09 2.3 0.065 34.0 45.0 0.99417 3.46 0.74 12.7 8
1061 9.1 0.40 0.50 1.8 0.071 7.0 16.0 0.99462 3.21 0.69 12.5 8
1090 10.0 0.26 0.54 1.9 0.083 42.0 74.0 0.99451 2.98 0.63 11.8 8
1120 7.9 0.54 0.34 2.5 0.076 8.0 17.0 0.99235 3.20 0.72 13.1 8
1202 8.6 0.42 0.39 1.8 0.068 6.0 12.0 0.99516 3.35 0.69 11.7 8
1269 5.5 0.49 0.03 1.8 0.044 28.0 87.0 0.99080 3.50 0.82 14.0 8
1403 7.2 0.33 0.33 1.7 0.061 3.0 13.0 0.99600 3.23 1.10 10.0 8
1449 7.2 0.38 0.31 2.0 0.056 15.0 29.0 0.99472 3.23 0.76 11.3 8
1549 7.4 0.36 0.30 1.8 0.074 17.0 24.0 0.99419 3.24 0.70 11.4 8
In [19]:
high_quality_low_alcohol = wine[(wine['quality'] > 7) & (wine['alcohol'] < 10)]
high_quality_low_alcohol.shape
Out[19]:
(1, 12)
In [20]:
high_quality_low_alcohol
Out[20]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
440 12.6 0.31 0.72 2.2 0.072 6.0 29.0 0.9987 2.88 0.82 9.8 8

You can do SO much more, including changing values, joining datasets, ...

In [21]:
# Hmmm... are quality and alcohol related?
wine.plot(x='alcohol', y='quality', style='o')
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x1234563c8>
In [22]:
# There's sortof a relationship there??
import statsmodels.formula.api as stats

quality_v_alcohol = stats.ols(formula="quality ~ alcohol", data=wine).fit()
quality_v_alcohol.summary()
Out[22]:
OLS Regression Results
Dep. Variable: quality R-squared: 0.227
Model: OLS Adj. R-squared: 0.226
Method: Least Squares F-statistic: 468.3
Date: Tue, 26 Mar 2019 Prob (F-statistic): 2.83e-91
Time: 14:32:59 Log-Likelihood: -1721.1
No. Observations: 1599 AIC: 3446.
Df Residuals: 1597 BIC: 3457.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 1.8750 0.175 10.732 0.000 1.532 2.218
alcohol 0.3608 0.017 21.639 0.000 0.328 0.394
Omnibus: 38.501 Durbin-Watson: 1.748
Prob(Omnibus): 0.000 Jarque-Bera (JB): 71.758
Skew: -0.154 Prob(JB): 2.62e-16
Kurtosis: 3.991 Cond. No. 104.


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.