DS2000 (Spring 2019, NCH) :: Lecture 12a¶

0. Administrivia¶

Final (numerical) grades available in Blackboard (just rounded overall average)
This Friday (4/5) and next (4/12) are project work days -- e-mail for Derbinsky help (and we can video chat if that's useful)
Please fill out TRACE: be specific, constructive if possible :)

1. What is `pandas`?¶

An open-source (https://pandas.pydata.org), high-performance set of objects/functions/etc for data analysis in Python. These sit atop Python, making it easier for you to do basic & advanced data analysis tasks quickly (but you can always do these things/supplement with raw Python).

2. `pandas` by Example: Red Wine Quality¶

Source: https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009

# Let's import pandas - it's common to name it something short
# Note: it's not built into Python, but is distributed with Anaconda
import pandas as pd

# Now read in the CSV
wine = pd.read_csv("winequality-red.csv")

# Well that was easy!!
# But what do I have...?

print(type(wine))

<class 'pandas.core.frame.DataFrame'>

There are two main classes in pandas:

A DataFrame is like a spreadsheet (rows and columns)
A Series is like a column of data

# Let's see what's in this CSV (the columns and first few rows)
wine.head()

# How do you get just the columns?
wine.columns.values

array(['fixed acidity', 'volatile acidity', 'citric acid',
       'residual sugar', 'chlorides', 'free sulfur dioxide',
       'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol',
       'quality'], dtype=object)

type(wine.columns.values)

numpy.ndarray

pandas sits atop another module called numpy (http://www.numpy.org), which makes it very efficient to work with data inside Python. An ndarray is an $n$-dimensional array of values.

# Let's get the "shape" of the data
wine.shape

(1599, 12)

This means there are 1599 rows (wines) and 12 columns in this dataset.

# Now let's get a summary of each of the columns
wine.describe()

# We can also get a visual summary
wine.hist()

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x1a2530ab38>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a2586a278>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a25893470>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x1a258bd8d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a258e5e48>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a25917400>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x1a2593e978>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a25967f28>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a25967f60>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x1a259bda20>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a259e8f98>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a25a18550>]],
      dtype=object)

You can get to columns of values similar to how you get to fields in a dictionary...

wine['alcohol']

0        9.4
1        9.8
2        9.8
3        9.8
4        9.4
5        9.4
6        9.4
7       10.0
8        9.5
9       10.5
10       9.2
11      10.5
12       9.9
13       9.1
14       9.2
15       9.2
16      10.5
17       9.3
18       9.0
19       9.2
20       9.4
21       9.7
22       9.5
23       9.4
24       9.7
25       9.3
26       9.5
27       9.5
28       9.4
29       9.8
        ... 
1569    11.5
1570    12.4
1571    11.1
1572     9.5
1573    12.5
1574    10.5
1575    11.8
1576    10.8
1577    11.9
1578    11.3
1579    11.3
1580    11.9
1581    11.3
1582    11.9
1583     9.8
1584    11.6
1585    11.5
1586    11.4
1587    10.9
1588    12.8
1589     9.2
1590    11.6
1591    11.6
1592    11.0
1593     9.5
1594    10.5
1595    11.2
1596    11.0
1597    10.2
1598    11.0
Name: alcohol, Length: 1599, dtype: float64

type(wine['alcohol'])

pandas.core.series.Series

Notice that pandas nicely cuts the output to be of reasonable length for long datasets :)

We can do many things with columns...

# Describe an individual column
wine['alcohol'].describe()

count    1599.000000
mean       10.422983
std         1.065668
min         8.400000
25%         9.500000
50%        10.200000
75%        11.100000
max        14.900000
Name: alcohol, dtype: float64

# Visualize a single column
wine['alcohol'].hist()

<matplotlib.axes._subplots.AxesSubplot at 0x1233fa198>

# Extract the values
wine['alcohol'].values

array([ 9.4,  9.8,  9.8, ..., 11. , 10.2, 11. ])

# You can do most things with an ndarray, but we could also extract a pure list
list(wine['alcohol'].values[:10])

[9.4, 9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10.0, 9.5, 10.5]

A common pattern is to create data frames based upon selections from other data frames...

high_quality = wine[wine['quality'] > 7]
high_quality.shape

(18, 12)

high_quality

high_quality_low_alcohol = wine[(wine['quality'] > 7) & (wine['alcohol'] < 10)]
high_quality_low_alcohol.shape

(1, 12)

high_quality_low_alcohol

You can do SO much more, including changing values, joining datasets, ...

# Hmmm... are quality and alcohol related?
wine.plot(x='alcohol', y='quality', style='o')

<matplotlib.axes._subplots.AxesSubplot at 0x1234563c8>

# There's sortof a relationship there??
import statsmodels.formula.api as stats

quality_v_alcohol = stats.ols(formula="quality ~ alcohol", data=wine).fit()
quality_v_alcohol.summary()

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
0	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5
1	7.8	0.88	0.00	2.6	0.098	25.0	67.0	0.9968	3.20	0.68	9.8	5
2	7.8	0.76	0.04	2.3	0.092	15.0	54.0	0.9970	3.26	0.65	9.8	5
3	11.2	0.28	0.56	1.9	0.075	17.0	60.0	0.9980	3.16	0.58	9.8	6
4	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
count	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000
mean	8.319637	0.527821	0.270976	2.538806	0.087467	15.874922	46.467792	0.996747	3.311113	0.658149	10.422983	5.636023
std	1.741096	0.179060	0.194801	1.409928	0.047065	10.460157	32.895324	0.001887	0.154386	0.169507	1.065668	0.807569
min	4.600000	0.120000	0.000000	0.900000	0.012000	1.000000	6.000000	0.990070	2.740000	0.330000	8.400000	3.000000
25%	7.100000	0.390000	0.090000	1.900000	0.070000	7.000000	22.000000	0.995600	3.210000	0.550000	9.500000	5.000000
50%	7.900000	0.520000	0.260000	2.200000	0.079000	14.000000	38.000000	0.996750	3.310000	0.620000	10.200000	6.000000
75%	9.200000	0.640000	0.420000	2.600000	0.090000	21.000000	62.000000	0.997835	3.400000	0.730000	11.100000	6.000000
max	15.900000	1.580000	1.000000	15.500000	0.611000	72.000000	289.000000	1.003690	4.010000	2.000000	14.900000	8.000000

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
267	7.9	0.35	0.46	3.6	0.078	15.0	37.0	0.99730	3.35	0.86	12.8	8
278	10.3	0.32	0.45	6.4	0.073	5.0	13.0	0.99760	3.23	0.82	12.6	8
390	5.6	0.85	0.05	1.4	0.045	12.0	88.0	0.99240	3.56	0.82	12.9	8
440	12.6	0.31	0.72	2.2	0.072	6.0	29.0	0.99870	2.88	0.82	9.8	8
455	11.3	0.62	0.67	5.2	0.086	6.0	19.0	0.99880	3.22	0.69	13.4	8
481	9.4	0.30	0.56	2.8	0.080	6.0	17.0	0.99640	3.15	0.92	11.7	8
495	10.7	0.35	0.53	2.6	0.070	5.0	16.0	0.99720	3.15	0.65	11.0	8
498	10.7	0.35	0.53	2.6	0.070	5.0	16.0	0.99720	3.15	0.65	11.0	8
588	5.0	0.42	0.24	2.0	0.060	19.0	50.0	0.99170	3.72	0.74	14.0	8
828	7.8	0.57	0.09	2.3	0.065	34.0	45.0	0.99417	3.46	0.74	12.7	8
1061	9.1	0.40	0.50	1.8	0.071	7.0	16.0	0.99462	3.21	0.69	12.5	8
1090	10.0	0.26	0.54	1.9	0.083	42.0	74.0	0.99451	2.98	0.63	11.8	8
1120	7.9	0.54	0.34	2.5	0.076	8.0	17.0	0.99235	3.20	0.72	13.1	8
1202	8.6	0.42	0.39	1.8	0.068	6.0	12.0	0.99516	3.35	0.69	11.7	8
1269	5.5	0.49	0.03	1.8	0.044	28.0	87.0	0.99080	3.50	0.82	14.0	8
1403	7.2	0.33	0.33	1.7	0.061	3.0	13.0	0.99600	3.23	1.10	10.0	8
1449	7.2	0.38	0.31	2.0	0.056	15.0	29.0	0.99472	3.23	0.76	11.3	8
1549	7.4	0.36	0.30	1.8	0.074	17.0	24.0	0.99419	3.24	0.70	11.4	8

Dep. Variable:	quality	R-squared:	0.227
Model:	OLS	Adj. R-squared:	0.226
Method:	Least Squares	F-statistic:	468.3
Date:	Tue, 26 Mar 2019	Prob (F-statistic):	2.83e-91
Time:	14:32:59	Log-Likelihood:	-1721.1
No. Observations:	1599	AIC:	3446.
Df Residuals:	1597	BIC:	3457.
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	1.8750	0.175	10.732	0.000	1.532	2.218
alcohol	0.3608	0.017	21.639	0.000	0.328	0.394

Omnibus:	38.501	Durbin-Watson:	1.748
Prob(Omnibus):	0.000	Jarque-Bera (JB):	71.758
Skew:	-0.154	Prob(JB):	2.62e-16
Kurtosis:	3.991	Cond. No.	104.

DS2000 (Spring 2019, NCH) :: Lecture 12a¶

0. Administrivia¶

1. What is pandas?¶

2. pandas by Example: Red Wine Quality¶

1. What is `pandas`?¶

2. `pandas` by Example: Red Wine Quality¶