{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# DS2500 Lesson6\n",
"\n",
"Jan 31, 2023\n",
"\n",
"### Content:\n",
"- Pandas\n",
" - series\n",
" - dataframe\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# before you begin, make sure you can load data from seaborn\n",
"import seaborn as sns\n",
"df_penguin = sns.load_dataset('penguins')\n",
"df_titanic = sns.load_dataset('titanic')\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Having trouble?\n",
"- see piazza for long-term solution\n",
" - [mac SSL error](https://piazza.com/class/lbxsbawi9yq2f9/post/55)\n",
"- use code below for today:\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# if these lines give you trouble, use the csvs available on the website\n",
"# (be sure they're adjacent to this .ipynb file on your machine)\n",
"import pandas as pd\n",
"df_penguin = pd.read_csv('penguin.csv')\n",
"df_titanic = pd.read_csv('titanic.csv')\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Defaulting to user installation because normal site-packages is not writeable\n",
"Requirement already satisfied: pandas in /home/matt/.local/lib/python3.10/site-packages (1.5.2)\n",
"Requirement already satisfied: numpy>=1.21.0 in /home/matt/.local/lib/python3.10/site-packages (from pandas) (1.24.1)\n",
"Requirement already satisfied: pytz>=2020.1 in /usr/lib/python3/dist-packages (from pandas) (2022.1)\n",
"Requirement already satisfied: python-dateutil>=2.8.1 in /home/matt/.local/lib/python3.10/site-packages (from pandas) (2.8.2)\n",
"Requirement already satisfied: six>=1.5 in /usr/lib/python3/dist-packages (from python-dateutil>=2.8.1->pandas) (1.16.0)\n"
]
}
],
"source": [
"!pip3 install pandas"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Admin:\n",
"- lab1\n",
" - talk to friends\n",
" - lab digest\n",
" - part b (part c)\n",
"- hw0 due friday @ 11:59 PM\n",
" - .py and .ipynb\n",
" - see canvas announcement\n",
" - see piazza\n",
"- look at schedule together\n",
"- tutoring groups\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# The value of talking-out-loud about programming\n",
"\n",
"... I learned 2 new ways to approach lab1's part B `get_win_set()` this morning!\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"False"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# if all items in an array are the same, then the std dev is 0\n",
"import numpy as np\n",
"\n",
"np.array([1, 1, 2]).std() == 0"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"False"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# if all items in an array are the same, \n",
"\n",
"# mysterious section 2 student\n",
"len(set(np.array([1, 1, 2]))) == 1"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Pandas\n",
"\n",
"Pandas is a python module which stores data in `pd.DataFrame` and `pd.Series` objects.\n"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" species | \n",
" island | \n",
" bill_length_mm | \n",
" bill_depth_mm | \n",
" flipper_length_mm | \n",
" body_mass_g | \n",
" sex | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Adelie | \n",
" Torgersen | \n",
" 39.1 | \n",
" 18.7 | \n",
" 181.0 | \n",
" 3750.0 | \n",
" Male | \n",
"
\n",
" \n",
" 1 | \n",
" Adelie | \n",
" Torgersen | \n",
" 39.5 | \n",
" 17.4 | \n",
" 186.0 | \n",
" 3800.0 | \n",
" Female | \n",
"
\n",
" \n",
" 2 | \n",
" Adelie | \n",
" Torgersen | \n",
" 40.3 | \n",
" 18.0 | \n",
" 195.0 | \n",
" 3250.0 | \n",
" Female | \n",
"
\n",
" \n",
" 3 | \n",
" Adelie | \n",
" Torgersen | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 4 | \n",
" Adelie | \n",
" Torgersen | \n",
" 36.7 | \n",
" 19.3 | \n",
" 193.0 | \n",
" 3450.0 | \n",
" Female | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" species island bill_length_mm bill_depth_mm flipper_length_mm \\\n",
"0 Adelie Torgersen 39.1 18.7 181.0 \n",
"1 Adelie Torgersen 39.5 17.4 186.0 \n",
"2 Adelie Torgersen 40.3 18.0 195.0 \n",
"3 Adelie Torgersen NaN NaN NaN \n",
"4 Adelie Torgersen 36.7 19.3 193.0 \n",
"\n",
" body_mass_g sex \n",
"0 3750.0 Male \n",
"1 3800.0 Female \n",
"2 3250.0 Female \n",
"3 NaN NaN \n",
"4 3450.0 Female "
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import seaborn as sns\n",
"\n",
"# Example DataFrame:\n",
"# df stands for dataframe. df_penguin is a dataframe of penguin data\n",
"df_penguin = sns.load_dataset('penguins')\n",
"df_penguin.head()\n"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 39.1\n",
"1 39.5\n",
"2 40.3\n",
"3 NaN\n",
"4 36.7\n",
" ... \n",
"339 NaN\n",
"340 46.8\n",
"341 50.4\n",
"342 45.2\n",
"343 49.9\n",
"Name: bill_length_mm, Length: 344, dtype: float64"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# example series: the \"s_\" is a (personal) convention for variables which are series\n",
"s_bill_length_mm = df_penguin['bill_length_mm']\n",
"\n",
"s_bill_length_mm\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"\n",
"### `pd.DataFrame` are two-dimensional, `pd.Series` are one-dimensional\n",
"\n",
"### If we already have `np.array()`, why do we need pandas?\n",
"- pandas supports non numeric data (strings for categorical data, for example)\n",
"- pandas supports reading / storing data from more formats\n",
" - csv (spreadsheets)\n",
"- pandas more elegantly deals with missing data\n",
"- pandas handles indexing woes\n",
"\n",
"You could do almost everything pandas does with numpy arrays ... but it'd be much more difficult to accomplish.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Pandas Series\n",
"\n",
"### building:\n",
"- building: default index\n",
"- building: custom index\n",
"- building: from a dict\n"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" species | \n",
" island | \n",
" bill_length_mm | \n",
" bill_depth_mm | \n",
" flipper_length_mm | \n",
" body_mass_g | \n",
" sex | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Adelie | \n",
" Torgersen | \n",
" 39.1 | \n",
" 18.7 | \n",
" 181.0 | \n",
" 3750.0 | \n",
" Male | \n",
"
\n",
" \n",
" 1 | \n",
" Adelie | \n",
" Torgersen | \n",
" 39.5 | \n",
" 17.4 | \n",
" 186.0 | \n",
" 3800.0 | \n",
" Female | \n",
"
\n",
" \n",
" 2 | \n",
" Adelie | \n",
" Torgersen | \n",
" 40.3 | \n",
" 18.0 | \n",
" 195.0 | \n",
" 3250.0 | \n",
" Female | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" species island bill_length_mm bill_depth_mm flipper_length_mm \\\n",
"0 Adelie Torgersen 39.1 18.7 181.0 \n",
"1 Adelie Torgersen 39.5 17.4 186.0 \n",
"2 Adelie Torgersen 40.3 18.0 195.0 \n",
"\n",
" body_mass_g sex \n",
"0 3750.0 Male \n",
"1 3800.0 Female \n",
"2 3250.0 Female "
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# look at first 3 rows of dataframe (for reference)\n",
"df_penguin.head(3)\n"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"species Adelie\n",
"island Torgersen\n",
"bill_length_mm 39.1\n",
"bill_depth_mm 18.7\n",
"flipper_length_mm 181.0\n",
"body_mass_g 3750.0\n",
"sex Male\n",
"Name: 0, dtype: object"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# each row, or column of dataframe is a series object\n",
"# below is first row of dataframe (more on iloc indexing later...)\n",
"# (remember: each row is a sample -> this is 1 penguin's data)\n",
"penguin0_series = df_penguin.iloc[0, :]\n",
"penguin0_series\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Pandas series contain a sequence of labelled data elements:\n",
"- penguin0's `species` is `Adelie`\n",
"- penguin0's `island` is `Torgersen`\n",
"- penguin0's `bill_length_mm` is `39.1` ...\n",
"- penguin0's `` is ``\n",
"\n",
"A series is quite similar to a dictionary ...\n"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"penguin0_dict = {'species': 'Adelie',\n",
" 'island': 'Torgersen',\n",
" 'bill_length_mm': 39.1,\n",
" 'bill_depth_mm': 18.7,\n",
" 'flipper_length_mm': 181.0,\n",
" 'body_mass_g': 3750.0,\n",
" 'sex': 'Male'}\n"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"species Adelie\n",
"island Torgersen\n",
"bill_length_mm 39.1\n",
"bill_depth_mm 18.7\n",
"flipper_length_mm 181.0\n",
"body_mass_g 3750.0\n",
"sex Male\n",
"dtype: object"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"\n",
"# build a series from dict\n",
"penguin0_series = pd.Series(penguin0_dict)\n",
"penguin0_series\n"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"species Adelie\n",
"island Torgersen\n",
"bill_length_mm 39.1\n",
"bill_depth_mm 18.7\n",
"flipper_length_mm 181.0\n",
"body_mass_g 3750.0\n",
"sex Male\n",
"dtype: object"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# you can also pass two corresponding lists / tuples\n",
"index = ['species', 'island', 'bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g', 'sex']\n",
"values = ['Adelie', 'Torgersen', 39.1, 18.7, 181.0, 3750.0, 'Male']\n",
"\n",
"penguin0_series = pd.Series(values, index=index)\n",
"penguin0_series\n"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 vanilla\n",
"1 chocolate\n",
"2 cherry garcia\n",
"3 oatmeal\n",
"dtype: object"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# sometimes your data has no meaningful index\n",
"# pandas will default to indexing things with integers\n",
"ice_cream_flavors = 'vanilla', 'chocolate', 'cherry garcia', 'oatmeal'\n",
"pd.Series(ice_cream_flavors)\n"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['Adelie', 'Torgersen', 39.1, 18.7, 181.0, 3750.0, 'Male'],\n",
" dtype=object)"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# you can access values as an array via .values\n",
"penguin0_series.values\n"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['species', 'island', 'bill_length_mm', 'bill_depth_mm',\n",
" 'flipper_length_mm', 'body_mass_g', 'sex'],\n",
" dtype='object')"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# you can access index (as a special pandas \"index\" object) via .index\n",
"penguin0_series.index\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### indexing into `pd.Series`: accessing / changing data\n",
"- accessing / setting using index:\n",
" - by name: `series.loc[name]`\n",
" - by position: `series.iloc[idx]`\n",
"- iterating: keys, items, iteritems (much like dict)\n",
"- deleting an entry\n"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"matt 6\n",
"riva 7\n",
"eli 11\n",
"zeke 101\n",
"dtype: int64"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dict_fav_num = {'matt': 6, 'riva': 7, 'eli': 11, 'zeke': 101}\n",
"series_fav_num = pd.Series(dict_fav_num)\n",
"series_fav_num"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"11"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# lookup by position: get value in position 2 (third)\n",
"series_fav_num.iloc[2]"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"101"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# lookup by index (name): get value associated with index='matt'\n",
"series_fav_num.loc['zeke']"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"6"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# you can also address directly into the series object to lookup by index\n",
"# (my mild preference nobody follows: avoid this ... a bit more ambiguous)\n",
"series_fav_num['matt']"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"matt 6\n",
"riva 7\n",
"eli 1000\n",
"zeke 101\n",
"dtype: int64"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# each of these access methods can also set the value\n",
"series_fav_num.iloc[2] = 1000\n",
"series_fav_num"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check membership of item in index\n",
"'matt' in series_fav_num.index\n"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"False"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"'bob' in series_fav_num.index\n"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"1000 in series_fav_num.values\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Iterating through elements of a `pd.Series`\n",
"\n",
"... pretty much the same as a dictionary except pandas uses an \"index\" while a dictionary has \"keys\".\n"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"matt\n",
"riva\n",
"eli\n",
"zeke\n"
]
}
],
"source": [
"# iterating through index (note: no parenthases around .index below)\n",
"for idx in series_fav_num.index:\n",
" print(idx)"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"6\n",
"7\n",
"1000\n",
"101\n"
]
}
],
"source": [
"# iterating through values (notice: no parenthases on .values belwo)\n",
"for val in series_fav_num.values:\n",
" print(val)\n"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"matt 6\n",
"riva 7\n",
"eli 1000\n",
"zeke 101\n"
]
}
],
"source": [
"# iterating through index, value pairs (just like dict!)\n",
"for key, val in series_fav_num.items():\n",
" print(key, val)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Removing an element\n"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"# removing a pair by its corresponding index (just like dict!)\n",
"del series_fav_num['matt']\n"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"riva 7\n",
"eli 1000\n",
"zeke 101\n",
"dtype: int64"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"series_fav_num\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Examining a `pd.Series`\n",
"\n",
"Just like numpy arrays:\n",
"- `Series.argmin()`\n",
" - which index has smallest value\n",
" - pandas gives the row number, not the index\n",
"- `Series.argmax()`\n",
" - which index has largest value\n",
" - pandas gives the row number, not the index\n",
"- `Series.mean()`\n",
"- `Series.min()`\n",
"- `Series.max()`\n",
"- `Series.std()`\n",
"- `Series.var()`\n",
"\n",
"But wait, there's more! These are in pandas objects but not numpy array\n",
"- `Series.count()`\n",
" - number of item pairs in series\n",
"- `Series.value_counts()`\n",
" - count of every unique value in series (like a histogram)\n",
" - (see example below please)\n",
"- `Series.describe()`\n",
" - summary statistics\n"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"matt 6\n",
"riva 7\n",
"eli 11\n",
"zeke 101\n",
"sally 101\n",
"dtype: int64"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dict_fav_num = {'matt': 6, 'riva': 7, 'eli': 11, 'zeke': 101, 'sally': 101}\n",
"series_fav_num = pd.Series(dict_fav_num)\n",
"series_fav_num\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"source": [
"### Our old friends from numpy\n"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"matt 6\n",
"riva 7\n",
"eli 11\n",
"zeke 101\n",
"sally 101\n",
"dtype: int64"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# for reference\n",
"series_fav_num\n"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(6, 101, 50.97254162782154, 2598.2)"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# our familiar friends ...\n",
"series_fav_num.min(), series_fav_num.max(), series_fav_num.std(), series_fav_num.var()\n"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# notice: pandas gives the position of the row with smallest value\n",
"# (one might think they'd get index 'matt' here instead)\n",
"series_fav_num.argmin()"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'matt'"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# index 0 (first entry) has the lowest favorite number\n",
"idx_min = series_fav_num.argmin()\n",
"series_fav_num.index[idx_min]\n"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# index 3 (last entry) has the highest favorite number\n",
"series_fav_num.argmax()\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### New functionality, only in pandas\n"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"matt 6\n",
"riva 7\n",
"eli 11\n",
"zeke 101\n",
"sally 101\n",
"dtype: int64"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"series_fav_num\n"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"5"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# number of entries (rows)\n",
"series_fav_num.count()\n"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"101 2\n",
"6 1\n",
"7 1\n",
"11 1\n",
"dtype: int64"
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# how many times did each of the favorite numbers occur?\n",
"# (101 occurs twice in series_fav_num, while all other values occur once)\n",
"series_fav_num.value_counts()\n"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Adelie 152\n",
"Gentoo 124\n",
"Chinstrap 68\n",
"Name: species, dtype: int64"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_penguin['species'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"count 5.000000\n",
"mean 45.200000\n",
"std 50.972542\n",
"min 6.000000\n",
"25% 7.000000\n",
"50% 11.000000\n",
"75% 101.000000\n",
"max 101.000000\n",
"dtype: float64"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# describe is useful to get a sense of how values are distributed\n",
"# \"50%\" is equivilent to the median\n",
"# \"25%\"\" indicates that 25% of data is less than this value (and 75% is greater)\n",
"series_fav_num.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Extracting a `pd.DataFrame` column as a series\n",
"\n",
"A dataframe is a two dimensional table of data. Each row or column is a series object.\n"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" survived | \n",
" pclass | \n",
" sex | \n",
" age | \n",
" sibsp | \n",
" parch | \n",
" fare | \n",
" embarked | \n",
" class | \n",
" who | \n",
" adult_male | \n",
" deck | \n",
" embark_town | \n",
" alive | \n",
" alone | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0 | \n",
" 3 | \n",
" male | \n",
" 22.0 | \n",
" 1 | \n",
" 0 | \n",
" 7.2500 | \n",
" S | \n",
" Third | \n",
" man | \n",
" True | \n",
" NaN | \n",
" Southampton | \n",
" no | \n",
" False | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" female | \n",
" 38.0 | \n",
" 1 | \n",
" 0 | \n",
" 71.2833 | \n",
" C | \n",
" First | \n",
" woman | \n",
" False | \n",
" C | \n",
" Cherbourg | \n",
" yes | \n",
" False | \n",
"
\n",
" \n",
" 2 | \n",
" 1 | \n",
" 3 | \n",
" female | \n",
" 26.0 | \n",
" 0 | \n",
" 0 | \n",
" 7.9250 | \n",
" S | \n",
" Third | \n",
" woman | \n",
" False | \n",
" NaN | \n",
" Southampton | \n",
" yes | \n",
" True | \n",
"
\n",
" \n",
" 3 | \n",
" 1 | \n",
" 1 | \n",
" female | \n",
" 35.0 | \n",
" 1 | \n",
" 0 | \n",
" 53.1000 | \n",
" S | \n",
" First | \n",
" woman | \n",
" False | \n",
" C | \n",
" Southampton | \n",
" yes | \n",
" False | \n",
"
\n",
" \n",
" 4 | \n",
" 0 | \n",
" 3 | \n",
" male | \n",
" 35.0 | \n",
" 0 | \n",
" 0 | \n",
" 8.0500 | \n",
" S | \n",
" Third | \n",
" man | \n",
" True | \n",
" NaN | \n",
" Southampton | \n",
" no | \n",
" True | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" survived pclass sex age sibsp parch fare embarked class \\\n",
"0 0 3 male 22.0 1 0 7.2500 S Third \n",
"1 1 1 female 38.0 1 0 71.2833 C First \n",
"2 1 3 female 26.0 0 0 7.9250 S Third \n",
"3 1 1 female 35.0 1 0 53.1000 S First \n",
"4 0 3 male 35.0 0 0 8.0500 S Third \n",
"\n",
" who adult_male deck embark_town alive alone \n",
"0 man True NaN Southampton no False \n",
"1 woman False C Cherbourg yes False \n",
"2 woman False NaN Southampton yes True \n",
"3 woman False C Southampton yes False \n",
"4 man True NaN Southampton no True "
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import seaborn as sns\n",
"\n",
"# may take a 15 sec on first run to download titanic data\n",
"df_titanic = sns.load_dataset('titanic')\n",
"df_titanic.head()\n"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 22.0\n",
"1 38.0\n",
"2 26.0\n",
"3 35.0\n",
"4 35.0\n",
" ... \n",
"886 27.0\n",
"887 19.0\n",
"888 NaN\n",
"889 26.0\n",
"890 32.0\n",
"Name: age, Length: 891, dtype: float64"
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# get the age column of dataframe as a series\n",
"df_titanic['age']\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## In Class Activity A\n",
"\n",
"- `.describe()` how much people paid to get aboard the titanic. \n",
"- count how many passengers of each age were on board\n",
"- each passenger corresponds to a row, what is the index of the passenger who paid the highest price?\n",
"- change the price paid of the passenger in row index 2 (the 3rd row) to `123`\n",
" - notice: does anything funny happen here? If so ... investigate\n"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"scrolled": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" survived | \n",
" pclass | \n",
" sex | \n",
" age | \n",
" sibsp | \n",
" parch | \n",
" fare | \n",
" embarked | \n",
" class | \n",
" who | \n",
" adult_male | \n",
" deck | \n",
" embark_town | \n",
" alive | \n",
" alone | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0 | \n",
" 3 | \n",
" male | \n",
" 22.0 | \n",
" 1 | \n",
" 0 | \n",
" 7.2500 | \n",
" S | \n",
" Third | \n",
" man | \n",
" True | \n",
" NaN | \n",
" Southampton | \n",
" no | \n",
" False | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" female | \n",
" 38.0 | \n",
" 1 | \n",
" 0 | \n",
" 71.2833 | \n",
" C | \n",
" First | \n",
" woman | \n",
" False | \n",
" C | \n",
" Cherbourg | \n",
" yes | \n",
" False | \n",
"
\n",
" \n",
" 2 | \n",
" 1 | \n",
" 3 | \n",
" female | \n",
" 26.0 | \n",
" 0 | \n",
" 0 | \n",
" 7.9250 | \n",
" S | \n",
" Third | \n",
" woman | \n",
" False | \n",
" NaN | \n",
" Southampton | \n",
" yes | \n",
" True | \n",
"
\n",
" \n",
" 3 | \n",
" 1 | \n",
" 1 | \n",
" female | \n",
" 35.0 | \n",
" 1 | \n",
" 0 | \n",
" 53.1000 | \n",
" S | \n",
" First | \n",
" woman | \n",
" False | \n",
" C | \n",
" Southampton | \n",
" yes | \n",
" False | \n",
"
\n",
" \n",
" 4 | \n",
" 0 | \n",
" 3 | \n",
" male | \n",
" 35.0 | \n",
" 0 | \n",
" 0 | \n",
" 8.0500 | \n",
" S | \n",
" Third | \n",
" man | \n",
" True | \n",
" NaN | \n",
" Southampton | \n",
" no | \n",
" True | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" survived pclass sex age sibsp parch fare embarked class \\\n",
"0 0 3 male 22.0 1 0 7.2500 S Third \n",
"1 1 1 female 38.0 1 0 71.2833 C First \n",
"2 1 3 female 26.0 0 0 7.9250 S Third \n",
"3 1 1 female 35.0 1 0 53.1000 S First \n",
"4 0 3 male 35.0 0 0 8.0500 S Third \n",
"\n",
" who adult_male deck embark_town alive alone \n",
"0 man True NaN Southampton no False \n",
"1 woman False C Cherbourg yes False \n",
"2 woman False NaN Southampton yes True \n",
"3 woman False C Southampton yes False \n",
"4 man True NaN Southampton no True "
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_titanic.head()\n"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"count 891.000000\n",
"mean 32.204208\n",
"std 49.693429\n",
"min 0.000000\n",
"25% 7.910400\n",
"50% 14.454200\n",
"75% 31.000000\n",
"max 512.329200\n",
"Name: fare, dtype: float64"
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_titanic['fare'].describe()\n"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"24.00 30\n",
"22.00 27\n",
"18.00 26\n",
"19.00 25\n",
"28.00 25\n",
" ..\n",
"36.50 1\n",
"55.50 1\n",
"0.92 1\n",
"23.50 1\n",
"74.00 1\n",
"Name: age, Length: 88, dtype: int64"
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# count how many passengers of each age were on board\n",
"df_titanic['age'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"258"
]
},
"execution_count": 59,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# each passenger corresponds to a row, what is the index of the passenger who paid the highest price?\n",
"df_titanic['fare'].argmax()"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 7.2500\n",
"1 71.2833\n",
"2 123.0000\n",
"3 53.1000\n",
"4 8.0500\n",
" ... \n",
"886 13.0000\n",
"887 30.0000\n",
"888 23.4500\n",
"889 30.0000\n",
"890 7.7500\n",
"Name: fare, Length: 891, dtype: float64"
]
},
"execution_count": 60,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# change the price paid of the passenger in row index 2 (the 3rd row) to 123\n",
"s_fare = df_titanic['fare']\n",
"s_fare"
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/tmp/ipykernel_12591/3786600315.py:1: SettingWithCopyWarning: \n",
"A value is trying to be set on a copy of a slice from a DataFrame\n",
"\n",
"See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
" s_fare.iloc[2] = 123\n"
]
}
],
"source": [
"s_fare.iloc[2] = 123"
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" survived | \n",
" pclass | \n",
" sex | \n",
" age | \n",
" sibsp | \n",
" parch | \n",
" fare | \n",
" embarked | \n",
" class | \n",
" who | \n",
" adult_male | \n",
" deck | \n",
" embark_town | \n",
" alive | \n",
" alone | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0 | \n",
" 3 | \n",
" male | \n",
" 22.0 | \n",
" 1 | \n",
" 0 | \n",
" 7.2500 | \n",
" S | \n",
" Third | \n",
" man | \n",
" True | \n",
" NaN | \n",
" Southampton | \n",
" no | \n",
" False | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" female | \n",
" 38.0 | \n",
" 1 | \n",
" 0 | \n",
" 71.2833 | \n",
" C | \n",
" First | \n",
" woman | \n",
" False | \n",
" C | \n",
" Cherbourg | \n",
" yes | \n",
" False | \n",
"
\n",
" \n",
" 2 | \n",
" 1 | \n",
" 3 | \n",
" female | \n",
" 26.0 | \n",
" 0 | \n",
" 0 | \n",
" 123.0000 | \n",
" S | \n",
" Third | \n",
" woman | \n",
" False | \n",
" NaN | \n",
" Southampton | \n",
" yes | \n",
" True | \n",
"
\n",
" \n",
" 3 | \n",
" 1 | \n",
" 1 | \n",
" female | \n",
" 35.0 | \n",
" 1 | \n",
" 0 | \n",
" 53.1000 | \n",
" S | \n",
" First | \n",
" woman | \n",
" False | \n",
" C | \n",
" Southampton | \n",
" yes | \n",
" False | \n",
"
\n",
" \n",
" 4 | \n",
" 0 | \n",
" 3 | \n",
" male | \n",
" 35.0 | \n",
" 0 | \n",
" 0 | \n",
" 8.0500 | \n",
" S | \n",
" Third | \n",
" man | \n",
" True | \n",
" NaN | \n",
" Southampton | \n",
" no | \n",
" True | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 886 | \n",
" 0 | \n",
" 2 | \n",
" male | \n",
" 27.0 | \n",
" 0 | \n",
" 0 | \n",
" 13.0000 | \n",
" S | \n",
" Second | \n",
" man | \n",
" True | \n",
" NaN | \n",
" Southampton | \n",
" no | \n",
" True | \n",
"
\n",
" \n",
" 887 | \n",
" 1 | \n",
" 1 | \n",
" female | \n",
" 19.0 | \n",
" 0 | \n",
" 0 | \n",
" 30.0000 | \n",
" S | \n",
" First | \n",
" woman | \n",
" False | \n",
" B | \n",
" Southampton | \n",
" yes | \n",
" True | \n",
"
\n",
" \n",
" 888 | \n",
" 0 | \n",
" 3 | \n",
" female | \n",
" NaN | \n",
" 1 | \n",
" 2 | \n",
" 23.4500 | \n",
" S | \n",
" Third | \n",
" woman | \n",
" False | \n",
" NaN | \n",
" Southampton | \n",
" no | \n",
" False | \n",
"
\n",
" \n",
" 889 | \n",
" 1 | \n",
" 1 | \n",
" male | \n",
" 26.0 | \n",
" 0 | \n",
" 0 | \n",
" 30.0000 | \n",
" C | \n",
" First | \n",
" man | \n",
" True | \n",
" C | \n",
" Cherbourg | \n",
" yes | \n",
" True | \n",
"
\n",
" \n",
" 890 | \n",
" 0 | \n",
" 3 | \n",
" male | \n",
" 32.0 | \n",
" 0 | \n",
" 0 | \n",
" 7.7500 | \n",
" Q | \n",
" Third | \n",
" man | \n",
" True | \n",
" NaN | \n",
" Queenstown | \n",
" no | \n",
" True | \n",
"
\n",
" \n",
"
\n",
"
891 rows × 15 columns
\n",
"
"
],
"text/plain": [
" survived pclass sex age sibsp parch fare embarked class \\\n",
"0 0 3 male 22.0 1 0 7.2500 S Third \n",
"1 1 1 female 38.0 1 0 71.2833 C First \n",
"2 1 3 female 26.0 0 0 123.0000 S Third \n",
"3 1 1 female 35.0 1 0 53.1000 S First \n",
"4 0 3 male 35.0 0 0 8.0500 S Third \n",
".. ... ... ... ... ... ... ... ... ... \n",
"886 0 2 male 27.0 0 0 13.0000 S Second \n",
"887 1 1 female 19.0 0 0 30.0000 S First \n",
"888 0 3 female NaN 1 2 23.4500 S Third \n",
"889 1 1 male 26.0 0 0 30.0000 C First \n",
"890 0 3 male 32.0 0 0 7.7500 Q Third \n",
"\n",
" who adult_male deck embark_town alive alone \n",
"0 man True NaN Southampton no False \n",
"1 woman False C Cherbourg yes False \n",
"2 woman False NaN Southampton yes True \n",
"3 woman False C Southampton yes False \n",
"4 man True NaN Southampton no True \n",
".. ... ... ... ... ... ... \n",
"886 man True NaN Southampton no True \n",
"887 woman False B Southampton yes True \n",
"888 woman False NaN Southampton no False \n",
"889 man True C Cherbourg yes True \n",
"890 man True NaN Queenstown no True \n",
"\n",
"[891 rows x 15 columns]"
]
},
"execution_count": 64,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_titanic"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Pandas: DataFrame\n",
"\n",
"Remember:\n",
"- `Series`: 1d data object\n",
"- `DataFrame`: 2d data object\n",
"\n",
"`DataFrame`s represent two-dimensional data, like the quiz scores from last class:\n",
"\n",
"| | Quiz 0 | Quiz 1 | Quiz 2 |\n",
"|-----------|--------|--------|--------|\n",
"| Student 0 | 80 | 90 | 50 |\n",
"| Student 1 | 87 | 92 | 80 |\n",
"\n",
"Each column or row above could be considered a `Series` object\n"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" quiz0 | \n",
" quiz1 | \n",
" quiz2 | \n",
"
\n",
" \n",
" \n",
" \n",
" student0 | \n",
" 80 | \n",
" 90 | \n",
" 50 | \n",
"
\n",
" \n",
" student1 | \n",
" 87 | \n",
" 92 | \n",
" 80 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" quiz0 quiz1 quiz2\n",
"student0 80 90 50\n",
"student1 87 92 80"
]
},
"execution_count": 67,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"quiz_array = np.array([[80, 90, 50],\n",
" [87, 92, 80]])\n",
"\n",
"df_quiz = pd.DataFrame(quiz_array, \n",
" columns=('quiz0', 'quiz1', 'quiz2'), \n",
" index=('student0', 'student1'))\n",
"df_quiz\n"
]
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" quiz0 | \n",
" quiz1 | \n",
" quiz2 | \n",
"
\n",
" \n",
" \n",
" \n",
" student0 | \n",
" 80 | \n",
" 90 | \n",
" 50 | \n",
"
\n",
" \n",
" asufdhsaidufhasiudfh | \n",
" 87 | \n",
" 92 | \n",
" 80 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" quiz0 quiz1 quiz2\n",
"student0 80 90 50\n",
"asufdhsaidufhasiudfh 87 92 80"
]
},
"execution_count": 69,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# we construct a dataframe as a dictionary\n",
"# keys of the dictionary are columns of dataframe\n",
"# values are lists (or tuples) of the values in each column\n",
"quiz_dict = {'quiz0': [80, 87],\n",
" 'quiz1': [90, 92],\n",
" 'quiz2': [50, 80]}\n",
"pd.DataFrame(quiz_dict, index=('student0', 'asufdhsaidufhasiudfh'))\n"
]
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 80 | \n",
" 90 | \n",
" 50 | \n",
"
\n",
" \n",
" 1 | \n",
" 87 | \n",
" 92 | \n",
" 80 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 1 2\n",
"0 80 90 50\n",
"1 87 92 80"
]
},
"execution_count": 70,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# can we make dataframe without labelling rows / columns?\n",
"df_quiz = pd.DataFrame(quiz_array)\n",
"df_quiz\n"
]
},
{
"cell_type": "code",
"execution_count": 71,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" quiz0 | \n",
" quiz1 | \n",
" quiz2 | \n",
"
\n",
" \n",
" \n",
" \n",
" student0 | \n",
" 80 | \n",
" 90 | \n",
" 50 | \n",
"
\n",
" \n",
" student1 | \n",
" 87 | \n",
" 92 | \n",
" 80 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" quiz0 quiz1 quiz2\n",
"student0 80 90 50\n",
"student1 87 92 80"
]
},
"execution_count": 71,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# we can just add the names in afterwards if you'd like to\n",
"df_quiz.columns = ['quiz0', 'quiz1', 'quiz2']\n",
"df_quiz.index = ('student0', 'student1')\n",
"df_quiz\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Describing a `pd.DataFrame`\n",
"\n",
"Just like numpy arrays:\n",
"- `DataFrame.argmin()`\n",
" - which index has smallest value\n",
" - pandas gives the row number, not the index\n",
"- `DataFrame.argmax()`\n",
" - which index has largest value\n",
" - pandas gives the row number, not the index\n",
"- `DataFrame.mean()`\n",
"- `DataFrame.min()`\n",
"- `DataFrame.max()`\n",
"- `DataFrame.std()`\n",
"- `DataFrame.var()`\n",
"\n",
"New to pandas:\n",
"- `DataFrame.count()`\n",
" - number of item pairs in series\n",
"- `DataFrame.describe()`\n",
" - summary statistics\n",
"- `DataFrame.value_counts()`\n",
" - count how many unique rows there are\n",
" - see falcon / dog / cat example below please\n"
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" quiz0 | \n",
" quiz1 | \n",
" quiz2 | \n",
"
\n",
" \n",
" \n",
" \n",
" student0 | \n",
" 80 | \n",
" 90 | \n",
" 50 | \n",
"
\n",
" \n",
" student1 | \n",
" 87 | \n",
" 92 | \n",
" 80 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" quiz0 quiz1 quiz2\n",
"student0 80 90 50\n",
"student1 87 92 80"
]
},
"execution_count": 72,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_quiz\n"
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"quiz0 83.5\n",
"quiz1 91.0\n",
"quiz2 65.0\n",
"dtype: float64"
]
},
"execution_count": 73,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# by default, each method applies operation to entire column of data\n",
"df_quiz.mean()\n"
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"quiz0 83.5\n",
"quiz1 91.0\n",
"quiz2 65.0\n",
"dtype: float64"
]
},
"execution_count": 74,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# we can also pass axis parameter to specify if operation should be applied to row or column\n",
"# !remember!\n",
"# axis=0 -> apply operation across all rows (returns operation per col)\n",
"# axis=1 -> apply operation across all cols (returns operation per row)\n",
"df_quiz.mean(axis=0)\n"
]
},
{
"cell_type": "code",
"execution_count": 75,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"student0 73.333333\n",
"student1 86.333333\n",
"dtype: float64"
]
},
"execution_count": 75,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# applies each operation to entire column of data (row)\n",
"df_quiz.mean(axis=1)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Take a moment to appreciate a panda:\n",
"Those labels on the pandas objects are super help in understanding the output immediately above, right?\n",
"\n",
"(The `axis=0` vs `axis=1` stuff was easy to get turned around with in numpy)\n"
]
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" quiz0 | \n",
" quiz1 | \n",
" quiz2 | \n",
"
\n",
" \n",
" \n",
" \n",
" student0 | \n",
" 80 | \n",
" 90 | \n",
" 50 | \n",
"
\n",
" \n",
" student1 | \n",
" 87 | \n",
" 92 | \n",
" 80 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" quiz0 quiz1 quiz2\n",
"student0 80 90 50\n",
"student1 87 92 80"
]
},
"execution_count": 76,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_quiz\n"
]
},
{
"cell_type": "code",
"execution_count": 77,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" quiz0 | \n",
" quiz1 | \n",
" quiz2 | \n",
"
\n",
" \n",
" \n",
" \n",
" count | \n",
" 2.000000 | \n",
" 2.000000 | \n",
" 2.000000 | \n",
"
\n",
" \n",
" mean | \n",
" 83.500000 | \n",
" 91.000000 | \n",
" 65.000000 | \n",
"
\n",
" \n",
" std | \n",
" 4.949747 | \n",
" 1.414214 | \n",
" 21.213203 | \n",
"
\n",
" \n",
" min | \n",
" 80.000000 | \n",
" 90.000000 | \n",
" 50.000000 | \n",
"
\n",
" \n",
" 25% | \n",
" 81.750000 | \n",
" 90.500000 | \n",
" 57.500000 | \n",
"
\n",
" \n",
" 50% | \n",
" 83.500000 | \n",
" 91.000000 | \n",
" 65.000000 | \n",
"
\n",
" \n",
" 75% | \n",
" 85.250000 | \n",
" 91.500000 | \n",
" 72.500000 | \n",
"
\n",
" \n",
" max | \n",
" 87.000000 | \n",
" 92.000000 | \n",
" 80.000000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" quiz0 quiz1 quiz2\n",
"count 2.000000 2.000000 2.000000\n",
"mean 83.500000 91.000000 65.000000\n",
"std 4.949747 1.414214 21.213203\n",
"min 80.000000 90.000000 50.000000\n",
"25% 81.750000 90.500000 57.500000\n",
"50% 83.500000 91.000000 65.000000\n",
"75% 85.250000 91.500000 72.500000\n",
"max 87.000000 92.000000 80.000000"
]
},
"execution_count": 77,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# describe only works on columns (no axis param given)\n",
"df_quiz.describe()\n"
]
},
{
"cell_type": "code",
"execution_count": 78,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" num_legs | \n",
" num_wings | \n",
"
\n",
" \n",
" \n",
" \n",
" falcon | \n",
" 2 | \n",
" 2 | \n",
"
\n",
" \n",
" dog | \n",
" 4 | \n",
" 0 | \n",
"
\n",
" \n",
" cat | \n",
" 4 | \n",
" 0 | \n",
"
\n",
" \n",
" ant | \n",
" 6 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" num_legs num_wings\n",
"falcon 2 2\n",
"dog 4 0\n",
"cat 4 0\n",
"ant 6 0"
]
},
"execution_count": 78,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# borrowing from pandas documentation for new example\n",
"df = pd.DataFrame({'num_legs': [2, 4, 4, 6],\n",
" 'num_wings': [2, 0, 0, 0]},\n",
" index=['falcon', 'dog', 'cat', 'ant'])\n",
"df\n"
]
},
{
"cell_type": "code",
"execution_count": 81,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"num_legs num_wings\n",
"4 0 2\n",
"2 2 1\n",
"6 0 1\n",
"dtype: int64"
]
},
"execution_count": 81,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# notice that value_counts() gives \n",
"df.value_counts()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`value_counts()` on a `pd.DataFrame` tells us how many times we observed each full row. It tells us that `df` has:\n",
"- 2 row(s) in `df` with `num_legs=4, num_wings=0` \n",
"- 1 row(s) in `df` with `num_legs=2, num_wings=2`\n",
"- 1 row(s) in `df` with `num_legs=6, num_wings=0`\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Indexing / Accessing a DataFrame\n",
"- indexing: \n",
" - `.loc[]` indexing by name of row or column\n",
" - `.iloc[]` indexing by position integer (0, 1, 2, 3, 4 ...)\n",
" & slicing & subsets\n",
"- using the slice operator `:` to get full rows or columns\n"
]
},
{
"cell_type": "code",
"execution_count": 82,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" quiz0 | \n",
" quiz1 | \n",
" quiz2 | \n",
"
\n",
" \n",
" \n",
" \n",
" student0 | \n",
" 80 | \n",
" 90 | \n",
" 50 | \n",
"
\n",
" \n",
" student1 | \n",
" 87 | \n",
" 92 | \n",
" 80 | \n",
"
\n",
" \n",
" student2 | \n",
" 50 | \n",
" 24 | \n",
" 21 | \n",
"
\n",
" \n",
" student3 | \n",
" 89 | \n",
" 85 | \n",
" 40 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" quiz0 quiz1 quiz2\n",
"student0 80 90 50\n",
"student1 87 92 80\n",
"student2 50 24 21\n",
"student3 89 85 40"
]
},
"execution_count": 82,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"quiz_dict = {'quiz0': [80, 87, 50, 89],\n",
" 'quiz1': [90, 92, 24, 85],\n",
" 'quiz2': [50, 80, 21, 40]}\n",
"df_quiz = pd.DataFrame(quiz_dict, index=('student0', 'student1', 'student2', 'student3'))\n",
"df_quiz\n"
]
},
{
"cell_type": "code",
"execution_count": 83,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"80"
]
},
"execution_count": 83,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# indexing data by \"name\"\n",
"# remember: rows first, then columns ... \n",
"# 1st entry describes which row ('student0')\n",
"# 2nd entry describes which col ('quiz0')\n",
"\n",
"df_quiz.loc['student0', 'quiz0']"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"50"
]
},
"execution_count": 54,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# index data by position\n",
"# 1st entry describes which row. 0 -> the 1st (topmost) row\n",
"# 2nd entry describes which col. 2 -> the 3rd (from the left) col\n",
"df_quiz.iloc[0, 2]\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### you can use same slicing syntaxes on both .loc and .iloc\n"
]
},
{
"cell_type": "code",
"execution_count": 84,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"student0 90\n",
"student1 92\n",
"student2 24\n",
"student3 85\n",
"Name: quiz1, dtype: int64"
]
},
"execution_count": 84,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# get the column with idx 1 (second col)\n",
"df_quiz.iloc[:, 1]\n"
]
},
{
"cell_type": "code",
"execution_count": 85,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"50"
]
},
"execution_count": 85,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 1st row, last col\n",
"df_quiz.iloc[0, -1]\n"
]
},
{
"cell_type": "code",
"execution_count": 86,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"student0 80\n",
"student1 87\n",
"student2 50\n",
"student3 89\n",
"Name: quiz0, dtype: int64"
]
},
"execution_count": 86,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# all rows, only quiz0\n",
"df_quiz.loc[:, 'quiz0']\n"
]
},
{
"cell_type": "code",
"execution_count": 87,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" quiz0 | \n",
" quiz1 | \n",
" quiz2 | \n",
"
\n",
" \n",
" \n",
" \n",
" student0 | \n",
" 80 | \n",
" 90 | \n",
" 50 | \n",
"
\n",
" \n",
" student1 | \n",
" 87 | \n",
" 92 | \n",
" 80 | \n",
"
\n",
" \n",
" student2 | \n",
" 50 | \n",
" 24 | \n",
" 21 | \n",
"
\n",
" \n",
" student3 | \n",
" 89 | \n",
" 85 | \n",
" 40 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" quiz0 quiz1 quiz2\n",
"student0 80 90 50\n",
"student1 87 92 80\n",
"student2 50 24 21\n",
"student3 89 85 40"
]
},
"execution_count": 87,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_quiz\n"
]
},
{
"cell_type": "code",
"execution_count": 88,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"quiz0 80\n",
"quiz1 90\n",
"quiz2 50\n",
"Name: student0, dtype: int64"
]
},
"execution_count": 88,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# slicing with named cols and rows\n",
"# you can get a range, by name of row/col\n",
"# note: this includes both start and stop columns (! unlike array / list)\n",
"df_quiz.loc['student0', 'quiz0':'quiz2' ]\n"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"quiz0 80\n",
"quiz1 90\n",
"Name: student0, dtype: int64"
]
},
"execution_count": 60,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# watch out:\n",
"# when you get ranges indexed by position: include start idx, exclude stop idx)\n",
"df_quiz.iloc[0, 0:2 ]\n"
]
},
{
"cell_type": "code",
"execution_count": 91,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"student0 80\n",
"student1 87\n",
"student2 50\n",
"student3 89\n",
"Name: quiz0, dtype: int64"
]
},
"execution_count": 91,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# if you access directly into dataframe, it will assume you're looking for a column\n",
"# (below is equivilent to df_quiz.loc[:, 'quiz0'])\n",
"# mild preference: avoid this\n",
"df_quiz['quiz0']\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### I've seen someone use `pd.DataFrame.ix` to index like above, what does that do?\n",
"\n",
"It was something of a hybrid between `.iloc` / `.loc` ... but it was weird to use.\n",
"\n",
"[Please don't use it.](https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.DataFrame.ix.html)\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Modifying a DataFrame\n",
"- updating values: single cell\n",
"- adding a new column or row\n",
" - good practice: use a `pd.Series` to add a new row / col\n"
]
},
{
"cell_type": "code",
"execution_count": 92,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" quiz0 | \n",
" quiz1 | \n",
" quiz2 | \n",
"
\n",
" \n",
" \n",
" \n",
" student0 | \n",
" 80 | \n",
" 90 | \n",
" 50 | \n",
"
\n",
" \n",
" student1 | \n",
" 87 | \n",
" 92 | \n",
" 80 | \n",
"
\n",
" \n",
" student2 | \n",
" 50 | \n",
" 24 | \n",
" 21 | \n",
"
\n",
" \n",
" student3 | \n",
" 89 | \n",
" 85 | \n",
" 40 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" quiz0 quiz1 quiz2\n",
"student0 80 90 50\n",
"student1 87 92 80\n",
"student2 50 24 21\n",
"student3 89 85 40"
]
},
"execution_count": 92,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"quiz_dict = {'quiz0': [80, 87, 50, 89],\n",
" 'quiz1': [90, 92, 24, 85],\n",
" 'quiz2': [50, 80, 21, 40]}\n",
"df_quiz = pd.DataFrame(quiz_dict, index=('student0', 'student1', 'student2', 'student3'))\n",
"df_quiz"
]
},
{
"cell_type": "code",
"execution_count": 93,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" quiz0 | \n",
" quiz1 | \n",
" quiz2 | \n",
"
\n",
" \n",
" \n",
" \n",
" student0 | \n",
" 80 | \n",
" 123 | \n",
" 50 | \n",
"
\n",
" \n",
" student1 | \n",
" 87 | \n",
" 92 | \n",
" 80 | \n",
"
\n",
" \n",
" student2 | \n",
" 50 | \n",
" 24 | \n",
" 21 | \n",
"
\n",
" \n",
" student3 | \n",
" 89 | \n",
" 85 | \n",
" 40 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" quiz0 quiz1 quiz2\n",
"student0 80 123 50\n",
"student1 87 92 80\n",
"student2 50 24 21\n",
"student3 89 85 40"
]
},
"execution_count": 93,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# setting single entry in dataframe\n",
"df_quiz.loc['student0', 'quiz1'] = 123\n",
"df_quiz\n"
]
},
{
"cell_type": "code",
"execution_count": 96,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" quiz0 | \n",
" quiz1 | \n",
" quiz2 | \n",
"
\n",
" \n",
" \n",
" \n",
" student0 | \n",
" 80 | \n",
" 123 | \n",
" 456 | \n",
"
\n",
" \n",
" student1 | \n",
" 87 | \n",
" 92 | \n",
" 80 | \n",
"
\n",
" \n",
" student2 | \n",
" 50 | \n",
" 24 | \n",
" 21 | \n",
"
\n",
" \n",
" student3 | \n",
" 89 | \n",
" 85 | \n",
" 40 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" quiz0 quiz1 quiz2\n",
"student0 80 123 456\n",
"student1 87 92 80\n",
"student2 50 24 21\n",
"student3 89 85 40"
]
},
"execution_count": 96,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# setting multiple (contiguous) entries in dataframe\n",
"df_quiz.loc['student0', 'quiz1': 'quiz2'] = 123, 456\n",
"df_quiz\n"
]
},
{
"cell_type": "code",
"execution_count": 98,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" quiz0 | \n",
" quiz1 | \n",
" quiz2 | \n",
" overall grade | \n",
"
\n",
" \n",
" \n",
" \n",
" student0 | \n",
" 80 | \n",
" 123 | \n",
" 456 | \n",
" b | \n",
"
\n",
" \n",
" student1 | \n",
" 87 | \n",
" 92 | \n",
" 80 | \n",
" c | \n",
"
\n",
" \n",
" student2 | \n",
" 50 | \n",
" 24 | \n",
" 21 | \n",
" d | \n",
"
\n",
" \n",
" student3 | \n",
" 89 | \n",
" 85 | \n",
" 40 | \n",
" a | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" quiz0 quiz1 quiz2 overall grade\n",
"student0 80 123 456 b\n",
"student1 87 92 80 c\n",
"student2 50 24 21 d\n",
"student3 89 85 40 a"
]
},
"execution_count": 98,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# adding a new column (error prone handling of indexing ... which student got which grade?)\n",
"df_quiz['overall grade'] = 'b' , 'c', 'd', 'a'\n",
"df_quiz\n"
]
},
{
"cell_type": "code",
"execution_count": 102,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" quiz0 | \n",
" quiz1 | \n",
" quiz2 | \n",
"
\n",
" \n",
" \n",
" \n",
" student0 | \n",
" 80 | \n",
" 123 | \n",
" 456 | \n",
"
\n",
" \n",
" student1 | \n",
" 87 | \n",
" 92 | \n",
" 80 | \n",
"
\n",
" \n",
" student2 | \n",
" 50 | \n",
" 24 | \n",
" 21 | \n",
"
\n",
" \n",
" student3 | \n",
" 89 | \n",
" 85 | \n",
" 40 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" quiz0 quiz1 quiz2\n",
"student0 80 123 456\n",
"student1 87 92 80\n",
"student2 50 24 21\n",
"student3 89 85 40"
]
},
"execution_count": 102,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# delete a column\n",
"del df_quiz['overall grade']\n",
"df_quiz\n"
]
},
{
"cell_type": "code",
"execution_count": 106,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"student1 b-\n",
"student2 f (no quizzes taken)\n",
"student1928918 asdf\n",
"dtype: object"
]
},
"execution_count": 106,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# adding a column (next 2 cells) robust way of handling indexing\n",
"# by explicilty labelling the index we're sure to match more explicitly\n",
"s_overgrade = pd.Series({'student1': 'b-', \n",
" 'student2': 'f (no quizzes taken)',\n",
" 'student1928918': 'asdf'})\n",
"s_overgrade\n"
]
},
{
"cell_type": "code",
"execution_count": 107,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" quiz0 | \n",
" quiz1 | \n",
" quiz2 | \n",
" overall grade | \n",
"
\n",
" \n",
" \n",
" \n",
" student0 | \n",
" 80 | \n",
" 123 | \n",
" 456 | \n",
" NaN | \n",
"
\n",
" \n",
" student1 | \n",
" 87 | \n",
" 92 | \n",
" 80 | \n",
" b- | \n",
"
\n",
" \n",
" student2 | \n",
" 50 | \n",
" 24 | \n",
" 21 | \n",
" f (no quizzes taken) | \n",
"
\n",
" \n",
" student3 | \n",
" 89 | \n",
" 85 | \n",
" 40 | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" quiz0 quiz1 quiz2 overall grade\n",
"student0 80 123 456 NaN\n",
"student1 87 92 80 b-\n",
"student2 50 24 21 f (no quizzes taken)\n",
"student3 89 85 40 NaN"
]
},
"execution_count": 107,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# notice how pandas helps us out in aligning our new column with proper row\n",
"df_quiz.loc[: , 'overall grade'] = s_overgrade\n",
"df_quiz\n"
]
},
{
"cell_type": "code",
"execution_count": 108,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" quiz0 | \n",
" quiz1 | \n",
" quiz2 | \n",
" overall grade | \n",
"
\n",
" \n",
" \n",
" \n",
" student1 | \n",
" 87 | \n",
" 92 | \n",
" 80 | \n",
" b- | \n",
"
\n",
" \n",
" student2 | \n",
" 50 | \n",
" 24 | \n",
" 21 | \n",
" f (no quizzes taken) | \n",
"
\n",
" \n",
" student3 | \n",
" 89 | \n",
" 85 | \n",
" 40 | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" quiz0 quiz1 quiz2 overall grade\n",
"student1 87 92 80 b-\n",
"student2 50 24 21 f (no quizzes taken)\n",
"student3 89 85 40 NaN"
]
},
"execution_count": 108,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# how to 'drop' a row (returns a dataframe with row removed)\n",
"df_quiz_short = df_quiz.drop('student0')\n",
"df_quiz_short\n"
]
},
{
"cell_type": "code",
"execution_count": 109,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" quiz1 | \n",
" quiz2 | \n",
" overall grade | \n",
"
\n",
" \n",
" \n",
" \n",
" student0 | \n",
" 123 | \n",
" 456 | \n",
" NaN | \n",
"
\n",
" \n",
" student1 | \n",
" 92 | \n",
" 80 | \n",
" b- | \n",
"
\n",
" \n",
" student2 | \n",
" 24 | \n",
" 21 | \n",
" f (no quizzes taken) | \n",
"
\n",
" \n",
" student3 | \n",
" 85 | \n",
" 40 | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" quiz1 quiz2 overall grade\n",
"student0 123 456 NaN\n",
"student1 92 80 b-\n",
"student2 24 21 f (no quizzes taken)\n",
"student3 85 40 NaN"
]
},
"execution_count": 109,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# you can drop a column too by specifying `axis=1`\n",
"# (by default it uses axis=0 to drop rows)\n",
"df_quiz.drop('quiz0', axis=1)\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# If slicing fails ... just pass a list\n"
]
},
{
"cell_type": "code",
"execution_count": 110,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" quiz0 | \n",
" quiz1 | \n",
" quiz2 | \n",
"
\n",
" \n",
" \n",
" \n",
" student0 | \n",
" 80 | \n",
" 90 | \n",
" 50 | \n",
"
\n",
" \n",
" student1 | \n",
" 87 | \n",
" 92 | \n",
" 80 | \n",
"
\n",
" \n",
" student2 | \n",
" 50 | \n",
" 24 | \n",
" 21 | \n",
"
\n",
" \n",
" student3 | \n",
" 89 | \n",
" 85 | \n",
" 40 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" quiz0 quiz1 quiz2\n",
"student0 80 90 50\n",
"student1 87 92 80\n",
"student2 50 24 21\n",
"student3 89 85 40"
]
},
"execution_count": 110,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"quiz_dict = {'quiz0': [80, 87, 50, 89],\n",
" 'quiz1': [90, 92, 24, 85],\n",
" 'quiz2': [50, 80, 21, 40]}\n",
"df_quiz = pd.DataFrame(quiz_dict, index=('student0', 'student1', 'student2', 'student3'))\n",
"df_quiz\n"
]
},
{
"cell_type": "code",
"execution_count": 96,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" quiz0 | \n",
" quiz1 | \n",
" quiz2 | \n",
"
\n",
" \n",
" \n",
" \n",
" student0 | \n",
" 80 | \n",
" 90 | \n",
" 50 | \n",
"
\n",
" \n",
" student2 | \n",
" 50 | \n",
" 24 | \n",
" 21 | \n",
"
\n",
" \n",
" student3 | \n",
" 89 | \n",
" 85 | \n",
" 40 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" quiz0 quiz1 quiz2\n",
"student0 80 90 50\n",
"student2 50 24 21\n",
"student3 89 85 40"
]
},
"execution_count": 96,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# how to get an arbitrary set of rows\n",
"df_quiz.loc[['student0', 'student2', 'student3'], :]\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## In Class Activity B\n",
"1. Build the following `df_grade`. Be sure to include the row and column names:\n",
"\n",
"| | StudentB | StudentA | StudentC |\n",
"|-------:|----:|------:|------:|\n",
"| Quiz 1 | 89 | 100 | 78 |\n",
"| Quiz 2 | 75 | 90 | 90 |\n",
"| Quiz 3 | 93 | 85 | 65 |\n",
"| Quiz 4 | 92 | 92 | 76 |\n",
"\n",
"1. index into this dataframe to build a `df_grade_subset`:\n",
" - only includes rows studentB and studentC\n",
" - only includes Quiz 2, Quiz 3, Quiz 4\n",
"1. Using the `df_grade_subset` from the step above:\n",
" * calculate mean scores of studentB and studentC from the selected quizes\n",
" * calculate mean score of each quiz \n",
" * (remember the `axis` parameter)\n",
" \n",
"Operating on `df_grade`:\n",
"1. Add a new column `'StudentD'` with grades `60, 70, 80, 90` for quizes 1, 2, 3, 4 respectively\n",
" * can you do this by adding a new `pd.Series` object (to be a bit more explicit)?\n",
"1. Add a new row, `quiz5`, with any grades\n",
"1. Delete StudentC's column\n"
]
},
{
"cell_type": "code",
"execution_count": 113,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" StudentB | \n",
" StudentA | \n",
" StudentC | \n",
"
\n",
" \n",
" \n",
" \n",
" Quiz1 | \n",
" 89 | \n",
" 100 | \n",
" 78 | \n",
"
\n",
" \n",
" Quiz2 | \n",
" 75 | \n",
" 90 | \n",
" 90 | \n",
"
\n",
" \n",
" Quiz3 | \n",
" 93 | \n",
" 85 | \n",
" 65 | \n",
"
\n",
" \n",
" Quiz4 | \n",
" 92 | \n",
" 92 | \n",
" 76 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" StudentB StudentA StudentC\n",
"Quiz1 89 100 78\n",
"Quiz2 75 90 90\n",
"Quiz3 93 85 65\n",
"Quiz4 92 92 76"
]
},
"execution_count": 113,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"student_grade_dict = {'StudentB': [89, 75, 93, 92],\n",
" 'StudentA': [100, 90, 85, 92],\n",
" 'StudentC': [78, 90, 65, 76]}\n",
"\n",
"df_quiz = pd.DataFrame(student_grade_dict, index=('Quiz1', 'Quiz2', 'Quiz3', 'Quiz4'))\n",
"df_quiz"
]
},
{
"cell_type": "code",
"execution_count": 116,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" StudentB | \n",
" StudentC | \n",
"
\n",
" \n",
" \n",
" \n",
" Quiz2 | \n",
" 75 | \n",
" 90 | \n",
"
\n",
" \n",
" Quiz3 | \n",
" 93 | \n",
" 65 | \n",
"
\n",
" \n",
" Quiz4 | \n",
" 92 | \n",
" 76 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" StudentB StudentC\n",
"Quiz2 75 90\n",
"Quiz3 93 65\n",
"Quiz4 92 76"
]
},
"execution_count": 116,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# index into this dataframe to build a df_grade_subset:\n",
"# only includes rows studentB and studentC\n",
"# only includes Quiz 2, Quiz 3, Quiz 4\n",
"df_grade_subset = df_quiz.loc['Quiz2': , ['StudentB', 'StudentC']]\n",
"df_grade_subset"
]
},
{
"cell_type": "code",
"execution_count": 117,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"StudentB 86.666667\n",
"StudentC 77.000000\n",
"dtype: float64"
]
},
"execution_count": 117,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Using the df_grade_subset from the step above:\n",
"# calculate mean scores of studentB and studentC from the selected quizes\n",
"df_grade_subset.mean()"
]
},
{
"cell_type": "code",
"execution_count": 120,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Quiz1 89.000000\n",
"Quiz2 85.000000\n",
"Quiz3 81.000000\n",
"Quiz4 86.666667\n",
"dtype: float64"
]
},
"execution_count": 120,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# calculate mean score of each quiz\n",
"# (remember the axis parameter)\n",
"df_quiz.mean(axis=1)"
]
},
{
"cell_type": "code",
"execution_count": 123,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" StudentB | \n",
" StudentA | \n",
" StudentC | \n",
" StudentD | \n",
"
\n",
" \n",
" \n",
" \n",
" Quiz1 | \n",
" 89 | \n",
" 100 | \n",
" 78 | \n",
" 60 | \n",
"
\n",
" \n",
" Quiz2 | \n",
" 75 | \n",
" 90 | \n",
" 90 | \n",
" 70 | \n",
"
\n",
" \n",
" Quiz3 | \n",
" 93 | \n",
" 85 | \n",
" 65 | \n",
" 80 | \n",
"
\n",
" \n",
" Quiz4 | \n",
" 92 | \n",
" 92 | \n",
" 76 | \n",
" 90 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" StudentB StudentA StudentC StudentD\n",
"Quiz1 89 100 78 60\n",
"Quiz2 75 90 90 70\n",
"Quiz3 93 85 65 80\n",
"Quiz4 92 92 76 90"
]
},
"execution_count": 123,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Operating on df_grade:\n",
"# Add a new column 'StudentD' with grades 60, 70, 80, 90 for quizes 1, 2, 3, 4 respectively\n",
"# can you do this by adding a new pd.Series object (to be a bit more explicit)?\n",
"\n",
"# implicit (not great)\n",
"# df_quiz.loc[:, 'StudentD'] = 60, 70, 80, 90\n",
"\n",
"# a bit more error-proof (notice: order doesn't matter!)\n",
"s_studentd = pd.Series({'Quiz1': 60, 'Quiz4': 90, 'Quiz2': 70, 'Quiz3': 80})\n",
"df_quiz.loc[:, 'StudentD'] = s_studentd\n",
"df_quiz"
]
},
{
"cell_type": "code",
"execution_count": 127,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" StudentB | \n",
" StudentA | \n",
" StudentD | \n",
"
\n",
" \n",
" \n",
" \n",
" Quiz1 | \n",
" 89.0 | \n",
" 100.0 | \n",
" 60.0 | \n",
"
\n",
" \n",
" Quiz2 | \n",
" 75.0 | \n",
" 90.0 | \n",
" 70.0 | \n",
"
\n",
" \n",
" Quiz3 | \n",
" 93.0 | \n",
" 85.0 | \n",
" 80.0 | \n",
"
\n",
" \n",
" Quiz4 | \n",
" 92.0 | \n",
" 92.0 | \n",
" 90.0 | \n",
"
\n",
" \n",
" Quiz5 | \n",
" 1.0 | \n",
" 2.0 | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" StudentB StudentA StudentD\n",
"Quiz1 89.0 100.0 60.0\n",
"Quiz2 75.0 90.0 70.0\n",
"Quiz3 93.0 85.0 80.0\n",
"Quiz4 92.0 92.0 90.0\n",
"Quiz5 1.0 2.0 NaN"
]
},
"execution_count": 127,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dict_studentd = {'Quiz1': 60, 'Quiz4': 90, 'Quiz2': 70, 'Quiz3': 80}\n",
"df_quiz.loc[:, 'StudentD'] = dict_studentd\n",
"df_quiz"
]
},
{
"cell_type": "code",
"execution_count": 124,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" StudentB | \n",
" StudentA | \n",
" StudentC | \n",
" StudentD | \n",
"
\n",
" \n",
" \n",
" \n",
" Quiz1 | \n",
" 89.0 | \n",
" 100.0 | \n",
" 78.0 | \n",
" 60.0 | \n",
"
\n",
" \n",
" Quiz2 | \n",
" 75.0 | \n",
" 90.0 | \n",
" 90.0 | \n",
" 70.0 | \n",
"
\n",
" \n",
" Quiz3 | \n",
" 93.0 | \n",
" 85.0 | \n",
" 65.0 | \n",
" 80.0 | \n",
"
\n",
" \n",
" Quiz4 | \n",
" 92.0 | \n",
" 92.0 | \n",
" 76.0 | \n",
" 90.0 | \n",
"
\n",
" \n",
" Quiz5 | \n",
" 1.0 | \n",
" 2.0 | \n",
" 3.0 | \n",
" 4.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" StudentB StudentA StudentC StudentD\n",
"Quiz1 89.0 100.0 78.0 60.0\n",
"Quiz2 75.0 90.0 90.0 70.0\n",
"Quiz3 93.0 85.0 65.0 80.0\n",
"Quiz4 92.0 92.0 76.0 90.0\n",
"Quiz5 1.0 2.0 3.0 4.0"
]
},
"execution_count": 124,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Add a new row, quiz5, with any grades\n",
"\n",
"# implicit way\n",
"df_quiz.loc['Quiz5', :] = 1, 2, 3, 4\n",
"df_quiz"
]
},
{
"cell_type": "code",
"execution_count": 126,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" StudentB | \n",
" StudentA | \n",
" StudentD | \n",
"
\n",
" \n",
" \n",
" \n",
" Quiz1 | \n",
" 89.0 | \n",
" 100.0 | \n",
" 60.0 | \n",
"
\n",
" \n",
" Quiz2 | \n",
" 75.0 | \n",
" 90.0 | \n",
" 70.0 | \n",
"
\n",
" \n",
" Quiz3 | \n",
" 93.0 | \n",
" 85.0 | \n",
" 80.0 | \n",
"
\n",
" \n",
" Quiz4 | \n",
" 92.0 | \n",
" 92.0 | \n",
" 90.0 | \n",
"
\n",
" \n",
" Quiz5 | \n",
" 1.0 | \n",
" 2.0 | \n",
" 4.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" StudentB StudentA StudentD\n",
"Quiz1 89.0 100.0 60.0\n",
"Quiz2 75.0 90.0 70.0\n",
"Quiz3 93.0 85.0 80.0\n",
"Quiz4 92.0 92.0 90.0\n",
"Quiz5 1.0 2.0 4.0"
]
},
"execution_count": 126,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Delete StudentC's row\n",
"del df_quiz['StudentC']\n",
"df_quiz"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Operating on DataFrame & Series Objects\n",
"\n",
"Your operators do pretty much what you'd expect them to.\n"
]
},
{
"cell_type": "code",
"execution_count": 128,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" quiz0 | \n",
" quiz1 | \n",
" quiz2 | \n",
"
\n",
" \n",
" \n",
" \n",
" student0 | \n",
" 80 | \n",
" 90 | \n",
" 50 | \n",
"
\n",
" \n",
" student1 | \n",
" 87 | \n",
" 92 | \n",
" 80 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" quiz0 quiz1 quiz2\n",
"student0 80 90 50\n",
"student1 87 92 80"
]
},
"execution_count": 128,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"quiz_dict = {'quiz0': [80, 87],\n",
" 'quiz1': [90, 92],\n",
" 'quiz2': [50, 80]}\n",
"df_quiz = pd.DataFrame(quiz_dict, index=('student0', 'student1'))\n",
"df_quiz"
]
},
{
"cell_type": "code",
"execution_count": 129,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" quiz0 | \n",
" quiz1 | \n",
" quiz2 | \n",
"
\n",
" \n",
" \n",
" \n",
" student0 | \n",
" 80000 | \n",
" 90000 | \n",
" 50000 | \n",
"
\n",
" \n",
" student1 | \n",
" 87000 | \n",
" 92000 | \n",
" 80000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" quiz0 quiz1 quiz2\n",
"student0 80000 90000 50000\n",
"student1 87000 92000 80000"
]
},
"execution_count": 129,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_quiz * 1000\n"
]
},
{
"cell_type": "code",
"execution_count": 130,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" quiz0 | \n",
" quiz1 | \n",
" quiz2 | \n",
"
\n",
" \n",
" \n",
" \n",
" student0 | \n",
" 1000000000079 | \n",
" 1000000000089 | \n",
" 1000000000049 | \n",
"
\n",
" \n",
" student1 | \n",
" 87 | \n",
" 92 | \n",
" 80 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" quiz0 quiz1 quiz2\n",
"student0 1000000000079 1000000000089 1000000000049\n",
"student1 87 92 80"
]
},
"execution_count": 130,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# thats some extra credit ...\n",
"df_quiz.loc['student0', :] += 999999999999\n",
"df_quiz\n"
]
},
{
"cell_type": "code",
"execution_count": 131,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" quiz0 | \n",
" quiz1 | \n",
" quiz2 | \n",
"
\n",
" \n",
" \n",
" \n",
" student0 | \n",
" 1000000000079 | \n",
" 1000000000089 | \n",
" 1000000000049 | \n",
"
\n",
" \n",
" student1 | \n",
" 87 | \n",
" 92 | \n",
" 80 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" quiz0 quiz1 quiz2\n",
"student0 1000000000079 1000000000089 1000000000049\n",
"student1 87 92 80"
]
},
"execution_count": 131,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_quiz\n"
]
},
{
"cell_type": "code",
"execution_count": 132,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" quiz0 | \n",
" quiz1 | \n",
" quiz2 | \n",
"
\n",
" \n",
" \n",
" \n",
" student0 | \n",
" True | \n",
" True | \n",
" True | \n",
"
\n",
" \n",
" student1 | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" quiz0 quiz1 quiz2\n",
"student0 True True True\n",
"student1 False False False"
]
},
"execution_count": 132,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# we can also use comparison operators (super helpful, see boolean indexing next)\n",
"df_quiz > 100\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Boolean Indexing into DataFrame\n",
"\n",
"Sometimes we want to grab only the rows or columns which meet a particular condition.\n",
"\n",
"\"Get all students whose grade was higher than 85 on quiz 1\"\n"
]
},
{
"cell_type": "code",
"execution_count": 134,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" quiz0 | \n",
" quiz1 | \n",
" quiz2 | \n",
"
\n",
" \n",
" \n",
" \n",
" student0 | \n",
" 80 | \n",
" 90 | \n",
" 50 | \n",
"
\n",
" \n",
" student1 | \n",
" 87 | \n",
" 92 | \n",
" 80 | \n",
"
\n",
" \n",
" student2 | \n",
" 60 | \n",
" 60 | \n",
" 70 | \n",
"
\n",
" \n",
" student3 | \n",
" 30 | \n",
" 23 | \n",
" 64 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" quiz0 quiz1 quiz2\n",
"student0 80 90 50\n",
"student1 87 92 80\n",
"student2 60 60 70\n",
"student3 30 23 64"
]
},
"execution_count": 134,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"quiz_dict = {'quiz0': [80, 87, 60, 30],\n",
" 'quiz1': [90, 92, 60, 23],\n",
" 'quiz2': [50, 80, 70, 64]}\n",
"df_quiz = pd.DataFrame(quiz_dict, index=('student0', 'student1', 'student2', 'student3'))\n",
"df_quiz\n"
]
},
{
"cell_type": "code",
"execution_count": 135,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"student0 90\n",
"student1 92\n",
"student2 60\n",
"student3 23\n",
"Name: quiz1, dtype: int64"
]
},
"execution_count": 135,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# quiz 1 is a series object which contains every index's quiz 1 grade\n",
"s_quiz1 = df_quiz.loc[:, 'quiz1']\n",
"s_quiz1\n"
]
},
{
"cell_type": "code",
"execution_count": 136,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"student0 True\n",
"student1 True\n",
"student2 False\n",
"student3 False\n",
"Name: quiz1, dtype: bool"
]
},
"execution_count": 136,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# we create a series of booleans which is True only in the positions we're interested in\n",
"s_bool = s_quiz1 > 85\n",
"s_bool\n"
]
},
{
"cell_type": "code",
"execution_count": 81,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" quiz0 | \n",
" quiz1 | \n",
" quiz2 | \n",
"
\n",
" \n",
" \n",
" \n",
" student0 | \n",
" 80 | \n",
" 90 | \n",
" 50 | \n",
"
\n",
" \n",
" student1 | \n",
" 87 | \n",
" 92 | \n",
" 80 | \n",
"
\n",
" \n",
" student2 | \n",
" 60 | \n",
" 60 | \n",
" 70 | \n",
"
\n",
" \n",
" student3 | \n",
" 30 | \n",
" 23 | \n",
" 64 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" quiz0 quiz1 quiz2\n",
"student0 80 90 50\n",
"student1 87 92 80\n",
"student2 60 60 70\n",
"student3 30 23 64"
]
},
"execution_count": 81,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_quiz\n"
]
},
{
"cell_type": "code",
"execution_count": 137,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" quiz0 | \n",
" quiz1 | \n",
" quiz2 | \n",
"
\n",
" \n",
" \n",
" \n",
" student0 | \n",
" 80 | \n",
" 90 | \n",
" 50 | \n",
"
\n",
" \n",
" student1 | \n",
" 87 | \n",
" 92 | \n",
" 80 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" quiz0 quiz1 quiz2\n",
"student0 80 90 50\n",
"student1 87 92 80"
]
},
"execution_count": 137,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# boolean indexing: using a boolean series as index returns only those entries which are True\n",
"# notice that since student2 & student3's quiz1 grade wasn't > 80 they aren't included below\n",
"df_quiz.loc[s_bool, :]\n"
]
},
{
"cell_type": "code",
"execution_count": 145,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 True\n",
"1 True\n",
"2 True\n",
"3 True\n",
"4 True\n",
" ... \n",
"339 False\n",
"340 False\n",
"341 False\n",
"342 False\n",
"343 False\n",
"Length: 344, dtype: bool"
]
},
"execution_count": 145,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s_bool = (df_penguin['species'] == 'Adelie') & (df_penguin['island'] == 'Torgersen')\n",
"s_bool"
]
},
{
"cell_type": "code",
"execution_count": 146,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" species | \n",
" island | \n",
" bill_length_mm | \n",
" bill_depth_mm | \n",
" flipper_length_mm | \n",
" body_mass_g | \n",
" sex | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Adelie | \n",
" Torgersen | \n",
" 39.1 | \n",
" 18.7 | \n",
" 181.0 | \n",
" 3750.0 | \n",
" Male | \n",
"
\n",
" \n",
" 1 | \n",
" Adelie | \n",
" Torgersen | \n",
" 39.5 | \n",
" 17.4 | \n",
" 186.0 | \n",
" 3800.0 | \n",
" Female | \n",
"
\n",
" \n",
" 2 | \n",
" Adelie | \n",
" Torgersen | \n",
" 40.3 | \n",
" 18.0 | \n",
" 195.0 | \n",
" 3250.0 | \n",
" Female | \n",
"
\n",
" \n",
" 3 | \n",
" Adelie | \n",
" Torgersen | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 4 | \n",
" Adelie | \n",
" Torgersen | \n",
" 36.7 | \n",
" 19.3 | \n",
" 193.0 | \n",
" 3450.0 | \n",
" Female | \n",
"
\n",
" \n",
" 5 | \n",
" Adelie | \n",
" Torgersen | \n",
" 39.3 | \n",
" 20.6 | \n",
" 190.0 | \n",
" 3650.0 | \n",
" Male | \n",
"
\n",
" \n",
" 6 | \n",
" Adelie | \n",
" Torgersen | \n",
" 38.9 | \n",
" 17.8 | \n",
" 181.0 | \n",
" 3625.0 | \n",
" Female | \n",
"
\n",
" \n",
" 7 | \n",
" Adelie | \n",
" Torgersen | \n",
" 39.2 | \n",
" 19.6 | \n",
" 195.0 | \n",
" 4675.0 | \n",
" Male | \n",
"
\n",
" \n",
" 8 | \n",
" Adelie | \n",
" Torgersen | \n",
" 34.1 | \n",
" 18.1 | \n",
" 193.0 | \n",
" 3475.0 | \n",
" NaN | \n",
"
\n",
" \n",
" 9 | \n",
" Adelie | \n",
" Torgersen | \n",
" 42.0 | \n",
" 20.2 | \n",
" 190.0 | \n",
" 4250.0 | \n",
" NaN | \n",
"
\n",
" \n",
" 10 | \n",
" Adelie | \n",
" Torgersen | \n",
" 37.8 | \n",
" 17.1 | \n",
" 186.0 | \n",
" 3300.0 | \n",
" NaN | \n",
"
\n",
" \n",
" 11 | \n",
" Adelie | \n",
" Torgersen | \n",
" 37.8 | \n",
" 17.3 | \n",
" 180.0 | \n",
" 3700.0 | \n",
" NaN | \n",
"
\n",
" \n",
" 12 | \n",
" Adelie | \n",
" Torgersen | \n",
" 41.1 | \n",
" 17.6 | \n",
" 182.0 | \n",
" 3200.0 | \n",
" Female | \n",
"
\n",
" \n",
" 13 | \n",
" Adelie | \n",
" Torgersen | \n",
" 38.6 | \n",
" 21.2 | \n",
" 191.0 | \n",
" 3800.0 | \n",
" Male | \n",
"
\n",
" \n",
" 14 | \n",
" Adelie | \n",
" Torgersen | \n",
" 34.6 | \n",
" 21.1 | \n",
" 198.0 | \n",
" 4400.0 | \n",
" Male | \n",
"
\n",
" \n",
" 15 | \n",
" Adelie | \n",
" Torgersen | \n",
" 36.6 | \n",
" 17.8 | \n",
" 185.0 | \n",
" 3700.0 | \n",
" Female | \n",
"
\n",
" \n",
" 16 | \n",
" Adelie | \n",
" Torgersen | \n",
" 38.7 | \n",
" 19.0 | \n",
" 195.0 | \n",
" 3450.0 | \n",
" Female | \n",
"
\n",
" \n",
" 17 | \n",
" Adelie | \n",
" Torgersen | \n",
" 42.5 | \n",
" 20.7 | \n",
" 197.0 | \n",
" 4500.0 | \n",
" Male | \n",
"
\n",
" \n",
" 18 | \n",
" Adelie | \n",
" Torgersen | \n",
" 34.4 | \n",
" 18.4 | \n",
" 184.0 | \n",
" 3325.0 | \n",
" Female | \n",
"
\n",
" \n",
" 19 | \n",
" Adelie | \n",
" Torgersen | \n",
" 46.0 | \n",
" 21.5 | \n",
" 194.0 | \n",
" 4200.0 | \n",
" Male | \n",
"
\n",
" \n",
" 68 | \n",
" Adelie | \n",
" Torgersen | \n",
" 35.9 | \n",
" 16.6 | \n",
" 190.0 | \n",
" 3050.0 | \n",
" Female | \n",
"
\n",
" \n",
" 69 | \n",
" Adelie | \n",
" Torgersen | \n",
" 41.8 | \n",
" 19.4 | \n",
" 198.0 | \n",
" 4450.0 | \n",
" Male | \n",
"
\n",
" \n",
" 70 | \n",
" Adelie | \n",
" Torgersen | \n",
" 33.5 | \n",
" 19.0 | \n",
" 190.0 | \n",
" 3600.0 | \n",
" Female | \n",
"
\n",
" \n",
" 71 | \n",
" Adelie | \n",
" Torgersen | \n",
" 39.7 | \n",
" 18.4 | \n",
" 190.0 | \n",
" 3900.0 | \n",
" Male | \n",
"
\n",
" \n",
" 72 | \n",
" Adelie | \n",
" Torgersen | \n",
" 39.6 | \n",
" 17.2 | \n",
" 196.0 | \n",
" 3550.0 | \n",
" Female | \n",
"
\n",
" \n",
" 73 | \n",
" Adelie | \n",
" Torgersen | \n",
" 45.8 | \n",
" 18.9 | \n",
" 197.0 | \n",
" 4150.0 | \n",
" Male | \n",
"
\n",
" \n",
" 74 | \n",
" Adelie | \n",
" Torgersen | \n",
" 35.5 | \n",
" 17.5 | \n",
" 190.0 | \n",
" 3700.0 | \n",
" Female | \n",
"
\n",
" \n",
" 75 | \n",
" Adelie | \n",
" Torgersen | \n",
" 42.8 | \n",
" 18.5 | \n",
" 195.0 | \n",
" 4250.0 | \n",
" Male | \n",
"
\n",
" \n",
" 76 | \n",
" Adelie | \n",
" Torgersen | \n",
" 40.9 | \n",
" 16.8 | \n",
" 191.0 | \n",
" 3700.0 | \n",
" Female | \n",
"
\n",
" \n",
" 77 | \n",
" Adelie | \n",
" Torgersen | \n",
" 37.2 | \n",
" 19.4 | \n",
" 184.0 | \n",
" 3900.0 | \n",
" Male | \n",
"
\n",
" \n",
" 78 | \n",
" Adelie | \n",
" Torgersen | \n",
" 36.2 | \n",
" 16.1 | \n",
" 187.0 | \n",
" 3550.0 | \n",
" Female | \n",
"
\n",
" \n",
" 79 | \n",
" Adelie | \n",
" Torgersen | \n",
" 42.1 | \n",
" 19.1 | \n",
" 195.0 | \n",
" 4000.0 | \n",
" Male | \n",
"
\n",
" \n",
" 80 | \n",
" Adelie | \n",
" Torgersen | \n",
" 34.6 | \n",
" 17.2 | \n",
" 189.0 | \n",
" 3200.0 | \n",
" Female | \n",
"
\n",
" \n",
" 81 | \n",
" Adelie | \n",
" Torgersen | \n",
" 42.9 | \n",
" 17.6 | \n",
" 196.0 | \n",
" 4700.0 | \n",
" Male | \n",
"
\n",
" \n",
" 82 | \n",
" Adelie | \n",
" Torgersen | \n",
" 36.7 | \n",
" 18.8 | \n",
" 187.0 | \n",
" 3800.0 | \n",
" Female | \n",
"
\n",
" \n",
" 83 | \n",
" Adelie | \n",
" Torgersen | \n",
" 35.1 | \n",
" 19.4 | \n",
" 193.0 | \n",
" 4200.0 | \n",
" Male | \n",
"
\n",
" \n",
" 116 | \n",
" Adelie | \n",
" Torgersen | \n",
" 38.6 | \n",
" 17.0 | \n",
" 188.0 | \n",
" 2900.0 | \n",
" Female | \n",
"
\n",
" \n",
" 117 | \n",
" Adelie | \n",
" Torgersen | \n",
" 37.3 | \n",
" 20.5 | \n",
" 199.0 | \n",
" 3775.0 | \n",
" Male | \n",
"
\n",
" \n",
" 118 | \n",
" Adelie | \n",
" Torgersen | \n",
" 35.7 | \n",
" 17.0 | \n",
" 189.0 | \n",
" 3350.0 | \n",
" Female | \n",
"
\n",
" \n",
" 119 | \n",
" Adelie | \n",
" Torgersen | \n",
" 41.1 | \n",
" 18.6 | \n",
" 189.0 | \n",
" 3325.0 | \n",
" Male | \n",
"
\n",
" \n",
" 120 | \n",
" Adelie | \n",
" Torgersen | \n",
" 36.2 | \n",
" 17.2 | \n",
" 187.0 | \n",
" 3150.0 | \n",
" Female | \n",
"
\n",
" \n",
" 121 | \n",
" Adelie | \n",
" Torgersen | \n",
" 37.7 | \n",
" 19.8 | \n",
" 198.0 | \n",
" 3500.0 | \n",
" Male | \n",
"
\n",
" \n",
" 122 | \n",
" Adelie | \n",
" Torgersen | \n",
" 40.2 | \n",
" 17.0 | \n",
" 176.0 | \n",
" 3450.0 | \n",
" Female | \n",
"
\n",
" \n",
" 123 | \n",
" Adelie | \n",
" Torgersen | \n",
" 41.4 | \n",
" 18.5 | \n",
" 202.0 | \n",
" 3875.0 | \n",
" Male | \n",
"
\n",
" \n",
" 124 | \n",
" Adelie | \n",
" Torgersen | \n",
" 35.2 | \n",
" 15.9 | \n",
" 186.0 | \n",
" 3050.0 | \n",
" Female | \n",
"
\n",
" \n",
" 125 | \n",
" Adelie | \n",
" Torgersen | \n",
" 40.6 | \n",
" 19.0 | \n",
" 199.0 | \n",
" 4000.0 | \n",
" Male | \n",
"
\n",
" \n",
" 126 | \n",
" Adelie | \n",
" Torgersen | \n",
" 38.8 | \n",
" 17.6 | \n",
" 191.0 | \n",
" 3275.0 | \n",
" Female | \n",
"
\n",
" \n",
" 127 | \n",
" Adelie | \n",
" Torgersen | \n",
" 41.5 | \n",
" 18.3 | \n",
" 195.0 | \n",
" 4300.0 | \n",
" Male | \n",
"
\n",
" \n",
" 128 | \n",
" Adelie | \n",
" Torgersen | \n",
" 39.0 | \n",
" 17.1 | \n",
" 191.0 | \n",
" 3050.0 | \n",
" Female | \n",
"
\n",
" \n",
" 129 | \n",
" Adelie | \n",
" Torgersen | \n",
" 44.1 | \n",
" 18.0 | \n",
" 210.0 | \n",
" 4000.0 | \n",
" Male | \n",
"
\n",
" \n",
" 130 | \n",
" Adelie | \n",
" Torgersen | \n",
" 38.5 | \n",
" 17.9 | \n",
" 190.0 | \n",
" 3325.0 | \n",
" Female | \n",
"
\n",
" \n",
" 131 | \n",
" Adelie | \n",
" Torgersen | \n",
" 43.1 | \n",
" 19.2 | \n",
" 197.0 | \n",
" 3500.0 | \n",
" Male | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" species island bill_length_mm bill_depth_mm flipper_length_mm \\\n",
"0 Adelie Torgersen 39.1 18.7 181.0 \n",
"1 Adelie Torgersen 39.5 17.4 186.0 \n",
"2 Adelie Torgersen 40.3 18.0 195.0 \n",
"3 Adelie Torgersen NaN NaN NaN \n",
"4 Adelie Torgersen 36.7 19.3 193.0 \n",
"5 Adelie Torgersen 39.3 20.6 190.0 \n",
"6 Adelie Torgersen 38.9 17.8 181.0 \n",
"7 Adelie Torgersen 39.2 19.6 195.0 \n",
"8 Adelie Torgersen 34.1 18.1 193.0 \n",
"9 Adelie Torgersen 42.0 20.2 190.0 \n",
"10 Adelie Torgersen 37.8 17.1 186.0 \n",
"11 Adelie Torgersen 37.8 17.3 180.0 \n",
"12 Adelie Torgersen 41.1 17.6 182.0 \n",
"13 Adelie Torgersen 38.6 21.2 191.0 \n",
"14 Adelie Torgersen 34.6 21.1 198.0 \n",
"15 Adelie Torgersen 36.6 17.8 185.0 \n",
"16 Adelie Torgersen 38.7 19.0 195.0 \n",
"17 Adelie Torgersen 42.5 20.7 197.0 \n",
"18 Adelie Torgersen 34.4 18.4 184.0 \n",
"19 Adelie Torgersen 46.0 21.5 194.0 \n",
"68 Adelie Torgersen 35.9 16.6 190.0 \n",
"69 Adelie Torgersen 41.8 19.4 198.0 \n",
"70 Adelie Torgersen 33.5 19.0 190.0 \n",
"71 Adelie Torgersen 39.7 18.4 190.0 \n",
"72 Adelie Torgersen 39.6 17.2 196.0 \n",
"73 Adelie Torgersen 45.8 18.9 197.0 \n",
"74 Adelie Torgersen 35.5 17.5 190.0 \n",
"75 Adelie Torgersen 42.8 18.5 195.0 \n",
"76 Adelie Torgersen 40.9 16.8 191.0 \n",
"77 Adelie Torgersen 37.2 19.4 184.0 \n",
"78 Adelie Torgersen 36.2 16.1 187.0 \n",
"79 Adelie Torgersen 42.1 19.1 195.0 \n",
"80 Adelie Torgersen 34.6 17.2 189.0 \n",
"81 Adelie Torgersen 42.9 17.6 196.0 \n",
"82 Adelie Torgersen 36.7 18.8 187.0 \n",
"83 Adelie Torgersen 35.1 19.4 193.0 \n",
"116 Adelie Torgersen 38.6 17.0 188.0 \n",
"117 Adelie Torgersen 37.3 20.5 199.0 \n",
"118 Adelie Torgersen 35.7 17.0 189.0 \n",
"119 Adelie Torgersen 41.1 18.6 189.0 \n",
"120 Adelie Torgersen 36.2 17.2 187.0 \n",
"121 Adelie Torgersen 37.7 19.8 198.0 \n",
"122 Adelie Torgersen 40.2 17.0 176.0 \n",
"123 Adelie Torgersen 41.4 18.5 202.0 \n",
"124 Adelie Torgersen 35.2 15.9 186.0 \n",
"125 Adelie Torgersen 40.6 19.0 199.0 \n",
"126 Adelie Torgersen 38.8 17.6 191.0 \n",
"127 Adelie Torgersen 41.5 18.3 195.0 \n",
"128 Adelie Torgersen 39.0 17.1 191.0 \n",
"129 Adelie Torgersen 44.1 18.0 210.0 \n",
"130 Adelie Torgersen 38.5 17.9 190.0 \n",
"131 Adelie Torgersen 43.1 19.2 197.0 \n",
"\n",
" body_mass_g sex \n",
"0 3750.0 Male \n",
"1 3800.0 Female \n",
"2 3250.0 Female \n",
"3 NaN NaN \n",
"4 3450.0 Female \n",
"5 3650.0 Male \n",
"6 3625.0 Female \n",
"7 4675.0 Male \n",
"8 3475.0 NaN \n",
"9 4250.0 NaN \n",
"10 3300.0 NaN \n",
"11 3700.0 NaN \n",
"12 3200.0 Female \n",
"13 3800.0 Male \n",
"14 4400.0 Male \n",
"15 3700.0 Female \n",
"16 3450.0 Female \n",
"17 4500.0 Male \n",
"18 3325.0 Female \n",
"19 4200.0 Male \n",
"68 3050.0 Female \n",
"69 4450.0 Male \n",
"70 3600.0 Female \n",
"71 3900.0 Male \n",
"72 3550.0 Female \n",
"73 4150.0 Male \n",
"74 3700.0 Female \n",
"75 4250.0 Male \n",
"76 3700.0 Female \n",
"77 3900.0 Male \n",
"78 3550.0 Female \n",
"79 4000.0 Male \n",
"80 3200.0 Female \n",
"81 4700.0 Male \n",
"82 3800.0 Female \n",
"83 4200.0 Male \n",
"116 2900.0 Female \n",
"117 3775.0 Male \n",
"118 3350.0 Female \n",
"119 3325.0 Male \n",
"120 3150.0 Female \n",
"121 3500.0 Male \n",
"122 3450.0 Female \n",
"123 3875.0 Male \n",
"124 3050.0 Female \n",
"125 4000.0 Male \n",
"126 3275.0 Female \n",
"127 4300.0 Male \n",
"128 3050.0 Female \n",
"129 4000.0 Male \n",
"130 3325.0 Female \n",
"131 3500.0 Male "
]
},
"execution_count": 146,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_penguin.loc[s_bool, :]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 83,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" quiz0 | \n",
" quiz1 | \n",
" quiz2 | \n",
"
\n",
" \n",
" \n",
" \n",
" student0 | \n",
" 80 | \n",
" 90 | \n",
" 50 | \n",
"
\n",
" \n",
" student1 | \n",
" 87 | \n",
" 92 | \n",
" 80 | \n",
"
\n",
" \n",
" student2 | \n",
" 60 | \n",
" 60 | \n",
" 70 | \n",
"
\n",
" \n",
" student3 | \n",
" 30 | \n",
" 23 | \n",
" 64 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" quiz0 quiz1 quiz2\n",
"student0 80 90 50\n",
"student1 87 92 80\n",
"student2 60 60 70\n",
"student3 30 23 64"
]
},
"execution_count": 83,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_quiz\n"
]
},
{
"cell_type": "code",
"execution_count": 84,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"student0 False\n",
"student1 False\n",
"student2 True\n",
"student3 True\n",
"Name: quiz1, dtype: bool"
]
},
"execution_count": 84,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# what are all the students who get below a 70 on quiz1?\n",
"s_bool = df_quiz.loc[:, 'quiz1'] < 70\n",
"s_bool\n"
]
},
{
"cell_type": "code",
"execution_count": 85,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" quiz0 | \n",
" quiz1 | \n",
" quiz2 | \n",
"
\n",
" \n",
" \n",
" \n",
" student2 | \n",
" 60 | \n",
" 60 | \n",
" 70 | \n",
"
\n",
" \n",
" student3 | \n",
" 30 | \n",
" 23 | \n",
" 64 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" quiz0 quiz1 quiz2\n",
"student2 60 60 70\n",
"student3 30 23 64"
]
},
"execution_count": 85,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_quiz.loc[s_bool, :]\n"
]
},
{
"cell_type": "code",
"execution_count": 86,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"student0 False\n",
"student1 True\n",
"student2 False\n",
"student3 False\n",
"dtype: bool"
]
},
"execution_count": 86,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# we can build more complex conditions using \n",
"# & (and operator)\n",
"# | (or operator)\n",
"\n",
"# all students who got higher than 91 on quiz1 but didn't score higher than 90 on quiz2\n",
"s_bool = (df_quiz.loc[:, 'quiz1'] > 91) & (df_quiz.loc[:, 'quiz2'] <= 90)\n",
"s_bool\n"
]
},
{
"cell_type": "code",
"execution_count": 87,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" quiz0 | \n",
" quiz1 | \n",
" quiz2 | \n",
"
\n",
" \n",
" \n",
" \n",
" student1 | \n",
" 87 | \n",
" 92 | \n",
" 80 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" quiz0 quiz1 quiz2\n",
"student1 87 92 80"
]
},
"execution_count": 87,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_quiz.loc[s_bool, :]\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# One more thing, whats `pd.DataFrame.head()`?\n",
"\n",
"It grabs the \"head\" (the first few rows) of a dataframe. DataFrames can be so big that its overwhelming to look at the whole thing, sometimes a few rows is all thats needed.\n"
]
},
{
"cell_type": "code",
"execution_count": 149,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" species | \n",
" island | \n",
" bill_length_mm | \n",
" bill_depth_mm | \n",
" flipper_length_mm | \n",
" body_mass_g | \n",
" sex | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Adelie | \n",
" Torgersen | \n",
" 39.1 | \n",
" 18.7 | \n",
" 181.0 | \n",
" 3750.0 | \n",
" Male | \n",
"
\n",
" \n",
" 1 | \n",
" Adelie | \n",
" Torgersen | \n",
" 39.5 | \n",
" 17.4 | \n",
" 186.0 | \n",
" 3800.0 | \n",
" Female | \n",
"
\n",
" \n",
" 2 | \n",
" Adelie | \n",
" Torgersen | \n",
" 40.3 | \n",
" 18.0 | \n",
" 195.0 | \n",
" 3250.0 | \n",
" Female | \n",
"
\n",
" \n",
" 3 | \n",
" Adelie | \n",
" Torgersen | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 4 | \n",
" Adelie | \n",
" Torgersen | \n",
" 36.7 | \n",
" 19.3 | \n",
" 193.0 | \n",
" 3450.0 | \n",
" Female | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" species island bill_length_mm bill_depth_mm flipper_length_mm \\\n",
"0 Adelie Torgersen 39.1 18.7 181.0 \n",
"1 Adelie Torgersen 39.5 17.4 186.0 \n",
"2 Adelie Torgersen 40.3 18.0 195.0 \n",
"3 Adelie Torgersen NaN NaN NaN \n",
"4 Adelie Torgersen 36.7 19.3 193.0 \n",
"\n",
" body_mass_g sex \n",
"0 3750.0 Male \n",
"1 3800.0 Female \n",
"2 3250.0 Female \n",
"3 NaN NaN \n",
"4 3450.0 Female "
]
},
"execution_count": 149,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_penguin = sns.load_dataset('penguins')\n",
"df_penguin.head()\n"
]
},
{
"cell_type": "code",
"execution_count": 150,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" species | \n",
" island | \n",
" bill_length_mm | \n",
" bill_depth_mm | \n",
" flipper_length_mm | \n",
" body_mass_g | \n",
" sex | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Adelie | \n",
" Torgersen | \n",
" 39.1 | \n",
" 18.7 | \n",
" 181.0 | \n",
" 3750.0 | \n",
" Male | \n",
"
\n",
" \n",
" 1 | \n",
" Adelie | \n",
" Torgersen | \n",
" 39.5 | \n",
" 17.4 | \n",
" 186.0 | \n",
" 3800.0 | \n",
" Female | \n",
"
\n",
" \n",
" 2 | \n",
" Adelie | \n",
" Torgersen | \n",
" 40.3 | \n",
" 18.0 | \n",
" 195.0 | \n",
" 3250.0 | \n",
" Female | \n",
"
\n",
" \n",
" 3 | \n",
" Adelie | \n",
" Torgersen | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 4 | \n",
" Adelie | \n",
" Torgersen | \n",
" 36.7 | \n",
" 19.3 | \n",
" 193.0 | \n",
" 3450.0 | \n",
" Female | \n",
"
\n",
" \n",
" 5 | \n",
" Adelie | \n",
" Torgersen | \n",
" 39.3 | \n",
" 20.6 | \n",
" 190.0 | \n",
" 3650.0 | \n",
" Male | \n",
"
\n",
" \n",
" 6 | \n",
" Adelie | \n",
" Torgersen | \n",
" 38.9 | \n",
" 17.8 | \n",
" 181.0 | \n",
" 3625.0 | \n",
" Female | \n",
"
\n",
" \n",
" 7 | \n",
" Adelie | \n",
" Torgersen | \n",
" 39.2 | \n",
" 19.6 | \n",
" 195.0 | \n",
" 4675.0 | \n",
" Male | \n",
"
\n",
" \n",
" 8 | \n",
" Adelie | \n",
" Torgersen | \n",
" 34.1 | \n",
" 18.1 | \n",
" 193.0 | \n",
" 3475.0 | \n",
" NaN | \n",
"
\n",
" \n",
" 9 | \n",
" Adelie | \n",
" Torgersen | \n",
" 42.0 | \n",
" 20.2 | \n",
" 190.0 | \n",
" 4250.0 | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" species island bill_length_mm bill_depth_mm flipper_length_mm \\\n",
"0 Adelie Torgersen 39.1 18.7 181.0 \n",
"1 Adelie Torgersen 39.5 17.4 186.0 \n",
"2 Adelie Torgersen 40.3 18.0 195.0 \n",
"3 Adelie Torgersen NaN NaN NaN \n",
"4 Adelie Torgersen 36.7 19.3 193.0 \n",
"5 Adelie Torgersen 39.3 20.6 190.0 \n",
"6 Adelie Torgersen 38.9 17.8 181.0 \n",
"7 Adelie Torgersen 39.2 19.6 195.0 \n",
"8 Adelie Torgersen 34.1 18.1 193.0 \n",
"9 Adelie Torgersen 42.0 20.2 190.0 \n",
"\n",
" body_mass_g sex \n",
"0 3750.0 Male \n",
"1 3800.0 Female \n",
"2 3250.0 Female \n",
"3 NaN NaN \n",
"4 3450.0 Female \n",
"5 3650.0 Male \n",
"6 3625.0 Female \n",
"7 4675.0 Male \n",
"8 3475.0 NaN \n",
"9 4250.0 NaN "
]
},
"execution_count": 150,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# DataFrame.head() takes an argument, the number of top rows to return\n",
"df_penguin.head(10)\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## In Class Activity C\n",
"The `pclass` of a titanic ticket describes the passenger class. Its unclear if larger or smaller `pclass` are the fancy tickets. See if you can answer this question by:\n",
"\n",
"- `.describe()` the `fare` paid by passengers who bought `pclass=3` tickets\n",
"- `.describe()` the `fare` paid by passengers who bought `pclass=2` tickets\n",
"- `.describe()` the `fare` paid by passengers who bought `pclass=1` tickets\n",
"\n",
"(++) You can use this boolean indexing to compare groups to answer all sorts of interesting questions:\n",
"- Survival Effectiveness: Were people who travelled alone more or less likely to survive the titanic?\n",
"- Demographics of towns: Which town, among Cherbourg, Queenstown or Southampton, seems to have the most families?\n",
"- Layout of the boat: Does having a higher or lower cabin number suggest one is more likely to have a higher or lower ticket class?\n",
" - e.g. when `pclass=1` maybe these cabin numbers are all very large or small ...\n",
"\n",
"Data dictionary ([not the primary source, but a source](https://jkarakas.github.io/Exploratory-Analysis-of-the-Titanic-Dataset/Titanic_Dataset_Exploratory_Analysis_No_Code.html))\n",
"\n",
"| Variable | Definition | Key |\n",
"|----------|--------------------------------------------|-----------------------------------------------|\n",
"| Survived | Survival | 0 = No, 1 = Yes |\n",
"| Pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd |\n",
"| Sex | Sex | |\n",
"| Age | Age in years | |\n",
"| Sibsp | # of siblings / spouses aboard the Titanic | |\n",
"| Parch | # of parents / children aboard the Titanic | |\n",
"| Ticket | Ticket number | |\n",
"| Fare | Passenger fare | |\n",
"| Cabin | Cabin number | |\n",
"| Embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown,S = Southampton |\n"
]
},
{
"cell_type": "code",
"execution_count": 90,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" survived | \n",
" pclass | \n",
" sex | \n",
" age | \n",
" sibsp | \n",
" parch | \n",
" fare | \n",
" embarked | \n",
" class | \n",
" who | \n",
" adult_male | \n",
" deck | \n",
" embark_town | \n",
" alive | \n",
" alone | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0 | \n",
" 3 | \n",
" male | \n",
" 22.0 | \n",
" 1 | \n",
" 0 | \n",
" 7.2500 | \n",
" S | \n",
" Third | \n",
" man | \n",
" True | \n",
" NaN | \n",
" Southampton | \n",
" no | \n",
" False | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" female | \n",
" 38.0 | \n",
" 1 | \n",
" 0 | \n",
" 71.2833 | \n",
" C | \n",
" First | \n",
" woman | \n",
" False | \n",
" C | \n",
" Cherbourg | \n",
" yes | \n",
" False | \n",
"
\n",
" \n",
" 2 | \n",
" 1 | \n",
" 3 | \n",
" female | \n",
" 26.0 | \n",
" 0 | \n",
" 0 | \n",
" 7.9250 | \n",
" S | \n",
" Third | \n",
" woman | \n",
" False | \n",
" NaN | \n",
" Southampton | \n",
" yes | \n",
" True | \n",
"
\n",
" \n",
" 3 | \n",
" 1 | \n",
" 1 | \n",
" female | \n",
" 35.0 | \n",
" 1 | \n",
" 0 | \n",
" 53.1000 | \n",
" S | \n",
" First | \n",
" woman | \n",
" False | \n",
" C | \n",
" Southampton | \n",
" yes | \n",
" False | \n",
"
\n",
" \n",
" 4 | \n",
" 0 | \n",
" 3 | \n",
" male | \n",
" 35.0 | \n",
" 0 | \n",
" 0 | \n",
" 8.0500 | \n",
" S | \n",
" Third | \n",
" man | \n",
" True | \n",
" NaN | \n",
" Southampton | \n",
" no | \n",
" True | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" survived pclass sex age sibsp parch fare embarked class \\\n",
"0 0 3 male 22.0 1 0 7.2500 S Third \n",
"1 1 1 female 38.0 1 0 71.2833 C First \n",
"2 1 3 female 26.0 0 0 7.9250 S Third \n",
"3 1 1 female 35.0 1 0 53.1000 S First \n",
"4 0 3 male 35.0 0 0 8.0500 S Third \n",
"\n",
" who adult_male deck embark_town alive alone \n",
"0 man True NaN Southampton no False \n",
"1 woman False C Cherbourg yes False \n",
"2 woman False NaN Southampton yes True \n",
"3 woman False C Southampton yes False \n",
"4 man True NaN Southampton no True "
]
},
"execution_count": 90,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_titanic = sns.load_dataset('titanic')\n",
"df_titanic.head()\n"
]
},
{
"cell_type": "code",
"execution_count": 158,
"metadata": {},
"outputs": [],
"source": [
"s_bool = df_titanic.loc[:, 'pclass'] == 3"
]
},
{
"cell_type": "code",
"execution_count": 160,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(891, 15)"
]
},
"execution_count": 160,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_titanic.shape"
]
},
{
"cell_type": "code",
"execution_count": 159,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" survived | \n",
" pclass | \n",
" sex | \n",
" age | \n",
" sibsp | \n",
" parch | \n",
" fare | \n",
" embarked | \n",
" class | \n",
" who | \n",
" adult_male | \n",
" deck | \n",
" embark_town | \n",
" alive | \n",
" alone | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0 | \n",
" 3 | \n",
" male | \n",
" 22.0 | \n",
" 1 | \n",
" 0 | \n",
" 7.2500 | \n",
" S | \n",
" Third | \n",
" man | \n",
" True | \n",
" NaN | \n",
" Southampton | \n",
" no | \n",
" False | \n",
"
\n",
" \n",
" 2 | \n",
" 1 | \n",
" 3 | \n",
" female | \n",
" 26.0 | \n",
" 0 | \n",
" 0 | \n",
" 123.0000 | \n",
" S | \n",
" Third | \n",
" woman | \n",
" False | \n",
" NaN | \n",
" Southampton | \n",
" yes | \n",
" True | \n",
"
\n",
" \n",
" 4 | \n",
" 0 | \n",
" 3 | \n",
" male | \n",
" 35.0 | \n",
" 0 | \n",
" 0 | \n",
" 8.0500 | \n",
" S | \n",
" Third | \n",
" man | \n",
" True | \n",
" NaN | \n",
" Southampton | \n",
" no | \n",
" True | \n",
"
\n",
" \n",
" 5 | \n",
" 0 | \n",
" 3 | \n",
" male | \n",
" NaN | \n",
" 0 | \n",
" 0 | \n",
" 8.4583 | \n",
" Q | \n",
" Third | \n",
" man | \n",
" True | \n",
" NaN | \n",
" Queenstown | \n",
" no | \n",
" True | \n",
"
\n",
" \n",
" 7 | \n",
" 0 | \n",
" 3 | \n",
" male | \n",
" 2.0 | \n",
" 3 | \n",
" 1 | \n",
" 21.0750 | \n",
" S | \n",
" Third | \n",
" child | \n",
" False | \n",
" NaN | \n",
" Southampton | \n",
" no | \n",
" False | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 882 | \n",
" 0 | \n",
" 3 | \n",
" female | \n",
" 22.0 | \n",
" 0 | \n",
" 0 | \n",
" 10.5167 | \n",
" S | \n",
" Third | \n",
" woman | \n",
" False | \n",
" NaN | \n",
" Southampton | \n",
" no | \n",
" True | \n",
"
\n",
" \n",
" 884 | \n",
" 0 | \n",
" 3 | \n",
" male | \n",
" 25.0 | \n",
" 0 | \n",
" 0 | \n",
" 7.0500 | \n",
" S | \n",
" Third | \n",
" man | \n",
" True | \n",
" NaN | \n",
" Southampton | \n",
" no | \n",
" True | \n",
"
\n",
" \n",
" 885 | \n",
" 0 | \n",
" 3 | \n",
" female | \n",
" 39.0 | \n",
" 0 | \n",
" 5 | \n",
" 29.1250 | \n",
" Q | \n",
" Third | \n",
" woman | \n",
" False | \n",
" NaN | \n",
" Queenstown | \n",
" no | \n",
" False | \n",
"
\n",
" \n",
" 888 | \n",
" 0 | \n",
" 3 | \n",
" female | \n",
" NaN | \n",
" 1 | \n",
" 2 | \n",
" 23.4500 | \n",
" S | \n",
" Third | \n",
" woman | \n",
" False | \n",
" NaN | \n",
" Southampton | \n",
" no | \n",
" False | \n",
"
\n",
" \n",
" 890 | \n",
" 0 | \n",
" 3 | \n",
" male | \n",
" 32.0 | \n",
" 0 | \n",
" 0 | \n",
" 7.7500 | \n",
" Q | \n",
" Third | \n",
" man | \n",
" True | \n",
" NaN | \n",
" Queenstown | \n",
" no | \n",
" True | \n",
"
\n",
" \n",
"
\n",
"
491 rows × 15 columns
\n",
"
"
],
"text/plain": [
" survived pclass sex age sibsp parch fare embarked class \\\n",
"0 0 3 male 22.0 1 0 7.2500 S Third \n",
"2 1 3 female 26.0 0 0 123.0000 S Third \n",
"4 0 3 male 35.0 0 0 8.0500 S Third \n",
"5 0 3 male NaN 0 0 8.4583 Q Third \n",
"7 0 3 male 2.0 3 1 21.0750 S Third \n",
".. ... ... ... ... ... ... ... ... ... \n",
"882 0 3 female 22.0 0 0 10.5167 S Third \n",
"884 0 3 male 25.0 0 0 7.0500 S Third \n",
"885 0 3 female 39.0 0 5 29.1250 Q Third \n",
"888 0 3 female NaN 1 2 23.4500 S Third \n",
"890 0 3 male 32.0 0 0 7.7500 Q Third \n",
"\n",
" who adult_male deck embark_town alive alone \n",
"0 man True NaN Southampton no False \n",
"2 woman False NaN Southampton yes True \n",
"4 man True NaN Southampton no True \n",
"5 man True NaN Queenstown no True \n",
"7 child False NaN Southampton no False \n",
".. ... ... ... ... ... ... \n",
"882 woman False NaN Southampton no True \n",
"884 man True NaN Southampton no True \n",
"885 woman False NaN Queenstown no False \n",
"888 woman False NaN Southampton no False \n",
"890 man True NaN Queenstown no True \n",
"\n",
"[491 rows x 15 columns]"
]
},
"execution_count": 159,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# all rows with just pclass3\n",
"df_titanic.loc[s_bool, :]"
]
},
{
"cell_type": "code",
"execution_count": 154,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([3, 1, 2])"
]
},
"execution_count": 154,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_titanic['pclass'].unique()"
]
},
{
"cell_type": "code",
"execution_count": 162,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"pclass = 1\n",
" survived pclass age sibsp parch fare\n",
"count 216.000000 216.0 186.000000 216.000000 216.000000 216.000000\n",
"mean 0.629630 1.0 38.233441 0.416667 0.356481 84.154687\n",
"std 0.484026 0.0 14.802856 0.611898 0.693997 78.380373\n",
"min 0.000000 1.0 0.920000 0.000000 0.000000 0.000000\n",
"25% 0.000000 1.0 27.000000 0.000000 0.000000 30.923950\n",
"50% 1.000000 1.0 37.000000 0.000000 0.000000 60.287500\n",
"75% 1.000000 1.0 49.000000 1.000000 0.000000 93.500000\n",
"max 1.000000 1.0 80.000000 3.000000 4.000000 512.329200\n",
"pclass = 2\n",
" survived pclass age sibsp parch fare\n",
"count 184.000000 184.0 173.000000 184.000000 184.000000 184.000000\n",
"mean 0.472826 2.0 29.877630 0.402174 0.380435 20.662183\n",
"std 0.500623 0.0 14.001077 0.601633 0.690963 13.417399\n",
"min 0.000000 2.0 0.670000 0.000000 0.000000 0.000000\n",
"25% 0.000000 2.0 23.000000 0.000000 0.000000 13.000000\n",
"50% 0.000000 2.0 29.000000 0.000000 0.000000 14.250000\n",
"75% 1.000000 2.0 36.000000 1.000000 1.000000 26.000000\n",
"max 1.000000 2.0 70.000000 3.000000 3.000000 73.500000\n",
"pclass = 3\n",
" survived pclass age sibsp parch fare\n",
"count 491.000000 491.0 355.000000 491.000000 491.000000 491.000000\n",
"mean 0.242363 3.0 25.140620 0.615071 0.393075 13.909919\n",
"std 0.428949 0.0 12.495398 1.374883 0.888861 12.766894\n",
"min 0.000000 3.0 0.420000 0.000000 0.000000 0.000000\n",
"25% 0.000000 3.0 18.000000 0.000000 0.000000 7.750000\n",
"50% 0.000000 3.0 24.000000 0.000000 0.000000 8.050000\n",
"75% 0.000000 3.0 32.000000 1.000000 0.000000 15.500000\n",
"max 1.000000 3.0 74.000000 8.000000 6.000000 123.000000\n"
]
}
],
"source": [
"for pclass in sorted(df_titanic['pclass'].unique()):\n",
" print(f'pclass = {pclass}')\n",
" s_bool = df_titanic.loc[:, 'pclass'] == pclass\n",
" print(df_titanic.loc[s_bool, :].describe())"
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6"
}
},
"nbformat": 4,
"nbformat_minor": 4
}