{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# DS 2500 Day 12 \n",
"\n",
"Feb 21, 2023\n",
"\n",
"content:\n",
"- Ensuring meaningful distances in data\n",
"- K-NN classifier\n",
"\n",
"admin:\n",
"- hw due friday\n",
"- project proposal due next monday\n",
"- workshopping a student project\n",
"- install `plotly` & `sklearn` via pip (see below) for today's notes\n",
"\n",
" pip3 install plotly sklearn\n",
" \n",
" some mac / anaconda students had trouble in the first section via pip but were successful when using anaconda's own installation tool (see piazza for detail)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Representing data (samples & features)\n",
"To describe a collection of **samples** we record a set of **features** for each sample.\n",
"\n",
"For example, when describing penguins:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
species
\n",
"
island
\n",
"
bill_length_mm
\n",
"
bill_depth_mm
\n",
"
flipper_length_mm
\n",
"
body_mass_g
\n",
"
sex
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
Adelie
\n",
"
Torgersen
\n",
"
39.1
\n",
"
18.7
\n",
"
181.0
\n",
"
3750.0
\n",
"
Male
\n",
"
\n",
"
\n",
"
1
\n",
"
Adelie
\n",
"
Torgersen
\n",
"
39.5
\n",
"
17.4
\n",
"
186.0
\n",
"
3800.0
\n",
"
Female
\n",
"
\n",
"
\n",
"
2
\n",
"
Adelie
\n",
"
Torgersen
\n",
"
40.3
\n",
"
18.0
\n",
"
195.0
\n",
"
3250.0
\n",
"
Female
\n",
"
\n",
"
\n",
"
4
\n",
"
Adelie
\n",
"
Torgersen
\n",
"
36.7
\n",
"
19.3
\n",
"
193.0
\n",
"
3450.0
\n",
"
Female
\n",
"
\n",
"
\n",
"
5
\n",
"
Adelie
\n",
"
Torgersen
\n",
"
39.3
\n",
"
20.6
\n",
"
190.0
\n",
"
3650.0
\n",
"
Male
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" species island bill_length_mm bill_depth_mm flipper_length_mm \\\n",
"0 Adelie Torgersen 39.1 18.7 181.0 \n",
"1 Adelie Torgersen 39.5 17.4 186.0 \n",
"2 Adelie Torgersen 40.3 18.0 195.0 \n",
"4 Adelie Torgersen 36.7 19.3 193.0 \n",
"5 Adelie Torgersen 39.3 20.6 190.0 \n",
"\n",
" body_mass_g sex \n",
"0 3750.0 Male \n",
"1 3800.0 Female \n",
"2 3250.0 Female \n",
"4 3450.0 Female \n",
"5 3650.0 Male "
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import seaborn as sns\n",
"\n",
"df_penguin = sns.load_dataset('penguins')\n",
"\n",
"# discard all rows which are missing any data\n",
"df_penguin.dropna(axis=0, inplace=True)\n",
"\n",
"df_penguin.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Each penguin is a sample for which we've observed 7 features:\n",
"\n",
"Quantitative:\n",
"- bill_length_mm\n",
"- bill_depth_mm\n",
"- flipper_length_mm\n",
"- body_mass_g\n",
"\n",
"Nominal:\n",
"- species\n",
"- island\n",
"- sex \n",
"\n",
"Let us represent the quantitative data as an array. \n",
"- We'll return to those Nominal features later"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Samples as vectors"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
bill_length_mm
\n",
"
bill_depth_mm
\n",
"
flipper_length_mm
\n",
"
body_mass_g
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
39.1
\n",
"
18.7
\n",
"
181.0
\n",
"
3750.0
\n",
"
\n",
"
\n",
"
1
\n",
"
40.2
\n",
"
17.9
\n",
"
194.0
\n",
"
3700.0
\n",
"
\n",
"
\n",
"
2
\n",
"
40.3
\n",
"
18.0
\n",
"
195.0
\n",
"
3250.0
\n",
"
\n",
"
\n",
"
4
\n",
"
36.7
\n",
"
19.3
\n",
"
193.0
\n",
"
3450.0
\n",
"
\n",
"
\n",
"
5
\n",
"
39.3
\n",
"
20.6
\n",
"
190.0
\n",
"
3650.0
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" bill_length_mm bill_depth_mm flipper_length_mm body_mass_g\n",
"0 39.1 18.7 181.0 3750.0\n",
"1 40.2 17.9 194.0 3700.0\n",
"2 40.3 18.0 195.0 3250.0\n",
"4 36.7 19.3 193.0 3450.0\n",
"5 39.3 20.6 190.0 3650.0"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# only focus on numerical features (for now)\n",
"col_num_list = 'bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g'\n",
"df_penguin_num = df_penguin.loc[:, col_num_list]\n",
"\n",
"# for pedagogical reasons, we need penguin1 to have slightly different values\n",
"df_penguin_num.iloc[1, :] = [40.2, 17.9, 194.0, 3700]\n",
"\n",
"df_penguin_num.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Individual samples (penguins) are considered, mathematically, as vectors:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([ 39.1, 18.7, 181. , 3750. ])"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_penguin_num.iloc[0, :].values"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([ 39.1, 18.7, 181. , 3750. ])"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import numpy as np\n",
"\n",
"penguin0 = np.array(df_penguin_num.iloc[0, :])\n",
"penguin0"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Distances between samples\n",
"Many ML tools require that these vectors have meaningful distances between them. By \"meaningful\", we mean:\n",
"- large distances suggest samples are different\n",
"- small distances suggest samples are similar\n",
"\n",
"Computing distance between two vectors $x = \\begin{bmatrix} x_1 \\\\ x_2 \\end{bmatrix}$ and $x' = \\begin{bmatrix} x_1' \\\\ x_2' \\end{bmatrix}$:\n",
"\n",
"$$||x - x'||_2 = \\sqrt{\\sum_i (x_i - x_i')^2}$$\n",
"\n",
"In words, to compute the distance between two vectors:\n",
"- we square the differences of each element\n",
"- add these values together\n",
"- compute the square root of this sum\n",
"\n",
"How similar is penguin0 to penguin1?"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"bill_length_mm 39.1\n",
"bill_depth_mm 18.7\n",
"flipper_length_mm 181.0\n",
"body_mass_g 3750.0\n",
"Name: 0, dtype: float64"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"penguin0 = df_penguin_num.iloc[0, :]\n",
"penguin0"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"bill_length_mm 40.2\n",
"bill_depth_mm 17.9\n",
"flipper_length_mm 194.0\n",
"body_mass_g 3700.0\n",
"Name: 1, dtype: float64"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"penguin1 = df_penguin_num.iloc[1, :]\n",
"penguin1"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"51.68026702717392"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sq_diff_per_feat = [(39.1 - 40.2) ** 2, (18.7 - 17.9) ** 2, (181 - 194) ** 2, (3750 - 3700) ** 2]\n",
"dist01_slow = sum(sq_diff_per_feat) ** .5\n",
"dist01_slow"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In and of itself, this distance isn't too insightful ... the penguins are 50-ish (units?) apart? \n",
"\n",
"The value becomes more useful when compared to other distances: Is penguin 1 more similar to penguin 0 or penguin 2?"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"distance between penguin0 and penguin1: 51.680\n",
"distance between penguin1 and penguin2: 450.001\n"
]
}
],
"source": [
"vec_penguin0 = np.array(df_penguin_num.iloc[0, :])\n",
"vec_penguin1 = np.array(df_penguin_num.iloc[1, :])\n",
"vec_penguin2 = np.array(df_penguin_num.iloc[2, :])\n",
"\n",
"# a quicker, equivilent way to compute distance\n",
"dist01 = np.linalg.norm(vec_penguin0 - vec_penguin1)\n",
"dist12 = np.linalg.norm(vec_penguin1 - vec_penguin2)\n",
"\n",
"print(f'distance between penguin0 and penguin1: {dist01:.3f}')\n",
"print(f'distance between penguin1 and penguin2: {dist12:.3f}')"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Interpretting Distances\n",
"(And cleaning our inputs so they have an appropriate meaning to interpret)\n",
"\n",
"\n",
"Lets recap:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
bill_length_mm
\n",
"
bill_depth_mm
\n",
"
flipper_length_mm
\n",
"
body_mass_g
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
39.1
\n",
"
18.7
\n",
"
181.0
\n",
"
3750.0
\n",
"
\n",
"
\n",
"
1
\n",
"
40.2
\n",
"
17.9
\n",
"
194.0
\n",
"
3700.0
\n",
"
\n",
"
\n",
"
2
\n",
"
40.3
\n",
"
18.0
\n",
"
195.0
\n",
"
3250.0
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" bill_length_mm bill_depth_mm flipper_length_mm body_mass_g\n",
"0 39.1 18.7 181.0 3750.0\n",
"1 40.2 17.9 194.0 3700.0\n",
"2 40.3 18.0 195.0 3250.0"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_penguin_num.head(3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Where penguin0 and penguin1 are more similar since we observed:\n",
"\n",
" distance between penguin0 and penguin1: 51.680\n",
" distance between penguin1 and penguin2: 450.001\n",
" \n",
"Is this satisfying or should penguin1 and penguin2 be considered more similar? Lets break it out by feature:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"bill_length_mm 1.1\n",
"bill_depth_mm -0.8\n",
"flipper_length_mm 13.0\n",
"body_mass_g -50.0\n",
"dtype: float64"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_penguin_num.iloc[1, :] - df_penguin_num.iloc[0, :]"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"bill_length_mm -0.1\n",
"bill_depth_mm -0.1\n",
"flipper_length_mm -1.0\n",
"body_mass_g 450.0\n",
"dtype: float64"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_penguin_num.iloc[1, :] - df_penguin_num.iloc[2, :]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The bills and flippers of penguin2 and penguin1 are just about identical ... but their difference in body mass is so large that it yields a large distance.\n",
"\n",
"### Big Idea 1: Distances assume that a change of 1 unit (in any feature) is equally significant\n",
"\n",
"What if we measured the body mass of the penguin in a different unit?"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
bill_length_mm
\n",
"
bill_depth_mm
\n",
"
flipper_length_mm
\n",
"
body_mass_kg
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
39.1
\n",
"
18.7
\n",
"
181.0
\n",
"
3.750000e-13
\n",
"
\n",
"
\n",
"
1
\n",
"
40.2
\n",
"
17.9
\n",
"
194.0
\n",
"
3.700000e-13
\n",
"
\n",
"
\n",
"
2
\n",
"
40.3
\n",
"
18.0
\n",
"
195.0
\n",
"
3.250000e-13
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" bill_length_mm bill_depth_mm flipper_length_mm body_mass_kg\n",
"0 39.1 18.7 181.0 3.750000e-13\n",
"1 40.2 17.9 194.0 3.700000e-13\n",
"2 40.3 18.0 195.0 3.250000e-13"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# replace body_mass_g with body_mass_kg\n",
"df_penguin_num['body_mass_kg'] = df_penguin_num['body_mass_g'] / 10000000000000000\n",
"del df_penguin_num['body_mass_g']\n",
"\n",
"df_penguin_num.head(3)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"new distance between penguin0 and penguin1: 13.071\n",
"new distance between penguin1 and penguin2: 1.010\n"
]
}
],
"source": [
"vec_penguin0 = np.array(df_penguin_num.iloc[0, :])\n",
"vec_penguin1 = np.array(df_penguin_num.iloc[1, :])\n",
"vec_penguin2 = np.array(df_penguin_num.iloc[2, :])\n",
"\n",
"# a quicker way to compute distance\n",
"dist01 = np.linalg.norm(vec_penguin0 - vec_penguin1)\n",
"dist12 = np.linalg.norm(vec_penguin1 - vec_penguin2)\n",
"\n",
"print(f'new distance between penguin0 and penguin1: {dist01:.3f}')\n",
"print(f'new distance between penguin1 and penguin2: {dist12:.3f}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"These numbers aren't just different, they claim an opposite conclusion: penguin1 and penguin2 are more similar!\n",
"\n",
"- **Distances assume that a change of 1 unit (in any feature) is equally significant**\n",
"- **Distances implicitly weight how important each feature is relative to others according to its variance**\n",
" - a feature with a higher variance is responsible for more of the distances\n",
" \n",
"To wrap all the different features into a single distance we must say *something* about how important one feature is compared to another. "
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"bill_length_mm 2.988886e+01\n",
"bill_depth_mm 3.879347e+00\n",
"flipper_length_mm 1.959126e+02\n",
"body_mass_kg 6.486477e-27\n",
"dtype: float64"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_penguin_num.var()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Scale Normalization:\n",
"How to scale your features so that they're equally important in our distance metric:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"bill_length_mm 2.988886e+01\n",
"bill_depth_mm 3.879347e+00\n",
"flipper_length_mm 1.959126e+02\n",
"body_mass_kg 6.486477e-27\n",
"dtype: float64"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"\n",
"df_penguin_num.var()"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"# by dividing each feature by the standard deviation, outputs will have same std dev\n",
"df_penguin_num_scaled = pd.DataFrame()\n",
"for feat in df_penguin_num.columns:\n",
" df_penguin_num_scaled[f'{feat}_scaled'] = df_penguin_num[feat] / df_penguin_num[feat].std()"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"bill_length_mm_scaled 1.0\n",
"bill_depth_mm_scaled 1.0\n",
"flipper_length_mm_scaled 1.0\n",
"body_mass_kg_scaled 1.0\n",
"dtype: float64"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_penguin_num_scaled.var()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice that in doing so, our units are no longer valid:"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
"
],
"text/plain": [
" bill_length_mm bill_depth_mm flipper_length_mm body_mass_g \\\n",
"65 41.6 18.0 192.0 3950.0 \n",
"276 43.8 13.9 208.0 4300.0 \n",
"186 49.7 18.6 195.0 3600.0 \n",
"198 50.1 17.9 190.0 3400.0 \n",
"293 46.5 14.8 217.0 5200.0 \n",
"\n",
" species_Adelie species_Chinstrap species_Gentoo island_Biscoe \\\n",
"65 1 0 0 1 \n",
"276 0 0 1 1 \n",
"186 0 1 0 0 \n",
"198 0 1 0 0 \n",
"293 0 0 1 1 \n",
"\n",
" island_Dream sex_Female sex_Male \n",
"65 0 0 1 \n",
"276 0 1 0 \n",
"186 1 0 1 \n",
"198 1 1 0 \n",
"293 0 1 0 "
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# you can apply one hot encoding to multiple features\n",
"pd.get_dummies(df_penguin, columns=['species', 'island', 'sex'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Notice:\n",
"One advantage of the \"one\"-hot-encoding is that a single sample can belong to multiple categories\n",
"- a penguin which lives on two islands\n",
" - (a penguin which heads to his warmer house in the winter)\n",
" \n",
"- consider a collection of boardgames, we can store their tags via one-hot encoding\n",
" - a single game (row) may have multiple tags:"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
cooperative
\n",
"
includes element of luck
\n",
"
strains even good friendships
\n",
"
\n",
" \n",
" \n",
"
\n",
"
monopoly
\n",
"
0
\n",
"
1
\n",
"
1
\n",
"
\n",
"
\n",
"
pictionary
\n",
"
1
\n",
"
1
\n",
"
0
\n",
"
\n",
"
\n",
"
risk
\n",
"
0
\n",
"
1
\n",
"
1
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" cooperative includes element of luck \\\n",
"monopoly 0 1 \n",
"pictionary 1 1 \n",
"risk 0 1 \n",
"\n",
" strains even good friendships \n",
"monopoly 1 \n",
"pictionary 0 \n",
"risk 1 "
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_board_game = pd.DataFrame({'cooperative': [0, 1, 0], \n",
" 'includes element of luck': [1, 1, 1],\n",
" 'strains even good friendships': [1, 0, 1]}, \n",
" index=['monopoly', 'pictionary', 'risk'])\n",
"\n",
"df_board_game"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# K-Nearest Neighbors "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## ML overview\n",
"| | Input Features per sample | Output Features per sample | Supervised | Penguin Example |\n",
"|:------------------------:|:-------------------------:|:--------------------------:|:----------:|---------------------------------------------------------------------------------------|\n",
"| Classification | 1+ numerical features | one categorical feature | True | Given `body_weight_g`, `flipper_length_mm` estimate `species` |\n",
"| Regression | 1+ numerical features | one continuous feature | True | Given `body_weight_g`, `bill_depth_mm` estimate `flipper_length_mm` |\n",
"| Clustering | 1+ numerical features | one categorical feature | False | Identify k groups of penguins which have similar `body_weight_g`, `flipper_length_mm` |\n",
"| Dimensionality Reduction | N numerical features | < N numerical features | False | Find 2d vector which best represents all 4 of penguin's body/flipper/beak features |\n",
"\n",
"A **supervised** method is one whose output features are observed in some input data set. Notice:\n",
"- To build a penguin species **classifier**, we must observe the species of penguins in our data set\n",
"- To build a **clustering** of penguins, no output feature needs to be observed"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## K-Nearest Neighbors Classifier (Warm Up)\n",
"\n",
"#### Goal:\n",
"Make a function which estimates `species` from `bill_depth_mm` and `bill_length_mm`.\n",
"\n",
"#### Problem Statement (any classifier):\n",
"\n",
"Given an initial set of \"training\" penguins we observe:\n",
"- `bill_depth_mm`\n",
"- `bill_length_mm`\n",
"- `species` \n",
"\n",
"Given some new penguin, Gerald, who is not in the training set, we observe:\n",
"- `bill_depth_mm`\n",
"- `bill_length_mm`\n",
"\n",
"How can we estimate Gerald's `species`?\n",
"\n",
"#### K-Nearest Neighbors (k-NN) Approach:\n",
"1. We identify the penguins which are the Geradld's $k$ Nearest Neighbors:\n",
"- let us represent each penguin as a vector containing:\n",
" - `bill_depth_mm`\n",
" - `bill_length_mm`\n",
"- the **nearest neighbors** are the vectors which are closest to some target vector (Gerald)\n",
"2. We estimate Gerald's species as the most common species of these $k$ Nearest Neighbors."
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"conf_mat_disp = ConfusionMatrixDisplay(conf_mat, display_labels=np.unique(y_true))\n",
"conf_mat_disp.plot()\n",
"\n",
"plt.gcf().set_size_inches(8, 8)\n",
"\n",
"# seaborn turns on grid by default ... looks best without it\n",
"plt.grid(False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## In Class Exercise 2\n",
"\n",
"Build a K-NN classifier which estimates whether a passenger on the titanic `survived` given their `age`, `pclass` and `fare` features.\n",
"- Discard any passengers which are missing a feature\n",
"- Be mindful of scale normalization, you may need to adjust data a bit\n",
"- Show the output of your classification as a confusion matrix plot, as shown above"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'1.2.1'"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import sklearn\n",
"sklearn.__version__"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
survived
\n",
"
pclass
\n",
"
sex
\n",
"
age
\n",
"
sibsp
\n",
"
parch
\n",
"
fare
\n",
"
embarked
\n",
"
class
\n",
"
who
\n",
"
adult_male
\n",
"
deck
\n",
"
embark_town
\n",
"
alive
\n",
"
alone
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
0
\n",
"
3
\n",
"
male
\n",
"
22.0
\n",
"
1
\n",
"
0
\n",
"
7.2500
\n",
"
S
\n",
"
Third
\n",
"
man
\n",
"
True
\n",
"
NaN
\n",
"
Southampton
\n",
"
no
\n",
"
False
\n",
"
\n",
"
\n",
"
1
\n",
"
1
\n",
"
1
\n",
"
female
\n",
"
38.0
\n",
"
1
\n",
"
0
\n",
"
71.2833
\n",
"
C
\n",
"
First
\n",
"
woman
\n",
"
False
\n",
"
C
\n",
"
Cherbourg
\n",
"
yes
\n",
"
False
\n",
"
\n",
"
\n",
"
2
\n",
"
1
\n",
"
3
\n",
"
female
\n",
"
26.0
\n",
"
0
\n",
"
0
\n",
"
7.9250
\n",
"
S
\n",
"
Third
\n",
"
woman
\n",
"
False
\n",
"
NaN
\n",
"
Southampton
\n",
"
yes
\n",
"
True
\n",
"
\n",
"
\n",
"
3
\n",
"
1
\n",
"
1
\n",
"
female
\n",
"
35.0
\n",
"
1
\n",
"
0
\n",
"
53.1000
\n",
"
S
\n",
"
First
\n",
"
woman
\n",
"
False
\n",
"
C
\n",
"
Southampton
\n",
"
yes
\n",
"
False
\n",
"
\n",
"
\n",
"
4
\n",
"
0
\n",
"
3
\n",
"
male
\n",
"
35.0
\n",
"
0
\n",
"
0
\n",
"
8.0500
\n",
"
S
\n",
"
Third
\n",
"
man
\n",
"
True
\n",
"
NaN
\n",
"
Southampton
\n",
"
no
\n",
"
True
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" survived pclass sex age sibsp parch fare embarked class \\\n",
"0 0 3 male 22.0 1 0 7.2500 S Third \n",
"1 1 1 female 38.0 1 0 71.2833 C First \n",
"2 1 3 female 26.0 0 0 7.9250 S Third \n",
"3 1 1 female 35.0 1 0 53.1000 S First \n",
"4 0 3 male 35.0 0 0 8.0500 S Third \n",
"\n",
" who adult_male deck embark_town alive alone \n",
"0 man True NaN Southampton no False \n",
"1 woman False C Cherbourg yes False \n",
"2 woman False NaN Southampton yes True \n",
"3 woman False C Southampton yes False \n",
"4 man True NaN Southampton no True "
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_titanic = sns.load_dataset('titanic')\n",
"df_titanic.head()"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [],
"source": [
"\n",
"df_titanic.dropna(how='any', inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',\n",
" 'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town',\n",
" 'alive', 'alone'],\n",
" dtype='object')"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_titanic.columns"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.neighbors import KNeighborsClassifier\n",
"import pandas as pd\n",
"import seaborn as sns\n",
"\n",
"k = 11\n",
"x_feat_list = ['age', 'pclass', 'fare']\n",
"y_feat = 'survived'\n",
"\n",
"df_titanic = sns.load_dataset('titanic')\n",
"df_titanic.dropna(how='any', inplace=True)\n",
"\n",
"# scale normalization (overwrites old data)\n",
"for feat in x_feat_list:\n",
" df_titanic[feat] = df_titanic[feat] / df_titanic[feat].std()\n",
"\n",
"# extract data into numpy format (for sklearn)\n",
"x = df_titanic.loc[:, x_feat_list].values\n",
"y_true = df_titanic.loc[:, y_feat].values\n",
"\n",
"# initialize a knn_classifier\n",
"knn_classifier = KNeighborsClassifier(n_neighbors=k)\n",
"\n",
"# fit happens \"inplace\", we modify the internal state of knn_classifier to remember all the training samples\n",
"knn_classifier.fit(x, y_true)\n",
"\n",
"# estimate each penguin's species\n",
"y_pred = knn_classifier.predict(x)"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {
"scrolled": false
},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"conf_mat = confusion_matrix(y_true=y_true, y_pred=y_pred)\n",
"\n",
"conf_mat_disp = ConfusionMatrixDisplay(conf_mat, display_labels=np.unique(y_true))\n",
"conf_mat_disp.plot()\n",
"\n",
"plt.gcf().set_size_inches(7, 7)\n",
"\n",
"# seaborn turns on grid by default ... looks best without it\n",
"plt.grid(False)"
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6"
}
},
"nbformat": 4,
"nbformat_minor": 4
}