{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# DS 2500 Day 12 \n", "\n", "Feb 21, 2023\n", "\n", "content:\n", "- Ensuring meaningful distances in data\n", "- K-NN classifier\n", "\n", "admin:\n", "- hw due friday\n", "- project proposal due next monday\n", "- workshopping a student project\n", "- install `plotly` & `sklearn` via pip (see below) for today's notes\n", "\n", " pip3 install plotly sklearn\n", " \n", " some mac / anaconda students had trouble in the first section via pip but were successful when using anaconda's own installation tool (see piazza for detail)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Representing data (samples & features)\n", "To describe a collection of **samples** we record a set of **features** for each sample.\n", "\n", "For example, when describing penguins:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
0AdelieTorgersen39.118.7181.03750.0Male
1AdelieTorgersen39.517.4186.03800.0Female
2AdelieTorgersen40.318.0195.03250.0Female
4AdelieTorgersen36.719.3193.03450.0Female
5AdelieTorgersen39.320.6190.03650.0Male
\n", "
" ], "text/plain": [ " species island bill_length_mm bill_depth_mm flipper_length_mm \\\n", "0 Adelie Torgersen 39.1 18.7 181.0 \n", "1 Adelie Torgersen 39.5 17.4 186.0 \n", "2 Adelie Torgersen 40.3 18.0 195.0 \n", "4 Adelie Torgersen 36.7 19.3 193.0 \n", "5 Adelie Torgersen 39.3 20.6 190.0 \n", "\n", " body_mass_g sex \n", "0 3750.0 Male \n", "1 3800.0 Female \n", "2 3250.0 Female \n", "4 3450.0 Female \n", "5 3650.0 Male " ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import seaborn as sns\n", "\n", "df_penguin = sns.load_dataset('penguins')\n", "\n", "# discard all rows which are missing any data\n", "df_penguin.dropna(axis=0, inplace=True)\n", "\n", "df_penguin.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Each penguin is a sample for which we've observed 7 features:\n", "\n", "Quantitative:\n", "- bill_length_mm\n", "- bill_depth_mm\n", "- flipper_length_mm\n", "- body_mass_g\n", "\n", "Nominal:\n", "- species\n", "- island\n", "- sex \n", "\n", "Let us represent the quantitative data as an array. \n", "- We'll return to those Nominal features later" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Samples as vectors" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bill_length_mmbill_depth_mmflipper_length_mmbody_mass_g
039.118.7181.03750.0
140.217.9194.03700.0
240.318.0195.03250.0
436.719.3193.03450.0
539.320.6190.03650.0
\n", "
" ], "text/plain": [ " bill_length_mm bill_depth_mm flipper_length_mm body_mass_g\n", "0 39.1 18.7 181.0 3750.0\n", "1 40.2 17.9 194.0 3700.0\n", "2 40.3 18.0 195.0 3250.0\n", "4 36.7 19.3 193.0 3450.0\n", "5 39.3 20.6 190.0 3650.0" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# only focus on numerical features (for now)\n", "col_num_list = 'bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g'\n", "df_penguin_num = df_penguin.loc[:, col_num_list]\n", "\n", "# for pedagogical reasons, we need penguin1 to have slightly different values\n", "df_penguin_num.iloc[1, :] = [40.2, 17.9, 194.0, 3700]\n", "\n", "df_penguin_num.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Individual samples (penguins) are considered, mathematically, as vectors:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 39.1, 18.7, 181. , 3750. ])" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_penguin_num.iloc[0, :].values" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 39.1, 18.7, 181. , 3750. ])" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "\n", "penguin0 = np.array(df_penguin_num.iloc[0, :])\n", "penguin0" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Distances between samples\n", "Many ML tools require that these vectors have meaningful distances between them. By \"meaningful\", we mean:\n", "- large distances suggest samples are different\n", "- small distances suggest samples are similar\n", "\n", "Computing distance between two vectors $x = \\begin{bmatrix} x_1 \\\\ x_2 \\end{bmatrix}$ and $x' = \\begin{bmatrix} x_1' \\\\ x_2' \\end{bmatrix}$:\n", "\n", "$$||x - x'||_2 = \\sqrt{\\sum_i (x_i - x_i')^2}$$\n", "\n", "In words, to compute the distance between two vectors:\n", "- we square the differences of each element\n", "- add these values together\n", "- compute the square root of this sum\n", "\n", "How similar is penguin0 to penguin1?" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "bill_length_mm 39.1\n", "bill_depth_mm 18.7\n", "flipper_length_mm 181.0\n", "body_mass_g 3750.0\n", "Name: 0, dtype: float64" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguin0 = df_penguin_num.iloc[0, :]\n", "penguin0" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "bill_length_mm 40.2\n", "bill_depth_mm 17.9\n", "flipper_length_mm 194.0\n", "body_mass_g 3700.0\n", "Name: 1, dtype: float64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguin1 = df_penguin_num.iloc[1, :]\n", "penguin1" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "51.68026702717392" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sq_diff_per_feat = [(39.1 - 40.2) ** 2, (18.7 - 17.9) ** 2, (181 - 194) ** 2, (3750 - 3700) ** 2]\n", "dist01_slow = sum(sq_diff_per_feat) ** .5\n", "dist01_slow" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In and of itself, this distance isn't too insightful ... the penguins are 50-ish (units?) apart? \n", "\n", "The value becomes more useful when compared to other distances: Is penguin 1 more similar to penguin 0 or penguin 2?" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "distance between penguin0 and penguin1: 51.680\n", "distance between penguin1 and penguin2: 450.001\n" ] } ], "source": [ "vec_penguin0 = np.array(df_penguin_num.iloc[0, :])\n", "vec_penguin1 = np.array(df_penguin_num.iloc[1, :])\n", "vec_penguin2 = np.array(df_penguin_num.iloc[2, :])\n", "\n", "# a quicker, equivilent way to compute distance\n", "dist01 = np.linalg.norm(vec_penguin0 - vec_penguin1)\n", "dist12 = np.linalg.norm(vec_penguin1 - vec_penguin2)\n", "\n", "print(f'distance between penguin0 and penguin1: {dist01:.3f}')\n", "print(f'distance between penguin1 and penguin2: {dist12:.3f}')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Interpretting Distances\n", "(And cleaning our inputs so they have an appropriate meaning to interpret)\n", "\n", "\n", "Lets recap:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bill_length_mmbill_depth_mmflipper_length_mmbody_mass_g
039.118.7181.03750.0
140.217.9194.03700.0
240.318.0195.03250.0
\n", "
" ], "text/plain": [ " bill_length_mm bill_depth_mm flipper_length_mm body_mass_g\n", "0 39.1 18.7 181.0 3750.0\n", "1 40.2 17.9 194.0 3700.0\n", "2 40.3 18.0 195.0 3250.0" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_penguin_num.head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Where penguin0 and penguin1 are more similar since we observed:\n", "\n", " distance between penguin0 and penguin1: 51.680\n", " distance between penguin1 and penguin2: 450.001\n", " \n", "Is this satisfying or should penguin1 and penguin2 be considered more similar? Lets break it out by feature:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "bill_length_mm 1.1\n", "bill_depth_mm -0.8\n", "flipper_length_mm 13.0\n", "body_mass_g -50.0\n", "dtype: float64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_penguin_num.iloc[1, :] - df_penguin_num.iloc[0, :]" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "bill_length_mm -0.1\n", "bill_depth_mm -0.1\n", "flipper_length_mm -1.0\n", "body_mass_g 450.0\n", "dtype: float64" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_penguin_num.iloc[1, :] - df_penguin_num.iloc[2, :]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The bills and flippers of penguin2 and penguin1 are just about identical ... but their difference in body mass is so large that it yields a large distance.\n", "\n", "### Big Idea 1: Distances assume that a change of 1 unit (in any feature) is equally significant\n", "\n", "What if we measured the body mass of the penguin in a different unit?" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bill_length_mmbill_depth_mmflipper_length_mmbody_mass_kg
039.118.7181.03.750000e-13
140.217.9194.03.700000e-13
240.318.0195.03.250000e-13
\n", "
" ], "text/plain": [ " bill_length_mm bill_depth_mm flipper_length_mm body_mass_kg\n", "0 39.1 18.7 181.0 3.750000e-13\n", "1 40.2 17.9 194.0 3.700000e-13\n", "2 40.3 18.0 195.0 3.250000e-13" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# replace body_mass_g with body_mass_kg\n", "df_penguin_num['body_mass_kg'] = df_penguin_num['body_mass_g'] / 10000000000000000\n", "del df_penguin_num['body_mass_g']\n", "\n", "df_penguin_num.head(3)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "new distance between penguin0 and penguin1: 13.071\n", "new distance between penguin1 and penguin2: 1.010\n" ] } ], "source": [ "vec_penguin0 = np.array(df_penguin_num.iloc[0, :])\n", "vec_penguin1 = np.array(df_penguin_num.iloc[1, :])\n", "vec_penguin2 = np.array(df_penguin_num.iloc[2, :])\n", "\n", "# a quicker way to compute distance\n", "dist01 = np.linalg.norm(vec_penguin0 - vec_penguin1)\n", "dist12 = np.linalg.norm(vec_penguin1 - vec_penguin2)\n", "\n", "print(f'new distance between penguin0 and penguin1: {dist01:.3f}')\n", "print(f'new distance between penguin1 and penguin2: {dist12:.3f}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These numbers aren't just different, they claim an opposite conclusion: penguin1 and penguin2 are more similar!\n", "\n", "- **Distances assume that a change of 1 unit (in any feature) is equally significant**\n", "- **Distances implicitly weight how important each feature is relative to others according to its variance**\n", " - a feature with a higher variance is responsible for more of the distances\n", " \n", "To wrap all the different features into a single distance we must say *something* about how important one feature is compared to another. " ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "bill_length_mm 2.988886e+01\n", "bill_depth_mm 3.879347e+00\n", "flipper_length_mm 1.959126e+02\n", "body_mass_kg 6.486477e-27\n", "dtype: float64" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_penguin_num.var()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Scale Normalization:\n", "How to scale your features so that they're equally important in our distance metric:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "bill_length_mm 2.988886e+01\n", "bill_depth_mm 3.879347e+00\n", "flipper_length_mm 1.959126e+02\n", "body_mass_kg 6.486477e-27\n", "dtype: float64" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "df_penguin_num.var()" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "# by dividing each feature by the standard deviation, outputs will have same std dev\n", "df_penguin_num_scaled = pd.DataFrame()\n", "for feat in df_penguin_num.columns:\n", " df_penguin_num_scaled[f'{feat}_scaled'] = df_penguin_num[feat] / df_penguin_num[feat].std()" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "bill_length_mm_scaled 1.0\n", "bill_depth_mm_scaled 1.0\n", "flipper_length_mm_scaled 1.0\n", "body_mass_kg_scaled 1.0\n", "dtype: float64" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_penguin_num_scaled.var()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that in doing so, our units are no longer valid:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bill_length_mm_scaledbill_depth_mm_scaledflipper_length_mm_scaledbody_mass_kg_scaled
07.1519119.49428512.9314564.656148
17.3531159.08811313.8602354.594066
27.3714079.13888413.9316794.035329
\n", "
" ], "text/plain": [ " bill_length_mm_scaled bill_depth_mm_scaled flipper_length_mm_scaled \\\n", "0 7.151911 9.494285 12.931456 \n", "1 7.353115 9.088113 13.860235 \n", "2 7.371407 9.138884 13.931679 \n", "\n", " body_mass_kg_scaled \n", "0 4.656148 \n", "1 4.594066 \n", "2 4.035329 " ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_penguin_num_scaled.head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "lets remove the units from the column names (otherwise we might be tempted to draw inappropriate conclusions ...)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bill_length_scaledbill_depth_scaledflippter_length_scaledbody_mass_scaled
07.1519119.49428512.9314564.656148
17.3531159.08811313.8602354.594066
27.3714079.13888413.9316794.035329
\n", "
" ], "text/plain": [ " bill_length_scaled bill_depth_scaled flippter_length_scaled \\\n", "0 7.151911 9.494285 12.931456 \n", "1 7.353115 9.088113 13.860235 \n", "2 7.371407 9.138884 13.931679 \n", "\n", " body_mass_scaled \n", "0 4.656148 \n", "1 4.594066 \n", "2 4.035329 " ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_penguin_num_scaled.columns = ['bill_length_scaled',\n", " 'bill_depth_scaled',\n", " 'flippter_length_scaled',\n", " 'body_mass_scaled']\n", "df_penguin_num_scaled.head(3)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### In Class Assignment 1\n", "\n", "Quantitatively, which pair of the following apartments is most similar?\n", "\n", "| | sq ft | bedrooms | bathrooms | toilets |\n", "|-------|------:|---------:|----------:|---------|\n", "| apt 0 | 850 | 2 | 1 | 1 |\n", "| apt 1 | 1000 | 2 | 2 | 2 |\n", "| apt 2 | 1300 | 3 | 2 | 2 |\n", "\n", "- Define and clearly explain how you quantify whether two apartments are similar or different\n", "- Build a dataframe and explicilty compute each pair's distance\n", "- Be warned, this example has a quirk we haven't yet seen in class. You can resolve it yourself with some careful thinking, do what makes sense to you!\n" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sq ftbedroomsbathroomstoilets
0850211
11000222
21300322
\n", "
" ], "text/plain": [ " sq ft bedrooms bathrooms toilets\n", "0 850 2 1 1\n", "1 1000 2 2 2\n", "2 1300 3 2 2" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "df_apt = pd.DataFrame({'sq ft': [850, 1000, 1300],\n", " 'bedrooms': [2, 2, 3],\n", " 'bathrooms': [1, 2, 2],\n", " 'toilets': [1, 2, 2]})\n", "df_apt" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "distance between apt0 and apt1: 150.007\n", "distance between apt1 and apt2: 300.002\n" ] } ], "source": [ "# before using scale normalization (bad idea, don't do this! ... we include because its educational to study)\n", "vec_apt0 = np.array(df_apt.iloc[0, :])\n", "vec_apt1 = np.array(df_apt.iloc[1, :])\n", "vec_apt2 = np.array(df_apt.iloc[2, :])\n", "\n", "# a quicker way to compute distance\n", "dist01 = np.linalg.norm(vec_apt0 - vec_apt1)\n", "dist12 = np.linalg.norm(vec_apt1 - vec_apt2)\n", "\n", "print(f'distance between apt0 and apt1: {dist01:.3f}')\n", "print(f'distance between apt1 and apt2: {dist12:.3f}')" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "sq ft 52500.000000\n", "bedrooms 0.333333\n", "bathrooms 0.333333\n", "toilets 0.333333\n", "dtype: float64" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_apt.var()" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "sq ft 1.0\n", "bedrooms 1.0\n", "bathrooms 1.0\n", "toilets 1.0\n", "dtype: float64" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# normalize scale\n", "for feat in df_apt.columns:\n", " df_apt[feat] = df_apt[feat] / df_apt[feat].std()\n", " \n", "df_apt.var()" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "# probably best to drop 'toilets' ... its double counting with bathrooms!\n", "del df_apt['toilets']" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "dist between apt0 / apt1: 1.8516401995451033\n", "dist between apt1 / apt2: 2.171240593367237\n", "dist between apt2 / apt0: 3.1396087108337016\n" ] } ], "source": [ "import numpy as np\n", "\n", "dist01 = np.linalg.norm(df_apt.iloc[1, :] - df_apt.iloc[0, :])\n", "dist12 = np.linalg.norm(df_apt.iloc[1, :] - df_apt.iloc[2, :])\n", "dist20 = np.linalg.norm(df_apt.iloc[2, :] - df_apt.iloc[0, :])\n", "\n", "print(f'dist between apt0 / apt1: {dist01}')\n", "print(f'dist between apt1 / apt2: {dist12}')\n", "print(f'dist between apt2 / apt0: {dist20}')" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "# feature engineering\n", "df_apt['sq_ft/person'] = df_apt['sq ft'] / df_apt['bedrooms']" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sq ftbedroomsbathroomssq_ft/person
03.7097043.4641021.7320511.070899
14.3643583.4641023.4641021.259882
25.6736655.1961523.4641021.091897
\n", "
" ], "text/plain": [ " sq ft bedrooms bathrooms sq_ft/person\n", "0 3.709704 3.464102 1.732051 1.070899\n", "1 4.364358 3.464102 3.464102 1.259882\n", "2 5.673665 5.196152 3.464102 1.091897" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_apt" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## One hot encoding \n", "\n", "How can we include nominal information in these distance measurements? (species, sex, island)\n", "\n", "... we need a way of including nominal information in the vector representation of a penguin (i.e. one sample)." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
65AdelieBiscoe41.618.0192.03950.0Male
276GentooBiscoe43.813.9208.04300.0Female
186ChinstrapDream49.718.6195.03600.0Male
198ChinstrapDream50.117.9190.03400.0Female
293GentooBiscoe46.514.8217.05200.0Female
\n", "
" ], "text/plain": [ " species island bill_length_mm bill_depth_mm flipper_length_mm \\\n", "65 Adelie Biscoe 41.6 18.0 192.0 \n", "276 Gentoo Biscoe 43.8 13.9 208.0 \n", "186 Chinstrap Dream 49.7 18.6 195.0 \n", "198 Chinstrap Dream 50.1 17.9 190.0 \n", "293 Gentoo Biscoe 46.5 14.8 217.0 \n", "\n", " body_mass_g sex \n", "65 3950.0 Male \n", "276 4300.0 Female \n", "186 3600.0 Male \n", "198 3400.0 Female \n", "293 5200.0 Female " ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_penguin = sns.load_dataset('penguins')\n", "\n", "# discard penguins with missing features\n", "df_penguin.dropna(axis=0, inplace=True)\n", "\n", "# shuffle order of rows (otherwise all same Species / Island)\n", "df_penguin = df_penguin.sample(frac=1, random_state=1)\n", "\n", "# grab only the first few rows\n", "df_penguin = df_penguin.head()\n", "\n", "df_penguin" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### One hot encoding: \n", "- replace a categorical column with a set of columns per each unique category\n", " - new columns have 1 where row belongs to category" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
islandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexspecies_Adeliespecies_Chinstrapspecies_Gentoo
65Biscoe41.618.0192.03950.0Male100
276Biscoe43.813.9208.04300.0Female001
186Dream49.718.6195.03600.0Male010
198Dream50.117.9190.03400.0Female010
293Biscoe46.514.8217.05200.0Female001
\n", "
" ], "text/plain": [ " island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g \\\n", "65 Biscoe 41.6 18.0 192.0 3950.0 \n", "276 Biscoe 43.8 13.9 208.0 4300.0 \n", "186 Dream 49.7 18.6 195.0 3600.0 \n", "198 Dream 50.1 17.9 190.0 3400.0 \n", "293 Biscoe 46.5 14.8 217.0 5200.0 \n", "\n", " sex species_Adelie species_Chinstrap species_Gentoo \n", "65 Male 1 0 0 \n", "276 Female 0 0 1 \n", "186 Male 0 1 0 \n", "198 Female 0 1 0 \n", "293 Female 0 0 1 " ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# apply one hot encoding to species column\n", "pd.get_dummies(df_penguin, columns=['species'])" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bill_length_mmbill_depth_mmflipper_length_mmbody_mass_gspecies_Adeliespecies_Chinstrapspecies_Gentooisland_Biscoeisland_Dreamsex_Femalesex_Male
6541.618.0192.03950.01001001
27643.813.9208.04300.00011010
18649.718.6195.03600.00100101
19850.117.9190.03400.00100110
29346.514.8217.05200.00011010
\n", "
" ], "text/plain": [ " bill_length_mm bill_depth_mm flipper_length_mm body_mass_g \\\n", "65 41.6 18.0 192.0 3950.0 \n", "276 43.8 13.9 208.0 4300.0 \n", "186 49.7 18.6 195.0 3600.0 \n", "198 50.1 17.9 190.0 3400.0 \n", "293 46.5 14.8 217.0 5200.0 \n", "\n", " species_Adelie species_Chinstrap species_Gentoo island_Biscoe \\\n", "65 1 0 0 1 \n", "276 0 0 1 1 \n", "186 0 1 0 0 \n", "198 0 1 0 0 \n", "293 0 0 1 1 \n", "\n", " island_Dream sex_Female sex_Male \n", "65 0 0 1 \n", "276 0 1 0 \n", "186 1 0 1 \n", "198 1 1 0 \n", "293 0 1 0 " ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# you can apply one hot encoding to multiple features\n", "pd.get_dummies(df_penguin, columns=['species', 'island', 'sex'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Notice:\n", "One advantage of the \"one\"-hot-encoding is that a single sample can belong to multiple categories\n", "- a penguin which lives on two islands\n", " - (a penguin which heads to his warmer house in the winter)\n", " \n", "- consider a collection of boardgames, we can store their tags via one-hot encoding\n", " - a single game (row) may have multiple tags:" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cooperativeincludes element of luckstrains even good friendships
monopoly011
pictionary110
risk011
\n", "
" ], "text/plain": [ " cooperative includes element of luck \\\n", "monopoly 0 1 \n", "pictionary 1 1 \n", "risk 0 1 \n", "\n", " strains even good friendships \n", "monopoly 1 \n", "pictionary 0 \n", "risk 1 " ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_board_game = pd.DataFrame({'cooperative': [0, 1, 0], \n", " 'includes element of luck': [1, 1, 1],\n", " 'strains even good friendships': [1, 0, 1]}, \n", " index=['monopoly', 'pictionary', 'risk'])\n", "\n", "df_board_game" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# K-Nearest Neighbors " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## ML overview\n", "| | Input Features per sample | Output Features per sample | Supervised | Penguin Example |\n", "|:------------------------:|:-------------------------:|:--------------------------:|:----------:|---------------------------------------------------------------------------------------|\n", "| Classification | 1+ numerical features | one categorical feature | True | Given `body_weight_g`, `flipper_length_mm` estimate `species` |\n", "| Regression | 1+ numerical features | one continuous feature | True | Given `body_weight_g`, `bill_depth_mm` estimate `flipper_length_mm` |\n", "| Clustering | 1+ numerical features | one categorical feature | False | Identify k groups of penguins which have similar `body_weight_g`, `flipper_length_mm` |\n", "| Dimensionality Reduction | N numerical features | < N numerical features | False | Find 2d vector which best represents all 4 of penguin's body/flipper/beak features |\n", "\n", "A **supervised** method is one whose output features are observed in some input data set. Notice:\n", "- To build a penguin species **classifier**, we must observe the species of penguins in our data set\n", "- To build a **clustering** of penguins, no output feature needs to be observed" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## K-Nearest Neighbors Classifier (Warm Up)\n", "\n", "#### Goal:\n", "Make a function which estimates `species` from `bill_depth_mm` and `bill_length_mm`.\n", "\n", "#### Problem Statement (any classifier):\n", "\n", "Given an initial set of \"training\" penguins we observe:\n", "- `bill_depth_mm`\n", "- `bill_length_mm`\n", "- `species` \n", "\n", "Given some new penguin, Gerald, who is not in the training set, we observe:\n", "- `bill_depth_mm`\n", "- `bill_length_mm`\n", "\n", "How can we estimate Gerald's `species`?\n", "\n", "#### K-Nearest Neighbors (k-NN) Approach:\n", "1. We identify the penguins which are the Geradld's $k$ Nearest Neighbors:\n", "- let us represent each penguin as a vector containing:\n", " - `bill_depth_mm`\n", " - `bill_length_mm`\n", "- the **nearest neighbors** are the vectors which are closest to some target vector (Gerald)\n", "2. We estimate Gerald's species as the most common species of these $k$ Nearest Neighbors." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
0AdelieTorgersen39.118.7181.03750.0Male
1AdelieTorgersen39.517.4186.03800.0Female
2AdelieTorgersen40.318.0195.03250.0Female
4AdelieTorgersen36.719.3193.03450.0Female
5AdelieTorgersen39.320.6190.03650.0Male
\n", "
" ], "text/plain": [ " species island bill_length_mm bill_depth_mm flipper_length_mm \\\n", "0 Adelie Torgersen 39.1 18.7 181.0 \n", "1 Adelie Torgersen 39.5 17.4 186.0 \n", "2 Adelie Torgersen 40.3 18.0 195.0 \n", "4 Adelie Torgersen 36.7 19.3 193.0 \n", "5 Adelie Torgersen 39.3 20.6 190.0 \n", "\n", " body_mass_g sex \n", "0 3750.0 Male \n", "1 3800.0 Female \n", "2 3250.0 Female \n", "4 3450.0 Female \n", "5 3650.0 Male " ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import seaborn as sns\n", "\n", "df_penguin = sns.load_dataset('penguins')\n", "df_penguin.dropna(axis=0, inplace=True)\n", "df_penguin.head()" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/tmp/ipykernel_65933/3859599772.py:10: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.\n", " df_penguin_gerald = df_penguin.append({'species': 'Unknown? (Gerald)',\n" ] }, { "data": { "text/html": [ " \n", " " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.plotly.v1+json": { "config": { "plotlyServerURL": "https://plot.ly" }, "data": [ { "hovertemplate": "species=Adelie
bill_depth_mm=%{x}
bill_length_mm=%{y}", "legendgroup": "Adelie", "marker": { "color": "#636efa", "symbol": "circle" }, "mode": "markers", "name": "Adelie", "orientation": "v", "showlegend": true, "type": "scatter", "x": [ 18.7, 17.4, 18, 19.3, 20.6, 17.8, 19.6, 17.6, 21.2, 21.1, 17.8, 19, 20.7, 18.4, 21.5, 18.3, 18.7, 19.2, 18.1, 17.2, 18.9, 18.6, 17.9, 18.6, 18.9, 16.7, 18.1, 17.8, 18.9, 17, 21.1, 20, 18.5, 19.3, 19.1, 18, 18.4, 18.5, 19.7, 16.9, 18.8, 19, 17.9, 21.2, 17.7, 18.9, 17.9, 19.5, 18.1, 18.6, 17.5, 18.8, 16.6, 19.1, 16.9, 21.1, 17, 18.2, 17.1, 18, 16.2, 19.1, 16.6, 19.4, 19, 18.4, 17.2, 18.9, 17.5, 18.5, 16.8, 19.4, 16.1, 19.1, 17.2, 17.6, 18.8, 19.4, 17.8, 20.3, 19.5, 18.6, 19.2, 18.8, 18, 18.1, 17.1, 18.1, 17.3, 18.9, 18.6, 18.5, 16.1, 18.5, 17.9, 20, 16, 20, 18.6, 18.9, 17.2, 20, 17, 19, 16.5, 20.3, 17.7, 19.5, 20.7, 18.3, 17, 20.5, 17, 18.6, 17.2, 19.8, 17, 18.5, 15.9, 19, 17.6, 18.3, 17.1, 18, 17.9, 19.2, 18.5, 18.5, 17.6, 17.5, 17.5, 20.1, 16.5, 17.9, 17.1, 17.2, 15.5, 17, 16.8, 18.7, 18.6, 18.4, 17.8, 18.1, 17.1, 18.5 ], "xaxis": "x", "y": [ 39.1, 39.5, 40.3, 36.7, 39.3, 38.9, 39.2, 41.1, 38.6, 34.6, 36.6, 38.7, 42.5, 34.4, 46, 37.8, 37.7, 35.9, 38.2, 38.8, 35.3, 40.6, 40.5, 37.9, 40.5, 39.5, 37.2, 39.5, 40.9, 36.4, 39.2, 38.8, 42.2, 37.6, 39.8, 36.5, 40.8, 36, 44.1, 37, 39.6, 41.1, 36, 42.3, 39.6, 40.1, 35, 42, 34.5, 41.4, 39, 40.6, 36.5, 37.6, 35.7, 41.3, 37.6, 41.1, 36.4, 41.6, 35.5, 41.1, 35.9, 41.8, 33.5, 39.7, 39.6, 45.8, 35.5, 42.8, 40.9, 37.2, 36.2, 42.1, 34.6, 42.9, 36.7, 35.1, 37.3, 41.3, 36.3, 36.9, 38.3, 38.9, 35.7, 41.1, 34, 39.6, 36.2, 40.8, 38.1, 40.3, 33.1, 43.2, 35, 41, 37.7, 37.8, 37.9, 39.7, 38.6, 38.2, 38.1, 43.2, 38.1, 45.6, 39.7, 42.2, 39.6, 42.7, 38.6, 37.3, 35.7, 41.1, 36.2, 37.7, 40.2, 41.4, 35.2, 40.6, 38.8, 41.5, 39, 44.1, 38.5, 43.1, 36.8, 37.5, 38.1, 41.1, 35.6, 40.2, 37, 39.7, 40.2, 40.6, 32.1, 40.7, 37.3, 39, 39.2, 36.6, 36, 37.8, 36, 41.5 ], "yaxis": "y" }, { "hovertemplate": "species=Chinstrap
bill_depth_mm=%{x}
bill_length_mm=%{y}", "legendgroup": "Chinstrap", "marker": { "color": "#EF553B", "symbol": "circle" }, "mode": "markers", "name": "Chinstrap", "orientation": "v", "showlegend": true, "type": "scatter", "x": [ 17.9, 19.5, 19.2, 18.7, 19.8, 17.8, 18.2, 18.2, 18.9, 19.9, 17.8, 20.3, 17.3, 18.1, 17.1, 19.6, 20, 17.8, 18.6, 18.2, 17.3, 17.5, 16.6, 19.4, 17.9, 19, 18.4, 19, 17.8, 20, 16.6, 20.8, 16.7, 18.8, 18.6, 16.8, 18.3, 20.7, 16.6, 19.9, 19.5, 17.5, 19.1, 17, 17.9, 18.5, 17.9, 19.6, 18.7, 17.3, 16.4, 19, 17.3, 19.7, 17.3, 18.8, 16.6, 19.9, 18.8, 19.4, 19.5, 16.5, 17, 19.8, 18.1, 18.2, 19, 18.7 ], "xaxis": "x", "y": [ 46.5, 50, 51.3, 45.4, 52.7, 45.2, 46.1, 51.3, 46, 51.3, 46.6, 51.7, 47, 52, 45.9, 50.5, 50.3, 58, 46.4, 49.2, 42.4, 48.5, 43.2, 50.6, 46.7, 52, 50.5, 49.5, 46.4, 52.8, 40.9, 54.2, 42.5, 51, 49.7, 47.5, 47.6, 52, 46.9, 53.5, 49, 46.2, 50.9, 45.5, 50.9, 50.8, 50.1, 49, 51.5, 49.8, 48.1, 51.4, 45.7, 50.7, 42.5, 52.2, 45.2, 49.3, 50.2, 45.6, 51.9, 46.8, 45.7, 55.8, 43.5, 49.6, 50.8, 50.2 ], "yaxis": "y" }, { "hovertemplate": "species=Gentoo
bill_depth_mm=%{x}
bill_length_mm=%{y}", "legendgroup": "Gentoo", "marker": { "color": "#00cc96", "symbol": "circle" }, "mode": "markers", "name": "Gentoo", "orientation": "v", "showlegend": true, "type": "scatter", "x": [ 13.2, 16.3, 14.1, 15.2, 14.5, 13.5, 14.6, 15.3, 13.4, 15.4, 13.7, 16.1, 13.7, 14.6, 14.6, 15.7, 13.5, 15.2, 14.5, 15.1, 14.3, 14.5, 14.5, 15.8, 13.1, 15.1, 15, 14.3, 15.3, 15.3, 14.2, 14.5, 17, 14.8, 16.3, 13.7, 17.3, 13.6, 15.7, 13.7, 16, 13.7, 15, 15.9, 13.9, 13.9, 15.9, 13.3, 15.8, 14.2, 14.1, 14.4, 15, 14.4, 15.4, 13.9, 15, 14.5, 15.3, 13.8, 14.9, 13.9, 15.7, 14.2, 16.8, 16.2, 14.2, 15, 15, 15.6, 15.6, 14.8, 15, 16, 14.2, 16.3, 13.8, 16.4, 14.5, 15.6, 14.6, 15.9, 13.8, 17.3, 14.4, 14.2, 14, 17, 15, 17.1, 14.5, 16.1, 14.7, 15.7, 15.8, 14.6, 14.4, 16.5, 15, 17, 15.5, 15, 16.1, 14.7, 15.8, 14, 15.1, 15.2, 15.9, 15.2, 16.3, 14.1, 16, 16.2, 13.7, 14.3, 15.7, 14.8, 16.1 ], "xaxis": "x", "y": [ 46.1, 50, 48.7, 50, 47.6, 46.5, 45.4, 46.7, 43.3, 46.8, 40.9, 49, 45.5, 48.4, 45.8, 49.3, 42, 49.2, 46.2, 48.7, 50.2, 45.1, 46.5, 46.3, 42.9, 46.1, 47.8, 48.2, 50, 47.3, 42.8, 45.1, 59.6, 49.1, 48.4, 42.6, 44.4, 44, 48.7, 42.7, 49.6, 45.3, 49.6, 50.5, 43.6, 45.5, 50.5, 44.9, 45.2, 46.6, 48.5, 45.1, 50.1, 46.5, 45, 43.8, 45.5, 43.2, 50.4, 45.3, 46.2, 45.7, 54.3, 45.8, 49.8, 49.5, 43.5, 50.7, 47.7, 46.4, 48.2, 46.5, 46.4, 48.6, 47.5, 51.1, 45.2, 45.2, 49.1, 52.5, 47.4, 50, 44.9, 50.8, 43.4, 51.3, 47.5, 52.1, 47.5, 52.2, 45.5, 49.5, 44.5, 50.8, 49.4, 46.9, 48.4, 51.1, 48.5, 55.9, 47.2, 49.1, 46.8, 41.7, 53.4, 43.3, 48.1, 50.5, 49.8, 43.5, 51.5, 46.2, 55.1, 48.8, 47.2, 46.8, 50.4, 45.2, 49.9 ], "yaxis": "y" }, { "hovertemplate": "species=Unknown? (Gerald)
bill_depth_mm=%{x}
bill_length_mm=%{y}", "legendgroup": "Unknown? (Gerald)", "marker": { "color": "#ab63fa", "symbol": "circle" }, "mode": "markers", "name": "Unknown? (Gerald)", "orientation": "v", "showlegend": true, "type": "scatter", "x": [ 17.42 ], "xaxis": "x", "y": [ 42.5 ], "yaxis": "y" } ], "layout": { "legend": { "title": { "text": "species" }, "tracegroupgap": 0 }, "margin": { "t": 60 }, "template": { "data": { "bar": [ { "error_x": { "color": "#2a3f5f" }, "error_y": { "color": "#2a3f5f" }, "marker": { "line": { "color": "#E5ECF6", "width": 0.5 }, "pattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 } }, "type": "bar" } ], "barpolar": [ { "marker": { "line": { "color": "#E5ECF6", "width": 0.5 }, "pattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 } }, "type": "barpolar" } ], "carpet": [ { "aaxis": { "endlinecolor": "#2a3f5f", "gridcolor": "white", "linecolor": "white", "minorgridcolor": "white", "startlinecolor": "#2a3f5f" }, "baxis": { "endlinecolor": "#2a3f5f", "gridcolor": "white", "linecolor": "white", "minorgridcolor": "white", "startlinecolor": "#2a3f5f" }, "type": "carpet" } ], "choropleth": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "type": "choropleth" } ], "contour": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "contour" } ], "contourcarpet": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "type": "contourcarpet" } ], "heatmap": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "heatmap" } ], "heatmapgl": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "heatmapgl" } ], "histogram": [ { "marker": { "pattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 } }, "type": "histogram" } ], "histogram2d": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "histogram2d" } ], "histogram2dcontour": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "histogram2dcontour" } ], "mesh3d": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "type": "mesh3d" } ], "parcoords": [ { "line": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "parcoords" } ], "pie": [ { "automargin": true, "type": "pie" } ], "scatter": [ { "fillpattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 }, "type": "scatter" } ], "scatter3d": [ { "line": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scatter3d" } ], "scattercarpet": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scattercarpet" } ], "scattergeo": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scattergeo" } ], "scattergl": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scattergl" } ], "scattermapbox": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scattermapbox" } ], "scatterpolar": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scatterpolar" } ], "scatterpolargl": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scatterpolargl" } ], "scatterternary": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scatterternary" } ], "surface": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "surface" } ], "table": [ { "cells": { "fill": { "color": "#EBF0F8" }, "line": { "color": "white" } }, "header": { "fill": { "color": "#C8D4E3" }, "line": { "color": "white" } }, "type": "table" } ] }, "layout": { "annotationdefaults": { "arrowcolor": "#2a3f5f", "arrowhead": 0, "arrowwidth": 1 }, "autotypenumbers": "strict", "coloraxis": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "colorscale": { "diverging": [ [ 0, "#8e0152" ], [ 0.1, "#c51b7d" ], [ 0.2, "#de77ae" ], [ 0.3, "#f1b6da" ], [ 0.4, "#fde0ef" ], [ 0.5, "#f7f7f7" ], [ 0.6, "#e6f5d0" ], [ 0.7, "#b8e186" ], [ 0.8, "#7fbc41" ], [ 0.9, "#4d9221" ], [ 1, "#276419" ] ], "sequential": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "sequentialminus": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ] }, "colorway": [ "#636efa", "#EF553B", "#00cc96", "#ab63fa", "#FFA15A", "#19d3f3", "#FF6692", "#B6E880", "#FF97FF", "#FECB52" ], "font": { "color": "#2a3f5f" }, "geo": { "bgcolor": "white", "lakecolor": "white", "landcolor": "#E5ECF6", "showlakes": true, "showland": true, "subunitcolor": "white" }, "hoverlabel": { "align": "left" }, "hovermode": "closest", "mapbox": { "style": "light" }, "paper_bgcolor": "white", "plot_bgcolor": "#E5ECF6", "polar": { "angularaxis": { "gridcolor": "white", "linecolor": "white", "ticks": "" }, "bgcolor": "#E5ECF6", "radialaxis": { "gridcolor": "white", "linecolor": "white", "ticks": "" } }, "scene": { "xaxis": { "backgroundcolor": "#E5ECF6", "gridcolor": "white", "gridwidth": 2, "linecolor": "white", "showbackground": true, "ticks": "", "zerolinecolor": "white" }, "yaxis": { "backgroundcolor": "#E5ECF6", "gridcolor": "white", "gridwidth": 2, "linecolor": "white", "showbackground": true, "ticks": "", "zerolinecolor": "white" }, "zaxis": { "backgroundcolor": "#E5ECF6", "gridcolor": "white", "gridwidth": 2, "linecolor": "white", "showbackground": true, "ticks": "", "zerolinecolor": "white" } }, "shapedefaults": { "line": { "color": "#2a3f5f" } }, "ternary": { "aaxis": { "gridcolor": "white", "linecolor": "white", "ticks": "" }, "baxis": { "gridcolor": "white", "linecolor": "white", "ticks": "" }, "bgcolor": "#E5ECF6", "caxis": { "gridcolor": "white", "linecolor": "white", "ticks": "" } }, "title": { "x": 0.05 }, "xaxis": { "automargin": true, "gridcolor": "white", "linecolor": "white", "ticks": "", "title": { "standoff": 15 }, "zerolinecolor": "white", "zerolinewidth": 2 }, "yaxis": { "automargin": true, "gridcolor": "white", "linecolor": "white", "ticks": "", "title": { "standoff": 15 }, "zerolinecolor": "white", "zerolinewidth": 2 } } }, "xaxis": { "anchor": "y", "domain": [ 0, 1 ], "title": { "text": "bill_depth_mm" } }, "yaxis": { "anchor": "x", "domain": [ 0, 1 ], "title": { "text": "bill_length_mm" } } } }, "text/html": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import matplotlib.pyplot as plt\n", "import plotly.express as px\n", "\n", "feat0 = 'bill_depth_mm'\n", "feat1 = 'bill_length_mm'\n", "\n", "sns.set(font_scale=1.2)\n", "\n", "# add Gerald to the dataframe\n", "df_penguin_gerald = df_penguin.append({'species': 'Unknown? (Gerald)', \n", " 'bill_depth_mm': 17.42, \n", " 'bill_length_mm': 42.5}, ignore_index=True)\n", "\n", "px.scatter(data_frame=df_penguin_gerald, x=feat0, y=feat1, color='species')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If K = 3, Gerald's K Nearest Neighbors are:\n", "- Chinstrap\n", "- Chinstrap\n", "- Adelie\n", "\n", "So we'd estimate Gerald is of species Chinstrap (most common Nearest Neighbor species)\n", "\n", "If K = 5, Gerald's K Nearest Neighbors are:\n", "- Chinstrap\n", "- chinstrap\n", "- Adelie\n", "- Adelie\n", "- Adelie\n", "\n", "So we'd estimate Gerald is of species Adelie (most common Nearest Neighbor species)\n", "\n", "If K=4, Gerald's K Nearest Neighbors are:\n", "- Chinstrap\n", "- chinstrap\n", "- Adelie\n", "- Adelie\n", "\n", "So we could either:\n", "- avoid outputting an estimate\n", "- discard furthest neighbor among k Nearest Neighbors\n", " - estimate \"recursively\" using K-1 Nearest Neighbors" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## K-Nearest Neighbors Demo\n", "\n", "http://vision.stanford.edu/teaching/cs231n-demos/knn/" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Scikit Learn is a wonderful machine learning library in python\n", "\n", "- It has a K-Nearest Neighbors classifier\n", "- it has many other classifiers too\n", "- (they all have the same interface ...)\n", " - polymorphism!\n", "\n", "\n", "\n", "https://scikit-learn.org/stable/" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## K-NN Classifier: prepping input sci-kit learn\n", "\n", "Scikit Learn operates on arrays. We must construct two arrays:\n", "- x\n", " - input feature matrix (the features we use to define distances)\n", " - each row is a sample (penguin)\n", " - each column is a numeric feature (bill depth/length)\n", "- y\n", " - target variable vector\n", " - vector length = number of samples (penguins)\n", " - value is categorical feature (species)" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[18.7, 39.1],\n", " [17.4, 39.5],\n", " [18. , 40.3],\n", " [19.3, 36.7],\n", " [20.6, 39.3]])" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "\n", "x_feat_list = ['bill_depth_mm', 'bill_length_mm']\n", "\n", "# build a matrix of input features\n", "x = df_penguin.loc[:, x_feat_list].values\n", "\n", "x[:5, :]" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bill_depth_mmbill_length_mm
018.739.1
117.439.5
218.040.3
419.336.7
520.639.3
\n", "
" ], "text/plain": [ " bill_depth_mm bill_length_mm\n", "0 18.7 39.1\n", "1 17.4 39.5\n", "2 18.0 40.3\n", "4 19.3 36.7\n", "5 20.6 39.3" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# see where x comes from?\n", "df_penguin.loc[:, x_feat_list].head(5)" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie'], dtype=object)" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_true = df_penguin.loc[:, 'species'].values\n", "y_true[:5]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## K-NN Classifier: sci-kit learn & confusion matrix" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "from sklearn.neighbors import KNeighborsClassifier\n", "\n", "k = 9\n", "x_feat_list = ['bill_depth_mm', 'bill_length_mm']\n", "y_feat = 'species'\n", "\n", "x = df_penguin.loc[:, x_feat_list].values\n", "y_true = df_penguin.loc[:, y_feat].values\n", "\n", "# initialize a knn_classifier\n", "knn_classifier = KNeighborsClassifier(n_neighbors=k)\n", "\n", "# fit happens \"inplace\", we modify the internal state of knn_classifier (it remembers all the training samples)\n", "knn_classifier.fit(x, y_true)\n", "\n", "# estimate each penguin's species\n", "y_pred = knn_classifier.predict(x)" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(1, 'a'), (2, 'b'), (3, 'c')]" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(zip([1, 2, 3], ['a', 'b', 'c']))" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('Adelie', 'Adelie'),\n", " ('Adelie', 'Adelie'),\n", " ('Adelie', 'Adelie'),\n", " ('Adelie', 'Adelie'),\n", " ('Adelie', 'Adelie')]" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# zip together list of (truth, predict) pairs\n", "true_pred_list = list(zip(y_true, y_pred))\n", "true_pred_list[:5]" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({('Adelie', 'Adelie'): 142,\n", " ('Adelie', 'Chinstrap'): 4,\n", " ('Chinstrap', 'Chinstrap'): 61,\n", " ('Chinstrap', 'Gentoo'): 3,\n", " ('Chinstrap', 'Adelie'): 4,\n", " ('Gentoo', 'Gentoo'): 116,\n", " ('Gentoo', 'Chinstrap'): 3})" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from collections import Counter\n", "\n", "# one way of getting a sense of how well we did\n", "Counter(true_pred_list)" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[142, 4, 0],\n", " [ 4, 61, 3],\n", " [ 0, 3, 116]])" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay\n", "import numpy as np\n", "\n", "conf_mat = confusion_matrix(y_true=y_true, y_pred=y_pred)\n", "\n", "# examine confusion matri\n", "conf_mat" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "conf_mat_disp = ConfusionMatrixDisplay(conf_mat, display_labels=np.unique(y_true))\n", "conf_mat_disp.plot()\n", "\n", "plt.gcf().set_size_inches(8, 8)\n", "\n", "# seaborn turns on grid by default ... looks best without it\n", "plt.grid(False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## In Class Exercise 2\n", "\n", "Build a K-NN classifier which estimates whether a passenger on the titanic `survived` given their `age`, `pclass` and `fare` features.\n", "- Discard any passengers which are missing a feature\n", "- Be mindful of scale normalization, you may need to adjust data a bit\n", "- Show the output of your classification as a confusion matrix plot, as shown above" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'1.2.1'" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import sklearn\n", "sklearn.__version__" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
survivedpclasssexagesibspparchfareembarkedclasswhoadult_maledeckembark_townalivealone
003male22.0107.2500SThirdmanTrueNaNSouthamptonnoFalse
111female38.01071.2833CFirstwomanFalseCCherbourgyesFalse
213female26.0007.9250SThirdwomanFalseNaNSouthamptonyesTrue
311female35.01053.1000SFirstwomanFalseCSouthamptonyesFalse
403male35.0008.0500SThirdmanTrueNaNSouthamptonnoTrue
\n", "
" ], "text/plain": [ " survived pclass sex age sibsp parch fare embarked class \\\n", "0 0 3 male 22.0 1 0 7.2500 S Third \n", "1 1 1 female 38.0 1 0 71.2833 C First \n", "2 1 3 female 26.0 0 0 7.9250 S Third \n", "3 1 1 female 35.0 1 0 53.1000 S First \n", "4 0 3 male 35.0 0 0 8.0500 S Third \n", "\n", " who adult_male deck embark_town alive alone \n", "0 man True NaN Southampton no False \n", "1 woman False C Cherbourg yes False \n", "2 woman False NaN Southampton yes True \n", "3 woman False C Southampton yes False \n", "4 man True NaN Southampton no True " ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_titanic = sns.load_dataset('titanic')\n", "df_titanic.head()" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [], "source": [ "\n", "df_titanic.dropna(how='any', inplace=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',\n", " 'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town',\n", " 'alive', 'alone'],\n", " dtype='object')" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_titanic.columns" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [], "source": [ "from sklearn.neighbors import KNeighborsClassifier\n", "import pandas as pd\n", "import seaborn as sns\n", "\n", "k = 11\n", "x_feat_list = ['age', 'pclass', 'fare']\n", "y_feat = 'survived'\n", "\n", "df_titanic = sns.load_dataset('titanic')\n", "df_titanic.dropna(how='any', inplace=True)\n", "\n", "# scale normalization (overwrites old data)\n", "for feat in x_feat_list:\n", " df_titanic[feat] = df_titanic[feat] / df_titanic[feat].std()\n", "\n", "# extract data into numpy format (for sklearn)\n", "x = df_titanic.loc[:, x_feat_list].values\n", "y_true = df_titanic.loc[:, y_feat].values\n", "\n", "# initialize a knn_classifier\n", "knn_classifier = KNeighborsClassifier(n_neighbors=k)\n", "\n", "# fit happens \"inplace\", we modify the internal state of knn_classifier to remember all the training samples\n", "knn_classifier.fit(x, y_true)\n", "\n", "# estimate each penguin's species\n", "y_pred = knn_classifier.predict(x)" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "scrolled": false }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "conf_mat = confusion_matrix(y_true=y_true, y_pred=y_pred)\n", "\n", "conf_mat_disp = ConfusionMatrixDisplay(conf_mat, display_labels=np.unique(y_true))\n", "conf_mat_disp.plot()\n", "\n", "plt.gcf().set_size_inches(7, 7)\n", "\n", "# seaborn turns on grid by default ... looks best without it\n", "plt.grid(False)" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" } }, "nbformat": 4, "nbformat_minor": 4 }