{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# DS2500 Day 13\n", "\n", "Content:\n", "- Cross Validation\n", "- Measuring Binary Classifier Performance\n", "\n", "Admin:\n", "- hw4 due tonight\n", "- proposal due next monday\n", " - anybody want to workshop one live here at the start of class?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Claim: I know what your favorite number is\n", "\n", "(activity motivating cross validation)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# In Class Assignment 1\n", "\n", "Use the given 1-nearest neighbor classifier which estimates a penguins `species` by observing its `bill_depth_mm` and `bill_length_mm` to:\n", "- Plot a confusion matrix which shows the performance of your classifier\n", "- In a few sentences, explain whether this confusion matrix accurately represents the performance of the classifier on **new** penguins (those the classifier hasn't trained on). Why or why not? How might you fix this issue?" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import seaborn as sns\n", "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "df_penguin = sns.load_dataset('penguins')\n", "df_penguin.dropna(how='any', inplace=True)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "k = 11\n", "x_feat_list = ['bill_depth_mm', 'bill_length_mm']\n", "y_feat = 'species'\n", "\n", "x = df_penguin.loc[:, x_feat_list].values\n", "y_true = df_penguin.loc[:, y_feat].values\n", "\n", "# initialize a knn_classifier\n", "knn_classifier = KNeighborsClassifier(n_neighbors=k)\n", "\n", "# fit happens \"inplace\", we modify the internal state of knn_classifier to remember all the training samples\n", "knn_classifier.fit(x, y_true)\n", "\n", "# estimate each penguin's species\n", "y_pred = knn_classifier.predict(x)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "conf_mat = confusion_matrix(y_true=y_true, y_pred=y_pred)\n", "conf_mat_disp = ConfusionMatrixDisplay(conf_mat, display_labels=np.unique(y_true))\n", "\n", "sns.set(font_scale=2)\n", "conf_mat_disp.plot()\n", "plt.suptitle('K=11 NN Classifier\\nSeems a little too accurate ...')\n", "plt.gcf().set_size_inches(7, 7)\n", "plt.grid(False)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Cross Validation\n", "\n", "Our motivation behind the confusion matrix was to quantify how well our classifier performs on samples it **hasn't** seen yet. \n", "- We built the classifier estimate the group of a new samples, we already know the group of every sample in the training set!\n", "\n", "# To estimate classifier classifier performance on new samples...\n", "## ... we must measure its performance on samples it hasn't been trained on" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### K-Fold Cross Validation\n", "\n", "1. Partition the data into K distinct \"folds\" (subset of the data)\n", "1. For each fold i:\n", " - train the model on all but the i-th fold\n", " - test the model on the i-th fold\n", " \n", " \n", "\n", " \n", "Animated:\n", "\n", "http://assets.yihui.org/figures/animation/example/cv-ani/demo-a.mp4" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Cross Validation in `sklearn`\n", "\n", "Silly example dataset:\n", "- Using the `favorite number` and `letters in name` to predict which of the beatles `takes milk in coffee`" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namefavorite numberletters in nametakes milk in coffee
0john04True
1ringo25True
2paul1044False
3george-56False
\n", "
" ], "text/plain": [ " name favorite number letters in name takes milk in coffee\n", "0 john 0 4 True\n", "1 ringo 2 5 True\n", "2 paul 104 4 False\n", "3 george -5 6 False" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.model_selection import KFold\n", "import pandas as pd\n", "\n", "df_beatles = pd.DataFrame({'name': ['john', 'ringo', 'paul', 'george'],\n", " 'favorite number': [0, 2, 104, -5],\n", " 'letters in name': [4, 5, 4, 6],\n", " 'takes milk in coffee': [True, True, False, False]})\n", "df_beatles" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "in this iteration:\n", "we train on idx: [1 2 3] (i.e. ['ringo', 'paul', 'george'])\n", "we test on idx: [0] (i.e. ['john'])\n", "----------\n", "in this iteration:\n", "we train on idx: [0 2 3] (i.e. ['john', 'paul', 'george'])\n", "we test on idx: [1] (i.e. ['ringo'])\n", "----------\n", "in this iteration:\n", "we train on idx: [0 1 3] (i.e. ['john', 'ringo', 'george'])\n", "we test on idx: [2] (i.e. ['paul'])\n", "----------\n", "in this iteration:\n", "we train on idx: [0 1 2] (i.e. ['john', 'ringo', 'paul'])\n", "we test on idx: [3] (i.e. ['george'])\n", "----------\n" ] } ], "source": [ "x_feat_list = ['favorite number', 'letters in name']\n", "y_feat = 'takes milk in coffee'\n", "\n", "# extract the values as np.arrays\n", "x = df_beatles.loc[:, x_feat_list].values\n", "y = df_beatles.loc[:, y_feat].values\n", "\n", "# construction of kfold object\n", "kfold = KFold(n_splits=4)\n", "\n", "for train_idx, test_idx in kfold.split(x, y):\n", " print('in this iteration:') \n", " name_train = list(df_beatles.loc[train_idx, 'name'])\n", " name_test = list(df_beatles.loc[test_idx, 'name'])\n", " print(f'we train on idx: {train_idx} (i.e. {name_train})')\n", " print(f'we test on idx: {test_idx} (i.e. {name_test})')\n", " print('-'*10)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 1. , 2. , 2.5, 5. , 10. ],\n", " [ 20. , 25. , 50. , 100. , 200. ]])" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# np.empty allocates a new array of given shape\n", "# (no guarantees about the values inside!)\n", "import numpy as np\n", "np.empty((2, 5))" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.model_selection import KFold\n", "\n", "# parameters of classifier\n", "k = 1\n", "x_feat_list = ['favorite number', 'letters in name']\n", "y_feat = 'takes milk in coffee'\n", "\n", "# extract the values as np.arrays\n", "x = df_beatles.loc[:, x_feat_list].values\n", "y_true = df_beatles.loc[:, y_feat].values\n", "\n", "# initialize classifier\n", "knn_class = KNeighborsClassifier(n_neighbors=k)\n", "\n", "# initialize of kfold object\n", "kfold = KFold(n_splits=4)\n", "\n", "# initialize an array of same shape and type as y\n", "y_pred = np.empty(y_true.shape)\n", "for train_idx, test_idx in kfold.split(x, y_true):\n", " # split into train and test sets\n", " x_train = x[train_idx, :]\n", " y_train = y_true[train_idx]\n", " x_test = x[test_idx, :]\n", " \n", " # fit classifier (on training set)\n", " knn_class.fit(x_train, y_train)\n", " \n", " # predict (on testing set)\n", " y_pred[test_idx] = knn_class.predict(x_test)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([1., 1., 1., 1.])" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# y pred contains a prediction for each sample ... but not from the same classifier\n", "y_pred" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Applying Cross Validation to our penguins species classification..." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "from copy import copy\n", "\n", "k = 1\n", "x_feat_list = ['bill_depth_mm', 'bill_length_mm']\n", "y_feat = 'species'\n", "\n", "x = df_penguin.loc[:, x_feat_list].values\n", "y_true = df_penguin.loc[:, y_feat].values\n", "\n", "# initialize a knn_classifier\n", "knn_classifier = KNeighborsClassifier(n_neighbors=k)\n", "\n", "# construction of kfold object\n", "kfold = KFold(n_splits=3)\n", "\n", "# allocate an empty array to store predictions in\n", "y_pred = copy(y_true)\n", "\n", "for train_idx, test_idx in kfold.split(x, y_true):\n", " # build arrays which correspond to x, y train /test\n", " x_test = x[test_idx, :]\n", " x_train = x[train_idx, :]\n", " y_true_train = y_true[train_idx]\n", " \n", " # fit happens \"inplace\", we modify the internal state of knn_classifier to remember all the training samples\n", " knn_classifier.fit(x_train, y_true_train)\n", "\n", " # estimate each penguin's species\n", " y_pred[test_idx] = knn_classifier.predict(x_test)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# build and plot confusion matrix\n", "conf_mat = confusion_matrix(y_true=y_true, y_pred=y_pred)\n", "conf_mat_disp = ConfusionMatrixDisplay(conf_mat, display_labels=np.unique(y_true))\n", "\n", "sns.set(font_scale=1.5)\n", "conf_mat_disp.plot()\n", "plt.suptitle('K=1 NN Classifier')\n", "plt.gcf().set_size_inches(6, 6)\n", "plt.grid(False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How come we don't get any chinstraps correct?\n", "\n", "hint: count the `species` in each set of training data." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "trained on: Counter({'Gentoo': 119, 'Chinstrap': 68, 'Adelie': 35})\n", "tested on: Counter({'Adelie': 111})\n", "----------\n", "trained on: Counter({'Adelie': 111, 'Gentoo': 111})\n", "tested on: Counter({'Chinstrap': 68, 'Adelie': 35, 'Gentoo': 8})\n", "----------\n", "trained on: Counter({'Adelie': 146, 'Chinstrap': 68, 'Gentoo': 8})\n", "tested on: Counter({'Gentoo': 111})\n", "----------\n" ] } ], "source": [ "from copy import copy\n", "from collections import Counter\n", "\n", "k = 1\n", "x_feat_list = ['bill_depth_mm', 'bill_length_mm']\n", "y_feat = 'species'\n", "\n", "x = df_penguin.loc[:, x_feat_list].values\n", "y_true = df_penguin.loc[:, y_feat].values\n", "\n", "# initialize a knn_classifier\n", "knn_classifier = KNeighborsClassifier(n_neighbors=k)\n", "\n", "# construction of kfold object\n", "kfold = KFold(n_splits=3)\n", "\n", "# allocate an empty array to store predictions in\n", "y_pred = copy(y_true)\n", "\n", "for train_idx, test_idx in kfold.split(x, y_true):\n", " # build arrays which correspond to x, y train /test\n", " x_test = x[test_idx, :]\n", " x_train = x[train_idx, :]\n", " y_true_train = y_true[train_idx]\n", " \n", " print(f'trained on: {Counter(y_true_train)}')\n", " print(f'tested on: {Counter(y_true[test_idx])}')\n", " print('-' * 10)\n", " \n", " # fit happens \"inplace\", we modify the internal state of knn_classifier to remember all the training samples\n", " knn_classifier.fit(x_train, y_true_train)\n", "\n", " # estimate each penguin's species\n", " y_pred[test_idx] = knn_classifier.predict(x_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that we don't always get a proper mix of species in each training set ... the second training set doesn't even have any Chinstrap penguins!\n", "\n", "Why is this?" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "array(['Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie',\n", " 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie',\n", " 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie',\n", " 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie',\n", " 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie',\n", " 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie',\n", " 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie',\n", " 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie',\n", " 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie',\n", " 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie',\n", " 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie',\n", " 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie',\n", " 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie',\n", " 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie',\n", " 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie',\n", " 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie',\n", " 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie',\n", " 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie',\n", " 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie',\n", " 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie',\n", " 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie',\n", " 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie',\n", " 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie',\n", " 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie',\n", " 'Adelie', 'Adelie', 'Chinstrap', 'Chinstrap', 'Chinstrap',\n", " 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap',\n", " 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap',\n", " 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap',\n", " 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap',\n", " 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap',\n", " 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap',\n", " 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap',\n", " 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap',\n", " 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap',\n", " 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap',\n", " 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap',\n", " 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap',\n", " 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap', 'Chinstrap',\n", " 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo',\n", " 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo',\n", " 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo',\n", " 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo',\n", " 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo',\n", " 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo',\n", " 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo',\n", " 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo',\n", " 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo',\n", " 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo',\n", " 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo',\n", " 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo',\n", " 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo',\n", " 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo',\n", " 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo',\n", " 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo',\n", " 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo',\n", " 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo',\n", " 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo',\n", " 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo', 'Gentoo'], dtype=object)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Adelie penguins up front, then chinstrap, finally all gentoo\n", "y_true" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "the data is sorted by species.\n", "\n", "## Ensuring that each training set represents all target groups:\n", "\n", "- we can pass the `shuffle=True` parameter to the constructor of `Kfold` so that data is shuffled before the training indexing\n", "\n", "- alternatively, we could use a `StratifiedKFold` object in place of a plain old `Kfold`:\n", "\n", "\n", " \"This cross-validation object is a variation of KFold that returns stratified folds. The folds are made by \n", " preserving the percentage of samples for each class.\" \n", " \n", "taken from [sklearn doc](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html#sklearn.model_selection.StratifiedKFold)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "using `shuffle=True` on a `KFold`" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Counter({'Adelie': 97, 'Gentoo': 76, 'Chinstrap': 49})\n", "Counter({'Adelie': 99, 'Gentoo': 77, 'Chinstrap': 46})\n", "Counter({'Adelie': 96, 'Gentoo': 85, 'Chinstrap': 41})\n" ] } ], "source": [ "from copy import copy\n", "from collections import Counter\n", "\n", "k = 1\n", "x_feat_list = ['bill_depth_mm', 'bill_length_mm']\n", "y_feat = 'species'\n", "\n", "x = df_penguin.loc[:, x_feat_list].values\n", "y_true = df_penguin.loc[:, y_feat].values\n", "\n", "# initialize a knn_classifier\n", "knn_classifier = KNeighborsClassifier(n_neighbors=k)\n", "\n", "# construction of kfold object\n", "kfold = KFold(n_splits=3, shuffle=True)\n", "\n", "# allocate an empty array to store predictions in\n", "y_pred = copy(y_true)\n", "\n", "for train_idx, test_idx in kfold.split(x, y_true):\n", " # build arrays which correspond to x, y train /test\n", " x_test = x[test_idx, :]\n", " x_train = x[train_idx, :]\n", " y_true_train = y_true[train_idx]\n", " \n", " print(Counter(y_true_train))\n", " \n", " # fit happens \"inplace\", we modify the internal state of knn_classifier to remember all the training samples\n", " knn_classifier.fit(x_train, y_true_train)\n", "\n", " # estimate each penguin's species\n", " y_pred[test_idx] = knn_classifier.predict(x_test)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Counter({'Adelie': 97, 'Gentoo': 80, 'Chinstrap': 45})\n", "Counter({'Adelie': 97, 'Gentoo': 79, 'Chinstrap': 46})\n", "Counter({'Adelie': 98, 'Gentoo': 79, 'Chinstrap': 45})\n" ] } ], "source": [ "from sklearn.model_selection import StratifiedKFold\n", "\n", "from copy import copy\n", "from collections import Counter\n", "\n", "k = 1\n", "x_feat_list = ['bill_depth_mm', 'bill_length_mm']\n", "y_feat = 'species'\n", "\n", "x = df_penguin.loc[:, x_feat_list].values\n", "y_true = df_penguin.loc[:, y_feat].values\n", "\n", "# initialize a knn_classifier\n", "knn_classifier = KNeighborsClassifier(n_neighbors=k)\n", "\n", "# construction of kfold object\n", "kfold = StratifiedKFold(n_splits=3)\n", "\n", "# allocate an empty array to store predictions in\n", "y_pred = copy(y_true)\n", "\n", "for train_idx, test_idx in kfold.split(x, y_true):\n", " # build arrays which correspond to x, y train /test\n", " x_test = x[test_idx, :]\n", " x_train = x[train_idx, :]\n", " y_true_train = y_true[train_idx]\n", " \n", " print(Counter(y_true_train))\n", " \n", " # fit happens \"inplace\", we modify the internal state of knn_classifier to remember all the training samples\n", " knn_classifier.fit(x_train, y_true_train)\n", "\n", " # estimate each penguin's species\n", " y_pred[test_idx] = knn_classifier.predict(x_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I have a mild preference for the `StratifiedKFold` since it gets us as close as possible to an even splitting of target groups." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Note this quirk:\n", "the resulting `y_pred` contains predictions for all the samples ...\n", "\n", "... but samples in different folds were estimated by classifiers trained on different data\n", "\n", "### So which of these trained classifiers should we use as our \"final\" classifier?\n", "\n", "None of them, better to re-train on the whole dataset:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
KNeighborsClassifier(n_neighbors=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "KNeighborsClassifier(n_neighbors=1)" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# discard previous training sets and train classifier on whole dataset\n", "# this is best for *truly new samples \n", "# (*penguins whose species is unknown, not just \"hidden\" for evaluation purposes)\n", "knn_classifier.fit(x, y_true)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Computing Accuracy\n", "\n", "use `accuracy_score`:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.4" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.metrics import accuracy_score\n", "\n", "y_true = [0, 0, 0, 2, 1]\n", "y_pred = [1, 1, 1, 2, 1]\n", "\n", "accuracy_score(y_true, y_pred)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# In Class Assignment 2\n", "\n", "One question we never answered: How do we pick the best K for a K-NN classifier?\n", "\n", "A common solution is to try many different k and then choose the one which works \"best\".\n", "\n", "\n", "\n", "In this ICA, make this plot of the **cross validated** accuracy of the k-NN classifier for k = 1 to 50.\n", "- in your cross validation, use `n_splits=10` folds of data\n", "- write a function `get_cv_acc_knn()` which:\n", " - accepts:\n", " - `x`, `y_true`, `k` (of k-NN) as defined above\n", " - stick with the same classification problem where we estimate `species` from `bill_depth_mm` and `bill_length_mm`\n", " - `n_splits=10` (defaults)\n", " - returns \n", " - the cross validated accuracy of k-NN on the dataset " ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import StratifiedKFold\n", "from sklearn.neighbors import KNeighborsClassifier\n", "\n", "from sklearn.metrics import confusion_matrix\n", "from sklearn.metrics import ConfusionMatrixDisplay\n", "import numpy as np\n", "\n", "def get_cv_acc_knn(x, y_true, k, n_splits=10):\n", " \"\"\" computes cross validated accuracy of a KNN classifier\n", " \n", " Args:\n", " x (np.array): (n_sample, n_feat) features\n", " y (np.array): (n_sample) target variable\n", " k (int): number of nearest neighbors in k-NN classifier\n", " \n", " Returns:\n", " acc (float): cross validated accuracy\n", " \"\"\"\n", " # initialize a knn_classifier\n", " knn_classifier = KNeighborsClassifier(n_neighbors=k)\n", "\n", " # \"Stratified\" ensures (roughly) same number of species across folds\n", " # otherwise we could get funny results with all `Adelie` penguins in one fold...\n", " kfold = StratifiedKFold(n_splits=n_splits)\n", "\n", " # initialize an empty array same size & datatype as y_true\n", " y_pred = np.empty_like(y_true)\n", " for train_idx, test_idx in kfold.split(x, y_true):\n", " # split test / training data\n", " x_train = x[train_idx, :] \n", " x_test = x[test_idx, :]\n", " y_true_train = y_true[train_idx]\n", "\n", " # train on training data\n", " knn_classifier.fit(x_train, y_true_train)\n", "\n", " # predict on the testing data\n", " y_pred[test_idx] = knn_classifier.predict(x_test)\n", " \n", " return accuracy_score(y_true, y_pred)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "n_splits = 10\n", "k = 5\n", "x_feat_list = ['bill_depth_mm', 'bill_length_mm']\n", "y_feat = 'species'\n", "\n", "# extract data into matrix\n", "x = df_penguin.loc[:, x_feat_list].values\n", "y_true = df_penguin.loc[:, y_feat].values\n", "\n", "# compute cross validated accuracy of each k\n", "k_all = np.array(range(1, 50))\n", "acc = np.empty(k_all.shape, dtype=float)\n", "for idx, k in enumerate(k_all):\n", " acc[idx] = get_cv_acc_knn(x, y_true, k)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "scrolled": false }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.plot(k_all, acc)\n", "plt.xlabel('k in k-NN')\n", "plt.ylabel('cross validated accuracy')\n", "plt.title('penguin classification optimization')\n", "plt.gcf().set_size_inches(10, 5)\n", "plt.savefig('best_k_penguin.png')\n", "\n", "# hint: occam's razor" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Measuring Binary Classification Performance\n", "- Accuracy\n", "- Sensitivity\n", "- Specificity" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## \"This Classifier is 90% accurate\"\n", "\n", "... is that a good thing?\n", "\n", "Lets take a look at a few 90% accurate classifications and see for ourselves:" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ " ## This classifier is 90% accurate: example 0\n", " \"title\"\n", " \n", " ... isn't there more than one way this classification can be 90% accurate?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ " ## This classifier is 90% accurate: example 1\n", " \"title\"\n", " \n", " Assumes:\n", " - \"fair\" coin flip\n", " - equal accuracy in identifying heads / tails" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ " ## This classifier is 90% accurate: example 2\n", " \"title\"\n", " \n", " Assumes:\n", " - \"fair\" coin flip\n", " - ~equal accuracy in identifying heads / tails~\n", "\n", "Problems with Accuracy:\n", "1. Doesn't describe how accuracy varies with each particular target \n", " - (e.g. tails more accurately predicted than heads)\n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ " ## This classifier is 90% accurate: example 3\n", " \"title\"\n", " \n", "Assumes:\n", " - ~\"fair\" coin flip~\n", " - equal accuracy in identifying heads / tails\n", " \n", "Problems with accuracy:\n", "1. Doesn't describe how accuracy varies with each particular target \n", " - (e.g. tails more accurately predicted than heads)\n", "1. Doesn't describe differences in distribution of our target variable\n", " - (e.g. heads occurs more often than tails does) " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ " ## This classifier is 90% accurate: example 4\n", " \"title\"\n", " \n", " Regardless of how we quantify accuracy, doesn't it implicitly say something about the relative costs of each of the errors below?\n", " - given one has a disease predicting they are healthy\n", " - given one is healthy predicting they have a disease\n", " \n", "Problems with accuracy:\n", "1. Doesn't describe how accuracy varies with each particular target \n", " - (e.g. tails more accurately predicted than heads)\n", "1. Doesn't describe differences in distribution of our target variable\n", " - (e.g. heads occurs more often than tails does) \n", "1. Doesn't characterize the relative cost of each particular error\n", " - (e.g. missing a disease detection can harm someone's health, falsely predicting they have a disease may only inconvenience them)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ " ## This classifier is 90% accurate: example 5\n", " \"title\"\n", " \n", "Problems with accuracy:\n", "1. Doesn't describe how accuracy varies with each particular target \n", " - (e.g. tails more accurately predicted than heads)\n", "1. Doesn't describe differences in distribution of our target variable\n", " - (e.g. heads occurs more often than tails does) \n", "1. Doesn't characterize the relative cost of each particular error\n", " - (e.g. missing a disease detection can harm someone's health, falsely predicting they have a disease may only inconvenience them)\n", "1. Doesn't characterize the difficulty of the problem itself\n", " - (e.g. predicting coin flips is easier than predicting lotto numbers)\n", " - further reading: see use of 'entropy' in [my favorite book on info theory](http://www.inference.org.uk/mackay/itila/book.html) " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## What about our confusion matrices?\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "-" } }, "source": [ "Problems with accuracy:\n", "1. Doesn't describe how accuracy varies with each particular target \n", " - (e.g. tails more accurately predicted than heads)\n", "1. Doesn't describe differences in distribution of our target variable\n", " - (e.g. heads occurs more often than tails does) \n", "1. Doesn't characterize the relative cost of each particular error\n", " - (e.g. missing a disease detection can harm someone's health, falsely predicting they have a disease may only inconvenience them)\n", "1. Doesn't characterize the difficulty of the problem itself\n", " - (e.g. predicting coin flips is easier than predicting lotto numbers)\n", " - we won't say much further in DS2500 on this issue, see [a book on info theory](http://www.inference.org.uk/mackay/itila/book.html) \n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "-" } }, "source": [ "**Confusion Matrices, being a whole array of numbers, are not easily compared for sorting.**\n", "\n", "(We'll focus on these scalar summary stats of accuracy today)**" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Warning: Gnarly Naming Conventions Ahead:\n", "\n", "https://en.wikipedia.org/wiki/Sensitivity_and_specificity\n", "\n", "Everyone uses these ideas and each field seems to have their own terms of interest.\n", "\n", "I'm sorry, its [a tough problem to fix](https://xkcd.com/927/) :)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Binary classification: Names\n", "\n", "Lets examine the binary classification problem, where we're trying to determine if a sample if of class 0 or class 1.\n", "\n", "| | Predict: Class 0 | Predict: Class1 |\n", "|----------------|-------------------|------------------|\n", "| Truth: Class 0 | True Negative (TN) | False Positive (FP) |\n", "| Truth: Class 1 | False Negative (FN) | True Positive (TP) |\n", "\n", "To remember:\n", "- True / False - True if correct False otherwise\n", "- Positive / Negative - Positive if estimated as class 1, Negative otherwise\n", "\n", "Error Types:\n", "\n", "* **False Positives (False alarm / Type I error)** \n", " - sample belongs to class 0 and its predicted as class 1\n", "* **False Negative (Missed detection / Type II error)** \n", " - sample belongs to class 1 and its predicted as class 0" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Binary classification: Illness detection example\n", "\n", "Example: Illness Detection\n", "- Class 0: Healthy\n", "- Class 1: Illness\n", "\n", "| | Predict: Healthy (0) | Predict: Illness (1) |\n", "|--------------------|-------------------------------|---------------------------|\n", "| Truth: Healthy (0) | TN: Healthy predicted healthy | FP: Healthy Predicted Ill |\n", "| Truth: Illness (1) | FN: Ill predicted healthy | TP: Ill predicted ill |" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## How to describe binary classifier performance\n", "\n", "- Accuracy\n", " - percentage of samples that the prediction is correct\n", " - illness detection example:\n", " - If 100 people take this test, how often will our test be correct?\n", "$$ \\rm{Accuracy} = \\frac{\\rm{TP} + \\rm{TN}}{\\rm{TP} + \\rm{TN} + \\rm{FP} + \\rm{FN}} $$\n", "\n", "- Sensitivity (Recall)\n", " - percentage of class 1 samples which are predicted as class 1\n", " - illness detection example:\n", " - If 100 ill people take this test, how many will we detect as ill?\n", "$$ \\rm{Sensitivity} = \\frac{\\rm{TP}}{\\rm{TP} + \\rm{FN}} $$\n", "\n", "- Specificity\n", " - percentage of samples predicted as class 1 which are truly class 1\n", " - illness detection example:\n", " - If the test says 100 people are ill, how many of these are really ill?\n", " \n", "$$ \\rm{Specificity} = \\frac{\\rm{TP}}{\\rm{TP} + \\rm{FP}} $$" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import confusion_matrix\n", "\n", "def get_acc_sens_spec(y_true, y_pred, verbose=True):\n", " \"\"\" computes sensitivity & specificity (assumed binary inputs)\n", "\n", " Args:\n", " y_true (np.array): binary ground truth per trial\n", " y_pred (np.array): binary prediction per trial\n", "\n", " Returns:\n", " acc (float): accuracy\n", " sens (float): sensitivity\n", " spec (float): specificity\n", " \"\"\"\n", " # line below stolen from sklearn confusion_matrix documentation\n", " tn, fp, fn, tp = confusion_matrix(y_true.astype(bool),\n", " y_pred.astype(bool),\n", " labels=(0, 1)).ravel()\n", "\n", " # compute sensitivity\n", " if tp + fn:\n", " sens = tp / (tp + fn)\n", " else:\n", " sens = np.nan\n", "\n", " # compute specificity\n", " if tn + fp:\n", " spec = tn / (tn + fp)\n", " else:\n", " spec = np.nan\n", " \n", " # compute acc\n", " acc = (tp + tn) / (tn + fp + fn + tp)\n", "\n", " return acc, sens, spec" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Unequal sample size happens (often):\n", "\n", "experimental design cost / danger\n", "- how does space travel impact plant growth?\n", "- how is the inside of this tornado different than a milder wind flow pattern?\n", "- are shark teeth sharper in a live shark's mouth?\n", "\t\n", "language modelling\n", "- given access to one's email (a relatively small amount of text compared to typical LM text libraries) how can we tune a generic language model to a particular user (increase prob of words they personally use often)\n", "\n", "many detection targets, by virtue of the fact that we bother building an AI system to detect them, are rare:\n", "- computer hacking\n", "- astronomy events (suns exploding etc)\n", "- accidents / crime\n", "- rare illness" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "# 'secret' slide: generate icbm data from ground truth\n", "import pandas as pd\n", "import numpy as np\n", "\n", "# total samples\n", "n = int(1e6)\n", "\n", "# prior prob of icbm event\n", "prior = .01\n", "\n", "# false alarm and detection (sensitivity) rates\n", "fa_detect = [(.1, .99), \n", " (.5, .95), \n", " (.8, 1), \n", " (.07, .95)]\n", "\n", "# sample n icbm events\n", "rng = np.random.default_rng(seed=0)\n", "icbm = rng.random(n) < prior\n", "\n", "# init dataframe\n", "df_icbm = pd.DataFrame({'icbm': icbm})\n", "\n", "for test_idx, (fa, detect) in enumerate(fa_detect):\n", " pred = np.empty_like(icbm)\n", " \n", " # get predictions (depend on icbm state)\n", " pred[icbm] = np.random.rand(icbm.sum()) < detect\n", " pred[~icbm] = np.random.rand((~icbm).sum()) < fa\n", " \n", " # store predictions\n", " df_icbm[f'alarm{test_idx}'] = pred\n", " \n", "df_icbm.to_csv('icbm.csv', index=False)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Intercontinental Ballistic Missile (ICBM)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
icbmalarm0alarm1alarm2alarm3
0FalseFalseFalseTrueFalse
1FalseTrueFalseTrueFalse
2FalseFalseTrueTrueTrue
3FalseFalseTrueTrueFalse
4FalseFalseTrueFalseFalse
\n", "
" ], "text/plain": [ " icbm alarm0 alarm1 alarm2 alarm3\n", "0 False False False True False\n", "1 False True False True False\n", "2 False False True True True\n", "3 False False True True False\n", "4 False False True False False" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_icbm = pd.read_csv('icbm.csv')\n", "df_icbm.head()" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
accuracysensitivityspecificity
alarm00.9010430.9876950.900168
alarm10.5040130.9476790.499533
alarm20.2086871.0000000.200697
alarm30.9303560.9512810.930145
\n", "
" ], "text/plain": [ " accuracy sensitivity specificity\n", "alarm0 0.901043 0.987695 0.900168\n", "alarm1 0.504013 0.947679 0.499533\n", "alarm2 0.208687 1.000000 0.200697\n", "alarm3 0.930356 0.951281 0.930145" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_performance = pd.DataFrame()\n", "for idx in range(4):\n", " alarm = f'alarm{idx}'\n", " \n", " # get truth / predict for alarm\n", " truth = df_icbm.loc[:, 'icbm']\n", " pred = df_icbm.loc[:, alarm]\n", " \n", " # build dataframe of accuracy, sensitivity and specificity\n", " acc, sens, spec = get_acc_sens_spec(y_true=truth, y_pred=pred)\n", " df_performance.loc[alarm, 'accuracy'] = acc\n", " df_performance.loc[alarm, 'sensitivity'] = sens\n", " df_performance.loc[alarm, 'specificity'] = spec\n", " \n", "df_performance" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## In Class Exercise 3\n", "\n", "Using the values above (and maybe other operations on the dataframe too) select which of the four alarm systems is most appropriate to detect ICBMs. Provide an explanation which is easily understood by a non-technical reader.\n", "\n", "Is there any other information you'd need to make this decision?" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" } }, "nbformat": 4, "nbformat_minor": 4 }