{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# DS3000 Final Project: Board Game Reccomendation\n",
"## Team -1 (example)\n",
"\n",
"- Piotr Sapiezynski (p.sapiezynski@northeastern.edu)\n",
"- Matt Higger (m.higger@ccs.neu.edu)\n",
"\n",
"# Executive Summary\n",
"We build a board game reccomendation system by collecting a data from [boardgamegeek.com's list of boardgames](https://boardgamegeek.com/browse/boardgame) and collecting [13 users preferences](#user_pref) about which games they do and don't enjoy. Our reccomender works by identifying the unrated game most similar to the user's top rated games. To validate our method, we estimate how closely the predicted user preferences match the observed user preferences under cross validation. The predicted user ratings do a [poor job](#validation) of matching actual user preferences. We [suggest](#discussion) that the model struggles because it fails to find a meaningful way of measuring whether two games are similar or not.\n",
"\n",
"# Ethical Considerations\n",
"Like any tool which reccomends products on might buy, this tool may be subject to bias from board game companies who wish to drive consumers to their products. We suggest that any product derived from this work be open-source to allow for people to easily audit its use for commercial bias. \n",
"\n",
"# Introduction\n",
"Finding the right board game to play is difficult. The time and money required to play test a game is considerable and media which describes gameplay can fail to capture the experience accurately. As a result, many players learn about new games by word of mouth. This situation leaves many great games \"undiscovered\" and hinders player enjoyment of gaming by only playing popular games. **This project aims to reccomend new board games to a player who submits their preferences on other games**.\n",
"\n",
"\n",
"# Data Description\n",
"\n",
"## Games\n",
"(Full details of game data can be found in `ex_game_clean.ipynb`, a summary of the relevant details is given here).\n",
"\n",
"We scrape a [list of boardgames](https://boardgamegeek.com/browse/boardgame) ranked by popularity from BoardGameGeek. \n",
"\n",
"
"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" description | \n",
" title | \n",
"
\n",
" \n",
" game_id | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 174430 | \n",
" Vanquish monsters with strategic cardplay. Ful... | \n",
" Gloomhaven | \n",
"
\n",
" \n",
" 161936 | \n",
" Mutating diseases are spreading around the wor... | \n",
" Pandemic Legacy: Season 1 | \n",
"
\n",
" \n",
" 224517 | \n",
" Build networks, grow industries, and navigate ... | \n",
" Brass: Birmingham | \n",
"
\n",
" \n",
" 167791 | \n",
" Compete with rival CEOs to make Mars habitable... | \n",
" Terraforming Mars | \n",
"
\n",
" \n",
" 233078 | \n",
" Build an intergalactic empire through trade, r... | \n",
" Twilight Imperium: Fourth Edition | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" description \\\n",
"game_id \n",
"174430 Vanquish monsters with strategic cardplay. Ful... \n",
"161936 Mutating diseases are spreading around the wor... \n",
"224517 Build networks, grow industries, and navigate ... \n",
"167791 Compete with rival CEOs to make Mars habitable... \n",
"233078 Build an intergalactic empire through trade, r... \n",
"\n",
" title \n",
"game_id \n",
"174430 Gloomhaven \n",
"161936 Pandemic Legacy: Season 1 \n",
"224517 Brass: Birmingham \n",
"167791 Terraforming Mars \n",
"233078 Twilight Imperium: Fourth Edition "
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"\n",
"df_game = pd.read_csv('game_final.csv', index_col='game_id')\n",
"df_game.loc[:, ['description', 'title']].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In particular, we collect the category tags associated with each individual game:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['cat: Adventure',\n",
" 'cat: Exploration',\n",
" 'cat: Fantasy',\n",
" 'cat: Fighting',\n",
" 'cat: Miniatures']"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def is_feat(col, feat_prefix=('cat: ',)):\n",
" for prefix in feat_prefix:\n",
" if col.startswith(prefix):\n",
" return True\n",
" return False\n",
"\n",
"def strip_feat(col, feat_prefix=('cat: ',)): \n",
" for prefix in feat_prefix:\n",
" if col.startswith(prefix):\n",
" return col[len(prefix):]\n",
" raise Error('input column is not a feature') \n",
" \n",
"\n",
"# build x feature list (any category a game belongs to)\n",
"x_feat_list = list()\n",
"for col in df_game.columns:\n",
" if is_feat(col):\n",
" x_feat_list.append(col)\n",
" \n",
"x_feat_list[:5]"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" cat: Adventure | \n",
" cat: Exploration | \n",
" cat: Fantasy | \n",
" cat: Fighting | \n",
" cat: Miniatures | \n",
"
\n",
" \n",
" game_id | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 174430 | \n",
" True | \n",
" True | \n",
" True | \n",
" True | \n",
" True | \n",
"
\n",
" \n",
" 161936 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" 224517 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" 167791 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" 233078 | \n",
" False | \n",
" True | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" cat: Adventure cat: Exploration cat: Fantasy cat: Fighting \\\n",
"game_id \n",
"174430 True True True True \n",
"161936 False False False False \n",
"224517 False False False False \n",
"167791 False False False False \n",
"233078 False True False False \n",
"\n",
" cat: Miniatures \n",
"game_id \n",
"174430 True \n",
"161936 False \n",
"224517 False \n",
"167791 False \n",
"233078 False "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_game.loc[:, x_feat_list[:5]].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## User Preferences\n",
"\n",
"\n",
"\n",
"User preferences were collected by soliciting student responses via a google form. Students of the spring 2020 DS3000 class were solicited:\n",
"\n",
"
\n",
"\n",
"Each user is represented by an integer `alias`. Each column represents a game and the values are the responses to the question above. Missing values indicate that a user did not give their preference on a particular game."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 174430 | \n",
" 2398 | \n",
" 171 | \n",
" 178900 | \n",
" 188834 | \n",
" 105134 | \n",
" 2453 | \n",
" 12962 | \n",
" 2181 | \n",
" 278 | \n",
" ... | \n",
" 195162 | \n",
" 92415 | \n",
" 275467 | \n",
" 128882 | \n",
" 204305 | \n",
" 169786 | \n",
" 120677 | \n",
" 31627 | \n",
" 174785 | \n",
" 253284 | \n",
"
\n",
" \n",
" alias | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 1 | \n",
" 1.0 | \n",
" 7.0 | \n",
" 6.0 | \n",
" 5.0 | \n",
" 4.0 | \n",
" 4.0 | \n",
" 5.0 | \n",
" 1.0 | \n",
" 5.0 | \n",
" 2.0 | \n",
" ... | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 6 | \n",
" 6.0 | \n",
" NaN | \n",
" NaN | \n",
" 4.0 | \n",
" 5.0 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" ... | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 7 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 4.0 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" ... | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 9 | \n",
" NaN | \n",
" NaN | \n",
" 3.0 | \n",
" 7.0 | \n",
" NaN | \n",
" NaN | \n",
" 5.0 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" ... | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 15 | \n",
" NaN | \n",
" NaN | \n",
" 7.0 | \n",
" NaN | \n",
" 6.0 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" ... | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 17 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 6.0 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" ... | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 18 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" ... | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 19 | \n",
" NaN | \n",
" NaN | \n",
" 4.0 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 6.0 | \n",
" NaN | \n",
" 4.0 | \n",
" NaN | \n",
" ... | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 20 | \n",
" NaN | \n",
" 5.0 | \n",
" 5.0 | \n",
" 7.0 | \n",
" 6.0 | \n",
" 6.0 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" ... | \n",
" 4.0 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 21 | \n",
" NaN | \n",
" NaN | \n",
" 7.0 | \n",
" 4.0 | \n",
" 5.0 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" ... | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 22 | \n",
" NaN | \n",
" 3.0 | \n",
" 3.0 | \n",
" 6.0 | \n",
" NaN | \n",
" NaN | \n",
" 2.0 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" ... | \n",
" NaN | \n",
" 5.0 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 23 | \n",
" NaN | \n",
" NaN | \n",
" 7.0 | \n",
" 7.0 | \n",
" 7.0 | \n",
" NaN | \n",
" 4.0 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" ... | \n",
" NaN | \n",
" NaN | \n",
" 2.0 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 24 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" ... | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 3.0 | \n",
" 4.0 | \n",
" 6.0 | \n",
" 5.0 | \n",
" 6.0 | \n",
"
\n",
" \n",
"
\n",
"
13 rows × 78 columns
\n",
"
"
],
"text/plain": [
" 174430 2398 171 178900 188834 105134 2453 12962 2181 278 ... \\\n",
"alias ... \n",
"1 1.0 7.0 6.0 5.0 4.0 4.0 5.0 1.0 5.0 2.0 ... \n",
"6 6.0 NaN NaN 4.0 5.0 NaN NaN NaN NaN NaN ... \n",
"7 NaN NaN NaN 4.0 NaN NaN NaN NaN NaN NaN ... \n",
"9 NaN NaN 3.0 7.0 NaN NaN 5.0 NaN NaN NaN ... \n",
"15 NaN NaN 7.0 NaN 6.0 NaN NaN NaN NaN NaN ... \n",
"17 NaN NaN NaN 6.0 NaN NaN NaN NaN NaN NaN ... \n",
"18 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... \n",
"19 NaN NaN 4.0 NaN NaN NaN 6.0 NaN 4.0 NaN ... \n",
"20 NaN 5.0 5.0 7.0 6.0 6.0 NaN NaN NaN NaN ... \n",
"21 NaN NaN 7.0 4.0 5.0 NaN NaN NaN NaN NaN ... \n",
"22 NaN 3.0 3.0 6.0 NaN NaN 2.0 NaN NaN NaN ... \n",
"23 NaN NaN 7.0 7.0 7.0 NaN 4.0 NaN NaN NaN ... \n",
"24 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... \n",
"\n",
" 195162 92415 275467 128882 204305 169786 120677 31627 174785 \\\n",
"alias \n",
"1 NaN NaN NaN NaN NaN NaN NaN NaN NaN \n",
"6 NaN NaN NaN NaN NaN NaN NaN NaN NaN \n",
"7 NaN NaN NaN NaN NaN NaN NaN NaN NaN \n",
"9 NaN NaN NaN NaN NaN NaN NaN NaN NaN \n",
"15 NaN NaN NaN NaN NaN NaN NaN NaN NaN \n",
"17 NaN NaN NaN NaN NaN NaN NaN NaN NaN \n",
"18 NaN NaN NaN NaN NaN NaN NaN NaN NaN \n",
"19 NaN NaN NaN NaN NaN NaN NaN NaN NaN \n",
"20 4.0 NaN NaN NaN NaN NaN NaN NaN NaN \n",
"21 NaN NaN NaN NaN NaN NaN NaN NaN NaN \n",
"22 NaN 5.0 NaN NaN NaN NaN NaN NaN NaN \n",
"23 NaN NaN 2.0 NaN NaN NaN NaN NaN NaN \n",
"24 NaN NaN NaN NaN NaN 3.0 4.0 6.0 5.0 \n",
"\n",
" 253284 \n",
"alias \n",
"1 NaN \n",
"6 NaN \n",
"7 NaN \n",
"9 NaN \n",
"15 NaN \n",
"17 NaN \n",
"18 NaN \n",
"19 NaN \n",
"20 NaN \n",
"21 NaN \n",
"22 NaN \n",
"23 NaN \n",
"24 6.0 \n",
"\n",
"[13 rows x 78 columns]"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_pref = pd.read_csv('pref_final.csv', index_col='alias')\n",
"df_pref"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Users were asked to rank at least 9 games, though we include all users with at least 8 to include a few more users:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"alias\n",
"1 11\n",
"6 15\n",
"7 21\n",
"9 8\n",
"15 9\n",
"17 9\n",
"18 10\n",
"19 9\n",
"20 10\n",
"21 10\n",
"22 12\n",
"23 8\n",
"24 8\n",
"dtype: int64"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# games ranked per user\n",
"(df_pref >= 0).sum(axis=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Method\n",
"## 1 - NN Regressor\n",
"To reccomend games to users we use a 1-Nearest Neighbor Regressor. In essense, every game is given an estimated user preference as the preference score of the \"most similar\" game among all the games the user has rated.\n",
"\n",
"This approach requires that we are able to identify the \"most similar\" game to any other. To do so we build a distance metric which measures game similarity. The distance between similar games should be small while the distance between different games should be large. We choose the metric as the traditional squared distance:\n",
"\n",
"$$ d_{i, j} = || y_1 - y_0 ||_2^2 = \\sum_i (y_{1, i} - y_{0, i})^2 $$\n",
"\n",
"where vectors $x_i$ represent a board games tags:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# for example, for our first two games:\n",
"y = df_game.loc[:, x_feat_list].values.astype(int)\n",
"y0 = y[0, :]\n",
"y1 = y[1, :]\n",
"y0"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The vector `x0` above indicates that the first game has the first 5 tags (i.e. Adventure, Exploration, Fantasy, Fighting, Miniatures) but none of the others."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"7.000000000000001"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import numpy as np\n",
"\n",
"# compute distance between first and second games\n",
"d01 = np.linalg.norm(y1 - y0) ** 2\n",
"d01"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice that the distance is equivilent to a count of how many category tags in `x_feat_list` which are different between two games.\n",
"\n",
"## Principle Component Analysis\n",
"There are two problems with the distance metric above.\n",
"1. **The scale of each feature is different:**"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"cat: Vietnam War 0.001008\n",
"cat: Expansion for Base-game 0.001008\n",
"cat: Trivia 0.002014\n",
"cat: American Revolutionary War 0.002014\n",
"cat: World War I 0.002014\n",
" ... \n",
"cat: Science Fiction 0.110995\n",
"cat: Fighting 0.131274\n",
"cat: Economic 0.159312\n",
"cat: Fantasy 0.164704\n",
"cat: Card Game 0.183588\n",
"Length: 79, dtype: float64"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_game.loc[:, x_feat_list].var().sort_values()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Left uncorrected, all the difference among `Card Game` would dominate the differences scores and ignore features with lower variances (i.e. `Vietnam War`, `Expansion for Base-game`). \n",
"\n",
"2. **Even if each features were given identical variance, some features effectively \"double count\" the importance of a feature by being correlated**"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" cat: Adventure | \n",
" cat: Exploration | \n",
" cat: Fantasy | \n",
" cat: Fighting | \n",
" cat: Miniatures | \n",
"
\n",
" \n",
" \n",
" \n",
" cat: Adventure | \n",
" 1.000000 | \n",
" 0.426874 | \n",
" 0.353859 | \n",
" 0.320280 | \n",
" 0.293766 | \n",
"
\n",
" \n",
" cat: Exploration | \n",
" 0.426874 | \n",
" 1.000000 | \n",
" 0.177884 | \n",
" 0.154813 | \n",
" 0.155875 | \n",
"
\n",
" \n",
" cat: Fantasy | \n",
" 0.353859 | \n",
" 0.177884 | \n",
" 1.000000 | \n",
" 0.391303 | \n",
" 0.242340 | \n",
"
\n",
" \n",
" cat: Fighting | \n",
" 0.320280 | \n",
" 0.154813 | \n",
" 0.391303 | \n",
" 1.000000 | \n",
" 0.434097 | \n",
"
\n",
" \n",
" cat: Miniatures | \n",
" 0.293766 | \n",
" 0.155875 | \n",
" 0.242340 | \n",
" 0.434097 | \n",
" 1.000000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" cat: Adventure cat: Exploration cat: Fantasy \\\n",
"cat: Adventure 1.000000 0.426874 0.353859 \n",
"cat: Exploration 0.426874 1.000000 0.177884 \n",
"cat: Fantasy 0.353859 0.177884 1.000000 \n",
"cat: Fighting 0.320280 0.154813 0.391303 \n",
"cat: Miniatures 0.293766 0.155875 0.242340 \n",
"\n",
" cat: Fighting cat: Miniatures \n",
"cat: Adventure 0.320280 0.293766 \n",
"cat: Exploration 0.154813 0.155875 \n",
"cat: Fantasy 0.391303 0.242340 \n",
"cat: Fighting 1.000000 0.434097 \n",
"cat: Miniatures 0.434097 1.000000 "
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_game.loc[:, x_feat_list[:5]].corr()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice that these first 5 tags are all positively correlated with each other (when one tag occurs any of the others is more likely to occur). In some sense, we can consider that each of these tags are redundant measurements of the same intrinsic game feature. We are effectively over-counting this feature by including it with each feature.\n",
"\n",
"To resolve both of these issues, we use a pre-processing step before applying our 1-NN regressor: Principle Component Analysis (PCA). PCA will:\n",
"- ensure output features each have equal variance\n",
"- ensure output features are all uncorrelated with each other"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Results\n",
"\n",
"## Estimation"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['cat: Adventure',\n",
" 'cat: Exploration',\n",
" 'cat: Fantasy',\n",
" 'cat: Fighting',\n",
" 'cat: Miniatures']"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"\n",
"# reload data\n",
"df_game = pd.read_csv('game_final.csv', index_col='game_id')\n",
"df_pref = pd.read_csv('pref_final.csv', index_col='alias')\n",
"\n",
"# ensure column names are integers\n",
"df_pref.rename(int, axis=1, inplace=True)\n",
"\n",
"# build x feature list (any category a game belongs to)\n",
"x_feat_list = list()\n",
"for col in df_game.columns:\n",
" if is_feat(col):\n",
" x_feat_list.append(col)\n",
"\n",
"x_feat_list[:5]"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.model_selection import KFold\n",
"from sklearn.metrics import r2_score\n",
"\n",
"def get_x_y(alias, df_pref, df_game, x_feat_list):\n",
" \"\"\" gets the input / output features of regressor for one user\n",
" \n",
" The input features are the game categories (binary) and\n",
" the output features are the user preferences\n",
" \n",
" Args:\n",
" alias (int): alias given to a user (index of df_pref)\n",
" df_pref (pd.DatFrame): user preferences\n",
" df_game (pd.DataFrame): game stats\n",
" \n",
" Returns:\n",
" x (np.array): (n_samples, n_feat) corresponds to the\n",
" categories every game does / doesn't belong to\n",
" y (np.array): (n_samples) user preferences of corresponding\n",
" samples\n",
" game_id_list (list): game ids with ratings\n",
" \"\"\"\n",
" \n",
" # get non null preferences for a given alias\n",
" s_pref_alias = df_pref.loc[alias, :]\n",
" s_pref_alias.dropna(inplace=True)\n",
" \n",
" # get list of game_id which user submitted preferences about\n",
" game_id_list = list(s_pref_alias.index)\n",
"\n",
" # extract x, y\n",
" x = df_game.loc[game_id_list, x_feat_list].values\n",
" y = s_pref_alias.values\n",
" \n",
" return x, y, game_id_list"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"from sklearn.linear_model import LinearRegression\n",
"from sklearn.neighbors import KNeighborsRegressor\n",
"\n",
"def cv_train(x, y_true):\n",
" \"\"\" leave one out cross validation regression of x, y\n",
" \n",
" Args:\n",
" x (np.array): (n_samples, n_feat) input features\n",
" y_true (np.array): (n_samples) output feature\n",
" \n",
" Returns:\n",
" regressor (LinearRegression): model which predicts y\n",
" from x\n",
" r2 (float): percentage of variance of y which is \n",
" explained by the model under cross validation\n",
" (r2=1 is strongest possible model, r2 = 0 is\n",
" a non-helpful model)\n",
" \"\"\"\n",
" # initialize kfold\n",
" n_samples = x.shape[0]\n",
" kfold = KFold(n_splits=n_samples)\n",
" \n",
" # initialize regressor\n",
" reg = KNeighborsRegressor(n_neighbors=1)\n",
" \n",
" y_pred = np.empty_like(y_true)\n",
" for train_idx, test_idx in kfold.split(x):\n",
" # split data\n",
" x_train = x[train_idx, :]\n",
" y_train = y_true[train_idx]\n",
" x_test = x[test_idx, :]\n",
" \n",
" # fit regressor\n",
" reg.fit(x_train, y_train)\n",
" \n",
" # predict\n",
" y_pred[test_idx] = reg.predict(x_test)\n",
" \n",
" # compute r2\n",
" r2 = r2_score(y_true=y_true, y_pred=y_pred)\n",
" \n",
" # fit model on entire dataset (best for predicting new samples)\n",
" reg.fit(x, y_true)\n",
" \n",
" return reg, r2\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"def predict_score(alias, df_pref, df_game, x_feat_list):\n",
" \"\"\" predicts scores on all games\n",
" \n",
" Args:\n",
" alias (int): integer alias of user\n",
" df_pref (pd.DataFrame): user preferences\n",
" df_game (pd.DataFrame): games stats\n",
" x_feat_list (list): features used to define distance\n",
" between games\n",
" \n",
" Returns:\n",
" df_predicted_pref (pd.DataFrame): estimated user preferences\n",
" (includes preferences for all games, not just the ones\n",
" the user has rated)\n",
" reg (KNeighborsRegressor): regressor which predicts user preferences\n",
" r2 (float): cross validated r2 value\n",
" \"\"\"\n",
" \n",
" # extract relevant data\n",
" x, y, game_id_list = get_x_y(alias, df_pref, df_game, x_feat_list)\n",
" \n",
" # cross validate & train model\n",
" reg, r2 = cv_train(x, y)\n",
" \n",
" # predict scores of all games (not just ones with observed preferrences)\n",
" x_all = df_game.loc[:, x_feat_list].values\n",
" y_predict = reg.predict(x_all)\n",
"\n",
" # collect / sort preferences in dataframe\n",
" df_predicted_pref = pd.DataFrame({'title': df_game['title'],\n",
" 'pref': y_predict,\n",
" 'url': df_game['url']},\n",
" index=df_game.index)\n",
" \n",
" # record whether preferences were observed (user supplied) or not\n",
" df_predicted_pref.loc[:, 'observed'] = False\n",
" df_predicted_pref.loc[game_id_list, 'observed'] = True\n",
" \n",
" # store x_feat in df_predcticted_pref (redundant but helpful to know\n",
" # which were used across multiple runs)\n",
" for x_feat_idx, x_feat in enumerate(x_feat_list):\n",
" df_predicted_pref.loc[:, x_feat] = x_all[:, x_feat_idx] \n",
" \n",
" \n",
" # sort by estimated rating\n",
" df_predicted_pref.sort_values('pref', inplace=True, ascending=False)\n",
" \n",
" return df_predicted_pref, reg, r2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Validation\n",
"\n",
"\n",
"\n",
"To validate our model, we compute the cross-validated $r^2$ value among all the games a user has given ratings for. \n",
"- If this value is close to 1, then we can effectively predict user preferences\n",
"- If this value is close to zero, then we are effectively guessing user preferences blindly\n",
"- If this value is negative, then we are doing worse than guessing user preferences blindly\n",
"\n",
"### Without applying PCA:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"def validate_all(df_pref, df_game, x_feat_list):\n",
" \"\"\" computes cross validated r2 for each alias\n",
" \n",
" Args:\n",
" df_pref (pd.DataFrame): user preferences\n",
" df_game (pd.DataFrame): games stats\n",
" x_feat_list (list): features used to define distance\n",
" between games\n",
" \n",
" Returns:\n",
" df_validate (pd.DataFrame): index is alias, contains\n",
" column `cv_r2` as well as `num_pref`, the number of\n",
" preferences available for a given user\n",
" \"\"\"\n",
" df_validate = pd.DataFrame()\n",
" for alias in df_pref.index:\n",
" # predict scores\n",
" df_predicted_pref, reg, r2 = predict_score(alias, df_pref, df_game, x_feat_list)\n",
"\n",
" # collect validation stats in one dataframe\n",
" row = dict(alias=alias, cv_r2=r2, num_pref=df_predicted_pref['observed'].sum())\n",
" df_validate = df_validate.append(row, ignore_index=True)\n",
"\n",
" # prep and display df_validate\n",
" df_validate.set_index('alias', inplace=True)\n",
" df_validate.sort_values('cv_r2', inplace=True)\n",
" \n",
" return df_validate"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" cv_r2 | \n",
" num_pref | \n",
"
\n",
" \n",
" alias | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 9.0 | \n",
" -3.296296 | \n",
" 8.0 | \n",
"
\n",
" \n",
" 23.0 | \n",
" -2.692308 | \n",
" 8.0 | \n",
"
\n",
" \n",
" 1.0 | \n",
" -1.360236 | \n",
" 11.0 | \n",
"
\n",
" \n",
" 18.0 | \n",
" -1.128514 | \n",
" 10.0 | \n",
"
\n",
" \n",
" 21.0 | \n",
" -0.875000 | \n",
" 10.0 | \n",
"
\n",
" \n",
" 17.0 | \n",
" -0.660428 | \n",
" 9.0 | \n",
"
\n",
" \n",
" 7.0 | \n",
" -0.625000 | \n",
" 21.0 | \n",
"
\n",
" \n",
" 20.0 | \n",
" -0.562500 | \n",
" 10.0 | \n",
"
\n",
" \n",
" 15.0 | \n",
" -0.528302 | \n",
" 9.0 | \n",
"
\n",
" \n",
" 22.0 | \n",
" -0.336709 | \n",
" 12.0 | \n",
"
\n",
" \n",
" 19.0 | \n",
" -0.170000 | \n",
" 9.0 | \n",
"
\n",
" \n",
" 6.0 | \n",
" -0.097561 | \n",
" 15.0 | \n",
"
\n",
" \n",
" 24.0 | \n",
" 0.125000 | \n",
" 8.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" cv_r2 num_pref\n",
"alias \n",
"9.0 -3.296296 8.0\n",
"23.0 -2.692308 8.0\n",
"1.0 -1.360236 11.0\n",
"18.0 -1.128514 10.0\n",
"21.0 -0.875000 10.0\n",
"17.0 -0.660428 9.0\n",
"7.0 -0.625000 21.0\n",
"20.0 -0.562500 10.0\n",
"15.0 -0.528302 9.0\n",
"22.0 -0.336709 12.0\n",
"19.0 -0.170000 9.0\n",
"6.0 -0.097561 15.0\n",
"24.0 0.125000 8.0"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# validate model (without pca)\n",
"df_validate = validate_all(df_pref, df_game, x_feat_list)\n",
"df_validate"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Only user `alias=24` achieved any improvement in preference estimation from our method.\n",
"\n",
"### Applying PCA:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.decomposition import PCA\n",
"\n",
"# extract old x values\n",
"x = df_game.loc[:, x_feat_list].values\n",
"\n",
"# transform to new x values\n",
"n_components = 2\n",
"pca = PCA(n_components=n_components, whiten=True)\n",
"x_new = pca.fit_transform(x)\n",
"\n",
"# add pca features back into dataframe\n",
"x_feat_list_new = [f'pca{idx}' for idx in range(n_components)]\n",
"for idx, feat in enumerate(x_feat_list_new):\n",
" df_game.loc[:, feat] = x_new[:, idx]"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"# validate using only first n_pca features\n",
"df_validate_pca = validate_all(df_pref, df_game, x_feat_list_new)"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" cv_r2 | \n",
" num_pref | \n",
"
\n",
" \n",
" alias | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 17.0 | \n",
" -2.633690 | \n",
" 9.0 | \n",
"
\n",
" \n",
" 9.0 | \n",
" -2.481481 | \n",
" 8.0 | \n",
"
\n",
" \n",
" 1.0 | \n",
" -1.988189 | \n",
" 11.0 | \n",
"
\n",
" \n",
" 23.0 | \n",
" -1.961538 | \n",
" 8.0 | \n",
"
\n",
" \n",
" 21.0 | \n",
" -1.569444 | \n",
" 10.0 | \n",
"
\n",
" \n",
" 18.0 | \n",
" -1.369478 | \n",
" 10.0 | \n",
"
\n",
" \n",
" 15.0 | \n",
" -0.910377 | \n",
" 9.0 | \n",
"
\n",
" \n",
" 7.0 | \n",
" -0.825000 | \n",
" 21.0 | \n",
"
\n",
" \n",
" 20.0 | \n",
" -0.687500 | \n",
" 10.0 | \n",
"
\n",
" \n",
" 22.0 | \n",
" -0.518987 | \n",
" 12.0 | \n",
"
\n",
" \n",
" 6.0 | \n",
" -0.219512 | \n",
" 15.0 | \n",
"
\n",
" \n",
" 19.0 | \n",
" 0.010000 | \n",
" 9.0 | \n",
"
\n",
" \n",
" 24.0 | \n",
" 0.125000 | \n",
" 8.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" cv_r2 num_pref\n",
"alias \n",
"17.0 -2.633690 9.0\n",
"9.0 -2.481481 8.0\n",
"1.0 -1.988189 11.0\n",
"23.0 -1.961538 8.0\n",
"21.0 -1.569444 10.0\n",
"18.0 -1.369478 10.0\n",
"15.0 -0.910377 9.0\n",
"7.0 -0.825000 21.0\n",
"20.0 -0.687500 10.0\n",
"22.0 -0.518987 12.0\n",
"6.0 -0.219512 15.0\n",
"19.0 0.010000 9.0\n",
"24.0 0.125000 8.0"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_validate_pca"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"PCA does improve results, though we are still not able to predict user preferences better than chance on the average user."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Visualization"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"def get_text(game_row):\n",
" # gets a string, in plotly format, of all tags a game contains\n",
" title = game_row['title']\n",
" tags = '
'.join([strip_feat(col) for col, val in game_row.items() if val and is_feat(col)])\n",
" return '
'.join([f'title: {title}',\n",
" f'{tags}'])\n",
"df_game['hovertext'] = df_game.apply(get_text, axis=1)\n"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"import plotly.graph_objects as go\n",
"import plotly.express as px\n",
"from plotly.subplots import make_subplots\n",
"\n",
"hovertemplate = '%{text}'\n",
"\n",
"def print_plotly_scatter(alias, df_pref, df_game, x_feat_list, f_html=None, x_feat_idx_horz=0, x_feat_idx_vert=1):\n",
" \n",
" if f_html is None:\n",
" f_html = f'user{alias}.html'\n",
" \n",
" # compute predicted scores\n",
" df_predicted_pref, reg, r2 = predict_score(alias, df_pref, df_game, x_feat_list_new)\n",
"\n",
" x_feat0 = x_feat_list[x_feat_idx_horz]\n",
" x_feat1 = x_feat_list[x_feat_idx_vert]\n",
" \n",
" # build scatter\n",
" fig = make_subplots()\n",
" for observed in [False, True]:\n",
" # select only relevant rows\n",
" row_bool = df_predicted_pref['observed'] == observed\n",
" df = df_predicted_pref.loc[row_bool, :]\n",
" \n",
" s_text = df_game.loc[df.index, 'hovertext']\n",
"\n",
" if observed:\n",
" marker_dict = dict(size=12, line=dict(width=2, color='black'), colorscale='viridis')\n",
" name = 'user-given'\n",
" else:\n",
" marker_dict = dict(colorscale='viridis', colorbar=dict(thickness=20, title='preference'))\n",
" name = 'estimated'\n",
"\n",
" trace = go.Scatter(x=df[x_feat0],\n",
" y=df[x_feat1],\n",
" mode='markers', \n",
" marker=marker_dict, \n",
" marker_color=df['pref'],\n",
" hovertemplate=hovertemplate,\n",
" text=s_text,\n",
" name=name)\n",
"\n",
" fig.add_trace(trace)\n",
"\n",
" legend_dict = legend=dict(yanchor=\"top\", y=0.99, xanchor=\"left\", x=0.01)\n",
" fig.update_layout(title=f'user {alias} preferences',\n",
" xaxis_title=x_feat0,\n",
" yaxis_title=x_feat1,\n",
" legend=legend_dict)\n",
"\n",
" fig.write_html(f_html)\n",
" \n",
" return f_html"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" "
],
"text/plain": [
""
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from IPython.display import IFrame\n",
"\n",
"f_html = print_plotly_scatter(alias=7, df_pref=df_pref, df_game=df_game, x_feat_list=x_feat_list_new)\n",
"\n",
"# allows us to embed html in jupyter (helpful if error in creation of plot)\n",
"IFrame(src=f_html, width=900, height=600)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Discussion\n",
"\n",
"\n",
"\n",
"The project did not succeed in being able to predict a user's preference any better than chance for the average user. (Cross validated $r^2 < 0$ for all users). This can be due to three reasons:\n",
"1. Our user preference data was insufficient:\n",
" - With only 8 to 20 games per user, we may not have enough data to accuractely characterize a single user's preferences among all the unique board games\n",
" - The user rating scale is somewhat subjective and was often biased towards games users enjoyed. This makes intuitive sense as the majority of time one is interacting with a game they're interacting with a game they've selected because they enjoy it. \n",
" - **We'd suggest future work collect only a list of games that a user enjoys**\n",
"1. Our distance metric, which defines which games are similar or different, was insufficient:\n",
" - After much experimenting, we couldn't identify an `x_feat_list` which significantly improved the cross validated $r^2$ metric.\n",
" - **We'd suggest future work do more feature engineering to identify which aspects of a game make is \"similar\" or \"different\".**\n",
" \n",
" Alternatively, one could define a metric of game similarity based on the correlation of user rankings:\n",
" - users typically rate both games high / low\n",
" - games are similar\n",
" - users typically rate one game high and the other low:\n",
" - games are different\n",
" \n",
"1. (Most significantly) Our 1-NN classifier was insufficient because:\n",
" - it gave identical scores to many games. This is not helpful in identifying a single best game to reccomend to a user\n",
" - it never synthesizes all the user preferences into its estimate. Instead, it relies exclusively on only the nearest neighbor. \n",
" - **We'd suggest future work discard the 1-NN classifier in favor of something which synthesizes all of a user's preferences (Regression, Density Estimation)**\n",
" \n",
"Not all results were negative, while the distance between games was not sufficient to reccomend games, it did provide some intuitive meaning:\n",
"- games in the lower left corner above are typically economic / negotiation games\n",
"- games in the upper right corner above are typically strategy / fighting / minature games\n",
" \n",
"## Takeaway:\n",
"Taken together, we do not think this work should be used to reccomend board games."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.10"
}
},
"nbformat": 4,
"nbformat_minor": 4
}