{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data and Analysis Plan: Board Game Reccomendation\n", "## Team -1 (example)\n", "\n", "- Piotr Sapiezynski (p.sapiezynski@northeastern.edu)\n", "- Matt Higger (m.higger@ccs.neu.edu)\n", "\n", "\n", " \n", "### Notes:\n", "This is the example DS3000 project. We'll switch from our official \"Data and Analysis Plan\" to pointing out features which are helpful as you write your own by changing the font color to green. \n", "\n", "**Please use the section template given here to ensure you complete the necessary sections**\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "(.5%) Expresses the central motivation of the project in one or two sentences. This may evolve a bit through the project.\n", "\n", "\n", "## Project Goal:\n", "This work will scrape lists of board games from [BoardGameGeek.com](https://boardgamegeek.com/) to reccomend a new board game to users who input a game they already enjoy. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data \n", "\n", "(1%) Gives a summary of the data processing pipeline so a technical expert can easily follow along.\n", " \n", " \n", "This overview section allows you to describe an overview of your data processing pipeline. \n", " \n", "Your pipeline should end by saving one or more csvs of data which will be loaded by the analysis portion of the notebook.\n", "\n", "\n", "### Overview \n", "We will scrape a [list of all boardgames](https://boardgamegeek.com/browse/boardgame) ranked by popularity from BoardGameGeek. \n", "\n", "\n", "\n", "\n", "From this list, for each game, we can obtain:\n", "- title\n", "- year published\n", "- url to specific game page\n", "- description (e.g. \"Vanquish monsters ...\")\n", "\n", "Upon visiting an [individual game's webpage](https://boardgamegeek.com/boardgame/174430/gloomhaven) we can use the top titlebar:\n", "\n", "\n", "\n", "to observe:\n", "- complexity rating\n", "- playtime (mins a typical game lasts)\n", "- min/max number of player required\n", "- reccomended age range\n", "\n", "Most importantly we seek to collect the category and mechanism tags associated with a particular game:\n", "\n", "\n", "Each game has multiple category tags. Our expectation is that games with similar tags are enjoyed by similar players. This tag data will be essential to reccomend a new game to a player who inputs another game they enjoy. To simplify analysis, each tag will have its own column in our output DataFrame. For example:\n", "\n", "| | card game | mystery | social | communication |\n", "|-------|-----------|---------|--------|---------------|\n", "| game0 | True | True | False | False |\n", "| game1 | False | True | True | True |\n", "\n", "Indicates that \n", "- game0 has tags 'card game' and 'mystery'\n", "- game1 has tags ' mystery', 'social' and 'communication'\n", "\n", "### Pipeline Overview\n", "\n", "We will accomplish this task with three functions:\n", "- `get_url()`\n", " - returns html string of a given url\n", "- `clean_top_games()`\n", " - builds dataframe of [a single page of top games](https://boardgamegeek.com/browse/boardgame/page/2) from html string \n", "- `clean_game_meta()`\n", " - collects game meta data (tags, complexity, playtime, min/max players, age range ...) from an [individual game's webpage](https://boardgamegeek.com/boardgame/174430/gloomhaven)\n", " \n", "As well as two short scripts:\n", "- **Scrape list of games:** use `get_url()` and `clean_top_games()` in a loop to collect n pages of top games (100 * n games), populating a DataFrame `df_game`\n", "- **Get meta data per game:** loop through each row of `df_game`, query and process the individual game's webpage via `get_url()` and `clean_game_meta()` using the previously collected url and append the remaining features to `df_game`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Pipeline\n", "\n", " \n", "(4%) Obtains, cleans, and merges all data sources involved in the project.\n", " \n", "Documentation counts!\n", " \n", "The majority of this section is code, but do make sure that one can do a quick sanity check that your pipeline worked by printing a few examples (e.g. call `DataFrame.head()` a few times).\n", "\n", "\n", "#### Scrape list of games" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import requests\n", "from bs4 import BeautifulSoup\n", "import json\n", "import time\n", "from tqdm import tqdm\n", "import seaborn as sns" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "def get_url(url):\n", " \"\"\" gets the html of a given url\n", " \n", " Args:\n", " url (str): target url\n", " \n", " Returns:\n", " str_html (str): html of given url\n", " \"\"\"\n", " return requests.get(url).text" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "def clean_top_games(str_html):\n", " \"\"\" gets DataFrame of games from page of \"All Boardgames\" list\n", " \n", " example page from \"All Boardgames\":\n", " https://boardgamegeek.com/browse/boardgame/page/2\n", " \n", " Args:\n", " str_html (str): string corresponding to html of page\n", " \n", " Returns:\n", " df_game (pd.DataFrame): dataframe where each row\n", " is one board game\n", " \"\"\" \n", " # build soup\n", " soup = BeautifulSoup(str_html)\n", "\n", " # get game data per row (discard first row as its the title)\n", " df_game = pd.DataFrame()\n", " for row_game in soup.find_all('tr')[1:]:\n", " game_dict = dict()\n", " \n", " # each td tag corresponds to a column. we unpack by column\n", " rank, image, title_year, rate_geek, rate_avg, num_vote, shop = row_game.find_all('td')\n", " \n", " # get game id and url from link in image\n", " game_url = 'https://boardgamegeek.com' + image.a.attrs['href']\n", " game_dict['url'] = game_url\n", " game_dict['game_id'] = game_url.split('/')[-2]\n", " \n", " # get title \n", " game_dict['title'] = title_year.a.text.strip()\n", " \n", " # get year\n", " str_year = title_year.span.text\n", " str_year = str_year.replace('(', '').replace(')', '')\n", " game_dict['year'] = int(str_year)\n", " \n", " # try to get description (set empty if fail)\n", " try:\n", " game_dict['description'] = title_year.p.text.strip()\n", " except:\n", " game_dict['description'] = ''\n", " \n", " # add game to total dataframe\n", " df_game = df_game.append(game_dict, ignore_index=True)\n", " \n", " # set game_id as index\n", " df_game.set_index('game_id', inplace=True)\n", " \n", " return df_game" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "scraping page: 100%|██████████| 10/10 [00:04<00:00, 2.14it/s]\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
descriptiontitleurlyear
game_id
174430Vanquish monsters with strategic cardplay. Ful...Gloomhavenhttps://boardgamegeek.com/boardgame/174430/glo...2017.0
161936Mutating diseases are spreading around the wor...Pandemic Legacy: Season 1https://boardgamegeek.com/boardgame/161936/pan...2015.0
224517Build networks, grow industries, and navigate ...Brass: Birminghamhttps://boardgamegeek.com/boardgame/224517/bra...2018.0
167791Compete with rival CEOs to make Mars habitable...Terraforming Marshttps://boardgamegeek.com/boardgame/167791/ter...2016.0
233078Build an intergalactic empire through trade, r...Twilight Imperium: Fourth Editionhttps://boardgamegeek.com/boardgame/233078/twi...2017.0
\n", "
" ], "text/plain": [ " description \\\n", "game_id \n", "174430 Vanquish monsters with strategic cardplay. Ful... \n", "161936 Mutating diseases are spreading around the wor... \n", "224517 Build networks, grow industries, and navigate ... \n", "167791 Compete with rival CEOs to make Mars habitable... \n", "233078 Build an intergalactic empire through trade, r... \n", "\n", " title \\\n", "game_id \n", "174430 Gloomhaven \n", "161936 Pandemic Legacy: Season 1 \n", "224517 Brass: Birmingham \n", "167791 Terraforming Mars \n", "233078 Twilight Imperium: Fourth Edition \n", "\n", " url year \n", "game_id \n", "174430 https://boardgamegeek.com/boardgame/174430/glo... 2017.0 \n", "161936 https://boardgamegeek.com/boardgame/161936/pan... 2015.0 \n", "224517 https://boardgamegeek.com/boardgame/224517/bra... 2018.0 \n", "167791 https://boardgamegeek.com/boardgame/167791/ter... 2016.0 \n", "233078 https://boardgamegeek.com/boardgame/233078/twi... 2017.0 " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "num_page = 10\n", "\n", "# scrape the top 100 * num_page games and save to csv\n", "df_game_list = []\n", "for page_idx in tqdm(range(1, num_page + 1), desc='scraping page'):\n", " \n", " # get url of a given page_idx\n", " url = f'https://boardgamegeek.com/browse/boardgame/page/{page_idx}?sort=rank'\n", " str_html = get_url(url)\n", " \n", " # clean game data and store it in list\n", " df_game = clean_top_games(str_html)\n", " df_game_list.append(df_game)\n", " \n", " # pause so we don't overwhelm the website (may not respond if we query too quickly)\n", " time.sleep(1)\n", " \n", "# glue together all rows of all dataframes\n", "df_game = pd.concat(df_game_list)\n", "\n", "# discard games with same titles as others\n", "df_game.drop_duplicates(subset='title', inplace=True)\n", "\n", "# save / print\n", "df_game.to_csv('game.csv')\n", "df_game.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Get metadata per game:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "def clean_game_meta(str_html):\n", " \"\"\" given HTML source of a game page, extract game stats\n", " \n", " Args:\n", " str_html (str): html of the page\n", " game_id (int): game identifier\n", " \n", " Returns:\n", " game_dict (dict): dictionary representing metadata and stats about a given game\n", " \"\"\"\n", " soup = BeautifulSoup(str_html)\n", " \n", " # all the content is hidden as json in JavaScript rather than in html \n", " # so we need to hack a bit to get to it\n", " script = soup.find_all('script')[1]\n", " \n", " # the actual content of the page is stored in a JavaScript variable called\n", " # GEEK.geekitemPreload, so let's get its value\n", " for var in script.contents[0].split('\\n\\t'):\n", " if var.startswith('GEEK.geekitemPreload'):\n", " data = json.loads(var.split('GEEK.geekitemPreload = ')[1][:-1])['item']\n", " \n", " game_dict = dict()\n", " game_dict['player_age'] = data['polls']['playerage']\n", "\n", " # not all games have the info on best_players_min\n", " try:\n", " game_dict['best_players_min'] = game_dict['polls']['userplayers']['best'][0]['min']\n", " except:\n", " game_dict['best_players_min'] = None\n", " \n", " # not all games have the info on best_players_max\n", " try:\n", " game_dict['best_players_max'] = data['polls']['userplayers']['best'][0]['max']\n", " except:\n", " game_dict['best_players_max'] = None\n", " \n", " game_dict['recomm_players_min'] = data['polls']['userplayers']['recommended'][0]['min']\n", " game_dict['recomm_players_max'] = data['polls']['userplayers']['recommended'][0]['max']\n", " game_dict['playtime_min'] = data['minplaytime']\n", " game_dict['playtime_max'] = data['maxplaytime']\n", " game_dict['awards'] = len(data['links']['boardgamehonor'])\n", " game_dict['difficulty'] = data['polls']['boardgameweight']['averageweight']\n", " game_dict['category'] = [cat['name'] for cat in data['links']['boardgamecategory']]\n", " game_dict['mechanic'] = [cat['name'] for cat in data['links']['boardgamemechanic']]\n", " \n", " return game_dict" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "scraping per game: 72%|███████▏ | 710/993 [03:05<01:00, 4.67it/s]" ] }, { "name": "stdout", "output_type": "stream", "text": [ "error getting meta data: Ascension: Storm of Souls (removing from df_game)\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "scraping per game: 100%|██████████| 993/993 [04:16<00:00, 3.88it/s]\n" ] } ], "source": [ "for game_id, game_row in tqdm(df_game.iterrows(), desc='scraping per game', total=df_game.shape[0]):\n", " # get game meta data from game specific page\n", " html_str = get_url(game_row['url'])\n", " try:\n", " game_dict = clean_game_meta(html_str)\n", " except:\n", " # failure to get game metadata (\"Ascension: Storm of Souls\")\n", " # drop this game from the database\n", " game_title = game_row['title']\n", " print(f'error getting meta data: {game_title} (removing from df_game)')\n", " df_game.drop(game_id, inplace=True)\n", " continue\n", " \n", " # add this game data to our dataframe \n", " for col, feat in game_dict.items():\n", " if col in ('category', 'mechanic'):\n", " # each game may have multiple 'category' or 'mechanic' groups it belongs to\n", " # we build a unique column for each unique tag\n", " \n", " # get prefix (so we can distinguish category/mechanic tags later)\n", " if col == 'category':\n", " prefix = 'cat: '\n", " else:\n", " prefix = 'mech: '\n", " \n", " # save each group as its own column\n", " for tag in feat:\n", " tag = prefix + tag\n", " df_game.loc[game_id, tag] = True\n", " else:\n", " # column has a single value, update df_game\n", " df_game.loc[game_id, col] = feat\n", " \n", " time.sleep(1)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "for col in df_game.columns:\n", " if ('cat: ' in col) or ('mech: ' in col):\n", " df_game.fillna(value={col: False}, inplace=True)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "df_game.to_csv('game.csv')" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
descriptiontitleurlyearplayer_agebest_players_minbest_players_maxrecomm_players_minrecomm_players_maxplaytime_min...cat: Triviamech: Map Deformationmech: Measurement Movementmech: Auction: Dutch Prioritymech: Single Loser Gamemech: Stacking and Balancingmech: Action Timermech: Physical Removalmech: Inductionmech: Ratio / Combat Results Table
game_id
174430Vanquish monsters with strategic cardplay. Ful...Gloomhavenhttps://boardgamegeek.com/boardgame/174430/glo...2017.014+NaN3.01.04.060...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
161936Mutating diseases are spreading around the wor...Pandemic Legacy: Season 1https://boardgamegeek.com/boardgame/161936/pan...2015.012+NaN4.02.04.060...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
224517Build networks, grow industries, and navigate ...Brass: Birminghamhttps://boardgamegeek.com/boardgame/224517/bra...2018.014+NaN4.02.04.060...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
167791Compete with rival CEOs to make Mars habitable...Terraforming Marshttps://boardgamegeek.com/boardgame/167791/ter...2016.012+NaN3.01.04.0120...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
233078Build an intergalactic empire through trade, r...Twilight Imperium: Fourth Editionhttps://boardgamegeek.com/boardgame/233078/twi...2017.014+NaN6.03.06.0240...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
\n", "

5 rows × 263 columns

\n", "
" ], "text/plain": [ " description \\\n", "game_id \n", "174430 Vanquish monsters with strategic cardplay. Ful... \n", "161936 Mutating diseases are spreading around the wor... \n", "224517 Build networks, grow industries, and navigate ... \n", "167791 Compete with rival CEOs to make Mars habitable... \n", "233078 Build an intergalactic empire through trade, r... \n", "\n", " title \\\n", "game_id \n", "174430 Gloomhaven \n", "161936 Pandemic Legacy: Season 1 \n", "224517 Brass: Birmingham \n", "167791 Terraforming Mars \n", "233078 Twilight Imperium: Fourth Edition \n", "\n", " url year player_age \\\n", "game_id \n", "174430 https://boardgamegeek.com/boardgame/174430/glo... 2017.0 14+ \n", "161936 https://boardgamegeek.com/boardgame/161936/pan... 2015.0 12+ \n", "224517 https://boardgamegeek.com/boardgame/224517/bra... 2018.0 14+ \n", "167791 https://boardgamegeek.com/boardgame/167791/ter... 2016.0 12+ \n", "233078 https://boardgamegeek.com/boardgame/233078/twi... 2017.0 14+ \n", "\n", " best_players_min best_players_max recomm_players_min \\\n", "game_id \n", "174430 NaN 3.0 1.0 \n", "161936 NaN 4.0 2.0 \n", "224517 NaN 4.0 2.0 \n", "167791 NaN 3.0 1.0 \n", "233078 NaN 6.0 3.0 \n", "\n", " recomm_players_max playtime_min ... cat: Trivia \\\n", "game_id ... \n", "174430 4.0 60 ... False \n", "161936 4.0 60 ... False \n", "224517 4.0 60 ... False \n", "167791 4.0 120 ... False \n", "233078 6.0 240 ... False \n", "\n", " mech: Map Deformation mech: Measurement Movement \\\n", "game_id \n", "174430 False False \n", "161936 False False \n", "224517 False False \n", "167791 False False \n", "233078 False False \n", "\n", " mech: Auction: Dutch Priority mech: Single Loser Game \\\n", "game_id \n", "174430 False False \n", "161936 False False \n", "224517 False False \n", "167791 False False \n", "233078 False False \n", "\n", " mech: Stacking and Balancing mech: Action Timer \\\n", "game_id \n", "174430 False False \n", "161936 False False \n", "224517 False False \n", "167791 False False \n", "233078 False False \n", "\n", " mech: Physical Removal mech: Induction \\\n", "game_id \n", "174430 False False \n", "161936 False False \n", "224517 False False \n", "167791 False False \n", "233078 False False \n", "\n", " mech: Ratio / Combat Results Table \n", "game_id \n", "174430 False \n", "161936 False \n", "224517 False \n", "167791 False \n", "233078 False \n", "\n", "[5 rows x 263 columns]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_game.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Visualizations (sanity check / data exploration)\n", "\n", " \n", "(2.5%) Builds two visualizations (graphs) from the data which characterize the distribution of the data itself in some interesting way. Your visualizations will be graded based on how much information they can effectively communicate with readers. Please make sure your visualizations are sufficiently distinct from each other.\n", "\n", "**Wherever a non-technical reader may misunderstand, write a few sentences which specify how to interpret the graph** It is expected that a non-technical reader can fully digest your graphs based on the images themselves as well as your explanatory text (they won't read your code).\n", " \n", "Notice that the pipeline above takes 20 mins or so to complete. We have manually re-labelled `game.csv` to `game_final.csv` so that:\n", "- we don't overwrite our precious dataset\n", "- we can load it below and continue on with our graphs\n", "\n", "You won't lost any credit for not using a `_final.csv` structure in your code but I'd encourage you to do so to avoid losing your data / time :)\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Correlations between category tags" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "df_game = pd.read_csv('game_final.csv')" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "def plot_corr_matrix(df_game, prefix, limit=10):\n", " # find all columns with a given prefix\n", " column_list = [col for col in df_game.columns if prefix in col]\n", "\n", " # get dataframe corresponding to column_list\n", " df = df_game.loc[:, column_list]\n", " \n", " # get series of tags, sorted in decreasing popularity\n", " series_tag = df.sum(axis=0).sort_values(ascending=False)\n", " \n", " # get limit most popular tags from df\n", " df = df_game.loc[:, series_tag.index[:limit]]\n", "\n", " # rename all columns\n", " df.columns = [col[len(prefix):] for col in df.columns]\n", "\n", " # plot\n", " sns.heatmap(df.corr(), vmin=-1, vmax=1, cmap='coolwarm')\n", " plt.gcf().set_size_inches(10, 10)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Text(0.5, 0.98, 'Correlation of category tags')" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plot_corr_matrix(df_game, prefix='cat: ')\n", "plt.suptitle('Correlation of category tags')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Higher values, shown in red, indicate that the categories tend to occur in the same game. For example, the final row (or column) describes games with miniatures. Minatures (i.e. little plastic figurines) are often in games with:\n", "- Fantasy\n", "- Fighting\n", "- Science fiction\n", "- Adventure\n", "- Wargame\n", "\n", "However, games with miniatures often do not also belong to the following categories:\n", "- Card Game\n", "- Economic\n", "- City Building" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Duration of gameplay" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Text(0, 0.5, 'count')" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import numpy as np\n", "\n", "sns.set(font_scale=1.3)\n", "\n", "fig, axs = plt.subplots(2, 1)\n", "fig.set_size_inches(12, 8)\n", "\n", "# plot histogram min playtime\n", "plt.sca(axs[0])\n", "bins = np.linspace(0, 250, 20)\n", "plt.hist(df_game['playtime_min'], bins=bins)\n", "plt.xlabel('min playtime (mins)')\n", "plt.ylabel('count')\n", "\n", "# plot histogram max playtime\n", "plt.sca(axs[1])\n", "plt.hist(df_game['playtime_max'], bins=bins)\n", "plt.xlabel('max playtime (mins)')\n", "plt.ylabel('count')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Analysis Plan\n", "\n", " \n", "(2%) Discuss what ML tools will be used and the relevant assumptions required to apply each. Either:\n", "\n", "discuss why one algorithm may be chosen over the others\n", "\n", "describe what subset of a whole suite of similar algorithms you’ll apply (its ok to say, ‘we’re going to try all of these because we don’t have reason to think one should be better than another’)\n", "\n", "For example:\n", "\n", "- Project goal: Predicting the movement of stock prices based on how similar companies are doing via regression.\n", " - This is observable: there are plenty of information you could collect about stock prices (though it will need some careful though about what a \"similar\" company is)\n", " - This is sound and simple: seems intuitive that there'd be a relationship to be found between the company's stock prices\n", " \n", "- Project goal: Recommend a movie based on the events of a person's dream last night.\n", " - This is not observable: how will you get enough data on what people dreamed about?\n", " - This is not sound or simple: what kind of movies should we suggest for people with nightmares of falling? Doesn't seem to be a clear relationship between dreams / movies a person would enjoy ....\n", " \n", " \n", "To date, we've covered data collection and processing extensively and are only starting ML this week (Mar 22). The grading will account for this:\n", "- be specific / clear about which data will be used\n", " - give plenty of examples to illustrate things\n", "- be as specific / clear as you can about the ML analysis\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our expectation is that games with similar categories would be enjoyed by similar board gamers. \n", "\n", "In each of the approaches below, a board game is characterized by a boolean vector $x$ which represents all the category and mechanics tags a board game could contain. For example, the fifth game can be represented by:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "cat: Adventure False\n", "cat: Exploration True\n", "cat: Fantasy False\n", "cat: Fighting False\n", "cat: Miniatures False\n", " ... \n", "mech: Stacking and Balancing False\n", "mech: Action Timer False\n", "mech: Physical Removal False\n", "mech: Induction False\n", "mech: Ratio / Combat Results Table False\n", "Name: 4, Length: 250, dtype: bool" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "feat_col = [col for col in df_game.columns if 'mech: ' in col or 'cat: ' in col]\n", "df_feat = df_game.loc[:, feat_col]\n", "df_feat.iloc[4, :]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will ask users to rate, on a 1-7 point scale, 10 or more board games. For example, some user might rate:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "example_user_rate = \\\n", "{'Gloomhaven': 7,\n", "'Pandemic Legacy: Season 1': 1,\n", "'Brass: Birmingham': 3,\n", "'Terraforming Mars': 2,\n", "'Twilight Imperium: Fourth Edition': 4,\n", "'Gloomhaven: Jaws of the Lion': 7,\n", "'Through the Ages: A New Story of Civilization': 5,\n", "'Gaia Project': 2,\n", "'Star Wars: Rebellion': 3,\n", "'Twilight Struggle': 4}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For every user, we will use regression to create a function which estimates these ratings from the $x$ vector above. This function can then be applied to all the $x$ values above, including the games for which the user has not indicated a rating. We will \"reccomend\" those games with the highest estimated rating (i.e. the output of the regression function) which the user has not explicitly rated." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" } }, "nbformat": 4, "nbformat_minor": 4 }