{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data and Analysis Plan: Board Game Reccomendation\n",
"## Team -1 (example)\n",
"\n",
"- Piotr Sapiezynski (p.sapiezynski@northeastern.edu)\n",
"- Matt Higger (m.higger@ccs.neu.edu)\n",
"\n",
"\n",
" \n",
"### Notes:\n",
"This is the example DS3000 project. We'll switch from our official \"Data and Analysis Plan\" to pointing out features which are helpful as you write your own by changing the font color to green. \n",
"\n",
"**Please use the section template given here to ensure you complete the necessary sections**\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"(.5%) Expresses the central motivation of the project in one or two sentences. This may evolve a bit through the project.\n",
"\n",
"\n",
"## Project Goal:\n",
"This work will scrape lists of board games from [BoardGameGeek.com](https://boardgamegeek.com/) to reccomend a new board game to users who input a game they already enjoy. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data \n",
"\n",
"(1%) Gives a summary of the data processing pipeline so a technical expert can easily follow along.\n",
" \n",
" \n",
"This overview section allows you to describe an overview of your data processing pipeline. \n",
" \n",
"Your pipeline should end by saving one or more csvs of data which will be loaded by the analysis portion of the notebook.\n",
"\n",
"\n",
"### Overview \n",
"We will scrape a [list of all boardgames](https://boardgamegeek.com/browse/boardgame) ranked by popularity from BoardGameGeek. \n",
"\n",
"\n",
"\n",
"\n",
"From this list, for each game, we can obtain:\n",
"- title\n",
"- year published\n",
"- url to specific game page\n",
"- description (e.g. \"Vanquish monsters ...\")\n",
"\n",
"Upon visiting an [individual game's webpage](https://boardgamegeek.com/boardgame/174430/gloomhaven) we can use the top titlebar:\n",
"\n",
"
\n",
"\n",
"to observe:\n",
"- complexity rating\n",
"- playtime (mins a typical game lasts)\n",
"- min/max number of player required\n",
"- reccomended age range\n",
"\n",
"Most importantly we seek to collect the category and mechanism tags associated with a particular game:\n",
"
\n",
"\n",
"Each game has multiple category tags. Our expectation is that games with similar tags are enjoyed by similar players. This tag data will be essential to reccomend a new game to a player who inputs another game they enjoy. To simplify analysis, each tag will have its own column in our output DataFrame. For example:\n",
"\n",
"| | card game | mystery | social | communication |\n",
"|-------|-----------|---------|--------|---------------|\n",
"| game0 | True | True | False | False |\n",
"| game1 | False | True | True | True |\n",
"\n",
"Indicates that \n",
"- game0 has tags 'card game' and 'mystery'\n",
"- game1 has tags ' mystery', 'social' and 'communication'\n",
"\n",
"### Pipeline Overview\n",
"\n",
"We will accomplish this task with three functions:\n",
"- `get_url()`\n",
" - returns html string of a given url\n",
"- `clean_top_games()`\n",
" - builds dataframe of [a single page of top games](https://boardgamegeek.com/browse/boardgame/page/2) from html string \n",
"- `clean_game_meta()`\n",
" - collects game meta data (tags, complexity, playtime, min/max players, age range ...) from an [individual game's webpage](https://boardgamegeek.com/boardgame/174430/gloomhaven)\n",
" \n",
"As well as two short scripts:\n",
"- **Scrape list of games:** use `get_url()` and `clean_top_games()` in a loop to collect n pages of top games (100 * n games), populating a DataFrame `df_game`\n",
"- **Get meta data per game:** loop through each row of `df_game`, query and process the individual game's webpage via `get_url()` and `clean_game_meta()` using the previously collected url and append the remaining features to `df_game`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Pipeline\n",
"\n",
" \n",
"(4%) Obtains, cleans, and merges all data sources involved in the project.\n",
" \n",
"Documentation counts!\n",
" \n",
"The majority of this section is code, but do make sure that one can do a quick sanity check that your pipeline worked by printing a few examples (e.g. call `DataFrame.head()` a few times).\n",
"\n",
"\n",
"#### Scrape list of games"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import requests\n",
"from bs4 import BeautifulSoup\n",
"import json\n",
"import time\n",
"from tqdm import tqdm\n",
"import seaborn as sns"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"def get_url(url):\n",
" \"\"\" gets the html of a given url\n",
" \n",
" Args:\n",
" url (str): target url\n",
" \n",
" Returns:\n",
" str_html (str): html of given url\n",
" \"\"\"\n",
" return requests.get(url).text"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"def clean_top_games(str_html):\n",
" \"\"\" gets DataFrame of games from page of \"All Boardgames\" list\n",
" \n",
" example page from \"All Boardgames\":\n",
" https://boardgamegeek.com/browse/boardgame/page/2\n",
" \n",
" Args:\n",
" str_html (str): string corresponding to html of page\n",
" \n",
" Returns:\n",
" df_game (pd.DataFrame): dataframe where each row\n",
" is one board game\n",
" \"\"\" \n",
" # build soup\n",
" soup = BeautifulSoup(str_html)\n",
"\n",
" # get game data per row (discard first row as its the title)\n",
" df_game = pd.DataFrame()\n",
" for row_game in soup.find_all('tr')[1:]:\n",
" game_dict = dict()\n",
" \n",
" # each td tag corresponds to a column. we unpack by column\n",
" rank, image, title_year, rate_geek, rate_avg, num_vote, shop = row_game.find_all('td')\n",
" \n",
" # get game id and url from link in image\n",
" game_url = 'https://boardgamegeek.com' + image.a.attrs['href']\n",
" game_dict['url'] = game_url\n",
" game_dict['game_id'] = game_url.split('/')[-2]\n",
" \n",
" # get title \n",
" game_dict['title'] = title_year.a.text.strip()\n",
" \n",
" # get year\n",
" str_year = title_year.span.text\n",
" str_year = str_year.replace('(', '').replace(')', '')\n",
" game_dict['year'] = int(str_year)\n",
" \n",
" # try to get description (set empty if fail)\n",
" try:\n",
" game_dict['description'] = title_year.p.text.strip()\n",
" except:\n",
" game_dict['description'] = ''\n",
" \n",
" # add game to total dataframe\n",
" df_game = df_game.append(game_dict, ignore_index=True)\n",
" \n",
" # set game_id as index\n",
" df_game.set_index('game_id', inplace=True)\n",
" \n",
" return df_game"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"scraping page: 100%|██████████| 10/10 [00:04<00:00, 2.14it/s]\n"
]
},
{
"data": {
"text/html": [
"
\n", " | description | \n", "title | \n", "url | \n", "year | \n", "
---|---|---|---|---|
game_id | \n", "\n", " | \n", " | \n", " | \n", " |
174430 | \n", "Vanquish monsters with strategic cardplay. Ful... | \n", "Gloomhaven | \n", "https://boardgamegeek.com/boardgame/174430/glo... | \n", "2017.0 | \n", "
161936 | \n", "Mutating diseases are spreading around the wor... | \n", "Pandemic Legacy: Season 1 | \n", "https://boardgamegeek.com/boardgame/161936/pan... | \n", "2015.0 | \n", "
224517 | \n", "Build networks, grow industries, and navigate ... | \n", "Brass: Birmingham | \n", "https://boardgamegeek.com/boardgame/224517/bra... | \n", "2018.0 | \n", "
167791 | \n", "Compete with rival CEOs to make Mars habitable... | \n", "Terraforming Mars | \n", "https://boardgamegeek.com/boardgame/167791/ter... | \n", "2016.0 | \n", "
233078 | \n", "Build an intergalactic empire through trade, r... | \n", "Twilight Imperium: Fourth Edition | \n", "https://boardgamegeek.com/boardgame/233078/twi... | \n", "2017.0 | \n", "
\n", " | description | \n", "title | \n", "url | \n", "year | \n", "player_age | \n", "best_players_min | \n", "best_players_max | \n", "recomm_players_min | \n", "recomm_players_max | \n", "playtime_min | \n", "... | \n", "cat: Trivia | \n", "mech: Map Deformation | \n", "mech: Measurement Movement | \n", "mech: Auction: Dutch Priority | \n", "mech: Single Loser Game | \n", "mech: Stacking and Balancing | \n", "mech: Action Timer | \n", "mech: Physical Removal | \n", "mech: Induction | \n", "mech: Ratio / Combat Results Table | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
game_id | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
174430 | \n", "Vanquish monsters with strategic cardplay. Ful... | \n", "Gloomhaven | \n", "https://boardgamegeek.com/boardgame/174430/glo... | \n", "2017.0 | \n", "14+ | \n", "NaN | \n", "3.0 | \n", "1.0 | \n", "4.0 | \n", "60 | \n", "... | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "
161936 | \n", "Mutating diseases are spreading around the wor... | \n", "Pandemic Legacy: Season 1 | \n", "https://boardgamegeek.com/boardgame/161936/pan... | \n", "2015.0 | \n", "12+ | \n", "NaN | \n", "4.0 | \n", "2.0 | \n", "4.0 | \n", "60 | \n", "... | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "
224517 | \n", "Build networks, grow industries, and navigate ... | \n", "Brass: Birmingham | \n", "https://boardgamegeek.com/boardgame/224517/bra... | \n", "2018.0 | \n", "14+ | \n", "NaN | \n", "4.0 | \n", "2.0 | \n", "4.0 | \n", "60 | \n", "... | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "
167791 | \n", "Compete with rival CEOs to make Mars habitable... | \n", "Terraforming Mars | \n", "https://boardgamegeek.com/boardgame/167791/ter... | \n", "2016.0 | \n", "12+ | \n", "NaN | \n", "3.0 | \n", "1.0 | \n", "4.0 | \n", "120 | \n", "... | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "
233078 | \n", "Build an intergalactic empire through trade, r... | \n", "Twilight Imperium: Fourth Edition | \n", "https://boardgamegeek.com/boardgame/233078/twi... | \n", "2017.0 | \n", "14+ | \n", "NaN | \n", "6.0 | \n", "3.0 | \n", "6.0 | \n", "240 | \n", "... | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "
5 rows × 263 columns
\n", "