{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# DS2500 Day 20\n", "\n", "Mar 28, 2023\n", "\n", "### Content\n", "- Web scraping (html parsing & string manipulations)\n", "\n", "### Admin\n", "- lab digest tomorrow\n", "- project\n", " - activate your mentor\n", " - sign up for a meeting slot with me next week\n", " \n", "### Lesson Credit\n", "\n", "Piotr Sapiezynski (https://www.sapiezynski.com/) originally wrote much of this lesson, I've modified it a bit (allrecipes.com has since changed ... arg!)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Web Scraping\n", "* Using programs or scripts to pretend to browse websites, examine the content on those websites, retrieve and extract data from those websites\n", "* Why scrape?\n", " * if an API is available for a service, we will nearly always prefer the API to scraping\n", " * ... but not all services have APIs or the available APIs are too expensive for our project\n", " * newly published information might not yet be available through ready datasets\n", "* Downsides of scraping:\n", " * no reference documentation (unlike APIs)\n", " * no guarantee that a webpage we scrape will look and work the same way the next day (might need to rewrite the whole scraper - this is why ETL is important!)\n", " * if it violates the terms of service it might be seen as a felony (https://www.aclu.org/cases/sandvig-v-barr-challenge-cfaa-prohibition-uncovering-racial-discrimination-online)\n", " * legal and moral greyzone (even if the ToS does not disallow it, somebody has to pay for the traffic and when you're scraping you're not looking at ads)\n", " * ... but everbody does it anyway (https://www.hollywoodreporter.com/thr-esq/genius-says-it-caught-google-lyricfind-redhanded-stealing-lyrics-400m-suit-1259383)\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Best case scenario\n", "Some webpages publish their data in the form of simple tables. In these (rare) cases we can just use pandas .read_html to scrape this data:\n", "\n", "https://www.espn.com/nba/team/stats/_/name/bos" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "# read html extracts all the elements from html and returns a list of DataFrames created from them\n", "tables = pd.read_html('https://www.espn.com/nba/team/stats/_/name/bos')\n", "len(tables)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Name
0Jayson Tatum SF
1Jaylen Brown SG
2Malcolm Brogdon PG
3Derrick White PG
4Marcus Smart PG
5Al Horford C
6Grant Williams PF
7Robert Williams III C
8Sam Hauser SF
9Mike Muscala C *
10Payton Pritchard PG
11Blake Griffin PF
12Luke Kornet C
13JD Davison SG
14Noah Vonleh PF
15Mfiondu Kabengele C
16Justin Jackson SF
17Total
\n", "" ], "text/plain": [ " Name\n", "0 Jayson Tatum SF\n", "1 Jaylen Brown SG\n", "2 Malcolm Brogdon PG\n", "3 Derrick White PG\n", "4 Marcus Smart PG\n", "5 Al Horford C\n", "6 Grant Williams PF\n", "7 Robert Williams III C\n", "8 Sam Hauser SF\n", "9 Mike Muscala C *\n", "10 Payton Pritchard PG\n", "11 Blake Griffin PF\n", "12 Luke Kornet C\n", "13 JD Davison SG\n", "14 Noah Vonleh PF\n", "15 Mfiondu Kabengele C\n", "16 Justin Jackson SF\n", "17 Total" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tables[0]" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
GPGSMINPTSORDRREBASTSTLBLKTOPFAST/TO
06969.037.330.11.17.88.94.71.00.73.02.11.6
16363.036.127.01.25.77.03.41.10.42.92.61.2
2620.025.814.60.63.64.23.70.60.31.51.62.5
37563.028.412.40.72.93.54.00.70.91.12.23.8
45757.032.311.40.82.43.26.41.50.42.42.82.6
55959.030.79.71.25.16.32.90.50.90.61.95.0
67222.026.58.31.13.64.71.70.60.41.12.61.6
73118.023.78.33.05.58.51.40.51.20.92.01.6
8735.015.86.10.52.12.50.80.30.30.31.32.3
9132.014.85.20.52.63.10.30.30.30.41.50.8
10452.012.54.70.51.01.51.00.30.00.70.81.5
113514.014.14.31.12.63.71.30.30.20.51.92.7
12620.011.53.81.31.52.80.70.20.70.41.21.8
13100.02.71.10.10.50.60.60.20.00.20.43.0
14231.07.51.10.81.32.10.30.10.30.51.50.6
1520.07.01.01.51.02.50.00.00.00.51.50.0
16230.04.70.90.10.70.70.40.20.20.10.34.5
1775NaNNaN118.19.635.745.326.56.45.212.719.22.1
\n", "
" ], "text/plain": [ " GP GS MIN PTS OR DR REB AST STL BLK TO PF AST/TO\n", "0 69 69.0 37.3 30.1 1.1 7.8 8.9 4.7 1.0 0.7 3.0 2.1 1.6\n", "1 63 63.0 36.1 27.0 1.2 5.7 7.0 3.4 1.1 0.4 2.9 2.6 1.2\n", "2 62 0.0 25.8 14.6 0.6 3.6 4.2 3.7 0.6 0.3 1.5 1.6 2.5\n", "3 75 63.0 28.4 12.4 0.7 2.9 3.5 4.0 0.7 0.9 1.1 2.2 3.8\n", "4 57 57.0 32.3 11.4 0.8 2.4 3.2 6.4 1.5 0.4 2.4 2.8 2.6\n", "5 59 59.0 30.7 9.7 1.2 5.1 6.3 2.9 0.5 0.9 0.6 1.9 5.0\n", "6 72 22.0 26.5 8.3 1.1 3.6 4.7 1.7 0.6 0.4 1.1 2.6 1.6\n", "7 31 18.0 23.7 8.3 3.0 5.5 8.5 1.4 0.5 1.2 0.9 2.0 1.6\n", "8 73 5.0 15.8 6.1 0.5 2.1 2.5 0.8 0.3 0.3 0.3 1.3 2.3\n", "9 13 2.0 14.8 5.2 0.5 2.6 3.1 0.3 0.3 0.3 0.4 1.5 0.8\n", "10 45 2.0 12.5 4.7 0.5 1.0 1.5 1.0 0.3 0.0 0.7 0.8 1.5\n", "11 35 14.0 14.1 4.3 1.1 2.6 3.7 1.3 0.3 0.2 0.5 1.9 2.7\n", "12 62 0.0 11.5 3.8 1.3 1.5 2.8 0.7 0.2 0.7 0.4 1.2 1.8\n", "13 10 0.0 2.7 1.1 0.1 0.5 0.6 0.6 0.2 0.0 0.2 0.4 3.0\n", "14 23 1.0 7.5 1.1 0.8 1.3 2.1 0.3 0.1 0.3 0.5 1.5 0.6\n", "15 2 0.0 7.0 1.0 1.5 1.0 2.5 0.0 0.0 0.0 0.5 1.5 0.0\n", "16 23 0.0 4.7 0.9 0.1 0.7 0.7 0.4 0.2 0.2 0.1 0.3 4.5\n", "17 75 NaN NaN 118.1 9.6 35.7 45.3 26.5 6.4 5.2 12.7 19.2 2.1" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tables[1]" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameGPGSMINPTSORDRREBASTSTLBLKTOPFAST/TO
0Jayson Tatum SF6969.037.330.11.17.88.94.71.00.73.02.11.6
1Jaylen Brown SG6363.036.127.01.25.77.03.41.10.42.92.61.2
2Malcolm Brogdon PG620.025.814.60.63.64.23.70.60.31.51.62.5
3Derrick White PG7563.028.412.40.72.93.54.00.70.91.12.23.8
4Marcus Smart PG5757.032.311.40.82.43.26.41.50.42.42.82.6
5Al Horford C5959.030.79.71.25.16.32.90.50.90.61.95.0
6Grant Williams PF7222.026.58.31.13.64.71.70.60.41.12.61.6
7Robert Williams III C3118.023.78.33.05.58.51.40.51.20.92.01.6
8Sam Hauser SF735.015.86.10.52.12.50.80.30.30.31.32.3
9Mike Muscala C *132.014.85.20.52.63.10.30.30.30.41.50.8
10Payton Pritchard PG452.012.54.70.51.01.51.00.30.00.70.81.5
11Blake Griffin PF3514.014.14.31.12.63.71.30.30.20.51.92.7
12Luke Kornet C620.011.53.81.31.52.80.70.20.70.41.21.8
13JD Davison SG100.02.71.10.10.50.60.60.20.00.20.43.0
14Noah Vonleh PF231.07.51.10.81.32.10.30.10.30.51.50.6
15Mfiondu Kabengele C20.07.01.01.51.02.50.00.00.00.51.50.0
16Justin Jackson SF230.04.70.90.10.70.70.40.20.20.10.34.5
17Total75NaNNaN118.19.635.745.326.56.45.212.719.22.1
\n", "
" ], "text/plain": [ " Name GP GS MIN PTS OR DR REB AST STL \\\n", "0 Jayson Tatum SF 69 69.0 37.3 30.1 1.1 7.8 8.9 4.7 1.0 \n", "1 Jaylen Brown SG 63 63.0 36.1 27.0 1.2 5.7 7.0 3.4 1.1 \n", "2 Malcolm Brogdon PG 62 0.0 25.8 14.6 0.6 3.6 4.2 3.7 0.6 \n", "3 Derrick White PG 75 63.0 28.4 12.4 0.7 2.9 3.5 4.0 0.7 \n", "4 Marcus Smart PG 57 57.0 32.3 11.4 0.8 2.4 3.2 6.4 1.5 \n", "5 Al Horford C 59 59.0 30.7 9.7 1.2 5.1 6.3 2.9 0.5 \n", "6 Grant Williams PF 72 22.0 26.5 8.3 1.1 3.6 4.7 1.7 0.6 \n", "7 Robert Williams III C 31 18.0 23.7 8.3 3.0 5.5 8.5 1.4 0.5 \n", "8 Sam Hauser SF 73 5.0 15.8 6.1 0.5 2.1 2.5 0.8 0.3 \n", "9 Mike Muscala C * 13 2.0 14.8 5.2 0.5 2.6 3.1 0.3 0.3 \n", "10 Payton Pritchard PG 45 2.0 12.5 4.7 0.5 1.0 1.5 1.0 0.3 \n", "11 Blake Griffin PF 35 14.0 14.1 4.3 1.1 2.6 3.7 1.3 0.3 \n", "12 Luke Kornet C 62 0.0 11.5 3.8 1.3 1.5 2.8 0.7 0.2 \n", "13 JD Davison SG 10 0.0 2.7 1.1 0.1 0.5 0.6 0.6 0.2 \n", "14 Noah Vonleh PF 23 1.0 7.5 1.1 0.8 1.3 2.1 0.3 0.1 \n", "15 Mfiondu Kabengele C 2 0.0 7.0 1.0 1.5 1.0 2.5 0.0 0.0 \n", "16 Justin Jackson SF 23 0.0 4.7 0.9 0.1 0.7 0.7 0.4 0.2 \n", "17 Total 75 NaN NaN 118.1 9.6 35.7 45.3 26.5 6.4 \n", "\n", " BLK TO PF AST/TO \n", "0 0.7 3.0 2.1 1.6 \n", "1 0.4 2.9 2.6 1.2 \n", "2 0.3 1.5 1.6 2.5 \n", "3 0.9 1.1 2.2 3.8 \n", "4 0.4 2.4 2.8 2.6 \n", "5 0.9 0.6 1.9 5.0 \n", "6 0.4 1.1 2.6 1.6 \n", "7 1.2 0.9 2.0 1.6 \n", "8 0.3 0.3 1.3 2.3 \n", "9 0.3 0.4 1.5 0.8 \n", "10 0.0 0.7 0.8 1.5 \n", "11 0.2 0.5 1.9 2.7 \n", "12 0.7 0.4 1.2 1.8 \n", "13 0.0 0.2 0.4 3.0 \n", "14 0.3 0.5 1.5 0.6 \n", "15 0.0 0.5 1.5 0.0 \n", "16 0.2 0.1 0.3 4.5 \n", "17 5.2 12.7 19.2 2.1 " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# \"glue\" dataframes together (more to come on this later in the semester)\n", "player_stats = pd.concat(tables[:2], axis=1)\n", "player_stats" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## HTML\n", "Web pages are written in HTML.\n", "\n", "The keywords in `<>` brackets are called tags. They open with `` and close with ``." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "s_html = \"\"\"\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "

Heading 1

\n", "

This is what heading 2 looks like

\n", " \n", "

Text is usually in paragraphs.\n", " New lines and multiple consecutive whitespace characters are ignored.

\n", "\n", "

Unlike in python indentation is only a good practice but it doesn't change functionality. In fact, all of this HTML could be (and often is in real webpages) just writen as a single line.

\n", " \n", "

Links are created using the \"a\" tag: \n", " Click here to google.\n", " href is an attirbute of the a tag that specify where the link points to.

\n", " \n", " \n", " \n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# write this string to a local file \"simple_page0.html\"\n", "with open('simple_page0.html', 'w') as f:\n", " print(s_html, file=f)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Clicking the link below will open the html page we just wrote:\n", "\n", "[simple_page0.html](simple_page0.html)\n", "\n", "While it opens in jupyter know that your usual browser will do the trick too (chrome, safari, firefox etc)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# HTML is organized as a tree\n", "\n", "(Note to self: write out tree structure below)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```html\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "

Heading 1

\n", "

This is what heading 2 looks like

\n", " \n", "

Text is usually in paragraphs.\n", " New lines and multiple consecutive whitespace characters are ignored.

\n", "\n", "

Unlike in python indentation is only a good practice but it doesn't change functionality. In fact, all of this HTML could be (and often is in real webpages) just writen as a single line.

\n", " \n", "

Links are created using the \"a\" tag: \n", " Click here to google.\n", " href is an attirbute of the a tag that specify where the link points to.

\n", " \n", " \n", " \n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# And now, the internet\n", "\n", "### Observing HTML in a browser\n", "You can see the actual html of a page by selecting \"inspect\" on a page via a right click. Try it out:\n", "\n", "[https://www.scrapethissite.com/pages/simple/](https://www.scrapethissite.com/pages/simple/)\n", "\n", "### Obtaining HTML from a url address\n", "Use `requests.get()` to get the html of a web page into python:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", " \n", " \n", " Countries of the World: A Simple Example | Scrape This Site | A public sandbox for learning web scraping\n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", " \n", "\n", " \n", "\n", "\n", "\n", "\n", " \n", "\n", " \n", " \n", "\n", " \n", "\n", " \n", "\n", "
\n", "\n", "
\n", "
\n", "\n", "
\n", "
\n", "

\n", " Countries of the World: A Simple Example\n", " 250 items\n", "

\n", "
\n", "
\n", "
\n", "
\n", "
\n", "

\n", " A single page that lists information about all the countries in the world. Good for those just get started with web scraping.\n", " Practice looking for patterns in the HTML that will allow you to extract information about each country. Then, build a simple web scraper that makes a request to this page, parses the HTML and prints out each country's name.\n", "

\n", "
\n", "
\n", "
\n", "
\n", "
\n", "

\n", " There are 4 video lessons that show you how to scrape this page.\n", "

\n", "
\n", "
\n", "
\n", "

\n", " \n", " Data via\n", " http://peric.github.io/GetCountries/\n", " \n", "

\n", "
\n", "
\n", "
\n", "\n", "
\n", " \n", "
\n", "

\n", " \n", " Andorra\n", "

\n", "
\n", " Capital: Andorra la Vella
\n", " Population: 84000
\n", " Area (km2): 468.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " United Arab Emirates\n", "

\n", "
\n", " Capital: Abu Dhabi
\n", " Population: 4975593
\n", " Area (km2): 82880.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Afghanistan\n", "

\n", "
\n", " Capital: Kabul
\n", " Population: 29121286
\n", " Area (km2): 647500.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Antigua and Barbuda\n", "

\n", "
\n", " Capital: St. John's
\n", " Population: 86754
\n", " Area (km2): 443.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Anguilla\n", "

\n", "
\n", " Capital: The Valley
\n", " Population: 13254
\n", " Area (km2): 102.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Albania\n", "

\n", "
\n", " Capital: Tirana
\n", " Population: 2986952
\n", " Area (km2): 28748.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Armenia\n", "

\n", "
\n", " Capital: Yerevan
\n", " Population: 2968000
\n", " Area (km2): 29800.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Angola\n", "

\n", "
\n", " Capital: Luanda
\n", " Population: 13068161
\n", " Area (km2): 1246700.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Antarctica\n", "

\n", "
\n", " Capital: None
\n", " Population: 0
\n", " Area (km2): 1.4E7
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Argentina\n", "

\n", "
\n", " Capital: Buenos Aires
\n", " Population: 41343201
\n", " Area (km2): 2766890.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " American Samoa\n", "

\n", "
\n", " Capital: Pago Pago
\n", " Population: 57881
\n", " Area (km2): 199.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Austria\n", "

\n", "
\n", " Capital: Vienna
\n", " Population: 8205000
\n", " Area (km2): 83858.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Australia\n", "

\n", "
\n", " Capital: Canberra
\n", " Population: 21515754
\n", " Area (km2): 7686850.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Aruba\n", "

\n", "
\n", " Capital: Oranjestad
\n", " Population: 71566
\n", " Area (km2): 193.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Åland\n", "

\n", "
\n", " Capital: Mariehamn
\n", " Population: 26711
\n", " Area (km2): 1580.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Azerbaijan\n", "

\n", "
\n", " Capital: Baku
\n", " Population: 8303512
\n", " Area (km2): 86600.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Bosnia and Herzegovina\n", "

\n", "
\n", " Capital: Sarajevo
\n", " Population: 4590000
\n", " Area (km2): 51129.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Barbados\n", "

\n", "
\n", " Capital: Bridgetown
\n", " Population: 285653
\n", " Area (km2): 431.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Bangladesh\n", "

\n", "
\n", " Capital: Dhaka
\n", " Population: 156118464
\n", " Area (km2): 144000.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Belgium\n", "

\n", "
\n", " Capital: Brussels
\n", " Population: 10403000
\n", " Area (km2): 30510.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Burkina Faso\n", "

\n", "
\n", " Capital: Ouagadougou
\n", " Population: 16241811
\n", " Area (km2): 274200.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Bulgaria\n", "

\n", "
\n", " Capital: Sofia
\n", " Population: 7148785
\n", " Area (km2): 110910.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Bahrain\n", "

\n", "
\n", " Capital: Manama
\n", " Population: 738004
\n", " Area (km2): 665.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Burundi\n", "

\n", "
\n", " Capital: Bujumbura
\n", " Population: 9863117
\n", " Area (km2): 27830.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Benin\n", "

\n", "
\n", " Capital: Porto-Novo
\n", " Population: 9056010
\n", " Area (km2): 112620.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Saint Barthélemy\n", "

\n", "
\n", " Capital: Gustavia
\n", " Population: 8450
\n", " Area (km2): 21.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Bermuda\n", "

\n", "
\n", " Capital: Hamilton
\n", " Population: 65365
\n", " Area (km2): 53.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Brunei\n", "

\n", "
\n", " Capital: Bandar Seri Begawan
\n", " Population: 395027
\n", " Area (km2): 5770.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Bolivia\n", "

\n", "
\n", " Capital: Sucre
\n", " Population: 9947418
\n", " Area (km2): 1098580.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Bonaire\n", "

\n", "
\n", " Capital: Kralendijk
\n", " Population: 18012
\n", " Area (km2): 328.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Brazil\n", "

\n", "
\n", " Capital: Brasília
\n", " Population: 201103330
\n", " Area (km2): 8511965.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Bahamas\n", "

\n", "
\n", " Capital: Nassau
\n", " Population: 301790
\n", " Area (km2): 13940.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Bhutan\n", "

\n", "
\n", " Capital: Thimphu
\n", " Population: 699847
\n", " Area (km2): 47000.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Bouvet Island\n", "

\n", "
\n", " Capital: None
\n", " Population: 0
\n", " Area (km2): 49.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Botswana\n", "

\n", "
\n", " Capital: Gaborone
\n", " Population: 2029307
\n", " Area (km2): 600370.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Belarus\n", "

\n", "
\n", " Capital: Minsk
\n", " Population: 9685000
\n", " Area (km2): 207600.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Belize\n", "

\n", "
\n", " Capital: Belmopan
\n", " Population: 314522
\n", " Area (km2): 22966.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Canada\n", "

\n", "
\n", " Capital: Ottawa
\n", " Population: 33679000
\n", " Area (km2): 9984670.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Cocos [Keeling] Islands\n", "

\n", "
\n", " Capital: West Island
\n", " Population: 628
\n", " Area (km2): 14.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Democratic Republic of the Congo\n", "

\n", "
\n", " Capital: Kinshasa
\n", " Population: 70916439
\n", " Area (km2): 2345410.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Central African Republic\n", "

\n", "
\n", " Capital: Bangui
\n", " Population: 4844927
\n", " Area (km2): 622984.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Republic of the Congo\n", "

\n", "
\n", " Capital: Brazzaville
\n", " Population: 3039126
\n", " Area (km2): 342000.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Switzerland\n", "

\n", "
\n", " Capital: Bern
\n", " Population: 7581000
\n", " Area (km2): 41290.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Ivory Coast\n", "

\n", "
\n", " Capital: Yamoussoukro
\n", " Population: 21058798
\n", " Area (km2): 322460.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Cook Islands\n", "

\n", "
\n", " Capital: Avarua
\n", " Population: 21388
\n", " Area (km2): 240.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Chile\n", "

\n", "
\n", " Capital: Santiago
\n", " Population: 16746491
\n", " Area (km2): 756950.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Cameroon\n", "

\n", "
\n", " Capital: Yaoundé
\n", " Population: 19294149
\n", " Area (km2): 475440.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " China\n", "

\n", "
\n", " Capital: Beijing
\n", " Population: 1330044000
\n", " Area (km2): 9596960.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Colombia\n", "

\n", "
\n", " Capital: Bogotá
\n", " Population: 47790000
\n", " Area (km2): 1138910.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Costa Rica\n", "

\n", "
\n", " Capital: San José
\n", " Population: 4516220
\n", " Area (km2): 51100.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Cuba\n", "

\n", "
\n", " Capital: Havana
\n", " Population: 11423000
\n", " Area (km2): 110860.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Cape Verde\n", "

\n", "
\n", " Capital: Praia
\n", " Population: 508659
\n", " Area (km2): 4033.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Curacao\n", "

\n", "
\n", " Capital: Willemstad
\n", " Population: 141766
\n", " Area (km2): 444.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Christmas Island\n", "

\n", "
\n", " Capital: Flying Fish Cove
\n", " Population: 1500
\n", " Area (km2): 135.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Cyprus\n", "

\n", "
\n", " Capital: Nicosia
\n", " Population: 1102677
\n", " Area (km2): 9250.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Czech Republic\n", "

\n", "
\n", " Capital: Prague
\n", " Population: 10476000
\n", " Area (km2): 78866.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Germany\n", "

\n", "
\n", " Capital: Berlin
\n", " Population: 81802257
\n", " Area (km2): 357021.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Djibouti\n", "

\n", "
\n", " Capital: Djibouti
\n", " Population: 740528
\n", " Area (km2): 23000.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Denmark\n", "

\n", "
\n", " Capital: Copenhagen
\n", " Population: 5484000
\n", " Area (km2): 43094.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Dominica\n", "

\n", "
\n", " Capital: Roseau
\n", " Population: 72813
\n", " Area (km2): 754.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Dominican Republic\n", "

\n", "
\n", " Capital: Santo Domingo
\n", " Population: 9823821
\n", " Area (km2): 48730.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Algeria\n", "

\n", "
\n", " Capital: Algiers
\n", " Population: 34586184
\n", " Area (km2): 2381740.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Ecuador\n", "

\n", "
\n", " Capital: Quito
\n", " Population: 14790608
\n", " Area (km2): 283560.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Estonia\n", "

\n", "
\n", " Capital: Tallinn
\n", " Population: 1291170
\n", " Area (km2): 45226.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Egypt\n", "

\n", "
\n", " Capital: Cairo
\n", " Population: 80471869
\n", " Area (km2): 1001450.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Western Sahara\n", "

\n", "
\n", " Capital: Laâyoune / El Aaiún
\n", " Population: 273008
\n", " Area (km2): 266000.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Eritrea\n", "

\n", "
\n", " Capital: Asmara
\n", " Population: 5792984
\n", " Area (km2): 121320.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Spain\n", "

\n", "
\n", " Capital: Madrid
\n", " Population: 46505963
\n", " Area (km2): 504782.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Ethiopia\n", "

\n", "
\n", " Capital: Addis Ababa
\n", " Population: 88013491
\n", " Area (km2): 1127127.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Finland\n", "

\n", "
\n", " Capital: Helsinki
\n", " Population: 5244000
\n", " Area (km2): 337030.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Fiji\n", "

\n", "
\n", " Capital: Suva
\n", " Population: 875983
\n", " Area (km2): 18270.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Falkland Islands\n", "

\n", "
\n", " Capital: Stanley
\n", " Population: 2638
\n", " Area (km2): 12173.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Micronesia\n", "

\n", "
\n", " Capital: Palikir
\n", " Population: 107708
\n", " Area (km2): 702.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Faroe Islands\n", "

\n", "
\n", " Capital: Tórshavn
\n", " Population: 48228
\n", " Area (km2): 1399.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " France\n", "

\n", "
\n", " Capital: Paris
\n", " Population: 64768389
\n", " Area (km2): 547030.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Gabon\n", "

\n", "
\n", " Capital: Libreville
\n", " Population: 1545255
\n", " Area (km2): 267667.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " United Kingdom\n", "

\n", "
\n", " Capital: London
\n", " Population: 62348447
\n", " Area (km2): 244820.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Grenada\n", "

\n", "
\n", " Capital: St. George's
\n", " Population: 107818
\n", " Area (km2): 344.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Georgia\n", "

\n", "
\n", " Capital: Tbilisi
\n", " Population: 4630000
\n", " Area (km2): 69700.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " French Guiana\n", "

\n", "
\n", " Capital: Cayenne
\n", " Population: 195506
\n", " Area (km2): 91000.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Guernsey\n", "

\n", "
\n", " Capital: St Peter Port
\n", " Population: 65228
\n", " Area (km2): 78.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Ghana\n", "

\n", "
\n", " Capital: Accra
\n", " Population: 24339838
\n", " Area (km2): 239460.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Gibraltar\n", "

\n", "
\n", " Capital: Gibraltar
\n", " Population: 27884
\n", " Area (km2): 6.5
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Greenland\n", "

\n", "
\n", " Capital: Nuuk
\n", " Population: 56375
\n", " Area (km2): 2166086.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Gambia\n", "

\n", "
\n", " Capital: Bathurst
\n", " Population: 1593256
\n", " Area (km2): 11300.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Guinea\n", "

\n", "
\n", " Capital: Conakry
\n", " Population: 10324025
\n", " Area (km2): 245857.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Guadeloupe\n", "

\n", "
\n", " Capital: Basse-Terre
\n", " Population: 443000
\n", " Area (km2): 1780.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Equatorial Guinea\n", "

\n", "
\n", " Capital: Malabo
\n", " Population: 1014999
\n", " Area (km2): 28051.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Greece\n", "

\n", "
\n", " Capital: Athens
\n", " Population: 11000000
\n", " Area (km2): 131940.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " South Georgia and the South Sandwich Islands\n", "

\n", "
\n", " Capital: Grytviken
\n", " Population: 30
\n", " Area (km2): 3903.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Guatemala\n", "

\n", "
\n", " Capital: Guatemala City
\n", " Population: 13550440
\n", " Area (km2): 108890.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Guam\n", "

\n", "
\n", " Capital: Hagåtña
\n", " Population: 159358
\n", " Area (km2): 549.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Guinea-Bissau\n", "

\n", "
\n", " Capital: Bissau
\n", " Population: 1565126
\n", " Area (km2): 36120.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Guyana\n", "

\n", "
\n", " Capital: Georgetown
\n", " Population: 748486
\n", " Area (km2): 214970.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Hong Kong\n", "

\n", "
\n", " Capital: Hong Kong
\n", " Population: 6898686
\n", " Area (km2): 1092.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Heard Island and McDonald Islands\n", "

\n", "
\n", " Capital: None
\n", " Population: 0
\n", " Area (km2): 412.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Honduras\n", "

\n", "
\n", " Capital: Tegucigalpa
\n", " Population: 7989415
\n", " Area (km2): 112090.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Croatia\n", "

\n", "
\n", " Capital: Zagreb
\n", " Population: 4491000
\n", " Area (km2): 56542.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Haiti\n", "

\n", "
\n", " Capital: Port-au-Prince
\n", " Population: 9648924
\n", " Area (km2): 27750.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Hungary\n", "

\n", "
\n", " Capital: Budapest
\n", " Population: 9982000
\n", " Area (km2): 93030.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Indonesia\n", "

\n", "
\n", " Capital: Jakarta
\n", " Population: 242968342
\n", " Area (km2): 1919440.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Ireland\n", "

\n", "
\n", " Capital: Dublin
\n", " Population: 4622917
\n", " Area (km2): 70280.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Israel\n", "

\n", "
\n", " Capital: None
\n", " Population: 7353985
\n", " Area (km2): 20770.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Isle of Man\n", "

\n", "
\n", " Capital: Douglas
\n", " Population: 75049
\n", " Area (km2): 572.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " India\n", "

\n", "
\n", " Capital: New Delhi
\n", " Population: 1173108018
\n", " Area (km2): 3287590.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " British Indian Ocean Territory\n", "

\n", "
\n", " Capital: None
\n", " Population: 4000
\n", " Area (km2): 60.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Iraq\n", "

\n", "
\n", " Capital: Baghdad
\n", " Population: 29671605
\n", " Area (km2): 437072.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Iran\n", "

\n", "
\n", " Capital: Tehran
\n", " Population: 76923300
\n", " Area (km2): 1648000.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Iceland\n", "

\n", "
\n", " Capital: Reykjavik
\n", " Population: 308910
\n", " Area (km2): 103000.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Italy\n", "

\n", "
\n", " Capital: Rome
\n", " Population: 60340328
\n", " Area (km2): 301230.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Jersey\n", "

\n", "
\n", " Capital: Saint Helier
\n", " Population: 90812
\n", " Area (km2): 116.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Jamaica\n", "

\n", "
\n", " Capital: Kingston
\n", " Population: 2847232
\n", " Area (km2): 10991.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Jordan\n", "

\n", "
\n", " Capital: Amman
\n", " Population: 6407085
\n", " Area (km2): 92300.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Japan\n", "

\n", "
\n", " Capital: Tokyo
\n", " Population: 127288000
\n", " Area (km2): 377835.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Kenya\n", "

\n", "
\n", " Capital: Nairobi
\n", " Population: 40046566
\n", " Area (km2): 582650.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Kyrgyzstan\n", "

\n", "
\n", " Capital: Bishkek
\n", " Population: 5776500
\n", " Area (km2): 198500.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Cambodia\n", "

\n", "
\n", " Capital: Phnom Penh
\n", " Population: 14453680
\n", " Area (km2): 181040.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Kiribati\n", "

\n", "
\n", " Capital: Tarawa
\n", " Population: 92533
\n", " Area (km2): 811.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Comoros\n", "

\n", "
\n", " Capital: Moroni
\n", " Population: 773407
\n", " Area (km2): 2170.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Saint Kitts and Nevis\n", "

\n", "
\n", " Capital: Basseterre
\n", " Population: 51134
\n", " Area (km2): 261.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " North Korea\n", "

\n", "
\n", " Capital: Pyongyang
\n", " Population: 22912177
\n", " Area (km2): 120540.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " South Korea\n", "

\n", "
\n", " Capital: Seoul
\n", " Population: 48422644
\n", " Area (km2): 98480.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Kuwait\n", "

\n", "
\n", " Capital: Kuwait City
\n", " Population: 2789132
\n", " Area (km2): 17820.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Cayman Islands\n", "

\n", "
\n", " Capital: George Town
\n", " Population: 44270
\n", " Area (km2): 262.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Kazakhstan\n", "

\n", "
\n", " Capital: Astana
\n", " Population: 15340000
\n", " Area (km2): 2717300.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Laos\n", "

\n", "
\n", " Capital: Vientiane
\n", " Population: 6368162
\n", " Area (km2): 236800.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Lebanon\n", "

\n", "
\n", " Capital: Beirut
\n", " Population: 4125247
\n", " Area (km2): 10400.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Saint Lucia\n", "

\n", "
\n", " Capital: Castries
\n", " Population: 160922
\n", " Area (km2): 616.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Liechtenstein\n", "

\n", "
\n", " Capital: Vaduz
\n", " Population: 35000
\n", " Area (km2): 160.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Sri Lanka\n", "

\n", "
\n", " Capital: Colombo
\n", " Population: 21513990
\n", " Area (km2): 65610.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Liberia\n", "

\n", "
\n", " Capital: Monrovia
\n", " Population: 3685076
\n", " Area (km2): 111370.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Lesotho\n", "

\n", "
\n", " Capital: Maseru
\n", " Population: 1919552
\n", " Area (km2): 30355.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Lithuania\n", "

\n", "
\n", " Capital: Vilnius
\n", " Population: 2944459
\n", " Area (km2): 65200.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Luxembourg\n", "

\n", "
\n", " Capital: Luxembourg
\n", " Population: 497538
\n", " Area (km2): 2586.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Latvia\n", "

\n", "
\n", " Capital: Riga
\n", " Population: 2217969
\n", " Area (km2): 64589.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Libya\n", "

\n", "
\n", " Capital: Tripoli
\n", " Population: 6461454
\n", " Area (km2): 1759540.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Morocco\n", "

\n", "
\n", " Capital: Rabat
\n", " Population: 31627428
\n", " Area (km2): 446550.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Monaco\n", "

\n", "
\n", " Capital: Monaco
\n", " Population: 32965
\n", " Area (km2): 1.95
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Moldova\n", "

\n", "
\n", " Capital: Chişinău
\n", " Population: 4324000
\n", " Area (km2): 33843.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Montenegro\n", "

\n", "
\n", " Capital: Podgorica
\n", " Population: 666730
\n", " Area (km2): 14026.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Saint Martin\n", "

\n", "
\n", " Capital: Marigot
\n", " Population: 35925
\n", " Area (km2): 53.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Madagascar\n", "

\n", "
\n", " Capital: Antananarivo
\n", " Population: 21281844
\n", " Area (km2): 587040.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Marshall Islands\n", "

\n", "
\n", " Capital: Majuro
\n", " Population: 65859
\n", " Area (km2): 181.3
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Macedonia\n", "

\n", "
\n", " Capital: Skopje
\n", " Population: 2062294
\n", " Area (km2): 25333.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Mali\n", "

\n", "
\n", " Capital: Bamako
\n", " Population: 13796354
\n", " Area (km2): 1240000.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Myanmar [Burma]\n", "

\n", "
\n", " Capital: Naypyitaw
\n", " Population: 53414374
\n", " Area (km2): 678500.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Mongolia\n", "

\n", "
\n", " Capital: Ulan Bator
\n", " Population: 3086918
\n", " Area (km2): 1565000.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Macao\n", "

\n", "
\n", " Capital: Macao
\n", " Population: 449198
\n", " Area (km2): 254.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Northern Mariana Islands\n", "

\n", "
\n", " Capital: Saipan
\n", " Population: 53883
\n", " Area (km2): 477.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Martinique\n", "

\n", "
\n", " Capital: Fort-de-France
\n", " Population: 432900
\n", " Area (km2): 1100.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Mauritania\n", "

\n", "
\n", " Capital: Nouakchott
\n", " Population: 3205060
\n", " Area (km2): 1030700.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Montserrat\n", "

\n", "
\n", " Capital: Plymouth
\n", " Population: 9341
\n", " Area (km2): 102.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Malta\n", "

\n", "
\n", " Capital: Valletta
\n", " Population: 403000
\n", " Area (km2): 316.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Mauritius\n", "

\n", "
\n", " Capital: Port Louis
\n", " Population: 1294104
\n", " Area (km2): 2040.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Maldives\n", "

\n", "
\n", " Capital: Malé
\n", " Population: 395650
\n", " Area (km2): 300.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Malawi\n", "

\n", "
\n", " Capital: Lilongwe
\n", " Population: 15447500
\n", " Area (km2): 118480.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Mexico\n", "

\n", "
\n", " Capital: Mexico City
\n", " Population: 112468855
\n", " Area (km2): 1972550.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Malaysia\n", "

\n", "
\n", " Capital: Kuala Lumpur
\n", " Population: 28274729
\n", " Area (km2): 329750.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Mozambique\n", "

\n", "
\n", " Capital: Maputo
\n", " Population: 22061451
\n", " Area (km2): 801590.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Namibia\n", "

\n", "
\n", " Capital: Windhoek
\n", " Population: 2128471
\n", " Area (km2): 825418.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " New Caledonia\n", "

\n", "
\n", " Capital: Noumea
\n", " Population: 216494
\n", " Area (km2): 19060.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Niger\n", "

\n", "
\n", " Capital: Niamey
\n", " Population: 15878271
\n", " Area (km2): 1267000.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Norfolk Island\n", "

\n", "
\n", " Capital: Kingston
\n", " Population: 1828
\n", " Area (km2): 34.6
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Nigeria\n", "

\n", "
\n", " Capital: Abuja
\n", " Population: 154000000
\n", " Area (km2): 923768.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Nicaragua\n", "

\n", "
\n", " Capital: Managua
\n", " Population: 5995928
\n", " Area (km2): 129494.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Netherlands\n", "

\n", "
\n", " Capital: Amsterdam
\n", " Population: 16645000
\n", " Area (km2): 41526.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Norway\n", "

\n", "
\n", " Capital: Oslo
\n", " Population: 5009150
\n", " Area (km2): 324220.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Nepal\n", "

\n", "
\n", " Capital: Kathmandu
\n", " Population: 28951852
\n", " Area (km2): 140800.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Nauru\n", "

\n", "
\n", " Capital: Yaren
\n", " Population: 10065
\n", " Area (km2): 21.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Niue\n", "

\n", "
\n", " Capital: Alofi
\n", " Population: 2166
\n", " Area (km2): 260.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " New Zealand\n", "

\n", "
\n", " Capital: Wellington
\n", " Population: 4252277
\n", " Area (km2): 268680.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Oman\n", "

\n", "
\n", " Capital: Muscat
\n", " Population: 2967717
\n", " Area (km2): 212460.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Panama\n", "

\n", "
\n", " Capital: Panama City
\n", " Population: 3410676
\n", " Area (km2): 78200.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Peru\n", "

\n", "
\n", " Capital: Lima
\n", " Population: 29907003
\n", " Area (km2): 1285220.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " French Polynesia\n", "

\n", "
\n", " Capital: Papeete
\n", " Population: 270485
\n", " Area (km2): 4167.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Papua New Guinea\n", "

\n", "
\n", " Capital: Port Moresby
\n", " Population: 6064515
\n", " Area (km2): 462840.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Philippines\n", "

\n", "
\n", " Capital: Manila
\n", " Population: 99900177
\n", " Area (km2): 300000.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Pakistan\n", "

\n", "
\n", " Capital: Islamabad
\n", " Population: 184404791
\n", " Area (km2): 803940.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Poland\n", "

\n", "
\n", " Capital: Warsaw
\n", " Population: 38500000
\n", " Area (km2): 312685.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Saint Pierre and Miquelon\n", "

\n", "
\n", " Capital: Saint-Pierre
\n", " Population: 7012
\n", " Area (km2): 242.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Pitcairn Islands\n", "

\n", "
\n", " Capital: Adamstown
\n", " Population: 46
\n", " Area (km2): 47.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Puerto Rico\n", "

\n", "
\n", " Capital: San Juan
\n", " Population: 3916632
\n", " Area (km2): 9104.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Palestine\n", "

\n", "
\n", " Capital: None
\n", " Population: 3800000
\n", " Area (km2): 5970.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Portugal\n", "

\n", "
\n", " Capital: Lisbon
\n", " Population: 10676000
\n", " Area (km2): 92391.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Palau\n", "

\n", "
\n", " Capital: Melekeok
\n", " Population: 19907
\n", " Area (km2): 458.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Paraguay\n", "

\n", "
\n", " Capital: Asunción
\n", " Population: 6375830
\n", " Area (km2): 406750.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Qatar\n", "

\n", "
\n", " Capital: Doha
\n", " Population: 840926
\n", " Area (km2): 11437.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Réunion\n", "

\n", "
\n", " Capital: Saint-Denis
\n", " Population: 776948
\n", " Area (km2): 2517.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Romania\n", "

\n", "
\n", " Capital: Bucharest
\n", " Population: 21959278
\n", " Area (km2): 237500.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Serbia\n", "

\n", "
\n", " Capital: Belgrade
\n", " Population: 7344847
\n", " Area (km2): 88361.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Russia\n", "

\n", "
\n", " Capital: Moscow
\n", " Population: 140702000
\n", " Area (km2): 1.71E7
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Rwanda\n", "

\n", "
\n", " Capital: Kigali
\n", " Population: 11055976
\n", " Area (km2): 26338.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Saudi Arabia\n", "

\n", "
\n", " Capital: Riyadh
\n", " Population: 25731776
\n", " Area (km2): 1960582.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Solomon Islands\n", "

\n", "
\n", " Capital: Honiara
\n", " Population: 559198
\n", " Area (km2): 28450.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Seychelles\n", "

\n", "
\n", " Capital: Victoria
\n", " Population: 88340
\n", " Area (km2): 455.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Sudan\n", "

\n", "
\n", " Capital: Khartoum
\n", " Population: 35000000
\n", " Area (km2): 1861484.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Sweden\n", "

\n", "
\n", " Capital: Stockholm
\n", " Population: 9828655
\n", " Area (km2): 449964.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Singapore\n", "

\n", "
\n", " Capital: Singapore
\n", " Population: 4701069
\n", " Area (km2): 692.7
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Saint Helena\n", "

\n", "
\n", " Capital: Jamestown
\n", " Population: 7460
\n", " Area (km2): 410.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Slovenia\n", "

\n", "
\n", " Capital: Ljubljana
\n", " Population: 2007000
\n", " Area (km2): 20273.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Svalbard and Jan Mayen\n", "

\n", "
\n", " Capital: Longyearbyen
\n", " Population: 2550
\n", " Area (km2): 62049.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Slovakia\n", "

\n", "
\n", " Capital: Bratislava
\n", " Population: 5455000
\n", " Area (km2): 48845.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Sierra Leone\n", "

\n", "
\n", " Capital: Freetown
\n", " Population: 5245695
\n", " Area (km2): 71740.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " San Marino\n", "

\n", "
\n", " Capital: San Marino
\n", " Population: 31477
\n", " Area (km2): 61.2
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Senegal\n", "

\n", "
\n", " Capital: Dakar
\n", " Population: 12323252
\n", " Area (km2): 196190.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Somalia\n", "

\n", "
\n", " Capital: Mogadishu
\n", " Population: 10112453
\n", " Area (km2): 637657.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Suriname\n", "

\n", "
\n", " Capital: Paramaribo
\n", " Population: 492829
\n", " Area (km2): 163270.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " South Sudan\n", "

\n", "
\n", " Capital: Juba
\n", " Population: 8260490
\n", " Area (km2): 644329.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " São Tomé and Príncipe\n", "

\n", "
\n", " Capital: São Tomé
\n", " Population: 175808
\n", " Area (km2): 1001.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " El Salvador\n", "

\n", "
\n", " Capital: San Salvador
\n", " Population: 6052064
\n", " Area (km2): 21040.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Sint Maarten\n", "

\n", "
\n", " Capital: Philipsburg
\n", " Population: 37429
\n", " Area (km2): 21.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Syria\n", "

\n", "
\n", " Capital: Damascus
\n", " Population: 22198110
\n", " Area (km2): 185180.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Swaziland\n", "

\n", "
\n", " Capital: Mbabane
\n", " Population: 1354051
\n", " Area (km2): 17363.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Turks and Caicos Islands\n", "

\n", "
\n", " Capital: Cockburn Town
\n", " Population: 20556
\n", " Area (km2): 430.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Chad\n", "

\n", "
\n", " Capital: N'Djamena
\n", " Population: 10543464
\n", " Area (km2): 1284000.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " French Southern Territories\n", "

\n", "
\n", " Capital: Port-aux-Français
\n", " Population: 140
\n", " Area (km2): 7829.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Togo\n", "

\n", "
\n", " Capital: Lomé
\n", " Population: 6587239
\n", " Area (km2): 56785.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Thailand\n", "

\n", "
\n", " Capital: Bangkok
\n", " Population: 67089500
\n", " Area (km2): 514000.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Tajikistan\n", "

\n", "
\n", " Capital: Dushanbe
\n", " Population: 7487489
\n", " Area (km2): 143100.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Tokelau\n", "

\n", "
\n", " Capital: None
\n", " Population: 1466
\n", " Area (km2): 10.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " East Timor\n", "

\n", "
\n", " Capital: Dili
\n", " Population: 1154625
\n", " Area (km2): 15007.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Turkmenistan\n", "

\n", "
\n", " Capital: Ashgabat
\n", " Population: 4940916
\n", " Area (km2): 488100.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Tunisia\n", "

\n", "
\n", " Capital: Tunis
\n", " Population: 10589025
\n", " Area (km2): 163610.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Tonga\n", "

\n", "
\n", " Capital: Nuku'alofa
\n", " Population: 122580
\n", " Area (km2): 748.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Turkey\n", "

\n", "
\n", " Capital: Ankara
\n", " Population: 77804122
\n", " Area (km2): 780580.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Trinidad and Tobago\n", "

\n", "
\n", " Capital: Port of Spain
\n", " Population: 1228691
\n", " Area (km2): 5128.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Tuvalu\n", "

\n", "
\n", " Capital: Funafuti
\n", " Population: 10472
\n", " Area (km2): 26.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Taiwan\n", "

\n", "
\n", " Capital: Taipei
\n", " Population: 22894384
\n", " Area (km2): 35980.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Tanzania\n", "

\n", "
\n", " Capital: Dodoma
\n", " Population: 41892895
\n", " Area (km2): 945087.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Ukraine\n", "

\n", "
\n", " Capital: Kiev
\n", " Population: 45415596
\n", " Area (km2): 603700.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Uganda\n", "

\n", "
\n", " Capital: Kampala
\n", " Population: 33398682
\n", " Area (km2): 236040.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " U.S. Minor Outlying Islands\n", "

\n", "
\n", " Capital: None
\n", " Population: 0
\n", " Area (km2): 0.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " United States\n", "

\n", "
\n", " Capital: Washington
\n", " Population: 310232863
\n", " Area (km2): 9629091.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Uruguay\n", "

\n", "
\n", " Capital: Montevideo
\n", " Population: 3477000
\n", " Area (km2): 176220.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Uzbekistan\n", "

\n", "
\n", " Capital: Tashkent
\n", " Population: 27865738
\n", " Area (km2): 447400.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Vatican City\n", "

\n", "
\n", " Capital: Vatican City
\n", " Population: 921
\n", " Area (km2): 0.44
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Saint Vincent and the Grenadines\n", "

\n", "
\n", " Capital: Kingstown
\n", " Population: 104217
\n", " Area (km2): 389.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Venezuela\n", "

\n", "
\n", " Capital: Caracas
\n", " Population: 27223228
\n", " Area (km2): 912050.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " British Virgin Islands\n", "

\n", "
\n", " Capital: Road Town
\n", " Population: 21730
\n", " Area (km2): 153.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " U.S. Virgin Islands\n", "

\n", "
\n", " Capital: Charlotte Amalie
\n", " Population: 108708
\n", " Area (km2): 352.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Vietnam\n", "

\n", "
\n", " Capital: Hanoi
\n", " Population: 89571130
\n", " Area (km2): 329560.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Vanuatu\n", "

\n", "
\n", " Capital: Port Vila
\n", " Population: 221552
\n", " Area (km2): 12200.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Wallis and Futuna\n", "

\n", "
\n", " Capital: Mata-Utu
\n", " Population: 16025
\n", " Area (km2): 274.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Samoa\n", "

\n", "
\n", " Capital: Apia
\n", " Population: 192001
\n", " Area (km2): 2944.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Kosovo\n", "

\n", "
\n", " Capital: Pristina
\n", " Population: 1800000
\n", " Area (km2): 10908.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Yemen\n", "

\n", "
\n", " Capital: Sanaa
\n", " Population: 23495361
\n", " Area (km2): 527970.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Mayotte\n", "

\n", "
\n", " Capital: Mamoudzou
\n", " Population: 159042
\n", " Area (km2): 374.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " South Africa\n", "

\n", "
\n", " Capital: Pretoria
\n", " Population: 49000000
\n", " Area (km2): 1219912.0
\n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Zambia\n", "

\n", "
\n", " Capital: Lusaka
\n", " Population: 13460305
\n", " Area (km2): 752614.0
\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", "
\n", "

\n", " \n", " Zimbabwe\n", "

\n", "
\n", " Capital: Harare
\n", " Population: 11651858
\n", " Area (km2): 390580.0
\n", "
\n", "
\n", " \n", " \n", "
\n", " \n", " \n", "\n", "
\n", "
\n", "\n", "
\n", "\n", "\n", "
\n", "
\n", "
\n", "
\n", " Lessons and Videos © Hartley Brody 2023\n", "
\n", "
\n", "
\n", "
\n", " \n", "\n", " \n", " \n", "\n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", "\n", "\n" ] } ], "source": [ "# Getting the html content in Python\n", "# (commonly passed into beautiful soup, see following slide)\n", "import requests\n", "\n", "response = requests.get('https://www.scrapethissite.com/pages/simple/')\n", "print(response.text)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Tip: save that html file!\n", "\n", "Websites change over time, if you have something really sensitive consider storing the raw HTML source.\n", "\n", "(for example, allrecipes.com changed since I last taught this lesson, arg!)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# BeautifulSoup allows us to make sense of this HTML mess\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Defaulting to user installation because normal site-packages is not writeable\n", "Requirement already satisfied: bs4 in /home/matt/.local/lib/python3.10/site-packages (0.0.1)\n", "Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.10/dist-packages (from bs4) (4.11.2)\n", "Requirement already satisfied: soupsieve>1.2 in /usr/local/lib/python3.10/dist-packages (from beautifulsoup4->bs4) (2.3.2.post1)\n" ] } ], "source": [ "!pip3 install bs4" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "soup = BeautifulSoup(s_html)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "

Heading 1

\n", "

This is what heading 2 looks like

\n", "

Text is usually in paragraphs.\n", " New lines and multiple consecutive whitespace characters are ignored.

\n", "

Unlike in python indentation is only a good practice but it doesn't change functionality. In fact, all of this HTML could be (and often is in real webpages) just writen as a single line.

\n", "

Links are created using the \"a\" tag: \n", " Click here to google.\n", " href is an attirbute of the a tag that specify where the link points to.

\n", "\n", "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[

Text is usually in paragraphs.\n", " New lines and multiple consecutive whitespace characters are ignored.

,\n", "

Unlike in python indentation is only a good practice but it doesn't change functionality. In fact, all of this HTML could be (and often is in real webpages) just writen as a single line.

,\n", "

Links are created using the \"a\" tag: \n", " Click here to google.\n", " href is an attirbute of the a tag that specify where the link points to.

]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## getting elements by their tag name:\n", "soup.find_all('p')\n", "\n", "# find_all returns a list, where each element is an instance of the specified tag" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Text is usually in paragraphs.\n", " New lines and multiple consecutive whitespace characters are ignored.\n", "------\n", "Unlike in python indentation is only a good practice but it doesn't change functionality. In fact, all of this HTML could be (and often is in real webpages) just writen as a single line.\n", "------\n", "Links are created using the \"a\" tag: \n", " Click here to google.\n", " href is an attirbute of the a tag that specify where the link points to.\n", "------\n" ] } ], "source": [ "for paragraph in soup.find_all('p'):\n", " # text is a property of a soup object\n", " print(paragraph.text) \n", " print('------')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# `.find_all()` on subtrees of soup object\n", "\n", "Note to self: write out tree structure below\n", "\n", "```html\n", "\n", " \n", "

The links in this paragraph point to search engines, like DuckDuckGo, Google, Bing

\n", " \n", "

The links in this paragraph point to Internet browsers, like Firefox, Chrome, Opera

.\n", " \n", "\n", "```\n", "\n", "# What if we only wanted links from the first paragraph?\n", "\n", "The `.find_all()` method works not only on the whole `soup` object, but also on subtrees of the soup object. " ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "s_html = \"\"\"\n", "\n", " \n", "

The links in this paragraph point to search engines, like DuckDuckGo, Google, Bing

\n", " \n", "

The links in this paragraph point to Internet browsers, like Firefox, Chrome, Opera

.\n", " \n", "\n", "\"\"\"\n", "\n", "# write this to a webpage (to see what it looks like)\n", "with open('simple_page1.html', 'w') as f:\n", " print(s_html, file=f)\n", "\n", "# either way, you can parse the html with BeautifulSoup\n", "soup = BeautifulSoup(s_html)\n", "\n", "# finding all paragraphs:\n", "p_all = soup.find_all('p')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "the webpage we just wrote:\n", "\n", "[simple_page1.html](simple_page1.html)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "# getting the first paragraph\n", "p_first = p_all[0]" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "

The links in this paragraph point to search engines, like DuckDuckGo, Google, Bing

" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "p_first" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[DuckDuckGo, Google, Bing]\n" ] } ], "source": [ "# getting the links from the first paragraph:\n", "links_p_first = p_first.find_all('a')\n", "\n", "print(links_p_first)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### syntactic sugar: \n", "To get the first tag under a soup object, refer to it as an attribute" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "

The links in this paragraph point to search engines, like DuckDuckGo, Google, Bing

" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# below is equivilent to soup.find_all('p')[0]\n", "soup.p" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[DuckDuckGo, Google, Bing]\n" ] } ], "source": [ "# so we can condense our code as\n", "plinks = soup.p.find_all('a')\n", "print(plinks)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "DuckDuckGo\n", "Firefox\n" ] } ], "source": [ "# iterating over tags\n", "for par in soup.find_all('p'):\n", " print(par.a)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "DuckDuckGo\n" ] } ], "source": [ "# and the first link in that paragraph can be accessed like this:\n", "link = soup.p.a\n", "print(link)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Identifying if tags exist" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\n", "\n", "

The links in this paragraph point to search engines, like DuckDuckGo, Google, Bing

\n", "

The links in this paragraph point to Internet browsers, like Firefox, Chrome, Opera

.\n", " \n", "" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# note: there is no \"h3\" tag below\n", "soup" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# what if we're trying to access an element that doesn't exist?\n", "header = soup.h3\n", "header is None" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can test if a tag exists in a soup object by looking for the first instance of this tag and comparing it to `None`" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "tag h3 doesnt exist in soup\n" ] } ], "source": [ "if soup.h3 is None:\n", " print(\"tag h3 doesnt exist in soup\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Putting it together:\n", "# Goal: get all cheese recipes!\n", "\n", "Just the recipe name & a link to its page now. Later, we'll visit the page to get more info on each. \n", "\n", "[https://www.allrecipes.com/search?q=cheese](https://www.allrecipes.com/search?q=cheese)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "# get soup\n", "url = 'https://www.allrecipes.com/search?q=cheese'\n", "response = requests.get(url)\n", "soup = BeautifulSoup(response.text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our **goal** is to get a list of recipes. Maybe we should find all the `div` tags?" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "238" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# that seems like too many recipes ...\n", "len(soup.find_all('a'))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Finding tags by `class_`\n", "\n", "how to localize a particular part of a web page" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Tags can have multiple \"classes\" they belong to. For example, in [https://www.allrecipes.com/search?q=cheese](https://www.allrecipes.com/search?q=cheese) the first recipe is encapsulated in this html tag:\n", "\n", " \n", " \n", " \n", "So this particular div tag belongs to classes:\n", "- `comp`\n", "- `mntl-card-list-items`\n", "- `mntl-document-card`\n", "- `card`\n", "- `card--no-image`\n", " \n", "I suspect our target recipes belong to the `mntl-card-list-items` class (I'm guessing a bit). Lets find them all:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "24" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(soup.find_all('a', class_='mntl-card-list-items'))" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'m!ss!ss!pp!'" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'mississippi'.replace('i', '!')" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Yellow Cheese vs. White Cheese: Why the Different Colors? \n", "Cheese Curds Make Amazing, Extra-Gooey Grilled Cheese Sandwiches\n", "Where Did American Cheese Come From (And Is It Even Cheese)?\n", "SaveSouthern Pimento Cheese1,020Ratings\n", "Chef John's Classic Cheese Fondue Is the Ultimate Cheese Lover's Recipe\n", "Hundreds of Pounds of Brie and Camembert Cheese Recalled Due to Possible Listeria Contamination\n", "Annie's Mac & Cheese and Smartfood Popcorn Have More in Common Than You Think\n", "SaveBasic Cream Cheese Frosting1,645Ratings\n", "SaveGrilled Cheese Sandwich855Ratings\n", "SaveHomemade Mac and Cheese2,642Ratings\n", "SaveBest Cheese Ball234Ratings\n", "SaveSimple Macaroni and Cheese965Ratings\n", "SaveBaked Mac and Cheese with Sour Cream and Cottage Cheese59Ratings\n", "SaveAbsolutely the BEST Rich and Creamy Blue Cheese Dressing Ever!536Ratings\n", "What Is Cottage Cheese and How Is It Made?\n", "SaveBaked Ham and Cheese Sliders973Ratings\n", "Kraft Is Giving Away Incense So Your Place Can Smell Like Grilled Cheese All the Time\n", "SaveJalapeño Popper Grilled Cheese Sandwich200Ratings\n", "Bread Cheese Is the Best Cheese You Haven't Tried Yet\n", "SaveCheese Sauce for Broccoli and Cauliflower473Ratings\n", "SaveNacho Cheese Sauce680Ratings\n", "SavePumpkin Bars with Cream Cheese Frosting148Ratings\n", "The Right Way To Wrap And Store Cheese\n", "SaveChef John's Creamy Blue Cheese Dressing127Ratings\n" ] } ], "source": [ "recipe_list = list()\n", "for tag in soup.find_all('a', class_='mntl-card-list-items'):\n", " # note: string processing methods reviewed / covered shortly\n", " print(tag.text.replace('\\n', ''))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# A problem\n", "\n", "We're getting closer ... but \"The Right Way To Wrap And Store Cheese\" isn't really a recipe, is it?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "\n", "# An insight (and solution)\n", "\n", "\n", "\n", "Only the recipes have ratings.\n", "\n", "In HTML-speak, only the recipes have some `svg` tag whose class is \"icon-star\"" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "recipe_list = list()\n", "for tag in soup.find_all('a', class_='mntl-card-list-items'):\n", " # search within tag to find all star icons\n", " star_list = tag.find_all('svg', class_='icon-star')\n", " if len(star_list) > 1:\n", " # some star icon is found, store this as its a real recipe\n", " recipe_list.append(tag)" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/plain": [ "['SaveSouthern Pimento Cheese1,020Ratings',\n", " 'SaveBasic Cream Cheese Frosting1,645Ratings',\n", " 'SaveGrilled Cheese Sandwich855Ratings',\n", " 'SaveHomemade Mac and Cheese2,642Ratings',\n", " 'SaveBest Cheese Ball234Ratings',\n", " 'SaveSimple Macaroni and Cheese965Ratings',\n", " 'SaveBaked Mac and Cheese with Sour Cream and Cottage Cheese59Ratings',\n", " 'SaveAbsolutely the BEST Rich and Creamy Blue Cheese Dressing Ever!536Ratings',\n", " 'SaveBaked Ham and Cheese Sliders973Ratings',\n", " 'SaveJalapeño Popper Grilled Cheese Sandwich200Ratings',\n", " 'SaveCheese Sauce for Broccoli and Cauliflower473Ratings',\n", " 'SaveNacho Cheese Sauce680Ratings',\n", " 'SavePumpkin Bars with Cream Cheese Frosting148Ratings',\n", " \"SaveChef John's Creamy Blue Cheese Dressing127Ratings\"]" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# looks pretty good\n", "# (well ... the Save 1,020Ratings isn't great but at least they're all recipes below)\n", "[tag.text.replace('\\n', '') for tag in recipe_list]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Finding tags by `id`\n", "\n", "Nearly the same as finding by class, but you'll look for `id=` in the html and pass it to the `id` keyword of `soup.find_all()`.\n", "\n", "**Goal**: Get the footer from: https://www.scrapethissite.com/\n", "\n", "\n", "\n", "```html\n", "
\n", "
\n", "
\n", "
\n", " Lessons and Videos © Hartley Brody 2018\n", "
\n", "
\n", "
\n", "
\n", "```" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "# get soup from url\n", "url = 'https://www.scrapethissite.com/'\n", "html = requests.get(url).text\n", "soup = BeautifulSoup(html)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[
\n", "
\n", "
\n", "
\n", " Lessons and Videos © Hartley Brody 2023\n", "
\n", "
\n", "
\n", "
]" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.find_all(id='footer')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Note that you can combine all searches shown above:\n", "- tag\n", " - p (paragraph)\n", " - a (link)\n", " - div ...\n", "- tag class\n", "- tag id\n", "\n", "```python\n", "# finds all links (tag type = 'a'), with given class and id\n", "soup.find_all('a', class_='fancy-link', id='blue')\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# What if I don't like cheese?\n", "\n", "First off, really? Its delicious!\n", "\n", "But if you insist on searching for some other ingredient, try swapping out \"cheese\" in the url below:\n", "\n", "[https://www.allrecipes.com/search?q=cheese](https://www.allrecipes.com/search?q=cheese)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## In Class Assignment 1\n", "\n", "**Goal:** Formalize a pipeline to scrape this site\n", "\n", "https://www.allrecipes.com/search/results/?search=cheese\n", " \n", "1. Write `extract_recipes(s_query)` which:\n", " * takes the search phrase (e.g. 'cheese') as input argument\n", " * builds the correct url that leads directly to the page that lists the recipes\n", " * uses `requests` to get the content of this page returns the html text of the page\n", " * returns an html string\n", " * builds a BeautifulSoup object out of that text \n", " * finds names of all recipes\n", " - to identify which tags / classes to `find_all()`, open the page in your browser and \"inspect\" \n", " - start from the recipe object above, and call another `find_all()` to zoom into the recipe name itself\n", " * returns a dataframe with a single column \"recipe\"\n", " * the names of the recipes might be a bit mangled, having \"save\" and \"1,243 raters\" just now, thats ok \n", " * we'll want to add more features to this dataframe later, building it up as a list of dictionaries (one per row) allows us to extend to other features easily:\n", " \n", "```python\n", "row_list = list()\n", "for recipe in recipe_list:\n", " # build a dictionary representing this recipe (row)\n", " d = {'name': name}\n", " row_list.append(d)\n", " \n", "df = pd.DataFrame(row_list)\n", "```" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "def extract_recipes(s_query):\n", " \"\"\" builds list of recipe names from allrecipies html\n", " \n", " Args:\n", " s_query (str): input query (i.e. \"cheese\")\n", " \n", " Returns:\n", " df_recipe (pd.DataFrame): each row is a recipe\n", " \"\"\"\n", " \n", " # build soup object from search query\n", " url = f'https://www.allrecipes.com/search?q={s_query}'\n", " s_html = requests.get(url).text\n", " soup = BeautifulSoup(s_html)\n", " \n", " # get a list of recipe tags\n", " recipe_list = list()\n", " for tag in soup.find_all('a', class_='mntl-card-list-items'):\n", " # search within tag to find all star icons\n", " star_list = tag.find_all('svg', class_='icon-star')\n", " if star_list:\n", " # some star icon is found, store this as its a real recipe\n", " recipe_list.append(tag)\n", " \n", " # extract features to build dataframe\n", " row_list = list()\n", " for recipe in recipe_list:\n", " name = recipe.text.replace('\\n', '')\n", " row_list.append({'name': name})\n", " \n", " return pd.DataFrame(row_list)\n", " " ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
name
0SaveThe Best Caramel Apples140Ratings
1SaveSauteed Apples1,791Ratings
2SaveCaramel Apples270Ratings
3SaveGourmet Caramel Apples97Ratings
4SaveBaked Apples with Oatmeal Filling114Ratings
5SaveSouthern Fried Apples262Ratings
6SaveGrilled Sweet Potatoes with Apples128Ratings
7SaveGrilled Sausages with Caramelized Onions a...
8SaveRed Cabbage and Apples187Ratings
9SaveMicrowave Baked Apples157Ratings
10SaveBaked Apples305Ratings
11SaveCandied Apples162Ratings
12SavePork Chops with Apples and Raisins109Ratings
13SavePork Chops with Apples, Onions, and Sweet ...
14SaveHerbed Pork and Apples248Ratings
15SaveSmushed Apples and Sweet Potatoes253Ratings
16SaveChicken Salad with Apples, Grapes, and Wal...
17SaveNo-Bake Cheesecake with Cool Whip and Appl...
18SaveRoasted Butternut Squash Soup with Apples ...
\n", "
" ], "text/plain": [ " name\n", "0 SaveThe Best Caramel Apples140Ratings\n", "1 SaveSauteed Apples1,791Ratings\n", "2 SaveCaramel Apples270Ratings\n", "3 SaveGourmet Caramel Apples97Ratings\n", "4 SaveBaked Apples with Oatmeal Filling114Ratings\n", "5 SaveSouthern Fried Apples262Ratings\n", "6 SaveGrilled Sweet Potatoes with Apples128Ratings\n", "7 SaveGrilled Sausages with Caramelized Onions a...\n", "8 SaveRed Cabbage and Apples187Ratings\n", "9 SaveMicrowave Baked Apples157Ratings\n", "10 SaveBaked Apples305Ratings\n", "11 SaveCandied Apples162Ratings\n", "12 SavePork Chops with Apples and Raisins109Ratings\n", "13 SavePork Chops with Apples, Onions, and Sweet ...\n", "14 SaveHerbed Pork and Apples248Ratings\n", "15 SaveSmushed Apples and Sweet Potatoes253Ratings\n", "16 SaveChicken Salad with Apples, Grapes, and Wal...\n", "17 SaveNo-Bake Cheesecake with Cool Whip and Appl...\n", "18 SaveRoasted Butternut Squash Soup with Apples ..." ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "extract_recipes('apples')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Todo list\n", "\n", "- extract info from each recipe's page\n", " - get url of each recipe's own page from initial search:\n", " - e.g. [https://www.allrecipes.com/recipe/189930/southern-pimento-cheese/](https://www.allrecipes.com/recipe/189930/southern-pimento-cheese/)\n", " - get string of nutrition info on that page\n", " \n", "```\n", " 208\n", " Calories\n", " 20g\n", " Fat\n", " 2g\n", " Carbs\n", " 6g\n", " Protein\n", "```\n", " \n", "\n", "- string processing\n", " - clean up the name of each recipe: \"SaveSouthern Pimento Cheese1,020Ratings\"\n", " - process the string above so it yields clean numbers we can operate on" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Getting info from each recipe's own page:\n", "\n", "When we interact with the webpage in the browser, clicking on the header with the recipe name leads us to the actual recipe. Let's have a look at how it's done:\n", "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [], "source": [ "##### repeated from above ...\n", "\n", "# build soup object from search query\n", "url = f'https://www.allrecipes.com/search?q=cheese'\n", "s_html = requests.get(url).text\n", "soup = BeautifulSoup(s_html)\n", "\n", "# get a list of recipe tags\n", "recipe_list = list()\n", "for tag in soup.find_all('a', class_='mntl-card-list-items'):\n", " # search within tag to find all star icons\n", " star_list = tag.find_all('svg', class_='icon-star')\n", " if star_list:\n", " # some star icon is found, store this as its a real recipe\n", " recipe_list.append(tag)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# takeaway: tags have attributes, you can access them\n", "\n", "(including the link address for \"Southern Pimento Cheese\")" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "{'id': 'mntl-card-list-items_1-0-3',\n", " 'class': ['comp',\n", " 'mntl-card-list-items',\n", " 'mntl-document-card',\n", " 'mntl-card',\n", " 'card',\n", " 'card--no-image'],\n", " 'data-doc-id': '6663961',\n", " 'data-tax-levels': '',\n", " 'href': 'https://www.allrecipes.com/recipe/189930/southern-pimento-cheese/',\n", " 'data-cta': '',\n", " 'data-ordinal': '4'}" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# this is the \"a\" tag object shown in the image immediately above\n", "recipe_list[0].attrs" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'https://www.allrecipes.com/recipe/189930/southern-pimento-cheese/'" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "recipe_list[0].attrs['href']" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Adding `href` to our dataframe of recipes\n", "\n", "Let's modify our `extract_recipes()` function such that rather than returning just the names of the dishes, it returns a list of dictionaries, where each dictionary has the `name` and `url` fields:" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [], "source": [ "def extract_recipes(s_query):\n", " \"\"\" builds list of recipe names from allrecipies html\n", " \n", " Args:\n", " s_query (str): input query (i.e. \"cheese\")\n", " \n", " Returns:\n", " df_recipe (pd.DataFrame): each row is a recipe\n", " \"\"\"\n", " \n", " # build soup object from search query\n", " url = f'https://www.allrecipes.com/search?q={s_query}'\n", " s_html = requests.get(url).text\n", " soup = BeautifulSoup(s_html)\n", " \n", " # get a list of recipe tags\n", " recipe_list = list()\n", " for tag in soup.find_all('a', class_='mntl-card-list-items'):\n", " # search within tag to find all star icons\n", " star_list = tag.find_all('svg', class_='icon-star')\n", " if star_list:\n", " # some star icon is found, store this as its a real recipe\n", " recipe_list.append(tag)\n", " \n", " # extract features to build dataframe\n", " row_list = list()\n", " for recipe in recipe_list:\n", " name = recipe.text.replace('\\n', '')\n", " row_list.append({'name': name,\n", " 'href': recipe.attrs['href']})\n", " \n", " \n", " return pd.DataFrame(row_list)\n", " " ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namehref
0SaveSouthern Pimento Cheese1,020Ratingshttps://www.allrecipes.com/recipe/189930/south...
1SaveBasic Cream Cheese Frosting1,645Ratingshttps://www.allrecipes.com/recipe/8379/basic-c...
2SaveGrilled Cheese Sandwich855Ratingshttps://www.allrecipes.com/recipe/23891/grille...
3SaveHomemade Mac and Cheese2,642Ratingshttps://www.allrecipes.com/recipe/11679/homema...
4SaveBest Cheese Ball234Ratingshttps://www.allrecipes.com/recipe/16600/herman...
5SaveSimple Macaroni and Cheese965Ratingshttps://www.allrecipes.com/recipe/238691/simpl...
6SaveBaked Mac and Cheese with Sour Cream and C...https://www.allrecipes.com/recipe/229815/baked...
7SaveAbsolutely the BEST Rich and Creamy Blue C...https://www.allrecipes.com/recipe/58745/absolu...
8SaveBaked Ham and Cheese Sliders973Ratingshttps://www.allrecipes.com/recipe/216756/baked...
9SaveJalapeño Popper Grilled Cheese Sandwich200...https://www.allrecipes.com/recipe/217267/jalap...
10SaveCheese Sauce for Broccoli and Cauliflower4...https://www.allrecipes.com/recipe/233481/chees...
11SaveNacho Cheese Sauce680Ratingshttps://www.allrecipes.com/recipe/24738/nacho-...
12SavePumpkin Bars with Cream Cheese Frosting148...https://www.allrecipes.com/recipe/229508/pumpk...
13SaveChef John's Creamy Blue Cheese Dressing127...https://www.allrecipes.com/recipe/232395/chef-...
\n", "
" ], "text/plain": [ " name \\\n", "0 SaveSouthern Pimento Cheese1,020Ratings \n", "1 SaveBasic Cream Cheese Frosting1,645Ratings \n", "2 SaveGrilled Cheese Sandwich855Ratings \n", "3 SaveHomemade Mac and Cheese2,642Ratings \n", "4 SaveBest Cheese Ball234Ratings \n", "5 SaveSimple Macaroni and Cheese965Ratings \n", "6 SaveBaked Mac and Cheese with Sour Cream and C... \n", "7 SaveAbsolutely the BEST Rich and Creamy Blue C... \n", "8 SaveBaked Ham and Cheese Sliders973Ratings \n", "9 SaveJalapeño Popper Grilled Cheese Sandwich200... \n", "10 SaveCheese Sauce for Broccoli and Cauliflower4... \n", "11 SaveNacho Cheese Sauce680Ratings \n", "12 SavePumpkin Bars with Cream Cheese Frosting148... \n", "13 SaveChef John's Creamy Blue Cheese Dressing127... \n", "\n", " href \n", "0 https://www.allrecipes.com/recipe/189930/south... \n", "1 https://www.allrecipes.com/recipe/8379/basic-c... \n", "2 https://www.allrecipes.com/recipe/23891/grille... \n", "3 https://www.allrecipes.com/recipe/11679/homema... \n", "4 https://www.allrecipes.com/recipe/16600/herman... \n", "5 https://www.allrecipes.com/recipe/238691/simpl... \n", "6 https://www.allrecipes.com/recipe/229815/baked... \n", "7 https://www.allrecipes.com/recipe/58745/absolu... \n", "8 https://www.allrecipes.com/recipe/216756/baked... \n", "9 https://www.allrecipes.com/recipe/217267/jalap... \n", "10 https://www.allrecipes.com/recipe/233481/chees... \n", "11 https://www.allrecipes.com/recipe/24738/nacho-... \n", "12 https://www.allrecipes.com/recipe/229508/pumpk... \n", "13 https://www.allrecipes.com/recipe/232395/chef-... " ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_recipe = extract_recipes('cheese')\n", "df_recipe" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Todo list: complete\n", "\n", "- extract info from each recipe's page\n", " - get url of each recipe's own page from initial search:\n", " - e.g. [https://www.allrecipes.com/recipe/189930/southern-pimento-cheese/](https://www.allrecipes.com/recipe/189930/southern-pimento-cheese/)\n", " - get string of nutrition info on that page\n", " \n", "```\n", " 208\n", " Calories\n", " 20g\n", " Fat\n", " 2g\n", " Carbs\n", " 6g\n", " Protein\n", "```\n", " \n", "# Todo list: \n", "- string processing\n", " - clean up the name of each recipe: \"SaveSouthern Pimento Cheese1,020Ratings\"\n", " - process the string above so it yields clean numbers we can operate on" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## String Manipulations\n", "- `.split()` & `.join()`\n", "- `.strip()`\n", "- `.replace()`\n", "- `.upper()` & `.lower()`\n", "\n", "I find these four most useful, but there's a few more [string methods](https://docs.python.org/3/library/stdtypes.html#string-methods) which you might find useful too.\n", "\n", "(++) Its a bit more powerful (read: complex to learn) but [regular expressions](https://docs.python.org/3/library/re.html) are likely to support some need where the above built-in python string methods don't work as well." ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'hello!hello!'" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'\\n\\n\\n hello! \\n hello! \\n\\n \\n \\n'.replace('\\n', '').replace(' ', '')" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'hello! \\n hello!'" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# strip removes all leading and trailing whitespace (spaces and newlines)\n", "'\\n\\n\\n hello! \\n hello! \\n\\n \\n \\n'.strip()" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'hello zeke zeke zeke'" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# replace does just what you think it does\n", "'hello matt matt matt'.replace('matt', 'zeke')" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'hello '" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# delete this when you find it\n", "'hello matt'.replace('matt', '')" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'DONT SHOUT!'" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# capitalize everything\n", "'dont shout!'.upper()" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'dont shout'" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# lowercase everything\n", "'DONT shOUt'.lower()" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['fat: 54 g', ' calories: 430 cal', ' sugar: 10g']" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# split will split a string on every occurance of given string (',' below)\n", "'fat: 54 g, calories: 430 cal, sugar: 10g'.split(',')" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'abcd'" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "''.join(['a', 'b', 'c', 'd'])" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['here',\n", " 'is',\n", " 'some',\n", " 'text',\n", " 'with',\n", " 'a',\n", " 'whole',\n", " 'bunch',\n", " 'of',\n", " 'spaces',\n", " 'in',\n", " 'the',\n", " 'middle']" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# ICA 2 tip: split(), without argument, splits on whitespace (spaces and newlines)\n", "' here is some text with a whole bunch of spaces in the middle'.split()" ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "['',\n", " 'here',\n", " 'is',\n", " 'some',\n", " 'text',\n", " '',\n", " '',\n", " '',\n", " '',\n", " '',\n", " '',\n", " '',\n", " '',\n", " '',\n", " 'with',\n", " 'a',\n", " 'whole',\n", " 'bunch',\n", " 'of',\n", " 'spaces',\n", " 'in',\n", " 'the',\n", " 'middle']" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# not equivilent to do this by passing a space explicitly\n", "' here is some text with a whole bunch of spaces in the middle'.split(' ')" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('a', 1), ('b', 2), ('c', 3)]" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(zip('abc', [1, 2, 3]))" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{' first0': 'last0', ' first1': ' last1', ' first2': ' last2'}" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dict(zip(name_list[1::2], name_list[::2]))" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'last0, first0'" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# \n", "name_list = 'last0, first0, last1, first1, last2, first2'.split(',')\n", "\n", "', '.join(name_list[:2])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## In Class Assignment 2 - Getting Nutritional Information\n", "Write an `extract_nutrition()` function, which accepts a url of a particular recipe (see ex directly above) and returns a dictionary of nutritional information:\n", "\n", "```python\n", "url = 'https://www.allrecipes.com/recipe/189930/southern-pimento-cheese/'\n", "extract_nutrition(url)\n", "\n", "```\n", "\n", "yields:\n", "\n", "```python\n", "{'Calories': '208',\n", " 'Fat': '20g',\n", " 'Carbs': '2g',\n", " 'Protein': '6g'}\n", "\n", "```\n", "\n", "Once complete, incorporate `extract_nutrition()` into `extract_recipes()` todo\n" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [], "source": [ "def extract_nutrition(url):\n", " \"\"\" returns a dictionary of nutrition info \n", " \n", " Args:\n", " url (str): location of all recipes \"recipe\"\n", " \n", " Returns:\n", " nutrition_dict (dict): keys are molecule types ('fat'), \n", " vals are str of quantity ('24 g')\n", " \"\"\"\n", " # get html, build soup\n", " html = requests.get(url).text\n", " soup = BeautifulSoup(html)\n", "\n", " # extract nutrition info\n", " str_nutrit = soup.find_all(class_='mntl-nutrition-facts-summary__table-body')[0].text\n", " \n", " # make dictionary from ordinal pairs (0 is first value, 1 is first key, 2 is second value ...)\n", " nutrit_list = str_nutrit.split()\n", " nutrit_dict = dict(zip(nutrit_list[1::2], \n", " nutrit_list[0::2]))\n", " return nutrit_dict" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for " ] }, { "cell_type": "code", "execution_count": 100, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(array([1, 1]), array([1, 2])), (array([0, 2]), array([3, 4]))]" ] }, "execution_count": 100, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "\n", "player = np.array([[1, 1], [0, 2]])\n", "size = np.array([[1, 2], [3, 4]])\n", "\n", "s_board = ''\n", "for row_idx in range(3):\n", " _player = player[row_idx, :]\n", " _size = size[row_idx, :]\n", " \n", " for p, s in zip(_player, _size):\n", " if _player == 1:\n", " s_board += f'{Fore.GREEN}{s}{Style.RESET_ALL}'\n", " elif _player == 2:\n", " s_board += f'{Fore.RED}{s}{Style.RESET_ALL}'\n", " else:\n", " s_board += '0'\n", " \n", " s_board += '\\n'\n", " " ] }, { "cell_type": "code", "execution_count": 88, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# get soup from url\n", "url = 'https://www.allrecipes.com/recipe/189930/southern-pimento-cheese/'\n", "html = requests.get(url).text\n", "soup = BeautifulSoup(html)\n", "\n", "# extract nutrition info\n", "str_nutrit = soup.find_all(class_='mntl-nutrition-facts-summary__table-body')[0].text" ] }, { "cell_type": "code", "execution_count": 92, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Calories', 'Fat', 'Carbs', 'Protein']" ] }, "execution_count": 92, "metadata": {}, "output_type": "execute_result" } ], "source": [ "str_nutrit.split()[1::2]" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Defaulting to user installation because normal site-packages is not writeable\n", "Requirement already satisfied: tqdm in /home/matt/.local/lib/python3.10/site-packages (4.64.1)\n" ] } ], "source": [ "# tqdm is a progress bar, not necessary, but fun to see once\n", "# (scraping often takes a moment, nice to get some updates)\n", "!pip3 install tqdm" ] }, { "cell_type": "code", "execution_count": 97, "metadata": {}, "outputs": [], "source": [ "from tqdm import tqdm \n", "\n", "def extract_recipes(s_query):\n", " \"\"\" builds list of recipe names from allrecipies html\n", " \n", " Args:\n", " s_query (str): input query (i.e. \"cheese\")\n", " \n", " Returns:\n", " df_recipe (pd.DataFrame): each row is a recipe\n", " \"\"\"\n", " # build soup object from search query\n", " url = f'https://www.allrecipes.com/search?q={s_query}'\n", " s_html = requests.get(url).text\n", " soup = BeautifulSoup(s_html)\n", " \n", " # get a list of recipe tags\n", " recipe_list = list()\n", " for tag in soup.find_all('a', class_='mntl-card-list-items'):\n", " # search within tag to find all star icons\n", " star_list = tag.find_all('svg', class_='icon-star')\n", " if star_list:\n", " # some star icon is found, store this as its a real recipe\n", " recipe_list.append(tag)\n", " \n", " # extract features to build dataframe\n", " row_list = list()\n", " for recipe in tqdm(recipe_list, desc='getting nutrition per recipe'):\n", " # extract name & url\n", " name = recipe.text.replace('\\n', '').replace('Save', '')\n", " url = recipe.attrs['href']\n", " \n", " # lookup nutrition info\n", " d = extract_nutrition(url)\n", " d['name'] = name\n", " d['url'] = url\n", " \n", " row_list.append(d)\n", " \n", " \n", " return pd.DataFrame(row_list)\n", " " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip3 install tqdm" ] }, { "cell_type": "code", "execution_count": 96, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████| 100/100 [00:05<00:00, 19.74it/s]\n" ] } ], "source": [ "from tqdm import tqdm\n", "from time import sleep\n", "\n", "for idx in tqdm(range(100)):\n", " sleep(.05)" ] }, { "cell_type": "code", "execution_count": 98, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "getting nutrition per recipe: 100%|██████████| 14/14 [00:04<00:00, 2.85it/s]\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CaloriesFatCarbsProteinnameurl
020820g2g6gSouthern Pimento Cheese1,020Ratingshttps://www.allrecipes.com/recipe/189930/south...
129214g40g2gBasic Cream Cheese Frosting1,645Ratingshttps://www.allrecipes.com/recipe/8379/basic-c...
240028g26g11gGrilled Cheese Sandwich855Ratingshttps://www.allrecipes.com/recipe/23891/grille...
384548g65g37gHomemade Mac and Cheese2,642Ratingshttps://www.allrecipes.com/recipe/11679/homema...
441339g4g15gBest Cheese Ball234Ratingshttps://www.allrecipes.com/recipe/16600/herman...
563034g55g27gSimple Macaroni and Cheese965Ratingshttps://www.allrecipes.com/recipe/238691/simpl...
641523g30g22gBaked Mac and Cheese with Sour Cream and Cotta...https://www.allrecipes.com/recipe/229815/baked...
7939g1g3gAbsolutely the BEST Rich and Creamy Blue Chees...https://www.allrecipes.com/recipe/58745/absolu...
820814g11g10gBaked Ham and Cheese Sliders973Ratingshttps://www.allrecipes.com/recipe/216756/baked...
952834g41g17gJalapeño Popper Grilled Cheese Sandwich200Ratingshttps://www.allrecipes.com/recipe/217267/jalap...
1017814g4g9gCheese Sauce for Broccoli and Cauliflower473Ra...https://www.allrecipes.com/recipe/233481/chees...
1128223g7g14gNacho Cheese Sauce680Ratingshttps://www.allrecipes.com/recipe/24738/nacho-...
1237320g48g2gPumpkin Bars with Cream Cheese Frosting148Ratingshttps://www.allrecipes.com/recipe/229508/pumpk...
1321422g3g3gChef John's Creamy Blue Cheese Dressing127Ratingshttps://www.allrecipes.com/recipe/232395/chef-...
\n", "
" ], "text/plain": [ " Calories Fat Carbs Protein \\\n", "0 208 20g 2g 6g \n", "1 292 14g 40g 2g \n", "2 400 28g 26g 11g \n", "3 845 48g 65g 37g \n", "4 413 39g 4g 15g \n", "5 630 34g 55g 27g \n", "6 415 23g 30g 22g \n", "7 93 9g 1g 3g \n", "8 208 14g 11g 10g \n", "9 528 34g 41g 17g \n", "10 178 14g 4g 9g \n", "11 282 23g 7g 14g \n", "12 373 20g 48g 2g \n", "13 214 22g 3g 3g \n", "\n", " name \\\n", "0 Southern Pimento Cheese1,020Ratings \n", "1 Basic Cream Cheese Frosting1,645Ratings \n", "2 Grilled Cheese Sandwich855Ratings \n", "3 Homemade Mac and Cheese2,642Ratings \n", "4 Best Cheese Ball234Ratings \n", "5 Simple Macaroni and Cheese965Ratings \n", "6 Baked Mac and Cheese with Sour Cream and Cotta... \n", "7 Absolutely the BEST Rich and Creamy Blue Chees... \n", "8 Baked Ham and Cheese Sliders973Ratings \n", "9 Jalapeño Popper Grilled Cheese Sandwich200Ratings \n", "10 Cheese Sauce for Broccoli and Cauliflower473Ra... \n", "11 Nacho Cheese Sauce680Ratings \n", "12 Pumpkin Bars with Cream Cheese Frosting148Ratings \n", "13 Chef John's Creamy Blue Cheese Dressing127Ratings \n", "\n", " url \n", "0 https://www.allrecipes.com/recipe/189930/south... \n", "1 https://www.allrecipes.com/recipe/8379/basic-c... \n", "2 https://www.allrecipes.com/recipe/23891/grille... \n", "3 https://www.allrecipes.com/recipe/11679/homema... \n", "4 https://www.allrecipes.com/recipe/16600/herman... \n", "5 https://www.allrecipes.com/recipe/238691/simpl... \n", "6 https://www.allrecipes.com/recipe/229815/baked... \n", "7 https://www.allrecipes.com/recipe/58745/absolu... \n", "8 https://www.allrecipes.com/recipe/216756/baked... \n", "9 https://www.allrecipes.com/recipe/217267/jalap... \n", "10 https://www.allrecipes.com/recipe/233481/chees... \n", "11 https://www.allrecipes.com/recipe/24738/nacho-... \n", "12 https://www.allrecipes.com/recipe/229508/pumpk... \n", "13 https://www.allrecipes.com/recipe/232395/chef-... " ] }, "execution_count": 98, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = extract_recipes('cheese')\n", "df" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [], "source": [ "def strip_g(s):\n", " return float(s.replace('g', ''))" ] }, { "cell_type": "code", "execution_count": 79, "metadata": {}, "outputs": [], "source": [ "# just playing\n", "x_feat_list = ['Fat', 'Carbs', 'Protein']\n", "for feat in x_feat_list:\n", " df[feat] = df[feat].map(strip_g)" ] }, { "cell_type": "code", "execution_count": 81, "metadata": {}, "outputs": [], "source": [ "\n", "x_feat_list = ['Fat', 'Carbs', 'Protein']\n", "\n", "y = df['Calories'].values\n", "x = df.loc[:, x_feat_list]" ] }, { "cell_type": "code", "execution_count": 83, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import LinearRegression\n", "from sklearn.metrics import r2_score\n", "\n", "lin_reg = LinearRegression()\n", "lin_reg.fit(x, y)\n", "y_pred = lin_reg.predict(x)\n", "r2 = r2_score(y_true=y, y_pred=y_pred)\n" ] }, { "cell_type": "code", "execution_count": 84, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.999681369112089" ] }, "execution_count": 84, "metadata": {}, "output_type": "execute_result" } ], "source": [ "r2" ] }, { "cell_type": "code", "execution_count": 87, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'Fat': 8.560508216473371,\n", " 'Carbs': 4.079712522080758,\n", " 'Protein': 4.4359473499832385}" ] }, "execution_count": 87, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dict(zip(x_feat_list, lin_reg.coef_))" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" } }, "nbformat": 4, "nbformat_minor": 4 }