{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# DS2500 Day 20\n", "\n", "Mar 28, 2023\n", "\n", "### Content\n", "- Web scraping (html parsing & string manipulations)\n", "\n", "### Admin\n", "- lab digest tomorrow\n", "- project\n", " - activate your mentor\n", " - sign up for a meeting slot with me next week\n", " \n", "### Lesson Credit\n", "\n", "Piotr Sapiezynski (https://www.sapiezynski.com/) originally wrote much of this lesson, I've modified it a bit (allrecipes.com has since changed ... arg!)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Web Scraping\n", "* Using programs or scripts to pretend to browse websites, examine the content on those websites, retrieve and extract data from those websites\n", "* Why scrape?\n", " * if an API is available for a service, we will nearly always prefer the API to scraping\n", " * ... but not all services have APIs or the available APIs are too expensive for our project\n", " * newly published information might not yet be available through ready datasets\n", "* Downsides of scraping:\n", " * no reference documentation (unlike APIs)\n", " * no guarantee that a webpage we scrape will look and work the same way the next day (might need to rewrite the whole scraper - this is why ETL is important!)\n", " * if it violates the terms of service it might be seen as a felony (https://www.aclu.org/cases/sandvig-v-barr-challenge-cfaa-prohibition-uncovering-racial-discrimination-online)\n", " * legal and moral greyzone (even if the ToS does not disallow it, somebody has to pay for the traffic and when you're scraping you're not looking at ads)\n", " * ... but everbody does it anyway (https://www.hollywoodreporter.com/thr-esq/genius-says-it-caught-google-lyricfind-redhanded-stealing-lyrics-400m-suit-1259383)\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Best case scenario\n", "Some webpages publish their data in the form of simple tables. In these (rare) cases we can just use pandas .read_html to scrape this data:\n", "\n", "https://www.espn.com/nba/team/stats/_/name/bos" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "# read html extracts all the
\n", " | Name | \n", "
---|---|
0 | \n", "Jayson Tatum SF | \n", "
1 | \n", "Jaylen Brown SG | \n", "
2 | \n", "Malcolm Brogdon PG | \n", "
3 | \n", "Derrick White PG | \n", "
4 | \n", "Marcus Smart PG | \n", "
5 | \n", "Al Horford C | \n", "
6 | \n", "Grant Williams PF | \n", "
7 | \n", "Robert Williams III C | \n", "
8 | \n", "Sam Hauser SF | \n", "
9 | \n", "Mike Muscala C * | \n", "
10 | \n", "Payton Pritchard PG | \n", "
11 | \n", "Blake Griffin PF | \n", "
12 | \n", "Luke Kornet C | \n", "
13 | \n", "JD Davison SG | \n", "
14 | \n", "Noah Vonleh PF | \n", "
15 | \n", "Mfiondu Kabengele C | \n", "
16 | \n", "Justin Jackson SF | \n", "
17 | \n", "Total | \n", "
\n", " | GP | \n", "GS | \n", "MIN | \n", "PTS | \n", "OR | \n", "DR | \n", "REB | \n", "AST | \n", "STL | \n", "BLK | \n", "TO | \n", "PF | \n", "AST/TO | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "69 | \n", "69.0 | \n", "37.3 | \n", "30.1 | \n", "1.1 | \n", "7.8 | \n", "8.9 | \n", "4.7 | \n", "1.0 | \n", "0.7 | \n", "3.0 | \n", "2.1 | \n", "1.6 | \n", "
1 | \n", "63 | \n", "63.0 | \n", "36.1 | \n", "27.0 | \n", "1.2 | \n", "5.7 | \n", "7.0 | \n", "3.4 | \n", "1.1 | \n", "0.4 | \n", "2.9 | \n", "2.6 | \n", "1.2 | \n", "
2 | \n", "62 | \n", "0.0 | \n", "25.8 | \n", "14.6 | \n", "0.6 | \n", "3.6 | \n", "4.2 | \n", "3.7 | \n", "0.6 | \n", "0.3 | \n", "1.5 | \n", "1.6 | \n", "2.5 | \n", "
3 | \n", "75 | \n", "63.0 | \n", "28.4 | \n", "12.4 | \n", "0.7 | \n", "2.9 | \n", "3.5 | \n", "4.0 | \n", "0.7 | \n", "0.9 | \n", "1.1 | \n", "2.2 | \n", "3.8 | \n", "
4 | \n", "57 | \n", "57.0 | \n", "32.3 | \n", "11.4 | \n", "0.8 | \n", "2.4 | \n", "3.2 | \n", "6.4 | \n", "1.5 | \n", "0.4 | \n", "2.4 | \n", "2.8 | \n", "2.6 | \n", "
5 | \n", "59 | \n", "59.0 | \n", "30.7 | \n", "9.7 | \n", "1.2 | \n", "5.1 | \n", "6.3 | \n", "2.9 | \n", "0.5 | \n", "0.9 | \n", "0.6 | \n", "1.9 | \n", "5.0 | \n", "
6 | \n", "72 | \n", "22.0 | \n", "26.5 | \n", "8.3 | \n", "1.1 | \n", "3.6 | \n", "4.7 | \n", "1.7 | \n", "0.6 | \n", "0.4 | \n", "1.1 | \n", "2.6 | \n", "1.6 | \n", "
7 | \n", "31 | \n", "18.0 | \n", "23.7 | \n", "8.3 | \n", "3.0 | \n", "5.5 | \n", "8.5 | \n", "1.4 | \n", "0.5 | \n", "1.2 | \n", "0.9 | \n", "2.0 | \n", "1.6 | \n", "
8 | \n", "73 | \n", "5.0 | \n", "15.8 | \n", "6.1 | \n", "0.5 | \n", "2.1 | \n", "2.5 | \n", "0.8 | \n", "0.3 | \n", "0.3 | \n", "0.3 | \n", "1.3 | \n", "2.3 | \n", "
9 | \n", "13 | \n", "2.0 | \n", "14.8 | \n", "5.2 | \n", "0.5 | \n", "2.6 | \n", "3.1 | \n", "0.3 | \n", "0.3 | \n", "0.3 | \n", "0.4 | \n", "1.5 | \n", "0.8 | \n", "
10 | \n", "45 | \n", "2.0 | \n", "12.5 | \n", "4.7 | \n", "0.5 | \n", "1.0 | \n", "1.5 | \n", "1.0 | \n", "0.3 | \n", "0.0 | \n", "0.7 | \n", "0.8 | \n", "1.5 | \n", "
11 | \n", "35 | \n", "14.0 | \n", "14.1 | \n", "4.3 | \n", "1.1 | \n", "2.6 | \n", "3.7 | \n", "1.3 | \n", "0.3 | \n", "0.2 | \n", "0.5 | \n", "1.9 | \n", "2.7 | \n", "
12 | \n", "62 | \n", "0.0 | \n", "11.5 | \n", "3.8 | \n", "1.3 | \n", "1.5 | \n", "2.8 | \n", "0.7 | \n", "0.2 | \n", "0.7 | \n", "0.4 | \n", "1.2 | \n", "1.8 | \n", "
13 | \n", "10 | \n", "0.0 | \n", "2.7 | \n", "1.1 | \n", "0.1 | \n", "0.5 | \n", "0.6 | \n", "0.6 | \n", "0.2 | \n", "0.0 | \n", "0.2 | \n", "0.4 | \n", "3.0 | \n", "
14 | \n", "23 | \n", "1.0 | \n", "7.5 | \n", "1.1 | \n", "0.8 | \n", "1.3 | \n", "2.1 | \n", "0.3 | \n", "0.1 | \n", "0.3 | \n", "0.5 | \n", "1.5 | \n", "0.6 | \n", "
15 | \n", "2 | \n", "0.0 | \n", "7.0 | \n", "1.0 | \n", "1.5 | \n", "1.0 | \n", "2.5 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.5 | \n", "1.5 | \n", "0.0 | \n", "
16 | \n", "23 | \n", "0.0 | \n", "4.7 | \n", "0.9 | \n", "0.1 | \n", "0.7 | \n", "0.7 | \n", "0.4 | \n", "0.2 | \n", "0.2 | \n", "0.1 | \n", "0.3 | \n", "4.5 | \n", "
17 | \n", "75 | \n", "NaN | \n", "NaN | \n", "118.1 | \n", "9.6 | \n", "35.7 | \n", "45.3 | \n", "26.5 | \n", "6.4 | \n", "5.2 | \n", "12.7 | \n", "19.2 | \n", "2.1 | \n", "
\n", " | Name | \n", "GP | \n", "GS | \n", "MIN | \n", "PTS | \n", "OR | \n", "DR | \n", "REB | \n", "AST | \n", "STL | \n", "BLK | \n", "TO | \n", "PF | \n", "AST/TO | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "Jayson Tatum SF | \n", "69 | \n", "69.0 | \n", "37.3 | \n", "30.1 | \n", "1.1 | \n", "7.8 | \n", "8.9 | \n", "4.7 | \n", "1.0 | \n", "0.7 | \n", "3.0 | \n", "2.1 | \n", "1.6 | \n", "
1 | \n", "Jaylen Brown SG | \n", "63 | \n", "63.0 | \n", "36.1 | \n", "27.0 | \n", "1.2 | \n", "5.7 | \n", "7.0 | \n", "3.4 | \n", "1.1 | \n", "0.4 | \n", "2.9 | \n", "2.6 | \n", "1.2 | \n", "
2 | \n", "Malcolm Brogdon PG | \n", "62 | \n", "0.0 | \n", "25.8 | \n", "14.6 | \n", "0.6 | \n", "3.6 | \n", "4.2 | \n", "3.7 | \n", "0.6 | \n", "0.3 | \n", "1.5 | \n", "1.6 | \n", "2.5 | \n", "
3 | \n", "Derrick White PG | \n", "75 | \n", "63.0 | \n", "28.4 | \n", "12.4 | \n", "0.7 | \n", "2.9 | \n", "3.5 | \n", "4.0 | \n", "0.7 | \n", "0.9 | \n", "1.1 | \n", "2.2 | \n", "3.8 | \n", "
4 | \n", "Marcus Smart PG | \n", "57 | \n", "57.0 | \n", "32.3 | \n", "11.4 | \n", "0.8 | \n", "2.4 | \n", "3.2 | \n", "6.4 | \n", "1.5 | \n", "0.4 | \n", "2.4 | \n", "2.8 | \n", "2.6 | \n", "
5 | \n", "Al Horford C | \n", "59 | \n", "59.0 | \n", "30.7 | \n", "9.7 | \n", "1.2 | \n", "5.1 | \n", "6.3 | \n", "2.9 | \n", "0.5 | \n", "0.9 | \n", "0.6 | \n", "1.9 | \n", "5.0 | \n", "
6 | \n", "Grant Williams PF | \n", "72 | \n", "22.0 | \n", "26.5 | \n", "8.3 | \n", "1.1 | \n", "3.6 | \n", "4.7 | \n", "1.7 | \n", "0.6 | \n", "0.4 | \n", "1.1 | \n", "2.6 | \n", "1.6 | \n", "
7 | \n", "Robert Williams III C | \n", "31 | \n", "18.0 | \n", "23.7 | \n", "8.3 | \n", "3.0 | \n", "5.5 | \n", "8.5 | \n", "1.4 | \n", "0.5 | \n", "1.2 | \n", "0.9 | \n", "2.0 | \n", "1.6 | \n", "
8 | \n", "Sam Hauser SF | \n", "73 | \n", "5.0 | \n", "15.8 | \n", "6.1 | \n", "0.5 | \n", "2.1 | \n", "2.5 | \n", "0.8 | \n", "0.3 | \n", "0.3 | \n", "0.3 | \n", "1.3 | \n", "2.3 | \n", "
9 | \n", "Mike Muscala C * | \n", "13 | \n", "2.0 | \n", "14.8 | \n", "5.2 | \n", "0.5 | \n", "2.6 | \n", "3.1 | \n", "0.3 | \n", "0.3 | \n", "0.3 | \n", "0.4 | \n", "1.5 | \n", "0.8 | \n", "
10 | \n", "Payton Pritchard PG | \n", "45 | \n", "2.0 | \n", "12.5 | \n", "4.7 | \n", "0.5 | \n", "1.0 | \n", "1.5 | \n", "1.0 | \n", "0.3 | \n", "0.0 | \n", "0.7 | \n", "0.8 | \n", "1.5 | \n", "
11 | \n", "Blake Griffin PF | \n", "35 | \n", "14.0 | \n", "14.1 | \n", "4.3 | \n", "1.1 | \n", "2.6 | \n", "3.7 | \n", "1.3 | \n", "0.3 | \n", "0.2 | \n", "0.5 | \n", "1.9 | \n", "2.7 | \n", "
12 | \n", "Luke Kornet C | \n", "62 | \n", "0.0 | \n", "11.5 | \n", "3.8 | \n", "1.3 | \n", "1.5 | \n", "2.8 | \n", "0.7 | \n", "0.2 | \n", "0.7 | \n", "0.4 | \n", "1.2 | \n", "1.8 | \n", "
13 | \n", "JD Davison SG | \n", "10 | \n", "0.0 | \n", "2.7 | \n", "1.1 | \n", "0.1 | \n", "0.5 | \n", "0.6 | \n", "0.6 | \n", "0.2 | \n", "0.0 | \n", "0.2 | \n", "0.4 | \n", "3.0 | \n", "
14 | \n", "Noah Vonleh PF | \n", "23 | \n", "1.0 | \n", "7.5 | \n", "1.1 | \n", "0.8 | \n", "1.3 | \n", "2.1 | \n", "0.3 | \n", "0.1 | \n", "0.3 | \n", "0.5 | \n", "1.5 | \n", "0.6 | \n", "
15 | \n", "Mfiondu Kabengele C | \n", "2 | \n", "0.0 | \n", "7.0 | \n", "1.0 | \n", "1.5 | \n", "1.0 | \n", "2.5 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.5 | \n", "1.5 | \n", "0.0 | \n", "
16 | \n", "Justin Jackson SF | \n", "23 | \n", "0.0 | \n", "4.7 | \n", "0.9 | \n", "0.1 | \n", "0.7 | \n", "0.7 | \n", "0.4 | \n", "0.2 | \n", "0.2 | \n", "0.1 | \n", "0.3 | \n", "4.5 | \n", "
17 | \n", "Total | \n", "75 | \n", "NaN | \n", "NaN | \n", "118.1 | \n", "9.6 | \n", "35.7 | \n", "45.3 | \n", "26.5 | \n", "6.4 | \n", "5.2 | \n", "12.7 | \n", "19.2 | \n", "2.1 | \n", "
Text is usually in paragraphs.\n", " New lines and multiple consecutive whitespace characters are ignored.
\n", "\n", "Unlike in python indentation is only a good practice but it doesn't change functionality. In fact, all of this HTML could be (and often is in real webpages) just writen as a single line.
\n", " \n", "Links are created using the \"a\" tag: \n", " Click here to google.\n", " href is an attirbute of the a tag that specify where the link points to.
\n", " \n", " \n", " \n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# write this string to a local file \"simple_page0.html\"\n", "with open('simple_page0.html', 'w') as f:\n", " print(s_html, file=f)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Clicking the link below will open the html page we just wrote:\n", "\n", "[simple_page0.html](simple_page0.html)\n", "\n", "While it opens in jupyter know that your usual browser will do the trick too (chrome, safari, firefox etc)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# HTML is organized as a tree\n", "\n", "(Note to self: write out tree structure below)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```html\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "Text is usually in paragraphs.\n", " New lines and multiple consecutive whitespace characters are ignored.
\n", "\n", "Unlike in python indentation is only a good practice but it doesn't change functionality. In fact, all of this HTML could be (and often is in real webpages) just writen as a single line.
\n", " \n", "Links are created using the \"a\" tag: \n", " Click here to google.\n", " href is an attirbute of the a tag that specify where the link points to.
\n", " \n", " \n", " \n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# And now, the internet\n", "\n", "### Observing HTML in a browser\n", "You can see the actual html of a page by selecting \"inspect\" on a page via a right click. Try it out:\n", "\n", "[https://www.scrapethissite.com/pages/simple/](https://www.scrapethissite.com/pages/simple/)\n", "\n", "### Obtaining HTML from a url address\n", "Use `requests.get()` to get the html of a web page into python:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", " \n", " \n", "\n", " A single page that lists information about all the countries in the world. Good for those just get started with web scraping.\n", " Practice looking for patterns in the HTML that will allow you to extract information about each country. Then, build a simple web scraper that makes a request to this page, parses the HTML and prints out each country's name.\n", "
\n", "\n", " There are 4 video lessons that show you how to scrape this page.\n", "
\n", "\n", " \n", " Data via\n", " http://peric.github.io/GetCountries/\n", " \n", "
\n", "Text is usually in paragraphs.\n", " New lines and multiple consecutive whitespace characters are ignored.
\n", "Unlike in python indentation is only a good practice but it doesn't change functionality. In fact, all of this HTML could be (and often is in real webpages) just writen as a single line.
\n", "Links are created using the \"a\" tag: \n", " Click here to google.\n", " href is an attirbute of the a tag that specify where the link points to.
\n", "\n", "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Text is usually in paragraphs.\n", " New lines and multiple consecutive whitespace characters are ignored.
,\n", "Unlike in python indentation is only a good practice but it doesn't change functionality. In fact, all of this HTML could be (and often is in real webpages) just writen as a single line.
,\n", "Links are created using the \"a\" tag: \n", " Click here to google.\n", " href is an attirbute of the a tag that specify where the link points to.
]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## getting elements by their tag name:\n", "soup.find_all('p')\n", "\n", "# find_all returns a list, where each element is an instance of the specified tag" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Text is usually in paragraphs.\n", " New lines and multiple consecutive whitespace characters are ignored.\n", "------\n", "Unlike in python indentation is only a good practice but it doesn't change functionality. In fact, all of this HTML could be (and often is in real webpages) just writen as a single line.\n", "------\n", "Links are created using the \"a\" tag: \n", " Click here to google.\n", " href is an attirbute of the a tag that specify where the link points to.\n", "------\n" ] } ], "source": [ "for paragraph in soup.find_all('p'):\n", " # text is a property of a soup object\n", " print(paragraph.text) \n", " print('------')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# `.find_all()` on subtrees of soup object\n", "\n", "Note to self: write out tree structure below\n", "\n", "```html\n", "\n", " \n", "The links in this paragraph point to search engines, like DuckDuckGo, Google, Bing
\n", " \n", "The links in this paragraph point to Internet browsers, like Firefox, Chrome, Opera
.\n", " \n", "\n", "```\n", "\n", "# What if we only wanted links from the first paragraph?\n", "\n", "The `.find_all()` method works not only on the whole `soup` object, but also on subtrees of the soup object. " ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "s_html = \"\"\"\n", "\n", " \n", "The links in this paragraph point to search engines, like DuckDuckGo, Google, Bing
\n", " \n", "The links in this paragraph point to Internet browsers, like Firefox, Chrome, Opera
.\n", " \n", "\n", "\"\"\"\n", "\n", "# write this to a webpage (to see what it looks like)\n", "with open('simple_page1.html', 'w') as f:\n", " print(s_html, file=f)\n", "\n", "# either way, you can parse the html with BeautifulSoup\n", "soup = BeautifulSoup(s_html)\n", "\n", "# finding all paragraphs:\n", "p_all = soup.find_all('p')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "the webpage we just wrote:\n", "\n", "[simple_page1.html](simple_page1.html)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "# getting the first paragraph\n", "p_first = p_all[0]" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "The links in this paragraph point to search engines, like DuckDuckGo, Google, Bing
" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "p_first" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[DuckDuckGo, Google, Bing]\n" ] } ], "source": [ "# getting the links from the first paragraph:\n", "links_p_first = p_first.find_all('a')\n", "\n", "print(links_p_first)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### syntactic sugar: \n", "To get the first tag under a soup object, refer to it as an attribute" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "The links in this paragraph point to search engines, like DuckDuckGo, Google, Bing
" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# below is equivilent to soup.find_all('p')[0]\n", "soup.p" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[DuckDuckGo, Google, Bing]\n" ] } ], "source": [ "# so we can condense our code as\n", "plinks = soup.p.find_all('a')\n", "print(plinks)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "DuckDuckGo\n", "Firefox\n" ] } ], "source": [ "# iterating over tags\n", "for par in soup.find_all('p'):\n", " print(par.a)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "DuckDuckGo\n" ] } ], "source": [ "# and the first link in that paragraph can be accessed like this:\n", "link = soup.p.a\n", "print(link)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Identifying if tags exist" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\n", "\n", "The links in this paragraph point to search engines, like DuckDuckGo, Google, Bing
\n", "The links in this paragraph point to Internet browsers, like Firefox, Chrome, Opera
.\n", " \n", "" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# note: there is no \"h3\" tag below\n", "soup" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# what if we're trying to access an element that doesn't exist?\n", "header = soup.h3\n", "header is None" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can test if a tag exists in a soup object by looking for the first instance of this tag and comparing it to `None`" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "tag h3 doesnt exist in soup\n" ] } ], "source": [ "if soup.h3 is None:\n", " print(\"tag h3 doesnt exist in soup\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Putting it together:\n", "# Goal: get all cheese recipes!\n", "\n", "Just the recipe name & a link to its page now. Later, we'll visit the page to get more info on each. \n", "\n", "[https://www.allrecipes.com/search?q=cheese](https://www.allrecipes.com/search?q=cheese)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "# get soup\n", "url = 'https://www.allrecipes.com/search?q=cheese'\n", "response = requests.get(url)\n", "soup = BeautifulSoup(response.text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our **goal** is to get a list of recipes. Maybe we should find all the `div` tags?" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "238" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# that seems like too many recipes ...\n", "len(soup.find_all('a'))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Finding tags by `class_`\n", "\n", "how to localize a particular part of a web page" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Tags can have multiple \"classes\" they belong to. For example, in [https://www.allrecipes.com/search?q=cheese](https://www.allrecipes.com/search?q=cheese) the first recipe is encapsulated in this html tag:\n", "\n", " \n", " \n", " \n", "So this particular div tag belongs to classes:\n", "- `comp`\n", "- `mntl-card-list-items`\n", "- `mntl-document-card`\n", "- `card`\n", "- `card--no-image`\n", " \n", "I suspect our target recipes belong to the `mntl-card-list-items` class (I'm guessing a bit). Lets find them all:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "24" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(soup.find_all('a', class_='mntl-card-list-items'))" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'m!ss!ss!pp!'" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'mississippi'.replace('i', '!')" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Yellow Cheese vs. White Cheese: Why the Different Colors? \n", "Cheese Curds Make Amazing, Extra-Gooey Grilled Cheese Sandwiches\n", "Where Did American Cheese Come From (And Is It Even Cheese)?\n", "SaveSouthern Pimento Cheese1,020Ratings\n", "Chef John's Classic Cheese Fondue Is the Ultimate Cheese Lover's Recipe\n", "Hundreds of Pounds of Brie and Camembert Cheese Recalled Due to Possible Listeria Contamination\n", "Annie's Mac & Cheese and Smartfood Popcorn Have More in Common Than You Think\n", "SaveBasic Cream Cheese Frosting1,645Ratings\n", "SaveGrilled Cheese Sandwich855Ratings\n", "SaveHomemade Mac and Cheese2,642Ratings\n", "SaveBest Cheese Ball234Ratings\n", "SaveSimple Macaroni and Cheese965Ratings\n", "SaveBaked Mac and Cheese with Sour Cream and Cottage Cheese59Ratings\n", "SaveAbsolutely the BEST Rich and Creamy Blue Cheese Dressing Ever!536Ratings\n", "What Is Cottage Cheese and How Is It Made?\n", "SaveBaked Ham and Cheese Sliders973Ratings\n", "Kraft Is Giving Away Incense So Your Place Can Smell Like Grilled Cheese All the Time\n", "SaveJalapeño Popper Grilled Cheese Sandwich200Ratings\n", "Bread Cheese Is the Best Cheese You Haven't Tried Yet\n", "SaveCheese Sauce for Broccoli and Cauliflower473Ratings\n", "SaveNacho Cheese Sauce680Ratings\n", "SavePumpkin Bars with Cream Cheese Frosting148Ratings\n", "The Right Way To Wrap And Store Cheese\n", "SaveChef John's Creamy Blue Cheese Dressing127Ratings\n" ] } ], "source": [ "recipe_list = list()\n", "for tag in soup.find_all('a', class_='mntl-card-list-items'):\n", " # note: string processing methods reviewed / covered shortly\n", " print(tag.text.replace('\\n', ''))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# A problem\n", "\n", "We're getting closer ... but \"The Right Way To Wrap And Store Cheese\" isn't really a recipe, is it?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "\n", "# An insight (and solution)\n", "\n", "\n", " | name | \n", "
---|---|
0 | \n", "SaveThe Best Caramel Apples140Ratings | \n", "
1 | \n", "SaveSauteed Apples1,791Ratings | \n", "
2 | \n", "SaveCaramel Apples270Ratings | \n", "
3 | \n", "SaveGourmet Caramel Apples97Ratings | \n", "
4 | \n", "SaveBaked Apples with Oatmeal Filling114Ratings | \n", "
5 | \n", "SaveSouthern Fried Apples262Ratings | \n", "
6 | \n", "SaveGrilled Sweet Potatoes with Apples128Ratings | \n", "
7 | \n", "SaveGrilled Sausages with Caramelized Onions a... | \n", "
8 | \n", "SaveRed Cabbage and Apples187Ratings | \n", "
9 | \n", "SaveMicrowave Baked Apples157Ratings | \n", "
10 | \n", "SaveBaked Apples305Ratings | \n", "
11 | \n", "SaveCandied Apples162Ratings | \n", "
12 | \n", "SavePork Chops with Apples and Raisins109Ratings | \n", "
13 | \n", "SavePork Chops with Apples, Onions, and Sweet ... | \n", "
14 | \n", "SaveHerbed Pork and Apples248Ratings | \n", "
15 | \n", "SaveSmushed Apples and Sweet Potatoes253Ratings | \n", "
16 | \n", "SaveChicken Salad with Apples, Grapes, and Wal... | \n", "
17 | \n", "SaveNo-Bake Cheesecake with Cool Whip and Appl... | \n", "
18 | \n", "SaveRoasted Butternut Squash Soup with Apples ... | \n", "
\n", " | name | \n", "href | \n", "
---|---|---|
0 | \n", "SaveSouthern Pimento Cheese1,020Ratings | \n", "https://www.allrecipes.com/recipe/189930/south... | \n", "
1 | \n", "SaveBasic Cream Cheese Frosting1,645Ratings | \n", "https://www.allrecipes.com/recipe/8379/basic-c... | \n", "
2 | \n", "SaveGrilled Cheese Sandwich855Ratings | \n", "https://www.allrecipes.com/recipe/23891/grille... | \n", "
3 | \n", "SaveHomemade Mac and Cheese2,642Ratings | \n", "https://www.allrecipes.com/recipe/11679/homema... | \n", "
4 | \n", "SaveBest Cheese Ball234Ratings | \n", "https://www.allrecipes.com/recipe/16600/herman... | \n", "
5 | \n", "SaveSimple Macaroni and Cheese965Ratings | \n", "https://www.allrecipes.com/recipe/238691/simpl... | \n", "
6 | \n", "SaveBaked Mac and Cheese with Sour Cream and C... | \n", "https://www.allrecipes.com/recipe/229815/baked... | \n", "
7 | \n", "SaveAbsolutely the BEST Rich and Creamy Blue C... | \n", "https://www.allrecipes.com/recipe/58745/absolu... | \n", "
8 | \n", "SaveBaked Ham and Cheese Sliders973Ratings | \n", "https://www.allrecipes.com/recipe/216756/baked... | \n", "
9 | \n", "SaveJalapeño Popper Grilled Cheese Sandwich200... | \n", "https://www.allrecipes.com/recipe/217267/jalap... | \n", "
10 | \n", "SaveCheese Sauce for Broccoli and Cauliflower4... | \n", "https://www.allrecipes.com/recipe/233481/chees... | \n", "
11 | \n", "SaveNacho Cheese Sauce680Ratings | \n", "https://www.allrecipes.com/recipe/24738/nacho-... | \n", "
12 | \n", "SavePumpkin Bars with Cream Cheese Frosting148... | \n", "https://www.allrecipes.com/recipe/229508/pumpk... | \n", "
13 | \n", "SaveChef John's Creamy Blue Cheese Dressing127... | \n", "https://www.allrecipes.com/recipe/232395/chef-... | \n", "
\n", " | Calories | \n", "Fat | \n", "Carbs | \n", "Protein | \n", "name | \n", "url | \n", "
---|---|---|---|---|---|---|
0 | \n", "208 | \n", "20g | \n", "2g | \n", "6g | \n", "Southern Pimento Cheese1,020Ratings | \n", "https://www.allrecipes.com/recipe/189930/south... | \n", "
1 | \n", "292 | \n", "14g | \n", "40g | \n", "2g | \n", "Basic Cream Cheese Frosting1,645Ratings | \n", "https://www.allrecipes.com/recipe/8379/basic-c... | \n", "
2 | \n", "400 | \n", "28g | \n", "26g | \n", "11g | \n", "Grilled Cheese Sandwich855Ratings | \n", "https://www.allrecipes.com/recipe/23891/grille... | \n", "
3 | \n", "845 | \n", "48g | \n", "65g | \n", "37g | \n", "Homemade Mac and Cheese2,642Ratings | \n", "https://www.allrecipes.com/recipe/11679/homema... | \n", "
4 | \n", "413 | \n", "39g | \n", "4g | \n", "15g | \n", "Best Cheese Ball234Ratings | \n", "https://www.allrecipes.com/recipe/16600/herman... | \n", "
5 | \n", "630 | \n", "34g | \n", "55g | \n", "27g | \n", "Simple Macaroni and Cheese965Ratings | \n", "https://www.allrecipes.com/recipe/238691/simpl... | \n", "
6 | \n", "415 | \n", "23g | \n", "30g | \n", "22g | \n", "Baked Mac and Cheese with Sour Cream and Cotta... | \n", "https://www.allrecipes.com/recipe/229815/baked... | \n", "
7 | \n", "93 | \n", "9g | \n", "1g | \n", "3g | \n", "Absolutely the BEST Rich and Creamy Blue Chees... | \n", "https://www.allrecipes.com/recipe/58745/absolu... | \n", "
8 | \n", "208 | \n", "14g | \n", "11g | \n", "10g | \n", "Baked Ham and Cheese Sliders973Ratings | \n", "https://www.allrecipes.com/recipe/216756/baked... | \n", "
9 | \n", "528 | \n", "34g | \n", "41g | \n", "17g | \n", "Jalapeño Popper Grilled Cheese Sandwich200Ratings | \n", "https://www.allrecipes.com/recipe/217267/jalap... | \n", "
10 | \n", "178 | \n", "14g | \n", "4g | \n", "9g | \n", "Cheese Sauce for Broccoli and Cauliflower473Ra... | \n", "https://www.allrecipes.com/recipe/233481/chees... | \n", "
11 | \n", "282 | \n", "23g | \n", "7g | \n", "14g | \n", "Nacho Cheese Sauce680Ratings | \n", "https://www.allrecipes.com/recipe/24738/nacho-... | \n", "
12 | \n", "373 | \n", "20g | \n", "48g | \n", "2g | \n", "Pumpkin Bars with Cream Cheese Frosting148Ratings | \n", "https://www.allrecipes.com/recipe/229508/pumpk... | \n", "
13 | \n", "214 | \n", "22g | \n", "3g | \n", "3g | \n", "Chef John's Creamy Blue Cheese Dressing127Ratings | \n", "https://www.allrecipes.com/recipe/232395/chef-... | \n", "