# DS2500 Day 20

Mar 28, 2023

### Content
- Web scraping (html parsing & string manipulations)

### Admin
- lab digest tomorrow
- project
    - activate your mentor
    - sign up for a meeting slot with me next week
    
### Lesson Credit

Piotr Sapiezynski (https://www.sapiezynski.com/) originally wrote much of this lesson, I've modified it a bit (allrecipes.com has since changed ... arg!).

## Web Scraping
* Using programs or scripts to pretend to browse websites, examine the content on those websites, retrieve and extract data from those websites
* Why scrape?
    * if an API is available for a service, we will nearly always prefer the API to scraping
    * ... but not all services have APIs or the available APIs are too expensive for our project
    * newly published information might not yet be available through ready datasets
* Downsides of scraping:
    * no reference documentation (unlike APIs)
    * no guarantee that a webpage we scrape will look and work the same way the next day (might need to rewrite the whole scraper - this is why ETL is important!)
    * if it violates the terms of service it might be seen as a felony (https://www.aclu.org/cases/sandvig-v-barr-challenge-cfaa-prohibition-uncovering-racial-discrimination-online)
    * legal and moral greyzone (even if the ToS does not disallow it, somebody has to pay for the traffic and when you're scraping you're not looking at ads)
    * ... but everbody does it anyway (https://www.hollywoodreporter.com/thr-esq/genius-says-it-caught-google-lyricfind-redhanded-stealing-lyrics-400m-suit-1259383)
    

## Best case scenario
Some webpages publish their data in the form of simple tables. In these (rare) cases we can just use pandas .read_html to scrape this data:

https://www.espn.com/nba/team/stats/_/name/bos

In [3]:
import pandas as pd
# read html extracts all the <table> elements from html and returns a list of DataFrames created from them
tables = pd.read_html('https://www.espn.com/nba/team/stats/_/name/bos')
len(tables)

4

In [4]:
tables[0]

Unnamed: 0,Name
0,Jayson Tatum SF
1,Jaylen Brown SG
2,Malcolm Brogdon PG
3,Derrick White PG
4,Marcus Smart PG
5,Al Horford C
6,Grant Williams PF
7,Robert Williams III C
8,Sam Hauser SF
9,Mike Muscala C *


In [3]:
tables[1]

Unnamed: 0,GP,GS,MIN,PTS,OR,DR,REB,AST,STL,BLK,TO,PF,AST/TO
0,69,69.0,37.3,30.1,1.1,7.8,8.9,4.7,1.0,0.7,3.0,2.1,1.6
1,63,63.0,36.1,27.0,1.2,5.7,7.0,3.4,1.1,0.4,2.9,2.6,1.2
2,62,0.0,25.8,14.6,0.6,3.6,4.2,3.7,0.6,0.3,1.5,1.6,2.5
3,75,63.0,28.4,12.4,0.7,2.9,3.5,4.0,0.7,0.9,1.1,2.2,3.8
4,57,57.0,32.3,11.4,0.8,2.4,3.2,6.4,1.5,0.4,2.4,2.8,2.6
5,59,59.0,30.7,9.7,1.2,5.1,6.3,2.9,0.5,0.9,0.6,1.9,5.0
6,72,22.0,26.5,8.3,1.1,3.6,4.7,1.7,0.6,0.4,1.1,2.6,1.6
7,31,18.0,23.7,8.3,3.0,5.5,8.5,1.4,0.5,1.2,0.9,2.0,1.6
8,73,5.0,15.8,6.1,0.5,2.1,2.5,0.8,0.3,0.3,0.3,1.3,2.3
9,13,2.0,14.8,5.2,0.5,2.6,3.1,0.3,0.3,0.3,0.4,1.5,0.8


In [5]:
# "glue" dataframes together (more to come on this later in the semester)
player_stats = pd.concat(tables[:2], axis=1)
player_stats

Unnamed: 0,Name,GP,GS,MIN,PTS,OR,DR,REB,AST,STL,BLK,TO,PF,AST/TO
0,Jayson Tatum SF,69,69.0,37.3,30.1,1.1,7.8,8.9,4.7,1.0,0.7,3.0,2.1,1.6
1,Jaylen Brown SG,63,63.0,36.1,27.0,1.2,5.7,7.0,3.4,1.1,0.4,2.9,2.6,1.2
2,Malcolm Brogdon PG,62,0.0,25.8,14.6,0.6,3.6,4.2,3.7,0.6,0.3,1.5,1.6,2.5
3,Derrick White PG,75,63.0,28.4,12.4,0.7,2.9,3.5,4.0,0.7,0.9,1.1,2.2,3.8
4,Marcus Smart PG,57,57.0,32.3,11.4,0.8,2.4,3.2,6.4,1.5,0.4,2.4,2.8,2.6
5,Al Horford C,59,59.0,30.7,9.7,1.2,5.1,6.3,2.9,0.5,0.9,0.6,1.9,5.0
6,Grant Williams PF,72,22.0,26.5,8.3,1.1,3.6,4.7,1.7,0.6,0.4,1.1,2.6,1.6
7,Robert Williams III C,31,18.0,23.7,8.3,3.0,5.5,8.5,1.4,0.5,1.2,0.9,2.0,1.6
8,Sam Hauser SF,73,5.0,15.8,6.1,0.5,2.1,2.5,0.8,0.3,0.3,0.3,1.3,2.3
9,Mike Muscala C *,13,2.0,14.8,5.2,0.5,2.6,3.1,0.3,0.3,0.3,0.4,1.5,0.8


## HTML
Web pages are written in HTML.

The keywords in `<>` brackets are called tags. They open with `<tag>` and close with `</tag>`.

In [1]:
s_html = """
<html>
    <head>
        <!-- comments in HTML are marked like this -->
        
        <!-- the head tag contains the meta information not displayed but helps browsers render the page -->
    </head>
    <body>
         <!-- This is the body of the document that contains all the visible elements.-->
        <h1>Heading 1</h1>
        <h2>This is what heading 2 looks like</h2>
        
        <p>Text is usually in paragraphs.
            New lines and multiple consecutive whitespace characters are ignored.</p>

<p>Unlike in python indentation is only a good practice but it doesn't change functionality. In fact, all of this HTML could be (and often is in real webpages) just writen as a single line.</p>   
        
        <p>Links are created using the "a" tag: 
            <a href="https://www.google.com">Click here to google.</a>
            href is an attirbute of the a tag that specify where the link points to.</p>
        
        
    </body>
</html>"""

In [2]:
# write this string to a local file "simple_page0.html"
with open('simple_page0.html', 'w') as f:
    print(s_html, file=f)

Clicking the link below will open the html page we just wrote:

[simple_page0.html](simple_page0.html)

While it opens in jupyter know that your usual browser will do the trick too (chrome, safari, firefox etc)

# HTML is organized as a tree

(Note to self: write out tree structure below)

```html
<html>
    <head>
        <!-- comments in HTML are marked like this -->
        
        <!-- the head tag contains the meta information not displayed but helps browsers render the page -->
    </head>
    <body>
        <!-- This is the body of the document that contains all the visible elements.-->
        <h1>Heading 1</h1>
        <h2>This is what heading 2 looks like</h2>
        
        <p>Text is usually in paragraphs.
            New lines and multiple consecutive whitespace characters are ignored.</p>

<p>Unlike in python indentation is only a good practice but it doesn't change functionality. In fact, all of this HTML could be (and often is in real webpages) just writen as a single line.</p>   
        
        <p>Links are created using the "a" tag: 
            <a href="https://www.google.com">Click here to google.</a>
            href is an attirbute of the a tag that specify where the link points to.</p>
        
        
    </body>
</html>
```

# And now, the internet

### Observing HTML in a browser
You can see the actual html of a page by selecting "inspect" on a page via a right click.  Try it out:

[https://www.scrapethissite.com/pages/simple/](https://www.scrapethissite.com/pages/simple/)

### Obtaining HTML from a url address
Use `requests.get()` to get the html of a web page into python:

In [6]:
# Getting the html content in Python
# (commonly passed into beautiful soup, see following slide)
import requests

response = requests.get('https://www.scrapethissite.com/pages/simple/')
print(response.text)

<!doctype html>
<html lang="en">
  <head>
    <meta charset="utf-8">
    <title>Countries of the World: A Simple Example | Scrape This Site | A public sandbox for learning web scraping</title>
    <link rel="icon" type="image/png" href="/static/images/scraper-icon.png" />

    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <meta name="description" content="A single page that lists information about all the countries in the world. Good for those just get started with web scraping.">

    <link href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.5/css/bootstrap.min.css" rel="stylesheet" integrity="sha256-MfvZlkHCEqatNoGiOXveE8FIwMzZg4W85qfrfIFBfYc= sha512-dTfge/zgoMYpP7QbHy4gWMEGsbsdZeCXz7irItjcC3sPUFtf0kuFbDz/ixG7ArTxmDjLXDmezHubeNikyKGVyQ==" crossorigin="anonymous">
    <link href='https://fonts.googleapis.com/css?family=Lato:400,700' rel='stylesheet' type='text/css'>
    <link rel="stylesheet" type="text/css" href="/static/css/styles.css">

    
<meta name=

# Tip: save that html file!

Websites change over time, if you have something really sensitive consider storing the raw HTML source.

(for example, allrecipes.com changed since I last taught this lesson, arg!)

# BeautifulSoup allows us to make sense of this HTML mess


In [7]:
!pip3 install bs4

Defaulting to user installation because normal site-packages is not writeable


In [8]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(s_html)

In [9]:
soup

<html>
<head>
<!-- comments in HTML are marked like this -->
<!-- the head tag contains the meta information not displayed but helps browsers render the page -->
</head>
<body>
<!-- This is the body of the document that contains all the visible elements.-->
<h1>Heading 1</h1>
<h2>This is what heading 2 looks like</h2>
<p>Text is usually in paragraphs.
            New lines and multiple consecutive whitespace characters are ignored.</p>
<p>Unlike in python indentation is only a good practice but it doesn't change functionality. In fact, all of this HTML could be (and often is in real webpages) just writen as a single line.</p>
<p>Links are created using the "a" tag: 
            <a href="https://www.google.com">Click here to google.</a>
            href is an attirbute of the a tag that specify where the link points to.</p>
</body>
</html>

In [12]:
## getting elements by their tag name:
soup.find_all('p')

# find_all returns a list, where each element is an instance of the specified tag

[<p>Text is usually in paragraphs.
             New lines and multiple consecutive whitespace characters are ignored.</p>,
 <p>Unlike in python indentation is only a good practice but it doesn't change functionality. In fact, all of this HTML could be (and often is in real webpages) just writen as a single line.</p>,
 <p>Links are created using the "a" tag: 
             <a href="https://www.google.com">Click here to google.</a>
             href is an attirbute of the a tag that specify where the link points to.</p>]

In [14]:
for paragraph in soup.find_all('p'):
    # text is a property of a soup object
    print(paragraph.text) 
    print('------')

Text is usually in paragraphs.
            New lines and multiple consecutive whitespace characters are ignored.
------
Unlike in python indentation is only a good practice but it doesn't change functionality. In fact, all of this HTML could be (and often is in real webpages) just writen as a single line.
------
Links are created using the "a" tag: 
            Click here to google.
            href is an attirbute of the a tag that specify where the link points to.
------


# `.find_all()` on subtrees of soup object

Note to self: write out tree structure below

```html
<html>
    <body>
        <p>The links in this paragraph point to search engines, like <a href="https://duckduckgo.com">DuckDuckGo</a>, <a href="https://google.com">Google</a>, <a href="https://bing.com">Bing</a></p>
        
        <p>The links in this paragraph point to Internet browsers, like <a href="https://firefox.com">Firefox</a>, <a href="https://chrome.com">Chrome</a>, <a href="https://opera.com">Opera</a></p>.
    </body>
</html>
```

# What if we only wanted links from the first paragraph?

The `.find_all()` method works not only on the whole `soup` object, but also on subtrees of the soup object.  

In [16]:
s_html = """
<html>
    <body>
        <p>The links in this paragraph point to search engines, like <a href="https://duckduckgo.com">DuckDuckGo</a>, <a href="https://google.com">Google</a>, <a href="https://bing.com">Bing</a></p>
        
        <p>The links in this paragraph point to Internet browsers, like <a href="https://firefox.com">Firefox</a>, <a href="https://chrome.com">Chrome</a>, <a href="https://opera.com">Opera</a></p>.
    </body>
</html>
"""

# write this to a webpage (to see what it looks like)
with open('simple_page1.html', 'w') as f:
    print(s_html, file=f)

# either way, you can parse the html with BeautifulSoup
soup = BeautifulSoup(s_html)

# finding all paragraphs:
p_all = soup.find_all('p')

the webpage we just wrote:

[simple_page1.html](simple_page1.html)

In [17]:
# getting the first paragraph
p_first = p_all[0]

In [18]:
p_first

<p>The links in this paragraph point to search engines, like <a href="https://duckduckgo.com">DuckDuckGo</a>, <a href="https://google.com">Google</a>, <a href="https://bing.com">Bing</a></p>

In [19]:
# getting the links from the first paragraph:
links_p_first = p_first.find_all('a')

print(links_p_first)

[<a href="https://duckduckgo.com">DuckDuckGo</a>, <a href="https://google.com">Google</a>, <a href="https://bing.com">Bing</a>]


### syntactic sugar: 
To get the first tag under a soup object, refer to it as an attribute

In [20]:
# below is equivilent to soup.find_all('p')[0]
soup.p

<p>The links in this paragraph point to search engines, like <a href="https://duckduckgo.com">DuckDuckGo</a>, <a href="https://google.com">Google</a>, <a href="https://bing.com">Bing</a></p>

In [21]:
# so we can condense our code as
plinks = soup.p.find_all('a')
print(plinks)

[<a href="https://duckduckgo.com">DuckDuckGo</a>, <a href="https://google.com">Google</a>, <a href="https://bing.com">Bing</a>]


In [22]:
# iterating over tags
for par in soup.find_all('p'):
    print(par.a)

<a href="https://duckduckgo.com">DuckDuckGo</a>
<a href="https://firefox.com">Firefox</a>


In [23]:
# and the first link in that paragraph can be accessed like this:
link = soup.p.a
print(link)

<a href="https://duckduckgo.com">DuckDuckGo</a>


## Identifying if tags exist

In [21]:
# note: there is no "h3" tag below
soup

<html>
<body>
<p>The links in this paragraph point to search engines, like <a href="https://duckduckgo.com">DuckDuckGo</a>, <a href="https://google.com">Google</a>, <a href="https://bing.com">Bing</a></p>
<p>The links in this paragraph point to Internet browsers, like <a href="https://firefox.com">Firefox</a>, <a href="https://chrome.com">Chrome</a>, <a href="https://opera.com">Opera</a></p>.
    </body>
</html>

In [24]:
# what if we're trying to access an element that doesn't exist?
header = soup.h3
header is None

True

We can test if a tag exists in a soup object by looking for the first instance of this tag and comparing it to `None`

In [25]:
if soup.h3 is None:
    print("tag h3 doesnt exist in soup")

tag h3 doesnt exist in soup


# Putting it together:
# Goal: get all cheese recipes!

Just the recipe name & a link to its page now.  Later, we'll visit the page to get more info on each.  

[https://www.allrecipes.com/search?q=cheese](https://www.allrecipes.com/search?q=cheese)

In [26]:
# get soup
url = 'https://www.allrecipes.com/search?q=cheese'
response = requests.get(url)
soup = BeautifulSoup(response.text)

Our **goal** is to get a list of recipes.  Maybe we should find all the `div` tags?

In [27]:
# that seems like too many recipes ...
len(soup.find_all('a'))

238

## Finding tags by `class_`

how to localize a particular part of a web page

Tags can have multiple "classes" they belong to.  For example, in [https://www.allrecipes.com/search?q=cheese](https://www.allrecipes.com/search?q=cheese) the first recipe is encapsulated in this html tag:

    <a id="mntl-card-list-items_1-0-4" class="comp mntl-card-list-items mntl-document-card mntl-card card card--no-image" data-doc-id="6663961" data-tax-levels="" href="https://www.allrecipes.com/recipe/189930/southern-pimento-cheese/" data-cta="" data-ordinal="5">
    </a>
    
So this particular div tag belongs to classes:
- `comp`
- `mntl-card-list-items`
- `mntl-document-card`
- `card`
- `card--no-image`
    
I suspect our target recipes belong to the `mntl-card-list-items` class (I'm guessing a bit).  Lets find them all:

In [29]:
len(soup.find_all('a', class_='mntl-card-list-items'))

24

In [32]:
'mississippi'.replace('i', '!')

'm!ss!ss!pp!'

In [33]:
recipe_list = list()
for tag in soup.find_all('a', class_='mntl-card-list-items'):
    # note: string processing methods reviewed / covered shortly
    print(tag.text.replace('\n', ''))

Yellow Cheese vs. White Cheese: Why the Different Colors? 
Cheese Curds Make Amazing, Extra-Gooey Grilled Cheese Sandwiches
Where Did American Cheese Come From (And Is It Even Cheese)?
SaveSouthern Pimento Cheese1,020Ratings
Chef John's Classic Cheese Fondue Is the Ultimate Cheese Lover's Recipe
Hundreds of Pounds of Brie and Camembert Cheese Recalled Due to Possible Listeria Contamination
Annie's Mac & Cheese and Smartfood Popcorn Have More in Common Than You Think
SaveBasic Cream Cheese Frosting1,645Ratings
SaveGrilled Cheese Sandwich855Ratings
SaveHomemade Mac and Cheese2,642Ratings
SaveBest Cheese Ball234Ratings
SaveSimple Macaroni and Cheese965Ratings
SaveBaked Mac and Cheese with Sour Cream and Cottage Cheese59Ratings
SaveAbsolutely the BEST Rich and Creamy Blue Cheese Dressing Ever!536Ratings
What Is Cottage Cheese and How Is It Made?
SaveBaked Ham and Cheese Sliders973Ratings
Kraft Is Giving Away Incense So Your Place Can Smell Like Grilled Cheese All the Time
SaveJalapeño Popp

# A problem

We're getting closer ... but "The Right Way To Wrap And Store Cheese" isn't really a recipe, is it?


# An insight (and solution)

<img src="https://i.ibb.co/D5pZW5f/cheese-star.png" width=800>

Only the recipes have ratings.

In HTML-speak, only the recipes have some `svg` tag whose class is "icon-star"

In [34]:
recipe_list = list()
for tag in soup.find_all('a', class_='mntl-card-list-items'):
    # search within tag to find all star icons
    star_list = tag.find_all('svg', class_='icon-star')
    if len(star_list) > 1:
        # some star icon is found, store this as its a real recipe
        recipe_list.append(tag)

In [37]:
# looks pretty good
# (well ... the Save 1,020Ratings isn't great but at least they're all recipes below)
[tag.text.replace('\n', '') for tag in recipe_list]

['SaveSouthern Pimento Cheese1,020Ratings',
 'SaveBasic Cream Cheese Frosting1,645Ratings',
 'SaveGrilled Cheese Sandwich855Ratings',
 'SaveHomemade Mac and Cheese2,642Ratings',
 'SaveBest Cheese Ball234Ratings',
 'SaveSimple Macaroni and Cheese965Ratings',
 'SaveBaked Mac and Cheese with Sour Cream and Cottage Cheese59Ratings',
 'SaveAbsolutely the BEST Rich and Creamy Blue Cheese Dressing Ever!536Ratings',
 'SaveBaked Ham and Cheese Sliders973Ratings',
 'SaveJalapeño Popper Grilled Cheese Sandwich200Ratings',
 'SaveCheese Sauce for Broccoli and Cauliflower473Ratings',
 'SaveNacho Cheese Sauce680Ratings',
 'SavePumpkin Bars with Cream Cheese Frosting148Ratings',
 "SaveChef John's Creamy Blue Cheese Dressing127Ratings"]

# Finding tags by `id`

Nearly the same as finding by class, but you'll look for `id=` in the html and pass it to the `id` keyword of `soup.find_all()`.

**Goal**: Get the footer from: https://www.scrapethissite.com/



```html
<section id="footer">
        <div class="container">
            <div class="row">
                <div class="col-md-12 text-center text-muted">
                    Lessons and Videos © Hartley Brody 2018
                </div><!--.col-->
            </div><!--.row-->
        </div><!--.container-->
    </section>
```

In [29]:
# get soup from url
url = 'https://www.scrapethissite.com/'
html = requests.get(url).text
soup = BeautifulSoup(html)

In [30]:
soup.find_all(id='footer')

[<section id="footer">
 <div class="container">
 <div class="row">
 <div class="col-md-12 text-center text-muted">
                     Lessons and Videos © Hartley Brody 2023
                 </div><!--.col-->
 </div><!--.row-->
 </div><!--.container-->
 </section>]

Note that you can combine all searches shown above:
- tag
    - p (paragraph)
    - a (link)
    - div ...
- tag class
- tag id

```python
# finds all links (tag type = 'a'), with given class and id
soup.find_all('a', class_='fancy-link', id='blue')

```

# What if I don't like cheese?

First off, really?  Its delicious!

But if you insist on searching for some other ingredient, try swapping out "cheese" in the url below:

[https://www.allrecipes.com/search?q=cheese](https://www.allrecipes.com/search?q=cheese)

## In Class Assignment 1

**Goal:** Formalize a pipeline to scrape this site

https://www.allrecipes.com/search/results/?search=cheese
    
1. Write `extract_recipes(s_query)` which:
    * takes the search phrase (e.g. 'cheese') as input argument
    * builds the correct url that leads directly to the page that lists the recipes
    * uses `requests` to get the content of this page returns the html text of the page
    * returns an html string
    * builds a BeautifulSoup object out of that text 
    * finds names of all recipes
        - to identify which tags / classes to `find_all()`, open the page in your browser and "inspect" 
        - start from the recipe object above, and call another `find_all()` to zoom into the recipe name itself
    * returns a dataframe with a single column "recipe"
        * the names of the recipes might be a bit mangled, having "save" and "1,243 raters" just now, thats ok    
    * we'll want to add more features to this dataframe later, building it up as a list of dictionaries (one per row) allows us to extend to other features easily:
    
```python
row_list = list()
for recipe in recipe_list:
    # build a dictionary representing this recipe (row)
    d = {'name': name}
    row_list.append(d)
 
df = pd.DataFrame(row_list)
```

In [39]:
import pandas as pd

def extract_recipes(s_query):
    """ builds list of recipe names from allrecipies html
    
    Args:
        s_query (str): input query (i.e. "cheese")
        
    Returns:
        df_recipe (pd.DataFrame): each row is a recipe
    """
    
    # build soup object from search query
    url = f'https://www.allrecipes.com/search?q={s_query}'
    s_html = requests.get(url).text
    soup = BeautifulSoup(s_html)
    
    # get a list of recipe tags
    recipe_list = list()
    for tag in soup.find_all('a', class_='mntl-card-list-items'):
        # search within tag to find all star icons
        star_list = tag.find_all('svg', class_='icon-star')
        if star_list:
            # some star icon is found, store this as its a real recipe
            recipe_list.append(tag)
            
    # extract features to build dataframe
    row_list = list()
    for recipe in recipe_list:
        name = recipe.text.replace('\n', '')
        row_list.append({'name': name})
        
    return pd.DataFrame(row_list)
    

In [44]:
extract_recipes('apples')

Unnamed: 0,name
0,SaveThe Best Caramel Apples140Ratings
1,"SaveSauteed Apples1,791Ratings"
2,SaveCaramel Apples270Ratings
3,SaveGourmet Caramel Apples97Ratings
4,SaveBaked Apples with Oatmeal Filling114Ratings
5,SaveSouthern Fried Apples262Ratings
6,SaveGrilled Sweet Potatoes with Apples128Ratings
7,SaveGrilled Sausages with Caramelized Onions a...
8,SaveRed Cabbage and Apples187Ratings
9,SaveMicrowave Baked Apples157Ratings


# Todo list

- extract info from each recipe's page
    - get url of each recipe's own page from initial search:
        - e.g. [https://www.allrecipes.com/recipe/189930/southern-pimento-cheese/](https://www.allrecipes.com/recipe/189930/southern-pimento-cheese/)
    - get string of nutrition info on that page
    
```
        208
        Calories
        20g
        Fat
        2g
        Carbs
        6g
        Protein
```
        

- string processing
    - clean up the name of each recipe: "SaveSouthern Pimento Cheese1,020Ratings"
    - process the string above so it yields clean numbers we can operate on

## Getting info from each recipe's own page:

When we interact with the webpage in the browser, clicking on the header with the recipe name leads us to the actual recipe. Let's have a look at how it's done:

<img src="https://i.ibb.co/9384Qb4/Screenshot-from-2023-03-27-14-49-23.png" width=500>



In [45]:
##### repeated from above ...

# build soup object from search query
url = f'https://www.allrecipes.com/search?q=cheese'
s_html = requests.get(url).text
soup = BeautifulSoup(s_html)

# get a list of recipe tags
recipe_list = list()
for tag in soup.find_all('a', class_='mntl-card-list-items'):
    # search within tag to find all star icons
    star_list = tag.find_all('svg', class_='icon-star')
    if star_list:
        # some star icon is found, store this as its a real recipe
        recipe_list.append(tag)

# takeaway: tags have attributes, you can access them

(including the link address for "Southern Pimento Cheese")

In [46]:
# this is the "a" tag object shown in the image immediately above
recipe_list[0].attrs

{'id': 'mntl-card-list-items_1-0-3',
 'class': ['comp',
  'mntl-card-list-items',
  'mntl-document-card',
  'mntl-card',
  'card',
  'card--no-image'],
 'data-doc-id': '6663961',
 'data-tax-levels': '',
 'href': 'https://www.allrecipes.com/recipe/189930/southern-pimento-cheese/',
 'data-cta': '',
 'data-ordinal': '4'}

In [47]:
recipe_list[0].attrs['href']

'https://www.allrecipes.com/recipe/189930/southern-pimento-cheese/'

# Adding `href` to our dataframe of recipes

Let's modify our `extract_recipes()` function such that rather than returning just the names of the dishes, it returns a list of dictionaries, where each dictionary has the `name` and `url` fields:

In [48]:
def extract_recipes(s_query):
    """ builds list of recipe names from allrecipies html
    
    Args:
        s_query (str): input query (i.e. "cheese")
        
    Returns:
        df_recipe (pd.DataFrame): each row is a recipe
    """
    
    # build soup object from search query
    url = f'https://www.allrecipes.com/search?q={s_query}'
    s_html = requests.get(url).text
    soup = BeautifulSoup(s_html)
    
    # get a list of recipe tags
    recipe_list = list()
    for tag in soup.find_all('a', class_='mntl-card-list-items'):
        # search within tag to find all star icons
        star_list = tag.find_all('svg', class_='icon-star')
        if star_list:
            # some star icon is found, store this as its a real recipe
            recipe_list.append(tag)
            
    # extract features to build dataframe
    row_list = list()
    for recipe in recipe_list:
        name = recipe.text.replace('\n', '')
        row_list.append({'name': name,
                         'href': recipe.attrs['href']})
        
        
    return pd.DataFrame(row_list)
    

In [49]:
df_recipe = extract_recipes('cheese')
df_recipe

Unnamed: 0,name,href
0,"SaveSouthern Pimento Cheese1,020Ratings",https://www.allrecipes.com/recipe/189930/south...
1,"SaveBasic Cream Cheese Frosting1,645Ratings",https://www.allrecipes.com/recipe/8379/basic-c...
2,SaveGrilled Cheese Sandwich855Ratings,https://www.allrecipes.com/recipe/23891/grille...
3,"SaveHomemade Mac and Cheese2,642Ratings",https://www.allrecipes.com/recipe/11679/homema...
4,SaveBest Cheese Ball234Ratings,https://www.allrecipes.com/recipe/16600/herman...
5,SaveSimple Macaroni and Cheese965Ratings,https://www.allrecipes.com/recipe/238691/simpl...
6,SaveBaked Mac and Cheese with Sour Cream and C...,https://www.allrecipes.com/recipe/229815/baked...
7,SaveAbsolutely the BEST Rich and Creamy Blue C...,https://www.allrecipes.com/recipe/58745/absolu...
8,SaveBaked Ham and Cheese Sliders973Ratings,https://www.allrecipes.com/recipe/216756/baked...
9,SaveJalapeño Popper Grilled Cheese Sandwich200...,https://www.allrecipes.com/recipe/217267/jalap...


# Todo list: complete

- extract info from each recipe's page
    - get url of each recipe's own page from initial search:
        - e.g. [https://www.allrecipes.com/recipe/189930/southern-pimento-cheese/](https://www.allrecipes.com/recipe/189930/southern-pimento-cheese/)
    - get string of nutrition info on that page
    
```
        208
        Calories
        20g
        Fat
        2g
        Carbs
        6g
        Protein
```
        
# Todo list: 
- string processing
    - clean up the name of each recipe: "SaveSouthern Pimento Cheese1,020Ratings"
    - process the string above so it yields clean numbers we can operate on

## String Manipulations
- `.split()` & `.join()`
- `.strip()`
- `.replace()`
- `.upper()` & `.lower()`

I find these four most useful, but there's a few more [string methods](https://docs.python.org/3/library/stdtypes.html#string-methods) which you might find useful too.

(++) Its a bit more powerful (read: complex to learn) but [regular expressions](https://docs.python.org/3/library/re.html) are likely to support some need where the above built-in python string methods don't work as well.

In [56]:
'\n\n\n hello!      \n    hello! \n\n    \n \n'.replace('\n', '').replace(' ', '')

'hello!hello!'

In [51]:
# strip removes all leading and trailing whitespace (spaces and newlines)
'\n\n\n hello!      \n    hello! \n\n    \n \n'.strip()

'hello!      \n    hello!'

In [53]:
# replace does just what you think it does
'hello matt matt matt'.replace('matt', 'zeke')

'hello zeke zeke zeke'

In [54]:
# delete this when you find it
'hello matt'.replace('matt', '')

'hello '

In [57]:
# capitalize everything
'dont shout!'.upper()

'DONT SHOUT!'

In [58]:
# lowercase everything
'DONT shOUt'.lower()

'dont shout'

In [43]:
# split will split a string on every occurance of given string (',' below)
'fat: 54 g, calories: 430 cal, sugar: 10g'.split(',')

['fat: 54 g', ' calories: 430 cal', ' sugar: 10g']

In [44]:
'<glue>'.join(['a', 'b', 'c', 'd'])

'a<glue>b<glue>c<glue>d'

In [59]:
# ICA 2 tip: split(), without argument, splits on whitespace (spaces and newlines)
' here is some text          with a whole bunch of spaces in the middle'.split()

['here',
 'is',
 'some',
 'text',
 'with',
 'a',
 'whole',
 'bunch',
 'of',
 'spaces',
 'in',
 'the',
 'middle']

In [60]:
# not equivilent to do this by passing a space explicitly
' here is some text          with a whole bunch of spaces in the middle'.split(' ')

['',
 'here',
 'is',
 'some',
 'text',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 'with',
 'a',
 'whole',
 'bunch',
 'of',
 'spaces',
 'in',
 'the',
 'middle']

In [67]:
list(zip('abc', [1, 2, 3]))

[('a', 1), ('b', 2), ('c', 3)]

In [68]:
dict(zip(name_list[1::2], name_list[::2]))

{' first0': 'last0', ' first1': ' last1', ' first2': ' last2'}

In [61]:
# 
name_list = 'last0, first0, last1, first1, last2, first2'.split(',')

', '.join(name_list[:2])

'last0,  first0'

## In Class Assignment 2 - Getting Nutritional Information
Write an `extract_nutrition()` function, which accepts a url of a particular recipe (see ex directly above) and returns a dictionary of nutritional information:

```python
url = 'https://www.allrecipes.com/recipe/189930/southern-pimento-cheese/'
extract_nutrition(url)

```

yields:

```python
{'Calories': '208',
 'Fat': '20g',
 'Carbs': '2g',
 'Protein': '6g'}

```

Once complete, incorporate `extract_nutrition()` into `extract_recipes()` todo


In [70]:
def extract_nutrition(url):
    """ returns a dictionary of nutrition info 
    
    Args:
        url (str): location of all recipes "recipe"
        
    Returns:
        nutrition_dict (dict): keys are molecule types ('fat'), 
            vals are str of quantity ('24 g')
    """
    # get html, build soup
    html = requests.get(url).text
    soup = BeautifulSoup(html)

    # extract nutrition info
    str_nutrit = soup.find_all(class_='mntl-nutrition-facts-summary__table-body')[0].text
    
    # make dictionary from ordinal pairs (0 is first value, 1 is first key, 2 is second value ...)
    nutrit_list = str_nutrit.split()
    nutrit_dict = dict(zip(nutrit_list[1::2], 
                           nutrit_list[0::2]))
    return nutrit_dict

In [None]:
for 

In [100]:
import numpy as np

player = np.array([[1, 1], [0, 2]])
size = np.array([[1, 2], [3, 4]])

s_board = ''
for row_idx in range(3):
    _player = player[row_idx, :]
    _size = size[row_idx, :]
    
    for p, s in zip(_player, _size):
        if _player == 1:
            s_board += f'{Fore.GREEN}{s}{Style.RESET_ALL}'
        elif _player == 2:
            s_board += f'{Fore.RED}{s}{Style.RESET_ALL}'
        else:
            s_board += '0'
        
    s_board += '\n'
    

[(array([1, 1]), array([1, 2])), (array([0, 2]), array([3, 4]))]

In [88]:
# get soup from url
url = 'https://www.allrecipes.com/recipe/189930/southern-pimento-cheese/'
html = requests.get(url).text
soup = BeautifulSoup(html)

# extract nutrition info
str_nutrit = soup.find_all(class_='mntl-nutrition-facts-summary__table-body')[0].text

In [92]:
str_nutrit.split()[1::2]

['Calories', 'Fat', 'Carbs', 'Protein']

In [50]:
# tqdm is a progress bar, not necessary, but fun to see once
# (scraping often takes a moment, nice to get some updates)
!pip3 install tqdm

Defaulting to user installation because normal site-packages is not writeable


In [97]:
from tqdm import tqdm 

def extract_recipes(s_query):
    """ builds list of recipe names from allrecipies html
    
    Args:
        s_query (str): input query (i.e. "cheese")
        
    Returns:
        df_recipe (pd.DataFrame): each row is a recipe
    """
    # build soup object from search query
    url = f'https://www.allrecipes.com/search?q={s_query}'
    s_html = requests.get(url).text
    soup = BeautifulSoup(s_html)
    
    # get a list of recipe tags
    recipe_list = list()
    for tag in soup.find_all('a', class_='mntl-card-list-items'):
        # search within tag to find all star icons
        star_list = tag.find_all('svg', class_='icon-star')
        if star_list:
            # some star icon is found, store this as its a real recipe
            recipe_list.append(tag)
            
    # extract features to build dataframe
    row_list = list()
    for recipe in tqdm(recipe_list, desc='getting nutrition per recipe'):
        # extract name & url
        name = recipe.text.replace('\n', '').replace('Save', '')
        url = recipe.attrs['href']
        
        # lookup nutrition info
        d = extract_nutrition(url)
        d['name'] = name
        d['url'] = url
        
        row_list.append(d)
        
        
    return pd.DataFrame(row_list)
    

In [None]:
!pip3 install tqdm

In [96]:
from tqdm import tqdm
from time import sleep

for idx in tqdm(range(100)):
    sleep(.05)

100%|██████████| 100/100 [00:05<00:00, 19.74it/s]


In [98]:
df = extract_recipes('cheese')
df

getting nutrition per recipe: 100%|██████████| 14/14 [00:04<00:00,  2.85it/s]


Unnamed: 0,Calories,Fat,Carbs,Protein,name,url
0,208,20g,2g,6g,"Southern Pimento Cheese1,020Ratings",https://www.allrecipes.com/recipe/189930/south...
1,292,14g,40g,2g,"Basic Cream Cheese Frosting1,645Ratings",https://www.allrecipes.com/recipe/8379/basic-c...
2,400,28g,26g,11g,Grilled Cheese Sandwich855Ratings,https://www.allrecipes.com/recipe/23891/grille...
3,845,48g,65g,37g,"Homemade Mac and Cheese2,642Ratings",https://www.allrecipes.com/recipe/11679/homema...
4,413,39g,4g,15g,Best Cheese Ball234Ratings,https://www.allrecipes.com/recipe/16600/herman...
5,630,34g,55g,27g,Simple Macaroni and Cheese965Ratings,https://www.allrecipes.com/recipe/238691/simpl...
6,415,23g,30g,22g,Baked Mac and Cheese with Sour Cream and Cotta...,https://www.allrecipes.com/recipe/229815/baked...
7,93,9g,1g,3g,Absolutely the BEST Rich and Creamy Blue Chees...,https://www.allrecipes.com/recipe/58745/absolu...
8,208,14g,11g,10g,Baked Ham and Cheese Sliders973Ratings,https://www.allrecipes.com/recipe/216756/baked...
9,528,34g,41g,17g,Jalapeño Popper Grilled Cheese Sandwich200Ratings,https://www.allrecipes.com/recipe/217267/jalap...


In [74]:
def strip_g(s):
    return float(s.replace('g', ''))

In [79]:
# just playing
x_feat_list =  ['Fat', 'Carbs', 'Protein']
for feat in x_feat_list:
    df[feat] = df[feat].map(strip_g)

In [81]:

x_feat_list =  ['Fat', 'Carbs', 'Protein']

y = df['Calories'].values
x = df.loc[:, x_feat_list]

In [83]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

lin_reg = LinearRegression()
lin_reg.fit(x, y)
y_pred = lin_reg.predict(x)
r2 = r2_score(y_true=y, y_pred=y_pred)


In [84]:
r2

0.999681369112089

In [87]:
dict(zip(x_feat_list, lin_reg.coef_))

{'Fat': 8.560508216473371,
 'Carbs': 4.079712522080758,
 'Protein': 4.4359473499832385}