# DS2000 (Spring 2019, NCH) :: Lecture 10b

In [1]:
%%html
<style>
#ipython_notebook img {                                                                                       
    display:block;
    background: url(https://course.ccs.neu.edu/ds2000sp19nch/img/logo.png) no-repeat;
    background-size: contain;
    width: 100px;
    height: 30px;
    padding-left: 100px;
    -moz-box-sizing: border-box;
    box-sizing: border-box;
}
</style>

## 0. Administrivia

1. Due today @ 9pm: HW9 (submit via Blackboard)
  - Last HW!!!
2. Due before Monday's lecture: pre-class quiz (via Blackboard; feel free to use book/notes/Python)
  - Look to the Topic Modeling reading on the course website
  - Last PCQ!!!
3. Next week...
  - Start of no "default" office hours for Derbinsky (just e-mail to schedule a time)
  - "Case Study" week means no Derbinsky in lecture
4. Remainder of the semester...
  - Week 12: Monday=Pandas, Friday=Project Worktime
  - Week 13: Monday=ML Intro, Friday=Project Worktime
  - Week 14: Monday=Project Worktime
  - All done :'(

## Web Scraping
We've covered a variety of sources of data this semester (direct from a user, files, CSV, APIs). But sometimes there's data on a webpage -- *web scraping* is the process of automatically extracting information from a website.

Important notes:
1. Check the site's terms & conditions (i.e., don't steal or act unethically)
2. Assuming you have permission, you *could* just get the page as text and process it, but there is a better way...

## HTML
Webpages are (primarily) written in a language called [HTML](https://www.w3schools.com/html/) or "Hypertext Markup Language", which simply means a language used to "markup" text, or describe it using a formal language. We've already used Markdown this semester, which is like HTML-lite. HTML has "tags" that surround text that provide description...

```
<table>
  <thead>
    <tr>
      <th>Key</th>
      <th>Value</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>1</td>
      <td>Foo</td>
    </tr>
    <tr>
      <td>2</td>
      <td>Bar</td>
    </tr>
  </tbody>
</table>
```

is similar to

```
| Key        | Value  |
| ---------- |--------|
| 1          | Foo    |
| 2          | Bar    |
```

which looks like...

| Key        | Value  |
| ---------- |--------|
| 1          | Foo    |
| 2          | Bar    |

and could be interpretted in Python as...

```
{1:'Foo', 2:'Bar'}
```

By now you know enough to be able to open a file and grab it, but there is a common Python module that makes this pretty easy...

## BeautifulSoup
The [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) module (`bs4`) makes it easy to input HTML and then search & examine results with ease.

Some high-level guidance:
- First get your HTML (could be from a file, or the text from a request
  - If doing this multiple times from a website, you might need a delay between subsequent requests
- Next, create a `bs4.BeautifulSoup` object using the `html.parser'
- The `find` method returns a single instance of the element/tag in the page; `find_all` lets you loop through all such instances. Most basically you are looking for the name of the tag (e.g., `th`) but you could also look for "attributes" within it, such as the `href` of `<a href="http://google.com>link to google</a>`.
- Once you have an element, the `get_text` method allows you to get the text contained in it; `.contents` is a list of text/tags within the element; and `['attribute_name']` lets you access attribute values
- There's much more (see the documentation), but these should get you started :)

### Example: Course Webpage
Let's take an easy example - the [course webpage schedule](https://course.ccs.neu.edu/ds2000sp19nch/sched.html). Let's say you want to import the schedule into useful Python (as a list of dictionaries)...

1. Look at the "source" of the webpage (go there in a browser, right click, view page source)

2. Find the schedule, and find the pattern of tags that will allow you to find and process it (Lucky for us, it's the only table!)

3. Now turn into Python :)

In [2]:
import requests # used to get the HTML
import bs4 # beautiful soup

result = []

# Get the webpage
response = requests.get('https://course.ccs.neu.edu/ds2000sp19nch/sched.html')

# Make sure you got it
if response.status_code != 200:
    print("Error: {}".format(response.status_code))
else:
    # Instead of JSON, we are actually using the text, which is HTML
    soup = bs4.BeautifulSoup(response.text, 'html.parser')
    
    # finds the only table on the page
    schedule = soup.find('table')
    
    # the "head" of the table has useful column names
    header = [element.get_text() for element in schedule.find_all('th')]
    
    # now we can loop to each row of the table
    for row in schedule.tbody.find_all('tr'):
        row_data = {}
        
        # we'll know which column we're in (via the header)
        for col_name, col_data in zip(header, row.find_all('td')):
            if col_name == 'Week':
                week_num = col_data.contents[0].strip()
                
                # Ignore Reading Week :)
                if not week_num:
                    break
                
                row_data['num'] = int(week_num)
                row_data['dates'] = col_data.small.get_text()
                
            elif col_name == 'Topics':
                row_data['topics'] = [el.get_text() for el in col_data.ul.find_all('li') if ('Notes' not in el.get_text()) and ('Practicum' not in el.get_text())]
                row_data['extra'] = [el.get_text() for el in col_data.find_all('p')]
                
                for el in col_data.ul.find_all('li'):
                    if 'Practicum: ' in el.get_text():
                        row_data['practicum'] = el.get_text().split('Practicum: ')[1].split('\n')[0]
            
            elif 'Reading' in col_name:
                if col_data.find_all('a'):
                    row_data['readings'] = [{a.get_text():a['href']} for a in col_data.find_all('a')]
                else:
                    row_data['readings'] = col_data.get_text().replace(',', ' ').split()
                    
            elif 'Due' in col_name:
                row_data['due'] = [el.get_text() for el in col_data.find_all('li')]
                    
                    
        
        if row_data:
            result.append(row_data)
            
result

[{'num': 1,
  'dates': 'Jan 7 - 11',
  'topics': ['Administrivia: syllabus, websites',
   'What is programming? Why does it matter?',
   'What is a programming language? Why Python?',
   'The process of writing a program, code documentation',
   'Values, data types, variables',
   'Statements, expressions, functions',
   'Console input/output, formatted strings',
   'Handout',
   'Starter files'],
  'extra': ['Derbinsky @ NCH, 1/8-1/14'],
  'practicum': 'Install software, Hello, World!',
  'readings': ['1', '2', '3', '9.5.1', '10.1-10.6', '10.23-10.25'],
  'due': []},
 {'num': 2,
  'dates': 'Jan 14 - 18',
  'topics': ['Boolean variables/expressions',
   'Conditional statements',
   'for loops'],
  'extra': ['Pre-Class Quiz 1'],
  'practicum': 'In-Class Quiz 1',
  'readings': ['4.4-4.5', '7.1-7.7'],
  'due': ['Homework 1']},
 {'num': 3,
  'dates': 'Jan 21 - 25',
  'topics': ['range function', 'while loops'],
  'extra': ['Pre-Class Quiz 2'],
  'readings': ['4.7', '8'],
  'due': ['Homewor

### Example: Wikipedia Table
So there are sites out there that allow you to convert a table in a Wikipedia article to various formats...
- CSV: https://wikitable2csv.ggor.de
- JSON: https://www.wikitable2json.com

How do these work?

In [3]:
data = []

# Get the webpage
response = requests.get('https://en.wikipedia.org/wiki/List_of_cities_in_the_United_Kingdom')

# Make sure you got it
if response.status_code != 200:
    print("Error: {}".format(response.status_code))
else:
    # Instead of JSON, we are actually using the text, which is HTML
    soup = bs4.BeautifulSoup(response.text, 'html.parser')
    
    # Assumes only one table on the page with class="wikitable"
    table = soup.find('table', {'class':'wikitable'})
    
    # the "head" of the table has useful column names
    # some of them have issues... there are many ways to correct :)
    header = [element.get_text().strip().replace('[1]','').replace('grantedor', 'granted or') for element in table.find_all('th')]
    
    for row in table.find_all('tr')[1:]:
        row_data = {}
        
        # we'll know which column we're in (via the header)
        for col_name, col_data in zip(header, row.find_all('td')):
            if col_name == 'City':
                row_data['name'] = col_data.a.get_text()
        
            elif col_name == 'Population':
                row_data['population'] = int(col_data.contents[1].split()[0].replace(',',''))
                
        data.append(row_data)
        
data

[{'name': 'Aberdeen', 'population': 189120},
 {'name': 'Armagh', 'population': 59340},
 {'name': 'Bangor', 'population': 18808},
 {'name': 'Bath', 'population': 88859},
 {'name': 'Belfast', 'population': 333871},
 {'name': 'Birmingham', 'population': 1092330},
 {'name': 'Bradford', 'population': 522452},
 {'name': 'Brighton & Hove', 'population': 273369},
 {'name': 'Bristol', 'population': 428234},
 {'name': 'Cambridge', 'population': 123867},
 {'name': 'Canterbury', 'population': 151145},
 {'name': 'Cardiff', 'population': 346090},
 {'name': 'Carlisle', 'population': 107524},
 {'name': 'Chelmsford', 'population': 168310},
 {'name': 'Chester', 'population': 91733},
 {'name': 'Chichester', 'population': 26795},
 {'name': 'Coventry', 'population': 316915},
 {'name': 'Derby', 'population': 248752},
 {'name': 'Derry', 'population': 107877},
 {'name': 'Dundee', 'population': 153990},
 {'name': 'Durham', 'population': 94375},
 {'name': 'Edinburgh', 'population': 468720},
 {'name': 'Ely', 'po