DS2000 (Spring 2019, NCH) :: Lecture 10b

0. Administrivia

  1. Due today @ 9pm: HW9 (submit via Blackboard)
    • Last HW!!!
  2. Due before Monday's lecture: pre-class quiz (via Blackboard; feel free to use book/notes/Python)
    • Look to the Topic Modeling reading on the course website
    • Last PCQ!!!
  3. Next week...
    • Start of no "default" office hours for Derbinsky (just e-mail to schedule a time)
    • "Case Study" week means no Derbinsky in lecture
  4. Remainder of the semester...
    • Week 12: Monday=Pandas, Friday=Project Worktime
    • Week 13: Monday=ML Intro, Friday=Project Worktime
    • Week 14: Monday=Project Worktime
    • All done :'(

Web Scraping

We've covered a variety of sources of data this semester (direct from a user, files, CSV, APIs). But sometimes there's data on a webpage -- web scraping is the process of automatically extracting information from a website.

Important notes:

  1. Check the site's terms & conditions (i.e., don't steal or act unethically)
  2. Assuming you have permission, you could just get the page as text and process it, but there is a better way...

HTML

Webpages are (primarily) written in a language called HTML or "Hypertext Markup Language", which simply means a language used to "markup" text, or describe it using a formal language. We've already used Markdown this semester, which is like HTML-lite. HTML has "tags" that surround text that provide description...

<table>
  <thead>
    <tr>
      <th>Key</th>
      <th>Value</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>1</td>
      <td>Foo</td>
    </tr>
    <tr>
      <td>2</td>
      <td>Bar</td>
    </tr>
  </tbody>
</table>

is similar to

| Key        | Value  |
| ---------- |--------|
| 1          | Foo    |
| 2          | Bar    |

which looks like...

Key Value
1 Foo
2 Bar

and could be interpretted in Python as...

{1:'Foo', 2:'Bar'}

By now you know enough to be able to open a file and grab it, but there is a common Python module that makes this pretty easy...

BeautifulSoup

The BeautifulSoup module (bs4) makes it easy to input HTML and then search & examine results with ease.

Some high-level guidance:

  • First get your HTML (could be from a file, or the text from a request
    • If doing this multiple times from a website, you might need a delay between subsequent requests
  • Next, create a bs4.BeautifulSoup object using the `html.parser'
  • The find method returns a single instance of the element/tag in the page; find_all lets you loop through all such instances. Most basically you are looking for the name of the tag (e.g., th) but you could also look for "attributes" within it, such as the href of <a href="http://google.com>link to google</a>.
  • Once you have an element, the get_text method allows you to get the text contained in it; .contents is a list of text/tags within the element; and ['attribute_name'] lets you access attribute values
  • There's much more (see the documentation), but these should get you started :)

Example: Course Webpage

Let's take an easy example - the course webpage schedule. Let's say you want to import the schedule into useful Python (as a list of dictionaries)...

  1. Look at the "source" of the webpage (go there in a browser, right click, view page source)

  2. Find the schedule, and find the pattern of tags that will allow you to find and process it (Lucky for us, it's the only table!)

  3. Now turn into Python :)

In [2]:
import requests # used to get the HTML
import bs4 # beautiful soup

result = []

# Get the webpage
response = requests.get('https://course.ccs.neu.edu/ds2000sp19nch/sched.html')

# Make sure you got it
if response.status_code != 200:
    print("Error: {}".format(response.status_code))
else:
    # Instead of JSON, we are actually using the text, which is HTML
    soup = bs4.BeautifulSoup(response.text, 'html.parser')
    
    # finds the only table on the page
    schedule = soup.find('table')
    
    # the "head" of the table has useful column names
    header = [element.get_text() for element in schedule.find_all('th')]
    
    # now we can loop to each row of the table
    for row in schedule.tbody.find_all('tr'):
        row_data = {}
        
        # we'll know which column we're in (via the header)
        for col_name, col_data in zip(header, row.find_all('td')):
            if col_name == 'Week':
                week_num = col_data.contents[0].strip()
                
                # Ignore Reading Week :)
                if not week_num:
                    break
                
                row_data['num'] = int(week_num)
                row_data['dates'] = col_data.small.get_text()
                
            elif col_name == 'Topics':
                row_data['topics'] = [el.get_text() for el in col_data.ul.find_all('li') if ('Notes' not in el.get_text()) and ('Practicum' not in el.get_text())]
                row_data['extra'] = [el.get_text() for el in col_data.find_all('p')]
                
                for el in col_data.ul.find_all('li'):
                    if 'Practicum: ' in el.get_text():
                        row_data['practicum'] = el.get_text().split('Practicum: ')[1].split('\n')[0]
            
            elif 'Reading' in col_name:
                if col_data.find_all('a'):
                    row_data['readings'] = [{a.get_text():a['href']} for a in col_data.find_all('a')]
                else:
                    row_data['readings'] = col_data.get_text().replace(',', ' ').split()
                    
            elif 'Due' in col_name:
                row_data['due'] = [el.get_text() for el in col_data.find_all('li')]
                    
                    
        
        if row_data:
            result.append(row_data)
            
result
Out[2]:
[{'num': 1,
  'dates': 'Jan 7 - 11',
  'topics': ['Administrivia: syllabus, websites',
   'What is programming? Why does it matter?',
   'What is a programming language? Why Python?',
   'The process of writing a program, code documentation',
   'Values, data types, variables',
   'Statements, expressions, functions',
   'Console input/output, formatted strings',
   'Handout',
   'Starter files'],
  'extra': ['Derbinsky @ NCH, 1/8-1/14'],
  'practicum': 'Install software, Hello, World!',
  'readings': ['1', '2', '3', '9.5.1', '10.1-10.6', '10.23-10.25'],
  'due': []},
 {'num': 2,
  'dates': 'Jan 14 - 18',
  'topics': ['Boolean variables/expressions',
   'Conditional statements',
   'for loops'],
  'extra': ['Pre-Class Quiz 1'],
  'practicum': 'In-Class Quiz 1',
  'readings': ['4.4-4.5', '7.1-7.7'],
  'due': ['Homework 1']},
 {'num': 3,
  'dates': 'Jan 21 - 25',
  'topics': ['range function', 'while loops'],
  'extra': ['Pre-Class Quiz 2'],
  'readings': ['4.7', '8'],
  'due': ['Homework 2']},
 {'num': 4,
  'dates': 'Jan 28 - Feb 1',
  'topics': ['Creating functions', 'Variable scope', 'Tuples'],
  'extra': ['Pre-Class Quiz 3'],
  'practicum': 'In-Class Quiz 2',
  'readings': ['6.1-6.9', '7.8', '10.26-10.28'],
  'due': ['Homework 3']},
 {'num': 5,
  'dates': 'Feb 4 - 8',
  'topics': ['Modules',
   '__main__',
   'Useful string/list functions',
   'List comprehensions'],
  'extra': ['Pre-Class Quiz 4'],
  'readings': ['5', '6.79', '10'],
  'due': ['Homework 4']},
 {'num': 6,
  'dates': 'Feb 11 - 15',
  'topics': ['Files'],
  'extra': ['Pre-Class Quiz 5'],
  'practicum': 'In-Class Quiz 3',
  'readings': ['11'],
  'due': ['Homework 5']},
 {'num': 7,
  'dates': 'Feb 25 - Mar 1',
  'topics': ['Dictionaries'],
  'extra': ['Pre-Class Quiz 6', 'Derbinsky @ NCH, 3/2-3/8'],
  'practicum': 'In-Class Quiz 4',
  'readings': ['12'],
  'due': ['Homework 6']},
 {'num': 8,
  'dates': 'Mar 4 - 8',
  'topics': ['Object-Oriented Programming'],
  'extra': ['Pre-Class Quiz 7'],
  'readings': ['16', '17'],
  'due': ['Homework 7']},
 {'num': 9,
  'dates': 'Mar 11 - 15',
  'topics': ['Jupyter Notebooks', 'Visualization via Matplotlib'],
  'extra': ['Pre-Class Quiz 8'],
  'practicum': 'In-Class Quiz 5',
  'readings': [{'Jupyter Tutorial': 'https://www.datacamp.com/community/tutorials/tutorial-jupyter-notebook'},
   {'Notebook Gallery': 'https://github.com/jupyter/jupyter/wiki/A-gallery-of-interesting-Jupyter-Notebooks'},
   {'Markdown Tutorial': 'https://www.markdowntutorial.com'},
   {'Markdown Cheatsheet': 'https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet'},
   {'Pyplot Tutorial': 'https://matplotlib.org/users/pyplot_tutorial.html'}],
  'due': ['Homework 8']},
 {'num': 10,
  'dates': 'Mar 18 - 22',
  'topics': ['Navigating documentation', 'CSV Files', 'APIs'],
  'extra': ['Pre-Class Quiz 9'],
  'readings': [{'csv Module': 'https://docs.python.org/3/library/csv.html'},
   {'requests Module': 'http://docs.python-requests.org/en/master/'}],
  'due': ['Homework 9']},
 {'num': 11,
  'dates': 'Mar 25 - 29',
  'topics': ['Case Study'],
  'extra': ['Pre-Class Quiz 10'],
  'readings': [{'Topic Modeling': 'ssl/practicum/blei2013.pdf'}],
  'due': []},
 {'num': 12,
  'dates': 'Apr 1 - 5',
  'topics': ['Pandas', 'Project Worktime'],
  'extra': [],
  'readings': [{'10 Minutes to pandas': 'https://pandas.pydata.org/pandas-docs/stable/10min.html'},
   {'Why Jupyter is data scientists’ computational notebook of choice': 'https://www.nature.com/articles/d41586-018-07196-1'}],
  'due': []},
 {'num': 13,
  'dates': 'Apr 8 - 12',
  'topics': ['Machine Learning', 'Project Worktime'],
  'extra': [],
  'readings': [{'An introduction to machine learning with scikit-learn': 'http://scikit-learn.org/stable/tutorial/basic/tutorial.html'}],
  'due': []},
 {'num': 14,
  'dates': 'Apr 15 - 18',
  'topics': ['Project Worktime'],
  'extra': [],
  'readings': [],
  'due': []}]

Example: Wikipedia Table

So there are sites out there that allow you to convert a table in a Wikipedia article to various formats...

How do these work?

In [3]:
data = []

# Get the webpage
response = requests.get('https://en.wikipedia.org/wiki/List_of_cities_in_the_United_Kingdom')

# Make sure you got it
if response.status_code != 200:
    print("Error: {}".format(response.status_code))
else:
    # Instead of JSON, we are actually using the text, which is HTML
    soup = bs4.BeautifulSoup(response.text, 'html.parser')
    
    # Assumes only one table on the page with class="wikitable"
    table = soup.find('table', {'class':'wikitable'})
    
    # the "head" of the table has useful column names
    # some of them have issues... there are many ways to correct :)
    header = [element.get_text().strip().replace('[1]','').replace('grantedor', 'granted or') for element in table.find_all('th')]
    
    for row in table.find_all('tr')[1:]:
        row_data = {}
        
        # we'll know which column we're in (via the header)
        for col_name, col_data in zip(header, row.find_all('td')):
            if col_name == 'City':
                row_data['name'] = col_data.a.get_text()
        
            elif col_name == 'Population':
                row_data['population'] = int(col_data.contents[1].split()[0].replace(',',''))
                
        data.append(row_data)
        
data
Out[3]:
[{'name': 'Aberdeen', 'population': 189120},
 {'name': 'Armagh', 'population': 59340},
 {'name': 'Bangor', 'population': 18808},
 {'name': 'Bath', 'population': 88859},
 {'name': 'Belfast', 'population': 333871},
 {'name': 'Birmingham', 'population': 1092330},
 {'name': 'Bradford', 'population': 522452},
 {'name': 'Brighton & Hove', 'population': 273369},
 {'name': 'Bristol', 'population': 428234},
 {'name': 'Cambridge', 'population': 123867},
 {'name': 'Canterbury', 'population': 151145},
 {'name': 'Cardiff', 'population': 346090},
 {'name': 'Carlisle', 'population': 107524},
 {'name': 'Chelmsford', 'population': 168310},
 {'name': 'Chester', 'population': 91733},
 {'name': 'Chichester', 'population': 26795},
 {'name': 'Coventry', 'population': 316915},
 {'name': 'Derby', 'population': 248752},
 {'name': 'Derry', 'population': 107877},
 {'name': 'Dundee', 'population': 153990},
 {'name': 'Durham', 'population': 94375},
 {'name': 'Edinburgh', 'population': 468720},
 {'name': 'Ely', 'population': 20256},
 {'name': 'Exeter', 'population': 117773},
 {'name': 'Glasgow', 'population': 603080},
 {'name': 'Gloucester', 'population': 121688},
 {'name': 'Hereford', 'population': 58896},
 {'name': 'Inverness', 'population': 79415},
 {'name': 'Kingston upon Hull', 'population': 256406},
 {'name': 'Lancaster', 'population': 138375},
 {'name': 'Leeds', 'population': 751485},
 {'name': 'Leicester', 'population': 329839},
 {'name': 'Lichfield', 'population': 32219},
 {'name': 'Lincoln', 'population': 93541},
 {'name': 'Lisburn', 'population': 120165},
 {'name': 'Liverpool', 'population': 466415},
 {'name': 'London', 'population': 7375},
 {'name': 'Manchester', 'population': 503127},
 {'name': 'Newcastle upon Tyne', 'population': 280177},
 {'name': 'Newport', 'population': 145736},
 {'name': 'Newry', 'population': 29946},
 {'name': 'Norwich', 'population': 132512},
 {'name': 'Nottingham', 'population': 305680},
 {'name': 'Oxford', 'population': 151906},
 {'name': 'Perth', 'population': 45770},
 {'name': 'Peterborough', 'population': 183631},
 {'name': 'Plymouth', 'population': 256384},
 {'name': 'Portsmouth', 'population': 205056},
 {'name': 'Preston', 'population': 140202},
 {'name': 'Ripon', 'population': 16702},
 {'name': 'St Albans', 'population': 140644},
 {'name': 'St Asaph', 'population': 3355},
 {'name': 'St Davids', 'population': 1841},
 {'name': 'Salford', 'population': 233933},
 {'name': 'Salisbury', 'population': 40302},
 {'name': 'Sheffield', 'population': 552698},
 {'name': 'Southampton', 'population': 236882},
 {'name': 'Stirling', 'population': 34790},
 {'name': 'Stoke-on-Trent', 'population': 249008},
 {'name': 'Sunderland', 'population': 275506},
 {'name': 'Swansea', 'population': 239023},
 {'name': 'Truro', 'population': 18766},
 {'name': 'Wakefield', 'population': 325837},
 {'name': 'Wells', 'population': 10536},
 {'name': 'Westminster', 'population': 219396},
 {'name': 'Winchester', 'population': 116595},
 {'name': 'Wolverhampton', 'population': 249470},
 {'name': 'Worcester', 'population': 98768},
 {'name': 'York', 'population': 198051}]