We've covered a variety of sources of data this semester (direct from a user, files, CSV, APIs). But sometimes there's data on a webpage -- web scraping is the process of automatically extracting information from a website.
Important notes:
Webpages are (primarily) written in a language called HTML or "Hypertext Markup Language", which simply means a language used to "markup" text, or describe it using a formal language. We've already used Markdown this semester, which is like HTML-lite. HTML has "tags" that surround text that provide description...
<table>
<thead>
<tr>
<th>Key</th>
<th>Value</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>Foo</td>
</tr>
<tr>
<td>2</td>
<td>Bar</td>
</tr>
</tbody>
</table>
is similar to
| Key | Value |
| ---------- |--------|
| 1 | Foo |
| 2 | Bar |
which looks like...
Key | Value |
---|---|
1 | Foo |
2 | Bar |
and could be interpretted in Python as...
{1:'Foo', 2:'Bar'}
By now you know enough to be able to open a file and grab it, but there is a common Python module that makes this pretty easy...
The BeautifulSoup module (bs4
) makes it easy to input HTML and then search & examine results with ease.
Some high-level guidance:
bs4.BeautifulSoup
object using the `html.parser'find
method returns a single instance of the element/tag in the page; find_all
lets you loop through all such instances. Most basically you are looking for the name of the tag (e.g., th
) but you could also look for "attributes" within it, such as the href
of <a href="http://google.com>link to google</a>
.get_text
method allows you to get the text contained in it; .contents
is a list of text/tags within the element; and ['attribute_name']
lets you access attribute valuesLet's take an easy example - the course webpage schedule. Let's say you want to import the schedule into useful Python (as a list of dictionaries)...
Look at the "source" of the webpage (go there in a browser, right click, view page source)
Find the schedule, and find the pattern of tags that will allow you to find and process it (Lucky for us, it's the only table!)
Now turn into Python :)
import requests # used to get the HTML
import bs4 # beautiful soup
result = []
# Get the webpage
response = requests.get('https://course.ccs.neu.edu/ds2000sp19nch/sched.html')
# Make sure you got it
if response.status_code != 200:
print("Error: {}".format(response.status_code))
else:
# Instead of JSON, we are actually using the text, which is HTML
soup = bs4.BeautifulSoup(response.text, 'html.parser')
# finds the only table on the page
schedule = soup.find('table')
# the "head" of the table has useful column names
header = [element.get_text() for element in schedule.find_all('th')]
# now we can loop to each row of the table
for row in schedule.tbody.find_all('tr'):
row_data = {}
# we'll know which column we're in (via the header)
for col_name, col_data in zip(header, row.find_all('td')):
if col_name == 'Week':
week_num = col_data.contents[0].strip()
# Ignore Reading Week :)
if not week_num:
break
row_data['num'] = int(week_num)
row_data['dates'] = col_data.small.get_text()
elif col_name == 'Topics':
row_data['topics'] = [el.get_text() for el in col_data.ul.find_all('li') if ('Notes' not in el.get_text()) and ('Practicum' not in el.get_text())]
row_data['extra'] = [el.get_text() for el in col_data.find_all('p')]
for el in col_data.ul.find_all('li'):
if 'Practicum: ' in el.get_text():
row_data['practicum'] = el.get_text().split('Practicum: ')[1].split('\n')[0]
elif 'Reading' in col_name:
if col_data.find_all('a'):
row_data['readings'] = [{a.get_text():a['href']} for a in col_data.find_all('a')]
else:
row_data['readings'] = col_data.get_text().replace(',', ' ').split()
elif 'Due' in col_name:
row_data['due'] = [el.get_text() for el in col_data.find_all('li')]
if row_data:
result.append(row_data)
result
So there are sites out there that allow you to convert a table in a Wikipedia article to various formats...
How do these work?
data = []
# Get the webpage
response = requests.get('https://en.wikipedia.org/wiki/List_of_cities_in_the_United_Kingdom')
# Make sure you got it
if response.status_code != 200:
print("Error: {}".format(response.status_code))
else:
# Instead of JSON, we are actually using the text, which is HTML
soup = bs4.BeautifulSoup(response.text, 'html.parser')
# Assumes only one table on the page with class="wikitable"
table = soup.find('table', {'class':'wikitable'})
# the "head" of the table has useful column names
# some of them have issues... there are many ways to correct :)
header = [element.get_text().strip().replace('[1]','').replace('grantedor', 'granted or') for element in table.find_all('th')]
for row in table.find_all('tr')[1:]:
row_data = {}
# we'll know which column we're in (via the header)
for col_name, col_data in zip(header, row.find_all('td')):
if col_name == 'City':
row_data['name'] = col_data.a.get_text()
elif col_name == 'Population':
row_data['population'] = int(col_data.contents[1].split()[0].replace(',',''))
data.append(row_data)
data