Scraping the web is fun:
In [107]:
import requests
In [125]:
r = requests.get("http://berkeley.edu")
r
Out[125]:
In [110]:
dir(r)
Out[110]:
In [124]:
r.encoding
Out[124]:
Let's scrape the daily crime log from Cambridge, Massachusetts!
The most powerful web-scraping Python library is lxml...
In [8]:
import lxml.html as LH
url = "http://www.cambridgema.gov/cpd/newsandalerts/Archives/detail.aspx?path=%2fsitecore%2fcontent%2fhome%2fcpd%2fnewsandalerts%2fArchives%2f2015%2f10%2f10092015"
tree = LH.parse(url)
table = [td.text_content() for td in tree.xpath('//td')]
In [9]:
table[:10]
Out[9]:
Wow that's ugly. Wait, can't pandas do this? Yes!
In [91]:
import pandas
tables = pandas.read_html(url)
In [106]:
print(type(tables))
print(len(tables))
print(type(tables[0]))
tables[0]
Out[106]:
In [104]:
df = tables[0]
Let's get rid of the first 2 rows.
In [98]:
df = df.ix[2:]
df
Out[98]:
And now let's parse the text of the 1st column, and pull it all together.
In [100]:
def parse_crime(text):
words = text.split()
date, time, crime_type, id_code = words[:4]
description = " ".join(words[4:])
return pandas.Series([date, time, crime_type, id_code, description])
In [102]:
parsed = df[0].apply(parse_crime)
parsed["description"] = df[1]
parsed.columns = ["date", "time", "crime_type", "id_code", "summary", "description"]
parsed
Out[102]:
So now we just need to encapsulate all of the above as function and feed it a list of the URLs from the crime log archive. I leave that as an exercise for the reader...
PROBLEM SET
1. How many academic departments and programs does UC Berkeley have?
2. Which department or program offers the most diverse set of graduate degrees?
3. Which Berkeley student organization has the longest name?
4. Scrape the entire Berkeley campus directory by UID. (Hint: This is a person, and this is a person, but not this or this or this. Look at the URLs.)