In [107]:
import requests

In [125]:
r = requests.get("http://berkeley.edu")
r


Out[125]:
<Response [200]>

In [110]:
dir(r)


Out[110]:
['__attrs__',
 '__bool__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__iter__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__nonzero__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_content',
 '_content_consumed',
 'apparent_encoding',
 'close',
 'connection',
 'content',
 'cookies',
 'elapsed',
 'encoding',
 'headers',
 'history',
 'is_permanent_redirect',
 'is_redirect',
 'iter_content',
 'iter_lines',
 'json',
 'links',
 'ok',
 'raise_for_status',
 'raw',
 'reason',
 'request',
 'status_code',
 'text',
 'url']

In [124]:
r.encoding


Out[124]:
'utf-8'

Let's scrape the daily crime log from Cambridge, Massachusetts!

The most powerful web-scraping Python library is lxml...


In [8]:
import lxml.html as LH

url = "http://www.cambridgema.gov/cpd/newsandalerts/Archives/detail.aspx?path=%2fsitecore%2fcontent%2fhome%2fcpd%2fnewsandalerts%2fArchives%2f2015%2f10%2f10092015"
tree = LH.parse(url)
table = [td.text_content() for td in tree.xpath('//td')]

In [9]:
table[:10]


Out[9]:
['\r\n            Cambridge Police Daily Log: October 9th, 2015\r\n            ',
 '\r\n            Type #\r\n            Date & Time\r\n            ',
 '\r\n            Info\r\n            ',
 '\r\n            10/08/2015\xa007:02\r\n            TRAFFIC \xa015007586\r\n            LEAVE SCENE OF PROPERTY DAMAGE c90 S24 \r\n            ',
 "\r\n            WINDSOR ST\r\n            A Tewksbury resident called police to report that at approximately 9 a.m. on 10/7/15, a rental truck struck and damaged the driver's side mirror/housing, and then left the scene without making himself known and/or leaving any of his information. \r\n            ",
 '\r\n            10/08/2015\xa008:39\r\n            INCIDENT \xa015007587\r\n            ',
 '\r\n            MASSACHUSETTS AVE\r\n            Robert Mulcahy, 56, 2 Harrington Terrace in Cambridge, was placed under arrest for Vandalize Property and Threats to Commit a Crime. Mulcahy engaged in a verbal argument with a taxi driver, kicked the rear back side of the taxi causing minor damage and threatened to kill the taxi driver and passenger.\r\n            ',
 '\r\n            10/08/2015\xa009:57\r\n            INCIDENT \xa015007590\r\n            LARCENY OVER $250 c266 S30 \r\n            ',
 '\r\n            JFK ST\r\n            A Watertown resident reports that on 10/5/15, he caught one of his employees stealing at his place of business. The manager reports that this theft of money from the register has been going on since April. The manager states he came up short on Monday after counting the money exactly the night before and the morning of and the suspect and he were the only two people at the store.\r\n            ',
 '\r\n            10/08/2015\xa010:43\r\n            INCIDENT \xa015007592\r\n            TRAFFIC INVESTIGATIONS \r\n            ']

Wow that's ugly. Wait, can't pandas do this? Yes!


In [91]:
import pandas

tables = pandas.read_html(url)

In [106]:
print(type(tables))
print(len(tables))
print(type(tables[0]))
tables[0]


<class 'list'>
1
<class 'pandas.core.frame.DataFrame'>
Out[106]:
0 1
0 Cambridge Police Daily Log: October 9th, 2015 NaN
1 Type # Date & Time Info
2 10/08/2015 07:02 TRAFFIC 15007586 LEAVE SCEN... WINDSOR ST A Tewksbury resident called police...
3 10/08/2015 08:39 INCIDENT 15007587 MASSACHUSETTS AVE Robert Mulcahy, 56, 2 Harri...
4 10/08/2015 09:57 INCIDENT 15007590 LARCENY O... JFK ST A Watertown resident reports that on 1...
5 10/08/2015 10:43 INCIDENT 15007592 TRAFFIC I... THIRD ST A motor vehicle turning westbound on...
6 10/08/2015 10:52 INCIDENT 15007591 CRIMINAL ... PUTNAM GDNS A Cambridge resident reports that ...
7 10/08/2015 11:34 INCIDENT 15007593 CROSSWALK... RICHDALE AVE A Somerville woman operating a 2...
8 10/08/2015 13:31 INCIDENT 15007595 ASSAULT W... MASSACHUSETTS AVE Cambridge Police units were...
9 10/08/2015 13:38 INCIDENT 15007596 SHOPLIFTI... CAMBRIDGESIDE PL Adel Ouansa, 23, 25 Madison ...
10 10/08/2015 14:16 INCIDENT 15007597 MISC. REP... BROADWAY Cambridge Police responded to the pe...
11 10/08/2015 18:09 INCIDENT 15007604 CRIMINAL ... GORE ST A resident of Cambridge called the Ca...
12 10/08/2015 18:46 INCIDENT 15007601 LARCENY O... OTIS ST A resident of Weymouth walked into th...
13 10/08/2015 19:14 INCIDENT 15007603 ROBBERY, ... ALEWIFE BROOK PKWY A loss prevention officer ...
14 10/08/2015 20:05 INCIDENT 15007605 SHOPLIFTI... CAMBRIDGESIDE PL Three juvenile females were ...
15 10/08/2015 23:15 INCIDENT 15007607 A&B WITH ... CAMBRIDGE ST Cambridge Police responded to a ...

In [104]:
df = tables[0]

Let's get rid of the first 2 rows.


In [98]:
df = df.ix[2:]
df


Out[98]:
0 1
2 10/08/2015 07:02 TRAFFIC 15007586 LEAVE SCEN... WINDSOR ST A Tewksbury resident called police...
3 10/08/2015 08:39 INCIDENT 15007587 MASSACHUSETTS AVE Robert Mulcahy, 56, 2 Harri...
4 10/08/2015 09:57 INCIDENT 15007590 LARCENY O... JFK ST A Watertown resident reports that on 1...
5 10/08/2015 10:43 INCIDENT 15007592 TRAFFIC I... THIRD ST A motor vehicle turning westbound on...
6 10/08/2015 10:52 INCIDENT 15007591 CRIMINAL ... PUTNAM GDNS A Cambridge resident reports that ...
7 10/08/2015 11:34 INCIDENT 15007593 CROSSWALK... RICHDALE AVE A Somerville woman operating a 2...
8 10/08/2015 13:31 INCIDENT 15007595 ASSAULT W... MASSACHUSETTS AVE Cambridge Police units were...
9 10/08/2015 13:38 INCIDENT 15007596 SHOPLIFTI... CAMBRIDGESIDE PL Adel Ouansa, 23, 25 Madison ...
10 10/08/2015 14:16 INCIDENT 15007597 MISC. REP... BROADWAY Cambridge Police responded to the pe...
11 10/08/2015 18:09 INCIDENT 15007604 CRIMINAL ... GORE ST A resident of Cambridge called the Ca...
12 10/08/2015 18:46 INCIDENT 15007601 LARCENY O... OTIS ST A resident of Weymouth walked into th...
13 10/08/2015 19:14 INCIDENT 15007603 ROBBERY, ... ALEWIFE BROOK PKWY A loss prevention officer ...
14 10/08/2015 20:05 INCIDENT 15007605 SHOPLIFTI... CAMBRIDGESIDE PL Three juvenile females were ...
15 10/08/2015 23:15 INCIDENT 15007607 A&B WITH ... CAMBRIDGE ST Cambridge Police responded to a ...

And now let's parse the text of the 1st column, and pull it all together.


In [100]:
def parse_crime(text):
    words = text.split()
    date, time, crime_type, id_code = words[:4]
    description = " ".join(words[4:])
    return pandas.Series([date, time, crime_type, id_code, description])

In [102]:
parsed = df[0].apply(parse_crime)
parsed["description"] = df[1]
parsed.columns = ["date", "time", "crime_type", "id_code", "summary", "description"]
parsed


Out[102]:
date time crime_type id_code summary description
2 10/08/2015 07:02 TRAFFIC 15007586 LEAVE SCENE OF PROPERTY DAMAGE c90 S24 WINDSOR ST A Tewksbury resident called police...
3 10/08/2015 08:39 INCIDENT 15007587 MASSACHUSETTS AVE Robert Mulcahy, 56, 2 Harri...
4 10/08/2015 09:57 INCIDENT 15007590 LARCENY OVER $250 c266 S30 JFK ST A Watertown resident reports that on 1...
5 10/08/2015 10:43 INCIDENT 15007592 TRAFFIC INVESTIGATIONS THIRD ST A motor vehicle turning westbound on...
6 10/08/2015 10:52 INCIDENT 15007591 CRIMINAL HARASSMENT PUTNAM GDNS A Cambridge resident reports that ...
7 10/08/2015 11:34 INCIDENT 15007593 CROSSWALK VIOLATION * C89 S11 RICHDALE AVE A Somerville woman operating a 2...
8 10/08/2015 13:31 INCIDENT 15007595 ASSAULT W/DANGEROUS WEAPON c265 S15B MASSACHUSETTS AVE Cambridge Police units were...
9 10/08/2015 13:38 INCIDENT 15007596 SHOPLIFTING $100+ BY CONCEALING MDSE C266 S30A CAMBRIDGESIDE PL Adel Ouansa, 23, 25 Madison ...
10 10/08/2015 14:16 INCIDENT 15007597 MISC. REPORT TYPE BROADWAY Cambridge Police responded to the pe...
11 10/08/2015 18:09 INCIDENT 15007604 CRIMINAL HARASSMENT GORE ST A resident of Cambridge called the Ca...
12 10/08/2015 18:46 INCIDENT 15007601 LARCENY OVER $250 c266 S30 OTIS ST A resident of Weymouth walked into th...
13 10/08/2015 19:14 INCIDENT 15007603 ROBBERY, UNARMED c265 S19 ALEWIFE BROOK PKWY A loss prevention officer ...
14 10/08/2015 20:05 INCIDENT 15007605 SHOPLIFTING $100+ BY CONCEALING MDSE C266 S30A CAMBRIDGESIDE PL Three juvenile females were ...
15 10/08/2015 23:15 INCIDENT 15007607 A&B WITH DANGEROUS WEAPON c265 S15A CAMBRIDGE ST Cambridge Police responded to a ...

So now we just need to encapsulate all of the above as function and feed it a list of the URLs from the crime log archive. I leave that as an exercise for the reader...

PROBLEM SET

1. How many academic departments and programs does UC Berkeley have?

2. Which department or program offers the most diverse set of graduate degrees?

3. Which Berkeley student organization has the longest name?

4. Scrape the entire Berkeley campus directory by UID. (Hint: This is a person, and this is a person, but not this or this or this. Look at the URLs.)