Scraping the web is fun:

Do-goodery
Making much money
Amusing much people



In [107]:

    
import requests



In [125]:

    
r = requests.get("http://berkeley.edu")
r









    Out[125]:





<Response [200]>



In [110]:

    
dir(r)









    Out[110]:





['__attrs__',
 '__bool__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__iter__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__nonzero__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_content',
 '_content_consumed',
 'apparent_encoding',
 'close',
 'connection',
 'content',
 'cookies',
 'elapsed',
 'encoding',
 'headers',
 'history',
 'is_permanent_redirect',
 'is_redirect',
 'iter_content',
 'iter_lines',
 'json',
 'links',
 'ok',
 'raise_for_status',
 'raw',
 'reason',
 'request',
 'status_code',
 'text',
 'url']



In [124]:

    
r.encoding









    Out[124]:





'utf-8'

Let's scrape the daily crime log from Cambridge, Massachusetts!

The most powerful web-scraping Python library is lxml...



In [8]:

    
import lxml.html as LH

url = "http://www.cambridgema.gov/cpd/newsandalerts/Archives/detail.aspx?path=%2fsitecore%2fcontent%2fhome%2fcpd%2fnewsandalerts%2fArchives%2f2015%2f10%2f10092015"
tree = LH.parse(url)
table = [td.text_content() for td in tree.xpath('//td')]



In [9]:

    
table[:10]









    Out[9]:





['\r\n            Cambridge Police Daily Log: October 9th, 2015\r\n            ',
 '\r\n            Type #\r\n            Date & Time\r\n            ',
 '\r\n            Info\r\n            ',
 '\r\n            10/08/2015\xa007:02\r\n            TRAFFIC \xa015007586\r\n            LEAVE SCENE OF PROPERTY DAMAGE c90 S24 \r\n            ',
 "\r\n            WINDSOR ST\r\n            A Tewksbury resident called police to report that at approximately 9 a.m. on 10/7/15, a rental truck struck and damaged the driver's side mirror/housing, and then left the scene without making himself known and/or leaving any of his information. \r\n            ",
 '\r\n            10/08/2015\xa008:39\r\n            INCIDENT \xa015007587\r\n            ',
 '\r\n            MASSACHUSETTS AVE\r\n            Robert Mulcahy, 56, 2 Harrington Terrace in Cambridge, was placed under arrest for Vandalize Property and Threats to Commit a Crime. Mulcahy engaged in a verbal argument with a taxi driver, kicked the rear back side of the taxi causing minor damage and threatened to kill the taxi driver and passenger.\r\n            ',
 '\r\n            10/08/2015\xa009:57\r\n            INCIDENT \xa015007590\r\n            LARCENY OVER $250 c266 S30 \r\n            ',
 '\r\n            JFK ST\r\n            A Watertown resident reports that on 10/5/15, he caught one of his employees stealing at his place of business. The manager reports that this theft of money from the register has been going on since April. The manager states he came up short on Monday after counting the money exactly the night before and the morning of and the suspect and he were the only two people at the store.\r\n            ',
 '\r\n            10/08/2015\xa010:43\r\n            INCIDENT \xa015007592\r\n            TRAFFIC INVESTIGATIONS \r\n            ']

Wow that's ugly. Wait, can't pandas do this? Yes!



In [91]:

    
import pandas

tables = pandas.read_html(url)



In [106]:

    
print(type(tables))
print(len(tables))
print(type(tables[0]))
tables[0]









    



<class 'list'>
1
<class 'pandas.core.frame.DataFrame'>






    Out[106]:






  
    
      
      0
      1
    
  
  
    
      0
      Cambridge Police Daily Log: October 9th, 2015
      NaN
    
    
      1
      Type #  Date & Time
      Info
    
    
      2
      10/08/2015 07:02  TRAFFIC 15007586  LEAVE SCEN...
      WINDSOR ST  A Tewksbury resident called police...
    
    
      3
      10/08/2015 08:39  INCIDENT 15007587
      MASSACHUSETTS AVE  Robert Mulcahy, 56, 2 Harri...
    
    
      4
      10/08/2015 09:57  INCIDENT 15007590  LARCENY O...
      JFK ST  A Watertown resident reports that on 1...
    
    
      5
      10/08/2015 10:43  INCIDENT 15007592  TRAFFIC I...
      THIRD ST  A motor vehicle turning westbound on...
    
    
      6
      10/08/2015 10:52  INCIDENT 15007591  CRIMINAL ...
      PUTNAM GDNS A Cambridge resident reports that ...
    
    
      7
      10/08/2015 11:34  INCIDENT 15007593  CROSSWALK...
      RICHDALE AVE  A Somerville woman operating a 2...
    
    
      8
      10/08/2015 13:31  INCIDENT 15007595  ASSAULT W...
      MASSACHUSETTS AVE  Cambridge Police units were...
    
    
      9
      10/08/2015 13:38  INCIDENT 15007596  SHOPLIFTI...
      CAMBRIDGESIDE PL  Adel Ouansa, 23, 25 Madison ...
    
    
      10
      10/08/2015 14:16  INCIDENT 15007597  MISC. REP...
      BROADWAY  Cambridge Police responded to the pe...
    
    
      11
      10/08/2015 18:09  INCIDENT 15007604  CRIMINAL ...
      GORE ST  A resident of Cambridge called the Ca...
    
    
      12
      10/08/2015 18:46  INCIDENT 15007601  LARCENY O...
      OTIS ST  A resident of Weymouth walked into th...
    
    
      13
      10/08/2015 19:14  INCIDENT 15007603  ROBBERY, ...
      ALEWIFE BROOK PKWY  A loss prevention officer ...
    
    
      14
      10/08/2015 20:05  INCIDENT 15007605  SHOPLIFTI...
      CAMBRIDGESIDE PL  Three juvenile females were ...
    
    
      15
      10/08/2015 23:15  INCIDENT 15007607  A&B WITH ...
      CAMBRIDGE ST  Cambridge Police responded to a ...



In [104]:

    
df = tables[0]

Let's get rid of the first 2 rows.



In [98]:

    
df = df.ix[2:]
df









    Out[98]:






  
    
      
      0
      1
    
  
  
    
      2
      10/08/2015 07:02  TRAFFIC 15007586  LEAVE SCEN...
      WINDSOR ST  A Tewksbury resident called police...
    
    
      3
      10/08/2015 08:39  INCIDENT 15007587
      MASSACHUSETTS AVE  Robert Mulcahy, 56, 2 Harri...
    
    
      4
      10/08/2015 09:57  INCIDENT 15007590  LARCENY O...
      JFK ST  A Watertown resident reports that on 1...
    
    
      5
      10/08/2015 10:43  INCIDENT 15007592  TRAFFIC I...
      THIRD ST  A motor vehicle turning westbound on...
    
    
      6
      10/08/2015 10:52  INCIDENT 15007591  CRIMINAL ...
      PUTNAM GDNS A Cambridge resident reports that ...
    
    
      7
      10/08/2015 11:34  INCIDENT 15007593  CROSSWALK...
      RICHDALE AVE  A Somerville woman operating a 2...
    
    
      8
      10/08/2015 13:31  INCIDENT 15007595  ASSAULT W...
      MASSACHUSETTS AVE  Cambridge Police units were...
    
    
      9
      10/08/2015 13:38  INCIDENT 15007596  SHOPLIFTI...
      CAMBRIDGESIDE PL  Adel Ouansa, 23, 25 Madison ...
    
    
      10
      10/08/2015 14:16  INCIDENT 15007597  MISC. REP...
      BROADWAY  Cambridge Police responded to the pe...
    
    
      11
      10/08/2015 18:09  INCIDENT 15007604  CRIMINAL ...
      GORE ST  A resident of Cambridge called the Ca...
    
    
      12
      10/08/2015 18:46  INCIDENT 15007601  LARCENY O...
      OTIS ST  A resident of Weymouth walked into th...
    
    
      13
      10/08/2015 19:14  INCIDENT 15007603  ROBBERY, ...
      ALEWIFE BROOK PKWY  A loss prevention officer ...
    
    
      14
      10/08/2015 20:05  INCIDENT 15007605  SHOPLIFTI...
      CAMBRIDGESIDE PL  Three juvenile females were ...
    
    
      15
      10/08/2015 23:15  INCIDENT 15007607  A&B WITH ...
      CAMBRIDGE ST  Cambridge Police responded to a ...

And now let's parse the text of the 1st column, and pull it all together.



In [100]:

    
def parse_crime(text):
    words = text.split()
    date, time, crime_type, id_code = words[:4]
    description = " ".join(words[4:])
    return pandas.Series([date, time, crime_type, id_code, description])



In [102]:

    
parsed = df[0].apply(parse_crime)
parsed["description"] = df[1]
parsed.columns = ["date", "time", "crime_type", "id_code", "summary", "description"]
parsed









    Out[102]:






  
    
      
      date
      time
      crime_type
      id_code
      summary
      description
    
  
  
    
      2
      10/08/2015
      07:02
      TRAFFIC
      15007586
      LEAVE SCENE OF PROPERTY DAMAGE c90 S24
      WINDSOR ST  A Tewksbury resident called police...
    
    
      3
      10/08/2015
      08:39
      INCIDENT
      15007587
      
      MASSACHUSETTS AVE  Robert Mulcahy, 56, 2 Harri...
    
    
      4
      10/08/2015
      09:57
      INCIDENT
      15007590
      LARCENY OVER $250 c266 S30
      JFK ST  A Watertown resident reports that on 1...
    
    
      5
      10/08/2015
      10:43
      INCIDENT
      15007592
      TRAFFIC INVESTIGATIONS
      THIRD ST  A motor vehicle turning westbound on...
    
    
      6
      10/08/2015
      10:52
      INCIDENT
      15007591
      CRIMINAL HARASSMENT
      PUTNAM GDNS A Cambridge resident reports that ...
    
    
      7
      10/08/2015
      11:34
      INCIDENT
      15007593
      CROSSWALK VIOLATION * C89 S11
      RICHDALE AVE  A Somerville woman operating a 2...
    
    
      8
      10/08/2015
      13:31
      INCIDENT
      15007595
      ASSAULT W/DANGEROUS WEAPON c265 S15B
      MASSACHUSETTS AVE  Cambridge Police units were...
    
    
      9
      10/08/2015
      13:38
      INCIDENT
      15007596
      SHOPLIFTING $100+ BY CONCEALING MDSE C266 S30A
      CAMBRIDGESIDE PL  Adel Ouansa, 23, 25 Madison ...
    
    
      10
      10/08/2015
      14:16
      INCIDENT
      15007597
      MISC. REPORT TYPE
      BROADWAY  Cambridge Police responded to the pe...
    
    
      11
      10/08/2015
      18:09
      INCIDENT
      15007604
      CRIMINAL HARASSMENT
      GORE ST  A resident of Cambridge called the Ca...
    
    
      12
      10/08/2015
      18:46
      INCIDENT
      15007601
      LARCENY OVER $250 c266 S30
      OTIS ST  A resident of Weymouth walked into th...
    
    
      13
      10/08/2015
      19:14
      INCIDENT
      15007603
      ROBBERY, UNARMED c265 S19
      ALEWIFE BROOK PKWY  A loss prevention officer ...
    
    
      14
      10/08/2015
      20:05
      INCIDENT
      15007605
      SHOPLIFTING $100+ BY CONCEALING MDSE C266 S30A
      CAMBRIDGESIDE PL  Three juvenile females were ...
    
    
      15
      10/08/2015
      23:15
      INCIDENT
      15007607
      A&B WITH DANGEROUS WEAPON c265 S15A
      CAMBRIDGE ST  Cambridge Police responded to a ...

So now we just need to encapsulate all of the above as function and feed it a list of the URLs from the crime log archive. I leave that as an exercise for the reader...

PROBLEM SET

1. How many academic departments and programs does UC Berkeley have?

2. Which department or program offers the most diverse set of graduate degrees?

3. Which Berkeley student organization has the longest name?

4. Scrape the entire Berkeley campus directory by UID. (Hint: This is a person, and this is a person, but not this or this or this. Look at the URLs.)

	0	1
0	Cambridge Police Daily Log: October 9th, 2015	NaN
1	Type # Date & Time	Info
2	10/08/2015 07:02 TRAFFIC 15007586 LEAVE SCEN...	WINDSOR ST A Tewksbury resident called police...
3	10/08/2015 08:39 INCIDENT 15007587	MASSACHUSETTS AVE Robert Mulcahy, 56, 2 Harri...
4	10/08/2015 09:57 INCIDENT 15007590 LARCENY O...	JFK ST A Watertown resident reports that on 1...
5	10/08/2015 10:43 INCIDENT 15007592 TRAFFIC I...	THIRD ST A motor vehicle turning westbound on...
6	10/08/2015 10:52 INCIDENT 15007591 CRIMINAL ...	PUTNAM GDNS A Cambridge resident reports that ...
7	10/08/2015 11:34 INCIDENT 15007593 CROSSWALK...	RICHDALE AVE A Somerville woman operating a 2...
8	10/08/2015 13:31 INCIDENT 15007595 ASSAULT W...	MASSACHUSETTS AVE Cambridge Police units were...
9	10/08/2015 13:38 INCIDENT 15007596 SHOPLIFTI...	CAMBRIDGESIDE PL Adel Ouansa, 23, 25 Madison ...
10	10/08/2015 14:16 INCIDENT 15007597 MISC. REP...	BROADWAY Cambridge Police responded to the pe...
11	10/08/2015 18:09 INCIDENT 15007604 CRIMINAL ...	GORE ST A resident of Cambridge called the Ca...
12	10/08/2015 18:46 INCIDENT 15007601 LARCENY O...	OTIS ST A resident of Weymouth walked into th...
13	10/08/2015 19:14 INCIDENT 15007603 ROBBERY, ...	ALEWIFE BROOK PKWY A loss prevention officer ...
14	10/08/2015 20:05 INCIDENT 15007605 SHOPLIFTI...	CAMBRIDGESIDE PL Three juvenile females were ...
15	10/08/2015 23:15 INCIDENT 15007607 A&B WITH ...	CAMBRIDGE ST Cambridge Police responded to a ...

	date	time	crime_type	id_code	summary	description
2	10/08/2015	07:02	TRAFFIC	15007586	LEAVE SCENE OF PROPERTY DAMAGE c90 S24	WINDSOR ST A Tewksbury resident called police...
3	10/08/2015	08:39	INCIDENT	15007587		MASSACHUSETTS AVE Robert Mulcahy, 56, 2 Harri...
4	10/08/2015	09:57	INCIDENT	15007590	LARCENY OVER $250 c266 S30	JFK ST A Watertown resident reports that on 1...
5	10/08/2015	10:43	INCIDENT	15007592	TRAFFIC INVESTIGATIONS	THIRD ST A motor vehicle turning westbound on...
6	10/08/2015	10:52	INCIDENT	15007591	CRIMINAL HARASSMENT	PUTNAM GDNS A Cambridge resident reports that ...
7	10/08/2015	11:34	INCIDENT	15007593	CROSSWALK VIOLATION * C89 S11	RICHDALE AVE A Somerville woman operating a 2...
8	10/08/2015	13:31	INCIDENT	15007595	ASSAULT W/DANGEROUS WEAPON c265 S15B	MASSACHUSETTS AVE Cambridge Police units were...
9	10/08/2015	13:38	INCIDENT	15007596	SHOPLIFTING $100+ BY CONCEALING MDSE C266 S30A	CAMBRIDGESIDE PL Adel Ouansa, 23, 25 Madison ...
10	10/08/2015	14:16	INCIDENT	15007597	MISC. REPORT TYPE	BROADWAY Cambridge Police responded to the pe...
11	10/08/2015	18:09	INCIDENT	15007604	CRIMINAL HARASSMENT	GORE ST A resident of Cambridge called the Ca...
12	10/08/2015	18:46	INCIDENT	15007601	LARCENY OVER $250 c266 S30	OTIS ST A resident of Weymouth walked into th...
13	10/08/2015	19:14	INCIDENT	15007603	ROBBERY, UNARMED c265 S19	ALEWIFE BROOK PKWY A loss prevention officer ...
14	10/08/2015	20:05	INCIDENT	15007605	SHOPLIFTING $100+ BY CONCEALING MDSE C266 S30A	CAMBRIDGESIDE PL Three juvenile females were ...
15	10/08/2015	23:15	INCIDENT	15007607	A&B WITH DANGEROUS WEAPON c265 S15A	CAMBRIDGE ST Cambridge Police responded to a ...