Working our way up to web scraping

There's been some interest in web scraping. It's beyond us, but there are some things we can do. ...

Note: requires internet access to run.

This IPython notebook was created by Dave Backus, Chase Coleman, and Spencer Lyon for the NYU Stern course Data Bootcamp.

Preliminaries

Import packages, etc.


In [ ]:
import pandas as pd             # data package
import matplotlib.pyplot as plt # graphics 
import sys                      # system module, used to get Python version 
import os                       # operating system tools (check files)
import datetime as dt           # date tools, used to note current date  

# these are new 
import requests, io             # internet and input tools  
from bs4 import BeautifulSoup   # website parsing

%matplotlib inline 

print('\nPython version: ', sys.version) 
print('Pandas version: ', pd.__version__)
print('Requests version: ', requests.__version__)
print("Today's date:", dt.date.today())

Sometimes we get lucky

We sometimes find that we can access data straight from a web page with Pandas' read_html. It works just like read_csv or read_excel.

The first example is baseball-reference.com. The same people run similar sites for football and basketball. Many of their pages are collections of tables. See, for example, this one for Pittsburgh's Andrew McCucthen.


In [ ]:
# baseball reference
url = 'http://www.baseball-reference.com/players/m/mccutan01.shtml'
am  = pd.read_html(url)

print('Ouput has type', type(am), 'and length', len(am))
print('First element has type', type(am[0])')

Question. What do we have here? A list of length 10? Whose elements are dataframes? Evidently this reads in all the tables from the page into dataframes and collects them in a list.


In [ ]:
am[4].head()

In [ ]:

Here's another one: Google's stock price from Yahoo finance.


In [ ]:
url = 'http://finance.yahoo.com/q/hp?s=GOOG+Historical+Prices'
ggl = pd.read_html(url)

In [ ]:
type(ggl)

In [ ]:
len(ggl)

In [ ]:
ggl[8]

In [ ]:


In [ ]:
url = 'http://databootcamp.nyuecon.com/'
url = 'google.com'
db  = pd.read_html(url)

In [ ]:


In [ ]:

Scanning urls


In [ ]:

Itamar adds:

Walk through the following steps before running the code:

1) Go to : http://finance.yahoo.com/q/hp?s=AAPL+Historical+Prices 2) Enter the dates you want and hit the get prices button. 3) Once the results are shown, look on the url address. 4) The new url will include several parameters, each one is seperated by the & character. 5) Try to explore the meanning of each parameter (s, a,b,c,d,e,f and g) 6) After some trial and error you can realize that each parameter represents the data you entered as input: the day, month and year, the stock sybmol, and the frequency you chose (daily, weekly etc) 7) Scroll down to the bottom of the page. there is a link which allows downloading the data as a csv file. click on it 8) Open the CSV in excel and see the structure of the file. 9) Go back to the web page, instead of clicking on the csv link, right click on it and copy the link address 10) Paste the address in a notebook - This is the url link we can use to access the data from our coding environment


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:

Accessing web pages

Requests again...


In [ ]:
url = 'http://databootcamp.nyuecon.com/'
db = requests.get(url)

In [ ]:
db.headers

In [ ]:
db.url

In [ ]:
db.status_code

In [ ]:


In [ ]:


In [ ]:

Extracting pieces of web pages

Use Beautiful Soup...


In [ ]:
bs = BeautifulSoup(db.content, 'lxml')

print('Type and length:  ', type(bs), ', ', len(bs), sep='')
print('Title: ', bs.title)
print('First n characters:\n', bs.prettify()[0:500], sep='')

In [ ]:
bs.find_all?

In [ ]:
bs.head

In [ ]:


In [ ]:
kids = [ c for c in bs.head.children]

In [ ]:
kids

In [ ]:


In [ ]:


In [ ]:


In [ ]: