There's been some interest in web scraping. It's beyond us, but there are some things we can do. ...
Note: requires internet access to run.
This IPython notebook was created by Dave Backus, Chase Coleman, and Spencer Lyon for the NYU Stern course Data Bootcamp.
In [ ]:
import pandas as pd # data package
import matplotlib.pyplot as plt # graphics
import sys # system module, used to get Python version
import os # operating system tools (check files)
import datetime as dt # date tools, used to note current date
# these are new
import requests, io # internet and input tools
from bs4 import BeautifulSoup # website parsing
%matplotlib inline
print('\nPython version: ', sys.version)
print('Pandas version: ', pd.__version__)
print('Requests version: ', requests.__version__)
print("Today's date:", dt.date.today())
We sometimes find that we can access data straight from a web page with Pandas' read_html
. It works just like read_csv
or read_excel
.
The first example is baseball-reference.com. The same people run similar sites for football and basketball. Many of their pages are collections of tables. See, for example, this one for Pittsburgh's Andrew McCucthen.
In [ ]:
# baseball reference
url = 'http://www.baseball-reference.com/players/m/mccutan01.shtml'
am = pd.read_html(url)
print('Ouput has type', type(am), 'and length', len(am))
print('First element has type', type(am[0])')
Question. What do we have here? A list of length 10? Whose elements are dataframes? Evidently this reads in all the tables from the page into dataframes and collects them in a list.
In [ ]:
am[4].head()
In [ ]:
Here's another one: Google's stock price from Yahoo finance.
In [ ]:
url = 'http://finance.yahoo.com/q/hp?s=GOOG+Historical+Prices'
ggl = pd.read_html(url)
In [ ]:
type(ggl)
In [ ]:
len(ggl)
In [ ]:
ggl[8]
In [ ]:
In [ ]:
url = 'http://databootcamp.nyuecon.com/'
url = 'google.com'
db = pd.read_html(url)
In [ ]:
In [ ]:
In [ ]:
Itamar adds:
Walk through the following steps before running the code:
1) Go to : http://finance.yahoo.com/q/hp?s=AAPL+Historical+Prices 2) Enter the dates you want and hit the get prices button. 3) Once the results are shown, look on the url address. 4) The new url will include several parameters, each one is seperated by the & character. 5) Try to explore the meanning of each parameter (s, a,b,c,d,e,f and g) 6) After some trial and error you can realize that each parameter represents the data you entered as input: the day, month and year, the stock sybmol, and the frequency you chose (daily, weekly etc) 7) Scroll down to the bottom of the page. there is a link which allows downloading the data as a csv file. click on it 8) Open the CSV in excel and see the structure of the file. 9) Go back to the web page, instead of clicking on the csv link, right click on it and copy the link address 10) Paste the address in a notebook - This is the url link we can use to access the data from our coding environment
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
url = 'http://databootcamp.nyuecon.com/'
db = requests.get(url)
In [ ]:
db.headers
In [ ]:
db.url
In [ ]:
db.status_code
In [ ]:
In [ ]:
In [ ]:
In [ ]:
bs = BeautifulSoup(db.content, 'lxml')
print('Type and length: ', type(bs), ', ', len(bs), sep='')
print('Title: ', bs.title)
print('First n characters:\n', bs.prettify()[0:500], sep='')
In [ ]:
bs.find_all?
In [ ]:
bs.head
In [ ]:
In [ ]:
kids = [ c for c in bs.head.children]
In [ ]:
kids
In [ ]:
In [ ]:
In [ ]:
In [ ]: