There's been some interest in web scraping. It's beyond us at this point, but this will work our way up to it.
Note: requires internet access to run.
This IPython notebook was created by Dave Backus, Chase Coleman, and Spencer Lyon for the NYU Stern course Data Bootcamp.
In [1]:
import pandas as pd # data package
import matplotlib.pyplot as plt # graphics
import sys # system module, used to get Python version
import os # operating system tools (check files)
import datetime as dt # date tools, used to note current date
# these are new
import requests, io # internet and input tools
from bs4 import BeautifulSoup # website parsing
%matplotlib inline
print('\nPython version: ', sys.version)
print('Pandas version: ', pd.__version__)
print('Requests version: ', requests.__version__)
print("Today's date:", dt.date.today())
We sometimes find that we can access data straight from a web page with Pandas' read_html
. It works just like read_csv
or read_excel
.
The first example is baseball-reference.com. The same people run similar sites for football and basketball. Many of their pages are collections of tables. See, for example, this one for Pittsburgh's Andrew McCucthen.
In [2]:
pd.read_html?
In [3]:
# baseball reference
url = 'http://www.baseball-reference.com/players/m/mccutan01.shtml'
am = pd.read_html(url)
print('Ouput has type', type(am), 'and length', len(am))
print('First element has type', type(am[0]))
Question. What do we have here? A list of length 10? Whose elements are dataframes? Evidently this reads in all the tables from the page into dataframes and collects them in a list.
In [4]:
am[4].head()
Out[4]:
In [ ]:
In [ ]:
It is important to think about what you are doing when you get data from online. It turns out to be useful to think about how you can specify the type of information that you want from a website. This is done through a site's API.
We will talk about two examples:
Arguments to a url are passed after a ?
and multiple arguments are separated using &
. You will see what we mean below.
In [5]:
# let's see if we're lucky
dfs = pd.read_html("http://finance.yahoo.com/q/hp?s=AAPL+Historical+Prices")
In [6]:
len(dfs)
Out[6]:
In [7]:
dfs[9].head(10)
Out[7]:
Itamar's comments on Yahoo finance.
Adapt this from Itamar
This code shows an example of a simple web scrapping program. Walk through the following steps before running the code:
1) Go to : http://finance.yahoo.com/q/hp?s=AAPL+Historical+Prices
2) Enter the dates you want and hit the get prices button.
3) Once the results are shown, look on the url address.
4) The new url will include several parameters, each one is seperated by the & character.
5) Try to explore the meanning of each parameter (s, a,b,c,d,e,f and g)
6) After some trial and error you can realize that each parameter represents the data you entered as input: the day, month and year, the stock sybmol, and the frequency you chose (daily, weekly etc)
7) Scroll down to the bottom of the page. there is a link which allows downloading the data as a csv file. click on it
8) Open the CSV in excel and see the structure of the file.
9) Go back to the web page, instead of clicking on the csv link, right click on it and copy the link address
10) Paste the address in a notebook - This is the url link we can use to access the data from our coding environment.
In [8]:
import pandas as pd
from datetime import datetime
import matplotlib.pyplot as plt
symbol='AAPL' ## change this to any stock symbol you'd like to get
begin_day='01'
begin_month='00' # January (note the months run from 0 to 11)
begin_year='2010'
end_day = '31'
end_month = '11' #December
end_year='2014'
freq='d' #d - daily, w - weekly, m - monthly etc..
# the following three lines will construct a url based on the parameters above:
url = 'http://real-chart.finance.yahoo.com/table.csv?s='+symbol
url+= '&a='+begin_month+'&b='+begin_day+'&c='+begin_year
url+= '&d='+end_month+'&e='+end_day+'&f='+end_year+'&g='+freq+'&ignore=.csv'
print (url) ## This should show a simliar stucture to the csv file found before.
# pandas allows us to read the csv file dirctly from the url
df=pd.read_csv(url)
# since all the data is read as a string, it'll be good to convert the date column to a datetime type
df['Date']=pd.to_datetime(df['Date'])
# now we can, for exmaple, plot the Adj.Close column vs. the date:
df.plot(x='Date',y='Adj Close')
Out[8]:
https://dev.socrata.com/consumers/examples/pandas-and-jupyter-notebook.html
https://dev.socrata.com/docs/queries/
Now that we've talked about this, let's look at a dataset and try some things.
In [9]:
base_url_json = "https://www.opendata.go.ke/resource/p452-xb7c.json"
df_json = pd.read_json(base_url_json)
df_json.head(3)
Out[9]:
In [10]:
base_url_csv = "https://www.opendata.go.ke/resource/p452-xb7c.csv"
df_csv = pd.read_csv(base_url_csv)
df_csv.columns = df_csv.columns.str.replace(" ", "_").str.lower()
df_csv.head(3)
Out[10]:
Notice that we can choose to interact with either the json or the csv file. They both give us lots of columns (and not very many rows) though and we might want to choose a smaller set of the data. We will show that we can do this using the api described in the second link
In [11]:
# First lets select subset of data
base_url_json += "?$select=boys_toilets,girls_toilets,geolocation,name_of_school,level_of_education"
df_json2 = pd.read_json(base_url_json)
df_json2.head()
Out[11]:
In [12]:
# Now let's raise the number of observations
base_url_json += "&$limit=5000"
df_json3 = pd.read_json(base_url_json)
In [13]:
df_json3.head(5)
Out[13]:
In [14]:
df_json.shape
Out[14]:
In [15]:
df_json2.shape
Out[15]:
In [16]:
df_json3.shape
Out[16]:
In [17]:
baseurl = 'https://www.opendata.go.ke/resource/p452-xb7c.json'
baseurl += '?$select=boys_toilets,girls_toilets,geolocation,name_of_school,level_of_education'
baseurl += '&$limit=50000'
In [18]:
df_big = pd.read_json(baseurl)
In [19]:
df_big.shape
Out[19]:
In [20]:
url = 'http://databootcamp.nyuecon.com/'
db = requests.get(url)
In [ ]:
In [22]:
db.headers
Out[22]:
In [23]:
db.url
Out[23]:
In [24]:
db.status_code
Out[24]:
In [30]:
db.content[0:500]
Out[30]:
In [26]:
bs = BeautifulSoup(db.content, 'lxml')
print('Type and length: ', type(bs), ', ', len(bs), sep='')
print('Title: ', bs.title)
print('First n characters:\n', bs.prettify()[0:250], sep='')
In [27]:
bs.head
Out[27]:
In [28]:
bs.title
Out[28]:
In [29]:
bs.find_all('a')
Out[29]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]: