Working our way up to web scraping

There's been some interest in web scraping. It's beyond us at this point, but this will work our way up to it.

Note: requires internet access to run.

This IPython notebook was created by Dave Backus, Chase Coleman, and Spencer Lyon for the NYU Stern course Data Bootcamp.

Preliminaries

Import packages, etc.



In [1]:

    
import pandas as pd             # data package
import matplotlib.pyplot as plt # graphics 
import sys                      # system module, used to get Python version 
import os                       # operating system tools (check files)
import datetime as dt           # date tools, used to note current date  

# these are new 
import requests, io             # internet and input tools  
from bs4 import BeautifulSoup   # website parsing

%matplotlib inline 

print('\nPython version: ', sys.version) 
print('Pandas version: ', pd.__version__)
print('Requests version: ', requests.__version__)
print("Today's date:", dt.date.today())









    



Python version:  3.5.1 |Anaconda 4.0.0 (64-bit)| (default, Feb 16 2016, 09:49:46) [MSC v.1900 64 bit (AMD64)]
Pandas version:  0.18.0
Requests version:  2.9.1
Today's date: 2016-05-05

Sometimes we get lucky

We sometimes find that we can access data straight from a web page with Pandas' read_html. It works just like read_csv or read_excel.

The first example is baseball-reference.com. The same people run similar sites for football and basketball. Many of their pages are collections of tables. See, for example, this one for Pittsburgh's Andrew McCucthen.



In [2]:

    
pd.read_html?



In [3]:

    
# baseball reference
url = 'http://www.baseball-reference.com/players/m/mccutan01.shtml'
am  = pd.read_html(url)

print('Ouput has type', type(am), 'and length', len(am))
print('First element has type', type(am[0]))









    



Ouput has type <class 'list'> and length 10
First element has type <class 'pandas.core.frame.DataFrame'>

Question. What do we have here? A list of length 10? Whose elements are dataframes? Evidently this reads in all the tables from the page into dataframes and collects them in a list.



In [4]:

    
am[4].head()









    Out[4]:






  
    
      
      Year
      Tm
      Lg
      Age
      Pos
      G
      GS
      CG
      Inn
      Ch
      ...
      Unnamed: 105
      Unnamed: 106
      Unnamed: 107
      Unnamed: 108
      Unnamed: 109
      Unnamed: 110
      Unnamed: 111
      Unnamed: 112
      Unnamed: 113
      Unnamed: 114
    
  
  
    
      0
      2009.0
      PIT
      NL
      22
      CF
      108
      108.0
      106.0
      952.2
      275.0
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      1
      2009.0
      PIT
      NL
      22
      OF
      108
      108.0
      106.0
      952.2
      275.0
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2
      2010.0
      PIT
      NL
      23
      CF
      152
      152.0
      140.0
      1290.1
      386.0
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      3
      2010.0
      PIT
      NL
      23
      OF
      152
      152.0
      140.0
      1290.1
      386.0
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      4
      2011.0
      PIT
      NL
      24
      CF
      155
      153.0
      146.0
      1353.2
      430.0
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
  

5 rows × 115 columns



In [ ]:



In [ ]:

Scanning url's

It is important to think about what you are doing when you get data from online. It turns out to be useful to think about how you can specify the type of information that you want from a website. This is done through a site's API.

We will talk about two examples:

Yahoo Finance data (Itamar's suggestion)
Government Open data (through Socrata Open Data API SODA)

Arguments to a url are passed after a ? and multiple arguments are separated using &. You will see what we mean below.

Yahoo finance

http://finance.yahoo.com/



In [5]:

    
# let's see if we're lucky 
dfs = pd.read_html("http://finance.yahoo.com/q/hp?s=AAPL+Historical+Prices")



In [6]:

    
len(dfs)









    Out[6]:





12



In [7]:

    
dfs[9].head(10)









    Out[7]:






  
    
      
      0
      1
      2
      3
      4
      5
      6
      7
      8
      9
      ...
      463
      464
      465
      466
      467
      468
      469
      470
      471
      472
    
  
  
    
      0
      DateOpenHighLowCloseVolumeAdj Close*  May 4, 2...
      Date
      Open
      High
      Low
      Close
      Volume
      Adj Close*
      May 4, 2016
      95.2
      ...
      37357200.0
      93.97
      Feb 1, 2016
      96.47
      96.71
      95.4
      96.43
      40943500.0
      95.91
      * Close price adjusted for dividends and splits.
    
    
      1
      Date
      Open
      High
      Low
      Close
      Volume
      Adj Close*
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2
      May 4, 2016
      95.20
      95.90
      93.82
      94.19
      40824800
      94.19
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      3
      May 3, 2016
      94.20
      95.74
      93.68
      95.18
      56534000
      95.18
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      4
      May 2, 2016
      93.97
      94.08
      92.40
      93.64
      47736900
      93.64
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      5
      Apr 29, 2016
      93.99
      94.72
      92.51
      93.74
      68334400
      93.74
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      6
      Apr 28, 2016
      97.61
      97.88
      94.25
      94.83
      81990700
      94.83
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      7
      Apr 27, 2016
      96.00
      98.71
      95.68
      97.82
      113538400
      97.82
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      8
      Apr 26, 2016
      103.91
      105.30
      103.91
      104.35
      40287500
      104.35
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      9
      Apr 25, 2016
      105.00
      105.65
      104.51
      105.08
      27951000
      105.08
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
  

10 rows × 473 columns

Yahoo Finance

One place to get an idea of how the Yahoo Finance api works is to start at the link below

http://www.jarloo.com/yahoo_finance/

Itamar's comments on Yahoo finance.

Adapt this from Itamar

%%%

This code shows an example of a simple web scrapping program. Walk through the following steps before running the code:

1) Go to : http://finance.yahoo.com/q/hp?s=AAPL+Historical+Prices

2) Enter the dates you want and hit the get prices button.

3) Once the results are shown, look on the url address.

4) The new url will include several parameters, each one is seperated by the & character.

5) Try to explore the meanning of each parameter (s, a,b,c,d,e,f and g)

6) After some trial and error you can realize that each parameter represents the data you entered as input: the day, month and year, the stock sybmol, and the frequency you chose (daily, weekly etc)

7) Scroll down to the bottom of the page. there is a link which allows downloading the data as a csv file. click on it

8) Open the CSV in excel and see the structure of the file.

9) Go back to the web page, instead of clicking on the csv link, right click on it and copy the link address

10) Paste the address in a notebook - This is the url link we can use to access the data from our coding environment.



In [8]:

    
import pandas as pd
from datetime import datetime
import matplotlib.pyplot as plt

symbol='AAPL'  ## change this to any stock symbol you'd like to get
begin_day='01'
begin_month='00' # January (note the months run from 0 to 11)
begin_year='2010'
end_day = '31'
end_month = '11' #December
end_year='2014'
freq='d' #d - daily, w - weekly, m - monthly etc..
# the following three lines will construct a url based on the parameters above:
url = 'http://real-chart.finance.yahoo.com/table.csv?s='+symbol
url+= '&a='+begin_month+'&b='+begin_day+'&c='+begin_year
url+= '&d='+end_month+'&e='+end_day+'&f='+end_year+'&g='+freq+'&ignore=.csv'

print (url)   ## This should show a simliar stucture to the csv file found before.

# pandas allows us to read the csv file dirctly from the url
df=pd.read_csv(url)
# since all the data is read as a string, it'll be good to convert the date column to a datetime type
df['Date']=pd.to_datetime(df['Date'])
# now we can, for exmaple, plot the Adj.Close column vs. the date:
df.plot(x='Date',y='Adj Close')









    



http://real-chart.finance.yahoo.com/table.csv?s=AAPL&a=00&b=01&c=2010&d=11&e=31&f=2014&g=d&ignore=.csv






    Out[8]:





<matplotlib.axes._subplots.AxesSubplot at 0xa67c828>

SODA API

https://dev.socrata.com/consumers/examples/pandas-and-jupyter-notebook.html

https://dev.socrata.com/docs/queries/

Now that we've talked about this, let's look at a dataset and try some things.



In [9]:

    
base_url_json = "https://www.opendata.go.ke/resource/p452-xb7c.json"

df_json = pd.read_json(base_url_json)
df_json.head(3)









    Out[9]:






  
    
      
      boys_toilets
      costituency
      county
      district
      division
      geolocation
      girls_toilets
      gok_tsc_female
      gok_tsc_male
      level_of_education
      ...
      school_institution_type_2
      school_institution_type_3
      sponsor_of_school
      status_of_school
      teachers_toilets
      total_boys
      total_enrolment
      total_girls
      total_number_of_classrooms
      total_toilets
    
  
  
    
      0
      0
      MSAMBWENI
      KWALE
      KWALE
      MSAMBWENI
      {'latitude': '-4.248295', 'human_address': '{"...
      0
      0
      0
      PRIMARY SCHOOL
      ...
      DAY ONLY
      ORDINARY
      CENTRAL GOVERNMENT/DEB
      PUBLIC
      0
      0
      839
      0
      1
      0
    
    
      1
      4
      KISAUNI
      MOMBASA
      MOMBASA
      KISAUNI
      {'latitude': '-3.99606', 'human_address': '{"a...
      4
      0
      0
      PRIMARY SCHOOL
      ...
      DAY ONLY
      ORDINARY
      RELIGIOUS ORGANIZATION
      PUBLIC
      2
      0
      736
      0
      15
      0
    
    
      2
      5
      WEBUYE
      BUNGOMA
      BUNGOMA
      CHWELE
      {'latitude': '0.71988', 'human_address': '{"ad...
      6
      0
      1
      PRIMARY SCHOOL
      ...
      DAY ONLY
      ORDINARY
      RELIGIOUS ORGANIZATION
      PUBLIC
      2
      0
      624
      0
      15
      0
    
  

3 rows × 35 columns



In [10]:

    
base_url_csv = "https://www.opendata.go.ke/resource/p452-xb7c.csv"

df_csv = pd.read_csv(base_url_csv)
df_csv.columns = df_csv.columns.str.replace(" ", "_").str.lower()
df_csv.head(3)









    Out[10]:






  
    
      
      name_of_school
      level_of_education
      status_of_school
      sponsor_of_school
      school_institution_type_1
      school_institution_type_2
      school_institution_type_3
      pupil_teacher_ratio
      pupil_classroom_ratio
      pupil_toilet_ratio
      ...
      non-teaching_staff_male
      non-teaching_staff_female
      province
      county
      district
      division
      location
      costituency
      geolocation
      kodi_county
    
  
  
    
      0
      MTUMWA PRY SCH
      PRIMARY SCHOOL
      PUBLIC
      CENTRAL GOVERNMENT/DEB
      MIXED
      DAY ONLY
      ORDINARY
      839.0
      839.0
      0.0
      ...
      0
      0
      COAST
      KWALE
      KWALE
      MSAMBWENI
      MWERENI
      MSAMBWENI
      (-4.248295, 39.085658)
      NaN
    
    
      1
      KIEMBENI BAPTIST PRI SCH
      PRIMARY SCHOOL
      PUBLIC
      RELIGIOUS ORGANIZATION
      MIXED
      DAY ONLY
      ORDINARY
      736.0
      49.1
      92.0
      ...
      1
      0
      COAST
      MOMBASA
      MOMBASA
      KISAUNI
      BAMBURI
      KISAUNI
      (-3.99606, 39.6977)
      NaN
    
    
      2
      MADISI ERSF PRI SCH
      PRIMARY SCHOOL
      PUBLIC
      RELIGIOUS ORGANIZATION
      MIXED
      DAY ONLY
      ORDINARY
      624.0
      41.6
      56.7
      ...
      2
      0
      WESTERN
      BUNGOMA
      BUNGOMA
      CHWELE
      CHWELE
      WEBUYE
      (0.71988, 34.62895)
      NaN
    
  

3 rows × 36 columns

Notice that we can choose to interact with either the json or the csv file. They both give us lots of columns (and not very many rows) though and we might want to choose a smaller set of the data. We will show that we can do this using the api described in the second link



In [11]:

    
# First lets select subset of data
base_url_json += "?$select=boys_toilets,girls_toilets,geolocation,name_of_school,level_of_education"

df_json2 = pd.read_json(base_url_json)
df_json2.head()









    Out[11]:






  
    
      
      boys_toilets
      geolocation
      girls_toilets
      level_of_education
      name_of_school
    
  
  
    
      0
      1
      {'latitude': '-1.66668', 'human_address': '{"a...
      2
      PRIMARY SCHOOL
      VOO SPECIAL UNIT
    
    
      1
      3
      {'latitude': '-1.012145', 'human_address': '{"...
      3
      PRIMARY SCHOOL
      MASAANTARE PRI
    
    
      2
      1
      {'latitude': '-1.159798', 'human_address': '{"...
      1
      PRIMARY SCHOOL
      NGOCHONI PRY SCH
    
    
      3
      1
      {'latitude': '-0.99747', 'human_address': '{"a...
      2
      PRIMARY SCHOOL
      NEEMA JUNIOR ACADEMY
    
    
      4
      0
      {'latitude': '-0.42783', 'human_address': '{"a...
      0
      PRIMARY SCHOOL
      LWANDA PRI SCH



In [12]:

    
# Now let's raise the number of observations
base_url_json += "&$limit=5000"

df_json3 = pd.read_json(base_url_json)



In [13]:

    
df_json3.head(5)









    Out[13]:






  
    
      
      boys_toilets
      geolocation
      girls_toilets
      level_of_education
      name_of_school
    
  
  
    
      0
      1
      {'latitude': '-1.66668', 'human_address': '{"a...
      2
      PRIMARY SCHOOL
      VOO SPECIAL UNIT
    
    
      1
      3
      {'latitude': '-1.012145', 'human_address': '{"...
      3
      PRIMARY SCHOOL
      MASAANTARE PRI
    
    
      2
      1
      {'latitude': '-1.159798', 'human_address': '{"...
      1
      PRIMARY SCHOOL
      NGOCHONI PRY SCH
    
    
      3
      1
      {'latitude': '-0.99747', 'human_address': '{"a...
      2
      PRIMARY SCHOOL
      NEEMA JUNIOR ACADEMY
    
    
      4
      0
      {'latitude': '-0.42783', 'human_address': '{"a...
      0
      PRIMARY SCHOOL
      LWANDA PRI SCH



In [14]:

    
df_json.shape









    Out[14]:





(1000, 35)



In [15]:

    
df_json2.shape









    Out[15]:





(1000, 5)



In [16]:

    
df_json3.shape









    Out[16]:





(5000, 5)



In [17]:

    
baseurl = 'https://www.opendata.go.ke/resource/p452-xb7c.json'
baseurl += '?$select=boys_toilets,girls_toilets,geolocation,name_of_school,level_of_education'
baseurl += '&$limit=50000'



In [18]:

    
df_big = pd.read_json(baseurl)



In [19]:

    
df_big.shape









    Out[19]:





(26197, 5)

Accessing web pages

Requests again...



In [20]:

    
url = 'http://databootcamp.nyuecon.com/'
db = requests.get(url)



In [ ]:



In [22]:

    
db.headers









    Out[22]:





{'Access-Control-Allow-Origin': '*', 'Content-Encoding': 'gzip', 'Date': 'Thu, 05 May 2016 17:43:41 GMT', 'Content-Length': '3499', 'Age': '0', 'Expires': 'Thu, 05 May 2016 17:51:41 GMT', 'Last-Modified': 'Thu, 21 Apr 2016 23:57:03 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Cache-Control': 'max-age=600', 'Via': '1.1 varnish', 'Server': 'GitHub.com', 'X-Cache-Hits': '0', 'Connection': 'keep-alive', 'X-Served-By': 'cache-iad2140-IAD', 'Accept-Ranges': 'bytes', 'X-Fastly-Request-ID': 'f7fc4a2f3875a07844929a45662b486778a0fe64', 'X-GitHub-Request-Id': '17EB2E2F:645D:28EA4FF:572B85D5', 'Vary': 'Accept-Encoding', 'X-Cache': 'MISS'}



In [23]:

    
db.url









    Out[23]:





'http://databootcamp.nyuecon.com/'



In [24]:

    
db.status_code









    Out[24]:





200



In [30]:

    
db.content[0:500]









    Out[30]:





b'<!DOCTYPE html>\n<!--[if lt IE 7]> <html class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->\n<!--[if IE 7]> <html class="no-js lt-ie9 lt-ie8"> <![endif]-->\n<!--[if IE 8]> <html class="no-js lt-ie9"> <![endif]-->\n<!--[if gt IE 8]><!--> <html class="no-js"> <!--<![endif]-->\n<head>\n  <meta http-equiv="content-type" content="text/html; charset=utf-8">\n  <title>Data Bootcamp </title>\n  <meta http-equiv="X-UA-Compatible" content="IE=edge">\n<meta name="HandheldFriendly" content="True">\n<meta name="MobileOp'

Extracting pieces of web pages

Use Beautiful Soup...



In [26]:

    
bs = BeautifulSoup(db.content, 'lxml')

print('Type and length:  ', type(bs), ', ', len(bs), sep='')
print('Title: ', bs.title)
print('First n characters:\n', bs.prettify()[0:250], sep='')









    



Type and length:  <class 'bs4.BeautifulSoup'>, 7
Title:  <title>Data Bootcamp </title>
First n characters:
<!DOCTYPE html>
<!--[if lt IE 7]> <html class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]> <html class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]> <html class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js"



In [27]:

    
bs.head









    Out[27]:





<head>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<title>Data Bootcamp </title>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="True" name="HandheldFriendly"/>
<meta content="320" name="MobileOptimized"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<meta content="" name="description"/>
<meta content="" name="keywords"/>
<meta content="Data Bootcamp " property="og:title"/>
<meta content="Data Bootcamp" property="og:site_name"/>
<meta content="http://databootcamp.nyuecon.com/" property="og:url"/>
<meta content="en-us" property="og:locale"/>
<meta content="website" property="og:type"/>
<link href="http://databootcamp.nyuecon.com/index.xml" rel="alternate" title="Data Bootcamp" type="application/rss+xml"/>
<link href="http://databootcamp.nyuecon.com/" rel="canonical"/>
<link href="http://databootcamp.nyuecon.com/touch-icon-144-precomposed.png" rel="apple-touch-icon-precomposed" sizes="144x144"/>
<link href="http://databootcamp.nyuecon.com/favicon.png" rel="icon"/>
<meta content="Hugo 0.16-DEV" name="generator"/>
<!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/libs/html5shiv/3.7.0/html5shiv.js"></script>
<script src="https://oss.maxcdn.com/libs/respond.js/1.4.2/respond.min.js"></script>
<![endif]-->
<link href="https://fonts.googleapis.com/css?family=Merriweather:300%7CRaleway%7COpen+Sans" rel="stylesheet" type="text/css"/>
<link href="http://databootcamp.nyuecon.com/css/font-awesome.min.css" rel="stylesheet"/>
<link href="http://databootcamp.nyuecon.com/css/style.css" rel="stylesheet"/>
<link href="http://databootcamp.nyuecon.com/css/highlight/default.css" rel="stylesheet"/>
</head>



In [28]:

    
bs.title









    Out[28]:





<title>Data Bootcamp </title>



In [29]:

    
bs.find_all('a')









    Out[29]:





[<a class="baselink" href="http://databootcamp.nyuecon.com/">
   Data Bootcamp
 
 </a>,
 <a class="homelink" href="http://databootcamp.nyuecon.com/">Home</a>,
 <a href="http://databootcamp.nyuecon.com/bootcamp_faq">Description &amp; FAQ</a>,
 <a href="http://databootcamp.nyuecon.com/bootcamp_outline">Topic outlines &amp; links</a>,
 <a href="http://databootcamp.nyuecon.com/bootcamp_duedates_ug">Due dates (UG)</a>,
 <a href="http://databootcamp.nyuecon.com/bootcamp_duedates_mba">Due dates (MBA)</a>,
 <a href="http://databootcamp.nyuecon.com/bootcamp_data">Data sources</a>,
 <a href="http://databootcamp.nyuecon.com/bootcamp_team">Our team</a>,
 <a href="http://databootcamp.nyuecon.com/bootcamp_faq">FAQ</a>,
 <a href="https://github.com/DaveBackus/Data_Bootcamp/blob/master/Documents/bootcamp_syllabus.pdf">Syllabus</a>,
 <a href="https://www.gitbook.com/book/davebackus/test/details">Book</a>,
 <a href="https://github.com/DaveBackus/Data_Bootcamp/blob/master/Documents/bootcamp_project.pdf">Project Guide</a>,
 <a href="https://groups.google.com/forum/#!forum/nyu_data_bootcamp_ug">UG</a>,
 <a href="https://groups.google.com/forum/#!forum/nyu_data_bootcamp_mba">MBA</a>,
 <a href="mailto:db3@nyu.edu">db3@nyu.edu</a>,
 <a href="http://pages.stern.nyu.edu/~dbackus/">directions</a>,
 <a href="mailto:sgl290@nyu.edu">sgl290@nyu.edu</a>,
 <a href="mailto:bl1741@nyu.edu">bl1741@nyu.edu</a>,
 <a href="mailto:cc7768@gmail.com">cc7768@gmail.com</a>,
 <a href="mailto:is1413@stern.nyu.edu">is1413@stern.nyu.edu</a>,
 <a href="https://github.com/DaveBackus/Data_Bootcamp">Everything</a>,
 <a href="https://github.com/DaveBackus/Data_Bootcamp/tree/master/Code/Python">Python programs</a>,
 <a href="https://github.com/DaveBackus/Data_Bootcamp/tree/master/Code/IPython">IPython notebooks</a>,
 <a href="https://github.com/DaveBackus/Data_Bootcamp/tree/master/Code/Lab">Experiments</a>,
 <a href="http://lmgtfy.com/?q=nyu+data+bootcamp">nyu data bootcamp</a>,
 <a href="https://docs.google.com/forms/d/1N7ugrqIQNHm_e1BLVAtaRMnC8SmSiTDMoYBy_0FYZic/viewform">entry poll</a>,
 <a href="https://groups.google.com/forum/#!forum/nyu_data_bootcamp_ug">UG</a>,
 <a href="https://groups.google.com/forum/#!forum/nyu_data_bootcamp_mba">MBA</a>,
 <a href="https://docs.google.com/document/d/18mDKqDO-48wbwYfFEGD0BAYJRsshRRRxR9IEKhekTWM/edit?usp=sharing">template</a>,
 <a href="https://w3.stern.nyu.edu/start/home">activate your Stern account</a>,
 <a href="mailto:helpdesk@stern.nyu.edu">helpdesk@stern.nyu.edu</a>,
 <a class="toplink" href="#">back to top</a>]



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:

	Year	Tm	Lg	Age	Pos	G	GS	CG	Inn	Ch	...	Unnamed: 105	Unnamed: 106	Unnamed: 107	Unnamed: 108	Unnamed: 109	Unnamed: 110	Unnamed: 111	Unnamed: 112	Unnamed: 113	Unnamed: 114
0	2009.0	PIT	NL	22	CF	108	108.0	106.0	952.2	275.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	2009.0	PIT	NL	22	OF	108	108.0	106.0	952.2	275.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	2010.0	PIT	NL	23	CF	152	152.0	140.0	1290.1	386.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	2010.0	PIT	NL	23	OF	152	152.0	140.0	1290.1	386.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	2011.0	PIT	NL	24	CF	155	153.0	146.0	1353.2	430.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

	0	1	2	3	4	5	6	7	8	9	...	463	464	465	466	467	468	469	470	471	472
0	DateOpenHighLowCloseVolumeAdj Close* May 4, 2...	Date	Open	High	Low	Close	Volume	Adj Close*	May 4, 2016	95.2	...	37357200.0	93.97	Feb 1, 2016	96.47	96.71	95.4	96.43	40943500.0	95.91	* Close price adjusted for dividends and splits.
1	Date	Open	High	Low	Close	Volume	Adj Close*	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	May 4, 2016	95.20	95.90	93.82	94.19	40824800	94.19	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	May 3, 2016	94.20	95.74	93.68	95.18	56534000	95.18	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	May 2, 2016	93.97	94.08	92.40	93.64	47736900	93.64	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
5	Apr 29, 2016	93.99	94.72	92.51	93.74	68334400	93.74	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
6	Apr 28, 2016	97.61	97.88	94.25	94.83	81990700	94.83	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
7	Apr 27, 2016	96.00	98.71	95.68	97.82	113538400	97.82	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
8	Apr 26, 2016	103.91	105.30	103.91	104.35	40287500	104.35	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
9	Apr 25, 2016	105.00	105.65	104.51	105.08	27951000	105.08	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

	boys_toilets	costituency	county	district	division	geolocation	girls_toilets	gok_tsc_male	level_of_education	...	school_institution_type_2	school_institution_type_3	sponsor_of_school	status_of_school	teachers_toilets	total_enrolment	total_number_of_classrooms
0	0	MSAMBWENI	KWALE	KWALE	MSAMBWENI	{'latitude': '-4.248295', 'human_address': '{"...	0	0	PRIMARY SCHOOL	...	DAY ONLY	ORDINARY	CENTRAL GOVERNMENT/DEB	PUBLIC	0	839	1
1	4	KISAUNI	MOMBASA	MOMBASA	KISAUNI	{'latitude': '-3.99606', 'human_address': '{"a...	4	0	PRIMARY SCHOOL	...	DAY ONLY	ORDINARY	RELIGIOUS ORGANIZATION	PUBLIC	2	736	15
2	5	WEBUYE	BUNGOMA	BUNGOMA	CHWELE	{'latitude': '0.71988', 'human_address': '{"ad...	6	1	PRIMARY SCHOOL	...	DAY ONLY	ORDINARY	RELIGIOUS ORGANIZATION	PUBLIC	2	624	15

	name_of_school	level_of_education	status_of_school	sponsor_of_school	school_institution_type_1	school_institution_type_2	school_institution_type_3	pupil_teacher_ratio	pupil_classroom_ratio	pupil_toilet_ratio	...	non-teaching_staff_male	province	county	district	division	location	costituency	geolocation	kodi_county
0	MTUMWA PRY SCH	PRIMARY SCHOOL	PUBLIC	CENTRAL GOVERNMENT/DEB	MIXED	DAY ONLY	ORDINARY	839.0	839.0	0.0	...	0	COAST	KWALE	KWALE	MSAMBWENI	MWERENI	MSAMBWENI	(-4.248295, 39.085658)	NaN
1	KIEMBENI BAPTIST PRI SCH	PRIMARY SCHOOL	PUBLIC	RELIGIOUS ORGANIZATION	MIXED	DAY ONLY	ORDINARY	736.0	49.1	92.0	...	1	COAST	MOMBASA	MOMBASA	KISAUNI	BAMBURI	KISAUNI	(-3.99606, 39.6977)	NaN
2	MADISI ERSF PRI SCH	PRIMARY SCHOOL	PUBLIC	RELIGIOUS ORGANIZATION	MIXED	DAY ONLY	ORDINARY	624.0	41.6	56.7	...	2	WESTERN	BUNGOMA	BUNGOMA	CHWELE	CHWELE	WEBUYE	(0.71988, 34.62895)	NaN

	boys_toilets	geolocation	girls_toilets	level_of_education	name_of_school
0	1	{'latitude': '-1.66668', 'human_address': '{"a...	2	PRIMARY SCHOOL	VOO SPECIAL UNIT
1	3	{'latitude': '-1.012145', 'human_address': '{"...	3	PRIMARY SCHOOL	MASAANTARE PRI
2	1	{'latitude': '-1.159798', 'human_address': '{"...	1	PRIMARY SCHOOL	NGOCHONI PRY SCH
3	1	{'latitude': '-0.99747', 'human_address': '{"a...	2	PRIMARY SCHOOL	NEEMA JUNIOR ACADEMY
4	0	{'latitude': '-0.42783', 'human_address': '{"a...	0	PRIMARY SCHOOL	LWANDA PRI SCH