Python for Webscraping

SOC 590: Big Data and Population Processes
17th October 2016

Tutorial 3: Webscraping and pre-processing

Outline

Import modules
Examine html structure of webpage
Use a function with Beautiful Soup to parse webpages into .csv
Analyze .csv of webpage as a Pandas DataFrame
Process the data with regular expression

import relevant modules
standard library modules:
- os
- urllib
- webbrowser
open source modules:
- pandas
- Beautiful Soup



In [1]:

    
import os
import urllib
import webbrowser
import pandas as pd
from bs4 import BeautifulSoup



In [2]:

    
url = 'http://www.pro-football-reference.com/years/2015/passing.htm'
webbrowser.open_new_tab(url)









    Out[2]:





True



In [3]:

    
# The url we will be scraping
url_2015 = "http://www.pro-football-reference.com/years/2015/passing.htm"

# get the html
html = urllib.request.urlopen(url_2015)

# create the BeautifulSoup object
soup = BeautifulSoup(html, "lxml")

Scraping the Column Headers

The column headers we need for our DataFrame are found in the second row of column headers PFR table. We will will scrape those and add two additional columns headers for the two additional player page links.



In [4]:

    
# Extract the necessary values for the column headers from the table
# and store them as a list
column_headers = [th.getText() for th in soup.findAll('th', limit=30)]
column_headers = [s for s in column_headers if len(s) != 0]
column_headers = column_headers[1:]
column_headers









    Out[4]:





['Tm',
 'Age',
 'Pos',
 'G',
 'GS',
 'QBrec',
 'Cmp',
 'Att',
 'Cmp%',
 'Yds',
 'TD',
 'TD%',
 'Int',
 'Int%',
 'Lng',
 'Y/A',
 'AY/A',
 'Y/C',
 'Y/G',
 'Rate',
 'QBR',
 'Sk',
 'Yds',
 'NY/A',
 'ANY/A',
 'Sk%',
 '4QC',
 'GWD']



In [5]:

    
len(column_headers)









    Out[5]:





28

Scraping the Data

Note that table_rows is a list of tag elements.



In [6]:

    
# The data is found within the table rows
# We want the elements from the 3rd row and on
table_rows = soup.find_all("tr")[1:]
print(type(table_rows))
table_rows[0] # take a look at the first row









    



<class 'list'>






    Out[6]:





<tr><th class="right " csk="1" data-stat="ranker" scope="row">1</th><td class="left " csk="Rivers,Philip" data-append-csv="RivePh00" data-stat="player"><a href="/players/R/RivePh00.htm">Philip Rivers</a></td><td class="left " data-stat="team"><a href="/teams/sdg/2015.htm" title="San Diego Chargers">SDG</a></td><td class="right " data-stat="age">34</td><td class="left " data-stat="pos">QB</td><td class="right " data-stat="g">16</td><td class="right " data-stat="gs">16</td><td class="right " csk="0.25000" data-stat="qb_rec">4-12-0</td><td class="right " data-stat="pass_cmp">437</td><td class="right " data-stat="pass_att">661</td><td class="right " data-stat="pass_cmp_perc">66.1</td><td class="right " data-stat="pass_yds">4792</td><td class="right " data-stat="pass_td">29</td><td class="right " data-stat="pass_td_perc">4.4</td><td class="right " data-stat="pass_int">13</td><td class="right " data-stat="pass_int_perc">2.0</td><td class="right " data-stat="pass_long">80</td><td class="right " data-stat="pass_yds_per_att">7.2</td><td class="right " data-stat="pass_adj_yds_per_att">7.2</td><td class="right " data-stat="pass_yds_per_cmp">11.0</td><td class="right " data-stat="pass_yds_per_g">299.5</td><td class="right " data-stat="pass_rating">93.8</td><td class="right " data-stat="qbr">59.44</td><td class="right " data-stat="pass_sacked">40</td><td class="right " data-stat="pass_sacked_yds">264</td><td class="right " data-stat="pass_net_yds_per_att">6.46</td><td class="right " data-stat="pass_adj_net_yds_per_att">6.45</td><td class="right " data-stat="pass_sacked_perc">5.7</td><td class="right " data-stat="comebacks">1</td><td class="right " data-stat="gwd">2</td></tr>

The data we want for each player is found within the the td (or table data) elements.

Below I've created a function that extracts the data we want from table_rows. The comments should walk you through what each part of the function does.



In [7]:

    
def extract_player_data(table_rows):
    """
    Extract and return the the desired information from the td elements within
    the table rows.
    """
    # create the empty list to store the player data
    player_data = []
    
    for row in table_rows:  # for each row do the following

        # Get the text for each table data (td) element in the row
        player_list = [td.get_text() for td in row.find_all("td")]
        
        # there are some empty table rows, which are the repeated 
        # column headers in the table
        # we skip over those rows and and continue the for loop
        if not player_list:
            continue
            
        # Now append the data to list of data
        player_data.append(player_list)
        
    return player_data

Now we can create a DataFrame with the 2016 passing data



In [8]:

    
# extract the data we want
data = extract_player_data(table_rows)
# and then store it in a DataFrame
df_2015  = pd.DataFrame(data)



In [9]:

    
data[1]









    Out[9]:





['Drew Brees',
 'NOR',
 '36',
 'QB',
 '15',
 '15',
 '7-8-0',
 '428',
 '627',
 '68.3',
 '4870',
 '32',
 '5.1',
 '11',
 '1.8',
 '80',
 '7.8',
 '8.0',
 '11.4',
 '324.7',
 '101.0',
 '75.47',
 '31',
 '235',
 '7.04',
 '7.26',
 '4.7',
 '1',
 '2']



In [10]:

    
df_2015.columns = ['Player_name']+column_headers



In [11]:

    
df_2015.head()









    Out[11]:






  
    
      
      Player_name
      Tm
      Age
      Pos
      G
      GS
      QBrec
      Cmp
      Att
      Cmp%
      ...
      Y/G
      Rate
      QBR
      Sk
      Yds
      NY/A
      ANY/A
      Sk%
      4QC
      GWD
    
  
  
    
      0
      Philip Rivers
      SDG
      34
      QB
      16
      16
      4-12-0
      437
      661
      66.1
      ...
      299.5
      93.8
      59.44
      40
      264
      6.46
      6.45
      5.7
      1
      2
    
    
      1
      Drew Brees
      NOR
      36
      QB
      15
      15
      7-8-0
      428
      627
      68.3
      ...
      324.7
      101.0
      75.47
      31
      235
      7.04
      7.26
      4.7
      1
      2
    
    
      2
      Tom Brady*
      NWE
      38
      QB
      16
      16
      12-4-0
      402
      624
      64.4
      ...
      298.1
      102.2
      64.42
      38
      225
      6.87
      7.48
      5.7
      2
      2
    
    
      3
      Eli Manning*
      NYG
      34
      QB
      16
      16
      6-10-0
      387
      618
      62.6
      ...
      277.0
      93.6
      60.46
      27
      157
      6.63
      6.74
      4.2
      1
      2
    
    
      4
      Matt Ryan
      ATL
      30
      QB
      16
      16
      8-8-0
      407
      614
      66.3
      ...
      286.9
      89.0
      61.79
      30
      203
      6.81
      6.35
      4.7
      4
      4
    
  

5 rows × 29 columns

Scraping the Data for All Seasons Since 1932

Scraping the for passing data since 1932 follows is essentially the same process as above, just repeated for each year, using a for loop.

As we loop over the years, we will create a DataFrame for each year of passing data, and append it to a large list of DataFrames that contains all the passing data. We will also have a separate list that will contain any errors and the url associated with that error. This will let us know if there are any issues with our scraper, and which url is causing the error.



In [12]:

    
# Create an empty list that will contain all the dataframes
# (one dataframe for all passing dataframes)
passing_dfs_list = []

# a list to store any errors that may come up while scraping
errors_list = []



In [13]:

    
# The url template that we pass in the passing year info
url_template = "http://www.pro-football-reference.com/years/{year}/passing.htm"

# for each year from 1932 to (and including) 2016
for year in range(1932, 2017): 
    
    # Use try/except block to catch and inspect any urls that cause an error
    try:
        # get the passing data url
        url = url_template.format(year=year)

        # get the html
        html = urllib.request.urlopen(url)

        # create the BeautifulSoup object
        soup = BeautifulSoup(html, "lxml") 

        # get the column headers
        column_headers = [th.getText() for th in soup.findAll('th', limit=30)]

        column_headers = [s for s in column_headers if len(s) != 0]
        column_headers = column_headers[1:]
        
        # select the data from the table 
        table_rows = soup.find_all("tr")[1:]

        # extract the player data from the table rows
        player_data = extract_player_data(table_rows)
        
        # create the dataframe for the current year's passing data
        year_df = pd.DataFrame(player_data)

        # add the year of the pssing data to the dataframe
        year_df.insert(0, "Year", year)
   
        # append the current dataframe to the list of dataframes
        passing_dfs_list.append(year_df)
    
    except Exception as e:
        # Store the url and the error it causes in a list
        error =[url, e] 
        # then append it to the list of errors
        errors_list.append(error)



In [14]:

    
print(len(errors_list))
errors_list









    



0






    Out[14]:





[]

We don't get any errors, so that's good.

Now we can concatenate all the DataFrames we scraped and create one large DataFrame containing all the passing data



In [15]:

    
type(passing_dfs_list)









    Out[15]:





list



In [16]:

    
passing_dfs_list[0:1]









    Out[16]:





[    Year                 0    1   2   3   4   5 6   7    8 ...    18    19  \
 0   1932     Arnie Herber+  GNB  22  tb  14   8    37  101 ...  17.3  45.6   
 1   1932       Walt Holmer  CRD  30   b  10   4    25   78 ...  18.0  44.9   
 2   1932    Benny Friedman  BKN  27  TB  11  10    23   74 ...  13.9  29.0   
 3   1932      Jack McBride  2TM  31      12   6    36   74 ...  10.1  30.3   
 4   1932  Keith Molesworth  CHI  27  QB  14  14    25   64 ...  13.8  24.7   
 5   1932   Honolulu Hughes  BOS  25  BB  10   9    11   57 ...  12.2  13.4   
 6   1932      Dutch Clark+  PRT  26  TB  11   9    17   52 ...  16.0  24.7   
 7   1932    Glenn Presnell  PRT  27  tb  12   7    17   46 ...  15.2  21.6   
 8   1932     Jack Grossman  BKN  22  FB  12   9    11   32 ...  16.2  14.8   
 9   1932      Jack Hagerty  NYG  29   b   4   4    17   32 ...   9.3  39.5   
 10  1932       Doug Wycoff  SIS  29  BB  12  11    10   31 ...  14.0  11.7   
 11  1932       Joe Lillard  CRD  27  TB   7   5     9   28 ...  11.4  14.7   
 12  1932     John Doehring  CHI  23       7   0    10   26 ...  20.3  29.0   
 13  1932  Bronko Nagurski+  CHI  24  FB  14  14    11   26 ...  13.6  10.7   
 14  1932        Ken Strong  SIS  26  FB  11   7     4   23 ...  23.5   8.5   
 15  1932     Cliff Battles  BOS  22  TB   8   8     2   20 ...  21.0   5.3   
 16  1932     Bob Campiglio  SIS      TB  11   5     7   20 ...  15.6   9.9   
 17  1932         Hap Moran  NYG  31  fb  11   5    12   20 ...   7.6   8.3   
 18  1932     Grassy Hinton  SIS  25  tb  12   6     5   19 ...   9.2   3.8   
 19  1932        Bo Molenda  2TM  27      12   7     7   15 ...  15.1   8.8   
 20  1932       Ray Novotny  BKN  25   b  10   4     5   15 ...  10.0   5.0   
 21  1932        Stu Clancy  2TM  26      12   2     2   13 ...  22.5   3.8   
 22  1932        Red Grange  CHI  29  LH  12  12     5   13 ...  19.2   8.0   
 23  1932    Verne Lewellen  GNB  31  TB  14   6     3   13 ...  22.3   4.8   
 24  1932      John Cavosie  PRT  24  FB  11   7     5   12 ...  13.2   6.0   
 25  1932        Les Molloy  CRD  25   b   8   3     2   11 ...  11.0   2.8   
 26  1932      Swede Hanson  SIS  25   b   3   2     2   10 ...  15.5  10.3   
 27  1932   Stumpy Thomason  BKN  26   b  10   4     1   10 ...  -4.0  -0.4   
 28  1932         Oran Pape  2TM  28       7   0     2    9 ...  15.5   4.4   
 29  1932    Carl Brumbaugh  CHI  26      13   0     3    7 ...  23.7   5.5   
 30  1932       Chris Cagle  NYG  28  FB  10   7     3    7 ...  22.7   6.8   
 31  1932   Shipwreck Kelly  NYG  22       5   0     3    7 ...   7.3   4.4   
 32  1932     Harry O'Boyle  GNB  28  BB  11   6     1    7 ...   8.0   0.7   
 33  1932        Stu Wilson  SIS  25  RE  11   7     1    7 ...  18.0   1.6   
 34  1932      Tiny Feather  NYG  30  wb   9   7     1    5 ...  15.0   1.7   
 35  1932   Paul Fitzgibbon  GNB  29   b   4   2     1    5 ...   9.0   2.3   
 36  1932         Gene Rose  CRD  28       1   0     2    5 ...  20.0  40.0   
 37  1932       Reggie Rust  BOS  23  tb   5   1     2    5 ...  12.5   5.0   
 38  1932       Ed Westfall  BOS  24       2   0     0    5 ...         0.0   
 39  1932       Hank Bruder  GNB  25  WB  14   5     1    4 ...  23.0   1.6   
 40  1932     Ray Flaherty+  NYG  29  RE  12  11     3    4 ...  11.0   2.8   
 41  1932      Dick Nesbitt  CHI  25  RH  14  14     2    4 ...   7.0   1.0   
 42  1932      Johnny Blood  GNB  29  tb  13   3     0    3 ...         0.0   
 43  1932    George Corbett  CHI  24   b  12   1     2    3 ...  30.5   5.1   
 44  1932      Ace Gutowsky  PRT  23  fb   8   4     1    3 ...  26.0   3.3   
 45  1932       Johnny Sisk  CHI  26  hb  13   1     1    3 ...   8.0   0.6   
 46  1932     Stud Stennett  CRD  25   b   2   1     1    3 ...  11.0   5.5   
 47  1932      Ossie Wiberg  BKN  28   b   6   3     1    3 ...  20.0   3.3   
 48  1932     Buck Halperin  BKN  24       2   0     1    2 ...  23.0  11.5   
 49  1932     Herb Joesting  CHI  27       4   0     1    2 ...  22.0   5.5   
 50  1932        Jim Musick  BOS  22  FB  10  10     1    2 ...   7.0   0.7   
 51  1932       Bernie Finn  2TM  25       3   2     1    1 ...   9.0   3.0   
 52  1932     Clarke Hinkle  GNB  23  FB  13   9     1    1 ...  23.0   1.8   
 53  1932      Deck Shelley  2TM  26       4   1     1    1 ...   3.0   0.8   
 54  1932       Dale Waters  BOS  23  RT   9   6     0    1 ...         0.0   
 
        20 21 22 23 24 25 26 27  
 0    51.5                 0  0  
 1    56.0                 0  0  
 2    28.9                 0  0  
 3    50.5                 0  0  
 4    46.7                 0  0  
 5     5.8                 0  0  
 6    24.4                 0  0  
 7    34.6                 0  0  
 8    14.8                 0  0  
 9    61.7                 0  0  
 10   20.9                 0  0  
 11    4.6                 0  0  
 12   76.3                 0  0  
 13   67.8                 0  0  
 14   19.0                 0  0  
 15    0.0                 0  0  
 16   54.0                 0  0  
 17   71.0                 0  0  
 18    0.0                 0  0  
 19   64.9                 0  0  
 20   16.0                 0  0  
 21    1.9                 0  0  
 22   64.9                 0  0  
 23    9.0                 0  0  
 24   99.3                 0  0  
 25   39.6                 0  0  
 26    0.4                 0  0  
 27    0.0                 0  0  
 28   38.9                 0  0  
 29   80.1                 0  0  
 30   38.7                 0  0  
 31   50.9                 0  0  
 32    0.0                 0  0  
 33    0.0                 0  0  
 34    0.0                 0  0  
 35   39.6                 0  0  
 36   68.7                 0  0  
 37   16.7                 0  0  
 38   39.6                 0  0  
 39   11.5                 0  0  
 40   99.0                 0  0  
 41   58.3                 0  0  
 42   39.6                 0  0  
 43  109.7                 0  0  
 44   66.0                 0  0  
 45   42.4                 0  0  
 46   45.1                 0  0  
 47   57.6                 0  0  
 48   91.7                 0  0  
 49   89.6                 0  0  
 50   18.7                 0  0  
 51  104.2                 0  0  
 52  118.7                 0  0  
 53   79.2                 0  0  
 54    0.0                 0  0  
 
 [55 rows x 29 columns]]



In [17]:

    
column_headers.insert(0, "Player_name")
column_headers.insert(0, "Year")
print(column_headers)
print(len(column_headers))









    



['Year', 'Player_name', 'Tm', 'Age', 'Pos', 'G', 'GS', 'QBrec', 'Cmp', 'Att', 'Cmp%', 'Yds', 'TD', 'TD%', 'Int', 'Int%', 'Lng', 'Y/A', 'AY/A', 'Y/C', 'Y/G', 'Rate', 'QBR', 'Sk', 'Yds', 'NY/A', 'ANY/A', 'Sk%', '4QC', 'GWD']
30



In [18]:

    
# store all passing data in one DataFrame
passing_df = pd.concat(passing_dfs_list, axis=0)
passing_df.columns = column_headers



In [19]:

    
# Take a look at the first few rows
passing_df.loc[0,:]









    Out[19]:






  
    
      
      Year
      Player_name
      Tm
      Age
      Pos
      G
      GS
      QBrec
      Cmp
      Att
      ...
      Y/G
      Rate
      QBR
      Sk
      Yds
      NY/A
      ANY/A
      Sk%
      4QC
      GWD
    
  
  
    
      0
      1932
      Arnie Herber+
      GNB
      22
      tb
      14
      8
      
      37
      101
      ...
      45.6
      51.5
      
      
      
      
      
      0
      0
      NaN
    
    
      0
      1933
      Harry Newman
      NYG
      24
      TB
      14
      12
      
      53
      136
      ...
      69.5
      51.7
      
      
      
      
      
      
      
      NaN
    
    
      0
      1934
      Arnie Herber
      GNB
      24
      TB
      11
      7
      
      42
      115
      ...
      72.6
      45.1
      
      
      
      
      
      0
      0
      NaN
    
    
      0
      1935
      Ed Danowski+
      NYG
      24
      TB
      12
      8
      
      57
      113
      ...
      66.2
      69.7
      
      
      
      
      
      
      
      NaN
    
    
      0
      1936
      Arnie Herber
      GNB
      26
      TB
      12
      5
      
      77
      173
      ...
      103.3
      58.9
      
      
      
      
      
      
      
      NaN
    
    
      0
      1937
      Sammy Baugh+
      WAS
      23
      TB
      11
      5
      
      81
      171
      ...
      102.5
      50.5
      
      
      
      
      
      
      
      NaN
    
    
      0
      1938
      Ace Parker+
      BKN
      26
      TB
      11
      11
      
      63
      148
      ...
      78.6
      53.5
      
      
      
      
      
      0
      0
      NaN
    
    
      0
      1939
      Parker Hall*+
      RAM
      23
      P/TB
      11
      11
      
      106
      208
      ...
      111.5
      57.5
      
      
      
      
      
      0
      0
      NaN
    
    
      0
      1940
      Davey O'Brien
      PHI
      23
      QB
      11
      11
      
      124
      277
      ...
      117.3
      39.2
      
      
      
      
      
      0
      0
      NaN
    
    
      0
      1941
      Cecil Isbell*+
      GNB
      26
      TB
      11
      4
      
      117
      206
      ...
      134.5
      81.4
      
      
      
      
      
      
      
      NaN
    
    
      0
      1942
      Bud Schwenk
      CRD
      25
      TB
      11
      9
      
      126
      295
      ...
      123.6
      25.5
      
      
      
      
      
      0
      0
      NaN
    
    
      0
      1943
      Sammy Baugh+
      WAS
      29
      P/TB
      10
      7
      
      133
      239
      ...
      175.4
      78.0
      
      
      
      
      
      
      
      NaN
    
    
      0
      1944
      Irv Comp
      GNB
      25
      TB
      10
      8
      
      80
      177
      ...
      115.9
      50.0
      
      
      
      
      
      
      
      NaN
    
    
      0
      1945
      Paul Christman
      CRD
      27
      QB
      8
      3
      
      89
      219
      ...
      143.4
      42.6
      
      
      
      
      
      0
      0
      NaN
    
    
      0
      1946
      Bob Waterfield+
      RAM
      26
      K/P/QB
      11
      4
      
      127
      251
      ...
      158.8
      67.6
      
      
      
      
      
      0
      0
      NaN
    
    
      0
      1947
      Sammy Baugh
      WAS
      33
      P/QB
      12
      1
      
      210
      354
      ...
      244.8
      92.0
      
      
      
      
      
      0
      0
      NaN
    
    
      0
      1948
      Sammy Baugh
      WAS
      34
      QB
      12
      3
      
      185
      315
      ...
      216.6
      78.3
      
      
      
      
      
      0
      0
      NaN
    
    
      0
      1949
      Johnny Lujack
      CHI
      24
      K/QB
      12
      7
      
      162
      312
      ...
      221.5
      76.0
      
      
      
      
      
      0
      0
      NaN
    
    
      0
      1950
      Bobby Layne
      DET
      24
      QB
      12
      12
      6-6-0
      152
      336
      ...
      193.6
      62.1
      
      
      
      
      
      0
      0
      NaN
    
    
      0
      1951
      Bobby Layne*
      DET
      25
      QB
      12
      12
      7-4-1
      152
      332
      ...
      200.3
      67.6
      
      
      
      
      
      0
      0
      NaN
    
    
      0
      1952
      Otto Graham*
      CLE
      31
      QB
      12
      12
      8-4-0
      181
      364
      ...
      234.7
      66.6
      
      
      
      
      
      
      
      NaN
    
    
      0
      1953
      George Blanda
      CHI
      26
      K/QB
      12
      12
      3-8-1
      169
      362
      ...
      180.3
      52.3
      
      
      
      
      
      0
      0
      NaN
    
    
      0
      1954
      Tobin Rote
      GNB
      26
      QB
      12
      12
      4-8-0
      180
      382
      ...
      192.6
      59.1
      
      
      
      
      
      0
      0
      NaN
    
    
      0
      1955
      Jim Finks
      PIT
      28
      QB
      12
      12
      4-8-0
      165
      344
      ...
      189.2
      47.7
      
      
      
      
      
      0
      0
      NaN
    
    
      0
      1956
      Tobin Rote*
      GNB
      28
      QB
      12
      11
      4-7-0
      146
      308
      ...
      183.6
      70.6
      
      
      
      
      
      0
      0
      NaN
    
    
      0
      1957
      Johnny Unitas*
      BAL
      24
      QB
      12
      12
      7-5-0
      172
      301
      ...
      212.5
      88.0
      
      
      
      
      
      0
      0
      NaN
    
    
      0
      1958
      Norm Van Brocklin*
      PHI
      32
      P/QB
      12
      12
      2-9-1
      198
      374
      ...
      200.8
      64.1
      
      
      
      
      
      0
      0
      NaN
    
    
      0
      1959
      Johnny Unitas*+
      BAL
      26
      QB
      12
      12
      9-3-0
      193
      367
      ...
      241.6
      92.0
      
      
      
      
      
      1
      1
      NaN
    
    
      0
      1960
      Johnny Unitas*
      BAL
      27
      QB
      12
      12
      6-6-0
      190
      378
      ...
      258.3
      73.7
      
      
      
      
      
      1
      2
      NaN
    
    
      0
      1961
      Johnny Unitas*
      BAL
      28
      QB
      14
      14
      8-6-0
      229
      420
      ...
      213.6
      66.1
      
      
      
      
      
      3
      4
      NaN
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      0
      1987
      Neil Lomax*
      STL
      28
      QB
      12
      12
      6-6-0
      275
      463
      ...
      282.3
      88.5
      48
      350
      5.94
      5.83
      9.4
      2
      2
      NaN
    
    
      0
      1988
      Dan Marino
      MIA
      27
      QB
      16
      16
      6-10-0
      354
      606
      ...
      277.1
      80.8
      6
      31
      7.19
      6.42
      1.0
      1
      2
      NaN
    
    
      0
      1989
      Don Majkowski*
      GNB
      25
      QB
      16
      16
      10-6-0
      353
      599
      ...
      269.9
      82.3
      47
      268
      6.27
      5.71
      7.3
      5
      7
      NaN
    
    
      0
      1990
      Warren Moon*
      HOU
      34
      QB
      15
      15
      8-7-0
      362
      584
      ...
      312.6
      96.8
      36
      252
      7.16
      7.28
      5.8
      1
      1
      NaN
    
    
      0
      1991
      Warren Moon*
      HOU
      35
      QB
      16
      16
      11-5-0
      404
      655
      ...
      293.1
      81.7
      23
      174
      6.66
      5.95
      3.4
      4
      5
      NaN
    
    
      0
      1992
      Dan Marino*
      MIA
      31
      QB
      16
      16
      11-5-0
      330
      554
      ...
      257.3
      85.1
      28
      173
      6.77
      6.36
      4.8
      6
      6
      NaN
    
    
      0
      1993
      John Elway*
      DEN
      33
      QB
      16
      16
      9-7-0
      348
      551
      ...
      251.9
      92.8
      39
      293
      6.33
      6.42
      6.6
      1
      1
      NaN
    
    
      0
      1994
      Drew Bledsoe*
      NWE
      22
      QB
      16
      16
      10-6-0
      400
      691
      ...
      284.7
      73.6
      22
      139
      6.19
      5.19
      3.1
      3
      3
      NaN
    
    
      0
      1995
      Drew Bledsoe
      NWE
      23
      QB
      15
      15
      6-9-0
      323
      636
      ...
      233.8
      63.7
      23
      170
      5.06
      4.37
      3.5
      2
      2
      NaN
    
    
      0
      1996
      Drew Bledsoe*
      NWE
      24
      QB
      16
      16
      11-5-0
      373
      623
      ...
      255.4
      83.7
      30
      190
      5.97
      5.76
      4.6
      3
      4
      NaN
    
    
      0
      1997
      Dan Marino
      MIA
      36
      QB
      16
      16
      9-7-0
      319
      548
      ...
      236.3
      80.7
      20
      132
      6.42
      6.11
      3.5
      1
      3
      NaN
    
    
      0
      1998
      Peyton Manning
      IND
      22
      QB
      16
      16
      3-13-0
      326
      575
      ...
      233.7
      71.2
      22
      109
      6.08
      4.84
      3.7
      1
      1
      NaN
    
    
      0
      1999
      Brett Favre
      GNB
      30
      QB
      16
      16
      8-8-0
      341
      595
      ...
      255.7
      74.7
      35
      223
      6.14
      5.20
      5.6
      3
      3
      NaN
    
    
      0
      2000
      Vinny Testaverde
      NYJ
      37
      QB
      16
      16
      9-7-0
      328
      590
      ...
      233.3
      69.0
      13
      71
      6.07
      4.90
      2.2
      4
      4
      NaN
    
    
      0
      2001
      Jon Kitna
      CIN
      29
      QB
      16
      15
      6-9-0
      313
      581
      ...
      201.0
      61.1
      25
      185
      5.00
      3.76
      4.1
      3
      3
      NaN
    
    
      0
      2002
      Rich Gannon*+
      OAK
      37
      QB
      16
      16
      11-5-0
      418
      618
      ...
      293.1
      97.3
      36
      214
      6.84
      6.95
      5.5
      1
      1
      NaN
    
    
      0
      2003
      Brad Johnson
      TAM
      35
      QB
      16
      16
      7-9-0
      354
      570
      ...
      238.2
      81.5
      20
      111
      6.27
      5.55
      3.4
      
      
      NaN
    
    
      0
      2004
      Trent Green
      KAN
      34
      QB
      16
      16
      7-9-0
      369
      556
      ...
      286.9
      95.2
      32
      227
      7.42
      7.04
      5.4
      3
      3
      NaN
    
    
      0
      2005
      Brett Favre
      GNB
      36
      QB
      16
      16
      4-12-0
      372
      607
      ...
      242.6
      70.9
      24
      170
      5.88
      4.45
      3.8
      1
      1
      NaN
    
    
      0
      2006
      Brett Favre
      GNB
      37
      QB
      16
      16
      8-8-0
      343
      613
      ...
      242.8
      72.7
      39.85
      21
      134
      5.92
      5.21
      3.3
      1
      1
    
    
      0
      2007
      Drew Brees
      NOR
      28
      QB
      16
      16
      7-9-0
      440
      652
      ...
      276.4
      89.4
      61.45
      16
      109
      6.46
      6.08
      2.4
      1
      1
    
    
      0
      2008
      Drew Brees*
      NOR
      29
      QB
      16
      16
      8-8-0
      413
      635
      ...
      316.8
      96.2
      63.67
      13
      92
      7.68
      7.55
      2.0
      2
      2
    
    
      0
      2009
      Matt Schaub*
      HOU
      28
      QB
      16
      16
      9-7-0
      396
      583
      ...
      298.1
      98.6
      69.45
      25
      149
      7.60
      7.44
      4.1
      2
      4
    
    
      0
      2010
      Peyton Manning*
      IND
      34
      QB
      16
      16
      10-6-0
      450
      679
      ...
      293.8
      91.9
      73.42
      16
      91
      6.63
      6.48
      2.3
      0
      2
    
    
      0
      2011
      Matthew Stafford
      DET
      23
      QB
      16
      16
      10-6-0
      421
      663
      ...
      314.9
      97.2
      60.93
      36
      257
      6.84
      6.98
      5.2
      3
      4
    
    
      0
      2012
      Matthew Stafford
      DET
      24
      QB
      16
      16
      4-12-0
      435
      727
      ...
      310.4
      79.8
      54.36
      29
      212
      6.29
      5.81
      3.8
      3
      3
    
    
      0
      2013
      Peyton Manning*+
      DEN
      37
      QB
      16
      16
      13-3-0
      450
      659
      ...
      342.3
      115.1
      81.01
      18
      120
      7.91
      8.87
      2.7
      2
      2
    
    
      0
      2014
      Drew Brees*
      NOR
      35
      QB
      16
      16
      7-9-0
      456
      659
      ...
      309.5
      97.0
      73.41
      29
      186
      6.93
      6.77
      4.2
      2
      2
    
    
      0
      2015
      Philip Rivers
      SDG
      34
      QB
      16
      16
      4-12-0
      437
      661
      ...
      299.5
      93.8
      59.44
      40
      264
      6.46
      6.45
      5.7
      1
      2
    
    
      0
      2016
      Joe Flacco
      BAL
      31
      qb
      6
      6
      3-3-0
      164
      264
      ...
      264.8
      78.9
      57.1
      13
      100
      5.38
      5.09
      4.7
      2
      2
    
  

85 rows × 30 columns

Now that we fixed up the necessary columns, let's write out the raw data to a CSV file.



In [20]:

    
if not os.path.exists('../data/raw_data'):
    os.makedirs('../data/raw_data')

if not os.path.exists('../data/clean_data'):
    os.makedirs('../data/clean_data')



In [21]:

    
passing_df.head()









    Out[21]:






  
    
      
      Year
      Player_name
      Tm
      Age
      Pos
      G
      GS
      QBrec
      Cmp
      Att
      ...
      Y/G
      Rate
      QBR
      Sk
      Yds
      NY/A
      ANY/A
      Sk%
      4QC
      GWD
    
  
  
    
      0
      1932
      Arnie Herber+
      GNB
      22
      tb
      14
      8
      
      37
      101
      ...
      45.6
      51.5
      
      
      
      
      
      0
      0
      NaN
    
    
      1
      1932
      Walt Holmer
      CRD
      30
      b
      10
      4
      
      25
      78
      ...
      44.9
      56.0
      
      
      
      
      
      0
      0
      NaN
    
    
      2
      1932
      Benny Friedman
      BKN
      27
      TB
      11
      10
      
      23
      74
      ...
      29.0
      28.9
      
      
      
      
      
      0
      0
      NaN
    
    
      3
      1932
      Jack McBride
      2TM
      31
      
      12
      6
      
      36
      74
      ...
      30.3
      50.5
      
      
      
      
      
      0
      0
      NaN
    
    
      4
      1932
      Keith Molesworth
      CHI
      27
      QB
      14
      14
      
      25
      64
      ...
      24.7
      46.7
      
      
      
      
      
      0
      0
      NaN
    
  

5 rows × 30 columns



In [22]:

    
# Write out the raw passing data to the raw_data fold in the data folder
passing_df.to_csv("../data/raw_data/pfr_nfl_passing_data_RAW.csv", index=False)

Cleaning the Data

Now that we have the raw passing data, we need to clean it up a bit in order to do some of the data exploration we want.



In [23]:

    
passing_df_clean = pd.read_csv("../data/raw_data/pfr_nfl_passing_data_RAW.csv")

create a dictionary for renaming passing columns
rename all columns with passing_columns dictionary



In [24]:

    
passing_columns = {
              'Year':'year', 'Player_name': 'name', 'Tm': 'team', 'Age': 'age',
              'Pos': 'position', 'G': 'games_played', 'GS':
              'games_started', 'QBrec': 'record', 'Cmp': 'completions',
              'Att': 'attempts', 'Cmp%': 'completionPct', 'Yds':
              'passing_yards', 'TD': 'passing_TD',  'TD%':
              'passing_TDPct', 'Int': 'passing_INT', 'Int%':
              'passing_INTPct', 'Lng': 'passing_long', 'Y/A':
              'passing_ydsAtt', 'AY/A': 'passing_airydsAtt', 'Y/C':
              'passing_ydsComp', 'Y/G': 'passing_ydsGame', 'Rate':
              'passing_rating', 'QBR': 'QBR','Sk': 'passing_sacks', 'Yds.1':
              'passing_sacksyds','NY/A': 'netydsAtt', 'ANY/A': 'passing_airnetydsAtt',
              'Sk%': 'passing_sackPct', '4QC': 'FourthQtrComebacks',
              'GWD': 'gamewinningdrives'
              }
ordered_columns = [
              'year', 'name', 'team', 'age', 'position', 'wins',
              'losses', 'games_played', 'games_started', 'completions',
              'attempts', 'completionPct', 'passing_yards',
              'passing_TD', 'passing_TDPct',  'passing_INT',
              'passing_INTPct', 'passing_long', 'passing_ydsAtt',
              'passing_airydsAtt', 'passing_ydsComp',
              'passing_ydsGame', 'passing_rating', 'passing_sacks',
              'passing_sacksyds', 'passing_airnetydsAtt',
              'passing_sackPct', 'FourthQtrComebacks',
              'gamewinningdrives'
    ]



In [25]:

    
passing_df_clean.rename(index=str,columns=passing_columns, inplace=True)
passing_df_clean.head()









    Out[25]:






  
    
      
      year
      name
      team
      age
      position
      games_played
      games_started
      record
      completions
      attempts
      ...
      passing_ydsGame
      passing_rating
      QBR
      passing_sacks
      passing_sacksyds
      netydsAtt
      passing_airnetydsAtt
      passing_sackPct
      FourthQtrComebacks
      gamewinningdrives
    
  
  
    
      0
      1932
      Arnie Herber+
      GNB
      22.0
      tb
      14
      8
      NaN
      37
      101
      ...
      45.6
      51.5
      NaN
      NaN
      NaN
      NaN
      NaN
      0.0
      0.0
      NaN
    
    
      1
      1932
      Walt Holmer
      CRD
      30.0
      b
      10
      4
      NaN
      25
      78
      ...
      44.9
      56.0
      NaN
      NaN
      NaN
      NaN
      NaN
      0.0
      0.0
      NaN
    
    
      2
      1932
      Benny Friedman
      BKN
      27.0
      TB
      11
      10
      NaN
      23
      74
      ...
      29.0
      28.9
      NaN
      NaN
      NaN
      NaN
      NaN
      0.0
      0.0
      NaN
    
    
      3
      1932
      Jack McBride
      2TM
      31.0
      NaN
      12
      6
      NaN
      36
      74
      ...
      30.3
      50.5
      NaN
      NaN
      NaN
      NaN
      NaN
      0.0
      0.0
      NaN
    
    
      4
      1932
      Keith Molesworth
      CHI
      27.0
      QB
      14
      14
      NaN
      25
      64
      ...
      24.7
      46.7
      NaN
      NaN
      NaN
      NaN
      NaN
      0.0
      0.0
      NaN
    
  

5 rows × 30 columns

Cleaning Up the Rest of the Passing Data

process record columns into wins and losses



In [26]:

    
passing_df_clean['record'] = passing_df_clean['record'].astype('str')
passing_df_clean['position'] = passing_df_clean['position'].astype('str')
passing_df_clean['position'] = passing_df_clean['position'].str.upper()
passing_df_clean['record'].loc[passing_df_clean.record == 'QBrec'] = "0-0-0"
passing_df_clean['record'].loc[passing_df_clean.record == 'nan'] = "0-0-0"
unformatted_record = passing_df_clean['record'].str[:]
unformatted_record = unformatted_record.str.replace("-",  "/")
passing_df_clean['wins'] = unformatted_record.str.split("/").str[0]
passing_df_clean['losses'] = unformatted_record.str.split("/").str[1]
passing_df_clean['name'] = passing_df_clean['name'].str.replace('[+|*]', "")
passing_df_clean['wins'] = passing_df_clean['wins'].astype('float')
passing_df_clean = pd.DataFrame(data=passing_df_clean, columns=ordered_columns)









    



/Users/kivan/anaconda3/lib/python3.5/site-packages/pandas/core/indexing.py:132: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)



In [27]:

    
# convert the data to proper numeric types
passing_df_clean = passing_df_clean.apply(pd.to_numeric, errors="ignore")
passing_df_clean.info()









    



<class 'pandas.core.frame.DataFrame'>
Index: 6956 entries, 0 to 6955
Data columns (total 29 columns):
year                    6956 non-null int64
name                    6956 non-null object
team                    6956 non-null object
age                     6952 non-null float64
position                6956 non-null object
wins                    6956 non-null float64
losses                  6956 non-null int64
games_played            6956 non-null int64
games_started           6956 non-null int64
completions             6956 non-null int64
attempts                6956 non-null int64
completionPct           6956 non-null float64
passing_yards           6956 non-null int64
passing_TD              6956 non-null int64
passing_TDPct           6956 non-null float64
passing_INT             6956 non-null int64
passing_INTPct          6956 non-null float64
passing_long            6956 non-null int64
passing_ydsAtt          6956 non-null float64
passing_airydsAtt       6956 non-null float64
passing_ydsComp         5707 non-null float64
passing_ydsGame         6956 non-null float64
passing_rating          6956 non-null float64
passing_sacks           4770 non-null float64
passing_sacksyds        4770 non-null float64
passing_airnetydsAtt    4770 non-null float64
passing_sackPct         3836 non-null float64
FourthQtrComebacks      3223 non-null float64
gamewinningdrives       403 non-null float64
dtypes: float64(16), int64(10), object(3)
memory usage: 1.6+ MB

We are not done yet. A lot of out numeric columns are missing data because players didn't accumulate any of those stats. For example, some players didn't score a TD or even play a game. Let's select the columns with numeric data and then replace the NaNs (the current value that represents the missing data) with 0s, as that is a more appropriate value.



In [28]:

    
# Get the column names for the numeric columns
num_cols = passing_df_clean.columns[passing_df_clean.dtypes != object]

# Replace all NaNs with 0
passing_df_clean.loc[:, num_cols] = passing_df_clean.loc[:, num_cols].fillna(0)



In [29]:

    
passing_df_clean.info()









    



<class 'pandas.core.frame.DataFrame'>
Index: 6956 entries, 0 to 6955
Data columns (total 29 columns):
year                    6956 non-null int64
name                    6956 non-null object
team                    6956 non-null object
age                     6956 non-null float64
position                6956 non-null object
wins                    6956 non-null float64
losses                  6956 non-null int64
games_played            6956 non-null int64
games_started           6956 non-null int64
completions             6956 non-null int64
attempts                6956 non-null int64
completionPct           6956 non-null float64
passing_yards           6956 non-null int64
passing_TD              6956 non-null int64
passing_TDPct           6956 non-null float64
passing_INT             6956 non-null int64
passing_INTPct          6956 non-null float64
passing_long            6956 non-null int64
passing_ydsAtt          6956 non-null float64
passing_airydsAtt       6956 non-null float64
passing_ydsComp         6956 non-null float64
passing_ydsGame         6956 non-null float64
passing_rating          6956 non-null float64
passing_sacks           6956 non-null float64
passing_sacksyds        6956 non-null float64
passing_airnetydsAtt    6956 non-null float64
passing_sackPct         6956 non-null float64
FourthQtrComebacks      6956 non-null float64
gamewinningdrives       6956 non-null float64
dtypes: float64(16), int64(10), object(3)
memory usage: 1.6+ MB

We are finally done cleaning the data and now we can save it to a CSV file.



In [30]:

    
passing_df_clean = passing_df_clean[pd.notnull(passing_df_clean['name'])]
passing_df_clean.sort_values('wins', ascending=False, axis=0, inplace=True)



In [31]:

    
passing_df_clean.to_csv("../data/clean_data/pfr_nfl_passing_data_CLEAN.csv", index=False)
passing_df_clean.head(10)









    Out[31]:






  
    
      
      year
      name
      team
      age
      position
      wins
      losses
      games_played
      games_started
      completions
      ...
      passing_airydsAtt
      passing_ydsComp
      passing_ydsGame
      passing_rating
      passing_sacks
      passing_sacksyds
      passing_airnetydsAtt
      passing_sackPct
      FourthQtrComebacks
      gamewinningdrives
    
  
  
    
      6039
      2007
      Tom Brady
      NWE
      30.0
      QB
      16.0
      0
      16
      16
      398
      ...
      9.4
      12.1
      300.4
      117.2
      21.0
      128.00
      8.88
      3.5
      4.0
      4.0
    
    
      6833
      2015
      Cam Newton
      CAR
      26.0
      QB
      15.0
      1
      16
      16
      296
      ...
      8.3
      13.0
      239.8
      99.4
      33.0
      284.00
      7.20
      6.3
      3.0
      4.0
    
    
      3038
      1978
      Terry Bradshaw
      PIT
      30.0
      QB
      14.0
      2
      16
      16
      207
      ...
      7.0
      14.1
      182.2
      84.7
      222.0
      6.92
      5.40
      2.0
      2.0
      0.0
    
    
      5610
      2003
      Tom Brady
      NWE
      26.0
      QB
      14.0
      2
      16
      16
      317
      ...
      6.7
      11.4
      226.3
      85.9
      219.0
      6.08
      5.70
      4.0
      7.0
      0.0
    
    
      5952
      2006
      Philip Rivers
      SDG
      25.0
      QB
      14.0
      2
      16
      16
      284
      ...
      7.4
      11.9
      211.8
      92.0
      27.0
      144.00
      6.73
      5.5
      4.0
      4.0
    
    
      4317
      1990
      Joe Montana
      SFO
      34.0
      QB
      14.0
      1
      15
      15
      321
      ...
      7.2
      12.3
      262.9
      89.0
      153.0
      6.91
      5.30
      3.0
      5.0
      0.0
    
    
      3655
      1984
      Joe Montana
      SFO
      28.0
      QB
      14.0
      1
      16
      15
      279
      ...
      8.7
      13.0
      226.9
      102.9
      138.0
      7.69
      4.80
      2.0
      2.0
      0.0
    
    
      6360
      2010
      Tom Brady
      NWE
      33.0
      QB
      14.0
      2
      16
      16
      324
      ...
      9.0
      12.0
      243.8
      111.0
      25.0
      175.00
      8.25
      4.8
      2.0
      3.0
    
    
      6243
      2009
      Peyton Manning
      IND
      33.0
      QB
      14.0
      2
      16
      16
      393
      ...
      7.8
      11.5
      281.3
      99.9
      10.0
      74.00
      7.51
      1.7
      7.0
      7.0
    
    
      3646
      1984
      Dan Marino
      MIA
      23.0
      QB
      14.0
      2
      16
      16
      362
      ...
      9.4
      14.0
      317.8
      108.9
      120.0
      8.60
      2.30
      2.0
      3.0
      0.0
    
  

10 rows × 29 columns

Review

In this tutorial, we learned how to examine the html structure of webpage and use a function based on the Beautiful Soup module to parse tables on multiple webpage into .csv. After creating a .csv of all webpage tables, we analyzed the .csv using the Pandas module



In [32]:

    
import sys
import bs4
print("System and module version information: \n")
print('Python version:', sys.version_info)
print('urllib.request version:', urllib.request.__version__)
print('pandas version:', pd.__version__)
print('Beautiful Soup version:', bs4.__version__)









    



System and module version information: 

Python version: sys.version_info(major=3, minor=5, micro=2, releaselevel='final', serial=0)
urllib.request version: 3.5
pandas version: 0.18.1
Beautiful Soup version: 4.5.1

	Player_name	Tm	Age	Pos	G	GS	QBrec	Cmp	Att	Cmp%	...	Y/G	Rate	QBR	Sk	Yds	NY/A	ANY/A	Sk%	4QC	GWD
0	Philip Rivers	SDG	34	QB	16	16	4-12-0	437	661	66.1	...	299.5	93.8	59.44	40	264	6.46	6.45	5.7	1	2
1	Drew Brees	NOR	36	QB	15	15	7-8-0	428	627	68.3	...	324.7	101.0	75.47	31	235	7.04	7.26	4.7	1	2
2	Tom Brady*	NWE	38	QB	16	16	12-4-0	402	624	64.4	...	298.1	102.2	64.42	38	225	6.87	7.48	5.7	2	2
3	Eli Manning*	NYG	34	QB	16	16	6-10-0	387	618	62.6	...	277.0	93.6	60.46	27	157	6.63	6.74	4.2	1	2
4	Matt Ryan	ATL	30	QB	16	16	8-8-0	407	614	66.3	...	286.9	89.0	61.79	30	203	6.81	6.35	4.7	4	4

	Year	Player_name	Tm	Age	Pos	G	GS	QBrec	Cmp	Att	...	Y/G	Rate	QBR	Sk	Yds	NY/A	ANY/A	Sk%	4QC	GWD
0	1932	Arnie Herber+	GNB	22	tb	14	8		37	101	...	45.6	51.5						0	0	NaN
0	1933	Harry Newman	NYG	24	TB	14	12		53	136	...	69.5	51.7								NaN
0	1934	Arnie Herber	GNB	24	TB	11	7		42	115	...	72.6	45.1						0	0	NaN
0	1935	Ed Danowski+	NYG	24	TB	12	8		57	113	...	66.2	69.7								NaN
0	1936	Arnie Herber	GNB	26	TB	12	5		77	173	...	103.3	58.9								NaN
0	1937	Sammy Baugh+	WAS	23	TB	11	5		81	171	...	102.5	50.5								NaN
0	1938	Ace Parker+	BKN	26	TB	11	11		63	148	...	78.6	53.5						0	0	NaN
0	1939	Parker Hall*+	RAM	23	P/TB	11	11		106	208	...	111.5	57.5						0	0	NaN
0	1940	Davey O'Brien	PHI	23	QB	11	11		124	277	...	117.3	39.2						0	0	NaN
0	1941	Cecil Isbell*+	GNB	26	TB	11	4		117	206	...	134.5	81.4								NaN
0	1942	Bud Schwenk	CRD	25	TB	11	9		126	295	...	123.6	25.5						0	0	NaN
0	1943	Sammy Baugh+	WAS	29	P/TB	10	7		133	239	...	175.4	78.0								NaN
0	1944	Irv Comp	GNB	25	TB	10	8		80	177	...	115.9	50.0								NaN
0	1945	Paul Christman	CRD	27	QB	8	3		89	219	...	143.4	42.6						0	0	NaN
0	1946	Bob Waterfield+	RAM	26	K/P/QB	11	4		127	251	...	158.8	67.6						0	0	NaN
0	1947	Sammy Baugh	WAS	33	P/QB	12	1		210	354	...	244.8	92.0						0	0	NaN
0	1948	Sammy Baugh	WAS	34	QB	12	3		185	315	...	216.6	78.3						0	0	NaN
0	1949	Johnny Lujack	CHI	24	K/QB	12	7		162	312	...	221.5	76.0						0	0	NaN
0	1950	Bobby Layne	DET	24	QB	12	12	6-6-0	152	336	...	193.6	62.1						0	0	NaN
0	1951	Bobby Layne*	DET	25	QB	12	12	7-4-1	152	332	...	200.3	67.6						0	0	NaN
0	1952	Otto Graham*	CLE	31	QB	12	12	8-4-0	181	364	...	234.7	66.6								NaN
0	1953	George Blanda	CHI	26	K/QB	12	12	3-8-1	169	362	...	180.3	52.3						0	0	NaN
0	1954	Tobin Rote	GNB	26	QB	12	12	4-8-0	180	382	...	192.6	59.1						0	0	NaN
0	1955	Jim Finks	PIT	28	QB	12	12	4-8-0	165	344	...	189.2	47.7						0	0	NaN
0	1956	Tobin Rote*	GNB	28	QB	12	11	4-7-0	146	308	...	183.6	70.6						0	0	NaN
0	1957	Johnny Unitas*	BAL	24	QB	12	12	7-5-0	172	301	...	212.5	88.0						0	0	NaN
0	1958	Norm Van Brocklin*	PHI	32	P/QB	12	12	2-9-1	198	374	...	200.8	64.1						0	0	NaN
0	1959	Johnny Unitas*+	BAL	26	QB	12	12	9-3-0	193	367	...	241.6	92.0						1	1	NaN
0	1960	Johnny Unitas*	BAL	27	QB	12	12	6-6-0	190	378	...	258.3	73.7						1	2	NaN
0	1961	Johnny Unitas*	BAL	28	QB	14	14	8-6-0	229	420	...	213.6	66.1						3	4	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
0	1987	Neil Lomax*	STL	28	QB	12	12	6-6-0	275	463	...	282.3	88.5	48	350	5.94	5.83	9.4	2	2	NaN
0	1988	Dan Marino	MIA	27	QB	16	16	6-10-0	354	606	...	277.1	80.8	6	31	7.19	6.42	1.0	1	2	NaN
0	1989	Don Majkowski*	GNB	25	QB	16	16	10-6-0	353	599	...	269.9	82.3	47	268	6.27	5.71	7.3	5	7	NaN
0	1990	Warren Moon*	HOU	34	QB	15	15	8-7-0	362	584	...	312.6	96.8	36	252	7.16	7.28	5.8	1	1	NaN
0	1991	Warren Moon*	HOU	35	QB	16	16	11-5-0	404	655	...	293.1	81.7	23	174	6.66	5.95	3.4	4	5	NaN
0	1992	Dan Marino*	MIA	31	QB	16	16	11-5-0	330	554	...	257.3	85.1	28	173	6.77	6.36	4.8	6	6	NaN
0	1993	John Elway*	DEN	33	QB	16	16	9-7-0	348	551	...	251.9	92.8	39	293	6.33	6.42	6.6	1	1	NaN
0	1994	Drew Bledsoe*	NWE	22	QB	16	16	10-6-0	400	691	...	284.7	73.6	22	139	6.19	5.19	3.1	3	3	NaN
0	1995	Drew Bledsoe	NWE	23	QB	15	15	6-9-0	323	636	...	233.8	63.7	23	170	5.06	4.37	3.5	2	2	NaN
0	1996	Drew Bledsoe*	NWE	24	QB	16	16	11-5-0	373	623	...	255.4	83.7	30	190	5.97	5.76	4.6	3	4	NaN
0	1997	Dan Marino	MIA	36	QB	16	16	9-7-0	319	548	...	236.3	80.7	20	132	6.42	6.11	3.5	1	3	NaN
0	1998	Peyton Manning	IND	22	QB	16	16	3-13-0	326	575	...	233.7	71.2	22	109	6.08	4.84	3.7	1	1	NaN
0	1999	Brett Favre	GNB	30	QB	16	16	8-8-0	341	595	...	255.7	74.7	35	223	6.14	5.20	5.6	3	3	NaN
0	2000	Vinny Testaverde	NYJ	37	QB	16	16	9-7-0	328	590	...	233.3	69.0	13	71	6.07	4.90	2.2	4	4	NaN
0	2001	Jon Kitna	CIN	29	QB	16	15	6-9-0	313	581	...	201.0	61.1	25	185	5.00	3.76	4.1	3	3	NaN
0	2002	Rich Gannon*+	OAK	37	QB	16	16	11-5-0	418	618	...	293.1	97.3	36	214	6.84	6.95	5.5	1	1	NaN
0	2003	Brad Johnson	TAM	35	QB	16	16	7-9-0	354	570	...	238.2	81.5	20	111	6.27	5.55	3.4			NaN
0	2004	Trent Green	KAN	34	QB	16	16	7-9-0	369	556	...	286.9	95.2	32	227	7.42	7.04	5.4	3	3	NaN
0	2005	Brett Favre	GNB	36	QB	16	16	4-12-0	372	607	...	242.6	70.9	24	170	5.88	4.45	3.8	1	1	NaN
0	2006	Brett Favre	GNB	37	QB	16	16	8-8-0	343	613	...	242.8	72.7	39.85	21	134	5.92	5.21	3.3	1	1
0	2007	Drew Brees	NOR	28	QB	16	16	7-9-0	440	652	...	276.4	89.4	61.45	16	109	6.46	6.08	2.4	1	1
0	2008	Drew Brees*	NOR	29	QB	16	16	8-8-0	413	635	...	316.8	96.2	63.67	13	92	7.68	7.55	2.0	2	2
0	2009	Matt Schaub*	HOU	28	QB	16	16	9-7-0	396	583	...	298.1	98.6	69.45	25	149	7.60	7.44	4.1	2	4
0	2010	Peyton Manning*	IND	34	QB	16	16	10-6-0	450	679	...	293.8	91.9	73.42	16	91	6.63	6.48	2.3	0	2
0	2011	Matthew Stafford	DET	23	QB	16	16	10-6-0	421	663	...	314.9	97.2	60.93	36	257	6.84	6.98	5.2	3	4
0	2012	Matthew Stafford	DET	24	QB	16	16	4-12-0	435	727	...	310.4	79.8	54.36	29	212	6.29	5.81	3.8	3	3
0	2013	Peyton Manning*+	DEN	37	QB	16	16	13-3-0	450	659	...	342.3	115.1	81.01	18	120	7.91	8.87	2.7	2	2
0	2014	Drew Brees*	NOR	35	QB	16	16	7-9-0	456	659	...	309.5	97.0	73.41	29	186	6.93	6.77	4.2	2	2
0	2015	Philip Rivers	SDG	34	QB	16	16	4-12-0	437	661	...	299.5	93.8	59.44	40	264	6.46	6.45	5.7	1	2
0	2016	Joe Flacco	BAL	31	qb	6	6	3-3-0	164	264	...	264.8	78.9	57.1	13	100	5.38	5.09	4.7	2	2

	year	name	team	age	position	games_played	games_started	record	completions	attempts	...	passing_ydsGame	passing_rating	QBR	passing_sacks	passing_sacksyds	netydsAtt	passing_airnetydsAtt	gamewinningdrives
0	1932	Arnie Herber+	GNB	22.0	tb	14	8	NaN	37	101	...	45.6	51.5	NaN	NaN	NaN	NaN	NaN	NaN
1	1932	Walt Holmer	CRD	30.0	b	10	4	NaN	25	78	...	44.9	56.0	NaN	NaN	NaN	NaN	NaN	NaN
2	1932	Benny Friedman	BKN	27.0	TB	11	10	NaN	23	74	...	29.0	28.9	NaN	NaN	NaN	NaN	NaN	NaN
3	1932	Jack McBride	2TM	31.0	NaN	12	6	NaN	36	74	...	30.3	50.5	NaN	NaN	NaN	NaN	NaN	NaN
4	1932	Keith Molesworth	CHI	27.0	QB	14	14	NaN	25	64	...	24.7	46.7	NaN	NaN	NaN	NaN	NaN	NaN

	year	name	team	age	position	wins	losses	games_played	games_started	completions	...	passing_airydsAtt	passing_ydsComp	passing_ydsGame	passing_rating	passing_sacks	passing_sacksyds	passing_airnetydsAtt	passing_sackPct	FourthQtrComebacks	gamewinningdrives
6039	2007	Tom Brady	NWE	30.0	QB	16.0	0	16	16	398	...	9.4	12.1	300.4	117.2	21.0	128.00	8.88	3.5	4.0	4.0
6833	2015	Cam Newton	CAR	26.0	QB	15.0	1	16	16	296	...	8.3	13.0	239.8	99.4	33.0	284.00	7.20	6.3	3.0	4.0
3038	1978	Terry Bradshaw	PIT	30.0	QB	14.0	2	16	16	207	...	7.0	14.1	182.2	84.7	222.0	6.92	5.40	2.0	2.0	0.0
5610	2003	Tom Brady	NWE	26.0	QB	14.0	2	16	16	317	...	6.7	11.4	226.3	85.9	219.0	6.08	5.70	4.0	7.0	0.0
5952	2006	Philip Rivers	SDG	25.0	QB	14.0	2	16	16	284	...	7.4	11.9	211.8	92.0	27.0	144.00	6.73	5.5	4.0	4.0
4317	1990	Joe Montana	SFO	34.0	QB	14.0	1	15	15	321	...	7.2	12.3	262.9	89.0	153.0	6.91	5.30	3.0	5.0	0.0
3655	1984	Joe Montana	SFO	28.0	QB	14.0	1	16	15	279	...	8.7	13.0	226.9	102.9	138.0	7.69	4.80	2.0	2.0	0.0
6360	2010	Tom Brady	NWE	33.0	QB	14.0	2	16	16	324	...	9.0	12.0	243.8	111.0	25.0	175.00	8.25	4.8	2.0	3.0
6243	2009	Peyton Manning	IND	33.0	QB	14.0	2	16	16	393	...	7.8	11.5	281.3	99.9	10.0	74.00	7.51	1.7	7.0	7.0
3646	1984	Dan Marino	MIA	23.0	QB	14.0	2	16	16	362	...	9.4	14.0	317.8	108.9	120.0	8.60	2.30	2.0	3.0	0.0