Python for Webscraping

  • SOC 590: Big Data and Population Processes
  • 17th October 2016

Tutorial 3: Webscraping and pre-processing

Outline

  1. Import modules
  2. Examine html structure of webpage
  3. Use a function with Beautiful Soup to parse webpages into .csv
  4. Analyze .csv of webpage as a Pandas DataFrame
  5. Process the data with regular expression

In [1]:
import os
import urllib
import webbrowser
import pandas as pd
from bs4 import BeautifulSoup

In [2]:
url = 'http://www.pro-football-reference.com/years/2015/passing.htm'
webbrowser.open_new_tab(url)


Out[2]:
True

In [3]:
# The url we will be scraping
url_2015 = "http://www.pro-football-reference.com/years/2015/passing.htm"

# get the html
html = urllib.request.urlopen(url_2015)

# create the BeautifulSoup object
soup = BeautifulSoup(html, "lxml")

Scraping the Column Headers

The column headers we need for our DataFrame are found in the second row of column headers PFR table. We will will scrape those and add two additional columns headers for the two additional player page links.


In [4]:
# Extract the necessary values for the column headers from the table
# and store them as a list
column_headers = [th.getText() for th in soup.findAll('th', limit=30)]
column_headers = [s for s in column_headers if len(s) != 0]
column_headers = column_headers[1:]
column_headers


Out[4]:
['Tm',
 'Age',
 'Pos',
 'G',
 'GS',
 'QBrec',
 'Cmp',
 'Att',
 'Cmp%',
 'Yds',
 'TD',
 'TD%',
 'Int',
 'Int%',
 'Lng',
 'Y/A',
 'AY/A',
 'Y/C',
 'Y/G',
 'Rate',
 'QBR',
 'Sk',
 'Yds',
 'NY/A',
 'ANY/A',
 'Sk%',
 '4QC',
 'GWD']

In [5]:
len(column_headers)


Out[5]:
28

Scraping the Data

  • Note that table_rows is a list of tag elements.

In [6]:
# The data is found within the table rows
# We want the elements from the 3rd row and on
table_rows = soup.find_all("tr")[1:]
print(type(table_rows))
table_rows[0] # take a look at the first row


<class 'list'>
Out[6]:
<tr><th class="right " csk="1" data-stat="ranker" scope="row">1</th><td class="left " csk="Rivers,Philip" data-append-csv="RivePh00" data-stat="player"><a href="/players/R/RivePh00.htm">Philip Rivers</a></td><td class="left " data-stat="team"><a href="/teams/sdg/2015.htm" title="San Diego Chargers">SDG</a></td><td class="right " data-stat="age">34</td><td class="left " data-stat="pos">QB</td><td class="right " data-stat="g">16</td><td class="right " data-stat="gs">16</td><td class="right " csk="0.25000" data-stat="qb_rec">4-12-0</td><td class="right " data-stat="pass_cmp">437</td><td class="right " data-stat="pass_att">661</td><td class="right " data-stat="pass_cmp_perc">66.1</td><td class="right " data-stat="pass_yds">4792</td><td class="right " data-stat="pass_td">29</td><td class="right " data-stat="pass_td_perc">4.4</td><td class="right " data-stat="pass_int">13</td><td class="right " data-stat="pass_int_perc">2.0</td><td class="right " data-stat="pass_long">80</td><td class="right " data-stat="pass_yds_per_att">7.2</td><td class="right " data-stat="pass_adj_yds_per_att">7.2</td><td class="right " data-stat="pass_yds_per_cmp">11.0</td><td class="right " data-stat="pass_yds_per_g">299.5</td><td class="right " data-stat="pass_rating">93.8</td><td class="right " data-stat="qbr">59.44</td><td class="right " data-stat="pass_sacked">40</td><td class="right " data-stat="pass_sacked_yds">264</td><td class="right " data-stat="pass_net_yds_per_att">6.46</td><td class="right " data-stat="pass_adj_net_yds_per_att">6.45</td><td class="right " data-stat="pass_sacked_perc">5.7</td><td class="right " data-stat="comebacks">1</td><td class="right " data-stat="gwd">2</td></tr>

The data we want for each player is found within the the td (or table data) elements.

Below I've created a function that extracts the data we want from table_rows. The comments should walk you through what each part of the function does.


In [7]:
def extract_player_data(table_rows):
    """
    Extract and return the the desired information from the td elements within
    the table rows.
    """
    # create the empty list to store the player data
    player_data = []
    
    for row in table_rows:  # for each row do the following

        # Get the text for each table data (td) element in the row
        player_list = [td.get_text() for td in row.find_all("td")]
        
        # there are some empty table rows, which are the repeated 
        # column headers in the table
        # we skip over those rows and and continue the for loop
        if not player_list:
            continue
            
        # Now append the data to list of data
        player_data.append(player_list)
        
    return player_data

Now we can create a DataFrame with the 2016 passing data


In [8]:
# extract the data we want
data = extract_player_data(table_rows)
# and then store it in a DataFrame
df_2015  = pd.DataFrame(data)

In [9]:
data[1]


Out[9]:
['Drew Brees',
 'NOR',
 '36',
 'QB',
 '15',
 '15',
 '7-8-0',
 '428',
 '627',
 '68.3',
 '4870',
 '32',
 '5.1',
 '11',
 '1.8',
 '80',
 '7.8',
 '8.0',
 '11.4',
 '324.7',
 '101.0',
 '75.47',
 '31',
 '235',
 '7.04',
 '7.26',
 '4.7',
 '1',
 '2']

In [10]:
df_2015.columns = ['Player_name']+column_headers

In [11]:
df_2015.head()


Out[11]:
Player_name Tm Age Pos G GS QBrec Cmp Att Cmp% ... Y/G Rate QBR Sk Yds NY/A ANY/A Sk% 4QC GWD
0 Philip Rivers SDG 34 QB 16 16 4-12-0 437 661 66.1 ... 299.5 93.8 59.44 40 264 6.46 6.45 5.7 1 2
1 Drew Brees NOR 36 QB 15 15 7-8-0 428 627 68.3 ... 324.7 101.0 75.47 31 235 7.04 7.26 4.7 1 2
2 Tom Brady* NWE 38 QB 16 16 12-4-0 402 624 64.4 ... 298.1 102.2 64.42 38 225 6.87 7.48 5.7 2 2
3 Eli Manning* NYG 34 QB 16 16 6-10-0 387 618 62.6 ... 277.0 93.6 60.46 27 157 6.63 6.74 4.2 1 2
4 Matt Ryan ATL 30 QB 16 16 8-8-0 407 614 66.3 ... 286.9 89.0 61.79 30 203 6.81 6.35 4.7 4 4

5 rows × 29 columns

Scraping the Data for All Seasons Since 1932

Scraping the for passing data since 1932 follows is essentially the same process as above, just repeated for each year, using a for loop.

As we loop over the years, we will create a DataFrame for each year of passing data, and append it to a large list of DataFrames that contains all the passing data. We will also have a separate list that will contain any errors and the url associated with that error. This will let us know if there are any issues with our scraper, and which url is causing the error.


In [12]:
# Create an empty list that will contain all the dataframes
# (one dataframe for all passing dataframes)
passing_dfs_list = []

# a list to store any errors that may come up while scraping
errors_list = []

In [13]:
# The url template that we pass in the passing year info
url_template = "http://www.pro-football-reference.com/years/{year}/passing.htm"

# for each year from 1932 to (and including) 2016
for year in range(1932, 2017): 
    
    # Use try/except block to catch and inspect any urls that cause an error
    try:
        # get the passing data url
        url = url_template.format(year=year)

        # get the html
        html = urllib.request.urlopen(url)

        # create the BeautifulSoup object
        soup = BeautifulSoup(html, "lxml") 

        # get the column headers
        column_headers = [th.getText() for th in soup.findAll('th', limit=30)]

        column_headers = [s for s in column_headers if len(s) != 0]
        column_headers = column_headers[1:]
        
        # select the data from the table 
        table_rows = soup.find_all("tr")[1:]

        # extract the player data from the table rows
        player_data = extract_player_data(table_rows)
        
        # create the dataframe for the current year's passing data
        year_df = pd.DataFrame(player_data)

        # add the year of the pssing data to the dataframe
        year_df.insert(0, "Year", year)
   
        # append the current dataframe to the list of dataframes
        passing_dfs_list.append(year_df)
    
    except Exception as e:
        # Store the url and the error it causes in a list
        error =[url, e] 
        # then append it to the list of errors
        errors_list.append(error)

In [14]:
print(len(errors_list))
errors_list


0
Out[14]:
[]

We don't get any errors, so that's good.

Now we can concatenate all the DataFrames we scraped and create one large DataFrame containing all the passing data


In [15]:
type(passing_dfs_list)


Out[15]:
list

In [16]:
passing_dfs_list[0:1]


Out[16]:
[    Year                 0    1   2   3   4   5 6   7    8 ...    18    19  \
 0   1932     Arnie Herber+  GNB  22  tb  14   8    37  101 ...  17.3  45.6   
 1   1932       Walt Holmer  CRD  30   b  10   4    25   78 ...  18.0  44.9   
 2   1932    Benny Friedman  BKN  27  TB  11  10    23   74 ...  13.9  29.0   
 3   1932      Jack McBride  2TM  31      12   6    36   74 ...  10.1  30.3   
 4   1932  Keith Molesworth  CHI  27  QB  14  14    25   64 ...  13.8  24.7   
 5   1932   Honolulu Hughes  BOS  25  BB  10   9    11   57 ...  12.2  13.4   
 6   1932      Dutch Clark+  PRT  26  TB  11   9    17   52 ...  16.0  24.7   
 7   1932    Glenn Presnell  PRT  27  tb  12   7    17   46 ...  15.2  21.6   
 8   1932     Jack Grossman  BKN  22  FB  12   9    11   32 ...  16.2  14.8   
 9   1932      Jack Hagerty  NYG  29   b   4   4    17   32 ...   9.3  39.5   
 10  1932       Doug Wycoff  SIS  29  BB  12  11    10   31 ...  14.0  11.7   
 11  1932       Joe Lillard  CRD  27  TB   7   5     9   28 ...  11.4  14.7   
 12  1932     John Doehring  CHI  23       7   0    10   26 ...  20.3  29.0   
 13  1932  Bronko Nagurski+  CHI  24  FB  14  14    11   26 ...  13.6  10.7   
 14  1932        Ken Strong  SIS  26  FB  11   7     4   23 ...  23.5   8.5   
 15  1932     Cliff Battles  BOS  22  TB   8   8     2   20 ...  21.0   5.3   
 16  1932     Bob Campiglio  SIS      TB  11   5     7   20 ...  15.6   9.9   
 17  1932         Hap Moran  NYG  31  fb  11   5    12   20 ...   7.6   8.3   
 18  1932     Grassy Hinton  SIS  25  tb  12   6     5   19 ...   9.2   3.8   
 19  1932        Bo Molenda  2TM  27      12   7     7   15 ...  15.1   8.8   
 20  1932       Ray Novotny  BKN  25   b  10   4     5   15 ...  10.0   5.0   
 21  1932        Stu Clancy  2TM  26      12   2     2   13 ...  22.5   3.8   
 22  1932        Red Grange  CHI  29  LH  12  12     5   13 ...  19.2   8.0   
 23  1932    Verne Lewellen  GNB  31  TB  14   6     3   13 ...  22.3   4.8   
 24  1932      John Cavosie  PRT  24  FB  11   7     5   12 ...  13.2   6.0   
 25  1932        Les Molloy  CRD  25   b   8   3     2   11 ...  11.0   2.8   
 26  1932      Swede Hanson  SIS  25   b   3   2     2   10 ...  15.5  10.3   
 27  1932   Stumpy Thomason  BKN  26   b  10   4     1   10 ...  -4.0  -0.4   
 28  1932         Oran Pape  2TM  28       7   0     2    9 ...  15.5   4.4   
 29  1932    Carl Brumbaugh  CHI  26      13   0     3    7 ...  23.7   5.5   
 30  1932       Chris Cagle  NYG  28  FB  10   7     3    7 ...  22.7   6.8   
 31  1932   Shipwreck Kelly  NYG  22       5   0     3    7 ...   7.3   4.4   
 32  1932     Harry O'Boyle  GNB  28  BB  11   6     1    7 ...   8.0   0.7   
 33  1932        Stu Wilson  SIS  25  RE  11   7     1    7 ...  18.0   1.6   
 34  1932      Tiny Feather  NYG  30  wb   9   7     1    5 ...  15.0   1.7   
 35  1932   Paul Fitzgibbon  GNB  29   b   4   2     1    5 ...   9.0   2.3   
 36  1932         Gene Rose  CRD  28       1   0     2    5 ...  20.0  40.0   
 37  1932       Reggie Rust  BOS  23  tb   5   1     2    5 ...  12.5   5.0   
 38  1932       Ed Westfall  BOS  24       2   0     0    5 ...         0.0   
 39  1932       Hank Bruder  GNB  25  WB  14   5     1    4 ...  23.0   1.6   
 40  1932     Ray Flaherty+  NYG  29  RE  12  11     3    4 ...  11.0   2.8   
 41  1932      Dick Nesbitt  CHI  25  RH  14  14     2    4 ...   7.0   1.0   
 42  1932      Johnny Blood  GNB  29  tb  13   3     0    3 ...         0.0   
 43  1932    George Corbett  CHI  24   b  12   1     2    3 ...  30.5   5.1   
 44  1932      Ace Gutowsky  PRT  23  fb   8   4     1    3 ...  26.0   3.3   
 45  1932       Johnny Sisk  CHI  26  hb  13   1     1    3 ...   8.0   0.6   
 46  1932     Stud Stennett  CRD  25   b   2   1     1    3 ...  11.0   5.5   
 47  1932      Ossie Wiberg  BKN  28   b   6   3     1    3 ...  20.0   3.3   
 48  1932     Buck Halperin  BKN  24       2   0     1    2 ...  23.0  11.5   
 49  1932     Herb Joesting  CHI  27       4   0     1    2 ...  22.0   5.5   
 50  1932        Jim Musick  BOS  22  FB  10  10     1    2 ...   7.0   0.7   
 51  1932       Bernie Finn  2TM  25       3   2     1    1 ...   9.0   3.0   
 52  1932     Clarke Hinkle  GNB  23  FB  13   9     1    1 ...  23.0   1.8   
 53  1932      Deck Shelley  2TM  26       4   1     1    1 ...   3.0   0.8   
 54  1932       Dale Waters  BOS  23  RT   9   6     0    1 ...         0.0   
 
        20 21 22 23 24 25 26 27  
 0    51.5                 0  0  
 1    56.0                 0  0  
 2    28.9                 0  0  
 3    50.5                 0  0  
 4    46.7                 0  0  
 5     5.8                 0  0  
 6    24.4                 0  0  
 7    34.6                 0  0  
 8    14.8                 0  0  
 9    61.7                 0  0  
 10   20.9                 0  0  
 11    4.6                 0  0  
 12   76.3                 0  0  
 13   67.8                 0  0  
 14   19.0                 0  0  
 15    0.0                 0  0  
 16   54.0                 0  0  
 17   71.0                 0  0  
 18    0.0                 0  0  
 19   64.9                 0  0  
 20   16.0                 0  0  
 21    1.9                 0  0  
 22   64.9                 0  0  
 23    9.0                 0  0  
 24   99.3                 0  0  
 25   39.6                 0  0  
 26    0.4                 0  0  
 27    0.0                 0  0  
 28   38.9                 0  0  
 29   80.1                 0  0  
 30   38.7                 0  0  
 31   50.9                 0  0  
 32    0.0                 0  0  
 33    0.0                 0  0  
 34    0.0                 0  0  
 35   39.6                 0  0  
 36   68.7                 0  0  
 37   16.7                 0  0  
 38   39.6                 0  0  
 39   11.5                 0  0  
 40   99.0                 0  0  
 41   58.3                 0  0  
 42   39.6                 0  0  
 43  109.7                 0  0  
 44   66.0                 0  0  
 45   42.4                 0  0  
 46   45.1                 0  0  
 47   57.6                 0  0  
 48   91.7                 0  0  
 49   89.6                 0  0  
 50   18.7                 0  0  
 51  104.2                 0  0  
 52  118.7                 0  0  
 53   79.2                 0  0  
 54    0.0                 0  0  
 
 [55 rows x 29 columns]]

In [17]:
column_headers.insert(0, "Player_name")
column_headers.insert(0, "Year")
print(column_headers)
print(len(column_headers))


['Year', 'Player_name', 'Tm', 'Age', 'Pos', 'G', 'GS', 'QBrec', 'Cmp', 'Att', 'Cmp%', 'Yds', 'TD', 'TD%', 'Int', 'Int%', 'Lng', 'Y/A', 'AY/A', 'Y/C', 'Y/G', 'Rate', 'QBR', 'Sk', 'Yds', 'NY/A', 'ANY/A', 'Sk%', '4QC', 'GWD']
30

In [18]:
# store all passing data in one DataFrame
passing_df = pd.concat(passing_dfs_list, axis=0)
passing_df.columns = column_headers

In [19]:
# Take a look at the first few rows
passing_df.loc[0,:]


Out[19]:
Year Player_name Tm Age Pos G GS QBrec Cmp Att ... Y/G Rate QBR Sk Yds NY/A ANY/A Sk% 4QC GWD
0 1932 Arnie Herber+ GNB 22 tb 14 8 37 101 ... 45.6 51.5 0 0 NaN
0 1933 Harry Newman NYG 24 TB 14 12 53 136 ... 69.5 51.7 NaN
0 1934 Arnie Herber GNB 24 TB 11 7 42 115 ... 72.6 45.1 0 0 NaN
0 1935 Ed Danowski+ NYG 24 TB 12 8 57 113 ... 66.2 69.7 NaN
0 1936 Arnie Herber GNB 26 TB 12 5 77 173 ... 103.3 58.9 NaN
0 1937 Sammy Baugh+ WAS 23 TB 11 5 81 171 ... 102.5 50.5 NaN
0 1938 Ace Parker+ BKN 26 TB 11 11 63 148 ... 78.6 53.5 0 0 NaN
0 1939 Parker Hall*+ RAM 23 P/TB 11 11 106 208 ... 111.5 57.5 0 0 NaN
0 1940 Davey O'Brien PHI 23 QB 11 11 124 277 ... 117.3 39.2 0 0 NaN
0 1941 Cecil Isbell*+ GNB 26 TB 11 4 117 206 ... 134.5 81.4 NaN
0 1942 Bud Schwenk CRD 25 TB 11 9 126 295 ... 123.6 25.5 0 0 NaN
0 1943 Sammy Baugh+ WAS 29 P/TB 10 7 133 239 ... 175.4 78.0 NaN
0 1944 Irv Comp GNB 25 TB 10 8 80 177 ... 115.9 50.0 NaN
0 1945 Paul Christman CRD 27 QB 8 3 89 219 ... 143.4 42.6 0 0 NaN
0 1946 Bob Waterfield+ RAM 26 K/P/QB 11 4 127 251 ... 158.8 67.6 0 0 NaN
0 1947 Sammy Baugh WAS 33 P/QB 12 1 210 354 ... 244.8 92.0 0 0 NaN
0 1948 Sammy Baugh WAS 34 QB 12 3 185 315 ... 216.6 78.3 0 0 NaN
0 1949 Johnny Lujack CHI 24 K/QB 12 7 162 312 ... 221.5 76.0 0 0 NaN
0 1950 Bobby Layne DET 24 QB 12 12 6-6-0 152 336 ... 193.6 62.1 0 0 NaN
0 1951 Bobby Layne* DET 25 QB 12 12 7-4-1 152 332 ... 200.3 67.6 0 0 NaN
0 1952 Otto Graham* CLE 31 QB 12 12 8-4-0 181 364 ... 234.7 66.6 NaN
0 1953 George Blanda CHI 26 K/QB 12 12 3-8-1 169 362 ... 180.3 52.3 0 0 NaN
0 1954 Tobin Rote GNB 26 QB 12 12 4-8-0 180 382 ... 192.6 59.1 0 0 NaN
0 1955 Jim Finks PIT 28 QB 12 12 4-8-0 165 344 ... 189.2 47.7 0 0 NaN
0 1956 Tobin Rote* GNB 28 QB 12 11 4-7-0 146 308 ... 183.6 70.6 0 0 NaN
0 1957 Johnny Unitas* BAL 24 QB 12 12 7-5-0 172 301 ... 212.5 88.0 0 0 NaN
0 1958 Norm Van Brocklin* PHI 32 P/QB 12 12 2-9-1 198 374 ... 200.8 64.1 0 0 NaN
0 1959 Johnny Unitas*+ BAL 26 QB 12 12 9-3-0 193 367 ... 241.6 92.0 1 1 NaN
0 1960 Johnny Unitas* BAL 27 QB 12 12 6-6-0 190 378 ... 258.3 73.7 1 2 NaN
0 1961 Johnny Unitas* BAL 28 QB 14 14 8-6-0 229 420 ... 213.6 66.1 3 4 NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
0 1987 Neil Lomax* STL 28 QB 12 12 6-6-0 275 463 ... 282.3 88.5 48 350 5.94 5.83 9.4 2 2 NaN
0 1988 Dan Marino MIA 27 QB 16 16 6-10-0 354 606 ... 277.1 80.8 6 31 7.19 6.42 1.0 1 2 NaN
0 1989 Don Majkowski* GNB 25 QB 16 16 10-6-0 353 599 ... 269.9 82.3 47 268 6.27 5.71 7.3 5 7 NaN
0 1990 Warren Moon* HOU 34 QB 15 15 8-7-0 362 584 ... 312.6 96.8 36 252 7.16 7.28 5.8 1 1 NaN
0 1991 Warren Moon* HOU 35 QB 16 16 11-5-0 404 655 ... 293.1 81.7 23 174 6.66 5.95 3.4 4 5 NaN
0 1992 Dan Marino* MIA 31 QB 16 16 11-5-0 330 554 ... 257.3 85.1 28 173 6.77 6.36 4.8 6 6 NaN
0 1993 John Elway* DEN 33 QB 16 16 9-7-0 348 551 ... 251.9 92.8 39 293 6.33 6.42 6.6 1 1 NaN
0 1994 Drew Bledsoe* NWE 22 QB 16 16 10-6-0 400 691 ... 284.7 73.6 22 139 6.19 5.19 3.1 3 3 NaN
0 1995 Drew Bledsoe NWE 23 QB 15 15 6-9-0 323 636 ... 233.8 63.7 23 170 5.06 4.37 3.5 2 2 NaN
0 1996 Drew Bledsoe* NWE 24 QB 16 16 11-5-0 373 623 ... 255.4 83.7 30 190 5.97 5.76 4.6 3 4 NaN
0 1997 Dan Marino MIA 36 QB 16 16 9-7-0 319 548 ... 236.3 80.7 20 132 6.42 6.11 3.5 1 3 NaN
0 1998 Peyton Manning IND 22 QB 16 16 3-13-0 326 575 ... 233.7 71.2 22 109 6.08 4.84 3.7 1 1 NaN
0 1999 Brett Favre GNB 30 QB 16 16 8-8-0 341 595 ... 255.7 74.7 35 223 6.14 5.20 5.6 3 3 NaN
0 2000 Vinny Testaverde NYJ 37 QB 16 16 9-7-0 328 590 ... 233.3 69.0 13 71 6.07 4.90 2.2 4 4 NaN
0 2001 Jon Kitna CIN 29 QB 16 15 6-9-0 313 581 ... 201.0 61.1 25 185 5.00 3.76 4.1 3 3 NaN
0 2002 Rich Gannon*+ OAK 37 QB 16 16 11-5-0 418 618 ... 293.1 97.3 36 214 6.84 6.95 5.5 1 1 NaN
0 2003 Brad Johnson TAM 35 QB 16 16 7-9-0 354 570 ... 238.2 81.5 20 111 6.27 5.55 3.4 NaN
0 2004 Trent Green KAN 34 QB 16 16 7-9-0 369 556 ... 286.9 95.2 32 227 7.42 7.04 5.4 3 3 NaN
0 2005 Brett Favre GNB 36 QB 16 16 4-12-0 372 607 ... 242.6 70.9 24 170 5.88 4.45 3.8 1 1 NaN
0 2006 Brett Favre GNB 37 QB 16 16 8-8-0 343 613 ... 242.8 72.7 39.85 21 134 5.92 5.21 3.3 1 1
0 2007 Drew Brees NOR 28 QB 16 16 7-9-0 440 652 ... 276.4 89.4 61.45 16 109 6.46 6.08 2.4 1 1
0 2008 Drew Brees* NOR 29 QB 16 16 8-8-0 413 635 ... 316.8 96.2 63.67 13 92 7.68 7.55 2.0 2 2
0 2009 Matt Schaub* HOU 28 QB 16 16 9-7-0 396 583 ... 298.1 98.6 69.45 25 149 7.60 7.44 4.1 2 4
0 2010 Peyton Manning* IND 34 QB 16 16 10-6-0 450 679 ... 293.8 91.9 73.42 16 91 6.63 6.48 2.3 0 2
0 2011 Matthew Stafford DET 23 QB 16 16 10-6-0 421 663 ... 314.9 97.2 60.93 36 257 6.84 6.98 5.2 3 4
0 2012 Matthew Stafford DET 24 QB 16 16 4-12-0 435 727 ... 310.4 79.8 54.36 29 212 6.29 5.81 3.8 3 3
0 2013 Peyton Manning*+ DEN 37 QB 16 16 13-3-0 450 659 ... 342.3 115.1 81.01 18 120 7.91 8.87 2.7 2 2
0 2014 Drew Brees* NOR 35 QB 16 16 7-9-0 456 659 ... 309.5 97.0 73.41 29 186 6.93 6.77 4.2 2 2
0 2015 Philip Rivers SDG 34 QB 16 16 4-12-0 437 661 ... 299.5 93.8 59.44 40 264 6.46 6.45 5.7 1 2
0 2016 Joe Flacco BAL 31 qb 6 6 3-3-0 164 264 ... 264.8 78.9 57.1 13 100 5.38 5.09 4.7 2 2

85 rows × 30 columns

Now that we fixed up the necessary columns, let's write out the raw data to a CSV file.


In [20]:
if not os.path.exists('../data/raw_data'):
    os.makedirs('../data/raw_data')

if not os.path.exists('../data/clean_data'):
    os.makedirs('../data/clean_data')

In [21]:
passing_df.head()


Out[21]:
Year Player_name Tm Age Pos G GS QBrec Cmp Att ... Y/G Rate QBR Sk Yds NY/A ANY/A Sk% 4QC GWD
0 1932 Arnie Herber+ GNB 22 tb 14 8 37 101 ... 45.6 51.5 0 0 NaN
1 1932 Walt Holmer CRD 30 b 10 4 25 78 ... 44.9 56.0 0 0 NaN
2 1932 Benny Friedman BKN 27 TB 11 10 23 74 ... 29.0 28.9 0 0 NaN
3 1932 Jack McBride 2TM 31 12 6 36 74 ... 30.3 50.5 0 0 NaN
4 1932 Keith Molesworth CHI 27 QB 14 14 25 64 ... 24.7 46.7 0 0 NaN

5 rows × 30 columns


In [22]:
# Write out the raw passing data to the raw_data fold in the data folder
passing_df.to_csv("../data/raw_data/pfr_nfl_passing_data_RAW.csv", index=False)

Cleaning the Data

Now that we have the raw passing data, we need to clean it up a bit in order to do some of the data exploration we want.


In [23]:
passing_df_clean = pd.read_csv("../data/raw_data/pfr_nfl_passing_data_RAW.csv")
  • create a dictionary for renaming passing columns
  • rename all columns with passing_columns dictionary

In [24]:
passing_columns = {
              'Year':'year', 'Player_name': 'name', 'Tm': 'team', 'Age': 'age',
              'Pos': 'position', 'G': 'games_played', 'GS':
              'games_started', 'QBrec': 'record', 'Cmp': 'completions',
              'Att': 'attempts', 'Cmp%': 'completionPct', 'Yds':
              'passing_yards', 'TD': 'passing_TD',  'TD%':
              'passing_TDPct', 'Int': 'passing_INT', 'Int%':
              'passing_INTPct', 'Lng': 'passing_long', 'Y/A':
              'passing_ydsAtt', 'AY/A': 'passing_airydsAtt', 'Y/C':
              'passing_ydsComp', 'Y/G': 'passing_ydsGame', 'Rate':
              'passing_rating', 'QBR': 'QBR','Sk': 'passing_sacks', 'Yds.1':
              'passing_sacksyds','NY/A': 'netydsAtt', 'ANY/A': 'passing_airnetydsAtt',
              'Sk%': 'passing_sackPct', '4QC': 'FourthQtrComebacks',
              'GWD': 'gamewinningdrives'
              }
ordered_columns = [
              'year', 'name', 'team', 'age', 'position', 'wins',
              'losses', 'games_played', 'games_started', 'completions',
              'attempts', 'completionPct', 'passing_yards',
              'passing_TD', 'passing_TDPct',  'passing_INT',
              'passing_INTPct', 'passing_long', 'passing_ydsAtt',
              'passing_airydsAtt', 'passing_ydsComp',
              'passing_ydsGame', 'passing_rating', 'passing_sacks',
              'passing_sacksyds', 'passing_airnetydsAtt',
              'passing_sackPct', 'FourthQtrComebacks',
              'gamewinningdrives'
    ]

In [25]:
passing_df_clean.rename(index=str,columns=passing_columns, inplace=True)
passing_df_clean.head()


Out[25]:
year name team age position games_played games_started record completions attempts ... passing_ydsGame passing_rating QBR passing_sacks passing_sacksyds netydsAtt passing_airnetydsAtt passing_sackPct FourthQtrComebacks gamewinningdrives
0 1932 Arnie Herber+ GNB 22.0 tb 14 8 NaN 37 101 ... 45.6 51.5 NaN NaN NaN NaN NaN 0.0 0.0 NaN
1 1932 Walt Holmer CRD 30.0 b 10 4 NaN 25 78 ... 44.9 56.0 NaN NaN NaN NaN NaN 0.0 0.0 NaN
2 1932 Benny Friedman BKN 27.0 TB 11 10 NaN 23 74 ... 29.0 28.9 NaN NaN NaN NaN NaN 0.0 0.0 NaN
3 1932 Jack McBride 2TM 31.0 NaN 12 6 NaN 36 74 ... 30.3 50.5 NaN NaN NaN NaN NaN 0.0 0.0 NaN
4 1932 Keith Molesworth CHI 27.0 QB 14 14 NaN 25 64 ... 24.7 46.7 NaN NaN NaN NaN NaN 0.0 0.0 NaN

5 rows × 30 columns

Cleaning Up the Rest of the Passing Data

  • process record columns into wins and losses

In [26]:
passing_df_clean['record'] = passing_df_clean['record'].astype('str')
passing_df_clean['position'] = passing_df_clean['position'].astype('str')
passing_df_clean['position'] = passing_df_clean['position'].str.upper()
passing_df_clean['record'].loc[passing_df_clean.record == 'QBrec'] = "0-0-0"
passing_df_clean['record'].loc[passing_df_clean.record == 'nan'] = "0-0-0"
unformatted_record = passing_df_clean['record'].str[:]
unformatted_record = unformatted_record.str.replace("-",  "/")
passing_df_clean['wins'] = unformatted_record.str.split("/").str[0]
passing_df_clean['losses'] = unformatted_record.str.split("/").str[1]
passing_df_clean['name'] = passing_df_clean['name'].str.replace('[+|*]', "")
passing_df_clean['wins'] = passing_df_clean['wins'].astype('float')
passing_df_clean = pd.DataFrame(data=passing_df_clean, columns=ordered_columns)


/Users/kivan/anaconda3/lib/python3.5/site-packages/pandas/core/indexing.py:132: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)

In [27]:
# convert the data to proper numeric types
passing_df_clean = passing_df_clean.apply(pd.to_numeric, errors="ignore")
passing_df_clean.info()


<class 'pandas.core.frame.DataFrame'>
Index: 6956 entries, 0 to 6955
Data columns (total 29 columns):
year                    6956 non-null int64
name                    6956 non-null object
team                    6956 non-null object
age                     6952 non-null float64
position                6956 non-null object
wins                    6956 non-null float64
losses                  6956 non-null int64
games_played            6956 non-null int64
games_started           6956 non-null int64
completions             6956 non-null int64
attempts                6956 non-null int64
completionPct           6956 non-null float64
passing_yards           6956 non-null int64
passing_TD              6956 non-null int64
passing_TDPct           6956 non-null float64
passing_INT             6956 non-null int64
passing_INTPct          6956 non-null float64
passing_long            6956 non-null int64
passing_ydsAtt          6956 non-null float64
passing_airydsAtt       6956 non-null float64
passing_ydsComp         5707 non-null float64
passing_ydsGame         6956 non-null float64
passing_rating          6956 non-null float64
passing_sacks           4770 non-null float64
passing_sacksyds        4770 non-null float64
passing_airnetydsAtt    4770 non-null float64
passing_sackPct         3836 non-null float64
FourthQtrComebacks      3223 non-null float64
gamewinningdrives       403 non-null float64
dtypes: float64(16), int64(10), object(3)
memory usage: 1.6+ MB

We are not done yet. A lot of out numeric columns are missing data because players didn't accumulate any of those stats. For example, some players didn't score a TD or even play a game. Let's select the columns with numeric data and then replace the NaNs (the current value that represents the missing data) with 0s, as that is a more appropriate value.


In [28]:
# Get the column names for the numeric columns
num_cols = passing_df_clean.columns[passing_df_clean.dtypes != object]

# Replace all NaNs with 0
passing_df_clean.loc[:, num_cols] = passing_df_clean.loc[:, num_cols].fillna(0)

In [29]:
passing_df_clean.info()


<class 'pandas.core.frame.DataFrame'>
Index: 6956 entries, 0 to 6955
Data columns (total 29 columns):
year                    6956 non-null int64
name                    6956 non-null object
team                    6956 non-null object
age                     6956 non-null float64
position                6956 non-null object
wins                    6956 non-null float64
losses                  6956 non-null int64
games_played            6956 non-null int64
games_started           6956 non-null int64
completions             6956 non-null int64
attempts                6956 non-null int64
completionPct           6956 non-null float64
passing_yards           6956 non-null int64
passing_TD              6956 non-null int64
passing_TDPct           6956 non-null float64
passing_INT             6956 non-null int64
passing_INTPct          6956 non-null float64
passing_long            6956 non-null int64
passing_ydsAtt          6956 non-null float64
passing_airydsAtt       6956 non-null float64
passing_ydsComp         6956 non-null float64
passing_ydsGame         6956 non-null float64
passing_rating          6956 non-null float64
passing_sacks           6956 non-null float64
passing_sacksyds        6956 non-null float64
passing_airnetydsAtt    6956 non-null float64
passing_sackPct         6956 non-null float64
FourthQtrComebacks      6956 non-null float64
gamewinningdrives       6956 non-null float64
dtypes: float64(16), int64(10), object(3)
memory usage: 1.6+ MB

We are finally done cleaning the data and now we can save it to a CSV file.


In [30]:
passing_df_clean = passing_df_clean[pd.notnull(passing_df_clean['name'])]
passing_df_clean.sort_values('wins', ascending=False, axis=0, inplace=True)

In [31]:
passing_df_clean.to_csv("../data/clean_data/pfr_nfl_passing_data_CLEAN.csv", index=False)
passing_df_clean.head(10)


Out[31]:
year name team age position wins losses games_played games_started completions ... passing_airydsAtt passing_ydsComp passing_ydsGame passing_rating passing_sacks passing_sacksyds passing_airnetydsAtt passing_sackPct FourthQtrComebacks gamewinningdrives
6039 2007 Tom Brady NWE 30.0 QB 16.0 0 16 16 398 ... 9.4 12.1 300.4 117.2 21.0 128.00 8.88 3.5 4.0 4.0
6833 2015 Cam Newton CAR 26.0 QB 15.0 1 16 16 296 ... 8.3 13.0 239.8 99.4 33.0 284.00 7.20 6.3 3.0 4.0
3038 1978 Terry Bradshaw PIT 30.0 QB 14.0 2 16 16 207 ... 7.0 14.1 182.2 84.7 222.0 6.92 5.40 2.0 2.0 0.0
5610 2003 Tom Brady NWE 26.0 QB 14.0 2 16 16 317 ... 6.7 11.4 226.3 85.9 219.0 6.08 5.70 4.0 7.0 0.0
5952 2006 Philip Rivers SDG 25.0 QB 14.0 2 16 16 284 ... 7.4 11.9 211.8 92.0 27.0 144.00 6.73 5.5 4.0 4.0
4317 1990 Joe Montana SFO 34.0 QB 14.0 1 15 15 321 ... 7.2 12.3 262.9 89.0 153.0 6.91 5.30 3.0 5.0 0.0
3655 1984 Joe Montana SFO 28.0 QB 14.0 1 16 15 279 ... 8.7 13.0 226.9 102.9 138.0 7.69 4.80 2.0 2.0 0.0
6360 2010 Tom Brady NWE 33.0 QB 14.0 2 16 16 324 ... 9.0 12.0 243.8 111.0 25.0 175.00 8.25 4.8 2.0 3.0
6243 2009 Peyton Manning IND 33.0 QB 14.0 2 16 16 393 ... 7.8 11.5 281.3 99.9 10.0 74.00 7.51 1.7 7.0 7.0
3646 1984 Dan Marino MIA 23.0 QB 14.0 2 16 16 362 ... 9.4 14.0 317.8 108.9 120.0 8.60 2.30 2.0 3.0 0.0

10 rows × 29 columns

Review

In this tutorial, we learned how to examine the html structure of webpage and use a function based on the Beautiful Soup module to parse tables on multiple webpage into .csv. After creating a .csv of all webpage tables, we analyzed the .csv using the Pandas module


In [32]:
import sys
import bs4
print("System and module version information: \n")
print('Python version:', sys.version_info)
print('urllib.request version:', urllib.request.__version__)
print('pandas version:', pd.__version__)
print('Beautiful Soup version:', bs4.__version__)


System and module version information: 

Python version: sys.version_info(major=3, minor=5, micro=2, releaselevel='final', serial=0)
urllib.request version: 3.5
pandas version: 0.18.1
Beautiful Soup version: 4.5.1