NBA MVP Comparisons

Part 1

  • 25 th December 2018

It's Christmas and that means a full slate of NBA games. This time of year also provokes some great NBA discussions and the best NBA discussions are comparative. Arguments like: would Jordan's '96 Bulls would beat the '17 Warriors?

Another comparison that is sure to create great debate is who had the best (or worst) Most Valuable Player (MVP) season in history? This question has a strong empirical dimension, in that we can observe quantifiable aspects across seasons, and can leverage programming to both gather and and analyze the data.

The MVP data will gather comes from basketball-reference.com. Basketball-reference is part of the Sports-Reference sites, "a group of sites providing both basic and sabermetric statistics and resources for sports fans everywhere. [Sports-Reference aims] to be the easiest-to-use, fastest, most complete sources for sports statistics anywhere" (sports-reference.com).

In this post, we will gather and pre-process all the data for the a multi-part series to determine:

  1. the MVP finalist with best case for winning their year
  2. predict the 2018-2019 MVP

Let the shade begin

Outline

  1. Import modules
  2. Examine html structure of webpage
  3. Use a function with Beautiful Soup to parse webpages into .csv
  4. Analyze .csv of webpage as a Pandas DataFrame
  5. Process the data

In [1]:
import os
import urllib
import webbrowser
import pandas as pd
from datetime import datetime
from bs4 import BeautifulSoup
  • Let's examine the webpage with all the MVP data from the 1966-1956 season to the 2017-2018 season

In [2]:
url = 'https://www.basketball-reference.com/awards/mvp.html'
webbrowser.open_new_tab(url)


Out[2]:
True

In [3]:
# get the html
html = urllib.request.urlopen(url)

# create the BeautifulSoup object
soup = BeautifulSoup(html, "lxml")

Scraping the Column Headers

The column headers we need for our DataFrame are found in the th element


In [4]:
# Extract the necessary values for the column headers from the table
# and store them as a list
column_headers = [th.getText() for th in soup.findAll('th', limit=30)]
column_headers = [s for s in column_headers if len(s) != 0]
column_headers = column_headers[1:]
print(column_headers)
column_headers = [e for e in column_headers if e not in ('Shooting', 'Advanced')][:-7]
print(column_headers)


['Shooting', 'Advanced', 'Season', 'Lg', 'Player', 'Voting', 'Age', 'Tm', 'G', 'MP', 'PTS', 'TRB', 'AST', 'STL', 'BLK', 'FG%', '3P%', 'FT%', 'WS', 'WS/48', '2017-18', '2016-17', '2015-16', '2014-15', '2013-14', '2012-13', '2011-12']
['Season', 'Lg', 'Player', 'Voting', 'Age', 'Tm', 'G', 'MP', 'PTS', 'TRB', 'AST', 'STL', 'BLK', 'FG%', '3P%', 'FT%', 'WS', 'WS/48']

In [5]:
len(column_headers)


Out[5]:
18

Scraping the Data

  • Note that table_rows is a list of tag elements.

In [6]:
# The data is found within the table rows
# We want the elements from the 3rd row and on
table_rows = soup.find_all("tr")[2:]
print(type(table_rows))
table_rows[0] # take a look at the first row


<class 'list'>
Out[6]:
<tr><th class="left " data-stat="season" scope="row"><a href="/leagues/NBA_2018.html">2017-18</a></th><td class="left " data-stat="lg_id"><a href="/leagues/NBA_2018.html">NBA</a></td><td class="left " csk="Harden,James" data-append-csv="hardeja01" data-stat="player"><a href="/players/h/hardeja01.html">James Harden</a></td><td class="center " data-stat="voting"> (<a href="/awards/awards_2018.html#mvp">V</a>)</td><td class="right " data-stat="age">28</td><td class="left " data-stat="team_id"><a href="/teams/HOU/2018.html">HOU</a></td><td class="right " data-stat="g">72</td><td class="right " data-stat="mp_per_g">35.4</td><td class="right " data-stat="pts_per_g">30.4</td><td class="right " data-stat="trb_per_g">5.4</td><td class="right " data-stat="ast_per_g">8.8</td><td class="right " data-stat="stl_per_g">1.8</td><td class="right " data-stat="blk_per_g">0.7</td><td class="right " data-stat="fg_pct">.449</td><td class="right " data-stat="fg3_pct">.367</td><td class="right " data-stat="ft_pct">.858</td><td class="right " data-stat="ws">15.4</td><td class="right " data-stat="ws_per_48">.289</td></tr>
  • The data we want for each player is found within the the td (or table data) elements.
  • Below I've created a function that extracts the data we want from table_rows. The comments should walk you through what each part of the function does.

In [7]:
def extract_mvp_data(table_rows):
    """
    Extract and return the the desired information from the td elements within the table rows.
    :param: table_rows:  list of soup `tr` elements
    :return: list of player-year MVP observation
    """
    # create the empty list to store the player data
    player_data = []
    
    for row in table_rows:  # for each row do the following

        # Get the text for each table data (td) element in the row
        player_year = [td.get_text() for td in row.find_all("th")]
        player_list = [td.get_text() for td in row.find_all("td")]
              
        # there are some empty table rows, which are the repeated 
        # column headers in the table
        # we skip over those rows and and continue the for loop
        if not player_list:
            continue
            
        # Now append the data to list of data
        
        player_info = player_year+player_list
        player_data.append(player_info)
        
    return player_data
  • now we can create a DataFrame with the MVP data

In [8]:
# extract the data we want
mvp_data = extract_mvp_data(table_rows)
# and then store it in a DataFrame
mvp_data_df  = pd.DataFrame(mvp_data)

In [9]:
mvp_data_df[0:1]


Out[9]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
0 2017-18 NBA James Harden (V) 28 HOU 72 35.4 30.4 5.4 8.8 1.8 0.7 .449 .367 .858 15.4 .289
  • rename the columns
  • view the data

In [10]:
mvp_data_df.columns = column_headers
print("the MVP dataframe has {} rows (player-year observations) and  {} columns".format(mvp_data_df.shape[0],
                                                                                        mvp_data_df.shape[1]))
mvp_data_df.head()


the MVP dataframe has 113 rows (player-year observations) and  18 columns
Out[10]:
Season Lg Player Voting Age Tm G MP PTS TRB AST STL BLK FG% 3P% FT% WS WS/48
0 2017-18 NBA James Harden (V) 28 HOU 72 35.4 30.4 5.4 8.8 1.8 0.7 .449 .367 .858 15.4 .289
1 2016-17 NBA Russell Westbrook (V) 28 OKC 81 34.6 31.6 10.7 10.4 1.6 0.4 .425 .343 .845 13.1 .224
2 2015-16 NBA Stephen Curry (V) 27 GSW 79 34.2 30.1 5.4 6.7 2.1 0.2 .504 .454 .908 17.9 .318
3 2014-15 NBA Stephen Curry (V) 26 GSW 80 32.7 23.8 4.3 7.7 2.0 0.2 .487 .443 .914 15.7 .288
4 2013-14 NBA Kevin Durant (V) 25 OKC 81 38.5 32.0 7.4 5.5 1.3 0.7 .503 .391 .873 19.2 .295
  • now we need the data on all the finalist for the question:
    • which finalist had the best argument for winning that year?

In [11]:
url_2018_mvp_finalist = 'https://www.basketball-reference.com/awards/awards_2018.html#mvp'

In [12]:
# get the html
# examine webpage
html_finalist = urllib.request.urlopen(url_2018_mvp_finalist)
webbrowser.open_new_tab(url)

# create the BeautifulSoup object
soup_finalist = BeautifulSoup(html_finalist, "lxml") 

# Extract the necessary values for the column headers from the table and store them as a list
column_headers_finalist = [th.getText() for th in soup_finalist.findAll('th', limit=30)]
column_headers_finalist = [s for s in column_headers_finalist if len(s) != 0]
column_headers_finalist = column_headers_finalist[1:]

print("raw column names in finalist table: {}".format(column_headers_finalist))
column_headers_finalist = [e for e in column_headers_finalist if e not in ('Shooting', 'Advanced', 'Per Game')][:-4]
print("formatted column names in finalist table: {}".format(column_headers_finalist))
print("{} columns in finalist table".format(len(column_headers_finalist)))

# The data is found within the `tr` elements of the first `tbody` element
# We want the elements from the 3rd row and on
table_rows_finalist = soup_finalist.find("tbody").find_all("tr")
print(type(table_rows_finalist))
table_rows_finalist[-1] # take a look at the last row


raw column names in finalist table: ['Per Game', 'Shooting', 'Advanced', 'Rank', 'Player', 'Age', 'Tm', 'First', 'Pts Won', 'Pts Max', 'Share', 'G', 'MP', 'PTS', 'TRB', 'AST', 'STL', 'BLK', 'FG%', '3P%', 'FT%', 'WS', 'WS/48', '1', '2', '3', '4']
formatted column names in finalist table: ['Rank', 'Player', 'Age', 'Tm', 'First', 'Pts Won', 'Pts Max', 'Share', 'G', 'MP', 'PTS', 'TRB', 'AST', 'STL', 'BLK', 'FG%', '3P%', 'FT%', 'WS', 'WS/48']
20 columns in finalist table
<class 'bs4.element.ResultSet'>
Out[12]:
<tr><th class="right " data-stat="rank" scope="row">13</th><td class="left " csk="Oladipo,Victor" data-append-csv="oladivi01" data-stat="player"><a href="/players/o/oladivi01.html">Victor Oladipo</a></td><td class="right " data-stat="age">25</td><td class="left " data-stat="team_id"><a href="/teams/IND/2018.html">IND</a></td><td class="right " data-stat="votes_first">0.0</td><td class="right " data-stat="points_won">2.0</td><td class="right " data-stat="points_max">1010</td><td class="right " data-stat="award_share">0.002</td><td class="right " data-stat="g">75</td><td class="right " data-stat="mp_per_g">34.0</td><td class="right " data-stat="pts_per_g">23.1</td><td class="right " data-stat="trb_per_g">5.2</td><td class="right " data-stat="ast_per_g">4.3</td><td class="right " data-stat="stl_per_g">2.4</td><td class="right " data-stat="blk_per_g">0.8</td><td class="right " data-stat="fg_pct">.477</td><td class="right " data-stat="fg3_pct">.371</td><td class="right " data-stat="ft_pct">.799</td><td class="right " data-stat="ws">8.2</td><td class="right " data-stat="ws_per_48">.155</td></tr>
  • create a function to extract MVP finalist data

In [13]:
def extract_player_data(table_rows):
    """
    Extract and return the the desired information from the td elements within the table rows.
    :param: table_rows:  list of soup `tr` elements
    :return: list of player-year MVP finalist observations
    """
    # create the empty list to store the player data
    player_data = []
    
    for row in table_rows:  # for each row do the following

        # Get the text for each table data (td) element in the row
        player_rank = [td.get_text() for td in row.find_all("th")]
        player_list = [td.get_text() for td in row.find_all("td")]
        
        
        # there are some empty table rows, which are the repeated 
        # column headers in the table
        # we skip over those rows and and continue the for loop
        if not player_list:
            continue
            
        # Now append the data to list of data
        
        player_info = player_rank+player_list
        player_data.append(player_info)
        
    return player_data

In [14]:
# extract the data we want
data = extract_player_data(table_rows_finalist)

# and then store it in a DataFrame
example_player_df  = pd.DataFrame(data)
example_player_df.columns = column_headers_finalist
print("the MVP finalist dataframe has {} rows (player-year observations) and  {} columns".format(example_player_df.shape[0],
                                                                                        example_player_df.shape[1]))
example_player_df.tail(6)


the MVP finalist dataframe has 13 rows (player-year observations) and  20 columns
Out[14]:
Rank Player Age Tm First Pts Won Pts Max Share G MP PTS TRB AST STL BLK FG% 3P% FT% WS WS/48
7 8 DeMar DeRozan 28 TOR 0.0 32.0 1010 0.032 80 33.9 23.0 3.9 5.2 1.1 0.3 .456 .310 .825 9.6 .170
8 9 LaMarcus Aldridge 32 SAS 0.0 6.0 1010 0.006 75 33.5 23.1 8.5 2.0 0.6 1.2 .510 .293 .837 10.9 .209
9 10T Jimmy Butler 28 MIN 0.0 5.0 1010 0.005 59 36.7 22.2 5.3 4.9 2.0 0.4 .474 .350 .854 8.9 .198
10 10T Stephen Curry 29 GSW 0.0 5.0 1010 0.005 51 32.0 26.4 5.1 6.1 1.6 0.2 .495 .423 .921 9.1 .267
11 12 Joel Embiid 23 PHI 0.0 4.0 1010 0.004 63 30.3 22.9 11.0 3.2 0.6 1.8 .483 .308 .769 6.2 .155
12 13 Victor Oladipo 25 IND 0.0 2.0 1010 0.002 75 34.0 23.1 5.2 4.3 2.4 0.8 .477 .371 .799 8.2 .155

Scraping the Data for All MVP Finalists Since 1956

Scraping the for finalist data since 1956 follows is essentially the same process as above, just repeated for each year, using a for loop.

  • As we loop over the years, we will create a DataFrame for each year of MVP finalist data, and append it to a large list of DataFrames that contains all the MVP finalists data. We will also have a separate list that will contain any errors and the url associated with that error. This will let us know if there are any issues with our scraper, and which url is causing the error.

In [15]:
# Create an empty list that will contain all the dataframes
# (one dataframe for all finalist dataframes)
mvp_finalist_list = []

# a list to store any errors that may come up while scraping
errors_list = []
  • let's time how long this loop takes

In [16]:
loop_start = datetime.now()
print(loop_start)


2019-05-25 15:10:57.567450

In [17]:
# The url template that we pass in the finalist year info
url_template = "https://www.basketball-reference.com/awards/awards_{year}.html#mvp"

# for each year from 1956 to (and including) 2018
for year in range(1956, 2019): 
    
    # Use try/except block to catch and inspect any urls that cause an error
    try:
        # get the MVP finalist data url
        url = url_template.format(year=year)

        # get the html
        html = urllib.request.urlopen(url)

        # create the BeautifulSoup object
        soup = BeautifulSoup(html, "lxml") 

        # get the column headers
        column_headers = [th.getText() for th in soup.findAll('th', limit=30)]
        column_headers = [s for s in column_headers if len(s) != 0]
        column_headers = column_headers[1:]
        column_headers = [e for e in column_headers if e not in ('Shooting', 'Advanced', 'Per Game')][:-4]
       
        # select the data from the table 
        table_rows = soup.find_all("tr")[2:]

        # extract the player data from the table rows
        player_data = extract_player_data(table_rows)
        
        # create the dataframe for the current year's mvp finalist data
        # subset to only include MVP finalists
        year_df = pd.DataFrame(player_data)
        year_df.columns = column_headers
        year_df = year_df.loc[year_df["Pts Max"]==year_df["Pts Max"].unique()[0]]

        # add the year of the MVP finalist data to the dataframe
        year_df.insert(0, "Year", year)
 
        # append the current dataframe to the list of dataframes
        mvp_finalist_list.append(year_df)
        
    except Exception as e:
        # Store the url and the error it causes in a list
        error =[url, e] 
        # then append it to the list of errors
        errors_list.append(error)

In [18]:
loop_stop = datetime.now()
print(loop_stop)
print("the loop took {}".format(loop_stop-loop_start))


2019-05-25 15:12:03.441050
the loop took 0:01:05.873600
  • the loop took ~ 1 minute

In [19]:
print(len(errors_list))
errors_list


0
Out[19]:
[]
  • We don't get any errors, so that's good.
  • Now we can concatenate all the DataFrames we scraped and create one large DataFrame containing all the finalist data

In [20]:
print(type(mvp_finalist_list))
print(len(mvp_finalist_list))


<class 'list'>
63

In [21]:
mvp_finalist_list[0:1]


Out[21]:
[   Year Rank          Player Age   Tm First Pts Won Pts Max  Share   G  ...   \
 0  1956    1      Bob Pettit  23  STL  33.0    33.0      80  0.413  72  ...    
 1  1956    2     Paul Arizin  27  PHW  21.0    21.0      80  0.263  72  ...    
 2  1956    3       Bob Cousy  27  BOS  11.0    11.0      80  0.138  72  ...    
 3  1956    4    Mel Hutchins  27  FTW   9.0     9.0      80  0.113  66  ...    
 4  1956   5T   Dolph Schayes  27  SYR   2.0     2.0      80  0.025  72  ...    
 5  1956   5T    Bill Sharman  29  BOS   2.0     2.0      80  0.025  72  ...    
 6  1956   7T        Tom Gola  23  PHW   1.0     1.0      80  0.013  68  ...    
 7  1956   7T  Maurice Stokes  22  ROC   1.0     1.0      80  0.013  67  ...    
 
     PTS   TRB  AST STL BLK   FG% 3P%   FT%    WS WS/48  
 0  25.7  16.2  2.6          .429      .736  13.8  .236  
 1  24.2   7.5  2.6          .448      .810  12.2  .214  
 2  18.8   6.8  8.9          .360      .844   6.8  .119  
 3  12.0   7.5  2.7          .425      .643   4.4  .095  
 4  20.4  12.4  2.8          .387      .858  11.8  .225  
 5  19.9   3.6  4.7          .438      .867   8.8  .157  
 6  10.8   9.1  5.9          .412      .733   6.5  .132  
 7  16.8  16.3  4.9          .354      .714   6.0  .125  
 
 [8 rows x 21 columns]]

In [22]:
column_headers_finalist.insert(0, "Year")
print(column_headers_finalist)
print(len(column_headers_finalist))


['Year', 'Rank', 'Player', 'Age', 'Tm', 'First', 'Pts Won', 'Pts Max', 'Share', 'G', 'MP', 'PTS', 'TRB', 'AST', 'STL', 'BLK', 'FG%', '3P%', 'FT%', 'WS', 'WS/48']
21

In [23]:
# store all finalist data in one DataFrame
mvp_finalist_df = pd.concat(mvp_finalist_list, axis=0)
print(mvp_finalist_df.shape)


(972, 21)

In [24]:
# Take a look at the first row
mvp_finalist_df.iloc[0]


Out[24]:
Year             1956
Rank                1
Player     Bob Pettit
Age                23
Tm                STL
First            33.0
Pts Won          33.0
Pts Max            80
Share           0.413
G                  72
MP               38.8
PTS              25.7
TRB              16.2
AST               2.6
STL                  
BLK                  
FG%              .429
3P%                  
FT%              .736
WS               13.8
WS/48            .236
Name: 0, dtype: object

Now that we fixed up the necessary columns, let's write out the raw data to a CSV file.


In [25]:
os.makedirs('../data/raw_data', exist_ok=True)
os.makedirs('../data/clean_data', exist_ok=True)

In [26]:
mvp_finalist_df.head()


Out[26]:
Year Rank Player Age Tm First Pts Won Pts Max Share G ... PTS TRB AST STL BLK FG% 3P% FT% WS WS/48
0 1956 1 Bob Pettit 23 STL 33.0 33.0 80 0.413 72 ... 25.7 16.2 2.6 .429 .736 13.8 .236
1 1956 2 Paul Arizin 27 PHW 21.0 21.0 80 0.263 72 ... 24.2 7.5 2.6 .448 .810 12.2 .214
2 1956 3 Bob Cousy 27 BOS 11.0 11.0 80 0.138 72 ... 18.8 6.8 8.9 .360 .844 6.8 .119
3 1956 4 Mel Hutchins 27 FTW 9.0 9.0 80 0.113 66 ... 12.0 7.5 2.7 .425 .643 4.4 .095
4 1956 5T Dolph Schayes 27 SYR 2.0 2.0 80 0.025 72 ... 20.4 12.4 2.8 .387 .858 11.8 .225

5 rows × 21 columns


In [27]:
# Write out the  MVP data and MVP finalist data to the raw_data folder in the data folder
mvp_data_df.to_csv("../data/raw_data/mvp_data_df_raw.csv", index=False)
mvp_finalist_df.to_csv("../data/raw_data/mvp_finalist_df_raw.csv", index=False)

Cleaning the Data

  • Now that we have the raw MVP data, we need to clean it up a bit for data exploration

In [28]:
mvp_data_df_clean = pd.read_csv("../data/raw_data/mvp_data_df_raw.csv", encoding = "Latin-1")
mvp_finalist_df_clean = pd.read_csv("../data/raw_data/mvp_finalist_df_raw.csv", encoding = "Latin-1")

In [29]:
mvp_data_df_clean.head()


Out[29]:
Season Lg Player Voting Age Tm G MP PTS TRB AST STL BLK FG% 3P% FT% WS WS/48
0 2017-18 NBA James Harden (V) 28.0 HOU 72.0 35.4 30.4 5.4 8.8 1.8 0.7 0.449 0.367 0.858 15.4 0.289
1 2016-17 NBA Russell Westbrook (V) 28.0 OKC 81.0 34.6 31.6 10.7 10.4 1.6 0.4 0.425 0.343 0.845 13.1 0.224
2 2015-16 NBA Stephen Curry (V) 27.0 GSW 79.0 34.2 30.1 5.4 6.7 2.1 0.2 0.504 0.454 0.908 17.9 0.318
3 2014-15 NBA Stephen Curry (V) 26.0 GSW 80.0 32.7 23.8 4.3 7.7 2.0 0.2 0.487 0.443 0.914 15.7 0.288
4 2013-14 NBA Kevin Durant (V) 25.0 OKC 81.0 38.5 32.0 7.4 5.5 1.3 0.7 0.503 0.391 0.873 19.2 0.295
  • create dictionaries for renaming columns
  • rename all columns with dictionaries

In [30]:
mvp_data_columns_dict = {'Season':'season', 'Lg':'league', 'Player':'player', 'Voting': 'voting', 'Tm': 'team', 'Age': 'age',
                         'G': 'games_played', 'MP': 'avg_minutes', 'PTS': 'avg_points',  'TRB': 'avg_rebounds', 
                         'AST': 'avg_assists', 'STL': 'avg_steals', 'BLK': 'avg_blocks', 'FG%': 'field_goal_pct',
                         '3P%': 'three_pt_pct', 'FT%': 'free_throw_pct', 'WS': 'win_shares',
                         'WS/48': 'win_shares_per_48'
                        }
mvp_finalist_columns_dict = {'Year':'year', 'Player':'player', 'Rank': 'rank', 'Tm': 'team', 'Age': 'age',
                             'First': 'first_place_votes', 'Pts Won': 'points_won', 'Pts Max': 'points_max',
                             'Share':'vote_share', 'G': 'games_played', 'MP': 'avg_minutes', 'PTS': 'avg_points',
                             'TRB': 'avg_rebounds', 'AST': 'avg_assists', 'STL': 'avg_steals', 'BLK': 'avg_blocks',
                             'FG%': 'field_goal_pct',  '3P%': 'three_pt_pct', 'FT%': 'free_throw_pct', 'WS': 'win_shares',
                             'WS/48': 'win_shares_per_48'
                              }

In [31]:
mvp_data_df_clean.rename(index=str,columns=mvp_data_columns_dict, inplace=True)
mvp_data_df_clean.head()


Out[31]:
season league player voting age team games_played avg_minutes avg_points avg_rebounds avg_assists avg_steals avg_blocks field_goal_pct three_pt_pct free_throw_pct win_shares win_shares_per_48
0 2017-18 NBA James Harden (V) 28.0 HOU 72.0 35.4 30.4 5.4 8.8 1.8 0.7 0.449 0.367 0.858 15.4 0.289
1 2016-17 NBA Russell Westbrook (V) 28.0 OKC 81.0 34.6 31.6 10.7 10.4 1.6 0.4 0.425 0.343 0.845 13.1 0.224
2 2015-16 NBA Stephen Curry (V) 27.0 GSW 79.0 34.2 30.1 5.4 6.7 2.1 0.2 0.504 0.454 0.908 17.9 0.318
3 2014-15 NBA Stephen Curry (V) 26.0 GSW 80.0 32.7 23.8 4.3 7.7 2.0 0.2 0.487 0.443 0.914 15.7 0.288
4 2013-14 NBA Kevin Durant (V) 25.0 OKC 81.0 38.5 32.0 7.4 5.5 1.3 0.7 0.503 0.391 0.873 19.2 0.295

In [32]:
mvp_finalist_df_clean.rename(index=str,columns=mvp_finalist_columns_dict, inplace=True)
mvp_finalist_df_clean.head()


Out[32]:
year rank player age team first_place_votes points_won points_max vote_share games_played ... avg_points avg_rebounds avg_assists avg_steals avg_blocks field_goal_pct three_pt_pct free_throw_pct win_shares win_shares_per_48
0 1956 1 Bob Pettit 23 STL 33.0 33.0 80 0.413 72 ... 25.7 16.2 2.6 NaN NaN 0.429 NaN 0.736 13.8 0.236
1 1956 2 Paul Arizin 27 PHW 21.0 21.0 80 0.263 72 ... 24.2 7.5 2.6 NaN NaN 0.448 NaN 0.810 12.2 0.214
2 1956 3 Bob Cousy 27 BOS 11.0 11.0 80 0.138 72 ... 18.8 6.8 8.9 NaN NaN 0.360 NaN 0.844 6.8 0.119
3 1956 4 Mel Hutchins 27 FTW 9.0 9.0 80 0.113 66 ... 12.0 7.5 2.7 NaN NaN 0.425 NaN 0.643 4.4 0.095
4 1956 5T Dolph Schayes 27 SYR 2.0 2.0 80 0.025 72 ... 20.4 12.4 2.8 NaN NaN 0.387 NaN 0.858 11.8 0.225

5 rows × 21 columns

Cleaning Up the Rest of the Data


In [33]:
# convert the data to proper numeric types
mvp_data_df_clean = mvp_data_df_clean.apply(pd.to_numeric, errors="ignore")
mvp_data_df_clean.info()


<class 'pandas.core.frame.DataFrame'>
Index: 113 entries, 0 to 112
Data columns (total 18 columns):
season               113 non-null object
league               113 non-null object
player               113 non-null object
voting               73 non-null object
age                  73 non-null float64
team                 73 non-null object
games_played         73 non-null float64
avg_minutes          73 non-null float64
avg_points           73 non-null float64
avg_rebounds         73 non-null float64
avg_assists          73 non-null float64
avg_steals           50 non-null float64
avg_blocks           50 non-null float64
field_goal_pct       73 non-null float64
three_pt_pct         48 non-null float64
free_throw_pct       73 non-null float64
win_shares           73 non-null float64
win_shares_per_48    73 non-null float64
dtypes: float64(13), object(5)
memory usage: 16.8+ KB

In [34]:
# convert the data to proper numeric types
mvp_finalist_df_clean = mvp_finalist_df_clean.apply(pd.to_numeric, errors="ignore")
mvp_finalist_df_clean.info()


<class 'pandas.core.frame.DataFrame'>
Index: 972 entries, 0 to 971
Data columns (total 21 columns):
year                 972 non-null int64
rank                 972 non-null object
player               972 non-null object
age                  972 non-null int64
team                 972 non-null object
first_place_votes    972 non-null float64
points_won           972 non-null float64
points_max           972 non-null int64
vote_share           972 non-null float64
games_played         972 non-null int64
avg_minutes          972 non-null float64
avg_points           972 non-null float64
avg_rebounds         972 non-null float64
avg_assists          972 non-null float64
avg_steals           756 non-null float64
avg_blocks           756 non-null float64
field_goal_pct       972 non-null float64
three_pt_pct         621 non-null float64
free_throw_pct       972 non-null float64
win_shares           972 non-null float64
win_shares_per_48    972 non-null float64
dtypes: float64(14), int64(4), object(3)
memory usage: 167.1+ KB

We are not done yet. A lot of out numeric columns are missing data because players didn't accumulate any of those stats. For example, the 3 point line is introduced in 1982 and all players in preceding seasons don't have this statistic. Additionally, we want to select the columns with numeric data and then replace the NaNs (the current value that represents the missing data) with 0s, as that is a more appropriate value.


In [35]:
# Get the column names for the numeric columns
num_cols_mvp = mvp_data_df_clean.columns[mvp_data_df_clean.dtypes != object]
num_cols_finalist = mvp_finalist_df_clean.columns[mvp_finalist_df_clean.dtypes != object]

# Replace all NaNs with 0
mvp_data_df_clean.loc[:, num_cols_mvp] = mvp_data_df_clean.loc[:, num_cols_mvp].fillna(0)
mvp_finalist_df_clean.loc[:, num_cols_finalist] = mvp_finalist_df_clean.loc[:, num_cols_finalist].fillna(0)

In [36]:
mvp_finalist_df_clean.info()


<class 'pandas.core.frame.DataFrame'>
Index: 972 entries, 0 to 971
Data columns (total 21 columns):
year                 972 non-null int64
rank                 972 non-null object
player               972 non-null object
age                  972 non-null int64
team                 972 non-null object
first_place_votes    972 non-null float64
points_won           972 non-null float64
points_max           972 non-null int64
vote_share           972 non-null float64
games_played         972 non-null int64
avg_minutes          972 non-null float64
avg_points           972 non-null float64
avg_rebounds         972 non-null float64
avg_assists          972 non-null float64
avg_steals           972 non-null float64
avg_blocks           972 non-null float64
field_goal_pct       972 non-null float64
three_pt_pct         972 non-null float64
free_throw_pct       972 non-null float64
win_shares           972 non-null float64
win_shares_per_48    972 non-null float64
dtypes: float64(14), int64(4), object(3)
memory usage: 167.1+ KB
  • remove ABA winners
  • remove MVP summary table

In [37]:
mvp_data_df_clean = mvp_data_df_clean.loc[mvp_data_df_clean["league"]=="NBA"]
mvp_data_df_clean = mvp_data_df_clean[pd.notnull(mvp_data_df_clean['team'])]

We are finally done cleaning the data and now we can save it to a CSV file.


In [38]:
mvp_data_df_clean = mvp_data_df_clean[pd.notnull(mvp_data_df_clean['player'])]
mvp_data_df_clean.sort_values(['season'], ascending=False, axis=0, inplace=True)


mvp_finalist_df_clean = mvp_finalist_df_clean[pd.notnull(mvp_finalist_df_clean['player'])]
mvp_finalist_df_clean.sort_values(['year', 'points_won'], ascending=False, axis=0, inplace=True)

In [39]:
mvp_data_df_clean.to_csv("../data/clean_data/mvp_data_df_clean.csv", index=False)
print(mvp_data_df_clean.shape)
mvp_data_df_clean.head()


(63, 18)
Out[39]:
season league player voting age team games_played avg_minutes avg_points avg_rebounds avg_assists avg_steals avg_blocks field_goal_pct three_pt_pct free_throw_pct win_shares win_shares_per_48
0 2017-18 NBA James Harden (V) 28.0 HOU 72.0 35.4 30.4 5.4 8.8 1.8 0.7 0.449 0.367 0.858 15.4 0.289
1 2016-17 NBA Russell Westbrook (V) 28.0 OKC 81.0 34.6 31.6 10.7 10.4 1.6 0.4 0.425 0.343 0.845 13.1 0.224
2 2015-16 NBA Stephen Curry (V) 27.0 GSW 79.0 34.2 30.1 5.4 6.7 2.1 0.2 0.504 0.454 0.908 17.9 0.318
3 2014-15 NBA Stephen Curry (V) 26.0 GSW 80.0 32.7 23.8 4.3 7.7 2.0 0.2 0.487 0.443 0.914 15.7 0.288
4 2013-14 NBA Kevin Durant (V) 25.0 OKC 81.0 38.5 32.0 7.4 5.5 1.3 0.7 0.503 0.391 0.873 19.2 0.295

In [40]:
mvp_finalist_df_clean.to_csv("../data/clean_data/mvp_finalist_df_clean.csv", index=False)
print(mvp_finalist_df_clean.shape)
mvp_finalist_df_clean.head()


(972, 21)
Out[40]:
year rank player age team first_place_votes points_won points_max vote_share games_played ... avg_points avg_rebounds avg_assists avg_steals avg_blocks field_goal_pct three_pt_pct free_throw_pct win_shares win_shares_per_48
959 2018 1 James Harden 28 HOU 86.0 965.0 1010 0.955 72 ... 30.4 5.4 8.8 1.8 0.7 0.449 0.367 0.858 15.4 0.289
960 2018 2 LeBron James 33 CLE 15.0 738.0 1010 0.731 82 ... 27.5 8.6 9.1 1.4 0.9 0.542 0.367 0.731 14.0 0.221
961 2018 3 Anthony Davis 24 NOP 0.0 445.0 1010 0.441 75 ... 28.1 11.1 2.3 1.5 2.6 0.534 0.340 0.828 13.7 0.241
962 2018 4 Damian Lillard 27 POR 0.0 207.0 1010 0.205 73 ... 26.9 4.5 6.6 1.1 0.4 0.439 0.361 0.916 12.6 0.227
963 2018 5 Russell Westbrook 29 OKC 0.0 76.0 1010 0.075 80 ... 25.4 10.1 10.3 1.8 0.3 0.449 0.298 0.737 10.1 0.166

5 rows × 21 columns

Review

  • In this tutorial, we learned how to
    • examine the html structure of webpage
    • use functions based on the Beautiful Soup module to parse tables on multiple webpage into .csv
    • analyzed a .csv file using the Pandas module

Download this notebook or see a static view here


In [41]:
import sys
import bs4
print(f'last updated: {datetime.now().strftime("%Y-%m-%d %H:%M")} \n')
print(f"System and module version information: \n")
print(f"Python version: {sys.version_info}")
print(f"urllib.request version: {urllib.request.__version__}")
print(f"pandas version: {pd.__version__}")
print(f"Beautiful Soup version: {bs4.__version__}")


last updated: 2019-05-27 22:35 

System and module version information: 

Python version: sys.version_info(major=3, minor=7, micro=1, releaselevel='final', serial=0)
urllib.request version: 3.7
pandas version: 0.23.4
Beautiful Soup version: 4.6.3