NBA MVP Comparisons

Part 3

  • 6 th March 2019

In the last two blog posts we:

  1. Gathered the relevant data for all NBA MVPs
  2. Evaluated several machine learning models to select the MVP and chose the Latent Discriminant Analysis model as the best predictor.

The goal of this blog post is to gather the data necessary to predict the 2018-2019 MVP

We will look select from the MVP finalists in 2017-2018

  • MVP will likely be a finalist from the previous year
    • Is this a valid assumption?

Update

On May 17th, the NBA announced the 2018-2019 MVP finalists:

Outline

  1. Import modules
  2. Examine html structure of webpage
  3. Use a function with Beautiful Soup to parse webpages into .csv
  4. Analyze .csv of webpage as a Pandas DataFrame
  5. Process the data

In [1]:
import os
import urllib
import webbrowser
import pandas as pd
from datetime import datetime
from bs4 import BeautifulSoup

In [2]:
url_2018_mvp_finalist = "https://www.basketball-reference.com/awards/awards_2018.html"
  • examine webpage

In [3]:
webbrowser.open_new_tab(url_2018_mvp_finalist)


Out[3]:
True
  • get the html
  • create the BeautifulSoup object

In [4]:
html_finalist = urllib.request.urlopen(url_2018_mvp_finalist)
soup_finalist = BeautifulSoup(html_finalist, "lxml")
  • Extract the necessary values for the column headers from the table and store them as a list

In [5]:
column_headers_finalist = [th.get_text() for th in soup_finalist.find_all('th', limit=30)]
column_headers_finalist = [header for header in column_headers_finalist if len(header) != 0]
column_headers_finalist = column_headers_finalist[1:]

In [6]:
print(f"raw column names in finalist table: {column_headers_finalist}")
column_headers_finalist = [header for header in column_headers_finalist if header not in ('Shooting', 'Advanced', 'Per Game')][:-4]
print(f"formatted column names in finalist table: {column_headers_finalist}")
print(f"{len(column_headers_finalist)} columns in finalist table")


raw column names in finalist table: ['Per Game', 'Shooting', 'Advanced', 'Rank', 'Player', 'Age', 'Tm', 'First', 'Pts Won', 'Pts Max', 'Share', 'G', 'MP', 'PTS', 'TRB', 'AST', 'STL', 'BLK', 'FG%', '3P%', 'FT%', 'WS', 'WS/48', '1', '2', '3', '4']
formatted column names in finalist table: ['Rank', 'Player', 'Age', 'Tm', 'First', 'Pts Won', 'Pts Max', 'Share', 'G', 'MP', 'PTS', 'TRB', 'AST', 'STL', 'BLK', 'FG%', '3P%', 'FT%', 'WS', 'WS/48']
20 columns in finalist table
  • The data is found within the tr elements of the first tbody element
  • We want the elements from the 3rd row and on
  • take a look at the last row to examine data quality

In [7]:
table_rows_finalist = soup_finalist.find("tbody").find_all("tr")
print(f"the subset soup object is of type: {type(table_rows_finalist)}")
table_rows_finalist[-1]


the subset soup object is of type: <class 'bs4.element.ResultSet'>
Out[7]:
<tr><th class="right " data-stat="rank" scope="row">13</th><td class="left " csk="Oladipo,Victor" data-append-csv="oladivi01" data-stat="player"><a href="/players/o/oladivi01.html">Victor Oladipo</a></td><td class="right " data-stat="age">25</td><td class="left " data-stat="team_id"><a href="/teams/IND/2018.html">IND</a></td><td class="right " data-stat="votes_first">0.0</td><td class="right " data-stat="points_won">2.0</td><td class="right " data-stat="points_max">1010</td><td class="right " data-stat="award_share">0.002</td><td class="right " data-stat="g">75</td><td class="right " data-stat="mp_per_g">34.0</td><td class="right " data-stat="pts_per_g">23.1</td><td class="right " data-stat="trb_per_g">5.2</td><td class="right " data-stat="ast_per_g">4.3</td><td class="right " data-stat="stl_per_g">2.4</td><td class="right " data-stat="blk_per_g">0.8</td><td class="right " data-stat="fg_pct">.477</td><td class="right " data-stat="fg3_pct">.371</td><td class="right " data-stat="ft_pct">.799</td><td class="right " data-stat="ws">8.2</td><td class="right " data-stat="ws_per_48">.155</td></tr>
  • to get a player's 2018-2019 statistical information, we can extract the player's web page from the BeautifulSoup object

In [8]:
player_href = [href for href in table_rows_finalist[-1].find_all("a")][0]["href"]
player_href


Out[8]:
'/players/o/oladivi01.html'
  • however, the player link is just a stub url and needs a base url for basketball-reference.com appended to it to access the player's web page

In [9]:
base_basketball_ref_url = "https://www.basketball-reference.com"

In [10]:
player_link = base_basketball_ref_url + player_href
player_link


Out[10]:
'https://www.basketball-reference.com/players/o/oladivi01.html'

In [11]:
webbrowser.open_new_tab(player_link)


Out[11]:
True
  • create a function to extract MVP finalist data

In [12]:
def extract_finalist_data(table_rows):
    """
    Extract and return the the desired information from the td elements within the table rows.
    :param: table_rows:  list of soup `tr` elements
    :return: list of player-year MVP finalist observations
    """
    
    base_basketball_ref_url = "https://www.basketball-reference.com"
    
    # create the empty list to store the player data
    player_data = []
    
    for row in table_rows:  # for each row do the following

        # Get the text for each table data (td) element in the row
        player_rank = [td.get_text() for td in row.find_all("th")]
        player_list = [td.get_text() for td in row.find_all("td")]
        player_href = [href for href in row.find_all("a")][0]["href"]
        player_link = [base_basketball_ref_url + player_href]        
        
        # there are some empty table rows, which are the repeated column headers in the table
        # we skip over those rows and and continue the for loop
        if not player_list:
            continue
            
        # Now append the individual player data to list of all player data
        
        player_info = player_rank+player_list+player_link
        player_data.append(player_info)
        
    return player_data
  • extract the data we want
  • and then store it in a DataFrame

In [13]:
extracted_finalist_2018_data = extract_finalist_data(table_rows_finalist)

mvp_finalist_2018_data  = pd.DataFrame(extracted_finalist_2018_data)
mvp_finalist_2018_data.columns = column_headers_finalist+["player_link"]
print(f"the MVP finalist dataframe has {mvp_finalist_2018_data.shape[0]} rows (player-year observations) and {mvp_finalist_2018_data.shape[1]} columns")
mvp_finalist_2018_data.tail(6)


the MVP finalist dataframe has 13 rows (player-year observations) and 21 columns
Out[13]:
Rank Player Age Tm First Pts Won Pts Max Share G MP ... TRB AST STL BLK FG% 3P% FT% WS WS/48 player_link
7 8 DeMar DeRozan 28 TOR 0.0 32.0 1010 0.032 80 33.9 ... 3.9 5.2 1.1 0.3 .456 .310 .825 9.6 .170 https://www.basketball-reference.com/players/d...
8 9 LaMarcus Aldridge 32 SAS 0.0 6.0 1010 0.006 75 33.5 ... 8.5 2.0 0.6 1.2 .510 .293 .837 10.9 .209 https://www.basketball-reference.com/players/a...
9 10T Jimmy Butler 28 MIN 0.0 5.0 1010 0.005 59 36.7 ... 5.3 4.9 2.0 0.4 .474 .350 .854 8.9 .198 https://www.basketball-reference.com/players/b...
10 10T Stephen Curry 29 GSW 0.0 5.0 1010 0.005 51 32.0 ... 5.1 6.1 1.6 0.2 .495 .423 .921 9.1 .267 https://www.basketball-reference.com/players/c...
11 12 Joel Embiid 23 PHI 0.0 4.0 1010 0.004 63 30.3 ... 11.0 3.2 0.6 1.8 .483 .308 .769 6.2 .155 https://www.basketball-reference.com/players/e...
12 13 Victor Oladipo 25 IND 0.0 2.0 1010 0.002 75 34.0 ... 5.2 4.3 2.4 0.8 .477 .371 .799 8.2 .155 https://www.basketball-reference.com/players/o...

6 rows × 21 columns

  • Extract the necessary values for the column headers from the table and store them as a list
  • create the BeautifulSoup object

In [14]:
player_test_table = "https://www.basketball-reference.com/players/h/hardeja01.html#per_game::none"
player_profile_request = urllib.request.urlopen(player_test_table)
player_profile_soup = BeautifulSoup(player_profile_request, "lxml")

In [15]:
column_headers_player = [th.get_text() for th in player_profile_soup.find_all('th', limit=30)]
column_headers_player = [header for header in column_headers_player if len(header) != 0]
column_headers_player = column_headers_player[1:]
print(f"the columns in the career data tables are: \n {column_headers_player} \n")
print(f"there are {len(column_headers_player)} total columns in the career table")


the columns in the career data tables are: 
 ['Age', 'Tm', 'Lg', 'Pos', 'G', 'GS', 'MP', 'FG', 'FGA', 'FG%', '3P', '3PA', '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS'] 

there are 29 total columns in the career table

In [16]:
player_name = player_profile_soup.find_all("h1")[0].text
player_years_active = [th.get_text() for th in player_profile_soup.find_all("tbody")[0].find_all("th")]
player_career_data = [td.get_text() for td in player_profile_soup.find_all("td")]
  • let's examine the data to make sure it conforms with our expectations

In [17]:
print(f"player name: {player_name}")
print(f"player years active: {player_years_active}")
print(f"player career data: {player_career_data}")


player name: James Harden
player years active: ['2009-10', '2010-11', '2011-12', '2012-13', '2013-14', '2014-15', '2015-16', '2016-17', '2017-18', '2018-19']
player career data: ['20', 'OKC', 'NBA', 'SG', '76', '0', '22.9', '3.1', '7.6', '.403', '1.2', '3.3', '.375', '1.8', '4.3', '.424', '.484', '2.6', '3.2', '.808', '0.6', '2.6', '3.2', '1.8', '1.1', '0.3', '1.4', '2.6', '9.9', '21', 'OKC', 'NBA', 'SG', '82', '5', '26.7', '3.6', '8.3', '.436', '1.4', '4.0', '.349', '2.3', '4.4', '.514', '.518', '3.5', '4.2', '.843', '0.5', '2.6', '3.1', '2.1', '1.1', '0.3', '1.3', '2.5', '12.2', '22', 'OKC', 'NBA', 'SG', '62', '2', '31.4', '5.0', '10.1', '.491', '1.8', '4.7', '.390', '3.1', '5.4', '.579', '.582', '5.0', '6.0', '.846', '0.5', '3.6', '4.1', '3.7', '1.0', '0.2', '2.2', '2.4', '16.8', '23', 'HOU', 'NBA', 'SG', '78', '78', '38.3', '7.5', '17.1', '.438', '2.3', '6.2', '.368', '5.2', '10.9', '.477', '.504', '8.6', '10.2', '.851', '0.8', '4.1', '4.9', '5.8', '1.8', '0.5', '3.8', '2.3', '25.9', '24', 'HOU', 'NBA', 'SG', '73', '73', '38.0', '7.5', '16.5', '.456', '2.4', '6.6', '.366', '5.1', '9.9', '.515', '.529', '7.9', '9.1', '.866', '0.8', '3.9', '4.7', '6.1', '1.6', '0.4', '3.6', '2.4', '25.4', '25', 'HOU', 'NBA', 'SG', '81', '81', '36.8', '8.0', '18.1', '.440', '2.6', '6.9', '.375', '5.4', '11.3', '.480', '.511', '8.8', '10.2', '.868', '0.9', '4.7', '5.7', '7.0', '1.9', '0.7', '4.0', '2.6', '27.4', '26', 'HOU', 'NBA', 'SG', '82', '82', '38.1', '8.7', '19.7', '.439', '2.9', '8.0', '.359', '5.8', '11.7', '.494', '.512', '8.8', '10.2', '.860', '0.8', '5.3', '6.1', '7.5', '1.7', '0.6', '4.6', '2.8', '29.0', '27', 'HOU', 'NBA', 'PG', '81', '81', '36.4', '8.3', '18.9', '.440', '3.2', '9.3', '.347', '5.1', '9.6', '.530', '.525', '9.2', '10.9', '.847', '1.2', '7.0', '8.1', '11.2', '1.5', '0.5', '5.7', '2.7', '29.1', '28', 'HOU', 'NBA', 'SG', '72', '72', '35.4', '9.0', '20.1', '.449', '3.7', '10.0', '.367', '5.4', '10.1', '.531', '.541', '8.7', '10.1', '.858', '0.6', '4.8', '5.4', '8.8', '1.8', '0.7', '4.4', '2.3', '30.4', '29', 'HOU', 'NBA', 'PG', '78', '78', '36.8', '10.8', '24.5', '.442', '4.8', '13.2', '.368', '6.0', '11.3', '.528', '.541', '9.7', '11.0', '.879', '0.8', '5.8', '6.6', '7.5', '2.0', '0.7', '5.0', '3.1', '36.1', '', '', 'NBA', '', '765', '552', '34.1', '7.2', '16.2', '.443', '2.6', '7.3', '.365', '4.5', '9.0', '.506', '.525', '7.3', '8.5', '.857', '0.8', '4.5', '5.2', '6.2', '1.6', '0.5', '3.6', '2.6', '24.3', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'HOU', 'NBA', '', '545', '545', '37.1', '8.5', '19.3', '.443', '3.1', '8.6', '.364', '5.4', '10.7', '.506', '.524', '8.8', '10.2', '.861', '0.8', '5.1', '6.0', '7.7', '1.8', '0.6', '4.4', '2.6', '29.0', '', 'OKC', 'NBA', '', '220', '7', '26.7', '3.8', '8.6', '.444', '1.5', '3.9', '.370', '2.4', '4.7', '.506', '.529', '3.6', '4.3', '.835', '0.5', '2.9', '3.4', '2.5', '1.1', '0.3', '1.6', '2.5', '12.7']
  • the player career data is a list that includes every year the player was active
    • this list is not broken up by year
  • to create yearly player data from the player career data list, we need a function to seperate the list
  • we can use the following function is from StackOverflow

In [18]:
def slice_per(source, step):
    return [source[i::step] for i in range(step)]
  • every 29 items in the list represents a year of player data
    • we will split the player_career_data object every 29 steps (using the language of the slice_per function) to create a year of data
  • then we add the column headers to the dataframe as a sanity check that the information was extracted correctly

In [19]:
sliced_player_data = slice_per(player_career_data, 29)
print(type(sliced_player_data))
player_career_df = pd.DataFrame(sliced_player_data).transpose()
player_career_df.columns = column_headers_player
player_career_df


<class 'list'>
Out[19]:
Age Tm Lg Pos G GS MP FG FGA FG% ... FT% ORB DRB TRB AST STL BLK TOV PF PTS
0 20 OKC NBA SG 76 0 22.9 3.1 7.6 .403 ... .808 0.6 2.6 3.2 1.8 1.1 0.3 1.4 2.6 9.9
1 21 OKC NBA SG 82 5 26.7 3.6 8.3 .436 ... .843 0.5 2.6 3.1 2.1 1.1 0.3 1.3 2.5 12.2
2 22 OKC NBA SG 62 2 31.4 5.0 10.1 .491 ... .846 0.5 3.6 4.1 3.7 1.0 0.2 2.2 2.4 16.8
3 23 HOU NBA SG 78 78 38.3 7.5 17.1 .438 ... .851 0.8 4.1 4.9 5.8 1.8 0.5 3.8 2.3 25.9
4 24 HOU NBA SG 73 73 38.0 7.5 16.5 .456 ... .866 0.8 3.9 4.7 6.1 1.6 0.4 3.6 2.4 25.4
5 25 HOU NBA SG 81 81 36.8 8.0 18.1 .440 ... .868 0.9 4.7 5.7 7.0 1.9 0.7 4.0 2.6 27.4
6 26 HOU NBA SG 82 82 38.1 8.7 19.7 .439 ... .860 0.8 5.3 6.1 7.5 1.7 0.6 4.6 2.8 29.0
7 27 HOU NBA PG 81 81 36.4 8.3 18.9 .440 ... .847 1.2 7.0 8.1 11.2 1.5 0.5 5.7 2.7 29.1
8 28 HOU NBA SG 72 72 35.4 9.0 20.1 .449 ... .858 0.6 4.8 5.4 8.8 1.8 0.7 4.4 2.3 30.4
9 29 HOU NBA PG 78 78 36.8 10.8 24.5 .442 ... .879 0.8 5.8 6.6 7.5 2.0 0.7 5.0 3.1 36.1
10 NBA 765 552 34.1 7.2 16.2 .443 ... .857 0.8 4.5 5.2 6.2 1.6 0.5 3.6 2.6 24.3
11 ...
12 HOU NBA 545 545 37.1 8.5 19.3 .443 ... .861 0.8 5.1 6.0 7.7 1.8 0.6 4.4 2.6 29.0
13 OKC NBA 220 7 26.7 3.8 8.6 .444 ... .835 0.5 2.9 3.4 2.5 1.1 0.3 1.6 2.5 12.7

14 rows × 29 columns

  • now we need a function to extract the player career data using the link to the player's profile as the input

In [20]:
def extract_career_data(player_link):
    """
    Extract and return the the desired information from the td elements within the table rows.
    :param: table_rows:  list of soup `tr` elements
    :return: list of player-year MVP finalist observations
    """
        
    player_profile_request = urllib.request.urlopen(player_link)

    # create the BeautifulSoup object
    player_profile_soup = BeautifulSoup(player_profile_request, "lxml") 
    extracted_player_data = [td for td in player_profile_soup.find_all("tbody")]

    player_name = player_profile_soup.find_all("h1")[0].text
    player_years_active = [th.get_text() for th in player_profile_soup.find_all("tbody")[0].find_all("th")]
    player_career_data = [td.get_text() for td in extracted_player_data[0].find_all("td")]
    
    # slice list of list (player_data into yearly lists)
    sliced_player_data = slice_per(player_career_data, 29)
    
    player_career_df = pd.DataFrame(sliced_player_data).transpose()
    player_career_df.insert(0, "Player", player_name)
    player_career_df.insert(1, "Year", player_years_active)
        
    return player_career_df
  • use list comprehension to extract career data for each player link in mvp_finalist_2018_data

In [21]:
all_player_career_data = [extract_career_data(player_link) for player_link in mvp_finalist_2018_data["player_link"]]
  • use Pandas concat function to store all finalist data in one DataFrame
  • store all finalist data in one DataFrame

In [22]:
mvp_finalist_2019_career_data = pd.concat(all_player_career_data, axis=0)
print(f"the MVP finalist dataframe has {mvp_finalist_2018_data.shape[0]} rows (player-year observations) and {mvp_finalist_2018_data.shape[1]} columns")

mvp_finalist_2019_career_data.head()


the MVP finalist dataframe has 13 rows (player-year observations) and 21 columns
Out[22]:
Player Year 0 1 2 3 4 5 6 7 ... 19 20 21 22 23 24 25 26 27 28
0 James Harden 2009-10 20 OKC NBA SG 76 0 22.9 3.1 ... .808 0.6 2.6 3.2 1.8 1.1 0.3 1.4 2.6 9.9
1 James Harden 2010-11 21 OKC NBA SG 82 5 26.7 3.6 ... .843 0.5 2.6 3.1 2.1 1.1 0.3 1.3 2.5 12.2
2 James Harden 2011-12 22 OKC NBA SG 62 2 31.4 5.0 ... .846 0.5 3.6 4.1 3.7 1.0 0.2 2.2 2.4 16.8
3 James Harden 2012-13 23 HOU NBA SG 78 78 38.3 7.5 ... .851 0.8 4.1 4.9 5.8 1.8 0.5 3.8 2.3 25.9
4 James Harden 2013-14 24 HOU NBA SG 73 73 38.0 7.5 ... .866 0.8 3.9 4.7 6.1 1.6 0.4 3.6 2.4 25.4

5 rows × 31 columns

  • rename columns by concatentating column_headers_player with the two new columns we added in our extract_career_data function

In [23]:
mvp_finalist_2019_career_data.columns = ["Player"] + ["Year"] + column_headers_player
mvp_finalist_2019_career_data.columns


Out[23]:
Index(['Player', 'Year', 'Age', 'Tm', 'Lg', 'Pos', 'G', 'GS', 'MP', 'FG',
       'FGA', 'FG%', '3P', '3PA', '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT',
       'FTA', 'FT%', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF',
       'PTS'],
      dtype='object')
  • examine the mvp_finalist_2019_career_data data

In [24]:
mvp_finalist_2019_career_data.tail()


Out[24]:
Player Year Age Tm Lg Pos G GS MP FG ... FT% ORB DRB TRB AST STL BLK TOV PF PTS
1 Victor Oladipo 2014-15 22 ORL NBA SG 72 71 35.7 6.6 ... .819 0.7 3.5 4.2 4.1 1.7 0.3 2.8 2.6 17.9
2 Victor Oladipo 2015-16 23 ORL NBA SG 72 52 33.0 5.9 ... .830 0.7 4.1 4.8 3.9 1.6 0.8 2.1 2.4 16.0
3 Victor Oladipo 2016-17 24 OKC NBA SG 67 67 33.2 6.1 ... .753 0.6 3.8 4.3 2.6 1.2 0.3 1.8 2.3 15.9
4 Victor Oladipo 2017-18 25 IND NBA SG 75 75 34.0 8.5 ... .799 0.6 4.6 5.2 4.3 2.4 0.8 2.9 2.3 23.1
5 Victor Oladipo 2018-19 26 IND NBA SG 36 36 31.9 6.9 ... .730 0.6 5.0 5.6 5.2 1.7 0.3 2.3 2.0 18.8

5 rows × 31 columns

Now that we fixed up the necessary columns, let's write out the raw data to a CSV file.


In [25]:
os.makedirs('../data/raw_data', exist_ok=True)
os.makedirs('../data/clean_data', exist_ok=True)
  • Write out the career data for 2018 MVP finalists to the raw_data folder in the data folder

In [26]:
mvp_finalist_2019_career_data.to_csv("../data/raw_data/mvp_finalist_2019_career_data.csv", index=False)

Cleaning the Data

  • Now that we have the raw MVP data, we need to clean it up a bit for data exploration

In [27]:
mvp_finalist_2019_df_clean = pd.read_csv("../data/raw_data/mvp_finalist_2019_career_data.csv", encoding = "Latin-1")
mvp_finalist_2019_df_clean.head()


Out[27]:
Player Year Age Tm Lg Pos G GS MP FG ... FT% ORB DRB TRB AST STL BLK TOV PF PTS
0 James Harden 2009-10 20 OKC NBA SG 76 0 22.9 3.1 ... 0.808 0.6 2.6 3.2 1.8 1.1 0.3 1.4 2.6 9.9
1 James Harden 2010-11 21 OKC NBA SG 82 5 26.7 3.6 ... 0.843 0.5 2.6 3.1 2.1 1.1 0.3 1.3 2.5 12.2
2 James Harden 2011-12 22 OKC NBA SG 62 2 31.4 5.0 ... 0.846 0.5 3.6 4.1 3.7 1.0 0.2 2.2 2.4 16.8
3 James Harden 2012-13 23 HOU NBA SG 78 78 38.3 7.5 ... 0.851 0.8 4.1 4.9 5.8 1.8 0.5 3.8 2.3 25.9
4 James Harden 2013-14 24 HOU NBA SG 73 73 38.0 7.5 ... 0.866 0.8 3.9 4.7 6.1 1.6 0.4 3.6 2.4 25.4

5 rows × 31 columns


In [28]:
mvp_finalist_2019_df_clean.columns


Out[28]:
Index(['Player', 'Year', 'Age', 'Tm', 'Lg', 'Pos', 'G', 'GS', 'MP', 'FG',
       'FGA', 'FG%', '3P', '3PA', '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT',
       'FTA', 'FT%', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF',
       'PTS'],
      dtype='object')
  • create dictionaries for renaming columns
  • rename all columns with dictionaries

In [29]:
mvp_finalist_columns_dict = {'Player':'player', 'Year':'year',  'Age': 'age', 'Tm': 'team', 'Lg': 'league',
                             'POS': 'position', 'G': 'games_played', 'GS': 'games_started', 'MP': 'avg_minutes',
                             'FG': 'field_goals_made_per_game', 'FGA': 'field_goals_attempted_per_game',
                             'FG%': 'field_goal_pct', '3P': 'three_pt_fg_made_per_game', '3PA': 'three_pt_fg_attempted_per_game',
                             '3P%': 'three_pt_pct', '2P': 'two_pt_fg_made_per_game', '2PA': 'two_pt_fg_attempted_per_game',
                             '2P%': 'two_pt_fg_pct', 'eFG%': 'effective_fg_pct',  
                             'FT': 'free_throws_made_per_game', 'FTA': 'free_throws_attempted_per_game',
                             'FT%': 'free_throw_pct', 'ORB': 'offensive_rebounds_per_game', 'DRB': 'defensive_rebounds_per_game',
                             'TRB': 'total_rebounds_per_game', 'AST': 'assists_per_game', 'STL': 'steals_per_game', 
                             'BLK': 'blocks_per_game', 'TOV': 'turnovers_per_game',  'PF': 'fouls_committed_per_game', 'PTS': 'points_per_game'
                              }

In [30]:
mvp_finalist_2019_df_clean.rename(index=str,columns=mvp_finalist_columns_dict, inplace=True)
mvp_finalist_2019_df_clean.head()


Out[30]:
player year age team league Pos games_played games_started avg_minutes field_goals_made_per_game ... free_throw_pct offensive_rebounds_per_game defensive_rebounds_per_game total_rebounds_per_game assists_per_game steals_per_game blocks_per_game turnovers_per_game fouls_committed_per_game points_per_game
0 James Harden 2009-10 20 OKC NBA SG 76 0 22.9 3.1 ... 0.808 0.6 2.6 3.2 1.8 1.1 0.3 1.4 2.6 9.9
1 James Harden 2010-11 21 OKC NBA SG 82 5 26.7 3.6 ... 0.843 0.5 2.6 3.1 2.1 1.1 0.3 1.3 2.5 12.2
2 James Harden 2011-12 22 OKC NBA SG 62 2 31.4 5.0 ... 0.846 0.5 3.6 4.1 3.7 1.0 0.2 2.2 2.4 16.8
3 James Harden 2012-13 23 HOU NBA SG 78 78 38.3 7.5 ... 0.851 0.8 4.1 4.9 5.8 1.8 0.5 3.8 2.3 25.9
4 James Harden 2013-14 24 HOU NBA SG 73 73 38.0 7.5 ... 0.866 0.8 3.9 4.7 6.1 1.6 0.4 3.6 2.4 25.4

5 rows × 31 columns


In [31]:
mvp_finalist_2019_df_clean.columns


Out[31]:
Index(['player', 'year', 'age', 'team', 'league', 'Pos', 'games_played',
       'games_started', 'avg_minutes', 'field_goals_made_per_game',
       'field_goals_attempted_per_game', 'field_goal_pct',
       'three_pt_fg_made_per_game', 'three_pt_fg_attempted_per_game',
       'three_pt_pct', 'two_pt_fg_made_per_game',
       'two_pt_fg_attempted_per_game', 'two_pt_fg_pct', 'effective_fg_pct',
       'free_throws_made_per_game', 'free_throws_attempted_per_game',
       'free_throw_pct', 'offensive_rebounds_per_game',
       'defensive_rebounds_per_game', 'total_rebounds_per_game',
       'assists_per_game', 'steals_per_game', 'blocks_per_game',
       'turnovers_per_game', 'fouls_committed_per_game', 'points_per_game'],
      dtype='object')

Cleaning Up the Rest of the Data

  • convert the data to proper numeric types

In [32]:
mvp_finalist_2019_df_clean = mvp_finalist_2019_df_clean.apply(pd.to_numeric, errors="ignore")
mvp_finalist_2019_df_clean.info()


<class 'pandas.core.frame.DataFrame'>
Index: 121 entries, 0 to 120
Data columns (total 31 columns):
player                            121 non-null object
year                              121 non-null object
age                               121 non-null int64
team                              121 non-null object
league                            121 non-null object
Pos                               121 non-null object
games_played                      121 non-null int64
games_started                     121 non-null int64
avg_minutes                       121 non-null float64
field_goals_made_per_game         121 non-null float64
field_goals_attempted_per_game    121 non-null float64
field_goal_pct                    121 non-null float64
three_pt_fg_made_per_game         121 non-null float64
three_pt_fg_attempted_per_game    121 non-null float64
three_pt_pct                      121 non-null float64
two_pt_fg_made_per_game           121 non-null float64
two_pt_fg_attempted_per_game      121 non-null float64
two_pt_fg_pct                     121 non-null float64
effective_fg_pct                  121 non-null float64
free_throws_made_per_game         121 non-null float64
free_throws_attempted_per_game    121 non-null float64
free_throw_pct                    121 non-null float64
offensive_rebounds_per_game       121 non-null float64
defensive_rebounds_per_game       121 non-null float64
total_rebounds_per_game           121 non-null float64
assists_per_game                  121 non-null float64
steals_per_game                   121 non-null float64
blocks_per_game                   121 non-null float64
turnovers_per_game                121 non-null float64
fouls_committed_per_game          121 non-null float64
points_per_game                   121 non-null float64
dtypes: float64(23), int64(3), object(5)
memory usage: 30.2+ KB
  • Get the column names for the numeric columns
  • Replace all NaNs with 0

In [33]:
num_cols_finalist = mvp_finalist_2019_df_clean.columns[mvp_finalist_2019_df_clean.dtypes != object]
mvp_finalist_2019_df_clean.loc[:, num_cols_finalist] = mvp_finalist_2019_df_clean.loc[:, num_cols_finalist].fillna(0)
mvp_finalist_2019_df_clean.info()


<class 'pandas.core.frame.DataFrame'>
Index: 121 entries, 0 to 120
Data columns (total 31 columns):
player                            121 non-null object
year                              121 non-null object
age                               121 non-null int64
team                              121 non-null object
league                            121 non-null object
Pos                               121 non-null object
games_played                      121 non-null int64
games_started                     121 non-null int64
avg_minutes                       121 non-null float64
field_goals_made_per_game         121 non-null float64
field_goals_attempted_per_game    121 non-null float64
field_goal_pct                    121 non-null float64
three_pt_fg_made_per_game         121 non-null float64
three_pt_fg_attempted_per_game    121 non-null float64
three_pt_pct                      121 non-null float64
two_pt_fg_made_per_game           121 non-null float64
two_pt_fg_attempted_per_game      121 non-null float64
two_pt_fg_pct                     121 non-null float64
effective_fg_pct                  121 non-null float64
free_throws_made_per_game         121 non-null float64
free_throws_attempted_per_game    121 non-null float64
free_throw_pct                    121 non-null float64
offensive_rebounds_per_game       121 non-null float64
defensive_rebounds_per_game       121 non-null float64
total_rebounds_per_game           121 non-null float64
assists_per_game                  121 non-null float64
steals_per_game                   121 non-null float64
blocks_per_game                   121 non-null float64
turnovers_per_game                121 non-null float64
fouls_committed_per_game          121 non-null float64
points_per_game                   121 non-null float64
dtypes: float64(23), int64(3), object(5)
memory usage: 30.2+ KB
  • We are finally done cleaning the data and now we can save it to a CSV file.

In [34]:
mvp_finalist_2019_df_clean = mvp_finalist_2019_df_clean[pd.notnull(mvp_finalist_2019_df_clean['player'])]
mvp_finalist_2019_df_clean.sort_values(['year'], ascending=False, axis=0, inplace=True)

In [35]:
mvp_finalist_2019_df_clean.to_csv("../data/clean_data/mvp_finalist_2019_df_clean.csv", index=False)
print(f" the dimensions for the final data are: {mvp_finalist_2019_df_clean.shape} (rows, columns)")
mvp_finalist_2019_df_clean.head()


 the dimensions for the final data are: (121, 31) (rows, columns)
Out[35]:
player year age team league Pos games_played games_started avg_minutes field_goals_made_per_game ... free_throw_pct offensive_rebounds_per_game defensive_rebounds_per_game total_rebounds_per_game assists_per_game steals_per_game blocks_per_game turnovers_per_game fouls_committed_per_game points_per_game
120 Victor Oladipo 2018-19 26 IND NBA SG 36 36 31.9 6.9 ... 0.730 0.6 5.0 5.6 5.2 1.7 0.3 2.3 2.0 18.8
25 LeBron James 2018-19 34 LAL NBA SF 55 55 35.2 10.1 ... 0.665 1.0 7.4 8.5 8.3 1.3 0.6 3.6 1.7 27.4
91 LaMarcus Aldridge 2018-19 33 SAS NBA C 81 81 33.2 8.4 ... 0.847 3.1 6.1 9.2 2.4 0.5 1.3 1.8 2.2 21.3
39 Damian Lillard 2018-19 28 POR NBA PG 80 80 35.5 8.5 ... 0.912 0.9 3.8 4.6 6.9 1.1 0.4 2.7 1.9 25.8
32 Anthony Davis 2018-19 25 NOP NBA C 56 56 33.0 9.5 ... 0.794 3.1 8.9 12.0 3.9 1.6 2.4 2.0 2.4 25.9

5 rows × 31 columns

Review

  • In this tutorial, we learned how to:
    • examine the html structure of webpage
    • use functions based on the Beautiful Soup module to parse tables on multiple webpage into .csv
    • analyzed a .csv file using the Pandas module

Download this notebook or see a static view here


In [36]:
import sys
import bs4
print(f'last updated: {datetime.now().strftime("%Y-%m-%d %H:%M")} \n')
print(f"System and module version information: \n")
print(f"Python version: {sys.version_info}")
print(f"urllib.request version: {urllib.request.__version__}")
print(f"pandas version: {pd.__version__}")
print(f"Beautiful Soup version: {bs4.__version__}")


last updated: 2019-07-01 09:13 

System and module version information: 

Python version: sys.version_info(major=3, minor=7, micro=1, releaselevel='final', serial=0)
urllib.request version: 3.7
pandas version: 0.23.4
Beautiful Soup version: 4.6.3