It's Christmas and that means a full slate of NBA games. This time of year also provokes some great NBA discussions and the best NBA discussions are comparative. Arguments like: would Jordan's '96 Bulls would beat the '17 Warriors?
Another comparison that is sure to create great debate is who had the best (or worst) Most Valuable Player (MVP) season in history? This question has a strong empirical dimension, in that we can observe quantifiable aspects across seasons, and can leverage programming to both gather and and analyze the data.
The MVP data will gather comes from basketball-reference.com. Basketball-reference is part of the Sports-Reference sites, "a group of sites providing both basic and sabermetric statistics and resources for sports fans everywhere. [Sports-Reference aims] to be the easiest-to-use, fastest, most complete sources for sports statistics anywhere" (sports-reference.com).
In this post, we will gather and pre-process all the data for the a multi-part series to determine:
Let the shade begin
In [1]:
import os
import urllib
import webbrowser
import pandas as pd
from datetime import datetime
from bs4 import BeautifulSoup
In [2]:
url = 'https://www.basketball-reference.com/awards/mvp.html'
webbrowser.open_new_tab(url)
Out[2]:
In [3]:
# get the html
html = urllib.request.urlopen(url)
# create the BeautifulSoup object
soup = BeautifulSoup(html, "lxml")
In [4]:
# Extract the necessary values for the column headers from the table
# and store them as a list
column_headers = [th.getText() for th in soup.findAll('th', limit=30)]
column_headers = [s for s in column_headers if len(s) != 0]
column_headers = column_headers[1:]
print(column_headers)
column_headers = [e for e in column_headers if e not in ('Shooting', 'Advanced')][:-7]
print(column_headers)
In [5]:
len(column_headers)
Out[5]:
In [6]:
# The data is found within the table rows
# We want the elements from the 3rd row and on
table_rows = soup.find_all("tr")[2:]
print(type(table_rows))
table_rows[0] # take a look at the first row
Out[6]:
td
(or table data) elements. table_rows
. The comments should walk you through what each part of the function does.
In [7]:
def extract_mvp_data(table_rows):
"""
Extract and return the the desired information from the td elements within the table rows.
:param: table_rows: list of soup `tr` elements
:return: list of player-year MVP observation
"""
# create the empty list to store the player data
player_data = []
for row in table_rows: # for each row do the following
# Get the text for each table data (td) element in the row
player_year = [td.get_text() for td in row.find_all("th")]
player_list = [td.get_text() for td in row.find_all("td")]
# there are some empty table rows, which are the repeated
# column headers in the table
# we skip over those rows and and continue the for loop
if not player_list:
continue
# Now append the data to list of data
player_info = player_year+player_list
player_data.append(player_info)
return player_data
DataFrame
with the MVP data
In [8]:
# extract the data we want
mvp_data = extract_mvp_data(table_rows)
# and then store it in a DataFrame
mvp_data_df = pd.DataFrame(mvp_data)
In [9]:
mvp_data_df[0:1]
Out[9]:
In [10]:
mvp_data_df.columns = column_headers
print("the MVP dataframe has {} rows (player-year observations) and {} columns".format(mvp_data_df.shape[0],
mvp_data_df.shape[1]))
mvp_data_df.head()
Out[10]:
In [11]:
url_2018_mvp_finalist = 'https://www.basketball-reference.com/awards/awards_2018.html#mvp'
In [12]:
# get the html
# examine webpage
html_finalist = urllib.request.urlopen(url_2018_mvp_finalist)
webbrowser.open_new_tab(url)
# create the BeautifulSoup object
soup_finalist = BeautifulSoup(html_finalist, "lxml")
# Extract the necessary values for the column headers from the table and store them as a list
column_headers_finalist = [th.getText() for th in soup_finalist.findAll('th', limit=30)]
column_headers_finalist = [s for s in column_headers_finalist if len(s) != 0]
column_headers_finalist = column_headers_finalist[1:]
print("raw column names in finalist table: {}".format(column_headers_finalist))
column_headers_finalist = [e for e in column_headers_finalist if e not in ('Shooting', 'Advanced', 'Per Game')][:-4]
print("formatted column names in finalist table: {}".format(column_headers_finalist))
print("{} columns in finalist table".format(len(column_headers_finalist)))
# The data is found within the `tr` elements of the first `tbody` element
# We want the elements from the 3rd row and on
table_rows_finalist = soup_finalist.find("tbody").find_all("tr")
print(type(table_rows_finalist))
table_rows_finalist[-1] # take a look at the last row
Out[12]:
In [13]:
def extract_player_data(table_rows):
"""
Extract and return the the desired information from the td elements within the table rows.
:param: table_rows: list of soup `tr` elements
:return: list of player-year MVP finalist observations
"""
# create the empty list to store the player data
player_data = []
for row in table_rows: # for each row do the following
# Get the text for each table data (td) element in the row
player_rank = [td.get_text() for td in row.find_all("th")]
player_list = [td.get_text() for td in row.find_all("td")]
# there are some empty table rows, which are the repeated
# column headers in the table
# we skip over those rows and and continue the for loop
if not player_list:
continue
# Now append the data to list of data
player_info = player_rank+player_list
player_data.append(player_info)
return player_data
In [14]:
# extract the data we want
data = extract_player_data(table_rows_finalist)
# and then store it in a DataFrame
example_player_df = pd.DataFrame(data)
example_player_df.columns = column_headers_finalist
print("the MVP finalist dataframe has {} rows (player-year observations) and {} columns".format(example_player_df.shape[0],
example_player_df.shape[1]))
example_player_df.tail(6)
Out[14]:
Scraping the for finalist data since 1956 follows is essentially the same process as above, just repeated for each year, using a for
loop.
DataFrame
for each year of MVP finalist data, and append it to a large list of DataFrame
s that contains all the MVP finalists data. We will also have a separate list that will contain any errors and the url associated with that error. This will let us know if there are any issues with our scraper, and which url is causing the error.
In [15]:
# Create an empty list that will contain all the dataframes
# (one dataframe for all finalist dataframes)
mvp_finalist_list = []
# a list to store any errors that may come up while scraping
errors_list = []
In [16]:
loop_start = datetime.now()
print(loop_start)
In [17]:
# The url template that we pass in the finalist year info
url_template = "https://www.basketball-reference.com/awards/awards_{year}.html#mvp"
# for each year from 1956 to (and including) 2018
for year in range(1956, 2019):
# Use try/except block to catch and inspect any urls that cause an error
try:
# get the MVP finalist data url
url = url_template.format(year=year)
# get the html
html = urllib.request.urlopen(url)
# create the BeautifulSoup object
soup = BeautifulSoup(html, "lxml")
# get the column headers
column_headers = [th.getText() for th in soup.findAll('th', limit=30)]
column_headers = [s for s in column_headers if len(s) != 0]
column_headers = column_headers[1:]
column_headers = [e for e in column_headers if e not in ('Shooting', 'Advanced', 'Per Game')][:-4]
# select the data from the table
table_rows = soup.find_all("tr")[2:]
# extract the player data from the table rows
player_data = extract_player_data(table_rows)
# create the dataframe for the current year's mvp finalist data
# subset to only include MVP finalists
year_df = pd.DataFrame(player_data)
year_df.columns = column_headers
year_df = year_df.loc[year_df["Pts Max"]==year_df["Pts Max"].unique()[0]]
# add the year of the MVP finalist data to the dataframe
year_df.insert(0, "Year", year)
# append the current dataframe to the list of dataframes
mvp_finalist_list.append(year_df)
except Exception as e:
# Store the url and the error it causes in a list
error =[url, e]
# then append it to the list of errors
errors_list.append(error)
In [18]:
loop_stop = datetime.now()
print(loop_stop)
print("the loop took {}".format(loop_stop-loop_start))
In [19]:
print(len(errors_list))
errors_list
Out[19]:
DataFrames
we scraped and create one large DataFrame
containing all the finalist data
In [20]:
print(type(mvp_finalist_list))
print(len(mvp_finalist_list))
In [21]:
mvp_finalist_list[0:1]
Out[21]:
In [22]:
column_headers_finalist.insert(0, "Year")
print(column_headers_finalist)
print(len(column_headers_finalist))
In [23]:
# store all finalist data in one DataFrame
mvp_finalist_df = pd.concat(mvp_finalist_list, axis=0)
print(mvp_finalist_df.shape)
In [24]:
# Take a look at the first row
mvp_finalist_df.iloc[0]
Out[24]:
Now that we fixed up the necessary columns, let's write out the raw data to a CSV file.
In [25]:
os.makedirs('../data/raw_data', exist_ok=True)
os.makedirs('../data/clean_data', exist_ok=True)
In [26]:
mvp_finalist_df.head()
Out[26]:
In [27]:
# Write out the MVP data and MVP finalist data to the raw_data folder in the data folder
mvp_data_df.to_csv("../data/raw_data/mvp_data_df_raw.csv", index=False)
mvp_finalist_df.to_csv("../data/raw_data/mvp_finalist_df_raw.csv", index=False)
In [28]:
mvp_data_df_clean = pd.read_csv("../data/raw_data/mvp_data_df_raw.csv", encoding = "Latin-1")
mvp_finalist_df_clean = pd.read_csv("../data/raw_data/mvp_finalist_df_raw.csv", encoding = "Latin-1")
In [29]:
mvp_data_df_clean.head()
Out[29]:
In [30]:
mvp_data_columns_dict = {'Season':'season', 'Lg':'league', 'Player':'player', 'Voting': 'voting', 'Tm': 'team', 'Age': 'age',
'G': 'games_played', 'MP': 'avg_minutes', 'PTS': 'avg_points', 'TRB': 'avg_rebounds',
'AST': 'avg_assists', 'STL': 'avg_steals', 'BLK': 'avg_blocks', 'FG%': 'field_goal_pct',
'3P%': 'three_pt_pct', 'FT%': 'free_throw_pct', 'WS': 'win_shares',
'WS/48': 'win_shares_per_48'
}
mvp_finalist_columns_dict = {'Year':'year', 'Player':'player', 'Rank': 'rank', 'Tm': 'team', 'Age': 'age',
'First': 'first_place_votes', 'Pts Won': 'points_won', 'Pts Max': 'points_max',
'Share':'vote_share', 'G': 'games_played', 'MP': 'avg_minutes', 'PTS': 'avg_points',
'TRB': 'avg_rebounds', 'AST': 'avg_assists', 'STL': 'avg_steals', 'BLK': 'avg_blocks',
'FG%': 'field_goal_pct', '3P%': 'three_pt_pct', 'FT%': 'free_throw_pct', 'WS': 'win_shares',
'WS/48': 'win_shares_per_48'
}
In [31]:
mvp_data_df_clean.rename(index=str,columns=mvp_data_columns_dict, inplace=True)
mvp_data_df_clean.head()
Out[31]:
In [32]:
mvp_finalist_df_clean.rename(index=str,columns=mvp_finalist_columns_dict, inplace=True)
mvp_finalist_df_clean.head()
Out[32]:
In [33]:
# convert the data to proper numeric types
mvp_data_df_clean = mvp_data_df_clean.apply(pd.to_numeric, errors="ignore")
mvp_data_df_clean.info()
In [34]:
# convert the data to proper numeric types
mvp_finalist_df_clean = mvp_finalist_df_clean.apply(pd.to_numeric, errors="ignore")
mvp_finalist_df_clean.info()
We are not done yet. A lot of out numeric columns are missing data because players didn't accumulate any of those stats. For example, the 3 point line is introduced in 1982 and all players in preceding seasons don't have this statistic. Additionally, we want to select the columns with numeric data and then replace the NaN
s (the current value that represents the missing data) with 0s, as that is a more appropriate value.
In [35]:
# Get the column names for the numeric columns
num_cols_mvp = mvp_data_df_clean.columns[mvp_data_df_clean.dtypes != object]
num_cols_finalist = mvp_finalist_df_clean.columns[mvp_finalist_df_clean.dtypes != object]
# Replace all NaNs with 0
mvp_data_df_clean.loc[:, num_cols_mvp] = mvp_data_df_clean.loc[:, num_cols_mvp].fillna(0)
mvp_finalist_df_clean.loc[:, num_cols_finalist] = mvp_finalist_df_clean.loc[:, num_cols_finalist].fillna(0)
In [36]:
mvp_finalist_df_clean.info()
In [37]:
mvp_data_df_clean = mvp_data_df_clean.loc[mvp_data_df_clean["league"]=="NBA"]
mvp_data_df_clean = mvp_data_df_clean[pd.notnull(mvp_data_df_clean['team'])]
We are finally done cleaning the data and now we can save it to a CSV file.
In [38]:
mvp_data_df_clean = mvp_data_df_clean[pd.notnull(mvp_data_df_clean['player'])]
mvp_data_df_clean.sort_values(['season'], ascending=False, axis=0, inplace=True)
mvp_finalist_df_clean = mvp_finalist_df_clean[pd.notnull(mvp_finalist_df_clean['player'])]
mvp_finalist_df_clean.sort_values(['year', 'points_won'], ascending=False, axis=0, inplace=True)
In [39]:
mvp_data_df_clean.to_csv("../data/clean_data/mvp_data_df_clean.csv", index=False)
print(mvp_data_df_clean.shape)
mvp_data_df_clean.head()
Out[39]:
In [40]:
mvp_finalist_df_clean.to_csv("../data/clean_data/mvp_finalist_df_clean.csv", index=False)
print(mvp_finalist_df_clean.shape)
mvp_finalist_df_clean.head()
Out[40]:
In [41]:
import sys
import bs4
print(f'last updated: {datetime.now().strftime("%Y-%m-%d %H:%M")} \n')
print(f"System and module version information: \n")
print(f"Python version: {sys.version_info}")
print(f"urllib.request version: {urllib.request.__version__}")
print(f"pandas version: {pd.__version__}")
print(f"Beautiful Soup version: {bs4.__version__}")