By: Nathan Ho
In sports, a 'clutch' play is typically a play which occurs in the key moment of a game that puts the outcome of the game in question. This could be a hail-mary pass in football, an 11th inning double play in baseball, or a save on a penalty kick in soccer.
However, in basketball, where the game often revolves around the selective few superstars in the league, 'clutch'ness describes more the player than the play. A 'clutch' player is a player who thrives in high pressure situations and becomes a more versatile and efficient player when the game is on the line. 'Clutch' plays are explained as the handiwork of 'clutch' players.
It's easy for most basketball fans and sports commentator to talk about 'clutch' qualitatively. Conversations about the strength of a player's mentality can get very opionated as it is unquantifiable. In this project, we will explore the quantitative side of 'clutch', and will be analyzing the career statistics of two widely alleged 'clutch' players: Kobe Bryant and Michael Jordan. Comparisons have often been drawn between the two players, and it will be interesting to take a statistical approach in evaluating them.
We will be specifically looking at:
We will begin by importing the relevant libraries:
In [1]:
#importing data organization and graphics packages
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import scipy.stats as stats
import datetime as dt
import sys
import seaborn as sb
import warnings
print('Today is', dt.date.today())
print('What version of Python are we running? \n', sys.version, sep='')
warnings.filterwarnings('ignore')
The data we will be using for this project will be coming from Savvastj's nbashots
module. This module accesses the stats.nba.com API through specialized methods.
To install, please run on your Terminal/Command Prompt:
pip install nbashots
We will be importing this package for the project:
In [3]:
#NBA module
import nbashots as nba
The module organizes player statistics by Player IDs. Here we will take a brief look at Kobe's ID and game statistics from the most recent season, 2015-16:
In [4]:
kobe_id = nba.get_player_id("Bryant, Kobe")[0]
kobe_id
Out[4]:
In [5]:
kobe_gamelogs = nba.PlayerLog(kobe_id)
kobe_df = kobe_gamelogs.get_game_logs()
kobe_df.head()
Out[5]:
In [6]:
print(kobe_df.dtypes)
print()
print(kobe_df.shape)
Thankfully, the data returned is mostly in int
and float
form, making it easier to analyze the data. The dataframe shape shows that Kobe played 66 games during the 2015-2016 regular season.
Now we take a brief look at Jordan's ID and statistics:
In [7]:
jordan_id = nba.get_player_id("Jordan, Michael")[0]
jordan_id
Out[7]:
In [8]:
jordan_gamelogs = nba.PlayerLog(jordan_id)
jordan_df = jordan_gamelogs.get_game_logs()
jordan_df
Out[8]:
The lack of data is due to the PlayerLog
method being defaulted to the 2015-16 season. Michael Jordan, who retired in 2003, obviously did not play then.
According to the module creator, Savvastj, the PlayerLog
method has four parameters:
player_id
league_id
, '00'
"Season
2015-16
" Season_Type
Regular Season
"This can be corrected by changing the Season
parameter to the 1995-96 season instead. The 1995-96 is a very famous season for the Chicago Bulls, when Jordan's Bulls earned 72 wins and 10 losses, becoming the only team in NBA history to win over 70 games and the NBA title in the same season.
In [9]:
jordan_gamelogs = nba.PlayerLog(jordan_id, '00', '1995-96', 'Regular Season')
jordan_df = jordan_gamelogs.get_game_logs()
jordan_df.head()
Out[9]:
By changing the Season_Type
parameter to 'Playoffs'
, we can pull Jordan's playoff games instead.
In [10]:
jordan_gamelogs1 = nba.PlayerLog(jordan_id, '00', '1995-96', 'Playoffs')
jordan_df1 = jordan_gamelogs1.get_game_logs()
jordan_df1.head()
Out[10]:
We can do some simple analysis on the data we pull by applying some pandas
methods.
According to the shape and mean of the jordan_df
dataframe, Jordan played 82 games, or the full season, in the 1995-96 regular season, and averaged 30 points, 4 assists, 2 steals and almost 7 rebounds a game:
In [11]:
print(jordan_df.shape)
In [12]:
jordan_df.mean()
Out[12]:
In [13]:
k_active_seasons = ['1996-97','1997-98','1998-99','1999-00','2000-01','2001-02','2002-03','2003-04','2004-05','2005-06','2006-07','2007-08','2008-09','2009-10','2010-11','2011-12','2012-13','2013-14','2014-15','2015-16']
Then we loop through the seasons to aggregate all of his career games during the regular season:
In [14]:
def total_reg(player_id, active_seasons):
a = {}
a = pd.DataFrame(a)
for i in active_seasons:
gamelogs = nba.PlayerLog(player_id, '00', i, 'Regular Season')
df = gamelogs.get_game_logs()
a = a.append(df, ignore_index = True)
return a
In [15]:
agg_kobe_reg = total_reg(kobe_id, k_active_seasons)
agg_kobe_reg.shape
Out[15]:
Since for any season which there were no games played returns an empty DataFrame, we can apply the same loop for Kobe's playoff games:
In [16]:
def total_po(player_id, active_seasons):
a = {}
a = pd.DataFrame(a)
for i in active_seasons:
gamelogs = nba.PlayerLog(player_id, '00', i, 'Playoffs')
df = gamelogs.get_game_logs()
a = a.append(df, ignore_index = True)
return a
In [17]:
agg_kobe_po = total_po(kobe_id, k_active_seasons)
agg_kobe_po.shape
Out[17]:
We assign the variable agg_kobe
with all game statistics, regardless of regular season or playoffs.
In [18]:
agg_kobe = agg_kobe_reg.append(agg_kobe_po)
agg_kobe.shape
Out[18]:
By assigning Jordan's active seasons to a variable, we can aggregate Jordan's regular season games and playoff games as well:
In [19]:
j_active_seasons = ['1984-85','1985-86','1986-87','1987-88','1988-89', '1989-90','1990-91','1991-92','1992-93','1993-94','1994-95','1995-96','1996-97','1997-98','2001-02','2002-03']
In [20]:
agg_jordan_reg = total_reg(jordan_id, j_active_seasons)
agg_jordan_reg.shape
Out[20]:
In [21]:
agg_jordan_po = total_po(jordan_id, j_active_seasons)
agg_jordan_po.shape
Out[21]:
In [22]:
agg_jordan = agg_jordan_reg.append(agg_jordan_po, ignore_index = True)
agg_jordan.shape
Out[22]:
We can call on the career average stats for Kobe and Jordan through:
In [23]:
agg_kobe.mean()
Out[23]:
Going through the agg_jordan
dataframe, I've noticed that there are 'None' entries for when the appropriate data is '0'.
We will use the .fillna(0)
method to help fix that:
In [24]:
agg_jordan.head()
Out[24]:
In [25]:
agg_jordan.fillna(0).mean()
Out[25]:
In [26]:
agg_disparity = agg_kobe.mean() - agg_jordan.fillna(0).mean()
agg_disparity
Out[26]:
According to the data, Jordan contributed more on a per-game basis compared to Kobe Bryant. Jordan averaged around 5 more points, 1 more rebound, 0.62 more assists, and 0.71 more steals per game in his career than Kobe.
For players at the calibre of Jordan and Kobe, such a difference is relatively insignificant. The disparity is likey due to Kobe playing alongside Hall-of-Fame Center Shaquille O'Neal for a significant portion of his career, thus having to share playing time and the ball more often than Jordan.
We then look at the distribution of their career averages. Focusing on the statistical distribution of the statistics may be a good way to understand their consistency throughout their career performing on the court. We will be looking specifically at points scored, assists, rebounds, and steals as those are typically the key stats evaluated for the guard position.
We can graph a histogram of their career stats in a normal distribution to see if there are any significant differences:
In [27]:
fig, ax = plt.subplots(2,4)
fig.set_size_inches(9.5, 5.5)
fig.suptitle('Statistic Distribution Comparison: Kobe vs. MJ ')
agg_kobe['PTS'].plot(ax = ax[0,0],
kind='hist',
bins = 50,
sharex= True,
title = 'Kobe Pt Dist.',
fontsize=8,
normed=True)
agg_jordan['PTS'].plot(ax = ax[1,0],
kind='hist',
bins = 50,
sharex = True,
sharey = True,
title = 'MJ Pt Dist.',
fontsize=8,
normed=True)
agg_kobe['AST'].plot(ax = ax[0,1],
kind='hist',
bins = 50,
sharey = True,
title = 'Kobe Assist Dist.',
fontsize=8,
normed=True)
agg_jordan['AST'].plot(ax = ax[1,1],
kind='hist',
bins = 50,
sharex= True,
sharey=True,
title = 'MJ Assist Dist.',
fontsize=8,
normed=True)
agg_kobe['REB'].plot(ax = ax[0,2],
kind='hist',
bins = 50,
sharey = True,
title = 'Kobe Reb Dist.',
fontsize=8,
normed=True)
agg_jordan['REB'].plot(ax = ax[1,2],
kind='hist',
bins = 50,
sharex= True,
sharey=True,
title = 'MJ Reb Dist.',
fontsize=8,
normed=True)
agg_kobe['STL'].plot(ax = ax[0,3],
kind='hist',
bins = 50,
sharey = True,
title = 'Kobe Stls Dist.',
fontsize=8,
normed=True)
agg_jordan['STL'].plot(ax = ax[1,3],
kind='hist',
bins = 50,
sharex= True,
sharey=True,
title = 'MJ Stls Dist.',
fontsize=8,
normed=True)
Out[27]:
Besides Jordan's points scored distribution having a higher mean than that of Kobe and a tighter distribution for Jordan's career rebounds, we do not see any significant differences.
We then take a more numeric approach. Through the .std()
method, we can see the standard deviations of the career statistics of Kobe and Jordan to measure how much each player fluctuates in their career stats:
In [28]:
print(agg_kobe.std())
print()
print(agg_jordan.std())
In [29]:
print(agg_kobe.std()-agg_jordan.fillna(0).std())
We see that Jordan is more consistent with his scoring throughout his career than Kobe with a standard deviation of 9.69 in points compared to a standard deviation of 10.60 to Kobe as well as assists, with a standard deviation of 2.76 for Jordan to Kobe's 2.79. However, Kobe has stronger career consistency in terms of rebounds and steals.
These differences are not significant enough to draw any conclusions for comparing the two players.
We move on to evaluate specifically how playing in the playoffs affect both player's performance statsitcs, or their respective 'clutch'. We will name the percent difference in performance statstics during the post season 'clutch factor'.
First, we compile the Kobe's statistical averages for the regular and post season into one dataframe:
In [30]:
kobe_rmean_df = pd.DataFrame(agg_kobe_reg.mean())
kobe_rmean_df.columns = ['kobe_reg_mean']
kobe_rmean_df = kobe_rmean_df.reset_index()
kobe_rmean_df.tail()
Out[30]:
In [31]:
kobe_pmean_df = pd.DataFrame(agg_kobe_po.mean())
kobe_pmean_df.columns = ['kobe_po_mean']
kobe_pmean_df = kobe_pmean_df.reset_index()
kobe_pmean_df.tail()
Out[31]:
In [32]:
kobe_clutch = pd.merge(kobe_rmean_df, kobe_pmean_df, how = 'right')
kobe_clutch.tail()
Out[32]:
Then we generate a new column showing the percent improvement by generating a new column:
In [33]:
kobe_clutch['k_clutch_factor'] = 100*(kobe_clutch['kobe_po_mean'] - kobe_clutch['kobe_reg_mean'])/kobe_clutch['kobe_reg_mean']
kobe_clutch.tail()
Out[33]:
We apply the same to Jordan's data:
In [34]:
jordan_rmean_df = pd.DataFrame(agg_jordan_reg.fillna(0).mean())
jordan_rmean_df.columns = ['jordan_reg_mean']
jordan_rmean_df = jordan_rmean_df.reset_index()
jordan_pmean_df = pd.DataFrame(agg_jordan_po.mean())
jordan_pmean_df.columns = ['jordan_po_mean']
jordan_pmean_df = jordan_pmean_df.reset_index()
jordan_clutch = pd.merge(jordan_rmean_df, jordan_pmean_df, how = 'right')
jordan_clutch['j_clutch_factor'] = 100*(jordan_clutch['jordan_po_mean'] - jordan_clutch['jordan_reg_mean'])/jordan_clutch['jordan_reg_mean']
jordan_clutch.tail()
Out[34]:
We can then merge the two dataframes, and drop all the rows which are unrelated to the points scored, assists, and rebound categories.
In [35]:
clutch_factor = pd.merge(kobe_clutch, jordan_clutch)
clutch_factor=clutch_factor.set_index('index')
clutch_factor.shape
Out[35]:
In [36]:
clutch_factor_t = clutch_factor.drop(clutch_factor.index[0:12])
clutch_factor_tt = clutch_factor_t.drop(clutch_factor_t.index[3:6])
clutch_factor_tt = clutch_factor_tt.drop(clutch_factor_tt.index[4])
clutch_factor_f = clutch_factor_tt
In [37]:
clutch_factor_f
Out[37]:
To understand the differences in percent difference, we can create a graphical representation of 'clutch factor' through the following plot:
In [38]:
fig, ax = plt.subplots()
clutch_factor_f['k_clutch_factor'].plot(kind = 'line',
color = 'blue',
legend = True)
clutch_factor_f['j_clutch_factor'].plot(kind = 'line',
color = 'red',
legend = True)
clutch_factor_f['k_clutch_factor'].plot(kind = 'bar',
color = 'blue',
alpha = .5,
legend = True)
clutch_factor_f['j_clutch_factor'].plot(kind = 'bar',
color = 'red',
alpha = .5,
legend = True)
ax.set_ylabel('Clutch Factor%')
ax.set_xlabel('Statistic')
ax.set_title('Kobe vs. Jordan: Clutch Comparison')
Out[38]:
The difference between the two players are more obvious in this figure. In every statistic other than steals, Michael Jordan's statistics improve during playoff time. Although Kobe does perform better in terms of assists and points scored in the post-season, he rebounds less and makes fewer steals. Jordan's improvements in assists and points scored are also to a much greater degree than to that of Kobe.
We will try to extend the definition of 'clutch' to encapsulate efficiency improvements during playoff time as well. Both Kobe and Jordan are Shooting Guards (SG). Their role on a basketball team is heavily based on the points they contribute to the game. To understand their production efficiency, we can look at their points per minute metric. We will run a regression on the data between points scored and minutes played, and will compare the differences in the slopes of the regression between the regular and post season for the two players.
We will be using NumPy's .polyfit
method to apply a least-square regression on the data. We've noticed that some of Michael Jordan's data is lacking the minutes data, thus we will be dropping those data points for the case of this regression.
In [39]:
fig, ax = plt.subplots(2)
fig.suptitle('OLS Regression Points by Minute', horizontalalignment= 'right', verticalalignment = 'bottom', size = 12)
agg_jordan_m = agg_jordan.drop(agg_jordan.index[agg_jordan['MIN']==0.0],axis = 0) #dropping missing or broken data
k_coeff = np.polyfit(x = agg_kobe['MIN'], y = agg_kobe['PTS'], deg =1)
k_ffit = np.poly1d(k_coeff)
k_linreg = k_ffit(agg_kobe['MIN'])
k_linreg = pd.DataFrame(k_linreg)
k_linreg.columns = ['VAL']
j_coeff = np.polyfit(x = agg_jordan_m['MIN'], y = agg_jordan_m['PTS'], deg =1)
j_ffit = np.poly1d(j_coeff)
j_linreg = j_ffit(agg_jordan_m['MIN'])
j_linreg = pd.DataFrame(j_linreg)
j_linreg.columns = ['VAL']
k_linreg.plot(ax=ax[0],
kind = 'line',
x = agg_kobe['MIN'],
legend = False,
color = 'red')
j_linreg.plot(ax=ax[1],
kind = 'line',
x = agg_jordan_m['MIN'],
legend = False,
color = 'red')
agg_kobe.plot(ax=ax[0],
kind = 'scatter',
x = 'MIN',
y = 'PTS',
color = 'blue',
alpha = 0.5,
title = 'Kobe Bryant',
sharex = True,
ylim = (0,85))
agg_jordan_m.plot(ax=ax[1],
kind = 'scatter',
x = 'MIN',
y = 'PTS',
color = 'blue',
alpha = 0.5,
title = 'Michael Jordan',
ylim = (0,85))
ax[0].set_xlim(0,60)
ax[1].set_xlim(0,60)
Out[39]:
According to the slope of the ordianry least-square regression, the career points per minute for Kobe and Jordan are as follows:
In [40]:
kobe_career_ppm = k_ffit[1]
kobe_career_ppm
Out[40]:
In [41]:
jordan_career_ppm = j_ffit[1]
jordan_career_ppm
Out[41]:
Overall, Jordan is more efficient than Kobe in terms of points production by minute by looking at the slope of the OLS regression.
Now we may separate their statistics from Regular Season and Playoff Season to see the difference in the points per minute metric, or their production efficiency 'clutch', when it comes to the postseason.
We generate separate OLS regression for each section of the data:
In [42]:
agg_jordan_reg1 = agg_jordan_reg.drop(agg_jordan_reg.index[agg_jordan_reg['MIN']==0.0],axis = 0)
k_reg_coeff = np.polyfit(x = agg_kobe_reg['MIN'], y = agg_kobe_reg['PTS'], deg =1)
k_reg_ffit = np.poly1d(k_reg_coeff)
k_reg_linreg = k_ffit(agg_kobe_reg['MIN'])
k_reg_linreg = pd.DataFrame(k_reg_linreg)
k_reg_linreg.columns = ['VAL']
k_po_coeff = np.polyfit(x = agg_kobe_po['MIN'], y = agg_kobe_po['PTS'], deg =1)
k_po_ffit = np.poly1d(k_po_coeff)
k_po_linreg = k_ffit(agg_kobe_po['MIN'])
k_po_linreg = pd.DataFrame(k_po_linreg)
k_po_linreg.columns = ['VAL']
j_reg_coeff = np.polyfit(x = agg_jordan_reg1['MIN'], y = agg_jordan_reg1['PTS'], deg =1)
j_reg_ffit = np.poly1d(j_reg_coeff)
j_reg_linreg = j_reg_ffit(agg_jordan_reg1['MIN'])
j_reg_linreg = pd.DataFrame(j_reg_linreg)
j_reg_linreg.columns = ['VAL']
j_po_coeff = np.polyfit(x = agg_jordan_po['MIN'], y = agg_jordan_po['PTS'], deg =1)
j_po_ffit = np.poly1d(j_po_coeff)
j_po_linreg = j_po_ffit(agg_jordan_po['MIN'])
j_po_linreg = pd.DataFrame(j_po_linreg)
j_po_linreg.columns = ['VAL']
Using the regressions and career statistics data we had previously compiled, we can plot the data into a 2 x 2 subplot:
In [43]:
fig, ax = plt.subplots(2,2)
agg_jordan_reg1.plot(ax=ax[0,1],
kind = 'scatter',
x = 'MIN',
y = 'PTS',
color = 'blue',
alpha = 0.5,
title = 'MJ Regular',
ylim = (0,85),
sharex = True,
sharey = True)
j_reg_linreg.plot(ax=ax[0,1],
kind = 'line',
x = agg_jordan_reg1['MIN'],
legend = False,
color = 'red')
agg_jordan_po.plot(ax=ax[1,1],
kind = 'scatter',
x = 'MIN',
y = 'PTS',
color = 'blue',
alpha = 0.5,
title = 'MJ Post',
ylim = (0,85))
j_po_linreg.plot(ax=ax[1,1],
kind = 'line',
x = agg_jordan_po['MIN'],
legend = False,
color = 'red')
agg_kobe_reg.plot(ax=ax[0,0],
kind = 'scatter',
x = 'MIN',
y = 'PTS',
color = 'blue',
alpha = 0.5,
title = 'Kobe Regular',
ylim = (0,85))
k_reg_linreg.plot(ax=ax[0,0],
kind = 'line',
x = agg_kobe_reg['MIN'],
legend = False,
color = 'red')
agg_kobe_po.plot(ax=ax[1,0],
kind = 'scatter',
x = 'MIN',
y = 'PTS',
color = 'blue',
alpha = 0.5,
title = 'Kobe Post',
ylim = (0,85))
k_po_linreg.plot(ax=ax[1,0],
kind = 'line',
x = agg_kobe_po['MIN'],
legend = False,
color = 'red')
ax[0,0].set_xlim(0,60)
ax[0,1].set_xlim(0,60)
ax[1,0].set_xlim(0,60)
ax[1,1].set_xlim(0,60)
Out[43]:
Looking at the scatter data, Jordan's Post season production efficiency stats seem to have a steeper slope than that of Kobe's, with their regular season efficiencies looking relatively similar. The efficiency 'clutch' improvement is calculated as follows:
In [44]:
k_reg_ppm = k_reg_ffit[1]
k_po_ppm = k_po_ffit[1]
print('Kobe:',k_reg_ppm, 'points per minute during the Regular Season')
print('Kobe:',k_po_ppm, 'points per minute during the Post Season')
print('Production Efficiency Clutch:', ((k_po_ppm-k_reg_ppm)/k_reg_ppm)*100,'%')
In [45]:
j_reg_ppm = j_reg_ffit[1]
j_po_ppm = j_po_ffit[1]
print('Jordan:',j_reg_ppm, 'points per minute during the Regular Season')
print('Jordan:',j_po_ppm, 'points per minute during the Post Season')
print('Production Efficiency Clutch:', ((j_po_ppm-j_reg_ppm)/j_reg_ppm)*100,'%')
Looking at clutch in terms of production efficiency, Jordan does better than Kobe, as Kobe actually scores over 10% fewer points per minute in the post season than in the regular season. This indicates that our previous analysis of Kobe having more points scored in the playoffs may simply be a product of him having more minutes on the court during his career postseasons.
Looking at the results from the data we pulled from Kobe and Jordan's career stats, we see that Jordan does hold the edge in terms of overall improvement during the playoffs over the regular season, or as what we defined as 'clutch'. Although there were not an enormous difference in their raw career stats in quantity and consistency, the disparity between regular season and the post season is observable for both key stats such as points scored, assists, rebounds, and steals, as well as production efficiency in terms of points per minute.
This difference in 'clutch' provides some quantitative reasoning behind the difference in awards and accolades Kobe and Jordan had garnered respectively in their careers. Both are obviously great players, but this may be part of the reason why many diehard fans as well as mainstream sports analysts such as ESPN consider Jordan to be the G.O.A.T. (Greatest of All Time) and Kobe just a legend in passing (check No.12).
Potential errors in this data report could be issues with multicollinearity on the regression for the production per minute OLS regression as well as the accounting for minor adjustments in averages due to Kobe playing 20 seasons in the NBA compared to Jordan's 15 (as he took a year break to pursue professional baseball, as well as a hiatus before joining the Washington Wizards).
This is the end of my project. I hope you had as much fun reading it as I had writing it.