Josh Ottensoser, Ben Rapaport, Alexander Stadtmauer, Jacob Sternberg
12/21/16
Professor Lyon & Professor Coleman
Data Bootcamp - UG Fall 2016
December 2016
As NBA fans, the eye test over recent years has seemed to demonstrate a shift in player scoring with an increased emphasis on the 3 Pointer. With Stephen Curry as a poster child of this shift, having obliterated the record for most 3 Pointers scored in a single season, we felt the need to find out whether the eye test was true – that Steph was in fact part of a greater trend - or if he was just an outlier. Therefore, in our project, we aimed to uncover evidence for any sort of scoring trends as it relates to distance attempted.
We accomplished this by retrieving data from basketball-reference.com and ESPN's database on the top 5 scorers from the past 15 years (ever since such data has been recorded). We used data of each season’s top 5 scorers rather than the league average for a few reasons: 1. Only individual player data rather than league-wide data is readily available for these metrics, 2. We see the top 5 scorers as proxies for scoring trends, as these are the players for whom offensive game planning revolves around; their attempts are indicative of what NBA offenses are trying to accomplish, and 3. Examining actual players rather than faceless averages allows us to truly see the drivers behind the data, making it easy to spot potential outliers rather than forcing us to guess whether certain players are driving the data.
After compiling the data and analyzing the basic points per game statistics, we delved deeper into the reasons behind the trends by organizing the data in various informative charts and graphs. The graphics are powered by our arranging of investigative formulas, such as points per 36 minutes, HHI, and distance from the basket to determine exactly where the top NBA players are scoring from on the court.
The results of our research can be found in the charts and explanations below. They portray the changes that the NBA game has undergone and what the Association might trend towards in the future; however, we understand that the league is always shifting and adapting to new strategies and changes so we acknowledge that the current trends may not prevail for long.
In [1]:
##imports
import sys # system module
import pandas as pd # data package
import matplotlib as mpl # graphics package
import matplotlib.pyplot as plt # pyplot module
%matplotlib inline
import datetime as dt # datetime module
import seaborn as sns # import seaborn module
##read the csv and save it as a dataframe
path = 'https://raw.githubusercontent.com/joshuaott3/DBProject/master/DBData.csv'
df= pd.read_csv(path)
##rename columns
df.columns=['Season','Name', 'Age','Team', 'League','Position','Games Played','Minutes Played', 'PPG','FGA','FG%','Average Shot Distance','%FGA 2P','%FGA 0-3','%FGA 3-10','%FGA 10-16','%FGA 16<3','%FGA 3P','FG% 2P','FG% 0-3','FG% 3-10','FG% 10-16','FG% 16<3','FG% 3P','%AST 2P','%FGA Dunks', 'Dunks Made','%ASTD 3P','%3PA Corner','3P% Corner', '3P Heaves Attempt','3P Heaves Made','OUT']
##drop column that isnt necessary
df = df.drop('OUT', 1)
##set season as the index
df = df.set_index('Season')
#drop the two rows that we do not need now (one row that is useless*, one redundant season row)
##*We got rid of this row by deleting the row that had the word 'Dunks' in the column 'Dunks Made' (the row we didn't want)
df=df.drop(['Season'])
df = df[df['Dunks Made'] != 'Dunks']
##Convert the appropriate rows from strings to floats
tofloats= ['Age','Games Played','Minutes Played', 'PPG','FGA','FG%','Average Shot Distance','%FGA 2P','%FGA 0-3','%FGA 3-10','%FGA 10-16','%FGA 16<3','%FGA 3P','FG% 2P','FG% 0-3','FG% 3-10','FG% 10-16','FG% 16<3','FG% 3P','%AST 2P','%FGA Dunks', 'Dunks Made','%ASTD 3P','%3PA Corner','3P% Corner']
for i in tofloats:
df[i] = df[i].astype(float)
##Create variables that will be necesasry for proper weighting
df['FGA 2P']=df['%FGA 2P'] * df['FGA']
df['FGA 0-3']=df['%FGA 0-3'] * df['FGA']
df['FGA 3-10']=df['%FGA 3-10'] * df['FGA']
df['FGA 10-16']=df['%FGA 10-16'] * df['FGA']
df['FGA 16 <3']=df['%FGA 16<3'] * df['FGA']
df['FGA 3P']=df['%FGA 3P'] * df['FGA']
##Creates Minute per Game variable
df['MPG'] = df['Minutes Played']/df['Games Played']
##Creates Points per Minute variable
df['PPM']= (df['PPG']*df['Games Played'])/df['Minutes Played']
##Creats Points per 36 Minutes variable
df['PP36']=df['PPM']*36
##getting the average for each year, will be useful for PPG and MPG
mean_years=df.groupby(df.index).mean()
In [2]:
##Checking to make sure the dataframe is there how we like it
df.head(5)
Out[2]:
In [3]:
##Making sure all of the data is in the format we need
##*There will be no need for 3-Point Heave data so we did not change it
df.dtypes
Out[3]:
In [4]:
##Making sure the mean dataframe is there and how we'd like it
mean_years.tail(5)
Out[4]:
We first seek to examine whether or not there has been a trend in scoring patterns over the course of our dataset. in order to do so we take the average of each season and graph points statistics over time. The following graph shows the points per game trend for the top 5 scorers over our dataset.
In [5]:
##set seaborn
sns.set()
##create subplot
fig, ax = plt.subplots()
##plot out the PPG from mean years
mean_years['PPG'].plot(ax=ax,legend=None,color='teal', linewidth=5,ls='dashdot')
##title it, place the legend and decide the style
plt.title('PPG by Year', color='Navy',fontsize='18', fontweight='bold')
#sets label titles and style
ax.set_xlabel('NBA Season', fontsize='14', fontweight='bold')
ax.set_ylabel('Points Per Game', fontsize='14', fontweight='bold')
##gets labels
locs, labels = plt.xticks()
##sets rotation of the x labels
plt.setp(labels, rotation=90)
Out[5]:
In [6]:
##set seaborn
sns.set()
##create subplot
fig, ax = plt.subplots()
##plot Minutes Player per game from mean_years
mean_years['PP36'].plot(ax=ax,legend=None,color='green', linewidth=5, ls=':')
#title it
plt.title('Points per 36 Minutes Over the Years', color='Navy',fontsize='14', fontweight='bold')
##sets label titles and style
ax.set_xlabel('NBA Season', fontsize='14', fontweight='bold')
ax.set_ylabel('Points Per 36 Minutes', fontsize='14', fontweight='bold')
##gets labels
locs, labels = plt.xticks()
##sets rotation of the x labels
plt.setp(labels, rotation=90)
Out[6]:
Here, we see that while points per game on a raw basis decreased, points per 36 minutes among the top scorers increased. This indicates to us that there is in fact a trend towards increased scoring efficiency. Now that we have established a trend of increased scoring efficiency, we seek to discover what is driving the increase in efficiency.
We calculate HHI to determine whether concentration of shot type (by location) is a factor in increased socirng efficiency.
HHI is a measure of how concentrated the distribution is. With our formula, the maximum HHI is 1. If a player shoots 100 shots and they are all in the 3-10 foot range, his HHI will be calculated as [(100 3-10 shots)/(100 total shots)]^2=1. Had this shooter taken 50 shots in the 3-10 foot range and 50 shots in the 10-16 foot range, his HHI would be calcualted as [(50 3-10 shots)/(100 total shots)]^2 + [(50 10-16 shots)/(100 total shots)]^2 = .5.
In [6]:
##create sum_years table, will be of use for everything else
sum_years=df.groupby(df.index).sum()
##create HHI column
sum_years['HHI']=0
##create list that will be vital for HHI calculation
HHI_list=['FGA 0-3','FGA 3-10', 'FGA 10-16', 'FGA 16 <3', 'FGA 3P']
##Create a for loop to get the HHI for each year
for i in HHI_list:
sum_years['HHI']+=(sum_years[i]/sum_years['FGA'])**2
#create list that will be vital to determine shot distribution
FGA_list=['%FGA 0-3','%FGA 3-10','%FGA 10-16','%FGA 16<3','%FGA 3P']
##for loop that will give accurate number for percentage of shots on each location
##before, it was a sum of all of the players %, which made it greater than 1. We want the % of all of the players shots.
for i in range(0,5):
sum_years[FGA_list[i]]=sum_years[HHI_list[i]]/sum_years['FGA']
##create MoreyBall column that will be sum of % of shots that are 0-3 feet and 3P
sum_years['MoreyBall']=sum_years['%FGA 0-3']+sum_years['%FGA 3P']
In [7]:
##Making sure the sum dataframe is there and how we'd like it
sum_years.tail(5)
Out[7]:
In [8]:
##set seaborn
sns.set()
##create subplot
fig, ax = plt.subplots()
##plot out the HHI from sum_years
sum_years['HHI'].plot(ax=ax,legend=False,color='Red', linewidth=5,linestyle='--')
##title it and format
plt.title('HHI by Year', color='Navy',fontsize='16',fontweight='bold')
##Title y and x label and format
ax.set_ylabel('HHI Level (max of 1)',fontsize='14',fontweight='bold')
ax.set_xlabel('NBA Season',fontsize='14',fontweight='bold')
#gets labels
locs, labels = plt.xticks()
#sets rotation of the x labels
plt.setp(labels, rotation=90)
Out[8]:
As can be seen in the graph above, there has been a recent upward trend in HHI. This signifies to us that there is more concentration in where shots are being taken from, however, we are still unaware of where these shots are in fact coming from. It is clear players are choosing to shoot from the same locations more often, but where from?
Having established a recent upward trend in HHI, we would like to more closely examine which shot distance is being favored and is therefore leading to an increase in shot concentration. Our first test that we have chosen to conduct is to chart the average shot distance for the top 5 scorers over the years. Perhaps looking at the trend regarding what distance these players are shooting from will help us understand what changes have been made.
In [9]:
##set seaborn
sns.set()
##create subplot
fig, ax = plt.subplots()
##plot Average Shot Distance from mean_years
mean_years['Average Shot Distance'].plot(ax=ax,legend=None,color='purple', linewidth=5, linestyle = 'solid')
##title it and format
plt.title('Average Shot Distance by Year', color='Navy',fontsize='16', fontweight='bold')
##title x and y label and format
ax.set_ylabel('Average Shot Distance',fontsize='14', fontweight='bold')
ax.set_xlabel('NBA Season',fontsize='14', fontweight='bold')
##gets labels
locs, labels = plt.xticks()
##sets rotation of the x labels
plt.setp(labels, rotation=90)
Out[9]:
We had hypthesized that a shot concentration increase would be a result of a rise in concentration of 3 Pointers attempted, and that the average shot distance would therefore rise considerably. However, the data shows that the story is more complicated than that. Since the 2011-2012 season, the average shot distance has hardly risen. This means that if 3 Pointers attempted has increased, close-range shots may have also increased to counterbalance the effect of more 3 Pointers attempted.
We will further examine the breakdown of shots attempted by distance type in order to see why exactly the average shot ditance has not changed despite our perception of an increase in 3 Pointers.
In [10]:
##Set seaborn
sns.set()
##create a subplot
fig, ax = plt.subplots()
##plot the %FGA from each distance (with seperate linewidths for 16<3 and 3P)
sum_years[['%FGA 0-3','%FGA 3-10','%FGA 10-16']].plot(ax=ax,ls='-.')
sum_years[['%FGA 16<3', '%FGA 3P']].plot(ax=ax,linewidth=5)
##title it and format
plt.title('Shot Distribution by Year', color='Maroon',fontsize='16',fontweight='bold')
#title x and y label and format
ax.set_ylabel('% of Total Shots [.30 = 30%]',fontsize='14',fontweight='bold')
ax.set_xlabel('NBA Season',fontsize='14',fontweight='bold')
##place legend
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
##gets labels
locs, labels = plt.xticks()
##sets rotation of the x labels
plt.setp(labels, rotation=90)
Out[10]:
Looking closely at this graph, we can see that 3 Pointers and FGA 0-3 have both been on the rise, with these shot locations dominated the league in 2015-2016 (combined ~65% of total shots).
Additionally, this graph can further help explain why the average shot distance has not changed much. The most diminishing shot type is FGA 16 < 3. Because the effect we are seeing essentially replaces FGA 16 < 3 with 3 Pointers on a percentage basis, and because these two shot types are close in distance, the average distance does not change much (especially being coupled with the increase in FGA 0-3).
We'd like to crystalize this trend towards layups and 3 Pointers in a visual that focuses on displaying the prominence of these two shot forms.
In this graph, we examine the proportion of the total field goal attempts of the top 5 scorers that have come from 0-3 ft. away from the hoop and beyond the 3 point line. 'Moreyball' gets its name from Houston Rockets General Manager Daryl Morey, who popularized the theory that long-range two-pointers are the least efficient shots in the game, and therefore pushes his team to shoot from only 0-3 ft or 3P (~70% of their shots in recent years).
This graph shows a generally positive trend in Moreyball even since 2000 but especially since 2011-2012 - right around where we saw HHI take off.
In [11]:
##set seaborn
sns.set()
##create subplot
fig, ax = plt.subplots()
##plot Moreyball from sum_years
sum_years['MoreyBall'].plot(ax=ax,legend=False, linewidth=5)
##title it
plt.title('Growth of MoreyBall', color='Blue', fontsize='16',fontweight='bold')
##title x and y label
ax.set_ylabel('% of FGA [.50 = 50%]',fontsize='14',fontweight='bold')
ax.set_xlabel('NBA Season',fontsize='14',fontweight='bold')
#gets labels
locs, labels = plt.xticks()
#sets rotation of the x labels
plt.setp(labels, rotation=90)
Out[11]:
To further look into this trend and how much the game has changed, we will now show two pie charts side by side that will display how shot location has changed since the 2000-2001 season compared to the 2015-2016 season. Anyone who watches basketball knows that they are watching a different game today than they had 15 years ago, but how different has it become? The results below tell a shocking tale.
In [12]:
##Set seaborn
sns.set()
##create subplots
fig, ax = plt.subplots(2)
##create an explode list to we can seperate 0-3 and 3P from the pie chart
explode = (0.1, 0, 0, 0,0.1)
##create a dataframe that only has the data from 2000-2001
twothousand=mean_years.loc['2000-01']
##only keep the data we want here (using the HHI list we made earlier)
twothousand=twothousand[HHI_list]
##plot it as a pie chart on the top plot (and set a startangle that we prefer, and display percentage breakdown)
twothousand.plot(ax=ax[0],kind='pie',legend=None,autopct='%1.1f%%',explode=explode, startangle=180)
##supertitle it
fig.suptitle('Total Distributions: 2000-01 and 2015-16 Seasons', fontsize='16', fontweight='bold')
##create a dataframe that only has the data from 2015-2016
twothousandfifteen=mean_years.loc['2015-16']
##only keep the data we want here (using the HHI list we made earlier)
twothousandfifteen=twothousandfifteen[HHI_list]
##plot it as a pie chart (and set a startangle that we prefer)
twothousandfifteen.plot(ax=ax[1],kind='pie',autopct='%1.1f%%',explode=explode,startangle=180)
Out[12]:
As you can clearly see, over this time frame, the percentage of 3 Pointers attempted has more than doubled and FGA 16 < 3 has more than halved. Other 2-pointers - FGA 10-16 and FGA 3-10 - have also decreased while FGA 0-3 has risen and has continued to be a critical component of the game.
We now know that while HHI and Moreyball have been on a pretty straightforward upward trend since 2011-2012, the new millennium before that season appears to have been far more volitile. This may mean that the trend has not been around as long as this side-by-side analysis might lead one to believe.
What is clear, however, is that the game of basketball has changed dramtically over the years, and the league is clearly beginning to catch on to something.
Ultimately, after analyzing the results from the PPG, PP36, HHI, shot distance, shot distribution, and specialized shot distribution, we have come to the conclusion that while it has taken the league some time to figure out, the NBA has ultimately seen a clear shift away from shots in the midrange, a doubling of 3 Pointers attempted, and an increase in layups / dunks as well. This seems to strengthen Daryl Morey's argument.
Our findings from the HHI make it clear that fewer shot types are responsible for most of the total shots among the tops scorers in the league. Our pie charts and distance graphs prove that this must be from a spike in 3 Pointers at the expense of long 2 Pointers, along with an increase in layups / dunks. This comes together to strengthen our original hypothesis that the league truly is searching for an edge in efficiency with its shot selection. Top scorers are playing fewer minutes and scoring just as many points as in the past.
It is important to point out that our data set is limited and one way of making this report more efficient is an increase of data points. Taking the top 10 or 15 scorers per season, compared to top 5, would likely yield different results and we wonder how different our conclusion would be. Although we found a trend in our project, we know that our limited data set of top 5 scorers per season could have been stronger had we retrieved more data points. For example, in the last season, the immense 3-point tendency in Stephen Curry had a significant pull on the season’s data which played a large role in our project, Increasing our data would relieve the dependency on a single player's statistics.
So Moreyball is on the rise, but has it proven to be a winning strategy on a team level? As a preliminary step towards where our analysis may lead researchers next, the data seems to show that since Moreyball has been on the rise, the top scorers with the lowest Moreyball scores are increasingly showing up on better teams:
-In 2015-2016, Demarcus Cousins scored lowest on the Moreyball scale. His team did in fact have the fewest wins of the five, and was the only of the season's top 5 scorers to fail to make the playoffs.
-In 2014-2015, Cousins, Anthony Davis and Russell Westbrook scored the lowest and their teams also had the three fewest wins of the five.
-In 2013-2014, Carmelo Anthony's Moreyball score was the lowest and his team's wins were the lowest.
-In 2012-2013, the data is scattered.
-In 2011-2012 the lower Moreyball scores actually appear on teams with more wins, with the highest of the season, Kevin Love, appearing on by far the least winning team.
We are ultimately left with the question of whether Moreyball is peaking or whether the league will catch on and reverse the trend outlined above. As technology and advanced statistics increasingly proliferate the Association, we predict that this current flourishing of exciting and smart basketball will continue in the near-term and evolve to account for counter-strategies and newly found evidence in the long-term. We look forward to see where the leading basketball minds of our time take the game next.
In [ ]: