In [1]:
import pandas as pd
%matplotlib inline
The goal of this worksheet is to provide practical examples of aggregating (with group by), plotting, and pivoting data with the Pandas Python package.
This worksheet is available as a jupyter notebook on github here: https://github.com/JBed/Pandas_Analysis_Worksheet
Get the data here: https://www.kaggle.com/dansbecker/nba-shot-logs
Finally, if you have any questions, comments, or believe that I did anything incorrectly feel free to email me here: jason@jbedford.net
In [2]:
df = pd.read_csv('nba-shot-logs.zip')
The data is structured so that each row corresponds to one shot taking during the 2014-2015 NBA season (We exclude free throws).
In [18]:
df.head(2)
Out[18]:
In [4]:
df.columns
Out[4]:
Most of the column names are self-explanatory. One thing that initially confused me was that there is no column telling us the team of the player taking the shot. It turns out that that information is hidden in the MATCHUP column.
In [24]:
df.set_index('GAME_ID').loc[21400899]['MATCHUP'].unique()
Out[24]:
We see that the name of the team of the player taking the shot is the first team listed after the date. It turns out that having things structured this way is actually very convenient.
In [ ]:
Here It makes sense restructure our data so that each row is referring to one team in one game and the columns give us the number of shots made or missed. This can be done with either DataFrame.groupby or pandas.pivot_table which I show below.
In [27]:
shot_result_by_matchup = df.groupby(['MATCHUP','SHOT_RESULT']).size().unstack()
In [28]:
shot_result_by_matchup.head()
Out[28]:
In [33]:
df.pivot_table(index='MATCHUP', columns='SHOT_RESULT', \
values='W',aggfunc=lambda x: len(x)).head()
Out[33]:
Personally I find the groupby operation to be more expressive.
In [ ]:
This is done by sorting the above on made
In [37]:
shot_result_by_matchup.sort_values(by='made').head()
Out[37]:
In [38]:
shot_result_by_matchup.sort_values(by='made', ascending=False).head()
Out[38]:
In [ ]:
We'll make a new column called total and sort on that.
In [34]:
shot_result_by_matchup['total'] = shot_result_by_matchup.sum(axis=1)
In [35]:
shot_result_by_matchup.sort_values(by='total').head()
Out[35]:
These are likely game that were canceled
In [36]:
shot_result_by_matchup.sort_values(by='total',ascending=False).head()
Out[36]:
In [ ]:
we'll make a derived column called make_percent and sort on that.
In [39]:
shot_result_by_matchup['make_percent'] = \
round((shot_result_by_matchup['made'] / shot_result_by_matchup['total'])*100,1)
In [40]:
shot_result_by_matchup.sort_values(by='make_percent').head()
Out[40]:
In [41]:
shot_result_by_matchup.sort_values(by='make_percent', ascending=False).head()
Out[41]:
In [ ]:
Here we’ll make a dataframe similar to the above but with 'W' as a column
In [53]:
shot_and_game_result_by_matchup = df.groupby(['MATCHUP','W','SHOT_RESULT']).size().unstack()
In [54]:
shot_and_game_result_by_matchup.head()
Out[54]:
We want 'W' to be a column not part of the index as it is now
In [55]:
shot_and_game_result_by_matchup = \
shot_and_game_result_by_matchup.reset_index().set_index('MATCHUP')
In [56]:
shot_and_game_result_by_matchup.head()
Out[56]:
In [ ]:
Again we'll make a derived column called make_percent
In [57]:
shot_and_game_result_by_matchup['make_percent'] = \
round((shot_and_game_result_by_matchup['made']/\
shot_and_game_result_by_matchup.sum(axis=1))*100,1)
In [58]:
shot_and_game_result_by_matchup[\
shot_and_game_result_by_matchup['W']=='W'].sort_values(by='make_percent').head()
Out[58]:
In [59]:
shot_and_game_result_by_matchup[\
shot_and_game_result_by_matchup['W']=='L'].sort_values(by='make_percent', \
ascending=False).head()
Out[59]:
In [ ]:
We'll explore this question with a group and a plot
In [60]:
shot_and_game_result_by_matchup.groupby('W')['make_percent'].describe().unstack()
Out[60]:
In [61]:
shot_and_game_result_by_matchup.boxplot(column='make_percent', by='W',figsize=(12,8))
Out[61]:
In [ ]:
In [95]:
shot_result_by_gameid_and_w = \
df.groupby(['GAME_ID','W','SHOT_RESULT']).size().unstack()
In [96]:
shot_result_by_matchup_gameid.head(4)
Out[96]:
In [97]:
shot_result_by_gameid_and_w['make_percent'] = \
round((shot_result_by_gameid_and_w['made']/\
shot_result_by_gameid_and_w.sum(axis=1))*100,1)
In [98]:
shot_result_by_gameid_and_w.head(4)
Out[98]:
In [101]:
shot_result_by_gameid_and_w = \
shot_result_by_gameid_and_w.reset_index()
In [84]:
shot_result_by_gameid_and_w.head(4)
Out[84]:
In [85]:
make_percent_by_gameid = \
shot_result_by_gameid_and_w.pivot(index='GAME_ID', columns='W')
In [106]:
make_percent_by_gameid.head()
Out[106]:
In [107]:
make_percent_by_gameid[\
make_percent_by_gameid['make_percent']['W']<\
make_percent_by_gameid['make_percent']['L']].head()
Out[107]:
In [108]:
len(make_percent_by_gameid[\
make_percent_by_gameid['make_percent']['W']<\
make_percent_by_gameid['make_percent']['L']])
Out[108]:
In [109]:
len(make_percent_by_gameid)
Out[109]:
In [110]:
(197./904)*100
Out[110]:
So about %22 of the time the game is won by the team that has the lower make percentage.
In [ ]: