In this notebook we will build a neural net to predict the positions of NBA players using the Keras library.
In [654]:
%load_ext autoreload
%autoreload 2
In [655]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelBinarizer, StandardScaler
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
We will use the Kaggle dataset "NBA Players stats since 1950", with stats for all players since 1950. We will take special interest in how the pass of time affects to the position of each player, and the definition of the positions themselves (a Small Forward, for example, was absolutely different in the 60's to what it is now)
In [656]:
stats = pd.read_csv(r'data/Seasons_Stats.csv', index_col=0)
The file Seasons_Stats.csv
contains the statics of all players since 1950. First, we drop a couple of blank columns, and the "Tm" column, that contains the team.
In [830]:
stats = pd.read_csv(r'data/Seasons_Stats.csv', index_col=0)
stats_clean = stats.drop(['blanl', 'blank2', 'Tm'], axis=1)
In [831]:
stats_clean.head()
Out[831]:
A second file, players.csv
, contains static information for each player, as height, weight, etc.
In [835]:
players = pd.read_csv(r'data/players.csv', index_col=0)
players.head(10)
Out[835]:
We merge both tables, and do some data cleaning:
In [839]:
data = pd.merge(stats_clean, players[['Player', 'height', 'weight']], left_on='Player', right_on='Player', right_index=False,
how='left', sort=False).fillna(value=0)
data = data[~(data['Pos']==0) & (data['MP'] > 200)]
data.reset_index(inplace=True, drop=True)
data['Player'] = data['Player'].str.replace('*','')
totals = ['PER', 'OWS', 'DWS', 'WS', 'OBPM', 'DBPM', 'BPM', 'VORP', 'FG', 'FGA', '3P', '3PA', '2P', '2PA', 'FT', 'FTA',
'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS']
for col in totals:
data[col] = 36 * data[col] / data['MP']
In [840]:
data.tail()
Out[840]:
We will train a neural network with this data, to try to predict the position of each player.
A way we didn't follow was to transform the positions into numbers from 1 to 5 (1 for a PG, 2 for a SG, 1.5 for a PG-SG, and so on, until 5 for a C), and use the network for regression instead of classification. But we wanted to see if the network was able to predict labels as "SG-PF", so we decided to work with the categorical labels. Another reason is that this makes this study more easily portable to other areas.
We convert our DataFrame into a matrix X with the inputs, and a vector y with the labels. We scale the inputs and encode the outputs into dummy variables using the corresponding sklearn
utilities.
Instead of a stochastic partition, we decided to use the 2017 season as our test data, and all the previous as the train set.
In [842]:
X = data.drop(['Player', 'Pos', 'G', 'GS', 'MP'], axis=1).as_matrix()
y = data['Pos'].as_matrix()
encoder = LabelBinarizer()
y_cat = encoder.fit_transform(y)
nlabels = len(encoder.classes_)
scaler =StandardScaler()
Xnorm = scaler.fit_transform(X)
stats2017 = (data['Year'] == 2017)
X_train = Xnorm[~stats2017]
y_train = y_cat[~stats2017]
X_test = Xnorm[stats2017]
y_test = y_cat[stats2017]
We build using Keras (with Tensorflow as beckend) a neural network with two hidden layers. We will use relu activations, except for the last one, where we use a softmax to properly obtain the label probability. We will use a 20% of the data as a validation set, to make sure we are not overfitting.
In [851]:
model = Sequential()
model.add(Dense(40, activation='relu', input_dim=46))
model.add(Dropout(0.5))
model.add(Dense(30, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(nlabels, activation='softmax'))
In [853]:
model.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['accuracy'])
In [854]:
# x_train and y_train are Numpy arrays --just like in the Scikit-Learn API.
model.fit(X_train, y_train, epochs=200, batch_size=128, validation_split=0.2, verbose=1)
Out[854]:
In [855]:
model.test_on_batch(X_test, y_test, sample_weight=None)
Out[855]:
The model performs well both for the validation and the test sets (65% might not seem a lot, but it is satisfying enough for our problem, where all the labels are very subjective (Was Larry Bird a "SM-PF" or a "PF-SF"? Nobody can tell).
Now we train again the model, using all the training data (we will still reserve the 2017 season out of the training).
In [856]:
# Production model, using all data
model.fit(X_train, y_train, epochs=200, batch_size=128, validation_split=0, verbose=1)
Out[856]:
In [857]:
first_team_members = ['Russell Westbrook', 'James Harden', 'Anthony Davis', 'LeBron James', 'Kawhi Leonard']
first_team_stats = data[[((x[1]['Player'] in first_team_members) & (x[1]['Year']==2017)) for x in data.iterrows()]]
first_team_stats
Out[857]:
In [858]:
pd.DataFrame(index=first_team_stats.loc[:, 'Player'].values, data={'Real': first_team_stats.loc[:, 'Pos'].values,
'Predicted':encoder.inverse_transform(model.predict(Xnorm[first_team_stats.index, :]))})
Out[858]:
The model gets right four of the five. It's even more interesting that the one that gets wrong, Anthony Davis, can play in both PF and C positions, and that in the last season, he played more as a Power Forward than as a Center, as the model predicts:
We will use now the model to predict the positions of all the NBA MVP since the creation of the award, in 1956.
In [859]:
mvp = [(1956, 'Bob Pettit'), (1957, 'Bob Cousy'), (1958, 'Bill Russell'), (1959, 'Bob Pettit'),
(1960, 'Wilt Chamberlain'), (1961, 'Bill Russell'), (1962, 'Bill Russell'), (1963, 'Bill Russell'),
(1964, 'Oscar Robertson'), (1965, 'Bill Russell'), (1966, 'Wilt Chamberlain'), (1967, 'Wilt Chamberlain'),
(1968, 'Wilt Chamberlain'), (1969, 'Wes Unseld'), (1970, 'Willis Reed'), (1971, 'Lew Alcindor'),
(1972, 'Kareem Abdul-Jabbar'), (1973, 'Dave Cowens'), (19704, 'Kareem Abdul-Jabbar'), (1975, 'Bob McAdoo'),
(1976, 'Kareem Abdul-Jabbar'), (1977, 'Kareem Abdul-Jabbar'), (1978, 'Bill Walton'), (1979, 'Moses Malone'),
(1980, 'Kareem Abdul-Jabbar'), (1981, 'Julius Erving'), (1982, 'Moses Malone'), (1983, 'Moses Malone'),
(1984, 'Larry Bird'), (1985, 'Larry Bird'), (1986, 'Larry Bird'), (1987, 'Magic Johnson'),
(1988, 'Michael Jordan'), (1989, 'Magic Johnson'), (1990, 'Magic Johnson'), (1991, 'Michael Jordan'),
(1992, 'Michael Jordan'), (1993, 'Charles Barkley'), (1994, 'Hakeem Olajuwon'), (1995, 'David Robinson'),
(1996, 'Michael Jordan'), (1997, 'Karl Malone'), (1998, 'Michael Jordan'), (1999, 'Karl Malone'),
(2000, 'Shaquille O\'Neal'), (2001, 'Allen Iverson'), (2002, 'Tim Duncan'), (2003, 'Tim Duncan'),
(2004, 'Kevin Garnett'), (2005, 'Steve Nash'), (2006, 'Steve Nash'), (2007, 'Dirk Nowitzki'),
(2008, 'Kobe Bryant'), (2009, 'LeBron James'), (2010, 'LeBron James'), (2011, 'Derrick Rose'),
(2012, 'LeBron James'), (2013, 'LeBron James'), (2014, 'Kevin Durant'), (2015, 'Stephen Curry'),
(2016, 'Stephen Curry')]
In [860]:
mvp_stats = pd.concat([data[(data['Player'] == x[1]) & (data['Year']==x[0])] for x in mvp], axis=0)
In [861]:
mvp_stats
Out[861]:
In [862]:
mvp_pred = pd.DataFrame(index=mvp_stats.loc[:, 'Player'].values, data={'Real': mvp_stats.loc[:, 'Pos'].values,
'Predicted':encoder.inverse_transform(model.predict(Xnorm[mvp_stats.index, :]))})
In [737]:
mvp_pred
Out[737]:
The model gets right most of the players, and the errors are always for a contiguous position (it is interesting that the model gets this right without having been provided with any information about the distances between the labels.)
The definitions of a forward or a center are always changing: in the very recent years, there is, for example, a trend towards having scoring point guards (as Stephen Curry) and forwards that direct the game instead of the guard (as Lebron James). Also, the physical requirements are increasing, and a height that in the 50's could characterize you as a center will make you a forward today.
We will follow the first and last MVP's, Stephen Curry and Bob Pettit, and see where our model puts them in different years in the NBA history.
In [863]:
curry2017 = data[(data['Player'] == 'Stephen Curry') & (data['Year']==2017)]
pettit1956 = data[(data['Player'] == 'Bob Pettit') & (data['Year']==1956)]
In [864]:
time_travel_curry = pd.concat([curry2017 for year in range(1956, 2018)], axis=0)
time_travel_curry['Year'] = range(1956, 2018)
X = time_travel_curry.drop(['Player', 'Pos', 'G', 'GS', 'MP'], axis=1).as_matrix()
y = time_travel_curry['Pos'].as_matrix()
y_cat = encoder.transform(y)
Xnorm = scaler.transform(X)
time_travel_curry_pred = pd.DataFrame(index=time_travel_curry.loc[:, 'Year'].values,
data={'Real': time_travel_curry.loc[:, 'Pos'].values,
'Predicted':encoder.inverse_transform(model.predict(Xnorm))})
time_travel_pettit = pd.concat([pettit1956 for year in range(1956, 2018)], axis=0)
time_travel_pettit['Year'] = range(1956, 2018)
X = time_travel_pettit.drop(['Player', 'Pos', 'G', 'GS', 'MP'], axis=1).as_matrix()
y = time_travel_pettit['Pos'].as_matrix()
y_cat = encoder.transform(y)
Xnorm = scaler.transform(X)
time_travel_pettit_pred = pd.DataFrame(index=time_travel_pettit.loc[:, 'Year'].values,
data={'Real': time_travel_pettit.loc[:, 'Pos'].values,
'Predicted':encoder.inverse_transform(model.predict(Xnorm))})
In [865]:
pd.concat([time_travel_curry_pred,time_travel_pettit_pred],axis=1,keys=['Stephen Curry','Bob Pettit'])
Out[865]:
Curry is labeled as a point guard (his real position) from 1973 until today, and as a shooting guard before that. Perhaps because of his heigh (191cm), or perhaps because he is too much of a scorer. Bob Pettit is labeled as a center until 1967, and as a power forward after that (he played both roles, but nowadays he would have difficulties to play as a center, and would be for sure a forward, perhaps even a small forward).
In [866]:
magic = data[(data['Player'] == 'Magic Johnson')]
jordan = data[(data['Player'] == 'Michael Jordan')]
In [867]:
# Magic
X = magic.drop(['Player', 'Pos', 'G', 'GS', 'MP'], axis=1).as_matrix()
y = magic['Pos'].as_matrix()
y_cat = encoder.transform(y)
Xnorm = scaler.transform(X)
magic_pred = pd.DataFrame(index=magic.loc[:, 'Age'].values,
data={'Real': magic.loc[:, 'Pos'].values,
'Predicted':encoder.inverse_transform(model.predict(Xnorm))})
# Jordan
X = jordan.drop(['Player', 'Pos', 'G', 'GS', 'MP'], axis=1).as_matrix()
y = jordan['Pos'].as_matrix()
y_cat = encoder.transform(y)
Xnorm = scaler.transform(X)
jordan_pred = pd.DataFrame(index=jordan.loc[:, 'Age'].values,
data={'Real': jordan.loc[:, 'Pos'].values,
'Predicted':encoder.inverse_transform(model.predict(Xnorm))})
In [869]:
pd.concat([magic_pred,jordan_pred],axis=1,keys=['Magic Johnson','Michael Jordan'])
Out[869]:
The model is able to detect the conversion of Jordan into a forward at the end of his career, but not the return of Magic as a power forward. Also, in his rookie season, he is classified as a small forward instead of as a shooting guard (Magic was clearly and outlier in the data, a 205cm point guard who could easily play in the five positions. It is even surprising that is properly labelled as a point guard during most of his career)
A concern we have before training the model was that it would use the height and weight as the main classifiers, and that it would label incorrectly players as Magic Johnson (a 205 cm point guard), or Charles Barkley (a 196cm power forward). Almost surprisingly, it works properly on this two players.
We will use again the 2017 First NBA Team and play with the heights and weights of the players. Keeping constant all other statistics, we will change the height and weight and observe how the predicted positions change.
In [1]:
first_team_stats
In [871]:
multiplier = np.arange(0.8,1.2,0.02)
growing_predicted = []
for p in first_team_stats.iterrows():
growing = pd.concat([p[1].to_frame().T for x in multiplier], axis=0)
growing['height'] = growing['height'] * multiplier
growing['weight'] = growing['weight'] * (multiplier ** 3)
X = growing.drop(['Player', 'Pos', 'G', 'GS', 'MP'], axis=1).as_matrix()
y = growing['Pos'].as_matrix()
y_cat = encoder.transform(y)
Xnorm = scaler.transform(X)
growing_predicted.append(pd.DataFrame(index=multiplier, data={'height': growing.loc[:, 'height'].values,
'Real': growing.loc[:, 'Pos'].values, 'Predicted':encoder.inverse_transform(model.predict(Xnorm))}))
In [874]:
pd.concat(growing_predicted,axis=1,keys=first_team_stats['Player'])
Out[874]:
As we can see height matters, but it's not enough. Any player can be classified as a center if he is tall enough (very tall: Kawhi Leonard would need to be 221cm tall to be considered a center), but being short it's not enough to be considered a guard: a 165cm Anthony Davis would be still considered a power forward.
In [ ]: