Data Analysis of NBA Players

This iPython notebook is meant to challenge Georgetown Data Science Certificate students to ensure they have understood or acknowledged the technologies and techniques taught througout the semester. We’ll be analyzing a dataset of NBA players and their performance in the 2013-2014 season. You can download the file here.

Read in a CSV File

In the first step, load the dataset into a dataframe.


In [1]:
# Load CSV into a dataframe named nba

Remember to import the pandas library to get access to Dataframes. Dataframes are two-dimensional arrays (matrices) where each column can be of a different datatype.

Find the number of Players

How many players are in the dataset?


In [2]:
# Print the number of rows and columns in the dataframe

Look at the First Row of Data

What does the first row look like?


In [3]:
# Print the first row of data

Find the Average of Each Statistic

Find the average value for each statistic. The columns have names like PER (player efficiency rating) and GP (Games Played) that represent the season statistics for each player. For more on the various statistics look here.


In [4]:
# Print the mean of each column

Make Pairwise Scatterplots

One common way to explore a dataset is to see how different columns correlate to others. We’ll compare the ast, mpg, and to columns. Create a scatter matrix plot of the dataset.


In [5]:
%matplotlib inline 

# Use seaborn or pandas to plot the scatter matrix

In Python, matplotlib is the primary plotting package, and seaborn is a widely used layer over matplotlib. You could have also used the pandas scatter_matrix for a similar result.

Make Clusters of the Players

One good way to explore this kind of data is to generate cluster plots. These will show which players are most similar. Use Scikit-Learn to cluster the data.


In [6]:
# Use a clustering model like K-means to cluster the players

We can use the main Python machine learning package, scikit-learn, to fit a k-means clustering model and get our cluster labels. In order to cluster properly, we remove any non-numeric columns, or columns with missing values (NA, Nan, etc) with the get_numeric_data and dropna methods.

Plot Players by Cluster

We can now plot out the players by cluster to discover patterns. One way to do this is to first use PCA to make our data 2-dimensional, then plot it, and shade each point according to cluster association.


In [7]:
# Use PCA to plot the clusters in 2 dimensions

With Python, we used the PCA class in the scikit-learn library. We used matplotlib to create the plot.

Split into Training and Testing Sets

If we want to do supervised machine learning, it’s a good idea to split the data into training and testing sets so we don’t overfit.


In [8]:
# Create train (80%) and test (20%) splits

In Python, the recent version of pandas came with a sample method that returns a certain proportion of rows randomly sampled from a source dataframe – this makes the code much more concise. We could also use Scikit-Learns KFolds and train_test_splits for different types of shuffle and splits in the data set. In both cases, we set a random seed to make the results reproducible.

Univariate Linear Regression

Let’s say we want to predict number of assists per player from the turnovers made per player.


In [9]:
# Compute the univariate regression of TO to AST

Scikit-learn has a linear regression model that we can fit and generate predictions from. Note also the use of Lasso and Ridge regressions, though this doesn't apply in the univariate case.

Calculate Summary Statistics for the Model

Evaluate the above model using your test set, how well does it perform?


In [10]:
# Compute the regression results

If we want to get summary statistics about the fit, like r-squared value, we can use the score method of the Sckit-Learn model. However, if we want more advanced regression statistics we’ll need to do a bit more. The statsmodels package enables many statistical methods to be used in Python and is a good tool to know.

Fit a Random Forest Model

Our linear regression worked well in the single variable case, but we suspect there may be nonlinearities in the data. Thus, we want to fit a random forest model.


In [11]:
# Compute random forest from the predictors "AGE", "MPG", "TO", "HT", "WT", "REBR" to the target, "AST"

Calculate Error

Now that we’ve fit two models, let’s calculate error. We’ll use MSE.


In [12]:
# Compute the MSE of the classifier

The scikit-learn library has a variety of error metrics that we can use

Download a Webpage

Now that we have data on NBA players from 2013-2014, let’s scrape some additional data to supplement it. We’ll just look at one box score from the NBA Finals here to save time.


In [19]:
# Download "http://www.basketball-reference.com/boxscores/201506140GSW.html"

Extract Player Box Scores

Now that we have the web page, we’ll need to parse it to extract scores for players.


In [13]:
# Use BeautifulSoup to parse the table from the web page

This will create a list containing two lists, the first with the box score for CLE, and the second with the box score for GSW. Both contain the headers, along with each player and their in-game stats. We won’t turn this into more training data now, but it could easily be transformed into a format that could be added to our nba dataframe.

BeautifulSoup, the most commonly used web scraping package. It enables us to loop through the tags and construct a list of lists in a straightforward way.