This iPython notebook is meant to challenge Georgetown Data Science Certificate students to ensure they have understood or acknowledged the technologies and techniques taught througout the semester. We’ll be analyzing a dataset of NBA players and their performance in the 2013-2014 season. You can download the file here.
In [1]:
# Load CSV into a dataframe named nba
In [2]:
# Print the number of rows and columns in the dataframe
In [3]:
# Print the first row of data
Find the average value for each statistic. The columns have names like PER (player efficiency rating) and GP (Games Played) that represent the season statistics for each player. For more on the various statistics look here.
In [4]:
# Print the mean of each column
In [5]:
%matplotlib inline
# Use seaborn or pandas to plot the scatter matrix
In Python, matplotlib is the primary plotting package, and seaborn is a widely used layer over matplotlib. You could have also used the pandas scatter_matrix for a similar result.
One good way to explore this kind of data is to generate cluster plots. These will show which players are most similar. Use Scikit-Learn to cluster the data.
In [6]:
# Use a clustering model like K-means to cluster the players
We can use the main Python machine learning package, scikit-learn, to fit a k-means clustering model and get our cluster labels. In order to cluster properly, we remove any non-numeric columns, or columns with missing values (NA, Nan, etc) with the get_numeric_data and dropna methods.
We can now plot out the players by cluster to discover patterns. One way to do this is to first use PCA to make our data 2-dimensional, then plot it, and shade each point according to cluster association.
In [7]:
# Use PCA to plot the clusters in 2 dimensions
In [8]:
# Create train (80%) and test (20%) splits
In Python, the recent version of pandas came with a sample method that returns a certain proportion of rows randomly sampled from a source dataframe – this makes the code much more concise. We could also use Scikit-Learns KFolds and train_test_splits for different types of shuffle and splits in the data set. In both cases, we set a random seed to make the results reproducible.
Let’s say we want to predict number of assists per player from the turnovers made per player.
In [9]:
# Compute the univariate regression of TO to AST
Scikit-learn has a linear regression model that we can fit and generate predictions from. Note also the use of Lasso and Ridge regressions, though this doesn't apply in the univariate case.
Evaluate the above model using your test set, how well does it perform?
In [10]:
# Compute the regression results
If we want to get summary statistics about the fit, like r-squared value, we can use the score method of the Sckit-Learn model. However, if we want more advanced regression statistics we’ll need to do a bit more. The statsmodels package enables many statistical methods to be used in Python and is a good tool to know.
Our linear regression worked well in the single variable case, but we suspect there may be nonlinearities in the data. Thus, we want to fit a random forest model.
In [11]:
# Compute random forest from the predictors "AGE", "MPG", "TO", "HT", "WT", "REBR" to the target, "AST"
In [12]:
# Compute the MSE of the classifier
In [19]:
# Download "http://www.basketball-reference.com/boxscores/201506140GSW.html"
In [13]:
# Use BeautifulSoup to parse the table from the web page
This will create a list containing two lists, the first with the box score for CLE, and the second with the box score for GSW. Both contain the headers, along with each player and their in-game stats. We won’t turn this into more training data now, but it could easily be transformed into a format that could be added to our nba dataframe.
BeautifulSoup, the most commonly used web scraping package. It enables us to loop through the tags and construct a list of lists in a straightforward way.