Given a dataset of NBA players performance and salary in 2014, use Python to load the dataset and compute the summary statistics for the SALARY field:
You will need to make use of the csv module to load the data and interact with it. Computations should require only simple arithmetic. (For the purposes of this exercise, attempt to use pure Python and no third party dependencies like Pandas - you can then compare and contrast the use of Pandas for this task later).
Bonus:
Determine the relationship of PER (Player Efficiency Rating) to Salary via a visualization of the data.
NBA 2014 Players Dataset: http://bit.ly/gtnbads
In [12]:
# Imports - you'll need some of these later, but it's traditional to put them all at the beginning.
import os
import csv
import json
import urllib2
from collections import Counter
from operator import itemgetter
You have a couple of options of fetching the data set to begin your analysis:
In either case, you'll have to be cognizant of where the CSV file lands. Here is a quick implementation of a function to download a URL at a file and write it to disk. Note the many approaches to do this as outlined here: How do I download a file over HTTP using Python?.
In [4]:
def download(url, path):
"""
Downloads a URL and writes it to the specified path. The "path"
is like the mailing address for the file - it tells the function
where on your computer to send it!
Also note the use of "with" to automatically close files - this
is a good standard practice to follow.
"""
response = urllib2.urlopen(url)
with open(path, 'w') as f:
f.write(response.read())
response.close()
Your turn: use the above function to download the data!
In [6]:
## Write the Python to execute the function and download the file here:
Now that we have the CSV file that we're looking for, we need to be able to open the file and read it into memory. The trick is that we want to read only a single line at a time - consider really large CSV files. Python provides memory efficient iteration in the form of generators and the csv.reader module exposes one such generator, that reads the data from the CSV one row at a time. Moreover, we also want to parse our data so that we have specific access to the fields we're looking for. The csv.DictReader class will give you each row as a dictionary, where the keys are derived from the first, header line of the file.
Here is a function that reads data from disk one line at a time and yields it to the user.
In [8]:
def read_csv(path):
# First open the file
with open(path, 'r') as f:
# Create a DictReader to parse the CSV
reader = csv.DictReader(f)
for row in reader:
# HINT: Convert SALARY column values into integers & PER column into floats.
# Otherwise CSVs can turn ints into strs! You'll thank me later :D
row['SALARY'] = int(row['SALARY'])
row['PER'] = float(row['PER'])
# Now yield each row one at a time.
yield row
Your turn: use the above function to open the file and print out the first row of the CSV!
To do this, you'll need to do three things:
First, remember where you told the download function to store your file? Pass that same path into read_csv:
In [ ]:
## Write the Python to execute our read_csv function.
Next step: The read_csv function "returns" a generator. How can we access just the first row? Remember how to access the next row of a generator?
In [3]:
## Now write the Python to print the first row of the CSV here.
Are there different ways to print the first n rows of something? Sure! Try using break, which will stop a for loop from running. E.g. the code:
for idx in xrange(100):
if idx > 10:
break
...will stop the for loop after 10 iterations.
Next, write a for loop that can access and print every row.
In [4]:
## Write the Python to print *every* row of the CSV here.
In [ ]:
data = list(read_csv('fixtures/nba_players.csv')) #Put in your own path here.
data = sorted(data, key=itemgetter('SALARY'))
total = 0
count = 0
for row in data:
count += 1
total += row['SALARY']
# Total Count
print "There are %d total players." % count
# Write the Python to get the median
median =
print "The median salary is %d." % median
# Write the Python to get the minimum
minimum =
print "The minimum salary is %d." % minimum
# Write the Python to get the maximum
maximum =
print "The maximum salary is %d." % maximum
# Write the Python to get the mean
mean =
print "The mean salary is %d." % mean
Nice work! Now... calculating the mode is a bit different. Remember about the Decorate-Sort-Undecorate pattern that we learned about in ThinkPython? That will work here!
In [ ]:
## Write the Python to get the mode of the salaries.
The "DSU" approach is a little inefficient. Instead of using a dictionary as our data type to solve the mode problem, we could use counter() from the Collections module. Read more about counter() and try it out here:
In [ ]:
## Experiment with using counter() here.
In [19]:
def statistics(path):
"""
Takes as input a path to `read_csv` and the field to
compute the summary statistics upon.
"""
# Uncomment below to load the CSV into a list
# data = list(read_csv(path))
# Fill in the function here
stats = {
'maximum': data[-1]['SALARY'],
'minimum': data[0]['SALARY'],
'median': data[count / 2]['SALARY'], # Any potential problems here?
'mode': freqs.most_common(2),
'mean': total / count,
}
return stats
Keep playing with the above function to get it to work more efficiently or to reduce bad data in the computation - e.g. what are all those zero salaries?
Congratulations if you've made it this far! It's time for the bonus round!
You've now had some summary statistics about the salaries of NBA players, but what we're really interested in is the relationship between SALARY and the rest of the fields in the data set. The PER - Player Efficiency Rating, is an aggregate score of all performance statistics; therefore if we determine the relationship of PER to SALARY, we might learn a lot about how to model NBA salaries.
In order to explore this, let's create a scatter plot of SALARY to PER, where each point is an NBA player.
Visualization is going to require a third party library. You probably already have matplotlib, so that might be the simplest if you're having trouble with installation. If you don't, pip install it now! Follow the documentation to create the scatter plot inline in the notebook in the following cells.
In [20]:
# Insert your Python to create the visualization here
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline # Makes the plot appear inline in your iPython Notebook.
def read_data(path):
# Pandas is an efficient way to wrangle the data quickly
return pd.DataFrame(pd.read_csv(path))
def graph_data(path, xkey='PER', ykey='SALARY'):
data = read_data(path)
## Fill this in yourself!
plt.show()
graph_data('fixtures/nba_players.csv') # Or whatever your path is
Nice work!! Matplotlib is pretty useful, but also kind of bare bones. Once you're ready to experiment with other libraries and take your visualizations to the next level, check out the following:
Our favorite is Bokeh - it's interactive!