py-Goldsberry
is a Python package that makes it easy to interface with the http://stats.nba.com and retrieve the data in a more analyzable format.
This is the first in a series of tutorials that walk through the different modules of the packages and how to use each to get different types of data.
If you've made it this far, you're probably less interested in reading about the package and more interested in actually using it.
If you don't have the package installed, use pip
install get the latest version
pip install py-goldsberry
pip install --upgrade py-goldsberry
When you have py-goldsberry
installed, you can load the package and check the version number
In [1]:
import goldsberry
import pandas as pd
goldsberry.__version__
Out[1]:
py-goldsberry
is designed to work in conjuntion with Pandas. Each function within the package returns data in a format that is easily converted to a Pandas DataFrame.
To get started, let's get a list of all of the players who were on an NBA roster during the 2015-16 season
Currently, the PlayerList()
function defaults to the current season. We start by creating an object, players
, that we will use to scrape player data.
In [2]:
players = goldsberry.PlayerList()
players2015 = pd.DataFrame(players.players())
players2015.head()
Out[2]:
We can manipulate the players
object to get data from different seasons by changing the API parameters and then re-running the query of the website. For example, if we want to get a list of players who were on an NBA roster during the 1990-91 season, we set the Season
parameter to 1990-91
using the .get_new_data()
method of the players
class as follows.
In [3]:
players.get_new_data(Season = '1990-91')
Once we get the raw data from the website, we need to save it as a dataframe to a new object.
In [4]:
players1990 = pd.DataFrame(players.players())
players1990.head()
Out[4]:
Each class in py-Goldsberry
works in a similar fashion. When instantiating each class, the class makes some assumptions about the parameters to use to query the NBA website and executes the query. If you want to change the query after instantiation, you can change the query parameters and then re-query the database with .get_new_data()
. Under the hood, the .get_new_data()
method takes any number of keyword arguments that it then translates to api parameters. As a sanity check, it will raise an exception if you try to set a parameter that the specific query does not take.
Each class takes a specific set of parameters. py-Goldsberry
is built to include a list of each parameter as well as a default value. I'm working on a dictionary of parameters and possible values each can take. Look for it to be posted in the near future. Until then, you can access the raw parameter dictionary by calling the .get_parameter_items()
method of each class. This gives you the possible values that the query can take.
As you saw above, you can pass in keyword arguments with the keyword being the parameter name and the argument being the desired value to change the default value of the paramters.
In [5]:
players.get_parameter_items()
Out[5]:
In the case of the PlayersList()
class, you can get a historical list of players by changing the value of 'IsOnlyCurrentSeason'
from 1 to 0.
In [6]:
players.get_new_data(IsOnlyCurrentSeason = 0)
playersAllTime = pd.DataFrame(players.players())
playersAllTime.head()
Out[6]:
By default, Goldsberry is set to pull data from the current year. If you are interested in alternative data from the get-go, you can set the default parameters do your desired values upon insantiation of the class. Let's checkout an example of getting the All-Time player list from a brand new object
In [7]:
new_playersAllTime = pd.DataFrame(goldsberry.PlayerList(IsOnlyCurrentSeason=0).players())
new_playersAllTime.head()
Out[7]:
In [8]:
playersAllTime.equals(new_playersAllTime)
Out[8]:
Well, it looks like these data frames aren't quite identical. Why is that?
Take a look at the ROSTERSTATUS
column. When we first asked for the all time players, remember we had set the base year to 1990-91? Alaa Abdelnaby was actually on a roster during that season (Portland to be specific) so he has a value of 1
in the ROSTERSTATUS
column. Since he was not in the league during the current season, he has a 0
in that column for the second pull. Let's compare just the names and see if we get an exact match. That will further reinforce that we have the same data, but we are looking at it from diffent points in time.
In [9]:
playersAllTime.loc[:, 'DISPLAY_FIRST_LAST'].equals(new_playersAllTime.loc[:, 'DISPLAY_FIRST_LAST'])
Out[9]:
Success!
This notebook outlines the general work flow for working with py-Goldsberry
. I'll post additional workbooks outline additional data pulls and illustrating some of the other features of the package and possibilities with the data.
In [ ]: