py-Goldsberry
is a Python package that makes it easy to interface with the http://stats.nba.com and retrieve the data in a more analyzable format.
This is the first in a series of tutorials that walk through the different modules of the packages and how to use each to get different types of data.
If you've made it this far, you're probably less interested in reading about the package and more interested in actually using it.
If you don't have the package installed, use pip
install get the latest version
pip install py-goldsberry
pip install --upgrade py-goldsberry
When you have py-goldsberry
installed, you can load the package and check the version number
In [1]:
import goldsberry
import pandas as pd
goldsberry.__version__
Out[1]:
In [2]:
players2014 = goldsberry.PlayerList(2014)
players2014 = pd.DataFrame(players2014)
players2014.head()
Out[2]:
If you want to get players who were on an NBA roster during the 1990-91 season, you can pass 1990
to goldsberry.PlayerList()
In [3]:
players1990 = goldsberry.PlayerList(1990)
players1990 = pd.DataFrame(players1990)
players1990.head()
Out[3]:
You can pass any year to the PlayerList() function to get the roster of players from that season. Alternatively, you may want a list of any player that has been on an NBA roster at any point in the history of the league. You can retrieve this list by passing alltime=True
to the PlayerList()
function.
In [4]:
players_alltime = goldsberry.PlayerList(AllTime=True)
players_alltime = pd.DataFrame(players_alltime)
players_alltime.sample(10)
Out[4]:
I just sampled 10 random players from the alltime list to illustrate that there are a combination of historic and current NBA players.
The PlayerList()
function is critical to the usage of other parts of the package. If you are interested in player level data, I highly recommend creating a list of players that you are interested in by using this function. You can refer to this list later.
One of the major modules of py-goldsberry
is the game
module. Within that module lies a set of classes that extracts information at a game level. There are two key sub-types of data in the module, box score and play-by-play. To access this data, you will need a specific GameID.
These GameIDs are not super straightforward to find through the stats.nba.com website.py-goldsberry
has a function built in that links to a table I have created containing all of the GameIDs from the first game in NBA history through the end of the 2014-15 season.
To access this table of GameIDs, use the GameIDs()
function.
In [5]:
gameids = goldsberry.GameIDs()
gameids = pd.DataFrame(gameids)
gameids.sample(10)
Out[5]:
This table is fairly raw at this point. I'm still in the process of augmenting and making the data more easily searchable. For now, it may make sense to filter by a specific season or date. In the GAMECODE
column, the code breaks down into the date followed by the initials of the two teams involved.
As with PlayerIDs, this table will likely be used fairly often. It is best to pull the list of games into an object at the very beginning of the analysis for easy access to filter
A third module, team
requires the use of unique teamIDs. I'm still in the process of building a simple way to arrive at a searchable table, but you can get at a list of ids (not matched to team name) by filtering the gameids
table we just created.
In [6]:
filter_season = '2014'
teamids = gameids['HOME_TEAM_ID'].ix[gameids['SEASON']==filter_season].drop_duplicates()
teamids.head()
Out[6]:
You will need to make sure you pass the year you wish to filter by as a string or you will need to change the datatype of the season column to numeric before you filter.
While this list is comprehensive in terms of unique teamIDs for the 2014-15 season, it is not matched with the team name. It is not as useful as it could be without additional information. We can use one of the classes within the team
module to get some additional information, and with a few lines of code, have a more descriptive database of teamIDs
We'll start by getting information for a single team. Then we'll put together a loop that creates a searchable/sortable dataframe.
In [7]:
teaminfo = goldsberry.team.team_info(teamids.iloc[0])
In [8]:
pd.DataFrame(teaminfo.info())
Out[8]:
You can see above, calling the team_info()
class within the team
module returns an object which we saved as teaminfo
. To get the actual data, we call the info()
method which is part of the teaminfo
object that we created. This is the standard parttern for the almost all of py-goldsberry
. The package is built this way to minimize the nubmer of calls that need to be made the NBA servers while returning a maximum amount of data.
In general, all calls are classes. Each class has methods associated with the variety of data that is retrieved when a unique call is made to the NBA website. When you save each class as an object, you immediately make a call to the website and the data which is retrieved is stored within the object and accessible through the use of object specific methods. If that doesn't make sense, don't worry. Just keep following the tutorials and you'll get the hang of how to use it without necessarily needing to understand the underlying mechanics.
After a brief digression, back to creating a table of teamIDs with rich information. We can create a nice table by implementing a simple loop gathering information on each team and merginging it into a single dataframe.
In [9]:
teamids_full = pd.DataFrame() # Create empty Data Frame
for i in teamids.values:
team = goldsberry.team.team_info(i)
teamids_full = pd.concat([teamids_full, pd.DataFrame(team.info())])
In [10]:
teamids_full
Out[10]:
Now you have three tables of highly valuable information for utilzing the rest of the package: players_alltime, gameids, and teamids_full.
If you feel comfortable with what we have so far, go forth and collect data! If you want a bit more help, check out some of the other tutorials I've put together.