pygameday Example Notebook

This Jupyter notebook walks through basic pygameday usage.

First, import the GameDayClient object. We've also imported dateutil.parser to help with converting date strings to date objects.


In [ ]:
from pygameday import GameDayClient
from datetime import datetime

Now specify a URI for your database. This example uses SQLite, a file-based database that can exist locally on your system. SQLite is probably the easiest type of database to run if you haven't used one extensively before.

If you are familiar with databases, or have a different implementation already handy, you can specify a different type of URI. Here are some example URIs for popular open-source databases.

  • PostgreSQL: postgresql://user:password@host:port/database_name
  • MySQL: mysql://user:password@host/database_name

pygameday is built on SQLAlchemy, so you can use any database that it supports, including Oracle and others.

The GameDayClient is invoked with the database URI. You can optionally set the log_level keyword argument to control the verbosity of the output. The acceptable values for log_level, from most verbose to least verbose, are "DEBUG", "INFO", "WARN", "ERROR", and "CRITICAL".

A GameDayClient also has an ingest_spring_training keyword argument that controls whether spring training and exhibition games will be inserted.

To make processing go faster, you can set the n_workers keyword argument to a number greater than 1 (the default is 4). This will kick off multiple processes to handle game inserts. If you're using SQLite, you may want to n_workers=1 to avoid attempts to write to a locked database.


In [ ]:
# Use an SQLite URI. A database file named `gameday.db` will be created in the current directory
# if it doesn't already exist
database_uri = "sqlite:///gameday.db" 

# Instantiate a GameDayClient with the above URI, a moderate log level, without ingesting exhibition games,
# and without processing games in parallel
client = GameDayClient(database_uri, log_level="INFO", ingest_spring_training=False, n_workers=1)

Before we do anything, use a class method to print the number of rows in each table.


In [ ]:
client.db_stats()

Scraping, Parsing, and Ingesting Data

To process a single day of games, use the process_date function. Note that if a game is already in the database, you'll get a WARNING notifying you, and the game won't be processed (provided that you have set the client's log_level to at least WARN.


In [ ]:
a_date = datetime(2018, 7, 1)
client.process_date(a_date)

To process a date range, specify a start and end date and use the process_date_range function. Data for all MLB games within the date range (including the start and end dates) will be processed.


In [ ]:
start_date = datetime(2015, 6, 12)  # June 12, 2015
end_date = datetime(2015, 6, 13)  # June 13, 2015
client.process_date_range(start_date, end_date)

That's it! You now have data in your database! Run db_stats() again to confirm.


In [ ]:
# Hopefully the output is different than when you ran this function above.
client.db_stats()

Using Your Database

This section gives a quick example of how to access the data you just retrieved using Python. You can of course use any tools of your choice to access your database.

This example uses SQLAlchemy and Pandas, a powerful data processing library.


In [ ]:
from sqlalchemy import create_engine
import pandas as pd

To access the database through Pandas, create an SQLAlchemy engine. This engine will be a parameter that we pass to Pandas in order to execute queries. See the Pandas documentation for more details.


In [ ]:
engine = create_engine("sqlite:///gameday.db")

Now use the Pandas read_sql function to execute an SQL statement. Here, we'll select 10 rows from the pitches table.


In [ ]:
data = pd.read_sql("SELECT * FROM pitches LIMIT 10;", engine, index_col="pitch_id")
# data = data.set_index("pitch_id")

The data is now available as a Pandas DataFrame object. Execute the following cell to view it in a pretty table.


In [ ]:
data.head()

Now you can do all kinds of cool stuff.