SeekWell is a package for quickly and easily querying SQL databases in Python. It was made with data analysts in mind and plays well with Jupyter notebooks. This notebook is a little tutorial to get you started working with SQL databases in under 5 minutes.
SeekWell is a higher level library built on top of SQLAlchemy, an amazing library that does all the heavy lifting. SeekWell is designed to get our of your way and focus on retrieving the data you want from your database.
Query results are retrieved in a lazy manner. That is, they aren't returned until you ask for them. Records are cached once you get them, so you only ever run the query once. SeekWell also provides methods for inspecting the tables and columns in your database.
In [1]:
from seekwell import Database
SeekWell uses SQLAlchemy underneath to connect to literally any database you can throw at it. You just need the appropriate engines, for example, psychopg for PostGres. Make sure to check out SQLAlchemy's documentation for connecting to databases.
In SeekWell, you just need to import the Database
class and create a new Database object. The path required is defined by SQLAlchemy (link to documentation here).
Here I'll load a database of European soccer matches, teams, and players available from Kaggle Datasets.
In [2]:
db = Database('sqlite:///database.sqlite')
I've found database introspection to be really useful, that is, listing out tables and columns. When connected to a Database
, you can get a list of tables.
In [3]:
db.table_names
Out[3]:
And you can get a table to inspect it.
In [4]:
table = db['Team']
table
Out[4]:
In [5]:
table.column_names
Out[5]:
To check out the data in a table, use the head
method to print out the first few rows.
In [6]:
table.head()
Out[6]:
In a notebook, rows are printed out in an HTML table for nice viewing. In the terminal, rows are printed out as an ASCII table.
In [7]:
print(table.head())
If you're using a database with schemas, you can get a list of the schema names with db.schema_names
.
In [8]:
db.schema_names
Out[8]:
In [9]:
records = db.query('SELECT * from Player limit 50')
The data isn't returned immediately, only when you request it.
In [10]:
records
Out[10]:
Use fetch
to get the data. Calling fetch
without any arguments will return all the rows. Passing in a number will return that many rows.
In [11]:
records.fetch(10)
Out[11]:
The data is cached in records.rows
In [12]:
records.rows
Out[12]:
You can get rows using slices too.
In [13]:
records[5:15]
Out[13]:
In [14]:
records[-5:]
Out[14]:
Or get all the rows by calling fetch
with no arguments...
In [15]:
records.fetch()
Out[15]:
Your statements can of course be as complex as you want. Using parameters in statements is possible using keyword arguments. This uses the SQLAlchemy text
syntax, so read up on it here. Below is an example using :home_team
as a parameter to filter for the desired home team in the statement.
In [16]:
statement = """
SELECT Match.date,
Match.home_team_goal, Match.away_team_goal,
home_team.team_long_name AS home_team,
away_team.team_long_name AS away_team
FROM Match
JOIN Team AS home_team
ON Match.home_team_api_id=home_team.team_api_id
JOIN Team AS away_team
ON Match.away_team_api_id=away_team.team_api_id
WHERE home_team=:home_team
ORDER BY Match.date ASC
"""
records = db.query(statement, home_team='KRC Genk')
records.fetch()[:20]
Out[16]:
In [17]:
records.to_csv('KRC_Genk_games.csv')
In [18]:
df = records.to_pandas()
df
Out[18]:
In [19]:
df = df.assign(point_diff=(df['home_team_goal'] - df['away_team_goal']))
df.groupby('away_team')['point_diff'].agg({'wins': lambda x: sum(x>0),
'losses': lambda x: sum(x<0),
'ties': lambda x: sum(x==0)})
Out[19]: