This notebook provides some background on the various extractor queries. These queries are on the Submission
model which has foreign key relationships on the User
and Form
. For data extraction, we need data from all 3 models, but not necessarily all their properties at once.
In [1]:
import sys
import sqlalchemy as sa
import sqlalchemy.orm as sa_orm
import testing.postgresql
from app import models
from app.util import sqldebug
In [2]:
# open a test session
postgresql = testing.postgresql.Postgresql(base_dir='.test_db')
db_engine = sa.create_engine(postgresql.url())
models.init_database(db_engine)
sessionmaker = sa_orm.sessionmaker(db_engine)
session = sessionmaker()
If you query the Submission
model without any other options, it will use default SQLAlchemy behavior. SQLAlchemy lazy-loads related models by default (unless otherwise defined in the model).
As such, if you render the query to SQL, you'll see that no joins occur. The related models are loaded only when access (lazily) using separate SELECT queries.
In [3]:
simple_query = session.query(models.Submission)
sqldebug.pp_query(simple_query)
SQLAlchemy provides the ability to specify a "joined load" option. Passing a orm.joinedload() to Query.options() will emit a left join operation by default. So you need to set innerjoin=True
if required. Data can then be eager-loaded.
As an example, we will extend our query to only "join-load" the User
model. Subsequent accesses to the user
property in a Submission
instance will not emit SELECT
queries. But note that joined-loads will load all columns in the related model. This is fine for the User
model because it has relatively few columns which are expected to be short strings (first and last name).
In the example below, we don't do this for models.Form
. This is conscious decision as the Form.schema
column is a JSON field which, relative to other columns, can be quite large. SQLAlchemy will continue to use its default lazy loading behavior and load the form using separate SELECT
queries when form
property of a Submission
instance. This may actually be fine for relatively few forms because their schemas will remain in the Session
cache after loading and thus potentially avoiding repeated SELECT
queries.
In [5]:
current_joined_query = session.query(models.Submission)\
.options(
sa_orm.joinedload(models.Submission.user, innerjoin=True)
)
sqldebug.pp_query(current_joined_query)
It may be desirable to force data extraction to one single SELECT
query. This does require a bit more code but is possible using explicit joins and eager loads. This provides full control and avoids relying on lazy-loading or Session
cache behavior.
Our ETL transformation only requires the name
column from the Form
model. You can eager-load related tables more precisely as follows:
INNER JOIN
In the example below, we show explicit joins for both the user
and form
relations. However, we restrict eager loading the form to only the name
property.
The resulting query gives us precise control. Note that the primary keys (id
columns) are still loaded as part of the join which is default behavior in SQLAlchemy. This is acceptable as our primary aim was to avoid loading the largest column (schema
).
In [48]:
new_joined_query = session.query(models.Submission)\
.join(models.User)\
.join(models.Form)\
.options(
sa_orm.contains_eager(models.Submission.user),
sa_orm.contains_eager(models.Submission.form).load_only('name'),
)
sqldebug.pp_query(new_joined_query)
In [ ]: