Optimizing ORM Queries

Introduction

This notebook provides some background on the various extractor queries. These queries are on the Submission model which has foreign key relationships on the User and Form. For data extraction, we need data from all 3 models, but not necessarily all their properties at once.


In [1]:
import sys
import sqlalchemy as sa
import sqlalchemy.orm as sa_orm
import testing.postgresql

from app import models
from app.util import sqldebug

In [2]:
# open a test session
postgresql = testing.postgresql.Postgresql(base_dir='.test_db')
db_engine = sa.create_engine(postgresql.url())
models.init_database(db_engine)
sessionmaker = sa_orm.sessionmaker(db_engine)
session = sessionmaker()

Default behavior ORM Queries

If you query the Submission model without any other options, it will use default SQLAlchemy behavior. SQLAlchemy lazy-loads related models by default (unless otherwise defined in the model).

As such, if you render the query to SQL, you'll see that no joins occur. The related models are loaded only when access (lazily) using separate SELECT queries.


In [3]:
simple_query = session.query(models.Submission)
sqldebug.pp_query(simple_query)


SELECT clover_app.form_responses.id AS clover_app_form_responses_id,
       clover_app.form_responses.form_id AS clover_app_form_responses_form_id,
       clover_app.form_responses.user_id AS clover_app_form_responses_user_id,
       clover_app.form_responses.responses AS clover_app_form_responses_responses,
       clover_app.form_responses.date_created AS clover_app_form_responses_date_created
FROM clover_app.form_responses

Joined Loading

SQLAlchemy provides the ability to specify a "joined load" option. Passing a orm.joinedload() to Query.options() will emit a left join operation by default. So you need to set innerjoin=True if required. Data can then be eager-loaded.

As an example, we will extend our query to only "join-load" the User model. Subsequent accesses to the user property in a Submission instance will not emit SELECT queries. But note that joined-loads will load all columns in the related model. This is fine for the User model because it has relatively few columns which are expected to be short strings (first and last name).

In the example below, we don't do this for models.Form. This is conscious decision as the Form.schema column is a JSON field which, relative to other columns, can be quite large. SQLAlchemy will continue to use its default lazy loading behavior and load the form using separate SELECT queries when form property of a Submission instance. This may actually be fine for relatively few forms because their schemas will remain in the Session cache after loading and thus potentially avoiding repeated SELECT queries.


In [5]:
current_joined_query = session.query(models.Submission)\
    .options(
        sa_orm.joinedload(models.Submission.user, innerjoin=True)
    )
sqldebug.pp_query(current_joined_query)


SELECT clover_app.form_responses.id AS clover_app_form_responses_id,
       clover_app.form_responses.form_id AS clover_app_form_responses_form_id,
       clover_app.form_responses.user_id AS clover_app_form_responses_user_id,
       clover_app.form_responses.responses AS clover_app_form_responses_responses,
       clover_app.form_responses.date_created AS clover_app_form_responses_date_created,
       users_1.id AS users_1_id,
       users_1.given_name AS users_1_given_name,
       users_1.family_name AS users_1_family_name
FROM clover_app.form_responses
JOIN clover_app.users AS users_1 ON users_1.id = clover_app.form_responses.user_id

Explicit Join and Eager Load

It may be desirable to force data extraction to one single SELECT query. This does require a bit more code but is possible using explicit joins and eager loads. This provides full control and avoids relying on lazy-loading or Session cache behavior.

Our ETL transformation only requires the name column from the Form model. You can eager-load related tables more precisely as follows:

  • Chain call Query.join() for each model you wish to eager load via a INNER JOIN
  • Pass to to Query.options() a orm.contains_eager() for each related property you wish to eager-load
  • To restrict to only a subset of columns, extend your eager option by chaining to orm.load_only() with the column attribute string you want to restrict it to.

In the example below, we show explicit joins for both the user and form relations. However, we restrict eager loading the form to only the name property.

The resulting query gives us precise control. Note that the primary keys (id columns) are still loaded as part of the join which is default behavior in SQLAlchemy. This is acceptable as our primary aim was to avoid loading the largest column (schema).


In [48]:
new_joined_query = session.query(models.Submission)\
    .join(models.User)\
    .join(models.Form)\
    .options(
        sa_orm.contains_eager(models.Submission.user),
        sa_orm.contains_eager(models.Submission.form).load_only('name'),
    )
    
sqldebug.pp_query(new_joined_query)


SELECT clover_app.form_schemas.id AS clover_app_form_schemas_id,
       clover_app.form_schemas.name AS clover_app_form_schemas_name,
       clover_app.users.id AS clover_app_users_id,
       clover_app.users.given_name AS clover_app_users_given_name,
       clover_app.users.family_name AS clover_app_users_family_name,
       clover_app.form_responses.id AS clover_app_form_responses_id,
       clover_app.form_responses.form_id AS clover_app_form_responses_form_id,
       clover_app.form_responses.user_id AS clover_app_form_responses_user_id,
       clover_app.form_responses.responses AS clover_app_form_responses_responses,
       clover_app.form_responses.date_created AS clover_app_form_responses_date_created
FROM clover_app.form_responses
JOIN clover_app.users ON clover_app.users.id = clover_app.form_responses.user_id
JOIN clover_app.form_schemas ON clover_app.form_schemas.id = clover_app.form_responses.form_id

In [ ]: