This notebook demonstrates how to query hierarchical databases with our toy query language BQL.
We start with loading a demo JSON dataset with data on departments and employees of the city of Chicago (source). In general, we can treat any JSON document with a regular structure as a hierarchical database.
In [1]:
%cd -q ..
from citydb_json import citydb
This is the structure of the citydb
database:
{
"departments": [
{
"name": ...,
"employees": [
{
"name": ...,
"surname": ...,
"position": ...,
"salary": ...
},
... ]
},
... ]
}
The top-level City object has the following fields:
departments
: an array of department objects.Department objects have the following fields:
name
: the name of the department.employees
: an array of employee objects.Employee objects have the following fields:
name
: employee's first name.surname
: employee's last name.position
: employee's title.salary
: annual salary of the employee.Next, we import the BQL library.
In [2]:
from bql import *
The BQL query language is embedded in Python, which means any BQL query is a regular Python function which maps JSON input to JSON output. We call such functions JSON combinators.
Two trivial examples of JSON combinators are:
Const(val)
, which maps all input to the same output value;Here()
, which returns its input unchanged.
In [3]:
C = Const(42)
C(None), C(42), C([1, 2, 3])
Out[3]:
In [4]:
I = Here()
I(None), I(42), I([1, 2, 3])
Out[4]:
More impressive is combinator Field(name)
that extracts a field value from a JSON object.
In [5]:
F = Field('x')
F({'x': 24, 'y': 42})
Out[5]:
By composing two field extractors, we can build a query that produces the names of all departments.
In [6]:
Departments = Field('departments')
Name = Field('name')
Dept_Names = Departments >> Name
dept_names = Dept_Names(citydb)
dept_names [:5]
Out[6]:
What does the >>
operator do exactly? Fundamentally, (A >> B)
composes A
and B
by sending the output of A
to the input of B
.
However, if we directly apply this rule to evaluate the expression (Departments >> Name)(citydb)
, we will fail because citydb['departments']['name']
does not exist.
To make this work, we need to clarify the composition rule. Namely, expression (A >> B)(x)
, when A(x)
is an array, applies B
to each element of the array.
Moreover, when B
itself produces array values, all B
outputs are combined into one array, which becomes the output of (A >> B)
.
The last feature is used when we list the names of all employees.
In [7]:
Employees = Field('employees')
Empl_Names = Departments >> Employees >> Name
empl_names = Empl_Names(citydb)
empl_names [:5]
Out[7]:
Dual to Field(name)
, combinator Select(...)
constructs JSON objects. Parameters of Select(...)
are combinators that construct object fields. Here is a trivial example.
In [8]:
S = Select(x=Const(42), y=Here())
S(24)
Out[8]:
Let us use Select(...)
to generate the name and the number of employees for each department.
In [9]:
Depts_With_Size = Departments >> Select(name=Name, size=Count(Employees))
depts_with_size = Depts_With_Size(citydb)
depts_with_size [:5]
Out[9]:
Here, combinator Count(Employees)
returns the length of the employees
array. In general, Count(F)
lets F
process its input expecting the output of F
to be an array, then returns the length of the array.
(You may've expected Employees >> Count()
, but that'd make operator >>
non-associative).
Array combinators such as Count(...)
are called aggregate combinators. The following aggregate combinators are defined in BQL: Count()
, Min()
, Max()
, First()
.
In [10]:
Num_Depts = Count(Departments)
Num_Depts(citydb)
Out[10]:
In [11]:
Salary = Field('salary')
Top_Salary = Max(Departments >> Employees >> Salary)
Top_Salary(citydb)
Out[11]:
In [12]:
One_Empl = First(Departments >> Employees)
One_Empl(citydb)
Out[12]:
In [13]:
Three_Depts = First(Departments >> Name, Const(3))
Three_Depts(citydb)
Out[13]:
In [14]:
Half_Depts = First(Departments >> Name, Count(Departments)//2)
Half_Depts(citydb)
Out[14]:
Combinator Filter(P)
applies predicate P
to its input. If the predicate condition is not satisfied, the input is dropped, otherwise it is returned unchanged. Let us use Filter()
to find the departments with more than 1000 employees.
In [15]:
Size = Field('size')
Large_Depts = Depts_With_Size >> Filter(Size > 1000)
Large_Depts(citydb)
Out[15]:
Here, combinator Depts_With_Size
, which adds size
field to each department object, is composed with combinator Filter(Size > 1000)
, which gathers the departments that satisfy condition Size > 1000
.
In the following example, we use Filter()
to find the number of employees whose annual salary exceeds 200k.
In [16]:
Num_Well_Paid_Empls = \
Count(Departments >> Employees >> Filter(Salary >= 200000))
Num_Well_Paid_Empls(citydb)
Out[16]:
Now suppose we'd like to find the number of employees with salary in a certain range, but we don't know the range in advance. In this case, we can construct a parameterized query.
In [17]:
Min_Salary = Ref('min_salary')
Max_Salary = Ref('max_salary')
Num_Empls_By_Salary_Range = \
Count(Departments >> Employees >> Filter((Salary >= Min_Salary) & (Salary < Max_Salary)))
To run the Num_Empls_By_Salary_Range
query, we need to supply it with parameters min_salary
and max_salary
.
In [18]:
Num_Empls_By_Salary_Range(citydb, {'min_salary': 200000, 'max_salary': 1000000})
Out[18]:
In [19]:
Num_Empls_By_Salary_Range(citydb, {'min_salary': 100000, 'max_salary': 200000})
Out[19]:
In [20]:
Num_Empls_By_Salary_Range(citydb, {'min_salary': 0, 'max_salary': 100000})
Out[20]:
The query knows which parameters it needs.
In [21]:
Num_Empls_By_Salary_Range.refs()
Out[21]:
The last feature we discuss here is an ability to assign parameter values dynamically.
Consider a query: find the top salary for each department. It could be easily implemented using Max()
aggregate.
In [22]:
Depts_With_Max_Salary = \
Departments >> Select(name=Name, max_salary=Max(Employees >> Salary))
Depts_With_Max_Salary(citydb) [:5]
Out[22]:
Now let us ask a slightly different question: find the employees with the highest salary at their department. We may try to use the Filter()
combinator as follows.
In [23]:
Highest_Paid_Empls_By_Dept = \
Departments >> Employees >> Filter(Salary == Max_Salary)
But the filter condition (Salary == Max_Salary)
is problematic since we cannot supply max_salary
as a query parameter. Instead it must be calculated dynamically for each department. The Given(...)
combinator does exactly that.
In [24]:
Highest_Paid_Empls_By_Dept = \
Departments >> \
Given(
Employees >> Filter(Salary == Max_Salary),
max_salary=Max(Employees >> Salary))
Highest_Paid_Empls_By_Dept(citydb) [:5]
Out[24]:
Notably, Highest_Paid_Empls_By_Dept
requires no parameters despite the fact that its definition refers to max_salary
.
In [25]:
Highest_Paid_Empls_By_Dept.refs()
Out[25]: