We start with loading a sample hierarchical database. Our sample database is derived from the dataset of all employees of the city of Chicago (source).
In [1]:
ENV["LINES"] = 15
include("../citydb_json.jl")
citydb
Out[1]:
In hierarchical data model, data is organized in a tree-like structure. In this database, data is stored as a JSON document organized in a 2-level hierarchy:
"departments"
with an array of department objects;"name"
, the name of the department, and "employees"
, an array of employees;"name"
, "surname"
, "position"
, "salary"
describing the employee.Here is a fragment of the dataset:
{
"departments": [
{
"name": "WATER MGMNT",
"employees": [
{
"name": "ALVA",
"surname": "A",
"position": "WATER RATE TAKER",
"salary": 87228
},
...
]
},
...
]
}
We may want to ask some questions about the data. For example,
Even though the raw dataset does not immediately contain any answers to these questions, it has enough information so that the answers could be inferred from the data if we are willing to write some code (we use Julia programming language).
Take a relatively complicated problem:
For each department, find the number of employees with the salary higher than $100k.
It can be solved as follows:
In [2]:
Depts_With_Num_Well_Paid_Empls(data) =
map(d -> Dict(
"name" => d["name"],
"N100k" => length(filter(e -> e["salary"] > 100000, d["employees"]))),
data["departments"])
Depts_With_Num_Well_Paid_Empls(citydb)
Out[2]:
Is it a good solution? Possibly. It is certainly compact, due to our use of map
and filter
to traverse the structure of the database. On the other hand, to write or understand code like this, one needs solid understanding of non-trivial CS concepts such as high-order and anonymous functions. One needs to be a professional programmer.
Is there a way to write this query without use of map
and filter
(or, equivalently, nested loops)? Indeed, there is, but to show how to do it, we need to introduce some new primitives and operations. We start with the notion of JSON combinators.
A JSON combinator is simply a function that maps JSON input to JSON output. Two trivial examples of JSON combinators are:
Const(val)
, which maps each input value to constant value val
.This()
, which copies the input to the output without changes.
In [3]:
Const(val) = x -> val
C = Const(42)
C(true), C(42), C([1,2,3])
Out[3]:
In this example, Const(42)
creates a new constant combinator. It is then applied to various input JSON values, always producing the same output.
In [4]:
This() = x -> x
I = This()
I(true), I(42), I([1,2,3])
Out[4]:
Similarly, This()
creates a new identity combinator. We test it with different input values to assure ourselves that it does not change the input.
Notice the pattern:
In short, by designing a collection of useful combinators, we are creating a query language (embedded in the host language, but this is immaterial).
In [5]:
Field(name) = x -> x[name]
Out[5]:
In [6]:
Salary = Field("salary")
Salary(Dict("name" => "RAHM", "surname" => "E", "position" => "MAYOR", "salary" => 216210))
Out[6]:
Here, to demonstrate field extractors, we defined Salary
, a combinator that extracts value of field "salary"
from the input JSON object.
To build interesting queries, we need a way to construct complex combinators from primitives. Let us define composition (F >> G)
of combinators F
and G
that ties F
and G
by sending the output of F
to the input of G
.
Our first, naive attempt to implement composition is as follows:
In [7]:
import Base: >>
(F >> G) = x -> G(F(x))
Out[7]:
We can traverse the structure of hierarchical data by chaining field extractors with the composition operator.
$$ \textbf{departments}\gg \quad \begin{cases} \gg\textbf{name} \\ \text{employees} \quad \begin{cases} \text{name} \\ \text{surname} \\ \text{position} \\ \text{salary} \end{cases} \end{cases} $$For example, let us find the names of all departments. We can do it by composing extractors for fields "departments"
and "name"
.
In [8]:
Departments = Field("departments")
Name = Field("name")
Dept_Names = Departments >> Name
Dept_Names(citydb)
What is going on? We expected to get a list of department names, but instead we got an error.
Here is a problem. With the current definition of the >>
operator, expression
(Departments >> Name)(citydb)
is translated to
citydb["departments"]["name"]
But this fails because citydb["departments"]
is a array and thus doesn't have a field called "name"
.
Let us demonstrate the behavior of >>
on the duplicating combinator. Combinator Dup
duplicates its input, that is, for any input value x
, it produces an array [x, x]
. See what happens when we compose Dup
with itself, once or several times:
In [9]:
Dup = x -> Any[x, x]
Dup(0), (Dup >> Dup)(0), (Dup >> Dup >> Dup)(0)
Out[9]:
We need composition (F >> G)
to be smarter. When F
produces an array, the composition should apply G
to each element of the array. In addition, if G
also produces array values, (F >> G)
concatenates all outputs to produce a single array value.
In [10]:
(F >> G) = x -> _flat(_map(G, F(x)))
_flat(z) =
isa(z, Array) ? foldr(vcat, [], z) : z
_map(G, y) =
isa(y, Array) ? map(_expand, map(G, y)) : G(y)
_expand(z_i) =
isa(z_i, Array) ? z_i : z_i != nothing ? [z_i] : []
Out[10]:
Let us test the updated >>
operator with Dup
again. We see that the output arrays are now flattened:
In [11]:
Dup(0), (Dup >> Dup)(0), (Dup >> Dup >> Dup)(0)
Out[11]:
We can get back to our original example, finding the names of all departments. Now we get the result we expected.
In [12]:
Dept_Names = Departments >> Name
Dept_Names(citydb)
Out[12]:
Similarly, we can list values of any attribute in the hierarchy tree. For example, let us find the names of all employees.
$$ \textbf{departments}\gg \quad \begin{cases} \text{name} \\ \gg\textbf{employees}\gg \quad \begin{cases} \gg\textbf{name} \\ \text{surname} \\ \text{position} \\ \text{salary} \end{cases} \end{cases} $$We can do it by composing field extractors on the path from the root of the hierarchy to the "name"
attribute:
In [13]:
Employees = Field("employees")
Empl_Names = Departments >> Employees >> Name
Empl_Names(citydb)
Out[13]:
Field extractors and composition give us a way to traverse the hierarchy tree. We still need a way to summarize data.
Consider a query: find the number of departments. To write it down, we need a combinator that can count the number of elements in an array.
Here is our first attempt to implement it:
In [14]:
Count() = x -> length(x)
Out[14]:
We compose it with a combinator that generates an array of departments to calculate the number of departments:
In [15]:
Num_Depts = Departments >> Count()
Num_Depts(citydb)
Out[15]:
But that's not what we expected! Here is the problem: the composition operator does not let Count()
see the whole array. Instead, Departments >> Count()
submits each array element to Count()
one by one and then concatenates the outputs of Count()
. Count()
, when its input is a JSON object, returns the number of fields in the object (in this case, 2 fields for all department objects).
The right way to implement Count()
is to add an array-producing combinator as a parameter:
In [16]:
Count(F) = x -> length(F(x))
Out[16]:
In [17]:
Num_Depts = Count(Departments)
Num_Depts(citydb)
Out[17]:
How to use composition with Count()
correctly? Here is an example: show the number of employees for each department. Consider this: number of employees is a (derived) property of each department, which suggests us to compose two combinators: one generating department objects and the other calculating the number of employees for a given department. We get:
In [18]:
Num_Empls_Per_Dept = Departments >> Count(Employees)
Num_Empls_Per_Dept(citydb)
Out[18]:
On the other hand, if we'd like to calculate the total number of employees, the parameter of Count()
should be the combinator that generates all the employees:
In [19]:
Num_Empls = Count(Departments >> Employees)
Num_Empls(citydb)
Out[19]:
We could add other summarizing or aggregate combinators. For example, let us define a combinator that finds the maximum value in an array.
In [20]:
Max(F) = x -> maximum(F(x))
Out[20]:
Aggregate combinators could be combined to answer complex questions. For example, let us find the maximum number of employees per department. We already have a combinator that generates the number of employees for each department, all is left is to apply Max()
.
In [21]:
Max_Empls_Per_Dept = Max(Num_Empls_Per_Dept) # Max(Departments >> Count(Employees))
Max_Empls_Per_Dept(citydb)
Out[21]:
In [22]:
Select(fields...) =
x -> Dict(map(f -> f.first => f.second(x), fields))
Out[22]:
For each input, Select()
constructs a new JSON object with field values generated by field constructors applied to the input.
Here is a simple example of Select()
summarizing the input array:
In [23]:
S = Select("len" => Count(This()), "max" => Max(This()))
S([10, 20, 30])
Out[23]:
Similarly, we can summarize any hierarchical dataset. Let us modify the query that finds the number of employees for each department. Instead of a raw list of numbers, we will generate a table with the name of the department and its size (the number of employees):
In [24]:
Depts_With_Size =
Departments >> Select(
"name" => Name,
"size" => Count(Employees))
Depts_With_Size(citydb)
Out[24]:
This query could easily be expanded to add more information about the department. For that, we only need to add extra field definitions to the Select()
clause. Notably, change in one field constructor cannot in any way affect the values of the other fields.
Let us additionally determine the top salary for each department:
In [25]:
Depts_With_Size_And_Max_Salary =
Departments >> Select(
"name" => Name,
"size" => Count(Employees),
"max_salary" => Max(Employees >> Salary))
Depts_With_Size_And_Max_Salary(citydb)
Out[25]:
Remember the problem we stated in the beginning: find the number of employees with the salary higher than $100k. We have almost all pieces we need to construct a solution of this problem. One piece that appears to be missing is a way to refine data. We need a combinator that, given a set of values and a predicate, produces the values that satisfy the predicate condition.
Here is how we can implement it:
In [26]:
Sieve(P) = x -> P(x) ? x : nothing
Out[26]:
Combinator Sieve(P)
is parameterized with a predicate combinator P
. A predicate is a combinator that, for any input, returns true
or false
. For example, a predicate combinator (F < G)
with two parameters F
and G
returns, for any input x
, the result of comparison F(x) < G(x)
.
Let us implement common predicate (and also some arithmetic) combinators:
In [27]:
import Base: >, >=, <, <=, ==, !=, !, &, |, +, -, /, %
(>)(F::Function, G::Function) = x -> F(x) > G(x)
(>)(F::Function, n::Number) = F > Const(n)
(>=)(F::Function, G::Function) = x -> F(x) >= G(x)
(>=)(F::Function, n::Number) = F >= Const(n)
(<)(F::Function, G::Function) = x -> F(x) < G(x)
(<)(F::Function, n::Number) = F < Const(n)
(<=)(F::Function, G::Function) = x -> F(x) <= G(x)
(<=)(F::Function, n::Number) = F <= Const(n)
(==)(F::Function, G::Function) = x -> F(x) == G(x)
(==)(F::Function, n::Number) = F == Const(n)
(!=)(F::Function, G::Function) = x -> F(x) != G(x)
(!=)(F::Function, n::Number) = F != Const(n)
(!)(F::Function) = x -> !F(x)
(&)(F::Function, G::Function) = x -> F(x) && G(x)
(|)(F::Function, G::Function) = x -> F(x) || G(x)
(+)(F::Function, G::Function) = x -> F(x) + G(x)
(+)(F::Function, n::Number) = F + Const(n)
(-)(F::Function, G::Function) = x -> F(x) - G(x)
(-)(F::Function, n::Number) = F - Const(n)
(/)(F::Function, G::Function) = x -> F(x) / G(x)
(/)(F::Function, n::Number) = F / Const(n)
(%)(F::Function, G::Function) = x -> F(x) % G(x)
(%)(F::Function, n::Number) = F % Const(n)
Out[27]:
Sieve(P)
tests its input on the predicate condition P
. If the input satisfies the condition, it is returned without changes. Otherwise, nothing
is returned.
Here is a trivial example to demonstrate how Sieve()
works:
In [28]:
Take_Odd = Sieve(This() % 2 == 1)
Take_Odd(5), Take_Odd(10)
Out[28]:
When the composition operator accumulates values for array output, it drops nothing
values. Thus, in a composition (F >> Sieve(P))
with an array-generating combinator F
, Sieve(P)
filters the elements of the array.
Let us use this feature to list the departments with more than 1000 employees. We already defined a combinator producing departments with the number of employees, we just need to filter its output:
In [29]:
Size = Field("size")
Large_Depts = Depts_With_Size >> Sieve(Size > 1000)
Large_Depts(citydb)
Out[29]:
Similarly, we can list positions of employees with salary higher than 200k:
In [30]:
Position = Field("position")
Very_Well_Paid_Posns =
Departments >> Employees >> Sieve(Salary > 200000) >> Position
Very_Well_Paid_Posns(citydb)
Out[30]:
With Sieve()
defined, we are finally able to answer the original question using combinators:
For each department, find the number of employees with salary higher than 100k.
In [31]:
Better_Depts_With_Num_Well_Paid_Empls =
Departments >> Select(
"name" => Name,
"N100k" => Count(Employees >> Sieve(Salary > 100000)))
Better_Depts_With_Num_Well_Paid_Empls(citydb)
Out[31]:
Compare it with the original solution. The new one reads much better!
In [32]:
Depts_With_Num_Well_Paid_Empls(data) =
map(d -> Dict(
"name" => d["name"],
"N100k" => length(filter(e -> e["salary"] > 100000, d["employees"]))),
data["departments"])
Out[32]:
We achieved our goal of sketching (a prototype of) a query language for hierarchical databases. Let us explore how it could be developed further. One possible way to improve it is by adding query parameters.
Consider a problem: find the number of employees whose annual salary exceeds 200k. We have all the tools to solve it:
In [33]:
Num_Very_Well_Paid_Empls =
Count(Departments >> Employees >> Sieve(Salary >= 200000))
Num_Very_Well_Paid_Empls(citydb)
Out[33]:
Now, imagine that we'd like to find the number of employees with salary in a certain range, but we don't know the range at the time we construct the query. Instead, we want to specify the range when we execute the query.
Let us introduce a query context, a collection of parameters and their values. We'd like the query context to travel with the input, where each combinator could access it if necessary. Thus, we have an updated definition of a JSON combinator: a function that maps JSON input and query context to JSON output.
We need to update existing combinators to make them context-aware:
In [34]:
Const(val) = (x, ctx...) -> val
This() = (x, ctx...) -> x
(F >> G) = (x, ctx...) -> _flat(_map(G, F(x, ctx...), ctx...))
_map(G, y, ctx...) =
isa(y, Array) ? map(_expand, map(yi -> G(yi, ctx...), y)) : G(y, ctx...)
Field(name) = (x, ctx...) -> x[name]
Select(fields...) =
(x, ctx...) -> Dict(map(f -> f.first => f.second(x, ctx...), fields))
Count(F) = (x, ctx...) -> length(F(x, ctx...))
Max(F) = (x, ctx...) -> maximum(F(x, ctx...))
Sieve(P) = (x, ctx...) -> P(x, ctx...) ? x : nothing
(>)(F::Function, G::Function) = (x, ctx...) -> F(x, ctx...) > G(x, ctx...)
(>=)(F::Function, G::Function) = (x, ctx...) -> F(x, ctx...) >= G(x, ctx...)
(<)(F::Function, G::Function) = (x, ctx...) -> F(x, ctx...) < G(x, ctx...)
(<=)(F::Function, G::Function) = (x, ctx...) -> F(x, ctx...) <= G(x, ctx...)
(==)(F::Function, G::Function) = (x, ctx...) -> F(x, ctx...) == G(x, ctx...)
(!=)(F::Function, G::Function) = (x, ctx...) -> F(x, ctx...) != G(x, ctx...)
(!)(F::Function) = (x, ctx...) -> !F(x, ctx...)
(&)(F::Function, G::Function) = (x, ctx...) -> F(x, ctx...) && G(x, ctx...)
(|)(F::Function, G::Function) = (x, ctx...) -> F(x, ctx...) || G(x, ctx...)
(+)(F::Function, G::Function) = (x, ctx...) -> F(x, ctx...) + G(x, ctx...)
(-)(F::Function, G::Function) = (x, ctx...) -> F(x, ctx...) - G(x, ctx...)
(/)(F::Function, G::Function) = (x, ctx...) -> F(x, ctx...) / G(x, ctx...)
(%)(F::Function, G::Function) = (x, ctx...) -> F(x, ctx...) % G(x, ctx...)
Out[34]:
Next, let us add combinator Var(name)
that extracts the value of a parameter from the query context.
In [35]:
Var(name) = (x, ctx...) -> Dict(ctx)[name]
Out[35]:
Now we can make parameterized queries. Find the number of employees with salary in a certain range:
In [36]:
Min_Salary = Var("min_salary")
Max_Salary = Var("max_salary")
Departments = Field("departments")
Employees = Field("employees")
Salary = Field("salary")
Num_Empls_By_Salary =
Count(
Departments >>
Employees >>
Sieve((Salary >= Min_Salary) & (Salary < Max_Salary)))
Num_Empls_By_Salary(citydb, "min_salary" => 100000, "max_salary" => 200000)
Out[36]:
Use of context is not limited to query parameters. We can also update the context dynamically.
Consider a problem: find the employee with the highest salary.
It can be solved in two queries. First, find the highest salary:
In [37]:
Max_Salary = Max(Departments >> Employees >> Salary)
Max_Salary(citydb)
Out[37]:
Second, find the employee with the given salary:
In [38]:
The_Salary = Var("salary")
Empl_With_Salary = Departments >> Employees >> Sieve(Salary == The_Salary)
Empl_With_Salary(citydb, "salary" => 260004)
Out[38]:
We need to automate this sequence of operations. Specifically, we take a value calculated by one combinator, assign it to some context parameter, and then evaluate the other combinator in the updated context. That's what Given()
combinator does:
In [39]:
Given(F, vars...) =
(x, ctx...) ->
let ctx = (ctx..., map(v -> v.first => v.second(x, ctx...), vars)...)
F(x, ctx...)
end
Out[39]:
Combining Max_Salary
and Empl_With_Salary
using Given
, we get:
In [40]:
Empl_With_Max_Salary = # Given(Empl_With_Salary, "salary" => Max_Salary)
Given(
Departments >> Employees >> Sieve(Salary == The_Salary),
"salary" => Max(Departments >> Employees >> Salary))
Empl_With_Max_Salary(citydb)
Out[40]:
This is not just a convenience feature. Indeed, let us change this query to find the highest paid employee for each department. To implement it, we need to pull Departments
from the Given()
clause:
In [41]:
Empls_With_Max_Salary_By_Dept =
Departments >> Given(
Employees >> Sieve(Salary == The_Salary),
"salary" => Max(Employees >> Salary))
Empls_With_Max_Salary_By_Dept(citydb)
Out[41]:
In [42]:
Max_Salary_By_Dept = Departments >> Max(Employees >> Salary)
Max_Salary_By_Dept(citydb)
Out[42]:
Now change it to: find the top salary for each position. We can't solve it with our current set of combinators. Why?
Look at the database hierarchy diagram:
$$ \text{departments} \quad \begin{cases} \text{name} \\ \text{employees} \quad \begin{cases} \text{name} \\ \text{surname} \\ \text{position} \\ \text{salary} \end{cases} \end{cases} $$The structure of the first query (top salary for each department) fits the structure of the database:
$$ \textbf{departments}\gg \quad \begin{cases} \text{name} \\ \gg\textbf{employees}\gg \quad \begin{cases} \text{name} \\ \text{surname} \\ \text{position} \\ \gg\textbf{salary} \end{cases} \end{cases} $$The structure of the second query (top salary for each position) violates this structure:
$$ \text{departments} \quad \begin{cases} \text{name} \\ \textbf{employees}\lessgtr \quad \begin{cases} \text{name} \\ \text{surname} \\ \ll\textbf{position} \\ \gg\textbf{salary} \end{cases} \end{cases} $$This is not the only limitation. Let us not forget that real databases are decidedly non-hierarchical. For example, this is the database schema (designed by Charles Tirrell) of our flagship product RexStudy. No hierarchy in sight!
As a conclusion, combinators are awesome for querying data as long as:
Otherwise, we are out of luck...
... Or are we?