Querying hierarchical data

In this notebook, we explore how to query hierarchical databases.

The database

We start with loading a sample hierarchical database. Our sample database is derived from the dataset of all employees of the city of Chicago (source).


In [1]:
ENV["LINES"] = 15
include("../citydb_json.jl")

citydb


Out[1]:
Dict{AbstractString,Any} with 1 entry:
  "departments" => Any[Dict{AbstractString,Any}("name"=>"WATER MGMNT","employee…

In hierarchical data model, data is organized in a tree-like structure. In this database, data is stored as a JSON document organized in a 2-level hierarchy:

  • the top level object contains field "departments" with an array of department objects;
  • each department object has fields "name", the name of the department, and "employees", an array of employees;
  • each employee object has fields "name", "surname", "position", "salary" describing the employee.
$$ \text{departments} \quad \begin{cases} \text{name} \\ \text{employees} \quad \begin{cases} \text{name} \\ \text{surname} \\ \text{position} \\ \text{salary} \end{cases} \end{cases} $$

Here is a fragment of the dataset:

{
        "departments": [
            {
                "name": "WATER MGMNT",
                "employees": [
                    {
                        "name": "ALVA",
                        "surname": "A",
                        "position": "WATER RATE TAKER",
                        "salary": 87228
                    },
                    ...
                ]
            },
            ...
        ]
    }

Combinators

We may want to ask some questions about the data. For example,

  • What are the departments in the city of Chicago?
  • How many employees in each department?
  • What is the top salary among all the employees?
  • and so on...

Even though the raw dataset does not immediately contain any answers to these questions, it has enough information so that the answers could be inferred from the data if we are willing to write some code (we use Julia programming language).

Take a relatively complicated problem:

For each department, find the number of employees with the salary higher than $100k.

It can be solved as follows:


In [2]:
Depts_With_Num_Well_Paid_Empls(data) =
    map(d -> Dict(
            "name" => d["name"],
            "N100k" => length(filter(e -> e["salary"] > 100000, d["employees"]))),
        data["departments"])

Depts_With_Num_Well_Paid_Empls(citydb)


Out[2]:
35-element Array{Any,1}:
 Dict{ASCIIString,Any}("name"=>"WATER MGMNT","N100k"=>179)    
 Dict{ASCIIString,Any}("name"=>"POLICE","N100k"=>1493)        
 Dict{ASCIIString,Any}("name"=>"GENERAL SERVICES","N100k"=>79)
 Dict{ASCIIString,Any}("name"=>"CITY COUNCIL","N100k"=>54)    
 Dict{ASCIIString,Any}("name"=>"STREETS & SAN","N100k"=>39)   
 ⋮                                                            
 Dict{ASCIIString,Any}("name"=>"BOARD OF ETHICS","N100k"=>2)  
 Dict{ASCIIString,Any}("name"=>"POLICE BOARD","N100k"=>0)     
 Dict{ASCIIString,Any}("name"=>"BUDGET & MGMT","N100k"=>12)   
 Dict{ASCIIString,Any}("name"=>"ADMIN HEARNG","N100k"=>3)     
 Dict{ASCIIString,Any}("name"=>"LICENSE APPL COMM","N100k"=>0)

Is it a good solution? Possibly. It is certainly compact, due to our use of map and filter to traverse the structure of the database. On the other hand, to write or understand code like this, one needs solid understanding of non-trivial CS concepts such as high-order and anonymous functions. One needs to be a professional programmer.

Is there a way to write this query without use of map and filter (or, equivalently, nested loops)? Indeed, there is, but to show how to do it, we need to introduce some new primitives and operations. We start with the notion of JSON combinators.

A JSON combinator is simply a function that maps JSON input to JSON output. Two trivial examples of JSON combinators are:

  • Const(val), which maps each input value to constant value val.
  • This(), which copies the input to the output without changes.

In [3]:
Const(val) = x -> val

C = Const(42)
C(true), C(42), C([1,2,3])


Out[3]:
(42,42,42)

In this example, Const(42) creates a new constant combinator. It is then applied to various input JSON values, always producing the same output.


In [4]:
This() = x -> x

I = This()
I(true), I(42), I([1,2,3])


Out[4]:
(true,42,[1,2,3])

Similarly, This() creates a new identity combinator. We test it with different input values to assure ourselves that it does not change the input.

Notice the pattern:

  • First, we create a combinator (construct a query) using combinator constructors.
  • Then, we apply the combinator (execute the query) against the data.

In short, by designing a collection of useful combinators, we are creating a query language (embedded in the host language, but this is immaterial).

Traversing the hierarchy

Now let us define a more interesting combinator. Field(name) extracts a field value from a JSON object.


In [5]:
Field(name) = x -> x[name]


Out[5]:
Field (generic function with 1 method)

In [6]:
Salary = Field("salary")
Salary(Dict("name" => "RAHM", "surname" => "E", "position" => "MAYOR", "salary" => 216210))


Out[6]:
216210

Here, to demonstrate field extractors, we defined Salary, a combinator that extracts value of field "salary" from the input JSON object.

To build interesting queries, we need a way to construct complex combinators from primitives. Let us define composition (F >> G) of combinators F and G that ties F and G by sending the output of F to the input of G.

Our first, naive attempt to implement composition is as follows:


In [7]:
import Base: >>

(F >> G) = x -> G(F(x))


Out[7]:
>> (generic function with 86 methods)

We can traverse the structure of hierarchical data by chaining field extractors with the composition operator.

$$ \textbf{departments}\gg \quad \begin{cases} \gg\textbf{name} \\ \text{employees} \quad \begin{cases} \text{name} \\ \text{surname} \\ \text{position} \\ \text{salary} \end{cases} \end{cases} $$

For example, let us find the names of all departments. We can do it by composing extractors for fields "departments" and "name".


In [8]:
Departments = Field("departments")
Name = Field("name")

Dept_Names = Departments >> Name
Dept_Names(citydb)


LoadError: indexing Array{Any,1} with types Tuple{ASCIIString} is not supported
while loading In[8], in expression starting on line 5

 in error at ./error.jl:21
 in getindex at abstractarray.jl:483
 in anonymous at In[5]:1
 in anonymous at In[7]:3
 [inlined code] from essentials.jl:114

What is going on? We expected to get a list of department names, but instead we got an error.

Here is a problem. With the current definition of the >> operator, expression

(Departments >> Name)(citydb)

is translated to

citydb["departments"]["name"]

But this fails because citydb["departments"] is a array and thus doesn't have a field called "name".

Let us demonstrate the behavior of >> on the duplicating combinator. Combinator Dup duplicates its input, that is, for any input value x, it produces an array [x, x]. See what happens when we compose Dup with itself, once or several times:


In [9]:
Dup = x -> Any[x, x]

Dup(0), (Dup >> Dup)(0), (Dup >> Dup >> Dup)(0)


Out[9]:
(Any[0,0],Any[Any[0,0],Any[0,0]],Any[Any[Any[0,0],Any[0,0]],Any[Any[0,0],Any[0,0]]])

We need composition (F >> G) to be smarter. When F produces an array, the composition should apply G to each element of the array. In addition, if G also produces array values, (F >> G) concatenates all outputs to produce a single array value.


In [10]:
(F >> G) = x -> _flat(_map(G, F(x)))

_flat(z) =
    isa(z, Array) ? foldr(vcat, [], z) : z
_map(G, y) =
    isa(y, Array) ? map(_expand, map(G, y)) : G(y)
_expand(z_i) =
    isa(z_i, Array) ? z_i : z_i != nothing ? [z_i] : []


Out[10]:
_expand (generic function with 1 method)

Let us test the updated >> operator with Dup again. We see that the output arrays are now flattened:


In [11]:
Dup(0), (Dup >> Dup)(0), (Dup >> Dup >> Dup)(0)


Out[11]:
(Any[0,0],Any[0,0,0,0],Any[0,0,0,0,0,0,0,0])

We can get back to our original example, finding the names of all departments. Now we get the result we expected.


In [12]:
Dept_Names = Departments >> Name
Dept_Names(citydb)


Out[12]:
35-element Array{Any,1}:
 "WATER MGMNT"      
 "POLICE"           
 "GENERAL SERVICES" 
 "CITY COUNCIL"     
 "STREETS & SAN"    
 ⋮                  
 "BOARD OF ETHICS"  
 "POLICE BOARD"     
 "BUDGET & MGMT"    
 "ADMIN HEARNG"     
 "LICENSE APPL COMM"

Similarly, we can list values of any attribute in the hierarchy tree. For example, let us find the names of all employees.

$$ \textbf{departments}\gg \quad \begin{cases} \text{name} \\ \gg\textbf{employees}\gg \quad \begin{cases} \gg\textbf{name} \\ \text{surname} \\ \text{position} \\ \text{salary} \end{cases} \end{cases} $$

We can do it by composing field extractors on the path from the root of the hierarchy to the "name" attribute:


In [13]:
Employees = Field("employees")

Empl_Names = Departments >> Employees >> Name
Empl_Names(citydb)


Out[13]:
32181-element Array{Any,1}:
 "ELVIA"     
 "VICENTE"   
 "MUHAMMAD"  
 "GIRLEY"    
 "DILAN"     
 ⋮           
 "NANCY"     
 "DARCI"     
 "THADDEUS"  
 "RACHENETTE"
 "MICHELLE"  

Summarizing data

Field extractors and composition give us a way to traverse the hierarchy tree. We still need a way to summarize data.

Consider a query: find the number of departments. To write it down, we need a combinator that can count the number of elements in an array.

Here is our first attempt to implement it:


In [14]:
Count() = x -> length(x)


Out[14]:
Count (generic function with 1 method)

We compose it with a combinator that generates an array of departments to calculate the number of departments:


In [15]:
Num_Depts = Departments >> Count()
Num_Depts(citydb)


Out[15]:
35-element Array{Any,1}:
 2
 2
 2
 2
 2
 ⋮
 2
 2
 2
 2
 2

But that's not what we expected! Here is the problem: the composition operator does not let Count() see the whole array. Instead, Departments >> Count() submits each array element to Count() one by one and then concatenates the outputs of Count(). Count(), when its input is a JSON object, returns the number of fields in the object (in this case, 2 fields for all department objects).

The right way to implement Count() is to add an array-producing combinator as a parameter:


In [16]:
Count(F) = x -> length(F(x))


Out[16]:
Count (generic function with 2 methods)

In [17]:
Num_Depts = Count(Departments)
Num_Depts(citydb)


Out[17]:
35

How to use composition with Count() correctly? Here is an example: show the number of employees for each department. Consider this: number of employees is a (derived) property of each department, which suggests us to compose two combinators: one generating department objects and the other calculating the number of employees for a given department. We get:


In [18]:
Num_Empls_Per_Dept = Departments >> Count(Employees)
Num_Empls_Per_Dept(citydb)


Out[18]:
35-element Array{Any,1}:
  1848
 13570
   924
   397
  2090
     ⋮
     9
     2
    43
    39
     1

On the other hand, if we'd like to calculate the total number of employees, the parameter of Count() should be the combinator that generates all the employees:


In [19]:
Num_Empls = Count(Departments >> Employees)
Num_Empls(citydb)


Out[19]:
32181

We could add other summarizing or aggregate combinators. For example, let us define a combinator that finds the maximum value in an array.


In [20]:
Max(F) = x -> maximum(F(x))


Out[20]:
Max (generic function with 1 method)

Aggregate combinators could be combined to answer complex questions. For example, let us find the maximum number of employees per department. We already have a combinator that generates the number of employees for each department, all is left is to apply Max().


In [21]:
Max_Empls_Per_Dept = Max(Num_Empls_Per_Dept) # Max(Departments >> Count(Employees))
Max_Empls_Per_Dept(citydb)


Out[21]:
13570

Constructing objects

We learned how to traverse and summarize data. Let us show how to create new structured data.

Combinator Select() constructs JSON objects. It is parameterized with a list of field names and constructors.


In [22]:
Select(fields...) =
    x -> Dict(map(f -> f.first => f.second(x), fields))


Out[22]:
Select (generic function with 1 method)

For each input, Select() constructs a new JSON object with field values generated by field constructors applied to the input.

Here is a simple example of Select() summarizing the input array:


In [23]:
S = Select("len" => Count(This()), "max" => Max(This()))
S([10, 20, 30])


Out[23]:
Dict{ASCIIString,Int64} with 2 entries:
  "max" => 30
  "len" => 3

Similarly, we can summarize any hierarchical dataset. Let us modify the query that finds the number of employees for each department. Instead of a raw list of numbers, we will generate a table with the name of the department and its size (the number of employees):


In [24]:
Depts_With_Size =
    Departments >> Select(
        "name" => Name,
        "size" => Count(Employees))

Depts_With_Size(citydb)


Out[24]:
35-element Array{Any,1}:
 Dict{ASCIIString,Any}("name"=>"WATER MGMNT","size"=>1848)    
 Dict{ASCIIString,Any}("name"=>"POLICE","size"=>13570)        
 Dict{ASCIIString,Any}("name"=>"GENERAL SERVICES","size"=>924)
 Dict{ASCIIString,Any}("name"=>"CITY COUNCIL","size"=>397)    
 Dict{ASCIIString,Any}("name"=>"STREETS & SAN","size"=>2090)  
 ⋮                                                            
 Dict{ASCIIString,Any}("name"=>"BOARD OF ETHICS","size"=>9)   
 Dict{ASCIIString,Any}("name"=>"POLICE BOARD","size"=>2)      
 Dict{ASCIIString,Any}("name"=>"BUDGET & MGMT","size"=>43)    
 Dict{ASCIIString,Any}("name"=>"ADMIN HEARNG","size"=>39)     
 Dict{ASCIIString,Any}("name"=>"LICENSE APPL COMM","size"=>1) 

This query could easily be expanded to add more information about the department. For that, we only need to add extra field definitions to the Select() clause. Notably, change in one field constructor cannot in any way affect the values of the other fields.

Let us additionally determine the top salary for each department:


In [25]:
Depts_With_Size_And_Max_Salary =
    Departments >> Select(
        "name" => Name,
        "size" => Count(Employees),
        "max_salary" => Max(Employees >> Salary))

Depts_With_Size_And_Max_Salary(citydb)


Out[25]:
35-element Array{Any,1}:
 Dict{ASCIIString,Any}("name"=>"WATER MGMNT","max_salary"=>169512,"size"=>1848)    
 Dict{ASCIIString,Any}("name"=>"POLICE","max_salary"=>260004,"size"=>13570)        
 Dict{ASCIIString,Any}("name"=>"GENERAL SERVICES","max_salary"=>157092,"size"=>924)
 Dict{ASCIIString,Any}("name"=>"CITY COUNCIL","max_salary"=>160248,"size"=>397)    
 Dict{ASCIIString,Any}("name"=>"STREETS & SAN","max_salary"=>157092,"size"=>2090)  
 ⋮                                                                                 
 Dict{ASCIIString,Any}("name"=>"BOARD OF ETHICS","max_salary"=>131688,"size"=>9)   
 Dict{ASCIIString,Any}("name"=>"POLICE BOARD","max_salary"=>97728,"size"=>2)       
 Dict{ASCIIString,Any}("name"=>"BUDGET & MGMT","max_salary"=>169992,"size"=>43)    
 Dict{ASCIIString,Any}("name"=>"ADMIN HEARNG","max_salary"=>156420,"size"=>39)     
 Dict{ASCIIString,Any}("name"=>"LICENSE APPL COMM","max_salary"=>69888,"size"=>1)  

Filtering data

Remember the problem we stated in the beginning: find the number of employees with the salary higher than $100k. We have almost all pieces we need to construct a solution of this problem. One piece that appears to be missing is a way to refine data. We need a combinator that, given a set of values and a predicate, produces the values that satisfy the predicate condition.

Here is how we can implement it:


In [26]:
Sieve(P) = x -> P(x) ? x : nothing


Out[26]:
Sieve (generic function with 1 method)

Combinator Sieve(P) is parameterized with a predicate combinator P. A predicate is a combinator that, for any input, returns true or false. For example, a predicate combinator (F < G) with two parameters F and G returns, for any input x, the result of comparison F(x) < G(x).

Let us implement common predicate (and also some arithmetic) combinators:


In [27]:
import Base: >, >=, <, <=, ==, !=, !, &, |, +, -, /, %

(>)(F::Function, G::Function) = x -> F(x) > G(x)
(>)(F::Function, n::Number) = F > Const(n)

(>=)(F::Function, G::Function) = x -> F(x) >= G(x)
(>=)(F::Function, n::Number) = F >= Const(n)

(<)(F::Function, G::Function) = x -> F(x) < G(x)
(<)(F::Function, n::Number) = F < Const(n)

(<=)(F::Function, G::Function) = x -> F(x) <= G(x)
(<=)(F::Function, n::Number) = F <= Const(n)

(==)(F::Function, G::Function) = x -> F(x) == G(x)
(==)(F::Function, n::Number) = F == Const(n)

(!=)(F::Function, G::Function) = x -> F(x) != G(x)
(!=)(F::Function, n::Number) = F != Const(n)

(!)(F::Function) = x -> !F(x)
(&)(F::Function, G::Function) = x -> F(x) && G(x)
(|)(F::Function, G::Function) = x -> F(x) || G(x)

(+)(F::Function, G::Function) = x -> F(x) + G(x)
(+)(F::Function, n::Number) = F + Const(n)

(-)(F::Function, G::Function) = x -> F(x) - G(x)
(-)(F::Function, n::Number) = F - Const(n)

(/)(F::Function, G::Function) = x -> F(x) / G(x)
(/)(F::Function, n::Number) = F / Const(n)

(%)(F::Function, G::Function) = x -> F(x) % G(x)
(%)(F::Function, n::Number) = F % Const(n)


Out[27]:
rem (generic function with 133 methods)

Sieve(P) tests its input on the predicate condition P. If the input satisfies the condition, it is returned without changes. Otherwise, nothing is returned.

Here is a trivial example to demonstrate how Sieve() works:


In [28]:
Take_Odd = Sieve(This() % 2 == 1)
Take_Odd(5), Take_Odd(10)


Out[28]:
(5,nothing)

When the composition operator accumulates values for array output, it drops nothing values. Thus, in a composition (F >> Sieve(P)) with an array-generating combinator F, Sieve(P) filters the elements of the array.

Let us use this feature to list the departments with more than 1000 employees. We already defined a combinator producing departments with the number of employees, we just need to filter its output:


In [29]:
Size = Field("size")

Large_Depts = Depts_With_Size >> Sieve(Size > 1000)
Large_Depts(citydb)


Out[29]:
7-element Array{Any,1}:
 Dict{ASCIIString,Any}("name"=>"WATER MGMNT","size"=>1848)  
 Dict{ASCIIString,Any}("name"=>"POLICE","size"=>13570)      
 Dict{ASCIIString,Any}("name"=>"STREETS & SAN","size"=>2090)
 Dict{ASCIIString,Any}("name"=>"AVIATION","size"=>1344)     
 Dict{ASCIIString,Any}("name"=>"FIRE","size"=>4875)         
 Dict{ASCIIString,Any}("name"=>"OEMC","size"=>1135)         
 Dict{ASCIIString,Any}("name"=>"TRANSPORTN","size"=>1200)   

Similarly, we can list positions of employees with salary higher than 200k:


In [30]:
Position = Field("position")

Very_Well_Paid_Posns =
    Departments >> Employees >> Sieve(Salary > 200000) >> Position

Very_Well_Paid_Posns(citydb)


Out[30]:
3-element Array{Any,1}:
 "SUPERINTENDENT OF POLICE"
 "FIRE COMMISSIONER"       
 "MAYOR"                   

With Sieve() defined, we are finally able to answer the original question using combinators:

For each department, find the number of employees with salary higher than 100k.


In [31]:
Better_Depts_With_Num_Well_Paid_Empls =
    Departments >> Select(
        "name" => Name,
        "N100k" => Count(Employees >> Sieve(Salary > 100000)))

Better_Depts_With_Num_Well_Paid_Empls(citydb)


Out[31]:
35-element Array{Any,1}:
 Dict{ASCIIString,Any}("name"=>"WATER MGMNT","N100k"=>179)    
 Dict{ASCIIString,Any}("name"=>"POLICE","N100k"=>1493)        
 Dict{ASCIIString,Any}("name"=>"GENERAL SERVICES","N100k"=>79)
 Dict{ASCIIString,Any}("name"=>"CITY COUNCIL","N100k"=>54)    
 Dict{ASCIIString,Any}("name"=>"STREETS & SAN","N100k"=>39)   
 ⋮                                                            
 Dict{ASCIIString,Any}("name"=>"BOARD OF ETHICS","N100k"=>2)  
 Dict{ASCIIString,Any}("name"=>"POLICE BOARD","N100k"=>0)     
 Dict{ASCIIString,Any}("name"=>"BUDGET & MGMT","N100k"=>12)   
 Dict{ASCIIString,Any}("name"=>"ADMIN HEARNG","N100k"=>3)     
 Dict{ASCIIString,Any}("name"=>"LICENSE APPL COMM","N100k"=>0)

Compare it with the original solution. The new one reads much better!


In [32]:
Depts_With_Num_Well_Paid_Empls(data) =
    map(d -> Dict(
            "name" => d["name"],
            "N100k" => length(filter(e -> e["salary"] > 100000, d["employees"]))),
        data["departments"])


Out[32]:
Depts_With_Num_Well_Paid_Empls (generic function with 1 method)

Parameters

We achieved our goal of sketching (a prototype of) a query language for hierarchical databases. Let us explore how it could be developed further. One possible way to improve it is by adding query parameters.

Consider a problem: find the number of employees whose annual salary exceeds 200k. We have all the tools to solve it:


In [33]:
Num_Very_Well_Paid_Empls =
    Count(Departments >> Employees >> Sieve(Salary >= 200000))

Num_Very_Well_Paid_Empls(citydb)


Out[33]:
3

Now, imagine that we'd like to find the number of employees with salary in a certain range, but we don't know the range at the time we construct the query. Instead, we want to specify the range when we execute the query.

Let us introduce a query context, a collection of parameters and their values. We'd like the query context to travel with the input, where each combinator could access it if necessary. Thus, we have an updated definition of a JSON combinator: a function that maps JSON input and query context to JSON output.

We need to update existing combinators to make them context-aware:


In [34]:
Const(val) = (x, ctx...) -> val
This() = (x, ctx...) -> x

(F >> G) = (x, ctx...) -> _flat(_map(G, F(x, ctx...), ctx...))
_map(G, y, ctx...) =
    isa(y, Array) ? map(_expand, map(yi -> G(yi, ctx...), y)) : G(y, ctx...)

Field(name) = (x, ctx...) -> x[name]
Select(fields...) =
    (x, ctx...) -> Dict(map(f -> f.first => f.second(x, ctx...), fields))

Count(F) = (x, ctx...) -> length(F(x, ctx...))
Max(F) = (x, ctx...) -> maximum(F(x, ctx...))

Sieve(P) = (x, ctx...) -> P(x, ctx...) ? x : nothing
(>)(F::Function, G::Function) = (x, ctx...) -> F(x, ctx...) > G(x, ctx...)
(>=)(F::Function, G::Function) = (x, ctx...) -> F(x, ctx...) >= G(x, ctx...)
(<)(F::Function, G::Function) = (x, ctx...) -> F(x, ctx...) < G(x, ctx...)
(<=)(F::Function, G::Function) = (x, ctx...) -> F(x, ctx...) <= G(x, ctx...)
(==)(F::Function, G::Function) = (x, ctx...) -> F(x, ctx...) == G(x, ctx...)
(!=)(F::Function, G::Function) = (x, ctx...) -> F(x, ctx...) != G(x, ctx...)
(!)(F::Function) = (x, ctx...) -> !F(x, ctx...)
(&)(F::Function, G::Function) = (x, ctx...) -> F(x, ctx...) && G(x, ctx...)
(|)(F::Function, G::Function) = (x, ctx...) -> F(x, ctx...) || G(x, ctx...)
(+)(F::Function, G::Function) = (x, ctx...) -> F(x, ctx...) + G(x, ctx...)
(-)(F::Function, G::Function) = (x, ctx...) -> F(x, ctx...) - G(x, ctx...)
(/)(F::Function, G::Function) = (x, ctx...) -> F(x, ctx...) / G(x, ctx...)
(%)(F::Function, G::Function) = (x, ctx...) -> F(x, ctx...) % G(x, ctx...)


Out[34]:
rem (generic function with 133 methods)

Next, let us add combinator Var(name) that extracts the value of a parameter from the query context.


In [35]:
Var(name) = (x, ctx...) -> Dict(ctx)[name]


Out[35]:
Var (generic function with 1 method)

Now we can make parameterized queries. Find the number of employees with salary in a certain range:


In [36]:
Min_Salary = Var("min_salary")
Max_Salary = Var("max_salary")

Departments = Field("departments")
Employees = Field("employees")
Salary = Field("salary")

Num_Empls_By_Salary =
    Count(
        Departments >>
        Employees >>
        Sieve((Salary >= Min_Salary) & (Salary < Max_Salary)))

Num_Empls_By_Salary(citydb, "min_salary" => 100000, "max_salary" => 200000)


Out[36]:
3916

Use of context is not limited to query parameters. We can also update the context dynamically.

Consider a problem: find the employee with the highest salary.

It can be solved in two queries. First, find the highest salary:


In [37]:
Max_Salary = Max(Departments >> Employees >> Salary)
Max_Salary(citydb)


Out[37]:
260004

Second, find the employee with the given salary:


In [38]:
The_Salary = Var("salary")

Empl_With_Salary = Departments >> Employees >> Sieve(Salary == The_Salary)
Empl_With_Salary(citydb, "salary" => 260004)


Out[38]:
1-element Array{Any,1}:
 Dict{AbstractString,Any}("name"=>"GARRY","surname"=>"M","position"=>"SUPERINTENDENT OF POLICE","salary"=>260004)

We need to automate this sequence of operations. Specifically, we take a value calculated by one combinator, assign it to some context parameter, and then evaluate the other combinator in the updated context. That's what Given() combinator does:


In [39]:
Given(F, vars...) =
    (x, ctx...) ->
        let ctx = (ctx..., map(v -> v.first => v.second(x, ctx...), vars)...)
            F(x, ctx...)
        end


Out[39]:
Given (generic function with 1 method)

Combining Max_Salary and Empl_With_Salary using Given, we get:


In [40]:
Empl_With_Max_Salary = # Given(Empl_With_Salary, "salary" => Max_Salary)
    Given(
        Departments >> Employees >> Sieve(Salary == The_Salary),
        "salary" => Max(Departments >> Employees >> Salary))

Empl_With_Max_Salary(citydb)


Out[40]:
1-element Array{Any,1}:
 Dict{AbstractString,Any}("name"=>"GARRY","surname"=>"M","position"=>"SUPERINTENDENT OF POLICE","salary"=>260004)

This is not just a convenience feature. Indeed, let us change this query to find the highest paid employee for each department. To implement it, we need to pull Departments from the Given() clause:


In [41]:
Empls_With_Max_Salary_By_Dept =
    Departments >> Given(
        Employees >> Sieve(Salary == The_Salary),
        "salary" => Max(Employees >> Salary))

Empls_With_Max_Salary_By_Dept(citydb)


Out[41]:
35-element Array{Any,1}:
 Dict{AbstractString,Any}("name"=>"THOMAS","surname"=>"P","position"=>"COMMISSIONER OF WATER MGMT","salary"=>169512)                
 Dict{AbstractString,Any}("name"=>"GARRY","surname"=>"M","position"=>"SUPERINTENDENT OF POLICE","salary"=>260004)                   
 Dict{AbstractString,Any}("name"=>"DAVID","surname"=>"R","position"=>"COMMISSIONER OF FLEET & FACILITY MANAGEMENT","salary"=>157092)
 Dict{AbstractString,Any}("name"=>"MARLA","surname"=>"K","position"=>"CHIEF ADMINISTRATIVE OFFICER","salary"=>160248)               
 Dict{AbstractString,Any}("name"=>"CHARLES","surname"=>"W","position"=>"COMMISSIONER OF STREETS AND SANITATION","salary"=>157092)   
 ⋮                                                                                                                                  
 Dict{AbstractString,Any}("name"=>"STEVEN","surname"=>"B","position"=>"EXECUTIVE DIR - BOARD OF ETHICS","salary"=>131688)           
 Dict{AbstractString,Any}("name"=>"MAX","surname"=>"C","position"=>"EXECUTIVE DIR - POLICE BOARD","salary"=>97728)                  
 Dict{AbstractString,Any}("name"=>"ALEXANDRA","surname"=>"H","position"=>"BUDGET DIR","salary"=>169992)                             
 Dict{AbstractString,Any}("name"=>"PATRICIA","surname"=>"J","position"=>"DIR OF ADMINISTRATIVE HEARINGS","salary"=>156420)          
 Dict{AbstractString,Any}("name"=>"MICHELLE","surname"=>"G","position"=>"STAFF ASST","salary"=>69888)                               

Limitations and conclusion

Consider a problem: find the top salary for each department. This is an easy one:


In [42]:
Max_Salary_By_Dept = Departments >> Max(Employees >> Salary)
Max_Salary_By_Dept(citydb)


Out[42]:
35-element Array{Any,1}:
 169512
 260004
 157092
 160248
 157092
      ⋮
 131688
  97728
 169992
 156420
  69888

Now change it to: find the top salary for each position. We can't solve it with our current set of combinators. Why?

Look at the database hierarchy diagram:

$$ \text{departments} \quad \begin{cases} \text{name} \\ \text{employees} \quad \begin{cases} \text{name} \\ \text{surname} \\ \text{position} \\ \text{salary} \end{cases} \end{cases} $$

The structure of the first query (top salary for each department) fits the structure of the database:

$$ \textbf{departments}\gg \quad \begin{cases} \text{name} \\ \gg\textbf{employees}\gg \quad \begin{cases} \text{name} \\ \text{surname} \\ \text{position} \\ \gg\textbf{salary} \end{cases} \end{cases} $$

The structure of the second query (top salary for each position) violates this structure:

$$ \text{departments} \quad \begin{cases} \text{name} \\ \textbf{employees}\lessgtr \quad \begin{cases} \text{name} \\ \text{surname} \\ \ll\textbf{position} \\ \gg\textbf{salary} \end{cases} \end{cases} $$

This is not the only limitation. Let us not forget that real databases are decidedly non-hierarchical. For example, this is the database schema (designed by Charles Tirrell) of our flagship product RexStudy. No hierarchy in sight!

As a conclusion, combinators are awesome for querying data as long as:

  1. The data is hierarchical.
  2. The structure of the query respects the structure of the data.

Otherwise, we are out of luck...

... Or are we?