Querying JSON Databases

This notebook demonstrates how to query hierarchical databases with our toy query language BQL.

We start with loading a demo JSON dataset with data on departments and employees of the city of Chicago (source). In general, we can treat any JSON document with a regular structure as a hierarchical database.


In [1]:
ENV["LINES"] = 15
include("../citydb_json.jl")

citydb


Out[1]:
Dict{AbstractString,Any} with 1 entry:
  "departments" => Any[Dict{AbstractString,Any}("name"=>"WATER MGMNT","employee…

This is the structure of the citydb database:

{
        "departments": [
            {
                "name": ...,
                "employees": [
                    {
                        "name": ...,
                        "surname": ...,
                        "position": ...,
                        "salary": ...
                    },
                    ... ]
            },
            ... ]
    }

The top-level City object has the following fields:

  • departments: an array of department objects.

Department objects have the following fields:

  • name: the name of the department.
  • employees: an array of employee objects.

Employee objects have the following fields:

  • name: employee's first name.
  • surname: employee's last name.
  • position: employee's title.
  • salary: annual salary of the employee.

Next, we import the BQL library.


In [2]:
push!(LOAD_PATH, "..")

using BQL

The BQL query language is embedded in Julia, which means any BQL query is a regular Julia function which maps JSON input to JSON output. We call such functions JSON combinators.

Two trivial examples of JSON combinators are:

  • Const(val), which maps all input to the same output value;
  • This(), which returns its input unchanged.

In [3]:
C = Const(42)
C(nothing), C(42), C([1, 2, 3])


Out[3]:
(nothing,42,42)

In [4]:
I = This()
I(nothing), I(42), I([1, 2, 3])


Out[4]:
(nothing,42,[1,2,3])

More impressive is combinator Field(name) that extracts a field value from a JSON object.


In [5]:
F = Field(:x)
F(Dict("x" => 24, "y" => 42))


Out[5]:
24

By composing two field extractors, we can build a query that produces the names of all departments.


In [6]:
Departments = Field(:departments)
Name = Field(:name)

Dept_Names = Departments >> Name
dept_names = Dept_Names(citydb)

dept_names


Out[6]:
35-element Array{Any,1}:
 "WATER MGMNT"      
 "POLICE"           
 "GENERAL SERVICES" 
 "CITY COUNCIL"     
 "STREETS & SAN"    
 ⋮                  
 "BOARD OF ETHICS"  
 "POLICE BOARD"     
 "BUDGET & MGMT"    
 "ADMIN HEARNG"     
 "LICENSE APPL COMM"

What does the >> operator do exactly? Fundamentally, (A >> B) composes A and B by sending the output of A to the input of B.

$$ (A \gg B):\; x \;\overset{A}{\longmapsto}\; y \;\overset{B}{\longmapsto}\; z \quad \text{(where $y = A(x),\, z = B(y)$)} $$

However, if we directly apply this rule to evaluate the expression (Departments >> Name)(citydb), we will fail because citydb['departments']['name'] does not exist.

To make this work, we need to clarify the composition rule. Namely, expression (A >> B)(x), when A(x) is an array, applies B to each element of the array.

$$ (A \gg B):\; x \;\overset{A}{\longmapsto}\; [y_1,\, y_2,\, \ldots] \;\overset{B}{\longmapsto}\; [z_1,\, z_2,\, \ldots] \quad \text{(when $A(x) = [y_1,\, y_2\, \ldots],\, B(y_k) = z_k$)} $$

Moreover, when B itself produces array values, all B outputs are combined into one array, which becomes the output of (A >> B).

$$ (A \gg B):\; x \;\overset{A}{\longmapsto}\; [y_1,\, y_2,\, \ldots] \;\overset{B}{\longmapsto}\; [z_{11},\, z_{12},\, \ldots\, z_{21},\, z_{22},\, \ldots] \quad \text{(when also $B(y_k)$ are arrays $[z_{k1},\, z_{k2},\, \ldots]$)} $$

The last feature is used when we list the names of all employees.


In [7]:
Employees = Field(:employees)

Empl_Names = Departments >> Employees >> Name
empl_names = Empl_Names(citydb)

empl_names


Out[7]:
32181-element Array{Any,1}:
 "ELVIA"     
 "VICENTE"   
 "MUHAMMAD"  
 "GIRLEY"    
 "DILAN"     
 ⋮           
 "NANCY"     
 "DARCI"     
 "THADDEUS"  
 "RACHENETTE"
 "MICHELLE"  

Dual to Field(name), combinator Select(...) constructs JSON objects. Parameters of Select(...) are combinators that construct object fields. Here is a trivial example.


In [8]:
S = Select(:x => Const(42), :y => This())
S(24)


Out[8]:
Dict{AbstractString,Any} with 2 entries:
  "x" => 42
  "y" => 24

Let us use Select(...) to generate the name and the number of employees for each department.


In [9]:
Depts_With_Size = Departments >> Select(:name => Name, :size => Count(Employees))
Depts_With_Size(citydb)


Out[9]:
35-element Array{Any,1}:
 Dict{AbstractString,Any}("name"=>"WATER MGMNT","size"=>1848)    
 Dict{AbstractString,Any}("name"=>"POLICE","size"=>13570)        
 Dict{AbstractString,Any}("name"=>"GENERAL SERVICES","size"=>924)
 Dict{AbstractString,Any}("name"=>"CITY COUNCIL","size"=>397)    
 Dict{AbstractString,Any}("name"=>"STREETS & SAN","size"=>2090)  
 ⋮                                                               
 Dict{AbstractString,Any}("name"=>"BOARD OF ETHICS","size"=>9)   
 Dict{AbstractString,Any}("name"=>"POLICE BOARD","size"=>2)      
 Dict{AbstractString,Any}("name"=>"BUDGET & MGMT","size"=>43)    
 Dict{AbstractString,Any}("name"=>"ADMIN HEARNG","size"=>39)     
 Dict{AbstractString,Any}("name"=>"LICENSE APPL COMM","size"=>1) 

Here, combinator Count(Employees) returns the length of the employees array. In general, Count(F) lets F process its input expecting the output of F to be an array, then returns the length of the array.

$$ \operatorname{Count}(F):\; x \;\overset{F}{\longmapsto}\; [y_1,\, y_2,\, \ldots\, y_N] \;\overset{\operatorname{length}}{\longmapsto}\; N $$

(You may've expected Employees >> Count(), but that'd make operator >> non-associative).

Array combinators such as Count(...) are called aggregate combinators. The following aggregate combinators are defined in BQL: Count(), Min(), Max(), First().


In [10]:
Num_Depts = Count(Departments)
Num_Depts(citydb)


Out[10]:
35

In [11]:
Salary = Field(:salary)

Top_Salary = Max(Departments >> Employees >> Salary)
Top_Salary(citydb)


Out[11]:
260004

In [12]:
One_Empl = First(Departments >> Employees)
One_Empl(citydb)


Out[12]:
Dict{AbstractString,Any} with 4 entries:
  "name"     => "ELVIA"
  "surname"  => "A"
  "position" => "WATER RATE TAKER"
  "salary"   => 88968

In [13]:
Three_Depts = First(Departments >> Name, Const(3))
Three_Depts(citydb)


Out[13]:
3-element Array{Any,1}:
 "WATER MGMNT"     
 "POLICE"          
 "GENERAL SERVICES"

In [14]:
Half_Depts = First(Departments >> Name, Count(Departments)//2)
Half_Depts(citydb)


Out[14]:
17-element Array{Any,1}:
 "WATER MGMNT"     
 "POLICE"          
 "GENERAL SERVICES"
 "CITY COUNCIL"    
 "STREETS & SAN"   
 ⋮                 
 "OEMC"            
 "TRANSPORTN"      
 "HEALTH"          
 "MAYOR'S OFFICE"  
 "LAW"             

Combinator Sieve(P) applies predicate P to its input. If the predicate condition is not satisfied, the input is dropped, otherwise it is returned unchanged. Let us use Sieve() to find the departments with more than 1000 employees.


In [15]:
Size = Field(:size)

Large_Depts = Depts_With_Size >> Sieve(Size > 1000)
Large_Depts(citydb)


Out[15]:
7-element Array{Any,1}:
 Dict{AbstractString,Any}("name"=>"WATER MGMNT","size"=>1848)  
 Dict{AbstractString,Any}("name"=>"POLICE","size"=>13570)      
 Dict{AbstractString,Any}("name"=>"STREETS & SAN","size"=>2090)
 Dict{AbstractString,Any}("name"=>"AVIATION","size"=>1344)     
 Dict{AbstractString,Any}("name"=>"FIRE","size"=>4875)         
 Dict{AbstractString,Any}("name"=>"OEMC","size"=>1135)         
 Dict{AbstractString,Any}("name"=>"TRANSPORTN","size"=>1200)   

Here, combinator Depts_With_Size, which adds size field to each department object, is composed with combinator Sieve(Size > 1000), which gathers the departments that satisfy condition Size > 1000.

In the following example, we use Sieve() to find the number of employees whose annual salary exceeds 200k.


In [16]:
Num_Well_Paid_Empls =
    Count(Departments >> Employees >> Sieve(Salary >= 200000))
Num_Well_Paid_Empls(citydb)


Out[16]:
3

Now suppose we'd like to find the number of employees with salary in a certain range, but we don't know the range in advance. In this case, we can construct a parameterized query.


In [17]:
Min_Salary = Var(:min_salary)
Max_Salary = Var(:max_salary)

Num_Empls_By_Salary_Range =
    Count(Departments >> Employees >> Sieve((Salary >= Min_Salary) & (Salary < Max_Salary)));

To run the Num_Empls_By_Salary_Range query, we need to supply it with parameters min_salary and max_salary.


In [18]:
Num_Empls_By_Salary_Range(citydb, Dict("min_salary" => 200000, "max_salary" => 1000000))


Out[18]:
3

In [19]:
Num_Empls_By_Salary_Range(citydb, Dict("min_salary" => 100000, "max_salary" => 200000))


Out[19]:
3916

In [20]:
Num_Empls_By_Salary_Range(citydb, Dict("min_salary" => 0, "max_salary" => 100000))


Out[20]:
28262

The query knows which parameters it needs.


In [21]:
vars(Num_Empls_By_Salary_Range)


Out[21]:
Set(Any["min_salary","max_salary"])

The last feature we discuss here is an ability to assign parameter values dynamically.

Consider a query: find the top salary for each department. It could be easily implemented using Max() aggregate.


In [22]:
Depts_With_Max_Salary =
    Departments >> Select(:name => Name, :max_salary => Max(Employees >> Salary))
Depts_With_Max_Salary(citydb)


Out[22]:
35-element Array{Any,1}:
 Dict{AbstractString,Any}("name"=>"WATER MGMNT","max_salary"=>169512)     
 Dict{AbstractString,Any}("name"=>"POLICE","max_salary"=>260004)          
 Dict{AbstractString,Any}("name"=>"GENERAL SERVICES","max_salary"=>157092)
 Dict{AbstractString,Any}("name"=>"CITY COUNCIL","max_salary"=>160248)    
 Dict{AbstractString,Any}("name"=>"STREETS & SAN","max_salary"=>157092)   
 ⋮                                                                        
 Dict{AbstractString,Any}("name"=>"BOARD OF ETHICS","max_salary"=>131688) 
 Dict{AbstractString,Any}("name"=>"POLICE BOARD","max_salary"=>97728)     
 Dict{AbstractString,Any}("name"=>"BUDGET & MGMT","max_salary"=>169992)   
 Dict{AbstractString,Any}("name"=>"ADMIN HEARNG","max_salary"=>156420)    
 Dict{AbstractString,Any}("name"=>"LICENSE APPL COMM","max_salary"=>69888)

Now let us ask a slightly different question: find the employees with the highest salary at their department. We may try to use the Sieve() combinator as follows.


In [23]:
Highest_Paid_Empls_By_Dept =
    Departments >> Employees >> Sieve(Salary == Max_Salary);

But the filter condition (Salary == Max_Salary) is problematic since we cannot supply max_salary as a query parameter. Instead it must be calculated dynamically for each department. The Given(...) combinator does exactly that.


In [24]:
Highest_Paid_Empls_By_Dept =
    Departments >>
    Given(
        Employees >> Sieve(Salary == Max_Salary),
        :max_salary => Max(Employees >> Salary))
    
Highest_Paid_Empls_By_Dept(citydb)


Out[24]:
35-element Array{Any,1}:
 Dict{AbstractString,Any}("name"=>"THOMAS","surname"=>"P","position"=>"COMMISSIONER OF WATER MGMT","salary"=>169512)                
 Dict{AbstractString,Any}("name"=>"GARRY","surname"=>"M","position"=>"SUPERINTENDENT OF POLICE","salary"=>260004)                   
 Dict{AbstractString,Any}("name"=>"DAVID","surname"=>"R","position"=>"COMMISSIONER OF FLEET & FACILITY MANAGEMENT","salary"=>157092)
 Dict{AbstractString,Any}("name"=>"MARLA","surname"=>"K","position"=>"CHIEF ADMINISTRATIVE OFFICER","salary"=>160248)               
 Dict{AbstractString,Any}("name"=>"CHARLES","surname"=>"W","position"=>"COMMISSIONER OF STREETS AND SANITATION","salary"=>157092)   
 ⋮                                                                                                                                  
 Dict{AbstractString,Any}("name"=>"STEVEN","surname"=>"B","position"=>"EXECUTIVE DIR - BOARD OF ETHICS","salary"=>131688)           
 Dict{AbstractString,Any}("name"=>"MAX","surname"=>"C","position"=>"EXECUTIVE DIR - POLICE BOARD","salary"=>97728)                  
 Dict{AbstractString,Any}("name"=>"ALEXANDRA","surname"=>"H","position"=>"BUDGET DIR","salary"=>169992)                             
 Dict{AbstractString,Any}("name"=>"PATRICIA","surname"=>"J","position"=>"DIR OF ADMINISTRATIVE HEARINGS","salary"=>156420)          
 Dict{AbstractString,Any}("name"=>"MICHELLE","surname"=>"G","position"=>"STAFF ASST","salary"=>69888)                               

Notably, Highest_Paid_Empls_By_Dept requires no parameters despite the fact that its definition refers to max_salary.


In [25]:
vars(Highest_Paid_Empls_By_Dept)


Out[25]:
0-element Array{Any,1}