The basics of using JSON within DB2 were covered in a previous notebook. This notebook will cover some of the advanced techniques that are required to access arrays and structures that may be nested within a JSON document. The first step (as with any notebook!) is to load the Db2 Jupyter magic commands.
In [ ]:
%run db2.ipynb
This section will work with a larger table that will give us some idea of the performance of using JSON within a Db2 database. This next statement will generate 25,000 customer records that have a structure similar to this:
{
"customerid": 100000,
"identity": {
"firstname": "Jacob",
"lastname": "Hines",
"birthdate": "1982-09-18"
},
"contact": {
"street": "Main Street North",
"city": "Amherst",
"state": "OH",
"zipcode": "44001",
"email": "Ja.Hines@yahii.com",
"phone": "813-689-8309"
},
"payment": {
"card_type": "MCCD",
"card_no": "4742-3005-2829-9227"
},
"purchases": [
{
"tx_date": "2018-02-14",
"tx_no": 157972,
"product_id": 1860,
"product": "Ugliest Snow Blower",
"quantity": 1,
"item_cost": 51.86
},...
]
}
The JSON record contains four distinct pieces of information:
The purchase structure contains information on the customer purchases. For each purchased items there is the following information:
If this was a relational database you would probably split these fields up into different tables and use join techniques to bring the information back together. In a JSON document we are able to keep all of this information in one record, which makes retrieval of an individual customer purchases easier.
In [ ]:
%sql connect
We need to create JSON records for loading into the table. This next command will run the db2json.ipynb
file which contains code to generate JSON records. If you want to change the number of records created or the data used, edit the file to modify the defaults. The program will display its progress as it runs.
In [ ]:
%run generate_json.ipynb
The table that we create for JSON data will include a column with the actual JSON record and another field as an identifier for the row. The load statement in the next section of code will insert the customerid
from the JSON record into the CUSTNO
column.
In [ ]:
%%sql -q
DROP TABLE CUSTOMERS_BSON;
CREATE TABLE CUSTOMERS_BSON
(
CUSTNO INT NOT NULL,
DETAILS BLOB(2048) INLINE LENGTH 2048
);
The next piece of code will import the data into Db2. The Python code will read the customer.js
file, parse it and then insert in into the CUSTOMERS
table. Note that the entire JSON record is stored into the DETAILS
column, while the CUSTNO
column is set to the customerid
field that is found in the JSON record.
In [ ]:
import io
import json
print("Starting Load")
start_time = time.time()
%sql autocommit off
x = %sql prepare INSERT INTO CUSTOMERS_BSON VALUES (?,SYSTOOLS.JSON2BSON(?))
if (x != False):
i = 0
with open("customers.js","r") as records:
for record in records:
rec = json.loads(record)
custno = rec['customerid']
i += 1
rc = %sql execute :x using :custno, :record
if (rc == False): break
if ((i % 5000) == 0):
print(str(i)+" rows read.")
%sql commit hold
%sql commit work
%sql autocommit on
end_time = time.time()
print('Total load time for {:d} records is {:.2f} seconds'.format(i,end_time-start_time))
customer_bson = end_time - start_time
To double-check the JSON format we retrieve one record from the table.
In [ ]:
%sql -j select systools.bson2json(details) from customers fetch first 1 row only
The table does not have an index currently defined on it. We could have created the table with a primary key but chose not to in order to speed up the load step. The next statement will create the index for our table.
In [ ]:
%sql CREATE UNIQUE INDEX IX_CUST_BSON ON CUSTOMERS_BSON(CUSTNO)
The Db2 notebook on JSON features described how to retrieve individual fields from a record by using the JSON_VAL
function. The following code retrieves the first and last name of customer number 100000.
In [ ]:
%%sql
SELECT JSON_VAL(DETAILS,'identity.firstname','s:32') AS FIRSTNAME,
JSON_VAL(DETAILS,'identity.lastname','s:32') AS LASTNAME
FROM CUSTOMERS
WHERE CUSTNO = 100000
In order to make JSON retrieval easier in Jupyter notebooks, the Db2 Jupyter %sql
command allows for the creation of macros. Macros are used to expand text in a SQL statement. Macros do not execute any code, but are used specifically to insert text into a SQL statement. An example of a macro is found below.
In [ ]:
%%sql macro helloworld
echo Everything you said is here: {1}
exit
In [ ]:
%sql %helloworld(Hello World)
A number of macros can be found in the db2jon.ipynb
file. We can insert the file in the next step to load these macros.
In [ ]:
%run db2json.ipynb
To use a macro, we precede it with the % character and enclose any parameters in brackets:
SELECT %js(details,identity.firstname,32) FROM CUSTOMERS WHERE CUSTONO = 100000
The %js
macro will expand the values in the string to include the JSON_VAL(details, ...)
in the SQL. To echo the contents of what the macro produces, we need to include the -e
flag in the SQL statement.
In [ ]:
%%sql -e
SELECT %js(details,identity.firstname,32)
FROM CUSTOMERS
WHERE CUSTNO = 100000
You will notice that a green box is placed around the generated code and the SQL has been modified by the macro to include the full JSON_VAL function. This makes it easier to create the SQL (and hopefully less error prone) when dealing with JSON objects. There are 5 macros that have been predefined for JSON queries.
-e
- This flag turns on the display of the generated SQL. This is useful if you want to cut and paste the final SQL into an application. If you do not use the -e
flag, the code will not be displayed. #js
- Return a string from a JSON field#ji
- Return an integer from a JSON field#jd
- Return a number (decimal) from a JSON field#jdate
- Return a date from a JSON fieldThe format for the ji
(Integer) and jdate
(Date) macro is:
#ji(json_column,json_field)
#jdate(json_column,json_field)
Note that you do not need to enclose quotes around the column or field names unless they have special characters in the name. The quotes are stripped out of the parameters but the macros will add them back into the SQL function if necessary. Also, field names are case sensitive in JSON so make sure you use the proper spelling for the field you are trying to retrieve. The column name that contains JSON data is not case sensitive.
The #jd
macro works similar to the #ji
macro except that it provides for an additional decimal formatting operator:
#jd(json_column, json_field, 'digits,decimals')
A decimal number if formatted as x,y where x is the number of digits in total, and y represents the number of digits after the decimal point. So a value of 5,2 would mean there are 5 digits in total with 2 of them after the decimal point.
The #js
(JSON string) macro the following format:
#js(column,field,length)
The macro needs which JSON column to use, the field that is being searched for, and the length that needs to be returned. Note that you do not use the JSON_VAL specification in this field ('s:255'). Instead you are supplying the length of the return string as an integer.
The next SQL command illustrates the use of these macros to return the customers first name, lastname, address, zipcode, date of birth and phone number from the JSON record.
In [ ]:
%%sql -e
SELECT
%js(details,identity.firstname,12) AS FIRSTNAME,
%js(details,identity.lastname,15) AS LASTNAME,
%js(details,contact.street,30) AS STREET,
%js(details,contact.city,20) AS CITY,
%js(details,contact.state,2) AS STATE,
%ji(details,contact.zipcode) AS ZIPCODE,
%jdate(details,identity.birthdate) AS BIRTHDATE,
%js(details,contact.phone,12) AS PHONE
FROM CUSTOMERS
WHERE CUSTNO = 100000
Using the -e
flag is useful when debugging any SQL that uses JSON functions. In addition it gives you the raw SQL that is generated so that you can cut and paste it into your own application. The examples with strings use the following format:
TRIM(JSON_VAL(...'s:12'))
If you use a specification that is too small to retrieve the field (s:12), then the JSON_VAL will return a NULL value rather than a truncated string. For this reason, you may want to use a larger size and the value will be trimmed down to a smaller size.
The final macro that is included in the list is the #jsonarray()
macro. The macro call has the following format:
jsonarray(table_name, pk, json_column, array, [optional where clause])
The fields are:
table_name
- table that we are accessingpk
- primary key of the tablejson_column
- column that contains the json dataarray
- the field that contains an array of items (which must be a structure and not individual values)where clause
- An optional expression to limit the result setThe #jsonarray
macro is used in situations where you want to retrieve the contains of an array within a JSON record. The macro will create a WITH clause specification where the name of the resulting table is called JSONARRAY
with two columns: PK for the primary key to the array record and ITEM for the items returned. The following example shows how you would retrieve all of the purchases that a customer made using the JSON_TABLE
function.
In [ ]:
%%sql
with purchases(items) as
(select systools.json2bson(items.value)
from customers,
table( systools.json_table(customers.details,'purchases','s:2048')) as items
where custno = 100000
)
select %ji(items,product_id),%js(items,product,32),%ji(items,quantity),%jd(items,item_cost,'7,2') from purchases
This code can be simplified by using the %jsonarray
macro instead.
In [ ]:
%%sql -e
WITH
%jsonarray(customers,custno,details,purchases,where="custno = 100000")
SELECT
%ji(item,product_id),%js(item,product,32),%ji(item,quantity),%jd(item,item_cost,'7,2') from jsonarray
If you display the SQL that was created, you will see the temporary table specification jsonarrray(pk,item)
that was created by the macro. You must start the SQL statement using the WITH
clause otherwise the syntax will be incorrect.
In [ ]:
%%sql -e
WITH
%jsonarray(customers,custno,details,purchases,where="custno = 100000")
SELECT
%ji(item,product_id),%js(item,product,32),%ji(item,quantity),%jd(item,item_cost,'7,2') from jsonarray
Now that we can access the data using the JSON macros, we can start doing some analysis on the records. For each of these queries, remember that there are no indexes being used to determine the answer. This first query will show the total sales per state. In order to determine this, we must take the array of purchases by a customer, calculate the total cost of their orders and sum it up for the state.
In [ ]:
%%sql -pb
WITH
%jsonarray(customers,custno,details,purchases),
SALES(state, amount) AS
(
SELECT %js(c.details,contact.state,2), sum(%ji(j.item,quantity) * %jd(j.item,item_cost,'7,2'))
from jsonarray j, customers c
where j.pk = c.custno
GROUP BY %js(c.details,contact.state,2)
)
SELECT state, sum(amount) from SALES
GROUP BY state
ORDER BY 2 DESC
This calculation gives us the number of times a product was ordered. This doesn't tell us the quantity since we need to look at the individual purchases to see how many of the product the customer actually ordered. The SQL command will use the -a flag so that the output will be displayed in a TABLE so that you can find it easier to scroll the results.
In [ ]:
%%sql -a
WITH
%jsonarray(customers,custno,details,purchases),
SALES(product, quantity) AS (
SELECT %js(j.item,product,32),count(*)
from jsonarray j
GROUP BY %js(j.item,product,32)
)
SELECT product, quantity from SALES
ORDER BY 2 DESC
In [ ]:
%%sql -a
WITH
%jsonarray(customers,custno,details,purchases),
SALES(product, quantity) AS (
SELECT %js(j.item,product,32), %ji(j.item,quantity)
from jsonarray j
)
SELECT product, sum(quantity) from SALES
GROUP BY product
ORDER BY 2 DESC
Here we need to get the product counts by state, and then pick the maximum quantity found in the answer set. The steps we need to perform in the SQL are:
The next set of SQL statements will get us the results. Note that it is possible to have more than one product being the top seller in a state. Note: This SQL may take a while to execute.
In [ ]:
%%sql -a
WITH
%jsonarray(customers,custno,details,purchases),
SALES(product, state, quantity) AS (
SELECT %js(j.item,product,32), %js(c.details,contact.state,2), SUM(%ji(j.item,quantity))
from jsonarray j, customers c
where j.pk = c.custno
GROUP BY %js(j.item,product,32), %js(c.details,contact.state,2)
),
MAXSALES(state, total) AS (
SELECT STATE, MAX(quantity)
FROM SALES
GROUP BY STATE
)
SELECT s.state, s.product, m.total
FROM SALES s, MAXSALES m
WHERE
s.state = m.state and
s.quantity = m.total
ORDER BY s.product desc
As you can see from the previous examples, it is possible to analytics against the JSON data. However, it requires some sophisticated techniques for manipulating (and converting) the JSON data. One approach is to use the JSON functions and move the data into a traditional relational table for analysis. The following SQL will create a "flat" relational table that will allow for further analysis without having to do joins. In addition, we are going to create the table using column organization which will give us the ability to do queries without having to worry about what indexes are needed.
In [ ]:
%%sql
DROP TABLE TRANSACTIONS;
CREATE TABLE TRANSACTIONS
(
TX_DATE DATE,
CITY VARCHAR(32),
STATE CHAR(2),
ZIPCODE INT,
PRODUCT_NAME VARCHAR(32),
PRODUCT_QTY INT,
PRODUCT_COST DEC(7,2),
CCARD CHAR(4)
)
ORGANIZE BY COLUMN;
Next we use the #jsonarray
macro to generate the rows we need to populate the transactions table.
In [ ]:
%%sql
INSERT INTO TRANSACTIONS
WITH
%jsonarray(customers,custno,details,purchases),
TXS(tx_date, city, state, zipcode, product_name, product_qty, product_cost, ccard) AS
(
SELECT %jdate(j.item,tx_date),
%js(c.details,contact.city,32),
%js(c.details,contact.state,2),
%ji(c.details,contact.zipcode),
%js(j.item,product,32),
%ji(j.item,quantity),
%jd(j.item,item_cost,'7,2'),
%js(c.details,payment.card_type,4)
from jsonarray j, customers c
where j.pk = c.custno
)
SELECT TX_DATE, CITY, STATE, ZIPCODE, PRODUCT_NAME, PRODUCT_QTY, PRODUCT_COST, CCARD
FROM TXS
Next we can check our record count to see how many records we have.
In [ ]:
%sql SELECT COUNT(*) FROM TRANSACTIONS
This code will produce a graph of the results from an SQL statement. This routine allows more flexibility on what is displayed on the chart, including changes the colors, the titles, and the axis names.
The format of the Plot is:
plotSQL(sql, "Title of the plot", "X Axis Title", "Y Axis Title", rotation, color palette)
If Rotation is set to True, the x axis labels will be rotated by 30 degress and place the text at a slight angle to make longer text easier to read. The color palette is either m1, m2, m3, or m4.
In [ ]:
import matplotlib.pyplot as plt
m1 = ('#009926','#00994d','#009973','#009999','#007399','#004d99','#002699','#000099')
m2 = ("#00ff40","#00ff80","#00ffbf","#00ffff","#00bfff","#0080ff","#0040ff")
m3 = ("#9933ff","#cc33ff","#ff33ff","#ff33cc","#ff3399","#ff3366","#ff3333")
m4 = ("#ff00bf","#ff0080","#ff0040","#ff0000","#ff0000","#ff4000","#ff8000","#ffbf00","#ffff00")
def plotSQL(sql, title, xaxis, yaxis, rotation, colormap):
if (rotation == True):
rot = 30
else:
rot = 0
df = %sql {sql}
if (df is None):
print("SQL call failed")
return
xvalues = df.columns.values[0]
yvalues = df.columns.values[1]
df.plot(kind='bar',x=xvalues,y=yvalues,color=colormap);
_ = plt.xlabel("", fontsize=12);
_ = plt.ylabel(yaxis, fontsize=12);
_ = plt.suptitle(title, fontsize=20);
_ = plt.xticks(rotation=rot);
_ = plt.plot();
In [ ]:
sqlin = \
'''
WITH
SALES(state, amount) AS
(
SELECT STATE, SUM(PRODUCT_QTY * PRODUCT_COST)
from TRANSACTIONS
GROUP BY STATE
)
SELECT state, sum(amount) from SALES
GROUP BY state
ORDER BY 2 DESC FETCH FIRST 10 ROWS ONLY
WITH UR
'''
plotSQL(sqlin,
"Sales per State",
"State",
"Revenue",
False,
m1)
In [ ]:
sqlin = \
'''
WITH
SALES(product, quantity) AS (
SELECT PRODUCT_NAME, PRODUCT_QTY
from TRANSACTIONS
)
SELECT product, sum(quantity) from SALES
GROUP BY product
ORDER BY 2 DESC
FETCH FIRST 10 ROWS ONLY
WITH UR
'''
plotSQL(sqlin,
"Products Sold",
"Product",
"Quantity",
True,
m2)
In [ ]:
sqlin = \
"""
WITH TOTALPROD(TX_DATE, TX_TOTAL) AS
(
SELECT WEEK(TX_DATE), SUM(PRODUCT_QTY*PRODUCT_COST) FROM TRANSACTIONS
WHERE YEAR(TX_DATE) = 2017 AND WEEK(TX_DATE) <= 52
GROUP BY WEEK(TX_DATE)
)
SELECT TX_DATE, TX_TOTAL AS SALES,
AVG(TX_TOTAL) OVER (
ORDER BY TX_DATE
ROWS BETWEEN 8 PRECEDING AND CURRENT ROW) AS MOVING_AVG
FROM TOTALPROD
ORDER BY TX_DATE
"""
df = %sql {sqlin}
txdate= df['TX_DATE']
sales = df['SALES']
avg = df['MOVING_AVG']
plt.xlabel("Week of Year", fontsize=12);
plt.ylabel("Sales", fontsize=12);
plt.suptitle("Weekly Sales and Moving Average", fontsize=20);
plt.plot(txdate, sales, 'r');
plt.plot(txdate, avg, 'b');
plt.show();
In [ ]:
import seaborn as sns
cm = sns.light_palette("green", as_cmap=True)
sqlin = \
"""
WITH
SALES(product, state, quantity) AS (
SELECT PRODUCT_NAME, STATE, SUM(PRODUCT_QTY)
from TRANSACTIONS
GROUP BY PRODUCT_NAME, STATE
),
MAXSALES(state, total) AS (
SELECT STATE, MAX(quantity)
FROM SALES
GROUP BY STATE
)
SELECT s.state, s.product, m.total
FROM SALES s, MAXSALES m
WHERE
s.state = m.state and
s.quantity = m.total
ORDER BY m.total desc
"""
result = %sql {sqlin}
result.style.background_gradient(cmap=cm)