Db2 already has a variety of Statistical functions built in. In Db2 11.1, a number of new functions have been added including:
N refers to the size of the population and in many cases, we only have a sample, not the entire population of values. In this case, the formula needs to be adjusted to account for the sampling.
Set up the connection to the database.
In [ ]:
%run db2.ipynb
We populate the database with the EMPLOYEE and DEPARTMENT tables so that we can run the various examples.
In [ ]:
%sql -sampledata
In [ ]:
%%sql
SELECT COVARIANCE_SAMP(SALARY, BONUS)
FROM EMPLOYEE
WHERE WORKDEPT = 'A00'
In [ ]:
%%sql
SELECT STDDEV_SAMP(SALARY)
FROM EMPLOYEE
WHERE WORKDEPT = 'A00'
In [ ]:
%%sql
SELECT VARIANCE_SAMP(SALARY)
FROM EMPLOYEE
WHERE WORKDEPT = 'A00'
In [ ]:
%%sql
SELECT MEDIAN(SALARY) AS MEDIAN, AVG(SALARY) AS AVERAGE
FROM EMPLOYEE
WHERE WORKDEPT = 'E21'
In [ ]:
%%sql
SELECT CUME_DIST(47000) WITHIN GROUP (ORDER BY SALARY)
FROM EMPLOYEE
WHERE WORKDEPT = 'A00'
In [ ]:
%%sql
SELECT PERCENT_RANK(47000) WITHIN GROUP (ORDER BY SALARY)
FROM EMPLOYEE
WHERE WORKDEPT = 'A00'
In [ ]:
%%sql
SELECT PERCENTILE_DISC(0.75) WITHIN GROUP (ORDER BY SALARY)
FROM EMPLOYEE
WHERE WORKDEPT = 'E21'
In [ ]:
%%sql
SELECT PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY SALARY)
FROM EMPLOYEE
WHERE WORKDEPT = 'E21'
In [ ]:
%%sql
SELECT EMPNO, SALARY, WIDTH_BUCKET(SALARY, 35000, 100000, 13)
FROM EMPLOYEE
ORDER BY EMPNO
We can plot this information by adding some more details to the bucket output.
In [ ]:
%%sql -a
WITH BUCKETS(EMPNO, SALARY, BNO) AS
(
SELECT EMPNO, SALARY,
WIDTH_BUCKET(SALARY, 35000, 100000, 9) AS BUCKET
FROM EMPLOYEE ORDER BY EMPNO
)
SELECT BNO, COUNT(*) AS COUNT FROM BUCKETS
GROUP BY BNO
ORDER BY BNO ASC
And here is a plot of the data to make sense of the histogram.
In [ ]:
%%sql -pb
WITH BUCKETS(EMPNO, SALARY, BNO) AS
(
SELECT EMPNO, SALARY,
WIDTH_BUCKET(SALARY, 35000, 100000, 9) AS BUCKET
FROM EMPLOYEE ORDER BY EMPNO
)
SELECT BNO, COUNT(*) AS COUNT FROM BUCKETS
GROUP BY BNO
ORDER BY BNO ASC