This week and the next few weeks we will be using our notebooks to connect to relational database systems. To do this, we need the ipython-sql library installed. If the %load_ext and %sql cells below don't work for you, you might need to install it yourself.
To check, run this:
In [ ]:
!pip list | grep ipython-sql
If you don't see ipython-sql (0.3.6) in the output of the above cell, execute the following cell to install ipython-sql.
If you are using datanotebook.org, it should be installed already.
In [ ]:
!pip install ipython-sql
Get the survey.db SQLite3 database file from the Software Carpentry lesson and connect to it.
In [ ]:
!wget http://files.software-carpentry.org/survey.db
In [ ]:
%load_ext sql
In [ ]:
%sql sqlite:///survey.db
Simple queries involve a projection of attributes and a product of relations (here just one relation):
In [ ]:
%%sql
SELECT personal, family
FROM Person
Note that we can change the order of projected attributes:
In [ ]:
%%sql
SELECT family, personal
FROM Person
We can both project all attributes with '*' and also change the ordering of the resulting relation.
In [ ]:
%%sql
SELECT *
FROM Person
ORDER BY family, personal
In [ ]:
%%sql
SELECT quant
FROM survey
By modifying our SELECT clause to use DISTINCT quant instead of just quant, we are assured of seeing each value only once.
In [ ]:
%%sql
SELECT DISTINCT quant
FROM Survey
Ordering is ASCending by default; DESCending is also easily specified, and each attribute can use either.
In [ ]:
%%sql
SELECT taken, person, quant
FROM Survey
ORDER BY taken ASC, person DESC;
We can use DISTINCT on one attribute yet not on others.
In [ ]:
%%sql
SELECT DISTINCT quant, person
FROM Survey
ORDER BY quant ASC;
Though it might sound confusing, the WHERE clause in SQL queries implements the selection operation.
In [ ]:
%%sql
SELECT *
FROM Visited
WHERE site='DR-1'
Above the site attribute is included implicitly with *; here we project it explicitly.
In [ ]:
%%sql
SELECT ident
FROM Visited
WHERE site='DR-1';
Keep in mind the broader picture of what's happening here. The following image, from the SW Carpentry lessons, illustrates the above query in terms of its fundamental select and project operations.
In [ ]:
from IPython.display import Image
Image(url="http://swcarpentry.github.io/sql-novice-survey/fig/sql-filter.svg")
In selections, we can use boolean operators like OR and AND to combine conditions logically.
In [ ]:
%%sql
SELECT *
FROM Survey
WHERE person='lake'
OR person='roe'
We can test set membership in selections as well.
In [ ]:
%%sql
SELECT *
FROM Survey
WHERE person IN ('lake', 'roe')
These boolean operations can nest into complex expressions using parentheses.
In [ ]:
%%sql
SELECT *
FROM Survey
WHERE quant='sal'
AND (person='lake' OR person='roe')
Wildcards like % and _ allow matching of substrings or single characters.
In [ ]:
%%sql
SELECT *
FROM Visited
WHERE site LIKE 'DR%';
We can start to put all of these together.
In [ ]:
%%sql
SELECT DISTINCT person, quant
FROM Survey
WHERE person='lake'
OR person='roe';
Mathematical transformations are easy to compute on projected attributes. Note also that we are here renaming an attribute with AS, creating a new alias.
In [ ]:
%%sql
SELECT reading, 1.05 * reading AS reading_multiplied
FROM Survey
WHERE quant='rad'
Sometimes we do this for convenience or clarity.
In [ ]:
%%sql
SELECT taken, reading AS reading_fahrenheit, round(5*(reading-32)/9, 2) AS reading_celsius
FROM Survey
WHERE quant='temp'
Sometimes we even do this to make data redundant for clarity. Here || is a concatenation operator.
In [ ]:
%%sql
SELECT personal || ' ' || family AS full_name, personal, family
FROM Person
ORDER BY family, personal
We can combine two relations with UNION.
In [ ]:
%%sql
SELECT *
FROM Person
WHERE ident='dyer'
UNION
SELECT *
FROM Person
WHERE ident='roe'
In [ ]:
%%sql
SELECT *
FROM Visited
Above we see NULL represented as None in the sqlite3 relation output.
Comparisons with NULL are NULL; thus boolean operations like the following will not match any NULL values.
In [ ]:
%%sql
SELECT *
FROM Visited
WHERE dated < '1930-01-01'
OR dated >= '1930-01-01'
We have to use the special IS NULL and IS NOT NULL to deal explicitly with the concept of NULL.
In [ ]:
%%sql
SELECT * FROM Visited WHERE dated IS NULL;
In [ ]:
%%sql
SELECT * FROM Visited WHERE dated IS NOT NULL;
Note the difference in the two following output results.
In [ ]:
%%sql
SELECT *
FROM Survey
WHERE quant = 'sal'
AND person != 'lake';
In [ ]:
%%sql
SELECT *
FROM Survey
WHERE quant = 'sal'
AND (person != 'lake' OR person IS NULL);
In [ ]:
%%sql
SELECT *
FROM Visited
WHERE dated IN ('1927-02-08', NULL)
Aggregation takes multiple input attribute values and returns a single output value based on the operation.
In [ ]:
%%sql
SELECT MIN(dated)
FROM Visited
This diagram, also from the SW Carpentry lessons, illustrates the process clearly.
In [ ]:
Image(url="http://swcarpentry.github.io/sql-novice-survey/fig/sql-aggregation.svg")
Each of these aggregators does what you think!
In [ ]:
%%sql
SELECT AVG(reading)
FROM Survey
WHERE quant='sal'
In [ ]:
%%sql
SELECT COUNT(reading)
FROM Survey
WHERE quant='sal'
In [ ]:
%%sql
SELECT SUM(reading)
FROM Survey
WHERE quant='sal'
We can use more than one aggregation with attributes.
In [ ]:
%%sql
SELECT MIN(reading), MAX(reading)
FROM Survey
WHERE quant = 'sal'
AND reading <= 1.0
In [ ]:
%%sql
SELECT person, COUNT(*)
FROM Survey
WHERE quant = 'sal'
AND reading <= 1.0
Aggregation allows the GROUP BY function that separates input attribute values into categories.
In [ ]:
%%sql
SELECT person, COUNT(reading), ROUND(AVG(reading), 2)
FROM Survey
WHERE quant = 'rad'
GROUP BY person
We can add a special form of selection to grouped category attributes by using HAVING with GROUP BY.
In [ ]:
%%sql
SELECT person, quant, COUNT(reading), ROUND(AVG(reading), 2)
FROM Survey
WHERE person IS NOT NULL
GROUP BY person, quant
HAVING quant = 'rad'
ORDER BY person, quant
ipython-sql is a wonderful jupyter plugin. It's not only how we're talking with SQLite, it also can connect with other databases - we'll use the support for another database backend engine next time.
Among other things, though, keep in mind that this works within a Python notebook, and the plugin allows you to pull data back and then work with straight Python.
In [ ]:
result = _
print(result)
In [ ]:
result.keys
In [ ]:
result[3]
In [ ]:
df = result.DataFrame()
df
In [ ]:
%matplotlib inline
In [ ]:
%%sql
SELECT *
FROM Survey
WHERE quant = 'rad'
In [ ]:
result = _
result.bar()