Week 4 lecture notes - Basic SQL DML and DDL

This week and the next few weeks we will be using our notebooks to connect to relational database systems. To do this, we need the ipython-sql library installed. If the %load_ext and %sql cells below don't work for you, you might need to install it yourself.

To check, run this:


In [ ]:
!pip list | grep ipython-sql

If you don't see ipython-sql (0.3.6) in the output of the above cell, execute the following cell to install ipython-sql.

If you are using datanotebook.org, it should be installed already.


In [ ]:
!pip install ipython-sql

Part 1 - Data manipulation (DML)

Get the survey.db SQLite3 database file from the Software Carpentry lesson and connect to it.


In [ ]:
!wget http://files.software-carpentry.org/survey.db

In [ ]:
%load_ext sql

In [ ]:
%sql sqlite:///survey.db

Basic queries: SELECT, DISTINCT, FROM, ORDER BY

Simple queries involve a projection of attributes and a product of relations (here just one relation):


In [ ]:
%%sql
SELECT personal, family
FROM Person

Note that we can change the order of projected attributes:


In [ ]:
%%sql
SELECT family, personal
FROM Person

We can both project all attributes with '*' and also change the ordering of the resulting relation.


In [ ]:
%%sql
SELECT *
FROM Person
ORDER BY family, personal

Handling duplicates

Note that the survey table has a number of repeating values for the quant attribute.


In [ ]:
%%sql
SELECT quant
FROM survey

By modifying our SELECT clause to use DISTINCT quant instead of just quant, we are assured of seeing each value only once.


In [ ]:
%%sql
SELECT DISTINCT quant
FROM Survey

Ordering is ASCending by default; DESCending is also easily specified, and each attribute can use either.


In [ ]:
%%sql
SELECT taken, person, quant
FROM Survey 
ORDER BY taken ASC, person DESC;

We can use DISTINCT on one attribute yet not on others.


In [ ]:
%%sql
SELECT DISTINCT quant, person 
FROM Survey 
ORDER BY quant ASC;

Filtering: WHERE

Though it might sound confusing, the WHERE clause in SQL queries implements the selection operation.


In [ ]:
%%sql
SELECT * 
FROM Visited 
WHERE site='DR-1'

Above the site attribute is included implicitly with *; here we project it explicitly.


In [ ]:
%%sql
SELECT ident 
FROM Visited 
WHERE site='DR-1';

Keep in mind the broader picture of what's happening here. The following image, from the SW Carpentry lessons, illustrates the above query in terms of its fundamental select and project operations.


In [ ]:
from IPython.display import Image
Image(url="http://swcarpentry.github.io/sql-novice-survey/fig/sql-filter.svg")

In selections, we can use boolean operators like OR and AND to combine conditions logically.


In [ ]:
%%sql
SELECT * 
FROM Survey 
WHERE person='lake' 
  OR person='roe'

We can test set membership in selections as well.


In [ ]:
%%sql
SELECT * 
FROM Survey 
WHERE person IN ('lake', 'roe')

These boolean operations can nest into complex expressions using parentheses.


In [ ]:
%%sql
SELECT * 
FROM Survey 
WHERE quant='sal' 
  AND (person='lake' OR person='roe')

Wildcards like % and _ allow matching of substrings or single characters.


In [ ]:
%%sql
SELECT * 
FROM Visited
WHERE site LIKE 'DR%';

We can start to put all of these together.


In [ ]:
%%sql
SELECT DISTINCT person, quant 
FROM Survey 
WHERE person='lake' 
   OR person='roe';

Mathematical transformations are easy to compute on projected attributes. Note also that we are here renaming an attribute with AS, creating a new alias.


In [ ]:
%%sql
SELECT reading, 1.05 * reading AS reading_multiplied
FROM Survey 
WHERE quant='rad'

Sometimes we do this for convenience or clarity.


In [ ]:
%%sql
SELECT taken, reading AS reading_fahrenheit, round(5*(reading-32)/9, 2) AS reading_celsius
FROM Survey 
WHERE quant='temp'

Sometimes we even do this to make data redundant for clarity. Here || is a concatenation operator.


In [ ]:
%%sql
SELECT personal || ' ' || family AS full_name, personal, family
FROM Person
ORDER BY family, personal

We can combine two relations with UNION.


In [ ]:
%%sql
SELECT * 
FROM Person 
WHERE ident='dyer' 

UNION 

SELECT * 
FROM Person 
WHERE ident='roe'

NULL values

NULLs can be confusing but they have important qualities to remember.


In [ ]:
%%sql
SELECT * 
FROM Visited

Above we see NULL represented as None in the sqlite3 relation output.

Comparisons with NULL are NULL; thus boolean operations like the following will not match any NULL values.


In [ ]:
%%sql
SELECT * 
FROM Visited
WHERE dated < '1930-01-01'
  OR dated >= '1930-01-01'

We have to use the special IS NULL and IS NOT NULL to deal explicitly with the concept of NULL.


In [ ]:
%%sql
SELECT * FROM Visited WHERE dated IS NULL;

In [ ]:
%%sql
SELECT * FROM Visited WHERE dated IS NOT NULL;

Note the difference in the two following output results.


In [ ]:
%%sql
SELECT * 
FROM Survey 
WHERE quant = 'sal' 
  AND person != 'lake';

In [ ]:
%%sql
SELECT * 
FROM Survey
WHERE quant = 'sal'
  AND (person != 'lake' OR person IS NULL);

In [ ]:
%%sql
SELECT * 
FROM Visited 
WHERE dated IN ('1927-02-08', NULL)

Aggregation: min(), max(), count(), avg()

Aggregation takes multiple input attribute values and returns a single output value based on the operation.


In [ ]:
%%sql
SELECT MIN(dated)
FROM Visited

This diagram, also from the SW Carpentry lessons, illustrates the process clearly.


In [ ]:
Image(url="http://swcarpentry.github.io/sql-novice-survey/fig/sql-aggregation.svg")

Each of these aggregators does what you think!


In [ ]:
%%sql
SELECT AVG(reading)
FROM Survey 
WHERE quant='sal'

In [ ]:
%%sql
SELECT COUNT(reading) 
FROM Survey 
WHERE quant='sal'

In [ ]:
%%sql
SELECT SUM(reading)
FROM Survey
WHERE quant='sal'

We can use more than one aggregation with attributes.


In [ ]:
%%sql
SELECT MIN(reading), MAX(reading)
FROM Survey 
WHERE quant = 'sal' 
  AND reading <= 1.0

In [ ]:
%%sql
SELECT person, COUNT(*) 
FROM Survey 
WHERE quant = 'sal' 
  AND reading <= 1.0

Aggregation: GROUP BY, HAVING

Aggregation allows the GROUP BY function that separates input attribute values into categories.


In [ ]:
%%sql
SELECT   person, COUNT(reading), ROUND(AVG(reading), 2)
FROM     Survey
WHERE    quant = 'rad'
GROUP BY person

We can add a special form of selection to grouped category attributes by using HAVING with GROUP BY.


In [ ]:
%%sql
SELECT   person, quant, COUNT(reading), ROUND(AVG(reading), 2)
FROM     Survey
WHERE    person IS NOT NULL
GROUP BY person, quant
HAVING   quant = 'rad'
ORDER BY person, quant

Part 2 - Using Python for queries

ipython-sql is a wonderful jupyter plugin. It's not only how we're talking with SQLite, it also can connect with other databases - we'll use the support for another database backend engine next time.

Among other things, though, keep in mind that this works within a Python notebook, and the plugin allows you to pull data back and then work with straight Python.


In [ ]:
result = _
print(result)

In [ ]:
result.keys

In [ ]:
result[3]

Pandas integration


In [ ]:
df = result.DataFrame()
df

Matplotlib integration


In [ ]:
%matplotlib inline

In [ ]:
%%sql
SELECT * 
FROM Survey
WHERE quant = 'rad'

In [ ]:
result = _
result.bar()