This tutorial will show you the basics of SQL:
We are going to use Python because it comes with the popular SQLite (aka etilqs
) relational database builtin.
The following examples assume you have Python 3 installed on your computer. Any distribution will do. If you have Anaconda or Miniconda, you can create a new conda environment with the requirements for this tutorial from a terminal. For example with miniconda on Mac OS X, you might use the following to activate the root conda environment and create a new one called "py36-thw-sql".
~$ source ~/miniconda/bin/activate
(root) ~$ conda create -n py36-thw-sql python=3.6.3 psycopg2 django pandas
Or on Windows with Anaconda, you might use the following in a CMD terminal.
~> %LOCALAPPDATA%\Continuum\anaconda3\Scripts\activate
(root) ~> conda create -n py36-thw-sql python=3.6.3 psycopg2 django pandas
PEP 249 specifies a connection
to a database, and a cursor
from the connection to execute SQL.
In [1]:
# imports
import io # we'll need this way later
import os
import sqlite3 # this is the module that binds to SQLite
import numpy as np # never know when you might need NumPy, oh, right, always!
import pandas as pd # you'll see why we can use this later
DBFILE = 'sqlite3.db' # this will be our database
BASEDIR = %pwd # os.path.abspath(os.path.dirname(__file__))
DBPATH = os.path.join(BASEDIR, DBFILE)
# we may need to delete the existing file first
if os.path.exists(DBPATH):
os.remove(DBPATH)
# open a connection to the database for this tutorial
conn = sqlite3.connect(DBPATH)
# get a cursor to the database
cur = conn.cursor()
If a file with the same name doesn't already exist, then this creates a new database otherwise it connects to the existing database contained in the file. You can also use ':memory:' to create an "in-memory" database that has no file, but then you can't connect to that from another process.
We'll have to close the connection and cursor later. Next time we could use a with
context to automatically close the connection.
with sqlite3.connect('sqlite3.db') as conn:
cur = conn.execute('SQL QUERY ...') # e.g.: 'SELECT * FROM table_name;'
output = cur.fetchall() # get the results
Closing the connection automatically closes the cursor. Other bindings may offer similar context managers. to commit and close changes or rollback changes and raise an exception.
A relational database or SQL database is a tabular structure consisting of rows of data with columns of fields. The data definition language or DDL used to create the table is the same language used to query it, called SQL or Structured Query Language.
Although the basic SQL commands are nearly the same for other relational databases, the data types may be different. SQLite only has 5 datatypes: NULL
, INTEGER
, REAL
, TEXT
, BLOB
. For boolean, use integer zero for false, and one for true. For dates and times use text and ISO8601, e.g.: "2018-02-21T17:05-0800"
. By comparison, PostgreSQL has too many to list here including booleans, date, time, arrays, JSON, etc.
CREATE
The basic SQL command to create a table is
CREATE TABLE <table_name> (<field_name> <TYPE> <CONSTRAINTS>, <field_name> <TYPE> <CONSTRAINTS>, <CONSTRAINTS>, ...);
Some languages enforce the semicolon, some don't. The syntax is nearly the same for other relational databases.
Constraints are optional and set conditions, limitations, or options for columns and the table. The most common constraints are: PRIMARY KEY
, UNIQUE
, NOT NULL
, DEFAULT
, FOREIGN KEY
, REFERENCES
, etc. The syntax is nearly the same for other relational databases.
PRIMARY KEY
The most important of these is PRIMARY KEY
which is equivalent to UNIQUE NOT NULL
. A primary key is a unique references that identifies each record in the table. Although it is not required, every table should have a primary key. Only one primary key is allowed, and it can be constructed from multiple columns, PRIMARY KEY (<field_A>, <field_B)
, to create a unique together, non-null identifier. In SQLite, if missing then a integer primary key named, rowid
, is created by default. Also in SQLite, any integer primary key is automatically incremented, so the AUTOINCREMENT command is usually not needed. In PostgreSQL the SERIAL
command is used to create a corresponding sequence for the primary key.
The other constraints and options are also important, but we'll discover those as we learn. Let's create some simple databases with fictitious data to practice. Imagine you are testing several different materials with different properties $\alpha$ and $\beta$ under different stresses like different temperatures and light intensity and changing thickness. How would you organize this data? Take a moment to design a schema or structure for your data. The schema consists of the column names and data types and the column and table constraints.
In [2]:
# we can use Python triple quoted strings to span multiple lines, but use single quotes, since SQL only uses double quotes
# first create a materials table
cur.execute('''CREATE TABLE materials (
material_id TEXT PRIMARY KEY,
long_name TEXT UNIQUE NOT NULL,
alpha REAL NOT NULL,
beta REAL NOT NULL,
material_type TEXT NOT NULL
)''')
conn.commit()
# if you don't commit the changes, they won't be written to the file, and won't be visible to other connections
# then create an experiments table
cur.execute('''CREATE TABLE experiments (
experiment_id INTEGER PRIMARY KEY,
temperature REAL DEFAULT 298.15,
irradiance REAL DEFAULT 1000.0,
uv_filter INTEGER DEFAULT 0,
material_id NOT NULL REFERENCES materials ON UPDATE CASCADE ON DELETE CASCADE,
thickness REAL DEFAULT 0.005,
UNIQUE (temperature, irradiance, uv_filter, material_id)
)''')
conn.commit()
# and finally create a trials table
cur.execute('''CREATE TABLE trials (
trial_id INTEGER PRIMARY KEY,
experiment_id NOT NULL REFERENCES experiments ON UPDATE CASCADE ON DELETE CASCADE,
results BLOB NOT NULL,
duration REAL NOT NULL,
avg_temperature REAL NOT NULL,
std_temperature REAL NOT NULL,
avg_irradiance REAL NOT NULL,
std_irradiance REAL NOT NULL,
init_visible_transmittance REAL NOT NULL,
final_visible_transmittance REAL NOT NULL,
notes TEXT
)''')
conn.commit()
FOREIGN KEY
A foreign key constraint creates a relationship between two tables. The FOREIGN KEY
is implied when the REFERENCES
column constraint is applied. In the experiments table above, the column constraint on material_id
is the same as adding this table constriant:
FOREIGN KEY (material_id) REFERENCES materials (material_id)
Specifying the referenced column in the table constraint isn't necessary, and if omitted defaults to the primary key of the referenced table. The syntax is nearly the same for other relational databases.
You can use the same name for the foreign key and it's related field, but it may make joining tables more difficult because you will need to use the table name to avoid an ambigous column name. E.G.: you can use trials.experiment_id
and experiments.experiment_id
to differentiate between them. You can also use AS
to create a temporary name like trials.experiment_id AS experiment
. Or you could just use different names for the foreign key and it's related field like FOREIGN KEY (material) REFERENCES materials (material_id)
and then there's no ambiguity. Your call.
DELETE
and UPDATE
What happens if the reference of a foreign key is deleted or updated? That's up to you: in SQLite the default is to do nothing, but typically you want the action to cascade. Add the desired ON DELETE
or ON UPDATE
action to the constraint.
INSERT
The basic SQL command to put data into a table is
INSERT INTO <table_name> (<field_name>, <field_name>, ...) VALUES (<value>, <value>, ...)
Other relational databases use the same SQL syntax.
Let's add some pretend data to the database
In [3]:
# add a EVA as a material
cur.execute('INSERT INTO materials VALUES ("EVA", "ethylene vinyl acetate", 0.123, 4.56, "polymer")')
conn.commit() # you must commit for it to become permanent
cur.rowcount # tells you how many rows written, sometimes, it's quirky
Out[3]:
You can use placeholders to loop over insert statements to add multiple records.
WARNING: Never use string formatters to in lieu of placeholders or you may be subject to a SQL injection attack.
SQLite uses ?
but other relational databases may use %s
or another placeholder.
Also, in sqlite3
executemany
is a convenient shortcut, but it may not be convenient for all database bindings.
In [4]:
# add some more fake materials
fake_materials = [
('PMMC', 'poly methyl methacrylate', 0.789, 10.11, 'polymer'),
('KBr', 'potassium bromide', 1.213, 14.15, 'crystal')
]
for mat in fake_materials:
# must have same number of place holders as values
cur.execute('INSERT INTO materials VALUES (?, ?, ?, ?, ?)', mat) # use place holders
conn.commit() # you can commit all of the changes at the end of the loop
# use the executemany shortcut
fake_materials = [
('SiO2', 'silicon dioxide', 1.617, 18.19, 'crystal'),
('CaF2', 'calcium flouride', 2.0, 21.22, 'crystal')
]
cur.executemany('INSERT INTO materials VALUES (?, ?, ?, ?, ?)', fake_materials)
conn.commit()
print('rowcount = %d' % cur.rowcount) # with executemany, cur.rowcount shows total number of rows
DELETE
and UPDATE
Oops I made a mistake. How do I fix it? The opposite of INSERT
is DELETE
. But don't throw the baby out with the bathwater, you can also [UPDATE
] a record. Other relational databases use the same SQL syntax to manipulate data.
In [5]:
cur.execute('DELETE FROM materials WHERE material_id = "SiO2"')
cur.execute('UPDATE materials SET alpha=1.23E-4, beta=8.910E+11 WHERE material_id = "CaF2"')
conn.commit()
The way you select data is by executing queries. The language is the same for all relational databases. The star *
means select all columns, or you can give the columns explicitly.
In [6]:
cur.execute('SELECT * FROM materials')
cur.fetchall() # fetch all the results of the query
Out[6]:
You can limit a query using WHERE
and LIMIT
. You can combine WHERE
with a conditional expression, IN
to check a set, or LIKE
to compare with strings. Use AND
and OR
to combine conditions.
The Python DB-API cursor can be used as an iterator or you can call it's fetch methods.
In [7]:
# limit the query using WHERE and LIMIT
cur.execute('SELECT material_id, long_name FROM materials WHERE alpha < 1 LIMIT 2')
for c in cur: print('{} is {}'.format(*c)) # user the cursor as an iterator
In [8]:
materials_list = ("EVA", "PMMC")
cur.execute('SELECT alpha, beta FROM materials WHERE material_id IN (?, ?)', materials_list)
[(mat, cur.fetchone()) for mat in materials_list] # use the cursor fetchone() method to get next item
Out[8]:
In [9]:
cur.execute('SELECT COUNT(*) FROM materials')
print(cur.fetchone())
In [10]:
cur.execute('SELECT material_type, COUNT(*), AVG(alpha), MAX(beta) FROM materials GROUP BY material_type')
cur.fetchmany(2) # use fetchmany() with size parameter, just for fun
Out[10]:
In [11]:
# use defaults, let primary key auto-increment, just supply material ID
cur.execute('INSERT INTO experiments (material_id) VALUES ("EVA")') # use defaults,
conn.commit()
In [12]:
# set up a test matrix for EVA
temp = range(300, 400, 25)
irrad = range(400, 800, 100)
try:
for T in temp:
for E in irrad:
cur.execute('INSERT INTO experiments (temperature, irradiance) VALUES (?, ?)', (T, E))
except sqlite3.IntegrityError as exc:
print('sqlite3.IntegrityError: %s', exc)
# Oops! We forgot to specify the material, there is not default, and it is constrained as NOT NULL!
conn.rollback() # undo any changes
try:
for T in temp:
for E in irrad:
cur.execute('INSERT INTO experiments (temperature, irradiance, material_id) VALUES (?, ?, "EVA")', (T, E))
except sqlite3.IntegrityError as exc:
print(exc)
conn.commit() # commit! commit! commit!
In [13]:
# this list is hard to read
list(cur.execute('SELECT * FROM experiments'))
Out[13]:
In [14]:
# not only is Pandas much nicer, it also executes queries!
pd.read_sql('SELECT * FROM experiments', conn, index_col='experiment_id')
Out[14]:
In [15]:
# Python's SQLite let's you use either '==' or '=', but I think SQL only allows '=', okay?
pd.read_sql('SELECT * FROM experiments WHERE irradiance = 700 ORDER BY temperature', conn, index_col='experiment_id')
Out[15]:
In [16]:
# descending order
pd.read_sql('SELECT * FROM experiments WHERE temperature = 375 ORDER BY irradiance DESC', conn, index_col='experiment_id')
Out[16]:
In [17]:
# Dr. Data
start_time, end_time = '2018-02-21T17:00-0800', '2018-02-21T18:30-0800'
timestamps = pd.DatetimeIndex(start=start_time, end=end_time, freq='T')
# use http://poquitopicante.blogspot.com/2016/11/panda-pop.html to help you recall what offset alias to use
size = len(timestamps)
data = {
'temperature': np.random.randn(size) + 298.15,
'irradiance': np.random.randn(size) + 1000,
'visible_transmittance': np.logspace(np.log10(0.9), np.log10(0.8), size) + np.random.randn(size) / 100
}
results = pd.DataFrame(data, index=timestamps)
duration = (results.index[-1] - results.index[0]).value # [ns]
avg_temperature = results.temperature.mean() # [K]
std_temperature = results.temperature.std() # [K]
avg_irradiance = results.irradiance.mean() # [W/m^2]
std_irradiance = results.irradiance.std() # [W/m^2]
init_visible_transmittance = results.visible_transmittance[start_time]
final_visible_transmittance = results.visible_transmittance[end_time]
values = (1, results.to_csv(), duration, avg_temperature, std_temperature,
avg_irradiance, std_irradiance, init_visible_transmittance, final_visible_transmittance,
'this is doctored data')
cur.execute('''INSERT INTO trials (
experiment_id, results, duration, avg_temperature, std_temperature,
avg_irradiance, std_irradiance, init_visible_transmittance,
final_visible_transmittance, notes
) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)''', values)
conn.commit() # commit! commit! commit!
In [18]:
# check the blob, is it really there?
cur.execute('SELECT results FROM trials WHERE trial_id = 1')
trial1_results = cur.fetchone()
pd.read_csv(io.StringIO(trial1_results[0]), index_col=0)
# yay! it works!
Out[18]:
In [19]:
# add the results for experiment 17: T=375[K], E=700[W/m^2]
experiment_id, temperature, irradiance = list(cur.execute(
'SELECT experiment_id, temperature, irradiance FROM experiments WHERE (temperature = 375 AND irradiance = 700)'
))[0]
start_time, end_time = '2018-02-28T17:00-0800', '2018-02-28T18:30-0800'
timestamps = pd.DatetimeIndex(start=start_time, end=end_time, freq='T')
# use http://poquitopicante.blogspot.com/2016/11/panda-pop.html to help you recall what offset alias to use
size = len(timestamps)
data = {
'temperature': np.random.randn(size) + temperature,
'irradiance': np.random.randn(size) + irradiance,
'visible_transmittance': np.logspace(np.log10(0.9), np.log10(0.7), size) + np.random.randn(size) / 100
}
results = pd.DataFrame(data, index=timestamps)
duration = (results.index[-1] - results.index[0]).value # [ns]
avg_temperature = results.temperature.mean() # [K]
std_temperature = results.temperature.std() # [K]
avg_irradiance = results.irradiance.mean() # [W/m^2]
std_irradiance = results.irradiance.std() # [W/m^2]
init_visible_transmittance = results.visible_transmittance[start_time]
final_visible_transmittance = results.visible_transmittance[end_time]
values = (experiment_id, results.to_csv(), duration, avg_temperature, std_temperature,
avg_irradiance, std_irradiance, init_visible_transmittance, final_visible_transmittance,
'this is doctored data')
cur.execute('''INSERT INTO trials (
experiment_id, results, duration, avg_temperature, std_temperature,
avg_irradiance, std_irradiance, init_visible_transmittance,
final_visible_transmittance, notes
) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)''', values)
conn.commit() # commit! commit! commit!
In [20]:
pd.read_sql('''
SELECT
trial_id, trials.experiment_id AS experiment, init_visible_transmittance, final_visible_transmittance,
experiments.material_id AS material, temperature, irradiance, alpha, beta, material_type
FROM trials
JOIN experiments ON experiments.experiment_id = experiment
JOIN materials ON materials.material_id = material
''', conn, index_col='trial_id')
Out[20]:
In [21]:
cur.close()
conn.close()
Unfortunately that's all we have time for, but we covered a lot, even if we didn't get this far. If there's anything I hope you learned from this it's that:
But there's still so much to learn! Hopefully you will continue on your own and try some of these intereting topics.
Did you learn about everything you expected today? What did wish we had covered? Leave your comments or improve this tutorial by sending a PR to The Hacker Within, Berkeley.
Thanks!