Intro to SQL - The Hacker Within (2018-02-21)

This tutorial will show you the basics of SQL:

  • creating a table
  • inserting rows
  • querying with conditions
  • grouping and ordering
  • joining tables with a common field
  • aggregating rows using COUNT, MAX, etc.
  • alternates to SQL like pandas and Django
  • indexing frequently queried columns for performance
  • alter existing table to add, drop, or rename a column
  • drop a table
  • vacuum a database

SQLite

We are going to use Python because it comes with the popular SQLite (aka etilqs) relational database builtin.

Requirements

The following examples assume you have Python 3 installed on your computer. Any distribution will do. If you have Anaconda or Miniconda, you can create a new conda environment with the requirements for this tutorial from a terminal. For example with miniconda on Mac OS X, you might use the following to activate the root conda environment and create a new one called "py36-thw-sql".

~$ source ~/miniconda/bin/activate
(root) ~$ conda create -n py36-thw-sql python=3.6.3 psycopg2 django pandas

Or on Windows with Anaconda, you might use the following in a CMD terminal.

~> %LOCALAPPDATA%\Continuum\anaconda3\Scripts\activate
(root) ~> conda create -n py36-thw-sql python=3.6.3 psycopg2 django pandas

Python DB-API 2.0

PEP 249 specifies a connection to a database, and a cursor from the connection to execute SQL.


In [1]:
# imports
import io  # we'll need this way later
import os
import sqlite3  # this is the module that binds to SQLite
import numpy as np  # never know when you might need NumPy, oh, right, always!
import pandas as pd  # you'll see why we can use this later

DBFILE = 'sqlite3.db'  # this will be our database
BASEDIR = %pwd  # os.path.abspath(os.path.dirname(__file__))
DBPATH = os.path.join(BASEDIR, DBFILE)

# we may need to delete the existing file first
if os.path.exists(DBPATH):
    os.remove(DBPATH)

# open a connection to the database for this tutorial
conn = sqlite3.connect(DBPATH)

# get a cursor to the database
cur = conn.cursor()

Notes

If a file with the same name doesn't already exist, then this creates a new database otherwise it connects to the existing database contained in the file. You can also use ':memory:' to create an "in-memory" database that has no file, but then you can't connect to that from another process.

We'll have to close the connection and cursor later. Next time we could use a with context to automatically close the connection.

with sqlite3.connect('sqlite3.db') as conn:
    cur = conn.execute('SQL QUERY ...')  # e.g.: 'SELECT * FROM table_name;'
    output = cur.fetchall()  # get the results

Closing the connection automatically closes the cursor. Other bindings may offer similar context managers. to commit and close changes or rollback changes and raise an exception.

Creating tables

A relational database or SQL database is a tabular structure consisting of rows of data with columns of fields. The data definition language or DDL used to create the table is the same language used to query it, called SQL or Structured Query Language.

Although the basic SQL commands are nearly the same for other relational databases, the data types may be different. SQLite only has 5 datatypes: NULL, INTEGER, REAL, TEXT, BLOB. For boolean, use integer zero for false, and one for true. For dates and times use text and ISO8601, e.g.: "2018-02-21T17:05-0800". By comparison, PostgreSQL has too many to list here including booleans, date, time, arrays, JSON, etc.

CREATE

The basic SQL command to create a table is

CREATE TABLE <table_name> (<field_name> <TYPE> <CONSTRAINTS>, <field_name> <TYPE> <CONSTRAINTS>, <CONSTRAINTS>, ...);

Some languages enforce the semicolon, some don't. The syntax is nearly the same for other relational databases.

Constraints, Defaults, and Options

Constraints are optional and set conditions, limitations, or options for columns and the table. The most common constraints are: PRIMARY KEY, UNIQUE, NOT NULL, DEFAULT, FOREIGN KEY, REFERENCES, etc. The syntax is nearly the same for other relational databases.

PRIMARY KEY

The most important of these is PRIMARY KEY which is equivalent to UNIQUE NOT NULL. A primary key is a unique references that identifies each record in the table. Although it is not required, every table should have a primary key. Only one primary key is allowed, and it can be constructed from multiple columns, PRIMARY KEY (<field_A>, <field_B), to create a unique together, non-null identifier. In SQLite, if missing then a integer primary key named, rowid, is created by default. Also in SQLite, any integer primary key is automatically incremented, so the AUTOINCREMENT command is usually not needed. In PostgreSQL the SERIAL command is used to create a corresponding sequence for the primary key.

Practice

The other constraints and options are also important, but we'll discover those as we learn. Let's create some simple databases with fictitious data to practice. Imagine you are testing several different materials with different properties $\alpha$ and $\beta$ under different stresses like different temperatures and light intensity and changing thickness. How would you organize this data? Take a moment to design a schema or structure for your data. The schema consists of the column names and data types and the column and table constraints.


In [2]:
# we can use Python triple quoted strings to span multiple lines, but use single quotes, since SQL only uses double quotes

# first create a materials table
cur.execute('''CREATE TABLE materials (
    material_id TEXT PRIMARY KEY,
    long_name TEXT UNIQUE NOT NULL,
    alpha REAL NOT NULL,
    beta REAL NOT NULL,
    material_type TEXT NOT NULL
)''')
conn.commit()
# if you don't commit the changes, they won't be written to the file, and won't be visible to other connections

# then create an experiments table
cur.execute('''CREATE TABLE experiments (
    experiment_id INTEGER PRIMARY KEY,
    temperature REAL DEFAULT 298.15,
    irradiance REAL DEFAULT 1000.0,
    uv_filter INTEGER DEFAULT 0,
    material_id NOT NULL REFERENCES materials ON UPDATE CASCADE ON DELETE CASCADE,
    thickness REAL DEFAULT 0.005,
    UNIQUE (temperature, irradiance, uv_filter, material_id)
)''')
conn.commit()

# and finally create a trials table
cur.execute('''CREATE TABLE trials (
    trial_id INTEGER PRIMARY KEY,
    experiment_id NOT NULL REFERENCES experiments ON UPDATE CASCADE ON DELETE CASCADE,
    results BLOB NOT NULL,
    duration REAL NOT NULL,
    avg_temperature REAL NOT NULL,
    std_temperature REAL NOT NULL,
    avg_irradiance REAL NOT NULL,
    std_irradiance REAL NOT NULL,
    init_visible_transmittance REAL NOT NULL,
    final_visible_transmittance REAL NOT NULL,
    notes TEXT
)''')
conn.commit()

FOREIGN KEY

A foreign key constraint creates a relationship between two tables. The FOREIGN KEY is implied when the REFERENCES column constraint is applied. In the experiments table above, the column constraint on material_id is the same as adding this table constriant:

FOREIGN KEY (material_id) REFERENCES materials (material_id)

Specifying the referenced column in the table constraint isn't necessary, and if omitted defaults to the primary key of the referenced table. The syntax is nearly the same for other relational databases.

You can use the same name for the foreign key and it's related field, but it may make joining tables more difficult because you will need to use the table name to avoid an ambigous column name. E.G.: you can use trials.experiment_id and experiments.experiment_id to differentiate between them. You can also use AS to create a temporary name like trials.experiment_id AS experiment. Or you could just use different names for the foreign key and it's related field like FOREIGN KEY (material) REFERENCES materials (material_id) and then there's no ambiguity. Your call.

DELETE and UPDATE

What happens if the reference of a foreign key is deleted or updated? That's up to you: in SQLite the default is to do nothing, but typically you want the action to cascade. Add the desired ON DELETE or ON UPDATE action to the constraint.

Bonus Questions

  1. What is the difference between a column constraint and a table constraint?
  2. What other table constraint is in the experiments table?
  3. What other constraints or defaults are applied in the tables?
  4. What part of the materials table schema is fragile and can be improved?

INSERT

The basic SQL command to put data into a table is

INSERT INTO <table_name> (<field_name>, <field_name>, ...) VALUES (<value>, <value>, ...)

Other relational databases use the same SQL syntax.

Let's add some pretend data to the database


In [3]:
# add a EVA as a material
cur.execute('INSERT INTO materials VALUES ("EVA", "ethylene vinyl acetate", 0.123, 4.56, "polymer")')
conn.commit()  # you must commit for it to become permanent
cur.rowcount  # tells you how many rows written, sometimes, it's quirky


Out[3]:
1

Placeholders

You can use placeholders to loop over insert statements to add multiple records.

WARNING: Never use string formatters to in lieu of placeholders or you may be subject to a SQL injection attack.

SQLite uses ? but other relational databases may use %s or another placeholder.

Also, in sqlite3 executemany is a convenient shortcut, but it may not be convenient for all database bindings.


In [4]:
# add some more fake materials
fake_materials = [
    ('PMMC', 'poly methyl methacrylate', 0.789, 10.11, 'polymer'),
    ('KBr', 'potassium bromide', 1.213, 14.15, 'crystal')
]
for mat in fake_materials:
    # must have same number of place holders as values
    cur.execute('INSERT INTO materials VALUES (?, ?, ?, ?, ?)', mat)  # use place holders
conn.commit()  # you can commit all of the changes at the end of the loop

# use the executemany shortcut
fake_materials = [
    ('SiO2', 'silicon dioxide', 1.617, 18.19, 'crystal'),
    ('CaF2', 'calcium flouride', 2.0, 21.22, 'crystal')
]
cur.executemany('INSERT INTO materials VALUES (?, ?, ?, ?, ?)', fake_materials)
conn.commit()
print('rowcount = %d' % cur.rowcount)  # with executemany, cur.rowcount shows total number of rows


rowcount = 2

DELETE and UPDATE

Oops I made a mistake. How do I fix it? The opposite of INSERT is DELETE. But don't throw the baby out with the bathwater, you can also [UPDATE] a record. Other relational databases use the same SQL syntax to manipulate data.


In [5]:
cur.execute('DELETE FROM materials WHERE material_id = "SiO2"')
cur.execute('UPDATE materials SET alpha=1.23E-4, beta=8.910E+11 WHERE material_id = "CaF2"')
conn.commit()

Queries

The way you select data is by executing queries. The language is the same for all relational databases. The star * means select all columns, or you can give the columns explicitly.


In [6]:
cur.execute('SELECT * FROM materials')
cur.fetchall()  # fetch all the results of the query


Out[6]:
[('EVA', 'ethylene vinyl acetate', 0.123, 4.56, 'polymer'),
 ('PMMC', 'poly methyl methacrylate', 0.789, 10.11, 'polymer'),
 ('KBr', 'potassium bromide', 1.213, 14.15, 'crystal'),
 ('CaF2', 'calcium flouride', 0.000123, 891000000000.0, 'crystal')]

Conditions

You can limit a query using WHERE and LIMIT. You can combine WHERE with a conditional expression, IN to check a set, or LIKE to compare with strings. Use AND and OR to combine conditions.

Python DB-API Cursor Methods

The Python DB-API cursor can be used as an iterator or you can call it's fetch methods.


In [7]:
# limit the query using WHERE and LIMIT
cur.execute('SELECT material_id, long_name FROM materials WHERE alpha < 1 LIMIT 2')
for c in cur: print('{} is {}'.format(*c))  # user the cursor as an iterator


EVA is ethylene vinyl acetate
PMMC is poly methyl methacrylate

In [8]:
materials_list = ("EVA", "PMMC")
cur.execute('SELECT alpha, beta FROM materials WHERE material_id IN (?, ?)', materials_list)
[(mat, cur.fetchone()) for mat in materials_list]  # use the cursor fetchone() method to get next item


Out[8]:
[('EVA', (0.123, 4.56)), ('PMMC', (0.789, 10.11))]

Aggregates

Your query can aggregate results like AVG, SUM, COUNT, MAX, MIN, etc.


In [9]:
cur.execute('SELECT COUNT(*) FROM materials')
print(cur.fetchone())


(4,)

GROUP BY

You can group queries by a column or a condition such as an expression, IN, or LIKE, if your selection is an aggregate.


In [10]:
cur.execute('SELECT material_type, COUNT(*), AVG(alpha), MAX(beta) FROM materials GROUP BY material_type')
cur.fetchmany(2)  # use fetchmany() with size parameter, just for fun


Out[10]:
[('crystal', 2, 0.6065615000000001, 891000000000.0),
 ('polymer', 2, 0.456, 10.11)]

More Practice

Add a fictitious experiment schedule and doctor up some data!


In [11]:
# use defaults, let primary key auto-increment, just supply material ID
cur.execute('INSERT INTO experiments (material_id) VALUES ("EVA")')  # use defaults, 
conn.commit()

In [12]:
# set up a test matrix for EVA
temp = range(300, 400, 25)
irrad = range(400, 800, 100)
try:
    for T in temp:
        for E in irrad:
            cur.execute('INSERT INTO experiments (temperature, irradiance) VALUES (?, ?)', (T, E))
except sqlite3.IntegrityError as exc:
    print('sqlite3.IntegrityError: %s', exc)

# Oops! We forgot to specify the material, there is not default, and it is constrained as NOT NULL!
conn.rollback()  # undo any changes
try:
    for T in temp:
        for E in irrad:
            cur.execute('INSERT INTO experiments (temperature, irradiance, material_id) VALUES (?, ?, "EVA")', (T, E))
except sqlite3.IntegrityError as exc:
    print(exc)
conn.commit()  # commit! commit! commit!


sqlite3.IntegrityError: %s NOT NULL constraint failed: experiments.material_id

In [13]:
# this list is hard to read
list(cur.execute('SELECT * FROM experiments'))


Out[13]:
[(1, 298.15, 1000.0, 0, 'EVA', 0.005),
 (2, 300.0, 400.0, 0, 'EVA', 0.005),
 (3, 300.0, 500.0, 0, 'EVA', 0.005),
 (4, 300.0, 600.0, 0, 'EVA', 0.005),
 (5, 300.0, 700.0, 0, 'EVA', 0.005),
 (6, 325.0, 400.0, 0, 'EVA', 0.005),
 (7, 325.0, 500.0, 0, 'EVA', 0.005),
 (8, 325.0, 600.0, 0, 'EVA', 0.005),
 (9, 325.0, 700.0, 0, 'EVA', 0.005),
 (10, 350.0, 400.0, 0, 'EVA', 0.005),
 (11, 350.0, 500.0, 0, 'EVA', 0.005),
 (12, 350.0, 600.0, 0, 'EVA', 0.005),
 (13, 350.0, 700.0, 0, 'EVA', 0.005),
 (14, 375.0, 400.0, 0, 'EVA', 0.005),
 (15, 375.0, 500.0, 0, 'EVA', 0.005),
 (16, 375.0, 600.0, 0, 'EVA', 0.005),
 (17, 375.0, 700.0, 0, 'EVA', 0.005)]

In [14]:
# not only is Pandas much nicer, it also executes queries!
pd.read_sql('SELECT * FROM experiments', conn, index_col='experiment_id')


Out[14]:
temperature irradiance uv_filter material_id thickness
experiment_id
1 298.15 1000.0 0 EVA 0.005
2 300.00 400.0 0 EVA 0.005
3 300.00 500.0 0 EVA 0.005
4 300.00 600.0 0 EVA 0.005
5 300.00 700.0 0 EVA 0.005
6 325.00 400.0 0 EVA 0.005
7 325.00 500.0 0 EVA 0.005
8 325.00 600.0 0 EVA 0.005
9 325.00 700.0 0 EVA 0.005
10 350.00 400.0 0 EVA 0.005
11 350.00 500.0 0 EVA 0.005
12 350.00 600.0 0 EVA 0.005
13 350.00 700.0 0 EVA 0.005
14 375.00 400.0 0 EVA 0.005
15 375.00 500.0 0 EVA 0.005
16 375.00 600.0 0 EVA 0.005
17 375.00 700.0 0 EVA 0.005

ORDER BY

Does what it says; order the query results by a column. Default is ascending, but use ASC or DESC to change the order.


In [15]:
# Python's SQLite let's you use either '==' or '=', but I think SQL only allows '=', okay?
pd.read_sql('SELECT * FROM experiments WHERE irradiance = 700 ORDER BY temperature', conn, index_col='experiment_id')


Out[15]:
temperature irradiance uv_filter material_id thickness
experiment_id
5 300.0 700.0 0 EVA 0.005
9 325.0 700.0 0 EVA 0.005
13 350.0 700.0 0 EVA 0.005
17 375.0 700.0 0 EVA 0.005

In [16]:
# descending order
pd.read_sql('SELECT * FROM experiments WHERE temperature = 375 ORDER BY irradiance DESC', conn, index_col='experiment_id')


Out[16]:
temperature irradiance uv_filter material_id thickness
experiment_id
17 375.0 700.0 0 EVA 0.005
16 375.0 600.0 0 EVA 0.005
15 375.0 500.0 0 EVA 0.005
14 375.0 400.0 0 EVA 0.005

In [17]:
# Dr. Data
start_time, end_time = '2018-02-21T17:00-0800', '2018-02-21T18:30-0800'
timestamps = pd.DatetimeIndex(start=start_time, end=end_time, freq='T')
# use http://poquitopicante.blogspot.com/2016/11/panda-pop.html to help you recall what offset alias to use
size = len(timestamps)
data = {
    'temperature': np.random.randn(size) + 298.15,
    'irradiance': np.random.randn(size) + 1000,
    'visible_transmittance': np.logspace(np.log10(0.9), np.log10(0.8), size) + np.random.randn(size) / 100
}
results = pd.DataFrame(data, index=timestamps)
duration = (results.index[-1] - results.index[0]).value  # [ns]
avg_temperature = results.temperature.mean()  # [K]
std_temperature = results.temperature.std()  # [K]
avg_irradiance = results.irradiance.mean()  # [W/m^2]
std_irradiance = results.irradiance.std()  # [W/m^2]
init_visible_transmittance = results.visible_transmittance[start_time]
final_visible_transmittance = results.visible_transmittance[end_time]
values = (1, results.to_csv(), duration, avg_temperature, std_temperature,
          avg_irradiance, std_irradiance, init_visible_transmittance, final_visible_transmittance,
         'this is doctored data')
cur.execute('''INSERT INTO trials (
    experiment_id, results, duration, avg_temperature, std_temperature,
    avg_irradiance, std_irradiance, init_visible_transmittance,
    final_visible_transmittance, notes
) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)''', values)
conn.commit()  # commit! commit! commit!

In [18]:
# check the blob, is it really there?
cur.execute('SELECT results FROM trials WHERE trial_id = 1')
trial1_results = cur.fetchone()
pd.read_csv(io.StringIO(trial1_results[0]), index_col=0)
# yay! it works!


Out[18]:
irradiance temperature visible_transmittance
2018-02-21 17:00:00-08:00 998.959655 298.918741 0.893836
2018-02-21 17:01:00-08:00 1001.346465 298.553568 0.888956
2018-02-21 17:02:00-08:00 1000.222570 296.760235 0.916820
2018-02-21 17:03:00-08:00 998.697424 297.998634 0.895405
2018-02-21 17:04:00-08:00 999.766542 298.686229 0.903194
2018-02-21 17:05:00-08:00 998.521437 298.325881 0.892011
2018-02-21 17:06:00-08:00 1000.991891 297.175668 0.896553
2018-02-21 17:07:00-08:00 1000.054543 297.946657 0.878752
2018-02-21 17:08:00-08:00 1002.449126 298.814267 0.887048
2018-02-21 17:09:00-08:00 998.311166 296.710237 0.883333
2018-02-21 17:10:00-08:00 1001.106463 297.809162 0.887446
2018-02-21 17:11:00-08:00 998.999252 298.670059 0.892166
2018-02-21 17:12:00-08:00 999.143503 297.965134 0.881345
2018-02-21 17:13:00-08:00 998.718983 299.050471 0.899170
2018-02-21 17:14:00-08:00 999.689110 298.396573 0.888901
2018-02-21 17:15:00-08:00 998.906026 298.081031 0.883191
2018-02-21 17:16:00-08:00 1001.441733 300.027797 0.889522
2018-02-21 17:17:00-08:00 999.874945 298.704170 0.880713
2018-02-21 17:18:00-08:00 999.763816 299.457996 0.866446
2018-02-21 17:19:00-08:00 999.328643 297.217506 0.869135
2018-02-21 17:20:00-08:00 999.746232 297.342810 0.900115
2018-02-21 17:21:00-08:00 1000.737338 297.912401 0.882068
2018-02-21 17:22:00-08:00 1000.274663 297.772786 0.878944
2018-02-21 17:23:00-08:00 1000.158990 297.938545 0.888503
2018-02-21 17:24:00-08:00 1000.922571 298.359145 0.877483
2018-02-21 17:25:00-08:00 1001.872280 298.087752 0.873660
2018-02-21 17:26:00-08:00 1000.656051 297.877323 0.866980
2018-02-21 17:27:00-08:00 1000.853954 296.797606 0.856736
2018-02-21 17:28:00-08:00 999.942882 299.841713 0.856718
2018-02-21 17:29:00-08:00 998.647786 298.861976 0.882361
... ... ... ...
2018-02-21 18:01:00-08:00 1000.933114 297.728089 0.836169
2018-02-21 18:02:00-08:00 999.142461 298.791704 0.842719
2018-02-21 18:03:00-08:00 998.136343 298.775994 0.813755
2018-02-21 18:04:00-08:00 1001.349441 297.735862 0.834132
2018-02-21 18:05:00-08:00 1000.251661 299.644198 0.817461
2018-02-21 18:06:00-08:00 1000.057671 298.579453 0.808162
2018-02-21 18:07:00-08:00 1000.744842 300.871544 0.819319
2018-02-21 18:08:00-08:00 1000.056866 296.959531 0.823236
2018-02-21 18:09:00-08:00 999.557475 298.471216 0.806808
2018-02-21 18:10:00-08:00 999.519186 298.048904 0.822102
2018-02-21 18:11:00-08:00 1000.148397 298.552681 0.806129
2018-02-21 18:12:00-08:00 999.258875 293.823340 0.827836
2018-02-21 18:13:00-08:00 1001.182033 299.571102 0.832395
2018-02-21 18:14:00-08:00 997.366496 297.647049 0.817866
2018-02-21 18:15:00-08:00 1000.163225 296.983521 0.818295
2018-02-21 18:16:00-08:00 1000.236309 297.914401 0.808314
2018-02-21 18:17:00-08:00 999.754813 298.566226 0.822418
2018-02-21 18:18:00-08:00 999.857364 299.251087 0.802571
2018-02-21 18:19:00-08:00 1000.809811 296.530036 0.810362
2018-02-21 18:20:00-08:00 999.982595 296.600307 0.831929
2018-02-21 18:21:00-08:00 998.040597 298.726982 0.809474
2018-02-21 18:22:00-08:00 1000.357286 296.702854 0.792977
2018-02-21 18:23:00-08:00 999.828010 296.561526 0.792677
2018-02-21 18:24:00-08:00 998.066797 298.541645 0.795909
2018-02-21 18:25:00-08:00 999.942954 300.276747 0.805300
2018-02-21 18:26:00-08:00 998.832891 297.323483 0.791858
2018-02-21 18:27:00-08:00 998.118673 297.802544 0.794021
2018-02-21 18:28:00-08:00 1000.149277 299.374961 0.811664
2018-02-21 18:29:00-08:00 998.827775 298.726997 0.796634
2018-02-21 18:30:00-08:00 1000.273102 299.743819 0.808884

91 rows × 3 columns

JOIN

The foreign keys relate tables, but how do we use this relation? By joining the tables.


In [19]:
# add the results for experiment 17: T=375[K], E=700[W/m^2]
experiment_id, temperature, irradiance = list(cur.execute(
    'SELECT experiment_id, temperature, irradiance FROM experiments WHERE (temperature = 375 AND irradiance = 700)'
))[0]
start_time, end_time = '2018-02-28T17:00-0800', '2018-02-28T18:30-0800'
timestamps = pd.DatetimeIndex(start=start_time, end=end_time, freq='T')
# use http://poquitopicante.blogspot.com/2016/11/panda-pop.html to help you recall what offset alias to use
size = len(timestamps)
data = {
    'temperature': np.random.randn(size) + temperature,
    'irradiance': np.random.randn(size) + irradiance,
    'visible_transmittance': np.logspace(np.log10(0.9), np.log10(0.7), size) + np.random.randn(size) / 100
}
results = pd.DataFrame(data, index=timestamps)
duration = (results.index[-1] - results.index[0]).value  # [ns]
avg_temperature = results.temperature.mean()  # [K]
std_temperature = results.temperature.std()  # [K]
avg_irradiance = results.irradiance.mean()  # [W/m^2]
std_irradiance = results.irradiance.std()  # [W/m^2]
init_visible_transmittance = results.visible_transmittance[start_time]
final_visible_transmittance = results.visible_transmittance[end_time]
values = (experiment_id, results.to_csv(), duration, avg_temperature, std_temperature,
          avg_irradiance, std_irradiance, init_visible_transmittance, final_visible_transmittance,
         'this is doctored data')
cur.execute('''INSERT INTO trials (
    experiment_id, results, duration, avg_temperature, std_temperature,
    avg_irradiance, std_irradiance, init_visible_transmittance,
    final_visible_transmittance, notes
) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)''', values)
conn.commit()  # commit! commit! commit!

In [20]:
pd.read_sql('''
SELECT
    trial_id, trials.experiment_id AS experiment, init_visible_transmittance, final_visible_transmittance,
    experiments.material_id AS material, temperature, irradiance, alpha, beta, material_type
FROM trials
JOIN experiments ON experiments.experiment_id = experiment
JOIN materials ON materials.material_id = material
''', conn, index_col='trial_id')


Out[20]:
experiment init_visible_transmittance final_visible_transmittance material temperature irradiance alpha beta material_type
trial_id
1 1 0.893836 0.808884 EVA 298.15 1000.0 0.123 4.56 polymer
2 17 0.917532 0.690272 EVA 375.00 700.0 0.123 4.56 polymer

In [21]:
cur.close()
conn.close()

Epilogue

Unfortunately that's all we have time for, but we covered a lot, even if we didn't get this far. If there's anything I hope you learned from this it's that:

  1. You can read the SQLite and PostgreSQL manuals or use StackOverflow to teach yourself SQL. It's not rocket science. I was a mechanical engineer, and I learned it.
  2. The basics of making a table, inserting data, and conducting queries.
  3. How to interact programmatically with a database.

But there's still so much to learn! Hopefully you will continue on your own and try some of these intereting topics.

  1. Use what you've learned on a more advanced database management system like PostgreSQL, MySQL, or MS SQL Server.
  2. Use an Object Relational Mapper (ORM) like Django or SQLAlchemy to simplify your workflow, by creating and manipulating data in native Python.
  3. Explore a NoSQL database like Cassandra or MongoDB and see if the flexible structure and large scale cluster computing allow you to explore big data with tools like Spark and Hadoop.

Did you learn about everything you expected today? What did wish we had covered? Leave your comments or improve this tutorial by sending a PR to The Hacker Within, Berkeley.

Thanks!