Intro to SQL - The Hacker Within (2018-02-21)

This tutorial will show you the basics of SQL:

creating a table
inserting rows
querying with conditions
grouping and ordering
joining tables with a common field
aggregating rows using COUNT, MAX, etc.
alternates to SQL like pandas and Django
indexing frequently queried columns for performance
alter existing table to add, drop, or rename a column
drop a table
vacuum a database

SQLite

We are going to use Python because it comes with the popular SQLite (aka etilqs) relational database builtin.

Requirements

The following examples assume you have Python 3 installed on your computer. Any distribution will do. If you have Anaconda or Miniconda, you can create a new conda environment with the requirements for this tutorial from a terminal. For example with miniconda on Mac OS X, you might use the following to activate the root conda environment and create a new one called "py36-thw-sql".

~$ source ~/miniconda/bin/activate
(root) ~$ conda create -n py36-thw-sql python=3.6.3 psycopg2 django pandas

Or on Windows with Anaconda, you might use the following in a CMD terminal.

~> %LOCALAPPDATA%\Continuum\anaconda3\Scripts\activate
(root) ~> conda create -n py36-thw-sql python=3.6.3 psycopg2 django pandas

Python DB-API 2.0

PEP 249 specifies a connection to a database, and a cursor from the connection to execute SQL.



In [1]:

    
# imports
import io  # we'll need this way later
import os
import sqlite3  # this is the module that binds to SQLite
import numpy as np  # never know when you might need NumPy, oh, right, always!
import pandas as pd  # you'll see why we can use this later

DBFILE = 'sqlite3.db'  # this will be our database
BASEDIR = %pwd  # os.path.abspath(os.path.dirname(__file__))
DBPATH = os.path.join(BASEDIR, DBFILE)

# we may need to delete the existing file first
if os.path.exists(DBPATH):
    os.remove(DBPATH)

# open a connection to the database for this tutorial
conn = sqlite3.connect(DBPATH)

# get a cursor to the database
cur = conn.cursor()

Notes

If a file with the same name doesn't already exist, then this creates a new database otherwise it connects to the existing database contained in the file. You can also use ':memory:' to create an "in-memory" database that has no file, but then you can't connect to that from another process.

We'll have to close the connection and cursor later. Next time we could use a with context to automatically close the connection.

with sqlite3.connect('sqlite3.db') as conn:
    cur = conn.execute('SQL QUERY ...')  # e.g.: 'SELECT * FROM table_name;'
    output = cur.fetchall()  # get the results

Closing the connection automatically closes the cursor. Other bindings may offer similar context managers. to commit and close changes or rollback changes and raise an exception.

Creating tables

A relational database or SQL database is a tabular structure consisting of rows of data with columns of fields. The data definition language or DDL used to create the table is the same language used to query it, called SQL or Structured Query Language.

Although the basic SQL commands are nearly the same for other relational databases, the data types may be different. SQLite only has 5 datatypes: NULL, INTEGER, REAL, TEXT, BLOB. For boolean, use integer zero for false, and one for true. For dates and times use text and ISO8601, e.g.: "2018-02-21T17:05-0800". By comparison, PostgreSQL has too many to list here including booleans, date, time, arrays, JSON, etc.

`CREATE`

The basic SQL command to create a table is

CREATE TABLE <table_name> (<field_name> <TYPE> <CONSTRAINTS>, <field_name> <TYPE> <CONSTRAINTS>, <CONSTRAINTS>, ...);

Some languages enforce the semicolon, some don't. The syntax is nearly the same for other relational databases.

Constraints, Defaults, and Options

Constraints are optional and set conditions, limitations, or options for columns and the table. The most common constraints are: PRIMARY KEY, UNIQUE, NOT NULL, DEFAULT, FOREIGN KEY, REFERENCES, etc. The syntax is nearly the same for other relational databases.

`PRIMARY KEY`

The most important of these is PRIMARY KEY which is equivalent to UNIQUE NOT NULL. A primary key is a unique references that identifies each record in the table. Although it is not required, every table should have a primary key. Only one primary key is allowed, and it can be constructed from multiple columns, PRIMARY KEY (<field_A>, <field_B), to create a unique together, non-null identifier. In SQLite, if missing then a integer primary key named, rowid, is created by default. Also in SQLite, any integer primary key is automatically incremented, so the AUTOINCREMENT command is usually not needed. In PostgreSQL the SERIAL command is used to create a corresponding sequence for the primary key.

Practice

The other constraints and options are also important, but we'll discover those as we learn. Let's create some simple databases with fictitious data to practice. Imagine you are testing several different materials with different properties $\alpha$ and $\beta$ under different stresses like different temperatures and light intensity and changing thickness. How would you organize this data? Take a moment to design a schema or structure for your data. The schema consists of the column names and data types and the column and table constraints.



In [2]:

    
# we can use Python triple quoted strings to span multiple lines, but use single quotes, since SQL only uses double quotes

# first create a materials table
cur.execute('''CREATE TABLE materials (
    material_id TEXT PRIMARY KEY,
    long_name TEXT UNIQUE NOT NULL,
    alpha REAL NOT NULL,
    beta REAL NOT NULL,
    material_type TEXT NOT NULL
)''')
conn.commit()
# if you don't commit the changes, they won't be written to the file, and won't be visible to other connections

# then create an experiments table
cur.execute('''CREATE TABLE experiments (
    experiment_id INTEGER PRIMARY KEY,
    temperature REAL DEFAULT 298.15,
    irradiance REAL DEFAULT 1000.0,
    uv_filter INTEGER DEFAULT 0,
    material_id NOT NULL REFERENCES materials ON UPDATE CASCADE ON DELETE CASCADE,
    thickness REAL DEFAULT 0.005,
    UNIQUE (temperature, irradiance, uv_filter, material_id)
)''')
conn.commit()

# and finally create a trials table
cur.execute('''CREATE TABLE trials (
    trial_id INTEGER PRIMARY KEY,
    experiment_id NOT NULL REFERENCES experiments ON UPDATE CASCADE ON DELETE CASCADE,
    results BLOB NOT NULL,
    duration REAL NOT NULL,
    avg_temperature REAL NOT NULL,
    std_temperature REAL NOT NULL,
    avg_irradiance REAL NOT NULL,
    std_irradiance REAL NOT NULL,
    init_visible_transmittance REAL NOT NULL,
    final_visible_transmittance REAL NOT NULL,
    notes TEXT
)''')
conn.commit()

`FOREIGN KEY`

A foreign key constraint creates a relationship between two tables. The FOREIGN KEY is implied when the REFERENCES column constraint is applied. In the experiments table above, the column constraint on material_id is the same as adding this table constriant:

FOREIGN KEY (material_id) REFERENCES materials (material_id)

Specifying the referenced column in the table constraint isn't necessary, and if omitted defaults to the primary key of the referenced table. The syntax is nearly the same for other relational databases.

You can use the same name for the foreign key and it's related field, but it may make joining tables more difficult because you will need to use the table name to avoid an ambigous column name. E.G.: you can use trials.experiment_id and experiments.experiment_id to differentiate between them. You can also use AS to create a temporary name like trials.experiment_id AS experiment. Or you could just use different names for the foreign key and it's related field like FOREIGN KEY (material) REFERENCES materials (material_id) and then there's no ambiguity. Your call.

`DELETE` and `UPDATE`

What happens if the reference of a foreign key is deleted or updated? That's up to you: in SQLite the default is to do nothing, but typically you want the action to cascade. Add the desired ON DELETE or ON UPDATE action to the constraint.

Bonus Questions

What is the difference between a column constraint and a table constraint?
What other table constraint is in the experiments table?
What other constraints or defaults are applied in the tables?
What part of the materials table schema is fragile and can be improved?

`INSERT`

The basic SQL command to put data into a table is

INSERT INTO <table_name> (<field_name>, <field_name>, ...) VALUES (<value>, <value>, ...)

Other relational databases use the same SQL syntax.

Let's add some pretend data to the database



In [3]:

    
# add a EVA as a material
cur.execute('INSERT INTO materials VALUES ("EVA", "ethylene vinyl acetate", 0.123, 4.56, "polymer")')
conn.commit()  # you must commit for it to become permanent
cur.rowcount  # tells you how many rows written, sometimes, it's quirky









    Out[3]:





1

Placeholders

You can use placeholders to loop over insert statements to add multiple records.

WARNING: Never use string formatters to in lieu of placeholders or you may be subject to a SQL injection attack.

SQLite uses ? but other relational databases may use %s or another placeholder.

Also, in sqlite3 executemany is a convenient shortcut, but it may not be convenient for all database bindings.



In [4]:

    
# add some more fake materials
fake_materials = [
    ('PMMC', 'poly methyl methacrylate', 0.789, 10.11, 'polymer'),
    ('KBr', 'potassium bromide', 1.213, 14.15, 'crystal')
]
for mat in fake_materials:
    # must have same number of place holders as values
    cur.execute('INSERT INTO materials VALUES (?, ?, ?, ?, ?)', mat)  # use place holders
conn.commit()  # you can commit all of the changes at the end of the loop

# use the executemany shortcut
fake_materials = [
    ('SiO2', 'silicon dioxide', 1.617, 18.19, 'crystal'),
    ('CaF2', 'calcium flouride', 2.0, 21.22, 'crystal')
]
cur.executemany('INSERT INTO materials VALUES (?, ?, ?, ?, ?)', fake_materials)
conn.commit()
print('rowcount = %d' % cur.rowcount)  # with executemany, cur.rowcount shows total number of rows









    



rowcount = 2

`DELETE` and `UPDATE`

Oops I made a mistake. How do I fix it? The opposite of INSERT is DELETE. But don't throw the baby out with the bathwater, you can also [UPDATE] a record. Other relational databases use the same SQL syntax to manipulate data.



In [5]:

    
cur.execute('DELETE FROM materials WHERE material_id = "SiO2"')
cur.execute('UPDATE materials SET alpha=1.23E-4, beta=8.910E+11 WHERE material_id = "CaF2"')
conn.commit()

Queries

The way you select data is by executing queries. The language is the same for all relational databases. The star * means select all columns, or you can give the columns explicitly.



In [6]:

    
cur.execute('SELECT * FROM materials')
cur.fetchall()  # fetch all the results of the query









    Out[6]:





[('EVA', 'ethylene vinyl acetate', 0.123, 4.56, 'polymer'),
 ('PMMC', 'poly methyl methacrylate', 0.789, 10.11, 'polymer'),
 ('KBr', 'potassium bromide', 1.213, 14.15, 'crystal'),
 ('CaF2', 'calcium flouride', 0.000123, 891000000000.0, 'crystal')]

Conditions

You can limit a query using WHERE and LIMIT. You can combine WHERE with a conditional expression, IN to check a set, or LIKE to compare with strings. Use AND and OR to combine conditions.

Python DB-API Cursor Methods

The Python DB-API cursor can be used as an iterator or you can call it's fetch methods.



In [7]:

    
# limit the query using WHERE and LIMIT
cur.execute('SELECT material_id, long_name FROM materials WHERE alpha < 1 LIMIT 2')
for c in cur: print('{} is {}'.format(*c))  # user the cursor as an iterator









    



EVA is ethylene vinyl acetate
PMMC is poly methyl methacrylate



In [8]:

    
materials_list = ("EVA", "PMMC")
cur.execute('SELECT alpha, beta FROM materials WHERE material_id IN (?, ?)', materials_list)
[(mat, cur.fetchone()) for mat in materials_list]  # use the cursor fetchone() method to get next item









    Out[8]:





[('EVA', (0.123, 4.56)), ('PMMC', (0.789, 10.11))]

Aggregates

Your query can aggregate results like AVG, SUM, COUNT, MAX, MIN, etc.



In [9]:

    
cur.execute('SELECT COUNT(*) FROM materials')
print(cur.fetchone())









    



(4,)

`GROUP BY`

You can group queries by a column or a condition such as an expression, IN, or LIKE, if your selection is an aggregate.



In [10]:

    
cur.execute('SELECT material_type, COUNT(*), AVG(alpha), MAX(beta) FROM materials GROUP BY material_type')
cur.fetchmany(2)  # use fetchmany() with size parameter, just for fun









    Out[10]:





[('crystal', 2, 0.6065615000000001, 891000000000.0),
 ('polymer', 2, 0.456, 10.11)]

More Practice

Add a fictitious experiment schedule and doctor up some data!



In [11]:

    
# use defaults, let primary key auto-increment, just supply material ID
cur.execute('INSERT INTO experiments (material_id) VALUES ("EVA")')  # use defaults, 
conn.commit()



In [12]:

    
# set up a test matrix for EVA
temp = range(300, 400, 25)
irrad = range(400, 800, 100)
try:
    for T in temp:
        for E in irrad:
            cur.execute('INSERT INTO experiments (temperature, irradiance) VALUES (?, ?)', (T, E))
except sqlite3.IntegrityError as exc:
    print('sqlite3.IntegrityError: %s', exc)

# Oops! We forgot to specify the material, there is not default, and it is constrained as NOT NULL!
conn.rollback()  # undo any changes
try:
    for T in temp:
        for E in irrad:
            cur.execute('INSERT INTO experiments (temperature, irradiance, material_id) VALUES (?, ?, "EVA")', (T, E))
except sqlite3.IntegrityError as exc:
    print(exc)
conn.commit()  # commit! commit! commit!









    



sqlite3.IntegrityError: %s NOT NULL constraint failed: experiments.material_id



In [13]:

    
# this list is hard to read
list(cur.execute('SELECT * FROM experiments'))









    Out[13]:





[(1, 298.15, 1000.0, 0, 'EVA', 0.005),
 (2, 300.0, 400.0, 0, 'EVA', 0.005),
 (3, 300.0, 500.0, 0, 'EVA', 0.005),
 (4, 300.0, 600.0, 0, 'EVA', 0.005),
 (5, 300.0, 700.0, 0, 'EVA', 0.005),
 (6, 325.0, 400.0, 0, 'EVA', 0.005),
 (7, 325.0, 500.0, 0, 'EVA', 0.005),
 (8, 325.0, 600.0, 0, 'EVA', 0.005),
 (9, 325.0, 700.0, 0, 'EVA', 0.005),
 (10, 350.0, 400.0, 0, 'EVA', 0.005),
 (11, 350.0, 500.0, 0, 'EVA', 0.005),
 (12, 350.0, 600.0, 0, 'EVA', 0.005),
 (13, 350.0, 700.0, 0, 'EVA', 0.005),
 (14, 375.0, 400.0, 0, 'EVA', 0.005),
 (15, 375.0, 500.0, 0, 'EVA', 0.005),
 (16, 375.0, 600.0, 0, 'EVA', 0.005),
 (17, 375.0, 700.0, 0, 'EVA', 0.005)]



In [14]:

    
# not only is Pandas much nicer, it also executes queries!
pd.read_sql('SELECT * FROM experiments', conn, index_col='experiment_id')









    Out[14]:







  
    
      
      temperature
      irradiance
      uv_filter
      material_id
      thickness
    
    
      experiment_id
      
      
      
      
      
    
  
  
    
      1
      298.15
      1000.0
      0
      EVA
      0.005
    
    
      2
      300.00
      400.0
      0
      EVA
      0.005
    
    
      3
      300.00
      500.0
      0
      EVA
      0.005
    
    
      4
      300.00
      600.0
      0
      EVA
      0.005
    
    
      5
      300.00
      700.0
      0
      EVA
      0.005
    
    
      6
      325.00
      400.0
      0
      EVA
      0.005
    
    
      7
      325.00
      500.0
      0
      EVA
      0.005
    
    
      8
      325.00
      600.0
      0
      EVA
      0.005
    
    
      9
      325.00
      700.0
      0
      EVA
      0.005
    
    
      10
      350.00
      400.0
      0
      EVA
      0.005
    
    
      11
      350.00
      500.0
      0
      EVA
      0.005
    
    
      12
      350.00
      600.0
      0
      EVA
      0.005
    
    
      13
      350.00
      700.0
      0
      EVA
      0.005
    
    
      14
      375.00
      400.0
      0
      EVA
      0.005
    
    
      15
      375.00
      500.0
      0
      EVA
      0.005
    
    
      16
      375.00
      600.0
      0
      EVA
      0.005
    
    
      17
      375.00
      700.0
      0
      EVA
      0.005

`ORDER BY`

Does what it says; order the query results by a column. Default is ascending, but use ASC or DESC to change the order.



In [15]:

    
# Python's SQLite let's you use either '==' or '=', but I think SQL only allows '=', okay?
pd.read_sql('SELECT * FROM experiments WHERE irradiance = 700 ORDER BY temperature', conn, index_col='experiment_id')









    Out[15]:







  
    
      
      temperature
      irradiance
      uv_filter
      material_id
      thickness
    
    
      experiment_id
      
      
      
      
      
    
  
  
    
      5
      300.0
      700.0
      0
      EVA
      0.005
    
    
      9
      325.0
      700.0
      0
      EVA
      0.005
    
    
      13
      350.0
      700.0
      0
      EVA
      0.005
    
    
      17
      375.0
      700.0
      0
      EVA
      0.005



In [16]:

    
# descending order
pd.read_sql('SELECT * FROM experiments WHERE temperature = 375 ORDER BY irradiance DESC', conn, index_col='experiment_id')









    Out[16]:







  
    
      
      temperature
      irradiance
      uv_filter
      material_id
      thickness
    
    
      experiment_id
      
      
      
      
      
    
  
  
    
      17
      375.0
      700.0
      0
      EVA
      0.005
    
    
      16
      375.0
      600.0
      0
      EVA
      0.005
    
    
      15
      375.0
      500.0
      0
      EVA
      0.005
    
    
      14
      375.0
      400.0
      0
      EVA
      0.005

Dr. Data



In [17]:

    
# Dr. Data
start_time, end_time = '2018-02-21T17:00-0800', '2018-02-21T18:30-0800'
timestamps = pd.DatetimeIndex(start=start_time, end=end_time, freq='T')
# use http://poquitopicante.blogspot.com/2016/11/panda-pop.html to help you recall what offset alias to use
size = len(timestamps)
data = {
    'temperature': np.random.randn(size) + 298.15,
    'irradiance': np.random.randn(size) + 1000,
    'visible_transmittance': np.logspace(np.log10(0.9), np.log10(0.8), size) + np.random.randn(size) / 100
}
results = pd.DataFrame(data, index=timestamps)
duration = (results.index[-1] - results.index[0]).value  # [ns]
avg_temperature = results.temperature.mean()  # [K]
std_temperature = results.temperature.std()  # [K]
avg_irradiance = results.irradiance.mean()  # [W/m^2]
std_irradiance = results.irradiance.std()  # [W/m^2]
init_visible_transmittance = results.visible_transmittance[start_time]
final_visible_transmittance = results.visible_transmittance[end_time]
values = (1, results.to_csv(), duration, avg_temperature, std_temperature,
          avg_irradiance, std_irradiance, init_visible_transmittance, final_visible_transmittance,
         'this is doctored data')
cur.execute('''INSERT INTO trials (
    experiment_id, results, duration, avg_temperature, std_temperature,
    avg_irradiance, std_irradiance, init_visible_transmittance,
    final_visible_transmittance, notes
) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)''', values)
conn.commit()  # commit! commit! commit!



In [18]:

    
# check the blob, is it really there?
cur.execute('SELECT results FROM trials WHERE trial_id = 1')
trial1_results = cur.fetchone()
pd.read_csv(io.StringIO(trial1_results[0]), index_col=0)
# yay! it works!









    Out[18]:







  
    
      
      irradiance
      temperature
      visible_transmittance
    
  
  
    
      2018-02-21 17:00:00-08:00
      998.959655
      298.918741
      0.893836
    
    
      2018-02-21 17:01:00-08:00
      1001.346465
      298.553568
      0.888956
    
    
      2018-02-21 17:02:00-08:00
      1000.222570
      296.760235
      0.916820
    
    
      2018-02-21 17:03:00-08:00
      998.697424
      297.998634
      0.895405
    
    
      2018-02-21 17:04:00-08:00
      999.766542
      298.686229
      0.903194
    
    
      2018-02-21 17:05:00-08:00
      998.521437
      298.325881
      0.892011
    
    
      2018-02-21 17:06:00-08:00
      1000.991891
      297.175668
      0.896553
    
    
      2018-02-21 17:07:00-08:00
      1000.054543
      297.946657
      0.878752
    
    
      2018-02-21 17:08:00-08:00
      1002.449126
      298.814267
      0.887048
    
    
      2018-02-21 17:09:00-08:00
      998.311166
      296.710237
      0.883333
    
    
      2018-02-21 17:10:00-08:00
      1001.106463
      297.809162
      0.887446
    
    
      2018-02-21 17:11:00-08:00
      998.999252
      298.670059
      0.892166
    
    
      2018-02-21 17:12:00-08:00
      999.143503
      297.965134
      0.881345
    
    
      2018-02-21 17:13:00-08:00
      998.718983
      299.050471
      0.899170
    
    
      2018-02-21 17:14:00-08:00
      999.689110
      298.396573
      0.888901
    
    
      2018-02-21 17:15:00-08:00
      998.906026
      298.081031
      0.883191
    
    
      2018-02-21 17:16:00-08:00
      1001.441733
      300.027797
      0.889522
    
    
      2018-02-21 17:17:00-08:00
      999.874945
      298.704170
      0.880713
    
    
      2018-02-21 17:18:00-08:00
      999.763816
      299.457996
      0.866446
    
    
      2018-02-21 17:19:00-08:00
      999.328643
      297.217506
      0.869135
    
    
      2018-02-21 17:20:00-08:00
      999.746232
      297.342810
      0.900115
    
    
      2018-02-21 17:21:00-08:00
      1000.737338
      297.912401
      0.882068
    
    
      2018-02-21 17:22:00-08:00
      1000.274663
      297.772786
      0.878944
    
    
      2018-02-21 17:23:00-08:00
      1000.158990
      297.938545
      0.888503
    
    
      2018-02-21 17:24:00-08:00
      1000.922571
      298.359145
      0.877483
    
    
      2018-02-21 17:25:00-08:00
      1001.872280
      298.087752
      0.873660
    
    
      2018-02-21 17:26:00-08:00
      1000.656051
      297.877323
      0.866980
    
    
      2018-02-21 17:27:00-08:00
      1000.853954
      296.797606
      0.856736
    
    
      2018-02-21 17:28:00-08:00
      999.942882
      299.841713
      0.856718
    
    
      2018-02-21 17:29:00-08:00
      998.647786
      298.861976
      0.882361
    
    
      ...
      ...
      ...
      ...
    
    
      2018-02-21 18:01:00-08:00
      1000.933114
      297.728089
      0.836169
    
    
      2018-02-21 18:02:00-08:00
      999.142461
      298.791704
      0.842719
    
    
      2018-02-21 18:03:00-08:00
      998.136343
      298.775994
      0.813755
    
    
      2018-02-21 18:04:00-08:00
      1001.349441
      297.735862
      0.834132
    
    
      2018-02-21 18:05:00-08:00
      1000.251661
      299.644198
      0.817461
    
    
      2018-02-21 18:06:00-08:00
      1000.057671
      298.579453
      0.808162
    
    
      2018-02-21 18:07:00-08:00
      1000.744842
      300.871544
      0.819319
    
    
      2018-02-21 18:08:00-08:00
      1000.056866
      296.959531
      0.823236
    
    
      2018-02-21 18:09:00-08:00
      999.557475
      298.471216
      0.806808
    
    
      2018-02-21 18:10:00-08:00
      999.519186
      298.048904
      0.822102
    
    
      2018-02-21 18:11:00-08:00
      1000.148397
      298.552681
      0.806129
    
    
      2018-02-21 18:12:00-08:00
      999.258875
      293.823340
      0.827836
    
    
      2018-02-21 18:13:00-08:00
      1001.182033
      299.571102
      0.832395
    
    
      2018-02-21 18:14:00-08:00
      997.366496
      297.647049
      0.817866
    
    
      2018-02-21 18:15:00-08:00
      1000.163225
      296.983521
      0.818295
    
    
      2018-02-21 18:16:00-08:00
      1000.236309
      297.914401
      0.808314
    
    
      2018-02-21 18:17:00-08:00
      999.754813
      298.566226
      0.822418
    
    
      2018-02-21 18:18:00-08:00
      999.857364
      299.251087
      0.802571
    
    
      2018-02-21 18:19:00-08:00
      1000.809811
      296.530036
      0.810362
    
    
      2018-02-21 18:20:00-08:00
      999.982595
      296.600307
      0.831929
    
    
      2018-02-21 18:21:00-08:00
      998.040597
      298.726982
      0.809474
    
    
      2018-02-21 18:22:00-08:00
      1000.357286
      296.702854
      0.792977
    
    
      2018-02-21 18:23:00-08:00
      999.828010
      296.561526
      0.792677
    
    
      2018-02-21 18:24:00-08:00
      998.066797
      298.541645
      0.795909
    
    
      2018-02-21 18:25:00-08:00
      999.942954
      300.276747
      0.805300
    
    
      2018-02-21 18:26:00-08:00
      998.832891
      297.323483
      0.791858
    
    
      2018-02-21 18:27:00-08:00
      998.118673
      297.802544
      0.794021
    
    
      2018-02-21 18:28:00-08:00
      1000.149277
      299.374961
      0.811664
    
    
      2018-02-21 18:29:00-08:00
      998.827775
      298.726997
      0.796634
    
    
      2018-02-21 18:30:00-08:00
      1000.273102
      299.743819
      0.808884
    
  

91 rows × 3 columns

`JOIN`

The foreign keys relate tables, but how do we use this relation? By joining the tables.



In [19]:

    
# add the results for experiment 17: T=375[K], E=700[W/m^2]
experiment_id, temperature, irradiance = list(cur.execute(
    'SELECT experiment_id, temperature, irradiance FROM experiments WHERE (temperature = 375 AND irradiance = 700)'
))[0]
start_time, end_time = '2018-02-28T17:00-0800', '2018-02-28T18:30-0800'
timestamps = pd.DatetimeIndex(start=start_time, end=end_time, freq='T')
# use http://poquitopicante.blogspot.com/2016/11/panda-pop.html to help you recall what offset alias to use
size = len(timestamps)
data = {
    'temperature': np.random.randn(size) + temperature,
    'irradiance': np.random.randn(size) + irradiance,
    'visible_transmittance': np.logspace(np.log10(0.9), np.log10(0.7), size) + np.random.randn(size) / 100
}
results = pd.DataFrame(data, index=timestamps)
duration = (results.index[-1] - results.index[0]).value  # [ns]
avg_temperature = results.temperature.mean()  # [K]
std_temperature = results.temperature.std()  # [K]
avg_irradiance = results.irradiance.mean()  # [W/m^2]
std_irradiance = results.irradiance.std()  # [W/m^2]
init_visible_transmittance = results.visible_transmittance[start_time]
final_visible_transmittance = results.visible_transmittance[end_time]
values = (experiment_id, results.to_csv(), duration, avg_temperature, std_temperature,
          avg_irradiance, std_irradiance, init_visible_transmittance, final_visible_transmittance,
         'this is doctored data')
cur.execute('''INSERT INTO trials (
    experiment_id, results, duration, avg_temperature, std_temperature,
    avg_irradiance, std_irradiance, init_visible_transmittance,
    final_visible_transmittance, notes
) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)''', values)
conn.commit()  # commit! commit! commit!



In [20]:

    
pd.read_sql('''
SELECT
    trial_id, trials.experiment_id AS experiment, init_visible_transmittance, final_visible_transmittance,
    experiments.material_id AS material, temperature, irradiance, alpha, beta, material_type
FROM trials
JOIN experiments ON experiments.experiment_id = experiment
JOIN materials ON materials.material_id = material
''', conn, index_col='trial_id')









    Out[20]:







  
    
      
      experiment
      init_visible_transmittance
      final_visible_transmittance
      material
      temperature
      irradiance
      alpha
      beta
      material_type
    
    
      trial_id
      
      
      
      
      
      
      
      
      
    
  
  
    
      1
      1
      0.893836
      0.808884
      EVA
      298.15
      1000.0
      0.123
      4.56
      polymer
    
    
      2
      17
      0.917532
      0.690272
      EVA
      375.00
      700.0
      0.123
      4.56
      polymer



In [21]:

    
cur.close()
conn.close()

Epilogue

Unfortunately that's all we have time for, but we covered a lot, even if we didn't get this far. If there's anything I hope you learned from this it's that:

You can read the SQLite and PostgreSQL manuals or use StackOverflow to teach yourself SQL. It's not rocket science. I was a mechanical engineer, and I learned it.
The basics of making a table, inserting data, and conducting queries.
How to interact programmatically with a database.

But there's still so much to learn! Hopefully you will continue on your own and try some of these intereting topics.

Use what you've learned on a more advanced database management system like PostgreSQL, MySQL, or MS SQL Server.
Use an Object Relational Mapper (ORM) like Django or SQLAlchemy to simplify your workflow, by creating and manipulating data in native Python.
Explore a NoSQL database like Cassandra or MongoDB and see if the flexible structure and large scale cluster computing allow you to explore big data with tools like Spark and Hadoop.

Did you learn about everything you expected today? What did wish we had covered? Leave your comments or improve this tutorial by sending a PR to The Hacker Within, Berkeley.

Thanks!

	temperature	irradiance	uv_filter	material_id	thickness
experiment_id
1	298.15	1000.0	0	EVA	0.005
2	300.00	400.0	0	EVA	0.005
3	300.00	500.0	0	EVA	0.005
4	300.00	600.0	0	EVA	0.005
5	300.00	700.0	0	EVA	0.005
6	325.00	400.0	0	EVA	0.005
7	325.00	500.0	0	EVA	0.005
8	325.00	600.0	0	EVA	0.005
9	325.00	700.0	0	EVA	0.005
10	350.00	400.0	0	EVA	0.005
11	350.00	500.0	0	EVA	0.005
12	350.00	600.0	0	EVA	0.005
13	350.00	700.0	0	EVA	0.005
14	375.00	400.0	0	EVA	0.005
15	375.00	500.0	0	EVA	0.005
16	375.00	600.0	0	EVA	0.005
17	375.00	700.0	0	EVA	0.005

	temperature	irradiance	uv_filter	material_id	thickness
experiment_id
5	300.0	700.0	0	EVA	0.005
9	325.0	700.0	0	EVA	0.005
13	350.0	700.0	0	EVA	0.005
17	375.0	700.0	0	EVA	0.005

	irradiance	temperature	visible_transmittance
2018-02-21 17:00:00-08:00	998.959655	298.918741	0.893836
2018-02-21 17:01:00-08:00	1001.346465	298.553568	0.888956
2018-02-21 17:02:00-08:00	1000.222570	296.760235	0.916820
2018-02-21 17:03:00-08:00	998.697424	297.998634	0.895405
2018-02-21 17:04:00-08:00	999.766542	298.686229	0.903194
2018-02-21 17:05:00-08:00	998.521437	298.325881	0.892011
2018-02-21 17:06:00-08:00	1000.991891	297.175668	0.896553
2018-02-21 17:07:00-08:00	1000.054543	297.946657	0.878752
2018-02-21 17:08:00-08:00	1002.449126	298.814267	0.887048
2018-02-21 17:09:00-08:00	998.311166	296.710237	0.883333
2018-02-21 17:10:00-08:00	1001.106463	297.809162	0.887446
2018-02-21 17:11:00-08:00	998.999252	298.670059	0.892166
2018-02-21 17:12:00-08:00	999.143503	297.965134	0.881345
2018-02-21 17:13:00-08:00	998.718983	299.050471	0.899170
2018-02-21 17:14:00-08:00	999.689110	298.396573	0.888901
2018-02-21 17:15:00-08:00	998.906026	298.081031	0.883191
2018-02-21 17:16:00-08:00	1001.441733	300.027797	0.889522
2018-02-21 17:17:00-08:00	999.874945	298.704170	0.880713
2018-02-21 17:18:00-08:00	999.763816	299.457996	0.866446
2018-02-21 17:19:00-08:00	999.328643	297.217506	0.869135
2018-02-21 17:20:00-08:00	999.746232	297.342810	0.900115
2018-02-21 17:21:00-08:00	1000.737338	297.912401	0.882068
2018-02-21 17:22:00-08:00	1000.274663	297.772786	0.878944
2018-02-21 17:23:00-08:00	1000.158990	297.938545	0.888503
2018-02-21 17:24:00-08:00	1000.922571	298.359145	0.877483
2018-02-21 17:25:00-08:00	1001.872280	298.087752	0.873660
2018-02-21 17:26:00-08:00	1000.656051	297.877323	0.866980
2018-02-21 17:27:00-08:00	1000.853954	296.797606	0.856736
2018-02-21 17:28:00-08:00	999.942882	299.841713	0.856718
2018-02-21 17:29:00-08:00	998.647786	298.861976	0.882361
...	...	...	...
2018-02-21 18:01:00-08:00	1000.933114	297.728089	0.836169
2018-02-21 18:02:00-08:00	999.142461	298.791704	0.842719
2018-02-21 18:03:00-08:00	998.136343	298.775994	0.813755
2018-02-21 18:04:00-08:00	1001.349441	297.735862	0.834132
2018-02-21 18:05:00-08:00	1000.251661	299.644198	0.817461
2018-02-21 18:06:00-08:00	1000.057671	298.579453	0.808162
2018-02-21 18:07:00-08:00	1000.744842	300.871544	0.819319
2018-02-21 18:08:00-08:00	1000.056866	296.959531	0.823236
2018-02-21 18:09:00-08:00	999.557475	298.471216	0.806808
2018-02-21 18:10:00-08:00	999.519186	298.048904	0.822102
2018-02-21 18:11:00-08:00	1000.148397	298.552681	0.806129
2018-02-21 18:12:00-08:00	999.258875	293.823340	0.827836
2018-02-21 18:13:00-08:00	1001.182033	299.571102	0.832395
2018-02-21 18:14:00-08:00	997.366496	297.647049	0.817866
2018-02-21 18:15:00-08:00	1000.163225	296.983521	0.818295
2018-02-21 18:16:00-08:00	1000.236309	297.914401	0.808314
2018-02-21 18:17:00-08:00	999.754813	298.566226	0.822418
2018-02-21 18:18:00-08:00	999.857364	299.251087	0.802571
2018-02-21 18:19:00-08:00	1000.809811	296.530036	0.810362
2018-02-21 18:20:00-08:00	999.982595	296.600307	0.831929
2018-02-21 18:21:00-08:00	998.040597	298.726982	0.809474
2018-02-21 18:22:00-08:00	1000.357286	296.702854	0.792977
2018-02-21 18:23:00-08:00	999.828010	296.561526	0.792677
2018-02-21 18:24:00-08:00	998.066797	298.541645	0.795909
2018-02-21 18:25:00-08:00	999.942954	300.276747	0.805300
2018-02-21 18:26:00-08:00	998.832891	297.323483	0.791858
2018-02-21 18:27:00-08:00	998.118673	297.802544	0.794021
2018-02-21 18:28:00-08:00	1000.149277	299.374961	0.811664
2018-02-21 18:29:00-08:00	998.827775	298.726997	0.796634
2018-02-21 18:30:00-08:00	1000.273102	299.743819	0.808884

	experiment	init_visible_transmittance	final_visible_transmittance	material	temperature	irradiance	alpha	beta	material_type
trial_id
1	1	0.893836	0.808884	EVA	298.15	1000.0	0.123	4.56	polymer
2	17	0.917532	0.690272	EVA	375.00	700.0	0.123	4.56	polymer

Intro to SQL - The Hacker Within (2018-02-21)

SQLite

Requirements

Python DB-API 2.0

Notes

Creating tables

CREATE

Constraints, Defaults, and Options

PRIMARY KEY

Practice

FOREIGN KEY

DELETE and UPDATE

Bonus Questions

INSERT

Placeholders

DELETE and UPDATE

Queries

Conditions

Python DB-API Cursor Methods

Aggregates

GROUP BY

More Practice

ORDER BY

Dr. Data

JOIN

Epilogue

`CREATE`

`PRIMARY KEY`

`FOREIGN KEY`

`DELETE` and `UPDATE`

`INSERT`

`DELETE` and `UPDATE`

`GROUP BY`

`ORDER BY`

`JOIN`