Lecture 17: Databases

Monday, November 6th 2017

Introduction

Why Learn Databases

You will see many databases in your career.

SQL (Structured Query Language) is still very popular and will remain so for a long time. Hence, you will need to code SQL.
- SQL is the language used to query a relational database.

You will have to deal with systematic storage of structured and unstructured data at some point.

More Database Motivation

It is very hard to implement a database well, but you must understand how they work.

Data storage/wrangling are not just database concerns; packages such as dplyr and pandas require a similar knowledge-base.

It is very important to make an informed decision on a storage engine that is sufficient for your program.

Important to understand query performance.

Transaction processing is not optimal for analytics.

What kind of data access do you need?

The answer depends on your problem and the resources you have available.

Database Genre	Examples
relational	SQL and its derivatives
document oriented	MongoDB, CouchDB
key-value	Riak, Memcached, leveldb
graph oriented	Neo4J
columnar	HBase

A Sampling of Database Genres

Relational Model

A relation (table) is a collection of tuples. Each tuple is called a row.

A collection of tables related to each other through common data values.

Items in a column are values of one attribute.

A cell is expected to be atomic

Tables are related to each other if they have columns (called keys) which represent the same values.

SQL (Structured Query Language) is a declarative model: a query optimizer decides how to execute the query.

Example from CS109:

Key-Value Model

like a dictionary
the database is the index

Document Model

stores nested records
bad for many-to-many
storage locality good for access, bad for writing

Components to a database

Client connection manager: what to do with incoming data
Transactional storage
- storage data structures and the log
- transactions and ACID: atomicity, consistency, isolation, durability
Process model: coroutines, threads, processes
Query model and language: query optimization

Working with Databases

Relational Grammar of Data

We want a language to help us easily query items in the database.
Provide simple verbs for simple things.
Pandas is a library for Python that allows users to work with data structures and relational databases.
The dplyr package offers a bunch of data manipulation tools including those for working with relational databases with the R programming lanuage.

The dplyr_pandas notebook by Tom Augspurger contains a table comparing dplyr and pandas. The following table is a modification to that table:

VERB	dplyr	pandas	SQL
QUERY/SELECTION	filter() (and slice())	query() (and loc[], iloc[])	SELECT WHERE
SORT	arrange()	sort()	ORDER BY
SELECT-COLUMNS/PROJECTION	select() (and rename())	[](__getitem__) (and rename())	SELECT COLUMN
SELECT-DISTINCT	distinct()	unique(),drop_duplicates()	SELECT DISTINCT COLUMN
ASSIGN	mutate() (and transmute())	assign	ALTER/UPDATE
AGGREGATE	summarise()	describe(), mean(), max()	None, AVG(),MAX()
SAMPLE	sample_n() and sample_frac()	sample()	implementation dep, use RAND()
GROUP-AGG	group_by/summarize	groupby/agg, count, mean	GROUP BY
DELETE	?	drop/masking	DELETE/WHERE

`SQLite`

NoSQL databases are gaining in popularity. However, we will stick with traditional relational databases in the course.

We need a way of querying a given relational database. There are several languages for such a purpose. We will focus on SQL (Structured Query Language).

SQL has a long history. Because of this (or in spite of it), there are many version of SQL available today.

We'll use SQLite. Here are some great references:

SQLite is built into Python.
- Python implements a standard database API for all databases called DBAPI2. Some references: DBAPI2 and PEP 249.

Note: There is an even higher level API available, called SQLAlchemy.

You can install SQLite if you need to: https://www.sqlite.org/download.html.
You may find the SQLite browser useful: http://sqlitebrowser.org.
You can access the command line interface by downloading the SQLite CLI:, SQLite CLI

`SQLite` Basics

The Plan

We're going to work with the sqlite3 package in Python.
This package will allow us to execute basic SQLite commands in Python to build and manipulate our database.
We'll start by creating a SQL database and work up from there.
Ultimately, we'd like to work with pandas to make our lives easier.
At least in the beginning, we'll just work directly with the SQLite commands to get the basics down.

The Essentials

Core Commands

SELECT --- Select a table name
INSERT --- Insert data into the table
UPDATE --- Change data values in the table
DELETE --- Delete data in the table

We can string these commands together to perform our basic operations on the database.

Structural Commands

These commands are used to modify the structure of the table in a database.

CREATE --- Create a table in a database
DROP --- Delete a table in the database
ALTER --- Add, delete, or modify columns in an existing table

Starting a Database

We'll start by using the simplest of the commands. Ultimately, we will create a database with tables of presidential candidates (from 2012) and their contributors.

We begin by importing sqlite3 into Python.



In [1]:

    
import sqlite3

Next, we create a connection to the database. A new one will be created if it doesn't already exist.
Then we'll place the "cursor" on the database.



In [2]:

    
db = sqlite3.connect('test_db.sqlite')
cursor = db.cursor()
cursor.execute("DROP TABLE IF EXISTS candidates") # Convenient in case you want to start over
cursor.execute("DROP TABLE IF EXISTS contributors") # Convenient in case you want to start over









    Out[2]:





<sqlite3.Cursor at 0x10b66ef80>



In [3]:

    
# This strange command turns out FOREIGN KEY support in SQLite.
# By default, this support is off.
# We'll talk about foreign keys later on.  Just setting things up here.
cursor.execute("PRAGMA foreign_keys=1")









    Out[3]:





<sqlite3.Cursor at 0x10b66ef80>

Creating a First Table in the Database

We have a connection to a database, but that database doesn't have a table in it yet.

We'll create a table for the candidates with the following columns: id, first_name, last_name, middle_initial, and party.

We require id to be of type integer and the rest to be a string.



In [4]:

    
cursor.execute('''CREATE TABLE candidates (
               id INTEGER PRIMARY KEY NOT NULL, 
               first_name TEXT, 
               last_name TEXT, 
               middle_init TEXT, 
               party TEXT NOT NULL)''')

db.commit() # Commit changes to the database

Discussion

What did we just do? Some of the commands are obvious (like CREATE TABLE) but others are not (like PRIMARY KEY).

cursor.execute() runs the SQLite command, which we pass in as a string.
The id column:
- Has integer values.
- Is the PRIMARY KEY. This means that those field values are unique and cannot have NULL values. A table can only have one PRIMARY KEY.
- Prohibit NULL values.
first_name, last_name, middle_init are all TEXT fields of unlimited length.
- Note that SQL has other types such as VARCHAR(N) and CHAR(N). VARCHAR(N) allows variable text lengths up to N characters in length and CHAR(N) expects text of exactly N characters.
party is also entered as TEXT and cannot have NULL values.

We have followed a convention wherein SQL commands are issued in capital letters and table fields are written in lowercase text.

NOTE: Always commit changes to your database! If you don't, you will lose them when you close the database.

Adding Values



In [5]:

    
cursor.execute('''INSERT INTO candidates
               (id, first_name, last_name, middle_init, party)
               VALUES (?, ?, ?, ?, ?)''', 
                (16, "Mike", "Huckabee", "", "R"))
db.commit()

Discussion

We executed the INSERT command in order to insert some values into the candidates table.
This could be read as: "Insert into the candidates table into the columns (...) the values (?,?,?,?,?)"
Why all the question marks?
- The question marks are not necessary, but they are important. You can use Python's string formatting to insert parameters into the execute() method.
- However, there are two pretty good reasons for using the ? placeholders instead:
  1. They leave the burden of correctly encoding and escaping data items to the database module
  2. More importantly, they improve security because they prevent arbitrary SQL from being injected into a query.

Some Queries

First, let's add another entry or two.



In [6]:

    
cursor.execute('''INSERT INTO candidates
               (id, first_name, last_name, middle_init, party)
               VALUES (?, ?, ?, ?, ?)''', 
                (34, "Hillary", "Clinton", "R.", "D"))
db.commit()

Getting All Rows from a Table



In [7]:

    
cursor.execute("SELECT * FROM candidates")
all_rows = cursor.fetchall()
print(all_rows)









    



[(16, 'Mike', 'Huckabee', '', 'R'), (34, 'Hillary', 'Clinton', 'R.', 'D')]

Selecting Specifics



In [8]:

    
cursor.execute("SELECT * FROM candidates WHERE first_name = 'Joseph'")
all_rows = cursor.fetchall()
print(all_rows)

[]



In [9]:

    
cursor.execute("SELECT first_name FROM candidates")
all_rows = cursor.fetchall()
print(all_rows)









    



[('Mike',), ('Hillary',)]

An Exercise

Read data from the candidates.txt file and enter it into the candidates table that you just created.

Another Table

We'll add another table to our database and introduce a few new SQL commands and ideas.

The new table will contain data on contributors and their contributions to each candidate.



In [10]:

    
cursor.execute('''CREATE TABLE contributors (
          id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL, 
          last_name TEXT, 
          first_name TEXT, 
          middle_name TEXT, 
          street_1 TEXT, 
          street_2 TEXT, 
          city TEXT, 
          state TEXT, 
          zip TEXT, 
          amount INTEGER, 
          date DATETIME, 
          candidate_id INTEGER NOT NULL, 
          FOREIGN KEY(candidate_id) REFERENCES candidates(id))''')

db.commit()

Discussion

The creation of our new table includes a few new commands:

AUTOINCREMENT: The id for a new entry to the table will be automatically generated. No need to enter this explicitly. The increment starts at 1.
DATETIME: Gives the date in the format YYYY-MM-DD HH:MM:SS
FOREIGN KEY: This allows us to link the two tables (candidates and contributors). More below.

Discussion: FOREIGN KEY

We'd like to be able to keep track of the contributions to each candidate.
In order to accomplish this, we need to link the candidates and contributors tables somehow.
A reference between the two tables is created by a common field in each table.
We define the FOREIGN KEY in the second table and it references a unique key in the first table. Often, this unique key is the PRIMARY KEY.

So, in our example, we create a new field in the contributors table called candidate_id. We declare it to be a FOREIGN KEY that references the id field in the candidates table.

Let's add a contributor.

Adding Some Data to Contributors Table



In [11]:

    
contributors = [("Agee", "Steven", "", "549 Laurel Branch Road", "", "Floyd", "VA", int(24091), int(500), 2007-6-30, 16),
                ("Aaron", "Carole", "", "PO Box 1806", "", "Ogunqui", "ME", int(3907), int(70), 2008-2-7, 34)]

cursor.executemany('INSERT INTO contributors (last_name, first_name, middle_name, street_1, street_2, city, state, zip, amount, date, candidate_id) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)', contributors)
db.commit()



In [12]:

    
cursor.execute("SELECT last_name FROM contributors where amount < 100")
for c in cursor.fetchall():
    print(c)









    



('Aaron',)

Now let's try to add a contributor who contributed to a candidate who's id is not in the candidates table.



In [13]:

    
cursor.execute('INSERT INTO contributors (last_name, first_name, middle_name, street_1, street_2, city, state, zip, amount, date, candidate_id) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)', ("Buckler", "Steve", "", "24351 Armada Dr.", "", "Dana Point", "CA", int(926291), int(50), 2007-7-30, 20))









    



---------------------------------------------------------------------------
IntegrityError                            Traceback (most recent call last)
<ipython-input-13-3d4e7549dbaf> in <module>()
----> 1 cursor.execute('INSERT INTO contributors (last_name, first_name, middle_name, street_1, street_2, city, state, zip, amount, date, candidate_id) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)', ("Buckler", "Steve", "", "24351 Armada Dr.", "", "Dana Point", "CA", int(926291), int(50), 2007-7-30, 20))

IntegrityError: FOREIGN KEY constraint failed

Observations

We got an error!

This is because the FOREIGN KEY constraint was violated.

Lecture 17: Databases

Monday, November 6th 2017

Introduction

Why Learn Databases

More Database Motivation

What kind of data access do you need?

A Sampling of Database Genres

Relational Model

Example from CS109:

Key-Value Model

Document Model

Components to a database

Working with Databases

Relational Grammar of Data

SQLite

SQLite Basics

The Plan

The Essentials

Core Commands

Structural Commands

Starting a Database

Creating a First Table in the Database

Discussion

Adding Values

Discussion

Some Queries

Getting All Rows from a Table

Selecting Specifics

An Exercise

Another Table

Discussion

Discussion: FOREIGN KEY

Adding Some Data to Contributors Table

Observations

`SQLite`

`SQLite` Basics