SQL
(Structured Query Language) is still very popular and will remain so for a long time. Hence, you will need to code SQL
.SQL
is the language used to query a relational database.dplyr
and pandas
require a similar knowledge-base.The answer depends on your problem and the resources you have available.
Database Genre | Examples |
---|---|
relational | SQL and its derivatives |
document oriented | MongoDB, CouchDB |
key-value | Riak, Memcached, leveldb |
graph oriented | Neo4J |
columnar | HBase |
Pandas
is a library for Python
that allows users to work with data structures and relational databases.dplyr
package offers a bunch of data manipulation tools including those for working with relational databases with the R
programming lanuage.The dplyr_pandas
notebook by Tom Augspurger contains a table comparing dplyr
and pandas
. The following table is a modification to that table:
VERB | dplyr | pandas | SQL |
---|---|---|---|
QUERY/SELECTION | filter() (and slice()) | query() (and loc[], iloc[]) | SELECT WHERE |
SORT | arrange() | sort() | ORDER BY |
SELECT-COLUMNS/PROJECTION | select() (and rename()) | [](__getitem__) (and rename()) | SELECT COLUMN |
SELECT-DISTINCT | distinct() | unique(),drop_duplicates() | SELECT DISTINCT COLUMN |
ASSIGN | mutate() (and transmute()) | assign | ALTER/UPDATE |
AGGREGATE | summarise() | describe(), mean(), max() | None, AVG(),MAX() |
SAMPLE | sample_n() and sample_frac() | sample() | implementation dep, use RAND() |
GROUP-AGG | group_by/summarize | groupby/agg, count, mean | GROUP BY |
DELETE | ? | drop/masking | DELETE/WHERE |
NoSQL
databases are gaining in popularity. However, we will stick with traditional relational databases in the course.
SQL
(Structured Query Language).SQL
has a long history. Because of this (or in spite of it), there are many version of SQL
available today.SQLite
. Here are some great references:
SQLite
if you need to: https://www.sqlite.org/download.html.SQLite
browser useful: http://sqlitebrowser.org.SQLite
CLI:, SQLite
CLIsqlite3
package in Python
.SQLite
commands in Python
to build and manipulate our database.SQL
database and work up from there.pandas
to make our lives easier.SQLite
commands to get the basics down.We begin by importing sqlite3
into Python
.
In [1]:
import sqlite3
In [2]:
db = sqlite3.connect('test_db.sqlite')
cursor = db.cursor()
cursor.execute("DROP TABLE IF EXISTS candidates") # Convenient in case you want to start over
cursor.execute("DROP TABLE IF EXISTS contributors") # Convenient in case you want to start over
Out[2]:
In [3]:
# This strange command turns out FOREIGN KEY support in SQLite.
# By default, this support is off.
# We'll talk about foreign keys later on. Just setting things up here.
cursor.execute("PRAGMA foreign_keys=1")
Out[3]:
We have a connection to a database, but that database doesn't have a table in it yet.
We'll create a table for the candidates with the following columns: id
, first_name
, last_name
, middle_initial
, and party
.
We require id
to be of type integer and the rest to be a string.
In [4]:
cursor.execute('''CREATE TABLE candidates (
id INTEGER PRIMARY KEY NOT NULL,
first_name TEXT,
last_name TEXT,
middle_init TEXT,
party TEXT NOT NULL)''')
db.commit() # Commit changes to the database
What did we just do? Some of the commands are obvious (like CREATE TABLE
) but others are not (like PRIMARY KEY
).
cursor.execute()
runs the SQLite
command, which we pass in as a string.id
column:PRIMARY KEY
. This means that those field values are unique and cannot have NULL
values. A table can only have one PRIMARY KEY
.NULL
values.first_name
, last_name
, middle_init
are all TEXT
fields of unlimited length.SQL
has other types such as VARCHAR(N)
and CHAR(N)
. VARCHAR(N)
allows variable text lengths up to N
characters in length and CHAR(N)
expects text of exactly N
characters.party
is also entered as TEXT
and cannot have NULL
values.We have followed a convention wherein SQL
commands are issued in capital letters and table fields are written in lowercase text.
NOTE: Always commit changes to your database! If you don't, you will lose them when you close the database.
In [5]:
cursor.execute('''INSERT INTO candidates
(id, first_name, last_name, middle_init, party)
VALUES (?, ?, ?, ?, ?)''',
(16, "Mike", "Huckabee", "", "R"))
db.commit()
INSERT
command in order to insert some values into the candidates
table.candidates
table into the columns (...) the values (?,?,?,?,?)"Python's
string formatting to insert parameters into the execute()
method.SQL
from being injected into a query.
In [6]:
cursor.execute('''INSERT INTO candidates
(id, first_name, last_name, middle_init, party)
VALUES (?, ?, ?, ?, ?)''',
(34, "Hillary", "Clinton", "R.", "D"))
db.commit()
In [7]:
cursor.execute("SELECT * FROM candidates")
all_rows = cursor.fetchall()
print(all_rows)
In [8]:
cursor.execute("SELECT * FROM candidates WHERE first_name = 'Joseph'")
all_rows = cursor.fetchall()
print(all_rows)
In [9]:
cursor.execute("SELECT first_name FROM candidates")
all_rows = cursor.fetchall()
print(all_rows)
In [10]:
cursor.execute('''CREATE TABLE contributors (
id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
last_name TEXT,
first_name TEXT,
middle_name TEXT,
street_1 TEXT,
street_2 TEXT,
city TEXT,
state TEXT,
zip TEXT,
amount INTEGER,
date DATETIME,
candidate_id INTEGER NOT NULL,
FOREIGN KEY(candidate_id) REFERENCES candidates(id))''')
db.commit()
The creation of our new table includes a few new commands:
AUTOINCREMENT
: The id
for a new entry to the table will be automatically generated. No need to enter this explicitly. The increment starts at 1.DATETIME
: Gives the date in the format YYYY-MM-DD HH:MM:SSFOREIGN KEY
: This allows us to link the two tables (candidates and contributors). More below.FOREIGN KEY
in the second table and it references a unique key in the first table. Often, this unique key is the PRIMARY KEY
.So, in our example, we create a new field in the contributors table called candidate_id
. We declare it to be a FOREIGN KEY
that references the id
field in the candidates table.
Let's add a contributor.
In [11]:
contributors = [("Agee", "Steven", "", "549 Laurel Branch Road", "", "Floyd", "VA", int(24091), int(500), 2007-6-30, 16),
("Aaron", "Carole", "", "PO Box 1806", "", "Ogunqui", "ME", int(3907), int(70), 2008-2-7, 34)]
cursor.executemany('INSERT INTO contributors (last_name, first_name, middle_name, street_1, street_2, city, state, zip, amount, date, candidate_id) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)', contributors)
db.commit()
In [12]:
cursor.execute("SELECT last_name FROM contributors where amount < 100")
for c in cursor.fetchall():
print(c)
Now let's try to add a contributor who contributed to a candidate who's id is not in the candidates table.
In [13]:
cursor.execute('INSERT INTO contributors (last_name, first_name, middle_name, street_1, street_2, city, state, zip, amount, date, candidate_id) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)', ("Buckler", "Steve", "", "24351 Armada Dr.", "", "Dana Point", "CA", int(926291), int(50), 2007-7-30, 20))