SQL (Structured Query Language) is still very popular and will remain so for a long time. Hence, you will need to code SQL.SQL is the language used to query a relational database.dplyr and pandas require a similar knowledge-base.The answer depends on your problem and the resources you have available.
| Database Genre | Examples |
|---|---|
| relational | SQL and its derivatives |
| document oriented | MongoDB, CouchDB |
| key-value | Riak, Memcached, leveldb |
| graph oriented | Neo4J |
| columnar | HBase |
Pandas is a library for Python that allows users to work with data structures and relational databases.dplyr package offers a bunch of data manipulation tools including those for working with relational databases with the R programming lanuage.The dplyr_pandas notebook by Tom Augspurger contains a table comparing dplyr and pandas. The following table is a modification to that table:
| VERB | dplyr | pandas | SQL |
|---|---|---|---|
| QUERY/SELECTION | filter() (and slice()) | query() (and loc[], iloc[]) | SELECT WHERE |
| SORT | arrange() | sort() | ORDER BY |
| SELECT-COLUMNS/PROJECTION | select() (and rename()) | [](__getitem__) (and rename()) | SELECT COLUMN |
| SELECT-DISTINCT | distinct() | unique(),drop_duplicates() | SELECT DISTINCT COLUMN |
| ASSIGN | mutate() (and transmute()) | assign | ALTER/UPDATE |
| AGGREGATE | summarise() | describe(), mean(), max() | None, AVG(),MAX() |
| SAMPLE | sample_n() and sample_frac() | sample() | implementation dep, use RAND() |
| GROUP-AGG | group_by/summarize | groupby/agg, count, mean | GROUP BY |
| DELETE | ? | drop/masking | DELETE/WHERE |
NoSQL databases are gaining in popularity. However, we will stick with traditional relational databases in the course.
SQL (Structured Query Language).SQL has a long history. Because of this (or in spite of it), there are many version of SQL available today.SQLite. Here are some great references:
SQLite if you need to: https://www.sqlite.org/download.html.SQLite browser useful: http://sqlitebrowser.org.SQLite CLI:, SQLite CLIsqlite3 package in Python.SQLite commands in Python to build and manipulate our database.SQL database and work up from there.pandas to make our lives easier.SQLite commands to get the basics down.We begin by importing sqlite3 into Python.
In [1]:
import sqlite3
In [2]:
db = sqlite3.connect('test_db.sqlite')
cursor = db.cursor()
cursor.execute("DROP TABLE IF EXISTS candidates") # Convenient in case you want to start over
cursor.execute("DROP TABLE IF EXISTS contributors") # Convenient in case you want to start over
Out[2]:
In [3]:
# This strange command turns out FOREIGN KEY support in SQLite.
# By default, this support is off.
# We'll talk about foreign keys later on. Just setting things up here.
cursor.execute("PRAGMA foreign_keys=1")
Out[3]:
We have a connection to a database, but that database doesn't have a table in it yet.
We'll create a table for the candidates with the following columns: id, first_name, last_name, middle_initial, and party.
We require id to be of type integer and the rest to be a string.
In [4]:
cursor.execute('''CREATE TABLE candidates (
id INTEGER PRIMARY KEY NOT NULL,
first_name TEXT,
last_name TEXT,
middle_init TEXT,
party TEXT NOT NULL)''')
db.commit() # Commit changes to the database
What did we just do? Some of the commands are obvious (like CREATE TABLE) but others are not (like PRIMARY KEY).
cursor.execute() runs the SQLite command, which we pass in as a string.id column:PRIMARY KEY. This means that those field values are unique and cannot have NULL values. A table can only have one PRIMARY KEY.NULL values.first_name, last_name, middle_init are all TEXT fields of unlimited length.SQL has other types such as VARCHAR(N) and CHAR(N). VARCHAR(N) allows variable text lengths up to N characters in length and CHAR(N) expects text of exactly N characters.party is also entered as TEXT and cannot have NULL values.We have followed a convention wherein SQL commands are issued in capital letters and table fields are written in lowercase text.
NOTE: Always commit changes to your database! If you don't, you will lose them when you close the database.
In [5]:
cursor.execute('''INSERT INTO candidates
(id, first_name, last_name, middle_init, party)
VALUES (?, ?, ?, ?, ?)''',
(16, "Mike", "Huckabee", "", "R"))
db.commit()
INSERT command in order to insert some values into the candidates table.candidates table into the columns (...) the values (?,?,?,?,?)"Python's string formatting to insert parameters into the execute() method.SQL from being injected into a query.
In [6]:
cursor.execute('''INSERT INTO candidates
(id, first_name, last_name, middle_init, party)
VALUES (?, ?, ?, ?, ?)''',
(34, "Hillary", "Clinton", "R.", "D"))
db.commit()
In [7]:
cursor.execute("SELECT * FROM candidates")
all_rows = cursor.fetchall()
print(all_rows)
In [8]:
cursor.execute("SELECT * FROM candidates WHERE first_name = 'Joseph'")
all_rows = cursor.fetchall()
print(all_rows)
In [9]:
cursor.execute("SELECT first_name FROM candidates")
all_rows = cursor.fetchall()
print(all_rows)
In [10]:
cursor.execute('''CREATE TABLE contributors (
id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
last_name TEXT,
first_name TEXT,
middle_name TEXT,
street_1 TEXT,
street_2 TEXT,
city TEXT,
state TEXT,
zip TEXT,
amount INTEGER,
date DATETIME,
candidate_id INTEGER NOT NULL,
FOREIGN KEY(candidate_id) REFERENCES candidates(id))''')
db.commit()
The creation of our new table includes a few new commands:
AUTOINCREMENT: The id for a new entry to the table will be automatically generated. No need to enter this explicitly. The increment starts at 1.DATETIME: Gives the date in the format YYYY-MM-DD HH:MM:SSFOREIGN KEY: This allows us to link the two tables (candidates and contributors). More below.FOREIGN KEY in the second table and it references a unique key in the first table. Often, this unique key is the PRIMARY KEY.So, in our example, we create a new field in the contributors table called candidate_id. We declare it to be a FOREIGN KEY that references the id field in the candidates table.
Let's add a contributor.
In [11]:
contributors = [("Agee", "Steven", "", "549 Laurel Branch Road", "", "Floyd", "VA", int(24091), int(500), 2007-6-30, 16),
("Aaron", "Carole", "", "PO Box 1806", "", "Ogunqui", "ME", int(3907), int(70), 2008-2-7, 34)]
cursor.executemany('INSERT INTO contributors (last_name, first_name, middle_name, street_1, street_2, city, state, zip, amount, date, candidate_id) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)', contributors)
db.commit()
In [12]:
cursor.execute("SELECT last_name FROM contributors where amount < 100")
for c in cursor.fetchall():
print(c)
Now let's try to add a contributor who contributed to a candidate who's id is not in the candidates table.
In [13]:
cursor.execute('INSERT INTO contributors (last_name, first_name, middle_name, street_1, street_2, city, state, zip, amount, date, candidate_id) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)', ("Buckler", "Steve", "", "24351 Armada Dr.", "", "Dana Point", "CA", int(926291), int(50), 2007-7-30, 20))