In [1]:
    
import sqlite3
import numpy as np
import pandas as pd
import time
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
db = sqlite3.connect('L19DB_demo.sqlite')
cursor = db.cursor()
cursor.execute("DROP TABLE IF EXISTS candidates")
cursor.execute("DROP TABLE IF EXISTS contributors")
cursor.execute("PRAGMA foreign_keys=1")
cursor.execute('''CREATE TABLE candidates (
               id INTEGER PRIMARY KEY NOT NULL, 
               first_name TEXT, 
               last_name TEXT, 
               middle_init TEXT, 
               party TEXT NOT NULL)''')
db.commit() # Commit changes to the database
cursor.execute('''CREATE TABLE contributors (
          id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL, 
          last_name TEXT, 
          first_name TEXT, 
          middle_name TEXT, 
          street_1 TEXT, 
          street_2 TEXT, 
          city TEXT, 
          state TEXT, 
          zip TEXT, 
          amount REAL, 
          date DATETIME, 
          candidate_id INTEGER NOT NULL, 
          FOREIGN KEY(candidate_id) REFERENCES candidates(id))''')
db.commit()
with open ("candidates.txt") as candidates:
    next(candidates) # jump over the header
    for line in candidates.readlines():
        cid, first_name, last_name, middle_name, party = line.strip().split('|')
        vals_to_insert = (int(cid), first_name, last_name, middle_name, party)
        cursor.execute('''INSERT INTO candidates 
                  (id, first_name, last_name, middle_init, party)
                  VALUES (?, ?, ?, ?, ?)''', vals_to_insert)
with open ("contributors.txt") as contributors:
    next(contributors)
    for line in contributors.readlines():
        cid, last_name, first_name, middle_name, street_1, street_2, \
            city, state, zip_code, amount, date, candidate_id = line.strip().split('|')
        vals_to_insert = (last_name, first_name, middle_name, street_1, street_2, 
                          city, state, int(zip_code), amount, date, candidate_id)
        cursor.execute('''INSERT INTO contributors (last_name, first_name, middle_name, 
                           street_1, street_2, city, state, zip, amount, date, candidate_id) 
                           VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)''', vals_to_insert)
candidate_cols = [col[1] for col in cursor.execute("PRAGMA table_info(candidates)")]
contributor_cols = [col[1] for col in cursor.execute("PRAGMA table_info(contributors)")]
def viz_tables(cols, query):
    q = cursor.execute(query).fetchall()
    framelist = []
    for i, col_name in enumerate(cols):
        framelist.append((col_name, [col[i] for col in q]))
    return pd.DataFrame.from_items(framelist)
    
Last time, you played with a bunch of SQLite commands to query and update the tables in the database.
One thing we didn't get to was how to query the contributors table based off of a query in the candidates table.  For example, suppose you want to query which contributors donated to Obama. You could use a nested SELECT statement to accomplish that.
In [2]:
    
query = '''SELECT * FROM contributors WHERE candidate_id = (SELECT id from candidates WHERE last_name = "Obama")'''
viz_tables(contributor_cols, query)
    
    Out[2]:
The last example involved querying data from multiple tables.
In particular, we combined columns from the two related tables (related through the FOREIGN KEY).
This leads to the idea of joining multiple tables together.  SQL has a set of commands to handle different types of joins.  SQLite does not support the full suite of join commands offered by SQL but you should still be able to get the main ideas from the limited command set.
We'll begin with the INNER JOIN.
The idea here is that you will combine the tables if the values of certain columns are the same between the two tables.  In our example, we will join the two tables based on the candidate id.  The result of the INNER JOIN will be a new table consisting of the columns we requested and containing the common data.  Since we are joining based off of the candidate id, we will not be excluding any rows.
Here are two tables. Table A has the form:
| nA | attr | idA | 
|---|---|---|
| s1 | 23 | 0 | 
| s2 | 7 | 2 | 
and table B has the form:
| nB | attr | idB | 
|---|---|---|
| t1 | 60 | 0 | 
| t2 | 14 | 7 | 
| t3 | 22 | 2 | 
Table A is associated with Table B through a foreign key on the id column.
If we join the two tables by comparing the id columns and selecting the nA, nB, and attr columns then we'll get
| nA | A.attr | nB | B.attr | 
|---|---|---|---|
| s1 | 23 | t1 | 60 | 
| s2 | 7 | t3 | 22 | 
The SQLite code to do this join would be
SELECT nA, A.attr, nB, B.attr FROM A INNER JOIN B ON B.idB = A.idA
Notice that the second row in table B is gone because the id values are not the same.
What is SQL doing with this operation?  It may help to visualize this with a Venn diagram.  Table A has rows with values corresponding to the idA attribute.  Column B has rows with values corresponding to the idB attribute.  The INNER JOIN will combine the two tables such that rows with common entries in the id attributes are included.  We essentially have the following Venn diagram.
INNER JOIN, join the candidates and contributors tables by comparing the candidate_id and candidates_id columns.  Display your joined table with the columns contributors.last_name, contributors.first_name, and candidates.last_name.WHERE clause to select a specific candidate's last name.LEFT JOIN or LEFT OUTER JOINThere are many ways to combine two tables.  We just explored one possibility in which we combined the tables based upon the intersection of the two tables (the INNER JOIN).
Now we'll talk about the LEFT JOIN or LEFT OUTER JOIN.
In words, the LEFT JOIN is combining the tables based upon what is in the intersection of the two tables and what is in the "reference" table.
We can consider our toy example in two guises:
Let's do a LEFT JOIN of table B from table A.  That is, we'd like to make a new table by putting table B into table A.  In this case, we'll consider table A our "reference" table.  We're comparing by the id column again.  We know that these two tables share ids 0 and 2 and table A doesn't have anything else in it.  The resulting table is:
| nA | A.attr | nB | B.attr | 
|---|---|---|---|
| s1 | 23 | t1 | 60 | 
| s2 | 7 | t3 | 22 | 
That's not very exciting.  It's the same result as from the INNER JOIN.  We can do another example that may be more enlightening.
Let's do a LEFT JOIN of table A from table B.  That is, we'd like to make a new table by putting table A into table B.  In this case, we'll consider table B our "reference" table.  Again, we use the id column from comparison.  We know that these two tables share ids 0 and 2.  This time, table B also contains the id 7, which is not shared by table A.  The resulting table is:
| nA | A.attr | nB | B.attr | 
|---|---|---|---|
| s1 | 23 | t1 | 60 | 
| None | NaN | t2 | 14 | 
| s2 | 7 | t3 | 22 | 
Notice that SQLite filed in the missing entries for us.  This is necessary for completion of the requested join.
The SQLite commands to accomplish all of this are:
SELECT nA, A.attr, nB, B.attr FROM A LEFT JOIN B ON B.idB = A.idA
and
SELECT nA, A.attr, nB, B.attr FROM B LEFT JOIN A ON A.idA = B.idB
Here is a visualization using Venn diagrams of the LEFT JOIN.
Use the following two tables to do the first two exercises in this section. Table A has the form:
| nA | attr | idA | 
|---|---|---|
| s1 | 23 | 0 | 
| s2 | 7 | 2 | 
| s3 | 15 | 2 | 
| s4 | 31 | 0 | 
and table B has the form:
| nB | attr | idB | 
|---|---|---|
| t1 | 60 | 0 | 
| t2 | 14 | 7 | 
| t3 | 22 | 2 | 
LEFT JOIN using table A as the reference and the id columns for comparison.LEFT JOIN using table B as the reference and the id columns for comparison.| average contribution | number of contributors | candidate last name | 
|---|---|---|
| ... | ... | ... | 
The table should be created using the LEFT JOIN clause on the contributors table by joining the candidates table by the id column.  The average contribution column and number of contributors column should be obtained using the AVG and COUNT SQL functions.  Finally, you should use the GROUP BY clause on the candidates last name.
We've been working with databases for the last few lectures and learning SQLite commands to work with and manipulate the databases.  There is a Python package called pandas that provides broad support for data structures.  It can be used to interact with relationsional databases through its own methods and even through SQL commands.
In the last part of this lecture, you will get to redo a bunch of the database exercises using pandas.
We won't be able to cover pandas from the ground up, but it's a well-documented library and is fairly easy to get up and running.  Here's the website:  pandas.
In [3]:
    
# Using pandas naming convention
dfcand = pd.read_csv("candidates.txt", sep="|")
dfcand
    
    Out[3]:
In [4]:
    
dfcontr = pd.read_csv("contributors.txt", sep="|")
dfcontr
    
    Out[4]:
Reading things in is quite easy with pandas.
Notice that pandas populates empty fields with NaN values.
The id column in the contributors dataset is superfluous.  Let's delete it.
In [5]:
    
del dfcontr['id']
dfcontr.head()
    
    Out[5]:
Very nice!  And we used the head method to print out the first five rows.
In [6]:
    
dbp = sqlite3.connect('L19_pandas_DB.sqlite')
csr = dbp.cursor()
csr.execute("DROP TABLE IF EXISTS candidates")
csr.execute("DROP TABLE IF EXISTS contributors")
csr.execute("PRAGMA foreign_keys=1")
csr.execute('''CREATE TABLE candidates (
               id INTEGER PRIMARY KEY NOT NULL, 
               first_name TEXT, 
               last_name TEXT, 
               middle_name TEXT, 
               party TEXT NOT NULL)''')
dbp.commit() # Commit changes to the database
csr.execute('''CREATE TABLE contributors (
          id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL, 
          last_name TEXT, 
          first_name TEXT, 
          middle_name TEXT, 
          street_1 TEXT, 
          street_2 TEXT, 
          city TEXT, 
          state TEXT, 
          zip TEXT, 
          amount REAL, 
          date DATETIME, 
          candidate_id INTEGER NOT NULL, 
          FOREIGN KEY(candidate_id) REFERENCES candidates(id))''')
dbp.commit()
    
Last time, we opened the data files with Python and then manually used SQLite commands to populate the individual tables.  We can use pandas instead like so.
In [7]:
    
dfcand.to_sql("candidates", dbp, if_exists="append", index=False)
    
How big is our table?
In [8]:
    
dfcand.shape
    
    Out[8]:
We can visualize the data in our pandas-populated table.  No surprises here except that pandas did everything for us.
In [9]:
    
query = '''SELECT * FROM candidates'''
csr.execute(query).fetchall()
    
    Out[9]:
In [10]:
    
dfcand.query("first_name=='Mike' & party=='D'")
    
    Out[10]:
In [11]:
    
dfcand[(dfcand.first_name=="Mike") & (dfcand.party=="D")]
    
    Out[11]:
In [12]:
    
dfcand[dfcand.middle_name.notnull()]
    
    Out[12]:
In [13]:
    
dfcand[dfcand.first_name.isin(['Mike', 'Hillary'])]
    
    Out[13]:
pandas to populate the contributors table.
In [14]:
    
dfcand.sort_values(by='party')
    
    Out[14]:
In [15]:
    
dfcand.sort_values(by='party', ascending=False)
    
    Out[15]:
In [16]:
    
dfcand[['last_name', 'party']]
    
    Out[16]:
In [17]:
    
dfcand[['last_name', 'party']].count()
    
    Out[17]:
In [18]:
    
dfcand[['first_name']].drop_duplicates()
    
    Out[18]:
In [19]:
    
dfcand[['first_name']].drop_duplicates().count()
    
    Out[19]:
Creating a new column is quite easy with pandas.
In [20]:
    
dfcand['name'] = dfcand['last_name'] + ", " + dfcand['first_name']
dfcand
    
    Out[20]:
We can change an existing field as well.
In [21]:
    
dfcand.loc[dfcand.first_name == "Mike", "name"]
    
    Out[21]:
In [22]:
    
dfcand.loc[dfcand.first_name == "Mike", "name"] = "Mikey"
    
In [23]:
    
dfcand.query("first_name == 'Mike'")
    
    Out[23]:
You may recall that SQLite doesn't have the functionality to drop a column.  It's a one-liner with pandas.
In [24]:
    
del dfcand['name']
dfcand
    
    Out[24]:
In [25]:
    
dfcand.describe()
    
    Out[25]:
It's not very interesting with the candidates table because the candidates table only has one numeric column.
I'll use the contributors table to do some demos now.
In [26]:
    
dfcontr.amount.max()
    
    Out[26]:
In [27]:
    
dfcontr[dfcontr.amount==dfcontr.amount.max()]
    
    Out[27]:
In [28]:
    
dfcontr.groupby("state").sum()
    
    Out[28]:
In [29]:
    
dfcontr.groupby("state")["amount"].sum()
    
    Out[29]:
In [30]:
    
dfcontr.state.unique()
    
    Out[30]:
There is also a version of the LIMIT clause.  It's very intuitive with pandas.
In [31]:
    
dfcand[0:3]
    
    Out[31]:
The usual Python slicing works just fine!
pandas has some some documentation on joins:  Merge, join, and concatenate.  If you want some more reinforcement on the concepts from earlier regarding JOIN, then the pandas documentation may be a good place to get it.
You may also be interested in a comparison with SQL.
To do joins with pandas, we use the merge command.
Here's an example of an explicit inner join:
In [32]:
    
cols_wanted = ['last_name_x', 'first_name_x', 'candidate_id', 'id', 'last_name_y']
dfcontr.merge(dfcand, left_on="candidate_id", right_on="id")[cols_wanted]
    
    Out[32]:
In [33]:
    
dfcontr.merge(dfcand, left_on="candidate_id", right_on="id")[cols_wanted].groupby('last_name_y').describe()
    
    Out[33]:
pandasWe didn't cover all possible joins because SQLite can only handle the few that we did discuss.  As mentioned, there are workarounds for some things in SQLite, but not evertyhing.  Fortunately, pandas can handle pretty much everything.  Here are a few joins that pandas can handle:
LEFT OUTER (already discussed)RIGHT OUTER - Think of the "opposite" of a LEFT OUTER join (shade the intersection and right set in the Venn diagram).FULL OUTER - Combine everything from both tables (shade the entire Venn diagram)
In [34]:
    
dfcontr.merge(dfcand, left_on="candidate_id", right_on="id", how="left")[cols_wanted]
    
    Out[34]:
In [36]:
    
dfcontr.merge(dfcand, left_on="candidate_id", right_on="id", how="right")[cols_wanted]
    
    Out[36]:
In [37]:
    
dfcontr.merge(dfcand, left_on="candidate_id", right_on="id", how="outer")[cols_wanted]
    
    Out[37]: