SQL Introduction

Using the Titanic dataset, perform the following exercises.

1 - Import the data into SQLite, removing the index column.

.mode csv
.import Titanic.csv titanic
CREATE TABLE temp AS SELECT Name, PClass, Age, Sex, Survived, SexCode FROM titanic;
DROP TABLE titanic;
ALTER TABLE temp RENAME TO titanic;

2 - Run a query that returns the number of passengers in each class under the age of 16.

SELECT PClass, COUNT(*)
FROM titanic
WHERE Age <= 16
GROUP BY PClass;

3 - Run a query that returns the average age of those who died (Survived = 0) and those who survived for male and female separately as well as the counts for each group.

SELECT Sex, Survived, AVG(Age) AS Age_Average, COUNT(*) AS Passengers_count
FROM titanic
GROUP BY Sex, Survived;

4 - Run a query that returns the portion of survivors in each class.

SELECT PClass, SUM(Survived) / CAST(COUNT(*) AS float) AS Survivor_portion
FROM titanic
GROUP BY PClass;

5 - Run a query that returns the portion of females and males in each class who survived with an age greater than the average.

Not really clear what is the desired output, let's try this:

SELECT PClass, Sex ,SUM(Survived) / CAST(COUNT(*) AS float) AS Survivor_portion
FROM titanic
WHERE Age > (SELECT AVG(Age) FROM titanic)
GROUP BY PClass, Sex;

6 - Download the USArrest and road accident data and perform the following.

  1. Import the data into your SQLite environment.
  2. What column should the data be joined on? After identifying it, JOIN on that column.
  3. Perform a LEFT_JOIN on the column identified in part 2.
  4. Are the results of your joins useful? If not, what would be required to fix them? (Don't actually do it!).
.import USArrests.csv usarrests
.import road.csv road
SELECT * FROM usarrests AS arr INNER JOIN road AS ro ON ro.State = arr.State;

SELECT * FROM usarrests AS arr LEFT JOIN road AS ro ON ro.State = arr.State;

The results are not useful for two reasons:

  1. the road dataset does not contain all US states
  2. the state are registered in different ways between the two dataset

To solve the latter problem we could modify the state names in order to have them match each other, to solve the former we need to get the missing data somewhere.

SQLite3 in Python


In [1]:
import sqlite3

sqlite_db = './myDB.db'

1 - Add a new column, of type char, called class, to the mtcars data set, containing low, med or high for cars with mpg < 15, mpg < 20, and mpg > 20 respectively. Be sure to properly open, commit, and close the connection before proceding.


In [8]:
conn = sqlite3.connect(sqlite_db)
c = conn.cursor()

c.execute('ALTER TABLE mtcars ADD class char;')
c.execute("UPDATE mtcars SET class = CASE WHEN mpg < 15 THEN 'low' WHEN mpg < 20 THEN 'med' ELSE 'high' END;")

conn.commit()
conn.close()

2 - What if you wanted to select cars with both class = 'high' or class = 'low' using the ? place holder like above? Consult the sqlite3 documentation and write a query, using ?, doing just that and print your results.


In [12]:
conn = sqlite3.connect(sqlite_db)
c = conn.cursor()

cl = ['high', 'low']
results = c.execute("SELECT * FROM mtcars WHERE class IN (?, ?)", cl)
for row in results:
    print(row)


('Mazda RX4', '21', '6', '160', '110', '3.9', '2.62', '16.46', '0', '1', '4', '4', 'high')
('Mazda RX4 Wag', '21', '6', '160', '110', '3.9', '2.875', '17.02', '0', '1', '4', '4', 'high')
('Datsun 710', '22.8', '4', '108', '93', '3.85', '2.32', '18.61', '1', '1', '4', '1', 'high')
('Hornet 4 Drive', '21.4', '6', '258', '110', '3.08', '3.215', '19.44', '1', '0', '3', '1', 'high')
('Duster 360', '14.3', '8', '360', '245', '3.21', '3.57', '15.84', '0', '0', '3', '4', 'low')
('Merc 240D', '24.4', '4', '146.7', '62', '3.69', '3.19', '20', '1', '0', '4', '2', 'high')
('Merc 230', '22.8', '4', '140.8', '95', '3.92', '3.15', '22.9', '1', '0', '4', '2', 'high')
('Cadillac Fleetwood', '10.4', '8', '472', '205', '2.93', '5.25', '17.98', '0', '0', '3', '4', 'low')
('Lincoln Continental', '10.4', '8', '460', '215', '3', '5.424', '17.82', '0', '0', '3', '4', 'low')
('Chrysler Imperial', '14.7', '8', '440', '230', '3.23', '5.345', '17.42', '0', '0', '3', '4', 'low')
('Fiat 128', '32.4', '4', '78.7', '66', '4.08', '2.2', '19.47', '1', '1', '4', '1', 'high')
('Honda Civic', '30.4', '4', '75.7', '52', '4.93', '1.615', '18.52', '1', '1', '4', '2', 'high')
('Toyota Corolla', '33.9', '4', '71.1', '65', '4.22', '1.835', '19.9', '1', '1', '4', '1', 'high')
('Toyota Corona', '21.5', '4', '120.1', '97', '3.7', '2.465', '20.01', '1', '0', '3', '1', 'high')
('Camaro Z28', '13.3', '8', '350', '245', '3.73', '3.84', '15.41', '0', '0', '3', '4', 'low')
('Fiat X1-9', '27.3', '4', '79', '66', '4.08', '1.935', '18.9', '1', '1', '4', '1', 'high')
('Porsche 914-2', '26', '4', '120.3', '91', '4.43', '2.14', '16.7', '0', '1', '5', '2', 'high')
('Lotus Europa', '30.4', '4', '95.1', '113', '3.77', '1.513', '16.9', '1', '1', '5', '2', 'high')
('Volvo 142E', '21.4', '4', '121', '109', '4.11', '2.78', '18.6', '1', '1', '4', '2', 'high')

3 - Like we did above, create a copy of the mtcars table, and insert 3 new cars into the table, using only a subset of the columns, with proper error checking using a try statement, verifying that they were properly inserted before commiting your changes.


In [13]:
conn = sqlite3.connect(sqlite_db)
c = conn.cursor()

c.execute('CREATE TABLE mtcars_copy AS SELECT * FROM mtcars;')

conn.commit()
conn.close()

In [19]:
import sys

conn = None

new_data = [('Panda','21','6'),
           ('Punto','15','5'),
           ('Uno','17','4')]

try:
    conn = sqlite3.connect(sqlite_db)
    c = conn.cursor()

    c.executemany('INSERT INTO mtcars_copy VALUES(?, ?, ?)', new_data)
    #c.executemany('INSERT INTO mtcars_copy(name, mpg, cyl) VALUES(?, ?, ?)', new_data) # correct statement
    
    conn.commit()
    conn.close()

except sqlite3.Error as e:
    print('Something went wrong:', e.args[0])

finally:
    if conn:
        conn.close()

4 - Create a Pandas dataframe from the mtcars dataset, containing only cars with 6 cylinder engines and then create a new table in the database from this dataframe. Verify that all operations worked.


In [25]:
import pandas as pd

conn = sqlite3.connect(sqlite_db)

mtcars = pd.read_sql_query('SELECT * FROM mtcars', conn)

conn.close()

mtcars.head()


Out[25]:
name mpg cyl disp hp drat wt qsec vs am gear carb class
0 Mazda RX4 21 6 160 110 3.9 2.62 16.46 0 1 4 4 high
1 Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4 high
2 Datsun 710 22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 high
3 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 high
4 Hornet Sportabout 18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 med

In [35]:
conn = sqlite3.connect(sqlite_db)

mtcars_6cyl = mtcars[mtcars['cyl'] == '6']
mtcars_6cyl.to_sql('mtcars_6cyl', conn)
pd.read_sql_query('SELECT * FROM mtcars_6cyl', conn)


Out[35]:
index name mpg cyl disp hp drat wt qsec vs am gear carb class
0 0 Mazda RX4 21 6 160 110 3.9 2.62 16.46 0 1 4 4 high
1 1 Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4 high
2 3 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 high
3 5 Valiant 18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 med
4 9 Merc 280 19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 med
5 10 Merc 280C 17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 med
6 29 Ferrari Dino 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6 med

In [36]:
conn.close()

Postgres and SQLAlchemy

1 - Now that the mtcars data is in our Postgres Database, all we need is the Titanic data. Do the same as above, bringing in the Titanic.csv we downloaded earlier. Hint: The data contains some NA values. You could just import the column as a character, but there is a proper way to do this...

2 - Print out the metadata for the columns of the mtcars table, including the data types for each column.


In [1]:
import sqlalchemy

def connect(user, password, db, schema, host='localhost', port = 5432):
    """Connects SQLAlchemy to a Postgres database
        
    Args:
        user: Postgres user
        password: User password
        db: Database name
        schema: Schema name
        host: Host default as localhost
        port: Postgres port, default 5432
            
    Returns:
        A connection and metadata object
    """
    # Generate the URL for our database
    url = 'postgresql://{}:{}@{}:{}/{}'
    url = url.format(user, password, host, port, db)

    # Get the connection object
    con = sqlalchemy.create_engine(url, client_encoding = 'utf8')

    # Get the metadata object
    meta = sqlalchemy.MetaData(bind = con)
    meta.reflect(bind = con, schema = schema)

    return con, meta

In [2]:
con, meta = connect(user = 'postgres', password = 'password', schema = 'tutorial', db = 'postgres')

In [3]:
for col in meta.tables['tutorial.mtcars'].c:
    print('{} is of type {}'.format(col, col.type))


mtcars.name is of type VARCHAR
mtcars.mpg is of type NUMERIC
mtcars.cyl is of type INTEGER
mtcars.disp is of type NUMERIC
mtcars.hp is of type NUMERIC
mtcars.drat is of type NUMERIC
mtcars.wt is of type NUMERIC
mtcars.qsec is of type NUMERIC
mtcars.vs is of type INTEGER
mtcars.am is of type INTEGER
mtcars.gear is of type INTEGER
mtcars.carb is of type INTEGER

3 - Write a query to return the average mpg for all rows with a hp greater than or equal to 115, grouped by cyl.


In [4]:
from sqlalchemy.sql import text

s = text('SELECT cyl, AVG(mpg) FROM tutorial.mtcars WHERE hp >= 115 GROUP BY cyl')
results = con.execute(s)

for row in results:
    print(row)


(8, Decimal('15.1000000000000000'))
(6, Decimal('18.9000000000000000'))

In [5]:
from sqlalchemy.sql import select, func

t = meta.tables['tutorial.mtcars']
s = select([t.c.cyl, func.avg(t.c.mpg)]).group_by(t.c.cyl)

results = con.execute(s.where(t.c.hp >= 115))

for row in results:
    print(row)


(8, Decimal('15.1000000000000000'))
(6, Decimal('18.9000000000000000'))

4 - Go to the R Dataset Repository and download the Iraq Vote and SAT datasets. After you have them, connect to your database using SQLAlchemy, insert the data, and verify that it worked correctly using SQLAlchemy.


In [6]:
from sqlalchemy import Table, Column, Integer, Numeric, Boolean, String, ForeignKey

In [7]:
SAT = Table('SAT', meta,
            Column('state', String, primary_key = True),
            Column('expend', Numeric),
            Column('ratio', Numeric),
            Column('salary', Numeric),
            Column('frac', Integer),
            Column('verbal', Integer),
            Column('math', Integer),
            Column('sat', Integer)
)

IraqVote = Table('IraqVote', meta,
                Column('id', Integer, primary_key = True),
                Column('y', Integer),
                Column('state_abb', String),
                Column('name', String),
                Column('rep', Boolean),
                Column('state_name', String, ForeignKey('SAT.state')),
                Column('gorevote', Numeric)
)

meta.create_all(con)

In [8]:
import pandas as pd

SAT_data = pd.read_csv('SAT.csv')

rows = []

for i in range(len(SAT_data)):
    rows.append({'state': SAT_data.iloc[i, 1],
           'expend': round(float(SAT_data.iloc[i, 2]), 3),
           'ratio': round(float(SAT_data.iloc[i, 3]), 1),
           'salary': round(float(SAT_data.iloc[i, 4]), 3),
           'frac': int(SAT_data.iloc[i, 5]),
           'verbal': int(SAT_data.iloc[i, 6]),
           'math': int(SAT_data.iloc[i, 7]),
           'sat': int(SAT_data.iloc[i, 8]),
           })

con.execute(meta.tables['SAT'].insert(), rows)


Out[8]:
<sqlalchemy.engine.result.ResultProxy at 0x23e61b01278>

In [9]:
t = meta.tables['SAT']
s = select([t])

results = con.execute(s.limit(5))

for row in results:
    print(row)


('Alabama', Decimal('4.405'), Decimal('17.2'), Decimal('31.144'), 8, 491, 538, 1029)
('Alaska', Decimal('8.963'), Decimal('17.6'), Decimal('47.951'), 47, 445, 489, 934)
('Arizona', Decimal('4.778'), Decimal('19.3'), Decimal('32.175'), 27, 448, 496, 944)
('Arkansas', Decimal('4.459'), Decimal('17.1'), Decimal('28.934'), 6, 482, 523, 1005)
('California', Decimal('4.992'), Decimal('24.0'), Decimal('41.078'), 45, 417, 485, 902)

In [10]:
IraqVote_data = pd.read_csv('IraqVote.csv')

rows = []

for i in range(len(IraqVote_data)):
    rows.append({'id': int(IraqVote_data.iloc[i, 0]),
            'y': int(IraqVote_data.iloc[i, 1]),
            'state_abb': IraqVote_data.iloc[i, 2],
            'name': IraqVote_data.iloc[i, 3],
            'rep': bool(IraqVote_data.iloc[i, 4]),
            'state_name': IraqVote_data.iloc[i, 5],
            'gorevote': round(float(IraqVote_data.iloc[i, 6]), 2)
           })

con.execute(meta.tables['IraqVote'].insert(), rows)


Out[10]:
<sqlalchemy.engine.result.ResultProxy at 0x23e61b38cf8>

In [11]:
t = meta.tables['IraqVote']
s = select([t])

results = con.execute(s.limit(5))

for row in results:
    print(row)


(1, 1, 'AL', 'SESSIONS (R AL)', True, 'Alabama', Decimal('41.59'))
(2, 1, 'AL', 'SHELBY (R AL)', True, 'Alabama', Decimal('41.59'))
(3, 1, 'AK', 'MURKOWSKI (R AK)', True, 'Alaska', Decimal('27.67'))
(4, 1, 'AK', 'STEVENS (R AK)', True, 'Alaska', Decimal('27.67'))
(5, 1, 'AZ', 'KYL (R AZ)', True, 'Arizona', Decimal('44.67'))

5 - Using SQLAlchemy, select all states which had a salary greater than 30.000 and a math score greater than 500.


In [13]:
from sqlalchemy.sql import and_

t = meta.tables['SAT']
s = select([t.c.state])

results = con.execute(s.where(and_(t.c.salary > 30, t.c.math > 500)))

for row in results:
    print(row)


('Alabama',)
('Colorado',)
('Illinois',)
('Iowa',)
('Kansas',)
('Kentucky',)
('Michigan',)
('Minnesota',)
('Missouri',)
('Nebraska',)
('Ohio',)
('Tennessee',)
('Wisconsin',)
('Wyoming',)

6 - Using SQLAlchemy, count the votes for the war, versus votes against the war for Republican senators, versus for the war not Republican senators.


In [36]:
t = meta.tables['IraqVote']
s1 = select([func.count().label('count')]).where(t.c.y == 1)
s2 = select([func.count().label('count')]).where(and_(t.c.rep == True, t.c.y == 0))
s3 = select([func.count().label('count')]).where(and_(t.c.rep == False, t.c.y == 1))

results1 = con.execute(s1)
results2 = con.execute(s2)
results3 = con.execute(s3)

for row in results1:
    print('Votes for war:', row[0])

for row in results2:
    print('Republican votes against war:', row[0])

for row in results3:
    print('Non republican votes against war:', row[0])


Votes for war: 77
Republican votes against war: 1
Non republican votes against war: 29

7 - Using SQLAlchemy, calculate the average SAT score for states that voted for the Iraq war, those that voted against, and those with split votes.


In [59]:
from sqlalchemy import case

sat = meta.tables['SAT']
vote = meta.tables['IraqVote']
t = vote.join(sat)

states = select([sat.c.state, sat.c.sat, func.sum(vote.c.y).label('NVotes')])\
        .select_from(t).group_by(sat.c.state, sat.c.sat).alias('st')

s = select([case({0: "Against", 1: "Split", 2: "For"}, value=states.c.NVotes), func.avg(states.c.sat).label('AvgSATScore')]).select_from(states).group_by(states.c.NVotes)

results = con.execute(s)

for row in results:
    print('{} states average SAT score: {:.2f}'.format(row[0], row[1]))


Against states average SAT score: 950.83
Split states average SAT score: 968.64
For states average SAT score: 967.76