SQL Introduction

Using the Titanic dataset, perform the following exercises.

1 - Import the data into SQLite, removing the index column.

.mode csv
.import Titanic.csv titanic

CREATE TABLE temp AS SELECT Name, PClass, Age, Sex, Survived, SexCode FROM titanic;
DROP TABLE titanic;
ALTER TABLE temp RENAME TO titanic;

2 - Run a query that returns the number of passengers in each class under the age of 16.

SELECT PClass, COUNT(*)
FROM titanic
WHERE Age <= 16
GROUP BY PClass;

3 - Run a query that returns the average age of those who died (Survived = 0) and those who survived for male and female separately as well as the counts for each group.

SELECT Sex, Survived, AVG(Age) AS Age_Average, COUNT(*) AS Passengers_count
FROM titanic
GROUP BY Sex, Survived;

4 - Run a query that returns the portion of survivors in each class.

SELECT PClass, SUM(Survived) / CAST(COUNT(*) AS float) AS Survivor_portion
FROM titanic
GROUP BY PClass;

5 - Run a query that returns the portion of females and males in each class who survived with an age greater than the average.

Not really clear what is the desired output, let's try this:

SELECT PClass, Sex ,SUM(Survived) / CAST(COUNT(*) AS float) AS Survivor_portion
FROM titanic
WHERE Age > (SELECT AVG(Age) FROM titanic)
GROUP BY PClass, Sex;

6 - Download the USArrest and road accident data and perform the following.

Import the data into your SQLite environment.
What column should the data be joined on? After identifying it, JOIN on that column.
Perform a LEFT_JOIN on the column identified in part 2.
Are the results of your joins useful? If not, what would be required to fix them? (Don't actually do it!).

.import USArrests.csv usarrests
.import road.csv road

SELECT * FROM usarrests AS arr INNER JOIN road AS ro ON ro.State = arr.State;

SELECT * FROM usarrests AS arr LEFT JOIN road AS ro ON ro.State = arr.State;

The results are not useful for two reasons:

the road dataset does not contain all US states
the state are registered in different ways between the two dataset

To solve the latter problem we could modify the state names in order to have them match each other, to solve the former we need to get the missing data somewhere.

SQLite3 in Python



In [1]:

    
import sqlite3

sqlite_db = './myDB.db'

1 - Add a new column, of type char, called class, to the mtcars data set, containing low, med or high for cars with mpg < 15, mpg < 20, and mpg > 20 respectively. Be sure to properly open, commit, and close the connection before proceding.



In [8]:

    
conn = sqlite3.connect(sqlite_db)
c = conn.cursor()

c.execute('ALTER TABLE mtcars ADD class char;')
c.execute("UPDATE mtcars SET class = CASE WHEN mpg < 15 THEN 'low' WHEN mpg < 20 THEN 'med' ELSE 'high' END;")

conn.commit()
conn.close()

2 - What if you wanted to select cars with both class = 'high' or class = 'low' using the ? place holder like above? Consult the sqlite3 documentation and write a query, using ?, doing just that and print your results.



In [12]:

    
conn = sqlite3.connect(sqlite_db)
c = conn.cursor()

cl = ['high', 'low']
results = c.execute("SELECT * FROM mtcars WHERE class IN (?, ?)", cl)
for row in results:
    print(row)









    



('Mazda RX4', '21', '6', '160', '110', '3.9', '2.62', '16.46', '0', '1', '4', '4', 'high')
('Mazda RX4 Wag', '21', '6', '160', '110', '3.9', '2.875', '17.02', '0', '1', '4', '4', 'high')
('Datsun 710', '22.8', '4', '108', '93', '3.85', '2.32', '18.61', '1', '1', '4', '1', 'high')
('Hornet 4 Drive', '21.4', '6', '258', '110', '3.08', '3.215', '19.44', '1', '0', '3', '1', 'high')
('Duster 360', '14.3', '8', '360', '245', '3.21', '3.57', '15.84', '0', '0', '3', '4', 'low')
('Merc 240D', '24.4', '4', '146.7', '62', '3.69', '3.19', '20', '1', '0', '4', '2', 'high')
('Merc 230', '22.8', '4', '140.8', '95', '3.92', '3.15', '22.9', '1', '0', '4', '2', 'high')
('Cadillac Fleetwood', '10.4', '8', '472', '205', '2.93', '5.25', '17.98', '0', '0', '3', '4', 'low')
('Lincoln Continental', '10.4', '8', '460', '215', '3', '5.424', '17.82', '0', '0', '3', '4', 'low')
('Chrysler Imperial', '14.7', '8', '440', '230', '3.23', '5.345', '17.42', '0', '0', '3', '4', 'low')
('Fiat 128', '32.4', '4', '78.7', '66', '4.08', '2.2', '19.47', '1', '1', '4', '1', 'high')
('Honda Civic', '30.4', '4', '75.7', '52', '4.93', '1.615', '18.52', '1', '1', '4', '2', 'high')
('Toyota Corolla', '33.9', '4', '71.1', '65', '4.22', '1.835', '19.9', '1', '1', '4', '1', 'high')
('Toyota Corona', '21.5', '4', '120.1', '97', '3.7', '2.465', '20.01', '1', '0', '3', '1', 'high')
('Camaro Z28', '13.3', '8', '350', '245', '3.73', '3.84', '15.41', '0', '0', '3', '4', 'low')
('Fiat X1-9', '27.3', '4', '79', '66', '4.08', '1.935', '18.9', '1', '1', '4', '1', 'high')
('Porsche 914-2', '26', '4', '120.3', '91', '4.43', '2.14', '16.7', '0', '1', '5', '2', 'high')
('Lotus Europa', '30.4', '4', '95.1', '113', '3.77', '1.513', '16.9', '1', '1', '5', '2', 'high')
('Volvo 142E', '21.4', '4', '121', '109', '4.11', '2.78', '18.6', '1', '1', '4', '2', 'high')

3 - Like we did above, create a copy of the mtcars table, and insert 3 new cars into the table, using only a subset of the columns, with proper error checking using a try statement, verifying that they were properly inserted before commiting your changes.



In [13]:

    
conn = sqlite3.connect(sqlite_db)
c = conn.cursor()

c.execute('CREATE TABLE mtcars_copy AS SELECT * FROM mtcars;')

conn.commit()
conn.close()



In [19]:

    
import sys

conn = None

new_data = [('Panda','21','6'),
           ('Punto','15','5'),
           ('Uno','17','4')]

try:
    conn = sqlite3.connect(sqlite_db)
    c = conn.cursor()

    c.executemany('INSERT INTO mtcars_copy VALUES(?, ?, ?)', new_data)
    #c.executemany('INSERT INTO mtcars_copy(name, mpg, cyl) VALUES(?, ?, ?)', new_data) # correct statement
    
    conn.commit()
    conn.close()

except sqlite3.Error as e:
    print('Something went wrong:', e.args[0])

finally:
    if conn:
        conn.close()

4 - Create a Pandas dataframe from the mtcars dataset, containing only cars with 6 cylinder engines and then create a new table in the database from this dataframe. Verify that all operations worked.



In [25]:

    
import pandas as pd

conn = sqlite3.connect(sqlite_db)

mtcars = pd.read_sql_query('SELECT * FROM mtcars', conn)

conn.close()

mtcars.head()









    Out[25]:






  
    
      
      name
      mpg
      cyl
      disp
      hp
      drat
      wt
      qsec
      vs
      am
      gear
      carb
      class
    
  
  
    
      0
      Mazda RX4
      21
      6
      160
      110
      3.9
      2.62
      16.46
      0
      1
      4
      4
      high
    
    
      1
      Mazda RX4 Wag
      21
      6
      160
      110
      3.9
      2.875
      17.02
      0
      1
      4
      4
      high
    
    
      2
      Datsun 710
      22.8
      4
      108
      93
      3.85
      2.32
      18.61
      1
      1
      4
      1
      high
    
    
      3
      Hornet 4 Drive
      21.4
      6
      258
      110
      3.08
      3.215
      19.44
      1
      0
      3
      1
      high
    
    
      4
      Hornet Sportabout
      18.7
      8
      360
      175
      3.15
      3.44
      17.02
      0
      0
      3
      2
      med



In [35]:

    
conn = sqlite3.connect(sqlite_db)

mtcars_6cyl = mtcars[mtcars['cyl'] == '6']
mtcars_6cyl.to_sql('mtcars_6cyl', conn)
pd.read_sql_query('SELECT * FROM mtcars_6cyl', conn)









    Out[35]:






  
    
      
      index
      name
      mpg
      cyl
      disp
      hp
      drat
      wt
      qsec
      vs
      am
      gear
      carb
      class
    
  
  
    
      0
      0
      Mazda RX4
      21
      6
      160
      110
      3.9
      2.62
      16.46
      0
      1
      4
      4
      high
    
    
      1
      1
      Mazda RX4 Wag
      21
      6
      160
      110
      3.9
      2.875
      17.02
      0
      1
      4
      4
      high
    
    
      2
      3
      Hornet 4 Drive
      21.4
      6
      258
      110
      3.08
      3.215
      19.44
      1
      0
      3
      1
      high
    
    
      3
      5
      Valiant
      18.1
      6
      225
      105
      2.76
      3.46
      20.22
      1
      0
      3
      1
      med
    
    
      4
      9
      Merc 280
      19.2
      6
      167.6
      123
      3.92
      3.44
      18.3
      1
      0
      4
      4
      med
    
    
      5
      10
      Merc 280C
      17.8
      6
      167.6
      123
      3.92
      3.44
      18.9
      1
      0
      4
      4
      med
    
    
      6
      29
      Ferrari Dino
      19.7
      6
      145
      175
      3.62
      2.77
      15.5
      0
      1
      5
      6
      med



In [36]:

    
conn.close()

Postgres and SQLAlchemy

1 - Now that the mtcars data is in our Postgres Database, all we need is the Titanic data. Do the same as above, bringing in the Titanic.csv we downloaded earlier. Hint: The data contains some NA values. You could just import the column as a character, but there is a proper way to do this...

2 - Print out the metadata for the columns of the mtcars table, including the data types for each column.



In [1]:

    
import sqlalchemy

def connect(user, password, db, schema, host='localhost', port = 5432):
    """Connects SQLAlchemy to a Postgres database
        
    Args:
        user: Postgres user
        password: User password
        db: Database name
        schema: Schema name
        host: Host default as localhost
        port: Postgres port, default 5432
            
    Returns:
        A connection and metadata object
    """
    # Generate the URL for our database
    url = 'postgresql://{}:{}@{}:{}/{}'
    url = url.format(user, password, host, port, db)

    # Get the connection object
    con = sqlalchemy.create_engine(url, client_encoding = 'utf8')

    # Get the metadata object
    meta = sqlalchemy.MetaData(bind = con)
    meta.reflect(bind = con, schema = schema)

    return con, meta



In [2]:

    
con, meta = connect(user = 'postgres', password = 'password', schema = 'tutorial', db = 'postgres')



In [3]:

    
for col in meta.tables['tutorial.mtcars'].c:
    print('{} is of type {}'.format(col, col.type))









    



mtcars.name is of type VARCHAR
mtcars.mpg is of type NUMERIC
mtcars.cyl is of type INTEGER
mtcars.disp is of type NUMERIC
mtcars.hp is of type NUMERIC
mtcars.drat is of type NUMERIC
mtcars.wt is of type NUMERIC
mtcars.qsec is of type NUMERIC
mtcars.vs is of type INTEGER
mtcars.am is of type INTEGER
mtcars.gear is of type INTEGER
mtcars.carb is of type INTEGER

3 - Write a query to return the average mpg for all rows with a hp greater than or equal to 115, grouped by cyl.



In [4]:

    
from sqlalchemy.sql import text

s = text('SELECT cyl, AVG(mpg) FROM tutorial.mtcars WHERE hp >= 115 GROUP BY cyl')
results = con.execute(s)

for row in results:
    print(row)









    



(8, Decimal('15.1000000000000000'))
(6, Decimal('18.9000000000000000'))



In [5]:

    
from sqlalchemy.sql import select, func

t = meta.tables['tutorial.mtcars']
s = select([t.c.cyl, func.avg(t.c.mpg)]).group_by(t.c.cyl)

results = con.execute(s.where(t.c.hp >= 115))

for row in results:
    print(row)









    



(8, Decimal('15.1000000000000000'))
(6, Decimal('18.9000000000000000'))

4 - Go to the R Dataset Repository and download the Iraq Vote and SAT datasets. After you have them, connect to your database using SQLAlchemy, insert the data, and verify that it worked correctly using SQLAlchemy.



In [6]:

    
from sqlalchemy import Table, Column, Integer, Numeric, Boolean, String, ForeignKey



In [7]:

    
SAT = Table('SAT', meta,
            Column('state', String, primary_key = True),
            Column('expend', Numeric),
            Column('ratio', Numeric),
            Column('salary', Numeric),
            Column('frac', Integer),
            Column('verbal', Integer),
            Column('math', Integer),
            Column('sat', Integer)
)

IraqVote = Table('IraqVote', meta,
                Column('id', Integer, primary_key = True),
                Column('y', Integer),
                Column('state_abb', String),
                Column('name', String),
                Column('rep', Boolean),
                Column('state_name', String, ForeignKey('SAT.state')),
                Column('gorevote', Numeric)
)

meta.create_all(con)



In [8]:

    
import pandas as pd

SAT_data = pd.read_csv('SAT.csv')

rows = []

for i in range(len(SAT_data)):
    rows.append({'state': SAT_data.iloc[i, 1],
           'expend': round(float(SAT_data.iloc[i, 2]), 3),
           'ratio': round(float(SAT_data.iloc[i, 3]), 1),
           'salary': round(float(SAT_data.iloc[i, 4]), 3),
           'frac': int(SAT_data.iloc[i, 5]),
           'verbal': int(SAT_data.iloc[i, 6]),
           'math': int(SAT_data.iloc[i, 7]),
           'sat': int(SAT_data.iloc[i, 8]),
           })

con.execute(meta.tables['SAT'].insert(), rows)









    Out[8]:





<sqlalchemy.engine.result.ResultProxy at 0x23e61b01278>



In [9]:

    
t = meta.tables['SAT']
s = select([t])

results = con.execute(s.limit(5))

for row in results:
    print(row)









    



('Alabama', Decimal('4.405'), Decimal('17.2'), Decimal('31.144'), 8, 491, 538, 1029)
('Alaska', Decimal('8.963'), Decimal('17.6'), Decimal('47.951'), 47, 445, 489, 934)
('Arizona', Decimal('4.778'), Decimal('19.3'), Decimal('32.175'), 27, 448, 496, 944)
('Arkansas', Decimal('4.459'), Decimal('17.1'), Decimal('28.934'), 6, 482, 523, 1005)
('California', Decimal('4.992'), Decimal('24.0'), Decimal('41.078'), 45, 417, 485, 902)



In [10]:

    
IraqVote_data = pd.read_csv('IraqVote.csv')

rows = []

for i in range(len(IraqVote_data)):
    rows.append({'id': int(IraqVote_data.iloc[i, 0]),
            'y': int(IraqVote_data.iloc[i, 1]),
            'state_abb': IraqVote_data.iloc[i, 2],
            'name': IraqVote_data.iloc[i, 3],
            'rep': bool(IraqVote_data.iloc[i, 4]),
            'state_name': IraqVote_data.iloc[i, 5],
            'gorevote': round(float(IraqVote_data.iloc[i, 6]), 2)
           })

con.execute(meta.tables['IraqVote'].insert(), rows)









    Out[10]:





<sqlalchemy.engine.result.ResultProxy at 0x23e61b38cf8>



In [11]:

    
t = meta.tables['IraqVote']
s = select([t])

results = con.execute(s.limit(5))

for row in results:
    print(row)









    



(1, 1, 'AL', 'SESSIONS (R AL)', True, 'Alabama', Decimal('41.59'))
(2, 1, 'AL', 'SHELBY (R AL)', True, 'Alabama', Decimal('41.59'))
(3, 1, 'AK', 'MURKOWSKI (R AK)', True, 'Alaska', Decimal('27.67'))
(4, 1, 'AK', 'STEVENS (R AK)', True, 'Alaska', Decimal('27.67'))
(5, 1, 'AZ', 'KYL (R AZ)', True, 'Arizona', Decimal('44.67'))

5 - Using SQLAlchemy, select all states which had a salary greater than 30.000 and a math score greater than 500.



In [13]:

    
from sqlalchemy.sql import and_

t = meta.tables['SAT']
s = select([t.c.state])

results = con.execute(s.where(and_(t.c.salary > 30, t.c.math > 500)))

for row in results:
    print(row)









    



('Alabama',)
('Colorado',)
('Illinois',)
('Iowa',)
('Kansas',)
('Kentucky',)
('Michigan',)
('Minnesota',)
('Missouri',)
('Nebraska',)
('Ohio',)
('Tennessee',)
('Wisconsin',)
('Wyoming',)

6 - Using SQLAlchemy, count the votes for the war, versus votes against the war for Republican senators, versus for the war not Republican senators.



In [36]:

    
t = meta.tables['IraqVote']
s1 = select([func.count().label('count')]).where(t.c.y == 1)
s2 = select([func.count().label('count')]).where(and_(t.c.rep == True, t.c.y == 0))
s3 = select([func.count().label('count')]).where(and_(t.c.rep == False, t.c.y == 1))

results1 = con.execute(s1)
results2 = con.execute(s2)
results3 = con.execute(s3)

for row in results1:
    print('Votes for war:', row[0])

for row in results2:
    print('Republican votes against war:', row[0])

for row in results3:
    print('Non republican votes against war:', row[0])









    



Votes for war: 77
Republican votes against war: 1
Non republican votes against war: 29

7 - Using SQLAlchemy, calculate the average SAT score for states that voted for the Iraq war, those that voted against, and those with split votes.



In [59]:

    
from sqlalchemy import case

sat = meta.tables['SAT']
vote = meta.tables['IraqVote']
t = vote.join(sat)

states = select([sat.c.state, sat.c.sat, func.sum(vote.c.y).label('NVotes')])\
        .select_from(t).group_by(sat.c.state, sat.c.sat).alias('st')

s = select([case({0: "Against", 1: "Split", 2: "For"}, value=states.c.NVotes), func.avg(states.c.sat).label('AvgSATScore')]).select_from(states).group_by(states.c.NVotes)

results = con.execute(s)

for row in results:
    print('{} states average SAT score: {:.2f}'.format(row[0], row[1]))









    



Against states average SAT score: 950.83
Split states average SAT score: 968.64
For states average SAT score: 967.76

	name	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	class
0	Mazda RX4	21	6	160	110	3.9	2.62	16.46	0	1	4	4	high
1	Mazda RX4 Wag	21	6	160	110	3.9	2.875	17.02	0	1	4	4	high
2	Datsun 710	22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	high
3	Hornet 4 Drive	21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	high
4	Hornet Sportabout	18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	med