Using the Titanic dataset, perform the following exercises.
1 - Import the data into SQLite, removing the index column.
.mode csv
.import Titanic.csv titanic
CREATE TABLE temp AS SELECT Name, PClass, Age, Sex, Survived, SexCode FROM titanic;
DROP TABLE titanic;
ALTER TABLE temp RENAME TO titanic;
2 - Run a query that returns the number of passengers in each class under the age of 16.
SELECT PClass, COUNT(*)
FROM titanic
WHERE Age <= 16
GROUP BY PClass;
3 - Run a query that returns the average age of those who died (Survived = 0) and those who survived for male and female separately as well as the counts for each group.
SELECT Sex, Survived, AVG(Age) AS Age_Average, COUNT(*) AS Passengers_count
FROM titanic
GROUP BY Sex, Survived;
4 - Run a query that returns the portion of survivors in each class.
SELECT PClass, SUM(Survived) / CAST(COUNT(*) AS float) AS Survivor_portion
FROM titanic
GROUP BY PClass;
5 - Run a query that returns the portion of females and males in each class who survived with an age greater than the average.
Not really clear what is the desired output, let's try this:
SELECT PClass, Sex ,SUM(Survived) / CAST(COUNT(*) AS float) AS Survivor_portion
FROM titanic
WHERE Age > (SELECT AVG(Age) FROM titanic)
GROUP BY PClass, Sex;
6 - Download the USArrest and road accident data and perform the following.
JOIN on that column.LEFT_JOIN on the column identified in part 2..import USArrests.csv usarrests
.import road.csv road
SELECT * FROM usarrests AS arr INNER JOIN road AS ro ON ro.State = arr.State;
SELECT * FROM usarrests AS arr LEFT JOIN road AS ro ON ro.State = arr.State;
The results are not useful for two reasons:
To solve the latter problem we could modify the state names in order to have them match each other, to solve the former we need to get the missing data somewhere.
In [1]:
import sqlite3
sqlite_db = './myDB.db'
1 - Add a new column, of type char, called class, to the mtcars data set, containing low, med or high for cars with mpg < 15, mpg < 20, and mpg > 20 respectively. Be sure to properly open, commit, and close the connection before proceding.
In [8]:
conn = sqlite3.connect(sqlite_db)
c = conn.cursor()
c.execute('ALTER TABLE mtcars ADD class char;')
c.execute("UPDATE mtcars SET class = CASE WHEN mpg < 15 THEN 'low' WHEN mpg < 20 THEN 'med' ELSE 'high' END;")
conn.commit()
conn.close()
2 - What if you wanted to select cars with both class = 'high' or class = 'low' using the ? place holder like above? Consult the sqlite3 documentation and write a query, using ?, doing just that and print your results.
In [12]:
conn = sqlite3.connect(sqlite_db)
c = conn.cursor()
cl = ['high', 'low']
results = c.execute("SELECT * FROM mtcars WHERE class IN (?, ?)", cl)
for row in results:
print(row)
3 - Like we did above, create a copy of the mtcars table, and insert 3 new cars into the table, using only a subset of the columns, with proper error checking using a try statement, verifying that they were properly inserted before commiting your changes.
In [13]:
conn = sqlite3.connect(sqlite_db)
c = conn.cursor()
c.execute('CREATE TABLE mtcars_copy AS SELECT * FROM mtcars;')
conn.commit()
conn.close()
In [19]:
import sys
conn = None
new_data = [('Panda','21','6'),
('Punto','15','5'),
('Uno','17','4')]
try:
conn = sqlite3.connect(sqlite_db)
c = conn.cursor()
c.executemany('INSERT INTO mtcars_copy VALUES(?, ?, ?)', new_data)
#c.executemany('INSERT INTO mtcars_copy(name, mpg, cyl) VALUES(?, ?, ?)', new_data) # correct statement
conn.commit()
conn.close()
except sqlite3.Error as e:
print('Something went wrong:', e.args[0])
finally:
if conn:
conn.close()
4 - Create a Pandas dataframe from the mtcars dataset, containing only cars with 6 cylinder engines and then create a new table in the database from this dataframe. Verify that all operations worked.
In [25]:
import pandas as pd
conn = sqlite3.connect(sqlite_db)
mtcars = pd.read_sql_query('SELECT * FROM mtcars', conn)
conn.close()
mtcars.head()
Out[25]:
In [35]:
conn = sqlite3.connect(sqlite_db)
mtcars_6cyl = mtcars[mtcars['cyl'] == '6']
mtcars_6cyl.to_sql('mtcars_6cyl', conn)
pd.read_sql_query('SELECT * FROM mtcars_6cyl', conn)
Out[35]:
In [36]:
conn.close()
1 - Now that the mtcars data is in our Postgres Database, all we need is the Titanic data. Do the same as above, bringing in the Titanic.csv we downloaded earlier. Hint: The data contains some NA values. You could just import the column as a character, but there is a proper way to do this...
2 - Print out the metadata for the columns of the mtcars table, including the data types for each column.
In [1]:
import sqlalchemy
def connect(user, password, db, schema, host='localhost', port = 5432):
"""Connects SQLAlchemy to a Postgres database
Args:
user: Postgres user
password: User password
db: Database name
schema: Schema name
host: Host default as localhost
port: Postgres port, default 5432
Returns:
A connection and metadata object
"""
# Generate the URL for our database
url = 'postgresql://{}:{}@{}:{}/{}'
url = url.format(user, password, host, port, db)
# Get the connection object
con = sqlalchemy.create_engine(url, client_encoding = 'utf8')
# Get the metadata object
meta = sqlalchemy.MetaData(bind = con)
meta.reflect(bind = con, schema = schema)
return con, meta
In [2]:
con, meta = connect(user = 'postgres', password = 'password', schema = 'tutorial', db = 'postgres')
In [3]:
for col in meta.tables['tutorial.mtcars'].c:
print('{} is of type {}'.format(col, col.type))
3 - Write a query to return the average mpg for all rows with a hp greater than or equal to 115, grouped by cyl.
In [4]:
from sqlalchemy.sql import text
s = text('SELECT cyl, AVG(mpg) FROM tutorial.mtcars WHERE hp >= 115 GROUP BY cyl')
results = con.execute(s)
for row in results:
print(row)
In [5]:
from sqlalchemy.sql import select, func
t = meta.tables['tutorial.mtcars']
s = select([t.c.cyl, func.avg(t.c.mpg)]).group_by(t.c.cyl)
results = con.execute(s.where(t.c.hp >= 115))
for row in results:
print(row)
4 - Go to the R Dataset Repository and download the Iraq Vote and SAT datasets. After you have them, connect to your database using SQLAlchemy, insert the data, and verify that it worked correctly using SQLAlchemy.
In [6]:
from sqlalchemy import Table, Column, Integer, Numeric, Boolean, String, ForeignKey
In [7]:
SAT = Table('SAT', meta,
Column('state', String, primary_key = True),
Column('expend', Numeric),
Column('ratio', Numeric),
Column('salary', Numeric),
Column('frac', Integer),
Column('verbal', Integer),
Column('math', Integer),
Column('sat', Integer)
)
IraqVote = Table('IraqVote', meta,
Column('id', Integer, primary_key = True),
Column('y', Integer),
Column('state_abb', String),
Column('name', String),
Column('rep', Boolean),
Column('state_name', String, ForeignKey('SAT.state')),
Column('gorevote', Numeric)
)
meta.create_all(con)
In [8]:
import pandas as pd
SAT_data = pd.read_csv('SAT.csv')
rows = []
for i in range(len(SAT_data)):
rows.append({'state': SAT_data.iloc[i, 1],
'expend': round(float(SAT_data.iloc[i, 2]), 3),
'ratio': round(float(SAT_data.iloc[i, 3]), 1),
'salary': round(float(SAT_data.iloc[i, 4]), 3),
'frac': int(SAT_data.iloc[i, 5]),
'verbal': int(SAT_data.iloc[i, 6]),
'math': int(SAT_data.iloc[i, 7]),
'sat': int(SAT_data.iloc[i, 8]),
})
con.execute(meta.tables['SAT'].insert(), rows)
Out[8]:
In [9]:
t = meta.tables['SAT']
s = select([t])
results = con.execute(s.limit(5))
for row in results:
print(row)
In [10]:
IraqVote_data = pd.read_csv('IraqVote.csv')
rows = []
for i in range(len(IraqVote_data)):
rows.append({'id': int(IraqVote_data.iloc[i, 0]),
'y': int(IraqVote_data.iloc[i, 1]),
'state_abb': IraqVote_data.iloc[i, 2],
'name': IraqVote_data.iloc[i, 3],
'rep': bool(IraqVote_data.iloc[i, 4]),
'state_name': IraqVote_data.iloc[i, 5],
'gorevote': round(float(IraqVote_data.iloc[i, 6]), 2)
})
con.execute(meta.tables['IraqVote'].insert(), rows)
Out[10]:
In [11]:
t = meta.tables['IraqVote']
s = select([t])
results = con.execute(s.limit(5))
for row in results:
print(row)
5 - Using SQLAlchemy, select all states which had a salary greater than 30.000 and a math score greater than 500.
In [13]:
from sqlalchemy.sql import and_
t = meta.tables['SAT']
s = select([t.c.state])
results = con.execute(s.where(and_(t.c.salary > 30, t.c.math > 500)))
for row in results:
print(row)
6 - Using SQLAlchemy, count the votes for the war, versus votes against the war for Republican senators, versus for the war not Republican senators.
In [36]:
t = meta.tables['IraqVote']
s1 = select([func.count().label('count')]).where(t.c.y == 1)
s2 = select([func.count().label('count')]).where(and_(t.c.rep == True, t.c.y == 0))
s3 = select([func.count().label('count')]).where(and_(t.c.rep == False, t.c.y == 1))
results1 = con.execute(s1)
results2 = con.execute(s2)
results3 = con.execute(s3)
for row in results1:
print('Votes for war:', row[0])
for row in results2:
print('Republican votes against war:', row[0])
for row in results3:
print('Non republican votes against war:', row[0])
7 - Using SQLAlchemy, calculate the average SAT score for states that voted for the Iraq war, those that voted against, and those with split votes.
In [59]:
from sqlalchemy import case
sat = meta.tables['SAT']
vote = meta.tables['IraqVote']
t = vote.join(sat)
states = select([sat.c.state, sat.c.sat, func.sum(vote.c.y).label('NVotes')])\
.select_from(t).group_by(sat.c.state, sat.c.sat).alias('st')
s = select([case({0: "Against", 1: "Split", 2: "For"}, value=states.c.NVotes), func.avg(states.c.sat).label('AvgSATScore')]).select_from(states).group_by(states.c.NVotes)
results = con.execute(s)
for row in results:
print('{} states average SAT score: {:.2f}'.format(row[0], row[1]))