n this project, we will walk through how to normalize our single table into multiple tables and how to create relations between them.
The Academy Awards, also known as the Oscars, is an annual awards ceremony hosted to recognize the achievements in the film industry. There are many different awards categories and the members of the academy vote every year to decide which artist or film should get the award. Each row in our data represents a nomination for an award. Recall that our database file, nominations.db, contains just the nominations table. This table has the following schema:
Let's now set up our enviroment and spend some time getting familiar with the data before we start normalizing it.
In [1]:
import sqlite3
In [2]:
# Conect to nominations.db
conn = sqlite3.connect('../data/nominations.db')
In [3]:
# Return the schema using "pragma table_info()"
query = "pragma table_info(nominations);"
schema = conn.execute(query).fetchall()
schema
Out[3]:
In [4]:
# Return the first 10 rows using the SELECT and LIMIT statements
query = "SELECT * FROM nominations LIMIT 10;"
first_ten = conn.execute(query).fetchall()
In [5]:
# Since both schema and first_ten are lists, use a for loop to iterate
for item in schema:
print(item)
for row in first_ten:
print(row)
Let's now add information on the host for each awards ceremony. Instead of adding a Host column to the nominations table and having lots of redundant data, we'll create a separate table called ceremonies which contains data specific to the ceremony itself.
Let's create a ceremonies table that contains the Year and Host for each ceremony and then set up a one-to-many relationship between ceremonies and nominations. In this screen, we'll focus on creating the ceremonies table and inserting the data we need and in the next guided step, we'll focus on setting up the one-to-many relationship.
The ceremonies table will contain 3 fields:
Before we can create and insert into the ceremonies table, we need to look up the host for each ceremony from 2000 to 2010. While we could represent each row as a tuple and write a SQL query with an INSERT statement to add each row to the ceremonies table, this is incredibly cumbersome.
The Python sqlite3 library comes with an executemany method that let's us easily mass insert records into a table. The executemany method requires the records we want to insert to be represented as a list of tuples. We then just need to write a single INSERT query with placeholder elements and specify that we want the list of tuples to be dropped into the query.
Let's first create the list of tuples representing the data we want inserted and then we'll walk through the placeholder query we need to write. We'll skip over creating the ceremonies table for now since we've explored how to create a table earlier in the course.
We then need to write the INSERT query with placeholder values. Instead of having specific values in the query string, we use a question mark (?) to act as a placeholder in the values section of the query:
insert_query = "INSERT INTO ceremonies (Year, Host) VALUES (?,?);"
Since the placeholder elements (?) will be replaced by the values in years_hosts, you need to make sure the number of question marks matches the length of each tuple in years_hosts. Since each tuple has 2 elements, we need to have 2 question marks as the placeholder elements. We don't need to specify values for the id column since it's a primary key column. When inserting values, recall that SQLite automatically creates a unique primary key for each row.
We then call the executemany method and pass in insert_query as the first parameter and years_hosts as the second parameter:
conn.executemany(insert_query, years_hosts)
In [6]:
# Create the ceremonies table
create_ceremonies = "create table ceremonies (id integer primary key, year integer, host text);"
conn.execute(create_ceremonies)
Out[6]:
In [7]:
# Create the list of tuples, years_hosts
years_hosts = [(2010, "Steve Martin"),
(2009, "Hugh Jackman"),
(2008, "Jon Stewart"),
(2007, "Ellen DeGeneres"),
(2006, "Jon Stewart"),
(2005, "Chris Rock"),
(2004, "Billy Crystal"),
(2003, "Steve Martin"),
(2002, "Whoopi Goldberg"),
(2001, "Steve Martin"),
(2000, "Billy Crystal"),
]
In [8]:
# Use the Connection method executemany() to insert the values
insert_query = "insert into ceremonies (Year, Host) values (?,?);"
conn.executemany(insert_query, years_hosts)
Out[8]:
In [9]:
# Verify that the ceremonies table was created and populated correctly
print(conn.execute("select * from ceremonies limit 10;").fetchall())
print(conn.execute("pragma table_info(ceremonies);").fetchall())
Since we'll be creating relations using foreign keys, we need to turn on foreign key constraints. By default, if you insert a row into a table that contains one or multiple foreign key columns, the record will be successfully inserted even if the foreign key reference is incorrect.
For example, since the ceremonies table only contains the id values 1 to 10, inserting a row into nominations while specifying that the ceremony_id value be 11 will work and no error will be returned. This is problematic because if we try to actually join that row with the ceremonies table, the results set will be empty since the id value 11 doesn't map to any row in the ceremonies table. To prevent us from inserting rows with nonexisting foreign key values, we need to turn on foreign key constraints by running the following query:
PRAGMA foreign_keys = ON;
The above query needs to be run every time you connect to a database where you'll be inserting foreign keys. Whenever you try inserting a row into a table containing foreign key(s), SQLite will query the linked table to make sure that foreign key value exists. If it does, the transaction will continue as expected. If it doesn't, then an error will be returned and the transaction won't go through.
In [10]:
# Turn on foreign key constraints
conn.execute("PRAGMA foreign_keys = ON;")
Out[10]:
The next step is to remove the Year column from nominations and add a new column, ceremony_id, that contains the foreign key reference to the id column in the ceremonies table. Unfortunately, we can't remove columns from an existing table in SQLite or change its schema. The goal of SQLite is to create an incredibly lightweight, open source database that contains a common, but reduced, set of features. While this has allowed SQLite to become the most popular database in the world, SQLite doesn't have the ability to heavily modify an existing table to keep the code base lightweight.
The only alterations we can make to an existing table are renaming it or adding a new column. This means that we can't just remove the Year column from nominations and add the ceremony_id column. We need to instead:
For nominations_two, we want the following schema:
First, we need to select all the records from the original nominations table with the columns we want and use an INNER JOIN to add the id field from ceremonies for each row:
SELECT nominations.category, nominations.nominee, nominations.movie, nominations.character, nominations.won, ceremonies.id
FROM nominations
INNER JOIN ceremonies ON
nominations.year == ceremonies.year
;
Then we can write the placeholder insert query we need to insert these records into nominations_two. Let's create and populate the nominations_two table in this screen and we'll work through the rest in the next screen.
In [11]:
# Write and run the query to create the nominations_two table
create_nominations_two = '''create table nominations_two
(id integer primary key,
category text,
nominee text,
movie text,
character text,
won text,
ceremony_id integer,
foreign key(ceremony_id) references ceremonies(id));
'''
conn.execute(create_nominations_two)
Out[11]:
In [12]:
# Write and run the query that returns the records from nominations
nom_query = '''
select ceremonies.id as ceremony_id, nominations.category as category,
nominations.nominee as nominee, nominations.movie as movie,
nominations.character as character, nominations.won as won
from nominations
inner join ceremonies
on nominations.year == ceremonies.year
;
'''
joined_nominations = conn.execute(nom_query).fetchall()
In [13]:
# Write a placeholder insert query that can insert values into nom2
insert_nominations_two = '''insert into nominations_two
(ceremony_id, category, nominee, movie, character, won)
values (?,?,?,?,?,?);
'''
In [14]:
# Use the Connection method executemany() to insert records
conn.executemany(insert_nominations_two, joined_nominations)
Out[14]:
In [15]:
# Verify your work by returning the first 5 rows from nominations_two
print(conn.execute("select * from nominations_two limit 5;").fetchall())
We now need to delete the nominations table since we'll be using the nominations_two table moving forward. We can use the DROP TABLE statement to drop the original nominations table.
Once we drop this table, we can use the ALTER TABLE statement to rename nominations_two to nominations. Here's what the syntax looks like for that statement:
ALTER TABLE [current_table_name]
RENAME TO [future_table_name]
In [16]:
# Write and run the query that deletes the nominations table
drop_nominations = "drop table nominations;"
conn.execute(drop_nominations)
Out[16]:
In [17]:
# Write and run the query that renames nominations_two to nominations
rename_nominations_two = "alter table nominations_two rename to nominations;"
conn.execute(rename_nominations_two)
Out[17]:
Creating a join table is no different than creating a regular one. To create the movies_actors join table we need to declare both of the foreign key references when specifying the schema:
CREATE TABLE movies_actors (
id INTEGER PRIMARY KEY,
movie_id INTEGER REFERENCES movies(id),
actor_id INTEGER REFERENCES actors(id)
);
In [18]:
# Create the movies table
create_movies = "create table movies (id integer primary key,movie text);"
conn.execute(create_movies)
Out[18]:
In [19]:
# Create the actors table
create_actors = "create table actors (id integer primary key,actor text);"
conn.execute(create_actors)
Out[19]:
In [20]:
# Create the movie_actors join table
create_movies_actors = '''create table movies_actors (id INTEGER PRIMARY KEY,
movie_id INTEGER references movies(id), actor_id INTEGER references actors(id));
'''
conn.execute(create_movies_actors)
Out[20]:
In [21]:
insert_movies = "insert into movies (movie) select distinct movie from nominations;"
insert_actors = "insert into actors (actor) select distinct nominee from nominations;"
conn.execute(insert_movies)
conn.execute(insert_actors)
print(conn.execute("select * from movies limit 5;").fetchall())
print(conn.execute("select * from actors limit 5;").fetchall())
In [22]:
pairs_query = "select movie,nominee from nominations;"
movie_actor_pairs = conn.execute(pairs_query).fetchall()
join_table_insert = "insert into movies_actors (movie_id, actor_id) values ((select id from movies where movie == ?),(select id from actors where actor == ?));"
conn.executemany(join_table_insert,movie_actor_pairs)
print(conn.execute("select * from movies_actors limit 5;").fetchall())
In [23]:
# Close the database
conn.close()
In [ ]: