So far, we've learned how to write SQL queries to interact with existing databases. In this guided project, you'll learn how to clean a CSV dataset and add it to a SQLite database.
We'll work with data on Academy Award nominations, which can be downloaded here. The Academy Awards, also known as the Oscars, is an annual awards ceremony hosted to recognize the achievements in the film industry. There are many different awards categories and the members of the academy vote every year to decide which artist or film should get the award. The awards categories have changed over the years, and you can learn more about when categories were added on Wikipedia.
Here are the columns in the dataset, academy_awards.csv:
Read in the dataset into a Dataframe and explore it to become more familiar witht he data. Once you've cleaned the dataset, you'll use a Pandas helper method to export the data into a SQLite database.
In [1]:
# Import pandas and read the CSV file academy_awards.csv into a DataFrame
import pandas as pd
df = pd.read_csv('../data/academy_awards.csv', encoding="ISO-8859-1")
In [2]:
# Start exploring the data in Pandas and look for data quality issues
df.head()
Out[2]:
In [3]:
# There are 6 unnamed columns at the end. Do any of them contain valid values?
cols = df.columns
for column in cols:
if column.startswith('Unnamed:'):
print('\ncolumn {}\n:{}'.format(column, df[column].value_counts()))
In [4]:
# Additional Info column contains a few different formatting styles.
# Start brainstorming ways to clean this column up.
col_str = "Biutiful {'Uxbal'}"
col_split = col_str.split(' ')
movie_name = col_split[0]
character_name = col_split[1].split("'")[1]
print(movie_name)
print(character_name)
The dataset is incredibly messy and you may have noticed many inconsistencies that make it hard to work with. Most columns don't have consistent formatting, which is incredibly important when we use SQL to query the data later on. Other columns vary in the information they convey based on the type of awards category that row corresponds to.
In the SQL and Databases: Intermediate course, we worked with a subset of the same dataset. This subset contained only the nominations from years 2001 to 2010 and only the following awards categories:
Let's filter our Dataframe to the same subset so it's more manageable.
In [5]:
# Before we filter the data, let's clean up the Year column
import numpy as np
df["Year"] = df["Year"].str[0:4]
# Convert the Year column to the int64 data type
df["Year"] = df["Year"].astype(np.int64)
df.dtypes
Out[5]:
In [6]:
# Use conditional filtering to select only the rows where Year > 2000
later_than_2000 = df[df["Year"] > 2000]
In [7]:
# Use conditional filtering to select rows where Category matches
award_categories = ['Actor -- Leading Role',
'Actor -- Supporting Role',
'Actress -- Leading Role',
'Actress -- Supporting Role']
nominations = later_than_2000[later_than_2000["Category"].isin(award_categories)]
nominations.head()
Out[7]:
Since SQLite uses the integers 0 and 1 to represent Boolean values, convert the Won? column to reflect this. Also rename the Won? column to Won so that it's consistent with the other column names. Finally, get rid of the 6 extra, unnamed columns, since they contain only null values in our filtered Dataframe nominations.
In [8]:
# Use Series method map to replace all NO values with 0 and YES with 1
replacements = {"NO": 0, "YES": 1}
nominations["Won?"] = nominations["Won?"].map(replacements)
nominations.head()
Out[8]:
In [9]:
# Create a new column Won that contains the values from the Won? column
nominations["Won"] = nominations["Won?"]
In [10]:
# Use the drop method to remove the extraneous columns
drop_cols = ['Won?', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7',
'Unnamed: 8', 'Unnamed: 9', 'Unnamed: 10']
final_nominations = nominations.drop(drop_cols, axis=1)
final_nominations.head()
Out[10]:
Now clean up the Additional Info column, whose values are formatted like so: MOVIE {'CHARACTER'}
Here are some examples:
The values in this column contain the movie name and the character the nominee played. Instead of keeping these values in 1 column, split them up into 2 different columns for easier querying.
In [11]:
# Use vectorized string metods to clean up the Additional Info column
additional_info_one = final_nominations["Additional Info"].str.rstrip("'}")
additional_info_two = additional_info_one.str.split(" {'")
movie_names = additional_info_two.str[0]
characters = additional_info_two.str[1]
final_nominations["Movie"] = movie_names
final_nominations["Character"] = characters
final_nominations = final_nominations.drop("Additional Info", axis=1)
final_nominations.head()
Out[11]:
Now that our Dataframe is cleaned up, let's write these records to a SQL database. We can use the Pandas Dataframe method to_sql to create a new table in a database we specify. This method has 2 required parameters:
Behind the scenes, Pandas creates a table and uses the first parameter to name it. Pandas uses the data types of each of the columns in the Dataframe to create a SQLite schema for this table. Since SQLite uses integer values to represent Booleans, it was important to convert the Won column to the integer values 0 and 1. We also converted the Year column to the integer data type, so that this column will have the appropriate type in our table.
After creating the table, Pandas creates a large INSERT query and runs it to insert the values into the table. We can customize the behavior of the to_sql method using its parameters. For example, if we wanted to append rows to an existing SQLite table instead of creating a new one, we can set the if_exists parameter to "append". By default, if_exists is set to "fail" and no rows will be inserted if we specify a table name that already exists. If we're inserting a large number of records into SQLite and we want to break up the inserting of records into chunks, we can use the chunksize parameter to set the number of rows we want inserted each time.
Since we're creating a database from scratch, we need to create a database file first so we can connect to it and export our data. To create a new database file, we use the sqlite3 library to connect to a file path that doesn't exist yet. If Python can't find the file we specified, it will create it for us and treat it as a SQLite database file.
SQLite doesn't have a special file format and you can use any file extension you'd like when creating a SQLite database. We generally use the .db extension, which isn't a file extension that's generally used for other applications.
In [14]:
# Create the SQLite database "nominations.db" and connect to it
import sqlite3
conn = sqlite3.connect('../data/nominations.db')
In [15]:
# Use the DataFrame method "to_sql" to export final_nominations
final_nominations.to_sql("nominations", conn, index=False)
In [14]:
import sqlite3
In [16]:
# Create a Connection using sqlite3.connect
conn = sqlite3.connect('../data/nominations.db')
In [17]:
# Explore the database to make sure the nominations table looks OK
# Return and print the schema using "pragma table_info()"
query_one = "pragma table_info(nominations);"
print(conn.execute(query_one).fetchall())
In [18]:
# Return and print the first 10 rows using the SELECT and LIMIT statements
query_two = "select * from nominations limit 10;"
print(conn.execute(query_two).fetchall())
In [19]:
# Once you're done, close the Connection
conn.close()
As a data scientist, it's important to learn many tools and how to use them together to accomplish what you need to. As you do more guided projects, you'll become more familiar with the strengths and weaknesses of each tool. For example, you probably have noticed that data cleaning is much easier in Pandas than in SQL.
In [ ]: