Introduction To The Data

So far, we've learned how to write SQL queries to interact with existing databases. In this guided project, you'll learn how to clean a CSV dataset and add it to a SQLite database.

We'll work with data on Academy Award nominations, which can be downloaded here. The Academy Awards, also known as the Oscars, is an annual awards ceremony hosted to recognize the achievements in the film industry. There are many different awards categories and the members of the academy vote every year to decide which artist or film should get the award. The awards categories have changed over the years, and you can learn more about when categories were added on Wikipedia.

Here are the columns in the dataset, academy_awards.csv:

  • Year - the year of the awards ceremony.
  • Category - the category of award the nominee was nominated for.
  • Nominee - the person nominated for the award.
  • Additional Info - this column contains additional info like:
    • the movie the nominee participated in.
    • the character the nominee played (for acting awards).
  • Won? - this column contains either YES or NO depending on if the nominee won the award.

Read in the dataset into a Dataframe and explore it to become more familiar witht he data. Once you've cleaned the dataset, you'll use a Pandas helper method to export the data into a SQLite database.


In [1]:
# Import pandas and read the CSV file academy_awards.csv into a DataFrame
import pandas as pd
df = pd.read_csv('../data/academy_awards.csv', encoding="ISO-8859-1")

In [2]:
# Start exploring the data in Pandas and look for data quality issues
df.head()


Out[2]:
Year Category Nominee Additional Info Won? Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10
0 2010 (83rd) Actor -- Leading Role Javier Bardem Biutiful {'Uxbal'} NO NaN NaN NaN NaN NaN NaN
1 2010 (83rd) Actor -- Leading Role Jeff Bridges True Grit {'Rooster Cogburn'} NO NaN NaN NaN NaN NaN NaN
2 2010 (83rd) Actor -- Leading Role Jesse Eisenberg The Social Network {'Mark Zuckerberg'} NO NaN NaN NaN NaN NaN NaN
3 2010 (83rd) Actor -- Leading Role Colin Firth The King's Speech {'King George VI'} YES NaN NaN NaN NaN NaN NaN
4 2010 (83rd) Actor -- Leading Role James Franco 127 Hours {'Aron Ralston'} NO NaN NaN NaN NaN NaN NaN

In [3]:
# There are 6 unnamed columns at the end.  Do any of them contain valid values?
cols = df.columns
for column in cols:
    if column.startswith('Unnamed:'):
        print('\ncolumn {}\n:{}'.format(column, df[column].value_counts()))


column Unnamed: 5
:*                                                                                                               7
 resilience                                                                                                     1
 D.B. "Don" Keele and Mark E. Engebretson has resulted in the over 20-year dominance of constant-directivity    1
 discoverer of stars                                                                                            1
 error-prone measurements on sets. [Digital Imaging Technology]"                                                1
Name: Unnamed: 5, dtype: int64

column Unnamed: 6
:*                                                                   9
 flexibility and water resistance                                   1
 sympathetic                                                        1
 direct radiator bass style cinema loudspeaker systems. [Sound]"    1
Name: Unnamed: 6, dtype: int64

column Unnamed: 7
: kindly                                               1
*                                                     1
 while requiring no dangerous solvents. [Systems]"    1
Name: Unnamed: 7, dtype: int64

column Unnamed: 8
: understanding comedy genius - Mack Sennett.""    1
*                                                 1
Name: Unnamed: 8, dtype: int64

column Unnamed: 9
:*    1
Name: Unnamed: 9, dtype: int64

column Unnamed: 10
:*    1
Name: Unnamed: 10, dtype: int64

In [4]:
# Additional Info column contains a few different formatting styles.
# Start brainstorming ways to clean this column up.
col_str = "Biutiful {'Uxbal'}"
col_split = col_str.split(' ')
movie_name = col_split[0]
character_name = col_split[1].split("'")[1]
print(movie_name)
print(character_name)


Biutiful
Uxbal

Filtering the Data

The dataset is incredibly messy and you may have noticed many inconsistencies that make it hard to work with. Most columns don't have consistent formatting, which is incredibly important when we use SQL to query the data later on. Other columns vary in the information they convey based on the type of awards category that row corresponds to.

In the SQL and Databases: Intermediate course, we worked with a subset of the same dataset. This subset contained only the nominations from years 2001 to 2010 and only the following awards categories:

  • Actor -- Leading Role
  • Actor -- Supporting Role
  • Actress -- Leading Role
  • Actress -- Supporting Role

Let's filter our Dataframe to the same subset so it's more manageable.


In [5]:
# Before we filter the data, let's clean up the Year column
import numpy as np
df["Year"] = df["Year"].str[0:4]

# Convert the Year column to the int64 data type
df["Year"] = df["Year"].astype(np.int64)
df.dtypes


Out[5]:
Year                int64
Category           object
Nominee            object
Additional Info    object
Won?               object
Unnamed: 5         object
Unnamed: 6         object
Unnamed: 7         object
Unnamed: 8         object
Unnamed: 9         object
Unnamed: 10        object
dtype: object

In [6]:
# Use conditional filtering to select only the rows where Year > 2000
later_than_2000 = df[df["Year"] > 2000]

In [7]:
# Use conditional filtering to select rows where Category matches
award_categories = ['Actor -- Leading Role', 
                    'Actor -- Supporting Role', 
                    'Actress -- Leading Role', 
                    'Actress -- Supporting Role']
nominations = later_than_2000[later_than_2000["Category"].isin(award_categories)]
nominations.head()


Out[7]:
Year Category Nominee Additional Info Won? Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10
0 2010 Actor -- Leading Role Javier Bardem Biutiful {'Uxbal'} NO NaN NaN NaN NaN NaN NaN
1 2010 Actor -- Leading Role Jeff Bridges True Grit {'Rooster Cogburn'} NO NaN NaN NaN NaN NaN NaN
2 2010 Actor -- Leading Role Jesse Eisenberg The Social Network {'Mark Zuckerberg'} NO NaN NaN NaN NaN NaN NaN
3 2010 Actor -- Leading Role Colin Firth The King's Speech {'King George VI'} YES NaN NaN NaN NaN NaN NaN
4 2010 Actor -- Leading Role James Franco 127 Hours {'Aron Ralston'} NO NaN NaN NaN NaN NaN NaN

Cleaning Up the Won? and Unnamed Columns

Since SQLite uses the integers 0 and 1 to represent Boolean values, convert the Won? column to reflect this. Also rename the Won? column to Won so that it's consistent with the other column names. Finally, get rid of the 6 extra, unnamed columns, since they contain only null values in our filtered Dataframe nominations.


In [8]:
# Use Series method map to replace all NO values with 0 and YES with 1
replacements = {"NO": 0, "YES": 1}
nominations["Won?"] = nominations["Won?"].map(replacements)
nominations.head()


/Users/toddleonhardt/anaconda/lib/python3.5/site-packages/ipykernel/__main__.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
Out[8]:
Year Category Nominee Additional Info Won? Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10
0 2010 Actor -- Leading Role Javier Bardem Biutiful {'Uxbal'} 0 NaN NaN NaN NaN NaN NaN
1 2010 Actor -- Leading Role Jeff Bridges True Grit {'Rooster Cogburn'} 0 NaN NaN NaN NaN NaN NaN
2 2010 Actor -- Leading Role Jesse Eisenberg The Social Network {'Mark Zuckerberg'} 0 NaN NaN NaN NaN NaN NaN
3 2010 Actor -- Leading Role Colin Firth The King's Speech {'King George VI'} 1 NaN NaN NaN NaN NaN NaN
4 2010 Actor -- Leading Role James Franco 127 Hours {'Aron Ralston'} 0 NaN NaN NaN NaN NaN NaN

In [9]:
# Create a new column Won that contains the values from the Won? column
nominations["Won"] = nominations["Won?"]


/Users/toddleonhardt/anaconda/lib/python3.5/site-packages/ipykernel/__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app

In [10]:
# Use the drop method to remove the extraneous columns
drop_cols = ['Won?', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 
             'Unnamed: 8', 'Unnamed: 9', 'Unnamed: 10']
final_nominations = nominations.drop(drop_cols, axis=1)
final_nominations.head()


Out[10]:
Year Category Nominee Additional Info Won
0 2010 Actor -- Leading Role Javier Bardem Biutiful {'Uxbal'} 0
1 2010 Actor -- Leading Role Jeff Bridges True Grit {'Rooster Cogburn'} 0
2 2010 Actor -- Leading Role Jesse Eisenberg The Social Network {'Mark Zuckerberg'} 0
3 2010 Actor -- Leading Role Colin Firth The King's Speech {'King George VI'} 1
4 2010 Actor -- Leading Role James Franco 127 Hours {'Aron Ralston'} 0

Cleaning Up the Additional Info Column

Now clean up the Additional Info column, whose values are formatted like so: MOVIE {'CHARACTER'}

Here are some examples:

  • Biutiful {'Uxbal'} - Biutiful is the movie and Uxbal is the character this nominee played.
  • True Grit {'Rooster Cogburn'} - True Grit is the movie and Rooster Cogburn is the character this nominee played.
  • The Social Network {'Mark Zuckerberg'} - The Social Network is the movie and Mark Zuckerberg is the character this nominee played.

The values in this column contain the movie name and the character the nominee played. Instead of keeping these values in 1 column, split them up into 2 different columns for easier querying.


In [11]:
# Use vectorized string metods to clean up the Additional Info column
additional_info_one = final_nominations["Additional Info"].str.rstrip("'}")
additional_info_two = additional_info_one.str.split(" {'")
movie_names = additional_info_two.str[0]
characters = additional_info_two.str[1]
final_nominations["Movie"] = movie_names
final_nominations["Character"] = characters
final_nominations = final_nominations.drop("Additional Info", axis=1)
final_nominations.head()


Out[11]:
Year Category Nominee Won Movie Character
0 2010 Actor -- Leading Role Javier Bardem 0 Biutiful Uxbal
1 2010 Actor -- Leading Role Jeff Bridges 0 True Grit Rooster Cogburn
2 2010 Actor -- Leading Role Jesse Eisenberg 0 The Social Network Mark Zuckerberg
3 2010 Actor -- Leading Role Colin Firth 1 The King's Speech King George VI
4 2010 Actor -- Leading Role James Franco 0 127 Hours Aron Ralston

Exporting to SQLite

Now that our Dataframe is cleaned up, let's write these records to a SQL database. We can use the Pandas Dataframe method to_sql to create a new table in a database we specify. This method has 2 required parameters:

  • name - string corresponding to the name of the table we want created. The rows from our Dataframe will be added to this table after it's created.
  • conn - the Connection instance representing the database we want to add to.

Behind the scenes, Pandas creates a table and uses the first parameter to name it. Pandas uses the data types of each of the columns in the Dataframe to create a SQLite schema for this table. Since SQLite uses integer values to represent Booleans, it was important to convert the Won column to the integer values 0 and 1. We also converted the Year column to the integer data type, so that this column will have the appropriate type in our table.

After creating the table, Pandas creates a large INSERT query and runs it to insert the values into the table. We can customize the behavior of the to_sql method using its parameters. For example, if we wanted to append rows to an existing SQLite table instead of creating a new one, we can set the if_exists parameter to "append". By default, if_exists is set to "fail" and no rows will be inserted if we specify a table name that already exists. If we're inserting a large number of records into SQLite and we want to break up the inserting of records into chunks, we can use the chunksize parameter to set the number of rows we want inserted each time.

Since we're creating a database from scratch, we need to create a database file first so we can connect to it and export our data. To create a new database file, we use the sqlite3 library to connect to a file path that doesn't exist yet. If Python can't find the file we specified, it will create it for us and treat it as a SQLite database file.

SQLite doesn't have a special file format and you can use any file extension you'd like when creating a SQLite database. We generally use the .db extension, which isn't a file extension that's generally used for other applications.


In [14]:
# Create the SQLite database "nominations.db" and connect to it
import sqlite3
conn = sqlite3.connect('../data/nominations.db')

In [15]:
# Use the DataFrame method "to_sql" to export final_nominations
final_nominations.to_sql("nominations", conn, index=False)

Verifying In SQL

Let's now query the database to make sure everything worked as expected.


In [14]:
import sqlite3

In [16]:
# Create a Connection using sqlite3.connect
conn = sqlite3.connect('../data/nominations.db')

In [17]:
# Explore the database to make sure the nominations table looks OK

# Return and print the schema using "pragma table_info()"
query_one = "pragma table_info(nominations);"
print(conn.execute(query_one).fetchall())


[(0, 'Year', 'INTEGER', 0, None, 0), (1, 'Category', 'TEXT', 0, None, 0), (2, 'Nominee', 'TEXT', 0, None, 0), (3, 'Won', 'INTEGER', 0, None, 0), (4, 'Movie', 'TEXT', 0, None, 0), (5, 'Character', 'TEXT', 0, None, 0)]

In [18]:
# Return and print the first 10 rows using the SELECT and LIMIT statements
query_two = "select * from nominations limit 10;"
print(conn.execute(query_two).fetchall())


[(2010, 'Actor -- Leading Role', 'Javier Bardem', 0, 'Biutiful', 'Uxbal'), (2010, 'Actor -- Leading Role', 'Jeff Bridges', 0, 'True Grit', 'Rooster Cogburn'), (2010, 'Actor -- Leading Role', 'Jesse Eisenberg', 0, 'The Social Network', 'Mark Zuckerberg'), (2010, 'Actor -- Leading Role', 'Colin Firth', 1, "The King's Speech", 'King George VI'), (2010, 'Actor -- Leading Role', 'James Franco', 0, '127 Hours', 'Aron Ralston'), (2010, 'Actor -- Supporting Role', 'Christian Bale', 1, 'The Fighter', 'Dicky Eklund'), (2010, 'Actor -- Supporting Role', 'John Hawkes', 0, "Winter's Bone", 'Teardrop'), (2010, 'Actor -- Supporting Role', 'Jeremy Renner', 0, 'The Town', 'James Coughlin'), (2010, 'Actor -- Supporting Role', 'Mark Ruffalo', 0, 'The Kids Are All Right', 'Paul'), (2010, 'Actor -- Supporting Role', 'Geoffrey Rush', 0, "The King's Speech", 'Lionel Logue')]

In [19]:
# Once you're done, close the Connection
conn.close()

Next Steps

As a data scientist, it's important to learn many tools and how to use them together to accomplish what you need to. As you do more guided projects, you'll become more familiar with the strengths and weaknesses of each tool. For example, you probably have noticed that data cleaning is much easier in Pandas than in SQL.

  • For next steps, explore the rest of our original dataset academy_awards.csv and brainstorm how to fix the rest of the dataset:
    • The awards categories in older ceremonies were different than the ones we have today. What relevant information should we keep from older ceremonies?
    • What are all the different formatting styles that the Additional Info column contains. Can we use tools like regular expressions to capture these patterns and clean them up?
      • The nominations for the Art Direction category have lengthy values for Additional Info. What information is useful and how do we extract it?
      • Many values in Additional Info don't contain the character name the actor or actress played. Should we toss out character name altogether as we expand our data? What tradeoffs do we make by doing so?
    • What's the best way to handle awards ceremonies that included movies from 2 years? *E.g. see 1927/28 (1st) in the Year column.

In [ ]: