GP09: Preparing Data For SQLite

1. Introduction to the data

a. Import pandas and the csv file


In [1]:
import pandas as pd
pd.options.mode.chained_assignment = None # default='warn'
df = pd.read_csv("../data/GP09/academy_awards.csv", encoding="ISO-8859-1")

b. Exploring the data

View the first 5 lines. Have a look at the last few unnamed comlumn for any values.


In [2]:
print(df.head(5))


          Year               Category          Nominee  \
0  2010 (83rd)  Actor -- Leading Role    Javier Bardem   
1  2010 (83rd)  Actor -- Leading Role     Jeff Bridges   
2  2010 (83rd)  Actor -- Leading Role  Jesse Eisenberg   
3  2010 (83rd)  Actor -- Leading Role      Colin Firth   
4  2010 (83rd)  Actor -- Leading Role     James Franco   

                          Additional Info Won? Unnamed: 5 Unnamed: 6  \
0                      Biutiful {'Uxbal'}   NO        NaN        NaN   
1           True Grit {'Rooster Cogburn'}   NO        NaN        NaN   
2  The Social Network {'Mark Zuckerberg'}   NO        NaN        NaN   
3    The King's Speech {'King George VI'}  YES        NaN        NaN   
4              127 Hours {'Aron Ralston'}   NO        NaN        NaN   

  Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10  
0        NaN        NaN        NaN         NaN  
1        NaN        NaN        NaN         NaN  
2        NaN        NaN        NaN         NaN  
3        NaN        NaN        NaN         NaN  
4        NaN        NaN        NaN         NaN  

In [3]:
result = df.loc[: , "Unnamed: 5":"Unnamed: 10"].apply(pd.value_counts);
print(result.head(5))


                                                    Unnamed: 5  Unnamed: 6  \
 D.B. "Don" Keele and Mark E. Engebretson has r...         1.0         NaN   
 direct radiator bass style cinema loudspeaker ...         NaN         1.0   
 discoverer of stars                                       1.0         NaN   
 error-prone measurements on sets. [Digital Ima...         1.0         NaN   
 flexibility and water resistance                          NaN         1.0   

                                                    Unnamed: 7  Unnamed: 8  \
 D.B. "Don" Keele and Mark E. Engebretson has r...         NaN         NaN   
 direct radiator bass style cinema loudspeaker ...         NaN         NaN   
 discoverer of stars                                       NaN         NaN   
 error-prone measurements on sets. [Digital Ima...         NaN         NaN   
 flexibility and water resistance                          NaN         NaN   

                                                    Unnamed: 9  Unnamed: 10  
 D.B. "Don" Keele and Mark E. Engebretson has r...         NaN          NaN  
 direct radiator bass style cinema loudspeaker ...         NaN          NaN  
 discoverer of stars                                       NaN          NaN  
 error-prone measurements on sets. [Digital Ima...         NaN          NaN  
 flexibility and water resistance                          NaN          NaN  

2. Filtering The Data

a. Clean up data

Extract the year and change to integer type. Assign the new series to the Year column.


In [4]:
df["Year"] = df["Year"].str[0:4]
df["Year"] = df["Year"].astype("int64")

b. Conditional FIltering: Later than year 2000

We are only inetersted in the nomination after the year 2000.


In [5]:
later_than_2000 = df[df["Year"] > 2000]

c. Conditional FIltering: Nomination

Filter out the categories we are interested in.


In [6]:
award_categories = ["Actor -- Leading Role","Actor -- Supporting Role", "Actress -- Leading Role", "Actress -- Supporting Role"]
nominations = later_than_2000[later_than_2000["Category"].isin(award_categories)]

3. Cleaning Up The Won? And Unnamed Columns

Mapping data to the SQL format and dropping unused columns.


In [7]:
replacements = { "NO": 0, "YES": 1 }
nominations["Won?"] = nominations["Won?"].map(replacements)
nominations["Won"] = nominations["Won?"]
drop_cols = ["Won?","Unnamed: 5", "Unnamed: 6","Unnamed: 7", "Unnamed: 8", "Unnamed: 9", "Unnamed: 10"]
final_nominations = nominations.drop(drop_cols, axis=1)
print(nominations["Won"].head(5))


0    0
1    0
2    0
3    1
4    0
Name: Won, dtype: int64

4. Cleaning Up The Additional Info Column


In [8]:
print(final_nominations.head(5))


   Year               Category          Nominee  \
0  2010  Actor -- Leading Role    Javier Bardem   
1  2010  Actor -- Leading Role     Jeff Bridges   
2  2010  Actor -- Leading Role  Jesse Eisenberg   
3  2010  Actor -- Leading Role      Colin Firth   
4  2010  Actor -- Leading Role     James Franco   

                          Additional Info  Won  
0                      Biutiful {'Uxbal'}    0  
1           True Grit {'Rooster Cogburn'}    0  
2  The Social Network {'Mark Zuckerberg'}    0  
3    The King's Speech {'King George VI'}    1  
4              127 Hours {'Aron Ralston'}    0  

In [9]:
additional_info_one = final_nominations["Additional Info"].str.rstrip("'}")
additional_info_two = additional_info_one.str.split(" {'")
movie_names = additional_info_two.str[0]
characters = additional_info_two.str[1]
final_nominations["Movie"] = movie_names
final_nominations["Character"] = characters
final_nominations = final_nominations.drop("Additional Info", axis=1)
final_nominations
print(final_nominations.head(5))


   Year               Category          Nominee  Won               Movie  \
0  2010  Actor -- Leading Role    Javier Bardem    0            Biutiful   
1  2010  Actor -- Leading Role     Jeff Bridges    0           True Grit   
2  2010  Actor -- Leading Role  Jesse Eisenberg    0  The Social Network   
3  2010  Actor -- Leading Role      Colin Firth    1   The King's Speech   
4  2010  Actor -- Leading Role     James Franco    0           127 Hours   

         Character  
0            Uxbal  
1  Rooster Cogburn  
2  Mark Zuckerberg  
3   King George VI  
4     Aron Ralston  

5. Exporting To SQLite

Write records to SQL Database. Create the SQLite database nominations.db and connect to it. Use the Dataframe method to_sql to export final_nominations to nominations.db.


In [10]:
import sqlite3
conn = sqlite3.connect("../data/GP09/nominations.db")
final_nominations.to_sql("nominations", conn, index=False)


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-10-9d466d8987d1> in <module>()
      1 import sqlite3
      2 conn = sqlite3.connect("../data/GP09/nominations.db")
----> 3 final_nominations.to_sql("nominations", conn, index=False)

/home/jasonm_dev/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py in to_sql(self, name, con, flavor, schema, if_exists, index, index_label, chunksize, dtype)
   1360         sql.to_sql(self, name, con, flavor=flavor, schema=schema,
   1361                    if_exists=if_exists, index=index, index_label=index_label,
-> 1362                    chunksize=chunksize, dtype=dtype)
   1363 
   1364     def to_pickle(self, path, compression='infer'):

/home/jasonm_dev/anaconda3/lib/python3.5/site-packages/pandas/io/sql.py in to_sql(frame, name, con, flavor, schema, if_exists, index, index_label, chunksize, dtype)
    469     pandas_sql.to_sql(frame, name, if_exists=if_exists, index=index,
    470                       index_label=index_label, schema=schema,
--> 471                       chunksize=chunksize, dtype=dtype)
    472 
    473 

/home/jasonm_dev/anaconda3/lib/python3.5/site-packages/pandas/io/sql.py in to_sql(self, frame, name, if_exists, index, index_label, schema, chunksize, dtype)
   1503                             if_exists=if_exists, index_label=index_label,
   1504                             dtype=dtype)
-> 1505         table.create()
   1506         table.insert(chunksize)
   1507 

/home/jasonm_dev/anaconda3/lib/python3.5/site-packages/pandas/io/sql.py in create(self)
    586         if self.exists():
    587             if self.if_exists == 'fail':
--> 588                 raise ValueError("Table '%s' already exists." % self.name)
    589             elif self.if_exists == 'replace':
    590                 self.pd_sql.drop_table(self.name, self.schema)

ValueError: Table 'nominations' already exists.

6. Verifying In SQL

Query database to check if it works.


In [ ]:
query_one = "pragma table_info(nominations);"
query_two = "select * from nominations limit 10;"
print(conn.execute(query_one).fetchall())
print(conn.execute(query_two).fetchall())
conn.close()

In [ ]: