In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime as dt
import re
%matplotlib inline
%autosave 120
%run helper_functions.py


Autosaving every 120 seconds

This notebook contains the final bits of code that was used to clean up my data before having it ready for analysis.

All commands have been commented out to avoid any data corruption/errors.

Data cleaning is an incredibly important skill - in some sense, I am thankful that this project was messy as it puched me to be creative in my cleaning techniques.

The entire scraping and cleaning process for the project too approximately 1 week.


In [713]:
# regex_runtime = r" minutes"
# subset = ""

In [714]:
# who

In [715]:
# movie_db = unpickle_object("movie_database_final.pkl")

In [716]:
# movie_df = pd.DataFrame.from_items(movie_db.items(), 
#                             orient='index', 
#                             columns=['A','B','C','D',"E",'F',"G", 'H', "I", "J", "K", "L"])

In [717]:
# del movie_df['F']

In [718]:
# movie_df.columns = ['Rank_in_genre', "rotton_rating_(/10)", "No._of_reviews_rotton","Tomato_Freshness_(%)", "audience_rating_(/5)", "Date", "Runtime", "Box_office", "Budget", "Country","Language" ]

In [719]:
# movie_df.head()

Alright, let's do some clean up!!


In [720]:
# movie_df['Rank_in_genre'] = movie_df['Rank_in_genre'].apply(lambda x: x.strip("."))
# movie_df['Rank_in_genre'] = movie_df['Rank_in_genre'].apply(lambda x: float(x))
# movie_df['rotton_rating_(/10)'] = movie_df['rotton_rating_(/10)'].apply(lambda x: float(x))
# movie_df["No._of_reviews_rotton"] = movie_df['No._of_reviews_rotton'].apply(lambda x: int(x))
# movie_df['Tomato_Freshness_(%)'] = movie_df['Tomato_Freshness_(%)'].apply(lambda x: x.strip("%"))
# movie_df['Tomato_Freshness_(%)'] = movie_df['Tomato_Freshness_(%)'].apply(lambda x: float(x))
# movie_df['audience_rating_(/5)'] = movie_df['audience_rating_(/5)'].apply(lambda x: x.strip("/5"))
# movie_df['audience_rating_(/5)'] = movie_df['audience_rating_(/5)'].apply(lambda x: float(x))
# movie_df['Date'] = movie_df['Date'].apply(lambda x: dt.strptime(x, '%Y-%m-%d'))
# movie_df['Month'] = movie_df["Date"].apply(lambda x: x.month)
# movie_df['Runtime'] = movie_df['Runtime'].apply(lambda x: str(x))
# movie_df['Runtime'] = movie_df['Runtime'].apply(lambda x: x.strip())
# movie_df['Runtime'] = movie_df['Runtime'].apply(lambda x: re.sub(regex_runtime, subset, x))
# movie_df['Runtime'] = movie_df['Runtime'].apply(lambda x: int(x))

In [721]:
# movie_df['Box_office'].unique()

In [722]:
# movie_df['Box_office'] = movie_df['Box_office'].apply(lambda x: str(x))

In [723]:
# movie_df['Box_office'] = movie_df['Box_office'].apply(lambda x: x.strip())

In [724]:
# billions

In [725]:
# movie_df['Box_office'] = movie_df['Box_office'].apply(lambda x: x.strip(" million"))

In [726]:
# movie_df['Box_office'] = movie_df['Box_office'].apply(lambda x: x.strip("\xa0"))

In [727]:
# movie_df['Box_office'] = movie_df['Box_office'].apply(lambda x: x.strip("$"))

In [728]:
# movie_df['Box_office'] = movie_df['Box_office'].apply(lambda x: x.strip(" million<"))

In [729]:
# movie_df['Box_office'] = movie_df['Box_office'].apply(lambda x: x.strip("¥"))

In [730]:
# movie_df['Box_office'] = movie_df['Box_office'].apply(lambda x: x.strip("$"))
# movie_df['Box_office'] = movie_df['Box_office'].apply(lambda x: x.strip(" b"))
# movie_df['Box_office'] = movie_df['Box_office'].apply(lambda x: x.strip("\xa0"))
# movie_df['Box_office'] = movie_df['Box_office'].apply(lambda x: x.strip(" million USD"))
# movie_df['Box_office'] = movie_df['Box_office'].apply(lambda x: x.strip("  billion\n$159.4 million)"))
# movie_df['Box_office'] = movie_df['Box_office'].apply(lambda x: x.strip("¥"))
# movie_df['Box_office'] = movie_df['Box_office'].apply(lambda x: x.strip(" billion\n$28"))
# movie_df['Box_office'] = movie_df['Box_office'].apply(lambda x: x.strip(" million<"))
# movie_df['Box_office'] = movie_df['Box_office'].apply(lambda x: x.strip(".682.627.806"))
# movie_df['Box_office'] = movie_df['Box_office'].apply(lambda x: x.strip(" million\n£'"))
# movie_df['Box_office'] = movie_df['Box_office'].apply(lambda x: x.strip("million\n\n$"))
# movie_df['Box_office'] = movie_df['Box_office'].apply(lambda x: x.strip(" billion toman (Ira"))

At this point, our dataframe is clean up until the box office column! Let's continue!


In [17]:
# def replacer(array,expression, change): #good helper functin to quickly replace bad formatting
#     for index, value in enumerate(array):
#         if expression in value:
#             array[index] = array[index].replace(expression, change)

Country and language cleaned up! We can finally start analysis!