While waiting for Star Wars: The Force Awakens, the team at FiveThirtyEight was interested in answering some questions about Star Wars fans. One question that particularly interested the team was: Does the rest of America realize that “The Empire Strikes Back” is clearly the best of the bunch?
The team needed to collect data before they could get started answering this question. They used SurveyMonkey, an online survey tool, to survey Star Wars fans. They received 835 responses total, which you can find here.
In this project, you'll be cleaning and exploring the dataset in Jupyter.
We need to specify an encoding because the dataset has some characters that aren't in the Python default utf-8 encoding. You can read more about character encodings here.
The data has several columns, including:
This dataset needs a lot of cleaning, which makes it a good place to practice the skills you've been learning so far. The first step you'll take is to remove invalid rows. RespondentID is supposed to be a unique ID for each respondent, but it's blank in some rows. You'll need to remove any rows with an invalid RespondentID.
In [1]:
# Import the necessary stuff
import pandas as pd
In [2]:
# Read the dataset into a Pandas DataFrame
star_wars = pd.read_csv("../data/star_wars.csv", encoding="ISO-8859-1")
In [3]:
# Explore the data by looking at the first 10 rows.
star_wars.head(10)
# NOTE: Lots of NaN, even for stuff like RespondentID. Needs cleaning.
Out[3]:
In [4]:
# Look at the column names
star_wars.columns
Out[4]:
In [5]:
# Remove any rows where RespondentId is NaN.
star_wars = star_wars[pd.notnull(star_wars["RespondentID"])]
star_wars.head()
Out[5]:
Take a look at the next two columns, which are:
You'll see that they are both Yes/No questions. There's another possible value they can take on, NaN, when a respondent chose not to answer that question. You can use the value_counts() method on a Series to see all the unique vlaues ina column, and the counts of each value.
Both columns are currently string types -- Yes or No. iN order to make it a bit easier to analyze down the line, convert each column to a Boolean, with only True, False, or NaN. Booleans are easier to work with because you can select the rows that are True or False without having to do string comparisons.
In [6]:
yes_no = {'Yes': True, 'No': False}
col1 = 'Have you seen any of the 6 films in the Star Wars franchise?'
col2 = 'Do you consider yourself to be a fan of the Star Wars film franchise?'
# Convert both these columns to bollean
star_wars[col1] = star_wars[col1].map(yes_no)
star_wars[col2] = star_wars[col2].map(yes_no)
# At the end, both columsn should only have True, False, or NaN values
star_wars.head()
Out[6]:
The next 6 columns represent a checkbox question. The respondent was asked 'Which of the following Star Wars films have you seen? Please select all that apply.", and then was able to check off a series of boxes indicating which movies they saw.
The columns that represent this data are:
For each of these columns, if the value in a cell is the name of the movie, that means the respondent saw it. If the value is NaN, the respondent either didn't answer, or didn't see the movie, but we'll assume that they didn't see the movie.
You need to convert each of these columns to a Boolean, then rename the column to have a more clear name
In [7]:
# For each column above, convert the column to only contain True and False
import numpy as np
movie_mapping = {
"Star Wars: Episode I The Phantom Menace": True,
np.nan: False,
"Star Wars: Episode II Attack of the Clones": True,
"Star Wars: Episode III Revenge of the Sith": True,
"Star Wars: Episode IV A New Hope": True,
"Star Wars: Episode V The Empire Strikes Back": True,
"Star Wars: Episode VI Return of the Jedi": True
}
for col in star_wars.columns[3:9]:
star_wars[col] = star_wars[col].map(movie_mapping)
star_wars.head()
Out[7]:
In [8]:
# Rename each of the columns above to have more clear names
star_wars = star_wars.rename(columns={
"Which of the following Star Wars films have you seen? Please select all that apply.": "seen_1",
"Unnamed: 4": "seen_2",
"Unnamed: 5": "seen_3",
"Unnamed: 6": "seen_4",
"Unnamed: 7": "seen_5",
"Unnamed: 8": "seen_6"
})
star_wars.head()
Out[8]:
The next 6 columns ask the respondent to rank the Star Wars movies in order of least to most favorite. 1 means the film was their most favorite, and 6 means it was their least favorite. Each of the following columns can contain the values 1, 2, 3, 4, 5, 6, or NaN:
You don't need to do a ton of cleanup for these columns. You'll need to convert each column to a numeric type, then rename the columns so you can tell what they're for more easily.
In [9]:
# Convert each column above to a float type
star_wars[star_wars.columns[9:15]] = star_wars[star_wars.columns[9:15]].astype(float)
star_wars.head()
Out[9]:
In [10]:
# Rename each of the above columsn to a more descriptive name.
star_wars = star_wars.rename(columns={
"Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.": "ranking_1",
"Unnamed: 10": "ranking_2",
"Unnamed: 11": "ranking_3",
"Unnamed: 12": "ranking_4",
"Unnamed: 13": "ranking_5",
"Unnamed: 14": "ranking_6"
})
star_wars.head()
Out[10]:
In [11]:
# Use the mean metod to compute the mean of each of the ranking columns
star_wars[star_wars.columns[9:15]].mean()
Out[11]:
In [12]:
# Make a bar char of each ranking
%matplotlib inline
import matplotlib.pyplot as plt
plt.bar(range(6), star_wars[star_wars.columns[9:15]].mean())
Out[12]:
Remember, here a lower ranking is better! So as suspected, overall the original movies are rated higher than the newer ones and Empire Strikes Back is rated best of all with Return of the Jedi coming in 2nd.
You cleaned up the seen columns earlier, and converted the values to the Boolean type. When you call methods like sum() or mean(), Booleans are treated like integers -- Trueis treated like a 1, and False is treated like a 0. This makes it easy to figure out how many people have seen each movie -- we just take the sum of the column.
In [13]:
# Uae the sum method to compute the sum of each of the seen columns
star_wars[star_wars.columns[3:9]].sum()
Out[13]:
In [14]:
# Plot the view counts
plt.bar(range(6), star_wars[star_wars.columns[3:9]].sum())
Out[14]:
It appears that the original movies were seen more than the newer ones.
We've seen what the whole survey population thinks are the highest ranked movies, but we can break this down by segments. There are several columns that segment our data into two groups, including:
In [15]:
# Split the data into 2 groups based on one of the above columns
males = star_wars[star_wars["Gender"] == "Male"]
females = star_wars[star_wars["Gender"] == "Female"]
In [16]:
# Plot rankings separately for men and women
plt.bar(range(6), males[males.columns[9:15]].mean())
plt.show()
plt.bar(range(6), females[females.columns[9:15]].mean())
plt.show()
It looks like women liked the newer movies more than men, at least the first one or two.
In [17]:
# Plot view couns by gender
plt.bar(range(6), males[males.columns[3:9]].sum())
plt.show()
plt.bar(range(6), females[females.columns[3:9]].sum())
plt.show()
Interestingly, more males watches episodes 1-3, but males liked them far less than females did.
Here are some potential next steps:
In [ ]: