The Avengers are a well-known and widely loved team of superheroes in the Marvel universe that were introduced in the 1960's in the original comic book series. They've since become popularized again through the recent Disney movies as part of the new Marvel Cinematic Universe.
The team at FiveThirtyEight wanted to dissect the deaths of the Avengers in the comics over the years. The writers were known to kill off and revive many of the superheroes so they were curious to know what data they could grab from the Marvel Wikia site, a fan-driven community site, to explore further. To learn how they collected their data, which is available on their Github repo, read the writeup they published on their site.
In [1]:
# %sh
# wget https://raw.githubusercontent.com/fivethirtyeight/data/master/avengers/avengers.csv
# ls -l
While the FiveThirtyEight team has done a wonderful job acquiring this data, the data still has some inconsistencies. Your mission, if you choose to accept it, is to clean up their dataset so it can be more useful for analysis in Pandas. Read our dataset into Pandas as a DataFrame and preview the first 5 rows to get a better sense of our data.
In [2]:
import pandas as pd
avengers = pd.read_csv("avengers.csv")
avengers.head(5)
Out[2]:
Since the data was collected from a community site, where most of the contributions came from individual users, there's room for errors to surface in the dataset. If you plot a histogram of the values in the Year column, which describes the year each Avenger was introduced, you'll immediately notice some oddities. There are quite a few Avengers who look like they were introduced in 1900, which we know is a little fishy. The Avengers weren't introduced in the comic series until the 1960's!
This is obviously a mistake in the data and you should remove all Avengers before 1960 from the DataFrame.
In [3]:
true_avengers = avengers[avengers['Year'] >= 1960]
print('All: ' + str(len(avengers.index)))
print('After 1960: ' + str(len(true_avengers.index)))
We are interested in the number of total deaths each character experienced and we'd like a field containing that distilled information. Right now, there are 5 fields (Death1 to Death5) that each contain a binary value representing if a superhero experienced that death or not. For example, a superhero can experience Death1, then Death2, etc. until they were no longer brought back to life by the writers.
We'd like to coalesce that information into just one field so we can do numerical analysis more easily.
In [4]:
columns = ['Death1', 'Death2', 'Death3', 'Death4', 'Death5']
def death_count(row):
death = 0
for column in columns:
if row[column] == 'YES':
death += 1
return death
true_avengers['Deaths'] = true_avengers[columns].apply(death_count, axis=1)
true_avengers['Deaths'].head()
# true_avengers[columns].head()
Out[4]:
In [5]:
joined_accuracy_count = len(true_avengers[true_avengers['Year'] + true_avengers['Years since joining'] == 2015])
print('Total number of rows: ' + str(len(true_avengers.index)))
print('Accurate rows: ' + str(joined_accuracy_count))
In [ ]: