2: Life And Death Of Avengers

The Avengers are a well-known and widely loved team of superheroes in the Marvel universe that were introduced in the 1960's in the original comic book series. They've since become popularized again through the recent Disney movies as part of the new Marvel Cinematic Universe.

The team at FiveThirtyEight wanted to dissect the deaths of the Avengers in the comics over the years. The writers were known to kill off and revive many of the superheroes so they were curious to know what data they could grab from the Marvel Wikia site, a fan-driven community site, to explore further. To learn how they collected their data, which is available on their Github repo, read the writeup they published on their site.


In [1]:
# %sh

# wget https://raw.githubusercontent.com/fivethirtyeight/data/master/avengers/avengers.csv

# ls -l

3: Exploring The Data

While the FiveThirtyEight team has done a wonderful job acquiring this data, the data still has some inconsistencies. Your mission, if you choose to accept it, is to clean up their dataset so it can be more useful for analysis in Pandas. Read our dataset into Pandas as a DataFrame and preview the first 5 rows to get a better sense of our data.


In [2]:
import pandas as pd

avengers = pd.read_csv("avengers.csv")
avengers.head(5)


Out[2]:
URL Name/Alias Appearances Current? Gender Probationary Introl Full/Reserve Avengers Intro Year Years since joining Honorary ... Return1 Death2 Return2 Death3 Return3 Death4 Return4 Death5 Return5 Notes
0 http://marvel.wikia.com/Henry_Pym_(Earth-616) Henry Jonathan "Hank" Pym 1269 YES MALE NaN Sep-63 1963 52 Full ... NO NaN NaN NaN NaN NaN NaN NaN NaN Merged with Ultron in Rage of Ultron Vol. 1. A...
1 http://marvel.wikia.com/Janet_van_Dyne_(Earth-... Janet van Dyne 1165 YES FEMALE NaN Sep-63 1963 52 Full ... YES NaN NaN NaN NaN NaN NaN NaN NaN Dies in Secret Invasion V1:I8. Actually was se...
2 http://marvel.wikia.com/Anthony_Stark_(Earth-616) Anthony Edward "Tony" Stark 3068 YES MALE NaN Sep-63 1963 52 Full ... YES NaN NaN NaN NaN NaN NaN NaN NaN Death: "Later while under the influence of Imm...
3 http://marvel.wikia.com/Robert_Bruce_Banner_(E... Robert Bruce Banner 2089 YES MALE NaN Sep-63 1963 52 Full ... YES NaN NaN NaN NaN NaN NaN NaN NaN Dies in Ghosts of the Future arc. However "he ...
4 http://marvel.wikia.com/Thor_Odinson_(Earth-616) Thor Odinson 2402 YES MALE NaN Sep-63 1963 52 Full ... YES YES NO NaN NaN NaN NaN NaN NaN Dies in Fear Itself brought back because that'...

5 rows × 21 columns

4: Filter Out The Bad Years

Since the data was collected from a community site, where most of the contributions came from individual users, there's room for errors to surface in the dataset. If you plot a histogram of the values in the Year column, which describes the year each Avenger was introduced, you'll immediately notice some oddities. There are quite a few Avengers who look like they were introduced in 1900, which we know is a little fishy. The Avengers weren't introduced in the comic series until the 1960's!

This is obviously a mistake in the data and you should remove all Avengers before 1960 from the DataFrame.


In [3]:
true_avengers = avengers[avengers['Year'] >= 1960]

print('All: ' + str(len(avengers.index)))
print('After 1960: ' + str(len(true_avengers.index)))


All: 173
After 1960: 159

5: Consolidating Deaths

We are interested in the number of total deaths each character experienced and we'd like a field containing that distilled information. Right now, there are 5 fields (Death1 to Death5) that each contain a binary value representing if a superhero experienced that death or not. For example, a superhero can experience Death1, then Death2, etc. until they were no longer brought back to life by the writers.

We'd like to coalesce that information into just one field so we can do numerical analysis more easily.


In [4]:
columns = ['Death1', 'Death2', 'Death3', 'Death4', 'Death5']

def death_count(row):
  death = 0
  for column in columns:
    if row[column] == 'YES':
      death += 1
  return death
true_avengers['Deaths'] = true_avengers[columns].apply(death_count, axis=1)
true_avengers['Deaths'].head()
# true_avengers[columns].head()


C:\Users\IBM_ADMIN\Anaconda2\lib\site-packages\ipykernel\__main__.py:9: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Out[4]:
0    1
1    1
2    1
3    1
4    2
Name: Deaths, dtype: int64

6: Years Since Joining

For the final task, we want to know if the Years since joining field accurately reflects the Year column. If an Avenger was introduced in Year 1960, is the Years since joining value for that Avenger 55?


In [5]:
joined_accuracy_count = len(true_avengers[true_avengers['Year'] + true_avengers['Years since joining'] == 2015])

print('Total number of rows: ' + str(len(true_avengers.index)))
print('Accurate rows: ' + str(joined_accuracy_count))


Total number of rows: 159
Accurate rows: 159

In [ ]: