Matplotlib

As it turns out, people are very terrible at understanding numerical data, but can process and interpret visual information at remarkable speeds -- quite the opposite of computers in fact, and as such, you will nearly always want some sort of visual to accompany your analysis. In this exercise, we'll be using Matplotlib, a package in SciPy, utilizying MATLAB-like syntax, to generate many plots.


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

Basics Using Generated Data

1 - Generate three arrays of 500 values, $x$, $y_1$, $y_2$ such that $$\{x \mid -2\pi \le x \le 2\pi \}$$ $$y_1 = sin(x)$$ $$y_2 = cos(x)$$


In [2]:
x = np.linspace(-2*np.pi, 2*np.pi, 500)
y1 = np.sin(x)
y2 = np.cos(x)

2 - Using the default settings, use pyplot to plot $y_1$ and $y_2$ versus $x$, all on the same plot.


In [3]:
plt.plot(x, y1)
plt.plot(x, y2);


3 - Generate the same plots, but set the horizontal and vertical limits to be slightly smaller than the default settings. In otherwords, tighten up the plot a bit.'


In [4]:
# Switching to explicit plot
fig = plt.figure()
ax = plt.axes()

ax.plot(x, y1)
ax.plot(x, y2)

# Tighten the plot
ax.set_xlim(-2*np.pi, 2*np.pi)
ax.set_ylim(-1, 1);


4 - Generate the same plots using all settings from above, but now change the color and thickness of each from the defaults. Play around with the values a bit until you are satisfied with how they look.


In [5]:
fig = plt.figure()
ax = plt.axes()

# Set colors and thickness
ax.plot(x, y1, c='green', linewidth=3)
ax.plot(x, y2, c='cyan', linewidth=3)

# Tighten the plot
ax.set_xlim(-2*np.pi, 2*np.pi)
ax.set_ylim(-1, 1);


5 - Generate the same plots using all settings from above, but now add some custom tickmarks with labels of your choosing. Which values would make sense given the functions we are using?


In [6]:
fig = plt.figure()
ax = plt.axes()

# Set colors and thickness
ax.plot(x, y1, c='green', linewidth=3)
ax.plot(x, y2, c='cyan', linewidth=3)

# Tighten the plot
ax.set_xlim(-2*np.pi, 2*np.pi)
ax.set_ylim(-1, 1)

# Set locators at multiples of π/2 and the respective labels using a list
ax.xaxis.set_major_locator(plt.MultipleLocator(np.pi / 2))
ax.xaxis.set_major_formatter(plt.FixedFormatter(['', '$-2\pi$', '$-3\pi/2$', '$-\pi$', '$-\pi/2$', '$0$', '$\pi/2$', '$\pi$', '$3\pi/2$', '$2\pi$']));


6 - Generate the same plots using all the settings from above, but now change your plot spines so that they are centered at the origin. In other words, change the plot area from a "box" to a "cross".


In [7]:
fig = plt.figure()
ax = plt.axes()

# Set colors and thickness
ax.plot(x, y1, c='green', linewidth=3, alpha=.5)
ax.plot(x, y2, c='cyan', linewidth=3, alpha=.5)

# Tighten the plot
ax.set_xlim(-2*np.pi, 2*np.pi)
ax.set_ylim(-1, 1)

# Set locators at multiples of π/2 and the respective labels using a list
ax.xaxis.set_major_locator(plt.MultipleLocator(np.pi / 2))
ax.xaxis.set_major_formatter(plt.FixedFormatter(['', '$-2\pi$', '$-3\pi/2$', '$-\pi$', '$-\pi/2$', '$0$', '$\pi/2$', '$\pi$', '$3\pi/2$', '$2\pi$']))

# Move left and right spine and make the other two invisible
ax.spines['left'].set_position('center')
ax.spines['right'].set_color('None')
ax.spines['bottom'].set_position('center')
ax.spines['top'].set_color('None');


7 - Generate the same plots using all the settings from above, but now add a legend, with labels sine and cosine, to your plot in a position of your choosing.


In [8]:
fig = plt.figure()
ax = plt.axes()

# Set colors and thickness
ax.plot(x, y1, c='green', linewidth=3, alpha=.5, label='sine')
ax.plot(x, y2, c='cyan', linewidth=3, alpha=.5, label='cosine')

# Tighten the plot
ax.set_xlim(-2*np.pi, 2*np.pi)
ax.set_ylim(-1, 1)

# Set locators at multiples of π/2 and the respective labels using a list
ax.xaxis.set_major_locator(plt.MultipleLocator(np.pi / 2))
ax.xaxis.set_major_formatter(plt.FixedFormatter(['', '$-2\pi$', '$-3\pi/2$', '$-\pi$', '$-\pi/2$', '$0$', '$\pi/2$', '$\pi$', '$3\pi/2$', '$2\pi$']))

# Move left and right spine and make the other two invisible
ax.spines['left'].set_position('center')
ax.spines['right'].set_color('None')
ax.spines['bottom'].set_position('center')
ax.spines['top'].set_color('None')

# Add the legend in the lower left corner
ax.legend(loc='lower left');


8 - Now generate two more data sets, $$y_3 = sin(x) + sin(2x)$$ $$y_3 = cos(x) + cos(2x)$$ and add them to your plot, setting different color and line styles (for example, dotted). Be sure to adjust your scales and legend as needed. Also add a title to your plot.


In [9]:
y3 = np.sin(x) + np.sin(2*x)
y4 = np.cos(x) + np.cos(2*x)

In [10]:
fig = plt.figure(figsize=(8,8))
ax = plt.axes()

# Set colors and thickness
ax.plot(x, y1, c='green', linewidth=3, alpha=.5, label='$\sin(x)$')
ax.plot(x, y2, c='cyan', linewidth=3, alpha=.5, label='$\cos(x)$')

# Add the new functions
ax.plot(x, y3, c='red', linewidth=3, alpha=.5, label='$\sin(x) + \sin(2x)$')
ax.plot(x, y4, c='blue', linewidth=3, alpha=.5, label='$\cos(x) + \cos(2x)$')

# Tighten the plot
ax.set_xlim(-2*np.pi, 2*np.pi)
ax.set_ylim(-3, 3)

# Set locators at multiples of π/2 and the respective labels using a list
ax.xaxis.set_major_locator(plt.MultipleLocator(np.pi / 2))
ax.xaxis.set_major_formatter(plt.FixedFormatter(['', '$-2\pi$', '$-3\pi/2$', '$-\pi$', '$-\pi/2$', '$0$', '$\pi/2$', '$\pi$', '$3\pi/2$', '$2\pi$']))

# Move left and right spine and make the other two invisible
ax.spines['left'].set_position('center')
ax.spines['right'].set_color('None')
ax.spines['bottom'].set_position('center')
ax.spines['top'].set_color('None')

# Add the legend in the lower left corner
ax.legend(loc='lower left', frameon=False)

# Set the title
ax.set_title('Some trigonometric functions');


More Plots With Real Data

In this exercise we'll be using a real data set to test out the functionality of matplotlib.

1 - Go to the R Data Repository and download, or load directly, the Aircraft Crash data, load it into a Data Frame, and print the first few rows.


In [11]:
crash = pd.read_csv('https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/gamclass/airAccs.csv')

In [12]:
# Change the first column name to 'id'
col = crash.columns.values
col[0] = 'id'
crash.columns = col

crash.head()


Out[12]:
id Date location operator planeType Dead Aboard Ground
0 1 1908-09-17 Fort Myer, Virginia Military - U.S. Army Wright Flyer III 1.0 2.0 0.0
1 2 1912-07-12 Atlantic City, New Jersey Military - U.S. Navy Dirigible 5.0 5.0 0.0
2 3 1913-08-06 Victoria, British Columbia, Canada Private Curtiss seaplane 1.0 1.0 0.0
3 4 1913-09-09 Over the North Sea Military - German Navy Zeppelin L-1 (airship) 14.0 20.0 0.0
4 5 1913-10-17 Near Johannisthal, Germany Military - German Navy Zeppelin L-2 (airship) 30.0 30.0 0.0

2 - Generate a histogram for the number of deaths, using bin sizes of your choice. Be sure to adjust the axis and to add a title to make your plot aesthetically appealing.

First, let's take a look at the summary statistics of the data:


In [13]:
crash.describe()


Out[13]:
id Dead Aboard Ground
count 5666.000000 5655.000000 5625.000000 5592.000000
mean 2833.500000 19.811848 27.375822 1.543455
std 1635.777644 32.520087 42.564764 52.301120
min 1.000000 0.000000 0.000000 0.000000
25% 1417.250000 3.000000 5.000000 0.000000
50% 2833.500000 9.000000 13.000000 0.000000
75% 4249.750000 22.000000 30.000000 0.000000
max 5666.000000 583.000000 644.000000 2750.000000

The column containing the number of deaths seems to be skewed to the right, so we expect a plot with some isolated bars to the right:


In [14]:
with plt.style.context('seaborn-white'):
    fig = plt.figure(figsize=(16,8))
    ax = plt.axes()
    # drop nans because they cause an error with the hist command
    n, bins, patches = ax.hist(crash['Dead'].dropna(), bins=50)
    # set x axis limits
    ax.set_xlim(0,600)
    # adjust the number of ticks
    ax.xaxis.set_major_locator(plt.MaxNLocator(50))
    # add a title to the plot and the x axis
    ax.set_title('Deaths in plane crashes')
    ax.set_xlabel('Number of deaths')


Indeed there are many bars to the right which result almost invisible, let's look at the bins calculated by the hist command to see the values for such bars:


In [15]:
print(bins)
print(n)


[   0.     11.66   23.32   34.98   46.64   58.3    69.96   81.62   93.28
  104.94  116.6   128.26  139.92  151.58  163.24  174.9   186.56  198.22
  209.88  221.54  233.2   244.86  256.52  268.18  279.84  291.5   303.16
  314.82  326.48  338.14  349.8   361.46  373.12  384.78  396.44  408.1
  419.76  431.42  443.08  454.74  466.4   478.06  489.72  501.38  513.04
  524.7   536.36  548.02  559.68  571.34  583.  ]
[  3.20400000e+03   1.12900000e+03   4.83000000e+02   2.69000000e+02
   1.50000000e+02   8.10000000e+01   7.80000000e+01   5.70000000e+01
   3.70000000e+01   3.80000000e+01   3.20000000e+01   1.50000000e+01
   1.30000000e+01   2.10000000e+01   8.00000000e+00   8.00000000e+00
   5.00000000e+00   1.00000000e+00   2.00000000e+00   6.00000000e+00
   2.00000000e+00   1.00000000e+00   5.00000000e+00   3.00000000e+00
   1.00000000e+00   1.00000000e+00   0.00000000e+00   0.00000000e+00
   1.00000000e+00   2.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   1.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   1.00000000e+00]

As you can see, two thirds of the bins (approximately the ones over 180 deaths) have zero or one counts: this, combined with the greater values for the first beans, is causing the last bars to be invisible.

We can try to solve this in a few ways, first let's try to cut the y bars and set different units for the upper parts of the higher bars:


In [16]:
with plt.style.context('seaborn-white'):
    # create two subplots, one for the higher bars and the other for the lower parts of every bar
    fig, ax = plt.subplots(2, 1, sharex='col', figsize=(16,8))
    # plot the lower part of bars by setting the limit of the y axis to 90
    ax[1].hist(crash['Dead'].dropna(), bins=50)
    ax[1].set_xlim(0,600)
    ax[1].set_ylim(0,90)
    ax[1].xaxis.set_major_locator(plt.MaxNLocator(50))
    # plot the higher part of bars by setting the limit of the y axis from 100 to 3500
    ax[0].hist(crash['Dead'].dropna(), bins=50)
    ax[0].set_ylim(100,3500)
    # add title and x axis label
    ax[0].set_title('Deaths in plane crashes')
    ax[1].set_xlabel('Number of deaths')
    # delete the spines between the plots
    ax[0].spines['bottom'].set_color('None')
    ax[1].spines['top'].set_color('None')
    # add dashes to indicate the cut in the y axis (shamelessly copying code from stackoverflow!)
    d = .01
    kwargs = dict(transform=ax[0].transAxes, color='k', clip_on=False)
    ax[0].plot((-d,+d),(-d,+d), **kwargs)
    ax[0].plot((1-d,1+d),(-d,+d), **kwargs)
    kwargs.update(transform=ax[1].transAxes)
    ax[1].plot((-d,+d),(1-d,1+d), **kwargs)
    ax[1].plot((1-d,1+d),(1-d,1+d), **kwargs)


Another possible solution is to clip the values and use the last bin to represent all the values over a certain number of deaths:


In [17]:
# Function for formatting the label of the clipped bin
def hist_formatter(value, pos):
    if value == 100:
        return ''
    elif value == 98:
        return str(int(value)) + '+'
    else:
        return str(int(value))

In [18]:
with plt.style.context('seaborn-white'):
    fig = plt.figure(figsize=(16,8))
    ax = plt.axes()
    # clip the values over 100 and create 50 bins
    ax.hist(np.clip(crash['Dead'].dropna(), 0, 100), bins=50);
    # set x axis limit to 100
    ax.set_xlim(0,100)
    # add 50 ticks (start and end of the bar)
    ax.xaxis.set_major_locator(plt.MaxNLocator(50))
    # format ticks so to have 98+ for the last bar
    ax.xaxis.set_major_formatter(plt.FuncFormatter(hist_formatter))
    # add title and x acis label
    ax.set_title('Deaths in plane crashes')
    ax.set_xlabel('Number of deaths')


3 - Make some plots of total number of deaths with respect to time, making use of Pandas time series functionality. Again, be sure to make your plot aesthetically appealing.

First, let's transform the Date column in DateTime format and set it as the index of the DataFrame:


In [20]:
crash['Date'] = crash['Date'].apply(lambda x: pd.datetime.strptime(x,'%Y-%m-%d'))
crash.set_index(['Date'], inplace=True)

Now let's create a new DataFrame containing yearly aggregates of the data:


In [21]:
# Resample at the year start taking the sum and using 0 where the sum is NaN
yearly_crash = crash.resample('AS').sum().fillna(0)
yearly_crash.head()


Out[21]:
id Dead Aboard Ground
Date
1908-01-01 1.0 1.0 2.0 0.0
1909-01-01 0.0 0.0 0.0 0.0
1910-01-01 0.0 0.0 0.0 0.0
1911-01-01 0.0 0.0 0.0 0.0
1912-01-01 2.0 5.0 5.0 0.0

Last but not least, the plot:


In [22]:
with plt.style.context('seaborn'):
    fig = plt.figure(figsize=(16,8))
    ax = plt.axes()
    # Plot the Dead column
    ax.plot(yearly_crash['Dead'])
    # Add a title
    ax.set_title('Deaths in plane crashes by year')
    # Set limits for the axis
    ax.set_xlim(yearly_crash.index.values.min(), yearly_crash.index.values.max())
    ax.set_ylim(-100, 3200)
    # Increase the number of ticks
    ax.xaxis.set_major_locator(plt.MaxNLocator(20));


There are two peaks of over 2500 deaths in a year between 1972 and 1985, let's determine which years they are:


In [23]:
yearly_crash[yearly_crash['Dead'] > 2500]


Out[23]:
id Dead Aboard Ground
Date
1972-01-01 288015.0 2946.0 3644.0 55.0
1985-01-01 278055.0 2670.0 3480.0 1.0

The years are 1972 and 1985, let's take a closer look at them:


In [24]:
crash.loc['1985'].describe()


Out[24]:
id Dead Aboard Ground
count 74.000000 74.000000 74.000000 74.000000
mean 3757.500000 36.081081 47.027027 0.013514
std 21.505813 80.429627 87.298727 0.116248
min 3721.000000 0.000000 1.000000 0.000000
25% 3739.250000 3.000000 4.000000 0.000000
50% 3757.500000 8.000000 11.000000 0.000000
75% 3775.750000 28.250000 44.750000 0.000000
max 3794.000000 520.000000 524.000000 1.000000

In [25]:
crash.loc['1985'].sort_values('Dead', ascending=False).head(10)


Out[25]:
id location operator planeType Dead Aboard Ground
Date
1985-08-12 3765 Mt. Osutaka, near Ueno Village, Japan Japan Air Lines Boeing B-747-SR46 520.0 524.0 0.0
1985-06-23 3757 Atlantic Ocean, 110 miles West of Ireland Air India Boeing B-747-237B 329.0 329.0 0.0
1985-12-12 3792 Gander, Newfoundland, Canada Arrow Airways McDonnell Douglas DC-8 Super 63PF 256.0 256.0 0.0
1985-07-10 3760 Near Uchuduk, Uzbekistan, USSR Aeroflot Tupolev TU-154B-2 200.0 200.0 0.0
1985-02-19 3736 Near Durango, Vizcaya, Spain Iberia Airlines Boeing B-727-256 148.0 148.0 0.0
1985-08-02 3763 Ft. Worth-Dallas, Texas Delta Air Lines Lockheed L-1011-1 TriStar 134.0 163.0 1.0
1985-07-24 3762 Leticia, Colombia Fuerza A????rea Colombiana Douglas DC-6B 80.0 80.0 0.0
1985-05-03 3749 Near L'vov, Ukraine, USSR Aeroflot / Soviet Air Force Tupolev TU-134A / Antonov An-26 76.0 76.0 0.0
1985-01-21 3726 Reno, Nevada Galaxy Airlines Lockheed L-188A Electra 70.0 71.0 0.0
1985-11-24 3790 Luqa, Malta EgyptAir Boeing B-737-266 60.0 103.0 0.0

In [26]:
crash.loc['1996'].describe()


Out[26]:
id Dead Aboard Ground
count 81.000000 81.000000 81.000000 81.000000
mean 4630.000000 29.456790 37.753086 3.530864
std 23.526581 57.069705 64.485760 25.275525
min 4590.000000 1.000000 1.000000 0.000000
25% 4610.000000 3.000000 6.000000 0.000000
50% 4630.000000 8.000000 14.000000 0.000000
75% 4650.000000 20.000000 28.000000 0.000000
max 4670.000000 349.000000 349.000000 225.000000

In [27]:
crash.loc['1996'].sort_values('Dead', ascending=False).head(10)


Out[27]:
id location operator planeType Dead Aboard Ground
Date
1996-11-12 4655 Near Charkhi Dadri, India Saudi Arabian Airlines / Kazakhstan Airlines Boeing B-747-168B / Ilyushin IL-76TD 349.0 349.0 0.0
1996-07-17 4631 Off East Moriches, New York Trans World Airlines Boeing B-747-131 230.0 230.0 0.0
1996-02-06 4598 Off Puerto Plata, Domincan Republic Alas Nacionales, leased from Birgen Air Boeing B-757-225 189.0 189.0 0.0
1996-11-07 4653 Lagos, Nigeria Aviation Development Corporation Boeing B-727-231 143.0 143.0 0.0
1996-08-29 4639 Spitsbergen, Norway Vnokovo Airlines Tupolev TU-154M 141.0 141.0 0.0
1996-11-23 4660 Off Moroni, Comoros Ethiopian Airlines Boeing B-767-200ER 127.0 175.0 0.0
1996-02-29 4605 Arequipa, Peru Compania de Aviacion Faucett SA (Peru) Boeing B-737-222 123.0 123.0 0.0
1996-05-11 4618 Everglades, Miami, Florida ValuJet McDonnell Douglas DC-9-32 110.0 110.0 0.0
1996-10-31 4651 Sao Paolo, Brazil TAM (Brazil) Fokker 100 95.0 95.0 3.0
1996-02-26 4603 Near Jabal Awliya, Sudan Military - Sudanese Air Force Lockheed C-130H 91.0 91.0 0.0

We can add some annotations to the plot to describe them:


In [28]:
with plt.style.context('seaborn'):
    fig = plt.figure(figsize=(16,8))
    ax = plt.axes()
    # Plot the Dead column
    ax.plot(yearly_crash['Dead'])
    # Add a title
    ax.set_title('Deaths in plane crashes by year')
    # Set limits for the axis
    ax.set_xlim(yearly_crash.index.values.min(), yearly_crash.index.values.max())
    ax.set_ylim(-100, 3200)
    # Increase the number of ticks
    ax.xaxis.set_major_locator(plt.MaxNLocator(20))
    # Add some annotations on the peaks
    style = dict(size=10, color='black')
    ax.text('1972-01-01', 3020, '1972: 105 crashes with a mean of 28 deads per crash', ha='center', **style)
    ax.text('1985-01-01', 2700, '1985: 520 deads in Mt. Osutaka crash', ha='center', **style);


4 - We're now going to add in some data from a different source to take a look at the bigger picture in terms of number of passengers flying each year. Head over to the World Bank Webpage and download the .csv version of the data in the link. Clean it up and merge it with your original aircraft accident data above. Call this merged data set data_all.


In [29]:
# Load the data and set country name as the index
passengers = pd.read_csv('API_IS.AIR.PSGR_DS2_en_csv_v2.csv', skiprows=4)
passengers.set_index('Country Name',inplace=True)
passengers.head()


Out[29]:
Country Code Indicator Name Indicator Code 1960 1961 1962 1963 1964 1965 1966 ... 2008 2009 2010 2011 2012 2013 2014 2015 2016 Unnamed: 61
Country Name
Aruba ABW Air transport, passengers carried IS.AIR.PSGR NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Afghanistan AFG Air transport, passengers carried IS.AIR.PSGR NaN NaN NaN NaN NaN NaN NaN ... NaN NaN 1999127.0 2.279341e+06 1.737962e+06 2044188.0 2209428.0 1929907.728 1917922.714 NaN
Angola AGO Air transport, passengers carried IS.AIR.PSGR NaN NaN NaN NaN NaN NaN NaN ... 283887.0 274869.0 1010194.0 9.877980e+05 1.132424e+06 1321872.0 1409952.0 1244491.000 1261671.262 NaN
Albania ALB Air transport, passengers carried IS.AIR.PSGR NaN NaN NaN NaN NaN NaN NaN ... 243691.0 231263.0 768533.0 8.297789e+05 8.143397e+05 865848.0 151632.0 NaN 26633.600 NaN
Andorra AND Air transport, passengers carried IS.AIR.PSGR NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 61 columns

We can ignore the first three and the last columns and we have to transpose the dataset transforming the year in a DateTime in order to be able to merge the two DataFrames:


In [30]:
# Transpose dataset
passengers = passengers.iloc[:, 3:-1].transpose()

# Set the year transformed in DateTime as the new index
passengers.reset_index(inplace=True)
passengers['index'] = passengers['index'].apply(lambda x: pd.to_datetime(pd.datetime(int(x), 1, 1)))
passengers.set_index('index', inplace=True)

passengers.head()


Out[30]:
Country Name Aruba Afghanistan Angola Albania Andorra Arab World United Arab Emirates Argentina Armenia American Samoa ... Virgin Islands (U.S.) Vietnam Vanuatu World Samoa Kosovo Yemen, Rep. South Africa Zambia Zimbabwe
index
1960-01-01 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1961-01-01 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1962-01-01 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1963-01-01 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1964-01-01 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 264 columns

Now we can sum all the columns to get the total number of passengers in a year and merge the two datasets:


In [31]:
# Sum all the columns
pd.DataFrame(passengers.fillna(0).sum(axis=1), columns=['Passengers'])

# Merge datasets using the indexes and selecting only some columns
all_data = pd.merge(pd.DataFrame(passengers.fillna(0).sum(axis=1), columns=['Passengers']), yearly_crash.iloc[:,1:], left_index=True, right_index=True, how='inner')

all_data


Out[31]:
Passengers Dead Aboard Ground
1960-01-01 0.000000e+00 1735.0 2121.0 37.0
1961-01-01 0.000000e+00 1521.0 2089.0 16.0
1962-01-01 0.000000e+00 2123.0 2540.0 5.0
1963-01-01 0.000000e+00 1379.0 1642.0 96.0
1964-01-01 0.000000e+00 1354.0 1717.0 0.0
1965-01-01 0.000000e+00 1824.0 2387.0 23.0
1966-01-01 0.000000e+00 1712.0 2001.0 159.0
1967-01-01 0.000000e+00 1814.0 2358.0 71.0
1968-01-01 0.000000e+00 2284.0 3058.0 5.0
1969-01-01 0.000000e+00 2101.0 2722.0 117.0
1970-01-01 1.866388e+09 2118.0 2790.0 9.0
1971-01-01 1.996079e+09 1951.0 2587.0 0.0
1972-01-01 1.835069e+09 2946.0 3644.0 55.0
1973-01-01 2.407371e+09 2477.0 3481.0 20.0
1974-01-01 2.886075e+09 2387.0 3123.0 0.0
1975-01-01 3.001390e+09 1659.0 2377.0 14.0
1976-01-01 3.292558e+09 2045.0 2717.0 144.0
1977-01-01 3.583457e+09 2173.0 2722.0 9.0
1978-01-01 4.015000e+09 1576.0 2740.0 23.0
1979-01-01 4.518941e+09 1992.0 2489.0 38.0
1980-01-01 4.523483e+09 1829.0 2754.0 1.0
1981-01-01 4.553121e+09 1245.0 1577.0 60.0
1982-01-01 4.660447e+09 1794.0 3163.0 15.0
1983-01-01 4.864865e+09 1612.0 2367.0 31.0
1984-01-01 5.187131e+09 1033.0 1495.0 72.0
1985-01-01 5.526388e+09 2670.0 3480.0 1.0
1986-01-01 5.922796e+09 1471.0 2567.0 51.0
1987-01-01 6.350715e+09 1723.0 2231.0 58.0
1988-01-01 6.711559e+09 2034.0 3037.0 83.0
1989-01-01 6.906911e+09 2283.0 3633.0 79.0
1990-01-01 7.201023e+09 1182.0 2265.0 80.0
1991-01-01 8.442310e+09 1839.0 2459.0 5.0
1992-01-01 8.649662e+09 2121.0 2967.0 61.0
1993-01-01 8.529488e+09 1571.0 2420.0 0.0
1994-01-01 9.182563e+09 1876.0 3108.0 24.0
1995-01-01 9.704725e+09 1593.0 2112.0 0.0
1996-01-01 1.032251e+10 2386.0 3058.0 286.0
1997-01-01 1.079187e+10 1672.0 2485.0 44.0
1998-01-01 1.086682e+10 1544.0 2011.0 48.0
1999-01-01 1.150838e+10 971.0 3003.0 36.0
2000-01-01 1.231302e+10 1429.0 2357.0 23.0
2001-01-01 1.234622e+10 1416.0 2129.0 5641.0
2002-01-01 1.214765e+10 1433.0 1798.0 170.0
2003-01-01 1.250480e+10 1279.0 1510.0 24.0
2004-01-01 1.429957e+10 728.0 937.0 2.0
2005-01-01 1.502382e+10 1317.0 2176.0 59.0
2006-01-01 1.594965e+10 1151.0 1431.0 4.0
2007-01-01 1.715771e+10 931.0 1364.0 57.0
2008-01-01 1.731750e+10 823.0 1466.0 60.0
2009-01-01 1.780159e+10 1095.0 1657.0 3.0
2010-01-01 2.125285e+10 1085.0 1514.0 9.0
2011-01-01 2.279443e+10 772.0 1041.0 27.0
2012-01-01 2.391928e+10 596.0 704.0 52.0
2013-01-01 2.534156e+10 311.0 827.0 5.0
2014-01-01 2.694625e+10 359.0 368.0 0.0

The years between 1960 and 1969 doesn't seem to have passengers data, despite being present in the passengers' csv, so we are going to skip them:


In [32]:
all_data = all_data.loc['1970-01-01':]

all_data.head()


Out[32]:
Passengers Dead Aboard Ground
1970-01-01 1.866388e+09 2118.0 2790.0 9.0
1971-01-01 1.996079e+09 1951.0 2587.0 0.0
1972-01-01 1.835069e+09 2946.0 3644.0 55.0
1973-01-01 2.407371e+09 2477.0 3481.0 20.0
1974-01-01 2.886075e+09 2387.0 3123.0 0.0

5 - Using data_all, create two graphs to visualize how the number of deaths and passengers vary with time, and, as always, make your plots as visually appealing as possible.


In [33]:
# Function for formatting the number of passengers labels in millions unit
def million_formatter(value, pos):
    return str(int(value / 1e6)) + 'M'

In [34]:
with plt.style.context('seaborn'):
    # Make two subplots, the first with number of passengers and the second with number of deaths
    fig, ax = plt.subplots(2, 1, sharex='col', figsize=(16,8))
    ax[0].plot(all_data['Passengers'])
    ax[1].plot(all_data['Dead'])
    # Set the limits of the common x axis
    ax[0].set_xlim(all_data.index.values.min(), all_data.index.values.max())
    # Set the titles for each plot
    ax[0].set_title('Number of passengers by year')
    ax[1].set_title('Deaths in plane crashes by year')
    # Format the labels for the number of passengers
    ax[0].yaxis.set_major_formatter(plt.FuncFormatter(million_formatter));


6 - Make a pie chart representing the number of deaths for each decade. Consult the pyplot documentation to play around with the settings a bit.


In [35]:
# Add a new column with the decade
all_data['Decade'] = all_data.index.year // 10 * 10


C:\Users\alessandro.diantonio\AppData\Local\Continuum\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app

In [36]:
with plt.style.context('ggplot'):
    fig = plt.figure(figsize=(16,8))
    ax = plt.axes()
    # Plot sum of Dead column grouped by decade
    ax.pie(all_data.groupby('Decade')['Dead'].sum(),
           labels=all_data['Decade'].unique(), # add labels
           counterclock=False, # change the order of the slices to clockwise
           startangle=90, # start from the top
           shadow=True, # add shadows
           autopct='%1.1f%%', # add percentage inside each slice
           explode=(0.1, 0, 0, 0, 0) # make the first slice pop out a bit
          )
    # Set a title
    ax.set_title('Number of deaths by decade')
    ;