Plane Crash Info

This exercise is trying to make some more sense out of the data collected at http://PlaneCrashInfo.com, which contains way too few charts for this topic. Using this notebook you can explore the data and add your own charts.

N.B. This is work in progress… If you have any suggestion, please feel free to raise an issue or provide a pull request in the repo on GitHub!



In [1]:

    
%matplotlib inline

Raw Data



In [2]:

    
import pandas as pd

df = pd.read_csv('data/data.csv')



In [3]:

    
df.head()









    Out[3]:






  
    
      
      Unnamed: 0
      Unnamed: 0.1
      Date:
      Time:
      Location:
      Operator:
      Flight #:
      Route:
      AC  Type:
      Registration:
      cn / ln:
      Aboard:
      Fatalities:
      Ground:
      Summary:
    
  
  
    
      0
      0
      ACCIDENT DETAILS
      February 03, 1921
      ?
      Mendotta, Minnisota
      US Aerial Mail Service
      ?
      ?
      De Havilland DH-4
      130
      ?
      1 (passengers:0 crew:1)
      1 (passengers:0 crew:1)
      0
      ?
    
    
      1
      1
      ACCIDENT DETAILS
      February 09, 1921
      ?
      La Crosse, Wisconsin
      US Aerial Mail Service
      ?
      ?
      Junkers F-13
      301
      ?
      3 (passengers:2 crew:1)
      3 (passengers:2 crew:1)
      0
      ?
    
    
      2
      2
      ACCIDENT DETAILS
      February 15, 1921
      ?
      Off Gibraltar
      Aeropostale
      ?
      ?
      Breguet 14
      F-ALBA
      ?
      2 (passengers:? crew:?)
      2 (passengers:? crew:?)
      0
      ?
    
    
      3
      3
      ACCIDENT DETAILS
      February 22, 1921
      ?
      Elko, Nevada
      US Aerial Mail Service
      ?
      ?
      De Havilland DH-4
      67
      ?
      1 (passengers:0 crew:1)
      1 (passengers:0 crew:1)
      0
      ?
    
    
      4
      4
      ACCIDENT DETAILS
      April 06, 1921
      ?
      Point Cook, Australia
      Military - Royal Australian Air Force
      ?
      ?
      Avro 504
      H3021
      ?
      2 (passengers:? crew:?)
      1 (passengers:? crew:?)
      0
      ?

Clean(er) Data



In [5]:

    
import planecrashinfo_light as pci

df = pci.clean_database(df)



In [6]:

    
df.head()









    Out[6]:






  
    
      
      Time
      Location
      Operator
      Route
      AC  Type
      Aboard
      Fatalities
      Ground
      Summary
      Origin
      Destination
      Fatalities total
      Location Country
    
  
  
    
      1921-02-03
      NaN
      Mendotta, Minnisota
      US Aerial Mail Service
      NaN
      De Havilland DH-4
      1 (passengers:0 crew:1)
      1 (passengers:0 crew:1)
      0.0
      NaN
      NaN
      NaN
      1.0
      Minnisota
    
    
      1921-02-09
      NaN
      La Crosse, Wisconsin
      US Aerial Mail Service
      NaN
      Junkers F-13
      3 (passengers:2 crew:1)
      3 (passengers:2 crew:1)
      0.0
      NaN
      NaN
      NaN
      3.0
      USA
    
    
      1921-02-15
      NaN
      Off Gibraltar
      Aeropostale
      NaN
      Breguet 14
      2 (passengers:? crew:?)
      2 (passengers:? crew:?)
      0.0
      NaN
      NaN
      NaN
      2.0
      Off Gibraltar
    
    
      1921-02-22
      NaN
      Elko, Nevada
      US Aerial Mail Service
      NaN
      De Havilland DH-4
      1 (passengers:0 crew:1)
      1 (passengers:0 crew:1)
      0.0
      NaN
      NaN
      NaN
      1.0
      USA
    
    
      1921-04-06
      NaN
      Point Cook, Australia
      Military - Royal Australian Air Force
      NaN
      Avro 504
      2 (passengers:? crew:?)
      1 (passengers:? crew:?)
      0.0
      NaN
      NaN
      NaN
      1.0
      Australia

Numbers



In [7]:

    
s = df.groupby(df.index.year).size()
s.plot.line(title='#Accidents (total, by year)')









    Out[7]:





<matplotlib.axes._subplots.AxesSubplot at 0x10debe128>



In [8]:

    
s = df.groupby(df.index.year).size().cumsum()
s.plot.line(title='#Accidents (cumulated, by year)')









    Out[8]:





<matplotlib.axes._subplots.AxesSubplot at 0x10e5c4dd8>



In [9]:

    
s = df.groupby(df.index.month).size()
s.plot.bar(title='#Accidents (total, by month)')









    Out[9]:





<matplotlib.axes._subplots.AxesSubplot at 0x10e929b00>



In [10]:

    
s = df.groupby(df.index.day).size()
s.plot.bar(title='#Accidents (total, by day)')









    Out[10]:





<matplotlib.axes._subplots.AxesSubplot at 0x111255208>



In [11]:

    
s = df.groupby((df.index.year // 10) * 10).size()
s.plot.bar(title='#Accidents (total, by decade)')









    Out[11]:





<matplotlib.axes._subplots.AxesSubplot at 0x111595c88>



In [12]:

    
s = df.groupby((df.index.quarter)).size()
s.plot.bar(title='#Accidents (total, by quarter)')









    Out[12]:





<matplotlib.axes._subplots.AxesSubplot at 0x1115e4748>

Fatalities



In [13]:

    
df['Fatalities total'].groupby(df.index.year).sum()









    Out[13]:





1921      67.0
1922      89.0
1923      76.0
1924      20.0
1925      39.0
1926      37.0
1927      62.0
1928     113.0
1929     133.0
1930     147.0
1931     102.0
1932      75.0
1933     205.0
1934     129.0
1935     234.0
1936     252.0
1937     227.0
1938     468.0
1939     160.0
1940     213.0
1941     246.0
1942     506.0
1943     676.0
1944     689.0
1945    1365.0
1946    1351.0
1947    1306.0
1948    1254.0
1949    1163.0
1950    1438.0
         ...  
1987    1723.0
1988    2034.0
1989    2283.0
1990    1182.0
1991    1839.0
1992    2121.0
1993    1571.0
1994    1876.0
1995    1593.0
1996    2386.0
1997    1672.0
1998    1544.0
1999     971.0
2000    1469.0
2001    1416.0
2002    1433.0
2003    1279.0
2004     728.0
2005    1317.0
2006    1161.0
2007     931.0
2008     824.0
2009    1095.0
2010    1087.0
2011     764.0
2012     596.0
2013     341.0
2014    1189.0
2015     770.0
2016     252.0
Name: Fatalities total, dtype: float64



In [14]:

    
df['Fatalities total'].groupby(df.index.year).sum().plot.line(title='#Fatalities (total)')









    Out[14]:





<matplotlib.axes._subplots.AxesSubplot at 0x111607438>



In [15]:

    
df['Ground'].groupby(df.index.year).sum().plot(title='#Ground fatalities (total)')









    Out[15]:





<matplotlib.axes._subplots.AxesSubplot at 0x11195de48>



In [16]:

    
df.groupby(df.index.year).sum().plot(title='#Fatalities')









    Out[16]:





<matplotlib.axes._subplots.AxesSubplot at 0x111a59fd0>

Aircraft



In [17]:

    
s = df.groupby('AC  Type').size().sort_values(ascending=True)[-20:]
s.plot.barh(title='Top 20 aircraft with highest #accidents')









    Out[17]:





<matplotlib.axes._subplots.AxesSubplot at 0x111b59940>



In [18]:

    
doug = [op for op in df['AC  Type'].unique() if type(op) == str and op.startswith('Douglas')]
anto = [op for op in df['AC  Type'].unique() if type(op) == str and op.startswith('Antonov')]
airb = [op for op in df['AC  Type'].unique() if type(op) == str and op.startswith('Airbus')]
boei = [op for op in df['AC  Type'].unique() if type(op) == str and op.startswith('Boeing')]



In [19]:

    
airb









    Out[19]:





['Airbus A300',
 'Airbus A.300B4-203',
 'Airbus A320-111',
 'Airbus A300B2-203',
 'Airbus A320-231',
 'Airbus A310-304',
 'Airbus A300B4-203',
 'Airbus A320-211',
 'Airbus A300B4-622R',
 'Airbus A330-321',
 'Airbus A300B2-1C',
 'Airbus A310-324',
 'Airbus A300-B4-200',
 'Airbus A300-622R',
 'Airbus A.320-214',
 'Airbus A-310-204',
 'Airbus A300B2-101',
 'AirbusA310-304',
 'Airbus A.330-301',
 'Airbus A320-212',
 'Airbus A300-600R',
 'Airbus A-330-243',
 'Airbus A-300-605R',
 'Airbus A-340',
 'Airbus A-310-324ET',
 'Airbus A-320-233',
 'Airbus A-310-324',
 'Airbus A320-232',
 'Airbus A320-214',
 'Airbus A330-203',
 'Airbus A300B4-203F',
 'Airbus A330-202',
 'Airbus A321-231',
 'Airbus A300F-622R',
 'Airbus A-320-216',
 'Airbus A-310-304F',
 'Airbus A321-111']



In [20]:

    
s = df[df['AC  Type'].isin(airb)].groupby('AC  Type').size()
s.plot.bar(title='#Accidents for Airbus Aircraft')









    Out[20]:





<matplotlib.axes._subplots.AxesSubplot at 0x111c40a20>



In [21]:

    
import re

def extract_airbus_type(ac_type):
    "Extract main Airbus type from 'AC Type', e.g. 'A300' from 'Airbus A-300-605R'."

    if type(ac_type) != str:
        return ''
    pat = 'Airbus ?(A[\.\-]?\d{3,3})'
    m = re.search(pat, ac_type)
    if m:
        return m.groups()[0].replace('-', '').replace('.', '')
    else:
        return ''



In [22]:

    
airbus_models = set([extract_airbus_type(a) for a in airb])
print(airbus_models)









    



{'A300', 'A330', 'A310', 'A321', 'A320', 'A340'}



In [23]:

    
df['Airbus Model'] = df['AC  Type'].apply(extract_airbus_type)



In [24]:

    
s = df[df['Airbus Model'].isin(airbus_models)].groupby('Airbus Model').size()
s.plot.bar(title='#Accidents for Airbus Aircraft by Model')









    Out[24]:





<matplotlib.axes._subplots.AxesSubplot at 0x1121e9b00>



In [25]:

    
df2 = df[df['AC  Type'].isin(airb)]
s2 = df2.groupby(df2.index.year).size()
s2.plot.bar(title='#Accidents for Airbus Aircraft')









    Out[25]:





<matplotlib.axes._subplots.AxesSubplot at 0x11221abe0>



In [26]:

    
df1 = df[df['AC  Type'].isin(doug)]
s1 = df1.groupby(df1.index.year).size()
s1.plot.line(title='#Accidents for Douglas Aircraft')









    Out[26]:





<matplotlib.axes._subplots.AxesSubplot at 0x11258fd30>



In [27]:

    
df3 = df[df['AC  Type'].isin(anto)]
s3 = df3.groupby(df3.index.year).size()
s3.plot.line(title='#Accidents for Antonov Aircraft')









    Out[27]:





<matplotlib.axes._subplots.AxesSubplot at 0x1126b3c88>



In [28]:

    
df4 = df[df['AC  Type'].isin(boei)]
s4 = df4.groupby(df4.index.year).size()
s4.plot.line(title='#Accidents for Boeing Aircraft')









    Out[28]:





<matplotlib.axes._subplots.AxesSubplot at 0x1126deb70>



In [29]:

    
dfab = pd.DataFrame(data={'Douglas': s1, 'Airbus': s2, 'Antonov': s3, 'Boeing': s4})
dfab.plot.line()









    Out[29]:





<matplotlib.axes._subplots.AxesSubplot at 0x1128a9908>

Operators



In [30]:

    
s = df.groupby('Operator').size().sort_values(ascending=True)[-20:]
s.plot.barh(title='Top 20 airlines with highest #accidents')









    Out[30]:





<matplotlib.axes._subplots.AxesSubplot at 0x112c03240>



In [31]:

    
mil_ops = sorted([op for op in df['Operator'].unique() if type(op) == str and 'Military' in op])



In [32]:

    
df4 = df[df['Operator'].isin(mil_ops)]
s4 = df4.groupby(df4.index.year).size()
s4.plot.line(title='#Accidents for Military Operators (total, by year)')









    Out[32]:





<matplotlib.axes._subplots.AxesSubplot at 0x1128a5be0>



In [33]:

    
s = df.groupby(df.index.year).size()
s.plot.line(title='#Accidents (total, by year)')









    Out[33]:





<matplotlib.axes._subplots.AxesSubplot at 0x112ea4208>

Routes



In [34]:

    
s = df.groupby('Origin').size().sort_values(ascending=True)
s[-20:].plot.barh(title='Top 20 most frequest origins (of %d)' % len(s))









    Out[34]:





<matplotlib.axes._subplots.AxesSubplot at 0x11302efd0>



In [35]:

    
s = df.groupby('Destination').size().sort_values(ascending=True)
s[-20:].plot.barh(title='Top 20 most frequent destinations (of %d)' % len(s))









    Out[35]:





<matplotlib.axes._subplots.AxesSubplot at 0x111fa7a58>

Locations



In [36]:

    
s = df.groupby('Location Country').size().sort_values(ascending=True)[-20:]
s.plot.barh(title='Top 20 countries with highest #accidents')









    Out[36]:





<matplotlib.axes._subplots.AxesSubplot at 0x112e889e8>

For respective maps see below...

Maps of Crash Locations

Without more data cleaning (removing prefixes like 'Near...' or 'Off...', etc.) we can plot ca. 53% of all crash locations below.



In [37]:

    
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap

def show_map(positions, title='', proj='mill', lat_0=0, lon_0=0):
    "Show a world map."
    
    plt.figure(num=1, figsize=(20,10))
    plt.title(title)

    map = Basemap(projection=proj, lat_0=lat_0, lon_0=lon_0, resolution='c')
    map.drawcoastlines()
    map.drawcountries()
    map.drawparallels(np.arange(-90, 90, 30), labels=[1, 0, 0, 0])
    map.drawmeridians(np.arange(map.lonmin, map.lonmax + 30, 60), labels=[0, 0, 0, 1])
    map.drawmapboundary(fill_color='#aaddff')
    map.fillcontinents(color='#dddddd', lake_color='#aaddff')

    for pos in positions:
        lon, lat = pos['lon'], pos['lat']
        map.plot(lon, lat, 'ro', markersize=3, latlon=True)

    plt.show()



In [38]:

    
import json

locs = json.load(open('data/geolocs.json'))
positions = list(filter(None, locs.values()))
print('#locations found: %d' % len(locs))
print('#coordinates found: %d' % len(positions))









    



#locations found: 3738
#coordinates found: 3225



In [39]:

    
title = 'Approx. Plane Crash Locations (%d out of %d)'% (len(positions), len(locs))
show_map(positions, title=title, proj='mill')



In [40]:

    
title = 'Approx. Plane Crash Locations (%d out of %d)'% (len(positions), len(locs))
show_map(positions, title=title, proj='ortho', lat_0=30, lon_0=-40)









    



Warning: Cannot label parallels on Orthographic basemap'Warning: Cannot label meridians on full-disk
                Geostationary, Orthographic or Azimuthal equidistant basemap

	Unnamed: 0	Unnamed: 0.1	Date:	Time:	Location:	Operator:	Flight #:	Route:	AC Type:	Registration:	cn / ln:	Aboard:	Fatalities:	Summary:
0	0	ACCIDENT DETAILS	February 03, 1921	?	Mendotta, Minnisota	US Aerial Mail Service	?	?	De Havilland DH-4	130	?	1 (passengers:0 crew:1)	1 (passengers:0 crew:1)	?
1	1	ACCIDENT DETAILS	February 09, 1921	?	La Crosse, Wisconsin	US Aerial Mail Service	?	?	Junkers F-13	301	?	3 (passengers:2 crew:1)	3 (passengers:2 crew:1)	?
2	2	ACCIDENT DETAILS	February 15, 1921	?	Off Gibraltar	Aeropostale	?	?	Breguet 14	F-ALBA	?	2 (passengers:? crew:?)	2 (passengers:? crew:?)	?
3	3	ACCIDENT DETAILS	February 22, 1921	?	Elko, Nevada	US Aerial Mail Service	?	?	De Havilland DH-4	67	?	1 (passengers:0 crew:1)	1 (passengers:0 crew:1)	?
4	4	ACCIDENT DETAILS	April 06, 1921	?	Point Cook, Australia	Military - Royal Australian Air Force	?	?	Avro 504	H3021	?	2 (passengers:? crew:?)	1 (passengers:? crew:?)	?

	Time	Location	Operator	Route	AC Type	Aboard	Fatalities	Summary	Origin	Destination	Fatalities total	Location Country
1921-02-03	NaN	Mendotta, Minnisota	US Aerial Mail Service	NaN	De Havilland DH-4	1 (passengers:0 crew:1)	1 (passengers:0 crew:1)	NaN	NaN	NaN	1.0	Minnisota
1921-02-09	NaN	La Crosse, Wisconsin	US Aerial Mail Service	NaN	Junkers F-13	3 (passengers:2 crew:1)	3 (passengers:2 crew:1)	NaN	NaN	NaN	3.0	USA
1921-02-15	NaN	Off Gibraltar	Aeropostale	NaN	Breguet 14	2 (passengers:? crew:?)	2 (passengers:? crew:?)	NaN	NaN	NaN	2.0	Off Gibraltar
1921-02-22	NaN	Elko, Nevada	US Aerial Mail Service	NaN	De Havilland DH-4	1 (passengers:0 crew:1)	1 (passengers:0 crew:1)	NaN	NaN	NaN	1.0	USA
1921-04-06	NaN	Point Cook, Australia	Military - Royal Australian Air Force	NaN	Avro 504	2 (passengers:? crew:?)	1 (passengers:? crew:?)	NaN	NaN	NaN	1.0	Australia