In [3]:
from bs4 import BeautifulSoup
from urllib.request import urlopen
import pandas as pd
import re
import pdb
import time
import json
import datetime
import random

Question a: Setting up the dataframe


In [100]:
base_url = "https://en.wikipedia.org"
index_ref = "/wiki/List_of_accidents_and_incidents_involving_commercial_aircraft"
index_html = urlopen(base_url + index_ref)
index = BeautifulSoup(index_html, "lxml")

Grab the <li> tags

From inspecting the source code, we can see that every item is within <li> tags - in some cases several tags for some days like September 11, 2001.


In [101]:
result = index.find_all('li')

Handling special dates with multiple accidents

However, there are some elements in our list that contain another list of dates, which look something like this:

<li><b><a href="/wiki/1950_Air_France_multiple_Douglas_DC-4_accidents" title="1950 Air France multiple Douglas DC-4 accidents">1950 Air France multiple Douglas DC-4 accidents</a></b>: <ul> <li>June 12 – An Air France Douglas DC-4 (F-BBDE) on a flight from Saigon to Paris crashes in the Arabian Sea while on approach to Bahrain Airport, killing 46 of 52 on board.</li> <li>June 14 – An Air France Douglas DC-4, F-BBDM, crashes in the Arabian Sea while on approach to Bahrain Airport, killing 40 of 53 on board. This aircraft was operating on the same flight route as F-BBDE.</li> </ul> </li>

We'll need to make sure we can handle these cases. Additionally, since there are <li> tags within this block, the <li>June 12...</li> and <li>June 14...</li> will appear again in our ResultSet. We should get rid of these duplicates.

Looking at the HTML above, the duplicated <li> entries won't have both links and date separators in them, so that's one way we can drop them. Then we'll save rows from the interior <li> tags, supplementing link information from the parent if necessary.

For example, the entry for September 11 is:


In [102]:
result[829]


Out[102]:
<li>September 11 – <b><a href="/wiki/September_11_attacks" title="September 11 attacks">September 11 attacks</a></b>
<ul>
<li><b><a href="/wiki/American_Airlines_Flight_11" title="American Airlines Flight 11">American Airlines Flight 11</a></b>, a Boeing 767-200ER with 92 people on board, is hijacked after taking off from Boston, and is flown into the north tower of the World Trade Center in New York City; all on board are killed as well as others on the ground and in the building.</li>
<li><b><a href="/wiki/United_Airlines_Flight_175" title="United Airlines Flight 175">United Airlines Flight 175</a></b>, a Boeing 767-200 with 65 people on board, is hijacked after taking off from Boston and is flown into the south tower of the World Trade Center in New York City; all on board are killed as well as others on the ground and in the building; the collapse of both towers brings the total death toll from the two crashes to at least 2,759, <b>the worst disaster involving commercial aircraft</b>.</li>
<li><b><a href="/wiki/American_Airlines_Flight_77" title="American Airlines Flight 77">American Airlines Flight 77</a></b>, a Boeing 757-200 with 64 people on board, is hijacked after taking off from Dulles International Airport and is flown into The Pentagon; all on board are killed as well as 125 people in the building and on the ground.</li>
<li><b><a href="/wiki/United_Airlines_Flight_93" title="United Airlines Flight 93">United Airlines Flight 93</a></b>, a Boeing 757-200 with 44 people on board, is hijacked after taking off from Newark, New Jersey; passengers struggle with the hijackers, and the aircraft crashes in a field near Shanksville, Pennsylvania, killing all on board.</li>
</ul>
</li>

But then there are also separate entries for each <li> in the list inside:


In [103]:
result[830:834]


Out[103]:
[<li><b><a href="/wiki/American_Airlines_Flight_11" title="American Airlines Flight 11">American Airlines Flight 11</a></b>, a Boeing 767-200ER with 92 people on board, is hijacked after taking off from Boston, and is flown into the north tower of the World Trade Center in New York City; all on board are killed as well as others on the ground and in the building.</li>,
 <li><b><a href="/wiki/United_Airlines_Flight_175" title="United Airlines Flight 175">United Airlines Flight 175</a></b>, a Boeing 767-200 with 65 people on board, is hijacked after taking off from Boston and is flown into the south tower of the World Trade Center in New York City; all on board are killed as well as others on the ground and in the building; the collapse of both towers brings the total death toll from the two crashes to at least 2,759, <b>the worst disaster involving commercial aircraft</b>.</li>,
 <li><b><a href="/wiki/American_Airlines_Flight_77" title="American Airlines Flight 77">American Airlines Flight 77</a></b>, a Boeing 757-200 with 64 people on board, is hijacked after taking off from Dulles International Airport and is flown into The Pentagon; all on board are killed as well as 125 people in the building and on the ground.</li>,
 <li><b><a href="/wiki/United_Airlines_Flight_93" title="United Airlines Flight 93">United Airlines Flight 93</a></b>, a Boeing 757-200 with 44 people on board, is hijacked after taking off from Newark, New Jersey; passengers struggle with the hijackers, and the aircraft crashes in a field near Shanksville, Pennsylvania, killing all on board.</li>]

Extracting data from the HTML

Let's take a look at one of the entries in our results containing only a list element defined by <li> in the page to see how we can select each piece of data we want to save. We'll store these items in lists.

Here we show how to extract pieces of data from the html:


In [104]:
result[0].find('a').get('href')


Out[104]:
'/wiki/1919_Verona_Caproni_Ca.48_crash'

In [105]:
result[0].text


Out[105]:
'August 2 – A Caproni Ca.48 crashes at Verona, Italy, during a flight from Venice to Taliedo, Milan, killing all on board (14, 15, or 17 people, according to different sources).'

In [106]:
result[0].text.split(' – ')


Out[106]:
['August 2',
 'A Caproni Ca.48 crashes at Verona, Italy, during a flight from Venice to Taliedo, Milan, killing all on board (14, 15, or 17 people, according to different sources).']

In [107]:
result[0].text.split(' – ')[0]


Out[107]:
'August 2'

In [108]:
result[0].text.split(' – ')[1]


Out[108]:
'A Caproni Ca.48 crashes at Verona, Italy, during a flight from Venice to Taliedo, Milan, killing all on board (14, 15, or 17 people, according to different sources).'

Iterating over all <li> elements and extract information

We need to write a function that will handle the fact that the date separator changes during the page.


In [109]:
def get_date_separator(html_fragment):
    # Date separator changes throughout the document, so let's handle both
    if ' – ' in html_fragment.text:
        return '–'
    elif ' - ' in html_fragment.text:
        return '-'
    else:
        return None

Let's write a function that will take each <li> and extract the crash details (month, day, link, and text description) from each element. We should also make sure that we handle the cases where there are several crashes per <li>.


In [110]:
def extract_details(html_fragment):
    # these lists may have one or more elements when returned
    bdates, blinks, bdescrips = [], [], []
    
    if html_fragment.find_all('li') == []:
        # Then there is only one crash for this bullet
        separator = get_date_separator(html_fragment)    
        blinks.append(html_fragment.find('a').get('href'))
        bdates.append(html_fragment.text.split(separator)[0].strip())
        bdescrips.append(html_fragment.text.split(separator)[1].strip())

    else:
        # Then there are multiple crashes for this bullet
        for bullet in html_fragment.find_all('li'):

            # Dates might appear in current or parent <li>
            separator = get_date_separator(bullet)
            if separator != None:
                bdates.append(bullet.text.split(separator)[0].strip())
                bdescrips.append(bullet.text.split(separator)[1].strip())
            else:
                parent_separator = get_date_separator(html_fragment)
                bdates.append(html_fragment.text.split(parent_separator)[0].strip())
                bdescrips.append(bullet.text.strip())                

            # Relevant link might appear in current or parent <li>
            if bullet.find('a') == None:
                blinks.append(html_fragment.find('a').get('href'))
            else:
                blinks.append(bullet.find('a').get('href'))
                
    return bdates, blinks, bdescrips

In [111]:
dates_month_day, links, descriptions = [], [], []
for each_li in result:
    if (' – ' in each_li.text or ' - ' in each_li.text) and each_li.find('a') != None:
        lis_dates, lis_links, lis_descrips = extract_details(each_li)
        dates_month_day += lis_dates
        links += lis_links
        descriptions += lis_descrips
    else:
        # If neither condition is true, then we hit duplicate or extra links
        # elsewhere in the page so we can skip these and throw them away
        continue

Sanity check the lengths of each list:


In [112]:
len(dates_month_day), len(links), len(descriptions)


Out[112]:
(1066, 1066, 1066)

Looks good! Time to make the DataFrame, which we can do by passing a Python dict:


In [113]:
df = pd.DataFrame({'date': dates_month_day, 'link': links, 'description': descriptions})

Sanity check again:


In [18]:
df.head()


Out[18]:
date description link
0 August 2 A Caproni Ca.48 crashes at Verona, Italy, duri... /wiki/1919_Verona_Caproni_Ca.48_crash
1 April 7 In the first mid-air collision of airliners, a... /wiki/First_mid-air_collision_of_airliners
2 May 14 An Air Union Farman F.60 Goliath crashes near ... /wiki/May_1923_Air_Union_Farman_Goliath_crash
3 August 27 An Air Union Farman F.60 Goliath crashes near ... /wiki/August_1923_Air_Union_Farman_Goliath_crash
4 December 24 An Imperial Airways de Havilland DH.34 crashes... /wiki/1924_Imperial_Airways_de_Havilland_DH.34...

Let's check that we did everything right for the weird cases by checking one of the bullets that had multiple crashes:


In [19]:
df[df.date == 'September 11']


Out[19]:
date description link
222 September 11 A President Airlines Douglas DC-6 crashes shor... /wiki/1961_President_Airlines_Douglas_DC-6_crash
323 September 11 Air France Flight 1611, a Sud Aviation SE-210 ... /wiki/Air_France_Flight_1611
417 September 11 JAT Airways Flight 769, a Sud Aviation Caravel... /wiki/JAT_Airways_Flight_769
429 September 11 Eastern Air Lines Flight 212, a McDonnell Doug... /wiki/Eastern_Air_Lines_Flight_212
680 September 11 Continental Express Flight 2574, an Embraer EM... /wiki/Continental_Express_Flight_2574
825 September 11 American Airlines Flight 11, a Boeing 767-200E... /wiki/American_Airlines_Flight_11
826 September 11 United Airlines Flight 175, a Boeing 767-200 w... /wiki/United_Airlines_Flight_175
827 September 11 American Airlines Flight 77, a Boeing 757-200 ... /wiki/American_Airlines_Flight_77
828 September 11 United Airlines Flight 93, a Boeing 757-200 wi... /wiki/United_Airlines_Flight_93

This looks like exactly what we expected. Now let's proceed with clicking the links so we can add in the year. We have to click those links anyway to extract the additional crash details, so let's just grab the years from there.


In [20]:
df.to_csv('crashes_question_starter.csv')

In [21]:
df[['description', 'link']].to_csv('crashes_no_extra_credit.csv')

Question b: Completing the dataframe with details

I'll rate limit my requests by using the function below:


In [22]:
def try_request(url):
    html = urlopen(url)
    time.sleep(1)

    return BeautifulSoup(html, "lxml")

Extracting elements from the summary page

We will write a function to extract elements from a list of table rows (defined by <tr>).


In [23]:
def extract_summary(trs):
    date_w_year, passengers, crew, fatalities, survivors = '', 0, 0, 0, 0
    registration, origins, destination = 'No data', 'No data', 'No data'

    for each_tr in trs:
        if each_tr.find('th', text = re.compile('Destination')) != None:
            try:
                destination = each_tr.td.text
            except:
                pass
        elif each_tr.find('th', text = re.compile('Date')) != None:
            try:
                date_w_year = each_tr.td.text   
            except:
                pass
        elif each_tr.find('th', text = re.compile('Passengers')) != None:
            try:
                passengers = extract_numbers(each_tr.td.text, passengers)
            except:
                pass
        elif each_tr.find('th', text = re.compile('Crew')) != None:
            try:
                crew = extract_numbers(each_tr.td.text, passengers)
            except:
                pass
        elif each_tr.find('th', text = re.compile('Fatalities')) != None:
            try:
                fatalities = extract_numbers(each_tr.td.text, passengers)
            except:
                pass
        elif each_tr.find('th', text = re.compile('Survivors')) != None:
            try:
                survivors = extract_numbers(each_tr.td.text, passengers)
            except:
                pass
        elif each_tr.find('th', text = re.compile('Flight origin')) != None:
            try:
                origins = each_tr.td.text
            except:
                pass
        elif each_tr.find('th', text = re.compile('Registration')) != None:
            try:
                registration = each_tr.td.text
            except:
                pass
        else:
            pass
    
    return {'destination': destination,
            'date': date_w_year,
            'passengers': passengers,
            'crew': crew,
            'fatalities': fatalities,
            'survivors': survivors,
            'origins': origins,
            'registration': registration}

Upon inspection of the text of each summary, we can see that there are some cases where in addition to (or instead of) an integer, there is some extraneous text or just a string like "all", "unknown", or "none". Let's handle these special cases and extract numbers in a function:


In [89]:
def extract_numbers(td_text, passengers):
    """
    Function that handles table data rows to extract numbers.
    Handles special cases where there are strings like all, none, etc. in the text
    """

    number_regex = re.compile('\d+')
    all_regex = re.compile('ll')
    none_regex = re.compile('one')
    unknown_regex = re.compile('nknown')
    
    try:
        data_element = int(number_regex.findall(td_text)[0])
    except:
        if len(all_regex.findall(td_text)) >= 1:
            data_element = passengers
        elif len(none_regex.findall(td_text)) >= 1:
            data_element = 0
        elif len(unknown_regex.findall(td_text)) >= 1:
            data_element = 0
        else:
            data_element = 0
    
    return data_element

Scraping each page

Now let's use these functions to scrape each link.


In [125]:
# Define lists we use to store our results
dates_w_year, passengers, crew, fatalities, survivors = [], [], [], [], []
registration, origins, destination = [], [], []

for row in links:
    # Get HTML of detail page
    summary_html = try_request(base_url + row)
    trs = summary_html.find_all('tr')

    # Extract data from summary HTML
    summary = extract_summary(trs)
    
    # Save the data for this page in our lists
    dates_w_year.append(summary['date'])   
    passengers.append(summary['passengers'])
    crew.append(summary['crew'])
    fatalities.append(summary['fatalities'])
    survivors.append(summary['survivors'])
    origins.append(summary['origins'])
    registration.append(summary['registration'])
    destination.append(summary['destination'])

Let's sanity check the lengths of these lists.


In [126]:
len(destination), len(origins), len(registration), len(dates_w_year), len(passengers), len(crew), len(fatalities), len(survivors)


Out[126]:
(1066, 1066, 1066, 1066, 1066, 1066, 1066, 1066)

In [127]:
df_full = pd.DataFrame({'date': dates_w_year, 'link': links, 'description': descriptions, 'passengers': passengers,
                       'crew': crew, 'fatalities': fatalities, 'survivors': survivors,
                       'registration': registration, 'flight origin': origins, 'destination': destination})
# save all this scraped stuff!
df_full.to_csv('all_data_rescraped.csv')

In [128]:
df_full = pd.read_csv('all_data_rescraped.csv')
dates_w_year = df_full['date']

In [129]:
df_full.columns
df_full.drop(['Unnamed: 0'], axis=1, inplace=True)

Clean up dates and format them as datetimes

The formatting of the dates is not so great, so let's just clean that up.


In [130]:
dates_w_year[0:10]


Out[130]:
0    August 2, 1919 (1919-08-02)
1                   7 April 1922
2       14 May 1923 (1923-05-14)
3    27 August 1923 (1923-08-27)
4               24 December 1924
5                 18 August 1926
6                 2 October 1926
7                 22 August 1927
8                   13 July 1928
9                   17 June 1929
Name: date, dtype: object

Let's remove commas.


In [132]:
cleaned_dates = [str(d).replace(',', '') for d in dates_w_year]

Some dates have month first. Some dates have date first. Let's make them consistent while also getting rid of extraneous information appended to the end of the date (like links to references). We'll write our own function to parse dates because we like to do fun and cool things like that.


In [24]:
import calendar
months = list(calendar.month_name)
days = list(calendar.day_name)
dates = [str(d) for d in list(range(1, 32))]
years = [str(y) for y in list(range(1900, 2017))]

In [25]:
def parse_date_strings(text):
    split_row = text.split()
    month, day, year, date = '', '', '', ''

    for each in split_row[0:4]:
        if each in months:
            month = each
        elif each in days:
            day = each
        elif each in years:
            year = each
        elif each in dates:
            date = each
        else:
            pass
    return {'month': month,
           'day': day,
           'year': year,
           'date': date}


def fix_dates(datecol):
    correctedcol = []

    for row in datecol:
        parsed_date = parse_date_strings(row)
        correctedcol.append('{} {} {}'.format(parsed_date['date'],
                                              parsed_date['month'],
                                              parsed_date['year']))

    return correctedcol

In [135]:
datescol = fix_dates(cleaned_dates)

In [136]:
datescol[0:5]


Out[136]:
['2 August 1919',
 '7 April 1922',
 '14 May 1923',
 '27 August 1923',
 '24 December 1924']

We can see now that our dates are nicely formatted and can create them as datetime objects:


In [137]:
dates_datetime = pd.to_datetime(datescol, format='%d %B %Y', errors='coerce')

In [138]:
df_full['date'] = dates_datetime

In [139]:
df_full.head()


Out[139]:
crew date description destination fatalities flight origin link passengers registration survivors
0 2 1919-08-02 A Caproni Ca.48 crashes at Verona, Italy, duri... Taliedo, Milan, Italy 14 Venice, Italy /wiki/1919_Verona_Caproni_Ca.48_crash 12 No data 0
1 2 1922-04-07 In the first mid-air collision of airliners, a... Le Bourget, Paris 2 Croydon /wiki/First_mid-air_collision_of_airliners 0 G-EAWO 0
2 2 1923-05-14 An Air Union Farman F.60 Goliath crashes near ... Croydon, Surrey, United Kingdom 6 Le Bourget, Paris, France /wiki/May_1923_Air_Union_Farman_Goliath_crash 4 F-AEBY 0
3 2 1923-08-27 An Air Union Farman F.60 Goliath crashes near ... Croydon Airport, Surrey, United Kingdom 1 Le Bourget Airport, Paris, France /wiki/August_1923_Air_Union_Farman_Goliath_crash 11 F-AECB 12
4 1 1924-12-24 An Imperial Airways de Havilland DH.34 crashes... Le Bourget, Paris, France 8 Croydon, Surrey, United Kingdom /wiki/1924_Imperial_Airways_de_Havilland_DH.34... 7 G-EBBX 0

In [155]:
df_full = pd.DataFrame({'date': dates_datetime, 'link': links, 'description': descriptions, 'passengers': passengers,
                       'crew': crew, 'fatalities': fatalities, 'survivors': survivors,
                       'registration': registration, 'flight origin': origins, 'destination': destination})
# save all this scraped stuff!
df_full.to_csv('final_dataframe.csv')

Optional: Doing part b while handling the special cases where there are multiple summaries per page

I told you to ignore this for the sake of simplifying this assignment, but it is the case that there are multiple summaries on a couple pages, for example:

https://en.wikipedia.org/wiki/1950_Air_France_multiple_Douglas_DC-4_accidents

These we can separate using the dates.

However, in addition to this, there are some cases where planes crash into one another. In these cases, the summaries are separated into two tables, one for each page, for example:

https://en.wikipedia.org/wiki/1922_Picardie_mid-air_collision

We could handle this in two ways: Sum the numbers for passengers, fatalities, etc. Or we could instead create two rows for these crashes.

We could handle both these cases by doing the following (pseudocode):

if there are multiple tables
        get the summary details for the appropriate date
else if there is one table
        get the summary details for each summary table and sum them (this is a collision)

We will create a new function extract_summaries() that will implement this approach.


In [147]:
%pdb


Automatic pdb calling has been turned ON

In [163]:
def extract_summaries(tables, relevant_date):

    if len(tables) == 1:
        result = extract_single_table_summary(tables[0])
    else:
        result = extract_relevant_table_summary(tables, relevant_date)
    
    return {'destination': result['destination'],
            'date': result['date'],
            'passengers': result['passengers'],
            'crew': result['crew'],
            'fatalities': result['fatalities'],
            'survivors': result['survivors'],
            'origins': result['origins'],
            'registration': result['registration']}


def pick_out_table(tables, relevant_date):
        
    for table in tables:
        trs = table.find_all('tr')
        for each_tr in trs:
            if each_tr.find('th', text = re.compile('Date')) != None:
                
                # Clean and parse date
                date = each_tr.td.text.replace(',', '')
                parsed_date = parse_date_strings(date)
                
                if (parsed_date['month'] == relevant_date.split()[0] 
                    and parsed_date['date'] == relevant_date.split()[1]):
                    return table

    return tables[0]


def extract_relevant_table_summary(tables, relevant_date):
    date_w_year, passengers, crew, fatalities, survivors = '', 0, 0, 0, 0
    registration, origins, destination = '', '', ''

    table = pick_out_table(tables, relevant_date)
    
    trs = table.find_all('tr')
        
    for each_tr in trs:
        if each_tr.find('th', text = re.compile('Destination')) != None:
            try:
                destination = each_tr.td.text
            except:
                pass
        elif each_tr.find('th', text = re.compile('Date')) != None:
            try:
                date_w_year = each_tr.td.text   
            except:
                pass
        elif each_tr.find('th', text = re.compile('Passengers')) != None:
            try:
                passengers = extract_numbers(each_tr.td.text, passengers)
            except:
                pass
        elif each_tr.find('th', text = re.compile('Crew')) != None:
            try:
                crew = extract_numbers(each_tr.td.text, passengers)
            except:
                pass
        elif each_tr.find('th', text = re.compile('Fatalities')) != None:
            try:
                fatalities = extract_numbers(each_tr.td.text, passengers)
            except:
                pass
        elif each_tr.find('th', text = re.compile('Survivors')) != None:
            try:
                survivors = extract_numbers(each_tr.td.text, passengers)
            except:
                pass
        elif each_tr.find('th', text = re.compile('Flight origin')) != None:
            try:
                origins = each_tr.td.text
            except:
                pass
        elif each_tr.find('th', text = re.compile('Registration')) != None:
            try:
                registration = each_tr.td.text
            except:
                pass
        else:
            continue
        
    return {'destination': destination.strip(),
            'date': date_w_year,
            'passengers': passengers,
            'crew': crew,
            'fatalities': fatalities,
            'survivors': survivors,
            'origins': origins.strip(),
            'registration': registration.strip()}


def extract_single_table_summary(table):
    date_w_year, passengers, crew, fatalities, survivors = '', 0, 0, 0, 0
    registration, origins, destination = '', '', ''

    trs = table.find_all('tr')
        
    for each_tr in trs:
        if each_tr.find('th', text = re.compile('Destination')) != None:
            try:
                destination += '    ' + each_tr.td.text
            except:
                pass
        elif each_tr.find('th', text = re.compile('Date')) != None:
            try:
                date_w_year = each_tr.td.text   
            except:
                pass
        elif each_tr.find('th', text = re.compile('Passengers')) != None:
            try:
                passengers += extract_numbers(each_tr.td.text, passengers)
            except:
                pass
        elif each_tr.find('th', text = re.compile('Crew')) != None:
            try:
                crew += extract_numbers(each_tr.td.text, passengers)
            except:
                pass
        elif each_tr.find('th', text = re.compile('Fatalities')) != None:
            try:
                fatalities += extract_numbers(each_tr.td.text, passengers)
            except:
                pass
        elif each_tr.find('th', text = re.compile('Survivors')) != None:
            try:
                survivors += extract_numbers(each_tr.td.text, passengers)
            except:
                pass
        elif each_tr.find('th', text = re.compile('Flight origin')) != None:
            try:
                origins += '    ' + each_tr.td.text
            except:
                pass
        elif each_tr.find('th', text = re.compile('Registration')) != None:
            try:
                registration += '    ' + each_tr.td.text
            except:
                pass
        else:
            continue
        
    return {'destination': destination.strip(),
            'date': date_w_year,
            'passengers': passengers,
            'crew': crew,
            'fatalities': fatalities,
            'survivors': survivors,
            'origins': origins.strip(),
            'registration': registration.strip()}

Let's test with the two URLs from earlier:


In [120]:
test_collision_url = 'https://en.wikipedia.org/wiki/1922_Picardie_mid-air_collision'
summary_html = try_request(test_collision_url)

In [59]:
tables = summary_html.find_all('table', {"class" : "infobox vcard vevent"})
result_updated = extract_summaries(tables)

In [117]:
result_updated


Out[117]:
{'crew': 4,
 'date': '7 April 1922',
 'destination': 'Croydon    Le Bourget, Paris',
 'fatalities': 7,
 'origins': 'Le Bourget, Paris    Croydon',
 'passengers': 3,
 'registration': 'F-GEAD    G-EAWO',
 'survivors': 0}

Looks like we can correctly extract a summary table with multiple aircraft in it.

Now let's try on a page that has multiple crashes in it on different days.


In [150]:
test_multiple_dates_url = 'https://en.wikipedia.org/wiki/1950_Air_France_multiple_Douglas_DC-4_accidents'
summary_html = try_request(test_multiple_dates_url)

In [151]:
first_crash = 'June 12'
second_crash = 'June 14'

In [152]:
tables = summary_html.find_all('table', {"class" : "infobox vcard vevent"})
result_updated = extract_summaries(tables, first_crash)

In [153]:
result_updated


Out[153]:
{'crew': 8,
 'date': '12 June 1950',
 'destination': 'Paris, France',
 'fatalities': 46,
 'origins': 'Saigon, Vietnam',
 'passengers': 44,
 'registration': 'F-BBDE',
 'survivors': 6}

In [154]:
result_updated = extract_summaries(tables, second_crash)

In [155]:
result_updated


Out[155]:
{'crew': 8,
 'date': '14 June 1950',
 'destination': 'Paris, France',
 'fatalities': 40,
 'origins': 'Saigon, Vietnam',
 'passengers': 45,
 'registration': 'F-BBDM',
 'survivors': 13}

In [164]:
dates_w_year, passengers, crew, fatalities, survivors = [], [], [], [], []
registration, origins, destination = [], [], []

for num_row in range(len(links)):
    # Get HTML of detail page
    summary_html = try_request(base_url + links[num_row])
    
    # Get tables that are in these sidebars (mostly one, but sometimes multiple)
    tables = summary_html.find_all('table', {"class" : ["infobox", "vcard"]})

    # Extract data from summary HTML
    summary = extract_summaries(tables, dates_month_day[num_row])
    
    # Save the data for this page in our lists
    dates_w_year.append(summary['date'])   
    passengers.append(summary['passengers'])
    crew.append(summary['crew'])
    fatalities.append(summary['fatalities'])
    survivors.append(summary['survivors'])
    origins.append(summary['origins'])
    registration.append(summary['registration'])
    destination.append(summary['destination'])

In [165]:
# Clean dates
cleaned_dates = [str(d).replace(',', '') for d in dates_w_year]
datescol = fix_dates(cleaned_dates)
dates_datetime = pd.to_datetime(datescol, format='%d %B %Y', errors='coerce')

# Save!
df_summary = pd.DataFrame({'date': dates_datetime, 'link': links, 'description': descriptions, 'passengers': passengers,
                       'crew': crew, 'fatalities': fatalities, 'survivors': survivors,
                       'registration': registration, 'flight origin': origins, 'destination': destination})
# save all this scraped stuff!
df_summary.to_csv('final_dataframe_summary.csv')

Question c: Which were the top 5 most deadly aviation incidents? Report the number of fatalities and the flight origin for each.


In [140]:
top_5_crashes = df_full.sort_values('fatalities', ascending=False)[0:5]

So the top 5 crashes, the number of fatalities and the flight origin was:


In [141]:
top_5_crashes[['fatalities', 'flight origin']]


Out[141]:
fatalities flight origin
584 520 Haneda Airport, Tokyo
425 346 Yesilköy Int'l Airport\nIstanbul, Turkey
581 329 Toronto (as Flight 181) Montréal-Mirabel Int'l...
513 301 Quaid-e-Azam Int'l Airport\nKarachi, Pakistan
1045 298 Amsterdam Airport Schiphol

Let's see the description:


In [154]:
top_5_crashes['description']


Out[154]:
584     Japan Airlines Flight 123, a Boeing 747, crash...
425     Turkish Airlines Flight 981, a McDonnell Dougl...
581     Air India Flight 182, a Boeing 747 en route fr...
513     Saudia Flight 163, a Lockheed L-1011 Tristar, ...
1045    Malaysia Airlines Flight 17, a Boeing 777 en r...
Name: description, dtype: object

Question d: Which flight origin has the highest number of aviation incidents in the last 25 years?

It's 2016, so let's take accidents from 1991 and later and see which is the most common flight origin.

This crash is the first one to occur in 1991:


In [148]:
df_full.date[672]


Out[148]:
Timestamp('1991-02-01 00:00:00')

In [149]:
recent_incidents = df_full[673:]

In [150]:
recent_incidents['flight origin'].value_counts()[0:5]


Out[150]:
No data                                                 40
Bergen Airport, Norway                                   3
Ninoy Aquino International Airport                       3
Domodedovo International Airport, Moscow                 3
Nnamdi Azikiwe International Airport, Abuja, Nigeria     2
Name: flight origin, dtype: int64

Without de-duplication, Bergen Airport, Ninoy Aquino International Airport, and Domodedovo International Airport in Moscow had the highest number of aviation incidents.

Out of curiosity, let's do this for the entire dataset:


In [151]:
df_full['flight origin'].value_counts()[0:10]


Out[151]:
No data                                              127
London Heathrow Airport                               10
Los Angeles International Airport                      6
Los Angeles Int'l Airport                              5
Miami International Airport                            5
Trondheim Airport, Værnes                              4
Cairo International Airport, Egypt                     4
Ninoy Aquino International Airport                     3
Orly Airport, Paris, France                            3
Juanda International Airport, Surabaya, Indonesia      3
Name: flight origin, dtype: int64

London Heathrow and LAX (entered twice as two slightly different strings) come out on top, which is not unexpected given the number of flights these airports have.

Note that one way we could proceed with de-duplication would be to use the fact that the summary tables actually contain links to their corresponding wikipedia pages. We could link together strings that correspond to the same airport using their common link.

Question e: Output as JSON


In [152]:
df_full.to_json('crashes.json')