Data Collection - Crawling Flight Crash Data

This is the first step in our project. The code below shows a crawler (written using BeautifulSoup, the old school way) that gets raw HTML data from this site, extracts the data from the HTML tables, and writes it to a MongoDB instance running on the same machine.

The entire data pipeline is shown below:

Import the required libraries



In [1]:

    
__author__ = 'shivam_gaur'

import requests
from bs4 import BeautifulSoup
import re
import os
import pymongo
from pymongo import MongoClient
import datetime

Declaring the important 'Global Variables'.

Basically the configuration of our crawl. The start_year and the end year of the crawl could be changed to suit your needs



In [2]:

    
# The URL 
rooturl = "http://www.planecrashinfo.com"
url = "http://www.planecrashinfo.com/database.htm"
#change start_year to 1920 to crawl the entire dataset
start_year = 2014
end_year = 2016
year_range = range(start_year,end_year+1,1)
newurl=''

Connecting to the Mongo DB client running on the same machine.

Must change if the Mongo DB is running on a separate machine. Check MongoDB docs



In [3]:

    
# Connecting to Mongo instance
client = MongoClient()
# specify the name of the db in  brackets
db = client['aircrashdb']
# specify the name of the collection in brackets
collection = db['crawled_data']

Helper function to convert month from text to a number [1-12]



In [4]:

    
def getMonth(month):
    Months = ['january','february','march','april','may','june','july','august','september','october','november','december']
    month = month.lower()
    for i,value in enumerate(Months):
        if value == month:
            return i+1
    return 0 # if it is not a valid month string

Helper function that takes url (string) as input and returns BeautifulSoup Object of the url



In [5]:

    
def makeBeautifulSoupObject(url):
    # Use a `Session` instance to customize how `requests` handles making HTTP requests.
    session = requests.Session()
    # `mount` a custom adapter that retries failed connections for HTTP and HTTPS requests, in this case- 5 times
    session.mount("http://", requests.adapters.HTTPAdapter(max_retries=5))
    session.mount("https://", requests.adapters.HTTPAdapter(max_retries=5))
    source_code = session.get(url=url)
    plain_text = source_code.text.encode('utf8')
    soup = BeautifulSoup(plain_text, "lxml")
    return soup

Helper function that pushes a Beautiful Soup Object (HTML table in this case) to a Mongo DB collection

Open this crash record in your browser, and have a look at the HTML source code for reference.
The table_ input basically parses each value according to the format of the key (i.e. Date/location/Aircraft Type/others)
The string.encode('utf-8') is necessary, as the website uses windows-1252 character set- which causes some characters to get messed up if the encoding is not explicitly changed.
This is what the HTML table looks like:



In [6]:

    
def push_record_to_mongo(table_):
    record = {}
    table=BeautifulSoup(str(table_[0]))
    for tr in table.find_all("tr")[1:]:
        tds = tr.find_all("td")
        
        # encoding the 'value' string to utf-8 and removing any non-breaking space (HTML Character)
        tmp_str = tds[1].string.encode('utf-8').replace("&nbsp;", "")
        value = str(tmp_str) # this is the value- In Column #2 of the HTML table
        key = tds[0].string           # this is the key- In Column #1 of the HTML table
        
        if key == "Date:":
            dat = str(value).replace(',','').split(' ')
            date = datetime.datetime(int(dat[2]),getMonth(dat[0]),int(dat[1]))
            record["date"] = date
            
        elif key == "Time:":
            if not value == '?':
                time = re.sub("[^0-9]", "",value)
                record["time"] = time
            else:
                record["time"] = "NULL"
                
        elif key == "Location:":
            if not value == '?':
                record["loc"] = str(value)
            else:
                record["loc"] = "NULL"
                
        elif key == "Operator:":
            if not value == '?':
                record["op"] = str(value)
            else:
                record["op"] = "NULL"
                
        elif key == "Flight#:":
            if not value == '?':
                record["flight"] = str(value)
            else:
                record["flight"] = "NULL"
                
        elif key == "Route:":
            if not value == '?':
                record["route"] = str(value)
            else:
                record["route"] = "NULL"
                
        elif key == "Registration:":
            if not value == '?':
                record["reg"] = str(value)
            else:
                record["reg"] = "NULL"
                
        elif key == "cn / ln:":
            if not value == '?':
                record["cnln"] = str(value)
            else:
                record["cnln"] = "NULL"
                
        elif key == "Aboard:":
            if not value == '?' :
               s = ' '.join(value.split())
               aboard_ = s.replace('(','').replace(')','').split(' ')

               if aboard_[0] != '?':
                   record["aboard_total"] = aboard_[0]
               else:
                   record["aboard_total"] = 'NULL'

               passengers = aboard_[1].replace("passengers:","")
               if passengers != '?':
                   record["aboard_passengers"] = passengers
               else:
                   record["aboard_passengers"] = 'NULL'

               crew = aboard_[2].replace("crew:","")
               if crew != '?':
                   record["aboard_crew"] = crew
               else:
                   record["aboard_crew"] = 'NULL'
            else:
                record["aboard_total"] = 'NULL'
                record["aboard_passengers"] = 'NULL'
                record["aboard_crew"] = 'NULL'
                
        elif key == "Fatalities:":
            if not value == '?':
               s = ' '.join(value.split())
               fatalities_ = s.replace('(','').replace(')','').split(' ')

               if fatalities_[0] != '?':
                   record["fatalities_total"] = fatalities_[0]
               else:
                   record["fatalities_total"] = 'NULL'

               passengers = fatalities_[1].replace("passengers:","")
               if passengers != '?':
                   record["fatalities_passengers"] = passengers
               else:
                   record["fatalities_passengers"] = 'NULL'

               crew = fatalities_[2].replace("crew:","")
               if crew != '?':
                   record["fatalities_crew"] = crew
               else:
                   record["fatalities_crew"] = 'NULL'
                    
            else:
                record["aboard_total"] = 'NULL'
                record["aboard_passengers"] = 'NULL'
                record["aboard_crew"] = 'NULL'
                
        elif key == "Ground:":
            if not value == '?':
                record["ground"] = str(value)
            else:
                record["ground"] = "NULL"
                
        elif key == "Summary:":
            if not value == '?':
                record["summary"] = str(value)
            else:
                record["summary"] = "NULL"
                
        else:
            st1 = ''.join(tds[0].string.split()).lower()
            if not value == '?':
                record[st1] = str(value)
            else:
                record[st1] = "NULL"
                
    collection.insert_one(record)

Crawler- The Core

MAIN IDEA: Leveraging the pattern in the url of the website.
- The hostname of the url remains the same for all the years - i.e. http://<-hostname-> .
- The path for each year comes after the hostname, i.e. http://<-hostname->/<-year->, where year is a 4 digit year from 1920 to 2016.
- The sub path that actually points us to the record page is http://hostname/ <-year>/<-year>-<-record_number-> , where record_number is a number between 1 and the number of crashes that took place in the corresponding year.
  - http://www.planecrashinfo.com/<-year->/<-year->-<-record_number->.htm

We will iterate through all the years specified at the beginning of this notebook, and send an appropriate HTTP request by building a url, leveraging the url pattern described above.

The code can be parallelized by using IPython.parallel library, not done for the sake of simplicity



In [7]:

    
program_start_time = datetime.datetime.utcnow() # you could uncomment this line if you wish to time the runtime of blocks from here onwards

for i in year_range:
    year_start = datetime.datetime.utcnow()
    # appending the path (year) to the url hostname
    newurl = rooturl + "/" + str(i) + "/" + str(i) + ".htm"
    soup = makeBeautifulSoupObject(newurl)
    tables = soup.find_all('table')
    print (newurl)

    for table in tables:
        #finding the no. of records for the given year
        number_of_rows = len(table.findAll(lambda tag: tag.name == 'tr' and tag.findParent('table') == table)) 
        row_range = range(1,number_of_rows,1)
        
        for j in row_range:
            # appending the row number to sub-path of the url, and building the final url that will be used for sending http request
            accident_url = newurl.replace(".htm","") + "-" + str(j) + ".htm"
            web_record = makeBeautifulSoupObject(accident_url)
            # removing all the boilerplate html code except the data table
            table_ = web_record.find_all('table')
            push_record_to_mongo(table_)

    print("Time to crawl year " + str(i) + "-" + str(datetime.datetime.utcnow()-year_start))

program_end_time = datetime.datetime.utcnow()
print ("_____________________________________")
print ("Total program time - " + str(program_end_time-program_start_time))









    



http://www.planecrashinfo.com/2014/2014.htm
Time to crawl year 2014-0:00:14.998000
http://www.planecrashinfo.com/2015/2015.htm
Time to crawl year 2015-0:00:09.765000
http://www.planecrashinfo.com/2016/2016.htm
Time to crawl year 2016-0:00:02.748000
_____________________________________
Total program time - 0:00:27.511000