Predictive Delay Analytics



In [1]:

    
%matplotlib inline
# import required modules for prediction tasks
import numpy as np
import pandas as pd
import math
import random

1. Data acquisition

First, let's acquire the data formated in '02_data_preparation.ipynb'. The figure below gives a glimpse of what the data set looks like.



In [2]:

    
%%time
# reads all predefined months for a year and merge into one data frame
rawData2014 = pd.DataFrame.from_csv('cache/predictionData/complete2014Data.csv')









    



Wall time: 1min 49s



In [4]:

    
print rawData2014.columns
rawData2014.head(5)









    



Index([u'index', u'FL_DATE', u'UNIQUE_CARRIER', u'TAIL_NUM', u'FL_NUM',
       u'ORIGIN', u'DEST', u'CRS_DEP_TIME', u'DEP_TIME', u'DEP_DELAY',
       u'TAXI_OUT', u'WHEELS_OFF', u'WHEELS_ON', u'TAXI_IN', u'CRS_ARR_TIME',
       u'ARR_TIME', u'ARR_DELAY', u'CANCELLED', u'CANCELLATION_CODE',
       u'AIR_TIME', u'DISTANCE', u'CARRIER_DELAY', u'WEATHER_DELAY',
       u'NAS_DELAY', u'SECURITY_DELAY', u'LATE_AIRCRAFT_DELAY',
       u'AIRCRAFT_YEAR', u'AIRCRAFT_MFR', u'LAT', u'LONG'],
      dtype='object')






    Out[4]:






  
    
      
      index
      FL_DATE
      UNIQUE_CARRIER
      TAIL_NUM
      FL_NUM
      ORIGIN
      DEST
      CRS_DEP_TIME
      DEP_TIME
      DEP_DELAY
      ...
      DISTANCE
      CARRIER_DELAY
      WEATHER_DELAY
      NAS_DELAY
      SECURITY_DELAY
      LATE_AIRCRAFT_DELAY
      AIRCRAFT_YEAR
      AIRCRAFT_MFR
      LAT
      LONG
    
  
  
    
      0
      0
      2014-01-01
      AA
      N338AA
      1
      JFK
      LAX
      900
      914
      14
      ...
      2475
      NaN
      NaN
      NaN
      NaN
      NaN
      1987
      BOEING
      40.633333
      -73.783333
    
    
      1
      1
      2014-01-02
      AA
      N338AA
      1
      JFK
      LAX
      900
      857
      -3
      ...
      2475
      NaN
      NaN
      NaN
      NaN
      NaN
      1987
      BOEING
      40.633333
      -73.783333
    
    
      2
      2
      2014-01-03
      AA
      N323AA
      1
      JFK
      LAX
      900
      NaN
      NaN
      ...
      2475
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      40.633333
      -73.783333
    
    
      3
      3
      2014-01-04
      AA
      N327AA
      1
      JFK
      LAX
      900
      1005
      65
      ...
      2475
      0
      59
      0
      0
      0
      1986
      BOEING
      40.633333
      -73.783333
    
    
      4
      4
      2014-01-05
      AA
      N323AA
      1
      JFK
      LAX
      900
      1050
      110
      ...
      2475
      0
      110
      0
      0
      0
      NaN
      NaN
      40.633333
      -73.783333
    
  

5 rows × 30 columns

Cleaning the data

We will focus on flights between New York and Chicago. When cleaning the data set, we have to remove the following entries:

flights that have been cancelled or diverted. We focus on predicting the delay. As a result, we also remove the columns associated with diverted flights.
colmuns that give the answer. This is the case of many colmuns related to the arrival of the plane
rows where a value is missing
flights whose destination or origin do not correspond to our study aim

Note that data points have to be cleaned in this order because most flights have empty entries for the 'diverted' columns.



In [5]:

    
#entries to be dropped in the analysis
columns_dropped = ['index', 'TAIL_NUM', 'FL_NUM', 'DEP_TIME', 'DEP_DELAY', 'TAXI_OUT', 'WHEELS_OFF', \
                   'WHEELS_ON', 'TAXI_IN', 'ARR_TIME', 'CANCELLED', 'CANCELLATION_CODE', 'AIR_TIME', \
                   'CARRIER_DELAY', 'WEATHER_DELAY', 'NAS_DELAY', 'SECURITY_DELAY', 'LATE_AIRCRAFT_DELAY']



In [6]:

    
def clean(data, list_col):
    ''' 
    Creates a dataset by excluding undesirable columns contained in list_col

    Parameters:
    -----------

    data: pandas.DataFrame
       Flight dataframe  

    list_col: <list 'string'>
        Comumns to exclude from the data set
    '''
    
    data.drop(data[data.CANCELLED == 0].index, inplace=True)
    data.drop(list_col, axis=1, inplace=True)
    data.dropna(axis = 0, inplace = True)
    return



In [ ]:

    
%%time
data2014 = clean(rawData2014.copy(), columns_dropped)
print data2014.columns

We memorize the cleaned dataset. This will save us some processing time. The two boxes below save the dataset and recover it from a 'cache/predictionData/' folder.



In [ ]:

    
#%%time
# save the data to avoid computing them again
#file_path = "cache/predictionData/predictionData2014.csv"
#data2014.to_csv(path_or_buf= file_path)



In [2]:

    
#%%time
# recover data2014 from cache/predictionData folder
#file_path = "cache/predictionData/predictionData2014.csv"
#data2014 = pd.read_csv(file_path)
#data2014.drop('Unnamed: 0', axis= 1, inplace = True)
#data2014.head(3)



In [6]:

    
# test that clean did the job
print "size of raw data set: ", len(rawData2014)
print "number of cancelled: ", len(rawData2014[(rawData2014.CANCELLED == 1)])
print "size of data set: ", len(data2014)









    



size of raw data set:  5819811
number of cancelled:  126984
size of data set:  4103597

Restricting the dataset: Case study NYC - Chicago flights

The dataset has more than 4 millions entries, which makes any data manipulation extremely costly - let alone model fitting. We will therefore make some restrictions on the airports and the airlines considered.

One way to do it is to make a case study, an example that we will follow all along. Here, the example looks at a flight from New York to Chicago.

Comment : it is possible to work on a different data set or a different example. However, due to the high complexity of the data set - both by a large number of points and by a large number of high-dimensional categorical variables, it is easier to work on a shrinked version of it.



In [3]:

    
dexample = data2014[data2014.DEST == 'ORD'][(data2014.ORIGIN == 'JFK') | (data2014.ORIGIN == 'LGA') | (data2014.ORIGIN == 'EWR')].copy()
print len(dexample)                                            
dexample.head(3)









    



8095






    



C:\Users\Estienne\Anaconda\lib\site-packages\pandas\core\frame.py:1825: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  "DataFrame index.", UserWarning)






    Out[3]:






  
    
      
      FL_DATE
      UNIQUE_CARRIER
      ORIGIN
      DEST
      CRS_DEP_TIME
      CRS_ARR_TIME
      ARR_DELAY
      DISTANCE
      AIRCRAFT_YEAR
      AIRCRAFT_MFR
      LAT
      LONG
    
  
  
    
      2033
      2014-01-01
      AA
      LGA
      ORD
      700
      850
      -17
      733
      1991
      MCDONNELL DOUGLAS
      40.766667
      -73.866667
    
    
      2034
      2014-01-04
      AA
      LGA
      ORD
      700
      850
      32
      733
      2007
      FRIEDEMANN JON
      40.766667
      -73.866667
    
    
      2035
      2014-01-05
      AA
      LGA
      ORD
      700
      850
      11
      733
      2007
      FRIEDEMANN JON
      40.766667
      -73.866667

We are only interested in the carrier that operates from New York to Chicago. Looking at the table, we also notice that Atlantic Southeast Airlines (airline code EV) is only marginally present. So we drop it from the list of carriers we will study in addition to the other carriers that do not operate on the line.



In [4]:

    
dexample.groupby('UNIQUE_CARRIER').size()









    Out[4]:





UNIQUE_CARRIER
AA     126
B6     955
EV       2
OO     191
UA    6821
dtype: int64



In [5]:

    
# In case we want to tackle a broader case, we can use this function. The droplist can be any list of carrier.
def restrict_carrier(data, droplist):
    ''' 
    Drop carriers from the data set according to broadlist

    Parameters:
    -----------

    data: pandas.DataFrame
       dataframe  

    droplist: <list 'string'>
        List of carriers to be droppped
    '''        
    
    for item in droplist:
        data.drop(data[data.UNIQUE_CARRIER == item].index, inplace= True)
    return



In [6]:

    
%%time
drop_airline = [ 'AS','DL', 'EV', 'F9', 'FL', 'HA', 'MQ', 'US', 'VX', 'WN']
restrict_carrier(dexample, drop_airline)









    



Wall time: 80 ms

In case we're doing a general study, the second step after droppping airlines is to drop airport. This is the purpose of the function below.

WARNING \ RUN THIS CELL ONLY ONCE! Because airports are linked to one another, everytime the function restrict_airport is run, some entries are dropped. After some time, there is no more entries.

To understand this, let's imagine that Boston Logan airport has an annual flight with a small airport XYZ. When we run the function the fisrt time, airport XYZ is droppped. Because of that, the total count of flights for Boston Logan airport decreases as well. If we run the function a second time, removing airport XYZ and its flights with Boston Logan has put the number of flights associated with Boston Logan under the threshold!



In [7]:

    
#remove all airports that have an annual traffic under threshold

def restrict_airport(data, threshold):
    ''' 
    Drop carriers from the data set.

    Parameters:
    -----------

    data: pandas.DataFrame
       dataframe  

    droplist: <list 'string'>
        List of carriers to be droppped
    '''     
    
    dict_count = data.groupby("DEST").agg(['count']).LAT.to_dict()['count']
    
    for key in dict_count:
        if dict_count[key] < threshold:
            data.drop(data[data.DEST == key].index, inplace=True)
            data.drop(data[data.ORIGIN == key].index, inplace=True)
    
    print data.groupby("DEST").agg(['count']).LAT.to_dict()['count']
    
    return

Extracting month and day

The date is given as a string in the format "Year-Month-Day". We exctract the month and the weekday in the following.

WARNING \ RUN THIS CELL ONLY ONCE! Because airports are linked to one another,



In [8]:

    
from time import strptime
days = {0:"Mon", 1:"Tues", 2:"Wed", 3:"Thurs", 4:"Fri", 5:"Sat", 6:"Sun"}
months = {1:"Jan", 2:"Feb", 3:"Mar", 4:"Apr", 5:"May", 6:"June", 7:"July", 8:"Aug", 9:"Sep", \
          10:"Oct", 11:"Nov", 12:"Dec"}



In [9]:

    
def adjust_time(data):
    ''' 
    Return the month list and the weekday list out of a DataFrame.

    Parameters:
    -----------

    data: pandas.DataFrame
       dataframe that contains a colum of flight dates
    ''' 
    
    monlist = np.empty(len(data), dtype = str)
    daylist = np.empty(len(data), dtype = str)
    
    for i in xrange(len(data)):
        date= strptime(data.FL_DATE.iloc[i], "%Y-%M-%d")
        monlist[i] = months[date.tm_min]
        daylist[i] = days[date.tm_wday]

    return monlist, daylist



In [10]:

    
%%time
# now add the weekday and the month columns created by the adjust_time
monlist, daylist = adjust_time(dexample)
print "OK"
dexample['MONTH'] = pd.Series(monlist, index=dexample.index)
dexample['DAY'] = pd.Series(daylist, index=dexample.index)
if 'FL_DATE' in dexample.columns:
    dexample.drop('FL_DATE', axis = 1, inplace= True)
print dexample.columns









    



OK
Index([u'UNIQUE_CARRIER', u'ORIGIN', u'DEST', u'CRS_DEP_TIME', u'CRS_ARR_TIME',
       u'ARR_DELAY', u'DISTANCE', u'AIRCRAFT_YEAR', u'AIRCRAFT_MFR', u'LAT',
       u'LONG', u'MONTH', u'DAY'],
      dtype='object')
Wall time: 511 ms

Adjusting numerical data

Let's change teh format of all time entries. Instead of having hour : minute, we convert everything in minutes.

WARNING \ RUN THIS CELL ONLY ONCE!



In [11]:

    
%%time
ti = lambda x : x/100 + x % 100
dexample['CRS_ARR_TIME_COR'] = dexample.CRS_ARR_TIME.map(ti)
dexample['CRS_DEP_TIME_COR'] = dexample.CRS_DEP_TIME.map(ti)
dexample.drop(['CRS_DEP_TIME', 'CRS_ARR_TIME'], axis = 1, inplace = True)









    



Wall time: 46 ms

We need to center and normalize all continuous data for better results. The idea is that if a feature X has a range of variation that is considerably higher than a feature Y, the variations of X may completely mask the variations of Y and therefore make Y useless.



In [12]:

    
def normalize(array):
    mean = np.mean(array)
    std = np.std(array)
    return [(x - mean)/std for x in array]



In [13]:

    
def normalize_data(data, feature_list):
    ''' 
    Normalize data.

    Parameters:
    -----------

    data: pandas.DataFrame
       dataframe  

    feature_list: <list 'string'>
        List of features to be normalized
    '''           

    for feature in feature_list:
        if feature in data.columns:
            data[feature + '_NOR'] = normalize(data[feature].values)
            data.drop(feature, axis =1, inplace=True)
    return



In [14]:

    
dexample.drop(dexample[dexample.AIRCRAFT_YEAR =='    '].index, inplace = True)
dexample['AIRCRAFT_YEAR_COR'] = dexample.AIRCRAFT_YEAR.map(lambda x: int(x))
dexample.drop('AIRCRAFT_YEAR', axis = 1, inplace = True)



In [15]:

    
normalize_data(dexample, ['DISTANCE', 'LAT', 'LONG', 'CRS_ARR_TIME_COR', 'CRS_DEP_TIME_COR', 'AIRCRAFT_YEAR_COR'])



In [17]:

    
dexample.head(3)









    Out[17]:






  
    
      
      UNIQUE_CARRIER
      ORIGIN
      DEST
      ARR_DELAY
      AIRCRAFT_MFR
      MONTH
      DAY
      DISTANCE_NOR
      LAT_NOR
      LONG_NOR
      CRS_ARR_TIME_COR_NOR
      CRS_DEP_TIME_COR_NOR
      AIRCRAFT_YEAR_COR_NOR
    
  
  
    
      2033
      AA
      LGA
      ORD
      -17
      MCDONNELL DOUGLAS
      J
      W
      0.68881
      1.058684
      0.786579
      0.88939
      -1.046771
      -1.713677
    
    
      2034
      AA
      LGA
      ORD
      32
      FRIEDEMANN JON
      J
      S
      0.68881
      1.058684
      0.786579
      0.88939
      -1.046771
      1.129411
    
    
      2035
      AA
      LGA
      ORD
      11
      FRIEDEMANN JON
      J
      S
      0.68881
      1.058684
      0.786579
      0.88939
      -1.046771
      1.129411

Encode categorical variables

Many of the features are categorical - for example the origin airport feature contains strings which represent the code of each airport. To make a classification/regression possible, we need to encode these features as series of indicators. More specifically, if the origin airport feature has n possibles values A1, A2, ... , An, we create n indicator functions I(A1), ..., I(An). Indicator function I(Ak) takes the value 1 if the feature origin airport has value Ak.



In [18]:

    
encoded_list = ['UNIQUE_CARRIER', 'ORIGIN', 'DEST', 'AIRCRAFT_MFR', 'MONTH','DAY']
dexample = pd.get_dummies(dexample, columns=encoded_list)



In [20]:

    
dexample.head(3)









    Out[20]:






  
    
      
      ARR_DELAY
      DISTANCE_NOR
      LAT_NOR
      LONG_NOR
      CRS_ARR_TIME_COR_NOR
      CRS_DEP_TIME_COR_NOR
      AIRCRAFT_YEAR_COR_NOR
      UNIQUE_CARRIER_AA
      UNIQUE_CARRIER_B6
      UNIQUE_CARRIER_OO
      ...
      MONTH_J
      MONTH_M
      MONTH_N
      MONTH_O
      MONTH_S
      DAY_F
      DAY_M
      DAY_S
      DAY_T
      DAY_W
    
  
  
    
      2033
      -17
      0.68881
      1.058684
      0.786579
      0.88939
      -1.046771
      -1.713677
      1
      0
      0
      ...
      1
      0
      0
      0
      0
      0
      0
      0
      0
      1
    
    
      2034
      32
      0.68881
      1.058684
      0.786579
      0.88939
      -1.046771
      1.129411
      1
      0
      0
      ...
      1
      0
      0
      0
      0
      0
      0
      1
      0
      0
    
    
      2035
      11
      0.68881
      1.058684
      0.786579
      0.88939
      -1.046771
      1.129411
      1
      0
      0
      ...
      1
      0
      0
      0
      0
      0
      0
      1
      0
      0
    
  

3 rows × 42 columns

Let's save all of our results.



In [21]:

    
%%time
# save the restricted data to avoid computing them again
file_path = "cache/predictionData/dexample.csv"
dexample.to_csv(path_or_buf= file_path)









    



Wall time: 311 ms



In [47]:

    
#%%time
# recover file
#file_path = "cache/predictionData/dexample.csv"
#dexample= pd.read_csv(file_path)
#dexample.drop('Unnamed: 0', axis= 1, inplace = True)
#print dexample.columns

1. Random Forest regressor

Our goal is to anticipate the delay of a flight given its features. One way to proceed is to train a random forest regressor. But first, let's look at the baseline predictor, that is the predictor that returns the mean of data set. The measure of accuracy of a predictor will be its mean square error against a test set distinct from the training set.



In [102]:

    
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error



In [106]:

    
def baseline_predictor(data):
    return np.mean(data.ARR_DELAY.values)

print "MSE of the mean predictor:" , \
mean_squared_error(dexample.ARR_DELAY.values, [baseline_predictor(dexample) for x in xrange(len(dexample))])









    



 MSE of the mean predictor: 2359.68904617

Split data into training/test sets

First, let's split the data set into a training set and a test set.



In [66]:

    
def split(data, list_drop, target, test_size):
    ''' 
    Splits the data into a training and a test set
    Separates the training and test sets according to a feature set and a target set
    Balance the features sets by retaining only fraction of its points

    Parameters:
    -----------

    data: pandas.DataFrame
       Flight dataframe  

    list_drop: <list 'string'>
        List of columns to exclude from the features set
        
    target: string
        target column along whch we make the target set
        
    test_size: float
        size of the test set
    
    '''    
    
    #split the dataset into a training set and a test set
    dtrain, dtest = train_test_split(data, test_size = 0.3)
    
    Xtrain = dtrain.drop(list_drop, axis=1).values
    ytrain = dtrain[target].values
    Xtest = dtest.drop(list_drop, axis=1).values
    ytest = dtest[target].values
    
    return Xtrain, ytrain, Xtest, ytest

Useful functions

The functions below satisfy various tasks. The first trains a classifier and estimates its scores on the test set and the training set. The second tries different parameter values for the regressor. The last one wraps everything.



In [69]:

    
def score_random_forest(Xtrain, ytrain, Xtest, ytest, n_trees=10, max_features='auto'):
    ''' 
    Fits a random forest with (Xtrain ,ytrain)
    Computes the score on (Xtest, ytest)

    Parameters:
    -----------

    Xtrain: numpy 2D array
       Feature training set

    ytrain: numpy 1D array
        Target training set
    
    Xtest: numpy 2D array
       Feature test set

    ytest: numpy 1D array
        Target test set
    
    n_trees: int
        number of trees in the forest
    
    max_features: string or int
        number of features used for every tree
        
    Outputs:
    --------
    
    score_train: float
        score on the train set
    
    score_test: float
        score on the test set
    
    clf.feature_importances_
        weights of each feature as used by the classifier
    
    ''' 

    clf= RandomForestRegressor(n_estimators=n_trees, max_features= max_features)
    clf.fit(Xtrain, ytrain)
    
    score_train = mean_squared_error(clf.predict(Xtrain), ytrain)
    score_test = mean_squared_error(clf.predict(Xtest), ytest)
    
    return  score_train, score_test, clf



In [70]:

    
def best_parameters(Xtrain, ytrain, Xtest, ytest, nb_trees, nb_features):
    ''' 
    Fits sequentially random forest classifiers
    Adds each test score in a pandas.DataFrame with the number of trees, the loss function, the train score,
    and the importance of each features
    Returns a DataFrame with all scores

    Parameters:
    -----------

    Xtrain: numpy 2D array
       Feature training set

    ytrain: numpy 1D array
        Target training set
    
    Xtest: numpy 2D array
       Feature test set

    ytest: numpy 1D array
        Target test set
    
    n_trees: <list int>
        list of numbers of trees in the forest
    
    nb_features: <list int>
        list of number of features in the forest
        
    Outputs:
    --------
    
    score_tab: pandas.DataFrame
        DataFrame of scores with associated parameters
    
    '''
    
    score_tab = pd.DataFrame(columns=['nb_trees', 'nb_features', 'test_score', 'train_score', 'classifier'])
    
    # counter will increment the index in score_tab
    counter = 0 

    for n_estimators in nb_trees:
        for max_features in nb_features:

            score_train, score_test, classifier = \
            score_random_forest(Xtrain, ytrain, Xtest, ytest, n_trees=n_estimators, max_features=max_features) 
            score_tab.loc[counter] = [n_estimators, max_features, score_test, score_train, classifier]
            counter += 1

    return score_tab



In [71]:

    
def classify_random_forest(data, list_drop, target, test_size=0.4, nb_trees=[10], nb_features = ['auto']):
    Xtrain, ytrain, Xtest, ytest = split(data, list_drop, target, test_size)
    scores =  best_parameters(Xtrain, ytrain, Xtest, ytest, nb_trees, nb_features)
    return scores

Parameter dashboard and run cell

We test the following parameters.

the number of trees in the forest
the number of features retained in each tree
the test size



In [107]:

    
nb_trees = [25, 50, 75, 100, 150, 200, 300, 400, 500, 600, 700, 800, 900, 1000]
nb_features = ['auto', 'log2', 'sqrt']
test_size = 0.4



In [108]:

    
%%time
randomForest2014 =  classify_random_forest(dexample, ['ARR_DELAY'], 'ARR_DELAY', test_size=test_size, nb_trees=nb_trees, nb_features=nb_features)









    



Wall time: 15min 22s



In [125]:

    
randomForest2014.head()









    Out[125]:






  
    
      
      nb_trees
      nb_features
      test_score
      train_score
      classifier
    
  
  
    
      0
      25
      auto
      2921.001980
      644.331707
      (DecisionTreeRegressor(criterion='mse', max_de...
    
    
      1
      25
      log2
      2886.270747
      613.948677
      (DecisionTreeRegressor(criterion='mse', max_de...
    
    
      2
      25
      sqrt
      2836.819022
      622.683047
      (DecisionTreeRegressor(criterion='mse', max_de...
    
    
      3
      50
      auto
      2835.219505
      617.548415
      (DecisionTreeRegressor(criterion='mse', max_de...
    
    
      4
      50
      log2
      2845.566965
      605.315175
      (DecisionTreeRegressor(criterion='mse', max_de...



In [109]:

    
# save file to /data/ folder
file_path = "cache/predictionData/randomForest2014.csv"
randomForest2014.to_csv(path_or_buf= file_path)

MSE by size of the forest



In [127]:

    
import matplotlib.pyplot as plt
from matplotlib.figure import Figure



In [137]:

    
# MSE on test set
fig = plt.gcf()
fig.set_size_inches(8, 5)

for key, co in zip(['auto', 'sqrt', 'log2'], ['b', 'g', 'y']):
    x = randomForest2014[randomForest2014.nb_features == key].nb_trees.values
    y = randomForest2014[randomForest2014.nb_features == key].test_score.values
    plt.plot(x, y, alpha = 0.5, c =co, label = key)

plt.ylabel("Mean Square Error")
plt.xlabel("Number of features")
plt.title("MSE on test set by nb of features and nb of trees")
plt.legend(loc = 1)

plt.show()



In [139]:

    
# MSE on train set
fig = plt.gcf()
fig.set_size_inches(8, 5)

for key, co in zip(['auto', 'sqrt', 'log2'], ['b', 'g', 'y']):
    x = randomForest2014[randomForest2014.nb_features == key].nb_trees.values
    y = randomForest2014[randomForest2014.nb_features == key].train_score.values
    plt.plot(x, y, alpha = 0.5, c =co, label = key)

plt.ylabel("Mean Square Error")
plt.xlabel("Number of features")
plt.title("MSE on train set by nb of features and nb of trees")
plt.legend(loc = 1)

plt.show()

2. Random forest classifier

We will make prediction on the variable 'ARR_DEL15'. This variable takes the value 1 is the plane is more than 15 minutes late and 0 if not. Let's look at the baseline classifier, that is the classifiers that assign repectively 1 or 0 to 'ARR_DEL15' for every flight.

Here we look at the case whether a flight will be more than 15 minutes late. So we adjust the ARR_DELAY colum to an indicator.



In [22]:

    
%%time
dexample['ARR_DELAY_COR'] = dexample.ARR_DELAY.map(lambda x: (x >= 15))
dexample.drop('ARR_DELAY', axis = 1, inplace = True)
file_path = "cache/predictionData/dexample_class.csv"
dexample.to_csv(path_or_buf= file_path)
dexample_class = dexample.copy()









    



Wall time: 1.18 s



In [ ]:

    
#file_path = "cache/predictionData/dexample_class.csv"
#dexample_class = pd.read_csv(path_or_buf= file_path)
#dexample_class.drop('Unnamed: 0', axis= 1, inplace = True)



In [23]:

    
dexample_class.head(3)









    Out[23]:






  
    
      
      DISTANCE_NOR
      LAT_NOR
      LONG_NOR
      CRS_ARR_TIME_COR_NOR
      CRS_DEP_TIME_COR_NOR
      AIRCRAFT_YEAR_COR_NOR
      UNIQUE_CARRIER_AA
      UNIQUE_CARRIER_B6
      UNIQUE_CARRIER_OO
      UNIQUE_CARRIER_UA
      ...
      MONTH_M
      MONTH_N
      MONTH_O
      MONTH_S
      DAY_F
      DAY_M
      DAY_S
      DAY_T
      DAY_W
      ARR_DELAY_COR
    
  
  
    
      2033
      0.68881
      1.058684
      0.786579
      0.88939
      -1.046771
      -1.713677
      1
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      1
      False
    
    
      2034
      0.68881
      1.058684
      0.786579
      0.88939
      -1.046771
      1.129411
      1
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      1
      0
      0
      True
    
    
      2035
      0.68881
      1.058684
      0.786579
      0.88939
      -1.046771
      1.129411
      1
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      1
      0
      0
      False
    
  

3 rows × 42 columns



In [24]:

    
from __future__ import division

def baseline_class(data, target):
    ''' 
    Compute the baseline classifiers along a target variable for a data set data

    Parameters:
    -----------

    data: pandas.DataFrame
       dataframe  

    target: string
        Column of data along wich we compute the baseline classifiers
    '''    
    
    
    score_baseline_1 = np.size(data[data[target] == 1][target].values) / np.size(data[target].values)
    score_baseline_0 = np.size(data[data[target] == 0][target].values) / np.size(data[target].values)
    
    print "baseline classifier everyone to 0: ", int(score_baseline_0*100) , "%"
    print "baseline classifier everyone to 1: ", int(score_baseline_1*100) , "%"
   
    return score_baseline_0, score_baseline_1



In [25]:

    
baseline_class(dexample_class, 'ARR_DELAY_COR')









    



baseline classifier everyone to 0:  70 %
baseline classifier everyone to 1:  29 %






    Out[25]:





(0.7088213306753053, 0.2911786693246947)

Prediction functions



In [34]:

    
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestClassifier

Random forest functions for a classification task



In [27]:

    
def split(data, list_drop, target, test_size):
    ''' 
    Splits the data into a training and a test set
    Separates the training and test sets according to a feature set and a target set
    Balance the features sets by retaining only fraction of its points

    Parameters:
    -----------

    data: pandas.DataFrame
       Flight dataframe  

    list_drop: <list 'string'>
        List of columns to exclude from the features set
        
    target: string
        target column along whch we make the target set
        
    test_size: float
        size of the test set
    
    '''    
    
    #split the dataset into a training set and a test set
    dtrain, dtest = train_test_split(data, test_size = 0.3)
    
    Xtrain = dtrain.drop(list_drop, axis=1).values
    ytrain = dtrain[target].values
    Xtest = dtest.drop(list_drop, axis=1).values
    ytest = dtest[target].values
    
    return Xtrain, ytrain, Xtest, ytest



In [35]:

    
def score_random_forest_classifier(Xtrain, ytrain, Xtest, ytest, n_trees=10, max_features='auto'):
    ''' 
    Fits a random forest with (Xtrain ,ytrain)
    Computes the score on (Xtest, ytest)

    Parameters:
    -----------

    Xtrain: numpy 2D array
       Feature training set

    ytrain: numpy 1D array
        Target training set
    
    Xtest: numpy 2D array
       Feature test set

    ytest: numpy 1D array
        Target test set
    
    n_trees: int
        number of trees in the forest
    
    max_features: string or int
        number of features used for every tree
        
    Outputs:
    --------
    
    score_train: float
        score on the train set
    
    score_test: float
        score on the test set
    
    clf.feature_importances_
        weights of each feature as used by the classifier
    
    ''' 

    clf= RandomForestClassifier(n_estimators=n_trees, max_features= max_features)
    clf.fit(Xtrain, ytrain)
    
    score_train = clf.score(Xtrain, ytrain)
    score_test = clf.score(Xtest, ytest)
    
    return  score_train, score_test, clf



In [36]:

    
def best_parameters_classifier(Xtrain, ytrain, Xtest, ytest, nb_trees, nb_features):
    ''' 
    Fits sequentially random forest classifiers
    Adds each test score in a pandas.DataFrame with the number of trees, the loss function, the train score,
    and the importance of each features
    Returns a DataFrame with all scores

    Parameters:
    -----------

    Xtrain: numpy 2D array
       Feature training set

    ytrain: numpy 1D array
        Target training set
    
    Xtest: numpy 2D array
       Feature test set

    ytest: numpy 1D array
        Target test set
    
    n_trees: <list int>
        list of numbers of trees in the forest
    
    nb_features: <list int>
        list of number of features in the forest
        
    Outputs:
    --------
    
    score_tab: pandas.DataFrame
        DataFrame of scores with associated parameters
    
    '''
    
    score_tab = pd.DataFrame(columns=['nb_trees', 'nb_features', 'test_score', 'train_score', 'classifier'])
    
    # counter will increment the index in score_tab
    counter = 0 

    for n_estimators in nb_trees:
        for max_features in nb_features:

            score_train, score_test, classifier = \
            score_random_forest_classifier(Xtrain, ytrain, Xtest, ytest, n_trees=n_estimators, max_features=max_features) 
            score_tab.loc[counter] = [n_estimators, max_features, score_test, score_train, classifier]
            counter += 1

    return score_tab



In [37]:

    
def classify_random_forest_class(data, list_drop, target, test_size=0.4, nb_trees=[10], nb_features = ['auto']):
    Xtrain, ytrain, Xtest, ytest = split(data, list_drop, target, test_size)
    scores =  best_parameters_classifier(Xtrain, ytrain, Xtest, ytest, nb_trees, nb_features)
    return scores

Classification task

We look at various paramters for the classifier. Note that there are two loss functions available for the random forest classifier, a Gini loss and an entropy loss. Simulations show they behave similarly!



In [99]:

    
nb_trees = [25, 50, 75, 100, 150, 200, 300, 400, 500, 750, 1000, 2000, 5000]
nb_features = ['sqrt', 'log2']
test_size = 0.4



In [100]:

    
%%time
randomForest2014_class =  classify_random_forest_class(dexample, ['ARR_DELAY_COR'], 'ARR_DELAY_COR', test_size=test_size, nb_trees=nb_trees, nb_features=nb_features)









    



Wall time: 28min 59s



In [90]:

    
randomForest2014_class.head(9)









    Out[90]:






  
    
      
      nb_trees
      nb_features
      test_score
      train_score
      classifier
    
  
  
    
      0
      25
      sqrt
      0.652824
      0.934318
      (DecisionTreeClassifier(class_weight=None, cri...
    
    
      1
      25
      log2
      0.648671
      0.934140
      (DecisionTreeClassifier(class_weight=None, cri...
    
    
      2
      50
      sqrt
      0.656561
      0.937344
      (DecisionTreeClassifier(class_weight=None, cri...
    
    
      3
      50
      log2
      0.657392
      0.937344
      (DecisionTreeClassifier(class_weight=None, cri...
    
    
      4
      75
      sqrt
      0.660714
      0.937878
      (DecisionTreeClassifier(class_weight=None, cri...
    
    
      5
      75
      log2
      0.658223
      0.937878
      (DecisionTreeClassifier(class_weight=None, cri...
    
    
      6
      100
      sqrt
      0.661960
      0.937878
      (DecisionTreeClassifier(class_weight=None, cri...
    
    
      7
      100
      log2
      0.661545
      0.937700
      (DecisionTreeClassifier(class_weight=None, cri...
    
    
      8
      150
      sqrt
      0.654900
      0.937878
      (DecisionTreeClassifier(class_weight=None, cri...



In [101]:

    
# save file to /data/ folder
file_path = "cache/predictionData/randomForest2014_class.csv"
randomForest2014_class.to_csv(path_or_buf= file_path)

Analysis

Let's now look at the importance coefficients, that the average usage of each coefficients in the random forest. As we can see, the main factors in the delay of a flight are the age of the aircraft, the time of the departure and the time of the origin. The fact that the departure and arrival time confirm what we observed in the explorative analysis. However, the role of the age of the aircraft was not very marked in the explorative analysis. Lastly, the weekday has a huge influence.

Notice that the airline and the aircraft model play a very limited role in the delay according to this random forest classifier.



In [106]:

    
features = dexample_class.drop('ARR_DELAY_COR', axis =1).columns
coeffs_class = randomForest2014_class.classifier[len(randomForest2014_class)-1].feature_importances_
u = pd.Series(coeffs_class, index=features)
u.sort(inplace=True, ascending=False)
print u









    



AIRCRAFT_YEAR_COR_NOR                          0.285870
CRS_ARR_TIME_COR_NOR                           0.217206
CRS_DEP_TIME_COR_NOR                           0.212159
DAY_T                                          0.027670
DAY_S                                          0.026794
DAY_W                                          0.022723
DAY_F                                          0.021133
DAY_M                                          0.020281
MONTH_J                                        0.017575
MONTH_A                                        0.015214
MONTH_M                                        0.013477
MONTH_O                                        0.013426
AIRCRAFT_MFR_AIRBUS INDUSTRIE                  0.013183
AIRCRAFT_MFR_BOEING                            0.012232
MONTH_S                                        0.012192
MONTH_F                                        0.011634
MONTH_N                                        0.010826
MONTH_D                                        0.009118
AIRCRAFT_MFR_AIRBUS                            0.007589
AIRCRAFT_MFR_EMBRAER                           0.003948
LAT_NOR                                        0.003597
DISTANCE_NOR                                   0.003498
LONG_NOR                                       0.003483
ORIGIN_LGA                                     0.002964
ORIGIN_EWR                                     0.002883
UNIQUE_CARRIER_UA                              0.002150
AIRCRAFT_MFR_EMBRAER S A                       0.001437
UNIQUE_CARRIER_OO                              0.001044
UNIQUE_CARRIER_B6                              0.001036
ORIGIN_JFK                                     0.001016
AIRCRAFT_MFR_MCDONNELL DOUGLAS                 0.000818
UNIQUE_CARRIER_AA                              0.000802
AIRCRAFT_MFR_CIRRUS DESIGN CORP                0.000400
AIRCRAFT_MFR_CESSNA                            0.000220
AIRCRAFT_MFR_PIPER                             0.000207
AIRCRAFT_MFR_ROBINSON HELICOPTER CO            0.000111
AIRCRAFT_MFR_BELL                              0.000034
AIRCRAFT_MFR_LEBLANC GLENN T                   0.000029
AIRCRAFT_MFR_MARZ BARRY                        0.000021
DEST_ORD                                       0.000000
AIRCRAFT_MFR_FRIEDEMANN JON                    0.000000
dtype: float64

Score by size of the forest



In [93]:

    
import matplotlib.pyplot as plt
from matplotlib.figure import Figure



In [107]:

    
# Score on test set
fig = plt.gcf()
fig.set_size_inches(8, 5)

for key, co in zip(['sqrt', 'log2'], ['b', 'g']):
    x = randomForest2014_class[randomForest2014_class.nb_features == key].nb_trees.values
    y = randomForest2014_class[randomForest2014_class.nb_features == key].test_score.values
    plt.plot(x, y, alpha = 0.5, c =co, label = key)

plt.ylabel("Score")
plt.xlabel("Number of features")
plt.title("Score on test set by nb of features and nb of trees")
plt.legend(loc = 4)

plt.show()



In [111]:

    
# Score on train set
fig = plt.gcf()
fig.set_size_inches(8, 5)

for key, co in zip(['sqrt', 'log2'], ['b', 'g']):
    x = randomForest2014_class[randomForest2014_class.nb_features == key].nb_trees.values
    y = randomForest2014_class[randomForest2014_class.nb_features == key].train_score.values
    plt.plot(x, y, alpha = 0.5, c =co, label = key)

plt.ylim(0.935, 0.95)
plt.ylabel("Score")
plt.xlabel("Number of features")
plt.title("Score on train set by nb of features and nb of trees")
plt.legend(loc = 4)

plt.show()

3. Predicting flight delay time with Linear Regression

To allow users to enjoy more than a classfication experience we want to give them an expected delay time in minutes.



In [19]:

    
%matplotlib inline
# import required modules for prediction tasks
import numpy as np
import pandas as pd
import math
import random
import requests
import zipfile
import StringIO
import re
import json
import os
import matplotlib
import matplotlib.pyplot as plt

# sklearn functions used for the linear regression model
from sklearn.preprocessing import OneHotEncoder
from scipy import sparse
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import SGDRegressor
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import mean_absolute_error

First, let us again establish a baseline which has to be beaten by our model. To get a feeling for a good baseline we pick flights from from New York(all airports) to Chicago(all airports).



In [30]:

    
# first step is to load the actual data and exclude rows that are unnecessary
# a script that produces the csv file can be found in the src folder
# (it might take some while to run, on my macbook up to one hour)
print('loading data...')
bigdf = pd.read_csv('cache/Big5FlightTable.csv')

years = ['2010', '2011', '2012', '2013', '2014']









    



loading data...



In [31]:

    
# specify here which cities should be investigated
city_from = 'New York, NY'
city_to = 'Chicago, IL'

# filter for cities
bigdf = bigdf[(bigdf.ORIGIN_CITY_NAME == city_from) & (bigdf.DEST_CITY_NAME == city_to)]

3.1. A first predictor

For a span of days we are interested in knoweledge, which flight we should take. Therefore let's choose the popular date for Christmas returning flights 21.12.2015 as example. As we want to use historical data, let's get first clear what data we have. Is it possible to compare flights over the years?



In [32]:

    
query_day = 21
query_month = 12

# how many flights do exist in all years?
flights = []
flightvalues = []
for y in years:
    query = list(bigdf[(bigdf.YEAR == int(y)) & (bigdf.MONTH == query_month) & (bigdf.DAY_OF_MONTH == query_day)].FL_NUM.astype(int).unique())
    flights.append(query)
    flightvalues += query
    
# build a matrix
data_matrix = np.zeros((len(flightvalues), len(years)))
# build dict
flightdict = dict(zip(flightvalues, np.arange(0, len(flightvalues))))

# fill datamatrix
for i in xrange(len(years)):
    for j in flights[i]:
        data_matrix[flightdict[j], i] = 1.
        
# plot matrix
plt.figure(figsize=(10, 6))
plt.imshow(data_matrix, extent=[0,data_matrix.shape[0],0,data_matrix.shape[1] * 5], \
           interpolation='none', cmap='gray')









    Out[32]:





<matplotlib.image.AxesImage at 0x234cb07d0>

As the plot shows, flight numbers are not stable over the years and it is not trivial to find a matching. Thus comparison by features for individual flights is not really possible. It seems as if airlines change their flight numbers on a yearly basis. Thus we need to come up with another idea and use a latent variable based approach instead. One of the easiest ideas is to average the delay time over each day as a base classifier and refine then.



In [34]:

    
dffordate = bigdf[bigdf.MONTH == query_month]
dffordate = dffordate[dffordate.DAY_OF_MONTH == query_day]
dffordate.head()

def predict_base_model(X):
    return np.array([dffordate.ARR_DELAY.mean()]*X.shape[0])

To test the quality of this model, we use the last year as test set and the previous as train data. The idea is, that we are always interested in predicting the next year somehow. Thus, if the match for 2014 is good, we expect it to be the same for 2015.



In [70]:

    
# build test/train set
df_train = dffordate[dffordate.YEAR != int(years[-1])]
df_test = dffordate[dffordate.YEAR == int(years[-1])]

y_train = df_train.ARR_DELAY
X_train = y_train # here dummy
y_test = df_test.ARR_DELAY
X_test = y_test # here dummy

In the base model, the prediction for 2014 is that we are going to be 27.72 minutes late



In [71]:

    
y_pred = predict_base_model(X_test)
y_pred[0]









    Out[71]:





27.728260869565219

How good did it perform comparing the actual arrival delay?



In [72]:

    
def rmse(y, y_pred):
    return np.sqrt(((y - y_pred)**2).mean())

def mas(y, y_pred):
    return (np.abs(y - y_pred)).mean()

MAS_base = mas(y_test, y_pred)
RMSE_base = rmse(y_test, y_pred)
RMSE_base









    Out[72]:





44.10756003440288

Using the root mean squared error as one (of many possible) measures, any model that we develop should beat the benchmark of 44.

3.2 Building a linear regression model for prediction

As flight delay changes over the time of day like explored in the data exploration part, we introduce a new feature which models in a categorical variable 10 minute time windows. I.e. for each window we introduce a latent variable that captures some sort of delay influence of this frame. This is done for both the departure and the arrival time. The model here is first developed for the reduced dataset containing the flights between New York / Chicago only.



In [41]:

    
%%time
bigdf['HOUR_OF_ARR'] = 0
bigdf['HOUR_OF_DEP'] = 0

for index, row in bigdf.iterrows():
    bigdf.set_value(index, 'HOUR_OF_ARR', int(row['ARR_TIME']) / 10)
    bigdf.set_value(index, 'HOUR_OF_DEP', int(row['DEP_TIME']) / 10)









    



CPU times: user 3.05 s, sys: 212 ms, total: 3.27 s
Wall time: 3.37 s

In the next step, before fitting the actual model categorical variables need to be encoded as binary features (they have no order!) and numerical features shall be normalized.



In [44]:

    
# split data into numerical and categorical features
numericalFeat = bigdf[['DISTANCE', 'AIRCRAFT_AGE']].astype('float') 
categoricalFeat = bigdf[['MONTH', 'DAY_OF_MONTH', 'ORIGIN', 'DEST', \
                         'HOUR_OF_ARR', 'HOUR_OF_DEP', 'UNIQUE_CARRIER', \
                         'DAY_OF_WEEK', 'AIRCRAFT_MFR']]

Dealing with categorical variables

Luckily, sklearn has a routine to do this for us. Yet, as it is only able to handle integers we first reindex all categorical features we create lookup tables for the values. This turned out to be one of the slowest step in processing.



In [46]:

    
%%time
# for the next step, all features need to be encoded as integers --> create lookup Tables!
def transformToID(df, col):
    vals = df[col].unique()
    LookupTable = dict(zip(vals, np.arange(len(vals))))
    for key in LookupTable.keys():
        df[df[col] == key] = LookupTable[key]
    return (LookupTable, df)

mfrDict, categoricalFeat = transformToID(categoricalFeat, 'AIRCRAFT_MFR')
originDict, categoricalFeat = transformToID(categoricalFeat, 'ORIGIN')
destDict, categoricalFeat = transformToID(categoricalFeat, 'DEST')
carrierDict, categoricalFeat = transformToID(categoricalFeat, 'UNIQUE_CARRIER')









    



/usr/local/lib/python2.7/site-packages/ipykernel/__main__.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
/usr/local/lib/python2.7/site-packages/pandas/core/indexing.py:415: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s






    



CPU times: user 12.9 s, sys: 104 ms, total: 13 s
Wall time: 13.8 s



In [47]:

    
categoricalFeat.head()









    Out[47]:






  
    
      
      MONTH
      DAY_OF_MONTH
      ORIGIN
      DEST
      HOUR_OF_ARR
      HOUR_OF_DEP
      UNIQUE_CARRIER
      DAY_OF_WEEK
      AIRCRAFT_MFR
    
  
  
    
      921
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      922
      1
      1
      1
      1
      1
      1
      1
      1
      1
    
    
      923
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      924
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      925
      2
      2
      2
      2
      2
      2
      2
      2
      2

Using sklearn the variables are now encoded. As we have many variables (around 1200 in the final model for the whole year) using sparse matrices was critical for us to fit the model.



In [48]:

    
import sklearn
from sklearn import linear_model
from sklearn.preprocessing import OneHotEncoder
from scipy import sparse
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import SGDRegressor
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import mean_squared_error



In [49]:

    
%%time 
encoder = OneHotEncoder() 
categoricals_encoded = encoder.fit_transform(categoricalFeat)









    



CPU times: user 48.2 ms, sys: 10 ms, total: 58.3 ms
Wall time: 59.6 ms

Recombining categorical and numerical features allows to start the usual Machine Learning routine (split into test/train, normalize, train/cross-validate, test)



In [51]:

    
numericals_sparse = sparse.csr_matrix(numericalFeat)
# get data matrix & response variable
X_all = sparse.hstack((numericals_sparse, categoricals_encoded))
y_all = bigdf['ARR_DELAY'].values

X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, test_size = 0.2, random_state = 42)

# we have 2 numerical features
X_train_numericals = X_train[:, 0:3].toarray()
X_test_numericals = X_test[:, 0:3].toarray()

# use sklearn tools to standardize numerical features...
scaler = StandardScaler() 
scaler.fit(X_train_numericals) # get std/mean from train set
X_train_numericals = sparse.csr_matrix(scaler.transform(X_train_numericals)) 
X_test_numericals = sparse.csr_matrix(scaler.transform(X_test_numericals))

# update sets
X_train[:, 0:3] = X_train_numericals
X_test[:, 0:3] = X_test_numericals









    



/usr/local/lib/python2.7/site-packages/scipy/sparse/compressed.py:739: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
  SparseEfficiencyWarning)

Ordinal least squares regression



In [54]:

    
clf = sklearn.linear_model.LinearRegression()

# fit the model!
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
rmse(y_test, y_pred)









    Out[54]:





48.323925698812779

In a first model, we use OLS to get a feeling of what is achievable using a reduced dataset. The motivation for this is mainly to see whether this might be used to speedup the whole process. As the RMSE reveals, we did not do better than the base classifier. What happened?



In [57]:

    
X_train.shape, clf.coef_.shape









    Out[57]:





((31881, 74), (74,))

Looking at the number of variables (74 here), it might be that either the fit is not good enough or we are ignoring the dependency of flight delays within different routes.

Ridge regression



In [62]:

    
%%time
# Use ridge regression (i.e. Gaussian prior) and vary the lambda parameter using Grid search
from sklearn.linear_model import SGDRegressor
from sklearn.grid_search import GridSearchCV

SGD_params = {'alpha': 10.0 ** -np.arange(-2,8)}
SGD_model = GridSearchCV(SGDRegressor(random_state = 42), \
                         SGD_params, scoring = 'mean_absolute_error', cv = 6) # cross validate 6 times


# train the model, this might take some time...
SGD_model.fit(X_train, y_train)









    



CPU times: user 2.01 s, sys: 52 ms, total: 2.06 s
Wall time: 2.14 s



In [63]:

    
y_pred = SGD_model.predict(X_test)
rmse(y_test, y_pred)









    Out[63]:





48.389335267933092

Looking at the RMSE again is a bit disappointing. We did worse than OLS and the base classifier! This means, it is now time to tackle the unavoidable: Regressing the whole dataset!

As the data becomes rapidly huge (for 1 year ~ 700MB uncompressed CSV, 5 years ~ 3.5GB, 10 years ~ 7.2GB) the code to perform the actual regression has been developed first in an IPython notebook and then run separately. It can be found in the src folder.

Regression over the whole data

In the following 2 models based on 1, 5 years of data are used (saved in BigFlightTable.csv, Big5FlightTable.csv).



In [131]:

    
import json
import seaborn as sns
import matplotlib.ticker as ticker

def load_model(filename):
    mdl = None
    with open(filename, 'r') as f:
        mdl = json.load(f)
    return mdl

# load model results
mdl1 = load_model('results/models/2014model.json') # model with 2014 data (1 year)
mdl5 = load_model('results/models/2010_2014model.json') # model with 2010-2014 data ( 5 years)


labels = np.array(['base predictor', '1 year LM', '5 year LM'])
RMSEs = np.array([RMSE_base, mdl1['RMSE'], mdl5['RMSE']])
MASs = np.array([MAS_base, mdl1['MAS'], mdl5['MAS']])

width = .35
xx = np.arange(len(RMSEs))
plt.bar(xx-width, RMSEs, width, label='RMSE', color=sns.color_palette()[0])
plt.bar(xx , MASs, width, label='MAE', color=sns.color_palette()[2])
plt.legend(loc='best')

# Hide major tick labels
ax = plt.gca()
ax.xaxis.set_major_formatter(ticker.NullFormatter())

# Customize minor tick labels
ax.xaxis.set_minor_locator(ticker.FixedLocator([i for i in xx]))
ax.xaxis.set_minor_formatter(ticker.FixedFormatter([labels[i] for i in xx]))

Comparing the RMSEs and Mean absolute error we see that the linear regression model over the whole data clearly outperformed the baseline predictor. Also, more data did not really tremendously improve the mean absolute error but we can see a clear variance reduction in the RMSE. To further improve the model, additional variables should be included. One of the next ideas is weather data. As formatted historic weather data is unfortunately not for free available and scraping it for 100Ks entries per year is not feasible, the variables could not be added to the model. Another interesting idea might be to include some more meta data. I.e. an variable measuring the effect of holidays and variables accounting for winds which could be based by geographic location (i.e. a flight in western direction is longer than one in eastern direction b/c of the passat wind.

3.3 Application: Predicting the best flight for a given date and route



In [132]:

    
from datetime import timedelta, datetime, tzinfo
from pytz import timezone
import pytz

def convertToUTC(naive, zonestring="America/New_York"):
    local = pytz.timezone (zonestring)
    local_dt = local.localize(naive, is_dst=None)
    return local_dt.astimezone (pytz.utc)

In the following, we want to give an example on how to perform a prediction using the model from 2010-2014. We choose the route Chicago / New York.



In [133]:

    
# we are only interested in flights NY to Chicago!
city_to = 'Chicago, IL'
city_from = 'New York, NY'
zone_to = 'America/Chicago'
zone_from = 'America/New_York'

Especially before Christmas, it is a good question to ask on which day you should fly and which flight to take. Therefore, we consider only flights between 20.12-24.12. Our assumption is that many values we feed in our predictor do not change, i.e. we can use the data from 2014 (stored in BigFlightTable.csv). For a productive system, these queries should be performed using a database.



In [134]:

    
# need to create a lookup table for the values (i.e. flight numbers, city and so on and then drop all duplicates!)
db = pd.read_csv('cache/BigFlightTable.csv')
# remove all unnecessary columns
db = db[['ORIGIN_CITY_NAME', 'DEST_CITY_NAME', 'AIRCRAFT_AGE', 'DEST', 'ARR_TIME', \
         'DEP_TIME', 'UNIQUE_CARRIER', 'DAY_OF_WEEK', 'AIRCRAFT_MFR', 'FL_NUM', 'MONTH', \
         'DAY_OF_MONTH', 'DISTANCE', 'ORIGIN']]
print str(db.count()[0]) + ' entries'
db.head()

# drop everything except for the 5 days before christmas! i.e. 20.12, 21.12, 22.12, 23.12, 24.12.
db = db[db.MONTH == 12]
db = db[db.DAY_OF_MONTH <= 24]
db = db[db.DAY_OF_MONTH >= 20]

print '5 days have ' + str(db.count()[0]) + ' flights'

db = db[db.ORIGIN_CITY_NAME == city_from]
db = db[db.DEST_CITY_NAME == city_to]
db.reset_index(inplace=True)

print 'Found ' + str(db.count()[0]) + ' flights from ' + city_from + ' to ' + city_to + ' for 20.12 - 24.12'









    



4376277 entries
5 days have 58334 flights
Found 87 flights from New York, NY to Chicago, IL for 20.12 - 24.12

We know load our favourite model and setup the categorical variable encoder.



In [135]:

    
mdl = load_model('results/models/2014model.json')

# categorical feature encoder, fitted on the keys
encoder = OneHotEncoder(sparse=True, n_values=mdl['encoder']['values'])

Using the lookuptables, it is straight-forward to write the prediction function. Note that variabales need to be normalized according to the training data used in the model.



In [136]:

    
# input is a datarow
# prediction of day in the next year!
def predictDelayTime(row, mdl):
    
    s_mean, s_std, coeff, intercept = mdl['scaler_mean'], mdl['scaler_std'], mdl['coeff'], mdl['intercept']
    
    # read out tables
    carrierTable = mdl['CARRIER']
    mfrTable = mdl['MANUFACTURER']
    destTable = mdl['DEST']
    originTable = mdl['ORIGIN']
    
    distance = row['DISTANCE'] # <-- look this up!
    aircraft_age = row['AIRCRAFT_AGE'] # <-- look this up!
    
    # normalize numerical features according to scaler
    distance = (distance - s_mean[0]) / s_std[0]
    aircraft_age = (aircraft_age + 1 - s_mean[1]) / s_std[1]
    
    month = row['MONTH']
    day_of_month = row['DAY_OF_MONTH'] 
    origin = row['ORIGIN']
    dest = row['DEST']
    
    hour_of_arr = int(row['ARR_TIME']) / 10
    hour_of_dep = int(row['DEP_TIME']) / 10
    carrier = row['UNIQUE_CARRIER']
    day_of_week = datetime(year=2015, month=row.MONTH, day=row.DAY_OF_MONTH).weekday() # <-- get via datetimeobject
    mfr = row['AIRCRAFT_MFR']
    
    # for nonindexed categorical features, do lookup!
    origin = originTable[origin]
    dest = destTable[dest]
    mfr = mfrTable[mfr]
    carrier = carrierTable[carrier]

    # write into df
    df = {}
    df['MONTH'] = month
    df['DAY_OF_MONTH'] = day_of_month
    df['ORIGIN'] = origin
    df['DEST'] = dest
    df['HOUR_OF_ARR'] = hour_of_arr
    df['HOUR_OF_DEP'] = hour_of_dep
    df['UNIQUE_CARRIER'] = carrier
    df['DAY_OF_WEEK'] = day_of_week
    df['AIRCRAFT_MFR'] = mfr
    df = pd.DataFrame([df])

    # order here is important! make sure it is the same as in the model!
    categoricalFeat = df[['MONTH', 'DAY_OF_MONTH', 'ORIGIN', 
                    'DEST', 'HOUR_OF_ARR', 'HOUR_OF_DEP', 
                    'UNIQUE_CARRIER', 'DAY_OF_WEEK', 'AIRCRAFT_MFR']].copy() # Categorical features
    
    # construct the data vector for the linear model
    categoricals_encoded = encoder.fit_transform(categoricalFeat)
    num_features = np.array([distance, aircraft_age])
    cat_features = categoricals_encoded.toarray().T.ravel()
    w = np.hstack([num_features, cat_features])

    y_pred = np.dot(w, coeff) + intercept
    
    return y_pred[0]

We are now ready to predict the delay time on our flights!



In [137]:

    
# create for each day info
db['PREDICTED_DELAY'] = 0.
db['FLIGHT_TIME'] = 0
db['PREDICTED_FLIGHT_TIME'] = 0
for index, row in db.iterrows():
    print 'processing {idx}'.format(idx=index)
    y_pred = predictDelayTime(row, mdl)
    db.set_value(index, 'PREDICTED_DELAY', y_pred)

    arr_time = datetime(year=2015, month=row['MONTH'], day=row['DAY_OF_MONTH'], \
                        hour= int(row['ARR_TIME'] / 100), minute=int(row['ARR_TIME'] % 100))
    dep_time = datetime(year=2015, month=row['MONTH'], day=row['DAY_OF_MONTH'], \
                        hour= int(row['DEP_TIME'] / 100), minute=int(row['DEP_TIME'] % 100))
    
    flight_time_in_min =  (convertToUTC(arr_time) - convertToUTC(dep_time))
    flight_time_in_min = int(flight_time_in_min.total_seconds() / 60)
    
    db.set_value(index, 'FLIGHT_TIME', flight_time_in_min)
    db.set_value(index, 'PREDICTED_FLIGHT_TIME', y_pred + flight_time_in_min)









    



processing 0
processing 1
processing 2
processing 3
processing 4
processing 5
processing 6
processing 7
processing 8
processing 9
processing 10
processing 11
processing 12
processing 13
processing 14
processing 15
processing 16
processing 17
processing 18
processing 19
processing 20
processing 21
processing 22
processing 23
processing 24
processing 25
processing 26
processing 27
processing 28
processing 29
processing 30
processing 31
processing 32
processing 33
processing 34
processing 35
processing 36
processing 37
processing 38
processing 39
processing 40
processing 41
processing 42
processing 43
processing 44
processing 45
processing 46
processing 47
processing 48
processing 49
processing 50
processing 51
processing 52
processing 53
processing 54
processing 55
processing 56
processing 57
processing 58
processing 59
processing 60
processing 61
processing 62
processing 63
processing 64
processing 65
processing 66
processing 67
processing 68
processing 69
processing 70
processing 71
processing 72
processing 73
processing 74
processing 75
processing 76
processing 77
processing 78
processing 79
processing 80
processing 81
processing 82
processing 83
processing 84
processing 85
processing 86

What is the best flight on 22nd December?

Using this info, we can get the best flights for the 22nd of December



In [140]:

    
db2 = db[db.DAY_OF_MONTH == 22]
dbres2 = db2.sort('PREDICTED_FLIGHT_TIME')
dbres2.to_csv('data/best_flights.csv')

On what day is it best to fly in the evening?

Analogously, we now want our model to give us the answer which is the best flight during the busy evening hours (17:00-20:00)



In [141]:

    
db3 = db[(db.DEP_TIME > 1700) & (db.DEP_TIME < 2000)]
dbres3 = db2.sort('PREDICTED_FLIGHT_TIME')
dbres3.head()









    Out[141]:






  
    
      
      index
      ORIGIN_CITY_NAME
      DEST_CITY_NAME
      AIRCRAFT_AGE
      DEST
      ARR_TIME
      DEP_TIME
      UNIQUE_CARRIER
      DAY_OF_WEEK
      AIRCRAFT_MFR
      FL_NUM
      MONTH
      DAY_OF_MONTH
      DISTANCE
      ORIGIN
      PREDICTED_DELAY
      FLIGHT_TIME
      PREDICTED_FLIGHT_TIME
    
  
  
    
      19
      4156108
      New York, NY
      Chicago, IL
      2
      MDW
      738
      627
      WN
      1
      BOEING
      490
      12
      22
      725
      LGA
      -5.391739
      71
      65
    
    
      59
      4235745
      New York, NY
      Chicago, IL
      16
      ORD
      916
      757
      UA
      1
      AIRBUS
      463
      12
      22
      733
      LGA
      0.540536
      79
      79
    
    
      18
      4156107
      New York, NY
      Chicago, IL
      14
      MDW
      1423
      1304
      WN
      1
      BOEING
      333
      12
      22
      725
      LGA
      7.796969
      79
      86
    
    
      74
      4237736
      New York, NY
      Chicago, IL
      16
      ORD
      845
      714
      UA
      1
      BOEING
      1094
      12
      22
      733
      LGA
      0.609249
      91
      91
    
    
      66
      4237531
      New York, NY
      Chicago, IL
      16
      ORD
      746
      610
      UA
      1
      AIRBUS
      683
      12
      22
      733
      LGA
      -2.842361
      96
      93

Surprisingly, the best day to fly for in the evening is the 22nd!

	index	FL_DATE	UNIQUE_CARRIER	TAIL_NUM	FL_NUM	ORIGIN	DEST	CRS_DEP_TIME	DEP_TIME	DEP_DELAY	...	DISTANCE	CARRIER_DELAY	WEATHER_DELAY	NAS_DELAY	SECURITY_DELAY	LATE_AIRCRAFT_DELAY	AIRCRAFT_YEAR	AIRCRAFT_MFR	LAT	LONG
0	0	2014-01-01	AA	N338AA	1	JFK	LAX	900	914	14	...	2475	NaN	NaN	NaN	NaN	NaN	1987	BOEING	40.633333	-73.783333
1	1	2014-01-02	AA	N338AA	1	JFK	LAX	900	857	-3	...	2475	NaN	NaN	NaN	NaN	NaN	1987	BOEING	40.633333	-73.783333
2	2	2014-01-03	AA	N323AA	1	JFK	LAX	900	NaN	NaN	...	2475	NaN	NaN	NaN	NaN	NaN	NaN	NaN	40.633333	-73.783333
3	3	2014-01-04	AA	N327AA	1	JFK	LAX	900	1005	65	...	2475	0	59	0	0	0	1986	BOEING	40.633333	-73.783333
4	4	2014-01-05	AA	N323AA	1	JFK	LAX	900	1050	110	...	2475	0	110	0	0	0	NaN	NaN	40.633333	-73.783333

	FL_DATE	UNIQUE_CARRIER	ORIGIN	DEST	CRS_DEP_TIME	CRS_ARR_TIME	ARR_DELAY	DISTANCE	AIRCRAFT_YEAR	AIRCRAFT_MFR	LAT	LONG
2033	2014-01-01	AA	LGA	ORD	700	850	-17	733	1991	MCDONNELL DOUGLAS	40.766667	-73.866667
2034	2014-01-04	AA	LGA	ORD	700	850	32	733	2007	FRIEDEMANN JON	40.766667	-73.866667
2035	2014-01-05	AA	LGA	ORD	700	850	11	733	2007	FRIEDEMANN JON	40.766667	-73.866667

	UNIQUE_CARRIER	ORIGIN	DEST	ARR_DELAY	AIRCRAFT_MFR	MONTH	DAY	DISTANCE_NOR	LAT_NOR	LONG_NOR	CRS_ARR_TIME_COR_NOR	CRS_DEP_TIME_COR_NOR	AIRCRAFT_YEAR_COR_NOR
2033	AA	LGA	ORD	-17	MCDONNELL DOUGLAS	J	W	0.68881	1.058684	0.786579	0.88939	-1.046771	-1.713677
2034	AA	LGA	ORD	32	FRIEDEMANN JON	J	S	0.68881	1.058684	0.786579	0.88939	-1.046771	1.129411
2035	AA	LGA	ORD	11	FRIEDEMANN JON	J	S	0.68881	1.058684	0.786579	0.88939	-1.046771	1.129411

	nb_trees	nb_features	test_score	train_score	classifier
0	25	auto	2921.001980	644.331707	(DecisionTreeRegressor(criterion='mse', max_de...
1	25	log2	2886.270747	613.948677	(DecisionTreeRegressor(criterion='mse', max_de...
2	25	sqrt	2836.819022	622.683047	(DecisionTreeRegressor(criterion='mse', max_de...
3	50	auto	2835.219505	617.548415	(DecisionTreeRegressor(criterion='mse', max_de...
4	50	log2	2845.566965	605.315175	(DecisionTreeRegressor(criterion='mse', max_de...

	nb_trees	nb_features	test_score	train_score	classifier
0	25	sqrt	0.652824	0.934318	(DecisionTreeClassifier(class_weight=None, cri...
1	25	log2	0.648671	0.934140	(DecisionTreeClassifier(class_weight=None, cri...
2	50	sqrt	0.656561	0.937344	(DecisionTreeClassifier(class_weight=None, cri...
3	50	log2	0.657392	0.937344	(DecisionTreeClassifier(class_weight=None, cri...
4	75	sqrt	0.660714	0.937878	(DecisionTreeClassifier(class_weight=None, cri...
5	75	log2	0.658223	0.937878	(DecisionTreeClassifier(class_weight=None, cri...
6	100	sqrt	0.661960	0.937878	(DecisionTreeClassifier(class_weight=None, cri...
7	100	log2	0.661545	0.937700	(DecisionTreeClassifier(class_weight=None, cri...
8	150	sqrt	0.654900	0.937878	(DecisionTreeClassifier(class_weight=None, cri...

	MONTH	DAY_OF_MONTH	ORIGIN	DEST	HOUR_OF_ARR	HOUR_OF_DEP	UNIQUE_CARRIER	DAY_OF_WEEK	AIRCRAFT_MFR
921	0	0	0	0	0	0	0	0	0
922	1	1	1	1	1	1	1	1	1
923	0	0	0	0	0	0	0	0	0
924	0	0	0	0	0	0	0	0	0
925	2	2	2	2	2	2	2	2	2

	index	ORIGIN_CITY_NAME	DEST_CITY_NAME	AIRCRAFT_AGE	DEST	ARR_TIME	DEP_TIME	UNIQUE_CARRIER	DAY_OF_WEEK	AIRCRAFT_MFR	FL_NUM	MONTH	DAY_OF_MONTH	DISTANCE	ORIGIN	PREDICTED_DELAY	FLIGHT_TIME	PREDICTED_FLIGHT_TIME
19	4156108	New York, NY	Chicago, IL	2	MDW	738	627	WN	1	BOEING	490	12	22	725	LGA	-5.391739	71	65
59	4235745	New York, NY	Chicago, IL	16	ORD	916	757	UA	1	AIRBUS	463	12	22	733	LGA	0.540536	79	79
18	4156107	New York, NY	Chicago, IL	14	MDW	1423	1304	WN	1	BOEING	333	12	22	725	LGA	7.796969	79	86
74	4237736	New York, NY	Chicago, IL	16	ORD	845	714	UA	1	BOEING	1094	12	22	733	LGA	0.609249	91	91
66	4237531	New York, NY	Chicago, IL	16	ORD	746	610	UA	1	AIRBUS	683	12	22	733	LGA	-2.842361	96	93

	MONTH	DAY_OF_MONTH	ORIGIN	DEST	HOUR_OF_ARR	HOUR_OF_DEP	UNIQUE_CARRIER	DAY_OF_WEEK	AIRCRAFT_MFR
921	0	0	0	0	0	0	0	0	0
922	1	1	1	1	1	1	1	1	1
923	0	0	0	0	0	0	0	0	0
924	0	0	0	0	0	0	0	0	0
925	2	2	2	2	2	2	2	2	2

	MONTH	DAY_OF_MONTH	ORIGIN	DEST	HOUR_OF_ARR	HOUR_OF_DEP	UNIQUE_CARRIER	DAY_OF_WEEK	AIRCRAFT_MFR
921	0	0	0	0	0	0	0	0	0
922	1	1	1	1	1	1	1	1	1
923	0	0	0	0	0	0	0	0	0
924	0	0	0	0	0	0	0	0	0
925	2	2	2	2	2	2	2	2	2