Predictive Delay Analytics


In [1]:
%matplotlib inline
# import required modules for prediction tasks
import numpy as np
import pandas as pd
import math
import random

1. Data acquisition

First, let's acquire the data formated in '02_data_preparation.ipynb'. The figure below gives a glimpse of what the data set looks like.


In [2]:
%%time
# reads all predefined months for a year and merge into one data frame
rawData2014 = pd.DataFrame.from_csv('cache/predictionData/complete2014Data.csv')


Wall time: 1min 49s

In [4]:
print rawData2014.columns
rawData2014.head(5)


Index([u'index', u'FL_DATE', u'UNIQUE_CARRIER', u'TAIL_NUM', u'FL_NUM',
       u'ORIGIN', u'DEST', u'CRS_DEP_TIME', u'DEP_TIME', u'DEP_DELAY',
       u'TAXI_OUT', u'WHEELS_OFF', u'WHEELS_ON', u'TAXI_IN', u'CRS_ARR_TIME',
       u'ARR_TIME', u'ARR_DELAY', u'CANCELLED', u'CANCELLATION_CODE',
       u'AIR_TIME', u'DISTANCE', u'CARRIER_DELAY', u'WEATHER_DELAY',
       u'NAS_DELAY', u'SECURITY_DELAY', u'LATE_AIRCRAFT_DELAY',
       u'AIRCRAFT_YEAR', u'AIRCRAFT_MFR', u'LAT', u'LONG'],
      dtype='object')
Out[4]:
index FL_DATE UNIQUE_CARRIER TAIL_NUM FL_NUM ORIGIN DEST CRS_DEP_TIME DEP_TIME DEP_DELAY ... DISTANCE CARRIER_DELAY WEATHER_DELAY NAS_DELAY SECURITY_DELAY LATE_AIRCRAFT_DELAY AIRCRAFT_YEAR AIRCRAFT_MFR LAT LONG
0 0 2014-01-01 AA N338AA 1 JFK LAX 900 914 14 ... 2475 NaN NaN NaN NaN NaN 1987 BOEING 40.633333 -73.783333
1 1 2014-01-02 AA N338AA 1 JFK LAX 900 857 -3 ... 2475 NaN NaN NaN NaN NaN 1987 BOEING 40.633333 -73.783333
2 2 2014-01-03 AA N323AA 1 JFK LAX 900 NaN NaN ... 2475 NaN NaN NaN NaN NaN NaN NaN 40.633333 -73.783333
3 3 2014-01-04 AA N327AA 1 JFK LAX 900 1005 65 ... 2475 0 59 0 0 0 1986 BOEING 40.633333 -73.783333
4 4 2014-01-05 AA N323AA 1 JFK LAX 900 1050 110 ... 2475 0 110 0 0 0 NaN NaN 40.633333 -73.783333

5 rows × 30 columns

Cleaning the data

We will focus on flights between New York and Chicago. When cleaning the data set, we have to remove the following entries:

  • flights that have been cancelled or diverted. We focus on predicting the delay. As a result, we also remove the columns associated with diverted flights.
  • colmuns that give the answer. This is the case of many colmuns related to the arrival of the plane
  • rows where a value is missing
  • flights whose destination or origin do not correspond to our study aim

Note that data points have to be cleaned in this order because most flights have empty entries for the 'diverted' columns.


In [5]:
#entries to be dropped in the analysis
columns_dropped = ['index', 'TAIL_NUM', 'FL_NUM', 'DEP_TIME', 'DEP_DELAY', 'TAXI_OUT', 'WHEELS_OFF', \
                   'WHEELS_ON', 'TAXI_IN', 'ARR_TIME', 'CANCELLED', 'CANCELLATION_CODE', 'AIR_TIME', \
                   'CARRIER_DELAY', 'WEATHER_DELAY', 'NAS_DELAY', 'SECURITY_DELAY', 'LATE_AIRCRAFT_DELAY']

In [6]:
def clean(data, list_col):
    ''' 
    Creates a dataset by excluding undesirable columns contained in list_col

    Parameters:
    -----------

    data: pandas.DataFrame
       Flight dataframe  

    list_col: <list 'string'>
        Comumns to exclude from the data set
    '''
    
    data.drop(data[data.CANCELLED == 0].index, inplace=True)
    data.drop(list_col, axis=1, inplace=True)
    data.dropna(axis = 0, inplace = True)
    return

In [ ]:
%%time
data2014 = clean(rawData2014.copy(), columns_dropped)
print data2014.columns

We memorize the cleaned dataset. This will save us some processing time. The two boxes below save the dataset and recover it from a 'cache/predictionData/' folder.


In [ ]:
#%%time
# save the data to avoid computing them again
#file_path = "cache/predictionData/predictionData2014.csv"
#data2014.to_csv(path_or_buf= file_path)

In [2]:
#%%time
# recover data2014 from cache/predictionData folder
#file_path = "cache/predictionData/predictionData2014.csv"
#data2014 = pd.read_csv(file_path)
#data2014.drop('Unnamed: 0', axis= 1, inplace = True)
#data2014.head(3)

In [6]:
# test that clean did the job
print "size of raw data set: ", len(rawData2014)
print "number of cancelled: ", len(rawData2014[(rawData2014.CANCELLED == 1)])
print "size of data set: ", len(data2014)


size of raw data set:  5819811
number of cancelled:  126984
size of data set:  4103597

Restricting the dataset: Case study NYC - Chicago flights

The dataset has more than 4 millions entries, which makes any data manipulation extremely costly - let alone model fitting. We will therefore make some restrictions on the airports and the airlines considered.

One way to do it is to make a case study, an example that we will follow all along. Here, the example looks at a flight from New York to Chicago.

Comment : it is possible to work on a different data set or a different example. However, due to the high complexity of the data set - both by a large number of points and by a large number of high-dimensional categorical variables, it is easier to work on a shrinked version of it.


In [3]:
dexample = data2014[data2014.DEST == 'ORD'][(data2014.ORIGIN == 'JFK') | (data2014.ORIGIN == 'LGA') | (data2014.ORIGIN == 'EWR')].copy()
print len(dexample)                                            
dexample.head(3)


8095
C:\Users\Estienne\Anaconda\lib\site-packages\pandas\core\frame.py:1825: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  "DataFrame index.", UserWarning)
Out[3]:
FL_DATE UNIQUE_CARRIER ORIGIN DEST CRS_DEP_TIME CRS_ARR_TIME ARR_DELAY DISTANCE AIRCRAFT_YEAR AIRCRAFT_MFR LAT LONG
2033 2014-01-01 AA LGA ORD 700 850 -17 733 1991 MCDONNELL DOUGLAS 40.766667 -73.866667
2034 2014-01-04 AA LGA ORD 700 850 32 733 2007 FRIEDEMANN JON 40.766667 -73.866667
2035 2014-01-05 AA LGA ORD 700 850 11 733 2007 FRIEDEMANN JON 40.766667 -73.866667

We are only interested in the carrier that operates from New York to Chicago. Looking at the table, we also notice that Atlantic Southeast Airlines (airline code EV) is only marginally present. So we drop it from the list of carriers we will study in addition to the other carriers that do not operate on the line.


In [4]:
dexample.groupby('UNIQUE_CARRIER').size()


Out[4]:
UNIQUE_CARRIER
AA     126
B6     955
EV       2
OO     191
UA    6821
dtype: int64

In [5]:
# In case we want to tackle a broader case, we can use this function. The droplist can be any list of carrier.
def restrict_carrier(data, droplist):
    ''' 
    Drop carriers from the data set according to broadlist

    Parameters:
    -----------

    data: pandas.DataFrame
       dataframe  

    droplist: <list 'string'>
        List of carriers to be droppped
    '''        
    
    for item in droplist:
        data.drop(data[data.UNIQUE_CARRIER == item].index, inplace= True)
    return

In [6]:
%%time
drop_airline = [ 'AS','DL', 'EV', 'F9', 'FL', 'HA', 'MQ', 'US', 'VX', 'WN']
restrict_carrier(dexample, drop_airline)


Wall time: 80 ms

In case we're doing a general study, the second step after droppping airlines is to drop airport. This is the purpose of the function below.

WARNING \ RUN THIS CELL ONLY ONCE! Because airports are linked to one another, everytime the function restrict_airport is run, some entries are dropped. After some time, there is no more entries.

To understand this, let's imagine that Boston Logan airport has an annual flight with a small airport XYZ. When we run the function the fisrt time, airport XYZ is droppped. Because of that, the total count of flights for Boston Logan airport decreases as well. If we run the function a second time, removing airport XYZ and its flights with Boston Logan has put the number of flights associated with Boston Logan under the threshold!


In [7]:
#remove all airports that have an annual traffic under threshold

def restrict_airport(data, threshold):
    ''' 
    Drop carriers from the data set.

    Parameters:
    -----------

    data: pandas.DataFrame
       dataframe  

    droplist: <list 'string'>
        List of carriers to be droppped
    '''     
    
    dict_count = data.groupby("DEST").agg(['count']).LAT.to_dict()['count']
    
    for key in dict_count:
        if dict_count[key] < threshold:
            data.drop(data[data.DEST == key].index, inplace=True)
            data.drop(data[data.ORIGIN == key].index, inplace=True)
    
    print data.groupby("DEST").agg(['count']).LAT.to_dict()['count']
    
    return

Extracting month and day

The date is given as a string in the format "Year-Month-Day". We exctract the month and the weekday in the following.

WARNING \ RUN THIS CELL ONLY ONCE! Because airports are linked to one another,


In [8]:
from time import strptime
days = {0:"Mon", 1:"Tues", 2:"Wed", 3:"Thurs", 4:"Fri", 5:"Sat", 6:"Sun"}
months = {1:"Jan", 2:"Feb", 3:"Mar", 4:"Apr", 5:"May", 6:"June", 7:"July", 8:"Aug", 9:"Sep", \
          10:"Oct", 11:"Nov", 12:"Dec"}

In [9]:
def adjust_time(data):
    ''' 
    Return the month list and the weekday list out of a DataFrame.

    Parameters:
    -----------

    data: pandas.DataFrame
       dataframe that contains a colum of flight dates
    ''' 
    
    monlist = np.empty(len(data), dtype = str)
    daylist = np.empty(len(data), dtype = str)
    
    for i in xrange(len(data)):
        date= strptime(data.FL_DATE.iloc[i], "%Y-%M-%d")
        monlist[i] = months[date.tm_min]
        daylist[i] = days[date.tm_wday]

    return monlist, daylist

In [10]:
%%time
# now add the weekday and the month columns created by the adjust_time
monlist, daylist = adjust_time(dexample)
print "OK"
dexample['MONTH'] = pd.Series(monlist, index=dexample.index)
dexample['DAY'] = pd.Series(daylist, index=dexample.index)
if 'FL_DATE' in dexample.columns:
    dexample.drop('FL_DATE', axis = 1, inplace= True)
print dexample.columns


OK
Index([u'UNIQUE_CARRIER', u'ORIGIN', u'DEST', u'CRS_DEP_TIME', u'CRS_ARR_TIME',
       u'ARR_DELAY', u'DISTANCE', u'AIRCRAFT_YEAR', u'AIRCRAFT_MFR', u'LAT',
       u'LONG', u'MONTH', u'DAY'],
      dtype='object')
Wall time: 511 ms

Adjusting numerical data

Let's change teh format of all time entries. Instead of having hour : minute, we convert everything in minutes.

WARNING \ RUN THIS CELL ONLY ONCE!


In [11]:
%%time
ti = lambda x : x/100 + x % 100
dexample['CRS_ARR_TIME_COR'] = dexample.CRS_ARR_TIME.map(ti)
dexample['CRS_DEP_TIME_COR'] = dexample.CRS_DEP_TIME.map(ti)
dexample.drop(['CRS_DEP_TIME', 'CRS_ARR_TIME'], axis = 1, inplace = True)


Wall time: 46 ms

We need to center and normalize all continuous data for better results. The idea is that if a feature X has a range of variation that is considerably higher than a feature Y, the variations of X may completely mask the variations of Y and therefore make Y useless.


In [12]:
def normalize(array):
    mean = np.mean(array)
    std = np.std(array)
    return [(x - mean)/std for x in array]

In [13]:
def normalize_data(data, feature_list):
    ''' 
    Normalize data.

    Parameters:
    -----------

    data: pandas.DataFrame
       dataframe  

    feature_list: <list 'string'>
        List of features to be normalized
    '''           

    for feature in feature_list:
        if feature in data.columns:
            data[feature + '_NOR'] = normalize(data[feature].values)
            data.drop(feature, axis =1, inplace=True)
    return

In [14]:
dexample.drop(dexample[dexample.AIRCRAFT_YEAR =='    '].index, inplace = True)
dexample['AIRCRAFT_YEAR_COR'] = dexample.AIRCRAFT_YEAR.map(lambda x: int(x))
dexample.drop('AIRCRAFT_YEAR', axis = 1, inplace = True)

In [15]:
normalize_data(dexample, ['DISTANCE', 'LAT', 'LONG', 'CRS_ARR_TIME_COR', 'CRS_DEP_TIME_COR', 'AIRCRAFT_YEAR_COR'])

In [17]:
dexample.head(3)


Out[17]:
UNIQUE_CARRIER ORIGIN DEST ARR_DELAY AIRCRAFT_MFR MONTH DAY DISTANCE_NOR LAT_NOR LONG_NOR CRS_ARR_TIME_COR_NOR CRS_DEP_TIME_COR_NOR AIRCRAFT_YEAR_COR_NOR
2033 AA LGA ORD -17 MCDONNELL DOUGLAS J W 0.68881 1.058684 0.786579 0.88939 -1.046771 -1.713677
2034 AA LGA ORD 32 FRIEDEMANN JON J S 0.68881 1.058684 0.786579 0.88939 -1.046771 1.129411
2035 AA LGA ORD 11 FRIEDEMANN JON J S 0.68881 1.058684 0.786579 0.88939 -1.046771 1.129411

Encode categorical variables

Many of the features are categorical - for example the origin airport feature contains strings which represent the code of each airport. To make a classification/regression possible, we need to encode these features as series of indicators. More specifically, if the origin airport feature has n possibles values A1, A2, ... , An, we create n indicator functions I(A1), ..., I(An). Indicator function I(Ak) takes the value 1 if the feature origin airport has value Ak.


In [18]:
encoded_list = ['UNIQUE_CARRIER', 'ORIGIN', 'DEST', 'AIRCRAFT_MFR', 'MONTH','DAY']
dexample = pd.get_dummies(dexample, columns=encoded_list)

In [20]:
dexample.head(3)


Out[20]:
ARR_DELAY DISTANCE_NOR LAT_NOR LONG_NOR CRS_ARR_TIME_COR_NOR CRS_DEP_TIME_COR_NOR AIRCRAFT_YEAR_COR_NOR UNIQUE_CARRIER_AA UNIQUE_CARRIER_B6 UNIQUE_CARRIER_OO ... MONTH_J MONTH_M MONTH_N MONTH_O MONTH_S DAY_F DAY_M DAY_S DAY_T DAY_W
2033 -17 0.68881 1.058684 0.786579 0.88939 -1.046771 -1.713677 1 0 0 ... 1 0 0 0 0 0 0 0 0 1
2034 32 0.68881 1.058684 0.786579 0.88939 -1.046771 1.129411 1 0 0 ... 1 0 0 0 0 0 0 1 0 0
2035 11 0.68881 1.058684 0.786579 0.88939 -1.046771 1.129411 1 0 0 ... 1 0 0 0 0 0 0 1 0 0

3 rows × 42 columns

Let's save all of our results.


In [21]:
%%time
# save the restricted data to avoid computing them again
file_path = "cache/predictionData/dexample.csv"
dexample.to_csv(path_or_buf= file_path)


Wall time: 311 ms

In [47]:
#%%time
# recover file
#file_path = "cache/predictionData/dexample.csv"
#dexample= pd.read_csv(file_path)
#dexample.drop('Unnamed: 0', axis= 1, inplace = True)
#print dexample.columns

1. Random Forest regressor

Our goal is to anticipate the delay of a flight given its features. One way to proceed is to train a random forest regressor. But first, let's look at the baseline predictor, that is the predictor that returns the mean of data set. The measure of accuracy of a predictor will be its mean square error against a test set distinct from the training set.


In [102]:
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

In [106]:
def baseline_predictor(data):
    return np.mean(data.ARR_DELAY.values)

print "MSE of the mean predictor:" , \
mean_squared_error(dexample.ARR_DELAY.values, [baseline_predictor(dexample) for x in xrange(len(dexample))])


 MSE of the mean predictor: 2359.68904617

Split data into training/test sets

First, let's split the data set into a training set and a test set.


In [66]:
def split(data, list_drop, target, test_size):
    ''' 
    Splits the data into a training and a test set
    Separates the training and test sets according to a feature set and a target set
    Balance the features sets by retaining only fraction of its points

    Parameters:
    -----------

    data: pandas.DataFrame
       Flight dataframe  

    list_drop: <list 'string'>
        List of columns to exclude from the features set
        
    target: string
        target column along whch we make the target set
        
    test_size: float
        size of the test set
    
    '''    
    
    #split the dataset into a training set and a test set
    dtrain, dtest = train_test_split(data, test_size = 0.3)
    
    Xtrain = dtrain.drop(list_drop, axis=1).values
    ytrain = dtrain[target].values
    Xtest = dtest.drop(list_drop, axis=1).values
    ytest = dtest[target].values
    
    return Xtrain, ytrain, Xtest, ytest

Useful functions

The functions below satisfy various tasks. The first trains a classifier and estimates its scores on the test set and the training set. The second tries different parameter values for the regressor. The last one wraps everything.


In [69]:
def score_random_forest(Xtrain, ytrain, Xtest, ytest, n_trees=10, max_features='auto'):
    ''' 
    Fits a random forest with (Xtrain ,ytrain)
    Computes the score on (Xtest, ytest)

    Parameters:
    -----------

    Xtrain: numpy 2D array
       Feature training set

    ytrain: numpy 1D array
        Target training set
    
    Xtest: numpy 2D array
       Feature test set

    ytest: numpy 1D array
        Target test set
    
    n_trees: int
        number of trees in the forest
    
    max_features: string or int
        number of features used for every tree
        
    Outputs:
    --------
    
    score_train: float
        score on the train set
    
    score_test: float
        score on the test set
    
    clf.feature_importances_
        weights of each feature as used by the classifier
    
    ''' 

    clf= RandomForestRegressor(n_estimators=n_trees, max_features= max_features)
    clf.fit(Xtrain, ytrain)
    
    score_train = mean_squared_error(clf.predict(Xtrain), ytrain)
    score_test = mean_squared_error(clf.predict(Xtest), ytest)
    
    return  score_train, score_test, clf

In [70]:
def best_parameters(Xtrain, ytrain, Xtest, ytest, nb_trees, nb_features):
    ''' 
    Fits sequentially random forest classifiers
    Adds each test score in a pandas.DataFrame with the number of trees, the loss function, the train score,
    and the importance of each features
    Returns a DataFrame with all scores

    Parameters:
    -----------

    Xtrain: numpy 2D array
       Feature training set

    ytrain: numpy 1D array
        Target training set
    
    Xtest: numpy 2D array
       Feature test set

    ytest: numpy 1D array
        Target test set
    
    n_trees: <list int>
        list of numbers of trees in the forest
    
    nb_features: <list int>
        list of number of features in the forest
        
    Outputs:
    --------
    
    score_tab: pandas.DataFrame
        DataFrame of scores with associated parameters
    
    '''
    
    score_tab = pd.DataFrame(columns=['nb_trees', 'nb_features', 'test_score', 'train_score', 'classifier'])
    
    # counter will increment the index in score_tab
    counter = 0 

    for n_estimators in nb_trees:
        for max_features in nb_features:

            score_train, score_test, classifier = \
            score_random_forest(Xtrain, ytrain, Xtest, ytest, n_trees=n_estimators, max_features=max_features) 
            score_tab.loc[counter] = [n_estimators, max_features, score_test, score_train, classifier]
            counter += 1

    return score_tab

In [71]:
def classify_random_forest(data, list_drop, target, test_size=0.4, nb_trees=[10], nb_features = ['auto']):
    Xtrain, ytrain, Xtest, ytest = split(data, list_drop, target, test_size)
    scores =  best_parameters(Xtrain, ytrain, Xtest, ytest, nb_trees, nb_features)
    return scores

Parameter dashboard and run cell

We test the following parameters.

  • the number of trees in the forest
  • the number of features retained in each tree
  • the test size

In [107]:
nb_trees = [25, 50, 75, 100, 150, 200, 300, 400, 500, 600, 700, 800, 900, 1000]
nb_features = ['auto', 'log2', 'sqrt']
test_size = 0.4

In [108]:
%%time
randomForest2014 =  classify_random_forest(dexample, ['ARR_DELAY'], 'ARR_DELAY', test_size=test_size, nb_trees=nb_trees, nb_features=nb_features)


Wall time: 15min 22s

In [125]:
randomForest2014.head()


Out[125]:
nb_trees nb_features test_score train_score classifier
0 25 auto 2921.001980 644.331707 (DecisionTreeRegressor(criterion='mse', max_de...
1 25 log2 2886.270747 613.948677 (DecisionTreeRegressor(criterion='mse', max_de...
2 25 sqrt 2836.819022 622.683047 (DecisionTreeRegressor(criterion='mse', max_de...
3 50 auto 2835.219505 617.548415 (DecisionTreeRegressor(criterion='mse', max_de...
4 50 log2 2845.566965 605.315175 (DecisionTreeRegressor(criterion='mse', max_de...

In [109]:
# save file to /data/ folder
file_path = "cache/predictionData/randomForest2014.csv"
randomForest2014.to_csv(path_or_buf= file_path)

MSE by size of the forest


In [127]:
import matplotlib.pyplot as plt
from matplotlib.figure import Figure

In [137]:
# MSE on test set
fig = plt.gcf()
fig.set_size_inches(8, 5)

for key, co in zip(['auto', 'sqrt', 'log2'], ['b', 'g', 'y']):
    x = randomForest2014[randomForest2014.nb_features == key].nb_trees.values
    y = randomForest2014[randomForest2014.nb_features == key].test_score.values
    plt.plot(x, y, alpha = 0.5, c =co, label = key)

plt.ylabel("Mean Square Error")
plt.xlabel("Number of features")
plt.title("MSE on test set by nb of features and nb of trees")
plt.legend(loc = 1)

plt.show()



In [139]:
# MSE on train set
fig = plt.gcf()
fig.set_size_inches(8, 5)

for key, co in zip(['auto', 'sqrt', 'log2'], ['b', 'g', 'y']):
    x = randomForest2014[randomForest2014.nb_features == key].nb_trees.values
    y = randomForest2014[randomForest2014.nb_features == key].train_score.values
    plt.plot(x, y, alpha = 0.5, c =co, label = key)

plt.ylabel("Mean Square Error")
plt.xlabel("Number of features")
plt.title("MSE on train set by nb of features and nb of trees")
plt.legend(loc = 1)

plt.show()


2. Random forest classifier

We will make prediction on the variable 'ARR_DEL15'. This variable takes the value 1 is the plane is more than 15 minutes late and 0 if not. Let's look at the baseline classifier, that is the classifiers that assign repectively 1 or 0 to 'ARR_DEL15' for every flight.

Here we look at the case whether a flight will be more than 15 minutes late. So we adjust the ARR_DELAY colum to an indicator.


In [22]:
%%time
dexample['ARR_DELAY_COR'] = dexample.ARR_DELAY.map(lambda x: (x >= 15))
dexample.drop('ARR_DELAY', axis = 1, inplace = True)
file_path = "cache/predictionData/dexample_class.csv"
dexample.to_csv(path_or_buf= file_path)
dexample_class = dexample.copy()


Wall time: 1.18 s

In [ ]:
#file_path = "cache/predictionData/dexample_class.csv"
#dexample_class = pd.read_csv(path_or_buf= file_path)
#dexample_class.drop('Unnamed: 0', axis= 1, inplace = True)

In [23]:
dexample_class.head(3)


Out[23]:
DISTANCE_NOR LAT_NOR LONG_NOR CRS_ARR_TIME_COR_NOR CRS_DEP_TIME_COR_NOR AIRCRAFT_YEAR_COR_NOR UNIQUE_CARRIER_AA UNIQUE_CARRIER_B6 UNIQUE_CARRIER_OO UNIQUE_CARRIER_UA ... MONTH_M MONTH_N MONTH_O MONTH_S DAY_F DAY_M DAY_S DAY_T DAY_W ARR_DELAY_COR
2033 0.68881 1.058684 0.786579 0.88939 -1.046771 -1.713677 1 0 0 0 ... 0 0 0 0 0 0 0 0 1 False
2034 0.68881 1.058684 0.786579 0.88939 -1.046771 1.129411 1 0 0 0 ... 0 0 0 0 0 0 1 0 0 True
2035 0.68881 1.058684 0.786579 0.88939 -1.046771 1.129411 1 0 0 0 ... 0 0 0 0 0 0 1 0 0 False

3 rows × 42 columns


In [24]:
from __future__ import division

def baseline_class(data, target):
    ''' 
    Compute the baseline classifiers along a target variable for a data set data

    Parameters:
    -----------

    data: pandas.DataFrame
       dataframe  

    target: string
        Column of data along wich we compute the baseline classifiers
    '''    
    
    
    score_baseline_1 = np.size(data[data[target] == 1][target].values) / np.size(data[target].values)
    score_baseline_0 = np.size(data[data[target] == 0][target].values) / np.size(data[target].values)
    
    print "baseline classifier everyone to 0: ", int(score_baseline_0*100) , "%"
    print "baseline classifier everyone to 1: ", int(score_baseline_1*100) , "%"
   
    return score_baseline_0, score_baseline_1

In [25]:
baseline_class(dexample_class, 'ARR_DELAY_COR')


baseline classifier everyone to 0:  70 %
baseline classifier everyone to 1:  29 %
Out[25]:
(0.7088213306753053, 0.2911786693246947)

Prediction functions


In [34]:
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestClassifier

Random forest functions for a classification task


In [27]:
def split(data, list_drop, target, test_size):
    ''' 
    Splits the data into a training and a test set
    Separates the training and test sets according to a feature set and a target set
    Balance the features sets by retaining only fraction of its points

    Parameters:
    -----------

    data: pandas.DataFrame
       Flight dataframe  

    list_drop: <list 'string'>
        List of columns to exclude from the features set
        
    target: string
        target column along whch we make the target set
        
    test_size: float
        size of the test set
    
    '''    
    
    #split the dataset into a training set and a test set
    dtrain, dtest = train_test_split(data, test_size = 0.3)
    
    Xtrain = dtrain.drop(list_drop, axis=1).values
    ytrain = dtrain[target].values
    Xtest = dtest.drop(list_drop, axis=1).values
    ytest = dtest[target].values
    
    return Xtrain, ytrain, Xtest, ytest

In [35]:
def score_random_forest_classifier(Xtrain, ytrain, Xtest, ytest, n_trees=10, max_features='auto'):
    ''' 
    Fits a random forest with (Xtrain ,ytrain)
    Computes the score on (Xtest, ytest)

    Parameters:
    -----------

    Xtrain: numpy 2D array
       Feature training set

    ytrain: numpy 1D array
        Target training set
    
    Xtest: numpy 2D array
       Feature test set

    ytest: numpy 1D array
        Target test set
    
    n_trees: int
        number of trees in the forest
    
    max_features: string or int
        number of features used for every tree
        
    Outputs:
    --------
    
    score_train: float
        score on the train set
    
    score_test: float
        score on the test set
    
    clf.feature_importances_
        weights of each feature as used by the classifier
    
    ''' 

    clf= RandomForestClassifier(n_estimators=n_trees, max_features= max_features)
    clf.fit(Xtrain, ytrain)
    
    score_train = clf.score(Xtrain, ytrain)
    score_test = clf.score(Xtest, ytest)
    
    return  score_train, score_test, clf

In [36]:
def best_parameters_classifier(Xtrain, ytrain, Xtest, ytest, nb_trees, nb_features):
    ''' 
    Fits sequentially random forest classifiers
    Adds each test score in a pandas.DataFrame with the number of trees, the loss function, the train score,
    and the importance of each features
    Returns a DataFrame with all scores

    Parameters:
    -----------

    Xtrain: numpy 2D array
       Feature training set

    ytrain: numpy 1D array
        Target training set
    
    Xtest: numpy 2D array
       Feature test set

    ytest: numpy 1D array
        Target test set
    
    n_trees: <list int>
        list of numbers of trees in the forest
    
    nb_features: <list int>
        list of number of features in the forest
        
    Outputs:
    --------
    
    score_tab: pandas.DataFrame
        DataFrame of scores with associated parameters
    
    '''
    
    score_tab = pd.DataFrame(columns=['nb_trees', 'nb_features', 'test_score', 'train_score', 'classifier'])
    
    # counter will increment the index in score_tab
    counter = 0 

    for n_estimators in nb_trees:
        for max_features in nb_features:

            score_train, score_test, classifier = \
            score_random_forest_classifier(Xtrain, ytrain, Xtest, ytest, n_trees=n_estimators, max_features=max_features) 
            score_tab.loc[counter] = [n_estimators, max_features, score_test, score_train, classifier]
            counter += 1

    return score_tab

In [37]:
def classify_random_forest_class(data, list_drop, target, test_size=0.4, nb_trees=[10], nb_features = ['auto']):
    Xtrain, ytrain, Xtest, ytest = split(data, list_drop, target, test_size)
    scores =  best_parameters_classifier(Xtrain, ytrain, Xtest, ytest, nb_trees, nb_features)
    return scores

Classification task

We look at various paramters for the classifier. Note that there are two loss functions available for the random forest classifier, a Gini loss and an entropy loss. Simulations show they behave similarly!


In [99]:
nb_trees = [25, 50, 75, 100, 150, 200, 300, 400, 500, 750, 1000, 2000, 5000]
nb_features = ['sqrt', 'log2']
test_size = 0.4

In [100]:
%%time
randomForest2014_class =  classify_random_forest_class(dexample, ['ARR_DELAY_COR'], 'ARR_DELAY_COR', test_size=test_size, nb_trees=nb_trees, nb_features=nb_features)


Wall time: 28min 59s

In [90]:
randomForest2014_class.head(9)


Out[90]:
nb_trees nb_features test_score train_score classifier
0 25 sqrt 0.652824 0.934318 (DecisionTreeClassifier(class_weight=None, cri...
1 25 log2 0.648671 0.934140 (DecisionTreeClassifier(class_weight=None, cri...
2 50 sqrt 0.656561 0.937344 (DecisionTreeClassifier(class_weight=None, cri...
3 50 log2 0.657392 0.937344 (DecisionTreeClassifier(class_weight=None, cri...
4 75 sqrt 0.660714 0.937878 (DecisionTreeClassifier(class_weight=None, cri...
5 75 log2 0.658223 0.937878 (DecisionTreeClassifier(class_weight=None, cri...
6 100 sqrt 0.661960 0.937878 (DecisionTreeClassifier(class_weight=None, cri...
7 100 log2 0.661545 0.937700 (DecisionTreeClassifier(class_weight=None, cri...
8 150 sqrt 0.654900 0.937878 (DecisionTreeClassifier(class_weight=None, cri...

In [101]:
# save file to /data/ folder
file_path = "cache/predictionData/randomForest2014_class.csv"
randomForest2014_class.to_csv(path_or_buf= file_path)

Analysis

Let's now look at the importance coefficients, that the average usage of each coefficients in the random forest. As we can see, the main factors in the delay of a flight are the age of the aircraft, the time of the departure and the time of the origin. The fact that the departure and arrival time confirm what we observed in the explorative analysis. However, the role of the age of the aircraft was not very marked in the explorative analysis. Lastly, the weekday has a huge influence.

Notice that the airline and the aircraft model play a very limited role in the delay according to this random forest classifier.


In [106]:
features = dexample_class.drop('ARR_DELAY_COR', axis =1).columns
coeffs_class = randomForest2014_class.classifier[len(randomForest2014_class)-1].feature_importances_
u = pd.Series(coeffs_class, index=features)
u.sort(inplace=True, ascending=False)
print u


AIRCRAFT_YEAR_COR_NOR                          0.285870
CRS_ARR_TIME_COR_NOR                           0.217206
CRS_DEP_TIME_COR_NOR                           0.212159
DAY_T                                          0.027670
DAY_S                                          0.026794
DAY_W                                          0.022723
DAY_F                                          0.021133
DAY_M                                          0.020281
MONTH_J                                        0.017575
MONTH_A                                        0.015214
MONTH_M                                        0.013477
MONTH_O                                        0.013426
AIRCRAFT_MFR_AIRBUS INDUSTRIE                  0.013183
AIRCRAFT_MFR_BOEING                            0.012232
MONTH_S                                        0.012192
MONTH_F                                        0.011634
MONTH_N                                        0.010826
MONTH_D                                        0.009118
AIRCRAFT_MFR_AIRBUS                            0.007589
AIRCRAFT_MFR_EMBRAER                           0.003948
LAT_NOR                                        0.003597
DISTANCE_NOR                                   0.003498
LONG_NOR                                       0.003483
ORIGIN_LGA                                     0.002964
ORIGIN_EWR                                     0.002883
UNIQUE_CARRIER_UA                              0.002150
AIRCRAFT_MFR_EMBRAER S A                       0.001437
UNIQUE_CARRIER_OO                              0.001044
UNIQUE_CARRIER_B6                              0.001036
ORIGIN_JFK                                     0.001016
AIRCRAFT_MFR_MCDONNELL DOUGLAS                 0.000818
UNIQUE_CARRIER_AA                              0.000802
AIRCRAFT_MFR_CIRRUS DESIGN CORP                0.000400
AIRCRAFT_MFR_CESSNA                            0.000220
AIRCRAFT_MFR_PIPER                             0.000207
AIRCRAFT_MFR_ROBINSON HELICOPTER CO            0.000111
AIRCRAFT_MFR_BELL                              0.000034
AIRCRAFT_MFR_LEBLANC GLENN T                   0.000029
AIRCRAFT_MFR_MARZ BARRY                        0.000021
DEST_ORD                                       0.000000
AIRCRAFT_MFR_FRIEDEMANN JON                    0.000000
dtype: float64

Score by size of the forest


In [93]:
import matplotlib.pyplot as plt
from matplotlib.figure import Figure

In [107]:
# Score on test set
fig = plt.gcf()
fig.set_size_inches(8, 5)

for key, co in zip(['sqrt', 'log2'], ['b', 'g']):
    x = randomForest2014_class[randomForest2014_class.nb_features == key].nb_trees.values
    y = randomForest2014_class[randomForest2014_class.nb_features == key].test_score.values
    plt.plot(x, y, alpha = 0.5, c =co, label = key)

plt.ylabel("Score")
plt.xlabel("Number of features")
plt.title("Score on test set by nb of features and nb of trees")
plt.legend(loc = 4)

plt.show()



In [111]:
# Score on train set
fig = plt.gcf()
fig.set_size_inches(8, 5)

for key, co in zip(['sqrt', 'log2'], ['b', 'g']):
    x = randomForest2014_class[randomForest2014_class.nb_features == key].nb_trees.values
    y = randomForest2014_class[randomForest2014_class.nb_features == key].train_score.values
    plt.plot(x, y, alpha = 0.5, c =co, label = key)

plt.ylim(0.935, 0.95)
plt.ylabel("Score")
plt.xlabel("Number of features")
plt.title("Score on train set by nb of features and nb of trees")
plt.legend(loc = 4)

plt.show()


3. Predicting flight delay time with Linear Regression

To allow users to enjoy more than a classfication experience we want to give them an expected delay time in minutes.


In [19]:
%matplotlib inline
# import required modules for prediction tasks
import numpy as np
import pandas as pd
import math
import random
import requests
import zipfile
import StringIO
import re
import json
import os
import matplotlib
import matplotlib.pyplot as plt

# sklearn functions used for the linear regression model
from sklearn.preprocessing import OneHotEncoder
from scipy import sparse
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import SGDRegressor
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import mean_absolute_error

First, let us again establish a baseline which has to be beaten by our model. To get a feeling for a good baseline we pick flights from from New York(all airports) to Chicago(all airports).


In [30]:
# first step is to load the actual data and exclude rows that are unnecessary
# a script that produces the csv file can be found in the src folder
# (it might take some while to run, on my macbook up to one hour)
print('loading data...')
bigdf = pd.read_csv('cache/Big5FlightTable.csv')

years = ['2010', '2011', '2012', '2013', '2014']


loading data...

In [31]:
# specify here which cities should be investigated
city_from = 'New York, NY'
city_to = 'Chicago, IL'

# filter for cities
bigdf = bigdf[(bigdf.ORIGIN_CITY_NAME == city_from) & (bigdf.DEST_CITY_NAME == city_to)]

3.1. A first predictor

For a span of days we are interested in knoweledge, which flight we should take. Therefore let's choose the popular date for Christmas returning flights 21.12.2015 as example. As we want to use historical data, let's get first clear what data we have. Is it possible to compare flights over the years?


In [32]:
query_day = 21
query_month = 12

# how many flights do exist in all years?
flights = []
flightvalues = []
for y in years:
    query = list(bigdf[(bigdf.YEAR == int(y)) & (bigdf.MONTH == query_month) & (bigdf.DAY_OF_MONTH == query_day)].FL_NUM.astype(int).unique())
    flights.append(query)
    flightvalues += query
    
# build a matrix
data_matrix = np.zeros((len(flightvalues), len(years)))
# build dict
flightdict = dict(zip(flightvalues, np.arange(0, len(flightvalues))))

# fill datamatrix
for i in xrange(len(years)):
    for j in flights[i]:
        data_matrix[flightdict[j], i] = 1.
        
# plot matrix
plt.figure(figsize=(10, 6))
plt.imshow(data_matrix, extent=[0,data_matrix.shape[0],0,data_matrix.shape[1] * 5], \
           interpolation='none', cmap='gray')


Out[32]:
<matplotlib.image.AxesImage at 0x234cb07d0>

As the plot shows, flight numbers are not stable over the years and it is not trivial to find a matching. Thus comparison by features for individual flights is not really possible. It seems as if airlines change their flight numbers on a yearly basis. Thus we need to come up with another idea and use a latent variable based approach instead. One of the easiest ideas is to average the delay time over each day as a base classifier and refine then.


In [34]:
dffordate = bigdf[bigdf.MONTH == query_month]
dffordate = dffordate[dffordate.DAY_OF_MONTH == query_day]
dffordate.head()

def predict_base_model(X):
    return np.array([dffordate.ARR_DELAY.mean()]*X.shape[0])

To test the quality of this model, we use the last year as test set and the previous as train data. The idea is, that we are always interested in predicting the next year somehow. Thus, if the match for 2014 is good, we expect it to be the same for 2015.


In [70]:
# build test/train set
df_train = dffordate[dffordate.YEAR != int(years[-1])]
df_test = dffordate[dffordate.YEAR == int(years[-1])]

y_train = df_train.ARR_DELAY
X_train = y_train # here dummy
y_test = df_test.ARR_DELAY
X_test = y_test # here dummy

In the base model, the prediction for 2014 is that we are going to be 27.72 minutes late


In [71]:
y_pred = predict_base_model(X_test)
y_pred[0]


Out[71]:
27.728260869565219

How good did it perform comparing the actual arrival delay?


In [72]:
def rmse(y, y_pred):
    return np.sqrt(((y - y_pred)**2).mean())

def mas(y, y_pred):
    return (np.abs(y - y_pred)).mean()

MAS_base = mas(y_test, y_pred)
RMSE_base = rmse(y_test, y_pred)
RMSE_base


Out[72]:
44.10756003440288

Using the root mean squared error as one (of many possible) measures, any model that we develop should beat the benchmark of 44.

3.2 Building a linear regression model for prediction

As flight delay changes over the time of day like explored in the data exploration part, we introduce a new feature which models in a categorical variable 10 minute time windows. I.e. for each window we introduce a latent variable that captures some sort of delay influence of this frame. This is done for both the departure and the arrival time. The model here is first developed for the reduced dataset containing the flights between New York / Chicago only.


In [41]:
%%time
bigdf['HOUR_OF_ARR'] = 0
bigdf['HOUR_OF_DEP'] = 0

for index, row in bigdf.iterrows():
    bigdf.set_value(index, 'HOUR_OF_ARR', int(row['ARR_TIME']) / 10)
    bigdf.set_value(index, 'HOUR_OF_DEP', int(row['DEP_TIME']) / 10)


CPU times: user 3.05 s, sys: 212 ms, total: 3.27 s
Wall time: 3.37 s

In the next step, before fitting the actual model categorical variables need to be encoded as binary features (they have no order!) and numerical features shall be normalized.


In [44]:
# split data into numerical and categorical features
numericalFeat = bigdf[['DISTANCE', 'AIRCRAFT_AGE']].astype('float') 
categoricalFeat = bigdf[['MONTH', 'DAY_OF_MONTH', 'ORIGIN', 'DEST', \
                         'HOUR_OF_ARR', 'HOUR_OF_DEP', 'UNIQUE_CARRIER', \
                         'DAY_OF_WEEK', 'AIRCRAFT_MFR']]

Dealing with categorical variables

Luckily, sklearn has a routine to do this for us. Yet, as it is only able to handle integers we first reindex all categorical features we create lookup tables for the values. This turned out to be one of the slowest step in processing.


In [46]:
%%time
# for the next step, all features need to be encoded as integers --> create lookup Tables!
def transformToID(df, col):
    vals = df[col].unique()
    LookupTable = dict(zip(vals, np.arange(len(vals))))
    for key in LookupTable.keys():
        df[df[col] == key] = LookupTable[key]
    return (LookupTable, df)

mfrDict, categoricalFeat = transformToID(categoricalFeat, 'AIRCRAFT_MFR')
originDict, categoricalFeat = transformToID(categoricalFeat, 'ORIGIN')
destDict, categoricalFeat = transformToID(categoricalFeat, 'DEST')
carrierDict, categoricalFeat = transformToID(categoricalFeat, 'UNIQUE_CARRIER')


/usr/local/lib/python2.7/site-packages/ipykernel/__main__.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
/usr/local/lib/python2.7/site-packages/pandas/core/indexing.py:415: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
CPU times: user 12.9 s, sys: 104 ms, total: 13 s
Wall time: 13.8 s

In [47]:
categoricalFeat.head()


Out[47]:
MONTH DAY_OF_MONTH ORIGIN DEST HOUR_OF_ARR HOUR_OF_DEP UNIQUE_CARRIER DAY_OF_WEEK AIRCRAFT_MFR
921 0 0 0 0 0 0 0 0 0
922 1 1 1 1 1 1 1 1 1
923 0 0 0 0 0 0 0 0 0
924 0 0 0 0 0 0 0 0 0
925 2 2 2 2 2 2 2 2 2

Using sklearn the variables are now encoded. As we have many variables (around 1200 in the final model for the whole year) using sparse matrices was critical for us to fit the model.


In [48]:
import sklearn
from sklearn import linear_model
from sklearn.preprocessing import OneHotEncoder
from scipy import sparse
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import SGDRegressor
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import mean_squared_error

In [49]:
%%time 
encoder = OneHotEncoder() 
categoricals_encoded = encoder.fit_transform(categoricalFeat)


CPU times: user 48.2 ms, sys: 10 ms, total: 58.3 ms
Wall time: 59.6 ms

Recombining categorical and numerical features allows to start the usual Machine Learning routine (split into test/train, normalize, train/cross-validate, test)


In [51]:
numericals_sparse = sparse.csr_matrix(numericalFeat)
# get data matrix & response variable
X_all = sparse.hstack((numericals_sparse, categoricals_encoded))
y_all = bigdf['ARR_DELAY'].values

X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, test_size = 0.2, random_state = 42)

# we have 2 numerical features
X_train_numericals = X_train[:, 0:3].toarray()
X_test_numericals = X_test[:, 0:3].toarray()

# use sklearn tools to standardize numerical features...
scaler = StandardScaler() 
scaler.fit(X_train_numericals) # get std/mean from train set
X_train_numericals = sparse.csr_matrix(scaler.transform(X_train_numericals)) 
X_test_numericals = sparse.csr_matrix(scaler.transform(X_test_numericals))

# update sets
X_train[:, 0:3] = X_train_numericals
X_test[:, 0:3] = X_test_numericals


/usr/local/lib/python2.7/site-packages/scipy/sparse/compressed.py:739: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
  SparseEfficiencyWarning)

Ordinal least squares regression


In [54]:
clf = sklearn.linear_model.LinearRegression()

# fit the model!
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
rmse(y_test, y_pred)


Out[54]:
48.323925698812779

In a first model, we use OLS to get a feeling of what is achievable using a reduced dataset. The motivation for this is mainly to see whether this might be used to speedup the whole process. As the RMSE reveals, we did not do better than the base classifier. What happened?


In [57]:
X_train.shape, clf.coef_.shape


Out[57]:
((31881, 74), (74,))

Looking at the number of variables (74 here), it might be that either the fit is not good enough or we are ignoring the dependency of flight delays within different routes.

Ridge regression


In [62]:
%%time
# Use ridge regression (i.e. Gaussian prior) and vary the lambda parameter using Grid search
from sklearn.linear_model import SGDRegressor
from sklearn.grid_search import GridSearchCV

SGD_params = {'alpha': 10.0 ** -np.arange(-2,8)}
SGD_model = GridSearchCV(SGDRegressor(random_state = 42), \
                         SGD_params, scoring = 'mean_absolute_error', cv = 6) # cross validate 6 times


# train the model, this might take some time...
SGD_model.fit(X_train, y_train)


CPU times: user 2.01 s, sys: 52 ms, total: 2.06 s
Wall time: 2.14 s

In [63]:
y_pred = SGD_model.predict(X_test)
rmse(y_test, y_pred)


Out[63]:
48.389335267933092

Looking at the RMSE again is a bit disappointing. We did worse than OLS and the base classifier! This means, it is now time to tackle the unavoidable: Regressing the whole dataset!

As the data becomes rapidly huge (for 1 year ~ 700MB uncompressed CSV, 5 years ~ 3.5GB, 10 years ~ 7.2GB) the code to perform the actual regression has been developed first in an IPython notebook and then run separately. It can be found in the src folder.

Regression over the whole data

In the following 2 models based on 1, 5 years of data are used (saved in BigFlightTable.csv, Big5FlightTable.csv).


In [131]:
import json
import seaborn as sns
import matplotlib.ticker as ticker

def load_model(filename):
    mdl = None
    with open(filename, 'r') as f:
        mdl = json.load(f)
    return mdl

# load model results
mdl1 = load_model('results/models/2014model.json') # model with 2014 data (1 year)
mdl5 = load_model('results/models/2010_2014model.json') # model with 2010-2014 data ( 5 years)


labels = np.array(['base predictor', '1 year LM', '5 year LM'])
RMSEs = np.array([RMSE_base, mdl1['RMSE'], mdl5['RMSE']])
MASs = np.array([MAS_base, mdl1['MAS'], mdl5['MAS']])

width = .35
xx = np.arange(len(RMSEs))
plt.bar(xx-width, RMSEs, width, label='RMSE', color=sns.color_palette()[0])
plt.bar(xx , MASs, width, label='MAE', color=sns.color_palette()[2])
plt.legend(loc='best')

# Hide major tick labels
ax = plt.gca()
ax.xaxis.set_major_formatter(ticker.NullFormatter())

# Customize minor tick labels
ax.xaxis.set_minor_locator(ticker.FixedLocator([i for i in xx]))
ax.xaxis.set_minor_formatter(ticker.FixedFormatter([labels[i] for i in xx]))


Comparing the RMSEs and Mean absolute error we see that the linear regression model over the whole data clearly outperformed the baseline predictor. Also, more data did not really tremendously improve the mean absolute error but we can see a clear variance reduction in the RMSE. To further improve the model, additional variables should be included. One of the next ideas is weather data. As formatted historic weather data is unfortunately not for free available and scraping it for 100Ks entries per year is not feasible, the variables could not be added to the model. Another interesting idea might be to include some more meta data. I.e. an variable measuring the effect of holidays and variables accounting for winds which could be based by geographic location (i.e. a flight in western direction is longer than one in eastern direction b/c of the passat wind.

3.3 Application: Predicting the best flight for a given date and route


In [132]:
from datetime import timedelta, datetime, tzinfo
from pytz import timezone
import pytz

def convertToUTC(naive, zonestring="America/New_York"):
    local = pytz.timezone (zonestring)
    local_dt = local.localize(naive, is_dst=None)
    return local_dt.astimezone (pytz.utc)

In the following, we want to give an example on how to perform a prediction using the model from 2010-2014. We choose the route Chicago / New York.


In [133]:
# we are only interested in flights NY to Chicago!
city_to = 'Chicago, IL'
city_from = 'New York, NY'
zone_to = 'America/Chicago'
zone_from = 'America/New_York'

Especially before Christmas, it is a good question to ask on which day you should fly and which flight to take. Therefore, we consider only flights between 20.12-24.12. Our assumption is that many values we feed in our predictor do not change, i.e. we can use the data from 2014 (stored in BigFlightTable.csv). For a productive system, these queries should be performed using a database.


In [134]:
# need to create a lookup table for the values (i.e. flight numbers, city and so on and then drop all duplicates!)
db = pd.read_csv('cache/BigFlightTable.csv')
# remove all unnecessary columns
db = db[['ORIGIN_CITY_NAME', 'DEST_CITY_NAME', 'AIRCRAFT_AGE', 'DEST', 'ARR_TIME', \
         'DEP_TIME', 'UNIQUE_CARRIER', 'DAY_OF_WEEK', 'AIRCRAFT_MFR', 'FL_NUM', 'MONTH', \
         'DAY_OF_MONTH', 'DISTANCE', 'ORIGIN']]
print str(db.count()[0]) + ' entries'
db.head()

# drop everything except for the 5 days before christmas! i.e. 20.12, 21.12, 22.12, 23.12, 24.12.
db = db[db.MONTH == 12]
db = db[db.DAY_OF_MONTH <= 24]
db = db[db.DAY_OF_MONTH >= 20]

print '5 days have ' + str(db.count()[0]) + ' flights'

db = db[db.ORIGIN_CITY_NAME == city_from]
db = db[db.DEST_CITY_NAME == city_to]
db.reset_index(inplace=True)

print 'Found ' + str(db.count()[0]) + ' flights from ' + city_from + ' to ' + city_to + ' for 20.12 - 24.12'


4376277 entries
5 days have 58334 flights
Found 87 flights from New York, NY to Chicago, IL for 20.12 - 24.12

We know load our favourite model and setup the categorical variable encoder.


In [135]:
mdl = load_model('results/models/2014model.json')

# categorical feature encoder, fitted on the keys
encoder = OneHotEncoder(sparse=True, n_values=mdl['encoder']['values'])

Using the lookuptables, it is straight-forward to write the prediction function. Note that variabales need to be normalized according to the training data used in the model.


In [136]:
# input is a datarow
# prediction of day in the next year!
def predictDelayTime(row, mdl):
    
    s_mean, s_std, coeff, intercept = mdl['scaler_mean'], mdl['scaler_std'], mdl['coeff'], mdl['intercept']
    
    # read out tables
    carrierTable = mdl['CARRIER']
    mfrTable = mdl['MANUFACTURER']
    destTable = mdl['DEST']
    originTable = mdl['ORIGIN']
    
    distance = row['DISTANCE'] # <-- look this up!
    aircraft_age = row['AIRCRAFT_AGE'] # <-- look this up!
    
    # normalize numerical features according to scaler
    distance = (distance - s_mean[0]) / s_std[0]
    aircraft_age = (aircraft_age + 1 - s_mean[1]) / s_std[1]
    
    month = row['MONTH']
    day_of_month = row['DAY_OF_MONTH'] 
    origin = row['ORIGIN']
    dest = row['DEST']
    
    hour_of_arr = int(row['ARR_TIME']) / 10
    hour_of_dep = int(row['DEP_TIME']) / 10
    carrier = row['UNIQUE_CARRIER']
    day_of_week = datetime(year=2015, month=row.MONTH, day=row.DAY_OF_MONTH).weekday() # <-- get via datetimeobject
    mfr = row['AIRCRAFT_MFR']
    
    # for nonindexed categorical features, do lookup!
    origin = originTable[origin]
    dest = destTable[dest]
    mfr = mfrTable[mfr]
    carrier = carrierTable[carrier]

    # write into df
    df = {}
    df['MONTH'] = month
    df['DAY_OF_MONTH'] = day_of_month
    df['ORIGIN'] = origin
    df['DEST'] = dest
    df['HOUR_OF_ARR'] = hour_of_arr
    df['HOUR_OF_DEP'] = hour_of_dep
    df['UNIQUE_CARRIER'] = carrier
    df['DAY_OF_WEEK'] = day_of_week
    df['AIRCRAFT_MFR'] = mfr
    df = pd.DataFrame([df])

    # order here is important! make sure it is the same as in the model!
    categoricalFeat = df[['MONTH', 'DAY_OF_MONTH', 'ORIGIN', 
                    'DEST', 'HOUR_OF_ARR', 'HOUR_OF_DEP', 
                    'UNIQUE_CARRIER', 'DAY_OF_WEEK', 'AIRCRAFT_MFR']].copy() # Categorical features
    
    # construct the data vector for the linear model
    categoricals_encoded = encoder.fit_transform(categoricalFeat)
    num_features = np.array([distance, aircraft_age])
    cat_features = categoricals_encoded.toarray().T.ravel()
    w = np.hstack([num_features, cat_features])

    y_pred = np.dot(w, coeff) + intercept
    
    return y_pred[0]

We are now ready to predict the delay time on our flights!


In [137]:
# create for each day info
db['PREDICTED_DELAY'] = 0.
db['FLIGHT_TIME'] = 0
db['PREDICTED_FLIGHT_TIME'] = 0
for index, row in db.iterrows():
    print 'processing {idx}'.format(idx=index)
    y_pred = predictDelayTime(row, mdl)
    db.set_value(index, 'PREDICTED_DELAY', y_pred)

    arr_time = datetime(year=2015, month=row['MONTH'], day=row['DAY_OF_MONTH'], \
                        hour= int(row['ARR_TIME'] / 100), minute=int(row['ARR_TIME'] % 100))
    dep_time = datetime(year=2015, month=row['MONTH'], day=row['DAY_OF_MONTH'], \
                        hour= int(row['DEP_TIME'] / 100), minute=int(row['DEP_TIME'] % 100))
    
    flight_time_in_min =  (convertToUTC(arr_time) - convertToUTC(dep_time))
    flight_time_in_min = int(flight_time_in_min.total_seconds() / 60)
    
    db.set_value(index, 'FLIGHT_TIME', flight_time_in_min)
    db.set_value(index, 'PREDICTED_FLIGHT_TIME', y_pred + flight_time_in_min)


processing 0
processing 1
processing 2
processing 3
processing 4
processing 5
processing 6
processing 7
processing 8
processing 9
processing 10
processing 11
processing 12
processing 13
processing 14
processing 15
processing 16
processing 17
processing 18
processing 19
processing 20
processing 21
processing 22
processing 23
processing 24
processing 25
processing 26
processing 27
processing 28
processing 29
processing 30
processing 31
processing 32
processing 33
processing 34
processing 35
processing 36
processing 37
processing 38
processing 39
processing 40
processing 41
processing 42
processing 43
processing 44
processing 45
processing 46
processing 47
processing 48
processing 49
processing 50
processing 51
processing 52
processing 53
processing 54
processing 55
processing 56
processing 57
processing 58
processing 59
processing 60
processing 61
processing 62
processing 63
processing 64
processing 65
processing 66
processing 67
processing 68
processing 69
processing 70
processing 71
processing 72
processing 73
processing 74
processing 75
processing 76
processing 77
processing 78
processing 79
processing 80
processing 81
processing 82
processing 83
processing 84
processing 85
processing 86

What is the best flight on 22nd December?

Using this info, we can get the best flights for the 22nd of December


In [140]:
db2 = db[db.DAY_OF_MONTH == 22]
dbres2 = db2.sort('PREDICTED_FLIGHT_TIME')
dbres2.to_csv('data/best_flights.csv')

On what day is it best to fly in the evening?

Analogously, we now want our model to give us the answer which is the best flight during the busy evening hours (17:00-20:00)


In [141]:
db3 = db[(db.DEP_TIME > 1700) & (db.DEP_TIME < 2000)]
dbres3 = db2.sort('PREDICTED_FLIGHT_TIME')
dbres3.head()


Out[141]:
index ORIGIN_CITY_NAME DEST_CITY_NAME AIRCRAFT_AGE DEST ARR_TIME DEP_TIME UNIQUE_CARRIER DAY_OF_WEEK AIRCRAFT_MFR FL_NUM MONTH DAY_OF_MONTH DISTANCE ORIGIN PREDICTED_DELAY FLIGHT_TIME PREDICTED_FLIGHT_TIME
19 4156108 New York, NY Chicago, IL 2 MDW 738 627 WN 1 BOEING 490 12 22 725 LGA -5.391739 71 65
59 4235745 New York, NY Chicago, IL 16 ORD 916 757 UA 1 AIRBUS 463 12 22 733 LGA 0.540536 79 79
18 4156107 New York, NY Chicago, IL 14 MDW 1423 1304 WN 1 BOEING 333 12 22 725 LGA 7.796969 79 86
74 4237736 New York, NY Chicago, IL 16 ORD 845 714 UA 1 BOEING 1094 12 22 733 LGA 0.609249 91 91
66 4237531 New York, NY Chicago, IL 16 ORD 746 610 UA 1 AIRBUS 683 12 22 733 LGA -2.842361 96 93

Surprisingly, the best day to fly for in the evening is the 22nd!