In [1]:
%matplotlib inline
# import required modules for prediction tasks
import numpy as np
import pandas as pd
import math
import random
First, let's acquire the data formated in '02_data_preparation.ipynb'. The figure below gives a glimpse of what the data set looks like.
In [2]:
%%time
# reads all predefined months for a year and merge into one data frame
rawData2014 = pd.DataFrame.from_csv('cache/predictionData/complete2014Data.csv')
In [4]:
print rawData2014.columns
rawData2014.head(5)
Out[4]:
We will focus on flights between New York and Chicago. When cleaning the data set, we have to remove the following entries:
Note that data points have to be cleaned in this order because most flights have empty entries for the 'diverted' columns.
In [5]:
#entries to be dropped in the analysis
columns_dropped = ['index', 'TAIL_NUM', 'FL_NUM', 'DEP_TIME', 'DEP_DELAY', 'TAXI_OUT', 'WHEELS_OFF', \
'WHEELS_ON', 'TAXI_IN', 'ARR_TIME', 'CANCELLED', 'CANCELLATION_CODE', 'AIR_TIME', \
'CARRIER_DELAY', 'WEATHER_DELAY', 'NAS_DELAY', 'SECURITY_DELAY', 'LATE_AIRCRAFT_DELAY']
In [6]:
def clean(data, list_col):
'''
Creates a dataset by excluding undesirable columns contained in list_col
Parameters:
-----------
data: pandas.DataFrame
Flight dataframe
list_col: <list 'string'>
Comumns to exclude from the data set
'''
data.drop(data[data.CANCELLED == 0].index, inplace=True)
data.drop(list_col, axis=1, inplace=True)
data.dropna(axis = 0, inplace = True)
return
In [ ]:
%%time
data2014 = clean(rawData2014.copy(), columns_dropped)
print data2014.columns
We memorize the cleaned dataset. This will save us some processing time. The two boxes below save the dataset and recover it from a 'cache/predictionData/' folder.
In [ ]:
#%%time
# save the data to avoid computing them again
#file_path = "cache/predictionData/predictionData2014.csv"
#data2014.to_csv(path_or_buf= file_path)
In [2]:
#%%time
# recover data2014 from cache/predictionData folder
#file_path = "cache/predictionData/predictionData2014.csv"
#data2014 = pd.read_csv(file_path)
#data2014.drop('Unnamed: 0', axis= 1, inplace = True)
#data2014.head(3)
In [6]:
# test that clean did the job
print "size of raw data set: ", len(rawData2014)
print "number of cancelled: ", len(rawData2014[(rawData2014.CANCELLED == 1)])
print "size of data set: ", len(data2014)
The dataset has more than 4 millions entries, which makes any data manipulation extremely costly - let alone model fitting. We will therefore make some restrictions on the airports and the airlines considered.
One way to do it is to make a case study, an example that we will follow all along. Here, the example looks at a flight from New York to Chicago.
Comment : it is possible to work on a different data set or a different example. However, due to the high complexity of the data set - both by a large number of points and by a large number of high-dimensional categorical variables, it is easier to work on a shrinked version of it.
In [3]:
dexample = data2014[data2014.DEST == 'ORD'][(data2014.ORIGIN == 'JFK') | (data2014.ORIGIN == 'LGA') | (data2014.ORIGIN == 'EWR')].copy()
print len(dexample)
dexample.head(3)
Out[3]:
We are only interested in the carrier that operates from New York to Chicago. Looking at the table, we also notice that Atlantic Southeast Airlines (airline code EV) is only marginally present. So we drop it from the list of carriers we will study in addition to the other carriers that do not operate on the line.
In [4]:
dexample.groupby('UNIQUE_CARRIER').size()
Out[4]:
In [5]:
# In case we want to tackle a broader case, we can use this function. The droplist can be any list of carrier.
def restrict_carrier(data, droplist):
'''
Drop carriers from the data set according to broadlist
Parameters:
-----------
data: pandas.DataFrame
dataframe
droplist: <list 'string'>
List of carriers to be droppped
'''
for item in droplist:
data.drop(data[data.UNIQUE_CARRIER == item].index, inplace= True)
return
In [6]:
%%time
drop_airline = [ 'AS','DL', 'EV', 'F9', 'FL', 'HA', 'MQ', 'US', 'VX', 'WN']
restrict_carrier(dexample, drop_airline)
In case we're doing a general study, the second step after droppping airlines is to drop airport. This is the purpose of the function below.
WARNING \ RUN THIS CELL ONLY ONCE! Because airports are linked to one another, everytime the function restrict_airport is run, some entries are dropped. After some time, there is no more entries.
To understand this, let's imagine that Boston Logan airport has an annual flight with a small airport XYZ. When we run the function the fisrt time, airport XYZ is droppped. Because of that, the total count of flights for Boston Logan airport decreases as well. If we run the function a second time, removing airport XYZ and its flights with Boston Logan has put the number of flights associated with Boston Logan under the threshold!
In [7]:
#remove all airports that have an annual traffic under threshold
def restrict_airport(data, threshold):
'''
Drop carriers from the data set.
Parameters:
-----------
data: pandas.DataFrame
dataframe
droplist: <list 'string'>
List of carriers to be droppped
'''
dict_count = data.groupby("DEST").agg(['count']).LAT.to_dict()['count']
for key in dict_count:
if dict_count[key] < threshold:
data.drop(data[data.DEST == key].index, inplace=True)
data.drop(data[data.ORIGIN == key].index, inplace=True)
print data.groupby("DEST").agg(['count']).LAT.to_dict()['count']
return
The date is given as a string in the format "Year-Month-Day". We exctract the month and the weekday in the following.
WARNING \ RUN THIS CELL ONLY ONCE! Because airports are linked to one another,
In [8]:
from time import strptime
days = {0:"Mon", 1:"Tues", 2:"Wed", 3:"Thurs", 4:"Fri", 5:"Sat", 6:"Sun"}
months = {1:"Jan", 2:"Feb", 3:"Mar", 4:"Apr", 5:"May", 6:"June", 7:"July", 8:"Aug", 9:"Sep", \
10:"Oct", 11:"Nov", 12:"Dec"}
In [9]:
def adjust_time(data):
'''
Return the month list and the weekday list out of a DataFrame.
Parameters:
-----------
data: pandas.DataFrame
dataframe that contains a colum of flight dates
'''
monlist = np.empty(len(data), dtype = str)
daylist = np.empty(len(data), dtype = str)
for i in xrange(len(data)):
date= strptime(data.FL_DATE.iloc[i], "%Y-%M-%d")
monlist[i] = months[date.tm_min]
daylist[i] = days[date.tm_wday]
return monlist, daylist
In [10]:
%%time
# now add the weekday and the month columns created by the adjust_time
monlist, daylist = adjust_time(dexample)
print "OK"
dexample['MONTH'] = pd.Series(monlist, index=dexample.index)
dexample['DAY'] = pd.Series(daylist, index=dexample.index)
if 'FL_DATE' in dexample.columns:
dexample.drop('FL_DATE', axis = 1, inplace= True)
print dexample.columns
Let's change teh format of all time entries. Instead of having hour : minute, we convert everything in minutes.
WARNING \ RUN THIS CELL ONLY ONCE!
In [11]:
%%time
ti = lambda x : x/100 + x % 100
dexample['CRS_ARR_TIME_COR'] = dexample.CRS_ARR_TIME.map(ti)
dexample['CRS_DEP_TIME_COR'] = dexample.CRS_DEP_TIME.map(ti)
dexample.drop(['CRS_DEP_TIME', 'CRS_ARR_TIME'], axis = 1, inplace = True)
We need to center and normalize all continuous data for better results. The idea is that if a feature X has a range of variation that is considerably higher than a feature Y, the variations of X may completely mask the variations of Y and therefore make Y useless.
In [12]:
def normalize(array):
mean = np.mean(array)
std = np.std(array)
return [(x - mean)/std for x in array]
In [13]:
def normalize_data(data, feature_list):
'''
Normalize data.
Parameters:
-----------
data: pandas.DataFrame
dataframe
feature_list: <list 'string'>
List of features to be normalized
'''
for feature in feature_list:
if feature in data.columns:
data[feature + '_NOR'] = normalize(data[feature].values)
data.drop(feature, axis =1, inplace=True)
return
In [14]:
dexample.drop(dexample[dexample.AIRCRAFT_YEAR ==' '].index, inplace = True)
dexample['AIRCRAFT_YEAR_COR'] = dexample.AIRCRAFT_YEAR.map(lambda x: int(x))
dexample.drop('AIRCRAFT_YEAR', axis = 1, inplace = True)
In [15]:
normalize_data(dexample, ['DISTANCE', 'LAT', 'LONG', 'CRS_ARR_TIME_COR', 'CRS_DEP_TIME_COR', 'AIRCRAFT_YEAR_COR'])
In [17]:
dexample.head(3)
Out[17]:
Many of the features are categorical - for example the origin airport feature contains strings which represent the code of each airport. To make a classification/regression possible, we need to encode these features as series of indicators. More specifically, if the origin airport feature has n possibles values A1, A2, ... , An, we create n indicator functions I(A1), ..., I(An). Indicator function I(Ak) takes the value 1 if the feature origin airport has value Ak.
In [18]:
encoded_list = ['UNIQUE_CARRIER', 'ORIGIN', 'DEST', 'AIRCRAFT_MFR', 'MONTH','DAY']
dexample = pd.get_dummies(dexample, columns=encoded_list)
In [20]:
dexample.head(3)
Out[20]:
Let's save all of our results.
In [21]:
%%time
# save the restricted data to avoid computing them again
file_path = "cache/predictionData/dexample.csv"
dexample.to_csv(path_or_buf= file_path)
In [47]:
#%%time
# recover file
#file_path = "cache/predictionData/dexample.csv"
#dexample= pd.read_csv(file_path)
#dexample.drop('Unnamed: 0', axis= 1, inplace = True)
#print dexample.columns
Our goal is to anticipate the delay of a flight given its features. One way to proceed is to train a random forest regressor. But first, let's look at the baseline predictor, that is the predictor that returns the mean of data set. The measure of accuracy of a predictor will be its mean square error against a test set distinct from the training set.
In [102]:
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
In [106]:
def baseline_predictor(data):
return np.mean(data.ARR_DELAY.values)
print "MSE of the mean predictor:" , \
mean_squared_error(dexample.ARR_DELAY.values, [baseline_predictor(dexample) for x in xrange(len(dexample))])
First, let's split the data set into a training set and a test set.
In [66]:
def split(data, list_drop, target, test_size):
'''
Splits the data into a training and a test set
Separates the training and test sets according to a feature set and a target set
Balance the features sets by retaining only fraction of its points
Parameters:
-----------
data: pandas.DataFrame
Flight dataframe
list_drop: <list 'string'>
List of columns to exclude from the features set
target: string
target column along whch we make the target set
test_size: float
size of the test set
'''
#split the dataset into a training set and a test set
dtrain, dtest = train_test_split(data, test_size = 0.3)
Xtrain = dtrain.drop(list_drop, axis=1).values
ytrain = dtrain[target].values
Xtest = dtest.drop(list_drop, axis=1).values
ytest = dtest[target].values
return Xtrain, ytrain, Xtest, ytest
The functions below satisfy various tasks. The first trains a classifier and estimates its scores on the test set and the training set. The second tries different parameter values for the regressor. The last one wraps everything.
In [69]:
def score_random_forest(Xtrain, ytrain, Xtest, ytest, n_trees=10, max_features='auto'):
'''
Fits a random forest with (Xtrain ,ytrain)
Computes the score on (Xtest, ytest)
Parameters:
-----------
Xtrain: numpy 2D array
Feature training set
ytrain: numpy 1D array
Target training set
Xtest: numpy 2D array
Feature test set
ytest: numpy 1D array
Target test set
n_trees: int
number of trees in the forest
max_features: string or int
number of features used for every tree
Outputs:
--------
score_train: float
score on the train set
score_test: float
score on the test set
clf.feature_importances_
weights of each feature as used by the classifier
'''
clf= RandomForestRegressor(n_estimators=n_trees, max_features= max_features)
clf.fit(Xtrain, ytrain)
score_train = mean_squared_error(clf.predict(Xtrain), ytrain)
score_test = mean_squared_error(clf.predict(Xtest), ytest)
return score_train, score_test, clf
In [70]:
def best_parameters(Xtrain, ytrain, Xtest, ytest, nb_trees, nb_features):
'''
Fits sequentially random forest classifiers
Adds each test score in a pandas.DataFrame with the number of trees, the loss function, the train score,
and the importance of each features
Returns a DataFrame with all scores
Parameters:
-----------
Xtrain: numpy 2D array
Feature training set
ytrain: numpy 1D array
Target training set
Xtest: numpy 2D array
Feature test set
ytest: numpy 1D array
Target test set
n_trees: <list int>
list of numbers of trees in the forest
nb_features: <list int>
list of number of features in the forest
Outputs:
--------
score_tab: pandas.DataFrame
DataFrame of scores with associated parameters
'''
score_tab = pd.DataFrame(columns=['nb_trees', 'nb_features', 'test_score', 'train_score', 'classifier'])
# counter will increment the index in score_tab
counter = 0
for n_estimators in nb_trees:
for max_features in nb_features:
score_train, score_test, classifier = \
score_random_forest(Xtrain, ytrain, Xtest, ytest, n_trees=n_estimators, max_features=max_features)
score_tab.loc[counter] = [n_estimators, max_features, score_test, score_train, classifier]
counter += 1
return score_tab
In [71]:
def classify_random_forest(data, list_drop, target, test_size=0.4, nb_trees=[10], nb_features = ['auto']):
Xtrain, ytrain, Xtest, ytest = split(data, list_drop, target, test_size)
scores = best_parameters(Xtrain, ytrain, Xtest, ytest, nb_trees, nb_features)
return scores
We test the following parameters.
In [107]:
nb_trees = [25, 50, 75, 100, 150, 200, 300, 400, 500, 600, 700, 800, 900, 1000]
nb_features = ['auto', 'log2', 'sqrt']
test_size = 0.4
In [108]:
%%time
randomForest2014 = classify_random_forest(dexample, ['ARR_DELAY'], 'ARR_DELAY', test_size=test_size, nb_trees=nb_trees, nb_features=nb_features)
In [125]:
randomForest2014.head()
Out[125]:
In [109]:
# save file to /data/ folder
file_path = "cache/predictionData/randomForest2014.csv"
randomForest2014.to_csv(path_or_buf= file_path)
In [127]:
import matplotlib.pyplot as plt
from matplotlib.figure import Figure
In [137]:
# MSE on test set
fig = plt.gcf()
fig.set_size_inches(8, 5)
for key, co in zip(['auto', 'sqrt', 'log2'], ['b', 'g', 'y']):
x = randomForest2014[randomForest2014.nb_features == key].nb_trees.values
y = randomForest2014[randomForest2014.nb_features == key].test_score.values
plt.plot(x, y, alpha = 0.5, c =co, label = key)
plt.ylabel("Mean Square Error")
plt.xlabel("Number of features")
plt.title("MSE on test set by nb of features and nb of trees")
plt.legend(loc = 1)
plt.show()
In [139]:
# MSE on train set
fig = plt.gcf()
fig.set_size_inches(8, 5)
for key, co in zip(['auto', 'sqrt', 'log2'], ['b', 'g', 'y']):
x = randomForest2014[randomForest2014.nb_features == key].nb_trees.values
y = randomForest2014[randomForest2014.nb_features == key].train_score.values
plt.plot(x, y, alpha = 0.5, c =co, label = key)
plt.ylabel("Mean Square Error")
plt.xlabel("Number of features")
plt.title("MSE on train set by nb of features and nb of trees")
plt.legend(loc = 1)
plt.show()
We will make prediction on the variable 'ARR_DEL15'. This variable takes the value 1 is the plane is more than 15 minutes late and 0 if not. Let's look at the baseline classifier, that is the classifiers that assign repectively 1 or 0 to 'ARR_DEL15' for every flight.
Here we look at the case whether a flight will be more than 15 minutes late. So we adjust the ARR_DELAY colum to an indicator.
In [22]:
%%time
dexample['ARR_DELAY_COR'] = dexample.ARR_DELAY.map(lambda x: (x >= 15))
dexample.drop('ARR_DELAY', axis = 1, inplace = True)
file_path = "cache/predictionData/dexample_class.csv"
dexample.to_csv(path_or_buf= file_path)
dexample_class = dexample.copy()
In [ ]:
#file_path = "cache/predictionData/dexample_class.csv"
#dexample_class = pd.read_csv(path_or_buf= file_path)
#dexample_class.drop('Unnamed: 0', axis= 1, inplace = True)
In [23]:
dexample_class.head(3)
Out[23]:
In [24]:
from __future__ import division
def baseline_class(data, target):
'''
Compute the baseline classifiers along a target variable for a data set data
Parameters:
-----------
data: pandas.DataFrame
dataframe
target: string
Column of data along wich we compute the baseline classifiers
'''
score_baseline_1 = np.size(data[data[target] == 1][target].values) / np.size(data[target].values)
score_baseline_0 = np.size(data[data[target] == 0][target].values) / np.size(data[target].values)
print "baseline classifier everyone to 0: ", int(score_baseline_0*100) , "%"
print "baseline classifier everyone to 1: ", int(score_baseline_1*100) , "%"
return score_baseline_0, score_baseline_1
In [25]:
baseline_class(dexample_class, 'ARR_DELAY_COR')
Out[25]:
In [34]:
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestClassifier
Random forest functions for a classification task
In [27]:
def split(data, list_drop, target, test_size):
'''
Splits the data into a training and a test set
Separates the training and test sets according to a feature set and a target set
Balance the features sets by retaining only fraction of its points
Parameters:
-----------
data: pandas.DataFrame
Flight dataframe
list_drop: <list 'string'>
List of columns to exclude from the features set
target: string
target column along whch we make the target set
test_size: float
size of the test set
'''
#split the dataset into a training set and a test set
dtrain, dtest = train_test_split(data, test_size = 0.3)
Xtrain = dtrain.drop(list_drop, axis=1).values
ytrain = dtrain[target].values
Xtest = dtest.drop(list_drop, axis=1).values
ytest = dtest[target].values
return Xtrain, ytrain, Xtest, ytest
In [35]:
def score_random_forest_classifier(Xtrain, ytrain, Xtest, ytest, n_trees=10, max_features='auto'):
'''
Fits a random forest with (Xtrain ,ytrain)
Computes the score on (Xtest, ytest)
Parameters:
-----------
Xtrain: numpy 2D array
Feature training set
ytrain: numpy 1D array
Target training set
Xtest: numpy 2D array
Feature test set
ytest: numpy 1D array
Target test set
n_trees: int
number of trees in the forest
max_features: string or int
number of features used for every tree
Outputs:
--------
score_train: float
score on the train set
score_test: float
score on the test set
clf.feature_importances_
weights of each feature as used by the classifier
'''
clf= RandomForestClassifier(n_estimators=n_trees, max_features= max_features)
clf.fit(Xtrain, ytrain)
score_train = clf.score(Xtrain, ytrain)
score_test = clf.score(Xtest, ytest)
return score_train, score_test, clf
In [36]:
def best_parameters_classifier(Xtrain, ytrain, Xtest, ytest, nb_trees, nb_features):
'''
Fits sequentially random forest classifiers
Adds each test score in a pandas.DataFrame with the number of trees, the loss function, the train score,
and the importance of each features
Returns a DataFrame with all scores
Parameters:
-----------
Xtrain: numpy 2D array
Feature training set
ytrain: numpy 1D array
Target training set
Xtest: numpy 2D array
Feature test set
ytest: numpy 1D array
Target test set
n_trees: <list int>
list of numbers of trees in the forest
nb_features: <list int>
list of number of features in the forest
Outputs:
--------
score_tab: pandas.DataFrame
DataFrame of scores with associated parameters
'''
score_tab = pd.DataFrame(columns=['nb_trees', 'nb_features', 'test_score', 'train_score', 'classifier'])
# counter will increment the index in score_tab
counter = 0
for n_estimators in nb_trees:
for max_features in nb_features:
score_train, score_test, classifier = \
score_random_forest_classifier(Xtrain, ytrain, Xtest, ytest, n_trees=n_estimators, max_features=max_features)
score_tab.loc[counter] = [n_estimators, max_features, score_test, score_train, classifier]
counter += 1
return score_tab
In [37]:
def classify_random_forest_class(data, list_drop, target, test_size=0.4, nb_trees=[10], nb_features = ['auto']):
Xtrain, ytrain, Xtest, ytest = split(data, list_drop, target, test_size)
scores = best_parameters_classifier(Xtrain, ytrain, Xtest, ytest, nb_trees, nb_features)
return scores
We look at various paramters for the classifier. Note that there are two loss functions available for the random forest classifier, a Gini loss and an entropy loss. Simulations show they behave similarly!
In [99]:
nb_trees = [25, 50, 75, 100, 150, 200, 300, 400, 500, 750, 1000, 2000, 5000]
nb_features = ['sqrt', 'log2']
test_size = 0.4
In [100]:
%%time
randomForest2014_class = classify_random_forest_class(dexample, ['ARR_DELAY_COR'], 'ARR_DELAY_COR', test_size=test_size, nb_trees=nb_trees, nb_features=nb_features)
In [90]:
randomForest2014_class.head(9)
Out[90]:
In [101]:
# save file to /data/ folder
file_path = "cache/predictionData/randomForest2014_class.csv"
randomForest2014_class.to_csv(path_or_buf= file_path)
Let's now look at the importance coefficients, that the average usage of each coefficients in the random forest. As we can see, the main factors in the delay of a flight are the age of the aircraft, the time of the departure and the time of the origin. The fact that the departure and arrival time confirm what we observed in the explorative analysis. However, the role of the age of the aircraft was not very marked in the explorative analysis. Lastly, the weekday has a huge influence.
Notice that the airline and the aircraft model play a very limited role in the delay according to this random forest classifier.
In [106]:
features = dexample_class.drop('ARR_DELAY_COR', axis =1).columns
coeffs_class = randomForest2014_class.classifier[len(randomForest2014_class)-1].feature_importances_
u = pd.Series(coeffs_class, index=features)
u.sort(inplace=True, ascending=False)
print u
In [93]:
import matplotlib.pyplot as plt
from matplotlib.figure import Figure
In [107]:
# Score on test set
fig = plt.gcf()
fig.set_size_inches(8, 5)
for key, co in zip(['sqrt', 'log2'], ['b', 'g']):
x = randomForest2014_class[randomForest2014_class.nb_features == key].nb_trees.values
y = randomForest2014_class[randomForest2014_class.nb_features == key].test_score.values
plt.plot(x, y, alpha = 0.5, c =co, label = key)
plt.ylabel("Score")
plt.xlabel("Number of features")
plt.title("Score on test set by nb of features and nb of trees")
plt.legend(loc = 4)
plt.show()
In [111]:
# Score on train set
fig = plt.gcf()
fig.set_size_inches(8, 5)
for key, co in zip(['sqrt', 'log2'], ['b', 'g']):
x = randomForest2014_class[randomForest2014_class.nb_features == key].nb_trees.values
y = randomForest2014_class[randomForest2014_class.nb_features == key].train_score.values
plt.plot(x, y, alpha = 0.5, c =co, label = key)
plt.ylim(0.935, 0.95)
plt.ylabel("Score")
plt.xlabel("Number of features")
plt.title("Score on train set by nb of features and nb of trees")
plt.legend(loc = 4)
plt.show()
To allow users to enjoy more than a classfication experience we want to give them an expected delay time in minutes.
In [19]:
%matplotlib inline
# import required modules for prediction tasks
import numpy as np
import pandas as pd
import math
import random
import requests
import zipfile
import StringIO
import re
import json
import os
import matplotlib
import matplotlib.pyplot as plt
# sklearn functions used for the linear regression model
from sklearn.preprocessing import OneHotEncoder
from scipy import sparse
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import SGDRegressor
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import mean_absolute_error
First, let us again establish a baseline which has to be beaten by our model. To get a feeling for a good baseline we pick flights from from New York(all airports) to Chicago(all airports).
In [30]:
# first step is to load the actual data and exclude rows that are unnecessary
# a script that produces the csv file can be found in the src folder
# (it might take some while to run, on my macbook up to one hour)
print('loading data...')
bigdf = pd.read_csv('cache/Big5FlightTable.csv')
years = ['2010', '2011', '2012', '2013', '2014']
In [31]:
# specify here which cities should be investigated
city_from = 'New York, NY'
city_to = 'Chicago, IL'
# filter for cities
bigdf = bigdf[(bigdf.ORIGIN_CITY_NAME == city_from) & (bigdf.DEST_CITY_NAME == city_to)]
For a span of days we are interested in knoweledge, which flight we should take. Therefore let's choose the popular date for Christmas returning flights 21.12.2015 as example. As we want to use historical data, let's get first clear what data we have. Is it possible to compare flights over the years?
In [32]:
query_day = 21
query_month = 12
# how many flights do exist in all years?
flights = []
flightvalues = []
for y in years:
query = list(bigdf[(bigdf.YEAR == int(y)) & (bigdf.MONTH == query_month) & (bigdf.DAY_OF_MONTH == query_day)].FL_NUM.astype(int).unique())
flights.append(query)
flightvalues += query
# build a matrix
data_matrix = np.zeros((len(flightvalues), len(years)))
# build dict
flightdict = dict(zip(flightvalues, np.arange(0, len(flightvalues))))
# fill datamatrix
for i in xrange(len(years)):
for j in flights[i]:
data_matrix[flightdict[j], i] = 1.
# plot matrix
plt.figure(figsize=(10, 6))
plt.imshow(data_matrix, extent=[0,data_matrix.shape[0],0,data_matrix.shape[1] * 5], \
interpolation='none', cmap='gray')
Out[32]:
As the plot shows, flight numbers are not stable over the years and it is not trivial to find a matching. Thus comparison by features for individual flights is not really possible. It seems as if airlines change their flight numbers on a yearly basis. Thus we need to come up with another idea and use a latent variable based approach instead. One of the easiest ideas is to average the delay time over each day as a base classifier and refine then.
In [34]:
dffordate = bigdf[bigdf.MONTH == query_month]
dffordate = dffordate[dffordate.DAY_OF_MONTH == query_day]
dffordate.head()
def predict_base_model(X):
return np.array([dffordate.ARR_DELAY.mean()]*X.shape[0])
To test the quality of this model, we use the last year as test set and the previous as train data. The idea is, that we are always interested in predicting the next year somehow. Thus, if the match for 2014 is good, we expect it to be the same for 2015.
In [70]:
# build test/train set
df_train = dffordate[dffordate.YEAR != int(years[-1])]
df_test = dffordate[dffordate.YEAR == int(years[-1])]
y_train = df_train.ARR_DELAY
X_train = y_train # here dummy
y_test = df_test.ARR_DELAY
X_test = y_test # here dummy
In the base model, the prediction for 2014 is that we are going to be 27.72 minutes late
In [71]:
y_pred = predict_base_model(X_test)
y_pred[0]
Out[71]:
How good did it perform comparing the actual arrival delay?
In [72]:
def rmse(y, y_pred):
return np.sqrt(((y - y_pred)**2).mean())
def mas(y, y_pred):
return (np.abs(y - y_pred)).mean()
MAS_base = mas(y_test, y_pred)
RMSE_base = rmse(y_test, y_pred)
RMSE_base
Out[72]:
Using the root mean squared error as one (of many possible) measures, any model that we develop should beat the benchmark of 44.
As flight delay changes over the time of day like explored in the data exploration part, we introduce a new feature which models in a categorical variable 10 minute time windows. I.e. for each window we introduce a latent variable that captures some sort of delay influence of this frame. This is done for both the departure and the arrival time. The model here is first developed for the reduced dataset containing the flights between New York / Chicago only.
In [41]:
%%time
bigdf['HOUR_OF_ARR'] = 0
bigdf['HOUR_OF_DEP'] = 0
for index, row in bigdf.iterrows():
bigdf.set_value(index, 'HOUR_OF_ARR', int(row['ARR_TIME']) / 10)
bigdf.set_value(index, 'HOUR_OF_DEP', int(row['DEP_TIME']) / 10)
In the next step, before fitting the actual model categorical variables need to be encoded as binary features (they have no order!) and numerical features shall be normalized.
In [44]:
# split data into numerical and categorical features
numericalFeat = bigdf[['DISTANCE', 'AIRCRAFT_AGE']].astype('float')
categoricalFeat = bigdf[['MONTH', 'DAY_OF_MONTH', 'ORIGIN', 'DEST', \
'HOUR_OF_ARR', 'HOUR_OF_DEP', 'UNIQUE_CARRIER', \
'DAY_OF_WEEK', 'AIRCRAFT_MFR']]
In [46]:
%%time
# for the next step, all features need to be encoded as integers --> create lookup Tables!
def transformToID(df, col):
vals = df[col].unique()
LookupTable = dict(zip(vals, np.arange(len(vals))))
for key in LookupTable.keys():
df[df[col] == key] = LookupTable[key]
return (LookupTable, df)
mfrDict, categoricalFeat = transformToID(categoricalFeat, 'AIRCRAFT_MFR')
originDict, categoricalFeat = transformToID(categoricalFeat, 'ORIGIN')
destDict, categoricalFeat = transformToID(categoricalFeat, 'DEST')
carrierDict, categoricalFeat = transformToID(categoricalFeat, 'UNIQUE_CARRIER')
In [47]:
categoricalFeat.head()
Out[47]:
Using sklearn the variables are now encoded. As we have many variables (around 1200 in the final model for the whole year) using sparse matrices was critical for us to fit the model.
In [48]:
import sklearn
from sklearn import linear_model
from sklearn.preprocessing import OneHotEncoder
from scipy import sparse
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import SGDRegressor
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import mean_squared_error
In [49]:
%%time
encoder = OneHotEncoder()
categoricals_encoded = encoder.fit_transform(categoricalFeat)
Recombining categorical and numerical features allows to start the usual Machine Learning routine (split into test/train, normalize, train/cross-validate, test)
In [51]:
numericals_sparse = sparse.csr_matrix(numericalFeat)
# get data matrix & response variable
X_all = sparse.hstack((numericals_sparse, categoricals_encoded))
y_all = bigdf['ARR_DELAY'].values
X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, test_size = 0.2, random_state = 42)
# we have 2 numerical features
X_train_numericals = X_train[:, 0:3].toarray()
X_test_numericals = X_test[:, 0:3].toarray()
# use sklearn tools to standardize numerical features...
scaler = StandardScaler()
scaler.fit(X_train_numericals) # get std/mean from train set
X_train_numericals = sparse.csr_matrix(scaler.transform(X_train_numericals))
X_test_numericals = sparse.csr_matrix(scaler.transform(X_test_numericals))
# update sets
X_train[:, 0:3] = X_train_numericals
X_test[:, 0:3] = X_test_numericals
In [54]:
clf = sklearn.linear_model.LinearRegression()
# fit the model!
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
rmse(y_test, y_pred)
Out[54]:
In a first model, we use OLS to get a feeling of what is achievable using a reduced dataset. The motivation for this is mainly to see whether this might be used to speedup the whole process. As the RMSE reveals, we did not do better than the base classifier. What happened?
In [57]:
X_train.shape, clf.coef_.shape
Out[57]:
Looking at the number of variables (74 here), it might be that either the fit is not good enough or we are ignoring the dependency of flight delays within different routes.
In [62]:
%%time
# Use ridge regression (i.e. Gaussian prior) and vary the lambda parameter using Grid search
from sklearn.linear_model import SGDRegressor
from sklearn.grid_search import GridSearchCV
SGD_params = {'alpha': 10.0 ** -np.arange(-2,8)}
SGD_model = GridSearchCV(SGDRegressor(random_state = 42), \
SGD_params, scoring = 'mean_absolute_error', cv = 6) # cross validate 6 times
# train the model, this might take some time...
SGD_model.fit(X_train, y_train)
In [63]:
y_pred = SGD_model.predict(X_test)
rmse(y_test, y_pred)
Out[63]:
Looking at the RMSE again is a bit disappointing. We did worse than OLS and the base classifier! This means, it is now time to tackle the unavoidable: Regressing the whole dataset!
As the data becomes rapidly huge (for 1 year ~ 700MB uncompressed CSV, 5 years ~ 3.5GB, 10 years ~ 7.2GB) the code to perform the actual regression has been developed first in an IPython notebook and then run separately. It can be found in the src
folder.
In [131]:
import json
import seaborn as sns
import matplotlib.ticker as ticker
def load_model(filename):
mdl = None
with open(filename, 'r') as f:
mdl = json.load(f)
return mdl
# load model results
mdl1 = load_model('results/models/2014model.json') # model with 2014 data (1 year)
mdl5 = load_model('results/models/2010_2014model.json') # model with 2010-2014 data ( 5 years)
labels = np.array(['base predictor', '1 year LM', '5 year LM'])
RMSEs = np.array([RMSE_base, mdl1['RMSE'], mdl5['RMSE']])
MASs = np.array([MAS_base, mdl1['MAS'], mdl5['MAS']])
width = .35
xx = np.arange(len(RMSEs))
plt.bar(xx-width, RMSEs, width, label='RMSE', color=sns.color_palette()[0])
plt.bar(xx , MASs, width, label='MAE', color=sns.color_palette()[2])
plt.legend(loc='best')
# Hide major tick labels
ax = plt.gca()
ax.xaxis.set_major_formatter(ticker.NullFormatter())
# Customize minor tick labels
ax.xaxis.set_minor_locator(ticker.FixedLocator([i for i in xx]))
ax.xaxis.set_minor_formatter(ticker.FixedFormatter([labels[i] for i in xx]))
Comparing the RMSEs and Mean absolute error we see that the linear regression model over the whole data clearly outperformed the baseline predictor. Also, more data did not really tremendously improve the mean absolute error but we can see a clear variance reduction in the RMSE. To further improve the model, additional variables should be included. One of the next ideas is weather data. As formatted historic weather data is unfortunately not for free available and scraping it for 100Ks entries per year is not feasible, the variables could not be added to the model. Another interesting idea might be to include some more meta data. I.e. an variable measuring the effect of holidays and variables accounting for winds which could be based by geographic location (i.e. a flight in western direction is longer than one in eastern direction b/c of the passat wind.
In [132]:
from datetime import timedelta, datetime, tzinfo
from pytz import timezone
import pytz
def convertToUTC(naive, zonestring="America/New_York"):
local = pytz.timezone (zonestring)
local_dt = local.localize(naive, is_dst=None)
return local_dt.astimezone (pytz.utc)
In the following, we want to give an example on how to perform a prediction using the model from 2010-2014. We choose the route Chicago / New York.
In [133]:
# we are only interested in flights NY to Chicago!
city_to = 'Chicago, IL'
city_from = 'New York, NY'
zone_to = 'America/Chicago'
zone_from = 'America/New_York'
Especially before Christmas, it is a good question to ask on which day you should fly and which flight to take. Therefore, we consider only flights between 20.12-24.12. Our assumption is that many values we feed in our predictor do not change, i.e. we can use the data from 2014 (stored in BigFlightTable.csv). For a productive system, these queries should be performed using a database.
In [134]:
# need to create a lookup table for the values (i.e. flight numbers, city and so on and then drop all duplicates!)
db = pd.read_csv('cache/BigFlightTable.csv')
# remove all unnecessary columns
db = db[['ORIGIN_CITY_NAME', 'DEST_CITY_NAME', 'AIRCRAFT_AGE', 'DEST', 'ARR_TIME', \
'DEP_TIME', 'UNIQUE_CARRIER', 'DAY_OF_WEEK', 'AIRCRAFT_MFR', 'FL_NUM', 'MONTH', \
'DAY_OF_MONTH', 'DISTANCE', 'ORIGIN']]
print str(db.count()[0]) + ' entries'
db.head()
# drop everything except for the 5 days before christmas! i.e. 20.12, 21.12, 22.12, 23.12, 24.12.
db = db[db.MONTH == 12]
db = db[db.DAY_OF_MONTH <= 24]
db = db[db.DAY_OF_MONTH >= 20]
print '5 days have ' + str(db.count()[0]) + ' flights'
db = db[db.ORIGIN_CITY_NAME == city_from]
db = db[db.DEST_CITY_NAME == city_to]
db.reset_index(inplace=True)
print 'Found ' + str(db.count()[0]) + ' flights from ' + city_from + ' to ' + city_to + ' for 20.12 - 24.12'
We know load our favourite model and setup the categorical variable encoder.
In [135]:
mdl = load_model('results/models/2014model.json')
# categorical feature encoder, fitted on the keys
encoder = OneHotEncoder(sparse=True, n_values=mdl['encoder']['values'])
Using the lookuptables, it is straight-forward to write the prediction function. Note that variabales need to be normalized according to the training data used in the model.
In [136]:
# input is a datarow
# prediction of day in the next year!
def predictDelayTime(row, mdl):
s_mean, s_std, coeff, intercept = mdl['scaler_mean'], mdl['scaler_std'], mdl['coeff'], mdl['intercept']
# read out tables
carrierTable = mdl['CARRIER']
mfrTable = mdl['MANUFACTURER']
destTable = mdl['DEST']
originTable = mdl['ORIGIN']
distance = row['DISTANCE'] # <-- look this up!
aircraft_age = row['AIRCRAFT_AGE'] # <-- look this up!
# normalize numerical features according to scaler
distance = (distance - s_mean[0]) / s_std[0]
aircraft_age = (aircraft_age + 1 - s_mean[1]) / s_std[1]
month = row['MONTH']
day_of_month = row['DAY_OF_MONTH']
origin = row['ORIGIN']
dest = row['DEST']
hour_of_arr = int(row['ARR_TIME']) / 10
hour_of_dep = int(row['DEP_TIME']) / 10
carrier = row['UNIQUE_CARRIER']
day_of_week = datetime(year=2015, month=row.MONTH, day=row.DAY_OF_MONTH).weekday() # <-- get via datetimeobject
mfr = row['AIRCRAFT_MFR']
# for nonindexed categorical features, do lookup!
origin = originTable[origin]
dest = destTable[dest]
mfr = mfrTable[mfr]
carrier = carrierTable[carrier]
# write into df
df = {}
df['MONTH'] = month
df['DAY_OF_MONTH'] = day_of_month
df['ORIGIN'] = origin
df['DEST'] = dest
df['HOUR_OF_ARR'] = hour_of_arr
df['HOUR_OF_DEP'] = hour_of_dep
df['UNIQUE_CARRIER'] = carrier
df['DAY_OF_WEEK'] = day_of_week
df['AIRCRAFT_MFR'] = mfr
df = pd.DataFrame([df])
# order here is important! make sure it is the same as in the model!
categoricalFeat = df[['MONTH', 'DAY_OF_MONTH', 'ORIGIN',
'DEST', 'HOUR_OF_ARR', 'HOUR_OF_DEP',
'UNIQUE_CARRIER', 'DAY_OF_WEEK', 'AIRCRAFT_MFR']].copy() # Categorical features
# construct the data vector for the linear model
categoricals_encoded = encoder.fit_transform(categoricalFeat)
num_features = np.array([distance, aircraft_age])
cat_features = categoricals_encoded.toarray().T.ravel()
w = np.hstack([num_features, cat_features])
y_pred = np.dot(w, coeff) + intercept
return y_pred[0]
We are now ready to predict the delay time on our flights!
In [137]:
# create for each day info
db['PREDICTED_DELAY'] = 0.
db['FLIGHT_TIME'] = 0
db['PREDICTED_FLIGHT_TIME'] = 0
for index, row in db.iterrows():
print 'processing {idx}'.format(idx=index)
y_pred = predictDelayTime(row, mdl)
db.set_value(index, 'PREDICTED_DELAY', y_pred)
arr_time = datetime(year=2015, month=row['MONTH'], day=row['DAY_OF_MONTH'], \
hour= int(row['ARR_TIME'] / 100), minute=int(row['ARR_TIME'] % 100))
dep_time = datetime(year=2015, month=row['MONTH'], day=row['DAY_OF_MONTH'], \
hour= int(row['DEP_TIME'] / 100), minute=int(row['DEP_TIME'] % 100))
flight_time_in_min = (convertToUTC(arr_time) - convertToUTC(dep_time))
flight_time_in_min = int(flight_time_in_min.total_seconds() / 60)
db.set_value(index, 'FLIGHT_TIME', flight_time_in_min)
db.set_value(index, 'PREDICTED_FLIGHT_TIME', y_pred + flight_time_in_min)
In [140]:
db2 = db[db.DAY_OF_MONTH == 22]
dbres2 = db2.sort('PREDICTED_FLIGHT_TIME')
dbres2.to_csv('data/best_flights.csv')
In [141]:
db3 = db[(db.DEP_TIME > 1700) & (db.DEP_TIME < 2000)]
dbres3 = db2.sort('PREDICTED_FLIGHT_TIME')
dbres3.head()
Out[141]:
Surprisingly, the best day to fly for in the evening is the 22nd!