FLIGHT TRUST

Summary

Simple python script that runs a regression to predict actual flight time and probability of delay of airlines by route. This document takes you through each of the steps to run the regression for both of these predictions. The document is divided into two main sections: the development, which show the development of the regression, and the Validations, which shows how we validated the data and created the ROC curves.

Development

1. Import

First, we start by importing all the neccesary libraries to the project



In [1]:

    
import pandas as pd
import statsmodels.api as sm
from sklearn.cross_validation import train_test_split
import math
import numpy as np
import matplotlib.pyplot as plt

2. Extract Data



In [2]:

    
gdf = pd.read_csv("./CSV/merged.csv")









    



/Users/gal/anaconda/lib/python3.5/site-packages/IPython/core/interactiveshell.py:2717: DtypeWarning: Columns (0,1,2,3,4,6,7,12,13,14,15,16,17,19,20,21,22,23,24,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

3. Select Data

We start by filtering the columns that were considered the most important based on a data analysis made prior to the development of the regression.

We filter by the relevant columns.



In [3]:

    
df1 = gdf[['AIRLINE_ID','ORIGIN', 'DEST', 'DEP_TIME','ARR_TIME','DEP_DELAY','ARR_DELAY','CANCELLED','DIVERTED','ACTUAL_ELAPSED_TIME']]

Afterwards, we filter by the important airports (ATL, DFW, JFK, LAX and ORD)



In [4]:

    
df2 = df1.query('(ORIGIN == "ATL" or ORIGIN == "DFW" or ORIGIN == "JFK" or ORIGIN == "LAX" or ORIGIN == "ORD") and (DEST == "ATL" or DEST == "DFW" or DEST == "JFK" or DEST == "LAX" or DEST == "ORD")')

4. Sample Data

We now get a random sample Data from the big data set



In [5]:

    
sampledf = df2.sample(n=10000)
print("Sample Size:" + str(len(sampledf.axes[0])))









    



Sample Size:10000

5. Clean Data

Having the sample data, we proceed to clean the data and make sure there are no invalid data in the dataset

First, we trim the string columns to avoid any unexpected error



In [6]:

    
sampledf["AIRLINE_ID"] = sampledf.apply(lambda row: str.strip(str(row.AIRLINE_ID)), axis=1)
sampledf["ORIGIN"] = sampledf.apply(lambda row: str.strip(str(row.ORIGIN)), axis=1)
sampledf["DEST"] = sampledf.apply(lambda row: str.strip(str(row.DEST)), axis=1)

Second, we delete any rows with null values



In [7]:

    
sampledf = sampledf.dropna()

Third, we change the Actual Elapse Time column to an integer



In [8]:

    
sampledf["ACTUAL_ELAPSED_TIME"] = sampledf.apply(lambda row: int(float(row.ACTUAL_ELAPSED_TIME)), axis=1)

Finally, we clean any invalid Data - any flight that has negative time



In [9]:

    
sampledf = sampledf[sampledf.ACTUAL_ELAPSED_TIME >= 0]

6. Add new Columns

We can now calculate the new columns we need for the regression. Based on a prior data analysis we found that the period of the day is relevant for the delay and flight time of a flight.

So, we first calculate the flight periods columns, where Morning is from 6 to 12 , Afternoon is from 12 to 19, Night is from 19 to 24, and Dawn is from 24 to 6



In [10]:

    
sampledf["Morning"]   = sampledf.apply(lambda row: 1 if(not row.CANCELLED and int(row.DEP_TIME) >= 600 and int(row.DEP_TIME) < 1200) else 0, axis=1)
sampledf["Afternoon"] = sampledf.apply(lambda row: 1 if(not row.CANCELLED and int(row.DEP_TIME) >= 1200 and int(row.DEP_TIME) < 1900) else 0, axis=1)
sampledf["Night"]     = sampledf.apply(lambda row: 1 if(not row.CANCELLED and int(row.DEP_TIME) >= 1900 and int(row.DEP_TIME) < 2400) else 0, axis=1)
sampledf["Dawn"]      = sampledf.apply(lambda row: 1 if(not row.CANCELLED and int(row.DEP_TIME) >= 2400 and int(row.DEP_TIME) < 600) else 0, axis=1)

We also calculate the delay column, which is one if a flight was cancelled, diverted, or has delayed over 30 min



In [11]:

    
sampledf["Delayed"] = sampledf.apply(lambda row: 1 if(row.CANCELLED or row.DIVERTED or row.ARR_DELAY > 30) else 0 , axis=1)
sampledf.columns









    Out[11]:





Index(['AIRLINE_ID', 'ORIGIN', 'DEST', 'DEP_TIME', 'ARR_TIME', 'DEP_DELAY',
       'ARR_DELAY', 'CANCELLED', 'DIVERTED', 'ACTUAL_ELAPSED_TIME', 'Morning',
       'Afternoon', 'Night', 'Dawn', 'Delayed'],
      dtype='object')

7. Dummy Variables

Now we need to create all of the dummy variables for each of the columns



In [12]:

    
originDummy = pd.get_dummies(sampledf["ORIGIN"], prefix="ORG", drop_first=True)
destDummy = pd.get_dummies(sampledf["DEST"], prefix="DST", drop_first=True)
airlineDummy = pd.get_dummies(sampledf["AIRLINE_ID"], prefix="AIRLN", drop_first=True)

Using the dummy variables, we create a table for the regression by concatenating all of the dummy columns and the dependant variable



In [13]:

    
dummyDf = pd.DataFrame()
dummyDf = pd.concat([originDummy,destDummy,airlineDummy,sampledf['Morning'], sampledf['Afternoon'], sampledf['Night'],sampledf['Delayed'],sampledf['ACTUAL_ELAPSED_TIME']], axis=1)
dummyDf.head()









    Out[13]:






  
    
      
      ORG_DFW
      ORG_JFK
      ORG_LAX
      ORG_ORD
      DST_DFW
      DST_JFK
      DST_LAX
      DST_ORD
      AIRLN_19790
      AIRLN_19805
      ...
      AIRLN_20398
      AIRLN_20409
      AIRLN_20416
      AIRLN_20436
      AIRLN_21171
      Morning
      Afternoon
      Night
      Delayed
      ACTUAL_ELAPSED_TIME
    
  
  
    
      412672
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      1.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0
      1
      0
      0
      168
    
    
      3104319
      0.0
      1.0
      0.0
      0.0
      0.0
      0.0
      1.0
      0.0
      0.0
      1.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0
      1
      0
      0
      372
    
    
      5416505
      0.0
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      1.0
      0.0
      0.0
      ...
      0.0
      1.0
      0.0
      0.0
      0.0
      1
      0
      0
      0
      165
    
    
      3953272
      0.0
      0.0
      1.0
      0.0
      0.0
      1.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      1.0
      0.0
      0.0
      0.0
      0
      0
      1
      0
      303
    
    
      667530
      0.0
      1.0
      0.0
      0.0
      0.0
      0.0
      1.0
      0.0
      0.0
      1.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0
      0
      1
      0
      411
    
  

5 rows × 23 columns

8. Split Test & Learn Datasets

Then, we split the sample data in training and test data set -- Test size is 20% of the hole data set



In [14]:

    
trainingDF, testDF = train_test_split(dummyDf, test_size = 0.2)

And we make sure all variables are an integer for the regression



In [15]:

    
trainingDF = trainingDF.applymap(np.int)
testDF = testDF.applymap(np.int)

9. 50-50 Data

We now split the data set to make sure that 50% of the rows are delayed flights and 50% are non-delayed flights

We do it for the training data set



In [16]:

    
trainingDFDelayed =  trainingDF[trainingDF.Delayed == 1].head(500)
trainingDFNotDelayed = trainingDF[trainingDF.Delayed == 0].head(500)

allTraining = [trainingDFNotDelayed,trainingDFDelayed]
trainingDF = pd.concat(allTraining)

print("Training Size:" + str(len(trainingDF.axes[0])))









    



Training Size:1000

And for the testing data set



In [17]:

    
testDFDelayed = testDF[testDF.Delayed == 1].head(100)
testDFNotDelayed = testDF[testDF.Delayed == 0].head(100)

allTest = [testDFDelayed,testDFNotDelayed]
testDF = pd.concat(allTest)

print("Testing Size:" + str(len(testDF.axes[0])))









    



Testing Size:200

10. Regression - Delayed

We can finally run the regression for the Delayed dependent Variable



In [18]:

    
XValues = sm.add_constant(trainingDF[trainingDF.columns.difference(['Delayed','ACTUAL_ELAPSED_TIME'])], prepend=False)
resultDelayed = sm.OLS(trainingDF['Delayed'], XValues).fit()
print(resultDelayed.summary())









    



                            OLS Regression Results                            
==============================================================================
Dep. Variable:                Delayed   R-squared:                       0.238
Model:                            OLS   Adj. R-squared:                  0.221
Method:                 Least Squares   F-statistic:                     14.52
Date:                Tue, 29 Nov 2016   Prob (F-statistic):           9.82e-45
Time:                        22:08:56   Log-Likelihood:                -590.07
No. Observations:                1000   AIC:                             1224.
Df Residuals:                     978   BIC:                             1332.
Df Model:                          21                                         
Covariance Type:            nonrobust                                         
===============================================================================
                  coef    std err          t      P>|t|      [95.0% Conf. Int.]
-------------------------------------------------------------------------------
AIRLN_19790    -0.0547      0.145     -0.376      0.707        -0.340     0.231
AIRLN_19805     0.0715      0.150      0.478      0.633        -0.222     0.365
AIRLN_19977     0.0039      0.153      0.025      0.980        -0.296     0.304
AIRLN_20304     0.2678      0.165      1.623      0.105        -0.056     0.592
AIRLN_20366    -0.2970      0.347     -0.857      0.392        -0.977     0.383
AIRLN_20398     0.1778      0.265      0.672      0.502        -0.342     0.697
AIRLN_20409    -0.0414      0.162     -0.256      0.798        -0.359     0.277
AIRLN_20416     0.1618      0.158      1.027      0.304        -0.147     0.471
AIRLN_20436     0.0647      0.172      0.377      0.706        -0.272     0.402
AIRLN_21171     0.0978      0.166      0.590      0.555        -0.228     0.423
Afternoon      -0.3956      0.042     -9.405      0.000        -0.478    -0.313
DST_DFW        -0.0584      0.062     -0.937      0.349        -0.181     0.064
DST_JFK         0.1066      0.060      1.766      0.078        -0.012     0.225
DST_LAX        -0.0580      0.054     -1.071      0.284        -0.164     0.048
DST_ORD         0.0250      0.057      0.439      0.661        -0.087     0.137
Morning        -0.5878      0.043    -13.743      0.000        -0.672    -0.504
Night          -0.2878      0.047     -6.093      0.000        -0.380    -0.195
ORG_DFW         0.0556      0.063      0.882      0.378        -0.068     0.179
ORG_JFK         0.1305      0.063      2.080      0.038         0.007     0.254
ORG_LAX        -0.0749      0.056     -1.333      0.183        -0.185     0.035
ORG_ORD         0.1194      0.059      2.027      0.043         0.004     0.235
const           0.7654      0.151      5.082      0.000         0.470     1.061
==============================================================================
Omnibus:                      327.911   Durbin-Watson:                   0.478
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               51.119
Skew:                           0.128   Prob(JB):                     7.94e-12
Kurtosis:                       1.923   Cond. No.                         48.9
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

11. Regression - Predicted Total Time (Flight time + taxi)

We run the regression to predict the flight time



In [19]:

    
XValues = sm.add_constant(trainingDF[trainingDF.columns.difference(['Delayed','ACTUAL_ELAPSED_TIME'])])
resultTime = sm.OLS(trainingDF['ACTUAL_ELAPSED_TIME'], XValues ).fit()
print(resultTime.summary())









    



                             OLS Regression Results                            
===============================================================================
Dep. Variable:     ACTUAL_ELAPSED_TIME   R-squared:                       0.854
Model:                             OLS   Adj. R-squared:                  0.851
Method:                  Least Squares   F-statistic:                     272.2
Date:                 Tue, 29 Nov 2016   Prob (F-statistic):               0.00
Time:                         22:09:16   Log-Likelihood:                -4929.8
No. Observations:                 1000   AIC:                             9904.
Df Residuals:                      978   BIC:                         1.001e+04
Df Model:                           21                                         
Covariance Type:             nonrobust                                         
===============================================================================
                  coef    std err          t      P>|t|      [95.0% Conf. Int.]
-------------------------------------------------------------------------------
const         162.7184     11.548     14.090      0.000       140.056   185.380
AIRLN_19790   -36.6926     11.154     -3.290      0.001       -58.581   -14.804
AIRLN_19805   -32.3481     11.472     -2.820      0.005       -54.861    -9.835
AIRLN_19977   -30.5088     11.728     -2.601      0.009       -53.524    -7.493
AIRLN_20304   -28.5561     12.654     -2.257      0.024       -53.388    -3.724
AIRLN_20366   -42.7696     26.577     -1.609      0.108       -94.924     9.385
AIRLN_20398   -54.4028     20.298     -2.680      0.007       -94.235   -14.571
AIRLN_20409   -47.0351     12.427     -3.785      0.000       -71.421   -22.649
AIRLN_20416   -34.3411     12.079     -2.843      0.005       -58.045   -10.637
AIRLN_20436   -27.6405     13.161     -2.100      0.036       -53.468    -1.813
AIRLN_21171   -28.4629     12.712     -2.239      0.025       -53.409    -3.517
Afternoon       2.0126      3.225      0.624      0.533        -4.317     8.342
DST_DFW         3.0961      4.781      0.648      0.517        -6.285    12.477
DST_JFK        73.3682      4.628     15.852      0.000        64.285    82.451
DST_LAX       139.8638      4.153     33.676      0.000       131.713   148.014
DST_ORD         8.9190      4.360      2.045      0.041         0.362    17.476
Morning         1.4039      3.280      0.428      0.669        -5.032     7.840
Night           1.2318      3.622      0.340      0.734        -5.876     8.340
ORG_DFW       -18.8795      4.833     -3.906      0.000       -28.364    -9.395
ORG_JFK        88.3285      4.810     18.362      0.000        78.889    97.768
ORG_LAX       102.3241      4.307     23.758      0.000        93.872   110.776
ORG_ORD        -6.8527      4.518     -1.517      0.130       -15.720     2.014
==============================================================================
Omnibus:                       47.866   Durbin-Watson:                   1.891
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              155.574
Skew:                          -0.011   Prob(JB):                     1.65e-34
Kurtosis:                       4.932   Cond. No.                         48.9
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

VALIDATION AND TESTING

Validate with Test Data -- Delayed Prediction

Using the regression, we predict the outcome for the testing data set for the delay prediction.



In [24]:

    
#Copy of the testing data set
validateDataDelay = testDF.copy()

#Get a subset of the data without the validation data 
subsetPredictDelay = validateDataDelay[validateDataDelay.columns.difference(['Delayed','ACTUAL_ELAPSED_TIME'])]

#Predict the outcome with the regression and put the result in a new column
subsetPredictDelay['Calculated_Delay'] = subsetPredictDelay.apply(lambda row: (row * resultDelayed.params).sum(),axis=1)

#Add the real outcome in a new column
subsetPredictDelay["Real_Delayed"] = testDF["Delayed"]

subsetPredictDelay.head()[["Real_Delayed","Calculated_Delay"]]









    



/Users/gal/anaconda/lib/python3.5/site-packages/ipykernel/__main__.py:8: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy






    Out[24]:






  
    
      
      Real_Delayed
      Calculated_Delay
    
  
  
    
      5350147
      1
      0.143945
    
    
      2506044
      1
      -0.399001
    
    
      1467895
      1
      -0.649615
    
    
      958060
      1
      0.132859
    
    
      2153379
      1
      -0.435742

Validate with Test Data -- Predicted Total Time (Flight time + taxi)

Using the regression, we predict the outcome for the testing data set for the flight time prediction.



In [23]:

    
#Copy of the testing data set
validateDataTime = testDF.copy()

subsetPredictTime = validateDataTime[validateDataTime.columns.difference(['Delayed','ACTUAL_ELAPSED_TIME'])]
subsetPredictTime["const"] = 1
subsetPredictTime['Calculated'] = subsetPredictTime.apply(lambda row: (row * resultTime.params).sum(),axis=1)

subsetPredictTime["ACTUAL_ELAPSED_TIME"] = validateDataTime["ACTUAL_ELAPSED_TIME"]
subsetPredictTime["Difference"] = subsetPredictTime.apply(lambda row: abs(row.ACTUAL_ELAPSED_TIME - row.Calculated), axis=1)

subsetPredictTime.head()[["ACTUAL_ELAPSED_TIME","Calculated","Difference"]]









    



/Users/gal/anaconda/lib/python3.5/site-packages/ipykernel/__main__.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy






    Out[23]:






  
    
      
      ACTUAL_ELAPSED_TIME
      Calculated
      Difference
    
  
  
    
      5350147
      367
      358.562598
      8.437402
    
    
      2506044
      249
      234.707071
      14.292929
    
    
      1467895
      203
      237.194454
      34.194454
    
    
      958060
      258
      263.381389
      5.381389
    
    
      2153379
      114
      121.813684
      7.813684

Predicted Total Time Graph (Testing set is used here)

Now we create a testing graph where we can see how much if the data is under a specific difference of time.



In [30]:

    
#Create dataframe with the difference ranges
roicTime = pd.DataFrame({"Values":range(int(subsetPredictTime["Difference"].min()),int(subsetPredictTime["Difference"].max()),10)})
roicTime["Percentage"] = roicTime.apply(lambda row: len(subsetPredictTime[subsetPredictTime.Difference < row.Values]["Difference"]) / len(subsetPredictTime["Difference"]) * 100, axis=1 )

plt.plot(roicTime.Values,roicTime.Percentage)
plt.xlabel("Time Difference between actual and predicted (min)")
plt.ylabel("% of Data covered")
plt.title('Predict Time Validation Graph')
plt.show()

Calculate ROC -- Predicted Delay (Testing set is used here)

We calculated the Truth Table



In [33]:

    
roicDelay = pd.DataFrame({"Values": np.arange(subsetPredictDelay["Calculated_Delay"].min(),subsetPredictDelay["Calculated_Delay"].max(),0.1)})

#True Positive
roicDelay["T_P"] = roicDelay.apply(lambda row:len(subsetPredictDelay[(subsetPredictDelay.Calculated_Delay > row.Values) & (subsetPredictDelay.Real_Delayed == 1)]),axis=1)
#False Positive
roicDelay["F_P"] = roicDelay.apply(lambda row:len(subsetPredictDelay[(subsetPredictDelay.Calculated_Delay > row.Values) & (subsetPredictDelay.Real_Delayed == 0)]),axis=1)
#True Negative
roicDelay["T_N"] = roicDelay.apply(lambda row:len(subsetPredictDelay[(subsetPredictDelay.Calculated_Delay < row.Values) & (subsetPredictDelay.Real_Delayed == 0)]),axis=1)
#False Negative
roicDelay["F_N"] = roicDelay.apply(lambda row:len(subsetPredictDelay[(subsetPredictDelay.Calculated_Delay < row.Values) & (subsetPredictDelay.Real_Delayed == 1)]),axis=1)

roicDelay

Finally, we graph the ROC line



In [35]:

    
#False Posive Ration 
roicDelay["F_P_R"] = roicDelay.apply(lambda row: row["F_P"]/(row["F_P"] + row["T_N"]),axis=1)
#Recall Ration
roicDelay["Recall"] = roicDelay.apply(lambda row: row["T_P"]/(row["T_P"] + row["F_N"]),axis=1)

plt.plot(roicDelay["F_P_R"],roicDelay["Recall"] )
plt.xlabel("False Positive Relation")
plt.ylabel("Recall")
plt.title('ROC Chart')
plt.show()



In [ ]:

	ORG_DFW	ORG_JFK	ORG_LAX	DST_JFK	DST_LAX	DST_ORD	AIRLN_19805	...	AIRLN_20409	Morning	Afternoon	Night	ACTUAL_ELAPSED_TIME
412672	1.0	0.0	0.0	0.0	1.0	0.0	0.0	...	0.0	0	1	0	168
3104319	0.0	1.0	0.0	0.0	1.0	0.0	1.0	...	0.0	0	1	0	372
5416505	0.0	1.0	0.0	0.0	0.0	1.0	0.0	...	1.0	1	0	0	165
3953272	0.0	0.0	1.0	1.0	0.0	0.0	0.0	...	1.0	0	0	1	303
667530	0.0	1.0	0.0	0.0	1.0	0.0	1.0	...	0.0	0	0	1	411

	Real_Delayed	Calculated_Delay
5350147	1	0.143945
2506044	1	-0.399001
1467895	1	-0.649615
958060	1	0.132859
2153379	1	-0.435742

	ACTUAL_ELAPSED_TIME	Calculated	Difference
5350147	367	358.562598	8.437402
2506044	249	234.707071	14.292929
1467895	203	237.194454	34.194454
958060	258	263.381389	5.381389
2153379	114	121.813684	7.813684

	Values	T_P	F_P	T_N	F_N
0	-0.717371	100	98	0	0
1	-0.617371	97	91	9	3
2	-0.517371	96	75	25	4
3	-0.417371	85	58	42	15
4	-0.317371	69	44	56	31
5	-0.217371	53	20	80	47
6	-0.117371	50	7	93	50
7	-0.017371	44	5	95	56
8	0.082629	30	2	98	70
9	0.182629	3	1	99	97