Simple python script that runs a regression to predict actual flight time and probability of delay of airlines by route. This document takes you through each of the steps to run the regression for both of these predictions. The document is divided into two main sections: the development, which show the development of the regression, and the Validations, which shows how we validated the data and created the ROC curves.
In [1]:
import pandas as pd
import statsmodels.api as sm
from sklearn.cross_validation import train_test_split
import math
import numpy as np
import matplotlib.pyplot as plt
In [2]:
gdf = pd.read_csv("./CSV/merged.csv")
In [3]:
df1 = gdf[['AIRLINE_ID','ORIGIN', 'DEST', 'DEP_TIME','ARR_TIME','DEP_DELAY','ARR_DELAY','CANCELLED','DIVERTED','ACTUAL_ELAPSED_TIME']]
Afterwards, we filter by the important airports (ATL, DFW, JFK, LAX and ORD)
In [4]:
df2 = df1.query('(ORIGIN == "ATL" or ORIGIN == "DFW" or ORIGIN == "JFK" or ORIGIN == "LAX" or ORIGIN == "ORD") and (DEST == "ATL" or DEST == "DFW" or DEST == "JFK" or DEST == "LAX" or DEST == "ORD")')
In [5]:
sampledf = df2.sample(n=10000)
print("Sample Size:" + str(len(sampledf.axes[0])))
In [6]:
sampledf["AIRLINE_ID"] = sampledf.apply(lambda row: str.strip(str(row.AIRLINE_ID)), axis=1)
sampledf["ORIGIN"] = sampledf.apply(lambda row: str.strip(str(row.ORIGIN)), axis=1)
sampledf["DEST"] = sampledf.apply(lambda row: str.strip(str(row.DEST)), axis=1)
Second, we delete any rows with null values
In [7]:
sampledf = sampledf.dropna()
Third, we change the Actual Elapse Time column to an integer
In [8]:
sampledf["ACTUAL_ELAPSED_TIME"] = sampledf.apply(lambda row: int(float(row.ACTUAL_ELAPSED_TIME)), axis=1)
Finally, we clean any invalid Data - any flight that has negative time
In [9]:
sampledf = sampledf[sampledf.ACTUAL_ELAPSED_TIME >= 0]
We can now calculate the new columns we need for the regression. Based on a prior data analysis we found that the period of the day is relevant for the delay and flight time of a flight.
So, we first calculate the flight periods columns, where Morning is from 6 to 12 , Afternoon is from 12 to 19, Night is from 19 to 24, and Dawn is from 24 to 6
In [10]:
sampledf["Morning"] = sampledf.apply(lambda row: 1 if(not row.CANCELLED and int(row.DEP_TIME) >= 600 and int(row.DEP_TIME) < 1200) else 0, axis=1)
sampledf["Afternoon"] = sampledf.apply(lambda row: 1 if(not row.CANCELLED and int(row.DEP_TIME) >= 1200 and int(row.DEP_TIME) < 1900) else 0, axis=1)
sampledf["Night"] = sampledf.apply(lambda row: 1 if(not row.CANCELLED and int(row.DEP_TIME) >= 1900 and int(row.DEP_TIME) < 2400) else 0, axis=1)
sampledf["Dawn"] = sampledf.apply(lambda row: 1 if(not row.CANCELLED and int(row.DEP_TIME) >= 2400 and int(row.DEP_TIME) < 600) else 0, axis=1)
We also calculate the delay column, which is one if a flight was cancelled, diverted, or has delayed over 30 min
In [11]:
sampledf["Delayed"] = sampledf.apply(lambda row: 1 if(row.CANCELLED or row.DIVERTED or row.ARR_DELAY > 30) else 0 , axis=1)
sampledf.columns
Out[11]:
In [12]:
originDummy = pd.get_dummies(sampledf["ORIGIN"], prefix="ORG", drop_first=True)
destDummy = pd.get_dummies(sampledf["DEST"], prefix="DST", drop_first=True)
airlineDummy = pd.get_dummies(sampledf["AIRLINE_ID"], prefix="AIRLN", drop_first=True)
Using the dummy variables, we create a table for the regression by concatenating all of the dummy columns and the dependant variable
In [13]:
dummyDf = pd.DataFrame()
dummyDf = pd.concat([originDummy,destDummy,airlineDummy,sampledf['Morning'], sampledf['Afternoon'], sampledf['Night'],sampledf['Delayed'],sampledf['ACTUAL_ELAPSED_TIME']], axis=1)
dummyDf.head()
Out[13]:
In [14]:
trainingDF, testDF = train_test_split(dummyDf, test_size = 0.2)
And we make sure all variables are an integer for the regression
In [15]:
trainingDF = trainingDF.applymap(np.int)
testDF = testDF.applymap(np.int)
In [16]:
trainingDFDelayed = trainingDF[trainingDF.Delayed == 1].head(500)
trainingDFNotDelayed = trainingDF[trainingDF.Delayed == 0].head(500)
allTraining = [trainingDFNotDelayed,trainingDFDelayed]
trainingDF = pd.concat(allTraining)
print("Training Size:" + str(len(trainingDF.axes[0])))
And for the testing data set
In [17]:
testDFDelayed = testDF[testDF.Delayed == 1].head(100)
testDFNotDelayed = testDF[testDF.Delayed == 0].head(100)
allTest = [testDFDelayed,testDFNotDelayed]
testDF = pd.concat(allTest)
print("Testing Size:" + str(len(testDF.axes[0])))
In [18]:
XValues = sm.add_constant(trainingDF[trainingDF.columns.difference(['Delayed','ACTUAL_ELAPSED_TIME'])], prepend=False)
resultDelayed = sm.OLS(trainingDF['Delayed'], XValues).fit()
print(resultDelayed.summary())
In [19]:
XValues = sm.add_constant(trainingDF[trainingDF.columns.difference(['Delayed','ACTUAL_ELAPSED_TIME'])])
resultTime = sm.OLS(trainingDF['ACTUAL_ELAPSED_TIME'], XValues ).fit()
print(resultTime.summary())
In [24]:
#Copy of the testing data set
validateDataDelay = testDF.copy()
#Get a subset of the data without the validation data
subsetPredictDelay = validateDataDelay[validateDataDelay.columns.difference(['Delayed','ACTUAL_ELAPSED_TIME'])]
#Predict the outcome with the regression and put the result in a new column
subsetPredictDelay['Calculated_Delay'] = subsetPredictDelay.apply(lambda row: (row * resultDelayed.params).sum(),axis=1)
#Add the real outcome in a new column
subsetPredictDelay["Real_Delayed"] = testDF["Delayed"]
subsetPredictDelay.head()[["Real_Delayed","Calculated_Delay"]]
Out[24]:
In [23]:
#Copy of the testing data set
validateDataTime = testDF.copy()
subsetPredictTime = validateDataTime[validateDataTime.columns.difference(['Delayed','ACTUAL_ELAPSED_TIME'])]
subsetPredictTime["const"] = 1
subsetPredictTime['Calculated'] = subsetPredictTime.apply(lambda row: (row * resultTime.params).sum(),axis=1)
subsetPredictTime["ACTUAL_ELAPSED_TIME"] = validateDataTime["ACTUAL_ELAPSED_TIME"]
subsetPredictTime["Difference"] = subsetPredictTime.apply(lambda row: abs(row.ACTUAL_ELAPSED_TIME - row.Calculated), axis=1)
subsetPredictTime.head()[["ACTUAL_ELAPSED_TIME","Calculated","Difference"]]
Out[23]:
In [30]:
#Create dataframe with the difference ranges
roicTime = pd.DataFrame({"Values":range(int(subsetPredictTime["Difference"].min()),int(subsetPredictTime["Difference"].max()),10)})
roicTime["Percentage"] = roicTime.apply(lambda row: len(subsetPredictTime[subsetPredictTime.Difference < row.Values]["Difference"]) / len(subsetPredictTime["Difference"]) * 100, axis=1 )
plt.plot(roicTime.Values,roicTime.Percentage)
plt.xlabel("Time Difference between actual and predicted (min)")
plt.ylabel("% of Data covered")
plt.title('Predict Time Validation Graph')
plt.show()
We calculated the Truth Table
In [33]:
roicDelay = pd.DataFrame({"Values": np.arange(subsetPredictDelay["Calculated_Delay"].min(),subsetPredictDelay["Calculated_Delay"].max(),0.1)})
#True Positive
roicDelay["T_P"] = roicDelay.apply(lambda row:len(subsetPredictDelay[(subsetPredictDelay.Calculated_Delay > row.Values) & (subsetPredictDelay.Real_Delayed == 1)]),axis=1)
#False Positive
roicDelay["F_P"] = roicDelay.apply(lambda row:len(subsetPredictDelay[(subsetPredictDelay.Calculated_Delay > row.Values) & (subsetPredictDelay.Real_Delayed == 0)]),axis=1)
#True Negative
roicDelay["T_N"] = roicDelay.apply(lambda row:len(subsetPredictDelay[(subsetPredictDelay.Calculated_Delay < row.Values) & (subsetPredictDelay.Real_Delayed == 0)]),axis=1)
#False Negative
roicDelay["F_N"] = roicDelay.apply(lambda row:len(subsetPredictDelay[(subsetPredictDelay.Calculated_Delay < row.Values) & (subsetPredictDelay.Real_Delayed == 1)]),axis=1)
roicDelay
Out[33]:
Finally, we graph the ROC line
In [35]:
#False Posive Ration
roicDelay["F_P_R"] = roicDelay.apply(lambda row: row["F_P"]/(row["F_P"] + row["T_N"]),axis=1)
#Recall Ration
roicDelay["Recall"] = roicDelay.apply(lambda row: row["T_P"]/(row["T_P"] + row["F_N"]),axis=1)
plt.plot(roicDelay["F_P_R"],roicDelay["Recall"] )
plt.xlabel("False Positive Relation")
plt.ylabel("Recall")
plt.title('ROC Chart')
plt.show()
In [ ]: