Predicting Airline On-Time Performance with Regression Models

The code in this notebook is licensed under Apache 2.0.
This notebook is licensed under a Creative Commons Attribution 4.0 International License.

Goal: in this notebook we will learn how to utilize non-linear regression in GraphLab to build complex and accurate data models. We will cover Factorization Machines, and Matrix Factorization with side features.

The airline on-time performance dataset has information about flight arrival/departure times for 10 years of flights in the US. Each year's data is recorded in a single csv file with the following columns:

Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,
UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,
ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,
Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay

The fields are rather self explanatory. Each line represents a single flight and provides information about the date, carrier, airport, arrival and departure times, delays, cancellation status, etc. The most interesting fields are those providing information about flight duration.

As usual, we start by importing the graphlab module.



In [1]:

    
import graphlab

Now we load 100K records of flight data from the year 2008.



In [2]:

    
#Airline on time dataset is available from: http://stat-computing.org/dataexpo/2009/the-data.html
data_url = "http://stat-computing.org/dataexpo/2009/2008.csv.bz2"

data = graphlab.SFrame.read_csv('~/data/old/airline/2008.csv', 
                                 column_type_hints={"ActualElapsedTime":float,"Distance":float}, 
                                 na_values=["NA"], nrows=1000000)


data = data.dropna(['ActualElapsedTime','CarrierDelay'])









    



[INFO] GraphLab Server Version: 1.8.1






    




PROGRESS: Finished parsing file /Users/bianca/data/old/airline/2008.csv






    




PROGRESS: Parsing completed. Parsed 100 lines in 2.16359 secs.






    




PROGRESS: Read 535634 lines. Lines per second: 221369






    




PROGRESS: Finished parsing file /Users/bianca/data/old/airline/2008.csv






    




PROGRESS: Parsing completed. Parsed 1000000 lines in 3.27019 secs.



In [3]:

    
data.head()









    Out[3]:





    
        Year
        Month
        DayofMonth
        DayOfWeek
        DepTime
        CRSDepTime
        ArrTime
        CRSArrTime
        UniqueCarrier
        FlightNum
    
    
        2008
        1
        3
        4
        1829
        1755
        1959
        1925
        WN
        3920
    
    
        2008
        1
        3
        4
        1937
        1830
        2037
        1940
        WN
        509
    
    
        2008
        1
        3
        4
        1644
        1510
        1845
        1725
        WN
        1333
    
    
        2008
        1
        3
        4
        1452
        1425
        1640
        1625
        WN
        675
    
    
        2008
        1
        3
        4
        1323
        1255
        1526
        1510
        WN
        4
    
    
        2008
        1
        3
        4
        1416
        1325
        1512
        1435
        WN
        54
    
    
        2008
        1
        3
        4
        1657
        1625
        1754
        1735
        WN
        623
    
    
        2008
        1
        3
        4
        1422
        1255
        1657
        1610
        WN
        188
    
    
        2008
        1
        3
        4
        2107
        1945
        2334
        2230
        WN
        362
    
    
        2008
        1
        3
        4
        1812
        1650
        1927
        1815
        WN
        422
    


    
        TailNum
        ActualElapsedTime
        CRSElapsedTime
        AirTime
        ArrDelay
        DepDelay
        Origin
        Dest
        Distance
        TaxiIn
        TaxiOut
    
    
        N464WN
        90.0
        90
        77
        34
        34
        IND
        BWI
        515.0
        3
        10
    
    
        N763SW
        240.0
        250
        230
        57
        67
        IND
        LAS
        1591.0
        3
        7
    
    
        N334SW
        121.0
        135
        107
        80
        94
        IND
        MCO
        828.0
        6
        8
    
    
        N286WN
        228.0
        240
        213
        15
        27
        IND
        PHX
        1489.0
        7
        8
    
    
        N674AA
        123.0
        135
        110
        16
        28
        IND
        TPA
        838.0
        4
        9
    
    
        N643SW
        56.0
        70
        49
        37
        51
        ISP
        BWI
        220.0
        2
        5
    
    
        N724SW
        57.0
        70
        47
        19
        32
        ISP
        BWI
        220.0
        5
        5
    
    
        N215WN
        155.0
        195
        143
        47
        87
        ISP
        FLL
        1093.0
        6
        6
    
    
        N798SW
        147.0
        165
        134
        64
        82
        ISP
        MCO
        972.0
        6
        7
    
    
        N779SW
        135.0
        145
        118
        72
        82
        ISP
        MDW
        765.0
        6
        11
    


    
        Cancelled
        CancellationCode
        Diverted
        CarrierDelay
        WeatherDelay
        NASDelay
        SecurityDelay
        LateAircraftDelay
    
    
        0
        
        0
        2
        0
        0
        0
        32
    
    
        0
        
        0
        10
        0
        0
        0
        47
    
    
        0
        
        0
        8
        0
        0
        0
        72
    
    
        0
        
        0
        3
        0
        0
        0
        12
    
    
        0
        
        0
        0
        0
        0
        0
        16
    
    
        0
        
        0
        12
        0
        0
        0
        25
    
    
        0
        
        0
        7
        0
        0
        0
        12
    
    
        0
        
        0
        40
        0
        0
        0
        7
    
    
        0
        
        0
        5
        0
        0
        0
        59
    
    
        0
        
        0
        3
        0
        0
        0
        69
    

[10 rows x 29 columns]

To understand better the quantity we want to predict (actual flight time) let's plot it:



In [4]:

    
graphlab.canvas.set_target('ipynb')
data.show()

Next, we split the data into training and test subsets. The accuracy of the model is evaluated by the test subset.



In [5]:

    
# split the data randomly, keeping 80% for training and the rest for validation
(train, test) = data.random_split(0.8)

Baseline approach: linear regression

We start by using a simple yet powerful linear regression method to try and predict the actual flight times.



In [6]:

    
model = graphlab.linear_regression.create(train, 
                                          target="ActualElapsedTime", 
                                          validation_set=test)









    




PROGRESS: WARNING: Detected extremely low variance for feature(s) 'Year', 'Cancelled', 'CancellationCode', 'Diverted' because all entries are nearly the same.
Proceeding with model training using all features. If the model does not provide results of adequate quality, exclude the above mentioned feature(s) from the input dataset.






    




PROGRESS: Linear regression:






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: Number of examples          : 203646






    




PROGRESS: Number of features          : 28






    




PROGRESS: Number of unpacked features : 28






    




PROGRESS: Number of coefficients    : 5458






    




PROGRESS: Starting L-BFGS






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: +-----------+----------+-----------+--------------+--------------------+----------------------+---------------+-----------------+






    




PROGRESS: | Iteration | Passes   | Step size | Elapsed Time | Training-max_error | Validation-max_error | Training-rmse | Validation-rmse |






    




PROGRESS: +-----------+----------+-----------+--------------+--------------------+----------------------+---------------+-----------------+






    




PROGRESS: | 1         | 6        | 0.000000  | 2.453843     | 393.161187         | 388.410692           | 56.703708     | 56.556151       |






    




PROGRESS: | 2         | 9        | 5.000000  | 3.379473     | 289.691850         | 260.022925           | 35.150928     | 35.619872       |






    




PROGRESS: | 3         | 10       | 5.000000  | 3.818391     | 924.859171         | 824.920817           | 114.057888    | 113.701624      |






    




PROGRESS: | 4         | 12       | 1.000000  | 4.511037     | 216.301866         | 162.580190           | 24.722532     | 25.569694       |






    




PROGRESS: | 5         | 13       | 1.000000  | 4.958254     | 213.491934         | 156.350437           | 23.485937     | 24.336778       |






    




PROGRESS: | 6         | 14       | 1.000000  | 5.396186     | 173.306373         | 164.638066           | 15.232857     | 15.855991       |






    




PROGRESS: | 10        | 18       | 1.000000  | 7.196265     | 130.559602         | 115.701722           | 5.933299      | 6.116535        |






    




PROGRESS: +-----------+----------+-----------+--------------+--------------------+----------------------+---------------+-----------------+






    




PROGRESS: TERMINATED: Iteration limit reached.






    




PROGRESS: This model may not be optimal. To improve it, consider increasing `max_iterations`.



In [7]:

    
print model.get('coefficients').topk('value')









    



+---------+--------+---------------+--------+
|   name  | index  |     value     | stderr |
+---------+--------+---------------+--------+
| TailNum | N193DN | 37.6967861682 |  None  |
|   Dest  |  ITO   | 33.5703602771 |  None  |
|   Dest  |  LIH   | 32.9720707672 |  None  |
|   Dest  |  OGG   | 30.1021630976 |  None  |
|   Dest  |  KOA   | 30.0556618572 |  None  |
|   Dest  |  HNL   | 29.0019005085 |  None  |
|   Dest  |  BRW   | 26.0296782114 |  None  |
|   Dest  |  YAK   | 23.4679842016 |  None  |
|   Dest  |  SGU   | 22.5361145994 |  None  |
| TailNum | N174DZ |  21.989889042 |  None  |
+---------+--------+---------------+--------+
[10 rows x 4 columns]



In [8]:

    
print model.get('coefficients').topk('value',reverse=True)









    



+---------+--------+----------------+--------+
|   name  | index  |     value      | stderr |
+---------+--------+----------------+--------+
| TailNum | N807NW | -42.2249278012 |  None  |
| TailNum | N67052 | -36.1001011292 |  None  |
| TailNum | N654BR | -34.2879774589 |  None  |
| TailNum | N655BR | -32.6992229896 |  None  |
| TailNum | N475HA | -31.9978931065 |  None  |
|  Origin |  ADK   | -31.9952694498 |  None  |
| TailNum | N693BR | -31.5397148221 |  None  |
| TailNum | N651BR | -31.5338393213 |  None  |
| TailNum | N66051 | -31.511416884  |  None  |
| TailNum | N76064 | -30.0822543526 |  None  |
+---------+--------+----------------+--------+
[10 rows x 4 columns]

http://www.aviationdb.com/Aviation/Aircraft/6/N654BR.shtm

What are the worst airports according to our linear model?



In [9]:

    
airports = graphlab.SFrame.read_csv('http://stat-computing.org/dataexpo/2009/airports.csv')









    




PROGRESS: Downloading http://stat-computing.org/dataexpo/2009/airports.csv to /var/tmp/graphlab-Bianca/3027/ca86f580-111e-4b08-8b7a-5267217970e6.csv






    




PROGRESS: Finished parsing file http://stat-computing.org/dataexpo/2009/airports.csv






    




PROGRESS: Parsing completed. Parsed 100 lines in 0.031848 secs.






    




PROGRESS: Finished parsing file http://stat-computing.org/dataexpo/2009/airports.csv






    




PROGRESS: Parsing completed. Parsed 3376 lines in 0.017611 secs.






    



------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[str,str,str,str,str,float,float]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------



In [10]:

    
airports.show()



In [11]:

    
airports.rename({'iata':'Dest'})
result = model.get('coefficients').topk('value')
result = result[result['name'] == 'Dest']
result = result.join(airports,on={'index':'Dest'}).topk('value')
print result









    



+------+-------+---------------+--------+-------------------------------+
| name | index |     value     | stderr |            airport            |
+------+-------+---------------+--------+-------------------------------+
| Dest |  ITO  | 33.5703602771 |  None  |       Hilo International      |
| Dest |  LIH  | 32.9720707672 |  None  |             Lihue             |
| Dest |  OGG  | 30.1021630976 |  None  |            Kahului            |
| Dest |  KOA  | 30.0556618572 |  None  | Kona International At Keahole |
| Dest |  HNL  | 29.0019005085 |  None  |     Honolulu International    |
| Dest |  BRW  | 26.0296782114 |  None  | Wiley Post Will Rogers Mem... |
| Dest |  YAK  | 23.4679842016 |  None  |            Yakutat            |
| Dest |  SGU  | 22.5361145994 |  None  |         St George Muni        |
+------+-------+---------------+--------+-------------------------------+
+-------------+-------+---------+-------------+--------------+
|     city    | state | country |     lat     |     long     |
+-------------+-------+---------+-------------+--------------+
|     Hilo    |   HI  |   USA   | 19.72026306 | -155.0484703 |
|    Lihue    |   HI  |   USA   | 21.97598306 | -159.3389581 |
|   Kahului   |   HI  |   USA   | 20.89864972 | -156.4304578 |
| Kailua/Kona |   HI  |   USA   | 19.73876583 | -156.0456314 |
|   Honolulu  |   HI  |   USA   | 21.31869111 | -157.9224072 |
|    Barrow   |   AK  |   USA   |  71.2854475 | -156.7660019 |
|   Yakutat   |   AK  |   USA   | 59.50336056 | -139.6602261 |
|  St George  |   UT  |   USA   | 37.09058333 | -113.5930556 |
+-------------+-------+---------+-------------+--------------+
[8 rows x 10 columns]

Non linear regression: Traditional Matrix Factorization

Our task is to predict the actual flight time, which is affected by the airport load, weather, plane type, carrier and many other paramters. We can cast this problem as that of predicting a real-valued variable (flight time) for a pair of entities (source and destination airports). This can be solved easily using certain models in the recommender toolkit. First, let us try regular matrix factoriation.



In [12]:

    
# Train a matrix factorization model with default parameters
model = graphlab.recommender.factorization_recommender.create(train, 
                                    user_id="FlightNum", 
                                    item_id="Dest", 
                                    target="ActualElapsedTime", 
                                    side_data_factorization=False)

# check out the results of training and validation
print 'Training RMSE', model.get('training_rmse')
print 'Validation RMSE', graphlab.evaluation.rmse(test['ActualElapsedTime'], model.predict(test))









    




PROGRESS: Recsys training: model = factorization_recommender






    




PROGRESS: Preparing data set.






    




PROGRESS:     Data has 203646 observations with 6953 users and 285 items.






    




PROGRESS:     Data prepared in: 1.47788s






    




PROGRESS: Training factorization_recommender for recommendations.






    




PROGRESS: +--------------------------------+--------------------------------------------------+----------+






    




PROGRESS: | Parameter                      | Description                                      | Value    |






    




PROGRESS: +--------------------------------+--------------------------------------------------+----------+






    




PROGRESS: | num_factors                    | Factor Dimension                                 | 8        |






    




PROGRESS: | regularization                 | L2 Regularization on Factors                     | 1e-08    |






    




PROGRESS: | solver                         | Solver used for training                         | sgd      |






    




PROGRESS: | linear_regularization          | L2 Regularization on Linear Coefficients         | 1e-10    |






    




PROGRESS: | max_iterations                 | Maximum Number of Iterations                     | 50       |






    




PROGRESS: +--------------------------------+--------------------------------------------------+----------+






    




PROGRESS:   Optimizing model using SGD; tuning step size.






    




PROGRESS:   Using 25455 / 203646 points for tuning the step size.






    




PROGRESS: +---------+-------------------+------------------------------------------+






    




PROGRESS: | Attempt | Initial Step Size | Estimated Objective Value                |






    




PROGRESS: +---------+-------------------+------------------------------------------+






    




PROGRESS: | 0       | 1.78571           | 544.447                                  |






    




PROGRESS: | 1       | 0.892857          | 828.68                                   |






    




PROGRESS: | 2       | 0.446429          | 601.085                                  |






    




PROGRESS: | 3       | 0.223214          | 367.501                                  |






    




PROGRESS: | 4       | 0.111607          | 90.3471                                  |






    




PROGRESS: | 5       | 0.0558036         | 3.16539                                  |






    




PROGRESS: | 6       | 0.0279018         | 0.737575                                 |






    




PROGRESS: | 7       | 0.0139509         | 0.0883986                                |






    




PROGRESS: | 8       | 0.00697545        | 0.0751661                                |






    




PROGRESS: | 9       | 0.00348772        | 0.0226821                                |






    




PROGRESS: | 10      | 0.00174386        | 0.0410873                                |






    




PROGRESS: | 11      | 0.000871931       | 0.13238                                  |






    




PROGRESS: +---------+-------------------+------------------------------------------+






    




PROGRESS: | Final   | 0.00348772        | 0.0226821                                |






    




PROGRESS: +---------+-------------------+------------------------------------------+






    




PROGRESS: Starting Optimization.






    




PROGRESS: +---------+--------------+-------------------+-----------------------+-------------+






    




PROGRESS: | Iter.   | Elapsed Time | Approx. Objective | Approx. Training RMSE | Step Size   |






    




PROGRESS: +---------+--------------+-------------------+-----------------------+-------------+






    




PROGRESS: | Initial | 126us        | 5079.06           | 71.2675               |             |






    




PROGRESS: +---------+--------------+-------------------+-----------------------+-------------+






    




PROGRESS: | 1       | 181.283ms    | 1.48451           | 1.21839               | 0.00348772  |






    




PROGRESS: | 2       | 345.257ms    | 0.033811          | 0.18385               | 0.00207381  |






    




PROGRESS: | 3       | 502.231ms    | 0.00569068        | 0.0754149             | 0.00153003  |






    




PROGRESS: | 4       | 661.91ms     | 0.00403702        | 0.0635295             | 0.0012331   |






    




PROGRESS: | 5       | 817.785ms    | 0.00383535        | 0.0619276             | 0.00104307  |






    




PROGRESS: | 6       | 984.065ms    | 0.00287288        | 0.0535962             | 0.000909764 |






    




PROGRESS: | 10      | 1.54s        | 0.00166004        | 0.0407396             | 0.000620215 |






    




PROGRESS: | 11      | 1.70s        | 0.00153618        | 0.03919               | 0.000577428 |






    




PROGRESS: | 20      | 3.07s        | 0.000896867       | 0.0299423             | 0.000368782 |






    




PROGRESS: | 30      | 4.58s        | 0.000631951       | 0.0251322             | 0.000272083 |






    




PROGRESS: | 40      | 6.07s        | 0.000493172       | 0.0222001             | 0.000219279 |






    




PROGRESS: | 50      | 7.55s        | 0.000402397       | 0.0200517             | 0.000185487 |






    




PROGRESS: +---------+--------------+-------------------+-----------------------+-------------+






    




PROGRESS: Optimization Complete: Maximum number of passes through the data reached.






    




PROGRESS: Computing final objective value and training RMSE.






    




PROGRESS:        Final objective value: 0.000409689






    




PROGRESS:        Final training RMSE: 0.0202327






    



Training RMSE 0.020232699617
Validation RMSE 0.0216987913757

Let's make the problem more realistic



In [13]:

    
train.remove_columns(['AirTime','ArrDelay','DepDelay','ArrTime'])
test.remove_columns(['AirTime','ArrDelay','DepDelay','ArrTime'])









    Out[13]:





    
        Year
        Month
        DayofMonth
        DayOfWeek
        DepTime
        CRSDepTime
        CRSArrTime
        UniqueCarrier
        FlightNum
        TailNum
    
    
        2008
        1
        3
        4
        1829
        1755
        1925
        WN
        3920
        N464WN
    
    
        2008
        1
        3
        4
        1657
        1625
        1735
        WN
        623
        N724SW
    
    
        2008
        1
        3
        4
        1812
        1650
        1815
        WN
        422
        N779SW
    
    
        2008
        1
        3
        4
        948
        925
        940
        WN
        3430
        N487WN
    
    
        2008
        1
        3
        4
        1813
        1735
        1905
        WN
        54
        N643SW
    
    
        2008
        1
        3
        4
        1734
        1650
        1905
        WN
        23
        N521SW
    
    
        2008
        1
        3
        4
        1327
        1230
        1500
        WN
        1171
        N682SW
    
    
        2008
        1
        3
        4
        1824
        1715
        25
        WN
        2383
        N290WN
    
    
        2008
        1
        3
        4
        1818
        1740
        1840
        WN
        391
        N608SW
    
    
        2008
        1
        3
        4
        1726
        1630
        1740
        WN
        2284
        N409WN
    


    
        ActualElapsedTime
        CRSElapsedTime
        Origin
        Dest
        Distance
        TaxiIn
        TaxiOut
        Cancelled
        CancellationCode
        Diverted
    
    
        90.0
        90
        IND
        BWI
        515.0
        3
        10
        0
        
        0
    
    
        57.0
        70
        ISP
        BWI
        220.0
        5
        5
        0
        
        0
    
    
        135.0
        145
        ISP
        MDW
        765.0
        6
        11
        0
        
        0
    
    
        71.0
        75
        JAX
        BHM
        365.0
        3
        9
        0
        
        0
    
    
        143.0
        150
        JAX
        HOU
        816.0
        6
        12
        0
        
        0
    
    
        127.0
        135
        JAX
        PHL
        742.0
        4
        10
        0
        
        0
    
    
        83.0
        90
        LAS
        ABQ
        487.0
        3
        15
        0
        
        0
    
    
        233.0
        250
        LAS
        BUF
        1987.0
        2
        10
        0
        
        0
    
    
        58.0
        60
        LAS
        BUR
        223.0
        2
        10
        0
        
        0
    
    
        66.0
        70
        LAS
        BUR
        223.0
        2
        18
        0
        
        0
    


    
        CarrierDelay
        WeatherDelay
        NASDelay
        SecurityDelay
        LateAircraftDelay
    
    
        2
        0
        0
        0
        32
    
    
        7
        0
        0
        0
        12
    
    
        3
        0
        0
        0
        69
    
    
        0
        0
        0
        0
        19
    
    
        11
        0
        0
        0
        20
    
    
        3
        0
        0
        0
        33
    
    
        50
        0
        0
        0
        0
    
    
        48
        0
        0
        0
        4
    
    
        20
        0
        0
        0
        16
    
    
        1
        0
        0
        0
        51
    

[50950 rows x 25 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Non Linear regression: Boosted decision trees



In [14]:

    
# Train a matrix factorization model with default parameters
model = graphlab.boosted_trees_regression.create(train, 
                                    target="ActualElapsedTime", max_iterations=50)

# check out the results of training and validation
print 'Training RMSE', model.get('training_rmse')
print 'Validation RMSE', graphlab.evaluation.rmse(test['ActualElapsedTime'], model.predict(test))









    




PROGRESS: WARNING: Detected extremely low variance for feature(s) 'Year', 'Cancelled', 'CancellationCode', 'Diverted' because all entries are nearly the same.
Proceeding with model training using all features. If the model does not provide results of adequate quality, exclude the above mentioned feature(s) from the input dataset.






    




PROGRESS: Boosted trees regression:






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: Number of examples          : 193634






    




PROGRESS: Number of features          : 24






    




PROGRESS: Number of unpacked features : 24






    




PROGRESS: +-----------+--------------+--------------------+---------------+----------------------+-----------------+






    




PROGRESS: | Iteration | Elapsed Time | Training-max_error | Training-rmse | Validation-max_error | Validation-rmse |






    




PROGRESS: +-----------+--------------+--------------------+---------------+----------------------+-----------------+






    




PROGRESS: | 1         | 0.624095     | 554.234            | 107.327       | 490.422              | 106.749         |






    




PROGRESS: | 2         | 0.835997     | 461.792            | 75.676        | 371.838              | 75.2357         |






    




PROGRESS: | 3         | 1.049662     | 399.715            | 53.6216       | 276.378              | 53.2747         |






    




PROGRESS: | 4         | 1.285606     | 333.673            | 38.3555       | 210.55               | 38.0989         |






    




PROGRESS: | 5         | 1.499359     | 296.046            | 27.8571       | 194.596              | 27.6415         |






    




PROGRESS: | 6         | 1.719405     | 289.982            | 20.7478       | 190.531              | 20.5923         |






    




PROGRESS: | 7         | 1.999013     | 287.32             | 16.0542       | 164.739              | 15.9223         |






    




PROGRESS: | 8         | 2.242998     | 286.383            | 13.0491       | 153.303              | 12.9629         |






    




PROGRESS: | 9         | 2.487664     | 285.923            | 11.1904       | 150.472              | 11.1693         |






    




PROGRESS: | 10        | 2.723796     | 284.811            | 10.0695       | 144.281              | 10.0854         |






    




PROGRESS: | 11        | 2.986373     | 283.336            | 9.40234       | 121.837              | 9.4224          |






    




PROGRESS: | 12        | 3.227736     | 283.038            | 8.98456       | 121.134              | 9.03835         |






    




PROGRESS: | 13        | 3.459151     | 282.559            | 8.73174       | 108.752              | 8.8069          |






    




PROGRESS: | 14        | 3.699813     | 277.177            | 8.53473       | 108.091              | 8.63382         |






    




PROGRESS: | 15        | 3.954697     | 277.113            | 8.40511       | 94.1261              | 8.49299         |






    




PROGRESS: | 16        | 4.191191     | 276.865            | 8.30892       | 94.2897              | 8.41412         |






    




PROGRESS: | 17        | 4.407972     | 276.972            | 8.23065       | 82.7015              | 8.34421         |






    




PROGRESS: | 18        | 4.631350     | 234.209            | 8.14703       | 82.5593              | 8.29985         |






    




PROGRESS: | 19        | 4.899781     | 199.077            | 8.09556       | 82.5363              | 8.27187         |






    




PROGRESS: | 20        | 5.151936     | 198.978            | 8.04158       | 81.7958              | 8.22154         |






    




PROGRESS: | 21        | 5.407611     | 169.131            | 7.99042       | 81.71                | 8.19026         |






    




PROGRESS: | 22        | 5.668808     | 169.017            | 7.95953       | 81.5966              | 8.16869         |






    




PROGRESS: | 23        | 5.918081     | 169.084            | 7.92954       | 81.5368              | 8.14878         |






    




PROGRESS: | 24        | 6.172503     | 169.038            | 7.88275       | 81.4912              | 8.11621         |






    




PROGRESS: | 25        | 6.401942     | 168.087            | 7.84154       | 81.51                | 8.08311         |






    




PROGRESS: | 26        | 6.609282     | 148.649            | 7.81972       | 80.9082              | 8.0645          |






    




PROGRESS: | 27        | 6.834229     | 148.633            | 7.78963       | 80.708               | 8.0465          |






    




PROGRESS: | 28        | 7.075659     | 148.596            | 7.77313       | 80.6713              | 8.03898         |






    




PROGRESS: | 29        | 7.406432     | 148.552            | 7.75813       | 79.4058              | 8.02248         |






    




PROGRESS: | 30        | 7.680434     | 148.579            | 7.73475       | 79.6334              | 8.01193         |






    




PROGRESS: | 31        | 7.912229     | 148.567            | 7.72169       | 79.6215              | 8.00017         |






    




PROGRESS: | 32        | 8.135928     | 148.557            | 7.70421       | 79.6116              | 7.98127         |






    




PROGRESS: | 33        | 8.369133     | 148.628            | 7.6615        | 79.6821              | 7.94616         |






    




PROGRESS: | 34        | 8.612277     | 136.715            | 7.63193       | 79.6611              | 7.91618         |






    




PROGRESS: | 35        | 8.844580     | 136.722            | 7.61954       | 79.6675              | 7.91141         |






    




PROGRESS: | 36        | 9.059120     | 136.699            | 7.6097        | 79.6442              | 7.90113         |






    




PROGRESS: | 37        | 9.262782     | 136.702            | 7.60155       | 79.6477              | 7.89701         |






    




PROGRESS: | 38        | 9.528778     | 126.335            | 7.58038       | 79.6622              | 7.88983         |






    




PROGRESS: | 39        | 9.770967     | 115.44             | 7.56052       | 79.6309              | 7.87944         |






    




PROGRESS: | 40        | 9.998261     | 115.605            | 7.52071       | 79.7277              | 7.83938         |






    




PROGRESS: | 41        | 10.374973    | 115.591            | 7.50877       | 80.6031              | 7.83244         |






    




PROGRESS: | 42        | 10.678682    | 115.74             | 7.4908        | 80.7515              | 7.81998         |






    




PROGRESS: | 43        | 10.999821    | 115.753            | 7.46465       | 80.7866              | 7.80895         |






    




PROGRESS: | 44        | 11.283004    | 115.751            | 7.45752       | 80.784               | 7.80477         |






    




PROGRESS: | 45        | 11.592528    | 115.764            | 7.41987       | 81.591               | 7.77682         |






    




PROGRESS: | 46        | 11.956417    | 115.773            | 7.4055        | 81.5995              | 7.76626         |






    




PROGRESS: | 47        | 12.243552    | 115.761            | 7.38938       | 81.3684              | 7.76103         |






    




PROGRESS: | 48        | 12.572483    | 115.753            | 7.38289       | 81.3603              | 7.7571          |






    




PROGRESS: | 49        | 12.867454    | 115.756            | 7.37603       | 81.3638              | 7.75459         |






    




PROGRESS: | 50        | 13.170697    | 98.3929            | 7.3636        | 81.3538              | 7.74491         |






    




PROGRESS: +-----------+--------------+--------------------+---------------+----------------------+-----------------+






    



PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

Training RMSE 7.36360089824
Validation RMSE 7.81836738719

Boosted decision trees is our winner!



In [15]:

    
print model.get_feature_importance()









    



+----------------+-------+-------+
|      name      | index | count |
+----------------+-------+-------+
|    NASDelay    |  None |  497  |
|    TaxiOut     |  None |  367  |
| CRSElapsedTime |  None |  260  |
|    DepTime     |  None |  161  |
|     TaxiIn     |  None |  157  |
|    Distance    |  None |  109  |
|   DayofMonth   |  None |  100  |
|   FlightNum    |  None |   83  |
| UniqueCarrier  |   OO  |   59  |
|  CarrierDelay  |  None |   56  |
+----------------+-------+-------+
[5457 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.



In [16]:

    
model = graphlab.linear_regression.create(train, target="ActualElapsedTime")
# check out the results of training and validation
print 'Training RMSE', model.get('training_rmse')
print 'Validation RMSE', graphlab.evaluation.rmse(test['ActualElapsedTime'], model.predict(test))









    




PROGRESS: WARNING: Detected extremely low variance for feature(s) 'Year', 'Cancelled', 'CancellationCode', 'Diverted' because all entries are nearly the same.
Proceeding with model training using all features. If the model does not provide results of adequate quality, exclude the above mentioned feature(s) from the input dataset.






    




PROGRESS: Linear regression:






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: Number of examples          : 193519






    




PROGRESS: Number of features          : 24






    




PROGRESS: Number of unpacked features : 24






    




PROGRESS: Number of coefficients    : 5453






    




PROGRESS: Starting L-BFGS






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: +-----------+----------+-----------+--------------+--------------------+----------------------+---------------+-----------------+






    




PROGRESS: | Iteration | Passes   | Step size | Elapsed Time | Training-max_error | Validation-max_error | Training-rmse | Validation-rmse |






    




PROGRESS: +-----------+----------+-----------+--------------+--------------------+----------------------+---------------+-----------------+






    




PROGRESS: | 1         | 6        | 0.000000  | 1.486708     | 410.308023         | 380.099370           | 57.632566     | 56.546443       |






    




PROGRESS: | 2         | 9        | 5.000000  | 2.277242     | 298.692060         | 231.144657           | 38.610478     | 38.453125       |






    




PROGRESS: | 3         | 10       | 5.000000  | 2.626242     | 753.997504         | 645.189949           | 77.906918     | 76.348224       |






    




PROGRESS: | 4         | 12       | 1.000000  | 3.199097     | 336.616223         | 210.391238           | 28.677100     | 29.251636       |






    




PROGRESS: | 5         | 13       | 1.000000  | 3.550872     | 328.556886         | 186.743475           | 26.845798     | 27.446559       |






    




PROGRESS: | 6         | 14       | 1.000000  | 3.889082     | 285.204517         | 136.018611           | 24.612232     | 25.210014       |






    




PROGRESS: | 10        | 18       | 1.000000  | 5.414460     | 240.962526         | 95.907329            | 11.098285     | 11.503368       |






    




PROGRESS: +-----------+----------+-----------+--------------+--------------------+----------------------+---------------+-----------------+






    




PROGRESS: TERMINATED: Iteration limit reached.






    




PROGRESS: This model may not be optimal. To improve it, consider increasing `max_iterations`.






    



PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

Training RMSE 11.0982849696
Validation RMSE 11.4111782317

Conclusion

We have explored several methods for predicting flight times. It is always recommended to try different methods and see which one performs better on your data. Using GraphLab Create it is easy to interchange methods and find the right one for your needs.

Year	Month	DayofMonth	DayOfWeek	DepTime	CRSDepTime	ArrTime	CRSArrTime	UniqueCarrier	FlightNum
2008	1	3	4	1829	1755	1959	1925	WN	3920
2008	1	3	4	1937	1830	2037	1940	WN	509
2008	1	3	4	1644	1510	1845	1725	WN	1333
2008	1	3	4	1452	1425	1640	1625	WN	675
2008	1	3	4	1323	1255	1526	1510	WN	4
2008	1	3	4	1416	1325	1512	1435	WN	54
2008	1	3	4	1657	1625	1754	1735	WN	623
2008	1	3	4	1422	1255	1657	1610	WN	188
2008	1	3	4	2107	1945	2334	2230	WN	362
2008	1	3	4	1812	1650	1927	1815	WN	422

TailNum	ActualElapsedTime	CRSElapsedTime	AirTime	ArrDelay	DepDelay	Origin	Dest	Distance	TaxiIn	TaxiOut
N464WN	90.0	90	77	34	34	IND	BWI	515.0	3	10
N763SW	240.0	250	230	57	67	IND	LAS	1591.0	3	7
N334SW	121.0	135	107	80	94	IND	MCO	828.0	6	8
N286WN	228.0	240	213	15	27	IND	PHX	1489.0	7	8
N674AA	123.0	135	110	16	28	IND	TPA	838.0	4	9
N643SW	56.0	70	49	37	51	ISP	BWI	220.0	2	5
N724SW	57.0	70	47	19	32	ISP	BWI	220.0	5	5
N215WN	155.0	195	143	47	87	ISP	FLL	1093.0	6	6
N798SW	147.0	165	134	64	82	ISP	MCO	972.0	6	7
N779SW	135.0	145	118	72	82	ISP	MDW	765.0	6	11

Cancelled	CancellationCode	Diverted	CarrierDelay	WeatherDelay	NASDelay	SecurityDelay	LateAircraftDelay
0		0	2	0	0	0	32
0		0	10	0	0	0	47
0		0	8	0	0	0	72
0		0	3	0	0	0	12
0		0	0	0	0	0	16
0		0	12	0	0	0	25
0		0	7	0	0	0	12
0		0	40	0	0	0	7
0		0	5	0	0	0	59
0		0	3	0	0	0	69

CarrierDelay	WeatherDelay	NASDelay	SecurityDelay	LateAircraftDelay
2	0	0	0	32
7	0	0	0	12
3	0	0	0	69
0	0	0	0	19
11	0	0	0	20
3	0	0	0	33
50	0	0	0	0
48	0	0	0	4
20	0	0	0	16
1	0	0	0	51

Cancelled	CancellationCode	Diverted	CarrierDelay	WeatherDelay	NASDelay	SecurityDelay	LateAircraftDelay
0		0	2	0	0	0	32
0		0	10	0	0	0	47
0		0	8	0	0	0	72
0		0	3	0	0	0	12
0		0	0	0	0	0	16
0		0	12	0	0	0	25
0		0	7	0	0	0	12
0		0	40	0	0	0	7
0		0	5	0	0	0	59
0		0	3	0	0	0	69

CarrierDelay	WeatherDelay	NASDelay	SecurityDelay	LateAircraftDelay
2	0	0	0	32
7	0	0	0	12
3	0	0	0	69
0	0	0	0	19
11	0	0	0	20
3	0	0	0	33
50	0	0	0	0
48	0	0	0	4
20	0	0	0	16
1	0	0	0	51