Predicting Airline On-Time Performance with Regression Models

The code in this notebook is licensed under Apache 2.0.
This notebook is licensed under a Creative Commons Attribution 4.0 International License.

Goal: in this notebook we will learn how to utilize non-linear regression in GraphLab to build complex and accurate data models. We will cover Factorization Machines, and Matrix Factorization with side features.

The airline on-time performance dataset has information about flight arrival/departure times for 10 years of flights in the US. Each year's data is recorded in a single csv file with the following columns:

Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,
UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,
ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,
Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay

The fields are rather self explanatory. Each line represents a single flight and provides information about the date, carrier, airport, arrival and departure times, delays, cancellation status, etc. The most interesting fields are those providing information about flight duration.

As usual, we start by importing the graphlab module.


In [1]:
import graphlab

Now we load 100K records of flight data from the year 2008.


In [2]:
#Airline on time dataset is available from: http://stat-computing.org/dataexpo/2009/the-data.html
data_url = "http://stat-computing.org/dataexpo/2009/2008.csv.bz2"

data = graphlab.SFrame.read_csv('~/data/old/airline/2008.csv', 
                                 column_type_hints={"ActualElapsedTime":float,"Distance":float}, 
                                 na_values=["NA"], nrows=1000000)


data = data.dropna(['ActualElapsedTime','CarrierDelay'])


[INFO] GraphLab Server Version: 1.8.1
PROGRESS: Finished parsing file /Users/bianca/data/old/airline/2008.csv
PROGRESS: Parsing completed. Parsed 100 lines in 2.16359 secs.
PROGRESS: Read 535634 lines. Lines per second: 221369
PROGRESS: Finished parsing file /Users/bianca/data/old/airline/2008.csv
PROGRESS: Parsing completed. Parsed 1000000 lines in 3.27019 secs.

In [3]:
data.head()


Out[3]:
Year Month DayofMonth DayOfWeek DepTime CRSDepTime ArrTime CRSArrTime UniqueCarrier FlightNum
2008 1 3 4 1829 1755 1959 1925 WN 3920
2008 1 3 4 1937 1830 2037 1940 WN 509
2008 1 3 4 1644 1510 1845 1725 WN 1333
2008 1 3 4 1452 1425 1640 1625 WN 675
2008 1 3 4 1323 1255 1526 1510 WN 4
2008 1 3 4 1416 1325 1512 1435 WN 54
2008 1 3 4 1657 1625 1754 1735 WN 623
2008 1 3 4 1422 1255 1657 1610 WN 188
2008 1 3 4 2107 1945 2334 2230 WN 362
2008 1 3 4 1812 1650 1927 1815 WN 422
TailNum ActualElapsedTime CRSElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance TaxiIn TaxiOut
N464WN 90.0 90 77 34 34 IND BWI 515.0 3 10
N763SW 240.0 250 230 57 67 IND LAS 1591.0 3 7
N334SW 121.0 135 107 80 94 IND MCO 828.0 6 8
N286WN 228.0 240 213 15 27 IND PHX 1489.0 7 8
N674AA 123.0 135 110 16 28 IND TPA 838.0 4 9
N643SW 56.0 70 49 37 51 ISP BWI 220.0 2 5
N724SW 57.0 70 47 19 32 ISP BWI 220.0 5 5
N215WN 155.0 195 143 47 87 ISP FLL 1093.0 6 6
N798SW 147.0 165 134 64 82 ISP MCO 972.0 6 7
N779SW 135.0 145 118 72 82 ISP MDW 765.0 6 11
Cancelled CancellationCode Diverted CarrierDelay WeatherDelay NASDelay SecurityDelay LateAircraftDelay
0 0 2 0 0 0 32
0 0 10 0 0 0 47
0 0 8 0 0 0 72
0 0 3 0 0 0 12
0 0 0 0 0 0 16
0 0 12 0 0 0 25
0 0 7 0 0 0 12
0 0 40 0 0 0 7
0 0 5 0 0 0 59
0 0 3 0 0 0 69
[10 rows x 29 columns]

To understand better the quantity we want to predict (actual flight time) let's plot it:


In [4]:
graphlab.canvas.set_target('ipynb')
data.show()


Next, we split the data into training and test subsets. The accuracy of the model is evaluated by the test subset.


In [5]:
# split the data randomly, keeping 80% for training and the rest for validation
(train, test) = data.random_split(0.8)

Baseline approach: linear regression

We start by using a simple yet powerful linear regression method to try and predict the actual flight times.


In [6]:
model = graphlab.linear_regression.create(train, 
                                          target="ActualElapsedTime", 
                                          validation_set=test)


PROGRESS: WARNING: Detected extremely low variance for feature(s) 'Year', 'Cancelled', 'CancellationCode', 'Diverted' because all entries are nearly the same.
Proceeding with model training using all features. If the model does not provide results of adequate quality, exclude the above mentioned feature(s) from the input dataset.
PROGRESS: Linear regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 203646
PROGRESS: Number of features          : 28
PROGRESS: Number of unpacked features : 28
PROGRESS: Number of coefficients    : 5458
PROGRESS: Starting L-BFGS
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+-----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: | Iteration | Passes   | Step size | Elapsed Time | Training-max_error | Validation-max_error | Training-rmse | Validation-rmse |
PROGRESS: +-----------+----------+-----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: | 1         | 6        | 0.000000  | 2.453843     | 393.161187         | 388.410692           | 56.703708     | 56.556151       |
PROGRESS: | 2         | 9        | 5.000000  | 3.379473     | 289.691850         | 260.022925           | 35.150928     | 35.619872       |
PROGRESS: | 3         | 10       | 5.000000  | 3.818391     | 924.859171         | 824.920817           | 114.057888    | 113.701624      |
PROGRESS: | 4         | 12       | 1.000000  | 4.511037     | 216.301866         | 162.580190           | 24.722532     | 25.569694       |
PROGRESS: | 5         | 13       | 1.000000  | 4.958254     | 213.491934         | 156.350437           | 23.485937     | 24.336778       |
PROGRESS: | 6         | 14       | 1.000000  | 5.396186     | 173.306373         | 164.638066           | 15.232857     | 15.855991       |
PROGRESS: | 10        | 18       | 1.000000  | 7.196265     | 130.559602         | 115.701722           | 5.933299      | 6.116535        |
PROGRESS: +-----------+----------+-----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: TERMINATED: Iteration limit reached.
PROGRESS: This model may not be optimal. To improve it, consider increasing `max_iterations`.

In [7]:
print model.get('coefficients').topk('value')


+---------+--------+---------------+--------+
|   name  | index  |     value     | stderr |
+---------+--------+---------------+--------+
| TailNum | N193DN | 37.6967861682 |  None  |
|   Dest  |  ITO   | 33.5703602771 |  None  |
|   Dest  |  LIH   | 32.9720707672 |  None  |
|   Dest  |  OGG   | 30.1021630976 |  None  |
|   Dest  |  KOA   | 30.0556618572 |  None  |
|   Dest  |  HNL   | 29.0019005085 |  None  |
|   Dest  |  BRW   | 26.0296782114 |  None  |
|   Dest  |  YAK   | 23.4679842016 |  None  |
|   Dest  |  SGU   | 22.5361145994 |  None  |
| TailNum | N174DZ |  21.989889042 |  None  |
+---------+--------+---------------+--------+
[10 rows x 4 columns]


In [8]:
print model.get('coefficients').topk('value',reverse=True)


+---------+--------+----------------+--------+
|   name  | index  |     value      | stderr |
+---------+--------+----------------+--------+
| TailNum | N807NW | -42.2249278012 |  None  |
| TailNum | N67052 | -36.1001011292 |  None  |
| TailNum | N654BR | -34.2879774589 |  None  |
| TailNum | N655BR | -32.6992229896 |  None  |
| TailNum | N475HA | -31.9978931065 |  None  |
|  Origin |  ADK   | -31.9952694498 |  None  |
| TailNum | N693BR | -31.5397148221 |  None  |
| TailNum | N651BR | -31.5338393213 |  None  |
| TailNum | N66051 | -31.511416884  |  None  |
| TailNum | N76064 | -30.0822543526 |  None  |
+---------+--------+----------------+--------+
[10 rows x 4 columns]

What are the worst airports according to our linear model?


In [9]:
airports = graphlab.SFrame.read_csv('http://stat-computing.org/dataexpo/2009/airports.csv')


PROGRESS: Downloading http://stat-computing.org/dataexpo/2009/airports.csv to /var/tmp/graphlab-Bianca/3027/ca86f580-111e-4b08-8b7a-5267217970e6.csv
PROGRESS: Finished parsing file http://stat-computing.org/dataexpo/2009/airports.csv
PROGRESS: Parsing completed. Parsed 100 lines in 0.031848 secs.
PROGRESS: Finished parsing file http://stat-computing.org/dataexpo/2009/airports.csv
PROGRESS: Parsing completed. Parsed 3376 lines in 0.017611 secs.
------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[str,str,str,str,str,float,float]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------

In [10]:
airports.show()



In [11]:
airports.rename({'iata':'Dest'})
result = model.get('coefficients').topk('value')
result = result[result['name'] == 'Dest']
result = result.join(airports,on={'index':'Dest'}).topk('value')
print result


+------+-------+---------------+--------+-------------------------------+
| name | index |     value     | stderr |            airport            |
+------+-------+---------------+--------+-------------------------------+
| Dest |  ITO  | 33.5703602771 |  None  |       Hilo International      |
| Dest |  LIH  | 32.9720707672 |  None  |             Lihue             |
| Dest |  OGG  | 30.1021630976 |  None  |            Kahului            |
| Dest |  KOA  | 30.0556618572 |  None  | Kona International At Keahole |
| Dest |  HNL  | 29.0019005085 |  None  |     Honolulu International    |
| Dest |  BRW  | 26.0296782114 |  None  | Wiley Post Will Rogers Mem... |
| Dest |  YAK  | 23.4679842016 |  None  |            Yakutat            |
| Dest |  SGU  | 22.5361145994 |  None  |         St George Muni        |
+------+-------+---------------+--------+-------------------------------+
+-------------+-------+---------+-------------+--------------+
|     city    | state | country |     lat     |     long     |
+-------------+-------+---------+-------------+--------------+
|     Hilo    |   HI  |   USA   | 19.72026306 | -155.0484703 |
|    Lihue    |   HI  |   USA   | 21.97598306 | -159.3389581 |
|   Kahului   |   HI  |   USA   | 20.89864972 | -156.4304578 |
| Kailua/Kona |   HI  |   USA   | 19.73876583 | -156.0456314 |
|   Honolulu  |   HI  |   USA   | 21.31869111 | -157.9224072 |
|    Barrow   |   AK  |   USA   |  71.2854475 | -156.7660019 |
|   Yakutat   |   AK  |   USA   | 59.50336056 | -139.6602261 |
|  St George  |   UT  |   USA   | 37.09058333 | -113.5930556 |
+-------------+-------+---------+-------------+--------------+
[8 rows x 10 columns]

Non linear regression: Traditional Matrix Factorization

Our task is to predict the actual flight time, which is affected by the airport load, weather, plane type, carrier and many other paramters. We can cast this problem as that of predicting a real-valued variable (flight time) for a pair of entities (source and destination airports). This can be solved easily using certain models in the recommender toolkit. First, let us try regular matrix factoriation.


In [12]:
# Train a matrix factorization model with default parameters
model = graphlab.recommender.factorization_recommender.create(train, 
                                    user_id="FlightNum", 
                                    item_id="Dest", 
                                    target="ActualElapsedTime", 
                                    side_data_factorization=False)

# check out the results of training and validation
print 'Training RMSE', model.get('training_rmse')
print 'Validation RMSE', graphlab.evaluation.rmse(test['ActualElapsedTime'], model.predict(test))


PROGRESS: Recsys training: model = factorization_recommender
PROGRESS: Preparing data set.
PROGRESS:     Data has 203646 observations with 6953 users and 285 items.
PROGRESS:     Data prepared in: 1.47788s
PROGRESS: Training factorization_recommender for recommendations.
PROGRESS: +--------------------------------+--------------------------------------------------+----------+
PROGRESS: | Parameter                      | Description                                      | Value    |
PROGRESS: +--------------------------------+--------------------------------------------------+----------+
PROGRESS: | num_factors                    | Factor Dimension                                 | 8        |
PROGRESS: | regularization                 | L2 Regularization on Factors                     | 1e-08    |
PROGRESS: | solver                         | Solver used for training                         | sgd      |
PROGRESS: | linear_regularization          | L2 Regularization on Linear Coefficients         | 1e-10    |
PROGRESS: | max_iterations                 | Maximum Number of Iterations                     | 50       |
PROGRESS: +--------------------------------+--------------------------------------------------+----------+
PROGRESS:   Optimizing model using SGD; tuning step size.
PROGRESS:   Using 25455 / 203646 points for tuning the step size.
PROGRESS: +---------+-------------------+------------------------------------------+
PROGRESS: | Attempt | Initial Step Size | Estimated Objective Value                |
PROGRESS: +---------+-------------------+------------------------------------------+
PROGRESS: | 0       | 1.78571           | 544.447                                  |
PROGRESS: | 1       | 0.892857          | 828.68                                   |
PROGRESS: | 2       | 0.446429          | 601.085                                  |
PROGRESS: | 3       | 0.223214          | 367.501                                  |
PROGRESS: | 4       | 0.111607          | 90.3471                                  |
PROGRESS: | 5       | 0.0558036         | 3.16539                                  |
PROGRESS: | 6       | 0.0279018         | 0.737575                                 |
PROGRESS: | 7       | 0.0139509         | 0.0883986                                |
PROGRESS: | 8       | 0.00697545        | 0.0751661                                |
PROGRESS: | 9       | 0.00348772        | 0.0226821                                |
PROGRESS: | 10      | 0.00174386        | 0.0410873                                |
PROGRESS: | 11      | 0.000871931       | 0.13238                                  |
PROGRESS: +---------+-------------------+------------------------------------------+
PROGRESS: | Final   | 0.00348772        | 0.0226821                                |
PROGRESS: +---------+-------------------+------------------------------------------+
PROGRESS: Starting Optimization.
PROGRESS: +---------+--------------+-------------------+-----------------------+-------------+
PROGRESS: | Iter.   | Elapsed Time | Approx. Objective | Approx. Training RMSE | Step Size   |
PROGRESS: +---------+--------------+-------------------+-----------------------+-------------+
PROGRESS: | Initial | 126us        | 5079.06           | 71.2675               |             |
PROGRESS: +---------+--------------+-------------------+-----------------------+-------------+
PROGRESS: | 1       | 181.283ms    | 1.48451           | 1.21839               | 0.00348772  |
PROGRESS: | 2       | 345.257ms    | 0.033811          | 0.18385               | 0.00207381  |
PROGRESS: | 3       | 502.231ms    | 0.00569068        | 0.0754149             | 0.00153003  |
PROGRESS: | 4       | 661.91ms     | 0.00403702        | 0.0635295             | 0.0012331   |
PROGRESS: | 5       | 817.785ms    | 0.00383535        | 0.0619276             | 0.00104307  |
PROGRESS: | 6       | 984.065ms    | 0.00287288        | 0.0535962             | 0.000909764 |
PROGRESS: | 10      | 1.54s        | 0.00166004        | 0.0407396             | 0.000620215 |
PROGRESS: | 11      | 1.70s        | 0.00153618        | 0.03919               | 0.000577428 |
PROGRESS: | 20      | 3.07s        | 0.000896867       | 0.0299423             | 0.000368782 |
PROGRESS: | 30      | 4.58s        | 0.000631951       | 0.0251322             | 0.000272083 |
PROGRESS: | 40      | 6.07s        | 0.000493172       | 0.0222001             | 0.000219279 |
PROGRESS: | 50      | 7.55s        | 0.000402397       | 0.0200517             | 0.000185487 |
PROGRESS: +---------+--------------+-------------------+-----------------------+-------------+
PROGRESS: Optimization Complete: Maximum number of passes through the data reached.
PROGRESS: Computing final objective value and training RMSE.
PROGRESS:        Final objective value: 0.000409689
PROGRESS:        Final training RMSE: 0.0202327
Training RMSE 0.020232699617
Validation RMSE 0.0216987913757

Let's make the problem more realistic


In [13]:
train.remove_columns(['AirTime','ArrDelay','DepDelay','ArrTime'])
test.remove_columns(['AirTime','ArrDelay','DepDelay','ArrTime'])


Out[13]:
Year Month DayofMonth DayOfWeek DepTime CRSDepTime CRSArrTime UniqueCarrier FlightNum TailNum
2008 1 3 4 1829 1755 1925 WN 3920 N464WN
2008 1 3 4 1657 1625 1735 WN 623 N724SW
2008 1 3 4 1812 1650 1815 WN 422 N779SW
2008 1 3 4 948 925 940 WN 3430 N487WN
2008 1 3 4 1813 1735 1905 WN 54 N643SW
2008 1 3 4 1734 1650 1905 WN 23 N521SW
2008 1 3 4 1327 1230 1500 WN 1171 N682SW
2008 1 3 4 1824 1715 25 WN 2383 N290WN
2008 1 3 4 1818 1740 1840 WN 391 N608SW
2008 1 3 4 1726 1630 1740 WN 2284 N409WN
ActualElapsedTime CRSElapsedTime Origin Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted
90.0 90 IND BWI 515.0 3 10 0 0
57.0 70 ISP BWI 220.0 5 5 0 0
135.0 145 ISP MDW 765.0 6 11 0 0
71.0 75 JAX BHM 365.0 3 9 0 0
143.0 150 JAX HOU 816.0 6 12 0 0
127.0 135 JAX PHL 742.0 4 10 0 0
83.0 90 LAS ABQ 487.0 3 15 0 0
233.0 250 LAS BUF 1987.0 2 10 0 0
58.0 60 LAS BUR 223.0 2 10 0 0
66.0 70 LAS BUR 223.0 2 18 0 0
CarrierDelay WeatherDelay NASDelay SecurityDelay LateAircraftDelay
2 0 0 0 32
7 0 0 0 12
3 0 0 0 69
0 0 0 0 19
11 0 0 0 20
3 0 0 0 33
50 0 0 0 0
48 0 0 0 4
20 0 0 0 16
1 0 0 0 51
[50950 rows x 25 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Non Linear regression: Boosted decision trees


In [14]:
# Train a matrix factorization model with default parameters
model = graphlab.boosted_trees_regression.create(train, 
                                    target="ActualElapsedTime", max_iterations=50)

# check out the results of training and validation
print 'Training RMSE', model.get('training_rmse')
print 'Validation RMSE', graphlab.evaluation.rmse(test['ActualElapsedTime'], model.predict(test))


PROGRESS: WARNING: Detected extremely low variance for feature(s) 'Year', 'Cancelled', 'CancellationCode', 'Diverted' because all entries are nearly the same.
Proceeding with model training using all features. If the model does not provide results of adequate quality, exclude the above mentioned feature(s) from the input dataset.
PROGRESS: Boosted trees regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 193634
PROGRESS: Number of features          : 24
PROGRESS: Number of unpacked features : 24
PROGRESS: +-----------+--------------+--------------------+---------------+----------------------+-----------------+
PROGRESS: | Iteration | Elapsed Time | Training-max_error | Training-rmse | Validation-max_error | Validation-rmse |
PROGRESS: +-----------+--------------+--------------------+---------------+----------------------+-----------------+
PROGRESS: | 1         | 0.624095     | 554.234            | 107.327       | 490.422              | 106.749         |
PROGRESS: | 2         | 0.835997     | 461.792            | 75.676        | 371.838              | 75.2357         |
PROGRESS: | 3         | 1.049662     | 399.715            | 53.6216       | 276.378              | 53.2747         |
PROGRESS: | 4         | 1.285606     | 333.673            | 38.3555       | 210.55               | 38.0989         |
PROGRESS: | 5         | 1.499359     | 296.046            | 27.8571       | 194.596              | 27.6415         |
PROGRESS: | 6         | 1.719405     | 289.982            | 20.7478       | 190.531              | 20.5923         |
PROGRESS: | 7         | 1.999013     | 287.32             | 16.0542       | 164.739              | 15.9223         |
PROGRESS: | 8         | 2.242998     | 286.383            | 13.0491       | 153.303              | 12.9629         |
PROGRESS: | 9         | 2.487664     | 285.923            | 11.1904       | 150.472              | 11.1693         |
PROGRESS: | 10        | 2.723796     | 284.811            | 10.0695       | 144.281              | 10.0854         |
PROGRESS: | 11        | 2.986373     | 283.336            | 9.40234       | 121.837              | 9.4224          |
PROGRESS: | 12        | 3.227736     | 283.038            | 8.98456       | 121.134              | 9.03835         |
PROGRESS: | 13        | 3.459151     | 282.559            | 8.73174       | 108.752              | 8.8069          |
PROGRESS: | 14        | 3.699813     | 277.177            | 8.53473       | 108.091              | 8.63382         |
PROGRESS: | 15        | 3.954697     | 277.113            | 8.40511       | 94.1261              | 8.49299         |
PROGRESS: | 16        | 4.191191     | 276.865            | 8.30892       | 94.2897              | 8.41412         |
PROGRESS: | 17        | 4.407972     | 276.972            | 8.23065       | 82.7015              | 8.34421         |
PROGRESS: | 18        | 4.631350     | 234.209            | 8.14703       | 82.5593              | 8.29985         |
PROGRESS: | 19        | 4.899781     | 199.077            | 8.09556       | 82.5363              | 8.27187         |
PROGRESS: | 20        | 5.151936     | 198.978            | 8.04158       | 81.7958              | 8.22154         |
PROGRESS: | 21        | 5.407611     | 169.131            | 7.99042       | 81.71                | 8.19026         |
PROGRESS: | 22        | 5.668808     | 169.017            | 7.95953       | 81.5966              | 8.16869         |
PROGRESS: | 23        | 5.918081     | 169.084            | 7.92954       | 81.5368              | 8.14878         |
PROGRESS: | 24        | 6.172503     | 169.038            | 7.88275       | 81.4912              | 8.11621         |
PROGRESS: | 25        | 6.401942     | 168.087            | 7.84154       | 81.51                | 8.08311         |
PROGRESS: | 26        | 6.609282     | 148.649            | 7.81972       | 80.9082              | 8.0645          |
PROGRESS: | 27        | 6.834229     | 148.633            | 7.78963       | 80.708               | 8.0465          |
PROGRESS: | 28        | 7.075659     | 148.596            | 7.77313       | 80.6713              | 8.03898         |
PROGRESS: | 29        | 7.406432     | 148.552            | 7.75813       | 79.4058              | 8.02248         |
PROGRESS: | 30        | 7.680434     | 148.579            | 7.73475       | 79.6334              | 8.01193         |
PROGRESS: | 31        | 7.912229     | 148.567            | 7.72169       | 79.6215              | 8.00017         |
PROGRESS: | 32        | 8.135928     | 148.557            | 7.70421       | 79.6116              | 7.98127         |
PROGRESS: | 33        | 8.369133     | 148.628            | 7.6615        | 79.6821              | 7.94616         |
PROGRESS: | 34        | 8.612277     | 136.715            | 7.63193       | 79.6611              | 7.91618         |
PROGRESS: | 35        | 8.844580     | 136.722            | 7.61954       | 79.6675              | 7.91141         |
PROGRESS: | 36        | 9.059120     | 136.699            | 7.6097        | 79.6442              | 7.90113         |
PROGRESS: | 37        | 9.262782     | 136.702            | 7.60155       | 79.6477              | 7.89701         |
PROGRESS: | 38        | 9.528778     | 126.335            | 7.58038       | 79.6622              | 7.88983         |
PROGRESS: | 39        | 9.770967     | 115.44             | 7.56052       | 79.6309              | 7.87944         |
PROGRESS: | 40        | 9.998261     | 115.605            | 7.52071       | 79.7277              | 7.83938         |
PROGRESS: | 41        | 10.374973    | 115.591            | 7.50877       | 80.6031              | 7.83244         |
PROGRESS: | 42        | 10.678682    | 115.74             | 7.4908        | 80.7515              | 7.81998         |
PROGRESS: | 43        | 10.999821    | 115.753            | 7.46465       | 80.7866              | 7.80895         |
PROGRESS: | 44        | 11.283004    | 115.751            | 7.45752       | 80.784               | 7.80477         |
PROGRESS: | 45        | 11.592528    | 115.764            | 7.41987       | 81.591               | 7.77682         |
PROGRESS: | 46        | 11.956417    | 115.773            | 7.4055        | 81.5995              | 7.76626         |
PROGRESS: | 47        | 12.243552    | 115.761            | 7.38938       | 81.3684              | 7.76103         |
PROGRESS: | 48        | 12.572483    | 115.753            | 7.38289       | 81.3603              | 7.7571          |
PROGRESS: | 49        | 12.867454    | 115.756            | 7.37603       | 81.3638              | 7.75459         |
PROGRESS: | 50        | 13.170697    | 98.3929            | 7.3636        | 81.3538              | 7.74491         |
PROGRESS: +-----------+--------------+--------------------+---------------+----------------------+-----------------+
PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

Training RMSE 7.36360089824
Validation RMSE 7.81836738719

Boosted decision trees is our winner!


In [15]:
print model.get_feature_importance()


+----------------+-------+-------+
|      name      | index | count |
+----------------+-------+-------+
|    NASDelay    |  None |  497  |
|    TaxiOut     |  None |  367  |
| CRSElapsedTime |  None |  260  |
|    DepTime     |  None |  161  |
|     TaxiIn     |  None |  157  |
|    Distance    |  None |  109  |
|   DayofMonth   |  None |  100  |
|   FlightNum    |  None |   83  |
| UniqueCarrier  |   OO  |   59  |
|  CarrierDelay  |  None |   56  |
+----------------+-------+-------+
[5457 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

In [16]:
model = graphlab.linear_regression.create(train, target="ActualElapsedTime")
# check out the results of training and validation
print 'Training RMSE', model.get('training_rmse')
print 'Validation RMSE', graphlab.evaluation.rmse(test['ActualElapsedTime'], model.predict(test))


PROGRESS: WARNING: Detected extremely low variance for feature(s) 'Year', 'Cancelled', 'CancellationCode', 'Diverted' because all entries are nearly the same.
Proceeding with model training using all features. If the model does not provide results of adequate quality, exclude the above mentioned feature(s) from the input dataset.
PROGRESS: Linear regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 193519
PROGRESS: Number of features          : 24
PROGRESS: Number of unpacked features : 24
PROGRESS: Number of coefficients    : 5453
PROGRESS: Starting L-BFGS
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+-----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: | Iteration | Passes   | Step size | Elapsed Time | Training-max_error | Validation-max_error | Training-rmse | Validation-rmse |
PROGRESS: +-----------+----------+-----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: | 1         | 6        | 0.000000  | 1.486708     | 410.308023         | 380.099370           | 57.632566     | 56.546443       |
PROGRESS: | 2         | 9        | 5.000000  | 2.277242     | 298.692060         | 231.144657           | 38.610478     | 38.453125       |
PROGRESS: | 3         | 10       | 5.000000  | 2.626242     | 753.997504         | 645.189949           | 77.906918     | 76.348224       |
PROGRESS: | 4         | 12       | 1.000000  | 3.199097     | 336.616223         | 210.391238           | 28.677100     | 29.251636       |
PROGRESS: | 5         | 13       | 1.000000  | 3.550872     | 328.556886         | 186.743475           | 26.845798     | 27.446559       |
PROGRESS: | 6         | 14       | 1.000000  | 3.889082     | 285.204517         | 136.018611           | 24.612232     | 25.210014       |
PROGRESS: | 10        | 18       | 1.000000  | 5.414460     | 240.962526         | 95.907329            | 11.098285     | 11.503368       |
PROGRESS: +-----------+----------+-----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: TERMINATED: Iteration limit reached.
PROGRESS: This model may not be optimal. To improve it, consider increasing `max_iterations`.
PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

Training RMSE 11.0982849696
Validation RMSE 11.4111782317

Conclusion

We have explored several methods for predicting flight times. It is always recommended to try different methods and see which one performs better on your data. Using GraphLab Create it is easy to interchange methods and find the right one for your needs.

Further Reading

Linear regression course: https://www.coursera.org/learn/ml-foundations