The code in this notebook is licensed under Apache 2.0.
This notebook is licensed under a Creative Commons Attribution 4.0 International License.
Goal: in this notebook we will learn how to utilize non-linear regression in GraphLab to build complex and accurate data models. We will cover Factorization Machines, and Matrix Factorization with side features.
The airline on-time performance dataset has information about flight arrival/departure times for 10 years of flights in the US. Each year's data is recorded in a single csv file with the following columns:
Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,
UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,
ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,
Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
The fields are rather self explanatory. Each line represents a single flight and provides information about the date, carrier, airport, arrival and departure times, delays, cancellation status, etc. The most interesting fields are those providing information about flight duration.
As usual, we start by importing the graphlab module.
In [1]:
import graphlab
Now we load 100K records of flight data from the year 2008.
In [2]:
#Airline on time dataset is available from: http://stat-computing.org/dataexpo/2009/the-data.html
data_url = "http://stat-computing.org/dataexpo/2009/2008.csv.bz2"
data = graphlab.SFrame.read_csv('~/data/old/airline/2008.csv',
column_type_hints={"ActualElapsedTime":float,"Distance":float},
na_values=["NA"], nrows=1000000)
data = data.dropna(['ActualElapsedTime','CarrierDelay'])
In [3]:
data.head()
Out[3]:
To understand better the quantity we want to predict (actual flight time) let's plot it:
In [4]:
graphlab.canvas.set_target('ipynb')
data.show()
Next, we split the data into training and test subsets. The accuracy of the model is evaluated by the test subset.
In [5]:
# split the data randomly, keeping 80% for training and the rest for validation
(train, test) = data.random_split(0.8)
We start by using a simple yet powerful linear regression method to try and predict the actual flight times.
In [6]:
model = graphlab.linear_regression.create(train,
target="ActualElapsedTime",
validation_set=test)
In [7]:
print model.get('coefficients').topk('value')
In [8]:
print model.get('coefficients').topk('value',reverse=True)
In [9]:
airports = graphlab.SFrame.read_csv('http://stat-computing.org/dataexpo/2009/airports.csv')
In [10]:
airports.show()
In [11]:
airports.rename({'iata':'Dest'})
result = model.get('coefficients').topk('value')
result = result[result['name'] == 'Dest']
result = result.join(airports,on={'index':'Dest'}).topk('value')
print result
Our task is to predict the actual flight time, which is affected by the airport load, weather, plane type, carrier and many other paramters. We can cast this problem as that of predicting a real-valued variable (flight time) for a pair of entities (source and destination airports). This can be solved easily using certain models in the recommender toolkit. First, let us try regular matrix factoriation.
In [12]:
# Train a matrix factorization model with default parameters
model = graphlab.recommender.factorization_recommender.create(train,
user_id="FlightNum",
item_id="Dest",
target="ActualElapsedTime",
side_data_factorization=False)
# check out the results of training and validation
print 'Training RMSE', model.get('training_rmse')
print 'Validation RMSE', graphlab.evaluation.rmse(test['ActualElapsedTime'], model.predict(test))
In [13]:
train.remove_columns(['AirTime','ArrDelay','DepDelay','ArrTime'])
test.remove_columns(['AirTime','ArrDelay','DepDelay','ArrTime'])
Out[13]:
In [14]:
# Train a matrix factorization model with default parameters
model = graphlab.boosted_trees_regression.create(train,
target="ActualElapsedTime", max_iterations=50)
# check out the results of training and validation
print 'Training RMSE', model.get('training_rmse')
print 'Validation RMSE', graphlab.evaluation.rmse(test['ActualElapsedTime'], model.predict(test))
Boosted decision trees is our winner!
In [15]:
print model.get_feature_importance()
In [16]:
model = graphlab.linear_regression.create(train, target="ActualElapsedTime")
# check out the results of training and validation
print 'Training RMSE', model.get('training_rmse')
print 'Validation RMSE', graphlab.evaluation.rmse(test['ActualElapsedTime'], model.predict(test))