In this problem, we will use supervised learning techniques to
see if we can use machine learning techniques to
predict departure delays at the O'Hare airport (ORD).
For simplicity, we will use only six attributes:
Month, DayofMonth, DayOfWeek, CRSDepTime, CRSArrTime, and Distance.
Of the four algorithms introduced in
Lesson 1,
you are only required to perform two algorithms:
$k$-NN and Decision Trees.
But scikit learn has a unified API
so it should be easy to test other algorithms on your own.
In [ ]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn
We use the 2001 on-time airline performance data set. We import the following columns:
In [ ]:
df = pd.read_csv('/data/airline/2001.csv', encoding='latin-1', usecols=(1, 2, 3, 5, 7, 15, 16, 18))
We use only the flights that departed from ORD. We define a flight to be delayed if its departure delay is 15 minutes or more, the same definition used by the FAA (source: Wikipedia).
In [ ]:
ohare = df[df.Origin == 'ORD']
ohare = ohare.drop('Origin', axis=1) # we don't need the Origin column anymore.
ohare['Delayed'] = (ohare.DepDelay >= 15).astype(np.int) # 1 if a flight was delayed, 0 if not.
ohare = ohare.drop('DepDelay', axis=1) # we don't need the DepDelay column.
ohare = ohare.dropna()
In [ ]:
print(ohare.head(5))
As explained in Lesson 1, we need to build NumPy arrays because scikit-learn does not work natively with Pandas DataFrame.
df_to_array() that takes a DataFrame
and returns a tuple of two NumPy ararys.
The first array should have every columns and rows except the Delayed column.
The second array is the labels that will be used as truth values, i.e. the Delayed column.
In [ ]:
def df_to_array(df):
'''
Takes a DataFrame and returns a tuple of NumPy arrays.
Parameters
----------
df: A DataFrame. Has a column named 'Delayed'.
Returns
-------
data: A NumPy array. To be used as attributes.
labels: A NumPy array. To be used as truth labels.
'''
#### your code goes here
return data, labels
Here are some sample outputs from my code:
data, labels = df_to_array(ohare)
print(data[:5])
[[ 1 1 1 951 1235 599]
[ 1 2 2 951 1235 599]
[ 1 3 3 951 1235 599]
[ 1 4 4 951 1235 599]
[ 1 5 5 951 1235 599]]
print(labels[:5])
[0 0 0 1 0]
print(data.shape)
(341284, 6)
print(labels.shape)
(341284,)
In [ ]:
data, labels = df_to_array(ohare)
First, we need to split our data into training testing sets. Thus,
split_train_test() function that takes two NumPy arrays.
The first array is the attributes, and the second labels.
It returns a tuple of four NumPy arrays:
the training set portion of the first input array,
the testing set portion of the first input array,
the training set portion of the second input array,
and the testing set portion of the second input array.IMPORTANT:
You must use the random_state parameter in the
train_test_split()
function to ensure
repeatibility.
Also, don't forget to use the optional parameter frac.
In [ ]:
from sklearn.utils import check_random_state
random_seed = 490
random_state = check_random_state(random_seed)
In [ ]:
from sklearn import cross_validation
def split(data, labels, frac=0.4, random_state=random_state):
'''
Splits `data` and `labels` into training and testing sets.
Parameters
----------
data: A NumPy array. Attributes.
labels: A NumPy array. Truth labels.
frac: Optional. A float. The fraction of test set.
random_state: Random number generator.
Returns
-------
A tuple of four NumPy arrays:
Training set portion of 'data', test set portion of 'data',
training set portion of 'labels', and test set portion of 'labels'.
'''
#### your code goes here
return a_train, a_test, b_train, b_test
In [ ]:
data_train, data_test, labels_train, labels_test = split(data, labels)
Write a function named learn_knn() that takes three NumPy arrays
and an integer. The first array is the training set attributes,
the second array is the training set labels, and
the third array is the test set attributes.
It should return a NumPy array that has predicted labels
for each data point in the test set.
There are some
parameters
you can adjust, but you should use only the n_neighbors parameter.
In [ ]:
from sklearn import neighbors
def learn_knn(data_train, labels_train, data_test, n_neighbors):
'''
Takes three NumPy arrays and an integer.
Trains a kNN algorithm where k = n_neighbors
and returns the predicted labels for each row in 'data_test'.
Parameters
----------
data_train: A NumPy array. Training set attributes.
labels_train: A NumPy array. Training set labels.
data_test: A Numpy array. Test set attributes.
n_neighbors: The number of neighbors for kNN queries.
Returns
-------
A NumPy array that has the predictions for each row in 'data_test'.
'''
#### your code goes here
return result
In [ ]:
labels_pred = learn_knn(data_train, labels_train, data_test, 5)
There are various
performance metrics
that you can use to evaluate the performance of a classifier.
For example, the [score()] method of the $k$-nearest neighbor classifier
that was demonstrated in Lesson 1 computes the
accuracy score.
from sklearn.metrics import accuracy_score
print("The accuracy score for kNN is {0:.4f}.".format(accuracy_score(labels_test, labels_pred)))
The accuracy score for kNN is 0.7633.
In [ ]:
from sklearn.metrics import accuracy_score
print("The accuracy score for kNN is {0:.4f}.".format(accuracy_score(labels_test, labels_pred)))
There are pros and cons for each metric. Another popular metrics is the F1 score.
from sklearn.metrics import f1_score
print("The F1 score for kNN is {0:.4f}.".format(f1_score(labels_test, labels_pred)))
The F1 score for kNN is 0.1885.
`
In [ ]:
from sklearn.metrics import f1_score
print("The F1 score for kNN is {0:.4f}.".format(f1_score(labels_test, labels_pred)))
In [ ]:
#### your code goes here
learn_dt() that takes three NumPy arrays:
the attributes of training set, the labels of training set,
and the attributes of testing set.
It should use the decision tree algorithm
(sklearn.tree.DecisionTreeClassifier)
to predict what the labels for testing data are.IMPORTANT:
You must use the random_state parameter in the
train_test_split()
function to ensure
repeatibility.
In [ ]:
# Next lets try Decision Trees
from sklearn import tree
random_state = check_random_state(random_seed)
def learn_dt(data_train, labels_train, data_test, random_state=random_state):
'''
Takes three NumPy arrays.
Trains a Decision Trees algorithm
and returns the predicted labels for each row in 'data_test'.
Parameters
----------
data_train: A NumPy array. Training set attributes.
labels_train: A NumPy array. Training set labels.
data_test: A Numpy array. Test set attributes.
Returns
-------
A NumPy array that has the predictions for each row in 'data_test'.
'''
#### your code goes here
return result
In [ ]:
labels_pred = learn_dt(data_train, labels_train, data_test)
In [ ]:
print("The accuracy score for DT is {0:.4f}.".format(accuracy_score(labels_test, labels_pred)))
And we compute the F1 score.
print("The F1 score for DT is {0:.4f}.".format(f1_score(labels_test, labels_pred)))
The F1 score for DT is 0.3590.
In [ ]:
print("The F1 score for DT is {0:.4f}.".format(f1_score(labels_test, labels_pred)))
In [ ]:
#### your code goes here
So it seems that the decision tree classifier is better at predicting delays (10,294) than the $k$-nearest neighbor classifier (3,752). However, the $k$-nearest neighbor classifer classified more non-delays correct (100,447) than the decision tree classifier (89,464).
We have also seen that the kNN classifier outperforms the DT in accuracy score, while the F1 score of DT is higher than that of kNN. In our case, one classifier did not clearly outperform another, and which classifier to use would depend on specific use cases, e.g. if we give more weight to the correct prediction of non-dalays vs. delays.
In [ ]: