Problem 13.1. Supervised Learning.

In this problem, we will use supervised learning techniques to see if we can use machine learning techniques to predict departure delays at the O'Hare airport (ORD). For simplicity, we will use only six attributes: Month, DayofMonth, DayOfWeek, CRSDepTime, CRSArrTime, and Distance. Of the four algorithms introduced in Lesson 1, you are only required to perform two algorithms: $k$-NN and Decision Trees. But scikit learn has a unified API so it should be easy to test other algorithms on your own.



In [ ]:

    
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn

We use the 2001 on-time airline performance data set. We import the following columns:

Column 1: Month, 1-12
Column 2: DayofMonth, 1-31
Column 3: DayOfWeek, 1 (Monday) - 7 (Sunday)
Column 5: CRSDepTime, scheduled departure time (local, hhmm)
Column 7: CRSArrTime, scheduled arrival time (local, hhmm)
Column 16: DepDelay, departure delay, in minutes
Column 18: Distance, in miles



In [ ]:

    
df = pd.read_csv('/data/airline/2001.csv', encoding='latin-1', usecols=(1, 2, 3, 5, 7, 15, 16, 18))

We use only the flights that departed from ORD. We define a flight to be delayed if its departure delay is 15 minutes or more, the same definition used by the FAA (source: Wikipedia).



In [ ]:

    
ohare = df[df.Origin == 'ORD']
ohare = ohare.drop('Origin', axis=1) # we don't need the Origin column anymore.
ohare['Delayed'] = (ohare.DepDelay >= 15).astype(np.int) # 1 if a flight was delayed, 0 if not.
ohare = ohare.drop('DepDelay', axis=1) # we don't need the DepDelay column.
ohare = ohare.dropna()



In [ ]:

    
print(ohare.head(5))

Function: df_to_array()

As explained in Lesson 1, we need to build NumPy arrays because scikit-learn does not work natively with Pandas DataFrame.

Write a function named df_to_array() that takes a DataFrame and returns a tuple of two NumPy ararys. The first array should have every columns and rows except the Delayed column. The second array is the labels that will be used as truth values, i.e. the Delayed column.



In [ ]:

    
def df_to_array(df):
    '''
    Takes a DataFrame and returns a tuple of NumPy arrays.
    
    Parameters
    ----------
    df: A DataFrame. Has a column named 'Delayed'.
    
    Returns
    -------
    data: A NumPy array. To be used as attributes.
    labels: A NumPy array. To be used as truth labels.
    '''
    
    #### your code goes here
    
    return data, labels

Here are some sample outputs from my code:

data, labels = df_to_array(ohare)
print(data[:5])

[[   1    1    1  951 1235  599]
 [   1    2    2  951 1235  599]
 [   1    3    3  951 1235  599]
 [   1    4    4  951 1235  599]
 [   1    5    5  951 1235  599]]

print(labels[:5])

[0 0 0 1 0]

print(data.shape)

(341284, 6)

print(labels.shape)

(341284,)



In [ ]:

    
data, labels = df_to_array(ohare)

Function: split_train_test()

First, we need to split our data into training testing sets. Thus,

Write a function named split_train_test() function that takes two NumPy arrays. The first array is the attributes, and the second labels. It returns a tuple of four NumPy arrays: the training set portion of the first input array, the testing set portion of the first input array, the training set portion of the second input array, and the testing set portion of the second input array.

IMPORTANT: You must use the random_state parameter in the train_test_split() function to ensure repeatibility. Also, don't forget to use the optional parameter frac.



In [ ]:

    
from sklearn.utils import check_random_state
random_seed = 490
random_state = check_random_state(random_seed)



In [ ]:

    
from sklearn import cross_validation

def split(data, labels, frac=0.4, random_state=random_state):
    '''
    Splits `data` and `labels` into training and testing sets.
    
    Parameters
    ----------
    data: A NumPy array. Attributes.
    labels: A NumPy array. Truth labels.
    frac: Optional. A float. The fraction of test set.
    random_state: Random number generator.
    
    Returns
    -------
    A tuple of four NumPy arrays:
    Training set portion of 'data', test set portion of 'data',
    training set portion of 'labels', and test set portion of 'labels'.
    '''
    
    #### your code goes here
            
    return a_train, a_test, b_train, b_test



In [ ]:

    
data_train, data_test, labels_train, labels_test = split(data, labels)

Function: learn_knn()

Write a function named learn_knn() that takes three NumPy arrays and an integer. The first array is the training set attributes, the second array is the training set labels, and the third array is the test set attributes. It should return a NumPy array that has predicted labels for each data point in the test set.

There are some parameters you can adjust, but you should use only the n_neighbors parameter.



In [ ]:

    
from sklearn import neighbors

def learn_knn(data_train, labels_train, data_test, n_neighbors):
    '''
    Takes three NumPy arrays and an integer.
    Trains a kNN algorithm where k = n_neighbors
    and returns the predicted labels for each row in 'data_test'.
    
    Parameters
    ----------
    data_train: A NumPy array. Training set attributes.
    labels_train: A NumPy array. Training set labels.
    data_test: A Numpy array. Test set attributes.
    n_neighbors: The number of neighbors for kNN queries.
    
    Returns
    -------
    A NumPy array that has the predictions for each row in 'data_test'.
    '''
    
    #### your code goes here
            
    return result



In [ ]:

    
labels_pred = learn_knn(data_train, labels_train, data_test, 5)

Performance Metrics for kNN

There are various performance metrics that you can use to evaluate the performance of a classifier. For example, the [score()] method of the $k$-nearest neighbor classifier that was demonstrated in Lesson 1 computes the accuracy score.

from sklearn.metrics import accuracy_score
print("The accuracy score for kNN is {0:.4f}.".format(accuracy_score(labels_test, labels_pred)))

The accuracy score for kNN is 0.7633.



In [ ]:

    
from sklearn.metrics import accuracy_score
print("The accuracy score for kNN is {0:.4f}.".format(accuracy_score(labels_test, labels_pred)))

There are pros and cons for each metric. Another popular metrics is the F1 score.

from sklearn.metrics import f1_score
print("The F1 score for kNN is {0:.4f}.".format(f1_score(labels_test, labels_pred)))

The F1 score for kNN is 0.1885.
`



In [ ]:

    
from sklearn.metrics import f1_score
print("The F1 score for kNN is {0:.4f}.".format(f1_score(labels_test, labels_pred)))

Plot: Confusion Matrix for kNN.

Create a confusion matrix for the result you obtained from the kNN algorithm.

Here's what I got:



In [ ]:

    
#### your code goes here

Function: learn_dt()

Write a function named learn_dt() that takes three NumPy arrays: the attributes of training set, the labels of training set, and the attributes of testing set. It should use the decision tree algorithm (sklearn.tree.DecisionTreeClassifier) to predict what the labels for testing data are.

IMPORTANT: You must use the random_state parameter in the train_test_split() function to ensure repeatibility.



In [ ]:

    
# Next lets try Decision Trees
from sklearn import tree

random_state = check_random_state(random_seed)

def learn_dt(data_train, labels_train, data_test, random_state=random_state):
    '''
    Takes three NumPy arrays.
    Trains a Decision Trees algorithm
    and returns the predicted labels for each row in 'data_test'.
    
    Parameters
    ----------
    data_train: A NumPy array. Training set attributes.
    labels_train: A NumPy array. Training set labels.
    data_test: A Numpy array. Test set attributes.
    
    Returns
    -------
    A NumPy array that has the predictions for each row in 'data_test'.
    '''
    
    #### your code goes here
        
    return result



In [ ]:

    
labels_pred = learn_dt(data_train, labels_train, data_test)

Performance Metrics for DT

We again compute the accuracy score for the Decision Trees classifier.

print("The accuracy score for DT is {0:.4f}.".format(accuracy_score(labels_test, labels_pred)))

The accuracy score for DT is 0.7308.



In [ ]:

    
print("The accuracy score for DT is {0:.4f}.".format(accuracy_score(labels_test, labels_pred)))

And we compute the F1 score.

print("The F1 score for DT is {0:.4f}.".format(f1_score(labels_test, labels_pred)))

The F1 score for DT is 0.3590.



In [ ]:

    
print("The F1 score for DT is {0:.4f}.".format(f1_score(labels_test, labels_pred)))

Plot: Confusion Matrix for Decision Trees.

Create a confusion matrix for the result you obtained from the Decison Tree algorithm. Here's what what I got:



In [ ]:

    
#### your code goes here

Summary

So it seems that the decision tree classifier is better at predicting delays (10,294) than the $k$-nearest neighbor classifier (3,752). However, the $k$-nearest neighbor classifer classified more non-delays correct (100,447) than the decision tree classifier (89,464).

We have also seen that the kNN classifier outperforms the DT in accuracy score, while the F1 score of DT is higher than that of kNN. In our case, one classifier did not clearly outperform another, and which classifier to use would depend on specific use cases, e.g. if we give more weight to the correct prediction of non-dalays vs. delays.



In [ ]: