Problem 12.2 and 12.3 Dimensional Reduction and Clustering.

This problem will give you a chance to practice using dimensiona reduction (PCA) and clustering ($k$-means) by performing these machine learning techniques on Delta Airline's aircrafts.

This problem is one continous problem, but I split into two parts for easier grading.

Problem 12.2 Diemnsional Reduction.


In [ ]:
%matplotlib inline

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn
from sklearn.utils import check_random_state

Delta Airline (and other major airlines) has data on all of their aircrafts on their website. For example, the following shows the specifications of AIRBUS A319 VIP.

Download delta.csv.

In this problem, we will use delta.csv, a CSV file that has the aircraft data taken from the Delta Airline website. So, the first step is to download delta.csv from GitHub. You can use wget in the following code cell, or git pull and use that file as well.


In [ ]:
%%bash
# edit paths as necessary
wget https://rawgit.com/INFO490/spring2015/master/week13/delta.csv -O /data/airline/delta.csv

In [ ]:
# edit path to where your delta.csv is located.
df = pd.read_csv('/data/airline/delta.csv')

This data set has 34 columns (including the names of the aircrafts) on 44 aircrafts. It inclues both quantitative measurements such as cruising speed, accommodation and range in miles, as well as categorical data, such as whether a particular aircraft has Wi-Fi or video. These binary are assigned values of either 1 or 0, for yes or no respectively.


In [ ]:
print(df.head())

Function: df_to_array()

We need to do some preprocessing before we actually apply the machine learning techniques.

As explained in Lesson 1, we need to build NumPy arrays because scikit-learn does not work natively with Pandas DataFrame.

  • Write a function named df_to_array() that takes a DataFrame and returns a tuple of two NumPy ararys. The first array should have every columns and rows except the Aircraft column. The second array is the labels that will be used as truth values, i.e. the Aircraft column.

In [ ]:
def df_to_array(df):
    '''
    Takes a DataFrame and returns a tuple of NumPy arrays.
    
    Parameters
    ----------
    df: A DataFrame. Has a column named 'Aircraft'.
    
    Returns
    -------
    data: A NumPy array. To be used as attributes.
    labels: A NumPy array. To be used as truth labels.
    '''
    
    ### your code goes here
    
    return data, labels

Here are some examples that you can use to test your function.

print(data.shape)
(44, 33)
print(data[0])
[  0.00000000e+00   0.00000000e+00   0.00000000e+00   2.10000000e+01
   3.60000000e+01   1.20000000e+01   0.00000000e+00   0.00000000e+00
   0.00000000e+00   1.72000000e+01   3.40000000e+01   1.80000000e+01
   1.72000000e+01   3.05000000e+01   9.60000000e+01   1.26000000e+02
   5.17000000e+02   2.39900000e+03   2.00000000e+00   1.11830000e+02
   3.85830000e+01   1.11000000e+02   1.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   1.00000000e+00   0.00000000e+00   1.00000000e+00
   1.00000000e+00]
print(data[:, 0])
[  0.   19.4   0.    0.    0.    0.    0.    0.    0.    0.    0.    0.
   0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.
   0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.
   0.    0.    0.    0.    0.    0.    0.    0. ]

In [ ]:
data, labels = df_to_array(df)

Standardization

PCA is a scale-dependent method. For example, if the range of one column is [-100, 100], while the that of another column is [-0.1, 0.1], PCA will place more weight on the attribute with larger values. One way to avoid this is to standardize a data set by scaling each feature so that the individual features all look like Gausssian distributions with zero mean and unit variance.

For further detail, see Preprocessing data. The function scale provides a quick and easy way to perform this operation on a single array-like dataset.


In [ ]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(data)
data_scaled = scaler.transform(data)

Function: perform_pca()

  • Write a funtion named perform_pca() that takes two NumPy arrays and returns a DataFrame.

In [ ]:
from sklearn.decomposition import PCA

def perform_pca(data, labels):
    '''
    Takes two NumPy arrays and returns a DataFrame.
    Runs PCA to obtain the first and second principal components.
    
    Parameters
    ----------
    data: A NumPy array. Attributes.
    labels: A NumPy array. Aircraft types.
    
    Returns
    -------
    A DataFrame with three columns, 'PCA1', 'PCA2', and 'Aircraft'.
    '''
        
    #### your code goes here
    
    return data_pca

data_pca = perform_pca(data_scaled, labels)

The PCA returned the following.

data_pca = perform_pca(data_scaled, labels)
print(data_pca.head(5))
PCA1      PCA2          Aircraft
0  2.656021 -1.382411       Airbus A319
1  6.766622  16.74373   Airbus A319 VIP
2  2.396654 -1.487692       Airbus A320
3  2.396654 -1.487692  Airbus A320 32-R
4 -4.862387  0.951892   Airbus A330-200

[5 rows x 3 columns]

In [ ]:
print(data_pca.head(5))

Plot: Scatter Plot Using Principal Components

  • Create a scatter plot using the first and second principal compnents of data_pca.


In [ ]:
#### your code goes here

Problem 12.3. Clustering

In this problem, we will use the k-means algorithm to group similar aircrafts into clusters.

Function: cluster()

  • Write a function named cluster() that takes two NumPy arrays. The first array has the scaled attributes, while the second array has the attributes in terms of the first and second principal components. It should return a DataFrame with three columns: PCA1, PCA2, and Cluster.

The number of clusters n_clusters is an adjustable parameter. Different n_clusters corresponds to a different model. You may experiment with different values of n_clusters to find what you think best fits the data. In the following, I use n_clusters=4.

IMPORTANT: You must use the random_state parameter in the KMeans() function to ensure repeatibility. Also, don't forget to use the optional parameter n_clusters.


In [ ]:
from sklearn.cluster import KMeans

random_seed = 490
random_state = check_random_state(random_seed)

def cluster(data_scaled, data_reduced, n_clusters=4, random_state=random_state):
    '''
    Takes two NumPy arrays and returns a DataFrame.
    
    Parameters
    ----------
    data_scaled: A NumPy array. Attributes scaled with e.g. StandardScaler().
    data_reduced: A NumPy array. Attributes in principal compnents.
    n_clusters: Optional. An integer. The number of clusters to query.
    random_state: Random number generator.
    
    Returns
    -------
    A DataFrame with three columns: 'PCA1', 'PCA2', and 'Cluster'.
    '''
    #### your code goes here
    
    return data_clust

data_clust = cluster(data_scaled, data_pca[['PCA1', 'PCA2']].values, n_clusters=4, random_state=random_state)

Here's what I got:

data_clust = cluster(data_scaled, data_pca[['PCA1', 'PCA2']].values)
print(data_clust.head(5))
PCA1      PCA2 Cluster
0  2.656021 -1.382411       2
1  6.766622  16.74373       3
2  2.396654 -1.487692       2
3  2.396654 -1.487692       2
4 -4.862387  0.951892       0

[5 rows x 3 columns]

In [ ]:
print(data_clust.head(5))

Plot: Scatter Plot of Clusters

  • Use the data_clust DataFrame to create a scatter plot of the clustered data in the first and second principal component axes.


In [ ]:
#### your code goes here

Dicussion

You don't have to write any code in this section, but here's one interpretaion of what we have done.

Let's take a closer look at each cluster.


In [ ]:
df['Cluster'] = data_clust['Cluster']
df_grouped = df.groupby('Cluster').mean()
print(df_grouped.Accommodation)

In [ ]:
print(df_grouped['Length (ft)'])

Cluster 3 has only one aircraft:


In [ ]:
clust3 = data_pca[data_clust.Cluster == 3]
print(clust3.Aircraft)

Airbus A319 VIP is not one of Delta Airline's regular fleet and is one of Airbus corporate jets.

Cluster 1 has four aircrafts:


In [ ]:
clust1 = data_pca[data_clust.Cluster == 1]
print(clust1.Aircraft)

These are small aircrafts and only have economy seats.


In [ ]:
cols_seat = ['First Class', 'Business', 'Eco Comfort', 'Economy']
print(df.loc[clust1.index, cols_seat])

Next, we look at Cluster 0:


In [ ]:
clust0 = data_pca[data_clust.Cluster == 0]
print(clust0.Aircraft)

These aircrafts do not have first class seating:


In [ ]:
print(df.loc[clust0.index, cols_seat])

Finally, cluster 2 has the following aircrafts:


In [ ]:
clust2 = data_pca[data_clust.Cluster == 2]
print(clust2.Aircraft)

The aircrafts in cluster 2 have first class seating but no business class.


In [ ]:
print(df.loc[clust2.index, cols_seat])

In [ ]: