This problem will give you a chance to practice using dimensiona reduction (PCA) and clustering ($k$-means) by performing these machine learning techniques on Delta Airline's aircrafts.
This problem is one continous problem, but I split into two parts for easier grading.
In [ ]:
%matplotlib inline
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn
from sklearn.utils import check_random_state
Delta Airline (and other major airlines) has data on all of their aircrafts on their website. For example, the following shows the specifications of AIRBUS A319 VIP.
In [ ]:
%%bash
# edit paths as necessary
wget https://rawgit.com/INFO490/spring2015/master/week13/delta.csv -O /data/airline/delta.csv
In [ ]:
# edit path to where your delta.csv is located.
df = pd.read_csv('/data/airline/delta.csv')
This data set has 34 columns (including the names of the aircrafts) on 44 aircrafts. It inclues both quantitative measurements such as cruising speed, accommodation and range in miles, as well as categorical data, such as whether a particular aircraft has Wi-Fi or video. These binary are assigned values of either 1 or 0, for yes or no respectively.
In [ ]:
print(df.head())
We need to do some preprocessing before we actually apply the machine learning techniques.
As explained in Lesson 1, we need to build NumPy arrays because scikit-learn does not work natively with Pandas DataFrame.
df_to_array() that takes a DataFrame
and returns a tuple of two NumPy ararys.
The first array should have every columns and rows except the Aircraft column.
The second array is the labels that will be used as truth values, i.e. the Aircraft column.
In [ ]:
def df_to_array(df):
'''
Takes a DataFrame and returns a tuple of NumPy arrays.
Parameters
----------
df: A DataFrame. Has a column named 'Aircraft'.
Returns
-------
data: A NumPy array. To be used as attributes.
labels: A NumPy array. To be used as truth labels.
'''
### your code goes here
return data, labels
Here are some examples that you can use to test your function.
print(data.shape)
(44, 33)
print(data[0])
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 2.10000000e+01
3.60000000e+01 1.20000000e+01 0.00000000e+00 0.00000000e+00
0.00000000e+00 1.72000000e+01 3.40000000e+01 1.80000000e+01
1.72000000e+01 3.05000000e+01 9.60000000e+01 1.26000000e+02
5.17000000e+02 2.39900000e+03 2.00000000e+00 1.11830000e+02
3.85830000e+01 1.11000000e+02 1.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 1.00000000e+00 0.00000000e+00 1.00000000e+00
1.00000000e+00]
print(data[:, 0])
[ 0. 19.4 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. ]
In [ ]:
data, labels = df_to_array(df)
PCA is a scale-dependent method. For example, if the range of one column is [-100, 100], while the that of another column is [-0.1, 0.1], PCA will place more weight on the attribute with larger values. One way to avoid this is to standardize a data set by scaling each feature so that the individual features all look like Gausssian distributions with zero mean and unit variance.
For further detail, see Preprocessing data. The function scale provides a quick and easy way to perform this operation on a single array-like dataset.
In [ ]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(data)
data_scaled = scaler.transform(data)
In [ ]:
from sklearn.decomposition import PCA
def perform_pca(data, labels):
'''
Takes two NumPy arrays and returns a DataFrame.
Runs PCA to obtain the first and second principal components.
Parameters
----------
data: A NumPy array. Attributes.
labels: A NumPy array. Aircraft types.
Returns
-------
A DataFrame with three columns, 'PCA1', 'PCA2', and 'Aircraft'.
'''
#### your code goes here
return data_pca
data_pca = perform_pca(data_scaled, labels)
The PCA returned the following.
data_pca = perform_pca(data_scaled, labels)
print(data_pca.head(5))
PCA1 PCA2 Aircraft
0 2.656021 -1.382411 Airbus A319
1 6.766622 16.74373 Airbus A319 VIP
2 2.396654 -1.487692 Airbus A320
3 2.396654 -1.487692 Airbus A320 32-R
4 -4.862387 0.951892 Airbus A330-200
[5 rows x 3 columns]
In [ ]:
print(data_pca.head(5))
In [ ]:
#### your code goes here
In this problem, we will use the k-means algorithm to group similar aircrafts into clusters.
cluster() that takes two NumPy arrays.
The first array has the scaled attributes,
while the second array has the attributes in terms of the first
and second principal components.
It should return a DataFrame with three columns:
PCA1, PCA2, and Cluster.The number of clusters n_clusters is an adjustable parameter.
Different n_clusters corresponds to a different model.
You may experiment with different values of n_clusters to
find what you think best fits the data.
In the following, I use n_clusters=4.
IMPORTANT:
You must use the random_state parameter in the
KMeans()
function to ensure
repeatibility.
Also, don't forget to use the optional parameter n_clusters.
In [ ]:
from sklearn.cluster import KMeans
random_seed = 490
random_state = check_random_state(random_seed)
def cluster(data_scaled, data_reduced, n_clusters=4, random_state=random_state):
'''
Takes two NumPy arrays and returns a DataFrame.
Parameters
----------
data_scaled: A NumPy array. Attributes scaled with e.g. StandardScaler().
data_reduced: A NumPy array. Attributes in principal compnents.
n_clusters: Optional. An integer. The number of clusters to query.
random_state: Random number generator.
Returns
-------
A DataFrame with three columns: 'PCA1', 'PCA2', and 'Cluster'.
'''
#### your code goes here
return data_clust
data_clust = cluster(data_scaled, data_pca[['PCA1', 'PCA2']].values, n_clusters=4, random_state=random_state)
Here's what I got:
data_clust = cluster(data_scaled, data_pca[['PCA1', 'PCA2']].values)
print(data_clust.head(5))
PCA1 PCA2 Cluster
0 2.656021 -1.382411 2
1 6.766622 16.74373 3
2 2.396654 -1.487692 2
3 2.396654 -1.487692 2
4 -4.862387 0.951892 0
[5 rows x 3 columns]
In [ ]:
print(data_clust.head(5))
In [ ]:
#### your code goes here
In [ ]:
df['Cluster'] = data_clust['Cluster']
df_grouped = df.groupby('Cluster').mean()
print(df_grouped.Accommodation)
In [ ]:
print(df_grouped['Length (ft)'])
Cluster 3 has only one aircraft:
In [ ]:
clust3 = data_pca[data_clust.Cluster == 3]
print(clust3.Aircraft)
Airbus A319 VIP is not one of Delta Airline's regular fleet and is one of Airbus corporate jets.
Cluster 1 has four aircrafts:
In [ ]:
clust1 = data_pca[data_clust.Cluster == 1]
print(clust1.Aircraft)
These are small aircrafts and only have economy seats.
In [ ]:
cols_seat = ['First Class', 'Business', 'Eco Comfort', 'Economy']
print(df.loc[clust1.index, cols_seat])
Next, we look at Cluster 0:
In [ ]:
clust0 = data_pca[data_clust.Cluster == 0]
print(clust0.Aircraft)
These aircrafts do not have first class seating:
In [ ]:
print(df.loc[clust0.index, cols_seat])
Finally, cluster 2 has the following aircrafts:
In [ ]:
clust2 = data_pca[data_clust.Cluster == 2]
print(clust2.Aircraft)
The aircrafts in cluster 2 have first class seating but no business class.
In [ ]:
print(df.loc[clust2.index, cols_seat])
In [ ]: