Supervised Learning using SciKit Learn

K-Nearest neighbors is one of the most basic classification algorithms and should be your first choice if you have no prior knowledge about the data.

It is based on the Euclidean distance between the test sample and the specified training samples. For any 2 points with p features, the Euclidean distance is given by the square root of the sum of distances for each dimension. For the mathematical explanation and theory, please view lecture notes: http://george1328.github.io/notes/DAT10_lec05_Regression_Regularization.pdf

We will use the MPG dataset to practice using KNN to predict a fuel economy(MPG) of a car given its features.

Step 1: Get the data


In [1]:
import pandas as pd
%pylab inline


Populating the interactive namespace from numpy and matplotlib

In [2]:
# Column/Feature label are not available in the dataset, so we create a list of features using auto-mpg.names
features = ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'model year', 'origin', 'car name']

In [3]:
# Import the data directly into pandas from the url, specify header=None as column labels are not in dataset
import urllib
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
# file is fixed-width-format so we will use read_fwf instead of read_csv
df = pd.read_fwf(urllib.urlopen(url), header = None)
df.columns = features

In [4]:
# Alternatively, we can download the data
# We use the bang(!) within iPython Notebooks to run command line statements directly from the Notebook
! curl -O https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data
! curl -O https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.names


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 30286  100 30286    0     0  47352      0 --:--:-- --:--:-- --:--:-- 47321
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1660  100  1660    0     0   5404      0 --:--:-- --:--:-- --:--:--  5389

In [5]:
# Since this dataset has no column headings, we need to explicitely state names=features
df = pd.read_fwf("auto-mpg.data", names = features)

Step 2: Clean the data


In [6]:
# Use head, describe, info and unique to get a sense of the data
df.describe()


Out[6]:
mpg cylinders displacement weight acceleration model year origin
count 398.000000 398.000000 398.000000 398.000000 398.000000 398.000000 398.000000
mean 23.514573 5.454774 193.425879 2970.424623 15.568090 76.010050 1.572864
std 7.815984 1.701004 104.269838 846.841774 2.757689 3.697627 0.802055
min 9.000000 3.000000 68.000000 1613.000000 8.000000 70.000000 1.000000
25% 17.500000 4.000000 104.250000 2223.750000 13.825000 73.000000 1.000000
50% 23.000000 4.000000 148.500000 2803.500000 15.500000 76.000000 1.000000
75% 29.000000 8.000000 262.000000 3608.000000 17.175000 79.000000 2.000000
max 46.600000 8.000000 455.000000 5140.000000 24.800000 82.000000 3.000000

In [7]:
df.head()


Out[7]:
mpg cylinders displacement horsepower weight acceleration model year origin car name
0 18 8 307 130.0 3504 12.0 70 1 "chevrolet chevelle malibu"
1 15 8 350 165.0 3693 11.5 70 1 "buick skylark 320"
2 18 8 318 150.0 3436 11.0 70 1 "plymouth satellite"
3 16 8 304 150.0 3433 12.0 70 1 "amc rebel sst"
4 17 8 302 140.0 3449 10.5 70 1 "ford torino"

In [8]:
df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 398 entries, 0 to 397
Data columns (total 9 columns):
mpg             398 non-null float64
cylinders       398 non-null int64
displacement    398 non-null float64
horsepower      398 non-null object
weight          398 non-null float64
acceleration    398 non-null float64
model year      398 non-null int64
origin          398 non-null int64
car name        398 non-null object
dtypes: float64(4), int64(3), object(2)

In [9]:
# Name and Horsepower are the only non-numeric fields. Name of a car is unlikely to have an influence on the MPG.
df.horsepower.unique()


Out[9]:
array(['130.0', '165.0', '150.0', '140.0', '198.0', '220.0', '215.0',
       '225.0', '190.0', '170.0', '160.0', '95.00', '97.00', '85.00',
       '88.00', '46.00', '87.00', '90.00', '113.0', '200.0', '210.0',
       '193.0', '?', '100.0', '105.0', '175.0', '153.0', '180.0', '110.0',
       '72.00', '86.00', '70.00', '76.00', '65.00', '69.00', '60.00',
       '80.00', '54.00', '208.0', '155.0', '112.0', '92.00', '145.0',
       '137.0', '158.0', '167.0', '94.00', '107.0', '230.0', '49.00',
       '75.00', '91.00', '122.0', '67.00', '83.00', '78.00', '52.00',
       '61.00', '93.00', '148.0', '129.0', '96.00', '71.00', '98.00',
       '115.0', '53.00', '81.00', '79.00', '120.0', '152.0', '102.0',
       '108.0', '68.00', '58.00', '149.0', '89.00', '63.00', '48.00',
       '66.00', '139.0', '103.0', '125.0', '133.0', '138.0', '135.0',
       '142.0', '77.00', '62.00', '132.0', '84.00', '64.00', '74.00',
       '116.0', '82.00'], dtype=object)

In [10]:
(df.horsepower == '?').sum()


Out[10]:
6

In [11]:
# Lets convert horsepower to a numeric field so we can use it in our analysis
df['horsepower'] = df['horsepower'].convert_objects(convert_numeric = True)

In [12]:
# We will drop the 6 records that are missing horsepower. We could extimate these missing values but for the sake of accuracy
# we will not. Also, its only 6 missing values
df = df.dropna()

In [13]:
df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 392 entries, 0 to 397
Data columns (total 9 columns):
mpg             392 non-null float64
cylinders       392 non-null int64
displacement    392 non-null float64
horsepower      392 non-null float64
weight          392 non-null float64
acceleration    392 non-null float64
model year      392 non-null int64
origin          392 non-null int64
car name        392 non-null object
dtypes: float64(5), int64(3), object(1)

Step 3: Get a sense of the data using Exploratory Data Analysis(EDA)


In [14]:
import seaborn as sns

In [15]:
sns.boxplot(df.mpg, df.cylinders)
# This is interesting. 4 cylinder vehicles have better mileage on average than 3 cylinder vehicles


Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x10d489e90>

In [16]:
three_cyl = df[df.cylinders == 3]
print three_cyl['car name']
## Aha! Tiny Mazda roadsters...


71     "mazda rx2 coupe"
111          "maxda rx3"
243         "mazda rx-4"
334      "mazda rx-7 gs"
Name: car name, dtype: object

In [17]:
sns.violinplot(df.mpg, df['model year'])
# Fancy seaborn graphing


Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x10d7669d0>

In [18]:
sns.barplot(df.mpg, df.horsepower)


Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x10d95ad10>

In [19]:
sns.barplot(df.mpg, df.weight)


Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x10e25c1d0>

In [20]:
sns.boxplot(df.mpg, df.origin)
# Although the values of origin are not given, we can guess that 1=USA, 2=Europe and 3=Japan... Maybe...


Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x10f538f90>

In [21]:
sns.boxplot(df.mpg, df.acceleration)
# Little cars have pretty good accelaration AND good mileage so not a great association


Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x110210d50>

In [22]:
sns.kdeplot(df.mpg, df.cylinders)
# Showing different plot options in seaborn :-)


Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x1116ef090>

In [23]:
df.corr()


Out[23]:
mpg cylinders displacement horsepower weight acceleration model year origin
mpg 1.000000 -0.777618 -0.805127 -0.778427 -0.832244 0.423329 0.580541 0.565209
cylinders -0.777618 1.000000 0.950823 0.842983 0.897527 -0.504683 -0.345647 -0.568932
displacement -0.805127 0.950823 1.000000 0.897257 0.932994 -0.543800 -0.369855 -0.614535
horsepower -0.778427 0.842983 0.897257 1.000000 0.864538 -0.689196 -0.416361 -0.455171
weight -0.832244 0.897527 0.932994 0.864538 1.000000 -0.416839 -0.309120 -0.585005
acceleration 0.423329 -0.504683 -0.543800 -0.689196 -0.416839 1.000000 0.290316 0.212746
model year 0.580541 -0.345647 -0.369855 -0.416361 -0.309120 0.290316 1.000000 0.181528
origin 0.565209 -0.568932 -0.614535 -0.455171 -0.585005 0.212746 0.181528 1.000000

Step 4: Re-engineer the data

Since MPG is a continuous variable and we want to use this dataset as a classification problem, lets convert the MPG values into bins. 5-10mpg, 10-15mpg...45-50mpg.


In [24]:
# create bins. '0'=0-4mpg, '1'=5-9mpg, '2'=10-14mpg...'9'=45-49mpg.
df['mpg_bin'] = (df.mpg/7).astype(int)

Ideally we should create dummy variable for the cylinder attribute but we will introduce that in a future notebook

Step 5: Prepare data for analysis


In [25]:
# Create numpy variables X and y with the predictor and class variables
X = df[['weight', 'model year', 'horsepower', 'origin', 'displacement']].values
y = df['mpg'].values

In [26]:
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LinearRegression

In [27]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [28]:
model = LinearRegression()
model.fit(X_train, y_train)


Out[28]:
LinearRegression(copy_X=True, fit_intercept=True, normalize=False)

In [29]:
predictions = model.predict(X_test)

In [30]:
for i, prediction in enumerate(predictions):
    print 'Predicted: %s, Actual: %s' % (prediction, y_test[i])


Predicted: 29.4308031249, Actual: 30.5
Predicted: 33.1952981092, Actual: 33.0
Predicted: 32.0181359381, Actual: 32.0
Predicted: 12.8002779599, Actual: 14.0
Predicted: 25.3061833209, Actual: 26.5
Predicted: 30.2339746281, Actual: 29.8
Predicted: 33.0487806024, Actual: 32.2
Predicted: 21.2135679363, Actual: 13.0
Predicted: 24.3574840944, Actual: 24.0
Predicted: 12.2274668373, Actual: 13.0
Predicted: 25.1870728772, Actual: 28.0
Predicted: 22.1411580969, Actual: 16.2
Predicted: 20.1410734066, Actual: 17.0
Predicted: 34.561472429, Actual: 37.2
Predicted: 23.8461972769, Actual: 23.0
Predicted: 36.0393892077, Actual: 37.0
Predicted: 24.7551353847, Actual: 24.5
Predicted: 17.2336762191, Actual: 17.0
Predicted: 19.4812922429, Actual: 18.0
Predicted: 24.8754183322, Actual: 23.8
Predicted: 25.6686880525, Actual: 19.0
Predicted: 29.830817277, Actual: 29.0
Predicted: 27.2672592875, Actual: 22.0
Predicted: 27.215946729, Actual: 20.0
Predicted: 11.639661399, Actual: 16.0
Predicted: 27.6098983166, Actual: 24.0
Predicted: 31.5006928976, Actual: 31.5
Predicted: 20.6940701595, Actual: 18.5
Predicted: 36.4312911981, Actual: 38.0
Predicted: 36.378747302, Actual: 31.0
Predicted: 26.8157281627, Actual: 28.8
Predicted: 23.1459743946, Actual: 23.9
Predicted: 29.2820020363, Actual: 27.2
Predicted: 11.4411525917, Actual: 14.0
Predicted: 9.5491124151, Actual: 13.0
Predicted: 13.4263078152, Actual: 13.0
Predicted: 5.98457756287, Actual: 13.0
Predicted: 25.9022740753, Actual: 21.6
Predicted: 22.0248296068, Actual: 19.0
Predicted: 8.93877839534, Actual: 11.0
Predicted: 34.0681885128, Actual: 34.1
Predicted: 6.8459693686, Actual: 12.0
Predicted: 20.3023752269, Actual: 15.0
Predicted: 20.9010794555, Actual: 20.0
Predicted: 21.3779477264, Actual: 19.0
Predicted: 23.9120067676, Actual: 22.0
Predicted: 20.4646029735, Actual: 17.5
Predicted: 32.6665228553, Actual: 31.9
Predicted: 29.8053168895, Actual: 21.1
Predicted: 36.2461867829, Actual: 38.0
Predicted: 31.2620442363, Actual: 36.1
Predicted: 11.4078090684, Actual: 14.0
Predicted: 28.4316996501, Actual: 33.5
Predicted: 31.0135434612, Actual: 35.7
Predicted: 12.1148872203, Actual: 14.0
Predicted: 25.3833115335, Actual: 24.3
Predicted: 20.8167184944, Actual: 20.2
Predicted: 35.065319018, Actual: 36.0
Predicted: 34.8101901299, Actual: 34.0
Predicted: 32.1628012083, Actual: 31.5
Predicted: 26.4670187118, Actual: 24.0
Predicted: 17.372988798, Actual: 18.0
Predicted: 23.9806431142, Actual: 17.0
Predicted: 19.1505446084, Actual: 22.0
Predicted: 28.3657297822, Actual: 31.0
Predicted: 24.5298420275, Actual: 25.5
Predicted: 10.1555970502, Actual: 14.0
Predicted: 34.5421360356, Actual: 36.1
Predicted: 20.3072004311, Actual: 17.7
Predicted: 23.1840191109, Actual: 23.0
Predicted: 26.3609303658, Actual: 27.9
Predicted: 12.0509238797, Actual: 14.0
Predicted: 22.9586054004, Actual: 24.0
Predicted: 26.3354950334, Actual: 24.0
Predicted: 20.9674225919, Actual: 20.5
Predicted: 33.0158374325, Actual: 39.0
Predicted: 28.8022864611, Actual: 27.0
Predicted: 16.7173825076, Actual: 16.0
Predicted: 23.0688222116, Actual: 18.1

In [31]:
model.score(X_test,y_test)


Out[31]:
0.82293456279632637

In [32]:
sns.regplot(predictions, y_test)


Out[32]:
<matplotlib.axes._subplots.AxesSubplot at 0x111e67a10>

In [33]:
from sklearn.preprocessing import PolynomialFeatures

In [34]:
quad_model =PolynomialFeatures(degree=2)

In [35]:
quad_X_train = quad_model.fit_transform(X_train)
quad_X_test = quad_model.fit_transform(X_test)

In [36]:
model.fit(quad_X_train, y_train)


Out[36]:
LinearRegression(copy_X=True, fit_intercept=True, normalize=False)

In [37]:
predictions = model.predict(quad_X_test)

In [38]:
for i, prediction in enumerate(predictions):
    print 'Predicted: %s, Actual: %s' % (prediction, y_test[i])


Predicted: 29.71347735, Actual: 30.5
Predicted: 33.4899146351, Actual: 33.0
Predicted: 32.362893245, Actual: 32.0
Predicted: 13.4082859132, Actual: 14.0
Predicted: 24.3829497547, Actual: 26.5
Predicted: 29.8749906141, Actual: 29.8
Predicted: 33.6948350224, Actual: 32.2
Predicted: 17.5600706525, Actual: 13.0
Predicted: 23.8637877057, Actual: 24.0
Predicted: 14.8968053527, Actual: 13.0
Predicted: 26.1080611058, Actual: 28.0
Predicted: 18.0564639908, Actual: 16.2
Predicted: 18.6731585909, Actual: 17.0
Predicted: 36.0656250472, Actual: 37.2
Predicted: 22.6938134494, Actual: 23.0
Predicted: 38.9293080217, Actual: 37.0
Predicted: 23.3073925207, Actual: 24.5
Predicted: 16.6326132369, Actual: 17.0
Predicted: 19.3480232633, Actual: 18.0
Predicted: 23.7547857787, Actual: 23.8
Predicted: 23.25694326, Actual: 19.0
Predicted: 31.0891028894, Actual: 29.0
Predicted: 25.3980068445, Actual: 22.0
Predicted: 25.2990155763, Actual: 20.0
Predicted: 10.0186085211, Actual: 16.0
Predicted: 27.9440881947, Actual: 24.0
Predicted: 32.7469694486, Actual: 31.5
Predicted: 19.1648743536, Actual: 18.5
Predicted: 39.4416044356, Actual: 38.0
Predicted: 39.2510147415, Actual: 31.0
Predicted: 23.876273622, Actual: 28.8
Predicted: 22.3229574106, Actual: 23.9
Predicted: 29.849009388, Actual: 27.2
Predicted: 13.820081633, Actual: 14.0
Predicted: 12.5526951652, Actual: 13.0
Predicted: 13.1434779351, Actual: 13.0
Predicted: 13.8083451654, Actual: 13.0
Predicted: 22.7683028754, Actual: 21.6
Predicted: 19.3613660338, Actual: 19.0
Predicted: 14.6846923328, Actual: 11.0
Predicted: 34.8785163439, Actual: 34.1
Predicted: 13.0789965183, Actual: 12.0
Predicted: 18.3477511496, Actual: 15.0
Predicted: 18.2219940558, Actual: 20.0
Predicted: 19.4376721247, Actual: 19.0
Predicted: 24.5357910022, Actual: 22.0
Predicted: 18.4981767768, Actual: 17.5
Predicted: 34.3948611477, Actual: 31.9
Predicted: 27.7151215546, Actual: 21.1
Predicted: 39.2665096613, Actual: 38.0
Predicted: 32.6600278312, Actual: 36.1
Predicted: 14.0845386878, Actual: 14.0
Predicted: 28.5754699889, Actual: 33.5
Predicted: 31.9966525014, Actual: 35.7
Predicted: 14.0454330029, Actual: 14.0
Predicted: 24.6972473406, Actual: 24.3
Predicted: 17.7208595876, Actual: 20.2
Predicted: 35.5326069675, Actual: 36.0
Predicted: 37.7768455848, Actual: 34.0
Predicted: 31.776165109, Actual: 31.5
Predicted: 24.6101626425, Actual: 24.0
Predicted: 16.3230510372, Actual: 18.0
Predicted: 20.7030081142, Actual: 17.0
Predicted: 19.0466418106, Actual: 22.0
Predicted: 29.4180151855, Actual: 31.0
Predicted: 23.1451225078, Actual: 25.5
Predicted: 14.265889449, Actual: 14.0
Predicted: 35.208092754, Actual: 36.1
Predicted: 13.8281300024, Actual: 17.7
Predicted: 22.6228167793, Actual: 23.0
Predicted: 24.523389664, Actual: 27.9
Predicted: 14.2653264212, Actual: 14.0
Predicted: 21.0913778325, Actual: 24.0
Predicted: 24.1214528607, Actual: 24.0
Predicted: 18.994029224, Actual: 20.5
Predicted: 36.4758511519, Actual: 39.0
Predicted: 29.6500911773, Actual: 27.0
Predicted: 15.8327425557, Actual: 16.0
Predicted: 18.7853727008, Actual: 18.1

In [39]:
model.score(quad_X_test,y_test)


Out[39]:
0.89217227801185983

In [40]:
sns.regplot(predictions, y_test)


Out[40]:
<matplotlib.axes._subplots.AxesSubplot at 0x111f61dd0>

In [41]:
from sklearn.cross_validation import cross_val_score

In [42]:
scores = cross_val_score(model, quad_X_train, y_train, cv =10)

In [43]:
scores


Out[43]:
array([ 0.89375634,  0.90110605,  0.87155007,  0.86614939,  0.71968059,
        0.8766902 ,  0.88787329,  0.7700916 ,  0.94054004,  0.80474785])

Using attributes of a car to predict MPG is not an exact science. What I mean is that car manufacturers make cars based on their marketing plans. Although some cars are marketed as being fuel efficient, some could be marketed for comfort, some might be manufactured to appeal to a certain demographic(women, young adults, corporate types) and some others even for speed. Given these assumptions, our model does a fairly good job.