Supervised Learning using SciKit Learn

K-Nearest neighbors is one of the most basic classification algorithms and should be your first choice if you have no prior knowledge about the data.

It is based on the Euclidean distance between the test sample and the specified training samples. For any 2 points with p features, the Euclidean distance is given by:

$$d(x_i, x_j) = \sqrt{(x_{i1}-x_{j1})^2 +(x_{i2}-x_{j2})^2+...+(x_{ip}-x_{jp})^2}$$

We will use the MPG dataset to practice using KNN to predict a fuel economy(MPG) of a car given its features.

Step 1: Get the data


In [1]:
import pandas as pd
%pylab inline


Populating the interactive namespace from numpy and matplotlib

In [2]:
# Column/Feature label are not available in the dataset, so we create a list of features using auto-mpg.names
features = ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'model year', 'origin', 'car name']

In [19]:
# Import the data directly into pandas from the url, specify header=None as column labels are not in dataset
import urllib
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
# file is fixed-width-format so we will use read_fwf instead of read_csv
df = pd.read_fwf(urllib.urlopen(url), header = None)
df.columns = features

In [20]:
# Alternatively, we can download the data
# We use the bang(!) within iPython Notebooks to run command line statements directly from the Notebook
! curl -O https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data
! curl -O https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.names


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 30286  100 30286    0     0  84766      0 --:--:-- --:--:-- --:--:-- 84597
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1660  100  1660    0     0   5339      0 --:--:-- --:--:-- --:--:--  5354

In [3]:
# Since this dataset has no column headings, we need to explicitely state names=features
df = pd.read_fwf("auto-mpg.data", names = features)

Step 2: Clean the data


In [4]:
# Use head, describe, info and unique to get a sense of the data
df.describe()


Out[4]:
mpg cylinders displacement weight acceleration model year origin
count 398.000000 398.000000 398.000000 398.000000 398.000000 398.000000 398.000000
mean 23.514573 5.454774 193.425879 2970.424623 15.568090 76.010050 1.572864
std 7.815984 1.701004 104.269838 846.841774 2.757689 3.697627 0.802055
min 9.000000 3.000000 68.000000 1613.000000 8.000000 70.000000 1.000000
25% 17.500000 4.000000 104.250000 2223.750000 13.825000 73.000000 1.000000
50% 23.000000 4.000000 148.500000 2803.500000 15.500000 76.000000 1.000000
75% 29.000000 8.000000 262.000000 3608.000000 17.175000 79.000000 2.000000
max 46.600000 8.000000 455.000000 5140.000000 24.800000 82.000000 3.000000

In [6]:
df.head()


Out[6]:
mpg cylinders displacement horsepower weight acceleration model year origin car name
0 18 8 307 130.0 3504 12.0 70 1 "chevrolet chevelle malibu"
1 15 8 350 165.0 3693 11.5 70 1 "buick skylark 320"
2 18 8 318 150.0 3436 11.0 70 1 "plymouth satellite"
3 16 8 304 150.0 3433 12.0 70 1 "amc rebel sst"
4 17 8 302 140.0 3449 10.5 70 1 "ford torino"

In [7]:
df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 398 entries, 0 to 397
Data columns (total 9 columns):
mpg             398 non-null float64
cylinders       398 non-null int64
displacement    398 non-null float64
horsepower      398 non-null object
weight          398 non-null float64
acceleration    398 non-null float64
model year      398 non-null int64
origin          398 non-null int64
car name        398 non-null object
dtypes: float64(4), int64(3), object(2)

In [8]:
# Name and Horsepower are the only non-numeric fields. Name of a car is unlikely to have an influence on the MPG.
df.horsepower.unique()


Out[8]:
array(['130.0', '165.0', '150.0', '140.0', '198.0', '220.0', '215.0',
       '225.0', '190.0', '170.0', '160.0', '95.00', '97.00', '85.00',
       '88.00', '46.00', '87.00', '90.00', '113.0', '200.0', '210.0',
       '193.0', '?', '100.0', '105.0', '175.0', '153.0', '180.0', '110.0',
       '72.00', '86.00', '70.00', '76.00', '65.00', '69.00', '60.00',
       '80.00', '54.00', '208.0', '155.0', '112.0', '92.00', '145.0',
       '137.0', '158.0', '167.0', '94.00', '107.0', '230.0', '49.00',
       '75.00', '91.00', '122.0', '67.00', '83.00', '78.00', '52.00',
       '61.00', '93.00', '148.0', '129.0', '96.00', '71.00', '98.00',
       '115.0', '53.00', '81.00', '79.00', '120.0', '152.0', '102.0',
       '108.0', '68.00', '58.00', '149.0', '89.00', '63.00', '48.00',
       '66.00', '139.0', '103.0', '125.0', '133.0', '138.0', '135.0',
       '142.0', '77.00', '62.00', '132.0', '84.00', '64.00', '74.00',
       '116.0', '82.00'], dtype=object)

In [9]:
(df.horsepower == '?').sum()


Out[9]:
6

In [10]:
# Lets convert horsepower to a numeric field so we can use it in our analysis
df['horsepower'] = df['horsepower'].convert_objects(convert_numeric = True)

In [11]:
# We will drop the 6 records that are missing horsepower. We could extimate these missing values but for the sake of accuracy
# we will not. Also, its only 6 missing values
df = df.dropna()

In [12]:
df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 392 entries, 0 to 397
Data columns (total 9 columns):
mpg             392 non-null float64
cylinders       392 non-null int64
displacement    392 non-null float64
horsepower      392 non-null float64
weight          392 non-null float64
acceleration    392 non-null float64
model year      392 non-null int64
origin          392 non-null int64
car name        392 non-null object
dtypes: float64(5), int64(3), object(1)

Step 3: Get a sense of the data using Exploratory Data Analysis(EDA)


In [13]:
import seaborn as sns

In [14]:
sns.boxplot(df.mpg, df.cylinders)
# This is interesting. 4 cylinder vehicles have better mileage on average than 3 cylinder vehicles


Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x10c45af90>

In [49]:
three_cyl = df[df.cylinders == 3]
print three_cyl['car name']
## Aha! Tiny Mazda roadsters...


71     "mazda rx2 coupe"
111          "maxda rx3"
243         "mazda rx-4"
334      "mazda rx-7 gs"
Name: car name, dtype: object

In [55]:
sns.violinplot(df.mpg, df['model year'])
# Fancy seaborn graphing


Out[55]:
<matplotlib.axes._subplots.AxesSubplot at 0x118eea050>

In [42]:
sns.barplot(df.mpg, df.horsepower)


Out[42]:
<matplotlib.axes._subplots.AxesSubplot at 0x10e30dd50>

In [57]:
sns.barplot(df.mpg, df.weight)


Out[57]:
<matplotlib.axes._subplots.AxesSubplot at 0x119280550>

In [61]:
sns.boxplot(df.mpg, df.origin)
# Although the values of origin are not given, we can guess that 1=USA, 2=Europe and 3=Japan... Maybe...


Out[61]:
<matplotlib.axes._subplots.AxesSubplot at 0x11acf56d0>

In [66]:
sns.boxplot(df.mpg, df.acceleration)
# Little cars have pretty good accelaration AND good mileage so not a great association


Out[66]:
<matplotlib.axes._subplots.AxesSubplot at 0x11ba1ec10>

In [53]:
sns.kdeplot(df.mpg, df.cylinders)
# Showing different plot options in seaborn :-)


Out[53]:
<matplotlib.axes._subplots.AxesSubplot at 0x118bd4910>

In [5]:
df.corr()


Out[5]:
mpg cylinders displacement weight acceleration model year origin
mpg 1.000000 -0.775396 -0.804203 -0.831741 0.420289 0.579267 0.563450
cylinders -0.775396 1.000000 0.950721 0.896017 -0.505419 -0.348746 -0.562543
displacement -0.804203 0.950721 1.000000 0.932824 -0.543684 -0.370164 -0.609409
weight -0.831741 0.896017 0.932824 1.000000 -0.417457 -0.306564 -0.581024
acceleration 0.420289 -0.505419 -0.543684 -0.417457 1.000000 0.288137 0.205873
model year 0.579267 -0.348746 -0.370164 -0.306564 0.288137 1.000000 0.180662
origin 0.563450 -0.562543 -0.609409 -0.581024 0.205873 0.180662 1.000000

Step 4: Re-engineer the data

Since MPG is a continuous variable and we want to use this dataset as a classification problem, lets convert the MPG values into bins. 5-10mpg, 10-15mpg...45-50mpg.


In [185]:
# create bins. '0'=0-4mpg, '1'=5-9mpg, '2'=10-14mpg...'9'=45-49mpg.
df['mpg_bin'] = (df.mpg/7).astype(int)

Ideally we should create dummy variable for the cylinder attribute but we will introduce that in a future notebook

Step 5: Prepare data for analysis


In [153]:
# Create numpy variables X and y with the predictor and class variables
X = df[['weight', 'model year', 'horsepower', 'origin', 'displacement']].values
y = df['mpg'].values

In [154]:
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LinearRegression

In [155]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [156]:
model = LinearRegression()
model.fit(X_train, y_train)


Out[156]:
LinearRegression(copy_X=True, fit_intercept=True, normalize=False)

In [157]:
predictions = model.predict(X_test)

In [158]:
for i, prediction in enumerate(predictions):
    print 'Predicted: %s, Actual: %s' % (prediction, y_test[i])


Predicted: 25.0499584293, Actual: 28.0
Predicted: 12.2337741045, Actual: 14.0
Predicted: 13.5117221241, Actual: 15.0
Predicted: 29.8047684458, Actual: 30.0
Predicted: 36.5495843042, Actual: 35.1
Predicted: 33.7803148003, Actual: 44.0
Predicted: 29.4892067573, Actual: 21.1
Predicted: 28.3266772716, Actual: 26.0
Predicted: 17.1721649263, Actual: 17.0
Predicted: 26.9742442075, Actual: 25.5
Predicted: 22.8982030145, Actual: 23.0
Predicted: 18.6824556906, Actual: 20.0
Predicted: 27.4579480406, Actual: 24.0
Predicted: 18.1935114123, Actual: 16.5
Predicted: 17.2452166247, Actual: 18.0
Predicted: 25.3507310101, Actual: 30.0
Predicted: 26.5892250387, Actual: 24.0
Predicted: 25.8033098351, Actual: 28.0
Predicted: 15.5254958754, Actual: 17.0
Predicted: 13.6588113618, Actual: 13.0
Predicted: 26.6372411144, Actual: 21.5
Predicted: 28.4243632949, Actual: 27.0
Predicted: 22.4945780377, Actual: 25.0
Predicted: 23.2198757119, Actual: 19.4
Predicted: 16.7031806199, Actual: 13.0
Predicted: 29.8520953136, Actual: 29.8
Predicted: 29.7578091856, Actual: 29.9
Predicted: 15.440452794, Actual: 16.0
Predicted: 29.6063283447, Actual: 35.0
Predicted: 27.4949069647, Actual: 25.0
Predicted: 34.1255234396, Actual: 37.2
Predicted: 15.1893036855, Actual: 15.0
Predicted: 33.6430236549, Actual: 34.1
Predicted: 19.9940034414, Actual: 19.4
Predicted: 10.343641653, Actual: 13.0
Predicted: 11.6095669583, Actual: 14.0
Predicted: 28.3365168426, Actual: 30.9
Predicted: 30.9567586975, Actual: 43.4
Predicted: 32.2215885099, Actual: 33.5
Predicted: 26.0113805683, Actual: 24.0
Predicted: 17.1558236578, Actual: 18.0
Predicted: 34.4586973164, Actual: 36.0
Predicted: 21.6888621028, Actual: 21.0
Predicted: 23.8982715961, Actual: 17.6
Predicted: 31.1658012091, Actual: 37.3
Predicted: 14.4339951257, Actual: 15.0
Predicted: 10.3558576811, Actual: 14.0
Predicted: 15.3080948414, Actual: 17.5
Predicted: 13.0770756461, Actual: 14.0
Predicted: 29.2050502491, Actual: 27.5
Predicted: 26.9075339977, Actual: 20.0
Predicted: 20.4942298647, Actual: 21.0
Predicted: 20.9318040835, Actual: 18.0
Predicted: 31.6706603205, Actual: 32.0
Predicted: 31.4780892732, Actual: 34.3
Predicted: 26.1496413314, Actual: 20.2
Predicted: 19.0487638874, Actual: 15.5
Predicted: 35.6120245639, Actual: 37.0
Predicted: 32.19260658, Actual: 39.4
Predicted: 28.8745384008, Actual: 28.0
Predicted: 11.9594603726, Actual: 13.0
Predicted: 16.5989957781, Actual: 16.0
Predicted: 30.8691241301, Actual: 23.7
Predicted: 28.849647851, Actual: 30.0
Predicted: 23.7064095351, Actual: 23.0
Predicted: 28.3164718062, Actual: 31.0
Predicted: 25.4881237355, Actual: 26.0
Predicted: 20.8088772719, Actual: 25.0
Predicted: 13.352144677, Actual: 14.0
Predicted: 30.7630822511, Actual: 29.0
Predicted: 25.4026222795, Actual: 19.0
Predicted: 24.9566311007, Actual: 27.2
Predicted: 20.3452226406, Actual: 21.0
Predicted: 30.0771940639, Actual: 29.0
Predicted: 30.9338158764, Actual: 35.7
Predicted: 20.8279963225, Actual: 17.6
Predicted: 11.1418183067, Actual: 13.0
Predicted: 25.4536131603, Actual: 30.0
Predicted: 14.13628628, Actual: 15.0

In [159]:
model.score(X_test,y_test)


Out[159]:
0.80198530681866198

In [160]:
sns.regplot(predictions, y_test)


Out[160]:
<matplotlib.axes._subplots.AxesSubplot at 0x10f75cf50>

In [161]:
from sklearn.preprocessing import PolynomialFeatures

In [162]:
quad_model =PolynomialFeatures(degree=2)

In [163]:
quad_X_train = quad_model.fit_transform(X_train)
quad_X_test = quad_model.fit_transform(X_test)

In [164]:
model.fit(quad_X_train, y_train)


Out[164]:
LinearRegression(copy_X=True, fit_intercept=True, normalize=False)

In [165]:
predictions = model.predict(quad_X_test)

In [166]:
for i, prediction in enumerate(predictions):
    print 'Predicted: %s, Actual: %s' % (prediction, y_test[i])


Predicted: 25.9783112969, Actual: 28.0
Predicted: 12.6892715936, Actual: 14.0
Predicted: 14.4327752407, Actual: 15.0
Predicted: 31.0008766782, Actual: 30.0
Predicted: 39.2188394315, Actual: 35.1
Predicted: 38.9328707672, Actual: 44.0
Predicted: 27.1279177032, Actual: 21.1
Predicted: 29.5978388997, Actual: 26.0
Predicted: 15.9382114215, Actual: 17.0
Predicted: 26.6713224499, Actual: 25.5
Predicted: 21.439639328, Actual: 23.0
Predicted: 17.3938204511, Actual: 20.0
Predicted: 27.9918847574, Actual: 24.0
Predicted: 15.0387238525, Actual: 16.5
Predicted: 16.206121657, Actual: 18.0
Predicted: 26.9659114946, Actual: 30.0
Predicted: 26.654314322, Actual: 24.0
Predicted: 27.0373283906, Actual: 28.0
Predicted: 14.6349416834, Actual: 17.0
Predicted: 13.3798905418, Actual: 13.0
Predicted: 22.4539243095, Actual: 21.5
Predicted: 29.3881481725, Actual: 27.0
Predicted: 23.8246531356, Actual: 25.0
Predicted: 22.1689016437, Actual: 19.4
Predicted: 15.558920914, Actual: 13.0
Predicted: 29.0123825459, Actual: 29.8
Predicted: 32.5308964937, Actual: 29.9
Predicted: 14.4780766658, Actual: 16.0
Predicted: 29.661690794, Actual: 35.0
Predicted: 27.9194639711, Actual: 25.0
Predicted: 35.6385798891, Actual: 37.2
Predicted: 14.3741297831, Actual: 15.0
Predicted: 34.5589782658, Actual: 34.1
Predicted: 17.6826205854, Actual: 19.4
Predicted: 12.4114683836, Actual: 13.0
Predicted: 13.2029596299, Actual: 14.0
Predicted: 29.3034690831, Actual: 30.9
Predicted: 34.7776677008, Actual: 43.4
Predicted: 32.0048767965, Actual: 33.5
Predicted: 23.5114591917, Actual: 24.0
Predicted: 16.1690356647, Actual: 18.0
Predicted: 37.4006329487, Actual: 36.0
Predicted: 19.2787987867, Actual: 21.0
Predicted: 25.6510003632, Actual: 17.6
Predicted: 32.8068598466, Actual: 37.3
Predicted: 13.0429484287, Actual: 15.0
Predicted: 13.2794206462, Actual: 14.0
Predicted: 14.671917348, Actual: 17.5
Predicted: 13.5418039014, Actual: 14.0
Predicted: 26.896762336, Actual: 27.5
Predicted: 25.5259798939, Actual: 20.0
Predicted: 19.5875304528, Actual: 21.0
Predicted: 18.7515286033, Actual: 18.0
Predicted: 31.2443447791, Actual: 32.0
Predicted: 32.7809016386, Actual: 34.3
Predicted: 26.7753600089, Actual: 20.2
Predicted: 18.2898805362, Actual: 15.5
Predicted: 38.1765411879, Actual: 37.0
Predicted: 32.290045207, Actual: 39.4
Predicted: 29.9052839263, Actual: 28.0
Predicted: 13.0505270208, Actual: 13.0
Predicted: 15.6922569176, Actual: 16.0
Predicted: 28.4824063568, Actual: 23.7
Predicted: 30.3846640217, Actual: 30.0
Predicted: 22.8378305952, Actual: 23.0
Predicted: 29.7604847862, Actual: 31.0
Predicted: 25.5695388068, Actual: 26.0
Predicted: 21.347400332, Actual: 25.0
Predicted: 12.8989678062, Actual: 14.0
Predicted: 31.4111770371, Actual: 29.0
Predicted: 24.285805945, Actual: 19.0
Predicted: 26.2911040467, Actual: 27.2
Predicted: 19.7936285744, Actual: 21.0
Predicted: 31.0775232674, Actual: 29.0
Predicted: 32.961373788, Actual: 35.7
Predicted: 19.2959364203, Actual: 17.6
Predicted: 13.1636109685, Actual: 13.0
Predicted: 28.0628469911, Actual: 30.0
Predicted: 13.7963394397, Actual: 15.0

In [167]:
model.score(quad_X_test,y_test)


Out[167]:
0.8702860590107111

In [168]:
sns.regplot(predictions, y_test)


Out[168]:
<matplotlib.axes._subplots.AxesSubplot at 0x10f847650>

In [169]:
from sklearn.cross_validation import cross_val_score

In [170]:
scores = cross_val_score(model, quad_X_train, y_train, cv =10)

In [171]:
scores


Out[171]:
array([ 0.78553428,  0.84203549,  0.91184909,  0.88585189,  0.63062841,
        0.84906748,  0.87562603,  0.94502662,  0.85083039,  0.90269688])

In [ ]: