Supervised Learning using SciKit Learn

K-Nearest neighbors is one of the most basic classification algorithms and should be your first choice if you have no prior knowledge about the data.

It is based on the Euclidean distance between the test sample and the specified training samples. For any 2 points with p features, the Euclidean distance is given by the square root of the sum of distances for each dimension. For the mathematical explanation and theory, please view lecture notes: http://george1328.github.io/notes/DAT10_lec05_Regression_Regularization.pdf

We will use the MPG dataset to practice using KNN to predict a fuel economy(MPG) of a car given its features.

Step 1: Get the data



In [1]:

    
import pandas as pd
%pylab inline









    



Populating the interactive namespace from numpy and matplotlib



In [2]:

    
# Column/Feature label are not available in the dataset, so we create a list of features using auto-mpg.names
features = ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'model year', 'origin', 'car name']



In [3]:

    
# Import the data directly into pandas from the url, specify header=None as column labels are not in dataset
import urllib
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
# file is fixed-width-format so we will use read_fwf instead of read_csv
df = pd.read_fwf(urllib.urlopen(url), header = None)
df.columns = features



In [4]:

    
# Alternatively, we can download the data
# We use the bang(!) within iPython Notebooks to run command line statements directly from the Notebook
! curl -O https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data
! curl -O https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.names









    



  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 30286  100 30286    0     0  47352      0 --:--:-- --:--:-- --:--:-- 47321
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1660  100  1660    0     0   5404      0 --:--:-- --:--:-- --:--:--  5389



In [5]:

    
# Since this dataset has no column headings, we need to explicitely state names=features
df = pd.read_fwf("auto-mpg.data", names = features)

Step 2: Clean the data



In [6]:

    
# Use head, describe, info and unique to get a sense of the data
df.describe()









    Out[6]:






  
    
      
      mpg
      cylinders
      displacement
      weight
      acceleration
      model year
      origin
    
  
  
    
      count
       398.000000
       398.000000
       398.000000
        398.000000
       398.000000
       398.000000
       398.000000
    
    
      mean
        23.514573
         5.454774
       193.425879
       2970.424623
        15.568090
        76.010050
         1.572864
    
    
      std
         7.815984
         1.701004
       104.269838
        846.841774
         2.757689
         3.697627
         0.802055
    
    
      min
         9.000000
         3.000000
        68.000000
       1613.000000
         8.000000
        70.000000
         1.000000
    
    
      25%
        17.500000
         4.000000
       104.250000
       2223.750000
        13.825000
        73.000000
         1.000000
    
    
      50%
        23.000000
         4.000000
       148.500000
       2803.500000
        15.500000
        76.000000
         1.000000
    
    
      75%
        29.000000
         8.000000
       262.000000
       3608.000000
        17.175000
        79.000000
         2.000000
    
    
      max
        46.600000
         8.000000
       455.000000
       5140.000000
        24.800000
        82.000000
         3.000000



In [7]:

    
df.head()









    Out[7]:






  
    
      
      mpg
      cylinders
      displacement
      horsepower
      weight
      acceleration
      model year
      origin
      car name
    
  
  
    
      0
       18
       8
       307
       130.0
       3504
       12.0
       70
       1
       "chevrolet chevelle malibu"
    
    
      1
       15
       8
       350
       165.0
       3693
       11.5
       70
       1
               "buick skylark 320"
    
    
      2
       18
       8
       318
       150.0
       3436
       11.0
       70
       1
              "plymouth satellite"
    
    
      3
       16
       8
       304
       150.0
       3433
       12.0
       70
       1
                   "amc rebel sst"
    
    
      4
       17
       8
       302
       140.0
       3449
       10.5
       70
       1
                     "ford torino"



In [8]:

    
df.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 398 entries, 0 to 397
Data columns (total 9 columns):
mpg             398 non-null float64
cylinders       398 non-null int64
displacement    398 non-null float64
horsepower      398 non-null object
weight          398 non-null float64
acceleration    398 non-null float64
model year      398 non-null int64
origin          398 non-null int64
car name        398 non-null object
dtypes: float64(4), int64(3), object(2)



In [9]:

    
# Name and Horsepower are the only non-numeric fields. Name of a car is unlikely to have an influence on the MPG.
df.horsepower.unique()









    Out[9]:





array(['130.0', '165.0', '150.0', '140.0', '198.0', '220.0', '215.0',
       '225.0', '190.0', '170.0', '160.0', '95.00', '97.00', '85.00',
       '88.00', '46.00', '87.00', '90.00', '113.0', '200.0', '210.0',
       '193.0', '?', '100.0', '105.0', '175.0', '153.0', '180.0', '110.0',
       '72.00', '86.00', '70.00', '76.00', '65.00', '69.00', '60.00',
       '80.00', '54.00', '208.0', '155.0', '112.0', '92.00', '145.0',
       '137.0', '158.0', '167.0', '94.00', '107.0', '230.0', '49.00',
       '75.00', '91.00', '122.0', '67.00', '83.00', '78.00', '52.00',
       '61.00', '93.00', '148.0', '129.0', '96.00', '71.00', '98.00',
       '115.0', '53.00', '81.00', '79.00', '120.0', '152.0', '102.0',
       '108.0', '68.00', '58.00', '149.0', '89.00', '63.00', '48.00',
       '66.00', '139.0', '103.0', '125.0', '133.0', '138.0', '135.0',
       '142.0', '77.00', '62.00', '132.0', '84.00', '64.00', '74.00',
       '116.0', '82.00'], dtype=object)



In [10]:

    
(df.horsepower == '?').sum()









    Out[10]:





6



In [11]:

    
# Lets convert horsepower to a numeric field so we can use it in our analysis
df['horsepower'] = df['horsepower'].convert_objects(convert_numeric = True)



In [12]:

    
# We will drop the 6 records that are missing horsepower. We could extimate these missing values but for the sake of accuracy
# we will not. Also, its only 6 missing values
df = df.dropna()



In [13]:

    
df.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 392 entries, 0 to 397
Data columns (total 9 columns):
mpg             392 non-null float64
cylinders       392 non-null int64
displacement    392 non-null float64
horsepower      392 non-null float64
weight          392 non-null float64
acceleration    392 non-null float64
model year      392 non-null int64
origin          392 non-null int64
car name        392 non-null object
dtypes: float64(5), int64(3), object(1)

Step 3: Get a sense of the data using Exploratory Data Analysis(EDA)



In [14]:

    
import seaborn as sns



In [15]:

    
sns.boxplot(df.mpg, df.cylinders)
# This is interesting. 4 cylinder vehicles have better mileage on average than 3 cylinder vehicles









    Out[15]:





<matplotlib.axes._subplots.AxesSubplot at 0x10d489e90>



In [16]:

    
three_cyl = df[df.cylinders == 3]
print three_cyl['car name']
## Aha! Tiny Mazda roadsters...









    



71     "mazda rx2 coupe"
111          "maxda rx3"
243         "mazda rx-4"
334      "mazda rx-7 gs"
Name: car name, dtype: object



In [17]:

    
sns.violinplot(df.mpg, df['model year'])
# Fancy seaborn graphing









    Out[17]:





<matplotlib.axes._subplots.AxesSubplot at 0x10d7669d0>



In [18]:

    
sns.barplot(df.mpg, df.horsepower)









    Out[18]:





<matplotlib.axes._subplots.AxesSubplot at 0x10d95ad10>



In [19]:

    
sns.barplot(df.mpg, df.weight)









    Out[19]:





<matplotlib.axes._subplots.AxesSubplot at 0x10e25c1d0>



In [20]:

    
sns.boxplot(df.mpg, df.origin)
# Although the values of origin are not given, we can guess that 1=USA, 2=Europe and 3=Japan... Maybe...









    Out[20]:





<matplotlib.axes._subplots.AxesSubplot at 0x10f538f90>



In [21]:

    
sns.boxplot(df.mpg, df.acceleration)
# Little cars have pretty good accelaration AND good mileage so not a great association









    Out[21]:





<matplotlib.axes._subplots.AxesSubplot at 0x110210d50>



In [22]:

    
sns.kdeplot(df.mpg, df.cylinders)
# Showing different plot options in seaborn :-)









    Out[22]:





<matplotlib.axes._subplots.AxesSubplot at 0x1116ef090>



In [23]:

    
df.corr()









    Out[23]:






  
    
      
      mpg
      cylinders
      displacement
      horsepower
      weight
      acceleration
      model year
      origin
    
  
  
    
      mpg
       1.000000
      -0.777618
      -0.805127
      -0.778427
      -0.832244
       0.423329
       0.580541
       0.565209
    
    
      cylinders
      -0.777618
       1.000000
       0.950823
       0.842983
       0.897527
      -0.504683
      -0.345647
      -0.568932
    
    
      displacement
      -0.805127
       0.950823
       1.000000
       0.897257
       0.932994
      -0.543800
      -0.369855
      -0.614535
    
    
      horsepower
      -0.778427
       0.842983
       0.897257
       1.000000
       0.864538
      -0.689196
      -0.416361
      -0.455171
    
    
      weight
      -0.832244
       0.897527
       0.932994
       0.864538
       1.000000
      -0.416839
      -0.309120
      -0.585005
    
    
      acceleration
       0.423329
      -0.504683
      -0.543800
      -0.689196
      -0.416839
       1.000000
       0.290316
       0.212746
    
    
      model year
       0.580541
      -0.345647
      -0.369855
      -0.416361
      -0.309120
       0.290316
       1.000000
       0.181528
    
    
      origin
       0.565209
      -0.568932
      -0.614535
      -0.455171
      -0.585005
       0.212746
       0.181528
       1.000000

Step 4: Re-engineer the data

Since MPG is a continuous variable and we want to use this dataset as a classification problem, lets convert the MPG values into bins. 5-10mpg, 10-15mpg...45-50mpg.



In [24]:

    
# create bins. '0'=0-4mpg, '1'=5-9mpg, '2'=10-14mpg...'9'=45-49mpg.
df['mpg_bin'] = (df.mpg/7).astype(int)

Ideally we should create dummy variable for the cylinder attribute but we will introduce that in a future notebook

Step 5: Prepare data for analysis



In [25]:

    
# Create numpy variables X and y with the predictor and class variables
X = df[['weight', 'model year', 'horsepower', 'origin', 'displacement']].values
y = df['mpg'].values



In [26]:

    
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LinearRegression



In [27]:

    
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)



In [28]:

    
model = LinearRegression()
model.fit(X_train, y_train)









    Out[28]:





LinearRegression(copy_X=True, fit_intercept=True, normalize=False)



In [29]:

    
predictions = model.predict(X_test)



In [30]:

    
for i, prediction in enumerate(predictions):
    print 'Predicted: %s, Actual: %s' % (prediction, y_test[i])









    



Predicted: 29.4308031249, Actual: 30.5
Predicted: 33.1952981092, Actual: 33.0
Predicted: 32.0181359381, Actual: 32.0
Predicted: 12.8002779599, Actual: 14.0
Predicted: 25.3061833209, Actual: 26.5
Predicted: 30.2339746281, Actual: 29.8
Predicted: 33.0487806024, Actual: 32.2
Predicted: 21.2135679363, Actual: 13.0
Predicted: 24.3574840944, Actual: 24.0
Predicted: 12.2274668373, Actual: 13.0
Predicted: 25.1870728772, Actual: 28.0
Predicted: 22.1411580969, Actual: 16.2
Predicted: 20.1410734066, Actual: 17.0
Predicted: 34.561472429, Actual: 37.2
Predicted: 23.8461972769, Actual: 23.0
Predicted: 36.0393892077, Actual: 37.0
Predicted: 24.7551353847, Actual: 24.5
Predicted: 17.2336762191, Actual: 17.0
Predicted: 19.4812922429, Actual: 18.0
Predicted: 24.8754183322, Actual: 23.8
Predicted: 25.6686880525, Actual: 19.0
Predicted: 29.830817277, Actual: 29.0
Predicted: 27.2672592875, Actual: 22.0
Predicted: 27.215946729, Actual: 20.0
Predicted: 11.639661399, Actual: 16.0
Predicted: 27.6098983166, Actual: 24.0
Predicted: 31.5006928976, Actual: 31.5
Predicted: 20.6940701595, Actual: 18.5
Predicted: 36.4312911981, Actual: 38.0
Predicted: 36.378747302, Actual: 31.0
Predicted: 26.8157281627, Actual: 28.8
Predicted: 23.1459743946, Actual: 23.9
Predicted: 29.2820020363, Actual: 27.2
Predicted: 11.4411525917, Actual: 14.0
Predicted: 9.5491124151, Actual: 13.0
Predicted: 13.4263078152, Actual: 13.0
Predicted: 5.98457756287, Actual: 13.0
Predicted: 25.9022740753, Actual: 21.6
Predicted: 22.0248296068, Actual: 19.0
Predicted: 8.93877839534, Actual: 11.0
Predicted: 34.0681885128, Actual: 34.1
Predicted: 6.8459693686, Actual: 12.0
Predicted: 20.3023752269, Actual: 15.0
Predicted: 20.9010794555, Actual: 20.0
Predicted: 21.3779477264, Actual: 19.0
Predicted: 23.9120067676, Actual: 22.0
Predicted: 20.4646029735, Actual: 17.5
Predicted: 32.6665228553, Actual: 31.9
Predicted: 29.8053168895, Actual: 21.1
Predicted: 36.2461867829, Actual: 38.0
Predicted: 31.2620442363, Actual: 36.1
Predicted: 11.4078090684, Actual: 14.0
Predicted: 28.4316996501, Actual: 33.5
Predicted: 31.0135434612, Actual: 35.7
Predicted: 12.1148872203, Actual: 14.0
Predicted: 25.3833115335, Actual: 24.3
Predicted: 20.8167184944, Actual: 20.2
Predicted: 35.065319018, Actual: 36.0
Predicted: 34.8101901299, Actual: 34.0
Predicted: 32.1628012083, Actual: 31.5
Predicted: 26.4670187118, Actual: 24.0
Predicted: 17.372988798, Actual: 18.0
Predicted: 23.9806431142, Actual: 17.0
Predicted: 19.1505446084, Actual: 22.0
Predicted: 28.3657297822, Actual: 31.0
Predicted: 24.5298420275, Actual: 25.5
Predicted: 10.1555970502, Actual: 14.0
Predicted: 34.5421360356, Actual: 36.1
Predicted: 20.3072004311, Actual: 17.7
Predicted: 23.1840191109, Actual: 23.0
Predicted: 26.3609303658, Actual: 27.9
Predicted: 12.0509238797, Actual: 14.0
Predicted: 22.9586054004, Actual: 24.0
Predicted: 26.3354950334, Actual: 24.0
Predicted: 20.9674225919, Actual: 20.5
Predicted: 33.0158374325, Actual: 39.0
Predicted: 28.8022864611, Actual: 27.0
Predicted: 16.7173825076, Actual: 16.0
Predicted: 23.0688222116, Actual: 18.1



In [31]:

    
model.score(X_test,y_test)









    Out[31]:





0.82293456279632637



In [32]:

    
sns.regplot(predictions, y_test)









    Out[32]:





<matplotlib.axes._subplots.AxesSubplot at 0x111e67a10>



In [33]:

    
from sklearn.preprocessing import PolynomialFeatures



In [34]:

    
quad_model =PolynomialFeatures(degree=2)



In [35]:

    
quad_X_train = quad_model.fit_transform(X_train)
quad_X_test = quad_model.fit_transform(X_test)



In [36]:

    
model.fit(quad_X_train, y_train)









    Out[36]:





LinearRegression(copy_X=True, fit_intercept=True, normalize=False)



In [37]:

    
predictions = model.predict(quad_X_test)



In [38]:

    
for i, prediction in enumerate(predictions):
    print 'Predicted: %s, Actual: %s' % (prediction, y_test[i])









    



Predicted: 29.71347735, Actual: 30.5
Predicted: 33.4899146351, Actual: 33.0
Predicted: 32.362893245, Actual: 32.0
Predicted: 13.4082859132, Actual: 14.0
Predicted: 24.3829497547, Actual: 26.5
Predicted: 29.8749906141, Actual: 29.8
Predicted: 33.6948350224, Actual: 32.2
Predicted: 17.5600706525, Actual: 13.0
Predicted: 23.8637877057, Actual: 24.0
Predicted: 14.8968053527, Actual: 13.0
Predicted: 26.1080611058, Actual: 28.0
Predicted: 18.0564639908, Actual: 16.2
Predicted: 18.6731585909, Actual: 17.0
Predicted: 36.0656250472, Actual: 37.2
Predicted: 22.6938134494, Actual: 23.0
Predicted: 38.9293080217, Actual: 37.0
Predicted: 23.3073925207, Actual: 24.5
Predicted: 16.6326132369, Actual: 17.0
Predicted: 19.3480232633, Actual: 18.0
Predicted: 23.7547857787, Actual: 23.8
Predicted: 23.25694326, Actual: 19.0
Predicted: 31.0891028894, Actual: 29.0
Predicted: 25.3980068445, Actual: 22.0
Predicted: 25.2990155763, Actual: 20.0
Predicted: 10.0186085211, Actual: 16.0
Predicted: 27.9440881947, Actual: 24.0
Predicted: 32.7469694486, Actual: 31.5
Predicted: 19.1648743536, Actual: 18.5
Predicted: 39.4416044356, Actual: 38.0
Predicted: 39.2510147415, Actual: 31.0
Predicted: 23.876273622, Actual: 28.8
Predicted: 22.3229574106, Actual: 23.9
Predicted: 29.849009388, Actual: 27.2
Predicted: 13.820081633, Actual: 14.0
Predicted: 12.5526951652, Actual: 13.0
Predicted: 13.1434779351, Actual: 13.0
Predicted: 13.8083451654, Actual: 13.0
Predicted: 22.7683028754, Actual: 21.6
Predicted: 19.3613660338, Actual: 19.0
Predicted: 14.6846923328, Actual: 11.0
Predicted: 34.8785163439, Actual: 34.1
Predicted: 13.0789965183, Actual: 12.0
Predicted: 18.3477511496, Actual: 15.0
Predicted: 18.2219940558, Actual: 20.0
Predicted: 19.4376721247, Actual: 19.0
Predicted: 24.5357910022, Actual: 22.0
Predicted: 18.4981767768, Actual: 17.5
Predicted: 34.3948611477, Actual: 31.9
Predicted: 27.7151215546, Actual: 21.1
Predicted: 39.2665096613, Actual: 38.0
Predicted: 32.6600278312, Actual: 36.1
Predicted: 14.0845386878, Actual: 14.0
Predicted: 28.5754699889, Actual: 33.5
Predicted: 31.9966525014, Actual: 35.7
Predicted: 14.0454330029, Actual: 14.0
Predicted: 24.6972473406, Actual: 24.3
Predicted: 17.7208595876, Actual: 20.2
Predicted: 35.5326069675, Actual: 36.0
Predicted: 37.7768455848, Actual: 34.0
Predicted: 31.776165109, Actual: 31.5
Predicted: 24.6101626425, Actual: 24.0
Predicted: 16.3230510372, Actual: 18.0
Predicted: 20.7030081142, Actual: 17.0
Predicted: 19.0466418106, Actual: 22.0
Predicted: 29.4180151855, Actual: 31.0
Predicted: 23.1451225078, Actual: 25.5
Predicted: 14.265889449, Actual: 14.0
Predicted: 35.208092754, Actual: 36.1
Predicted: 13.8281300024, Actual: 17.7
Predicted: 22.6228167793, Actual: 23.0
Predicted: 24.523389664, Actual: 27.9
Predicted: 14.2653264212, Actual: 14.0
Predicted: 21.0913778325, Actual: 24.0
Predicted: 24.1214528607, Actual: 24.0
Predicted: 18.994029224, Actual: 20.5
Predicted: 36.4758511519, Actual: 39.0
Predicted: 29.6500911773, Actual: 27.0
Predicted: 15.8327425557, Actual: 16.0
Predicted: 18.7853727008, Actual: 18.1



In [39]:

    
model.score(quad_X_test,y_test)









    Out[39]:





0.89217227801185983



In [40]:

    
sns.regplot(predictions, y_test)









    Out[40]:





<matplotlib.axes._subplots.AxesSubplot at 0x111f61dd0>



In [41]:

    
from sklearn.cross_validation import cross_val_score



In [42]:

    
scores = cross_val_score(model, quad_X_train, y_train, cv =10)



In [43]:

    
scores









    Out[43]:





array([ 0.89375634,  0.90110605,  0.87155007,  0.86614939,  0.71968059,
        0.8766902 ,  0.88787329,  0.7700916 ,  0.94054004,  0.80474785])

Using attributes of a car to predict MPG is not an exact science. What I mean is that car manufacturers make cars based on their marketing plans. Although some cars are marketed as being fuel efficient, some could be marketed for comfort, some might be manufactured to appeal to a certain demographic(women, young adults, corporate types) and some others even for speed. Given these assumptions, our model does a fairly good job.

	mpg	cylinders	displacement	weight	acceleration	model year	origin
count	398.000000	398.000000	398.000000	398.000000	398.000000	398.000000	398.000000
mean	23.514573	5.454774	193.425879	2970.424623	15.568090	76.010050	1.572864
std	7.815984	1.701004	104.269838	846.841774	2.757689	3.697627	0.802055
min	9.000000	3.000000	68.000000	1613.000000	8.000000	70.000000	1.000000
25%	17.500000	4.000000	104.250000	2223.750000	13.825000	73.000000	1.000000
50%	23.000000	4.000000	148.500000	2803.500000	15.500000	76.000000	1.000000
75%	29.000000	8.000000	262.000000	3608.000000	17.175000	79.000000	2.000000
max	46.600000	8.000000	455.000000	5140.000000	24.800000	82.000000	3.000000

	mpg	cylinders	displacement	horsepower	weight	acceleration	model year	origin	car name
0	18	8	307	130.0	3504	12.0	70	1	"chevrolet chevelle malibu"
1	15	8	350	165.0	3693	11.5	70	1	"buick skylark 320"
2	18	8	318	150.0	3436	11.0	70	1	"plymouth satellite"
3	16	8	304	150.0	3433	12.0	70	1	"amc rebel sst"
4	17	8	302	140.0	3449	10.5	70	1	"ford torino"

	mpg	cylinders	displacement	horsepower	weight	acceleration	model year	origin
mpg	1.000000	-0.777618	-0.805127	-0.778427	-0.832244	0.423329	0.580541	0.565209
cylinders	-0.777618	1.000000	0.950823	0.842983	0.897527	-0.504683	-0.345647	-0.568932
displacement	-0.805127	0.950823	1.000000	0.897257	0.932994	-0.543800	-0.369855	-0.614535
horsepower	-0.778427	0.842983	0.897257	1.000000	0.864538	-0.689196	-0.416361	-0.455171
weight	-0.832244	0.897527	0.932994	0.864538	1.000000	-0.416839	-0.309120	-0.585005
acceleration	0.423329	-0.504683	-0.543800	-0.689196	-0.416839	1.000000	0.290316	0.212746
model year	0.580541	-0.345647	-0.369855	-0.416361	-0.309120	0.290316	1.000000	0.181528
origin	0.565209	-0.568932	-0.614535	-0.455171	-0.585005	0.212746	0.181528	1.000000