K-Nearest neighbors is one of the most basic classification algorithms and should be your first choice if you have no prior knowledge about the data.
It is based on the Euclidean distance between the test sample and the specified training samples. For any 2 points with p features, the Euclidean distance is given by:
$$d(x_i, x_j) = \sqrt{(x_{i1}-x_{j1})^2 +(x_{i2}-x_{j2})^2+...+(x_{ip}-x_{jp})^2}$$
We will use the MPG dataset to practice using KNN to predict a fuel economy(MPG) of a car given its features.
In [1]:
import pandas as pd
%pylab inline
In [2]:
# Column/Feature label are not available in the dataset, so we create a list of features using auto-mpg.names
features = ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'model year', 'origin', 'car name']
In [19]:
# Import the data directly into pandas from the url, specify header=None as column labels are not in dataset
import urllib
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
# file is fixed-width-format so we will use read_fwf instead of read_csv
df = pd.read_fwf(urllib.urlopen(url), header = None)
df.columns = features
In [20]:
# Alternatively, we can download the data
# We use the bang(!) within iPython Notebooks to run command line statements directly from the Notebook
! curl -O https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data
! curl -O https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.names
In [3]:
# Since this dataset has no column headings, we need to explicitely state names=features
df = pd.read_fwf("auto-mpg.data", names = features)
In [4]:
# Use head, describe, info and unique to get a sense of the data
df.describe()
Out[4]:
In [6]:
df.head()
Out[6]:
In [7]:
df.info()
In [8]:
# Name and Horsepower are the only non-numeric fields. Name of a car is unlikely to have an influence on the MPG.
df.horsepower.unique()
Out[8]:
In [9]:
(df.horsepower == '?').sum()
Out[9]:
In [10]:
# Lets convert horsepower to a numeric field so we can use it in our analysis
df['horsepower'] = df['horsepower'].convert_objects(convert_numeric = True)
In [11]:
# We will drop the 6 records that are missing horsepower. We could extimate these missing values but for the sake of accuracy
# we will not. Also, its only 6 missing values
df = df.dropna()
In [12]:
df.info()
In [13]:
import seaborn as sns
In [14]:
sns.boxplot(df.mpg, df.cylinders)
# This is interesting. 4 cylinder vehicles have better mileage on average than 3 cylinder vehicles
Out[14]:
In [49]:
three_cyl = df[df.cylinders == 3]
print three_cyl['car name']
## Aha! Tiny Mazda roadsters...
In [55]:
sns.violinplot(df.mpg, df['model year'])
# Fancy seaborn graphing
Out[55]:
In [42]:
sns.barplot(df.mpg, df.horsepower)
Out[42]:
In [57]:
sns.barplot(df.mpg, df.weight)
Out[57]:
In [61]:
sns.boxplot(df.mpg, df.origin)
# Although the values of origin are not given, we can guess that 1=USA, 2=Europe and 3=Japan... Maybe...
Out[61]:
In [66]:
sns.boxplot(df.mpg, df.acceleration)
# Little cars have pretty good accelaration AND good mileage so not a great association
Out[66]:
In [53]:
sns.kdeplot(df.mpg, df.cylinders)
# Showing different plot options in seaborn :-)
Out[53]:
In [5]:
df.corr()
Out[5]:
Since MPG is a continuous variable and we want to use this dataset as a classification problem, lets convert the MPG values into bins. 5-10mpg, 10-15mpg...45-50mpg.
In [185]:
# create bins. '0'=0-4mpg, '1'=5-9mpg, '2'=10-14mpg...'9'=45-49mpg.
df['mpg_bin'] = (df.mpg/7).astype(int)
Ideally we should create dummy variable for the cylinder attribute but we will introduce that in a future notebook
In [153]:
# Create numpy variables X and y with the predictor and class variables
X = df[['weight', 'model year', 'horsepower', 'origin', 'displacement']].values
y = df['mpg'].values
In [154]:
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LinearRegression
In [155]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
In [156]:
model = LinearRegression()
model.fit(X_train, y_train)
Out[156]:
In [157]:
predictions = model.predict(X_test)
In [158]:
for i, prediction in enumerate(predictions):
print 'Predicted: %s, Actual: %s' % (prediction, y_test[i])
In [159]:
model.score(X_test,y_test)
Out[159]:
In [160]:
sns.regplot(predictions, y_test)
Out[160]:
In [161]:
from sklearn.preprocessing import PolynomialFeatures
In [162]:
quad_model =PolynomialFeatures(degree=2)
In [163]:
quad_X_train = quad_model.fit_transform(X_train)
quad_X_test = quad_model.fit_transform(X_test)
In [164]:
model.fit(quad_X_train, y_train)
Out[164]:
In [165]:
predictions = model.predict(quad_X_test)
In [166]:
for i, prediction in enumerate(predictions):
print 'Predicted: %s, Actual: %s' % (prediction, y_test[i])
In [167]:
model.score(quad_X_test,y_test)
Out[167]:
In [168]:
sns.regplot(predictions, y_test)
Out[168]:
In [169]:
from sklearn.cross_validation import cross_val_score
In [170]:
scores = cross_val_score(model, quad_X_train, y_train, cv =10)
In [171]:
scores
Out[171]:
In [ ]: