Assignnement 2: Prediction and Classification

Due: Thursday, April 30, 2015 11:59 PM

Introduction

Problem 3 is optional - for extra credit! Problems 1 and 2 will be graded for the Lab 2.

In this assignment you will be using regression and classification to explore different data sets.

First: You will use data from before 2002 in the Sean Lahman's Baseball Database to create a metric for picking baseball players using linear regression. This database contains the "complete batting and pitching statistics from 1871 to 2013, plus fielding statistics, standings, team stats, managerial records, post-season data, and more". Documentation provided here.

http://saberseminar.com/wp-content/uploads/2012/01/saber-web.jpg

Second: You will use the famous iris data set to perform a $k$-neareast neighbor classification using cross validation. While it was introduced in 1936, it is still one of the most popular example data sets in the machine learning community. Wikipedia describes the data set as follows: "The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimetres." Here is an illustration what the four features measure:

http://sebastianraschka.com/Images/2014_python_lda/iris_petal_sepal.png

Third: You will investigate the influence of higher dimensional spaces on the classification using another standard data set in machine learning called the The cars data set.

Load Python modules


In [1]:
# prepare the notebook for matplotlib
%matplotlib inline 

import requests 
import StringIO
import zipfile
import numpy as np
import pandas as pd # pandas
import matplotlib.pyplot as plt # module for plotting 

# If this module is not already installed, you may need to install it. 
# You can do this by typing 'pip install seaborn' in the command line
import seaborn as sns 

import sklearn
import sklearn.datasets
import sklearn.cross_validation
import sklearn.decomposition
import sklearn.grid_search
import sklearn.neighbors
import sklearn.metrics


/usr/local/lib/python2.7/site-packages/matplotlib/__init__.py:872: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))

Problem 1: Sabermetrics

Using data preceding the 2002 season pick 10 offensive players keeping the payroll under $20 million (assign each player the median salary). Predict how many games this team would win in a 162 game season.

In this problem we will be returning to the Sean Lahman's Baseball Database. From this database, we will be extract five data sets containing information such as yearly stats and standing, batting statistics, fielding statistics, player names, player salaries and biographical information. You will explore the data in this database from before 2002 and create a metric for picking players.

Problem 1(a)

Load in these CSV files from the Sean Lahman's Baseball Database. For this assignment, we will use the 'Teams.csv', 'Batting.csv', 'Salaries.csv', 'Fielding.csv', 'Master.csv' tables. Read these tables into separate pandas DataFrames with the following names.

CSV file name Name of pandas DataFrame
Teams.csv teams
Batting.csv players
Salaries.csv salaries
Fielding.csv fielding
Master.csv master

In [2]:
teams = pd.read_csv("data/Teams.csv")
players = pd.read_csv("data/Batting.csv")
salaries = pd.read_csv("data/Salaries.csv")
fielding = pd.read_csv("data/Fielding.csv")
master = pd.read_csv("data/Master.csv")

In [ ]:

Problem 1(b)

Calculate the median salary for each player and create a pandas DataFrame called medianSalaries with four columns: (1) the player ID, (2) the first name of the player, (3) the last name of the player and (4) the median salary of the player. Show the head of the medianSalaries DataFrame.


In [3]:
medians = salaries.groupby(["playerID"]).median()
merged = pd.merge(medians, master, left_index=True, right_on="playerID")
medianSalaries=merged[["playerID", "nameFirst", "nameLast", "salary"]]
medianSalaries.reset_index(inplace=True, drop=True)
medianSalaries.head()


Out[3]:
playerID nameFirst nameLast salary
0 aardsda01 David Aardsma 419000
1 aasedo01 Don Aase 612500
2 abadan01 Andy Abad 327000
3 abadfe01 Fernando Abad 451500
4 abbotje01 Jeff Abbott 255000

Problem 1(c)

Now, consider only team/season combinations in which the teams played 162 Games. Exclude all data from before 1947. Compute the per plate appearance rates for singles, doubles, triples, HR, and BB. Create a new pandas DataFrame called stats that has the teamID, yearID, wins and these rates.

Hint: Singles are hits that are not doubles, triples, nor HR. Plate appearances are base on balls plus at bats.


In [4]:
g_filter = teams["G"] == 162
year_filter = teams["yearID"] > 1947
filtered_teams = teams[g_filter & year_filter]

stats = filtered_teams[["teamID", "yearID", "W", "2B", "3B", "HR", "BB", "H", "AB"]].copy()
stats["S"] = stats["H"] - (stats["2B"] + stats["3B"] + stats["HR"])

## Calculate PA = AB + BB
stats['PA'] = stats['AB'] + stats['BB']

## For each of singles, doubles, triples, HR and BB, calculate plate appearance rate
for i in ['S','2B','3B','HR','BB']:
   stats[i] = stats[i]/stats['PA']

stats.drop('H', axis=1, inplace=True)
stats.drop('PA', axis=1, inplace=True)
stats.drop('AB', axis=1, inplace=True)
stats.sort("yearID", inplace=True)
stats.head()


/usr/local/lib/python2.7/site-packages/ipykernel/__main__.py:18: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
Out[4]:
teamID yearID W 2B 3B HR BB S
1366 LAA 1961 70 0.035708 0.003604 0.030958 0.111548 0.147748
1367 KC1 1961 61 0.035982 0.007829 0.014993 0.096618 0.164751
1395 CHN 1962 59 0.032461 0.009275 0.020868 0.083471 0.168930
1394 HOU 1962 64 0.028095 0.007767 0.017353 0.081474 0.173195
1391 ML1 1962 86 0.033780 0.006292 0.029972 0.096208 0.157808

In [5]:
#the yearly mean for each metric across teams -- in the next step I make these values 0
stats.groupby('yearID')["S","2B","3B","HR","BB"].mean().head()


Out[5]:
S 2B 3B HR BB
yearID
1961 0.156249 0.035845 0.005717 0.022975 0.104083
1962 0.165632 0.035853 0.006777 0.023811 0.088590
1963 0.162467 0.034020 0.006896 0.021254 0.080336
1964 0.167251 0.036336 0.006748 0.021548 0.079152
1965 0.160042 0.035539 0.006534 0.022693 0.085745

Problem 1(d)

Is there a noticeable time trend in the rates computed computed in Problem 1(c)?


In [6]:
for col in ["S","2B","3B","HR","BB"]:
    plt.scatter(stats.yearID, stats[col], alpha=0.5, marker='x')
    plt.title(col)
    plt.xlabel('Year')
    plt.ylabel('Rate')
    plt.show()


Problem 1(e)

Using the stats DataFrame from Problem 1(c), adjust the singles per PA rates so that the average across teams for each year is 0. Do the same for the doubles, triples, HR, and BB rates.


In [7]:
years = stats['yearID'].unique()

year_stats_collection = []

for year in years:
    year_selector = stats["yearID"] == year
    year_stats = stats[year_selector]
    year_stats[['S', '2B', '3B', 'HR', 'BB']] = stats[year_selector][['S', '2B', '3B', 'HR', 'BB']].apply(lambda df: (df - df.mean()) / df.std())
    year_stats_collection.append(year_stats)

stats_adj = pd.concat(year_stats_collection)


/usr/local/lib/python2.7/site-packages/pandas/core/frame.py:2320: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[k1] = value[k2]

In [8]:
#yearly averages should eq 0
stats_adj.groupby('yearID')["S","2B","3B","HR","BB"].mean().head()


Out[8]:
S 2B 3B HR BB
yearID
1961 0.000000e+00 0.000000e+00 5.551115e-17 0.000000e+00 6.106227e-16
1962 4.804238e-15 1.468522e-15 6.863197e-16 -3.229740e-16 2.977416e-16
1963 -1.015061e-15 -2.347329e-15 4.440892e-16 6.344132e-16 -9.595499e-16
1964 2.415997e-16 2.220446e-16 4.239033e-16 -2.485386e-16 -3.229740e-16
1965 2.581269e-15 -5.048045e-16 3.590878e-16 -3.330669e-16 3.330669e-16

Problem 1(f)

Build a simple linear regression model to predict the number of wins from the average adjusted singles, double, triples, HR, and BB rates. To decide which of these terms to include fit the model to data from 2002 and compute the average squared residuals from predictions to years past 2002. Use the fitted model to define a new sabermetric summary: offensive predicted wins (OPW). Hint: the new summary should be a linear combination of one to five of the five rates.


In [9]:
from sklearn import linear_model
from sklearn.cross_validation import train_test_split

In [10]:
# labels=stats_adj[['S', '2B', '3B', 'HR', 'BB']].values
training_selector = stats_adj['yearID'] < 2002
test_selector = stats_adj['yearID'] >= 2002
labels=["S", "2B", "3B", "HR", "BB"]

training_data = stats_adj[training_selector]
test_data = stats_adj[test_selector]
X_train = training_data[labels].values
y_train = training_data[['W']].values
X_test = test_data[labels].values
y_test = test_data[['W']].values
# (stats_adj.shape, X_train.shape, y_train.shape, X_test.shape, y_test.shape)

In [11]:
labels=["S", "2B", "3B", "HR", "BB"]
data = stats_adj[labels].values
target = stats_adj[['W']].values
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.4, random_state=9)

In [12]:
(X_test.shape, y_test.shape, X_train.shape)


Out[12]:
((394, 5), (394, 1), (590, 5))

In [13]:
regr = linear_model.LinearRegression()
regr.fit(X_train, y_train)


Out[13]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [14]:
#  The coefficients
print('Coefficients: \n', regr.coef_)
# The mean square error
print("Residual sum of squares: %.2f"
      % np.mean((regr.predict(X_test) - y_test) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(X_test, y_test))

# Plot outputs
# plt.scatter(X_test, y_test,  color='black')
# plt.plot(X_test, regr.predict(X_test), color='blue',linewidth=3)

# plt.xticks(())
# plt.yticks(())

# plt.show()


('Coefficients: \n', array([[ 3.97105936,  2.64799083,  1.17327139,  4.11374163,  4.82718991]]))
Residual sum of squares: 85.33
Variance score: 0.35

In [15]:
pd.DataFrame(regr.coef_)


Out[15]:
0 1 2 3 4
0 3.971059 2.647991 1.173271 4.113742 4.82719

In [16]:
plt.scatter(stats_adj[["HR"]].values, stats_adj[["W"]].values,  color='black', marker="x")
plt.scatter(stats_adj[["3B"]].values, stats_adj[["W"]].values,  color='red', marker="x")
plt.scatter(stats_adj[["2B"]].values, stats_adj[["W"]].values,  color='blue', marker="x")
plt.scatter(stats_adj[["S"]].values, stats_adj[["W"]].values,  color='green', marker="x")
plt.scatter(stats_adj[["BB"]].values, stats_adj[["W"]].values,  color='purple', marker="x")


Out[16]:
<matplotlib.collections.PathCollection at 0x1119159d0>

Your answer here:

Problem 1(g)

Now we will create a similar database for individual players. Consider only player/year combinations in which the player had at least 500 plate appearances. Consider only the years we considered for the calculations above (after 1947 and seasons with 162 games). For each player/year compute singles, doubles, triples, HR, BB per plate appearance rates. Create a new pandas DataFrame called playerstats that has the playerID, yearID and the rates of these stats. Remove the average for each year as for these rates as done in Problem 1(e).


In [17]:
players["PA"] = players["AB"] + players["BB"]
appearances_selector = players["PA"] >= 500 #assuming appearances = AB
year_selector = players["yearID"] > 1947
games_selector = players["G"] == 162


filtered_players = players[appearances_selector & year_selector & games_selector]

playerstats = filtered_players[["playerID", "yearID", "2B", "3B", "HR", "BB", "H", "AB"]].copy()
playerstats["S"] = playerstats["H"] - (playerstats["2B"] + playerstats["3B"] + playerstats["HR"])

## Calculate PA = AB + BB
playerstats['PA'] = playerstats['AB'] + playerstats['BB']

## For each of singles, doubles, triples, HR and BB, calculate plate appearance rate
for i in ['S','2B','3B','HR','BB']:
   playerstats[i] = playerstats[i]/playerstats['PA']

playerstats.drop('AB', axis=1, inplace=True)
playerstats.drop('PA', axis=1, inplace=True)
playerstats.drop('H', axis=1, inplace=True)
playerstats.index
playerstats.sort("yearID", inplace=True)
playerstats.tail()


/usr/local/lib/python2.7/site-packages/ipykernel/__main__.py:23: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
Out[17]:
playerID yearID 2B 3B HR BB S
14211 castrst01 2012 0.042522 0.017595 0.020528 0.052786 0.187683
97506 pencehu01 2013 0.051395 0.007342 0.039648 0.076358 0.162996
96755 butlebi03 2013 0.040847 0.000000 0.022693 0.119516 0.190620
96968 fieldpr01 2013 0.051502 0.000000 0.035765 0.107296 0.161660
97826 vottojo01 2013 0.041899 0.004190 0.033520 0.188547 0.167598

In [18]:
years = playerstats['yearID'].unique()

year_stats_collection = []

for year in years:
    year_selector = playerstats["yearID"]==year
    year_stats = playerstats[year_selector]
    year_stats[['S', '2B', '3B', 'HR', 'BB']] = playerstats[year_selector][['S', '2B', '3B', 'HR', 'BB']].apply(lambda df: (df - df.mean()) / df.std())
    year_stats_collection.append(year_stats)

playerstats_adj = pd.concat(year_stats_collection)

In [19]:
year_sel = playerstats_adj['yearID'] == 2002
playerstats_adj[year_sel].mean(0)


Out[19]:
yearID    2.002000e+03
2B       -3.700743e-16
3B        7.401487e-17
HR       -2.775558e-17
BB        3.515706e-16
S         5.736152e-16
dtype: float64

Show the head of the playerstats DataFrame.


In [20]:
playerstats_adj.head()


Out[20]:
playerID yearID 2B 3B HR BB S
72882 richabo01 1961 0.707107 -0.707107 -0.707107 -0.707107 0.707107
94941 woodja01 1961 -0.707107 0.707107 0.707107 0.707107 -0.707107
73959 robinbr01 1962 -0.165696 1.605472 -0.693658 -1.004793 0.967074
55480 mayswi01 1962 0.374270 0.142187 1.438874 0.351343 -1.227364
74049 robinfr02 1962 1.814792 -0.863346 0.660623 0.333819 -0.061390

Problem 1(h)

Using the playerstats DataFrame created in Problem 1(g), create a new DataFrame called playerLS containing the player's lifetime stats. This DataFrame should contain the playerID, the year the player's career started, the year the player's career ended and the player's lifetime average for each of the quantities (singles, doubles, triples, HR, BB). For simplicity we will simply compute the avaerage of the rates by year (a more correct way is to go back to the totals).


In [21]:
left = playerstats_adj.groupby(["playerID"]).mean().reset_index()
playerLS = master[["playerID","debut","finalGame"]].merge(left, how='inner', on="playerID")
playerLS["debut"] = playerLS.debut.apply(lambda x: int(x[0:4]))
playerLS["finalGame"] = playerLS.finalGame.apply(lambda x: int(x[0:4]))
playerLS.drop('yearID', axis=1, inplace=True)

Show the head of the playerLS DataFrame.


In [22]:
playerLS.head()


Out[22]:
playerID debut finalGame 2B 3B HR BB S
0 abreubo01 1996 2012 0.918221 -0.345258 -0.187583 1.431686 -0.954927
1 allendi01 1963 1977 1.289085 1.509213 0.830369 0.473426 -0.188239
2 alomasa01 1964 1978 -0.707107 -0.707107 -0.707107 -0.707107 0.707107
3 alouma01 1960 1974 1.649904 1.226444 -1.157587 -0.969189 1.395309
4 bagweje01 1991 2005 0.975857 -0.422082 0.751760 1.235499 -0.996528

Problem 1(i)

Compute the OPW for each player based on the average rates in the playerLS DataFrame. You can interpret this summary statistic as the predicted wins for a team with 9 batters exactly like the player in question. Add this column to the playerLS DataFrame. Call this colum OPW.


In [23]:
playerLS["OPW"] = regr.predict(playerLS[["S", "2B", "3B", "HR", "BB"]].values)
playerLS.head()


Out[23]:
playerID debut finalGame 2B 3B HR BB S OPW
0 abreubo01 1996 2012 0.918221 -0.345258 -0.187583 1.431686 -0.954927 85.435445
1 allendi01 1963 1977 1.289085 1.509213 0.830369 0.473426 -0.188239 91.199738
2 alomasa01 1964 1978 -0.707107 -0.707107 -0.707107 -0.707107 0.707107 74.845534
3 alouma01 1960 1974 1.649904 1.226444 -1.157587 -0.969189 1.395309 82.970071
4 bagweje01 1991 2005 0.975857 -0.422082 0.751760 1.235499 -0.996528 88.249913

Problem 1(j)

Add four columns to the playerLS DataFrame that contains the player's position (C, 1B, 2B, 3B, SS, LF, CF, RF, or OF), first name, last name and median salary.


In [25]:
pos_df = fielding.groupby(["playerID"])["POS"].agg(lambda x:x.value_counts().index[0])
pos_df = pd.DataFrame(pos_df).reset_index()

In [26]:
playerLS = pos_df.merge(playerLS, how='inner', on="playerID")

In [27]:
playerLS_merged = playerLS.merge(medianSalaries, how='inner', on=['playerID'])

Show the head of the playerLS DataFrame.


In [28]:
playerLS_merged.head()


Out[28]:
playerID POS debut finalGame 2B 3B HR BB S OPW nameFirst nameLast salary
0 abreubo01 RF 1996 2012 0.918221 -0.345258 -0.187583 1.431686 -0.954927 85.435445 Bobby Abreu 9000000.0
1 bagweje01 1B 1991 2005 0.975857 -0.422082 0.751760 1.235499 -0.996528 88.249913 Jeff Bagwell 6875000.0
2 bayja01 LF 2003 2013 0.778678 0.330081 0.426752 1.037695 -0.875178 86.800325 Jason Bay 4750000.0
3 baylodo01 DH 1970 1988 0.352534 -0.469633 1.436595 0.817097 -0.855471 87.901237 Don Baylor 695909.5
4 bellbu01 3B 1972 1989 1.464895 -0.470834 -0.363651 -0.959318 0.645826 80.826252 Buddy Bell 830454.5

Problem 1(k)

Subset the playerLS DataFrame for players active in 2002 and 2003 and played at least three years. Plot and describe the relationship bewteen the median salary (in millions) and the predicted number of wins.


In [29]:
_3_year_selector = playerLS_merged["finalGame"] - playerLS_merged["debut"] >= 3
start_selector = playerLS_merged["debut"] <= 2002
end_selector = playerLS_merged["finalGame"] >= 2003

active = playerLS_merged[_3_year_selector & start_selector & end_selector]
fig = plt.figure()
ax = fig.gca()
ax.scatter(active.salary, active.OPW)
ax.set_xlabel('Salary ')
ax.set_ylabel('OPW')
ax.set_title('Relationship between Salary and Predicted Number of Wins')
plt.show()


Problem 1(L)

Pick one players from one of each of these 10 position C, 1B, 2B, 3B, SS, LF, CF, RF, DH, or OF keeping the total median salary of all 10 players below 20 million. Report their averaged predicted wins and total salary.


In [78]:
positions = list(set(playerLS_merged['POS'].values))


team = playerLS_merged.groupby('POS').min().reset_index()
team


Out[78]:
POS playerID debut finalGame 2B 3B HR BB S OPW nameFirst nameLast salary
0 1B bagweje01 1969 1986 -1.590516 -1.360410 -1.405985 -1.062170 -1.147259 75.434428 Adrian Bagwell 370833.5
1 2B biggicr01 1970 1986 -0.707107 -0.707107 -0.803776 -0.827785 -1.270748 79.458850 Bill Biggio 600000.0
2 3B bellbu01 1972 1988 -0.759137 -1.205730 -1.698594 -1.389863 -1.262175 70.389584 Aaron Bell 297500.0
3 CF beltrca01 1976 1986 -0.967604 -0.400189 -1.537022 -0.943648 -1.396018 75.916727 Adam Beltran 400000.0
4 DH baylodo01 1968 1987 0.352534 -0.469633 -0.707107 -0.707107 -1.381041 80.249615 Don Baylor 695909.5
5 LF bayja01 1996 2012 0.388996 -0.838907 -0.601539 -1.047430 -0.875178 76.458745 Carlos Bay 1440000.0
6 OF cartejo01 1963 1986 -1.427923 -1.084184 -1.592864 -0.778285 -1.920573 74.133137 Al Carter 380000.0
7 RF abreubo01 1963 1985 -0.627936 -0.345258 -0.952054 -0.983216 -1.351377 76.665733 Bobby Abreu 300000.0
8 SS bowala01 1969 1985 -1.118849 -0.875001 -1.379656 -1.211343 -0.942530 72.461843 Alex Bowa 450000.0

In [72]:
team["salary"].sum()


Out[72]:
4934243.0

In [73]:
team["OPW"].mean()


Out[73]:
75.685406794999182

Problem 1(m)

What do these players outperform in? Singles, doubles, triples HR or BB?


In [81]:
attr = ['POS','2B', '3B', 'HR', 'BB', 'S']
team.sort_values(by='OPW', ascending=False)[attr].drop_duplicates().sum()


Out[81]:
POS    DH2BRFLFCF1BOFSS3B
2B               -6.45754
3B               -7.28642
HR               -10.6786
BB               -8.95085
S                -11.5469
dtype: object

Your answer here:

Our team excells in 2B

Use one of the classification methods to predict wheather a player will be an Allstar?


In [93]:
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestClassifier

#Use AllstarFull.csv to determine which players are Allstars
allstar = pd.read_csv('Data/AllstarFull.csv')
allstar = allstar[['playerID','yearID','GP']].drop_duplicates()

players_allstar = pd.merge(players,allstar,on=['playerID'],how ='inner')
players_allstar['GP'] = players_allstar['GP'].fillna(0)


x = players_allstar.drop('GP',axis=1)._get_numeric_data().fillna(0)
y = players_allstar['GP']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20)
rfc = RandomForestClassifier()
rfc.fit(x_train, y_train) 
z = pd.DataFrame(zip(rfc.predict(x_test),y_test),columns = ['predicted','actual'])
z['new'] = z['actual']-z['predicted']

print 'Score:','%.2f'%(100 - z[z['new']!=0]['new'].count()*100.0/z['new'].count()),'%'


Score: 100.00 %

Discussion for Problem 1

The combination of grid search and 10 fold cross validation had had helped us pick a more accurate value of k for our KNN classifier. This helped us create an accurate KNN classifier for the iris dataset.

Problem 2: $k$-Nearest Neighbors and Cross Validation

What is the optimal $k$ for predicting species using $k$-nearest neighbor classification on the four features provided by the iris dataset.

In this problem you will get to know the famous iris data set, and use cross validation to select the optimal $k$ for a $k$-nearest neighbor classification. This problem set makes heavy use of the sklearn library. In addition to Pandas, it is one of the most useful libraries for data scientists. For the Iris data set sklearn provides an extra function to load it - since it is one of the very commonly used data sets.


In [34]:
#load the iris data set
from sklearn.datasets import load_iris
iris = load_iris()

Problem 2(a)

Split the data into a train and a test set. Use a random selection of 33% of the samples as test data. Sklearn provides the train_test_split function for this purpose. Print the dimensions of all the train and test data sets you have created.


In [35]:
from sklearn.cross_validation import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(iris.data, iris.target, test_size=0.33, random_state=9)

Problem 2(b)

Use ten fold cross validation to estimate the optimal value for $k$ for the iris data set.

Note: For your convenience sklearn does not only include the KNN classifier, but also a grid search function. The function is called grid search, because if you have to optimize more than one parameter, it is common practice to define a range of possible values for each parameter. An exhaustive search then runs over the complete grid defined by all the possible parameter combinations. This can get very computation heavy, but luckily our KNN classifier only requires tuning of a single parameter for this problem set.


In [36]:
# use cross validation to find the optimal value for k
k = np.arange(20)+1

parameters = {'n_neighbors': k}
knn = sklearn.neighbors.KNeighborsClassifier()
clf = sklearn.grid_search.GridSearchCV(knn, parameters, cv=10)
clf.fit(X_train, Y_train)


Out[36]:
GridSearchCV(cv=10, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'n_neighbors': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20])},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)

Problem 2(c)

Visualize the result by plotting the score results versus values for $k$.


In [37]:
a = clf.grid_scores_
scores = [b.cv_validation_scores for b in a]

score_means = np.mean(scores, axis=1)

sns.boxplot(scores)
plt.scatter(k,score_means, c='k', zorder=2)
plt.ylim(0.8, 1.1)
plt.title('Accuracy as a function of $k$')
plt.ylabel('Accuracy')
plt.xlabel('Choice of k')
plt.show()


/usr/local/lib/python2.7/site-packages/matplotlib/__init__.py:892: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))

Verify that the grid search has indeed chosen the right parameter value for $k$.


In [38]:
clf.best_params_


Out[38]:
{'n_neighbors': 13}

Problem 2(d)

Test the performance of our tuned KNN classifier on the test set.


In [39]:
from sklearn.metrics import classification_report
for params, mean_score, scores in clf.grid_scores_:
    print("%0.3f (+/-%0.03f) for %r"
          % (mean_score, scores.std() * 2, params))
    
print("Detailed classification report:")
print()
print("The model is trained on the full development set.")
print("The scores are computed on the full evaluation set.")
print()
y_true, y_pred = Y_test, clf.predict(X_test)
print(classification_report(y_true, y_pred))


0.930 (+/-0.145) for {'n_neighbors': 1}
0.920 (+/-0.180) for {'n_neighbors': 2}
0.930 (+/-0.177) for {'n_neighbors': 3}
0.910 (+/-0.143) for {'n_neighbors': 4}
0.950 (+/-0.127) for {'n_neighbors': 5}
0.950 (+/-0.127) for {'n_neighbors': 6}
0.950 (+/-0.127) for {'n_neighbors': 7}
0.930 (+/-0.123) for {'n_neighbors': 8}
0.950 (+/-0.130) for {'n_neighbors': 9}
0.940 (+/-0.127) for {'n_neighbors': 10}
0.950 (+/-0.130) for {'n_neighbors': 11}
0.950 (+/-0.130) for {'n_neighbors': 12}
0.960 (+/-0.129) for {'n_neighbors': 13}
0.940 (+/-0.127) for {'n_neighbors': 14}
0.940 (+/-0.127) for {'n_neighbors': 15}
0.930 (+/-0.123) for {'n_neighbors': 16}
0.940 (+/-0.128) for {'n_neighbors': 17}
0.930 (+/-0.123) for {'n_neighbors': 18}
0.930 (+/-0.123) for {'n_neighbors': 19}
0.920 (+/-0.152) for {'n_neighbors': 20}
Detailed classification report:
()
The model is trained on the full development set.
The scores are computed on the full evaluation set.
()
             precision    recall  f1-score   support

          0       1.00      1.00      1.00        20
          1       0.94      1.00      0.97        16
          2       1.00      0.93      0.96        14

avg / total       0.98      0.98      0.98        50


In [40]:
clf.best_score_


Out[40]:
0.95999999999999996

Discussion for Problem 2

Write a brief discussion of your conclusions to the questions and tasks above in 100 words or less.


Problem 3: Supervised Learning - Estimating Boston house pricing using Linear Regression and Regression Trees

Import the Boston House Pricing Dataset; it comes with the scikit or you can dowload it FROM UCI ML cars dataset. (https://archive.ics.uci.edu/ml/datasets/Automobile)

In [94]:
header = ['symboling','normalized-losses','make','fuel-type','aspiration',
          'num-of-doors', 'body-style','drive-wheels','engine-location',
          'wheel-base','length','width','height','curb-weight','engine-type',
          'num-of-cylinders','engine-size','fuel-system','bore','stroke',
          'compression-ratio','horsepower','peak-rpm','city-mpg','highway-mpg',
          'price']

auto = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data', names=header)
auto.head()


Out[94]:
symboling normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base ... engine-size fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price
0 3 ? alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9 111 5000 21 27 13495
1 3 ? alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9 111 5000 21 27 16500
2 1 ? alfa-romero gas std two hatchback rwd front 94.5 ... 152 mpfi 2.68 3.47 9 154 5000 19 26 16500
3 2 164 audi gas std four sedan fwd front 99.8 ... 109 mpfi 3.19 3.40 10 102 5500 24 30 13950
4 2 164 audi gas std four sedan 4wd front 99.4 ... 136 mpfi 3.19 3.40 8 115 5500 18 22 17450

5 rows × 26 columns

Find the mostimportant features


In [95]:
#Replace '?' with NaN
auto = auto.replace(to_replace='?', value=np.nan)

#Convert relevant numeric data from object to float
change_num = ['bore','stroke','horsepower','peak-rpm','price']
auto[change_num]=auto[change_num].astype(float, inplace=True)

#Select columns with numbers
num_auto = auto._get_numeric_data().dropna()
num_auto.head()


Out[95]:
symboling wheel-base length width height curb-weight engine-size bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price
0 3 88.6 168.8 64.1 48.8 2548 130 3.47 2.68 9 111 5000 21 27 13495
1 3 88.6 168.8 64.1 48.8 2548 130 3.47 2.68 9 111 5000 21 27 16500
2 1 94.5 171.2 65.5 52.4 2823 152 2.68 3.47 9 154 5000 19 26 16500
3 2 99.8 176.6 66.2 54.3 2337 109 3.19 3.40 10 102 5500 24 30 13950
4 2 99.4 176.6 66.4 54.3 2824 136 3.19 3.40 8 115 5500 18 22 17450

In [97]:
import math
from math import isnan

def corr(list1, list2):
    '''Find correlation between 2 datasets. Removes NaNs'''
    #Remove Nan:
    new_list1 = []
    new_list2 = []
    for i1,i2 in zip(list1,list2):
        if (isnan(i1)==False) & (isnan(i2)==False):
            new_list1.append(float(i1))
            new_list2.append(float(i2))
    cov = np.cov([new_list1,new_list2])[0][1]
    std1 = np.std(new_list1)
    std2 = np.std(new_list2)
    corr = cov/(std1*std2)
    return corr

#Taking features that has an abolute correlation of 0.5+ against price
feature_list = num_auto.columns[0:-1]
price = num_auto['price']

for feature in feature_list:
    attr = num_auto[feature]
    c = corr(attr,price)
    if np.sqrt(c**2) > 0.7:
        print feature,':\t\t%.2f'%c


width :		0.76
curb-weight :		0.84
engine-size :		0.89
horsepower :		0.82
city-mpg :		-0.71
highway-mpg :		-0.72

Using 10-fold cross validation separate the test and training data sets


In [98]:
#Attributes with highest correlation
attr = ['width', 'curb-weight', 'engine-size', 'horsepower', 'highway-mpg','price','city-mpg']

#Picking important features
new_auto = auto[attr].dropna()
X = new_auto[new_auto.columns - ['price']]
y = new_auto['price']

k_fold = sklearn.cross_validation.KFold(len(X), n_folds=10, shuffle=True, random_state=44)


/usr/local/lib/python2.7/site-packages/ipykernel/__main__.py:6: FutureWarning: using '-' to provide set differences with Indexes is deprecated, use .difference()

In [99]:
#Normalize the data

X = (X - X.mean()) / (X.max() - X.min())
X.head()


Out[99]:
city-mpg curb-weight engine-size highway-mpg horsepower width
0 -0.116695 -0.003115 0.011984 -0.096932 0.035528 -0.152343
1 -0.116695 -0.003115 0.011984 -0.096932 0.035528 -0.152343
2 -0.172250 0.103557 0.095003 -0.123248 0.236463 -0.032685
3 -0.033361 -0.084961 -0.067261 -0.017985 -0.006528 0.027144
4 -0.200028 0.103945 0.034626 -0.228511 0.054220 0.044238

In [ ]:

Start with a lineal model and evaluate how well it can predict the price variable


In [100]:
scores = [] 

for train_index, test_index in k_fold:
    X_train, X_test = X.values[train_index], X.values[test_index]
    y_train, y_test = y.values[train_index], y.values[test_index]
    lm = sklearn.linear_model.LinearRegression()
    lm.fit(X_train, y_train)
    scores.append(lm.score(X_test, y_test))
    
print 'Score:\t', '%.2f'%(np.mean(scores)*100),'%'


Score:	73.10 %

Try using Ridge regression and evaluate the result of the 10-fold cross-validation


In [101]:
### Your code here ###

scores = [] 

for train_index, test_index in k_fold:
    X_train, X_test = X.values[train_index], X.values[test_index]
    y_train, y_test = y.values[train_index], y.values[test_index]
    rm = sklearn.linear_model.Ridge()
    rm.fit(X_train, y_train)
    predict = rm.predict(X_test)
    score = float(np.sum([i==j for i,j in zip(predict,y_test)])/len(y_test))
    scores.append(rm.score(X_test, y_test))
    
print 'Score:\t', '%.2f'%(np.mean(scores)*100),'%'


Score:	73.82 %

Train the Regression Tree and evaluate using 10-fold cross validation; Specify the parapmeters used and how you changed them to increase the accuracy;


In [102]:
### Your code here ###

dtr = sklearn.tree.DecisionTreeRegressor()
grid_search = sklearn.grid_search.GridSearchCV(dtr, 
                                               {'min_samples_split':list(range(1,10)), 
                                                'min_samples_leaf':list(range(1,10)), 
                                                'min_weight_fraction_leaf':np.linspace(0,0.5,10)}, 
                                               cv=10)
grid_search.fit(X_train, y_train)
print 'Best Parameters:'
grid_search.best_params_


Best Parameters:
Out[102]:
{'min_samples_leaf': 9,
 'min_samples_split': 1,
 'min_weight_fraction_leaf': 0.0}

In [103]:
score = grid_search.score(X_test, y_test)

print 'Score:\t','%.2f'%(score*100),'%'


Score:	87.07 %

Discussion for Problem 3 Write a brief discussion of your conclusions to the questions and tasks above in 100 words or less.Compare all three aproaches and discuss your findings in 100 words or less

Both the linear and ridge regression models give simialr accuracy scores. Therefore the attributes must have a high degree of independence. The Regression Tree model gave the highest accuracy.


In [ ]: