Assignnement 2: Prediction and Classification

Due: Thursday, April 30, 2015 11:59 PM

Introduction

Problem 3 is optional - for extra credit! Problems 1 and 2 will be graded for the Lab 2.

In this assignment you will be using regression and classification to explore different data sets.

First: You will use data from before 2002 in the Sean Lahman's Baseball Database to create a metric for picking baseball players using linear regression. This database contains the "complete batting and pitching statistics from 1871 to 2013, plus fielding statistics, standings, team stats, managerial records, post-season data, and more". Documentation provided here.

http://saberseminar.com/wp-content/uploads/2012/01/saber-web.jpg

Second: You will use the famous iris data set to perform a $k$-neareast neighbor classification using cross validation. While it was introduced in 1936, it is still one of the most popular example data sets in the machine learning community. Wikipedia describes the data set as follows: "The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimetres." Here is an illustration what the four features measure:

http://sebastianraschka.com/Images/2014_python_lda/iris_petal_sepal.png

Third: You will investigate the influence of higher dimensional spaces on the classification using another standard data set in machine learning called the The cars data set.

Load Python modules



In [1]:

    
# prepare the notebook for matplotlib
%matplotlib inline 

import requests 
import StringIO
import zipfile
import numpy as np
import pandas as pd # pandas
import matplotlib.pyplot as plt # module for plotting 

# If this module is not already installed, you may need to install it. 
# You can do this by typing 'pip install seaborn' in the command line
import seaborn as sns 

import sklearn
import sklearn.datasets
import sklearn.cross_validation
import sklearn.decomposition
import sklearn.grid_search
import sklearn.neighbors
import sklearn.metrics









    



/usr/local/lib/python2.7/site-packages/matplotlib/__init__.py:872: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))

Problem 1: Sabermetrics

Using data preceding the 2002 season pick 10 offensive players keeping the payroll under $20 million (assign each player the median salary). Predict how many games this team would win in a 162 game season.

In this problem we will be returning to the Sean Lahman's Baseball Database. From this database, we will be extract five data sets containing information such as yearly stats and standing, batting statistics, fielding statistics, player names, player salaries and biographical information. You will explore the data in this database from before 2002 and create a metric for picking players.

Problem 1(a)

Load in these CSV files from the Sean Lahman's Baseball Database. For this assignment, we will use the 'Teams.csv', 'Batting.csv', 'Salaries.csv', 'Fielding.csv', 'Master.csv' tables. Read these tables into separate pandas DataFrames with the following names.

CSV file name	Name of pandas DataFrame
Teams.csv	teams
Batting.csv	players
Salaries.csv	salaries
Fielding.csv	fielding
Master.csv	master



In [2]:

    
teams = pd.read_csv("data/Teams.csv")
players = pd.read_csv("data/Batting.csv")
salaries = pd.read_csv("data/Salaries.csv")
fielding = pd.read_csv("data/Fielding.csv")
master = pd.read_csv("data/Master.csv")



In [ ]:

Problem 1(b)

Calculate the median salary for each player and create a pandas DataFrame called medianSalaries with four columns: (1) the player ID, (2) the first name of the player, (3) the last name of the player and (4) the median salary of the player. Show the head of the medianSalaries DataFrame.



In [3]:

    
medians = salaries.groupby(["playerID"]).median()
merged = pd.merge(medians, master, left_index=True, right_on="playerID")
medianSalaries=merged[["playerID", "nameFirst", "nameLast", "salary"]]
medianSalaries.reset_index(inplace=True, drop=True)
medianSalaries.head()









    Out[3]:






  
    
      
      playerID
      nameFirst
      nameLast
      salary
    
  
  
    
      0
      aardsda01
      David
      Aardsma
      419000
    
    
      1
      aasedo01
      Don
      Aase
      612500
    
    
      2
      abadan01
      Andy
      Abad
      327000
    
    
      3
      abadfe01
      Fernando
      Abad
      451500
    
    
      4
      abbotje01
      Jeff
      Abbott
      255000

Problem 1(c)

Now, consider only team/season combinations in which the teams played 162 Games. Exclude all data from before 1947. Compute the per plate appearance rates for singles, doubles, triples, HR, and BB. Create a new pandas DataFrame called stats that has the teamID, yearID, wins and these rates.

Hint: Singles are hits that are not doubles, triples, nor HR. Plate appearances are base on balls plus at bats.



In [4]:

    
g_filter = teams["G"] == 162
year_filter = teams["yearID"] > 1947
filtered_teams = teams[g_filter & year_filter]

stats = filtered_teams[["teamID", "yearID", "W", "2B", "3B", "HR", "BB", "H", "AB"]].copy()
stats["S"] = stats["H"] - (stats["2B"] + stats["3B"] + stats["HR"])

## Calculate PA = AB + BB
stats['PA'] = stats['AB'] + stats['BB']

## For each of singles, doubles, triples, HR and BB, calculate plate appearance rate
for i in ['S','2B','3B','HR','BB']:
   stats[i] = stats[i]/stats['PA']

stats.drop('H', axis=1, inplace=True)
stats.drop('PA', axis=1, inplace=True)
stats.drop('AB', axis=1, inplace=True)
stats.sort("yearID", inplace=True)
stats.head()









    



/usr/local/lib/python2.7/site-packages/ipykernel/__main__.py:18: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)






    Out[4]:






  
    
      
      teamID
      yearID
      W
      2B
      3B
      HR
      BB
      S
    
  
  
    
      1366
      LAA
      1961
      70
      0.035708
      0.003604
      0.030958
      0.111548
      0.147748
    
    
      1367
      KC1
      1961
      61
      0.035982
      0.007829
      0.014993
      0.096618
      0.164751
    
    
      1395
      CHN
      1962
      59
      0.032461
      0.009275
      0.020868
      0.083471
      0.168930
    
    
      1394
      HOU
      1962
      64
      0.028095
      0.007767
      0.017353
      0.081474
      0.173195
    
    
      1391
      ML1
      1962
      86
      0.033780
      0.006292
      0.029972
      0.096208
      0.157808



In [5]:

    
#the yearly mean for each metric across teams -- in the next step I make these values 0
stats.groupby('yearID')["S","2B","3B","HR","BB"].mean().head()

Problem 1(d)

Is there a noticeable time trend in the rates computed computed in Problem 1(c)?



In [6]:

    
for col in ["S","2B","3B","HR","BB"]:
    plt.scatter(stats.yearID, stats[col], alpha=0.5, marker='x')
    plt.title(col)
    plt.xlabel('Year')
    plt.ylabel('Rate')
    plt.show()

Problem 1(e)

Using the stats DataFrame from Problem 1(c), adjust the singles per PA rates so that the average across teams for each year is 0. Do the same for the doubles, triples, HR, and BB rates.



In [7]:

    
years = stats['yearID'].unique()

year_stats_collection = []

for year in years:
    year_selector = stats["yearID"] == year
    year_stats = stats[year_selector]
    year_stats[['S', '2B', '3B', 'HR', 'BB']] = stats[year_selector][['S', '2B', '3B', 'HR', 'BB']].apply(lambda df: (df - df.mean()) / df.std())
    year_stats_collection.append(year_stats)

stats_adj = pd.concat(year_stats_collection)









    



/usr/local/lib/python2.7/site-packages/pandas/core/frame.py:2320: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[k1] = value[k2]



In [8]:

    
#yearly averages should eq 0
stats_adj.groupby('yearID')["S","2B","3B","HR","BB"].mean().head()









    Out[8]:






  
    
      
      S
      2B
      3B
      HR
      BB
    
    
      yearID
      
      
      
      
      
    
  
  
    
      1961
      0.000000e+00
      0.000000e+00
      5.551115e-17
      0.000000e+00
      6.106227e-16
    
    
      1962
      4.804238e-15
      1.468522e-15
      6.863197e-16
      -3.229740e-16
      2.977416e-16
    
    
      1963
      -1.015061e-15
      -2.347329e-15
      4.440892e-16
      6.344132e-16
      -9.595499e-16
    
    
      1964
      2.415997e-16
      2.220446e-16
      4.239033e-16
      -2.485386e-16
      -3.229740e-16
    
    
      1965
      2.581269e-15
      -5.048045e-16
      3.590878e-16
      -3.330669e-16
      3.330669e-16

Problem 1(f)

Build a simple linear regression model to predict the number of wins from the average adjusted singles, double, triples, HR, and BB rates. To decide which of these terms to include fit the model to data from 2002 and compute the average squared residuals from predictions to years past 2002. Use the fitted model to define a new sabermetric summary: offensive predicted wins (OPW). Hint: the new summary should be a linear combination of one to five of the five rates.



In [9]:

    
from sklearn import linear_model
from sklearn.cross_validation import train_test_split



In [10]:

    
# labels=stats_adj[['S', '2B', '3B', 'HR', 'BB']].values
training_selector = stats_adj['yearID'] < 2002
test_selector = stats_adj['yearID'] >= 2002
labels=["S", "2B", "3B", "HR", "BB"]

training_data = stats_adj[training_selector]
test_data = stats_adj[test_selector]
X_train = training_data[labels].values
y_train = training_data[['W']].values
X_test = test_data[labels].values
y_test = test_data[['W']].values
# (stats_adj.shape, X_train.shape, y_train.shape, X_test.shape, y_test.shape)



In [11]:

    
labels=["S", "2B", "3B", "HR", "BB"]
data = stats_adj[labels].values
target = stats_adj[['W']].values
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.4, random_state=9)



In [12]:

    
(X_test.shape, y_test.shape, X_train.shape)









    Out[12]:





((394, 5), (394, 1), (590, 5))



In [13]:

    
regr = linear_model.LinearRegression()
regr.fit(X_train, y_train)









    Out[13]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)



In [14]:

    
#  The coefficients
print('Coefficients: \n', regr.coef_)
# The mean square error
print("Residual sum of squares: %.2f"
      % np.mean((regr.predict(X_test) - y_test) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(X_test, y_test))

# Plot outputs
# plt.scatter(X_test, y_test,  color='black')
# plt.plot(X_test, regr.predict(X_test), color='blue',linewidth=3)

# plt.xticks(())
# plt.yticks(())

# plt.show()









    



('Coefficients: \n', array([[ 3.97105936,  2.64799083,  1.17327139,  4.11374163,  4.82718991]]))
Residual sum of squares: 85.33
Variance score: 0.35



In [15]:

    
pd.DataFrame(regr.coef_)



In [16]:

    
plt.scatter(stats_adj[["HR"]].values, stats_adj[["W"]].values,  color='black', marker="x")
plt.scatter(stats_adj[["3B"]].values, stats_adj[["W"]].values,  color='red', marker="x")
plt.scatter(stats_adj[["2B"]].values, stats_adj[["W"]].values,  color='blue', marker="x")
plt.scatter(stats_adj[["S"]].values, stats_adj[["W"]].values,  color='green', marker="x")
plt.scatter(stats_adj[["BB"]].values, stats_adj[["W"]].values,  color='purple', marker="x")









    Out[16]:





<matplotlib.collections.PathCollection at 0x1119159d0>

Your answer here:

Problem 1(g)

Now we will create a similar database for individual players. Consider only player/year combinations in which the player had at least 500 plate appearances. Consider only the years we considered for the calculations above (after 1947 and seasons with 162 games). For each player/year compute singles, doubles, triples, HR, BB per plate appearance rates. Create a new pandas DataFrame called playerstats that has the playerID, yearID and the rates of these stats. Remove the average for each year as for these rates as done in Problem 1(e).



In [17]:

    
players["PA"] = players["AB"] + players["BB"]
appearances_selector = players["PA"] >= 500 #assuming appearances = AB
year_selector = players["yearID"] > 1947
games_selector = players["G"] == 162


filtered_players = players[appearances_selector & year_selector & games_selector]

playerstats = filtered_players[["playerID", "yearID", "2B", "3B", "HR", "BB", "H", "AB"]].copy()
playerstats["S"] = playerstats["H"] - (playerstats["2B"] + playerstats["3B"] + playerstats["HR"])

## Calculate PA = AB + BB
playerstats['PA'] = playerstats['AB'] + playerstats['BB']

## For each of singles, doubles, triples, HR and BB, calculate plate appearance rate
for i in ['S','2B','3B','HR','BB']:
   playerstats[i] = playerstats[i]/playerstats['PA']

playerstats.drop('AB', axis=1, inplace=True)
playerstats.drop('PA', axis=1, inplace=True)
playerstats.drop('H', axis=1, inplace=True)
playerstats.index
playerstats.sort("yearID", inplace=True)
playerstats.tail()









    



/usr/local/lib/python2.7/site-packages/ipykernel/__main__.py:23: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)






    Out[17]:






  
    
      
      playerID
      yearID
      2B
      3B
      HR
      BB
      S
    
  
  
    
      14211
      castrst01
      2012
      0.042522
      0.017595
      0.020528
      0.052786
      0.187683
    
    
      97506
      pencehu01
      2013
      0.051395
      0.007342
      0.039648
      0.076358
      0.162996
    
    
      96755
      butlebi03
      2013
      0.040847
      0.000000
      0.022693
      0.119516
      0.190620
    
    
      96968
      fieldpr01
      2013
      0.051502
      0.000000
      0.035765
      0.107296
      0.161660
    
    
      97826
      vottojo01
      2013
      0.041899
      0.004190
      0.033520
      0.188547
      0.167598



In [18]:

    
years = playerstats['yearID'].unique()

year_stats_collection = []

for year in years:
    year_selector = playerstats["yearID"]==year
    year_stats = playerstats[year_selector]
    year_stats[['S', '2B', '3B', 'HR', 'BB']] = playerstats[year_selector][['S', '2B', '3B', 'HR', 'BB']].apply(lambda df: (df - df.mean()) / df.std())
    year_stats_collection.append(year_stats)

playerstats_adj = pd.concat(year_stats_collection)



In [19]:

    
year_sel = playerstats_adj['yearID'] == 2002
playerstats_adj[year_sel].mean(0)









    Out[19]:





yearID    2.002000e+03
2B       -3.700743e-16
3B        7.401487e-17
HR       -2.775558e-17
BB        3.515706e-16
S         5.736152e-16
dtype: float64

Show the head of the playerstats DataFrame.



In [20]:

    
playerstats_adj.head()

Problem 1(h)

Using the playerstats DataFrame created in Problem 1(g), create a new DataFrame called playerLS containing the player's lifetime stats. This DataFrame should contain the playerID, the year the player's career started, the year the player's career ended and the player's lifetime average for each of the quantities (singles, doubles, triples, HR, BB). For simplicity we will simply compute the avaerage of the rates by year (a more correct way is to go back to the totals).



In [21]:

    
left = playerstats_adj.groupby(["playerID"]).mean().reset_index()
playerLS = master[["playerID","debut","finalGame"]].merge(left, how='inner', on="playerID")
playerLS["debut"] = playerLS.debut.apply(lambda x: int(x[0:4]))
playerLS["finalGame"] = playerLS.finalGame.apply(lambda x: int(x[0:4]))
playerLS.drop('yearID', axis=1, inplace=True)

Show the head of the playerLS DataFrame.



In [22]:

    
playerLS.head()

Problem 1(i)

Compute the OPW for each player based on the average rates in the playerLS DataFrame. You can interpret this summary statistic as the predicted wins for a team with 9 batters exactly like the player in question. Add this column to the playerLS DataFrame. Call this colum OPW.



In [23]:

    
playerLS["OPW"] = regr.predict(playerLS[["S", "2B", "3B", "HR", "BB"]].values)
playerLS.head()

Problem 1(j)

Add four columns to the playerLS DataFrame that contains the player's position (C, 1B, 2B, 3B, SS, LF, CF, RF, or OF), first name, last name and median salary.



In [25]:

    
pos_df = fielding.groupby(["playerID"])["POS"].agg(lambda x:x.value_counts().index[0])
pos_df = pd.DataFrame(pos_df).reset_index()



In [26]:

    
playerLS = pos_df.merge(playerLS, how='inner', on="playerID")



In [27]:

    
playerLS_merged = playerLS.merge(medianSalaries, how='inner', on=['playerID'])

Show the head of the playerLS DataFrame.



In [28]:

    
playerLS_merged.head()

Problem 1(k)

Subset the playerLS DataFrame for players active in 2002 and 2003 and played at least three years. Plot and describe the relationship bewteen the median salary (in millions) and the predicted number of wins.



In [29]:

    
_3_year_selector = playerLS_merged["finalGame"] - playerLS_merged["debut"] >= 3
start_selector = playerLS_merged["debut"] <= 2002
end_selector = playerLS_merged["finalGame"] >= 2003

active = playerLS_merged[_3_year_selector & start_selector & end_selector]
fig = plt.figure()
ax = fig.gca()
ax.scatter(active.salary, active.OPW)
ax.set_xlabel('Salary ')
ax.set_ylabel('OPW')
ax.set_title('Relationship between Salary and Predicted Number of Wins')
plt.show()

Problem 1(L)

Pick one players from one of each of these 10 position C, 1B, 2B, 3B, SS, LF, CF, RF, DH, or OF keeping the total median salary of all 10 players below 20 million. Report their averaged predicted wins and total salary.



In [78]:

    
positions = list(set(playerLS_merged['POS'].values))


team = playerLS_merged.groupby('POS').min().reset_index()
team



In [72]:

    
team["salary"].sum()









    Out[72]:





4934243.0



In [73]:

    
team["OPW"].mean()









    Out[73]:





75.685406794999182

Problem 1(m)

What do these players outperform in? Singles, doubles, triples HR or BB?



In [81]:

    
attr = ['POS','2B', '3B', 'HR', 'BB', 'S']
team.sort_values(by='OPW', ascending=False)[attr].drop_duplicates().sum()









    Out[81]:





POS    DH2BRFLFCF1BOFSS3B
2B               -6.45754
3B               -7.28642
HR               -10.6786
BB               -8.95085
S                -11.5469
dtype: object

Your answer here:

Our team excells in 2B

Use one of the classification methods to predict wheather a player will be an Allstar?



In [93]:

    
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestClassifier

#Use AllstarFull.csv to determine which players are Allstars
allstar = pd.read_csv('Data/AllstarFull.csv')
allstar = allstar[['playerID','yearID','GP']].drop_duplicates()

players_allstar = pd.merge(players,allstar,on=['playerID'],how ='inner')
players_allstar['GP'] = players_allstar['GP'].fillna(0)


x = players_allstar.drop('GP',axis=1)._get_numeric_data().fillna(0)
y = players_allstar['GP']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20)
rfc = RandomForestClassifier()
rfc.fit(x_train, y_train) 
z = pd.DataFrame(zip(rfc.predict(x_test),y_test),columns = ['predicted','actual'])
z['new'] = z['actual']-z['predicted']

print 'Score:','%.2f'%(100 - z[z['new']!=0]['new'].count()*100.0/z['new'].count()),'%'









    



Score: 100.00 %

Discussion for Problem 1

The combination of grid search and 10 fold cross validation had had helped us pick a more accurate value of k for our KNN classifier. This helped us create an accurate KNN classifier for the iris dataset.

Problem 2: $k$-Nearest Neighbors and Cross Validation

What is the optimal $k$ for predicting species using $k$-nearest neighbor classification on the four features provided by the iris dataset.

In this problem you will get to know the famous iris data set, and use cross validation to select the optimal $k$ for a $k$-nearest neighbor classification. This problem set makes heavy use of the sklearn library. In addition to Pandas, it is one of the most useful libraries for data scientists. For the Iris data set sklearn provides an extra function to load it - since it is one of the very commonly used data sets.



In [34]:

    
#load the iris data set
from sklearn.datasets import load_iris
iris = load_iris()

Problem 2(a)

Split the data into a train and a test set. Use a random selection of 33% of the samples as test data. Sklearn provides the train_test_split function for this purpose. Print the dimensions of all the train and test data sets you have created.



In [35]:

    
from sklearn.cross_validation import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(iris.data, iris.target, test_size=0.33, random_state=9)

Problem 2(b)

Use ten fold cross validation to estimate the optimal value for $k$ for the iris data set.

Note: For your convenience sklearn does not only include the KNN classifier, but also a grid search function. The function is called grid search, because if you have to optimize more than one parameter, it is common practice to define a range of possible values for each parameter. An exhaustive search then runs over the complete grid defined by all the possible parameter combinations. This can get very computation heavy, but luckily our KNN classifier only requires tuning of a single parameter for this problem set.



In [36]:

    
# use cross validation to find the optimal value for k
k = np.arange(20)+1

parameters = {'n_neighbors': k}
knn = sklearn.neighbors.KNeighborsClassifier()
clf = sklearn.grid_search.GridSearchCV(knn, parameters, cv=10)
clf.fit(X_train, Y_train)









    Out[36]:





GridSearchCV(cv=10, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'n_neighbors': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20])},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)

Problem 2(c)

Visualize the result by plotting the score results versus values for $k$.



In [37]:

    
a = clf.grid_scores_
scores = [b.cv_validation_scores for b in a]

score_means = np.mean(scores, axis=1)

sns.boxplot(scores)
plt.scatter(k,score_means, c='k', zorder=2)
plt.ylim(0.8, 1.1)
plt.title('Accuracy as a function of $k$')
plt.ylabel('Accuracy')
plt.xlabel('Choice of k')
plt.show()









    



/usr/local/lib/python2.7/site-packages/matplotlib/__init__.py:892: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))

Verify that the grid search has indeed chosen the right parameter value for $k$.



In [38]:

    
clf.best_params_









    Out[38]:





{'n_neighbors': 13}

Problem 2(d)

Test the performance of our tuned KNN classifier on the test set.



In [39]:

    
from sklearn.metrics import classification_report
for params, mean_score, scores in clf.grid_scores_:
    print("%0.3f (+/-%0.03f) for %r"
          % (mean_score, scores.std() * 2, params))
    
print("Detailed classification report:")
print()
print("The model is trained on the full development set.")
print("The scores are computed on the full evaluation set.")
print()
y_true, y_pred = Y_test, clf.predict(X_test)
print(classification_report(y_true, y_pred))









    



0.930 (+/-0.145) for {'n_neighbors': 1}
0.920 (+/-0.180) for {'n_neighbors': 2}
0.930 (+/-0.177) for {'n_neighbors': 3}
0.910 (+/-0.143) for {'n_neighbors': 4}
0.950 (+/-0.127) for {'n_neighbors': 5}
0.950 (+/-0.127) for {'n_neighbors': 6}
0.950 (+/-0.127) for {'n_neighbors': 7}
0.930 (+/-0.123) for {'n_neighbors': 8}
0.950 (+/-0.130) for {'n_neighbors': 9}
0.940 (+/-0.127) for {'n_neighbors': 10}
0.950 (+/-0.130) for {'n_neighbors': 11}
0.950 (+/-0.130) for {'n_neighbors': 12}
0.960 (+/-0.129) for {'n_neighbors': 13}
0.940 (+/-0.127) for {'n_neighbors': 14}
0.940 (+/-0.127) for {'n_neighbors': 15}
0.930 (+/-0.123) for {'n_neighbors': 16}
0.940 (+/-0.128) for {'n_neighbors': 17}
0.930 (+/-0.123) for {'n_neighbors': 18}
0.930 (+/-0.123) for {'n_neighbors': 19}
0.920 (+/-0.152) for {'n_neighbors': 20}
Detailed classification report:
()
The model is trained on the full development set.
The scores are computed on the full evaluation set.
()
             precision    recall  f1-score   support

          0       1.00      1.00      1.00        20
          1       0.94      1.00      0.97        16
          2       1.00      0.93      0.96        14

avg / total       0.98      0.98      0.98        50



In [40]:

    
clf.best_score_









    Out[40]:





0.95999999999999996

Discussion for Problem 2

Write a brief discussion of your conclusions to the questions and tasks above in 100 words or less.

Problem 3: Supervised Learning - Estimating Boston house pricing using Linear Regression and Regression Trees

Import the Boston House Pricing Dataset; it comes with the scikit or you can dowload it FROM UCI ML cars dataset. (https://archive.ics.uci.edu/ml/datasets/Automobile)



In [94]:

    
header = ['symboling','normalized-losses','make','fuel-type','aspiration',
          'num-of-doors', 'body-style','drive-wheels','engine-location',
          'wheel-base','length','width','height','curb-weight','engine-type',
          'num-of-cylinders','engine-size','fuel-system','bore','stroke',
          'compression-ratio','horsepower','peak-rpm','city-mpg','highway-mpg',
          'price']

auto = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data', names=header)
auto.head()









    Out[94]:






  
    
      
      symboling
      normalized-losses
      make
      fuel-type
      aspiration
      num-of-doors
      body-style
      drive-wheels
      engine-location
      wheel-base
      ...
      engine-size
      fuel-system
      bore
      stroke
      compression-ratio
      horsepower
      peak-rpm
      city-mpg
      highway-mpg
      price
    
  
  
    
      0
      3
      ?
      alfa-romero
      gas
      std
      two
      convertible
      rwd
      front
      88.6
      ...
      130
      mpfi
      3.47
      2.68
      9
      111
      5000
      21
      27
      13495
    
    
      1
      3
      ?
      alfa-romero
      gas
      std
      two
      convertible
      rwd
      front
      88.6
      ...
      130
      mpfi
      3.47
      2.68
      9
      111
      5000
      21
      27
      16500
    
    
      2
      1
      ?
      alfa-romero
      gas
      std
      two
      hatchback
      rwd
      front
      94.5
      ...
      152
      mpfi
      2.68
      3.47
      9
      154
      5000
      19
      26
      16500
    
    
      3
      2
      164
      audi
      gas
      std
      four
      sedan
      fwd
      front
      99.8
      ...
      109
      mpfi
      3.19
      3.40
      10
      102
      5500
      24
      30
      13950
    
    
      4
      2
      164
      audi
      gas
      std
      four
      sedan
      4wd
      front
      99.4
      ...
      136
      mpfi
      3.19
      3.40
      8
      115
      5500
      18
      22
      17450
    
  

5 rows × 26 columns

Find the mostimportant features



In [95]:

    
#Replace '?' with NaN
auto = auto.replace(to_replace='?', value=np.nan)

#Convert relevant numeric data from object to float
change_num = ['bore','stroke','horsepower','peak-rpm','price']
auto[change_num]=auto[change_num].astype(float, inplace=True)

#Select columns with numbers
num_auto = auto._get_numeric_data().dropna()
num_auto.head()









    Out[95]:






  
    
      
      symboling
      wheel-base
      length
      width
      height
      curb-weight
      engine-size
      bore
      stroke
      compression-ratio
      horsepower
      peak-rpm
      city-mpg
      highway-mpg
      price
    
  
  
    
      0
      3
      88.6
      168.8
      64.1
      48.8
      2548
      130
      3.47
      2.68
      9
      111
      5000
      21
      27
      13495
    
    
      1
      3
      88.6
      168.8
      64.1
      48.8
      2548
      130
      3.47
      2.68
      9
      111
      5000
      21
      27
      16500
    
    
      2
      1
      94.5
      171.2
      65.5
      52.4
      2823
      152
      2.68
      3.47
      9
      154
      5000
      19
      26
      16500
    
    
      3
      2
      99.8
      176.6
      66.2
      54.3
      2337
      109
      3.19
      3.40
      10
      102
      5500
      24
      30
      13950
    
    
      4
      2
      99.4
      176.6
      66.4
      54.3
      2824
      136
      3.19
      3.40
      8
      115
      5500
      18
      22
      17450



In [97]:

    
import math
from math import isnan

def corr(list1, list2):
    '''Find correlation between 2 datasets. Removes NaNs'''
    #Remove Nan:
    new_list1 = []
    new_list2 = []
    for i1,i2 in zip(list1,list2):
        if (isnan(i1)==False) & (isnan(i2)==False):
            new_list1.append(float(i1))
            new_list2.append(float(i2))
    cov = np.cov([new_list1,new_list2])[0][1]
    std1 = np.std(new_list1)
    std2 = np.std(new_list2)
    corr = cov/(std1*std2)
    return corr

#Taking features that has an abolute correlation of 0.5+ against price
feature_list = num_auto.columns[0:-1]
price = num_auto['price']

for feature in feature_list:
    attr = num_auto[feature]
    c = corr(attr,price)
    if np.sqrt(c**2) > 0.7:
        print feature,':\t\t%.2f'%c









    



width :		0.76
curb-weight :		0.84
engine-size :		0.89
horsepower :		0.82
city-mpg :		-0.71
highway-mpg :		-0.72

Using 10-fold cross validation separate the test and training data sets



In [98]:

    
#Attributes with highest correlation
attr = ['width', 'curb-weight', 'engine-size', 'horsepower', 'highway-mpg','price','city-mpg']

#Picking important features
new_auto = auto[attr].dropna()
X = new_auto[new_auto.columns - ['price']]
y = new_auto['price']

k_fold = sklearn.cross_validation.KFold(len(X), n_folds=10, shuffle=True, random_state=44)









    



/usr/local/lib/python2.7/site-packages/ipykernel/__main__.py:6: FutureWarning: using '-' to provide set differences with Indexes is deprecated, use .difference()



In [99]:

    
#Normalize the data

X = (X - X.mean()) / (X.max() - X.min())
X.head()









    Out[99]:






  
    
      
      city-mpg
      curb-weight
      engine-size
      highway-mpg
      horsepower
      width
    
  
  
    
      0
      -0.116695
      -0.003115
      0.011984
      -0.096932
      0.035528
      -0.152343
    
    
      1
      -0.116695
      -0.003115
      0.011984
      -0.096932
      0.035528
      -0.152343
    
    
      2
      -0.172250
      0.103557
      0.095003
      -0.123248
      0.236463
      -0.032685
    
    
      3
      -0.033361
      -0.084961
      -0.067261
      -0.017985
      -0.006528
      0.027144
    
    
      4
      -0.200028
      0.103945
      0.034626
      -0.228511
      0.054220
      0.044238



In [ ]:

Start with a lineal model and evaluate how well it can predict the price variable



In [100]:

    
scores = [] 

for train_index, test_index in k_fold:
    X_train, X_test = X.values[train_index], X.values[test_index]
    y_train, y_test = y.values[train_index], y.values[test_index]
    lm = sklearn.linear_model.LinearRegression()
    lm.fit(X_train, y_train)
    scores.append(lm.score(X_test, y_test))
    
print 'Score:\t', '%.2f'%(np.mean(scores)*100),'%'









    



Score:	73.10 %

Try using Ridge regression and evaluate the result of the 10-fold cross-validation



In [101]:

    
### Your code here ###

scores = [] 

for train_index, test_index in k_fold:
    X_train, X_test = X.values[train_index], X.values[test_index]
    y_train, y_test = y.values[train_index], y.values[test_index]
    rm = sklearn.linear_model.Ridge()
    rm.fit(X_train, y_train)
    predict = rm.predict(X_test)
    score = float(np.sum([i==j for i,j in zip(predict,y_test)])/len(y_test))
    scores.append(rm.score(X_test, y_test))
    
print 'Score:\t', '%.2f'%(np.mean(scores)*100),'%'









    



Score:	73.82 %

Train the Regression Tree and evaluate using 10-fold cross validation; Specify the parapmeters used and how you changed them to increase the accuracy;



In [102]:

    
### Your code here ###

dtr = sklearn.tree.DecisionTreeRegressor()
grid_search = sklearn.grid_search.GridSearchCV(dtr, 
                                               {'min_samples_split':list(range(1,10)), 
                                                'min_samples_leaf':list(range(1,10)), 
                                                'min_weight_fraction_leaf':np.linspace(0,0.5,10)}, 
                                               cv=10)
grid_search.fit(X_train, y_train)
print 'Best Parameters:'
grid_search.best_params_









    



Best Parameters:






    Out[102]:





{'min_samples_leaf': 9,
 'min_samples_split': 1,
 'min_weight_fraction_leaf': 0.0}



In [103]:

    
score = grid_search.score(X_test, y_test)

print 'Score:\t','%.2f'%(score*100),'%'









    



Score:	87.07 %

Discussion for Problem 3 Write a brief discussion of your conclusions to the questions and tasks above in 100 words or less.Compare all three aproaches and discuss your findings in 100 words or less

Both the linear and ridge regression models give simialr accuracy scores. Therefore the attributes must have a high degree of independence. The Regression Tree model gave the highest accuracy.



In [ ]:

	S	2B	3B	HR	BB
yearID
1961	0.156249	0.035845	0.005717	0.022975	0.104083
1962	0.165632	0.035853	0.006777	0.023811	0.088590
1963	0.162467	0.034020	0.006896	0.021254	0.080336
1964	0.167251	0.036336	0.006748	0.021548	0.079152
1965	0.160042	0.035539	0.006534	0.022693	0.085745

	playerID	yearID	2B	3B	HR	BB	S
72882	richabo01	1961	0.707107	-0.707107	-0.707107	-0.707107	0.707107
94941	woodja01	1961	-0.707107	0.707107	0.707107	0.707107	-0.707107
73959	robinbr01	1962	-0.165696	1.605472	-0.693658	-1.004793	0.967074
55480	mayswi01	1962	0.374270	0.142187	1.438874	0.351343	-1.227364
74049	robinfr02	1962	1.814792	-0.863346	0.660623	0.333819	-0.061390

	playerID	debut	finalGame	2B	3B	HR	BB	S
0	abreubo01	1996	2012	0.918221	-0.345258	-0.187583	1.431686	-0.954927
1	allendi01	1963	1977	1.289085	1.509213	0.830369	0.473426	-0.188239
2	alomasa01	1964	1978	-0.707107	-0.707107	-0.707107	-0.707107	0.707107
3	alouma01	1960	1974	1.649904	1.226444	-1.157587	-0.969189	1.395309
4	bagweje01	1991	2005	0.975857	-0.422082	0.751760	1.235499	-0.996528

	playerID	debut	finalGame	2B	3B	HR	BB	S	OPW
0	abreubo01	1996	2012	0.918221	-0.345258	-0.187583	1.431686	-0.954927	85.435445
1	allendi01	1963	1977	1.289085	1.509213	0.830369	0.473426	-0.188239	91.199738
2	alomasa01	1964	1978	-0.707107	-0.707107	-0.707107	-0.707107	0.707107	74.845534
3	alouma01	1960	1974	1.649904	1.226444	-1.157587	-0.969189	1.395309	82.970071
4	bagweje01	1991	2005	0.975857	-0.422082	0.751760	1.235499	-0.996528	88.249913

	playerID	POS	debut	finalGame	2B	3B	HR	BB	S	OPW	nameFirst	nameLast	salary
0	abreubo01	RF	1996	2012	0.918221	-0.345258	-0.187583	1.431686	-0.954927	85.435445	Bobby	Abreu	9000000.0
1	bagweje01	1B	1991	2005	0.975857	-0.422082	0.751760	1.235499	-0.996528	88.249913	Jeff	Bagwell	6875000.0
2	bayja01	LF	2003	2013	0.778678	0.330081	0.426752	1.037695	-0.875178	86.800325	Jason	Bay	4750000.0
3	baylodo01	DH	1970	1988	0.352534	-0.469633	1.436595	0.817097	-0.855471	87.901237	Don	Baylor	695909.5
4	bellbu01	3B	1972	1989	1.464895	-0.470834	-0.363651	-0.959318	0.645826	80.826252	Buddy	Bell	830454.5

	playerID	nameFirst	nameLast	salary
0	aardsda01	David	Aardsma	419000
1	aasedo01	Don	Aase	612500
2	abadan01	Andy	Abad	327000
3	abadfe01	Fernando	Abad	451500
4	abbotje01	Jeff	Abbott	255000

	teamID	yearID	W	2B	3B	HR	BB	S
1366	LAA	1961	70	0.035708	0.003604	0.030958	0.111548	0.147748
1367	KC1	1961	61	0.035982	0.007829	0.014993	0.096618	0.164751
1395	CHN	1962	59	0.032461	0.009275	0.020868	0.083471	0.168930
1394	HOU	1962	64	0.028095	0.007767	0.017353	0.081474	0.173195
1391	ML1	1962	86	0.033780	0.006292	0.029972	0.096208	0.157808

	S	2B	3B	HR	BB
yearID
1961	0.000000e+00	0.000000e+00	5.551115e-17	0.000000e+00	6.106227e-16
1962	4.804238e-15	1.468522e-15	6.863197e-16	-3.229740e-16	2.977416e-16
1963	-1.015061e-15	-2.347329e-15	4.440892e-16	6.344132e-16	-9.595499e-16
1964	2.415997e-16	2.220446e-16	4.239033e-16	-2.485386e-16	-3.229740e-16
1965	2.581269e-15	-5.048045e-16	3.590878e-16	-3.330669e-16	3.330669e-16

	playerID	yearID	2B	3B	HR	BB	S
14211	castrst01	2012	0.042522	0.017595	0.020528	0.052786	0.187683
97506	pencehu01	2013	0.051395	0.007342	0.039648	0.076358	0.162996
96755	butlebi03	2013	0.040847	0.000000	0.022693	0.119516	0.190620
96968	fieldpr01	2013	0.051502	0.000000	0.035765	0.107296	0.161660
97826	vottojo01	2013	0.041899	0.004190	0.033520	0.188547	0.167598

	POS	playerID	debut	finalGame	2B	3B	HR	BB	S	OPW	nameFirst	nameLast	salary
0	1B	bagweje01	1969	1986	-1.590516	-1.360410	-1.405985	-1.062170	-1.147259	75.434428	Adrian	Bagwell	370833.5
1	2B	biggicr01	1970	1986	-0.707107	-0.707107	-0.803776	-0.827785	-1.270748	79.458850	Bill	Biggio	600000.0
2	3B	bellbu01	1972	1988	-0.759137	-1.205730	-1.698594	-1.389863	-1.262175	70.389584	Aaron	Bell	297500.0
3	CF	beltrca01	1976	1986	-0.967604	-0.400189	-1.537022	-0.943648	-1.396018	75.916727	Adam	Beltran	400000.0
4	DH	baylodo01	1968	1987	0.352534	-0.469633	-0.707107	-0.707107	-1.381041	80.249615	Don	Baylor	695909.5
5	LF	bayja01	1996	2012	0.388996	-0.838907	-0.601539	-1.047430	-0.875178	76.458745	Carlos	Bay	1440000.0
6	OF	cartejo01	1963	1986	-1.427923	-1.084184	-1.592864	-0.778285	-1.920573	74.133137	Al	Carter	380000.0
7	RF	abreubo01	1963	1985	-0.627936	-0.345258	-0.952054	-0.983216	-1.351377	76.665733	Bobby	Abreu	300000.0
8	SS	bowala01	1969	1985	-1.118849	-0.875001	-1.379656	-1.211343	-0.942530	72.461843	Alex	Bowa	450000.0

	symboling	normalized-losses	make	fuel-type	aspiration	num-of-doors	body-style	drive-wheels	engine-location	wheel-base	...	engine-size	fuel-system	bore	stroke	compression-ratio	horsepower	peak-rpm	city-mpg	highway-mpg	price
0	3	?	alfa-romero	gas	std	two	convertible	rwd	front	88.6	...	130	mpfi	3.47	2.68	9	111	5000	21	27	13495
1	3	?	alfa-romero	gas	std	two	convertible	rwd	front	88.6	...	130	mpfi	3.47	2.68	9	111	5000	21	27	16500
2	1	?	alfa-romero	gas	std	two	hatchback	rwd	front	94.5	...	152	mpfi	2.68	3.47	9	154	5000	19	26	16500
3	2	164	audi	gas	std	four	sedan	fwd	front	99.8	...	109	mpfi	3.19	3.40	10	102	5500	24	30	13950
4	2	164	audi	gas	std	four	sedan	4wd	front	99.4	...	136	mpfi	3.19	3.40	8	115	5500	18	22	17450

	symboling	wheel-base	length	width	height	curb-weight	engine-size	bore	stroke	compression-ratio	horsepower	peak-rpm	city-mpg	highway-mpg	price
0	3	88.6	168.8	64.1	48.8	2548	130	3.47	2.68	9	111	5000	21	27	13495
1	3	88.6	168.8	64.1	48.8	2548	130	3.47	2.68	9	111	5000	21	27	16500
2	1	94.5	171.2	65.5	52.4	2823	152	2.68	3.47	9	154	5000	19	26	16500
3	2	99.8	176.6	66.2	54.3	2337	109	3.19	3.40	10	102	5500	24	30	13950
4	2	99.4	176.6	66.4	54.3	2824	136	3.19	3.40	8	115	5500	18	22	17450

	city-mpg	curb-weight	engine-size	highway-mpg	horsepower	width
0	-0.116695	-0.003115	0.011984	-0.096932	0.035528	-0.152343
1	-0.116695	-0.003115	0.011984	-0.096932	0.035528	-0.152343
2	-0.172250	0.103557	0.095003	-0.123248	0.236463	-0.032685
3	-0.033361	-0.084961	-0.067261	-0.017985	-0.006528	0.027144
4	-0.200028	0.103945	0.034626	-0.228511	0.054220	0.044238