Assignnement 2: Prediction and Classification
Due: Thursday, April 30, 2015 11:59 PM
Problem 3 is optional - for extra credit! Problems 1 and 2 will be graded for the Lab 2.
In this assignment you will be using regression and classification to explore different data sets.
First: You will use data from before 2002 in the Sean Lahman's Baseball Database to create a metric for picking baseball players using linear regression. This database contains the "complete batting and pitching statistics from 1871 to 2013, plus fielding statistics, standings, team stats, managerial records, post-season data, and more". Documentation provided here.
Second: You will use the famous iris data set to perform a $k$-neareast neighbor classification using cross validation. While it was introduced in 1936, it is still one of the most popular example data sets in the machine learning community. Wikipedia describes the data set as follows: "The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimetres." Here is an illustration what the four features measure:
Third: You will investigate the influence of higher dimensional spaces on the classification using another standard data set in machine learning called the The cars data set.
In [1]:
# prepare the notebook for matplotlib
%matplotlib inline
import requests
import StringIO
import zipfile
import numpy as np
import pandas as pd # pandas
import matplotlib.pyplot as plt # module for plotting
# If this module is not already installed, you may need to install it.
# You can do this by typing 'pip install seaborn' in the command line
import seaborn as sns
import sklearn
import sklearn.datasets
import sklearn.cross_validation
import sklearn.decomposition
import sklearn.grid_search
import sklearn.neighbors
import sklearn.metrics
Using data preceding the 2002 season pick 10 offensive players keeping the payroll under $20 million (assign each player the median salary). Predict how many games this team would win in a 162 game season.
In this problem we will be returning to the Sean Lahman's Baseball Database. From this database, we will be extract five data sets containing information such as yearly stats and standing, batting statistics, fielding statistics, player names, player salaries and biographical information. You will explore the data in this database from before 2002 and create a metric for picking players.
Load in these CSV files from the Sean Lahman's Baseball Database. For this assignment, we will use the 'Teams.csv', 'Batting.csv', 'Salaries.csv', 'Fielding.csv', 'Master.csv' tables. Read these tables into separate pandas DataFrames with the following names.
CSV file name | Name of pandas DataFrame |
---|---|
Teams.csv | teams |
Batting.csv | players |
Salaries.csv | salaries |
Fielding.csv | fielding |
Master.csv | master |
In [2]:
teams = pd.read_csv("data/Teams.csv")
players = pd.read_csv("data/Batting.csv")
salaries = pd.read_csv("data/Salaries.csv")
fielding = pd.read_csv("data/Fielding.csv")
master = pd.read_csv("data/Master.csv")
In [ ]:
In [3]:
medians = salaries.groupby(["playerID"]).median()
merged = pd.merge(medians, master, left_index=True, right_on="playerID")
medianSalaries=merged[["playerID", "nameFirst", "nameLast", "salary"]]
medianSalaries.reset_index(inplace=True, drop=True)
medianSalaries.head()
Out[3]:
Now, consider only team/season combinations in which the teams played 162 Games. Exclude all data from before 1947. Compute the per plate appearance rates for singles, doubles, triples, HR, and BB. Create a new pandas DataFrame called stats
that has the teamID, yearID, wins and these rates.
Hint: Singles are hits that are not doubles, triples, nor HR. Plate appearances are base on balls plus at bats.
In [4]:
g_filter = teams["G"] == 162
year_filter = teams["yearID"] > 1947
filtered_teams = teams[g_filter & year_filter]
stats = filtered_teams[["teamID", "yearID", "W", "2B", "3B", "HR", "BB", "H", "AB"]].copy()
stats["S"] = stats["H"] - (stats["2B"] + stats["3B"] + stats["HR"])
## Calculate PA = AB + BB
stats['PA'] = stats['AB'] + stats['BB']
## For each of singles, doubles, triples, HR and BB, calculate plate appearance rate
for i in ['S','2B','3B','HR','BB']:
stats[i] = stats[i]/stats['PA']
stats.drop('H', axis=1, inplace=True)
stats.drop('PA', axis=1, inplace=True)
stats.drop('AB', axis=1, inplace=True)
stats.sort("yearID", inplace=True)
stats.head()
Out[4]:
In [5]:
#the yearly mean for each metric across teams -- in the next step I make these values 0
stats.groupby('yearID')["S","2B","3B","HR","BB"].mean().head()
Out[5]:
In [6]:
for col in ["S","2B","3B","HR","BB"]:
plt.scatter(stats.yearID, stats[col], alpha=0.5, marker='x')
plt.title(col)
plt.xlabel('Year')
plt.ylabel('Rate')
plt.show()
In [7]:
years = stats['yearID'].unique()
year_stats_collection = []
for year in years:
year_selector = stats["yearID"] == year
year_stats = stats[year_selector]
year_stats[['S', '2B', '3B', 'HR', 'BB']] = stats[year_selector][['S', '2B', '3B', 'HR', 'BB']].apply(lambda df: (df - df.mean()) / df.std())
year_stats_collection.append(year_stats)
stats_adj = pd.concat(year_stats_collection)
In [8]:
#yearly averages should eq 0
stats_adj.groupby('yearID')["S","2B","3B","HR","BB"].mean().head()
Out[8]:
Build a simple linear regression model to predict the number of wins from the average adjusted singles, double, triples, HR, and BB rates. To decide which of these terms to include fit the model to data from 2002 and compute the average squared residuals from predictions to years past 2002. Use the fitted model to define a new sabermetric summary: offensive predicted wins (OPW). Hint: the new summary should be a linear combination of one to five of the five rates.
In [9]:
from sklearn import linear_model
from sklearn.cross_validation import train_test_split
In [10]:
# labels=stats_adj[['S', '2B', '3B', 'HR', 'BB']].values
training_selector = stats_adj['yearID'] < 2002
test_selector = stats_adj['yearID'] >= 2002
labels=["S", "2B", "3B", "HR", "BB"]
training_data = stats_adj[training_selector]
test_data = stats_adj[test_selector]
X_train = training_data[labels].values
y_train = training_data[['W']].values
X_test = test_data[labels].values
y_test = test_data[['W']].values
# (stats_adj.shape, X_train.shape, y_train.shape, X_test.shape, y_test.shape)
In [11]:
labels=["S", "2B", "3B", "HR", "BB"]
data = stats_adj[labels].values
target = stats_adj[['W']].values
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.4, random_state=9)
In [12]:
(X_test.shape, y_test.shape, X_train.shape)
Out[12]:
In [13]:
regr = linear_model.LinearRegression()
regr.fit(X_train, y_train)
Out[13]:
In [14]:
# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean square error
print("Residual sum of squares: %.2f"
% np.mean((regr.predict(X_test) - y_test) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(X_test, y_test))
# Plot outputs
# plt.scatter(X_test, y_test, color='black')
# plt.plot(X_test, regr.predict(X_test), color='blue',linewidth=3)
# plt.xticks(())
# plt.yticks(())
# plt.show()
In [15]:
pd.DataFrame(regr.coef_)
Out[15]:
In [16]:
plt.scatter(stats_adj[["HR"]].values, stats_adj[["W"]].values, color='black', marker="x")
plt.scatter(stats_adj[["3B"]].values, stats_adj[["W"]].values, color='red', marker="x")
plt.scatter(stats_adj[["2B"]].values, stats_adj[["W"]].values, color='blue', marker="x")
plt.scatter(stats_adj[["S"]].values, stats_adj[["W"]].values, color='green', marker="x")
plt.scatter(stats_adj[["BB"]].values, stats_adj[["W"]].values, color='purple', marker="x")
Out[16]:
Your answer here:
Now we will create a similar database for individual players. Consider only player/year combinations in which the player had at least 500 plate appearances. Consider only the years we considered for the calculations above (after 1947 and seasons with 162 games). For each player/year compute singles, doubles, triples, HR, BB per plate appearance rates. Create a new pandas DataFrame called playerstats
that has the playerID, yearID and the rates of these stats. Remove the average for each year as for these rates as done in Problem 1(e).
In [17]:
players["PA"] = players["AB"] + players["BB"]
appearances_selector = players["PA"] >= 500 #assuming appearances = AB
year_selector = players["yearID"] > 1947
games_selector = players["G"] == 162
filtered_players = players[appearances_selector & year_selector & games_selector]
playerstats = filtered_players[["playerID", "yearID", "2B", "3B", "HR", "BB", "H", "AB"]].copy()
playerstats["S"] = playerstats["H"] - (playerstats["2B"] + playerstats["3B"] + playerstats["HR"])
## Calculate PA = AB + BB
playerstats['PA'] = playerstats['AB'] + playerstats['BB']
## For each of singles, doubles, triples, HR and BB, calculate plate appearance rate
for i in ['S','2B','3B','HR','BB']:
playerstats[i] = playerstats[i]/playerstats['PA']
playerstats.drop('AB', axis=1, inplace=True)
playerstats.drop('PA', axis=1, inplace=True)
playerstats.drop('H', axis=1, inplace=True)
playerstats.index
playerstats.sort("yearID", inplace=True)
playerstats.tail()
Out[17]:
In [18]:
years = playerstats['yearID'].unique()
year_stats_collection = []
for year in years:
year_selector = playerstats["yearID"]==year
year_stats = playerstats[year_selector]
year_stats[['S', '2B', '3B', 'HR', 'BB']] = playerstats[year_selector][['S', '2B', '3B', 'HR', 'BB']].apply(lambda df: (df - df.mean()) / df.std())
year_stats_collection.append(year_stats)
playerstats_adj = pd.concat(year_stats_collection)
In [19]:
year_sel = playerstats_adj['yearID'] == 2002
playerstats_adj[year_sel].mean(0)
Out[19]:
Show the head of the playerstats
DataFrame.
In [20]:
playerstats_adj.head()
Out[20]:
Using the playerstats
DataFrame created in Problem 1(g), create a new DataFrame called playerLS
containing the player's lifetime stats. This DataFrame should contain the playerID, the year the player's career started, the year the player's career ended and the player's lifetime average for each of the quantities (singles, doubles, triples, HR, BB). For simplicity we will simply compute the avaerage of the rates by year (a more correct way is to go back to the totals).
In [21]:
left = playerstats_adj.groupby(["playerID"]).mean().reset_index()
playerLS = master[["playerID","debut","finalGame"]].merge(left, how='inner', on="playerID")
playerLS["debut"] = playerLS.debut.apply(lambda x: int(x[0:4]))
playerLS["finalGame"] = playerLS.finalGame.apply(lambda x: int(x[0:4]))
playerLS.drop('yearID', axis=1, inplace=True)
Show the head of the playerLS
DataFrame.
In [22]:
playerLS.head()
Out[22]:
In [23]:
playerLS["OPW"] = regr.predict(playerLS[["S", "2B", "3B", "HR", "BB"]].values)
playerLS.head()
Out[23]:
In [25]:
pos_df = fielding.groupby(["playerID"])["POS"].agg(lambda x:x.value_counts().index[0])
pos_df = pd.DataFrame(pos_df).reset_index()
In [26]:
playerLS = pos_df.merge(playerLS, how='inner', on="playerID")
In [27]:
playerLS_merged = playerLS.merge(medianSalaries, how='inner', on=['playerID'])
Show the head of the playerLS
DataFrame.
In [28]:
playerLS_merged.head()
Out[28]:
In [29]:
_3_year_selector = playerLS_merged["finalGame"] - playerLS_merged["debut"] >= 3
start_selector = playerLS_merged["debut"] <= 2002
end_selector = playerLS_merged["finalGame"] >= 2003
active = playerLS_merged[_3_year_selector & start_selector & end_selector]
fig = plt.figure()
ax = fig.gca()
ax.scatter(active.salary, active.OPW)
ax.set_xlabel('Salary ')
ax.set_ylabel('OPW')
ax.set_title('Relationship between Salary and Predicted Number of Wins')
plt.show()
In [78]:
positions = list(set(playerLS_merged['POS'].values))
team = playerLS_merged.groupby('POS').min().reset_index()
team
Out[78]:
In [72]:
team["salary"].sum()
Out[72]:
In [73]:
team["OPW"].mean()
Out[73]:
In [81]:
attr = ['POS','2B', '3B', 'HR', 'BB', 'S']
team.sort_values(by='OPW', ascending=False)[attr].drop_duplicates().sum()
Out[81]:
Your answer here:
Our team excells in 2B
In [93]:
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestClassifier
#Use AllstarFull.csv to determine which players are Allstars
allstar = pd.read_csv('Data/AllstarFull.csv')
allstar = allstar[['playerID','yearID','GP']].drop_duplicates()
players_allstar = pd.merge(players,allstar,on=['playerID'],how ='inner')
players_allstar['GP'] = players_allstar['GP'].fillna(0)
x = players_allstar.drop('GP',axis=1)._get_numeric_data().fillna(0)
y = players_allstar['GP']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20)
rfc = RandomForestClassifier()
rfc.fit(x_train, y_train)
z = pd.DataFrame(zip(rfc.predict(x_test),y_test),columns = ['predicted','actual'])
z['new'] = z['actual']-z['predicted']
print 'Score:','%.2f'%(100 - z[z['new']!=0]['new'].count()*100.0/z['new'].count()),'%'
What is the optimal $k$ for predicting species using $k$-nearest neighbor classification on the four features provided by the iris dataset.
In this problem you will get to know the famous iris data set, and use cross validation to select the optimal $k$ for a $k$-nearest neighbor classification. This problem set makes heavy use of the sklearn library. In addition to Pandas, it is one of the most useful libraries for data scientists. For the Iris data set sklearn provides an extra function to load it - since it is one of the very commonly used data sets.
In [34]:
#load the iris data set
from sklearn.datasets import load_iris
iris = load_iris()
Split the data into a train and a test set. Use a random selection of 33% of the samples as test data. Sklearn provides the train_test_split
function for this purpose. Print the dimensions of all the train and test data sets you have created.
In [35]:
from sklearn.cross_validation import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(iris.data, iris.target, test_size=0.33, random_state=9)
Use ten fold cross validation to estimate the optimal value for $k$ for the iris data set.
Note: For your convenience sklearn does not only include the KNN classifier, but also a grid search function. The function is called grid search, because if you have to optimize more than one parameter, it is common practice to define a range of possible values for each parameter. An exhaustive search then runs over the complete grid defined by all the possible parameter combinations. This can get very computation heavy, but luckily our KNN classifier only requires tuning of a single parameter for this problem set.
In [36]:
# use cross validation to find the optimal value for k
k = np.arange(20)+1
parameters = {'n_neighbors': k}
knn = sklearn.neighbors.KNeighborsClassifier()
clf = sklearn.grid_search.GridSearchCV(knn, parameters, cv=10)
clf.fit(X_train, Y_train)
Out[36]:
In [37]:
a = clf.grid_scores_
scores = [b.cv_validation_scores for b in a]
score_means = np.mean(scores, axis=1)
sns.boxplot(scores)
plt.scatter(k,score_means, c='k', zorder=2)
plt.ylim(0.8, 1.1)
plt.title('Accuracy as a function of $k$')
plt.ylabel('Accuracy')
plt.xlabel('Choice of k')
plt.show()
Verify that the grid search has indeed chosen the right parameter value for $k$.
In [38]:
clf.best_params_
Out[38]:
In [39]:
from sklearn.metrics import classification_report
for params, mean_score, scores in clf.grid_scores_:
print("%0.3f (+/-%0.03f) for %r"
% (mean_score, scores.std() * 2, params))
print("Detailed classification report:")
print()
print("The model is trained on the full development set.")
print("The scores are computed on the full evaluation set.")
print()
y_true, y_pred = Y_test, clf.predict(X_test)
print(classification_report(y_true, y_pred))
In [40]:
clf.best_score_
Out[40]:
In [94]:
header = ['symboling','normalized-losses','make','fuel-type','aspiration',
'num-of-doors', 'body-style','drive-wheels','engine-location',
'wheel-base','length','width','height','curb-weight','engine-type',
'num-of-cylinders','engine-size','fuel-system','bore','stroke',
'compression-ratio','horsepower','peak-rpm','city-mpg','highway-mpg',
'price']
auto = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data', names=header)
auto.head()
Out[94]:
In [95]:
#Replace '?' with NaN
auto = auto.replace(to_replace='?', value=np.nan)
#Convert relevant numeric data from object to float
change_num = ['bore','stroke','horsepower','peak-rpm','price']
auto[change_num]=auto[change_num].astype(float, inplace=True)
#Select columns with numbers
num_auto = auto._get_numeric_data().dropna()
num_auto.head()
Out[95]:
In [97]:
import math
from math import isnan
def corr(list1, list2):
'''Find correlation between 2 datasets. Removes NaNs'''
#Remove Nan:
new_list1 = []
new_list2 = []
for i1,i2 in zip(list1,list2):
if (isnan(i1)==False) & (isnan(i2)==False):
new_list1.append(float(i1))
new_list2.append(float(i2))
cov = np.cov([new_list1,new_list2])[0][1]
std1 = np.std(new_list1)
std2 = np.std(new_list2)
corr = cov/(std1*std2)
return corr
#Taking features that has an abolute correlation of 0.5+ against price
feature_list = num_auto.columns[0:-1]
price = num_auto['price']
for feature in feature_list:
attr = num_auto[feature]
c = corr(attr,price)
if np.sqrt(c**2) > 0.7:
print feature,':\t\t%.2f'%c
In [98]:
#Attributes with highest correlation
attr = ['width', 'curb-weight', 'engine-size', 'horsepower', 'highway-mpg','price','city-mpg']
#Picking important features
new_auto = auto[attr].dropna()
X = new_auto[new_auto.columns - ['price']]
y = new_auto['price']
k_fold = sklearn.cross_validation.KFold(len(X), n_folds=10, shuffle=True, random_state=44)
In [99]:
#Normalize the data
X = (X - X.mean()) / (X.max() - X.min())
X.head()
Out[99]:
In [ ]:
In [100]:
scores = []
for train_index, test_index in k_fold:
X_train, X_test = X.values[train_index], X.values[test_index]
y_train, y_test = y.values[train_index], y.values[test_index]
lm = sklearn.linear_model.LinearRegression()
lm.fit(X_train, y_train)
scores.append(lm.score(X_test, y_test))
print 'Score:\t', '%.2f'%(np.mean(scores)*100),'%'
In [101]:
### Your code here ###
scores = []
for train_index, test_index in k_fold:
X_train, X_test = X.values[train_index], X.values[test_index]
y_train, y_test = y.values[train_index], y.values[test_index]
rm = sklearn.linear_model.Ridge()
rm.fit(X_train, y_train)
predict = rm.predict(X_test)
score = float(np.sum([i==j for i,j in zip(predict,y_test)])/len(y_test))
scores.append(rm.score(X_test, y_test))
print 'Score:\t', '%.2f'%(np.mean(scores)*100),'%'
In [102]:
### Your code here ###
dtr = sklearn.tree.DecisionTreeRegressor()
grid_search = sklearn.grid_search.GridSearchCV(dtr,
{'min_samples_split':list(range(1,10)),
'min_samples_leaf':list(range(1,10)),
'min_weight_fraction_leaf':np.linspace(0,0.5,10)},
cv=10)
grid_search.fit(X_train, y_train)
print 'Best Parameters:'
grid_search.best_params_
Out[102]:
In [103]:
score = grid_search.score(X_test, y_test)
print 'Score:\t','%.2f'%(score*100),'%'
Both the linear and ridge regression models give simialr accuracy scores. Therefore the attributes must have a high degree of independence. The Regression Tree model gave the highest accuracy.
In [ ]: