Student Name: Shaival Dalal
Student Netid: sd3462
In this assignment we will be looking at data generated by particle physicists to test whether machine learning can help classify whether certain particle decay experiments identify the presence of a Higgs Boson. One does not need to know anything about particle physics to do well here, but if you are curious, full feature and data descriptions can be found here:
The goal of this assignment is to learn to use cross-validation for model selection as well as bootstrapping for error estimation. We’ll also use learning curve analysis to understand how well different algorithms make use of limited data. For more documentation on cross-validation with Python, you can consult the following:
Create a data preparation and cleaning function that does the following:
EventID is the index) into a pandas dataframeLabel to numeric (choose the minority class to be equal to 1)Y with numeric labelLabel-999): orig_var_name + _mv where orig_var_name is the name of the actual var with a missing value.replace() function useful.Hint: as a guide, this function can easily be done in less than 15 lines.
In [204]:
import pandas as pd
import numpy as np
from collections import Counter
from sklearn.preprocessing import scale
def cleanBosonData(infile_name):
raw=pd.read_csv(infile_name,header=0,index_col=0,na_values=-999)
raw=raw.replace({'Label':{Counter(raw['Label']).most_common()[0][0]:0,Counter(raw['Label']).most_common()[1][0]:1}}).rename(columns={'Label':'Y'})
for col in raw.columns.values[:-1]:
if raw[col].isnull().any().any():
raw[(col+'_mv')]=np.where(raw[col].isnull(),1,0)
raw[col]=raw[col].fillna(raw[col].mean())
raw[col]=scale(raw[col])
return raw
1. Clean the two data files included in this assignment (data/boson_training_cut_2000.csv and data/boson_testing_cut.csv) and use them as training and testing data sets.
In [205]:
data_train = cleanBosonData("../Datasets/boson_training_cut_2000.csv")
data_test = cleanBosonData("../Datasets/boson_testing_cut.csv")
2. On the training set, build the following models:
linear_model.LogisticRegression(). For this model, use C=1e30.svm.svc(). For this model, specify that kernel="linear".For each model above, plot the ROC curve of both models on the same plot. Make sure to use the test set for computing and plotting. In the legend, also print out the Area Under the ROC (AUC) for reference.
In [206]:
#Importing basic libraries
import matplotlib
import matplotlib.pyplot as plt
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn import linear_model
from sklearn.svm import SVC
%matplotlib inline
#Creating a model using Logistic regression
logreg = linear_model.LogisticRegression(C = 1e30).fit(data_train.drop('Y', 1), data_train['Y'])
#Creating a model using Support Vector Machines
svm=SVC(kernel="linear").fit(data_train.drop('Y',1),data_train['Y'])
#We predict values using the above developed models
pred_logreg=logreg.predict(data_test.drop('Y',1))
pred_svm=svm.predict(data_test.drop('Y',1))
#Calculating the FPR and the TPR for our predicted values
roc_log=roc_curve(y_true=data_test['Y'],y_score=pred_logreg)
roc_svm=roc_curve(y_true=data_test['Y'],y_score=pred_svm)
#Plotting values on a graph
plt.figure(figsize=(10,5))
plt.plot(roc_log[0],roc_log[1],color='darkorange',lw=3,label='AUC for Logistic Regression: %.4f' %(roc_auc_score(y_true=data_test['Y'],y_score=pred_logreg)))
plt.plot(roc_svm[0],roc_svm[1],color='navy',lw=3,label='AUC for SVM: %.4f' %(roc_auc_score(y_true=data_test['Y'],y_score=pred_svm)))
plt.plot([0, 1], [0, 1], color='black', lw=1, linestyle='--') #We plot a base line to indicate 0.5 indicating random chance
plt.title('Area under the ROC')
plt.legend(loc='lower right')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.show()
3. Which of the two models is generally better at ranking the test set? Are there any classification thresholds where the model identified above as "better" would underperform the other in a classification metric (such as TPR)?
In [207]:
from sklearn.metrics import confusion_matrix
from IPython.display import display, HTML
#Preparing confusion matrix to check different performance measures
conf_logreg=pd.DataFrame(confusion_matrix(y_true=data_test['Y'],y_pred=pred_logreg))
conf_svm=pd.DataFrame(confusion_matrix(y_true=data_test['Y'],y_pred=pred_svm))
display('Logistic')
display(conf_logreg)
display('SVM')
display(conf_svm)
**Answer**: The Logistic model performs better than the SVM model on an average. We judge them by using the AUC metric which is base rate invariant. There are specific thresholds for which one model outperforms the other and the final selection must be made on the basis of the utility of the model.
In Logistic:
In SVM:
1. Write a cross-validation function that does the following:
k), a sequence of values for $C$ (cs)for each f in range(k):data_train & data_validate according to cross-validation logicfor each c in cs:C=c, kernel="linear"c:[auc-c1,auc-c2,..auc-ck]
In [208]:
from sklearn.model_selection import train_test_split
def xValSVM(dataset, label_name, k, cs):
aucstore=dict()
for f in range(k):
data_train,data_validate=train_test_split(dataset)
for c in cs:
svm_temp=SVC(kernel="linear",C=c).fit(data_train.drop(label_name,1),data_train[label_name])
predsvm_temp=svm_temp.predict(data_test.drop(label_name,1))
svmroc_temp=roc_auc_score(data_test[label_name],predsvm_temp)
if c not in aucstore:
aucstore[c]=list()
aucstore[c].append(svmroc_temp)
return aucstore
2. Using the function written above, do the following:
[10^(-8), ..., 10^1] (i.e., do all powers of 10 from -8 to 1).Then answer the question: Did the model parameters selected beat the out-of-the-box model for SVM?
In [209]:
from scipy.stats import sem
import matplotlib.pyplot as plt
c_val = np.power(10,np.arange(-8.0,2.0))
aucs=xValSVM(data_train,"Y",10,c_val)
alldiff=list()
allmeans=list()
allstderr=list()
for i in aucs.keys():
mean=np.mean(aucs[i])
stderr=sem(aucs[i])
print("Mean for",i,":",mean," Std. Err: ",stderr)
allmeans.append(mean)
allstderr.append(stderr)
alldiff.append(mean-stderr)
print("Maximum difference:", max(alldiff)," for C value",list(aucs.keys())[(alldiff.index(max(alldiff)))])
plt.figure(figsize=(10,5))
plt.plot(np.log10(c_val),allmeans,label='Mean AUC')
plt.plot(np.log10(c_val),np.array(allmeans)-np.array(allstderr),'k+',label='Mean-Standard error')
plt.plot(np.log10(c_val),np.array(allmeans)+np.array(allstderr),'k--',label='Mean+Standard error')
plt.axhline(y=max(alldiff),color='r',label='Reference Line')
plt.legend(loc='lower right')
plt.show()
**Answer**:
Yes, our optimised model beats the out-of-box SVM model. However, we observe a only marginal increase in the AUC value. Out-of-box SVM with C=1.0 gives us an AUC of 0.5757 while our optimised SVM with C=10.0 gives us an AUC of 0.5768.
In this HW we are trying to find the best linear model to predict if a record represents the Higgs Boson. One of the drivers of the performance of a model is the sample size of the training set. As a data scientist, sometimes you have to decide if you have enough data or if you should invest in more. We can use learning curve analysis to determine if we have reached a performance plateau. This will inform us on whether or not we should invest in more data (in this case it would be by running more experiments).
Given a training set of size $N$, we test the performance of a model trained on a subsample of size $N_i$, where $N_i<=N$. We can plot how performance grows as we move $N_i$ from $0$ to $N$.
Because of the inherent randomness of subsamples of size $N_i$, we should expect that any single sample of size $N_i$ might not be representative of an algorithm’s performance at a given training set size. To quantify this variance and get a better generalization, we will also use bootstrap analysis. In bootstrap analysis, we pull multiple samples of size $N_i$, build a model, evaluate on a test set, and then take an average and standard error of the results.
1. Create a bootstrap function that can do the following:
def modBootstrapper(train, test, nruns, sampsize, lr, c):
Takes as input:
Runs a loop with (nruns) iterations, and within each loop:
Returns the mean(AUC) and Standard Error(mean(AUC)) across all bootstrap samples
In [210]:
from sklearn.metrics import auc
def modBootstrapper(train, test, nruns, sampsize, lr, **c):
aucstore2=dict()
means=list()
stds=list()
for i in np.arange(1,nruns):
for j in sampsize:
sampled=train.sample(n=j,replace=True)
if lr==0 and c is not None:
svm_temp=SVC(kernel="linear",C=list(c.values())[0]).fit(sampled.drop('Y',1),sampled['Y'])
predsvm_temp=svm_temp.predict(test.drop('Y',1))
auc_value=roc_auc_score(y_true=test['Y'],y_score=predsvm_temp)
elif lr==1:
logreg = linear_model.LogisticRegression(C = 1e30).fit(sampled.drop('Y', 1), sampled['Y'])
pred_logreg=logreg.predict(test.drop('Y',1))
auc_value=roc_auc_score(y_true=test['Y'],y_score=pred_logreg)
else:
print("Enter valid parameters for lr and c")
if j not in aucstore2:
aucstore2[j]=list()
aucstore2[j].append(auc_value)
for key,value in sorted(aucstore2.items()):
means.append(np.mean(value))
stds.append(np.std(value))
return means,stds
2. For both LR and SVM, run 20 bootstrap samples for each samplesize in the following list: samplesizes = [50, 100, 200, 500, 1000, 1500, 2000]. (Note, this might take 10-15 mins … feel free to go grab a drink or watch Youtube while this runs).
Generate a plot with the following:
In [211]:
samplesize=[50, 100, 200, 500, 1000, 1500, 2000]
logauc,logstd=modBootstrapper(data_train,data_test,20,samplesize,lr=1)
svmauc,svmstd=modBootstrapper(data_train,data_test,20,samplesize,lr=0,c=10)
fig=plt.figure(figsize=(15,5))
ax=fig.add_subplot(111)
plt.plot(np.log2(samplesize),svmauc,'g',label='SVM Mean')
plt.plot(np.log2(samplesize),np.array(svmauc)-np.array(svmstd),'g+',label='SVM StdErr Lower')
plt.plot(np.log2(samplesize),np.array(svmauc)+np.array(svmstd),'g--',label='SVM StdErr Upper')
plt.plot(np.log2(samplesize),logauc,'r',label='Logistic Mean')
plt.plot(np.log2(samplesize),np.array(logauc)-np.array(logstd),'r+',label='Logistic StdErr Upper')
plt.plot(np.log2(samplesize),np.array(logauc)+np.array(logstd),'r--',label='Logistic StdErr Lower')
plt.legend(loc='best')
ax.set_xlabel('Log2(samplesize)')
ax.set_ylabel('AUC')
plt.show()
3. Which of the two algorithms are more suitable for smaller sample sizes, given the set of features? If it costs twice the investment to run enough experiments to double the data, do you think it is a worthy investment?
**Answer**:
For smaller sample sizes, we prefer SVM over Logistic regression as by using SVM, we gain a chance to obtain a marginally better result than Logistic. The intervals that bind the average case of SVM cover a higher value of AUC but are also lower than the lower bound of Logistic (as seen in the graph). So by choosing SVM, there is a possibility that our model will yield a better result.
However, if it costs twice the investment to run experiments to double the data, we can safely choose Logistic regression as the accuracy gain by SVM is only marginal (~0.01%). For cases where accuracy is precious, such as in healthcare industry, we can adopt SVM as we need all the extra precision that we can get to prevent wrong predictions.
4. Is there a reason why cross-validation might be biased? If so, in what direction is it biased?
**Answer**:
Cross validation is used to remove bias from the dataset but it results in an increase in variance. To remove/ reduce bias, we take the value of K to be generally 3 more. In our case, we took the cross-validation as 10. This results in the development of a model that exhibits low bias due to repeated sampling and averaging of results. Due to this reason, I don't believe that our cross-validation is biased.