Introduction to Data Science

Homework 4

Student Name: Shaival Dalal

Student Netid: sd3462

In this assignment we will be looking at data generated by particle physicists to test whether machine learning can help classify whether certain particle decay experiments identify the presence of a Higgs Boson. One does not need to know anything about particle physics to do well here, but if you are curious, full feature and data descriptions can be found here:

The goal of this assignment is to learn to use cross-validation for model selection as well as bootstrapping for error estimation. We’ll also use learning curve analysis to understand how well different algorithms make use of limited data. For more documentation on cross-validation with Python, you can consult the following:

http://scikit-learn.org/stable/modules/cross_validation.html#cross-validation

Part 1: Data preparation (5 points)

Create a data preparation and cleaning function that does the following:

Has a single input that is a file name string
Reads data (the data is comma separated, has a row header and the first column EventID is the index) into a pandas dataframe
Cleans the data
- Convert the feature Label to numeric (choose the minority class to be equal to 1)
  - Create a feature Y with numeric label
  - Drop the feature Label
- If a feature has missing values (i.e., -999):
  - Create a dummy variable for the missing value
    - Call the variable orig_var_name + _mv where orig_var_name is the name of the actual var with a missing value
    - Give this new variable a 1 if the original variable is missing
  - Replace the missing value with the average of the feature (make sure to compute the mean on records where the value isn't missing). You may find pandas' .replace() function useful.
After the above is done, rescales the data so that each feature has zero mean and unit variance (hint: look up sklearn.preprocessing)
Returns the cleaned and rescaled dataset

Hint: as a guide, this function can easily be done in less than 15 lines.



In [204]:

    
import pandas as pd
import numpy as np
from collections import Counter
from sklearn.preprocessing import scale

def cleanBosonData(infile_name):
    raw=pd.read_csv(infile_name,header=0,index_col=0,na_values=-999)
    raw=raw.replace({'Label':{Counter(raw['Label']).most_common()[0][0]:0,Counter(raw['Label']).most_common()[1][0]:1}}).rename(columns={'Label':'Y'})
    for col in raw.columns.values[:-1]:
        if raw[col].isnull().any().any():
            raw[(col+'_mv')]=np.where(raw[col].isnull(),1,0)
            raw[col]=raw[col].fillna(raw[col].mean())
        raw[col]=scale(raw[col])
    return raw

Part 2: Basic evaluations (5 points)

In this part you will build an out-of-the box logistic regression (LR) model and support vector machine (SVM). You will then plot ROC for the LR and SVM model.

1. Clean the two data files included in this assignment (data/boson_training_cut_2000.csv and data/boson_testing_cut.csv) and use them as training and testing data sets.



In [205]:

    
data_train = cleanBosonData("../Datasets/boson_training_cut_2000.csv")
data_test = cleanBosonData("../Datasets/boson_testing_cut.csv")

2. On the training set, build the following models:

A logistic regression using sklearn's linear_model.LogisticRegression(). For this model, use C=1e30.
An SVM using sklearn's svm.svc(). For this model, specify that kernel="linear".

For each model above, plot the ROC curve of both models on the same plot. Make sure to use the test set for computing and plotting. In the legend, also print out the Area Under the ROC (AUC) for reference.



In [206]:

    
#Importing basic libraries
import matplotlib
import matplotlib.pyplot as plt
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn import linear_model
from sklearn.svm import SVC

%matplotlib inline
#Creating a model using Logistic regression
logreg = linear_model.LogisticRegression(C = 1e30).fit(data_train.drop('Y', 1), data_train['Y'])
#Creating a model using Support Vector Machines
svm=SVC(kernel="linear").fit(data_train.drop('Y',1),data_train['Y'])

#We predict values using the above developed models
pred_logreg=logreg.predict(data_test.drop('Y',1))
pred_svm=svm.predict(data_test.drop('Y',1))

#Calculating the FPR and the TPR for our predicted values
roc_log=roc_curve(y_true=data_test['Y'],y_score=pred_logreg)
roc_svm=roc_curve(y_true=data_test['Y'],y_score=pred_svm)

#Plotting values on a graph
plt.figure(figsize=(10,5))
plt.plot(roc_log[0],roc_log[1],color='darkorange',lw=3,label='AUC for Logistic Regression: %.4f' %(roc_auc_score(y_true=data_test['Y'],y_score=pred_logreg)))
plt.plot(roc_svm[0],roc_svm[1],color='navy',lw=3,label='AUC for SVM: %.4f' %(roc_auc_score(y_true=data_test['Y'],y_score=pred_svm)))
plt.plot([0, 1], [0, 1], color='black', lw=1, linestyle='--') #We plot a base line to indicate 0.5 indicating random chance
plt.title('Area under the ROC')
plt.legend(loc='lower right')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.show()

3. Which of the two models is generally better at ranking the test set? Are there any classification thresholds where the model identified above as "better" would underperform the other in a classification metric (such as TPR)?



In [207]:

    
from sklearn.metrics import confusion_matrix
from IPython.display import display, HTML

#Preparing confusion matrix to check different performance measures
conf_logreg=pd.DataFrame(confusion_matrix(y_true=data_test['Y'],y_pred=pred_logreg))
conf_svm=pd.DataFrame(confusion_matrix(y_true=data_test['Y'],y_pred=pred_svm))

display('Logistic')
display(conf_logreg)
display('SVM')
display(conf_svm)

**Answer**: The Logistic model performs better than the SVM model on an average. We judge them by using the AUC metric which is base rate invariant. There are specific thresholds for which one model outperforms the other and the final selection must be made on the basis of the utility of the model.

In Logistic:

We observe that the TPR (predicted 0 and actual 0) is higher than SvM. Logistic correctly classifies actual true values for 6597 cases as compared to SVM's 3251 values.
The False Negative Error rate is also less. Logistic has 10566 while SVM has 13912 false negatives.

In SVM:

It is better at classifying True Positives than Logistic. It correctly identifies positives for 31586 cases as compared to logistic's 29399 cases.
The False Positive rate (1251) is also lower than logistic (3438)

Part 3: Model selection with cross-validation (10 points)

We think we might be able to improve the performance of the SVM if we perform a grid search on the hyper-parameter $C$. Because we only have 1000 instances, we will have to use cross-validation to find the optimal $C$.

1. Write a cross-validation function that does the following:

Takes as inputs a dataset, a label name, # of splits/folds (k), a sequence of values for $C$ (cs)
Performs two loops
- Outer Loop: for each f in range(k):
  - Splits the data into data_train & data_validate according to cross-validation logic
- Inner Loop: for each c in cs:
  - Trains an SVM on training split with C=c, kernel="linear"
  - Computes AUC_c_k on validation data
  - Stores AUC_c_k in a dictionary of values
Returns a dictionary, where each key-value pair is: c:[auc-c1,auc-c2,..auc-ck]



In [208]:

    
from sklearn.model_selection import train_test_split

def xValSVM(dataset, label_name, k, cs):
    aucstore=dict()
    for f in range(k):
        data_train,data_validate=train_test_split(dataset)
        for c in cs:
            svm_temp=SVC(kernel="linear",C=c).fit(data_train.drop(label_name,1),data_train[label_name])
            predsvm_temp=svm_temp.predict(data_test.drop(label_name,1))
            svmroc_temp=roc_auc_score(data_test[label_name],predsvm_temp)
            if c not in aucstore:
                aucstore[c]=list()
            aucstore[c].append(svmroc_temp)
    return aucstore

2. Using the function written above, do the following:

Generate a sequence of 10 $C$ values in the interval [10^(-8), ..., 10^1] (i.e., do all powers of 10 from -8 to 1).

Call aucs = xValSVM(train, ‘Y’, 10, cs)
For each c in cs, get mean(AUC) and StdErr(AUC)
Compute the value for max(meanAUC-StdErr(AUC)) across all values of c.
Generate a plot with the following: a. Log10(c) on the x-axis b. 1 series with mean(AUC) for each c c. 1 series with mean(AUC)-stderr(AUC) for each c (use ‘k+’ as color pattern) d. 1 series with mean(AUC)+stderr(AUC) for each c (use ‘k--‘ as color pattern) e. a reference line for max(AUC-StdErr(AUC)) (use ‘r’ as color pattern)

Then answer the question: Did the model parameters selected beat the out-of-the-box model for SVM?



In [209]:

    
from scipy.stats import sem
import matplotlib.pyplot as plt
c_val = np.power(10,np.arange(-8.0,2.0))
aucs=xValSVM(data_train,"Y",10,c_val)

alldiff=list()
allmeans=list()
allstderr=list()
for i in aucs.keys():
    mean=np.mean(aucs[i])
    stderr=sem(aucs[i])
    print("Mean for",i,":",mean," Std. Err: ",stderr)
    allmeans.append(mean)
    allstderr.append(stderr)
    alldiff.append(mean-stderr)
print("Maximum difference:", max(alldiff)," for C value",list(aucs.keys())[(alldiff.index(max(alldiff)))])

plt.figure(figsize=(10,5))
plt.plot(np.log10(c_val),allmeans,label='Mean AUC')
plt.plot(np.log10(c_val),np.array(allmeans)-np.array(allstderr),'k+',label='Mean-Standard error')
plt.plot(np.log10(c_val),np.array(allmeans)+np.array(allstderr),'k--',label='Mean+Standard error')
plt.axhline(y=max(alldiff),color='r',label='Reference Line')
plt.legend(loc='lower right')
plt.show()









    



Mean for 1e-08 : 0.5  Std. Err:  0.0
Mean for 1e-07 : 0.5  Std. Err:  0.0
Mean for 1e-06 : 0.5  Std. Err:  0.0
Mean for 1e-05 : 0.5  Std. Err:  0.0
Mean for 0.0001 : 0.5  Std. Err:  0.0
Mean for 0.001 : 0.513331065817  Std. Err:  0.00202499133557
Mean for 0.01 : 0.557735427855  Std. Err:  0.00295342024579
Mean for 0.1 : 0.566470420758  Std. Err:  0.00263840183551
Mean for 1.0 : 0.568824065461  Std. Err:  0.00303295252131
Mean for 10.0 : 0.569677827001  Std. Err:  0.003277977188
Maximum difference: 0.566399849813  for C value 10.0

**Answer**:

Yes, our optimised model beats the out-of-box SVM model. However, we observe a only marginal increase in the AUC value. Out-of-box SVM with C=1.0 gives us an AUC of 0.5757 while our optimised SVM with C=10.0 gives us an AUC of 0.5768.

Part 4: Learning Curve with Bootstrapping

In this HW we are trying to find the best linear model to predict if a record represents the Higgs Boson. One of the drivers of the performance of a model is the sample size of the training set. As a data scientist, sometimes you have to decide if you have enough data or if you should invest in more. We can use learning curve analysis to determine if we have reached a performance plateau. This will inform us on whether or not we should invest in more data (in this case it would be by running more experiments).

Given a training set of size $N$, we test the performance of a model trained on a subsample of size $N_i$, where $N_i<=N$. We can plot how performance grows as we move $N_i$ from $0$ to $N$.

Because of the inherent randomness of subsamples of size $N_i$, we should expect that any single sample of size $N_i$ might not be representative of an algorithm’s performance at a given training set size. To quantify this variance and get a better generalization, we will also use bootstrap analysis. In bootstrap analysis, we pull multiple samples of size $N_i$, build a model, evaluate on a test set, and then take an average and standard error of the results.

1. Create a bootstrap function that can do the following:

def modBootstrapper(train, test, nruns, sampsize, lr, c):

Takes as input:
- A master training file (train)
- A master testing file (test)
- Number of bootstrap iterations (nruns)
- Size of a bootstrap sample (sampsize)
- An indicator variable to specific LR or SVM (lr=1)
- A c option (only applicable to SVM)
Runs a loop with (nruns) iterations, and within each loop:
- Sample (sampsize) instances from train, with replacement
- Fit either an SVM or LR (depending on options specified). For SVM, use the value of C identified using the 1 standard error method from part 3.
- Computes AUC on test data using predictions from model in above step
- Stores the AUC in a list
Returns the mean(AUC) and Standard Error(mean(AUC)) across all bootstrap samples



In [210]:

    
from sklearn.metrics import auc
def modBootstrapper(train, test, nruns, sampsize, lr, **c):
    aucstore2=dict()
    means=list()
    stds=list()
    for i in np.arange(1,nruns):
        for j in sampsize:
            sampled=train.sample(n=j,replace=True)
            if lr==0 and c is not None:
                svm_temp=SVC(kernel="linear",C=list(c.values())[0]).fit(sampled.drop('Y',1),sampled['Y'])
                predsvm_temp=svm_temp.predict(test.drop('Y',1))
                auc_value=roc_auc_score(y_true=test['Y'],y_score=predsvm_temp)
            elif lr==1:
                logreg = linear_model.LogisticRegression(C = 1e30).fit(sampled.drop('Y', 1), sampled['Y'])
                pred_logreg=logreg.predict(test.drop('Y',1))
                auc_value=roc_auc_score(y_true=test['Y'],y_score=pred_logreg)
            else:
                print("Enter valid parameters for lr and c")
            if j not in aucstore2:
                aucstore2[j]=list()
            aucstore2[j].append(auc_value)
    for key,value in sorted(aucstore2.items()):
        means.append(np.mean(value))
        stds.append(np.std(value))
    return means,stds

2. For both LR and SVM, run 20 bootstrap samples for each samplesize in the following list: samplesizes = [50, 100, 200, 500, 1000, 1500, 2000]. (Note, this might take 10-15 mins … feel free to go grab a drink or watch Youtube while this runs).

Generate a plot with the following:

Log2(samplesize) on the x-axis
2 sets of results lines, one for LR and one for SVM, the set should include
- 1 series with mean(AUC) for each sampsize (use the color options ‘g’ for svm, ‘r’ for lr)
- 1 series with mean(AUC)-stderr(AUC) for each c (use ‘+’ as color pattern, ‘g’,’r’ for SVM, LR respectively)
- 1 series with mean(AUC)+stderr(AUC) for each c (use ‘--‘ as color pattern ‘g’,’r’ for SVM, LR respectively)



In [211]:

    
samplesize=[50, 100, 200, 500, 1000, 1500, 2000]
logauc,logstd=modBootstrapper(data_train,data_test,20,samplesize,lr=1)
svmauc,svmstd=modBootstrapper(data_train,data_test,20,samplesize,lr=0,c=10)

fig=plt.figure(figsize=(15,5))
ax=fig.add_subplot(111)
plt.plot(np.log2(samplesize),svmauc,'g',label='SVM Mean')
plt.plot(np.log2(samplesize),np.array(svmauc)-np.array(svmstd),'g+',label='SVM StdErr Lower')
plt.plot(np.log2(samplesize),np.array(svmauc)+np.array(svmstd),'g--',label='SVM StdErr Upper')

plt.plot(np.log2(samplesize),logauc,'r',label='Logistic Mean')
plt.plot(np.log2(samplesize),np.array(logauc)-np.array(logstd),'r+',label='Logistic StdErr Upper')
plt.plot(np.log2(samplesize),np.array(logauc)+np.array(logstd),'r--',label='Logistic StdErr Lower')

plt.legend(loc='best')
ax.set_xlabel('Log2(samplesize)')
ax.set_ylabel('AUC')
plt.show()

3. Which of the two algorithms are more suitable for smaller sample sizes, given the set of features? If it costs twice the investment to run enough experiments to double the data, do you think it is a worthy investment?

**Answer**:

For smaller sample sizes, we prefer SVM over Logistic regression as by using SVM, we gain a chance to obtain a marginally better result than Logistic. The intervals that bind the average case of SVM cover a higher value of AUC but are also lower than the lower bound of Logistic (as seen in the graph). So by choosing SVM, there is a possibility that our model will yield a better result.

However, if it costs twice the investment to run experiments to double the data, we can safely choose Logistic regression as the accuracy gain by SVM is only marginal (~0.01%). For cases where accuracy is precious, such as in healthcare industry, we can adopt SVM as we need all the extra precision that we can get to prevent wrong predictions.

4. Is there a reason why cross-validation might be biased? If so, in what direction is it biased?

**Answer**:

Cross validation is used to remove bias from the dataset but it results in an increase in variance. To remove/ reduce bias, we take the value of K to be generally 3 more. In our case, we took the cross-validation as 10. This results in the development of a model that exhibits low bias due to repeated sampling and averaging of results. Due to this reason, I don't believe that our cross-validation is biased.