Machine learning using Regression

Read the data

Generate a few summary statistics

Data set 1: Rocks vs. Mines

Independent variables: sonar soundings at different frequencies

Dependent variable (target): Rock or Mine



In [ ]:

    
import pandas as pd
from pandas import DataFrame
url="https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data"
df = pd.read_csv(url,header=None)
df.describe()

See all columns



In [ ]:

    
pd.options.display.max_columns=70
df.describe()

Examine the distribution of the data in column 4

Quartile 1: from .0067 to .03805

Quartile 2: from .03805 to .0625

Quartile 3: from .0625 to .100275

Quartile 4: from .100275 to .401

Quartile 4 is much larger than the other quartiles. This raises the possibility of outliers

A Quantile - Quantile (qq) plot can help identify outliers

y-axis contains values

x-axis is the cumulative normal density function plotted as a straight line (-3 to +3)

y-axis is the values ordered from lowest to highest

the closer the curve is to the line, the more it reflects a normal distribution



In [ ]:

    
import numpy as np 
import pylab 
import scipy.stats as stats
import matplotlib
import matplotlib.pyplot as plt
matplotlib.style.use('ggplot')
%matplotlib inline
   
stats.probplot(df[4], dist="norm", plot=pylab)
pylab.show()

Examine the dependent variable



In [ ]:

    
df[60].unique()

Examine correlations



In [ ]:

    
df.corr()



In [ ]:

    
import matplotlib.pyplot as plot
plot.pcolor(df.corr())
plot.show()



In [ ]:

    
df.corr()[0].plot()

Highly correlated items = not good!

Low correlated items = good

Correlations with target (dv) = good (high predictive power)

Data Set 2: Wine data

Independent variables: Wine composition (alcohol content, sulphites, acidity, etc.)

Dependent variable (target): Taste score (average of a panel of 3 wine tasters)



In [ ]:

    
url = "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
import pandas as pd
from pandas import DataFrame
w_df = pd.read_csv(url,header=0,sep=';')
w_df.describe()



In [ ]:

    
w_df['volatile acidity']



In [ ]:

    
w_df.corr()



In [ ]:

    
import matplotlib.pyplot as plot
plot.pcolor(w_df.corr())
plot.show()

Examining the correlation of one variable with the others



In [ ]:

    
w_df.corr()['fixed acidity'].plot()

Pandas scatter matrix function helps visualize the relationship between features

Use with care though, because it is processor intensive



In [ ]:

    
from pandas.tools.plotting import scatter_matrix
p=scatter_matrix(w_df, alpha=0.2, figsize=(12, 12), diagonal='kde')

And we can examine quintile plots as we did with the rocks and mines data



In [ ]:

    
import numpy as np 
import pylab 
import scipy.stats as stats
%matplotlib inline
   
stats.probplot(w_df['alcohol'], dist="norm", plot=pylab)
pylab.show()

Training a classifier on Rocks vs Mines



In [ ]:

    
import numpy
import random
from sklearn import datasets, linear_model
from sklearn.metrics import roc_curve, auc
import pylab as pl



In [ ]:

    
import pandas as pd
from pandas import DataFrame
url="https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data"
df = pd.read_csv(url,header=None)
df.describe()

Convert labels R and M to 0 and 1



In [ ]:

    
df[60]=np.where(df[60]=='R',0,1)

Divide the dataset into training and test samples

Separate out the x and y variable frames for the train and test samples



In [ ]:

    
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size = 0.3)
x_train = train.iloc[0:,0:60]
y_train = train[60]
x_test = test.iloc[0:,0:60]
y_test = test[60]
y_train

Build the model and fit the training data



In [ ]:

    
model = linear_model.LinearRegression()
model.fit(x_train,y_train)

Interpreting categorical prediction results

Precision

Recall

True Positive Rate

False Positive Rate

Precision recall curve

ROC curve

F-Score

Area under PR curve

Area under ROC curve



In [ ]:

Generate predictions in-sample error



In [ ]:

    
training_predictions = model.predict(x_train)
print(np.mean((training_predictions - y_train) ** 2))



In [ ]:

    
print('Train R-Square:',model.score(x_train,y_train))
print('Test R-Square:',model.score(x_test,y_test))

These are horrible!

But do we really care?

Focus on the problem

Do we need to recognize both rocks as well as mines correctly?

How do we interpret the predicted y-values



In [ ]:

    
print(max(training_predictions),min(training_predictions),np.mean(training_predictions))

We want to predict categories: Rocks or Mines

But we're actually getting a continuous value

Not the same thing. So R-Square probably doesn't mean a whole lot

We need to convert the conitnuous values into categorical 1s and 0s. We can do this by fixing a threshold value between 0 and 1

Values greater than the threshold are 1 (Mines). Values less than or equal to the threshold are 0 (Rocks)

Confusion matrix

Reports the proportion of

true positive: predicts mine and is a mine

false positive: predicts mine and is not a mine

true negative: predicts not mine and is not a mine

false negative:Predicts not mine but turns out to be a mine (BOOM!)



In [ ]:

    
def confusion_matrix(predicted, actual, threshold):
    if len(predicted) != len(actual): return -1
    tp = 0.0
    fp = 0.0
    tn = 0.0
    fn = 0.0
    for i in range(len(actual)):
        if actual[i] > 0.5: #labels that are 1.0  (positive examples)
            if predicted[i] > threshold:
                tp += 1.0 #correctly predicted positive
            else:
                fn += 1.0 #incorrectly predicted negative
        else:              #labels that are 0.0 (negative examples)
            if predicted[i] < threshold:
                tn += 1.0 #correctly predicted negative
            else:
                fp += 1.0 #incorrectly predicted positive
    rtn = [tp, fn, fp, tn]

    return rtn



In [ ]:

    
testing_predictions = model.predict(x_test)



In [ ]:

    
testing_predictions = model.predict(x_test)
confusion_matrix(testing_predictions,np.array(y_test),0.5)

Misclassification rate = (fp + fn)/number of cases



In [ ]:

    
cm = confusion_matrix(testing_predictions,np.array(y_test),0.5)
misclassification_rate = (cm[1] + cm[2])/len(y_test)
misclassification_rate

Precision and Recall



In [ ]:

    
[tp, fn, fp, tn] = confusion_matrix(testing_predictions,np.array(y_test),0.5)
precision = tp/(tp+fp)
recall = tp/(tp+fn)
f_score = 2 * (precision * recall)/(precision + recall)
print(precision,recall,f_score)

Confusion matrix (and hence precision, recall etc.) depend on the selected threshold

As the threshold changes, we will need to tradeoff precision and recall



In [ ]:

    
[tp, fn, fp, tn] = confusion_matrix(testing_predictions,np.array(y_test),0.9)
precision = tp/(tp+fp)
recall = tp/(tp+fn)
f_score = 2 * (precision * recall)/(precision + recall)
print(precision,recall,f_score)

ROC: Receiver Order Characteristic

An ROC curve shows the performance of a binary classifier as the threshold varies.

Computes two series:

False positive rate (FPR) Fall out/false alarm = False Positives/(True Negatives + False Positives)

Or, what proportion of rocks are identified as mines

True Positive rate (TPR) Sensitivity/recall = True Positives/(True Positives + False Negatives)

Or, what proportion of actual mines are identified as mines

</ol>

true positive: predicts mine and is a mine

false positive: predicts mine and is not a mine

true negative: predicts not mine and is not a mine

false negative:Predicts not mine but turns out to be a mine (BOOM!)

Let's first plot the predictions against actuals

The goal is to see if our classifier has discriminated at all



In [ ]:

    
positives = list()
negatives = list()
actual = np.array(y_train)
for i in range(len(y_train)):
    
    if actual[i]:
        positives.append(training_predictions[i])
    else:
        negatives.append(training_predictions[i])



In [ ]:

    
df_p = pd.DataFrame(positives)
df_n = pd.DataFrame(negatives)
fig, ax = plt.subplots()
a_heights, a_bins = np.histogram(df_p)
b_heights, b_bins = np.histogram(df_n, bins=a_bins)
width = (a_bins[1] - a_bins[0])/3
ax.bar(a_bins[:-1], a_heights, width=width, facecolor='cornflowerblue')
ax.bar(b_bins[:-1]+width, b_heights, width=width, facecolor='seagreen')

Repeat for the holdout sample



In [ ]:

    
positives = list()
negatives = list()
actual = np.array(y_test)
for i in range(len(y_test)):
    
    if actual[i]:
        positives.append(testing_predictions[i])
    else:
        negatives.append(testing_predictions[i])
df_p = pd.DataFrame(positives)
df_n = pd.DataFrame(negatives)
fig, ax = plt.subplots()
a_heights, a_bins = np.histogram(df_p)
b_heights, b_bins = np.histogram(df_n, bins=a_bins)
width = (a_bins[1] - a_bins[0])/3
ax.bar(a_bins[:-1], a_heights, width=width, facecolor='cornflowerblue')
ax.bar(b_bins[:-1]+width, b_heights, width=width, facecolor='seagreen')

Drawing the ROC Curve

sklearn has a function roc_curve that does this for us



In [ ]:

    
from sklearn.metrics import roc_curve, auc

In-sample ROC Curve



In [ ]:

    
(fpr, tpr, thresholds) = roc_curve(y_train,training_predictions)
area = auc(fpr,tpr)
pl.clf() #Clear the current figure
pl.plot(fpr,tpr,label="In-Sample ROC Curve with area = %1.2f"%area)

pl.plot([0, 1], [0, 1], 'k') #This plots the random (equal probability line)
pl.xlim([0.0, 1.0])
pl.ylim([0.0, 1.0])
pl.xlabel('False Positive Rate')
pl.ylabel('True Positive Rate')
pl.title('In sample ROC rocks versus mines')
pl.legend(loc="lower right")
pl.show()

Out-sample ROC curve



In [ ]:

    
(fpr, tpr, thresholds) = roc_curve(y_test,testing_predictions)
area = auc(fpr,tpr)
pl.clf() #Clear the current figure
pl.plot(fpr,tpr,label="Out-Sample ROC Curve with area = %1.2f"%area)

pl.plot([0, 1], [0, 1], 'k')
pl.xlim([0.0, 1.0])
pl.ylim([0.0, 1.0])
pl.xlabel('False Positive Rate')
pl.ylabel('True Positive Rate')
pl.title('Out sample ROC rocks versus mines')
pl.legend(loc="lower right")
pl.show()



In [ ]:

    
(fpr, tpr, thresholds)

So, what threshold should we actually use?

ROC curves and AUC give you a sense for how good your classifier is and how sensitive it is to changes in threshold

Too sensitive is not good

Example: Let's say

Everything classified as a rock needs to be checked with a hand scanner at 200/scan

Everything classified as a mine needs to be defused at 1000 if it is a real mine or 300 if it turns out to be a rock



In [ ]:

    
cm = confusion_matrix(testing_predictions,np.array(y_test),.1)
cost1 = 1000*cm[0] + 300 * cm[2] + 200 * cm[1] + 200 * cm[3]
cm = confusion_matrix(testing_predictions,np.array(y_test),.9)
cost2 = 1000*cm[0] + 300 * cm[2] + 200 * cm[1] + 200 * cm[3]

print(cost1,cost2)

Example: Let's say

Everything classified as a rock will be assumed a rock and if wrong, will cost 5000 in injuries

Everything classified as a mine will be left as is (no one will walk on it!)



In [ ]:

    
cm = confusion_matrix(testing_predictions,np.array(y_test),.1)
cost1 = 0*cm[0] + 0 * cm[2] + 5000 * cm[1] + 0 * cm[3]
cm = confusion_matrix(testing_predictions,np.array(y_test),.9)
cost2 = 0*cm[0] + 0 * cm[2] + 5000 * cm[1] + 0 * cm[3]
print(cost1,cost2)

Bottom line. Depends on factors from your domain



In [ ]: