Machine learning using Regression

Read the data

Generate a few summary statistics

Data set 1: Rocks vs. Mines

  • Independent variables: sonar soundings at different frequencies
  • Dependent variable (target): Rock or Mine
  • 
    
    In [ ]:
    import pandas as pd
    from pandas import DataFrame
    url="https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data"
    df = pd.read_csv(url,header=None)
    df.describe()
    

    See all columns

    
    
    In [ ]:
    pd.options.display.max_columns=70
    df.describe()
    

    Examine the distribution of the data in column 4

  • Quartile 1: from .0067 to .03805
  • Quartile 2: from .03805 to .0625
  • Quartile 3: from .0625 to .100275
  • Quartile 4: from .100275 to .401
  • Quartile 4 is much larger than the other quartiles. This raises the possibility of outliers

    A Quantile - Quantile (qq) plot can help identify outliers

  • y-axis contains values
  • x-axis is the cumulative normal density function plotted as a straight line (-3 to +3)
  • y-axis is the values ordered from lowest to highest
  • the closer the curve is to the line, the more it reflects a normal distribution
  • 
    
    In [ ]:
    import numpy as np 
    import pylab 
    import scipy.stats as stats
    import matplotlib
    import matplotlib.pyplot as plt
    matplotlib.style.use('ggplot')
    %matplotlib inline
       
    stats.probplot(df[4], dist="norm", plot=pylab)
    pylab.show()
    

    Examine the dependent variable

    
    
    In [ ]:
    df[60].unique()
    

    Examine correlations

    
    
    In [ ]:
    df.corr()
    
    
    
    In [ ]:
    import matplotlib.pyplot as plot
    plot.pcolor(df.corr())
    plot.show()
    
    
    
    In [ ]:
    df.corr()[0].plot()
    

    Highly correlated items = not good!

    Low correlated items = good

    Correlations with target (dv) = good (high predictive power)

    Data Set 2: Wine data

  • Independent variables: Wine composition (alcohol content, sulphites, acidity, etc.)
  • Dependent variable (target): Taste score (average of a panel of 3 wine tasters)
  • 
    
    In [ ]:
    url = "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
    import pandas as pd
    from pandas import DataFrame
    w_df = pd.read_csv(url,header=0,sep=';')
    w_df.describe()
    
    
    
    In [ ]:
    w_df['volatile acidity']
    
    
    
    In [ ]:
    w_df.corr()
    
    
    
    In [ ]:
    import matplotlib.pyplot as plot
    plot.pcolor(w_df.corr())
    plot.show()
    

    Examining the correlation of one variable with the others

    
    
    In [ ]:
    w_df.corr()['fixed acidity'].plot()
    

    Pandas scatter matrix function helps visualize the relationship between features

    Use with care though, because it is processor intensive

    
    
    In [ ]:
    from pandas.tools.plotting import scatter_matrix
    p=scatter_matrix(w_df, alpha=0.2, figsize=(12, 12), diagonal='kde')
    

    And we can examine quintile plots as we did with the rocks and mines data

    
    
    In [ ]:
    import numpy as np 
    import pylab 
    import scipy.stats as stats
    %matplotlib inline
       
    stats.probplot(w_df['alcohol'], dist="norm", plot=pylab)
    pylab.show()
    

    Training a classifier on Rocks vs Mines

    
    
    In [ ]:
    import numpy
    import random
    from sklearn import datasets, linear_model
    from sklearn.metrics import roc_curve, auc
    import pylab as pl
    
    
    
    In [ ]:
    import pandas as pd
    from pandas import DataFrame
    url="https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data"
    df = pd.read_csv(url,header=None)
    df.describe()
    

    Convert labels R and M to 0 and 1

    
    
    In [ ]:
    df[60]=np.where(df[60]=='R',0,1)
    

    Divide the dataset into training and test samples

    Separate out the x and y variable frames for the train and test samples

    
    
    In [ ]:
    from sklearn.model_selection import train_test_split
    train, test = train_test_split(df, test_size = 0.3)
    x_train = train.iloc[0:,0:60]
    y_train = train[60]
    x_test = test.iloc[0:,0:60]
    y_test = test[60]
    y_train
    

    Build the model and fit the training data

    
    
    In [ ]:
    model = linear_model.LinearRegression()
    model.fit(x_train,y_train)
    

    Interpreting categorical prediction results

    Precision

    Recall

    True Positive Rate

    False Positive Rate

    Precision recall curve

    ROC curve

    F-Score

    Area under PR curve

    Area under ROC curve

    
    
    In [ ]:
    
    

    Generate predictions in-sample error

    
    
    In [ ]:
    training_predictions = model.predict(x_train)
    print(np.mean((training_predictions - y_train) ** 2))
    
    
    
    In [ ]:
    print('Train R-Square:',model.score(x_train,y_train))
    print('Test R-Square:',model.score(x_test,y_test))
    

    These are horrible!

    But do we really care?

  • Focus on the problem
  • Do we need to recognize both rocks as well as mines correctly?
  • How do we interpret the predicted y-values
  • 
    
    In [ ]:
    print(max(training_predictions),min(training_predictions),np.mean(training_predictions))
    

    We want to predict categories: Rocks or Mines

    But we're actually getting a continuous value

    Not the same thing. So R-Square probably doesn't mean a whole lot

    We need to convert the conitnuous values into categorical 1s and 0s. We can do this by fixing a threshold value between 0 and 1

    Values greater than the threshold are 1 (Mines). Values less than or equal to the threshold are 0 (Rocks)

    Confusion matrix

  • Reports the proportion of
    1. true positive: predicts mine and is a mine
    2. false positive: predicts mine and is not a mine
    3. true negative: predicts not mine and is not a mine
    4. false negative:Predicts not mine but turns out to be a mine (BOOM!)
    5. 
      
      In [ ]:
      def confusion_matrix(predicted, actual, threshold):
          if len(predicted) != len(actual): return -1
          tp = 0.0
          fp = 0.0
          tn = 0.0
          fn = 0.0
          for i in range(len(actual)):
              if actual[i] > 0.5: #labels that are 1.0  (positive examples)
                  if predicted[i] > threshold:
                      tp += 1.0 #correctly predicted positive
                  else:
                      fn += 1.0 #incorrectly predicted negative
              else:              #labels that are 0.0 (negative examples)
                  if predicted[i] < threshold:
                      tn += 1.0 #correctly predicted negative
                  else:
                      fp += 1.0 #incorrectly predicted positive
          rtn = [tp, fn, fp, tn]
      
          return rtn
      
      
      
      In [ ]:
      testing_predictions = model.predict(x_test)
      
      
      
      In [ ]:
      testing_predictions = model.predict(x_test)
      confusion_matrix(testing_predictions,np.array(y_test),0.5)
      

      Misclassification rate = (fp + fn)/number of cases

      
      
      In [ ]:
      cm = confusion_matrix(testing_predictions,np.array(y_test),0.5)
      misclassification_rate = (cm[1] + cm[2])/len(y_test)
      misclassification_rate
      

      Precision and Recall

      
      
      In [ ]:
      [tp, fn, fp, tn] = confusion_matrix(testing_predictions,np.array(y_test),0.5)
      precision = tp/(tp+fp)
      recall = tp/(tp+fn)
      f_score = 2 * (precision * recall)/(precision + recall)
      print(precision,recall,f_score)
      

      Confusion matrix (and hence precision, recall etc.) depend on the selected threshold

      As the threshold changes, we will need to tradeoff precision and recall

      
      
      In [ ]:
      [tp, fn, fp, tn] = confusion_matrix(testing_predictions,np.array(y_test),0.9)
      precision = tp/(tp+fp)
      recall = tp/(tp+fn)
      f_score = 2 * (precision * recall)/(precision + recall)
      print(precision,recall,f_score)
      

      ROC: Receiver Order Characteristic

    6. An ROC curve shows the performance of a binary classifier as the threshold varies.
    7. Computes two series:
      1. False positive rate (FPR) Fall out/false alarm = False Positives/(True Negatives + False Positives)
        • Or, what proportion of rocks are identified as mines

      2. True Positive rate (TPR) Sensitivity/recall = True Positives/(True Positives + False Negatives)
        • Or, what proportion of actual mines are identified as mines
        </ol>

        • true positive: predicts mine and is a mine
        • false positive: predicts mine and is not a mine
        • true negative: predicts not mine and is not a mine
        • false negative:Predicts not mine but turns out to be a mine (BOOM!)
        • Let's first plot the predictions against actuals

          The goal is to see if our classifier has discriminated at all

          
          
          In [ ]:
          positives = list()
          negatives = list()
          actual = np.array(y_train)
          for i in range(len(y_train)):
              
              if actual[i]:
                  positives.append(training_predictions[i])
              else:
                  negatives.append(training_predictions[i])
          
          
          
          In [ ]:
          df_p = pd.DataFrame(positives)
          df_n = pd.DataFrame(negatives)
          fig, ax = plt.subplots()
          a_heights, a_bins = np.histogram(df_p)
          b_heights, b_bins = np.histogram(df_n, bins=a_bins)
          width = (a_bins[1] - a_bins[0])/3
          ax.bar(a_bins[:-1], a_heights, width=width, facecolor='cornflowerblue')
          ax.bar(b_bins[:-1]+width, b_heights, width=width, facecolor='seagreen')
          

          Repeat for the holdout sample

          
          
          In [ ]:
          positives = list()
          negatives = list()
          actual = np.array(y_test)
          for i in range(len(y_test)):
              
              if actual[i]:
                  positives.append(testing_predictions[i])
              else:
                  negatives.append(testing_predictions[i])
          df_p = pd.DataFrame(positives)
          df_n = pd.DataFrame(negatives)
          fig, ax = plt.subplots()
          a_heights, a_bins = np.histogram(df_p)
          b_heights, b_bins = np.histogram(df_n, bins=a_bins)
          width = (a_bins[1] - a_bins[0])/3
          ax.bar(a_bins[:-1], a_heights, width=width, facecolor='cornflowerblue')
          ax.bar(b_bins[:-1]+width, b_heights, width=width, facecolor='seagreen')
          

          Drawing the ROC Curve

          sklearn has a function roc_curve that does this for us

          
          
          In [ ]:
          from sklearn.metrics import roc_curve, auc
          

          In-sample ROC Curve

          
          
          In [ ]:
          (fpr, tpr, thresholds) = roc_curve(y_train,training_predictions)
          area = auc(fpr,tpr)
          pl.clf() #Clear the current figure
          pl.plot(fpr,tpr,label="In-Sample ROC Curve with area = %1.2f"%area)
          
          pl.plot([0, 1], [0, 1], 'k') #This plots the random (equal probability line)
          pl.xlim([0.0, 1.0])
          pl.ylim([0.0, 1.0])
          pl.xlabel('False Positive Rate')
          pl.ylabel('True Positive Rate')
          pl.title('In sample ROC rocks versus mines')
          pl.legend(loc="lower right")
          pl.show()
          

          Out-sample ROC curve

          
          
          In [ ]:
          (fpr, tpr, thresholds) = roc_curve(y_test,testing_predictions)
          area = auc(fpr,tpr)
          pl.clf() #Clear the current figure
          pl.plot(fpr,tpr,label="Out-Sample ROC Curve with area = %1.2f"%area)
          
          pl.plot([0, 1], [0, 1], 'k')
          pl.xlim([0.0, 1.0])
          pl.ylim([0.0, 1.0])
          pl.xlabel('False Positive Rate')
          pl.ylabel('True Positive Rate')
          pl.title('Out sample ROC rocks versus mines')
          pl.legend(loc="lower right")
          pl.show()
          
          
          
          In [ ]:
          (fpr, tpr, thresholds)
          

          So, what threshold should we actually use?

          ROC curves and AUC give you a sense for how good your classifier is and how sensitive it is to changes in threshold

          Too sensitive is not good

          Example: Let's say

        • Everything classified as a rock needs to be checked with a hand scanner at 200/scan
        • Everything classified as a mine needs to be defused at 1000 if it is a real mine or 300 if it turns out to be a rock
        • 
          
          In [ ]:
          cm = confusion_matrix(testing_predictions,np.array(y_test),.1)
          cost1 = 1000*cm[0] + 300 * cm[2] + 200 * cm[1] + 200 * cm[3]
          cm = confusion_matrix(testing_predictions,np.array(y_test),.9)
          cost2 = 1000*cm[0] + 300 * cm[2] + 200 * cm[1] + 200 * cm[3]
          
          print(cost1,cost2)
          

          Example: Let's say

        • Everything classified as a rock will be assumed a rock and if wrong, will cost 5000 in injuries
        • Everything classified as a mine will be left as is (no one will walk on it!)
        • 
          
          In [ ]:
          cm = confusion_matrix(testing_predictions,np.array(y_test),.1)
          cost1 = 0*cm[0] + 0 * cm[2] + 5000 * cm[1] + 0 * cm[3]
          cm = confusion_matrix(testing_predictions,np.array(y_test),.9)
          cost2 = 0*cm[0] + 0 * cm[2] + 5000 * cm[1] + 0 * cm[3]
          print(cost1,cost2)
          

          Bottom line. Depends on factors from your domain

          
          
          In [ ]: