Project 1: Digit Classification with KNN and Naive Bayes

In this project, you'll implement your own image recognition system for classifying digits. Read through the code and the instructions carefully and add your own code where indicated. Each problem can be addressed succinctly with the included packages -- please don't add any more. Grading will be based on writing clean, commented code, along with a few short answers.

As always, you're welcome to work on the project in groups and discuss ideas on the course wall, but please prepare your own write-up (with your own code).

If you're interested, check out these links related to digit recognition:

Yann Lecun's MNIST benchmarks: http://yann.lecun.com/exdb/mnist/

Stanford Streetview research and data: http://ufldl.stanford.edu/housenumbers/



In [1]:

    
# This tells matplotlib not to try opening a new window for each plot.
%matplotlib inline

# Import a bunch of libraries.
import time
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import MultipleLocator
from sklearn.pipeline import Pipeline
from sklearn.datasets import fetch_mldata
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LinearRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import GaussianNB
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import classification_report

# Set the randomizer seed so results are the same each time.
np.random.seed(0)

Load the data. Notice that we are splitting the data into training, development, and test. We also have a small subset of the training data called mini_train_data and mini_train_labels that you should use in all the experiments below, unless otherwise noted.



In [2]:

    
# Load the digit data either from mldata.org, or once downloaded to data_home, from disk. The data is about 53MB so this cell
# should take a while the first time your run it.
mnist = fetch_mldata('MNIST original', data_home='~/datasets/mnist')
X, Y = mnist.data, mnist.target

# Rescale grayscale values to [0,1].
X = X / 255.0

# Shuffle the input: create a random permutation of the integers between 0 and the number of data points and apply this
# permutation to X and Y.
# NOTE: Each time you run this cell, you'll re-shuffle the data, resulting in a different ordering.
shuffle = np.random.permutation(np.arange(X.shape[0]))
X, Y = X[shuffle], Y[shuffle]

print 'data shape: ', X.shape
print 'label shape:', Y.shape

# Set some variables to hold test, dev, and training data.
test_data, test_labels = X[61000:], Y[61000:]
dev_data, dev_labels = X[60000:61000], Y[60000:61000]
train_data, train_labels = X[:60000], Y[:60000]
mini_train_data, mini_train_labels = X[:1000], Y[:1000]









    



data shape:  (70000, 784)
label shape: (70000,)

(1) Create a 10x10 grid to visualize 10 examples of each digit. Python hints:

plt.rc() for setting the colormap, for example to black and white
plt.subplot() for creating subplots
plt.imshow() for rendering a matrix
np.array.reshape() for reshaping a 1D feature vector into a 2D matrix (for rendering)



In [13]:

    
#def P1(num_examples=10):

### STUDENT START ###
# Credit where due... some inspiration drawn from:
# https://github.com/mnielsen/neural-networks-and-deep-learning/blob/master/fig/mnist.py

# example_as_pixel_matrix():
#   transforms a 784 element pixel into a 28 x 28 pixel matrix
def example_as_pixel_matrix(example):
    return np.reshape(example, (-1, 28))

# add_example_to_figure():
#   given an existing figure, number of rows, columns, and position,
#   adds a subplot with the example to the figure
def add_example_to_figure(example, 
                     figure, 
                     subplot_rows, 
                     subplot_cols, 
                     subplot_number):
    matrix = example_as_pixel_matrix(example)

    subplot = figure.add_subplot(subplot_rows, subplot_cols, subplot_number)
    subplot.imshow(matrix, cmap='Greys', interpolation='Nearest')
    # disable tick marks
    subplot.set_xticks(np.array([]))
    subplot.set_yticks(np.array([]))

# plot_examples():
#   given a matrix of examples (digit, example#) => example, 
#   plots it with digits as rows and examples as columns
def plot_examples(examples):
    
    figure = plt.figure()
    
    shape = np.shape(examples)
    rows = shape[0]
    columns = shape[1]
    
    subplot_index = 1
    
    for digit, examples_for_digit in enumerate(examples):
        for example_index, example in enumerate(examples_for_digit):
            add_example_to_figure(example, 
                                  figure, 
                                  rows, 
                                  columns, 
                                  subplot_index
                                 )
            subplot_index = subplot_index + 1
    
    figure.tight_layout()
    plt.show()

# plot_one_example():
#   given an example, plots only that example, typically
#   for debugging or diagnostics
def plot_one_example(example):  
    examples = [ [ example ] ]
    plot_examples(examples)

# select_indices_of_digit():
#   given an array of digit lables, selects the indices of
#   labels that match a desired digit
def select_indices_of_digit(labels, digit):
    return [i for i, label in enumerate(labels) if label == digit]

# take_n_from():
#   code readability sugar for taking a number of elements from an array
def take_n_from(count, array):
    return array[:count]

# take_n_examples_by_digit():
#   given a data set of examples, a label set, and a parameter n,
#   creates a matrix where the rows are the digits 0-9, and the
#   columns are the first n examples of each digit
def take_n_examples_by_digit(data, labels, n):
    examples = [
        data[take_n_from(n, select_indices_of_digit(labels, digit))]
        for digit in range(10)
    ]
    return examples

def P1(num_examples=10):
    examples = take_n_examples_by_digit(mini_train_data, mini_train_labels, num_examples)
    plot_examples(examples)

P1(10)
### STUDENT END ###

#P1(10)

(2) Evaluate a K-Nearest-Neighbors model with k = [1,3,5,7,9] using the mini training set. Report accuracy on the dev set. For k=1, show precision, recall, and F1 for each label. Which is the most difficult digit?

KNeighborsClassifier() for fitting and predicting
classification_report() for producing precision, recall, F1 results



In [5]:

    
#def P2(k_values):

### STUDENT START ###

from sklearn.metrics import accuracy_score

# apply_k_nearest_neighbors():
#   given the parameter k, training data and labels, and development data and labels,
#   fit a k nearest neighbors classifier using the training data, 
#   test using development data, and output a report
def apply_k_nearest_neighbors(k,
                              training_data,
                              training_labels,
                              development_data,
                              development_labels):

    neigh = KNeighborsClassifier(n_neighbors = k)
    neigh.fit(training_data, training_labels)
    
    predicted_labels = neigh.predict(development_data)
    
    target_names = [ str(i) for i in range(10) ]
    
    print '============ Classification report for k = ' + str(k) + ' ============'
    print ''
    print(classification_report(
            development_labels, 
            predicted_labels, 
            target_names = target_names))
    
    return accuracy_score(development_labels, predicted_labels, normalize = True)

def P2(k_values):
    return [
        apply_k_nearest_neighbors(k,
                                  mini_train_data,
                                  mini_train_labels,
                                  dev_data,
                                  dev_labels)
        for k in k_values
    ]

k_values = [1, 3, 5, 7, 9]
P2(k_values)

### STUDENT END ###

#k_values = [1, 3, 5, 7, 9]
#P2(k_values)









    



============ Classification report for k = 1 ============

             precision    recall  f1-score   support

          0       0.91      0.98      0.94        99
          1       0.89      1.00      0.94       105
          2       0.99      0.79      0.88       102
          3       0.77      0.87      0.82        86
          4       0.89      0.82      0.85       104
          5       0.93      0.84      0.88        91
          6       0.94      0.96      0.95        98
          7       0.89      0.92      0.90       113
          8       0.94      0.88      0.91        96
          9       0.78      0.82      0.80       106

avg / total       0.89      0.89      0.89      1000

============ Classification report for k = 3 ============

             precision    recall  f1-score   support

          0       0.90      1.00      0.95        99
          1       0.81      1.00      0.89       105
          2       0.95      0.81      0.88       102
          3       0.69      0.84      0.75        86
          4       0.88      0.85      0.86       104
          5       0.94      0.79      0.86        91
          6       0.97      0.96      0.96        98
          7       0.92      0.88      0.90       113
          8       0.96      0.79      0.87        96
          9       0.84      0.85      0.85       106

avg / total       0.89      0.88      0.88      1000

============ Classification report for k = 5 ============

             precision    recall  f1-score   support

          0       0.92      0.98      0.95        99
          1       0.78      1.00      0.88       105
          2       0.98      0.80      0.88       102
          3       0.76      0.84      0.80        86
          4       0.87      0.83      0.85       104
          5       0.93      0.82      0.87        91
          6       0.94      0.95      0.94        98
          7       0.87      0.87      0.87       113
          8       0.96      0.77      0.86        96
          9       0.77      0.82      0.79       106

avg / total       0.88      0.87      0.87      1000

============ Classification report for k = 7 ============

             precision    recall  f1-score   support

          0       0.91      0.98      0.94        99
          1       0.77      1.00      0.87       105
          2       0.99      0.76      0.86       102
          3       0.79      0.87      0.83        86
          4       0.90      0.81      0.85       104
          5       0.95      0.81      0.88        91
          6       0.91      0.93      0.92        98
          7       0.83      0.88      0.85       113
          8       0.95      0.78      0.86        96
          9       0.77      0.81      0.79       106

avg / total       0.87      0.86      0.86      1000

============ Classification report for k = 9 ============

             precision    recall  f1-score   support

          0       0.91      0.98      0.94        99
          1       0.73      1.00      0.84       105
          2       0.97      0.75      0.85       102
          3       0.80      0.85      0.82        86
          4       0.91      0.81      0.86       104
          5       0.97      0.79      0.87        91
          6       0.92      0.93      0.92        98
          7       0.83      0.88      0.85       113
          8       0.94      0.79      0.86        96
          9       0.78      0.84      0.81       106

avg / total       0.88      0.86      0.86      1000







    Out[5]:





[0.88800000000000001,
 0.878,
 0.86899999999999999,
 0.86499999999999999,
 0.86299999999999999]

ANSWER: The most difficult digit is 9, as measured by f1-score

(3) Using k=1, report dev set accuracy for the training set sizes below. Also, measure the amount of time needed for prediction with each training size.

time.time() gives a wall clock value you can use for timing operations



In [5]:

    
#def P3(train_sizes, accuracies):

### STUDENT START ###
# k_nearest_neighbors_timed_accuracy():
#   given the parameter k, training data and labels, and development data and labels,
#   fit a k nearest neighbors classifier using the training data, 
#   test using development data, and return the number of examples, prediction time,
#   and accuracy as a Python dictionary
def k_nearest_neighbors_timed_accuracy(k,
                                       training_data,
                                       training_labels,
                                       development_data,
                                       development_labels):

    neigh = KNeighborsClassifier(n_neighbors = k)
    neigh.fit(training_data, training_labels)
    
    start = time.time()
    predicted_labels = neigh.predict(development_data)
    end = time.time()
    
    examples, dimensions = np.shape(training_data)
        
    accuracy = accuracy_score(development_labels, predicted_labels, normalize = True)
    
    return { 'examples' : examples, 'time' : end-start, 'accuracy' : accuracy }

def P3(train_sizes, accuracies):
    k = 1
    for train_size in train_sizes:
        # sample train_size examples from the training set
        current_train_data, current_train_labels = X[:train_size], Y[:train_size]
        
        results = k_nearest_neighbors_timed_accuracy(k,
                                                     current_train_data,
                                                     current_train_labels,
                                                     dev_data,
                                                     dev_labels)
        print(results)
        accuracies.append(results['accuracy'])

train_sizes = [100, 200, 400, 800, 1600, 3200, 6400, 12800, 25000]
accuracies = [ ]
P3(train_sizes, accuracies)       
### STUDENT END ###

#train_sizes = [100, 200, 400, 800, 1600, 3200, 6400, 12800, 25000]
#accuracies = []
#P3(train_sizes, accuracies)









    



{'accuracy': 0.71999999999999997, 'examples': 100, 'time': 0.13258910179138184}
{'accuracy': 0.78600000000000003, 'examples': 200, 'time': 0.24303507804870605}
{'accuracy': 0.84099999999999997, 'examples': 400, 'time': 0.5572900772094727}
{'accuracy': 0.88400000000000001, 'examples': 800, 'time': 1.0725040435791016}
{'accuracy': 0.90200000000000002, 'examples': 1600, 'time': 3.7206990718841553}
{'accuracy': 0.92600000000000005, 'examples': 3200, 'time': 4.9908530712127686}
{'accuracy': 0.93700000000000006, 'examples': 6400, 'time': 11.849493026733398}
{'accuracy': 0.95899999999999996, 'examples': 12800, 'time': 18.47224998474121}
{'accuracy': 0.96999999999999997, 'examples': 25000, 'time': 37.5604829788208}

(4) Fit a regression model that predicts accuracy from training size. What does it predict for n=60000? What's wrong with using regression here? Can you apply a transformation that makes the predictions more reasonable?

Remember that the sklearn fit() functions take an input matrix X and output vector Y. So each input example in X is a vector, even if it contains only a single value.



In [7]:

    
#def P4():

### STUDENT START ###

from sklearn.linear_model import LogisticRegression

# fit_linear_regression():
#   given arrays of training data sizes and corresponding accuracies,
#   train and return a linear regression model for predicting accuracies
def fit_linear_regression(train_sizes, accuracies):
    train_sizes_matrix = [ [ train_size ] for train_size in train_sizes ]
    
    linear = LinearRegression()
    linear.fit(train_sizes_matrix, accuracies)
    
    return linear

# fit_logistic_regression():
#   given arrays of training data sizes and corresponding accuracies,
#   train and return a logistic regression model for predicting accuracies
def fit_logistic_regression(train_sizes, accuracies):
    train_sizes_matrix = [ [ train_size ] for train_size in train_sizes ]
    
    logistic = LogisticRegression()
    logistic.fit(train_sizes_matrix, accuracies)
    
    return logistic

def P4():
    full_training_size = 60000
    
    linear = fit_linear_regression(train_sizes, accuracies)
    linear_prediction = linear.predict(full_training_size)
    print('Linear model prediction for ' 
          + str(full_training_size) + ' : ' + str(linear_prediction[0]))
    
    logistic = fit_logistic_regression(train_sizes, accuracies)
    logistic_prediction = logistic.predict(full_training_size)
    print('Logistic model prediction for ' 
          + str(full_training_size) + ' : ' + str(logistic_prediction[0]))
    
P4()

### STUDENT END ###

#P4()









    



Linear model prediction for 60000 : 1.24307226036
Logistic model prediction for 60000 : 0.97

ANSWER: OLS/Linear models aren't designed to respect probibility range (0,1) and can produce probabilities > 1 or < 0 (e.g. 1.24). A Logistic Regression model is a great straightforward fix as it produces predictions in valid probability range (0.0 - 1.0) by design.

Fit a 1-NN and output a confusion matrix for the dev data. Use the confusion matrix to identify the most confused pair of digits, and display a few example mistakes.

confusion_matrix() produces a confusion matrix



In [17]:

    
#def P5():

### STUDENT START ###

# train_k_nearest_neighbors():
#   given the parameter k, training data and labels, and development data and labels,
#   fit a k nearest neighbors classifier using the training data
def train_k_nearest_neighbors(k,
                              training_data,
                              training_labels):

    neigh = KNeighborsClassifier(n_neighbors = k)
    neigh.fit(training_data, training_labels)
    
    return neigh

# most_confused():
#   given a confusion matrix
#   returns a sequence that comprises the two most confused digits, and errors between them
def most_confused(confusion):
    rows, columns = np.shape(confusion)
    worst_row, worst_column, worst_errors = 0, 1, 0
    
    # iterate through the upper triangle, ignoring the diagonals
    # confused is the sum for each pair of indices
    for row in range(rows):
        for column in range(row + 1, columns):
            errors = confusion[row][column] + confusion[column][row]
            if errors > worst_errors:
                worst_row, worst_column, worst_errors = row, column, errors
    
    return ( worst_row, worst_column, worst_errors )

# select_pairwire_error_indices():
#   given a predictions vector, actual label vector, and the digits of interest
#   returns an array of indices where the digits were confused
def select_pairwire_error_indices(predictions, labels, confused_low, confused_high):
    error_indices = [ ]
    for i, prediction in enumerate(predictions):
        label = labels[i]
        if ((prediction == confused_low and label == confused_high) or
            (prediction == confused_high and label == confused_low)):
            
            error_indices.append(i)
            
    return error_indices

def P5():
    k = 1
    neigh = train_k_nearest_neighbors(k, train_data, train_labels)
    development_predicted = neigh.predict(dev_data)
    
    confusion = confusion_matrix(dev_labels, development_predicted)
    
    confused_low, confused_high, confusion_errors = most_confused(confusion)
    print('Most confused digits are: ' + str(confused_low) + ' and ' + str(confused_high)
          + ', with ' + str(confusion_errors) + ' total confusion errors')
    
    error_indices = select_pairwire_error_indices(
    development_predicted, dev_labels, confused_low, confused_high)
    error_examples = [ dev_data[error_indices] ]
    plot_examples(error_examples)
    
    return confusion
P5()

### STUDENT END ###

#P5()









    



Most confused digits are: 3 and 8, with 3 total confusion errors






    












    Out[17]:





array([[ 99,   0,   0,   0,   0,   0,   0,   0,   0,   0],
       [  0, 105,   0,   0,   0,   0,   0,   0,   0,   0],
       [  1,   0,  98,   2,   0,   0,   0,   1,   0,   0],
       [  0,   0,   0,  83,   0,   1,   0,   0,   1,   1],
       [  0,   0,   0,   0, 102,   0,   0,   0,   0,   2],
       [  1,   0,   0,   0,   0,  88,   0,   0,   1,   1],
       [  1,   0,   0,   0,   1,   0,  96,   0,   0,   0],
       [  0,   0,   1,   0,   0,   0,   0, 111,   0,   1],
       [  1,   0,   1,   2,   0,   2,   1,   0,  89,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0, 106]])

(6) A common image processing technique is to smooth an image by blurring. The idea is that the value of a particular pixel is estimated as the weighted combination of the original value and the values around it. Typically, the blurring is Gaussian -- that is, the weight of a pixel's influence is determined by a Gaussian function over the distance to the relevant pixel.

Implement a simplified Gaussian blur by just using the 8 neighboring pixels: the smoothed value of a pixel is a weighted combination of the original value and the 8 neighboring values. Try applying your blur filter in 3 ways:

preprocess the training data but not the dev data
preprocess the dev data but not the training data
preprocess both training and dev data

Note that there are Guassian blur filters available, for example in scipy.ndimage.filters. You're welcome to experiment with those, but you are likely to get the best results with the simplified version I described above.



In [18]:

    
import itertools

# blur():
#   blurs an image by averaging adjacent pixels
def blur(image):
    pixel_matrix = example_as_pixel_matrix(image)
    blurred_image = []
    rows, columns = np.shape(pixel_matrix)
    
    for row in range(rows):
        for column in range(columns):
            # take the mean of the 9-pixel neighborhood (in clause)
            # but guard against running off the edges of the matrix (if clause)
            value = np.mean(list( 
                pixel_matrix[i][j] 
                for i, j
                in itertools.product(
                    range(row - 1, row + 2), 
                    range(column - 1, column + 2)
                )
                if (i >= 0) and (j >= 0) and (i < rows) and (j < columns)
            ))
            
            blurred_image.append(value)
    
    return blurred_image

# blur_images():
#   blurs a collection of images
def blur_images(images): 
    blurred = [ blur(image) for image in images ]
    return blurred



In [7]:

    
# Do this in batches since iPythonNB seems to hang on large batches
train_data_0k = train_data[:10000]
blurred_train_data_0k = blur_images(train_data_0k)



In [8]:

    
train_data_1k = train_data[10000:20000]
blurred_train_data_1k = blur_images(train_data_1k)



In [9]:

    
train_data_2k = train_data[20000:30000]
blurred_train_data_2k = blur_images(train_data_2k)



In [10]:

    
train_data_3k = train_data[30000:40000]
blurred_train_data_3k = blur_images(train_data_3k)



In [11]:

    
train_data_4k = train_data[40000:50000]
blurred_train_data_4k = blur_images(train_data_4k)



In [12]:

    
train_data_5k = train_data[50000:60000]
blurred_train_data_5k = blur_images(train_data_5k)



In [13]:

    
blurred_dev_data = blur_images(dev_data)



In [15]:

    
blurred_train_data = (
    blurred_train_data_0k 
    + blurred_train_data_1k
    + blurred_train_data_2k
    + blurred_train_data_3k
    + blurred_train_data_4k
    + blurred_train_data_5k
)



In [19]:

    
#def P6():
    
### STUDENT START ###

def P6():
    k = 1
    neigh_blurred_train = train_k_nearest_neighbors(k, blurred_train_data, train_labels)
    neigh_unblurred_train = train_k_nearest_neighbors(k, train_data, train_labels)
    
    predicted_blurred_train_unblurred_dev = (
        neigh_blurred_train.predict(dev_data)
    )
    
    predicted_unblurred_train_blurred_dev = (
        neigh_unblurred_train.predict(blurred_dev_data)
    )
    
    predicted_blurred_train_blurred_dev = (
        neigh_blurred_train.predict(blurred_dev_data)
    )
    
    print 'Accuracy for blurred training, unblurred dev:'
    print(accuracy_score(
            dev_labels, predicted_blurred_train_unblurred_dev, normalize = True))
    
    print 'Accuracy for unblurred training, blurred dev:'
    print(accuracy_score(
            dev_labels, predicted_unblurred_train_blurred_dev, normalize = True))
    
    print 'Accuracy for blurred training, blurred dev:'
    print(accuracy_score(
            dev_labels, predicted_blurred_train_blurred_dev, normalize = True))

P6()

### STUDENT END ###

#P6()









    



Accuracy for blurred training, unblurred dev:
0.982
Accuracy for unblurred training, blurred dev:
0.962
Accuracy for blurred training, blurred dev:
0.979

ANSWER: Blurring the training but not the development data

(7) Fit a Naive Bayes classifier and report accuracy on the dev data. Remember that Naive Bayes estimates P(feature|label). While sklearn can handle real-valued features, let's start by mapping the pixel values to either 0 or 1. You can do this as a preprocessing step, or with the binarize argument. With binary-valued features, you can use BernoulliNB. Next try mapping the pixel values to 0, 1, or 2, representing white, grey, or black. This mapping requires MultinomialNB. Does the multi-class version improve the results? Why or why not?



In [4]:

    
#def P7():

### STUDENT START ###

from sklearn.metrics import accuracy_score

# binarize_example():
#   Turn all pixels below 0.5 (or threshold) -> 0, greater -> 1
def binarize_example(example, threshold = 0.5):
    binarized = [ 1 if value > threshold else 0 for value in example ]
    return binarized
    
# binarize_examples():
#   Apply binarization to a set of example
def binarize_examples(examples, threshold = 0.5):
    binarized = [ binarize_example(example, threshold) for example in examples ]
    return binarized

# ternarize_example():
#   Turn all pixels below 1/3 (or threshold) -> 0, 1/3 through 2/3 -> 1, greater -> 2
def ternarize_example(example, threshold_low = 0.33333333, threshold_high = 0.66666666):
    ternarized = [ 
        0 if value < threshold_low else 1 if value < threshold_high else 2
        for value in example
    ]
    return ternarized

# ternarize_examples():
#   Apply ternarization to a set of example
def ternarize_examples(examples, threshold_low = 0.33333333, threshold_high = 0.66666666):
    ternarized = [ 
        ternarize_example(example, threshold_low, threshold_high) 
        for example in examples 
    ]
    return ternarized

def P7():
    binarized_train_data = binarize_examples(train_data)
    
    binary_naive_bayes = BernoulliNB()
    binary_naive_bayes.fit(binarized_train_data, train_labels)

    binarized_dev_data = binarize_examples(dev_data)
    binary_naive_bayes_predicted = binary_naive_bayes.predict(binarized_dev_data)
    
    target_names = [ str(i) for i in range(10) ]
    
    print '============ Classification report for binarized ============'
    print ''
    print(classification_report(
            dev_labels, 
            binary_naive_bayes_predicted, 
            target_names = target_names))
    print ' Accuracy score: '
    print(accuracy_score(dev_labels, binary_naive_bayes_predicted, normalize = True))
    
    ternarized_train_data = ternarize_examples(train_data)
    
    ternary_naive_bayes = MultinomialNB()
    ternary_naive_bayes.fit(ternarized_train_data, train_labels)
    
    ternarized_dev_data = ternarize_examples(dev_data)
    
    ternary_naive_bayes_predicted = ternary_naive_bayes.predict(ternarized_dev_data)
    print '============ Classification report for ternarized ============'
    print ''
    print(classification_report(
            dev_labels, 
            ternary_naive_bayes_predicted, 
            target_names = target_names))
    print ' Accuracy score: '
    print(accuracy_score(dev_labels, ternary_naive_bayes_predicted, normalize = True))
    
P7()
    
### STUDENT END ###

#P7()









    



============ Classification report for binarized ============

             precision    recall  f1-score   support

          0       0.94      0.98      0.96        99
          1       0.86      0.95      0.90       105
          2       0.88      0.79      0.84       102
          3       0.70      0.76      0.73        86
          4       0.86      0.83      0.84       104
          5       0.88      0.77      0.82        91
          6       0.91      0.88      0.89        98
          7       0.93      0.82      0.87       113
          8       0.75      0.82      0.79        96
          9       0.76      0.83      0.79       106

avg / total       0.85      0.84      0.85      1000

 Accuracy score: 
0.845
============ Classification report for ternarized ============

             precision    recall  f1-score   support

          0       0.92      0.97      0.95        99
          1       0.89      0.93      0.91       105
          2       0.86      0.78      0.82       102
          3       0.75      0.76      0.75        86
          4       0.87      0.75      0.80       104
          5       0.88      0.69      0.77        91
          6       0.91      0.88      0.89        98
          7       0.96      0.79      0.86       113
          8       0.65      0.83      0.73        96
          9       0.68      0.86      0.76       106

avg / total       0.84      0.83      0.83      1000

 Accuracy score: 
0.826

ANSWER:

(8) Use GridSearchCV to perform a search over values of alpha (the Laplace smoothing parameter) in a Bernoulli NB model. What is the best value for alpha? What is the accuracy when alpha=0? Is this what you'd expect?

Note that GridSearchCV partitions the training data so the results will be a bit different than if you used the dev data for evaluation.



In [8]:

    
#def P8(alphas):

### STUDENT START ###

def P8(alphas):
    binarized_train_data = binarize_examples(train_data)
    
    bernoulli_naive_bayes = BernoulliNB()
    
    grid_search = GridSearchCV(bernoulli_naive_bayes, alphas, verbose = 3)
    grid_search.fit(binarized_train_data, train_labels)

    return grid_search

alphas = {'alpha': [0.0, 0.0001, 0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 10.0]}
nb = P8(alphas)
print nb.best_params_

### STUDENT END ###

#alphas = {'alpha': [0.0, 0.0001, 0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 10.0]}
#nb = P8(alphas)









    



Fitting 3 folds for each of 9 candidates, totalling 27 fits
[CV] alpha=0.0 .......................................................
[CV] .............................. alpha=0.0, score=0.098940 -   3.2s
[CV] alpha=0.0 .......................................................
[CV] .............................. alpha=0.0, score=0.098900 -   3.3s
[CV] alpha=0.0 .......................................................
[CV] .............................. alpha=0.0, score=0.098910 -   3.2s
[CV] alpha=0.0001 ....................................................
[CV] ........................... alpha=0.0001, score=0.838966 -   3.3s
[CV] alpha=0.0001 ....................................................
[CV] ........................... alpha=0.0001, score=0.838650 -   3.0s
[CV] alpha=0.0001 ....................................................
[CV] ........................... alpha=0.0001, score=0.833083 -   3.1s
[CV] alpha=0.001 .....................................................
[CV] ............................ alpha=0.001, score=0.838666 -   2.9s
[CV] alpha=0.001 .....................................................
[CV] ............................ alpha=0.001, score=0.838500 -   3.1s
[CV] alpha=0.001 .....................................................
[CV] ............................ alpha=0.001, score=0.832933 -   4.0s
[CV] alpha=0.01 ......................................................
[CV] ............................. alpha=0.01, score=0.838566 -   3.8s
[CV] alpha=0.01 ......................................................
[CV] ............................. alpha=0.01, score=0.838250 -   3.0s
[CV] alpha=0.01 ......................................................
[CV] ............................. alpha=0.01, score=0.832733 -   3.0s
[CV] alpha=0.1 .......................................................
[CV] .............................. alpha=0.1, score=0.838216 -   3.1s
[CV] alpha=0.1 .......................................................
[CV] .............................. alpha=0.1, score=0.838000 -   2.8s
[CV] alpha=0.1 .......................................................
[CV] .............................. alpha=0.1, score=0.832283 -   3.2s
[CV] alpha=0.5 .......................................................
[CV] .............................. alpha=0.5, score=0.837516 -   2.9s
[CV] alpha=0.5 .......................................................
[CV] .............................. alpha=0.5, score=0.837200 -   2.9s
[CV] alpha=0.5 .......................................................
[CV] .............................. alpha=0.5, score=0.831533 -   3.0s
[CV] alpha=1.0 .......................................................
[CV] .............................. alpha=1.0, score=0.837316 -   2.9s
[CV] alpha=1.0 .......................................................
[CV] .............................. alpha=1.0, score=0.836900 -   2.9s
[CV] alpha=1.0 .......................................................
[CV] .............................. alpha=1.0, score=0.830983 -   3.0s
[CV] alpha=2.0 .......................................................
[CV] .............................. alpha=2.0, score=0.836516 -   2.9s
[CV] alpha=2.0 .......................................................
[CV] .............................. alpha=2.0, score=0.836900 -   2.9s
[CV] alpha=2.0 .......................................................
[CV] .............................. alpha=2.0, score=0.830283 -   2.9s
[CV] alpha=10.0 ......................................................
[CV] ............................. alpha=10.0, score=0.833417 -   3.0s
[CV] alpha=10.0 ......................................................
[CV] ............................. alpha=10.0, score=0.834200 -   2.9s
[CV] alpha=10.0 ......................................................
[CV] ............................. alpha=10.0, score=0.828283 -   3.1s





    



[Parallel(n_jobs=1)]: Done   1 jobs       | elapsed:    3.2s
[Parallel(n_jobs=1)]: Done  27 out of  27 | elapsed:  1.4min finished






    



{'alpha': 0.0001}



In [14]:

    
#print nb.best_params_

ANSWER: The best value for alpha is 0.0001

When alpha is 0, the accuracy is about one tenth. The effect of alpha = 0 is to ignore the training data, so this leaves the model to just pick a class. Single there are 10 (0-9) values with roughly equal distributions among the bins, always picking the same class is expected to have about 1/10 chances of being right.

(9) Try training a model using GuassianNB, which is intended for real-valued features, and evaluate on the dev data. You'll notice that it doesn't work so well. Try to diagnose the problem. You should be able to find a simple fix that returns the accuracy to around the same rate as BernoulliNB. Explain your solution.

Hint: examine the parameters estimated by the fit() method, theta_ and sigma_.



In [12]:

    
#def P9():

### STUDENT END ###

def train_and_score_gaussian(
    training_data, training_labels, development_data, development_labels):
    
    model = GaussianNB().fit(training_data, training_labels)
    predictions = model.predict(development_data)
    print(accuracy_score(development_labels, predictions, normalize = True))
    
    return model

def P9():
    print 'Accuracy score of Gaussian Naive Bayes (uncorrected): '
    gaussian_naive_bayes = train_and_score_gaussian(
        train_data, train_labels, dev_data, dev_labels)

    theta = gaussian_naive_bayes.theta_

    for digit in range(10):
        theta_figure = plt.figure()
        theta_hist = plt.hist(theta[digit], bins = 100)
        theta_hist_title = plt.title('Theta distribution for the digit ' + str(digit))
        plt.show()
    sigma = gaussian_naive_bayes.sigma_

    for digit in range(10):
        sigma_figure = plt.figure()
        sigma_hist = plt.hist(theta[digit], bins = 100)
        sigma_hist_title = plt.title('Sigma distribution for the digit ' + str(digit))
        plt.show()

    return gaussian_naive_bayes

gnb = P9()

# Attempts to improve were unsuccessful, see attempts below

print('Issue: Many features have variance 0, ')
print('which means they "contribute" but the contribution is noise')

examples, pixels = np.shape(train_data)
def select_signal_pixel_indices(data):
    indices = [ ]
    examples, pixels = np.shape(data)
    
    for pixel in range(pixels):
        has_signal = False
        
        for example in range(examples):
            if data[example][pixel] > 0.0:
                has_signal = True
        
        if has_signal:
            indices.append(pixel)
    
    return indices
            
pixels_with_signal = select_signal_pixel_indices(train_data)

def select_pixels_with_signal(data, pixels_with_signal):
    examples, pixels = np.shape(data)
    selected = [
        data[example][pixels_with_signal]
        for example in range(examples)
    ]
    
    return selected

signal_train_data = select_pixels_with_signal(train_data, pixels_with_signal)

signal_dev_data = select_pixels_with_signal(dev_data, pixels_with_signal)

print('Attempt #0 : only select non-0 pixels ')
evaluate_gaussian(signal_train_data, train_labels, signal_dev_data, dev_labels)


def transform_attempt1(pixel):
    return np.log(0.1 + pixel)

vectorized_transform_attempt1 = np.vectorize(transform_attempt1)

mapped_train_data = vectorized_transform_attempt1(train_data)
mapped_dev_data = vectorized_transform_attempt1(dev_data)


print('Attempt #1 : transform each pixel with log(0.1 + pixel) ')
evaluate_gaussian(mapped_train_data, train_labels, mapped_dev_data, dev_labels)

def transform_attempt2(pixel):
    return 0.0 if pixel < 0.0001 else 1.0

vectorized_transform_attempt2 = np.vectorize(transform_attempt2)

mapped_train_data = vectorized_transform_attempt2(train_data)
mapped_dev_data = vectorized_transform_attempt2(dev_data)


print('Attempt #2 : binarize all pixels with a very low threshold ')
evaluate_gaussian(mapped_train_data, train_labels, mapped_dev_data, dev_labels)

### STUDENT END ###

#gnb = P9()









    



Accuracy score of Gaussian Naive Bayes (uncorrected): 
0.577






    












    












    












    












    












    












    












    












    












    












    












    












    












    












    












    












    












    












    












    












    



Issue: Many features have variance 0, 
which means they "contribute" but the contribution is noise
Attempt #0 : only select non-0 pixels 






    



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-12-610b5b84dd1b> in <module>()
     74 
     75 print('Attempt #0 : only select non-0 pixels ')
---> 76 evaluate_gaussian(signal_train_data, train_labels, signal_dev_data, dev_labels)
     77 
     78 

NameError: name 'evaluate_gaussian' is not defined

ANSWER:

(10) Because Naive Bayes is a generative model, we can use the trained model to generate digits. Train a BernoulliNB model and then generate a 10x20 grid with 20 examples of each digit. Because you're using a Bernoulli model, each pixel output will be either 0 or 1. How do the generated digits compare to the training digits?

You can use np.random.rand() to generate random numbers from a uniform distribution
The estimated probability of each pixel is stored in feature_log_prob_. You'll need to use np.exp() to convert a log probability back to a probability.



In [14]:

    
#def P10(num_examples):

### STUDENT START ###

def generate_example(log_probabilities):
    pixels = [
        1.0 if np.random.rand() <= np.exp( log_probability ) else 0.0
        for log_probability in log_probabilities
    ]

    return pixels

# more than 10 x 10 gets scaled too small
def plot_10_examples(binary_naive_bayes):
    per_digit_log_probabilities = binary_naive_bayes.feature_log_prob_
    
    examples = [
        [ 
            generate_example(per_digit_log_probabilities[digit])
            for example in range(10)
        ]
        for digit in range(10)
    ]
    
    plot_examples(examples)

def P10(num_examples):
    binarized_train_data = binarize_examples(train_data)
    binary_naive_bayes = BernoulliNB().fit(binarized_train_data, train_labels)
    
    page = 0
    
    while page < num_examples:
        plot_10_examples(binary_naive_bayes)
        page = page + 10
    
P10(20)

### STUDENT END ###

#P10(20)

ANSWER: Many of the generated digits are recognizable. However, they lack the connected lines of handdrawn digits because each pixel is sampled independently.

(11) Remember that a strongly calibrated classifier is rougly 90% accurate when the posterior probability of the predicted class is 0.9. A weakly calibrated classifier is more accurate when the posterior is 90% than when it is 80%. A poorly calibrated classifier has no positive correlation between posterior and accuracy.

Train a BernoulliNB model with a reasonable alpha value. For each posterior bucket (think of a bin in a histogram), you want to estimate the classifier's accuracy. So for each prediction, find the bucket the maximum posterior belongs to and update the "correct" and "total" counters.

How would you characterize the calibration for the Naive Bayes model?



In [5]:

    
#def P11(buckets, correct, total):
    
### STUDENT START ###

buckets = [0.5, 0.9, 0.999, 0.99999, 0.9999999, 0.999999999, 0.99999999999, 0.9999999999999, 1.0]
correct = [0 for i in buckets]
total = [0 for i in buckets]

def train_binarized_bernoulli(training_data, training_labels, alpha = 0.0001):
    binarized_train_data = binarize_examples(training_data)
    binary_naive_bayes = BernoulliNB(alpha = alpha)
    binary_naive_bayes.fit(binarized_train_data, training_labels)
    
    return binary_naive_bayes

def find_bucket_index(buckets, posterior):
    index = None
    
    for i in range(len(buckets)):
        if index == None and posterior <= buckets[i]:
            index = i

    return index
    
def score_by_posterior_buckets(
    binary_naive_bayes, test_data, test_labels,
    buckets, correct, total):
    
    predictions = binary_naive_bayes.predict(test_data)
    posteriors = binary_naive_bayes.predict_proba(test_data)
    confidences = [
        posteriors[index][predictions[index]]
        for index in range(len(predictions))
    ]
    
    for index, confidence in enumerate(confidences):
        bucket_index = find_bucket_index(buckets, confidence)
        
        total[bucket_index] = total[bucket_index] + 1
        
        if predictions[index] == test_labels[index]:
            correct[bucket_index] = correct[bucket_index] + 1
    
def P11(buckets, correct, total):
    binary_naive_bayes = train_binarized_bernoulli(
        train_data, train_labels)
    
    binarized_dev_data = binarize_examples(dev_data)
    score_by_posterior_buckets(binary_naive_bayes, binarized_dev_data, dev_labels,
                               buckets, correct, total)

P11(buckets, correct, total)

for i in range(len(buckets)):
    if (total[i] > 0): accuracy = float(correct[i]) / float(total[i])
    print 'p(pred) <= %.13f    total = %3d    accuracy = %.3f' %(buckets[i], total[i], accuracy)    
                
### STUDENT END ###

#buckets = [0.5, 0.9, 0.999, 0.99999, 0.9999999, 0.999999999, 0.99999999999, 0.9999999999999, 1.0]
#correct = [0 for i in buckets]
#total = [0 for i in buckets]

#P11(buckets, correct, total)

#for i in range(len(buckets)):
#    accuracy = 0.0
#    if (total[i] > 0): accuracy = correct[i] / total[i]
#    print 'p(pred) <= %.13f    total = %3d    accuracy = %.3f' %(buckets[i], total[i], accuracy)









    



p(pred) <= 0.5000000000000    total =   3    accuracy = 0.333
p(pred) <= 0.9000000000000    total =  40    accuracy = 0.575
p(pred) <= 0.9990000000000    total =  97    accuracy = 0.495
p(pred) <= 0.9999900000000    total =  75    accuracy = 0.640
p(pred) <= 0.9999999000000    total =  63    accuracy = 0.714
p(pred) <= 0.9999999990000    total =  70    accuracy = 0.871
p(pred) <= 0.9999999999900    total =  75    accuracy = 0.893
p(pred) <= 0.9999999999999    total =  81    accuracy = 0.926
p(pred) <= 1.0000000000000    total = 496    accuracy = 0.972

ANSWER: The model is poorly calibrated - all probably buckets are over-confident, many drastically so.

(12) EXTRA CREDIT

Try designing extra features to see if you can improve the performance of Naive Bayes on the dev set. Here are a few ideas to get you started:

Try summing the pixel values in each row and each column.
Try counting the number of enclosed regions; 8 usually has 2 enclosed regions, 9 usually has 1, and 7 usually has 0.

Make sure you comment your code well!



In [18]:

    
#def P12():

### STUDENT START ###


### STUDENT END ###

#P12()



In [ ]: