Exploring precision and recall

The goal of this second notebook is to understand precision-recall in the context of classifiers.

  • Use Amazon review data in its entirety.
  • Train a logistic regression model.
  • Explore various evaluation metrics: accuracy, confusion matrix, precision, recall.
  • Explore how various metrics can be combined to produce a cost of making an error.
  • Explore precision and recall curves.

Because we are using the full Amazon review dataset (not a subset of words or reviews), in this assignment we return to using GraphLab Create for its efficiency. As usual, let's start by firing up GraphLab Create.

Make sure you have the latest version of GraphLab Create (1.8.3 or later). If you don't find the decision tree module, then you would need to upgrade graphlab-create using

   pip install graphlab-create --upgrade

See this page for detailed instructions on upgrading.


In [48]:
import graphlab
from __future__ import division
import numpy as np
graphlab.canvas.set_target('ipynb')

Load amazon review dataset


In [49]:
products = graphlab.SFrame('amazon_baby.gl/')

Extract word counts and sentiments

As in the first assignment of this course, we compute the word counts for individual words and extract positive and negative sentiments from ratings. To summarize, we perform the following:

  1. Remove punctuation.
  2. Remove reviews with "neutral" sentiment (rating 3).
  3. Set reviews with rating 4 or more to be positive and those with 2 or less to be negative.

In [50]:
def remove_punctuation(text):
    import string
    return text.translate(None, string.punctuation) 

# Remove punctuation.
review_clean = products['review'].apply(remove_punctuation)

# Count words
products['word_count'] = graphlab.text_analytics.count_words(review_clean)

# Drop neutral sentiment reviews.
products = products[products['rating'] != 3]

# Positive sentiment to +1 and negative sentiment to -1
products['sentiment'] = products['rating'].apply(lambda rating : +1 if rating > 3 else -1)

Now, let's remember what the dataset looks like by taking a quick peek:


In [51]:
products


Out[51]:
name review rating word_count sentiment
Planetwise Wipe Pouch it came early and was not
disappointed. i love ...
5.0 {'and': 3, 'love': 1,
'it': 3, 'highly': 1, ...
1
Annas Dream Full Quilt
with 2 Shams ...
Very soft and comfortable
and warmer than it ...
5.0 {'and': 2, 'quilt': 1,
'it': 1, 'comfortable': ...
1
Stop Pacifier Sucking
without tears with ...
This is a product well
worth the purchase. I ...
5.0 {'and': 3, 'ingenious':
1, 'love': 2, 'what': 1, ...
1
Stop Pacifier Sucking
without tears with ...
All of my kids have cried
non-stop when I tried to ...
5.0 {'and': 2, 'all': 2,
'help': 1, 'cried': 1, ...
1
Stop Pacifier Sucking
without tears with ...
When the Binky Fairy came
to our house, we didn't ...
5.0 {'and': 2, 'this': 2,
'her': 1, 'help': 2, ...
1
A Tale of Baby's Days
with Peter Rabbit ...
Lovely book, it's bound
tightly so you may no ...
4.0 {'shop': 1, 'noble': 1,
'is': 1, 'it': 1, 'as': ...
1
Baby Tracker® - Daily
Childcare Journal, ...
Perfect for new parents.
We were able to keep ...
5.0 {'and': 2, 'all': 1,
'right': 1, 'had': 1, ...
1
Baby Tracker® - Daily
Childcare Journal, ...
A friend of mine pinned
this product on Pinte ...
5.0 {'and': 1, 'fantastic':
1, 'help': 1, 'give': 1, ...
1
Baby Tracker® - Daily
Childcare Journal, ...
This has been an easy way
for my nanny to record ...
4.0 {'all': 1, 'standarad':
1, 'another': 1, 'when': ...
1
Baby Tracker® - Daily
Childcare Journal, ...
I love this journal and
our nanny uses it ...
4.0 {'all': 2, 'nannys': 1,
'just': 1, 'food': 1, ...
1
[166752 rows x 5 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Split data into training and test sets

We split the data into a 80-20 split where 80% is in the training set and 20% is in the test set.


In [52]:
train_data, test_data = products.random_split(.8, seed=1)

Train a logistic regression classifier

We will now train a logistic regression classifier with sentiment as the target and word_count as the features. We will set validation_set=None to make sure everyone gets exactly the same results.

Remember, even though we now know how to implement logistic regression, we will use GraphLab Create for its efficiency at processing this Amazon dataset in its entirety. The focus of this assignment is instead on the topic of precision and recall.


In [53]:
model = graphlab.logistic_classifier.create(train_data, target='sentiment',
                                            features=['word_count'],
                                            validation_set=None)


Logistic regression:
--------------------------------------------------------
Number of examples          : 133416
Number of classes           : 2
Number of feature columns   : 1
Number of unpacked features : 121712
Number of coefficients    : 121713
Starting L-BFGS
--------------------------------------------------------
+-----------+----------+-----------+--------------+-------------------+
| Iteration | Passes   | Step size | Elapsed Time | Training-accuracy |
+-----------+----------+-----------+--------------+-------------------+
| 1         | 5        | 0.000002  | 1.039824     | 0.840754          |
| 2         | 9        | 3.000000  | 2.086944     | 0.931350          |
| 3         | 10       | 3.000000  | 2.475899     | 0.882046          |
| 4         | 11       | 3.000000  | 2.834446     | 0.954076          |
| 5         | 12       | 3.000000  | 3.187388     | 0.960964          |
| 6         | 13       | 3.000000  | 3.542964     | 0.975033          |
+-----------+----------+-----------+--------------+-------------------+
TERMINATED: Terminated due to numerical difficulties.
This model may not be ideal. To improve it, consider doing one of the following:
(a) Increasing the regularization.
(b) Standardizing the input data.
(c) Removing highly correlated features.
(d) Removing `inf` and `NaN` values in the training data.

Model Evaluation

We will explore the advanced model evaluation concepts that were discussed in the lectures.

Accuracy

One performance metric we will use for our more advanced exploration is accuracy, which we have seen many times in past assignments. Recall that the accuracy is given by

$$ \mbox{accuracy} = \frac{\mbox{# correctly classified data points}}{\mbox{# total data points}} $$

To obtain the accuracy of our trained models using GraphLab Create, simply pass the option metric='accuracy' to the evaluate function. We compute the accuracy of our logistic regression model on the test_data as follows:


In [54]:
accuracy= model.evaluate(test_data, metric='accuracy')['accuracy']
print "Test Accuracy: %s" % accuracy


Test Accuracy: 0.914536837053

Baseline: Majority class prediction

Recall from an earlier assignment that we used the majority class classifier as a baseline (i.e reference) model for a point of comparison with a more sophisticated classifier. The majority classifier model predicts the majority class for all data points.

Typically, a good model should beat the majority class classifier. Since the majority class in this dataset is the positive class (i.e., there are more positive than negative reviews), the accuracy of the majority class classifier can be computed as follows:


In [55]:
baseline = len(test_data[test_data['sentiment'] == 1])/len(test_data)
print "Baseline accuracy (majority class classifier): %s" % baseline


Baseline accuracy (majority class classifier): 0.842782577394

Quiz Question: Using accuracy as the evaluation metric, was our logistic regression model better than the baseline (majority class classifier)?

Confusion Matrix

The accuracy, while convenient, does not tell the whole story. For a fuller picture, we turn to the confusion matrix. In the case of binary classification, the confusion matrix is a 2-by-2 matrix laying out correct and incorrect predictions made in each label as follows:

              +---------------------------------------------+
              |                Predicted label              |
              +----------------------+----------------------+
              |          (+1)        |         (-1)         |
+-------+-----+----------------------+----------------------+
| True  |(+1) | # of true positives  | # of false negatives |
| label +-----+----------------------+----------------------+
|       |(-1) | # of false positives | # of true negatives  |
+-------+-----+----------------------+----------------------+

To print out the confusion matrix for a classifier, use metric='confusion_matrix':


In [56]:
confusion_matrix = model.evaluate(test_data, metric='confusion_matrix')['confusion_matrix']
confusion_matrix


Out[56]:
target_label predicted_label count
1 -1 1406
-1 -1 3798
-1 1 1443
1 1 26689
[4 rows x 3 columns]

Quiz Question: How many predicted values in the test set are false positives?


In [57]:
round(1443 / (26689 + 1443 ), 2)


Out[57]:
0.05

Computing the cost of mistakes

Put yourself in the shoes of a manufacturer that sells a baby product on Amazon.com and you want to monitor your product's reviews in order to respond to complaints. Even a few negative reviews may generate a lot of bad publicity about the product. So you don't want to miss any reviews with negative sentiments --- you'd rather put up with false alarms about potentially negative reviews instead of missing negative reviews entirely. In other words, false positives cost more than false negatives. (It may be the other way around for other scenarios, but let's stick with the manufacturer's scenario for now.)

Suppose you know the costs involved in each kind of mistake:

  1. \$100 for each false positive.
  2. \$1 for each false negative.
  3. Correctly classified reviews incur no cost.

Quiz Question: Given the stipulation, what is the cost associated with the logistic regression classifier's performance on the test set?


In [58]:
100*1443 + 1*1406


Out[58]:
145706

Precision and Recall

You may not have exact dollar amounts for each kind of mistake. Instead, you may simply prefer to reduce the percentage of false positives to be less than, say, 3.5% of all positive predictions. This is where precision comes in:

$$ [\text{precision}] = \frac{[\text{# positive data points with positive predicitions}]}{\text{[# all data points with positive predictions]}} = \frac{[\text{# true positives}]}{[\text{# true positives}] + [\text{# false positives}]} $$

So to keep the percentage of false positives below 3.5% of positive predictions, we must raise the precision to 96.5% or higher.

First, let us compute the precision of the logistic regression classifier on the test_data.


In [59]:
precision = model.evaluate(test_data, metric='precision')['precision']
print "Precision on test data: %s" % precision


Precision on test data: 0.948706099815

Quiz Question: Out of all reviews in the test set that are predicted to be positive, what fraction of them are false positives? (Round to the second decimal place e.g. 0.25)


In [60]:
round(1 - precision, 2)


Out[60]:
0.05

Quiz Question: Based on what we learned in lecture, if we wanted to reduce this fraction of false positives to be below 3.5%, we would: (see the quiz)

A complementary metric is recall, which measures the ratio between the number of true positives and that of (ground-truth) positive reviews:

$$ [\text{recall}] = \frac{[\text{# positive data points with positive predicitions}]}{\text{[# all positive data points]}} = \frac{[\text{# true positives}]}{[\text{# true positives}] + [\text{# false negatives}]} $$

Let us compute the recall on the test_data.


In [61]:
recall = model.evaluate(test_data, metric='recall')['recall']
print "Recall on test data: %s" % recall


Recall on test data: 0.949955508098

Quiz Question: What fraction of the positive reviews in the test_set were correctly predicted as positive by the classifier?

Quiz Question: What is the recall value for a classifier that predicts +1 for all data points in the test_data?

Precision-recall tradeoff

In this part, we will explore the trade-off between precision and recall discussed in the lecture. We first examine what happens when we use a different threshold value for making class predictions. We then explore a range of threshold values and plot the associated precision-recall curve.

Varying the threshold

False positives are costly in our example, so we may want to be more conservative about making positive predictions. To achieve this, instead of thresholding class probabilities at 0.5, we can choose a higher threshold.

Write a function called apply_threshold that accepts two things

  • probabilities (an SArray of probability values)
  • threshold (a float between 0 and 1).

The function should return an SArray, where each element is set to +1 or -1 depending whether the corresponding probability exceeds threshold.


In [62]:
def apply_threshold(probabilities, threshold):
    ### YOUR CODE GOES HERE
    # +1 if >= threshold and -1 otherwise.
    return probabilities.apply(lambda x: +1 if x >= threshold else -1)

Run prediction with output_type='probability' to get the list of probability values. Then use thresholds set at 0.5 (default) and 0.9 to make predictions from these probability values.


In [63]:
probabilities = model.predict(test_data, output_type='probability')
predictions_with_default_threshold = apply_threshold(probabilities, 0.5)
predictions_with_high_threshold = apply_threshold(probabilities, 0.9)

In [64]:
print "Number of positive predicted reviews (threshold = 0.5): %s" % (predictions_with_default_threshold == 1).sum()


Number of positive predicted reviews (threshold = 0.5): 28132

In [65]:
print "Number of positive predicted reviews (threshold = 0.9): %s" % (predictions_with_high_threshold == 1).sum()


Number of positive predicted reviews (threshold = 0.9): 25630

Quiz Question: What happens to the number of positive predicted reviews as the threshold increased from 0.5 to 0.9?

Exploring the associated precision and recall as the threshold varies

By changing the probability threshold, it is possible to influence precision and recall. We can explore this as follows:


In [66]:
# Threshold = 0.5
precision_with_default_threshold = graphlab.evaluation.precision(test_data['sentiment'],
                                        predictions_with_default_threshold)

recall_with_default_threshold = graphlab.evaluation.recall(test_data['sentiment'],
                                        predictions_with_default_threshold)

# Threshold = 0.9
precision_with_high_threshold = graphlab.evaluation.precision(test_data['sentiment'],
                                        predictions_with_high_threshold)
recall_with_high_threshold = graphlab.evaluation.recall(test_data['sentiment'],
                                        predictions_with_high_threshold)

In [67]:
print "Precision (threshold = 0.5): %s" % precision_with_default_threshold
print "Recall (threshold = 0.5)   : %s" % recall_with_default_threshold


Precision (threshold = 0.5): 0.948706099815
Recall (threshold = 0.5)   : 0.949955508098

In [68]:
print "Precision (threshold = 0.9): %s" % precision_with_high_threshold
print "Recall (threshold = 0.9)   : %s" % recall_with_high_threshold


Precision (threshold = 0.9): 0.969527896996
Recall (threshold = 0.9)   : 0.884463427656

Quiz Question (variant 1): Does the precision increase with a higher threshold?

Quiz Question (variant 2): Does the recall increase with a higher threshold?

Precision-recall curve

Now, we will explore various different values of tresholds, compute the precision and recall scores, and then plot the precision-recall curve.


In [69]:
threshold_values = np.linspace(0.5, 1, num=100)
print threshold_values


[ 0.5         0.50505051  0.51010101  0.51515152  0.52020202  0.52525253
  0.53030303  0.53535354  0.54040404  0.54545455  0.55050505  0.55555556
  0.56060606  0.56565657  0.57070707  0.57575758  0.58080808  0.58585859
  0.59090909  0.5959596   0.6010101   0.60606061  0.61111111  0.61616162
  0.62121212  0.62626263  0.63131313  0.63636364  0.64141414  0.64646465
  0.65151515  0.65656566  0.66161616  0.66666667  0.67171717  0.67676768
  0.68181818  0.68686869  0.69191919  0.6969697   0.7020202   0.70707071
  0.71212121  0.71717172  0.72222222  0.72727273  0.73232323  0.73737374
  0.74242424  0.74747475  0.75252525  0.75757576  0.76262626  0.76767677
  0.77272727  0.77777778  0.78282828  0.78787879  0.79292929  0.7979798
  0.8030303   0.80808081  0.81313131  0.81818182  0.82323232  0.82828283
  0.83333333  0.83838384  0.84343434  0.84848485  0.85353535  0.85858586
  0.86363636  0.86868687  0.87373737  0.87878788  0.88383838  0.88888889
  0.89393939  0.8989899   0.9040404   0.90909091  0.91414141  0.91919192
  0.92424242  0.92929293  0.93434343  0.93939394  0.94444444  0.94949495
  0.95454545  0.95959596  0.96464646  0.96969697  0.97474747  0.97979798
  0.98484848  0.98989899  0.99494949  1.        ]

For each of the values of threshold, we compute the precision and recall scores.


In [70]:
precision_all = []
recall_all = []

probabilities = model.predict(test_data, output_type='probability')
for threshold in threshold_values:
    predictions = apply_threshold(probabilities, threshold)
    
    precision = graphlab.evaluation.precision(test_data['sentiment'], predictions)
    recall = graphlab.evaluation.recall(test_data['sentiment'], predictions)
    
    precision_all.append(precision)
    recall_all.append(recall)

Now, let's plot the precision-recall curve to visualize the precision-recall tradeoff as we vary the threshold.


In [71]:
import matplotlib.pyplot as plt
%matplotlib inline

def plot_pr_curve(precision, recall, title):
    plt.rcParams['figure.figsize'] = 7, 5
    plt.locator_params(axis = 'x', nbins = 5)
    plt.plot(precision, recall, 'b-', linewidth=4.0, color = '#B0017F')
    plt.title(title)
    plt.xlabel('Precision')
    plt.ylabel('Recall')
    plt.rcParams.update({'font.size': 16})
    
plot_pr_curve(precision_all, recall_all, 'Precision recall curve (all)')


Quiz Question: Among all the threshold values tried, what is the smallest threshold value that achieves a precision of 96.5% or better? Round your answer to 3 decimal places.


In [72]:
for i, p in enumerate(precision_all):
    print str(i) + " -> " + str(p)


0 -> 0.948706099815
1 -> 0.94905908719
2 -> 0.949288256228
3 -> 0.949506819072
4 -> 0.949624140511
5 -> 0.949805711026
6 -> 0.950203324534
7 -> 0.950417648319
8 -> 0.950696677385
9 -> 0.950877694755
10 -> 0.951062459755
11 -> 0.951424684994
12 -> 0.951534907046
13 -> 0.951761459341
14 -> 0.952177656598
15 -> 0.952541642734
16 -> 0.952825782345
17 -> 0.952950902164
18 -> 0.953033408854
19 -> 0.953081711222
20 -> 0.953231323132
21 -> 0.953525236877
22 -> 0.953680340278
23 -> 0.953691347784
24 -> 0.954012200845
25 -> 0.95415959253
26 -> 0.954481362305
27 -> 0.954630969609
28 -> 0.954956912159
29 -> 0.955217391304
30 -> 0.955425794284
31 -> 0.955603150978
32 -> 0.955716205907
33 -> 0.955933682373
34 -> 0.95600756859
35 -> 0.956162388494
36 -> 0.956453611253
37 -> 0.956670800204
38 -> 0.956951949759
39 -> 0.957200292398
40 -> 0.95730904302
41 -> 0.957558224696
42 -> 0.957740800469
43 -> 0.958172812328
44 -> 0.958434310054
45 -> 0.958762128786
46 -> 0.959152130713
47 -> 0.959266352387
48 -> 0.95958553044
49 -> 0.959906966441
50 -> 0.959957149717
51 -> 0.960170118343
52 -> 0.96034655115
53 -> 0.960716006374
54 -> 0.960870855278
55 -> 0.961087182534
56 -> 0.961366847624
57 -> 0.962202659674
58 -> 0.962415603901
59 -> 0.9624873268
60 -> 0.962727546261
61 -> 0.963204278397
62 -> 0.963492362814
63 -> 0.963922783423
64 -> 0.964218170815
65 -> 0.964581991742
66 -> 0.964945559391
67 -> 0.965311550152
68 -> 0.965662948723
69 -> 0.965982762566
70 -> 0.966381418093
71 -> 0.966780205901
72 -> 0.966996320147
73 -> 0.96737626806
74 -> 0.96765996766
75 -> 0.967978395062
76 -> 0.968586792526
77 -> 0.968960968418
78 -> 0.969313939017
79 -> 0.969468923029
80 -> 0.969731336279
81 -> 0.969926286073
82 -> 0.970296640176
83 -> 0.970813586098
84 -> 0.971404775125
85 -> 0.97203187251
86 -> 0.972883121045
87 -> 0.973425672411
88 -> 0.974041226258
89 -> 0.974463571837
90 -> 0.974766393611
91 -> 0.97549325026
92 -> 0.976197472818
93 -> 0.976871731644
94 -> 0.977337354589
95 -> 0.978530031612
96 -> 0.980131852253
97 -> 0.981307971185
98 -> 0.984238628196
99 -> 0.991666666667

In [73]:
round(threshold_values[67], 3)


Out[73]:
0.838

Quiz Question: Using threshold = 0.98, how many false negatives do we get on the test_data? (Hint: You may use the graphlab.evaluation.confusion_matrix function implemented in GraphLab Create.)


In [74]:
predictions_with_98_threshold = apply_threshold(probabilities, 0.98)
cm = graphlab.evaluation.confusion_matrix(test_data['sentiment'],
                                        predictions_with_98_threshold)
cm


Out[74]:
target_label predicted_label count
-1 1 487
1 1 22269
1 -1 5826
-1 -1 4754
[4 rows x 3 columns]

This is the number of false negatives (i.e the number of reviews to look at when not needed) that we have to deal with using this classifier.

Evaluating specific search terms

So far, we looked at the number of false positives for the entire test set. In this section, let's select reviews using a specific search term and optimize the precision on these reviews only. After all, a manufacturer would be interested in tuning the false positive rate just for their products (the reviews they want to read) rather than that of the entire set of products on Amazon.

From the test set, select all the reviews for all products with the word 'baby' in them.


In [36]:
baby_reviews =  test_data[test_data['name'].apply(lambda x: 'baby' in x.lower())]

Now, let's predict the probability of classifying these reviews as positive:


In [37]:
probabilities = model.predict(baby_reviews, output_type='probability')

Let's plot the precision-recall curve for the baby_reviews dataset.

First, let's consider the following threshold_values ranging from 0.5 to 1:


In [38]:
threshold_values = np.linspace(0.5, 1, num=100)

Second, as we did above, let's compute precision and recall for each value in threshold_values on the baby_reviews dataset. Complete the code block below.


In [39]:
precision_all = []
recall_all = []

for threshold in threshold_values:
    
    # Make predictions. Use the `apply_threshold` function 
    ## YOUR CODE HERE 
    predictions = apply_threshold(probabilities, threshold)

    # Calculate the precision.
    # YOUR CODE HERE
    precision = graphlab.evaluation.precision(baby_reviews['sentiment'], predictions)
    
    # YOUR CODE HERE
    recall = graphlab.evaluation.recall(baby_reviews['sentiment'], predictions)
    
    # Append the precision and recall scores.
    precision_all.append(precision)
    recall_all.append(recall)

Quiz Question: Among all the threshold values tried, what is the smallest threshold value that achieves a precision of 96.5% or better for the reviews of data in baby_reviews? Round your answer to 3 decimal places.


In [43]:
round(threshold_values[72], 3)


Out[43]:
0.864

In [44]:
for i, p in enumerate(precision_all):
    print str(i) + " -> " + str(p)


0 -> 0.947656392486
1 -> 0.948165723672
2 -> 0.948319941563
3 -> 0.948474328522
4 -> 0.948638274538
5 -> 0.948792977323
6 -> 0.949487554905
7 -> 0.949459805896
8 -> 0.94998167827
9 -> 0.949954170486
10 -> 0.95011920044
11 -> 0.950816663608
12 -> 0.95080763583
13 -> 0.950964187328
14 -> 0.951793928243
15 -> 0.951951399116
16 -> 0.952082565426
17 -> 0.952407304925
18 -> 0.952363367799
19 -> 0.952345770225
20 -> 0.952336966562
21 -> 0.952856350527
22 -> 0.95282146161
23 -> 0.952795261014
24 -> 0.952901909883
25 -> 0.953035084463
26 -> 0.953212031192
27 -> 0.953354395094
28 -> 0.953683035714
29 -> 0.954020848846
30 -> 0.954172876304
31 -> 0.954164337619
32 -> 0.954291044776
33 -> 0.954248366013
34 -> 0.954214165577
35 -> 0.95436693473
36 -> 0.954715568862
37 -> 0.954852004496
38 -> 0.954997187324
39 -> 0.955355468017
40 -> 0.955492957746
41 -> 0.955622414442
42 -> 0.955768868812
43 -> 0.95627591406
44 -> 0.956259426848
45 -> 0.956579195771
46 -> 0.956924239562
47 -> 0.956883509834
48 -> 0.957188861527
49 -> 0.957321699545
50 -> 0.957471046136
51 -> 0.957596501236
52 -> 0.957912778518
53 -> 0.958245948522
54 -> 0.958778625954
55 -> 0.959105675521
56 -> 0.959586286152
57 -> 0.960655737705
58 -> 0.96098126328
59 -> 0.961121856867
60 -> 0.961084220716
61 -> 0.961419154711
62 -> 0.961545931249
63 -> 0.962062256809
64 -> 0.962378167641
65 -> 0.962724434036
66 -> 0.963064295486
67 -> 0.963194988254
68 -> 0.963711259317
69 -> 0.964194373402
70 -> 0.964722112732
71 -> 0.964863797868
72 -> 0.965019762846
73 -> 0.965510406343
74 -> 0.966037735849
75 -> 0.966155683854
76 -> 0.966839792249
77 -> 0.967935871743
78 -> 0.968072289157
79 -> 0.968014484007
80 -> 0.967911200807
81 -> 0.967859308672
82 -> 0.968154158215
83 -> 0.968470301058
84 -> 0.969139587165
85 -> 0.969771745836
86 -> 0.971050454921
87 -> 0.971226021685
88 -> 0.971476510067
89 -> 0.972245762712
90 -> 0.972418216806
91 -> 0.973411154345
92 -> 0.974202011369
93 -> 0.97500552975
94 -> 0.975100942127
95 -> 0.976659038902
96 -> 0.979048964218
97 -> 0.980103168755
98 -> 0.984425349087
99 -> 1.0

Quiz Question: Is this threshold value smaller or larger than the threshold used for the entire dataset to achieve the same specified precision of 96.5%?

Finally, let's plot the precision recall curve.


In [45]:
plot_pr_curve(precision_all, recall_all, "Precision-Recall (Baby)")



In [ ]: