In [1]:
__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2020"
Different evaluation metrics encode different values and have different biases and other weaknesses. Thus, you should choose your metrics carefully, and motivate those choices when writing up and presenting your work.
This notebook reviews some of the most prominent evaluation metrics in NLP, seeking not only to define them, but also to articulate what values they encode and what their weaknesses are.
In your own work, you shouldn't feel confined to these metrics. Per item 1 above, you should feel that you have the freedom to motivate new metrics and specific uses of existing metrics, depending on what your goals are.
If you're working on an established problem, then you'll feel pressure from readers (and referees) to use the metrics that have already been used for the problem. This might be a compelling pressure. However, you should always feel free to argue against those cultural norms and motivate new ones. Areas can stagnate due to poor metrics, so we must be vigilant!
This notebook discusses prominent metrics in NLP evaluations. I've had to be selective to keep the notebook from growing too long and complex. I think the measures and considerations here are fairly representative of the issues that arise in NLP evaluation.
The scikit-learn model evaluation usage guide is excellent as a source of implementations, definitions, and references for a wide range of metrics for classification, regression, ranking, and clustering.
This notebook is the first in a two-part series on evaluation. Part 2 is on evaluation methods.
In [2]:
%matplotlib inline
from nltk.metrics.distance import edit_distance
from nltk.translate import bleu_score
import numpy as np
import pandas as pd
import scipy.stats
from sklearn import metrics
import utils
In [3]:
# Set all the random seeds for reproducibility. Only the
# system seed is relevant for this notebook.
utils.fix_random_seeds()
For classifiers that predict real values (scores, probabilities), it is important to remember that a threshold was imposed to create these categorical predictions.
The position of this threshold can have a large impact on the overall assessment that uses the confusion matrix as an input. The default is to choose the class with the highest probability. This is so deeply ingrained that it is often not even mentioned. However, it might be inappropriate:
Metrics like average precision explore this threshold as part of their evaluation procedure.
This function creates the toy confusion matrices that we will use for illustrative examples:
In [4]:
def illustrative_confusion_matrix(data):
classes = ['pos', 'neg', 'neutral']
ex = pd.DataFrame(
data,
columns=classes,
index=classes)
ex.index.name = "observed"
return ex
In [5]:
ex1 = illustrative_confusion_matrix([
[15, 10, 100],
[10, 15, 10],
[10, 100, 1000]])
Accuracy is the sum of the correct predictions divided by the sum of all predictions:
In [6]:
def accuracy(cm):
return cm.values.diagonal().sum() / cm.values.sum()
Here's an illustrative confusion matrix:
ex1 =
predicted | ||||
---|---|---|---|---|
pos | neg | neutral | ||
gold | pos | 15 | 10 | 100 |
neg | 10 | 15 | 10 | |
neutral | 10 | 100 | 1000 |
In [7]:
accuracy(ex1)
Out[7]:
Accuracy seems to directly encode a core value we have for classifiers – how often they are correct. In addition, the accuracy of a classifier on a test set will be negatively correlated with the negative log (logistic, cross-entropy) loss, which is a common loss for classifiers. In this sense, these classifiers are optimizing for accuracy.
In [8]:
ex2 = illustrative_confusion_matrix([
[0, 0, 125],
[0, 0, 35],
[0, 0, 1110]])
In [9]:
ex2
Out[9]:
Intuitively, this is a worse classifier than the one that produced ex1
. Whereas ex1
does well at pos and neg despite their small size, this classifier doesn't even try to get them right – it always predicts neutral. However, its accuracy is higher!
In [10]:
print(accuracy(ex1))
print(accuracy(ex2))
Precision is the sum of the correct predictions divided by the sum of all guesses. This is a per-class notion; in our confusion matrices, it's the diagonal values divided by the column sums:
In [11]:
def precision(cm):
return cm.values.diagonal() / cm.sum(axis=0)
ex1 =
predicted | ||||
---|---|---|---|---|
pos | neg | neutral | ||
gold | pos | 15 | 10 | 100 |
neg | 10 | 15 | 10 | |
neutral | 10 | 100 | 1000 | |
precision | 0.43 | 0.12 | 0.90 |
In [12]:
precision(ex1)
Out[12]:
For our problematic all neutral classifier above, precision is strictly speaking undefined for pos and neg:
In [13]:
ex2
Out[13]:
In [14]:
precision(ex2)
Out[14]:
It's common to see these NaN
values mapped to 0.
Precision's dangerous edge case is that one can achieve very high precision for a category by rarely guessing it. Consider, for example, the following classifier's flawless predictions for pos and neg. These predictions are at the expense of neutral, but that is such a big class that it hardly matters to the precision for that class either.
In [15]:
ex3 = illustrative_confusion_matrix([
[1, 0, 124],
[0, 1, 24],
[0, 0, 1110]])
In [16]:
ex3
Out[16]:
In [17]:
precision(ex3)
Out[17]:
These numbers mask the fact that this is a very poor classifier!
Compare with our less imbalanced ex1
; for "perfect" precision on pos
and neg
, we incurred only a small drop in neutral
here:
In [18]:
ex1
Out[18]:
In [19]:
precision(ex1)
Out[19]:
Recall is the sum of the correct predictions divided by the sum of all true instances. This is a per-class notion; in our confusion matrices, it's the diagonal values divided by the row sums. Recall is sometimes called the "true positive rate".
In [20]:
def recall(cm):
return cm.values.diagonal() / cm.sum(axis=1)
ex1 =
predicted | |||||
---|---|---|---|---|---|
pos | neg | neutral | recall | ||
gold | pos | 15 | 10 | 100 | 0.12 |
neg | 10 | 15 | 10 | 0.43 | |
neutral | 10 | 100 | 1000 | 0.90 |
In [21]:
recall(ex1)
Out[21]:
Recall trades off against precision. For instance, consider again ex3
, in which the classifier was very conservative with pos and neg:
ex3 =
predicted | |||||
---|---|---|---|---|---|
pos | neg | neutral | recall | ||
gold | pos | 1 | 0 | 124 | 0.008 |
neg | 0 | 1 | 24 | 0.040 | |
neutral | 0 | 0 | 1110 | 1.000 | |
precision | 1.00 | 1.00 | 0.88 |
Recall's dangerous edge case is that one can achieve very high recall for a category by always guessing it. This could mean a lot of incorrect guesses, but recall sees only the correct ones. You can see this in ex3
above. The model did make some incorrect neutral predictions, but it missed none, so it achieved perfect recall for that category.
ex3 =
predicted | |||||
---|---|---|---|---|---|
pos | neg | neutral | recall | ||
gold | pos | 1 | 0 | 124 | 0.008 |
neg | 0 | 1 | 24 | 0.040 | |
neutral | 0 | 0 | 1110 | 1.000 | |
precision | 1.00 | 1.00 | 0.88 |
F scores combine precision and recall via their harmonic mean, with a value $\beta$ that can be used to emphasize one or the other. Like precision and recall, this is a per-category notion.
$$ (\beta^{2}+1) \cdot \frac{\textbf{precision} \cdot \textbf{recall}}{(\beta^{2} \cdot \textbf{precision}) + \textbf{recall}} $$Where $\beta=1$, we have F1:
$$ 2 \cdot \frac{\textbf{precision} \cdot \textbf{recall}}{\textbf{precision} + \textbf{recall}} $$
In [22]:
def f_score(cm, beta):
p = precision(cm)
r = recall(cm)
return (beta**2 + 1) * ((p * r) / ((beta**2 * p) + r))
In [23]:
def f1_score(cm):
return f_score(cm, beta=1.0)
In [24]:
ex1
Out[24]:
In [25]:
f1_score(ex1)
Out[25]:
In [26]:
ex2
Out[26]:
In [27]:
f1_score(ex2)
Out[27]:
In [28]:
ex3
Out[28]:
In [29]:
f1_score(ex3)
Out[29]:
The F$_{\beta}$ score for a class $K$ is an attempt to summarize how well the classifier's $K$ predictions align with the true instances of $K$. Alignment brings in both missed cases and incorrect predictions. Intuitively, precision and recall keep each other in check in the calculation. This idea runs through almost all robust classification metrics.
There is no normalization for the size of the dataset within $K$ or outside of it.
For a given category $K$, the F$_{\beta}$ score for $K$ ignores all the values that are off the row and column for $K$, which might be the majority of the data. This means that the individual scores for a category can be very misleading about the overall performance of the system.
ex1 =
predicted | |||||
---|---|---|---|---|---|
pos | neg | neutral | F1 | ||
gold | pos | 15 | 10 | 100 | 0.187 |
neg | 10 | 15 | 10 | 0.187 | |
neutral | 10 | 100 | 1,000 | 0.90 |
ex4 =
predicted | |||||
---|---|---|---|---|---|
pos | neg | neutral | F1 | ||
gold | pos | 15 | 10 | 100 | 0.187 |
neg | 10 | 15 | 10 | 0.187 | |
neutral | 10 | 100 | 100,000 | 0.999 |
Dice similarity for binary vectors is sometimes used to assess how well a model has learned to identify a set of items. In this setting, it is equivalent to the per-token F1 score.
The intuition behind F scores (balancing precision and recall) runs through many of the metrics discussed below.
The macro-averaged F$_{\beta}$ score (macro F$_{\beta}$) is the mean of the F$_{\beta}$ score for each category:
In [30]:
def macro_f_score(cm, beta):
return f_score(cm, beta).mean(skipna=False)
In [31]:
ex1
Out[31]:
In [32]:
f1_score(ex1)
Out[32]:
In [33]:
macro_f_score(ex1, beta=1)
Out[33]:
In [34]:
ex2
Out[34]:
In [35]:
f1_score(ex2)
Out[35]:
In [36]:
macro_f_score(ex2, beta=1)
Out[36]:
In [37]:
ex3
Out[37]:
In [38]:
f1_score(ex3)
Out[38]:
In [39]:
macro_f_score(ex3, beta=1)
Out[39]:
In NLP, we typically care about modeling all of the classes well, so macro-F$_{\beta}$ scores often seem appropriate. However, this is also the source of their primary weaknesses:
If a model is doing really well on a small class $K$, its high macro F$_{\beta}$ score might mask the fact that it mostly makes incorrect predictions outside of $K$. So F$_{\beta}$ scoring will make this kind of classifier look better than it is.
Conversely, if a model does well on a very large class, its overall performance might be high even if it stumbles on some small classes. So F$_{\beta}$ scoring will make this kind of classifier look worse than it is, as measured by sheer number of good predictions.
Weighted F$_{\beta}$ scores average the per-category F$_{\beta}$ scores, but it's a weighted average based on the size of the classes in the observed/gold data:
In [40]:
def weighted_f_score(cm, beta):
scores = f_score(cm, beta=beta).values
weights = cm.sum(axis=1)
return np.average(scores, weights=weights)
In [41]:
weighted_f_score(ex3, beta=1.0)
Out[41]:
Weighted F$_{\beta}$ scores inherit the values of F$_{\beta}$ scores, and they additionally say that we want to weight the summary by the number of actual and predicted examples in each class. This will probably correspond well with how the classifier will perform, on a per example basis, on data with the same class distribution as the training data.
Micro-averaged F$_{\beta}$ scores (micro F$_{\beta}$ scores) add up the 2 $\times$ 2 confusion matrices for each category versus the rest, and then they calculate the F$_{\beta}$ scores, with the convention being that the positive class's F$_{\beta}$ score is reported.
This function creates the 2 $\times$ 2 matrix for a category cat
in a confusion matrix cm
:
In [42]:
def cat_versus_rest(cm, cat):
yes = cm.loc[cat, cat]
yes_no = cm.loc[cat].sum() - yes
no_yes = cm[cat].sum() - yes
no = cm.values.sum() - yes - yes_no - no_yes
return pd.DataFrame(
[[yes, yes_no],
[no_yes, no]],
columns=['yes', 'no'],
index=['yes', 'no'])
In [43]:
display(ex1)
display(cat_versus_rest(ex1, 'pos'))
display(cat_versus_rest(ex1, 'neg'))
display(cat_versus_rest(ex1, 'neutral'))
In [44]:
sum([cat_versus_rest(ex1, cat) for cat in ex1.index])
Out[44]:
For the micro F$_{\beta}$ score, we just add up these per-category confusion matrices and calculate the F$_{\beta}$ score:
In [45]:
def micro_f_score(cm, beta):
c = sum([cat_versus_rest(cm, cat) for cat in cm.index])
return f_score(c, beta=beta).loc['yes']
In [46]:
micro_f_score(ex1, beta=1.0)
Out[46]:
Micro F$_{\beta}$ scores inherit the values of weighted F$_{\beta}$ scores. (The resulting scores tend to be very similar.)
For two-class problems, this has an intuitive interpretation in which precision and recall are defined in terms of correct and incorrect guesses ignoring the class.
The weaknesses too are the same as those of weighted F$_{\beta}$ scores, with the additional drawback that we actually get two potentially very different values, for the positive and negative classes, and we have to choose one to meet our goal of having a single summary number. (See the 'yes'
in the final line of micro_f_score
.)
I noted above that confusion matrices hide a threshold for turning probabilities/scores into predicted labels. With precision–recall curves, we finally address this.
A precision–recall curve is a method for summarizing the relationship between precision and recall for a binary classifier.
The basis for this calculation is not the confusion matrix, but rather the raw scores or probabilities returned by the classifier. Normally, we use 0.5 as the threshold for saying that a prediction is positive. However, each distinct real value in the set of predictions is a potential threshold. The precision–recall curve explores this space.
Here's a basic implementation; the sklearn version is more flexible and so recommended for real experimental frameworks.
In [47]:
def precision_recall_curve(y, probs):
"""`y` is a list of labels, and `probs` is a list of predicted
probabilities or predicted scores -- likely a column of the
output of `predict_proba` using an `sklearn` classifier.
"""
thresholds = sorted(set(probs))
data = []
for t in thresholds:
# Use `t` to create labels:
pred = [1 if p >= t else 0 for p in probs]
# Precision/recall analysis as usual, focused on
# the positive class:
cm = pd.DataFrame(metrics.confusion_matrix(y, pred))
prec = precision(cm)[1]
rec = recall(cm)[1]
data.append((t, prec, rec))
# For intuitive graphs, always include this end-point:
data.append((None, 1, 0))
return pd.DataFrame(
data, columns=['threshold', 'precision', 'recall'])
I'll illustrate with a hypothetical binary classification problem involving balanced classes:
In [48]:
y = np.random.choice((0, 1), size=1000, p=(0.5, 0.5))
Suppose our classifier is generally able to distinguish the two classes, but it never predicts a value above 0.4, so our usual methods of thresholding at 0.5 would make the classifier look very bad:
In [49]:
y_pred = [np.random.uniform(0.0, 0.3) if x == 0 else np.random.uniform(0.1, 0.4)
for x in y]
The precision–recall curve can help us identify the optimal threshold given whatever our real-world goals happen to be:
In [50]:
prc = precision_recall_curve(y, y_pred)
In [51]:
def plot_precision_recall_curve(prc):
ax1 = prc.plot.scatter(x='recall', y='precision', legend=False)
ax1.set_xlim([0, 1])
ax1.set_ylim([0, 1.1])
ax1.set_ylabel("precision")
ax2 = ax1.twiny()
ax2.set_xticklabels(prc['threshold'].values[::100].round(3))
_ = ax2.set_xlabel("threshold")
In [52]:
plot_precision_recall_curve(prc)
With precision–recall curves, we get a generalized perspective on F1 scores (and we could weight precision and recall differently to achieve the effects of beta
for F scores more generally). These curves can be used, not only to assess a system, but also to identify an optimal decision boundary given external goals.
Most implementations are limited to binary problems. The basic concepts are defined for multi-class problems, but it's very difficult to understand the resulting hyperplanes.
There is no single statistic that does justice to the full curve, so this metric isn't useful on its own for guiding development and optimization. Indeed, opening up the decision threshold in this way really creates another hyperparameter that one has to worry about!
The Receiver Operating Characteristic (ROC) curve-curve) is superficially similar to the precision–recall, but it compares recall with the false positive rate.
Average precision, covered next, is a way of summarizing these curves with a single number.
Average precision is a method for summarizing the precision–recall curve. It does this by calculating the average precision weighted by the change in recall from step to step along the curve.
Here is the calculation in terms of the data structures returned by precision_recall_curve
above, in which (as in sklearn) the largest recall value is first:
where $n$ is the increasing sequence of thresholds and the precision and recall vectors $p$ and $r$ are of length $n+1$. (We insert a final pair of values $p=1$ and $r=0$ in the precision–recall curve calculation, with no threshold for that point.)
In [53]:
def average_precision(p, r):
total = 0.0
for i in range(len(p)-1):
total += (r[i] - r[i+1]) * p[i]
return total
In [54]:
plot_precision_recall_curve(prc)
In [55]:
average_precision(prc['precision'].values, prc['recall'].values)
Out[55]:
An important weakness of this metric is cultural: it is often hard to tell whether a paper is reporting average precision or some interpolated variant thereof. The interpolated versions are meaningfully different and will tend to inflate scores. In any case, they are not comparable to the calculation defined above and implemented in sklearn
as sklearn.metrics.average_precision_score
.
Unlike for precision–recall curves, we aren't strictly speaking limited to binary classification here. Since we aren't trying to visualize anything, we can do these calculations for multi-class problems. However, then we have to decide on how the precision and recall values will be combined for each step: macro-averaged, weighted, or micro-averaged, just as with F$_{\beta}$ scores. This introduces another meaningful design choice.
There are interpolated versions of this score, and some tasks/communities have even settled on specific versions as their standard metrics. All such measures should be approached with skepticism, since all of them can inflate scores artificially in specific cases.
This blog post is an excellent discussion of the issues with linear interpolation. It proposes a step-wise interpolation procedure that is much less problematic. I believe the blog post and subsequent PR to sklearn
led the sklearn
developers to drop support for all interpolation mechanisms for this metric!
Average precision as defined above is a discrete approximation of the area under the precision–recall curve. This is a separate measure often referred to as "AUC". In calculating AUC for a precision–recall curve, some kind of interpolation will be done, and this will generally produce exaggerated scores for the same reasons that interpolated average precison does.
The Receiver Operating Characteristic (ROC) curve for a class $k$ depicts recall the false positive rate (FPR) for $k$ as a function of the recall for $k$. For instance, suppose we focus on $k$ as the positive class $A$:
$$ \begin{array}{r r r} \hline & A & B \\ \hline A & \text{TP}_{A} & \text{FN}_{A}\\ B & \text{FP}_{A} & \text{TN}_{A}\\ \hline \end{array} $$The false positive rate is
$$ \textbf{fpr}(A) = \frac{\text{FP}_{A}}{\text{FP}_{A} + \text{TN}_{A}} $$which is equivalent to 1 minus the recall for $B$ class.
ROC curves are implemented in sklearn.metrics.roc_curve.
The area under the ROC curve is often used as a summary statistic: see sklearn.metrics.roc_auc_curve.
ROC is limited to binary problems.
Recall that, for two classes $A$ and $B$,
$$ \begin{array}{r r r} \hline & A & B \\ \hline A & \text{TP}_{A} & \text{FN}_{A}\\ B & \text{FP}_{A} & \text{TN}_{B}\\ \hline \end{array} $$we can express ROC as comparing $\textbf{recall}(A)$ with $1.0 - \textbf{recall}(B)$.
This reveals a point of contrast with scores based in precision and recall: the entire table is used, whereas precision and recall for a class $k$ ignore the $\text{TN}_{k}$ values. Thus, whereas precision and recall for a class $k$ will be insensitive to changes in $\text{TN}_{k}$, ROC will be affected by such changes. The following individual ROC calculations help to bring this out:
$$ \begin{array}{r r r r r} \hline & A & B & \textbf{F1} & \textbf{ROC}\\ \hline A & 15 & 10 & 0.21 & 0.90 \\ B & 100 & {\color{blue}{50}} & 0.48 & 0.83 \\ \hline \end{array} \qquad \begin{array}{r r r r r} \hline & A & B & \textbf{F1} & \textbf{ROC} \\ \hline A & 15 & 10 & 0.21 & 3.6 \\ B & 100 & {\color{blue}{500}} & 0.90 & 2.08 \\ \hline \end{array} $$One might worry that the model on the right isn't better at identifying class $A$, even though its ROC value for $A$ is larger.
The mean squared error is a summary of the distance between predicted and actual values:
$$ \textbf{mse}(y, \widehat{y}) = \frac{1}{N}\sum_{i=1}^{N} (y_{i} - \hat{y_{i}})^{2} $$
In [56]:
def mean_squared_error(y_true, y_pred):
diffs = (y_true - y_pred)**2
return np.mean(diffs)
The raw distances y_true - y_pred
are often called the residuals.
This measure seeks to summarize the errors made by a regression classifier. The smaller it is, the closer the model's predictions are to the truth. In this sense, it is intuitively like a counterpart to accuracy for classifiers.
Scikit-learn implements a variety of closely related measures: mean absolute error, mean squared logarithmic error, and median absolute error. I'd say that one should choose among these metrics based on how the output values are scaled and distributed. For instance:
The R$^{2}$ score is probably the most prominent method for summarizing regression model performance, in statistics, social sciences, and ML/NLP. This is the value that sklearn
's regression models deliver with their score
functions.
where $\mu$ is the mean of the gold values $y$.
In [57]:
def r2(y_true, y_pred):
mu = y_true.mean()
# Total sum of squares:
total = ((y_true - mu)**2).sum()
# Sum of squared errors:
res = ((y_true - y_pred)**2).sum()
return 1.0 - (res / total)
The numerator in the R$^{2}$ calculation is the sum of errors:
$$ \textbf{r2}(y, \widehat{y}) = 1.0 - \frac{ \sum_{i}^{N} (y_{i} - \hat{y_{i}})^{2} }{ \sum_{i}^{N} (y_{i} - \mu)^{2} } $$In the context of regular linear regression, the model's objective is to minimize the total sum of squares, which is the denominator in the calculation. Thus, R$^{2}$ is based in the ratio between what the model achieved and what its objective was, which is a measure of the goodness of fit of the model.
R$^{2}$ is closely related to the squared Pearson correlation coefficient.
R$^{2}$ is closely related to the explained variance, which is also defined in terms of a ratio of the residuals and the variation in the gold data. For explained variance, the numerator is the variance of the residuals and the denominator is the variance of the gold values.
Adjusted R$^{2}$ seeks to take into account the number of predictors in the model, to reduce the incentive to simply add more features in the hope of lucking into a better score. In ML/NLP, relatively little attention is paid to model complexity in this sense. The attitude is like: if you can improve your model by adding features, you might as well do that!
The Pearson correlation coefficient $\rho$ between two vectors $y$ and $\widehat{y}$ of dimension $N$ is:
$$ \textbf{pearsonr}(y, \widehat{y}) = \frac{ \sum_{i}^{N} (y_{i} - \mu_{y}) \cdot (\widehat{y}_{i} - \mu_{\widehat{y}}) }{ \sum_{i}^{N} (y_{i} - \mu_{y})^{2} \cdot (\widehat{y}_{i} - \mu_{\widehat{y}})^{2} } $$where $\mu_{y}$ is the mean of $y$ and $\mu_{\widehat{y}}$ is the mean of $\widehat{y}$.
This is implemented as scipy.stats.pearsonr
, which returns the coefficient and a p-value.
For comparing gold values $y$ and predicted values $\widehat{y}$, Pearson correlation is equivalent to a linear regression using $\widehat{y}$ and a bias term to predict $y$. See this great blog post for details.
As noted above, there is also a close relationship to R-squared values.
In [58]:
corr_df = pd.DataFrame({
'y1': np.random.uniform(-10, 10, size=1000),
'y2': np.random.uniform(-10, 10, size=1000)})
In [59]:
scipy.stats.spearmanr(corr_df['y1'], corr_df['y2'])
Out[59]:
In [60]:
scipy.stats.pearsonr(corr_df['y1'].rank(), corr_df['y2'].rank())
Out[60]:
Unlike Pearson, Spearman is not sensitive to the magnitude of the differences. In fact, it's invariant under all monotonic rescaling, since the values are converted to ranks. This also makes it less sensitive to outliers than Pearson.
Of course, these strengths become weaknesses in domains where the raw differences do matter. That said, in most NLU contexts, Spearman will be a good conservative choice for system assessment.
For comparing gold values $y$ and predicted values $\widehat{y}$, Pearson correlation is equivalent to a linear regression using $\textbf{rank}(\widehat{y})$ and a bias term to predict $\textbf{rank}(y)$. See this great blog post for details.
Sequence prediction metrics all seek to summarize and quantify the extent to which a model has managed to reproduce, or accurately match, some gold standard sequences. Such problems arise throughout NLP. Examples:
Evaluations is very challenging because the relationships tend to be many-to-one: a given sentence might have multiple suitable translations; a given dialogue act will always have numerous felicitous responses; any scene can be described in multiple ways; and so forth. The most constrained of these problems is the speech-to-text case in 1, but even that one has indeterminacy in real-world contexts (humans often disagree about how to transcribe spoken language).
The word error rate (WER) metric is a word-level, length-normalized measure of Levenshtein string-edit distance:
In [61]:
def wer(seq_true, seq_pred):
d = edit_distance(seq_true, seq_pred)
return d / len(seq_true)
In [62]:
wer(['A', 'B', 'C'], ['A', 'A', 'C'])
Out[62]:
In [63]:
wer(['A', 'B', 'C', 'D'], ['A', 'A', 'C', 'D'])
Out[63]:
To calculate this over the entire test-set, one gets the edit-distances for each gold–predicted pair and normalizes these by the length of all the gold examples, rather than normalizing each case:
In [64]:
def corpus_wer(y_true, y_pred):
dists = [edit_distance(seq_true, seq_pred)
for seq_true, seq_pred in zip(y_true, y_pred)]
lengths = [len(seq) for seq in y_true]
return sum(dists) / sum(lengths)
This gives a single summary value for the entire set of errors.
The value encoded reveals a potential weakness in certain domains. Roughly, the more semantic the task, the less appropriate WER is likely to be.
For example, adding a negation to a sentence will radically change its meaning but incur only a small WER penalty, whereas passivizing a sentence (Kim won the race → The race was won by Kim) will hardly change its meaning at all but incur a large WER penalty.
See also Liu et al. 2016 for similar arguments in the context of dialogue generation.
BLEU (Bilingual Evaluation Understudy) scores were originally developed in the context of machine translation, but they are applied in other generation tasks as well. For BLEU scoring, we require a set of gold outputs. The metric has two main components:
Modified n-gram precision: A direct application of precision would divide the number of correct n-grams in the predicted output (n-grams that appear in any translation) by the number of n-grams in the predicted output. This has a degenerate solution in which the predicted output contains only one word. BLEU's modified version substitutes the actual count for each n-gram by the maximum number of times it appears in any translation.
Brevity penalty (BP): to avoid favoring outputs that are too short, a penalty is applied. Let $Y$ be the set of gold outputs, $\widehat{y}$ the predicted output, $c$ the length of the predicted output, and $r$ the smallest absolute difference between the length of $c$ and the length of any of its gold outputs in $Y$. Then:
The BLEU score itself is typically a combination of modified n-gram precision for various $n$ (usually up to 4):
$$\textbf{BLEU}(Y, \widehat{y}) = \textbf{BP}(Y, \widehat{y}) \cdot \exp\left(\sum_{n=1}^{N} w_{n} \cdot \log\left(\textbf{modified-precision}(Y, \widehat{y}, n\right)\right)$$where $Y$ is the set of gold outputs, $\widehat{y}$ is the predicted output, and $w_{n}$ is a weight for each $n$-gram level (usually set to $1/N$).
NLTK has implementations of Bleu scoring for the sentence-level, as defined above, and for the corpus level (nltk.translate.bleu_score.corpus_bleu
). At the corpus level, it is typical to do a kind of micro-averaging of the modified precision scores and use a cumulative version of the brevity penalty.
BLEU scores attempt to achieve the same balance between precision and recall that runs through the majority of the metrics discussed here. It has many affinities with word error rate, but seeks to accommodate the fact that there are typically multiple suitable outputs for a given input.
Callison-Burch et al. (2006) criticize BLEU as a machine translation metric on the grounds that it fails to correlate with human scoring of translations. They highlight its insensitivity to n-gram order and its insensitivity to n-gram types (e.g., function vs. content words) as causes of this lack of correlation.
Liu et al. (2016) specifically argue against BLEU as a metric for assessing dialogue systems, based on a lack of correlation with human judgments about dialogue coherence.
There are many competitors/alternatives to BLEU, most proposed in the context of machine translation. Examples: ROUGE, METEOR, HyTER, Orange (smoothed Bleu).
Perplexity is a common metric for directly assessing generation models by calculating the probability that they assign to sequences in the test data. It is based in a measure of average surprisal:
$$H(P, x) = -\frac{1}{m}\log_{2} P(x)$$where $P$ is a model assigning probabilities to sequences and $x$ is a sequence.
Perplexity is then the exponent of this:
$$\textbf{perplexity}(P, x) = 2^{H(P, x)}$$Using any base $n$ both in defining $H$ and as the base in $\textbf{perplexity}$ will lead to identical results.
Minimizing perplexity is equivalent to maximizing probability.
It is common to report per-token perplexity; here the averaging should be done in log-space to deliver a geometric mean:
$$\textbf{token-perplexity}(P, x) = \exp\left(\frac{\log\textbf{perplexity}(P, x)}{\textbf{length}(x)}\right)$$When averaging perplexity values obtained from all the sequences in a text corpus, one should again use the geometric mean:
$$\textbf{mean-perplexity}(P, X) = \exp\left(\frac{1}{m}\sum_{x\in X}\log(\textbf{token-perplexity}(P, x))\right)$$for a set of $m$ examples $X$.
The guiding idea behind perplexity is that a good model will assign high probability to the sequences in the test data. This is an intuitive, expedient intrinsic evaluation, and it matches well with the objective for models trained with a cross-entropy or logistic objective.
Perplexity is heavily dependent on the nature of the underlying vocabulary in the following sense: one can artificially lower one's perplexity by having a lot of UNK
tokens in the training and test sets. Consider the extreme case in which everything is mapped to UNK
and perplexity is thus perfect on any test set. The more worrisome thing is that any amount of UNK
usage side-steps the pervasive challenge of dealing with infrequent words.
As Hal Daumé discusses in this post, the perplexity metric imposes an artificial constrain that one's model outputs are probabilistic.
Perplexity is the inverse of probability and, with some assumptions, can be seen as an approximation of the cross-entropy between the model's predictions and the true underlying sequence probabilities.
The scikit-learn model evaluation usage guide is a great resource for metrics I didn't cover here. In particular:
Clustering
Ranking
Inter-annotator agreement