On the viability of crowdsourcing NLP annotations in healthcare



In [1]:

    
__author__  = 'Bruno Godefroy'

TL;DR For most annotation tasks in healthcare, we must rely on trained experts, who are always in high-demand, creating a bottleneck for system development. Can we address this limitation by crowdsourcing the easier annotation tasks in this space? We report on an experiment addressing this question, with a focus on how to identify problematic workers and infer reliable annotations from noisy crowd work.

Overview

Obtaining high-quality annotations is a bottleneck for all natural language processing applications. The challenges are especially severe in healthcare, where we rely on annotators who have expertise in the practice of medicine and in understanding medical texts, and who are authorized to access sensitive data. Such annotators are in high-demand and thus demand high prices. The question then arises whether we can ease the burden on these annotators by crowdsourcing the less demanding annotation tasks involving publicly-available data, reserving the experts for where they are truly needed.

This post reports on a crowdsourcing experiment we ran to explore this issue. We defined a reasonably nuanced span-identification task, and launched it on Figure Eight. As expected, the output was noisy, as a result of the highly variable pool of annotators we tapped into. To infer quality labels from this noisy process, we used a straightforward application of Expectation-Maximization, with quite good results, suggesting that crowdsourcing is an effective tool for obtaining annotations for at least some NLP problems in healthcare.

Crowdsourcing task definition

The publicly available FDA Drug Labels data set is a rich source of information about how drugs relate to disease states (among other things). Since it's a public dataset, we don't have to address privacy concerns, which are of course another limiting factor when it comes to annotating data in healthcare.

For our pilot task, we decided to focus on developing annotations to facilitate automatic extraction of the core drug–disease relationships expressed in these labels, as exemplified in figure 1. This is a problem we've worked on before, so we have a good sense for what information should be extracted.

Figure 1: Annotated sentences from a drug label.

We need our NLP model to identify the disease mentions in these texts and determine the sense for those mentions. Our own high-level exploration of the data led us to the label set prevents, treats, and is contraindicated for as the right set of high-level categories for the mentions. This annotation design is motivated by recent experiments which gave encouraging results on similar tasks [1, 2].

For crowdsourcing, we break this down into two separate tasks: identifying disease mentions spans, and assigning categories to identified spans. (Breaking the task down in this way isn't strictly speaking necessary, but it eases the burden on individual annotators). Since the span identification task is more open-ended and challenging, it's the one we explore in this post.

We developed a short annotation manual with examples, had an in-house expert label 50 sentences for quality control, and used the Figure Eight platform to design a simple annotation interface (figure 2).

Figure 2: The interface for our first task. Crowdworkers are asked to select disease, symptoms and injury mentions in a sentence. Some drug and disease mentions are automatically underlined using fixed lexicons to help the workers understand the texts.

Who completed our task on Figure Eight?

We launched our task on Figure Eight with 10,000 sentences. It was completed within a few days. The job was done by 451 people from 45 countries, the majority from India and Venezuela. No special qualifications were imposed.

Most workers made just a few contributions in a short period of time, as figures 3 and 4 show. Half of the sessions (continuous periods of work) lasted less than 20 minutes; at the upper end, 7% were more than an hour and a half. This is expected given recent studies of crowdworkers' behaviors [3, 4].

Figure 3: Number of work sessions per contributor.

Figure 4: Work session durations.

Assessment against gold labels

Given our large and diverse pool of workers, we expect some of them to be unreliable, perhaps due to a lack of expertise or a lack of attention. With Figure Eight, we can supply our own labeled examples for a subset of cases, to help identify and filter out unreliable workers. Figure 5 summarizes the work of 100 annotators who were rejected from our task based on their performance on this gold data.

Figure 5: Main reason for failing out of our gold-label assessment. Results come from the manual study of the output from 100 workers who didn't pass the test.

It seems that several workers attempted cheating strategies, such as selecting no text segments at all. Our gold labels help us identify such work. We also see evidence that some workers used auxiliary tools to translate the full Figure Eight interface into another language, including the sentences to annotate, with the result that the responses they submitted were in that language!

From noisy judgments to crowd truth

Should we blindly trust all the judgments of all the workers who passed our gold-label assessment? Probably not. Some errors are inevitable even for careful workers, and some malicious workers are likely to slip past our assessment against our gold examples. Furthermore, there are bound to be cases that are ambiguous or open to interpretation, leading to multiple right answers that we ourselves might not have fully appreciated, as in figure 6.

Figure 6: An ambiguous case. Should "feelings of sadness related to winter" be selected as a disease? Or a symptom?

The most basic step we can take to address these concerns is to collect multiple judgments from many different annotators, in the hope that a consensus emerges. We could define consensus in various ways – e.g., the majority label, or at least 70% agreement on a label (leaving some cases without a label at all). However, these approaches are clearly suboptimal because they make the implicit assumption that all workers are equally trustworthy, by giving equal weight to all of them. To address this, we use a simple application of the Expectation-Maximization (EM) method to our matrix of worker judgments.

Our guiding idea is that "disagreement is not noise but signal" [5]. Assuming the majority is usually right, it's probably worth giving more credit to workers who often agree with each other. Following this intuition, Dawid and Skene [6] propose a simple model which uses EM to estimate response confidence and workers' reliability in an efficient way. Figure 7 provides an overview of this algorithm, which we state in full in an appendix.

Figure 7: The Expectation-Maximization algorithm for inferring labels from crowd work.

Let's consider a small example where we want to find the responses to five boolean questions using a crowd of four workers. For each question, we collect three True/False (T/F) judgments.

	Worker 1	Worker 2	Worker 3	Worker 4	True response
Question 1	T	T	T		T
Question 2		F	F	T	F
Question 3	T	T	F		T
Question 4	F		F	T	F
Question 5	F	T		T	F

Table 1: Dummy collected judgments and true response (that we don't have), for each question.

Worker 1 looks highly reliable: where it contributed annotations, it always agrees with our true response. Workers 2 and 4 have low precision in the sense that they sometimes say T when the correct label is F. Conversely, Worker 3 has low recall in the sense that he sometimes says F when the correct label is T.

Of course, we crowdsource precisely because we don't have these true responses – we have to try to uncover them given only the matrix of worker judgments. With EM, we do this by jointly estimating the reliability of each worker and the response that maximizes the likelihood of the observed judgments.

Table 2 gives the output of the algorithm for this simple example. The crowd response is the one derived from the EM algorithm; T when the maximum likelihood estimate is larger than 0.5, F otherwise. Our implementation (available on Github) converges after 8 iterations.

	Maximum likelihood estimate	Derived crowd response	True response
Question 1	0.99	T	T
Question 2	0.01	F	F
Question 3	0.98	T	T
Question 4	0.08	F	F
Question 5	0.28	F	F

Table 2: Estimated maximum likelihood responses.

The algorithm does well here and shows how weighting contributions, using inter-worker agreement, can be valuable for judgment aggregation. All the responses derived from the crowd are correct according to our gold labels. For the first four questions, the model has high confidence in its estimate. For question 5, though the majority voted T, the model correctly infers F as the label because worker 1's judgments are given more weight than the others' judgments.

The individual workers' performance estimates, table 3, look good too. Despite the small amount of data, the algorithm successfully figured out that worker 1 was very reliable, workers 2 and 4 had low precision and worker 3 had low recall.

	Estimated precision	True precision	Estimated recall	True recall
Worker 1	0.99	1.0	0.84	1.0
Worker 2	0.75	0.67	0.99	1.0
Worker 3	0.99	1.0	0.48	0.5
Worker 4	0.12	0.0	0.99	/

Table 3: Workers' performance estimates.

Several extensions of this and other methods [7, 8, 9, 10] have been proposed for this problem. Zhang et al. [10], for example, address the issue of getting trapped in a local optimum with EM by computing initial parameter estimates with a spectral method, rather than initializing them randomly. Other approaches focus on modeling micro-task difficulty along with worker reliability [8].

The wisdom of our crowd

There is evidence that, in many settings, a crowd of non-experts can collectively offer estimates that match or exceed those of individual experts [11, 12, 13, 14]. Is this true of our crowd of Figure Eight workers?

To address this question, we applied the EM algorithm as described above. Since many of our disease spans are multi-word expressions, we do this at the token level: for each token selected by at least one worker, its probability to be a disease mention is estimated.

To begin, we hope that most tokens end up with a probability close to 0 or 1, that is, a high confidence of being part of an entity of interest or not. Figure 8 shows that this is the case.

Figure 8: The distribution of probabilities estimated with EM, after 1, 2 and 10 iterations (convergence).

Next, we randomly selected 1,000 tokens from our task and presented them (in their linguistic context) to an in-house expert for labeling. We then compared these expert-provided labels with the ones inferred from the crowd work. Figure 9 summarizes this evaluation, which uses the standard classification metrics of precision, recall, and F1.

Figure 9: Precision, recall and F1 scores at various confidence thresholds for each aggregation method.

We see that the crowd does well. At a threshold of 0.5, precision and recall are above 0.8, which is comparable to results claimed by MacLean et al. in a similar entity extraction task [1].

In this experiment, it appears that most workers have high precision, but recall is very diverse (see our last section). Therefore, weighting contributions with EM does not really impact the precision of the crowd; majority voting (orange curve) and EM (blue curve) give similar results. For recall, however, since there is high variance among contributors, giving more weight to high-recall workers significantly improves the crowd response.

Assuming a majority of workers are mostly "noise", one could also consider, for each sentence, only the judgment from the most reliable worker according to EM (green curve). In this experiment, this still performs better than majority voting but does not beat the weighted aggregation of all available judgments (blue curve).

Facing the crowd

EM estimates individual worker reliability and can therefore help us understand individual behaviors in the crowd. To that end, figure 10 shows the timeline of each work session (a continuous period of work for one contributor).

Figure 10: Precision and recall against time during work sessions. Clusters are represented with distinct colors.

As we noted above, most workers have high precision, while recall is very diverse. That is, quality issues mostly come from missing relevant mentions rather than selecting some wrong ones. This is evident in the figure: most of the timelines are near the top of the plot, but they are spread out over the recall axis.

One could think that aggregation could easily fix this bad recall: by combining multiple high-precision, low-recall judgments, we would end up with high-precision, high-recall annotations. We see however that there is a high correlation between workers' behavior; they tend to miss the same text segments. In our task, this has a likely explanation: to maximize profit, workers might choose to increase their judgment rate at the cost of reducing work quality. By reading sentences superficially, they would identify the most obvious relevant mentions while missing more demanding ones. This strategy could go unnoticed during judgment collection because our gold-label assessment deliberately avoids difficult cases and ambiguity.

Using dynamic time warping (DTW) to measure similarity between these time series (see this blog post for discussion of DTW), we identified 4 distinct clusters of work sessions. These correspond to the different colors in figure 10:

Light blue (1,050 timelines): The largest cluster, consisting mostly of short sessions.
Yellow (6 timelines): Very low precision and very low recall. These workers unfortunately slipped through our gold-label assessment.
Orange (15 timelines): High precision and very low recall, probably from workers who selected almost no text segments.
Dark blue (31 timelines): Long sessions consisting of reliable annotations.

This analysis highlights the diversity of the crowd and strengthens our intuition that weighting contributions with individual reliability, estimated using inter-worker agreement, is valuable. A judgment from a dark blue session in figure 10 should be given way more credit than one from the orange cluster.

Although, these findings also reveal that a large group of workers with similar biases might lead to misleading results with EM. For example, a large group of consistently wrong workers would appear to be reliable and might even drown out the work of a smaller, more reliable group. Any unsupervised method will have weaknesses like this, which highlights the importance of having at least some expert annotators reviewing the work.

Looking ahead

The above findings are reassuring – we might in fact be able to relieve some of the annotation burden from trained experts by using crowdsourcing. It remains to be seen where the boundary lies between work the crowd can do and work that requires experts, especially as we find ways to break down very challenging tasks into simpler component parts that crowdworkers can succeed at.

Appendix: A closer look to Expectation-Maximization

Let's have a closer look to the EM algorithm. We consider an experiment where we want to find the response $r_i$ to questions $i$, using some noisy judgments $J_i$ collected from workers $k$. For each question, we collect boolean judgments $j_{ik}$ (T/F, or None if the worker did not respond to question $i$). Workers reliability is modeled with $\Theta = \theta_1, \ldots, \theta_k$.

We assume judgments are independent between questions for a given worker (no gain of experience) and between workers for a given question (no inter-worker communication).

We describe below the main steps of the algorithm (see figure 7 above).

Initialization

Since we have no prior knowledge of workers' trustworthiness, their reliability parameters $\Theta$ are initialized to reasonable values based on what we might expect from workers in pools like ours.

Expectation

During the expectation step, workers' performance is re-estimated, given the output from the maximization step. Here, we will model workers quality with the true positive, false positive, true negative and false negative rates.

Maximization

In this step, we calculate, for each question $i$, the maximum likelihood estimate that the true response $r_i$ is positive, given the collected judgments $J_i$: $p(r_i = \text{T} \mid J_i, \Theta)$.

Using Bayes' theorem:

$$ \begin{align} p(r_i = \text{T} \mid J_i, \Theta) &= \frac{p(J_i \mid r_i = \text{T}, \Theta) p(r_i = \text{T})}{p(J_i \mid \Theta)} \end{align} $$

We know that

$$ \begin{align} p(J_i \mid \Theta) = p(J_i \mid r_i=\text{T}, \Theta)p(r_i=\text{T}) + p(J_i \mid r_i=\text{F}, \Theta)p(r_i=\text{F}) \end{align} $$

By substitution, assuming the T and F response outcomes are equiprobable, i.e. $p(r_i = \text{T}) = p(r_i = \text{F})$, we obtain

$$ \begin{align} p(r_i= \text{T} \mid J_i, \Theta) &= \frac{ p(J_i \mid, r_i=\text{T}, \Theta) }{ p(J_i \mid r_i = \text{T}, \Theta) + p(J_i \mid r_i = \text{F}, \Theta) } \\ \end{align} $$

Because we assume worker independence, the collected judgment probability for a given question equals the product of individual judgment probabilities:

$$ p(J_i \mid r_i, \Theta) = \prod_{k} p(j_{ik} \mid r_i, \theta_k) $$

All in all,

$$ \begin{align} p(r_i = \text{T} \mid J_i, \Theta) = \prod_{k} \frac{ p(j_k \mid r_i = \text{T}, \theta_k) }{ p(j_k \mid r_i = \text{T}, \theta_k) + p(j_k \mid r_i = \text{F}, \theta_k) } \end{align} $$

This probability could be computed from worker's reliability parameters $\Theta$ (estimated in the expectation step). Indeed, $p(j_k \mid r_i = \text{T}, \theta_k)$ is the true positive rate if $j_k = \text{T}$ and the false negative rate if $j_k = \text{F}$. Similarly, for $p(j_k \mid r_i = \text{F}, \theta_k)$, we use the false positive rate if $j_k = \text{T}$ and the true negative rate if $j_k = \text{F}$.

Since it maximizes a non-convex log-likelihood function, this approach has no theoretical guarantee of performance, though empirical studies show that it usually converges to good estimates.

References

[1] D. MacLean, J. Heer. 2013. Identifying medical terms in patient-authored text: A crowdsourcing-based approach. *Journal of the American Medical Informatics Association (JAMIA)*.

[2] J. Tenuto. 2015. [How scientists are using CrowdFlower to create a massive biomedical database](https://www.figure-eight.com/citizen-science/). Figure Eight's Artificial Intelligence Resource Center.

[3] H. Zhai, T. Lingren, L. Deleger, Q. Li, M. Kaiser, L Stoutenborough, I Solti. 2013. Web 2.0-based crowdsourcing for high-quality gold standard development in clinical natural language processing. Journal Medical Internet Research.

[4] S. Amer-Yahia, S. Basu Roy. 2016. Toward worker-centric crowdsourcing. *IEEE Data Engineering Bulletin*.

[5] L. Aroyo, C. Welty. 2013. Measuring crowd truth for medical relation extraction. In AAAI Fall Symposium on Semantics for Big Data.

[6] A. P. Dawid, A. M. Skene. 1979. Maximum likelihood estimation of observer error-rates using the EM algorithm. Journal of the Royal Statistical Society, Series C.

[7] V. Raykar et al. 2009. Supervised learning from multiple experts: Whom to trust when everyone lies a bit. *International Conference on Machine Learning (ICML)*.

[8] J. Whitehill, T. Wu, J. Bergsma, J.R. Movellan, P.L. Ruvolo. 2009. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. *Neural Information Processing Systems (NIPS)*.

[9] D.R. Karger, S. Oh, D. Shah. 2011. Iterative learning for reliable crowdsourcing systems. *Neural Information Processing Systems (NIPS)*

[10] Y. Zhang, X. Chen, D. Zhou, M. I. Jordan. 2014. Spectral methods meet EM: A provably optimal algorithm for crowdsourcing. Neural Information Processing Systems (NIPS)

[11] J. Surowiecki. 2005. *The Wisdom of Crowds*. Anchor Books.

[12] W. Antweiler, M. Z. Frank. 2004. Is all that talk just noise? The information content of Internet stock message boards. *The Journal of Finance*.

[13] H. Chen, P. De, Y. Hu, B.-H. Hwang. 2014. Wisdom of crowds: The value of stock opinions transmitted through social media. *Review of Financial Studies*.

[14] M. Nofer, O. Hinz. 2014. Are crowds on the Internet wiser than experts? The case of a stock prediction community. *Journal of Business Economics*.