This tutorial will explain how to use CrowdTruth metrics to process data that was collected with crowdsourcing. For more information about the metrics and how they work, read this paper.
In [ ]:
!pip install crowdtruth
To use the development version, download the source code from the GitHub repository and install it frome the console using:
python setup.py install
Now you can load the CrowdTruth library:
In [1]:
import crowdtruth
In this tutorial, we will work with data from a crowdsourcing task on Medical Relation Extraction. This is a multiple choice task that was executed on FigureEight. The workers were asked to read a medical sentence with 2 highlighted terms, then pick from a multiple choice list what are the relations expressed between the 2 terms in the sentence. Below you can see the task template:
A sample dataset for this task is available in this file. containing raw output from the crowd on FigureEight. Download the file and place it in the same folder as this notebook, then check your data:
In [2]:
import pandas as pd
test_data = pd.read_csv("relex_example.csv")
test_data.head()
Out[2]:
In [3]:
from crowdtruth.configuration import DefaultConfig
Our test class inherits the default configuration DefaultConfig
, while also declaring some additional attributes that are specific to the Medical Relation Extraction task:
inputColumns
: list of input columns from the .csv file with the input dataoutputColumns
: list of output columns from the .csv file with the answers from the workersannotation_separator
: string that separates between the crowd annotations in outputColumns
open_ended_task
: boolean variable defining whether the task is open-ended (i.e. the possible crowd annotations are not known beforehand, like in the case of free text input); in the task that we are processing, workers pick the answers from a pre-defined list, therefore the task is not open ended, and this variable is set to False
annotation_vector
: list of possible crowd answers, mandatory to declare when open_ended_task
is False
; for our task, this is the list of medical relationsprocessJudgments
: method that defines processing of the raw crowd data; for this task, we process the crowd answers to correspond to the values in annotation_vector
The complete configuration class is declared below:
In [4]:
class TestConfig(DefaultConfig):
inputColumns = ["term1", "b1", "e1", "term2", "b2", "e2", "sentence"]
outputColumns = ["relations"]
annotation_separator = " "
# processing of a closed task
open_ended_task = False
annotation_vector = [
"causes", "manifestation", "treats", "prevents", "symptom", "diagnose_by_test_or_drug",
"location", "side_effect", "contraindicates", "associated_with", "is_a", "part_of",
"other", "none"]
def processJudgments(self, judgments):
# pre-process output to match the values in annotation_vector
for col in self.outputColumns:
# remove square brackets from annotations
judgments[col] = judgments[col].apply(lambda x: str(x).replace('[',''))
judgments[col] = judgments[col].apply(lambda x: str(x).replace(']',''))
# transform to lowercase
judgments[col] = judgments[col].apply(lambda x: str(x).lower())
return judgments
In [5]:
data, config = crowdtruth.load(
file = "relex_example.csv",
config = TestConfig()
)
data['judgments'].head()
Out[5]:
In [6]:
results = crowdtruth.run(data, config)
results
is a dict object that contains the quality metrics for sentences, relations and crowd workers.
The sentence metrics are stored in results["units"]
:
In [7]:
results["units"].head()
Out[7]:
The uqs
column in results["units"]
contains the sentence quality scores, capturing the overall workers agreement over each sentence. Here we plot its histogram:
In [9]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.hist(results["units"]["uqs"])
plt.xlabel("Sentence Quality Score")
plt.ylabel("Sentences")
Out[9]:
The unit_annotation_score
column in results["units"]
contains the sentence-relation scores. For each sentence, we store a dict of relations
In [14]:
results["units"]["unit_annotation_score"].head()
Out[14]:
The worker metrics are stored in results["workers"]
:
In [11]:
results["workers"].head()
Out[11]:
The wqs
columns in results["workers"]
contains the worker quality scores, capturing the overall agreement between one worker and all the other workers.
In [12]:
plt.hist(results["workers"]["wqs"])
plt.xlabel("Worker Quality Score")
plt.ylabel("Workers")
Out[12]:
The relation metrics are stored in results["annotations"]. The aqs
column contains the relation quality scores, capturing the overall worker agreement over one relation.
In [13]:
results["annotations"]
Out[13]:
In [ ]: