This analysis uses the data gathered in the "Event Annotation" crowdsourcing experiment published in Rion Snow, Brendan O’Connor, Dan Jurafsky, and Andrew Y. Ng: Cheap and fast—but is it good? Evaluating non-expert annotations for natural language tasks. EMNLP 2008, pages 254–263.
Task Description: Given two events in a text, the crowd has to choose whether the first event happened "strictly before" or "strictly after" the second event. Following, we provide an example from the aforementioned publication:
Text: “It just blew up in the air, and then we saw two fireballs go down to the, to the water, and there was a big small, ah, smoke, from ah, coming up from that”.
Events: go/coming, or blew/saw
A screenshot of the task as it appeared to workers can be seen at the following repository.
The dataset for this task was downloaded from the following repository, which contains the raw output from the crowd on AMT. Currently, you can find the processed input file in the folder named data
. Besides the raw crowd annotations, the processed file also contains the sentence and the two events that were given as input to the crowd (for part of the dataset).
In [1]:
import pandas as pd
test_data = pd.read_csv("../data/temp.standardized.csv")
test_data.head()
Out[1]:
In [2]:
import crowdtruth
from crowdtruth.configuration import DefaultConfig
Our test class inherits the default configuration DefaultConfig
, while also declaring some additional attributes that are specific to the Temporal Event Ordering task:
inputColumns
: list of input columns from the .csv file with the input dataoutputColumns
: list of output columns from the .csv file with the answers from the workerscustomPlatformColumns
: a list of columns from the .csv file that defines a standard annotation tasks, in the following order - judgment id, unit id, worker id, started time, submitted time. This variable is used for input files that do not come from AMT or FigureEight (formarly known as CrowdFlower).annotation_separator
: string that separates between the crowd annotations in outputColumns
open_ended_task
: boolean variable defining whether the task is open-ended (i.e. the possible crowd annotations are not known beforehand, like in the case of free text input); in the task that we are processing, workers pick the answers from a pre-defined list, therefore the task is not open ended, and this variable is set to False
annotation_vector
: list of possible crowd answers, mandatory to declare when open_ended_task
is False
; for our task, this is the list of relationsprocessJudgments
: method that defines processing of the raw crowd data; for this task, we process the crowd answers to correspond to the values in annotation_vector
The complete configuration class is declared below:
In [3]:
class TestConfig(DefaultConfig):
inputColumns = ["gold", "event1", "event2", "text"]
outputColumns = ["response"]
customPlatformColumns = ["!amt_annotation_ids", "orig_id", "!amt_worker_ids", "start", "end"]
# processing of a closed task
open_ended_task = False
annotation_vector = ["before", "after"]
def processJudgments(self, judgments):
# pre-process output to match the values in annotation_vector
for col in self.outputColumns:
# transform to lowercase
judgments[col] = judgments[col].apply(lambda x: str(x).lower())
return judgments
In [4]:
data, config = crowdtruth.load(
file = "../data/temp.standardized.csv",
config = TestConfig()
)
data['judgments'].head()
Out[4]:
In [5]:
results = crowdtruth.run(data, config)
results
is a dict object that contains the quality metrics for the sentences, annotations and crowd workers.
The sentence metrics are stored in results["units"]
:
In [6]:
results["units"].head()
Out[6]:
The uqs
column in results["units"]
contains the sentence quality scores, capturing the overall workers agreement over each sentences. Here we plot its histogram:
In [7]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = 15, 5
plt.subplot(1, 2, 1)
plt.hist(results["units"]["uqs"])
plt.ylim(0,200)
plt.xlabel("Sentence Quality Score")
plt.ylabel("#Sentences")
plt.subplot(1, 2, 2)
plt.hist(results["units"]["uqs_initial"])
plt.ylim(0,200)
plt.xlabel("Sentence Quality Score Initial")
plt.ylabel("# Units")
Out[7]:
In [8]:
import numpy as np
sortUQS = results["units"].sort_values(['uqs'], ascending=[1])
sortUQS = sortUQS.reset_index()
plt.rcParams['figure.figsize'] = 15, 5
plt.plot(np.arange(sortUQS.shape[0]), sortUQS["uqs_initial"], 'ro', lw = 1, label = "Initial UQS")
plt.plot(np.arange(sortUQS.shape[0]), sortUQS["uqs"], 'go', lw = 1, label = "Final UQS")
plt.ylabel('Sentence Quality Score')
plt.xlabel('Sentence Index')
Out[8]:
The unit_annotation_score
column in results["units"]
contains the sentence-annotation scores, capturing the likelihood that an annotation is expressed in a sentence. For each sentence, we store a dictionary mapping each annotation to its sentence-relation score.
In [9]:
results["units"]["unit_annotation_score"].head()
Out[9]:
Save unit metrics:
In [30]:
rows = []
header = ["orig_id", "gold", "text", "event1", "event2", "uqs", "uqs_initial", "before", "after", "before_initial", "after_initial"]
units = results["units"].reset_index()
for i in range(len(units.index)):
row = [units["unit"].iloc[i], units["input.gold"].iloc[i], units["input.text"].iloc[i], units["input.event1"].iloc[i],\
units["input.event2"].iloc[i], units["uqs"].iloc[i], units["uqs_initial"].iloc[i], \
units["unit_annotation_score"].iloc[i]["before"], units["unit_annotation_score"].iloc[i]["after"], \
units["unit_annotation_score_initial"].iloc[i]["before"], units["unit_annotation_score_initial"].iloc[i]["after"]]
rows.append(row)
rows = pd.DataFrame(rows, columns=header)
rows.to_csv("../data/results/crowdtruth_units_temp.csv", index=False)
The worker metrics are stored in results["workers"]
:
In [24]:
results["workers"].head()
Out[24]:
The wqs
columns in results["workers"]
contains the worker quality scores, capturing the overall agreement between one worker and all the other workers.
In [25]:
plt.rcParams['figure.figsize'] = 15, 5
plt.subplot(1, 2, 1)
plt.hist(results["workers"]["wqs"])
plt.ylim(0,30)
plt.xlabel("Worker Quality Score")
plt.ylabel("#Workers")
plt.subplot(1, 2, 2)
plt.hist(results["workers"]["wqs_initial"])
plt.ylim(0,30)
plt.xlabel("Worker Quality Score Initial")
plt.ylabel("#Workers")
Out[25]:
Save the worker metrics:
In [26]:
results["workers"].to_csv("../data/results/crowdtruth_workers_temp.csv", index=True)
The annotation metrics are stored in results["annotations"]
. The aqs
column contains the annotation quality scores, capturing the overall worker agreement over one relation.
In [27]:
results["annotations"]
Out[27]:
In [28]:
import numpy as np
sortedUQS = results["units"].sort_values(["uqs"])
# remove the units for which we don't have the events and the text
sortedUQS = sortedUQS.dropna()
In [17]:
sortedUQS.tail(1)
Out[17]:
In [18]:
print("Text: %s" % sortedUQS["input.text"].iloc[len(sortedUQS.index)-1])
print("\n Event1: %s" % sortedUQS["input.event1"].iloc[len(sortedUQS.index)-1])
print("\n Event2: %s" % sortedUQS["input.event2"].iloc[len(sortedUQS.index)-1])
print("\n Expert Answer: %s" % sortedUQS["input.gold"].iloc[len(sortedUQS.index)-1])
print("\n Crowd Answer with CrowdTruth: %s" % sortedUQS["unit_annotation_score"].iloc[len(sortedUQS.index)-1])
print("\n Crowd Answer without CrowdTruth: %s" % sortedUQS["unit_annotation_score_initial"].iloc[len(sortedUQS.index)-1])
In [19]:
sortedUQS.head(1)
Out[19]:
In [20]:
print("Text: %s" % sortedUQS["input.text"].iloc[0])
print("\n Event1: %s" % sortedUQS["input.event1"].iloc[0])
print("\n Event2: %s" % sortedUQS["input.event2"].iloc[0])
print("\n Expert Answer: %s" % sortedUQS["input.gold"].iloc[0])
print("\n Crowd Answer with CrowdTruth: %s" % sortedUQS["unit_annotation_score"].iloc[0])
print("\n Crowd Answer without CrowdTruth: %s" % sortedUQS["unit_annotation_score_initial"].iloc[0])
We first pre-processed the crowd results to create compatible files for running the MACE tool. Each row in a csv file should point to a unit in the dataset and each column in the csv file should point to a worker. The content of the csv file captures the worker answer for that particular unit (or remains empty if the worker did not annotate that unit).
In [29]:
import numpy as np
test_data = pd.read_csv("../data/mace_temp.standardized.csv", header=None)
test_data = test_data.replace(np.nan, '', regex=True)
test_data.head()
Out[29]:
In [31]:
import pandas as pd
mace_data = pd.read_csv("../data/results/mace_units_temp.csv")
mace_data.head()
Out[31]:
In [32]:
mace_workers = pd.read_csv("../data/results/mace_workers_temp.csv")
mace_workers.head()
Out[32]:
In [33]:
mace_workers = pd.read_csv("../data/results/mace_workers_temp.csv")
crowdtruth_workers = pd.read_csv("../data/results/crowdtruth_workers_temp.csv")
mace_workers = mace_workers.sort_values(["worker"])
crowdtruth_workers = crowdtruth_workers.sort_values(["worker"])
In [34]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.scatter(
mace_workers["competence"],
crowdtruth_workers["wqs"],
)
plt.title("Worker Quality Score")
plt.xlabel("MACE")
plt.ylabel("CrowdTruth")
Out[34]:
In [35]:
sortWQS = crowdtruth_workers.sort_values(['wqs'], ascending=[1])
sortWQS = sortWQS.reset_index()
worker_ids = list(sortWQS["worker"])
mace_workers = mace_workers.set_index('worker')
mace_workers.loc[worker_ids]
plt.rcParams['figure.figsize'] = 15, 5
plt.plot(np.arange(sortWQS.shape[0]), sortWQS["wqs"], 'bo', lw = 1, label = "CrowdTruth Worker Score")
plt.plot(np.arange(mace_workers.shape[0]), mace_workers["competence"], 'go', lw = 1, label = "MACE Worker Score")
plt.ylabel('Worker Quality Score')
plt.xlabel('Worker Index')
plt.legend()
Out[35]:
In [46]:
mace = pd.read_csv("../data/results/mace_units_temp.csv")
crowdtruth = pd.read_csv("../data/results/crowdtruth_units_temp.csv")
In [39]:
def compute_F1_score(dataset):
nyt_f1 = np.zeros(shape=(100, 2))
for idx in xrange(0, 100):
thresh = (idx + 1) / 100.0
tp = 0
fp = 0
tn = 0
fn = 0
for gt_idx in range(0, len(dataset.index)):
if dataset['after'].iloc[gt_idx] >= thresh:
if dataset['gold'].iloc[gt_idx] == 'after':
tp = tp + 1.0
else:
fp = fp + 1.0
else:
if dataset['gold'].iloc[gt_idx] == 'after':
fn = fn + 1.0
else:
tn = tn + 1.0
nyt_f1[idx, 0] = thresh
if tp != 0:
nyt_f1[idx, 1] = 2.0 * tp / (2.0 * tp + fp + fn)
else:
nyt_f1[idx, 1] = 0
return nyt_f1
def compute_majority_vote(dataset, crowd_column):
tp = 0
fp = 0
tn = 0
fn = 0
for j in range(len(dataset.index)):
if dataset['after_initial'].iloc[j] >= 0.5:
if dataset['gold'].iloc[j] == 'after':
tp = tp + 1.0
else:
fp = fp + 1.0
else:
if dataset['gold'].iloc[j] == 'after':
fn = fn + 1.0
else:
tn = tn + 1.0
return 2.0 * tp / (2.0 * tp + fp + fn)
In [43]:
F1_crowdtruth = compute_F1_score(crowdtruth)
print(F1_crowdtruth[F1_crowdtruth[:,1].argsort()][-10:])
In [47]:
F1_mace = compute_F1_score(mace)
print(F1_mace[F1_mace[:,1].argsort()][-10:])
In [48]:
F1_majority_vote = compute_majority_vote(crowdtruth, 'value')
F1_majority_vote
Out[48]:
In [ ]: