In this tutorial, we will apply CrowdTruth metrics to a multiple choice crowdsourcing task for Person Type Annotation from video fragments. The workers were asked to watch a video of about 3-5 seconds and then pick from a multiple choice list which are the types of person that appear in the video fragment. The task was executed on a custom platform. For more crowdsourcing annotation task examples, click here.
This is a screenshot of the task as it appeared to workers:
A sample dataset for this task is available in this file, containing raw output from the crowd on FigureEight. Download the file and place it in a folder named data
that has the same root as this notebook. Now you can check your data:
In [1]:
import pandas as pd
test_data = pd.read_csv("../data/custom-platform-person-video-multiple-choice.csv")
test_data.head()
Out[1]:
In [2]:
import crowdtruth
from crowdtruth.configuration import DefaultConfig
Our test class inherits the default configuration DefaultConfig
, while also declaring some additional attributes that are specific to the Person Type Annotation in Video task:
inputColumns
: list of input columns from the .csv file with the input dataoutputColumns
: list of output columns from the .csv file with the answers from the workerscustomPlatformColumns
: a list of columns from the .csv file that defines a standard annotation tasks, in the following order - judgment id, unit id, worker id, started time, submitted time. This variable is used for input files that do not come from AMT or FigureEight (formarly known as CrowdFlower).annotation_separator
: string that separates between the crowd annotations in outputColumns
open_ended_task
: boolean variable defining whether the task is open-ended (i.e. the possible crowd annotations are not known beforehand, like in the case of free text input); in the task that we are processing, workers pick the answers from a pre-defined list, therefore the task is not open ended, and this variable is set to False
annotation_vector
: list of possible crowd answers, mandatory to declare when open_ended_task
is False
; for our task, this is the list of relationsprocessJudgments
: method that defines processing of the raw crowd data; for this task, we process the crowd answers to correspond to the values in annotation_vector
The complete configuration class is declared below:
In [3]:
class TestConfig(DefaultConfig):
inputColumns = ["videolocation", "subtitles", "imagetags", "subtitletags"]
outputColumns = ["selected_answer"]
customPlatformColumns = ["judgmentId", "unitId", "workerId", "startedAt", "submittedAt"]
# processing of a closed task
open_ended_task = False
annotation_vector = ["archeologist", "architect", "artist", "astronaut", "athlete", "businessperson","celebrity",
"chef", "criminal", "engineer", "farmer", "fictionalcharacter", "journalist", "judge",
"lawyer", "militaryperson", "model", "monarch", "philosopher", "politician", "presenter",
"producer", "psychologist", "scientist", "sportsmanager", "writer", "none", "other"]
def processJudgments(self, judgments):
# pre-process output to match the values in annotation_vector
for col in self.outputColumns:
# transform to lowercase
judgments[col] = judgments[col].apply(lambda x: str(x).lower())
# remove square brackets from annotations
judgments[col] = judgments[col].apply(lambda x: str(x).replace('[',''))
judgments[col] = judgments[col].apply(lambda x: str(x).replace(']',''))
# remove the quotes around the annotations
judgments[col] = judgments[col].apply(lambda x: str(x).replace('"',''))
return judgments
In [4]:
data, config = crowdtruth.load(
file = "../data/custom-platform-person-video-multiple-choice.csv",
config = TestConfig()
)
data['judgments'].head()
Out[4]:
In [5]:
results = crowdtruth.run(data, config)
results
is a dict object that contains the quality metrics for video fragments, annotations and crowd workers.
The video fragments metrics are stored in results["units"]
:
In [6]:
results["units"].head()
Out[6]:
The uqs
column in results["units"]
contains the video fragment quality scores, capturing the overall workers agreement over each video fragment. Here we plot its histogram:
In [7]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.hist(results["units"]["uqs"])
plt.xlabel("Sentence Quality Score")
plt.ylabel("Sentences")
Out[7]:
The unit_annotation_score
column in results["units"]
contains the video fragment-annotation scores, capturing the likelihood that an annotation is expressed in a video fragment. For each video fragment, we store a dictionary mapping each annotation to its video fragment-annotation score.
In [8]:
results["units"]["unit_annotation_score"].head()
Out[8]:
The worker metrics are stored in results["workers"]
:
In [9]:
results["workers"].head()
Out[9]:
The wqs
columns in results["workers"]
contains the worker quality scores, capturing the overall agreement between one worker and all the other workers.
In [10]:
plt.hist(results["workers"]["wqs"])
plt.xlabel("Worker Quality Score")
plt.ylabel("Workers")
Out[10]:
The annotation metrics are stored in results["annotations"]
. The aqs
column contains the annotation quality scores, capturing the overall worker agreement over one annotation.
In [11]:
results["annotations"]
Out[11]: