Highlighting Task - Event Extraction from Text

In this tutorial, we will show how dimensionality reduction can be applied over both the media units and the annotations of a crowdsourcing task, and how this impacts the results of the CrowdTruth quality metrics. We start with an open-ended extraction task, where the crowd was asked to highlight words or phrases in a text that refer to events or actions. The task was executed on FigureEight. For more crowdsourcing annotation task examples, click here.

To replicate this experiment, the code used to design and implement this crowdsourcing annotation template is available here: template, css, javascript.

This is how the task looked like to the workers:

A sample dataset for this task is available in this file, containing raw output from the crowd on FigureEight. Download the file and place it in a folder named data that has the same root as this notebook. The answers from the crowd are stored in the tagged_events column.


In [17]:
import pandas as pd

test_data = pd.read_csv("../data/event-text-highlight.csv")
test_data["tagged_events"][0:30]


Out[17]:
0                                       ["income fell"]
1                                   ["net income fell"]
2     ["reported third-quarter net income fell 5.9 %...
3            ["reported third-quarter net income fell"]
4                            ["reported third-quarter"]
5                        ["reported","fell","year-ago"]
6                                   ["reported","fell"]
7                                   ["reported","fell"]
8                                   ["reported","fell"]
9                                   ["reported","fell"]
10                                  ["reported","fell"]
11                                  ["reported","fell"]
12                           ["reported","income fell"]
13                         ["reported","income","fell"]
14    ["reported","third-quarter","net","income","fe...
15                                         ["reported"]
16                                         ["reported"]
17    ["Separately","reported third-quarter net inco...
18                            ["Separately","reported"]
19                                    ["third-quarter"]
20       ["held","20","years","anti-abortion movement"]
21           ["held","anti-abortion","movement","said"]
22    ["jobs he has held over the past 20 years have...
23                                 ["jobs he has held"]
24     ["jobs","anti-abortion movement","news release"]
25                    ["jobs","anti-abortion movement"]
26    ["jobs","he","has","held","over","the","past",...
27                      ["Justice","Department","said"]
28           ["purports to be a devout Roman Catholic"]
29    ["purports","has held over","anti-abortion","m...
Name: tagged_events, dtype: object

Notice the diverse behavior of the crowd workers. While most annotated each word individually, the worker on row 2 annotated a chunk of the sentence together in one word phrase. Also, when no answer was picked by the worker, the value in the cell is [NONE].

A basic pre-processing configuration

Our basic pre-processing configuration attempts to normalize the different ways of performing the crowd annotations.

We set remove_empty_rows = False to keep the empty rows from the crowd. This configuration option will set all empty cell values to correspond to a NONE token in the annotation vector.

We build the annotation vector to have one component for each word in the sentence. To do this, we break up multiple-word annotations into a list of single words in the processJudgments call:

judgments[self.outputColumns[0]] = judgments[self.outputColumns[0]].apply(
            lambda x: str(x).replace(' ',self.annotation_separator))

The final configuration class Config is this:


In [57]:
import crowdtruth
from crowdtruth.configuration import DefaultConfig

class Config(DefaultConfig):
    inputColumns = ["doc_id", "sentence_id", "events", "events_count", "original_sententce", "processed_sentence", "tokens"]
    outputColumns = ["tagged_events"]
    open_ended_task = True
    annotation_separator = ","

    remove_empty_rows = False
    
    def processJudgments(self, judgments):
        # build annotation vector just from words
        judgments[self.outputColumns[0]] = judgments[self.outputColumns[0]].apply(
            lambda x: str(x).replace(' ',self.annotation_separator))

        # normalize vector elements
        judgments[self.outputColumns[0]] = judgments[self.outputColumns[0]].apply(
            lambda x: str(x).replace('[',''))
        judgments[self.outputColumns[0]] = judgments[self.outputColumns[0]].apply(
            lambda x: str(x).replace(']',''))
        judgments[self.outputColumns[0]] = judgments[self.outputColumns[0]].apply(
            lambda x: str(x).replace('"',''))
        judgments[self.outputColumns[0]] = judgments[self.outputColumns[0]].apply(
            lambda x: str(x).replace(',,,',','))
        judgments[self.outputColumns[0]] = judgments[self.outputColumns[0]].apply(
            lambda x: str(x).replace(',,',','))
        return judgments

Now we can pre-process the data and run the CrowdTruth metrics:


In [58]:
data_with_stopwords, config_with_stopwords = crowdtruth.load(
    file = "../data/event-text-highlight.csv",
    config = Config()
)

processed_results_with_stopwords = crowdtruth.run(
    data_with_stopwords,
    config_with_stopwords
)

Removing stopwords from Media Units and Annotations

A more complex dimensionality reduction technique involves removing the stopwords from both the media units and the crowd annotations. Stopwords (i.e. words that are very common in the English language) do not usually contain much useful information. Also, the behavior of the crowds w.r.t them is inconsistent - some workers omit them, some annotate them.

The first step is to build a function that removes stopwords from strings. We will use the stopwords corpus in the nltk package to get the list of words. We want to build a function that can be reused for both the text in the media units and in the annotations column. Also, we need to be careful about omitting punctuation.

The function remove_stop_words does all of these things:


In [59]:
import nltk
from nltk.corpus import stopwords
import string

stopword_set = set(stopwords.words('english'))
stopword_set.update(['s'])

def remove_stop_words(words_string, sep):
    '''
    words_string: string containing all words
    sep: separator character for the words in words_string
    '''
    words_list = words_string.split(sep)
    corrected_words_list = ""
    for word in words_list:
        if word not in stopword_set:
            if corrected_words_list != "":
                corrected_words_list += sep
            corrected_words_list += word
    return corrected_words_list

In the new configuration class ConfigDimRed, we apply the function we just built to both the column that contains the media unit text (inputColumns[2]), and the column containing the crowd annotations (outputColumns[0]):


In [60]:
import pandas as pd

class ConfigDimRed(Config):
    def processJudgments(self, judgments):
        judgments = Config.processJudgments(self, judgments)
        
        # remove stopwords from input sentence
        for idx in range(len(judgments[self.inputColumns[2]])):
            judgments.at[idx, self.inputColumns[2]] = remove_stop_words(
                judgments[self.inputColumns[2]][idx], " ")
        
        for idx in range(len(judgments[self.outputColumns[0]])):
            judgments.at[idx, self.outputColumns[0]] = remove_stop_words(
                judgments[self.outputColumns[0]][idx], self.annotation_separator)
            if judgments[self.outputColumns[0]][idx] == "":
                judgments.at[idx, self.outputColumns[0]] = self.none_token
        return judgments

Now we can pre-process the data and run the CrowdTruth metrics:


In [61]:
data_without_stopwords, config_without_stopwords = crowdtruth.load(
    file = "../data/event-text-highlight.csv",
    config = ConfigDimRed()
)

processed_results_without_stopwords = crowdtruth.run(
    data_without_stopwords,
    config_without_stopwords
)

Effect on CrowdTruth metrics

Finally, we can compare the effect of the stopword removal on the CrowdTruth sentence quality score.


In [62]:
%matplotlib inline

import matplotlib
import matplotlib.pyplot as plt

plt.scatter(
    processed_results_with_stopwords["units"]["uqs"],
    processed_results_without_stopwords["units"]["uqs"],
)
plt.plot([0, 1], [0, 1], 'red', linewidth=1)
plt.title("Sentence Quality Score")
plt.xlabel("with stopwords")
plt.ylabel("without stopwords")


Out[62]:
Text(0,0.5,u'without stopwords')

The red line in the plot runs through the diagonal. All sentences above the line have a higher sentence quality score when the stopwords were removed.

The plot shows that removing the stopwords improved the quality for a majority of the sentences. Surprisingly though, some sentences decreased in quality. This effect can be understood when plotting the worker quality scores.


In [63]:
plt.scatter(
    processed_results_with_stopwords["workers"]["wqs"],
    processed_results_without_stopwords["workers"]["wqs"],
)
plt.plot([0, 0.8], [0, 0.8], 'red', linewidth=1)
plt.title("Worker Quality Score")
plt.xlabel("with stopwords")
plt.ylabel("without stopwords")


Out[63]:
Text(0,0.5,u'without stopwords')

The quality of the majority of workers also has increased in the configuration where we removed the stopwords. However, because of the inter-linked nature of the CrowdTruth quality metrics, the annotations of these workers now has a greater weight when calculating the sentence quality score. So the stopword removal process had the effect of removing some of the noise in the annotations and therefore increasing the quality scores, but also of amplifying the true ambiguity in the sentences.


In [64]:
data_with_stopwords["units"]


Out[64]:
duration input.doc_id input.events input.events_count input.processed_sentence input.sentence_id input.tokens job output.tagged_events output.tagged_events.annotations output.tagged_events.unique_annotations worker uqs unit_annotation_score uqs_initial unit_annotation_score_initial
unit
1893181917 57.15 wsj_1033.tml $ 10.1__129__135###reported__38__46###fell__72... 4 Separately , Esselte Business Systems reported... 11 39 ../data/event-text-highlight {u'cents': 2, u'share': 2, u'period': 2, u'in'... 79 24 20 0.608170 {u'cents': 0.0297850757146, u'share': 0.029785... 0.519530 {u'cents': 0.1, u'share': 0.1, u'period': 0.1,...
1893181918 47.20 APW19990607.0041.tml purports__5__13###be__17__19###said__179__183#... 5 Kopp purports to be a devout Roman Catholic , ... 14 39 ../data/event-text-highlight {u'': 2, u'Catholic': 3, u'anti-abortion': 9, ... 128 29 20 0.334970 {u'': 0.157986305381, u'Catholic': 0.026908423... 0.259531 {u'': 0.1, u'Catholic': 0.15, u'anti-abortion'...
1893181919 43.90 NYT19981025.0216.tml protect__45__52###murdered__77__85###said__97_... 7 `` We as Christians have a responsibility to p... 14 39 ../data/event-text-highlight {u'a': 3, u'protect': 24, u'said': 9, u'from':... 94 20 20 0.576942 {u'a': 0.0704736341848, u'protect': 1.47222699... 0.401958 {u'a': 0.15, u'protect': 1.2, u'said': 0.45, u...
1893181920 41.50 NYT19981026.0446.tml opposed__170__177###followed__122__130###was__... 5 Slepian 's death was among the first topics ra... 16 39 ../data/event-text-highlight {u'among': 1, u'raised': 12, u'followed': 11, ... 119 22 20 0.551484 {u'among': 0.0200950204081, u'raised': 0.67052... 0.406153 {u'among': 0.05, u'raised': 0.6, u'followed': ...
1893181921 43.05 NYT19981026.0446.tml exploit__109__116###murder__133__139###said__2... 5 `` It 's possible that New York politics has n... 43 39 ../data/event-text-highlight {u'own': 1, u'willingness': 4, u'steppingstone... 102 29 20 0.477072 {u'own': 0.0221168623273, u'willingness': 0.25... 0.358854 {u'own': 0.05, u'willingness': 0.2, u'stepping...
1893181922 35.90 NYT19981121.0173.tml returned__139__147###shot__54__58 2 Slepian , 52 , an obstetrician and gynecologis... 17 39 ../data/event-text-highlight {u'and': 4, u'shot': 18, u'an': 1, u'through':... 117 30 20 0.573296 {u'and': 0.1014974899, u'shot': 0.990028551248... 0.431335 {u'and': 0.2, u'shot': 0.9, u'an': 0.05, u'thr...
1893181923 33.95 NYT19990505.0443.tml wrote__151__156###murder__80__86###observed__9... 5 A jogger observed Kopp 's car at 6 a.m. near S... 20 39 ../data/event-text-highlight {u'': 2, u'plate': 1, u'number': 1, u'observed... 112 33 20 0.657856 {u'': 0.0396913152867, u'plate': 0.04870870448... 0.492269 {u'': 0.1, u'plate': 0.05, u'number': 0.05, u'...
1893181924 44.00 XIE19990313.0173.tml be__124__126###accession__100__109###claiming_... 5 But some other parties and social organization... 10 39 ../data/event-text-highlight {u'and': 2, u'burdens': 5, u'some': 1, u'acces... 120 27 20 0.373122 {u'and': 0.0825392628047, u'burdens': 0.344247... 0.290667 {u'and': 0.1, u'burdens': 0.25, u'some': 0.05,...
1893181925 40.65 XIE19990313.0229.tml disasters__141__150###been__101__105###stabili... 8 `` Extending membership to these three democra... 12 39 ../data/event-text-highlight {u'century': 2, u'three': 1, u'''': 1, u'disas... 128 29 20 0.431015 {u'century': 0.0685543621321, u'three': 0.0487... 0.328617 {u'century': 0.1, u'three': 0.05, u'''': 0.05,...
1893181926 39.95 AP900815-0044.tml thrust__186__192###guard__149__154###camped__1... 6 The U.S. military buildup in Saudi Arabia cont... 11 38 ../data/event-text-highlight {u'kingdom': 3, u'force': 4, u'guard': 14, u'a... 148 34 20 0.528953 {u'kingdom': 0.0994574494386, u'force': 0.1399... 0.400969 {u'kingdom': 0.15, u'force': 0.2, u'guard': 0....
1893181927 38.90 AP900815-0044.tml blocked__184__191###said__34__38###retaliate__... 6 The Iraqi ambassador to Venezuela said on Tues... 32 38 ../data/event-text-highlight {u'and': 1, u'ambassador': 1, u'Kuwait': 1, u'... 104 25 20 0.450525 {u'and': 0.0325544025466, u'ambassador': 0.020... 0.342065 {u'and': 0.05, u'ambassador': 0.05, u'Kuwait':...
1893181928 45.10 AP900815-0044.tml flows__156__161###barricade__54__63###extended... 4 Bush told a news conference on Tuesday that th... 56 38 ../data/event-text-highlight {u'and': 2, u'on': 1, u'force': 4, u'it': 1, u... 128 33 20 0.402053 {u'and': 0.14169536593, u'on': 0.0209061159915... 0.336127 {u'and': 0.1, u'on': 0.05, u'force': 0.2, u'it...
1893181929 36.50 AP900816-0139.tml come__106__110###said__86__90###withdraw__160_... 5 After a two-hour meeting at his Kennebunkport ... 4 38 ../data/event-text-highlight {u'Kuwait': 3, u'Bush': 3, u'at': 1, u'home': ... 108 29 20 0.475422 {u'Kuwait': 0.0748202295946, u'Bush': 0.064393... 0.360409 {u'Kuwait': 0.15, u'Bush': 0.15, u'at': 0.05, ...
1893181930 35.65 AP900816-0139.tml statement__95__104###truth__216__221###attempt... 7 Replied State Department deputy spokesman Rich... 25 38 ../data/event-text-highlight {u'Replied': 12, u'Boucher': 3, u'rhetoric': 4... 80 21 20 0.379997 {u'Replied': 0.659090232546, u'Boucher': 0.061... 0.305077 {u'Replied': 0.6, u'Boucher': 0.15, u'rhetoric...
1893181931 32.10 AP900816-0139.tml said__54__58###attempt__115__122###led__131__1... 6 Meanwhile , Egypt 's official Middle East News... 64 38 ../data/event-text-highlight {u'large-scale': 7, u'Agency': 2, u'Thursday':... 111 24 20 0.532090 {u'large-scale': 0.273694670853, u'Agency': 0.... 0.435103 {u'large-scale': 0.35, u'Agency': 0.1, u'Thurs...
1893181932 36.30 APW19980219.0476.tml presumed__153__161###kidnapped__139__148###rec... 6 The top commander of a Cambodian resistance fo... 6 38 ../data/event-text-highlight {u'and': 2, u'force': 2, u'almost': 1, u'recov... 157 35 20 0.539572 {u'and': 0.0830114770935, u'force': 0.06061338... 0.428880 {u'and': 0.1, u'force': 0.1, u'almost': 0.05, ...
1893181933 31.90 NYT19980206.0460.tml tumult__207__213###reported__83__91###disrupti... 6 The economy created jobs at a surprisingly rob... 8 38 ../data/event-text-highlight {u'financial': 7, u'caused': 2, u'surprisingly... 125 32 20 0.463548 {u'financial': 0.356021526749, u'caused': 0.08... 0.408510 {u'financial': 0.35, u'caused': 0.1, u'surpris...
1893181934 35.35 NYT19980402.0453.tml find__106__110###believed__182__190###identifi... 6 The police and prosecutors said they had ident... 5 38 ../data/event-text-highlight {u'and': 1, u'find': 5, u'linking': 3, u'in': ... 102 26 20 0.461942 {u'and': 0.0203175077078, u'find': 0.268739659... 0.405740 {u'and': 0.05, u'find': 0.25, u'linking': 0.15...
1893181935 29.90 NYT19980424.0421.tml upheld__26__32 1 By a 6-3 vote , the court upheld a discriminat... 4 38 ../data/event-text-highlight {u'the': 1, u'citizenship': 4, u'unmarried': 4... 104 26 20 0.446112 {u'the': 0.0199130729834, u'citizenship': 0.13... 0.382814 {u'the': 0.05, u'citizenship': 0.2, u'unmarrie...
1893181936 32.90 SJMN91-06338157.tml lamented__98__106###said__86__90###tightening_... 7 One GOP source , reporting on a call from the ... 7 38 ../data/event-text-highlight {u'and': 1, u'help': 5, u'is': 4, u'need': 7, ... 100 23 20 0.448201 {u'and': 0.0199130729834, u'help': 0.258170521... 0.392595 {u'and': 0.05, u'help': 0.25, u'is': 0.2, u'ne...
1893181937 39.20 WSJ900813-0157.tml confrontation__211__224###end__41__44###propos... 8 Iraq 's Saddam Hussein , his options for endin... 3 38 ../data/event-text-highlight {u'Gulf': 5, u'increasingly': 3, u'ending': 7,... 122 29 20 0.339267 {u'Gulf': 0.180771506315, u'increasingly': 0.0... 0.258450 {u'Gulf': 0.25, u'increasingly': 0.15, u'endin...
1893181938 31.65 WSJ900813-0157.tml confirmed__38__47###reports__48__55###head__16... 3 Over the weekend , Pentagon officials confirme... 60 38 ../data/event-text-highlight {u'and': 1, u'within': 2, u'powerful': 1, u'of... 99 32 20 0.554168 {u'and': 0.0194958678635, u'within': 0.0485962... 0.464803 {u'and': 0.05, u'within': 0.1, u'powerful': 0....
1893181939 33.25 WSJ910225-0066.tml become__103__109###casualties__88__98###entren... 7 Despite the early indications of success , the... 16 38 ../data/event-text-highlight {u'indications': 5, u'tough': 2, u'is': 2, u'd... 128 28 20 0.452174 {u'indications': 0.271023781306, u'tough': 0.0... 0.330208 {u'indications': 0.25, u'tough': 0.1, u'is': 0...
1893181940 30.15 ed980111.1130.0089.tml likely__69__75###out__15__18###off__35__38###c... 4 The lights are out and the heat is off and tho... 1 38 ../data/event-text-highlight {u'and': 5, u'Canada': 2, u'people': 1, u'is':... 125 31 20 0.336470 {u'and': 0.170866699369, u'Canada': 0.01983097... 0.296920 {u'and': 0.25, u'Canada': 0.1, u'people': 0.05...
1893181941 33.55 wsj_0026.tml said__16__20###produced__122__130###approved__... 4 The White House said President Bush has approv... 4 38 ../data/event-text-highlight {u'produced': 9, u'Islands': 1, u'said': 7, u'... 83 21 20 0.478675 {u'produced': 0.491280127077, u'said': 0.32528... 0.402183 {u'produced': 0.45, u'said': 0.35, u'for': 0.1...
1893181942 29.75 PRI19980213.2000.0313.tml reporting__36__45 1 For NPR news , I 'm Auncil Martinez reporting . 10 10 ../data/event-text-highlight {u'I': 1, u'reporting': 16, u'Martinez': 1, u'... 24 7 20 0.848991 {u'I': 0.0297608786779, u'reporting': 0.937529... 0.604289 {u'I': 0.05, u'reporting': 0.8, u'Martinez': 0...
1893181943 33.75 PRI19980306.2000.1675.tml gunfire__11__18 1 More heavy gunfire in the Serbian province of ... 2 10 ../data/event-text-highlight {u'heavy': 8, u'province': 2, u'NONE': 1, u'gu... 26 4 20 0.716488 {u'heavy': 0.340689482465, u'province': 0.0293... 0.524574 {u'heavy': 0.4, u'province': 0.1, u'NONE': 0.0...
1893181944 38.60 SJMN91-06338157.tml said__10__14###met__37__40 2 Officials said the president himself met with ... 13 10 ../data/event-text-highlight {u'NONE': 1, u'himself': 2, u'said': 10, u'met... 43 9 20 0.574096 {u'NONE': 0.0087320973195, u'himself': 0.08564... 0.418177 {u'NONE': 0.05, u'himself': 0.1, u'said': 0.5,...
1893181945 41.50 VOA19980305.1800.2603.tml become__11__17###support__27__34 2 Women have become the sole support of their fa... 14 10 ../data/event-text-highlight {u'NONE': 2, u'support': 14, u'sole': 6, u'hav... 36 6 20 0.566965 {u'NONE': 0.0213007712472, u'support': 0.75809... 0.464811 {u'NONE': 0.1, u'support': 0.7, u'sole': 0.3, ...
1893181946 31.00 VOA19980305.1800.2603.tml forbidden__23__32###work__41__45 2 Yet , the Taliban have forbidden them to work . 15 10 ../data/event-text-highlight {u'them': 4, u'forbidden': 17, u'work': 7, u'N... 38 6 20 0.679360 {u'them': 0.160573478252, u'forbidden': 0.9670... 0.535963 {u'them': 0.2, u'forbidden': 0.85, u'work': 0....
1893181947 44.25 WSJ900813-0157.tml responded__14__23 1 The president responded , " Everything , every... 14 10 ../data/event-text-highlight {u'president': 3, u'NONE': 1, u'responded': 17} 21 3 20 0.865400 {u'president': 0.0950187394689, u'NONE': 0.009... 0.703831 {u'president': 0.15, u'NONE': 0.05, u'responde...
1893181948 49.95 ed980111.1130.0089.tml grip__47__51###maintain__34__42###continues__2... 3 A powerful ice storm continues to maintain its... 2 10 ../data/event-text-highlight {u'grip': 7, u'NONE': 1, u'continues': 6, u'po... 48 10 20 0.478889 {u'grip': 0.516808569107, u'NONE': 7.967463463... 0.337153 {u'grip': 0.35, u'NONE': 0.05, u'continues': 0...
1893181949 47.35 wsj_0026.tml denied__32__38###treatment__54__63 2 Previously , watch imports were denied such du... 6 10 ../data/event-text-highlight {u'NONE': 1, u'imports': 7, u'were': 2, u'watc... 36 9 20 0.553471 {u'NONE': 7.96746346345e-07, u'imports': 0.374... 0.383764 {u'NONE': 0.05, u'imports': 0.35, u'were': 0.1...
1893181950 45.05 wsj_0027.tml expect__5__11###cut__19__22 2 They expect him to cut costs throughout the or... 13 10 ../data/event-text-highlight {u'NONE': 2, u'cut': 11, u'to': 2, u'costs': 9... 45 9 20 0.579558 {u'NONE': 0.0101591382774, u'cut': 0.829219536... 0.330809 {u'NONE': 0.1, u'cut': 0.55, u'to': 0.1, u'cos...
1893181951 50.10 wsj_0106.tml declined__3__11###discuss__15__22 2 He declined to discuss other terms of the issue . 7 10 ../data/event-text-highlight {u'NONE': 2, u'terms': 1, u'of': 1, u'to': 2, ... 38 9 20 0.677483 {u'NONE': 0.00980459272858, u'terms': 0.066722... 0.441666 {u'NONE': 0.1, u'terms': 0.05, u'of': 0.05, u'...
1893181952 48.05 wsj_0150.tml closed__10__16 1 Primerica closed at $ 28.25 , down 50 cents . 7 10 ../data/event-text-highlight {u'NONE': 1, u'Primerica': 2, u'$': 2, u'cents... 38 9 20 0.679813 {u'NONE': 8.17248045606e-07, u'Primerica': 0.1... 0.463861 {u'NONE': 0.05, u'Primerica': 0.1, u'$': 0.1, ...
1893181953 43.90 wsj_0175.tml dispute__38__45###settle__27__33###talks__18__23 3 Both sides are in talks to settle the dispute . 9 10 ../data/event-text-highlight {u'NONE': 1, u'to': 2, u'settle': 13, u'in': 4... 48 7 20 0.769411 {u'NONE': 8.60151861794e-07, u'to': 0.15607993... 0.509102 {u'NONE': 0.05, u'to': 0.1, u'settle': 0.65, u...
1893181954 42.20 wsj_0471.tml suit__37__41###said__4__8###comment__22__29 3 DPC said it could n't comment on the suit . 7 10 ../data/event-text-highlight {u'comment': 15, u'NONE': 1, u'said': 12, u'co... 54 8 20 0.629596 {u'comment': 0.900546927647, u'NONE': 8.545539... 0.454137 {u'comment': 0.75, u'NONE': 0.05, u'said': 0.6...
1893181955 38.20 wsj_0520.tml announced__7__16###closed__52__58###request__2... 4 Nashua announced the Reiss request after the m... 13 10 ../data/event-text-highlight {u'NONE': 1, u'Nashua': 1, u'after': 1, u'requ... 54 10 20 0.652620 {u'NONE': 2.06151867342e-06, u'Nashua': 0.0107... 0.424066 {u'NONE': 0.05, u'Nashua': 0.05, u'after': 0.0...
1893181956 27.65 wsj_0551.tml expected__19__27###close__31__36###transaction... 3 The transaction is expected to close around ye... 6 10 ../data/event-text-highlight {u'NONE': 2, u'transaction': 11, u'end': 4, u'... 55 11 20 0.623260 {u'NONE': 0.0104003935425, u'transaction': 0.7... 0.418970 {u'NONE': 0.1, u'transaction': 0.55, u'end': 0...
1893181957 62.90 NYT19990505.0443.tml decided__207__214###said__193__197###killing__... 9 Based on physical evidence _ including a rifle... 9 55 ../data/event-text-highlight {u'': 52, u'el': 1, u'his': 1, u'en': 2, u'Bas... 281 69 20 0.521132 {u'': 0.000116692296731, u'el': 2.24408262943e... 0.283726 {u'': 2.6, u'el': 0.05, u'his': 0.05, u'en': 0...
1893181958 46.85 NYT19980424.0421.tml use__209__212###invalidating__192__204###vindi... 6 The New York Times said in an editorial on Sat... 3 54 ../data/event-text-highlight {u'Court': 1, u'in': 3, u'it': 3, u'years': 3,... 154 41 20 0.433716 {u'Court': 0.0605714503761, u'in': 0.127326102... 0.224615 {u'Court': 0.05, u'in': 0.15, u'it': 0.15, u'y...

In [ ]: