Abstract

test suitability of tf-idf-based measures for
classifying discourse-given vs. discourse-new NPs
(a.k.a. anaphoricity)
test corpora : ... (English newswire)
results: significant, w/out relying on language-specific
resources

Motivation

tf-idf is a shallow semantic measure i.a. used in information retrieval
and text summarization
anaphoricity: an NP is anaphoric if it refers to a real-workd entity
mentioned in the preceding text
usage: anaphoricity classification as a first step in anaphora resolution
(i.e. reducing the search space)

Challenges

matching anaphora-antecedent pairs is hard
previous attempts: matching complete NP strings or only their heads,
first or last tokens, etc.
these methods all rely on token boundaries and therefore can't account for
compounding (Berlin vs. the Berlin-based startup),
derivation (e.g. event anaphors canceled vs. the cancellation) and
name variations

proposed solution

tf-idf-based matching between NPs, where terms are character ngrams
(here: 4-grams, as suggested by Taboada et al. (2009))

tf-idf adapted for anaphoricity classification

for each NP, we calculate a set of tf-idf-based features:

$t$: term / character ngram
$d$: document
$D$: corpus / set of documents
$D_t$: set of documents containing term t
$tf_{t,d}$: number of times term t occurs in document d,
normalized to the total number of terms in the document

tf-idf

$tf idf_{t,d} = tf_{t,d} * \log(\frac{|D|}{|D_{t}|})$

anaphoricity classification

partition the text before the NP to classify
- $d_1, ..., d_n$: characters in document D
- if an NP starts at $x_{k+1}$, then ...
- $tf_{t,d_{k}}$ is the relative frequency of term t in $d_k$
  (document d up to position k)
$tf_{t,\bar{d_k}} = tf_{t,d} - tf_{t,d_k}$
- read: increase of tf after k = term count in doc - term count up to k

we calculate the sum and means of these measures for each NP

$tf_{t,d}$: how often occurs t in d?
$tf_{t,d_k}$: how often occurs t in d up to position k?
$tf_{t,\bar{d_k}}$: how often occurs t in d after position k?
$tf idf_{t,d}$: "standard" tf-idf (for the whole document d)
$tf idf_{t,d_k}$: tf-idf for document d up to position k
$tf idf_{t,\bar{d_k}}$: tf-idf for document after position k
$idf$: "standard" (idf)

To calculate the sum across an NP, do this:

$NP^{e}_s$: an NP starting at position s and ending at position e
$l$: term / character ngram length (here: 4)

$$S_{tf idf_{NP_s^{e}, d_s}} = \sum_{i=s}^{e-l+1} tf idf_{t_i, d_i}$$

calculating the tf-idf means across an NP is e.g. used to
determine the relevance of a sentence in text summarization (Bieler and Dipper 2008)

Classification Experiments

Corpus & Annotation Scheme

test corpus: WSJ section of OntoNotes
annotation scheme:
- APPOS: linking attributive/appositive NPs
- IDENT: (other) coreference links
extracted all NPs with IDENT relation
- marked them as anaphoric (AN), if they had an antecedent,
  i.e. an expression to its left referring to the same ID
- otherwise marked as non-anaphoric (NON)

Preprocessing and Settings

random split: training and evaluation set,
controlled by number of NPs in each document
Train: 103,245 NPs
Test: 13,414 NPs
subset of the data only including NNPs (proper names)
lemmatization using TreeTagger

features (to compare against tf-idf)

exact match (string identity): how many times has the exact same NP been mentioned before
matching head: how many times did the lemma of the NP's head occur "in the lemmatized context" ???
- head: rightmost token directly dominated by the NP node
NP's grammatical function (subj, adverbial NP etc)
NP's surface form (pronoun, def. det., indef. det, etc.)
- why is this called surface form?
bool: does NP contain a name (NNP)
in case of pronouns: morph. features (person, num, gender, reflexivity)

Classification Experiments and Results

C4.5 decision trees trained with WEKA (training set)
baseline: majority class (NON) $\rightarrow$ 86.37% acc
results: significant improvement in acc. with each added feature set
- exact + head + tf-idf feats. + linguistic feats.



In [ ]:



In [ ]:

#1st pass
l:=4; #initialize term length l
D:=0; #initialize file counter D

for each Document d i in the corpus
#count document
D++;
p:=1; #initialize character position p
    while p + l in d i
        #sequentially cut into terms t of length l
        t:=substring(d i , p, l);
        #*insert string normalization (optional)*
        #initialize count array where necessary
        C(t, d i ):=0 unless defined;
        #save number of previous mentions
        #(i.e. annotate t with C(t, d i ))
        A(t, d i , p):=C(t, d i );
        #count current mention
        C(t, d i )++;
        #count documents containing t
        #(only on first mention of t)
        E(t)++ if (C(t, d i ) =1);
        p++;
    end; #end while
end; #end for each;


#2nd pass
for each Document d i in the corpus
    for each noun phrase NP s e in d i
        sum:=0; #initialize sum
        #from NP’s starting position. . .
        p:=s;
        #. . . to start of last term
        while p <= e − l + 1
            t:=substring(d i , p, l);
            #*insert string normalization (optional)*
            #get annotation of t at p,
            #calculate tf-idf from it
            #and add it to the current sum
            sum+=(get(t, d i , p)/p)*log(D/E(t));
            #calculate sum of other measures
            ...
        end; #end while

        #average by the number of terms in NP s e
        a:=sum/(e − s − l + 2);
        #annotate sum and means to NP s e
        S(d i , s, e):=sum;
        M (d i , s, e):=a;
    end; #end for each
end; #end for each

Ritz, Julia (2010). Using tf-idf-related Measures for Determining the Anaphoricity of Noun Phrases.

Abstract

Motivation

Challenges

proposed solution

tf-idf adapted for anaphoricity classification

tf-idf

anaphoricity classification

we calculate the sum and means of these measures for each NP

Classification Experiments

Corpus & Annotation Scheme

Preprocessing and Settings

features (to compare against tf-idf)

Classification Experiments and Results