Ritz, Julia (2010). Using tf-idf-related Measures for Determining the Anaphoricity of Noun Phrases.

Abstract

  • test suitability of tf-idf-based measures for
    classifying discourse-given vs. discourse-new NPs
    (a.k.a. anaphoricity)
  • test corpora : ... (English newswire)
  • results: significant, w/out relying on language-specific
    resources

Motivation

  • tf-idf is a shallow semantic measure i.a. used in information retrieval
    and text summarization
  • anaphoricity: an NP is anaphoric if it refers to a real-workd entity
    mentioned in the preceding text
  • usage: anaphoricity classification as a first step in anaphora resolution
    (i.e. reducing the search space)

Challenges

  • matching anaphora-antecedent pairs is hard
  • previous attempts: matching complete NP strings or only their heads,
    first or last tokens, etc.
  • these methods all rely on token boundaries and therefore can't account for
    compounding (Berlin vs. the Berlin-based startup),
    derivation (e.g. event anaphors canceled vs. the cancellation) and
    name variations

proposed solution

  • tf-idf-based matching between NPs, where terms are character ngrams
    (here: 4-grams, as suggested by Taboada et al. (2009))

tf-idf adapted for anaphoricity classification

  • for each NP, we calculate a set of tf-idf-based features:
  • $t$: term / character ngram
  • $d$: document
  • $D$: corpus / set of documents
  • $D_t$: set of documents containing term t
  • $tf_{t,d}$: number of times term t occurs in document d,
    normalized to the total number of terms in the document

tf-idf

$tf idf_{t,d} = tf_{t,d} * \log(\frac{|D|}{|D_{t}|})$

anaphoricity classification

  • partition the text before the NP to classify

    • $d_1, ..., d_n$: characters in document D
    • if an NP starts at $x_{k+1}$, then ...
    • $tf_{t,d_{k}}$ is the relative frequency of term t in $d_k$
      (document d up to position k)
  • $tf_{t,\bar{d_k}} = tf_{t,d} - tf_{t,d_k}$

    • read: increase of tf after k = term count in doc - term count up to k

we calculate the sum and means of these measures for each NP

  • $tf_{t,d}$: how often occurs t in d?
  • $tf_{t,d_k}$: how often occurs t in d up to position k?
  • $tf_{t,\bar{d_k}}$: how often occurs t in d after position k?
  • $tf idf_{t,d}$: "standard" tf-idf (for the whole document d)
  • $tf idf_{t,d_k}$: tf-idf for document d up to position k
  • $tf idf_{t,\bar{d_k}}$: tf-idf for document after position k
  • $idf$: "standard" (idf)

To calculate the sum across an NP, do this:

  • $NP^{e}_s$: an NP starting at position s and ending at position e
  • $l$: term / character ngram length (here: 4)
$$S_{tf idf_{NP_s^{e}, d_s}} = \sum_{i=s}^{e-l+1} tf idf_{t_i, d_i}$$
  • calculating the tf-idf means across an NP is e.g. used to
    determine the relevance of a sentence in text summarization (Bieler and Dipper 2008)

Classification Experiments

Corpus & Annotation Scheme

  • test corpus: WSJ section of OntoNotes
  • annotation scheme:
    • APPOS: linking attributive/appositive NPs
    • IDENT: (other) coreference links
  • extracted all NPs with IDENT relation
    • marked them as anaphoric (AN), if they had an antecedent,
      i.e. an expression to its left referring to the same ID
    • otherwise marked as non-anaphoric (NON)

Preprocessing and Settings

  • random split: training and evaluation set,
    controlled by number of NPs in each document
  • Train: 103,245 NPs
  • Test: 13,414 NPs
  • subset of the data only including NNPs (proper names)
  • lemmatization using TreeTagger

features (to compare against tf-idf)

  • exact match (string identity): how many times has the exact same NP been mentioned before
  • matching head: how many times did the lemma of the NP's head occur "in the lemmatized context" ???
    • head: rightmost token directly dominated by the NP node
  • NP's grammatical function (subj, adverbial NP etc)
  • NP's surface form (pronoun, def. det., indef. det, etc.)
    • why is this called surface form?
  • bool: does NP contain a name (NNP)
  • in case of pronouns: morph. features (person, num, gender, reflexivity)

Classification Experiments and Results

  • C4.5 decision trees trained with WEKA (training set)
  • baseline: majority class (NON) $\rightarrow$ 86.37% acc
  • results: significant improvement in acc. with each added feature set
    • exact + head + tf-idf feats. + linguistic feats.

In [ ]:


In [ ]:

#1st pass
l:=4; #initialize term length l
D:=0; #initialize file counter D

for each Document d i in the corpus
#count document
D++;
p:=1; #initialize character position p
    while p + l in d i
        #sequentially cut into terms t of length l
        t:=substring(d i , p, l);
        #*insert string normalization (optional)*
        #initialize count array where necessary
        C(t, d i ):=0 unless defined;
        #save number of previous mentions
        #(i.e. annotate t with C(t, d i ))
        A(t, d i , p):=C(t, d i );
        #count current mention
        C(t, d i )++;
        #count documents containing t
        #(only on first mention of t)
        E(t)++ if (C(t, d i ) =1);
        p++;
    end; #end while
end; #end for each;


#2nd pass
for each Document d i in the corpus
    for each noun phrase NP s e in d i
        sum:=0; #initialize sum
        #from NP’s starting position. . .
        p:=s;
        #. . . to start of last term
        while p <= e  l + 1
            t:=substring(d i , p, l);
            #*insert string normalization (optional)*
            #get annotation of t at p,
            #calculate tf-idf from it
            #and add it to the current sum
            sum+=(get(t, d i , p)/p)*log(D/E(t));
            #calculate sum of other measures
            ...
        end; #end while

        #average by the number of terms in NP s e
        a:=sum/(e  s  l + 2);
        #annotate sum and means to NP s e
        S(d i , s, e):=sum;
        M (d i , s, e):=a;
    end; #end for each
end; #end for each