Ch6 Learning to Classify Text

本章的目標是回答幾個問題:

  1. 如何辨識能用來分類的顯著特徵?
  2. 如何建立language model來自動完成文字處理作業?
  3. 從這些model中可以學到什麼?

Supervised Classification

Classification將輸入標上正確的class labelSupervised是指預先得到正確標記的training data。

例如我們想根據姓氏的結尾判斷是男或女:


In [1]:
import nltk
from nltk.corpus import names
import random

In [2]:
name = [(n,'M') for n in names.words('male.txt')] + [(n,'F') for n in names.words('female.txt')]
random.shuffle(name)
name[:10]


Out[2]:
[(u'Dannye', 'F'),
 (u'Alston', 'M'),
 (u'Conroy', 'M'),
 (u'Bret', 'M'),
 (u'Estele', 'F'),
 (u'Michaelina', 'F'),
 (u'Lurleen', 'F'),
 (u'Chevy', 'M'),
 (u'Carolynn', 'F'),
 (u'Blaine', 'M')]

準備一個function,用來產生feature,這裡的feature是用最後的英文字母。


In [3]:
def gender_feature(name): return {'last_letter': name[-1]}
featuresets = [(gender_feature(n), g) for (n,g) in name]
train, test = featuresets[500:], featuresets[:500]
classifier = nltk.NaiveBayesClassifier.train(train)

In [4]:
classifier.classify({'last_letter': 'a'})


Out[4]:
'F'

In [5]:
nltk.classify.accuracy(classifier, test)


Out[5]:
0.734

In [6]:
classifier.show_most_informative_features()


Most Informative Features
             last_letter = u'k'                M : F      =     44.9 : 1.0
             last_letter = u'a'                F : M      =     35.8 : 1.0
             last_letter = u'f'                M : F      =     15.2 : 1.0
             last_letter = u'p'                M : F      =     11.1 : 1.0
             last_letter = u'v'                M : F      =     10.5 : 1.0
             last_letter = u'd'                M : F      =     10.0 : 1.0
             last_letter = u'm'                M : F      =      9.1 : 1.0
             last_letter = u'o'                M : F      =      8.4 : 1.0
             last_letter = u'r'                M : F      =      6.8 : 1.0
             last_letter = u'g'                M : F      =      5.6 : 1.0

嘗試不同的feature,例如加入第一個字母,或加入姓名的長度。


In [7]:
def gender_feature(name): return {'last_letter': name[-1], 'first_letter': name[0]}
featuresets = [(gender_feature(n), g) for (n,g) in name]
train, test = featuresets[500:], featuresets[:500]
classifier = nltk.NaiveBayesClassifier.train(train)

In [8]:
nltk.classify.accuracy(classifier, test)


Out[8]:
0.768

In [9]:
def gender_feature(name): return {'last_letter': name[-1], 'first_letter': name[0], 'len': len(name)}
featuresets = [(gender_feature(n), g) for (n,g) in name]
train, test = featuresets[500:], featuresets[:500]
classifier = nltk.NaiveBayesClassifier.train(train)

In [12]:
nltk.classify.accuracy(classifier, test)


Out[12]:
0.762

In [13]:
classifier.show_most_informative_features()


Most Informative Features
             last_letter = u'k'                M : F      =     44.9 : 1.0
             last_letter = u'a'                F : M      =     35.8 : 1.0
             last_letter = u'f'                M : F      =     15.2 : 1.0
             last_letter = u'p'                M : F      =     11.1 : 1.0
             last_letter = u'v'                M : F      =     10.5 : 1.0
             last_letter = u'd'                M : F      =     10.0 : 1.0
             last_letter = u'm'                M : F      =      9.1 : 1.0
             last_letter = u'o'                M : F      =      8.4 : 1.0
             last_letter = u'r'                M : F      =      6.8 : 1.0
             last_letter = u'g'                M : F      =      5.6 : 1.0

Document Classification

movie_reviews裡面已經分好正面和負面兩種評語,我們的目標是根據字彙,預測評語是正面或負面。


In [17]:
from nltk.corpus import movie_reviews

In [18]:
movie_reviews.categories()


Out[18]:
[u'neg', u'pos']

In [21]:
movie_reviews.fileids('neg')[:5], movie_reviews.fileids('pos')[:5]


Out[21]:
([u'neg/cv000_29416.txt',
  u'neg/cv001_19502.txt',
  u'neg/cv002_17424.txt',
  u'neg/cv003_12683.txt',
  u'neg/cv004_12641.txt'],
 [u'pos/cv000_29590.txt',
  u'pos/cv001_18431.txt',
  u'pos/cv002_15918.txt',
  u'pos/cv003_11664.txt',
  u'pos/cv004_11636.txt'])

In [22]:
documents = [(list(movie_reviews.words(f)), c)
             for c in movie_reviews.categories() for f in movie_reviews.fileids(c)]
random.shuffle(documents)

In [41]:
# 選擇2000個字作為feature
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = [w for (w, c) in all_words.most_common()[:2000]]
word_features[:5]


Out[41]:
[u',', u'the', u'.', u'a', u'and']

In [42]:
def document_features(doc):
    document_words = set(doc)  # 自動排除重複字
    features = {w: w in document_words for w in word_features}
    return features

In [44]:
tmp = document_features(['the', 'she'])
[key for key in tmp if tmp[key]]


Out[44]:
[u'the', u'she']

In [45]:
featuresets = [(document_features(d), c) for (d, c) in documents]
train, test = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train)
nltk.classify.accuracy(classifier, test)


Out[45]:
0.81

In [46]:
classifier.show_most_informative_features()


Most Informative Features
                  seagal = True              neg : pos    =     12.3 : 1.0
             outstanding = True              pos : neg    =     11.2 : 1.0
                   mulan = True              pos : neg    =      8.4 : 1.0
             wonderfully = True              pos : neg    =      6.9 : 1.0
                   damon = True              pos : neg    =      5.9 : 1.0
                   flynt = True              pos : neg    =      5.7 : 1.0
                  wasted = True              neg : pos    =      5.7 : 1.0
                   waste = True              neg : pos    =      5.4 : 1.0
                    jedi = True              pos : neg    =      5.3 : 1.0
                   awful = True              neg : pos    =      5.3 : 1.0

Part-of-Speech Tagging

在Chapter 5有介紹過regular expression tagger,當時是用手動設定規則,這裡嘗試用classifier找出規則。


In [2]:
import nltk
from nltk.corpus import brown
fdist = nltk.FreqDist()
fdist.update([w[-1:] for w in brown.words()])
fdist.update([w[-2:] for w in brown.words()])
fdist.update([w[-3:] for w in brown.words()])

In [2]:
common_suf = [k for (k, c) in fdist.most_common()[:100]]
common_suf[:5]


Out[2]:
[u'e', u',', u'.', u's', u'd']

In [3]:
def pos_features(word):
    features = {suffix:word.lower().endswith(suffix) for suffix in common_suf}
    return features

In [4]:
featuresets = [(pos_features(n), pos) for (n, pos) in brown.tagged_words(categories='news')]
size = int(len(featuresets) * 0.1)
train, test = featuresets[size:], featuresets[:size]
size


Out[4]:
10055

In [5]:
classifier = nltk.DecisionTreeClassifier.train(train)
nltk.classify.accuracy(classifier, test)


Out[5]:
0.6248632521133765

In [11]:
# DecisionTree可以列出結構
print classifier.pretty_format(depth=10)


the=False? ............................................ .
  ,=False? ............................................ .
    s=False? .......................................... .
      .=False? ........................................ .
        of=False? ..................................... .
          and=False? .................................. .
            a=False? .................................. .
              in=False? ............................... .
                ed=False? ............................. .
                  to=False? ........................... .
                  to=True? ............................ TO
                ed=True? .............................. VBN
              in=True? ................................ IN
            a=True? ................................... AT
          and=True? ................................... CC
        of=True? ...................................... IN
      .=True? ......................................... .
    s=True? ........................................... PP$
      is=False? ....................................... PP$
        was=False? .................................... PP$
          as=False? ................................... PP$
            's=False? ................................. PP$
              ss=False? ............................... NNS
              ss=True? ................................ NN
            's=True? .................................. NP$
          as=True? .................................... CS
        was=True? ..................................... BEDZ
      is=True? ........................................ BEZ
        his=False? .................................... BEZ
        his=True? ..................................... PP$
  ,=True? ............................................. ,
the=True? ............................................. AT

Exploiting Context

Context是指一個單字的前後文,例如單字fly可以作為名詞或動詞,如果前面是a或the,表示是名詞。


In [1]:
def pos_features(sentence, i):
    features = {"suf_1": sentence[i][-1:], "suf_2": sentence[i][-2:], "suf_3": sentence[i][-3:]}
    if i == 0:
        features['prev'] = '*'
    else:
        features['prev'] = sentence[i-1]
    return features

In [3]:
pos_features(brown.sents()[0], 8)


Out[3]:
{'prev': u'an', 'suf_1': u'n', 'suf_2': u'on', 'suf_3': u'ion'}

In [6]:
brown.sents()[0][7:10]


Out[6]:
[u'an', u'investigation', u'of']

In [7]:
tagged_sents = brown.tagged_sents(categories='news')
featuresets = []
for tagged_sent in tagged_sents:
    untagged_sent = nltk.tag.untag(tagged_sent)
    for i, (word, tag) in enumerate(tagged_sent):
        featuresets.append( (pos_features(untagged_sent, i), tag) )
size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
classifier = nltk.NaiveBayesClassifier.train(train_set)
nltk.classify.accuracy(classifier, test_set)


Out[7]:
0.7891596220785678

In [9]:
classifier.show_most_informative_features()


Most Informative Features
                   suf_1 = u'.'                . : NN     =   6950.8 : 1.0
                   suf_2 = u'he'              AT : NN     =   3296.2 : 1.0
                   suf_2 = u'ho'             WPS : NN     =   2982.4 : 1.0
                   suf_1 = u'r'              JJR : NNS    =   2252.6 : 1.0
                   suf_2 = u'to'              TO : JJ     =   2180.6 : 1.0
                   suf_1 = u'h'              ABX : NNS    =   2013.7 : 1.0
                   suf_2 = u'es'             NNS : IN     =   1676.3 : 1.0
                   suf_3 = u'hat'             CS : NN     =   1576.4 : 1.0
                   suf_1 = u"'"               '' : JJ     =   1502.2 : 1.0
                   suf_2 = u'ng'             VBG : VBN    =   1241.0 : 1.0

Sequence Classification

可以更進一步,將單字預測的詞性,用來預測下一個單字的詞性。


In [10]:
def pos_features(sentence, i, history):
    features = {"suf_1": sentence[i][-1:], "suf_2": sentence[i][-2:], "suf_3": sentence[i][-3:]}
    if i == 0:
        features['prev-word'] = '*'
        features['prev-tag'] = '*'
    else:
        features['prev-word'] = sentence[i-1]
        features['prev-tag'] = history[i-1]
    return features

In [11]:
# 自己定義自己的tagger class
class ConsecutivePosTagger(nltk.TaggerI):
    def __init__(self, train_sents):
        train_set = []
        for tagged_sent in train_sents:
            untagged_sent = nltk.tag.untag(tagged_sent)
            history = []
            # 利用enumerate產生 (index, value)
            for i, (word, tag) in enumerate(tagged_sent):
                featureset = pos_features(untagged_sent, i, history)
                train_set.append( (featureset, tag) )
                history.append(tag)
        self.classifier = nltk.NaiveBayesClassifier.train(train_set)
 
    def tag(self, sentence):
        history = []
        for i, word in enumerate(sentence):
            featureset = pos_features(sentence, i, history)
            tag = self.classifier.classify(featureset)
            history.append(tag)
        return zip(sentence, history)

In [12]:
tagged_sents = brown.tagged_sents(categories='news')
size = int(len(tagged_sents) * 0.1)
train_sents, test_sents = tagged_sents[size:], tagged_sents[:size]
tagger = ConsecutivePosTagger(train_sents)
tagger.evaluate(test_sents)


Out[12]:
0.7981455725544738

Evaluation

產生test set的方法: 利用random.shuffle


In [14]:
import random
from nltk.corpus import brown
tagged_sents = list(brown.tagged_sents(categories='news'))
random.shuffle(tagged_sents)
# 用10%的資料作為test set
size = int(len(tagged_sents) * 0.1)
train_set, test_set = tagged_sents[size:], tagged_sents[:size]
  • Accuracy: 代表test set中,被正確分類的機率。
  • Precision: 代表被分類到X的資料中,真的是X的機率。
  • Recall: 代表所有X的資料中,能正確被分類到X的機率。
  • F-Measure(F-Score): 代表Precision及Recall的調和平均數=$\frac{2 \times Precision \times Recall}{Precision + Recall}$

以法庭為例,Precision是被判有罪的人裡面,真的有罪的機率。如果Precision是0.95,表示有5%的人會被誤判。Recall是所有罪犯會被判有罪的機率,如果Recall是0.85,表示有15%的人會被誤放。而F-Score則為(2*0.85*0.95)/(0.85+0.95)=0.897。

Confusion Matrices

Confusion Matrix是用來分析輸入和輸出的分類。

Naive Bayes Classifier

要利用機率模型來分類,最簡單的方法就是直接使用機率最高的分類,例如unigram model中,所有的字都預測為'the'。當我們已知機率模型$P(\text{label | feature})$,對於未來新的輸入,就可以用$\max_\ell P(\ell|\text{feature})$來預測。

$$ P(\text{label | feature}) = \frac{P(\text{feature, label})}{P(\text{feature})} = \frac{P(\text{feature | label}) \times P(\text{label})}{P(\text{feature})} = \frac{ \prod_{f \in \text{feature}} P(\text{f | label}) \times P(\text{label})}{P(\text{feature})} $$

在實務上,我們會用$count(f,label) / count(label)$來估計$P(\text{f | label})$的值。在POS問題中,假設我們想找P(suffix_es|NN),則估計值為count(suffix_es|NN) / count(NN)。