本章的目標是回答這些問題:
本章會提到一些基本的NLP方法,例如sequence labeling、n-gram models、backoff、evaluation。
辨識單字的part-of-speech(詞性)並標記的過程稱為tagging,或稱part-of-speech tagging、POS tagging。在一般的NLP流程中,tagging是接在tokenization後面。part-of-speech又稱為word class或lexical category。而可供選擇的tag集合稱為tagset。
In [2]:
import nltk
text = nltk.word_tokenize("And now for something completely different")
nltk.pos_tag(text)
Out[2]:
上面的範例中,CC是對等連接詞、RB是副詞、IN是介系詞、NN是名詞、JJ則是形容詞。如果想知道詳細的tag定義,可以用nltk.help.upenn_tagset('RB')來查詢。
In [3]:
nltk.tag.str2tuple('fly/NN')
Out[3]:
In [9]:
# tagged_words() 是一個已經表示成tuple形態的資料
nltk.corpus.brown.tagged_words()
Out[9]:
In [12]:
# 用參數 tagset='universal' 可以換成簡單的tag
nltk.corpus.brown.tagged_words(tagset='universal')
Out[12]:
In [15]:
# 利用 FreqDist 統計詞性的數量
tag_fd = nltk.FreqDist(tag for (word, tag) in nltk.corpus.brown.tagged_words(tagset='universal'))
tag_fd.most_common()
Out[15]:
In [17]:
%matplotlib inline
tag_fd.plot()
In [61]:
tag_cd = nltk.ConditionalFreqDist(nltk.corpus.brown.tagged_words(tagset='universal'))
In [65]:
# 查詢某單字的常用POS
tag_cd['yield']
Out[65]:
corpus中也有tagged sentences:
In [69]:
nltk.corpus.brown.tagged_sents(tagset='universal')[0]
Out[69]:
In [72]:
pos = {} # 在python中定義dictionary最簡單的方法
pos['hello'] = 'world'
pos['right'] = 'here'
pos
Out[72]:
In [74]:
[w for w in pos] # 用for的時候會找出key
Out[74]:
In [75]:
pos.keys()
Out[75]:
In [76]:
pos.items()
Out[76]:
In [77]:
pos.values()
Out[77]:
In [80]:
pos = dict(hello = 'world', right = 'here') # 另一種定義方式
pos
Out[80]:
In [81]:
f = nltk.defaultdict(int)
f['color'] = 4
f
Out[81]:
In [84]:
f['dream'] # dream不存在,但查詢時會自動新增
Out[84]:
In [85]:
f # 查詢dream後,就直接新增了一個dream
Out[85]:
In [86]:
f = nltk.defaultdict(lambda: 'xxx')
f['hello'] = 'world'
f
Out[86]:
In [87]:
f['here'] = f['here'] + 'comment'
f
Out[87]:
In [6]:
old = dict(nltk.corpus.brown.tagged_words()[:100])
new = dict((value, key) for (key, value) in old.items())
In [8]:
new['JJ'] # 雖然成功的反相,但只能查出最後輸入的字
Out[8]:
In [10]:
new2 = nltk.defaultdict(list) # 當key不存在時,都視為empty list
for (key, value) in old.items():
new2[value].append(key)
In [11]:
new2['JJ']
Out[11]:
更簡單的方法: 利用nltk內建的函式。
In [12]:
new3 = nltk.Index((value, key) for (key, value) in old.items())
new3['JJ']
Out[12]:
d = {}: 建立空的dictd[key] = value: 為key指定新的valued.keys(): 傳回list of keyslist(d): 傳回list of keysd.values(): 傳回list of valuessorted(d): 傳回sorted list of keyskey in d: 如果d有包含key則傳回Truefor key in d: 依序傳回每一個Keyd1.update(d2): 將d2的每個item複製到d1defaultdict(int): 預設value為0的dictdefaultdict(list): 預設value為[]的dict
In [13]:
from nltk.corpus import brown
brown_tagged_sents = brown.tagged_sents(categories='news')
brown_sents = brown.sents(categories='news')
In [14]:
tags = [tag for (word, tag) in brown.tagged_words(categories='news')]
nltk.FreqDist(tags).max()
Out[14]:
In [15]:
default_tagger = nltk.DefaultTagger('NN') # 因為NN頻率最高,所以未知詞性的情況一律當成NN
default_tagger.tag(nltk.word_tokenize('i like my mother and dog'))
Out[15]:
In [17]:
# 當然預測的準確率很差,因為只有13%是真的NN
default_tagger.evaluate(brown_tagged_sents)
Out[17]:
In [19]:
patterns = [
(r'.*ing$', 'VBG'),
(r'.*ed$', 'VBD'),
(r'.*es$', 'VBZ'),
(r'.*ould$', 'MD'),
(r'.*\'s$', 'NN$'),
(r'.*s$', 'NNS'),
(r'^-?[0-9]+(.[0-9]+)?$', 'CD'),
(r'.*', 'NN')
]
In [22]:
regexp_tagger = nltk.RegexpTagger(patterns)
regexp_tagger.tag(nltk.word_tokenize('i could be sleeping in 9 AM'))
Out[22]:
In [23]:
regexp_tagger.evaluate(brown_tagged_sents)
Out[23]:
In [25]:
unigram_tagger = nltk.UnigramTagger(brown.tagged_sents(categories='news')[:500])
In [26]:
unigram_tagger.tag(nltk.word_tokenize('i could be sleeping in 9 AM'))
Out[26]:
unigram tagger是統計每個字最常出現的詞性,因此訓練資料越大,就會越準確。但遇到沒看過的字,就會傳回None。因此需要設backoff,當unigram tagger無法判斷時,用另一個tagger來輔助。
In [32]:
unigram_tagger = nltk.UnigramTagger(brown.tagged_sents(categories='news')[:500],
backoff = regexp_tagger)
In [33]:
unigram_tagger.evaluate(brown_tagged_sents[500:])
Out[33]:
In [38]:
unigram_tagger = nltk.UnigramTagger(brown.tagged_sents(categories='news')[:4000],
backoff = regexp_tagger)
In [40]:
unigram_tagger.evaluate(brown_tagged_sents[4000:])
Out[40]:
兩個重點:
In [41]:
bigram_tagger = nltk.BigramTagger(brown.tagged_sents(categories='news')[:4000])
In [42]:
bigram_tagger.tag(nltk.word_tokenize('i could be sleeping in 9 AM'))
Out[42]:
In [43]:
bigram_tagger = nltk.BigramTagger(brown.tagged_sents(categories='news')[:4000],
backoff=unigram_tagger)
In [44]:
bigram_tagger.evaluate(brown_tagged_sents[4000:])
Out[44]:
In [46]:
from cPickle import dump
output = open('t2.pkl', 'wb')
dump(bigram_tagger, output, -1)
output.close()
In [47]:
from cPickle import load
input = open('t2.pkl', 'rb')
tagger = load(input)
input.close()
In [48]:
tagger.evaluate(brown_tagged_sents[4000:])
Out[48]:
In [55]:
brown_sents = brown.sents()
brown_tagged_sents = brown.tagged_sents(tagset = 'universal')
default_tagger = nltk.DefaultTagger('NOUN')
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents[:4000], backoff=default_tagger)
bigram_tagger = nltk.BigramTagger(brown_tagged_sents[:4000], backoff=unigram_tagger)
In [65]:
unigram_tagger.tag(nltk.word_tokenize('I like your mother'))
Out[65]:
In [66]:
test = [tag for sent in brown_sents[4000:] for (word, tag) in bigram_tagger.tag(sent)]
gold = [tag for sent in brown_tagged_sents[4000:] for (word, tag) in sent]
print nltk.ConfusionMatrix(gold, test)
In [ ]: