In [1]:
import nltk
nltk.corpus.gutenberg.fileids()
Out[1]:
例如我們想要讀取Jane Austen所著的"Emma",可以用words()
這個函式取得。
In [4]:
emma = nltk.corpus.gutenberg.words('austen-emma.txt')
emma[:10] # 看作品的前10個字
Out[4]:
現在看一下每一部作品的統計資料。
In [12]:
from nltk.corpus import gutenberg as gb
for book in gb.fileids():
num_chars = len(gb.raw(book))
num_words = len(gb.words(book))
num_sents = len(gb.sents(book))
print '{0:10d}{1:8d}{2:6d} {3}'.format(num_chars, num_words, num_sents, book)
In [19]:
from nltk.corpus import webtext as web
for f in web.fileids():
print '{0:15}{1}'.format(f, web.words(f)[:7])
ntlk.corpus.nps_char
包含一萬筆即時通訊的記錄,所有對話的姓名都以"UserNNN"代替。nps_char
分為15個file id,對應不同的chat room,例如file id中間有"20s"表示是20歲的聊天室,有"teen"表示是青少年專用。file id前面的日期都是2006年,而檔名中的"???posts"代表裡面有幾則訊息。
In [22]:
from nltk.corpus import nps_chat as nps
for f in nps.fileids():
print '{0:30}{1}'.format(f, nps.words(f)[:5])
In [25]:
from nltk.corpus import brown as br
br.categories()
Out[25]:
In [27]:
br.fileids(categories='news')[:5]
Out[27]:
In [28]:
br.fileids(categories='science_fiction')[:5]
Out[28]:
使用Brown Corpus時,所有函式都可以加上categories = ['xxx','yyy']
或是fileid = ['xxx','yyy']
來限制
In [32]:
br.words(fileids=['cg22','cm04'])
Out[32]:
In [33]:
br.words(categories=['humor','lore'])
Out[33]:
我們常利用Brown Corpus來研究不同類別的文字中,用詞會有什麼不一樣。例如我想知道不同文字中,"can", "could", "may", "might", "must", "will"的比例。
In [58]:
modals = ["can", "could", "may", "might", "must", "will"]
print '{0:15} {1:>6} {2:>6} {3:>6} {4:>6} {5:>6} {6:>6}'.format \
('category', modals[0], modals[1], modals[2], modals[3], modals[4], modals[5])
for cat in br.categories():
text = br.words(categories=cat)
dist = nltk.FreqDist([w.lower() for w in text])
num = len(text)
freq = [float(dist[m]) / num * 10000 for m in modals]
print '{0:15} {1:6.2f} {2:6.2f} {3:6.2f} {4:6.2f} {5:6.2f} {6:6.2f}'.format \
(cat, freq[0], freq[1], freq[2], freq[3], freq[4], freq[5])
也可以用內建的nltk.ConditionalFreqDist()
來作到這件事。我們給他的參數是所有(genre,word)的配對,它會根據conditions=genres
,將word分類,並根據samples=modals
找出要顯示的文字。
cfd['news']
會傳回Counter,也可以用most_common()
將Counter依照出現次數排序。
In [59]:
cfd = nltk.ConditionalFreqDist((genre, word) for genre in br.categories()
for word in br.words(categories=genre))
genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']
modals = ['can', 'could', 'may', 'might', 'must', 'will']
cfd.tabulate(conditions=genres, samples=modals)
In [79]:
cfd.tabulate(conditions=genres, samples=['love', 'hate', 'problems', 'forever'])
In [68]:
cfd['hobbies'].most_common()[:5]
Out[68]:
In [84]:
from nltk.corpus import reuters as rt
rt.fileids(['wheat', 'corn'])[:3]
Out[84]:
In [85]:
rt.categories([u'test/14832', u'test/14841', u'test/14858'])
Out[85]:
In [89]:
from nltk.corpus import inaugural as ina
ina.sents('2009-Obama.txt')
Out[89]:
In [122]:
cfd = nltk.ConditionalFreqDist((target, f[:4]) for f in ina.fileids() for w in ina.words(f)
for target in ['america', 'right'] if w.lower().startswith(target))
In [123]:
%matplotlib inline
cfd.plot()
fileids()
: 檔案清單fileids([categories])
: 特定分類下的檔案清單categories()
: 分類清單categories([fileids])
: 檔案所歸屬的分類清單raw()
: 原始資料raw(fileids=[f1,f2,f3])
: 取得指定檔案的原始資料raw(categories=[c1,c2])
: 取得指定分類的原始資料words()
: 單字陣列words(fileids=[f1,f2,f3])
: 取得指定檔案的單字陣列words(categories=[c1,c2])
: 取得指定分類的單字陣列sents()
: 單句陣列sents(fileids=[f1,f2,f3])
: 取得指定檔案的單句陣列sents(categories=[c1,c2])
: 取得指定分類的單句陣列abspath(fileid)
: 檔案在硬碟中的位置encoding(fileid)
: 檔案的編碼(如果已知)open(fileid)
: 將檔案打開成stream的格式root()
: corpus的根目錄readme()
: 關於這個corpus的資訊如果你有純文字檔想當成Corpus,可以使用ntlk.corpus.PlaintextCorpusReader
。
>>> from nltk.corpus import PlaintextCorpusReader
>>> corpus_root = '/usr/share/dict'
>>> wordlists = PlaintextCorpusReader(corpus_root, '.*')
>>> wordlists.fileids()
['README', 'connectives', 'propernames', 'web2', 'web2a', 'words']
>>> wordlists.words('connectives')
['the', 'of', 'and', 'to', 'a', 'in', 'that', 'is', ...]
如果你有Penn Treebank樹狀結構的資料,可以使用ntlk.corpus.BracketParseCorpusReader
。
>>> from nltk.corpus import BracketParseCorpusReader
>>> corpus_root = r"C:\corpora\penntreebank\parsed\mrg\wsj"
>>> file_pattern = r".*/wsj_.*\.mrg"
>>> ptb = BracketParseCorpusReader(corpus_root, file_pattern)
>>> ptb.fileids()
['00/wsj_0001.mrg', '00/wsj_0002.mrg', '00/wsj_0003.mrg', '00/wsj_0004.mrg', ...]
>>> len(ptb.sents())
49208
>>> ptb.sents(fileids='20/wsj_2013.mrg')[19]
['The', '55-year-old', 'Mr.', 'Noriega', 'is', "n't", 'as', 'smooth', 'as', 'the',
'shah', 'of', 'Iran', ',', 'as', 'well-born', 'as', 'Nicaragua', "'s", 'Anastasio',
'Somoza', ',', 'as', 'imperial', 'as', 'Ferdinand', 'Marcos', 'of', 'the', 'Philippines',
'or', 'as', 'bloody', 'as', 'Haiti', "'s", 'Baby', Doc', 'Duvalier', '.']
In [11]:
import nltk
from nltk.corpus import gutenberg as gut
emma = gut.words(gut.fileids()[0])
freq_uni = nltk.FreqDist(emma)
freq_bi = nltk.FreqDist(nltk.bigrams(emma))
In [20]:
freq_uni.most_common()[:5]
Out[20]:
In [15]:
freq_bi.most_common()[:5]
Out[15]:
cfd = ConditionalFreqDist(pairs)
: 將pairs=(c,s)轉換為頻率分布cfd.conditions()
: 列出pairs[0]的所有可能cfd[cond]
: 用cond作為條件查詢,也就是pairs[0]==cond的清單cfd[cond][sample]
: pairs[0]==cond && pairs[1]==sample的清單cfd.tabulate(conditions=cond, samples=samp)
: 將多個cond與多個sample的對應關係畫成表格cfd.plot()
: 畫出頻率分布cfd.plot(samples, conditions)
: 畫出頻率分布
In [19]:
vocabulary = set(w.lower() for w in nltk.corpus.words.words())
austen = set(w.lower() for w in nltk.corpus.gutenberg.words('austen-sense.txt'))
list(austen.difference(vocabulary))[:10]
Out[19]:
第二種分析方式,是找出stopwords,這些字太常見以致於沒有分析的意義,通常會將stopwords設法移除。
In [22]:
from nltk.corpus import stopwords
stopwords.words('english')[:10]
Out[22]:
In [32]:
stop = stopwords.words('english')
austen = nltk.corpus.gutenberg.words('austen-sense.txt')
aus = [w for w in austen if w.lower() not in stop]
nltk.FreqDist(aus).most_common()[:10]
Out[32]:
nltk有包含CMU Pronouncing Dictionary for US English,專門為語音合成的程式而設計。裡面所使用的音標可以參考: https://en.wikipedia.org/wiki/Arpabet
In [34]:
entry = nltk.corpus.cmudict.entries()
len(entry)
Out[34]:
In [36]:
entry[10000:10010]
Out[36]:
In [43]:
[w for w,pron in entry if pron[-5:] == [u'V', u'IH2', u'ZH', u'AH0', u'N']]
Out[43]:
在發音中的數字代表重音,例如1表示重音,2表示次重音,0表示輕音。
In [53]:
def stress(pron):
return [c[-1] for c in pron if c[-1].isdigit()]
[w for w, pron in entry if stress(pron) == ['0','1','0','2','0','0']]
Out[53]:
除了使用tuple格式的資料,也有提供dict格式的資料。
In [54]:
pdict = nltk.corpus.cmudict.dict()
pdict['fire']
Out[54]:
如果遇到字典中沒有的字,可以手動新增,但新增的結果並不會存檔,下次讀出來還是會缺字。
In [55]:
pdict['blog']
In [56]:
pdict['blog'] = [['B','L','AA1','G']]
In [57]:
pdict['blog']
Out[57]:
In [60]:
from nltk.corpus import swadesh
swadesh.fileids()
Out[60]:
In [62]:
swadesh.words('it')[:5]
Out[62]:
In [63]:
swadesh.words('en')[:5]
Out[63]:
In [67]:
swadesh.entries(['fr','en'])[:5] # 兩種語言的對照
Out[67]:
In [68]:
fr2en = dict(swadesh.entries(['fr','en']))
fr2en['nous']
Out[68]:
In [75]:
from nltk.corpus import toolbox
toolbox.entries('rotokas.dic')[0]
Out[75]:
In [77]:
from nltk.corpus import wordnet as wn
wn.synsets('sleep')
# sleep共有6種不同的字義,分布在6個synsets中
Out[77]:
In [87]:
# 每個synset可能由多個字組成
[syn.lemma_names() for syn in wn.synsets('sleep')]
Out[87]:
In [90]:
# 查詢synset所代表的意義
[syn.definition() for syn in wn.synsets('sleep')]
Out[90]:
In [91]:
# synset的例句
[syn.examples() for syn in wn.synsets('sleep')]
Out[91]:
In [118]:
# 尋找hypernyms
syn = wn.synset('sleep.n.01')
syn.hypernym_paths()
Out[118]:
邏輯關係: entailment。如果A動作是由B,C,D...組成的,則我們說A entails B,C,D...
In [121]:
wn.synset('eat.v.01').entailments()
Out[121]:
要在synsets中找出特定詞性的單字,可以用第二個參數。
In [123]:
wn.synsets('sleep', wn.NOUN)
Out[123]:
antonymy是相反的關係。
In [134]:
lemma = wn.synsets('vertical')[2].lemmas()[0]
lemma.antonyms()
Out[134]:
要比較兩個synsets接近的程度,可以看它們共同的祖先有多高。如果共同的祖先是entity,表示幾乎沒有關係,因為entity已經是root了。例如我們要找left和right的字義中最接近的。
In [170]:
l = wn.synsets('left', wn.NOUN)
r = wn.synsets('right', wn.NOUN)
for left in l:
for right in r:
ancestor = left.common_hypernyms(right)[0]
sim = left.path_similarity(right)
if sim > 0.1:
print '{0:18}{1:18}{2:25}{3}'.format(left.name(), right.name(), ancestor.name(), sim)
In [ ]: