Chapter 2: 取得Corpora及Lexical資源

2.1 文字型的Corpus

Gutenberg Corpus

古騰堡計畫是由志工參與,將文學資料電子化的計畫。古騰堡是活字印刷的發明人,以他為名表示希望知識的自由流通。在Gutenberg Corpus中有部分的文學作品,可以用nltk.corpus.gutenberg.fileids()查詢。


In [1]:
import nltk
nltk.corpus.gutenberg.fileids()


Out[1]:
[u'austen-emma.txt',
 u'austen-persuasion.txt',
 u'austen-sense.txt',
 u'bible-kjv.txt',
 u'blake-poems.txt',
 u'bryant-stories.txt',
 u'burgess-busterbrown.txt',
 u'carroll-alice.txt',
 u'chesterton-ball.txt',
 u'chesterton-brown.txt',
 u'chesterton-thursday.txt',
 u'edgeworth-parents.txt',
 u'melville-moby_dick.txt',
 u'milton-paradise.txt',
 u'shakespeare-caesar.txt',
 u'shakespeare-hamlet.txt',
 u'shakespeare-macbeth.txt',
 u'whitman-leaves.txt']

例如我們想要讀取Jane Austen所著的"Emma",可以用words()這個函式取得。


In [4]:
emma = nltk.corpus.gutenberg.words('austen-emma.txt')
emma[:10]  # 看作品的前10個字


Out[4]:
[u'[',
 u'Emma',
 u'by',
 u'Jane',
 u'Austen',
 u'1816',
 u']',
 u'VOLUME',
 u'I',
 u'CHAPTER']

現在看一下每一部作品的統計資料。


In [12]:
from nltk.corpus import gutenberg as gb
for book in gb.fileids():
    num_chars = len(gb.raw(book))
    num_words = len(gb.words(book))
    num_sents = len(gb.sents(book))
    print '{0:10d}{1:8d}{2:6d} {3}'.format(num_chars, num_words, num_sents, book)


    887071  192427  7752 austen-emma.txt
    466292   98171  3747 austen-persuasion.txt
    673022  141576  4999 austen-sense.txt
   4332554 1010654 30103 bible-kjv.txt
     38153    8354   438 blake-poems.txt
    249439   55563  2863 bryant-stories.txt
     84663   18963  1054 burgess-busterbrown.txt
    144395   34110  1703 carroll-alice.txt
    457450   97004  4779 chesterton-ball.txt
    406629   86063  3806 chesterton-brown.txt
    320525   69213  3742 chesterton-thursday.txt
    935158  210663 10230 edgeworth-parents.txt
   1242990  260819 10059 melville-moby_dick.txt
    468220   96825  1851 milton-paradise.txt
    112310   25837  2163 shakespeare-caesar.txt
    162881   37360  3106 shakespeare-hamlet.txt
    100351   23140  1907 shakespeare-macbeth.txt
    711215  154883  4250 whitman-leaves.txt

Web and Chat Text

nltk.corpus.webtext包含了Firefox討論區、神鬼奇航的劇本、還有一些路邊聽到的對話等等。


In [19]:
from nltk.corpus import webtext as web
for f in web.fileids():
    print '{0:15}{1}'.format(f, web.words(f)[:7])


firefox.txt    [u'Cookie', u'Manager', u':', u'"', u'Don', u"'", u't']
grail.txt      [u'SCENE', u'1', u':', u'[', u'wind', u']', u'[']
overheard.txt  [u'White', u'guy', u':', u'So', u',', u'do', u'you']
pirates.txt    [u'PIRATES', u'OF', u'THE', u'CARRIBEAN', u':', u'DEAD', u'MAN']
singles.txt    [u'25', u'SEXY', u'MALE', u',', u'seeks', u'attrac', u'older']
wine.txt       [u'Lovely', u'delicate', u',', u'fragrant', u'Rhone', u'wine', u'.']

ntlk.corpus.nps_char包含一萬筆即時通訊的記錄,所有對話的姓名都以"UserNNN"代替。nps_char分為15個file id,對應不同的chat room,例如file id中間有"20s"表示是20歲的聊天室,有"teen"表示是青少年專用。file id前面的日期都是2006年,而檔名中的"???posts"代表裡面有幾則訊息。


In [22]:
from nltk.corpus import nps_chat as nps
for f in nps.fileids():
    print '{0:30}{1}'.format(f, nps.words(f)[:5])


10-19-20s_706posts.xml        [u'now', u'im', u'left', u'with', u'this']
10-19-30s_705posts.xml        [u'U11', u'lol', u'lol', u'U11', u'wb']
10-19-40s_686posts.xml        [u'hi', u'U23', u'love', u'me', u'like']
10-19-adults_706posts.xml     [u'Hello', u'U24', u',', u'welcome', u'to']
10-24-40s_706posts.xml        [u'tc', u'U9', u'Tell', u'me', u'why']
10-26-teens_706posts.xml      [u'I', u'have', u'a', u'problem', u'with']
11-06-adults_706posts.xml     [u'Hello', u'U57', u',', u'welcome', u'to']
11-08-20s_705posts.xml        [u'U110', u',', u'i', u"'ll", u'stalk']
11-08-40s_706posts.xml        [u'lol', u'U2', u'me', u'too', u'U7']
11-08-adults_705posts.xml     [u'JOIN', u'U13', u',', u'welcome', u'to']
11-08-teens_706posts.xml      [u'JOIN', u'JOIN', u'.', u'wz', u'73042']
11-09-20s_706posts.xml        [u'LoL', u'im', u'like', u'five', u'seconds']
11-09-40s_706posts.xml        [u'lol', u'h', u'U17', u'U4', u'..']
11-09-adults_706posts.xml     [u'wisconsin', u'?', u'.', u'ACTION', u'yawns']
11-09-teens_706posts.xml      [u'PART', u'PART', u'PART', u'sup', u'yoll']

Brown Corpus

Brown Corpus在1961年由布朗大學所建立,包含500種來源的文字,這些來源分類為news, editorial等等。由file id可以看出類別,例如"ca"開頭的是新聞、"cd"開頭的是宗教、"cm"開頭的是科幻。


In [25]:
from nltk.corpus import brown as br
br.categories()


Out[25]:
[u'adventure',
 u'belles_lettres',
 u'editorial',
 u'fiction',
 u'government',
 u'hobbies',
 u'humor',
 u'learned',
 u'lore',
 u'mystery',
 u'news',
 u'religion',
 u'reviews',
 u'romance',
 u'science_fiction']

In [27]:
br.fileids(categories='news')[:5]


Out[27]:
[u'ca01', u'ca02', u'ca03', u'ca04', u'ca05']

In [28]:
br.fileids(categories='science_fiction')[:5]


Out[28]:
[u'cm01', u'cm02', u'cm03', u'cm04', u'cm05']

使用Brown Corpus時,所有函式都可以加上categories = ['xxx','yyy']或是fileid = ['xxx','yyy']來限制


In [32]:
br.words(fileids=['cg22','cm04'])


Out[32]:
[u'Does', u'our', u'society', u'have', u'a', ...]

In [33]:
br.words(categories=['humor','lore'])


Out[33]:
[u'In', u'American', u'romance', u',', u'almost', ...]

我們常利用Brown Corpus來研究不同類別的文字中,用詞會有什麼不一樣。例如我想知道不同文字中,"can", "could", "may", "might", "must", "will"的比例。


In [58]:
modals = ["can", "could", "may", "might", "must", "will"]
print '{0:15} {1:>6} {2:>6} {3:>6} {4:>6} {5:>6} {6:>6}'.format \
            ('category', modals[0], modals[1], modals[2], modals[3], modals[4], modals[5])
for cat in br.categories():
    text = br.words(categories=cat)
    dist = nltk.FreqDist([w.lower() for w in text])
    num = len(text)
    freq = [float(dist[m]) / num * 10000 for m in modals]
    print '{0:15} {1:6.2f} {2:6.2f} {3:6.2f} {4:6.2f} {5:6.2f} {6:6.2f}'.format \
            (cat, freq[0], freq[1], freq[2], freq[3], freq[4], freq[5])


category           can  could    may  might   must   will
adventure         6.92  22.21   1.01   8.51   3.89   7.35
belles_lettres   14.39  12.48  12.77   6.53   9.88  14.21
editorial        20.13   9.25  12.82   6.33   8.93  38.15
fiction           5.69  24.53   1.46   6.42   8.03   8.18
government       16.97   5.42  25.53   1.85  14.55  34.80
hobbies          33.52   7.16  17.37   2.67  10.20  32.67
humor             7.84  15.21   3.69   3.69   4.15   5.99
learned          20.18   8.74  18.47   7.04  11.16  18.69
lore             15.41  12.87  15.41   4.53   8.70  16.14
mystery           7.87  25.36   2.62   9.97   5.42   4.37
news              9.35   8.65   9.25   3.78   5.27  38.69
religion         21.32  14.97  20.05   3.05  13.71  18.27
reviews          11.06   9.83  11.55   6.39   4.67  14.99
romance          11.28  27.85   1.57   7.28   6.57   7.00
science_fiction  11.06  33.86   2.76   8.29   5.53  11.75

也可以用內建的nltk.ConditionalFreqDist()來作到這件事。我們給他的參數是所有(genre,word)的配對,它會根據conditions=genres,將word分類,並根據samples=modals找出要顯示的文字。

cfd['news']會傳回Counter,也可以用most_common()將Counter依照出現次數排序。


In [59]:
cfd = nltk.ConditionalFreqDist((genre, word) for genre in br.categories()
                               for word in br.words(categories=genre))
genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']
modals = ['can', 'could', 'may', 'might', 'must', 'will']
cfd.tabulate(conditions=genres, samples=modals)


                 can could  may might must will 
           news   93   86   66   38   50  389 
       religion   82   59   78   12   54   71 
        hobbies  268   58  131   22   83  264 
science_fiction   16   49    4   12    8   16 
        romance   74  193   11   51   45   43 
          humor   16   30    8    8    9   13 

In [79]:
cfd.tabulate(conditions=genres, samples=['love', 'hate', 'problems', 'forever'])


                love hate problems forever 
           news    3    1   19    1 
       religion   13    3    3    3 
        hobbies    6    0   26    1 
science_fiction    3    0    0    3 
        romance   32    9    2   10 
          humor    4    0    4    3 

In [68]:
cfd['hobbies'].most_common()[:5]


Out[68]:
[(u'the', 4300), (u',', 3849), (u'.', 3453), (u'of', 2390), (u'and', 2144)]

Reuters Corpus

Reuters Corpus包含10788則新聞,總共有130萬字,分成90個主題。為了方便machine learning,所有的file id都以"training/"或"test/"開頭,例如"test/14826"表示在test set中的一則新聞。

Reuters Corpus與Brown不同的地方在於分類可能重疊,一篇新聞可以同時屬於多個分類。新聞的分類可由categories(file_id)來查詢,也可以由fileid(category)找出分類對應的file id。


In [84]:
from nltk.corpus import reuters as rt
rt.fileids(['wheat', 'corn'])[:3]


Out[84]:
[u'test/14832', u'test/14841', u'test/14858']

In [85]:
rt.categories([u'test/14832', u'test/14841', u'test/14858'])


Out[85]:
[u'carcass',
 u'corn',
 u'grain',
 u'livestock',
 u'oilseed',
 u'rice',
 u'rubber',
 u'soybean',
 u'sugar',
 u'tin',
 u'trade',
 u'wheat']

Inaugural Address Corpus

這個Corpus是歷任美國總統的就職演說,檔名包含年份及總統名,例如"2009-Obama.txt"。


In [89]:
from nltk.corpus import inaugural as ina
ina.sents('2009-Obama.txt')


Out[89]:
[[u'My', u'fellow', u'citizens', u':'], [u'I', u'stand', u'here', u'today', u'humbled', u'by', u'the', u'task', u'before', u'us', u',', u'grateful', u'for', u'the', u'trust', u'you', u'have', u'bestowed', u',', u'mindful', u'of', u'the', u'sacrifices', u'borne', u'by', u'our', u'ancestors', u'.'], ...]

In [122]:
cfd = nltk.ConditionalFreqDist((target, f[:4]) for f in ina.fileids() for w in ina.words(f)
                        for target in ['america', 'right'] if w.lower().startswith(target))

In [123]:
%matplotlib inline
cfd.plot()


Corpus共同的API

  • fileids(): 檔案清單
  • fileids([categories]): 特定分類下的檔案清單
  • categories(): 分類清單
  • categories([fileids]): 檔案所歸屬的分類清單
  • raw(): 原始資料
  • raw(fileids=[f1,f2,f3]): 取得指定檔案的原始資料
  • raw(categories=[c1,c2]): 取得指定分類的原始資料
  • words(): 單字陣列
  • words(fileids=[f1,f2,f3]): 取得指定檔案的單字陣列
  • words(categories=[c1,c2]): 取得指定分類的單字陣列
  • sents(): 單句陣列
  • sents(fileids=[f1,f2,f3]): 取得指定檔案的單句陣列
  • sents(categories=[c1,c2]): 取得指定分類的單句陣列
  • abspath(fileid): 檔案在硬碟中的位置
  • encoding(fileid): 檔案的編碼(如果已知)
  • open(fileid): 將檔案打開成stream的格式
  • root(): corpus的根目錄
  • readme(): 關於這個corpus的資訊

使用自己的Corpus

如果你有純文字檔想當成Corpus,可以使用ntlk.corpus.PlaintextCorpusReader

>>> from nltk.corpus import PlaintextCorpusReader
>>> corpus_root = '/usr/share/dict' 
>>> wordlists = PlaintextCorpusReader(corpus_root, '.*') 
>>> wordlists.fileids()
['README', 'connectives', 'propernames', 'web2', 'web2a', 'words']
>>> wordlists.words('connectives')
['the', 'of', 'and', 'to', 'a', 'in', 'that', 'is', ...]

如果你有Penn Treebank樹狀結構的資料,可以使用ntlk.corpus.BracketParseCorpusReader

>>> from nltk.corpus import BracketParseCorpusReader
>>> corpus_root = r"C:\corpora\penntreebank\parsed\mrg\wsj" 
>>> file_pattern = r".*/wsj_.*\.mrg" 
>>> ptb = BracketParseCorpusReader(corpus_root, file_pattern)
>>> ptb.fileids()
['00/wsj_0001.mrg', '00/wsj_0002.mrg', '00/wsj_0003.mrg', '00/wsj_0004.mrg', ...]
>>> len(ptb.sents())
49208
>>> ptb.sents(fileids='20/wsj_2013.mrg')[19]
['The', '55-year-old', 'Mr.', 'Noriega', 'is', "n't", 'as', 'smooth', 'as', 'the',
'shah', 'of', 'Iran', ',', 'as', 'well-born', 'as', 'Nicaragua', "'s", 'Anastasio',
'Somoza', ',', 'as', 'imperial', 'as', 'Ferdinand', 'Marcos', 'of', 'the', 'Philippines',
'or', 'as', 'bloody', 'as', 'Haiti', "'s", 'Baby', Doc', 'Duvalier', '.']

2.2 條件式頻率分布

在統計條件分布時,我們需要將文字轉換成(condition, event)的格式,例如Brown中,有15個condition(每個分類一個),及1161192個event(每個單字一個)。

FreqDist()是單純的頻率分布,以list為參數。ConditionalFreqDist()則是條件式頻率分布,以list of pairs為參數。


In [11]:
import nltk
from nltk.corpus import gutenberg as gut
emma = gut.words(gut.fileids()[0])
freq_uni = nltk.FreqDist(emma)
freq_bi = nltk.FreqDist(nltk.bigrams(emma))

In [20]:
freq_uni.most_common()[:5]


Out[20]:
[(u',', 11454), (u'.', 6928), (u'to', 5183), (u'the', 4844), (u'and', 4672)]

In [15]:
freq_bi.most_common()[:5]


Out[15]:
[((u',', u'and'), 1879),
 ((u'Mr', u'.'), 1153),
 ((u"'", u's'), 932),
 ((u';', u'and'), 866),
 ((u'."', u'"'), 757)]

ConditionalFreqDist的常用函式

  • cfd = ConditionalFreqDist(pairs): 將pairs=(c,s)轉換為頻率分布
  • cfd.conditions(): 列出pairs[0]的所有可能
  • cfd[cond]: 用cond作為條件查詢,也就是pairs[0]==cond的清單
  • cfd[cond][sample]: pairs[0]==cond && pairs[1]==sample的清單
  • cfd.tabulate(conditions=cond, samples=samp): 將多個cond與多個sample的對應關係畫成表格
  • cfd.plot(): 畫出頻率分布
  • cfd.plot(samples, conditions): 畫出頻率分布

詞彙分析資源 (Lexical Resources)

單字或句子的附加資訊,都屬於詞彙分析的一部分,例如單字的詞性、字義、頻率等。

Wordlist Corpora

第一種分析方式,是拿字典作比較,找出不存在於字典的單字。


In [19]:
vocabulary = set(w.lower() for w in nltk.corpus.words.words())
austen = set(w.lower() for w in nltk.corpus.gutenberg.words('austen-sense.txt'))
list(austen.difference(vocabulary))[:10]


Out[19]:
[u'legacies',
 u'saves',
 u'woods',
 u'hating',
 u'consists',
 u'oldest',
 u'assembled',
 u'sashes',
 u'patches',
 u'sweetest']

第二種分析方式,是找出stopwords,這些字太常見以致於沒有分析的意義,通常會將stopwords設法移除。


In [22]:
from nltk.corpus import stopwords
stopwords.words('english')[:10]


Out[22]:
[u'i',
 u'me',
 u'my',
 u'myself',
 u'we',
 u'our',
 u'ours',
 u'ourselves',
 u'you',
 u'your']

In [32]:
stop = stopwords.words('english')
austen = nltk.corpus.gutenberg.words('austen-sense.txt')
aus = [w for w in austen if w.lower() not in stop]
nltk.FreqDist(aus).most_common()[:10]


Out[32]:
[(u',', 9397),
 (u'.', 3975),
 (u'"', 1506),
 (u';', 1419),
 (u"'", 883),
 (u'."', 721),
 (u'Elinor', 684),
 (u'could', 568),
 (u'Marianne', 566),
 (u'Mrs', 530)]

A Pronouncing Dictionary

nltk有包含CMU Pronouncing Dictionary for US English,專門為語音合成的程式而設計。裡面所使用的音標可以參考: https://en.wikipedia.org/wiki/Arpabet


In [34]:
entry = nltk.corpus.cmudict.entries()
len(entry)


Out[34]:
133737

In [36]:
entry[10000:10010]


Out[36]:
[(u'belford', [u'B', u'EH1', u'L', u'F', u'ER0', u'D']),
 (u'belfry', [u'B', u'EH1', u'L', u'F', u'R', u'IY0']),
 (u'belgacom', [u'B', u'EH1', u'L', u'G', u'AH0', u'K', u'AA0', u'M']),
 (u'belgacom', [u'B', u'EH1', u'L', u'JH', u'AH0', u'K', u'AA0', u'M']),
 (u'belgard', [u'B', u'EH0', u'L', u'G', u'AA1', u'R', u'D']),
 (u'belgarde', [u'B', u'EH0', u'L', u'G', u'AA1', u'R', u'D', u'IY0']),
 (u'belge', [u'B', u'EH1', u'L', u'JH', u'IY0']),
 (u'belger', [u'B', u'EH1', u'L', u'G', u'ER0']),
 (u'belgian', [u'B', u'EH1', u'L', u'JH', u'AH0', u'N']),
 (u'belgians', [u'B', u'EH1', u'L', u'JH', u'AH0', u'N', u'Z'])]

In [43]:
[w for w,pron in entry if pron[-5:] == [u'V', u'IH2', u'ZH', u'AH0', u'N']]


Out[43]:
[u'activision',
 u'cablevision',
 u'computervision',
 u'coopervision',
 u'exploravision',
 u'macrovision',
 u'spectravision',
 u'subdivision',
 u'television',
 u'valuevision',
 u'worldvision']

在發音中的數字代表重音,例如1表示重音,2表示次重音,0表示輕音。


In [53]:
def stress(pron):
    return [c[-1] for c in pron if c[-1].isdigit()]
[w for w, pron in entry if stress(pron) == ['0','1','0','2','0','0']]


Out[53]:
[u'accumulatively',
 u'appreciatively',
 u'environmentalists',
 u'environmentalists',
 u'environmentalists',
 u'environmentalists',
 u'identifiable',
 u'identifiable',
 u'irreconcilable',
 u'unhesitatingly',
 u'unnecessarily',
 u'unprecedentedly']

除了使用tuple格式的資料,也有提供dict格式的資料。


In [54]:
pdict = nltk.corpus.cmudict.dict()
pdict['fire']


Out[54]:
[[u'F', u'AY1', u'ER0'], [u'F', u'AY1', u'R']]

如果遇到字典中沒有的字,可以手動新增,但新增的結果並不會存檔,下次讀出來還是會缺字。


In [55]:
pdict['blog']


---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-55-482d977dfcbf> in <module>()
----> 1 pdict['blog']

KeyError: 'blog'

In [56]:
pdict['blog'] = [['B','L','AA1','G']]

In [57]:
pdict['blog']


Out[57]:
[['B', 'L', 'AA1', 'G']]

Comparative Wordlists

比較性的字集,又稱為Swadesh wordlists,列出200個常用字在多國語言中的寫法。


In [60]:
from nltk.corpus import swadesh
swadesh.fileids()


Out[60]:
[u'be',
 u'bg',
 u'bs',
 u'ca',
 u'cs',
 u'cu',
 u'de',
 u'en',
 u'es',
 u'fr',
 u'hr',
 u'it',
 u'la',
 u'mk',
 u'nl',
 u'pl',
 u'pt',
 u'ro',
 u'ru',
 u'sk',
 u'sl',
 u'sr',
 u'sw',
 u'uk']

In [62]:
swadesh.words('it')[:5]


Out[62]:
[u'io', u'tu, Lei', u'lui, egli', u'noi', u'voi']

In [63]:
swadesh.words('en')[:5]


Out[63]:
[u'I', u'you (singular), thou', u'he', u'we', u'you (plural)']

In [67]:
swadesh.entries(['fr','en'])[:5]  # 兩種語言的對照


Out[67]:
[(u'je', u'I'),
 (u'tu, vous', u'you (singular), thou'),
 (u'il', u'he'),
 (u'nous', u'we'),
 (u'vous', u'you (plural)')]

In [68]:
fr2en = dict(swadesh.entries(['fr','en']))
fr2en['nous']


Out[68]:
u'we'

Shoebox and Toolbox Lexicons

Toolbox是語言學家最常用的工具,在第11章會再詳細介紹。


In [75]:
from nltk.corpus import toolbox
toolbox.entries('rotokas.dic')[0]


Out[75]:
(u'kaa',
 [(u'ps', u'V'),
  (u'pt', u'A'),
  (u'ge', u'gag'),
  (u'tkp', u'nek i pas'),
  (u'dcsv', u'true'),
  (u'vx', u'1'),
  (u'sc', u'???'),
  (u'dt', u'29/Oct/2005'),
  (u'ex', u'Apoka ira kaaroi aioa-ia reoreopaoro.'),
  (u'xp', u'Kaikai i pas long nek bilong Apoka bikos em i kaikai na toktok.'),
  (u'xe', u'Apoka is gagging from food while talking.')])

Wordnet

WordNet是一個語意字典,也可以說是給電腦看的字典。當兩個字的字義相同,例如motocar和automobile,就稱這兩個字為synonyms,而這兩個字會組成一個synset。一個synset永遠只包含一個字義,但可以有很多單字。


In [77]:
from nltk.corpus import wordnet as wn
wn.synsets('sleep')
# sleep共有6種不同的字義,分布在6個synsets中


Out[77]:
[Synset('sleep.n.01'),
 Synset('sleep.n.02'),
 Synset('sleep.n.03'),
 Synset('rest.n.05'),
 Synset('sleep.v.01'),
 Synset('sleep.v.02')]

In [87]:
# 每個synset可能由多個字組成
[syn.lemma_names() for syn in wn.synsets('sleep')]


Out[87]:
[[u'sleep', u'slumber'],
 [u'sleep', u'sopor'],
 [u'sleep', u'nap'],
 [u'rest', u'eternal_rest', u'sleep', u'eternal_sleep', u'quietus'],
 [u'sleep', u'kip', u'slumber', u"log_Z's", u"catch_some_Z's"],
 [u'sleep']]

In [90]:
# 查詢synset所代表的意義
[syn.definition() for syn in wn.synsets('sleep')]


Out[90]:
[u'a natural and periodic state of rest during which consciousness of the world is suspended',
 u'a torpid state resembling deep sleep',
 u'a period of time spent sleeping',
 u'euphemisms for death (based on an analogy between lying in a bed and in a tomb)',
 u'be asleep',
 u'be able to accommodate for sleeping']

In [91]:
# synset的例句
[syn.examples() for syn in wn.synsets('sleep')]


Out[91]:
[[u"he didn't get enough sleep last night",
  u'calm as a child in dreamless slumber'],
 [],
 [u'he felt better after a little sleep', u"there wasn't time for a nap"],
 [u'she was laid to rest beside her husband',
  u'they had to put their family pet to sleep'],
 [],
 [u'This tent sleeps six people']]

Wordnet Hierarchy

WordNet的特色是所有字義都組成一個樹狀結構,裡面每個點都是synsets。synset的子孫節點稱為hyponyms(下位字),而兩個synsets共同的祖先稱為hypernyms(上位字)。


In [118]:
# 尋找hypernyms
syn = wn.synset('sleep.n.01')
syn.hypernym_paths()


Out[118]:
[[Synset('entity.n.01'),
  Synset('abstraction.n.06'),
  Synset('attribute.n.02'),
  Synset('state.n.02'),
  Synset('condition.n.01'),
  Synset('physical_condition.n.01'),
  Synset('sleep.n.01')]]

邏輯關係: entailment。如果A動作是由B,C,D...組成的,則我們說A entails B,C,D...


In [121]:
wn.synset('eat.v.01').entailments()


Out[121]:
[Synset('chew.v.01'), Synset('swallow.v.01')]

要在synsets中找出特定詞性的單字,可以用第二個參數。


In [123]:
wn.synsets('sleep', wn.NOUN)


Out[123]:
[Synset('sleep.n.01'),
 Synset('sleep.n.02'),
 Synset('sleep.n.03'),
 Synset('rest.n.05')]

antonymy是相反的關係。


In [134]:
lemma = wn.synsets('vertical')[2].lemmas()[0]
lemma.antonyms()


Out[134]:
[Lemma('horizontal.a.01.horizontal'), Lemma('inclined.a.02.inclined')]

要比較兩個synsets接近的程度,可以看它們共同的祖先有多高。如果共同的祖先是entity,表示幾乎沒有關係,因為entity已經是root了。例如我們要找left和right的字義中最接近的。


In [170]:
l = wn.synsets('left', wn.NOUN)
r = wn.synsets('right', wn.NOUN)
for left in l:
    for right in r:
        ancestor = left.common_hypernyms(right)[0]
        sim = left.path_similarity(right)
        if sim > 0.1:
            print '{0:18}{1:18}{2:25}{3}'.format(left.name(), right.name(), ancestor.name(), sim)


left.n.01         right.n.02        position.n.01            0.333333333333
left.n.01         right_field.n.01  object.n.01              0.125
left.n.02         right.n.04        clique.n.01              0.333333333333
left.n.03         right.n.05        physical_entity.n.01     0.333333333333
left_field.n.01   right.n.02        object.n.01              0.125
left_field.n.01   right_field.n.01  region.n.03              0.333333333333
left.n.05         right.n.06        change_of_direction.n.01 0.333333333333

In [ ]: