问题

  • 什么是词汇分类,如何使用他们?
  • python什么数据结构适合存储词和词的类别?
  • 如何自动标记词性?

In [9]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag, map_tag

tag的含义

  • CC :coordinating conjunction 并列连词
  • RB: adverbs 副词
  • IN: preposition 介词
  • NN : noun 名词
  • JJ: adjective 形容词

In [6]:
text = word_tokenize("And now for something completely different")
print(text)
print(type(text))
nltk.pos_tag(text)


['And', 'now', 'for', 'something', 'completely', 'different']
<class 'list'>
Out[6]:
[('And', 'CC'),
 ('now', 'RB'),
 ('for', 'IN'),
 ('something', 'NN'),
 ('completely', 'RB'),
 ('different', 'JJ')]

In [14]:
pos_tag(word_tokenize("John's big idea is not all that bad."), tagset='universal')


Out[14]:
[('John', 'NOUN'),
 ("'s", 'PRT'),
 ('big', 'ADJ'),
 ('idea', 'NOUN'),
 ('is', 'VERB'),
 ('not', 'ADV'),
 ('all', 'DET'),
 ('that', 'ADP'),
 ('bad', 'ADJ'),
 ('.', '.')]

In [ ]: