Natural Language Toolkit, znany też jako NLTK – zestaw bibliotek i programów do symbolicznego i statystycznego przetwarzania języka naturalnego. NLTK zawiera demonstracje graficzne i przykładowe dane.
Autorzy: Steven Bird, Edward Loper, Ewan Klein
Oficjalna strona projektu
sudo apt-get install python-pipsudo yum install python-pipsudo pip install -U nltksudo pip install -U numpypython -c "import nltk"sudo python -m nltk.downloader -d /usr/share/nltk_data allpython -c "import nltk;nltk.download()"Zaimportowanie całego NLTK:
In [1]:
import nltk
In [2]:
text = "Natural language processing (NLP) is a field of e.g. computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages. As such, NLP is related to the area of human-computer interaction. Many challenges in NLP involve natural language understanding, that is, enabling computers to derive meaning from human or natural language input, and others involve natural language generation."
one_sentence = "I eat breakfast at 8:00 a.m."
Podział na tokeny (bez możliwości wyboru języka) tworzy prostą listę słów i znaków interpunkcyjnych:
In [3]:
tokens = nltk.word_tokenize(text)
tokens
Out[3]:
Dla podziału na zdania należy załadować tokenizers/punkt/english.pickle (dla języka angielskiego):
In [4]:
sent_tokenize = nltk.data.load('tokenizers/punkt/english.pickle')
sent_tokenize.tokenize(text)
Out[4]:
Wykorzystanie prostego Pos Tagger'a:
| Skrót | Nazwa | Przykład |
|---|---|---|
| ADJ | adjective | new, good, high, special, big, local |
| ADV | adverb | really, already, still, early, now |
| CNJ | conjunction | and, or, but, if, while, although |
| DET | determiner | the, a, some, most, every, no |
| EX | existential | there, there's |
| FW | foreign word | dolce, ersatz, esprit, quo, maitre |
| MOD | modal verb | will, can, would, may, must, should |
| N | noun | year, home, costs, time, education |
| NP | proper noun | Alison, Africa, April, Washington |
| NUM | number | twenty-four, fourth, 1991, 14:24 |
| PRO | pronoun | he, their, her, its, my, I, us |
| P | preposition | on, of, at, with, by, into, under |
| TO | the word to | to |
| UH | interjection | ah, bang, ha, whee, hmpf, oops |
| V | verb | is, has, get, do, make, see, run |
| VD | past tense | said, took, told, made, asked |
| VG | present participle | making, going, playing, working |
| VN | past participle | given, taken, begun, sung |
| WH | wh determiner | who, which, when, what, where, how |
In [5]:
tagged = nltk.pos_tag(tokens)
tagged
Out[5]:
Graficznie:
In [6]:
one_sentence_pos_tags = nltk.pos_tag(nltk.word_tokenize(one_sentence))
entities = nltk.chunk.ne_chunk(one_sentence_pos_tags)
entities
Out[6]:
Lematyzator bazuje na WordNet.
Słowa (ANG = PL):
In [7]:
wnl = nltk.stem.WordNetLemmatizer()
print "aardwolves \t=", wnl.lemmatize("aardwolves")
print "dogs \t=", wnl.lemmatize("dogs")
print "abaci \t=", wnl.lemmatize("abaci")
Niemickie skróty:
In [8]:
text_de = u"Haben Sie Lust mit zu mir zu kommen und alles das zu tun, z.B. was ich allen anderen morgen sowieso erzählen werde? Ich weigere mich, in dem alten u. verspukten Schloss zu schlafen. ich habe Angst vor Geistern!"
sent_tokenize_de = nltk.data.load('tokenizers/punkt/german.pickle')
sents_de = sent_tokenize_de.tokenize(text_de)
for i in sents_de:
print "'" + i + "'"
Dostępna jest wejska online, zawierająca:
Pod adresem: Python NLTK Demos