NLTK

PYTHON

Spis treści:

Co to NLTK?
Wymagania
Instalacja NLTK
Przykład dla języka angielskiego:
Wersja Online
Uwagi

Co to NLTK?

Natural Language Toolkit, znany też jako NLTK – zestaw bibliotek i programów do symbolicznego i statystycznego przetwarzania języka naturalnego. NLTK zawiera demonstracje graficzne i przykładowe dane.
Autorzy: Steven Bird, Edward Loper, Ewan Klein
Oficjalna strona projektu

↑ Powrót do spisu treści ↑

Wymagania

Znajomość podstaw Przetwarzania Języka Naturalnego
Znajomość podstaw Python'a
Python 2.7 lub 3.2+
↑ Powrót do spisu treści ↑

Instalacja NLTK

Instalacja Python-pip:
1. Debian/Ubuntu:
  - sudo apt-get install python-pip
2. Fedora:
  - sudo yum install python-pip
Instalacja NLTK:
- sudo pip install -U nltk
(OPCJONALNE) Instalacja Numpy:
- sudo pip install -U numpy
Test użycia:
- python -c "import nltk"
Pobranie zewnętrznych danych:
1. Automatycznie (pobranie danych do /usr/share/nltk_data):
  - sudo python -m nltk.downloader -d /usr/share/nltk_data all
2. Graficznie:
  - python -c "import nltk;nltk.download()"
    - wybrać lokalizację instalacji w polu Download Directory
    - wybrać interesujące nas pakiety i akceptować przyciskiem Download

↑ Powrót do spisu treści ↑

Import NLTK

Zaimportowanie całego NLTK:

↑ Powrót do spisu treści ↑



In [1]:

    
import nltk

Język angielski

Przykłady dla języka angielskiego:



In [2]:

    
text = "Natural language processing (NLP) is a field of e.g. computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages. As such, NLP is related to the area of human-computer interaction. Many challenges in NLP involve natural language understanding, that is, enabling computers to derive meaning from human or natural language input, and others involve natural language generation."
one_sentence = "I eat breakfast at 8:00 a.m."

Word Tokenize

Podział na tokeny (bez możliwości wyboru języka) tworzy prostą listę słów i znaków interpunkcyjnych:

↑ Powrót do spisu treści ↑



In [3]:

    
tokens = nltk.word_tokenize(text)
tokens









    Out[3]:





['Natural',
 'language',
 'processing',
 '(',
 'NLP',
 ')',
 'is',
 'a',
 'field',
 'of',
 'e.g',
 '.',
 'computer',
 'science',
 ',',
 'artificial',
 'intelligence',
 ',',
 'and',
 'computational',
 'linguistics',
 'concerned',
 'with',
 'the',
 'interactions',
 'between',
 'computers',
 'and',
 'human',
 '(',
 'natural',
 ')',
 'languages',
 '.',
 'As',
 'such',
 ',',
 'NLP',
 'is',
 'related',
 'to',
 'the',
 'area',
 'of',
 'human-computer',
 'interaction',
 '.',
 'Many',
 'challenges',
 'in',
 'NLP',
 'involve',
 'natural',
 'language',
 'understanding',
 ',',
 'that',
 'is',
 ',',
 'enabling',
 'computers',
 'to',
 'derive',
 'meaning',
 'from',
 'human',
 'or',
 'natural',
 'language',
 'input',
 ',',
 'and',
 'others',
 'involve',
 'natural',
 'language',
 'generation',
 '.']

Sentence Tokenize

Dla podziału na zdania należy załadować tokenizers/punkt/english.pickle (dla języka angielskiego):

↑ Powrót do spisu treści ↑



In [4]:

    
sent_tokenize = nltk.data.load('tokenizers/punkt/english.pickle')
sent_tokenize.tokenize(text)









    Out[4]:





['Natural language processing (NLP) is a field of e.g.',
 'computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages.',
 'As such, NLP is related to the area of human-computer interaction.',
 'Many challenges in NLP involve natural language understanding, that is, enabling computers to derive meaning from human or natural language input, and others involve natural language generation.']

Pos Tagger

Wykorzystanie prostego Pos Tagger'a:

Skrót	Nazwa	Przykład
ADJ	adjective	new, good, high, special, big, local
ADV	adverb	really, already, still, early, now
CNJ	conjunction	and, or, but, if, while, although
DET	determiner	the, a, some, most, every, no
EX	existential	there, there's
FW	foreign word	dolce, ersatz, esprit, quo, maitre
MOD	modal verb	will, can, would, may, must, should
N	noun	year, home, costs, time, education
NP	proper noun	Alison, Africa, April, Washington
NUM	number	twenty-four, fourth, 1991, 14:24
PRO	pronoun	he, their, her, its, my, I, us
P	preposition	on, of, at, with, by, into, under
TO	the word to	to
UH	interjection	ah, bang, ha, whee, hmpf, oops
V	verb	is, has, get, do, make, see, run
VD	past tense	said, took, told, made, asked
VG	present participle	making, going, playing, working
VN	past participle	given, taken, begun, sung
WH	wh determiner	who, which, when, what, where, how

↑ Powrót do spisu treści ↑



In [5]:

    
tagged = nltk.pos_tag(tokens)
tagged









    Out[5]:





[('Natural', 'JJ'),
 ('language', 'NN'),
 ('processing', 'NN'),
 ('(', '('),
 ('NLP', 'NNP'),
 (')', ')'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('field', 'NN'),
 ('of', 'IN'),
 ('e.g', 'NN'),
 ('.', '.'),
 ('computer', 'NN'),
 ('science', 'NN'),
 (',', ','),
 ('artificial', 'JJ'),
 ('intelligence', 'NN'),
 (',', ','),
 ('and', 'CC'),
 ('computational', 'JJ'),
 ('linguistics', 'NNS'),
 ('concerned', 'VBN'),
 ('with', 'IN'),
 ('the', 'DT'),
 ('interactions', 'NNS'),
 ('between', 'IN'),
 ('computers', 'NNS'),
 ('and', 'CC'),
 ('human', 'JJ'),
 ('(', '('),
 ('natural', 'JJ'),
 (')', ')'),
 ('languages', 'VBZ'),
 ('.', '.'),
 ('As', 'IN'),
 ('such', 'JJ'),
 (',', ','),
 ('NLP', 'NNP'),
 ('is', 'VBZ'),
 ('related', 'VBN'),
 ('to', 'TO'),
 ('the', 'DT'),
 ('area', 'NN'),
 ('of', 'IN'),
 ('human-computer', 'JJ'),
 ('interaction', 'NN'),
 ('.', '.'),
 ('Many', 'JJ'),
 ('challenges', 'NNS'),
 ('in', 'IN'),
 ('NLP', 'NNP'),
 ('involve', 'VBP'),
 ('natural', 'JJ'),
 ('language', 'NN'),
 ('understanding', 'NN'),
 (',', ','),
 ('that', 'WDT'),
 ('is', 'VBZ'),
 (',', ','),
 ('enabling', 'VBG'),
 ('computers', 'NNS'),
 ('to', 'TO'),
 ('derive', 'VB'),
 ('meaning', 'NN'),
 ('from', 'IN'),
 ('human', 'NN'),
 ('or', 'CC'),
 ('natural', 'JJ'),
 ('language', 'NN'),
 ('input', 'NN'),
 (',', ','),
 ('and', 'CC'),
 ('others', 'NNS'),
 ('involve', 'VBP'),
 ('natural', 'JJ'),
 ('language', 'NN'),
 ('generation', 'NN'),
 ('.', '.')]

Graficznie:

↑ Powrót do spisu treści ↑



In [6]:

    
one_sentence_pos_tags = nltk.pos_tag(nltk.word_tokenize(one_sentence))
entities = nltk.chunk.ne_chunk(one_sentence_pos_tags)
entities









    Out[6]:

Lemmatizer

Lematyzator bazuje na WordNet.

Słowa (ANG = PL):

aardwolves = Protel, protel grzywiasty, hiena grzywiasta
dogs - pies
abacuses, abaci = abakus (liczydło)
↑ Powrót do spisu treści ↑



In [7]:

    
wnl = nltk.stem.WordNetLemmatizer()
print "aardwolves \t=", wnl.lemmatize("aardwolves")
print "dogs \t=", wnl.lemmatize("dogs")
print "abaci \t=", wnl.lemmatize("abaci")









    



aardwolves 	= aardwolf
dogs 	= dog
abaci 	= abacus

Sentence Tokenize DE

Niemickie skróty:

z.B. (zum Beispiel) = np. (na przykład)
u. (und) = spójnik i
↑ Powrót do spisu treści ↑



In [8]:

    
text_de = u"Haben Sie Lust mit zu mir zu kommen und alles das zu tun, z.B. was ich allen anderen morgen sowieso erzählen werde? Ich weigere mich, in dem alten u. verspukten Schloss zu schlafen. ich habe Angst vor Geistern!"
sent_tokenize_de = nltk.data.load('tokenizers/punkt/german.pickle')
sents_de = sent_tokenize_de.tokenize(text_de)
for i in sents_de:
    print "'" + i + "'"









    



'Haben Sie Lust mit zu mir zu kommen und alles das zu tun, z.B. was ich allen anderen morgen sowieso erzählen werde?'
'Ich weigere mich, in dem alten u. verspukten Schloss zu schlafen.'
'ich habe Angst vor Geistern!'

Online NLKT

Dostępna jest wejska online, zawierająca:

Pod adresem: Python NLTK Demos

↑ Powrót do spisu treści ↑

Uwagi

Powstało podczas zajęć: Przetwarzanie Języka Naturalnego
Instalacja uwzględnia tylko dystrybucję Linuxa (Debian/Ubuntu, Fedora) i ich pochodne!
Instalacja NLTK została przeprowadzona na Python 2
Dla lepszej interakcji uruchom za pomocą polecenia: ipython notebook

↑ Powrót do spisu treści ↑