NLTK 자연어 처리 패키지 소개

NLTK(Natural Language Toolkit) 패키지는 교육용으로 개발된 자연어 처리 및 문서 분석용 파이썬 패키지다. 다양한 기능 및 예제를 가지고 있으며 실무 및 연구에서도 많이 사용된다.

NLTK 패키지가 제공하는 주요 기능은 다음과 같다.

  • 샘플 corpus 및 사전
  • 토큰 생성(tokenizing)
  • 형태소 분석(stemming/lemmatizing)
  • 품사 태깅(part-of-speech tagging)
  • 구문 분석(syntax parsing)

샘플 corpus

corpus는 분석 작업을 위한 샘플 문서 집합을 말한다. 단순히 소설, 신문 등의 문서를 모아놓은 것도 있지만 대부분 품사. 형태소, 등의 보조적 의미를 추가하고 쉬운 분석을 위해 구조적인 형태로 정리해 놓은 것이 많다.

NLTK 패키지의 corpus 서브패키지에서는 다음과 같은 다양한 연구용 corpus를 제공한다. 이 목록은 전체 corpus의 일부일 뿐이다.

  • averaged_perceptron_tagger Averaged Perceptron Tagger
  • book_grammars: Grammars from NLTK Book
  • brown: Brown Corpus
  • chat80: Chat-80 Data Files
  • city_database: City Database
  • comparative_sentences Comparative Sentence Dataset
  • dependency_treebank. Dependency Parsed Treebank
  • gutenberg: Project Gutenberg Selections
  • hmm_treebank_pos_tagger Treebank Part of Speech Tagger (HMM)
  • inaugural: C-Span Inaugural Address Corpus
  • large_grammars: Large context-free and feature-based grammars for parser comparison
  • mac_morpho: MAC-MORPHO: Brazilian Portuguese news text with part-of-speech tags
  • masc_tagged: MASC Tagged Corpus
  • maxent_ne_chunker: ACE Named Entity Chunker (Maximum entropy)
  • maxent_treebank_pos_tagger Treebank Part of Speech Tagger (Maximum entropy)
  • movie_reviews: Sentiment Polarity Dataset Version 2.0
  • names: Names Corpus, Version 1.3 (1994-03-29)
  • nps_chat: NPS Chat
  • omw: Open Multilingual Wordnet
  • opinion_lexicon: Opinion Lexicon
  • pros_cons: Pros and Cons
  • ptb: Penn Treebank
  • punkt: Punkt Tokenizer Models
  • reuters: The Reuters-21578 benchmark corpus, ApteMod version
  • sample_grammars: Sample Grammars
  • sentence_polarity: Sentence Polarity Dataset v1.0
  • sentiwordnet: SentiWordNet
  • snowball_data: Snowball Data
  • stopwords: Stopwords Corpus
  • subjectivity: Subjectivity Dataset v1.0
  • tagsets: Help on Tagsets
  • treebank: Penn Treebank Sample
  • twitter_samples: Twitter Samples
  • unicode_samples: Unicode Samples
  • universal_tagset: Mappings to the Universal Part-of-Speech Tagset
  • universal_treebanks_v20 Universal Treebanks Version 2.0
  • verbnet: VerbNet Lexicon, Version 2.1
  • webtext: Web Text Corpus
  • word2vec_sample: Word2Vec Sample
  • wordnet: WordNet
  • words: Word Lists

이러한 corpus 자료는 설치시에 제공되는 것이 아니라 download 명령으로 사용자가 다운로드 받아야 한다.


In [173]:
nltk.download('averaged_perceptron_tagger')
nltk.download("gutenberg")
nltk.download('punkt')
nltk.download('reuters')
nltk.download("stopwords")
nltk.download("taggers")
nltk.download("webtext")
nltk.download("wordnet")


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/joel/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package gutenberg to /home/joel/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package punkt to /home/joel/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package reuters to /home/joel/nltk_data...
[nltk_data]   Package reuters is already up-to-date!
[nltk_data] Downloading package stopwords to /home/joel/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Error loading taggers: Package 'taggers' not found in
[nltk_data]     index
[nltk_data] Downloading package webtext to /home/joel/nltk_data...
[nltk_data]   Package webtext is already up-to-date!
[nltk_data] Downloading package wordnet to /home/joel/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
Out[173]:
True

In [9]:
nltk.corpus.gutenberg.fileids()


Out[9]:
[u'austen-emma.txt',
 u'austen-persuasion.txt',
 u'austen-sense.txt',
 u'bible-kjv.txt',
 u'blake-poems.txt',
 u'bryant-stories.txt',
 u'burgess-busterbrown.txt',
 u'carroll-alice.txt',
 u'chesterton-ball.txt',
 u'chesterton-brown.txt',
 u'chesterton-thursday.txt',
 u'edgeworth-parents.txt',
 u'melville-moby_dick.txt',
 u'milton-paradise.txt',
 u'shakespeare-caesar.txt',
 u'shakespeare-hamlet.txt',
 u'shakespeare-macbeth.txt',
 u'whitman-leaves.txt']

In [142]:
emma_raw = nltk.corpus.gutenberg.raw("austen-emma.txt")
print(emma_raw[:1302])


[Emma by Jane Austen 1816]

VOLUME I

CHAPTER I


Emma Woodhouse, handsome, clever, and rich, with a comfortable home
and happy disposition, seemed to unite some of the best blessings
of existence; and had lived nearly twenty-one years in the world
with very little to distress or vex her.

She was the youngest of the two daughters of a most affectionate,
indulgent father; and had, in consequence of her sister's marriage,
been mistress of his house from a very early period.  Her mother
had died too long ago for her to have more than an indistinct
remembrance of her caresses; and her place had been supplied
by an excellent woman as governess, who had fallen little short
of a mother in affection.

Sixteen years had Miss Taylor been in Mr. Woodhouse's family,
less as a governess than a friend, very fond of both daughters,
but particularly of Emma.  Between _them_ it was more the intimacy
of sisters.  Even before Miss Taylor had ceased to hold the nominal
office of governess, the mildness of her temper had hardly allowed
her to impose any restraint; and the shadow of authority being
now long passed away, they had been living together as friend and
friend very mutually attached, and Emma doing just what she liked;
highly esteeming Miss Taylor's judgment, but directed chiefly by
her own.

토큰 생성(tokenizing)

문서를 분석하기 위해서는 우선 긴 문자열을 분석을 위한 작은 단위로 나누어야 한다. 이 문자열 단위를 토큰(token)이라고 한다.


In [73]:
from nltk.tokenize import word_tokenize
word_tokenize(emma_raw[50:100])


Out[73]:
[u'Emma',
 u'Woodhouse',
 u',',
 u'handsome',
 u',',
 u'clever',
 u',',
 u'and',
 u'rich',
 u',',
 u'with',
 u'a']

In [81]:
from nltk.tokenize import RegexpTokenizer
t = RegexpTokenizer("[\w]+")
t.tokenize(emma_raw[50:100])


Out[81]:
[u'Emma', u'Woodhouse', u'handsome', u'clever', u'and', u'rich', u'with', u'a']

In [154]:
from nltk.tokenize import sent_tokenize
print(sent_tokenize(emma_raw[:1000])[3])


Sixteen years had Miss Taylor been in Mr. Woodhouse's family,
less as a governess than a friend, very fond of both daughters,
but particularly of Emma.

형태소 분석

형태소 분석이란 어근, 접두사/접미사, 품사(POS, part-of-speech) 등 다양한 언어적 속성의 구조를 파악하는 작업이다. 구체적으로는 다음과 같은 작업으로 나뉜다.

  • stemming (어근 추출)
  • lemmatizing (원형 복원)
  • POS tagging (품사 태깅)

Stemming and lemmatizing


In [82]:
from nltk.stem import PorterStemmer
st = PorterStemmer()
st.stem("eating")


Out[82]:
u'eat'

In [89]:
from nltk.stem import LancasterStemmer
st = LancasterStemmer()
st.stem("shopping")


Out[89]:
'shop'

In [90]:
from nltk.stem import RegexpStemmer
st = RegexpStemmer("ing")
st.stem("cooking")


Out[90]:
u'cook'

In [95]:
from nltk.stem import WordNetLemmatizer
lm = WordNetLemmatizer()
print(lm.lemmatize("cooking"))
print(lm.lemmatize("cooking", pos="v"))
print(lm.lemmatize("cookbooks"))


cooking
cook
cookbook

In [99]:
print(WordNetLemmatizer().lemmatize("believes"))
print(LancasterStemmer().stem("believes"))


belief
believ

POS tagging


In [171]:
from nltk.tag import pos_tag
tagged_list = pos_tag(word_tokenize(emma_raw[:100]))
tagged_list


Out[171]:
[(u'[', 'NNS'),
 (u'Emma', 'NNP'),
 (u'by', 'IN'),
 (u'Jane', 'NNP'),
 (u'Austen', 'NNP'),
 (u'1816', 'CD'),
 (u']', 'NNP'),
 (u'VOLUME', 'NNP'),
 (u'I', 'PRP'),
 (u'CHAPTER', 'VBP'),
 (u'I', 'PRP'),
 (u'Emma', 'NNP'),
 (u'Woodhouse', 'NNP'),
 (u',', ','),
 (u'handsome', 'NN'),
 (u',', ','),
 (u'clever', 'NN'),
 (u',', ','),
 (u'and', 'CC'),
 (u'rich', 'JJ'),
 (u',', ','),
 (u'with', 'IN'),
 (u'a', 'DT')]

In [172]:
from nltk.tag import untag
untag(tagged_list)


Out[172]:
[u'[',
 u'Emma',
 u'by',
 u'Jane',
 u'Austen',
 u'1816',
 u']',
 u'VOLUME',
 u'I',
 u'CHAPTER',
 u'I',
 u'Emma',
 u'Woodhouse',
 u',',
 u'handsome',
 u',',
 u'clever',
 u',',
 u'and',
 u'rich',
 u',',
 u'with',
 u'a']