Vamos avaliar as técnicas mais comuns para prepararmos o texto para usar com algoritmos de aprendizado de máquina logo mais.
Como estudo de caso, vamos usar o texto de Hamlet, encontrado no corpus Gutenberg do pacote NLTK
1. Baixando o corpus Gutenberg
In [1]:
import nltk
nltk.download("gutenberg")
Out[1]:
2. Exibindo o texto "Hamlet"
In [2]:
hamlet_raw = nltk.corpus.gutenberg.raw('shakespeare-hamlet.txt')
print(hamlet_raw[:1000])
3. Segmentação de sentenças e tokenização de palavras
In [3]:
from nltk.tokenize import sent_tokenize
sentences = sent_tokenize(hamlet_raw)
print(sentences[:10])
In [4]:
from nltk.tokenize import word_tokenize
words = word_tokenize(sentences[0])
print(words)
4. Removendo stopwords e pontuação
In [5]:
from nltk.corpus import stopwords
stopwords_list = stopwords.words('english')
print(stopwords_list)
In [6]:
non_stopwords = [w for w in words if not w.lower() in stopwords_list]
print(non_stopwords)
In [7]:
import string
punctuation = string.punctuation
print(punctuation)
In [8]:
non_punctuation = [w for w in non_stopwords if not w in punctuation]
print(non_punctuation)
5. Part of Speech (POS) Tags
In [9]:
from nltk import pos_tag
pos_tags = pos_tag(words)
print(pos_tags)
As tags indicam a classificação sintática de cada palavra no texto. Ver https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html para uma lista completa
6. Stemming e Lemmatization
Stemming permite obter a "raiz" da palavra, removendo sufixos, por exemplo.
In [10]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer('english')
sample_sentence = "He has already gone"
sample_words = word_tokenize(sample_sentence)
stems = [stemmer.stem(w) for w in sample_words]
print(stems)
Já lemmatization vai além de somente remover sufixos, obtendo a raiz linguística da palavra. Vamos usar as tags POS obtidas anteriormente para otimizar o lemmatizer.
In [12]:
nltk.download('wordnet')
Out[12]:
In [13]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
lemmatizer = WordNetLemmatizer()
pos_tags = nltk.pos_tag(sample_words)
lemmas = []
for w in pos_tags:
if w[1].startswith('J'):
pos_tag = wordnet.ADJ
elif w[1].startswith('V'):
pos_tag = wordnet.VERB
elif w[1].startswith('N'):
pos_tag = wordnet.NOUN
elif w[1].startswith('R'):
pos_tag = wordnet.ADV
else:
pos_tag = wordnet.NOUN
lemmas.append(lemmatizer.lemmatize(w[0], pos_tag))
print(lemmas)
7. N-gramas
Além da técnica de Bag-of-Words, outra opção é utilizar n-gramas (onde "n" pode variar)
In [14]:
non_punctuation = [w for w in words if not w.lower() in punctuation]
n_grams_3 = ["%s %s %s"%(non_punctuation[i], non_punctuation[i+1], non_punctuation[i+2]) for i in range(0, len(non_punctuation)-2)]
print(n_grams_3)
Também podemos usar a classe CountVectorizer, do scikit-learn:
In [15]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer(ngram_range=(3,3))
import numpy as np
arr = np.array([sentences[0]])
print(arr)
n_gram_counts = count_vect.fit_transform(arr)
print(n_gram_counts)
print(count_vect.vocabulary_)
Agora, vamos contar os n-grams (no nosso caso, trigramas) de todas as sentenças do texto:
In [16]:
arr = np.array(sentences)
n_gram_counts = count_vect.fit_transform(arr)
print(n_gram_counts[:20])
print([k for k in count_vect.vocabulary_.keys()][:20])
In [17]:
from nltk import word_tokenize
frase = 'o cachorro correu atrás do gato'
ngrams = ["%s %s %s" % (nltk.word_tokenize(frase)[i], \
nltk.word_tokenize(frase)[i+1], \
nltk.word_tokenize(frase)[i+2]) \
for i in range(len(nltk.word_tokenize(frase))-2)]
print(ngrams)
In [ ]: