In [1]:
import matplotlib.pyplot as plt
import pandas as pd
In [2]:
%matplotlib inline
In [3]:
df = pd.read_csv('data_tau_days.csv')
In [4]:
df.head()
Out[4]:
In [5]:
import nltk
from wordcloud import WordCloud
In [6]:
sentence = df["title"][0]
sentence
Out[6]:
In [7]:
tokens = nltk.wordpunct_tokenize(sentence)
tokens
Out[7]:
Let us take all the sentence in the dataframe and tokenize to find the words, and get a frequency of count of each words
In [8]:
frequency_words = {}
In [9]:
for data in df['title']:
tokens = nltk.wordpunct_tokenize(data)
for token in tokens:
if token in frequency_words:
count = frequency_words[token]
count = count + 1
frequency_words[token] = count
else:
frequency_words[token] = 1
In [10]:
# Let us see the frequency_words for each word occuring
frequency_words
Out[10]:
In [11]:
# Creating a Wordcloud
wordcloud = WordCloud()
In [12]:
wordcloud.generate_from_frequencies(frequency_words.items())
Out[12]:
In [13]:
plt.figure(figsize=(14,10))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()
Question - What are the two issue with this wordcloud?
In [14]:
# Convert the dict to a dataframe
freq = pd.DataFrame.from_dict(frequency_words, orient = 'index')
In [15]:
# Let us sort them in descinding order
freq.sort_values(by = 0, ascending=False).head(10)
Out[15]:
Stop words are words which are filtered out before or after processing of natural language data. Though stop words usually refer to the most common words in a language, there is no single universal list of stop words used by all natural language processing tools, and indeed not all tools even use such a list. Some tools specifically avoid removing these stop words to support phrase search.
In [19]:
from nltk.corpus import stopwords
# nltk.download()
In [20]:
stop = stopwords.words('english')
In [21]:
stop[0:10]
Out[21]:
We will recreate the frequency words with two additional steps
In [23]:
frequency_words_wo_stop = {}
for data in df['title']:
tokens = nltk.wordpunct_tokenize(data)
for token in tokens:
if token.lower() not in stop:
if token in frequency_words_wo_stop:
count = frequency_words_wo_stop[token]
count = count + 1
frequency_words_wo_stop[token] = count
else:
frequency_words_wo_stop[token] = 1
In [24]:
wordcloud.generate_from_frequencies(frequency_words_wo_stop.items())
Out[24]:
In [25]:
plt.figure(figsize=(14,10))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()
We can also extend the stopword list with common punctuations to even reomove those from the list
In [28]:
stop.extend(('.', ',', '"', "'", '?', '!', ':', ';', '(', ')', '[', ']', '{', '}','/','-'))
In [29]:
frequency_words_wo_stop = {}
In [30]:
def generate_word_frequency(row):
data = row['title']
tokens = nltk.wordpunct_tokenize(data)
token_list = []
for token in tokens:
if token.lower() not in stop:
token_list.append(token.lower())
if token.lower() in frequency_words_wo_stop:
count = frequency_words_wo_stop[token.lower()]
count = count + 1
frequency_words_wo_stop[token.lower()] = count
else:
frequency_words_wo_stop[token.lower()] = 1
return ','.join(token_list)
The apply
function takes a function as its input and applies that across all the rows or columns
In [31]:
df['tokens'] = df.apply(generate_word_frequency,axis=1)
In [32]:
df.head()
Out[32]:
In [33]:
wordcloud.generate_from_frequencies(frequency_words_wo_stop.items())
Out[33]:
In [34]:
plt.figure(figsize=(14,10))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()
Exercise: Find the frequency count for each word without stopword
In [ ]:
An linguistic morphology and information retrieval, stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form—generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. Algorithms for stemming have been studied in computer science since the 1960s. Many search engines treat words with the same stem as synonyms as a kind of query expansion, a process called conflation.
Stemming words is another common NLP technique to reduce topically similar words to their root. For example, “stemming,” “stemmer,” “stemmed,” all have similar meanings; stemming reduces those terms to “stem.” This is important for topic modeling, which would otherwise view those terms as separate entities and reduce their importance in the model. Stemming programs are commonly referred to as stemming algorithms or stemmers.
Like stopping, stemming is flexible and some methods are more aggressive. The Porter stemming algorithm is the most widely used method. To implement a Porter stemming algorithm, import the Porter Stemmer module from NLTK:
In [35]:
from nltk.stem.porter import PorterStemmer
In [36]:
porter_stemmer = PorterStemmer()
In [37]:
porter_stemmer.stem('dividing')
Out[37]:
Lemmatisation (or lemmatization) in linguistics, is the process of grouping together the different inflected forms of a word so they can be analysed as a single item. In computational linguistics, lemmatisation is the algorithmic process of determining the lemma for a given word. Since the process may involve complex tasks such as understanding context and determining the part of speech of a word in a sentence (requiring, for example, knowledge of the grammar of a language) it can be a hard task to implement a lemmatiser for a new language.
In many languages, words appear in several inflected forms. For example, in English, the verb ‘to walk’ may appear as ‘walk’, ‘walked’, ‘walks’, ‘walking’. The base form, ‘walk’, that one might look up in a dictionary, is called the lemma for the word. The combination of the base form with the part of speech is often called the lexeme of the word.
Lemmatisation is closely related to stemming. The difference is that a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech. However, stemmers are typically easier to implement and run faster, and the reduced accuracy may not matter for some applications.
We will use a corpus to do the Lemmatization. Let us download the wordnet corpora using nltk.download()
In [38]:
from nltk.stem import WordNetLemmatizer
In [39]:
wordnet_lemmatizer = WordNetLemmatizer()
In [40]:
wordnet_lemmatizer.lemmatize('are')
Out[40]:
In [41]:
wordnet_lemmatizer.lemmatize('is')
Out[41]:
But we know that the root of are
and is
, is be. The reason why we see are
and is
as is , is because we
have to define them as verbs
In [45]:
wordnet_lemmatizer.lemmatize('dividing', pos = "v")
Out[45]:
In [42]:
wordnet_lemmatizer.lemmatize('is',pos='v')
Out[42]:
In [46]:
def stem_title(data):
return porter_stemmer.stem(data['title'])
In [47]:
def lemmatize_title(data):
return wordnet_lemmatizer.lemmatize(data['title'])
In [48]:
df['stem'] = df.apply(stem_title,axis=1)
In [49]:
df.head()
Out[49]:
In [50]:
df['lemma'] = df.apply(lemmatize_title,axis=1)
In [51]:
df.head()
Out[51]:
In [52]:
df.tail()
Out[52]:
Note: Stemming and Lemma in the context of Recall
https://displacy.spacy.io/displacy/index.html?full=Click+the+button+to+see+this+sentence+in+displaCy.
Let us go back to school. Schools commonly teach that there are 9 parts of speech in English: noun, verb, article, adjective, preposition, pronoun, adverb, conjunction, and interjection.
Part-of-speech tagging is one of the most important text analysis tasks used to classify words into their part-of-speech and label them according the tagset which is a collection of tags used for the pos tagging. Part-of-speech tagging also known as word classes or lexical categories. Here is the definition from wikipedia:
In corpus linguistics, part-of-speech tagging (POS tagging or POST), also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition, as well as its context—i.e. relationship with adjacent and related words in a phrase, sentence, or paragraph. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.
Once performed by hand, POS tagging is now done in the context of computational linguistics, using algorithms which associate discrete terms, as well as hidden parts of speech, in accordance with a set of descriptive tags. POS-tagging algorithms fall into two distinctive groups: rule-based and stochastic. E. Brill’s tagger, one of the first and most widely used English POS-taggers, employs rule-based algorithms.
In [53]:
text = 'Calvin harris is a great musician'
In [54]:
text_tokens = nltk.wordpunct_tokenize(text)
In [55]:
text_tokens
Out[55]:
We will download from nltk download averaged perceptron tagger to do POS tagging
In [56]:
nltk.pos_tag(text_tokens)
Out[56]:
Tag | Meaning | English Examples
Tag | Meaning | Examples |
---|---|---|
ADJ | adjective | new, good, high, special, big, local |
ADP | adposition | on, of, at, with, by, into, under |
ADV | adverb | really, already, still, early, now |
CONJ | conjunction | and, or, but, if, while, although |
DET | determiner | the, a, some, most, every, no, which |
NOUN | noun | year, home, costs, time, Africa |
NUM | numeral | twenty-four, fourth, 1991, 14:24 |
PRT | particle | at, on, out, over per, that, up, with |
PRON | pronoun | he, their, her, its, my, I, us |
VERB | verb | is, say, told, given, playing, would |
0 | punctuation marks | . , ; ! |
X | other | ersatz, esprit, dunno, gr8, univeristy |
In [57]:
def get_pos_tags(data):
return nltk.pos_tag(nltk.wordpunct_tokenize(data['title']))
In [58]:
df['pos_tags'] = df.apply(get_pos_tags,axis=1)
In [59]:
df.head()
Out[59]:
The basic technique we will use for entity detection is chunking, which segments and labels multi-token sequences as illustrated in 2.1. The smaller boxes show the word-level tokenization and part-of-speech tagging, while the large boxes show higher-level chunking. Each of these larger boxes is called a chunk. Like tokenization, which omits whitespace, chunking usually selects a subset of the tokens. Also like tokenization, the pieces produced by a chunker do not overlap in the source text.
Named Entity-Type | Examples |
---|---|
ORGANIZATION | Georgia-Pacific Corp., WHO |
PERSON | Eddy Bonte, President Obama |
LOCATION | Murray River, Mount Everest |
DATE | June, 2008-06-29 |
TIME | two fifty a m, 1:30 p.m. |
MONEY | 175 million Canadian Dollars, GBP 10.40 |
PERCENT | twenty pct, 18.75 % |
FACILITY | Washington Monument, Stonehenge |
GPE | South East Asia, Midlothian |
To do the entity identification, we will download the maxent chunker and words corpora
In [60]:
df.pos_tags[0]
Out[60]:
In [61]:
ne_tree = nltk.ne_chunk(df.pos_tags[0],binary=True)
# ne_tree
In [ ]:
# for x in ne_tree:
# print(x)
In [ ]:
# we want only the NE ones and when print the type we see that it is a tree
# so we need to iterate over the tree and get the NE
In [62]:
for x in ne_tree:
print(type(x),x)
if type(x) == nltk.tree.Tree:
if(x.label()) == 'NE':
print(x)
In [63]:
def get_entities(row):
entities=[]
chunked_tree = nltk.ne_chunk(row.pos_tags,binary=True)
for nodes in chunked_tree:
if type(nodes) == nltk.tree.Tree:
if(nodes.label()) == 'NE':
print("Before zip",nodes.leaves())
zipped_list = list(zip(*nodes.leaves()))
print("After zip",zipped_list)
entities.append(' '.join(zipped_list[0]))
return entities
In [64]:
df['named_entities'] = df.apply(get_entities,axis=1)
In [65]:
df.head()
Out[65]:
Now that we have entities, we can understand the statements better
In [ ]:
df.tail()
In [ ]:
df.to_csv('data_tau_ta.csv',index=False)
In [ ]:
In [ ]: