This notebook guides you through the basic concepts to start working with Natural Language Processing, including how to set up your environment, create and analyze data sets, and work with data files.
This notebook uses NLTK, a python framework for Natural Language Processing. Some knowledge of Python is recommended.
If you are new to notebooks, here's how the user interface works: Parts of a notebook
Natural language processing (NLP) is a field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully process large natural language corpora. Challenges in natural language processing frequently involve natural language understanding, natural language generation (frequently from formal, machine-readable logical forms), connecting language and machine perception, dialog systems, or some combination thereof.
Text is messy data, various types of noise are present in it and the data is not readily analyzable without any pre-processing. The entire process of cleaning and standardization of text, making it noise-free and ready for analysis is known as text preprocessing.
It is predominantly comprised of three steps:
Any piece of text which is not relevant to the context of the data and the end-output can be specified as the noise.
For example – language stopwords (commonly used words of a language – is, am, the, of, in etc), URLs or links, social media entities (mentions, hashtags), punctuations and industry specific words. This step deals with removal of all types of noisy entities present in the text.
A general approach for noise removal is to prepare a dictionary of noisy entities, and iterate the text object by tokens (or by words), eliminating those tokens which are present in the noise dictionary.
Following is the python code for the same purpose.
In [ ]:
noise_list = ["is", "a", "this", "..."]
def _remove_noise(input_text):
words = input_text.split()
noise_free_words = [word for word in words if word not in noise_list]
noise_free_text = " ".join(noise_free_words)
return noise_free_text
_remove_noise("this is a sample text")
In [ ]:
import re
def _remove_regex(input_text, regex_pattern):
urls = re.finditer(regex_pattern, input_text)
for i in urls:
input_text = re.sub(i.group().strip(), '', input_text)
return input_text
regex_pattern = "#[\w]*"
_remove_regex("remove this #FloridaBlue from tweet text", regex_pattern)
Another type of textual noise is about the multiple representations exhibited by single word.
For example – “play”, “player”, “played”, “plays” and “playing” are the different variations of the word – “play”, Though they mean different but contextually all are similar. The step converts all the disparities of a word into their normalized form (also known as lemma). Normalization is a pivotal step for feature engineering with text as it converts the high dimensional features (N different features) to the low dimensional space (1 feature), which is an ideal ask for any ML model.
The most common lexicon normalization practices are :
Stemming: Stemming is a rudimentary rule-based process of stripping the suffixes (“ing”, “ly”, “es”, “s” etc) from a word.
Lemmatization: Lemmatization, on the other hand, is an organized & step by step procedure of obtaining the root form of the word, it makes use of vocabulary (dictionary importance of words) and morphological analysis (word structure and grammar relations).
Below is the sample code that performs lemmatization and stemming using python’s popular library – NLTK.
In [ ]:
from nltk.stem.wordnet import WordNetLemmatizer
lem = WordNetLemmatizer()
from nltk.stem.porter import PorterStemmer
stem = PorterStemmer()
word = "multiplying"
lem.lemmatize(word, "v")
stem.stem(word)
Text data often contains words or phrases which are not present in any standard lexical dictionaries. These pieces are not recognized by search engines and models.
Some of the examples are – acronyms, hashtags with attached words, and colloquial slangs. With the help of regular expressions and manually prepared data dictionaries, this type of noise can be fixed, the code below uses a dictionary lookup method to replace social media slangs from a text.
In [ ]:
translation_dict = {'rt':'Retweet', 'dm':'direct message', "awsm" : "awesome", "luv" :"love"}
Syntactical parsing involves the analysis of words in the sentence for grammar and their arrangement in a manner that shows the relationships among the words. Dependency Grammar and Part of Speech tags are the important attributes of text syntactics.
Dependency Trees – Sentences are composed of some words sewed together. The relationship among the words in a sentence is determined by the basic dependency grammar. Dependency grammar is a class of syntactic text analysis that deals with (labeled) asymmetrical binary relations between two lexical items (words). Every relation can be represented in the form of a triplet (relation, governor, dependent). For example: consider the sentence – “Bills on ports and immigration were submitted by Senator Brownback, Republican of Kansas.” The relationship among the words can be observed in the form of a tree representation as shown:
The tree shows that “submitted” is the root word of this sentence, and is linked by two sub-trees (subject and object subtrees). Each subtree is a itself a dependency tree with relations such as – (“Bills” <-> “ports”
This type of tree, when parsed recursively in top-down manner gives grammar relation triplets as output which can be used as features for many nlp problems like entity wise sentiment analysis, actor & entity identification, and text classification. The python wrapper StanfordCoreNLP (by Stanford NLP Group, only commercial license) and NLTK dependency grammars can be used to generate dependency trees.
In [ ]:
import nltk
from nltk import sent_tokenize, word_tokenize
from IPython.display import Image
nltk.download()
In [ ]:
sentences = sent_tokenize("Our mission is to help people and communities achieve better health declares our purpose " \
"as a company and it serves as the standard against which we weigh our actions and our decisions. " \
"Our Vision is to be a leading innovator enabling healthy communities is both the inspirational and " \
"aspirational description of the future state of our company. It is our framework and guides every " \
"aspect of our business. By broadening our scope and continuing to evolve, we have more flexibility " \
"to make a greater impact on as many people as possible. Our core values are timeless. They " \
"the core principles that distinguish our culture and serve as a compass for our actions and describe " \
"how we behave in the world.")
sentences
In [ ]:
tokens = word_tokenize(sentences[2])
In [ ]:
tokens
Apart from the grammar relations, every word in a sentence is also associated with a part of speech (pos) tag (nouns, verbs, adjectives, adverbs etc). The pos tags defines the usage and function of a word in the sentence. Here is a list of all possible pos-tags defined by Pennsylvania university. Following code using NLTK performs pos tagging annotation on input text. (it provides several implementations, the default one is perceptron tagger)
In [ ]:
from nltk import pos_tag
#this is a Classifier, given a token assign a class
#pos_tag Already defined in the library. We can train our own.
In [ ]:
tags = pos_tag(tokens)
In [ ]:
text = "I am using Data Science Experience at Florida Blue for Natural Language Processing"
tokens = word_tokenize(text)
print pos_tag(tokens)
In [ ]:
# Let's apply this to our sample text from our website.
tags
Part of Speech tagging is used for many important purposes in NLP:
Some language words have multiple meanings according to their usage. For example, in the two sentences below:
I. “Please book my flight for Delhi”
II. “I am going to read this book in the flight”
“Book” is used with different context, however the part of speech tag for both of the cases are different. In sentence I, the word “book” is used as v erb, while in II it is used as no un. (Lesk Algorithm is also us ed for similar purposes)
A learning model could learn different contexts of a word when used word as the features, however if the part of speech tag is linked with them, the context is preserved, thus making strong features. For example:
Sentence -“book my flight, I will read this book”
Tokens – (“book”, 2), (“my”, 1), (“flight”, 1), (“I”, 1), (“will”, 1), (“read”, 1), (“this”, 1)
Tokens with POS – (“book_VB”, 1), (“my_PRP$”, 1), (“flight_NN”, 1), (“I_PRP”, 1), (“will_MD”, 1), (“read_VB”, 1), (“this_DT”, 1), (“book_NN”, 1)
POS tags are the basis of lemmatization process for converting a word to its base form (lemma).
POS tags are also useful in efficient removal of stopwords.
For example, there are some tags which always define the low frequency / less important words of a language. For example: (IN – “within”, “upon”, “except”), (CD – “one”,”two”, “hundred”), (MD – “may”, “must” etc)
In linguistics, a word sense is one of the meanings of a word. Until now, we worked with tokens and POS. So, for instance in "the man sit down on the bench near the river.", the token [bench] could be bench as a constructed object by humans where people sit, or the natural side where the river meets the land.
Lets see some functions to handle meanings in tokens. Wordnet provides the concept of synsets, as syntactic units for tokens
In [ ]:
from nltk.corpus import wordnet as wn #loading wordnet module
wn.synsets('human')
In [ ]:
wn.synsets('human')[0].definition
In [ ]:
wn.synsets('human')[1].definition
In [ ]:
human = wn.synsets('Human',pos=wn.NOUN)[0]
human
In [ ]:
human.hypernyms()
In [ ]:
human.hyponyms()
In [ ]:
bike = wn.synsets('bicycle')[0]
bike
In [ ]:
girl = wn.synsets('girl')[1]
girl
In [ ]:
bike.wup_similarity(human)
In [ ]:
girl.wup_similarity(human)
Chunking is the process of collecting patterns of Part of Speech together, representing some meaning. Analysis of a sentence which identifies the constituents (noun groups - "[The red tree] grows near the river", verbs, verb groups, etc.)
Our goal is detect into digital text, thing like "where are different entities located," or "which person is employed by what organization". Its the way in which we extract structured data (entities and relations) from unstructured text.
In [ ]:
from nltk import word_tokenize,pos_tag
from nltk.chunk import RegexpParser
chunker = RegexpParser(r'''
NP:
{<DT><NN.*><.*>*<NN.*>}
}<VB.*>{
''')
In [ ]:
print tags
print chunker.parse(tags)
In [ ]:
from nltk.chunk import ne_chunk
In [ ]:
sentence = "Daryl A. is the head of the coworking place Commoncode Corp. from where many people work in Melbourne, Australia."
pos_tags = pos_tag(word_tokenize(sentence))
pos_tags
In [ ]:
from IPython.display import display
display(pos_tags)
Is a phrase expresing a positive opinion? a negative opinion? how can we measure that? We will decompose sentences into their smaller units: tokens, and we will measure how probable they distribute on positive/negative sentences along the text.
In [ ]:
#We will use movie reviews, already separated as positive and negative. Specially interest are those bigrams,
#that are in positive or negative sentences (pair of words (word1, word2) that appear consecutives.)
from nltk.corpus import movie_reviews
In [ ]:
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
#This is the function that given a word, return a dict {word:True}. This will be our feature in the classifier.
def word_feats(words):
return dict([(word, True) for word in words])
#neg_ids, pos_ids keep all the files for negative reviews, and positive reviews respectively.
neg_ids = movie_reviews.fileids('neg')
pos_ids = movie_reviews.fileids('pos')
#So, lets take the positive/negative words, create the feature for such word, and store it in a negative/positive features list.
neg_feats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in neg_ids]
pos_feats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in pos_ids]
#Separating 3/4 of this featured words for training, 1/4 for test.
neg_len_train = len(neg_feats)*3/4
pos_len_train = len(pos_feats)*3/4
train_feats = neg_feats[:neg_len_train] + pos_feats[:pos_len_train]
test_feats = neg_feats[neg_len_train:] + pos_feats[pos_len_train:]
#training a NaiveBayes Classifier with our training featured words.
classifier = NaiveBayesClassifier.train(train_feats)
#Lts check accuracy.
print 'accuracy: ', nltk.classify.util.accuracy(classifier, test_feats)
#Lets see which words fit best in each class.
classifier.show_most_informative_features()
In [ ]:
#SO WE TRAINED A CLASSIFIER FOR MOVIE REVIEWS. IT MEANS, FOR EVERY WORD THAT WE TRAINED,
#IT NOWS THAT THIS WORD IS PROBABLE IN NEGATIVES WITH PROB P(W/POS) AND POSITIVE P(W/POS) (BAYES THEOREME).
In [ ]:
sentence = "Florida Blue, movie is incredible!"
tokens = [word for word in word_tokenize(sentence)]
pos_tags = [pos for pos in pos_tag(tokens)]
pos_tags
In [ ]:
feats = word_feats( [word for (word,_) in pos_tags] )
feats
In [ ]:
classifier.classify(feats)
In [ ]:
sentence = "This is a miserable experience, and I just want to leave and be a lumberjack."
tokens = [word for word in word_tokenize(sentence)]
pos_tags = [pos for pos in pos_tag(tokens) if pos[1] == 'JJ']
pos_tags
In [ ]:
feats = word_feats( [word for (word,_) in pos_tags] )
feats
In [ ]:
classifier.classify(feats)