In [ ]:

    
import nltk
nltk.download()

Working with text!

Sentiment Analysis

Identify entities and emotions in a sentence and use these to determine if the entity is being viewed positively or negatively

Easy examples

I had an excellent souffle at the restaurant Cavity Maker

Excellent is a positive word for both the souffle as well as for the restaurant

Not so easy examples

Often, looking at words alone is not enough to figure out the sentiment

The Girl on the Train is an excellent book for a ‘stuck at home’ snow day

This one is easy since it includes an explicit positive opinion using a positive word

The Girl on the Train is an excellent book for using as a liner for your cat’s litter box

Not so simple! The positive word "excellent" is used with a negative connotation.

The Girl on the Train is better than Gone Girl

The positive word is used as a comparator. Whether the writer likes The Girl on the Train or not depends on what he or she thinks of Gone Girl

Bottom line

Sentiment analysis is generally a starting point in analyzing a text and is then coupled with other techniques (e.g., topic analysis)

Sentiment analysis is usually done using a corpus of positive and negative words

Some sources compile lists of positive and negative words

Others include the polarity - the degree of positivity or negativity - of each word

Sources of sentiment coded words

Hu and Liu's sentiment analysis lexicon: words coded as either positive or negative

http://ptrckprry.com/course/ssd/data/positive-words.txt
http://ptrckprry.com/course/ssd/data/negative-words.txt

NRC Emotion Lexicon: words coded into emotional categories (many languages)

http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm

SentiWordNet: Lists of words weighted by positive or negative sentiment. Includes guidance on how to use the words

http://sentiwordnet.isti.cnr.it/

Vadar Sentiment tool: 7800 words with positive or negative polarity

Included with python nltk

Our examples

Compiled set of 15 reviews each of four neighborhood restaurants

Presidential inaugural addresses (from Washington to Trump)

Some data from yelp (very limited!)

Simple sentiment analysis

Compute the proportion of positive and negative words in a text



In [ ]:

    
def get_words(url):
    import requests
    words = requests.get(url).content.decode('latin-1')
    word_list = words.split('\n')
    index = 0
    while index < len(word_list):
        word = word_list[index]
        if ';' in word or not word:
            word_list.pop(index)
        else:
            index+=1
    return word_list

#Get lists of positive and negative words
p_url = 'http://ptrckprry.com/course/ssd/data/positive-words.txt'
n_url = 'http://ptrckprry.com/course/ssd/data/negative-words.txt'
positive_words = get_words(p_url)
negative_words = get_words(n_url)

Read the text being analyzed and count the proportion of positive and negative words in the text



In [ ]:

    
with open('data/community.txt','r') as f:
    community = f.read()
with open('data/le_monde.txt','r') as f:
    le_monde = f.read()

Compute sentiment by looking at the proportion of positive and negative words in the text



In [ ]:

    
from nltk import word_tokenize
cpos = cneg = lpos = lneg = 0
for word in word_tokenize(community):
    if word in positive_words:
        cpos+=1
    if word in negative_words:
        cneg+=1
for word in word_tokenize(le_monde):
    if word in positive_words:
        lpos+=1
    if word in negative_words:
        lneg+=1
print("community {0:1.2f}%\t {1:1.2f}%\t {2:1.2f}%".format(cpos/len(word_tokenize(community))*100,
                                                        cneg/len(word_tokenize(community))*100,
                                                        (cpos-cneg)/len(word_tokenize(community))*100))
print("le monde  {0:1.2f}%\t {1:1.2f}%\t {2:1.2f}%".format(lpos/len(word_tokenize(le_monde))*100,
                                                        lneg/len(word_tokenize(le_monde))*100,
                                                        (lpos-lneg)/len(word_tokenize(le_monde))*100))

Simple sentiment analysis using NRC data

NRC data codifies words with emotions

14,182 words are coded into 2 sentiments and 8 emotions

For example, the word abandonment is associated with anger, fear, sadness and has a negative sentiment

abandoned anger 1

abandoned anticipation 0

abandoned disgust 0

abandoned fear 1

abandoned joy 0

abandoned negative 1

abandoned positive 0

abandoned sadness 1

abandoned surprise 0

abandoned trust 0

Read the NRC sentiment data



In [ ]:

    
nrc = "data/NRC-emotion-lexicon-wordlevel-alphabetized-v0.92.txt"
count=0
emotion_dict=dict()
with open(nrc,'r') as f:
    all_lines = list()
    for line in f:
        if count < 46:
            count+=1
            continue
        line = line.strip().split('\t')
        if int(line[2]) == 1:
            if emotion_dict.get(line[0]):
                emotion_dict[line[0]].append(line[1])
            else:
                emotion_dict[line[0]] = [line[1]]

Functionalize this



In [ ]:

    
def get_nrc_data():
    nrc = "data/NRC-emotion-lexicon-wordlevel-alphabetized-v0.92.txt"
    count=0
    emotion_dict=dict()
    with open(nrc,'r') as f:
        all_lines = list()
        for line in f:
            if count < 46:
                count+=1
                continue
            line = line.strip().split('\t')
            if int(line[2]) == 1:
                if emotion_dict.get(line[0]):
                    emotion_dict[line[0]].append(line[1])
                else:
                    emotion_dict[line[0]] = [line[1]]
    return emotion_dict



In [ ]:

    
emotion_dict = get_nrc_data()
emotion_dict['abandoned']

Analyzing yelp reviews

Caveat: We're only looking at one review snippet for each restaurant

Download yelp python "pip install yelp"
Register with yelp https://www.yelp.com/developers/manage_api_keys (use anything for the host)
Copy the various keys into variables as below

We'll see what we can figure out re reviews of restaurants close to Columbia

First let's read in the yelp keys



In [ ]:

    
CONSUMER_KEY = ""
CONSUMER_SECRET = ""
TOKEN = ""
TOKEN_SECRET = ""

I've saved my keys in a file and will use those. Don't run the next cell!



In [ ]:

    
with open('yelp_keys.txt','r') as f:
    count = 0
    for line in f:
        if count == 0:
            CONSUMER_KEY = line.strip()
        if count == 1:
            CONSUMER_SECRET = line.strip()
        if count == 2:
            TOKEN = line.strip()
        if count == 3:
            TOKEN_SECRET = line.strip()
        count+=1

We need to do a few things

Get the latitude and longitude of our location
Set up the parameters for what data we want from yelp
Query yelp by passing authentication info as well as our parameters
Extract review snippets
And append into a containing (restaurant,review snippet) tuples



In [ ]:

    
#We'll use the get_lat_lng function we wrote way back in week 3
def get_lat_lng(address):
    url = 'https://maps.googleapis.com/maps/api/geocode/json?address='
    url += address
    import requests
    response = requests.get(url)
    if not (response.status_code == 200):
        return None
    data = response.json()
    if not( data['status'] == 'OK'):
        return None
    main_result = data['results'][0]
    geometry = main_result['geometry']
    latitude = geometry['location']['lat']
    longitude = geometry['location']['lng']
    return latitude,longitude



In [ ]:

    
lat,long = get_lat_lng("Columbia University")



In [ ]:

    
#Now set up our search parameters
def set_search_parameters(lat,long,radius):
  #See the Yelp API for more details
    params = {}
    params["term"] = "restaurant"
    params["ll"] = "{},{}".format(str(lat),str(long))
    params["radius_filter"] = str(radius) #The distance around our point in metres
    params["limit"] = "10" #Limit ourselves to 10 results
 
    return params



In [ ]:

    
set_search_parameters(lat,long,200)

Write the function that queries yelp. We'll use rauth library to handle authentication

!pip install rauth



In [ ]:

    
def get_results(params):
    import rauth
    consumer_key = CONSUMER_KEY
    consumer_secret = CONSUMER_SECRET
    token = TOKEN
    token_secret = TOKEN_SECRET

    session = rauth.OAuth1Session(
    consumer_key = consumer_key
    ,consumer_secret = consumer_secret
    ,access_token = token
    ,access_token_secret = token_secret)

    request = session.get("http://api.yelp.com/v2/search",params=params)
    #Transforms the JSON API response into a Python dictionary
    data = request.json()
    session.close()

    return data



In [ ]:

    
#Get the results
response = get_results(set_search_parameters(get_lat_lng("Community Food and Juice")[0],get_lat_lng("Community Food and Juice")[1],200))

Extract snippets



In [ ]:

    
all_snippets = list()
for business in response['businesses']:
    name = business['name']
    snippet = business['snippet_text']
    id = business['id']
    all_snippets.append((id,name,snippet))
all_snippets

Functionalize this



In [ ]:

    
def get_snippets(response):
    all_snippets = list()
    for business in response['businesses']:
        name = business['name']
        snippet = business['snippet_text']
        id = business['id']
        all_snippets.append((id,name,snippet))
    return all_snippets

A function that analyzes emotions



In [ ]:

    
def emotion_analyzer(text,emotion_dict=emotion_dict):
    #Set up the result dictionary
    emotions = {x for y in emotion_dict.values() for x in y}
    emotion_count = dict()
    for emotion in emotions:
        emotion_count[emotion] = 0

    #Analyze the text and normalize by total number of words
    total_words = len(text.split())
    for word in text.split():
        if emotion_dict.get(word):
            for emotion in emotion_dict.get(word):
                emotion_count[emotion] += 1/len(text.split())
    return emotion_count

Now we can analyze the emotional content of the review snippets



In [ ]:

    
print("%-12s %1s\t%1s %1s %1s %1s   %1s %1s %1s %1s"%(
        "restaurant","fear","trust","negative","positive","joy","disgust","anticip",
        "sadness","surprise"))
        
for snippet in all_snippets:
    text = snippet[2]
    result = emotion_analyzer(text)
    print("%-12s %1.2f\t%1.2f\t%1.2f\t%1.2f\t%1.2f\t%1.2f\t%1.2f\t%1.2f\t%1.2f"%(
        snippet[1][0:10],result['fear'],result['trust'],
          result['negative'],result['positive'],result['joy'],result['disgust'],
          result['anticipation'],result['sadness'],result['surprise']))

Let's functionalize this



In [ ]:

    
def comparative_emotion_analyzer(text_tuples):
    print("%-20s %1s\t%1s %1s %1s %1s   %1s %1s %1s %1s"%(
            "restaurant","fear","trust","negative","positive","joy","disgust","anticip",
            "sadness","surprise"))
        
    for text_tuple in text_tuples:
        text = text_tuple[2] 
        result = emotion_analyzer(text)
        print("%-20s %1.2f\t%1.2f\t%1.2f\t%1.2f\t%1.2f\t%1.2f\t%1.2f\t%1.2f\t%1.2f"%(
            text_tuple[1][0:20],result['fear'],result['trust'],
              result['negative'],result['positive'],result['joy'],result['disgust'],
              result['anticipation'],result['sadness'],result['surprise']))
        
#And test it        
comparative_emotion_analyzer(all_snippets)

And let's functionalize the yelp stuff as well



In [ ]:

    
def analyze_nearby_restaurants(address,radius):
    lat,long = get_lat_lng(address)
    params = set_search_parameters(lat,long,radius)
    response = get_results(params)
    snippets = get_snippets(response)
    comparative_emotion_analyzer(snippets)

#And test it    
analyze_nearby_restaurants("Community Food and Juice",200)



In [ ]:

    
#Test it on some other place
analyze_nearby_restaurants("221 Baker Street",200)

Simple analysis: Word Clouds

Let's see what sort of words the snippets use

First we'll combine all snippets into one string

Then we'll generate a word cloud using the words in the string

You may need to install wordcloud using pip

pip install wordcloud



In [ ]:

    
all_snippets



In [ ]:

    
text=''
for snippet in all_snippets:
    text+=snippet[2]
text



In [ ]:

    
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
%matplotlib inline

wordcloud = WordCloud(stopwords=STOPWORDS,background_color='white',width=3000,height=3000).generate(text)


plt.imshow(wordcloud)
plt.axis('off')
plt.show()

Let's do a detailed comparison of local restaurants

I've saved a few reviews for each restaurant in four directories

We'll use the PlainTextCorpusReader to read these directories

PlainTextCorpusReader reads all matching files in a directory and saves them by file-ids



In [ ]:

    
import nltk
from nltk.corpus import PlaintextCorpusReader
community_root = "data/community"
le_monde_root = "data/le_monde"
community_files = "community.*"
le_monde_files = "le_monde.*"
heights_root = "data/heights"
heights_files = "heights.*"
amigos_root = "data/amigos"
amigos_files = "amigos.*"
community_data = PlaintextCorpusReader(community_root,community_files)
le_monde_data = PlaintextCorpusReader(le_monde_root,le_monde_files)
heights_data = PlaintextCorpusReader(heights_root,heights_files)
amigos_data = PlaintextCorpusReader(amigos_root,amigos_files)



In [ ]:

    
amigos_data.fileids()



In [ ]:

    
amigos_data.raw()

We need to modify comparitive_emotion_analyzer to tell it where the restaurant name and the text is in the tuple



In [ ]:

    
def comparative_emotion_analyzer(text_tuples,name_location=1,text_location=2):
    print("%-20s %1s\t%1s %1s %1s %1s   %1s %1s %1s %1s"%(
            "restaurant","fear","trust","negative","positive","joy","disgust","anticip",
            "sadness","surprise"))
        
    for text_tuple in text_tuples:
        text = text_tuple[text_location] 
        result = emotion_analyzer(text)
        print("%-20s %1.2f\t%1.2f\t%1.2f\t%1.2f\t%1.2f\t%1.2f\t%1.2f\t%1.2f\t%1.2f"%(
            text_tuple[name_location][0:20],result['fear'],result['trust'],
              result['negative'],result['positive'],result['joy'],result['disgust'],
              result['anticipation'],result['sadness'],result['surprise']))
        
#And test it        
comparative_emotion_analyzer(all_snippets)



In [ ]:

    
restaurant_data = [('community',community_data.raw()),('le monde',le_monde_data.raw())
                  ,('heights',heights_data.raw()), ('amigos',amigos_data.raw())]
comparative_emotion_analyzer(restaurant_data,0,1)

Simple Analysis: Complexity

We'll look at four complexity factors

average word length: longer words adds to complexity

average sentence length: longer sentences are more complex (unless the text is rambling!)

vocabulary: the ratio of unique words used to the total number of words (more variety, more complexity)

token: A sequence (or group) of characters of interest. For e.g., in the below analysis, a token = a word

Generally: A token is the base unit of analysis

So, the first step is to convert text into tokens and nltk text object



In [ ]:

    
#Construct tokens (words/sentences) from the text
text = le_monde_data.raw()
import nltk
from nltk import sent_tokenize,word_tokenize 
sentences = nltk.Text(sent_tokenize(text))
print(len(sentences))
words = nltk.Text(word_tokenize(text))
print(len(words))



In [ ]:

    
num_chars=len(text)
num_words=len(word_tokenize(text))
num_sentences=len(sent_tokenize(text))
vocab = {x.lower() for x in word_tokenize(text)}
print(num_chars,int(num_chars/num_words),int(num_words/num_sentences),(len(vocab)/num_words))

Functionalize this



In [ ]:

    
def get_complexity(text):
    num_chars=len(text)
    num_words=len(word_tokenize(text))
    num_sentences=len(sent_tokenize(text))
    vocab = {x.lower() for x in word_tokenize(text)}
    return len(vocab),int(num_chars/num_words),int(num_words/num_sentences),len(vocab)/num_words



In [ ]:

    
get_complexity(le_monde_data.raw())



In [ ]:

    
for text in restaurant_data:
    (vocab,word_size,sent_size,vocab_to_text) = get_complexity(text[1])
    print("{0:15s}\t{1:1.2f}\t{2:1.2f}\t{3:1.2f}\t{4:1.2f}".format(text[0],vocab,word_size,sent_size,vocab_to_text))

We could do a word cloud comparison

We'll remove short words and look only at words longer than 6 letters



In [ ]:

    
texts = restaurant_data
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
%matplotlib inline
#Remove unwanted words
#As we look at the cloud, we can get rid of words that don't make sense by adding them to this variable
DELETE_WORDS = []
def remove_words(text_string,DELETE_WORDS=DELETE_WORDS):
    for word in DELETE_WORDS:
        text_string = text_string.replace(word,' ')
    return text_string

#Remove short words
MIN_LENGTH = 0
def remove_short_words(text_string,min_length = MIN_LENGTH):
    word_list = text_string.split()
    for word in word_list:
        if len(word) < min_length:
            text_string = text_string.replace(' '+word+' ',' ',1)
    return text_string


#Set up side by side clouds
COL_NUM = 2
ROW_NUM = 2
fig, axes = plt.subplots(ROW_NUM, COL_NUM, figsize=(12,12))

for i in range(0,len(texts)):
    text_string = remove_words(texts[i][1])
    text_string = remove_short_words(text_string)
    ax = axes[i%2]
    ax = axes[i//2, i%2] #Use this if ROW_NUM >=2
    ax.set_title(texts[i][0])
    wordcloud = WordCloud(stopwords=STOPWORDS,background_color='white',width=1200,height=1000,max_words=20).generate(text_string)
    ax.imshow(wordcloud)
    ax.axis('off')
plt.show()

Comparing complexity of restaurant reviews won't get us anything useful

Let's look at something more useful

nltk: Python's natural language toolkit

nltk contains a large corpora of pre-tokenized text

Load it using the command:

nltk.download()

Import the corpora



In [ ]:

    
from nltk.book import *

Often, a comparitive analysis helps us understand text better

Let's look at US Presidentinaugural speeches

Copy the files 2013-Obama.txt and 2017-Trump.txt to the nltk_data/corpora/inaugural directory. nltk_data should be under your home directory



In [ ]:

    
inaugural.fileids()



In [ ]:

    
inaugural.raw('1861-Lincoln.txt')

Let's look at the complexity of the speeches by four presidents



In [ ]:

    
texts = [('trump',inaugural.raw('2017-Trump.txt')),
         ('obama',inaugural.raw('2009-Obama.txt')+inaugural.raw('2013-Obama.txt')),
         ('jackson',inaugural.raw('1829-Jackson.txt')+inaugural.raw('1833-Jackson.txt')),
         ('washington',inaugural.raw('1789-Washington.txt')+inaugural.raw('1793-Washington.txt'))]
for text in texts:
    (vocab,word_size,sent_size,vocab_to_text) = get_complexity(text[1])
    print("{0:15s}\t{1:1.2f}\t{2:1.2f}\t{3:1.2f}\t{4:1.2f}".format(text[0],vocab,word_size,sent_size,vocab_to_text))

Analysis over time

The files are arranged over time so we can analyze how complexity has changed between Washington and Trump



In [ ]:

    
from nltk.corpus import inaugural
sentence_lengths = list()
for fileid in inaugural.fileids():
    sentence_lengths.append(get_complexity(' '.join(inaugural.words(fileid)))[2])
plt.plot(sentence_lengths)

dispersion plots

Dispersion plots show the relative frequency of words over the text

Let's see how the frequency of some words has changed over the course of the republic

That should give us some idea of how the focus of the nation has changed



In [ ]:

    
text4.dispersion_plot(["government", "citizen", "freedom", "duties", "America",'independence','God','patriotism'])

We may want to use word stems rather than the part of speect form

For example: patriot, patriotic, patriotism all express roughly the same idea

nltk has a stemmer that implements the "Porter Stemming Algorithm" (https://tartarus.org/martin/PorterStemmer/)

We'll push everything to lowercase as well



In [ ]:

    
from nltk.stem.porter import PorterStemmer
p_stemmer = PorterStemmer()
text = inaugural.raw()
striptext = text.replace('\n\n', ' ')
striptext = striptext.replace('\n', ' ')
sentences = sent_tokenize(striptext)
words = word_tokenize(striptext)
text = nltk.Text([p_stemmer.stem(i).lower() for i in words])
text.dispersion_plot(["govern", "citizen", "free", "america",'independ','god','patriot'])

Weighted word analysis using Vader

Vader contains a list of 7500 features weighted by how positive or negative they are

It uses these features to calculate stats on how positive, negative and neutral a passage is

And combines these results to give a compound sentiment (higher = more positive) for the passage

Human trained on twitter data and generally considered good for informal communication

10 humans rated each feature in each tweet in context from -4 to +4

Calculates the sentiment in a sentence using word order analysis

"marginally good" will get a lower positive score than "extremely good"

Computes a "compound" score based on heuristics (between -1 and +1)

Includes sentiment of emoticons, punctuation, and other 'social media' lexicon elements



In [ ]:

    
!pip install vaderSentiment



In [ ]:

    
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer



In [ ]:

    
headers = ['pos','neg','neu','compound']
texts = restaurant_data
analyzer = SentimentIntensityAnalyzer()
for i in range(len(texts)):
    name = texts[i][0]
    sentences = sent_tokenize(texts[i][1])
    pos=compound=neu=neg=0
    for sentence in sentences:
        vs = analyzer.polarity_scores(sentence)
        pos+=vs['pos']/(len(sentences))
        compound+=vs['compound']/(len(sentences))
        neu+=vs['neu']/(len(sentences))
        neg+=vs['neg']/(len(sentences))
    print(name,pos,neg,neu,compound)

And functionalize this as well



In [ ]:

    
def vader_comparison(texts):
    from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
    headers = ['pos','neg','neu','compound']
    print("Name\t",'  pos\t','neg\t','neu\t','compound')
    analyzer = SentimentIntensityAnalyzer()
    for i in range(len(texts)):
        name = texts[i][0]
        sentences = sent_tokenize(texts[i][1])
        pos=compound=neu=neg=0
        for sentence in sentences:
            vs = analyzer.polarity_scores(sentence)
            pos+=vs['pos']/(len(sentences))
            compound+=vs['compound']/(len(sentences))
            neu+=vs['neu']/(len(sentences))
            neg+=vs['neg']/(len(sentences))
        print('%-10s'%name,'%1.2f\t'%pos,'%1.2f\t'%neg,'%1.2f\t'%neu,'%1.2f\t'%compound)



In [ ]:

    
vader_comparison(restaurant_data)

Named Entities

People, places, organizations

Named entities are often the subject of sentiments so identifying them can be very useful

Named entity detection is based on Part-of-speech tagging of words and chunks (groups of words)

Start with sentences (using a sentence tokenizer)

tokenize words in each sentence

chunk them. ne_chunk identifies likely chunked candidates (ne = named entity)

Finally build chunks using nltk's guess on what members of chunk represent (people, place, organization)



In [ ]:

    
en={}
try:
    sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
    sentences = sent_detector.tokenize(community_data.raw().strip())
    for sentence in sentences:
            tokenized = nltk.word_tokenize(sentence)
            tagged = nltk.pos_tag(tokenized)
            chunked = nltk.ne_chunk(tagged)
            for tree in chunked:
                if hasattr(tree, 'label'):
                    ne = ' '.join(c[0] for c in tree.leaves())
                    en[ne] = [tree.label(), ' '.join(c[1] for c in tree.leaves())]
except Exception as e:
    print(str(e))
import pprint
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(en)

Assuming we've done a good job of identifying named entities, we can get an affect score on entities



In [ ]:

    
meaningful_sents = list()
i=0
for sentence in sentences:
    if 'service' in sentence:
        i+=1
        meaningful_sents.append((i,sentence))

vader_comparison(meaningful_sents)

We could also develop a affect calculator for common terms in our domain (e.g., food items)



In [ ]:

    
def get_affect(text,word,lower=False):
    import nltk
    from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
    analyzer = SentimentIntensityAnalyzer()
    sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
    sentences = sent_detector.tokenize(text.strip())
    sentence_count = 0
    running_total = 0
    for sentence in sentences:
        if lower: sentence = sentence.lower()
        if word in sentence:
            vs = analyzer.polarity_scores(sentence) 
            running_total += vs['compound']
            sentence_count += 1
    if sentence_count == 0: return 0
    return running_total/sentence_count



In [ ]:

    
get_affect(community_data.raw(),'service',True)

The nltk function concordance returns text fragments around a word



In [ ]:

    
nltk.Text(community_data.words()).concordance('service',100)

Text summarization

Text summarization is useful because you can generate a short summary of a large piece of text automatically

Then, these summaries can serve as an input into a topic analyzer to figure out what the main topic of the text is

A naive form of summarization is to identify the most frequent words in a piece of text and use the occurrence of these words in sentences to rate the importance of a sentence.

First the imports



In [ ]:

    
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.probability import FreqDist
from nltk.corpus import stopwords
from collections import OrderedDict
import pprint

Then prep the text. Get did of end of line chars



In [ ]:

    
text = community_data.raw()
summary_sentences = []
candidate_sentences = {}
candidate_sentence_counts = {}
striptext = text.replace('\n\n', ' ')
striptext = striptext.replace('\n', ' ')

Construct a list of words after getting rid of unimportant ones and numbers



In [ ]:

    
words = word_tokenize(striptext)
lowercase_words = [word.lower() for word in words
                  if word not in stopwords.words() and word.isalpha()]

Construct word frequencies and choose the most common n (20)



In [ ]:

    
word_frequencies = FreqDist(lowercase_words)
most_frequent_words = FreqDist(lowercase_words).most_common(20)
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(most_frequent_words)

lowercase the sentences

candidate_sentences is a dictionary with the original sentence as the key, and its lowercase version as the value



In [ ]:

    
sentences = sent_tokenize(striptext)
for sentence in sentences:
    candidate_sentences[sentence] = sentence.lower()
candidate_sentences



In [ ]:

    
for long, short in candidate_sentences.items():
    count = 0
    for freq_word, frequency_score in most_frequent_words:
        if freq_word in short:
            count += frequency_score
            candidate_sentence_counts[long] = count



In [2]:

    
sorted_sentences = OrderedDict(sorted(
                    candidate_sentence_counts.items(),
                    key = lambda x: x[0],
                    reverse = True)[:4])
pp.pprint(sorted_sentences)









    



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-2-7337dc068114> in <module>()
----> 1 sorted_sentences = OrderedDict(sorted(
      2                     candidate_sentence_counts.items(),
      3                     key = lambda x: x[0],
      4                     reverse = True)[:4])
      5 pp.pprint(sorted_sentences)

NameError: name 'OrderedDict' is not defined

Packaging all this into a function



In [ ]:

    
def build_naive_summary(text):
    from nltk.tokenize import word_tokenize
    from nltk.tokenize import sent_tokenize
    from nltk.probability import FreqDist
    from nltk.corpus import stopwords
    from collections import OrderedDict
    summary_sentences = []
    candidate_sentences = {}
    candidate_sentence_counts = {}
    striptext = text.replace('\n\n', ' ')
    striptext = striptext.replace('\n', ' ')
    words = word_tokenize(striptext)
    lowercase_words = [word.lower() for word in words
                      if word not in stopwords.words() and word.isalpha()]
    word_frequencies = FreqDist(lowercase_words)
    most_frequent_words = FreqDist(lowercase_words).most_common(20)
    sentences = sent_tokenize(striptext)
    for sentence in sentences:
        candidate_sentences[sentence] = sentence.lower()
    for long, short in candidate_sentences.items():
        count = 0
        for freq_word, frequency_score in most_frequent_words:
            if freq_word in short:
                count += frequency_score
                candidate_sentence_counts[long] = count   
    sorted_sentences = OrderedDict(sorted(
                        candidate_sentence_counts.items(),
                        key = lambda x: x[1],
                        reverse = True)[:4])
    return sorted_sentences



In [ ]:

    
summary = '\n'.join(build_naive_summary(community_data.raw()))
print(summary)



In [ ]:

    
summary = '\n'.join(build_naive_summary(le_monde_data.raw()))
print(summary)

We can summarize George Washington's first inaugural speech



In [ ]:

    
build_naive_summary(inaugural.raw('1789-Washington.txt'))

gensim: another text summarizer

Gensim uses a network with sentences as nodes and 'lexical similarity' as weights on the arcs between nodes



In [ ]:

    
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
%matplotlib inline
import nltk
from nltk.corpus import PlaintextCorpusReader
from nltk import sent_tokenize,word_tokenize 
from nltk.book import *



In [ ]:

    
import nltk
from nltk.corpus import PlaintextCorpusReader
community_root = "data/community"
le_monde_root = "data/le_monde"
community_files = "community.*"
le_monde_files = "le_monde.*"
heights_root = "data/heights"
heights_files = "heights.*"
amigos_root = "data/amigos"
amigos_files = "amigos.*"
community_data = PlaintextCorpusReader(community_root,community_files)
le_monde_data = PlaintextCorpusReader(le_monde_root,le_monde_files)
heights_data = PlaintextCorpusReader(heights_root,heights_files)
amigos_data = PlaintextCorpusReader(amigos_root,amigos_files)



In [ ]:

    
type(community_data)



In [ ]:

    
text = community_data.raw()
summary_sentences = []
candidate_sentences = {}
candidate_sentence_counts = {}
striptext = text.replace('\n\n', ' ')
striptext = striptext.replace('\n', ' ')



In [ ]:

    
import gensim.summarization



In [ ]:



In [ ]:

    
#!pip install gensim



In [ ]:

    
import gensim.summarization



In [ ]:

    
summary = gensim.summarization.summarize(striptext, word_count=100) 
print(summary)



In [ ]:

    
print(gensim.summarization.keywords(striptext,words=10))



In [ ]:

    
summary = '\n'.join(build_naive_summary(community_data.raw()))
print(summary)



In [ ]:

    
text = le_monde_data.raw()
summary_sentences = []
candidate_sentences = {}
candidate_sentence_counts = {}
striptext = text.replace('\n\n', ' ')
striptext = striptext.replace('\n', ' ')
summary = gensim.summarization.summarize(striptext, word_count=100) 
print(summary)
#print(gensim.summarization.keywords(striptext,words=10))

Topic modeling

The goal of topic modeling is to identify the major concepts underlying a piece of text

Topic modeling uses "Unsupervised Learning". No apriori knowledge is necessary

Though it is helpful in cleaning up results!

LDA: Latent Dirichlet Allocation Model

Identifies potential topics using pruning techniques like 'upward closure'

Computes conditional probabilities for topic word sets

Identifies the most likely topics

Does this over multiple passes probabilistically picking topics in each pass

Good intuitive explanation: http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/



In [ ]:

    
from gensim import corpora
from gensim.models.ldamodel import LdaModel
from gensim.parsing.preprocessing import STOPWORDS
import pprint

Prepare the text



In [ ]:

    
text = PlaintextCorpusReader("data/","Nikon_coolpix_4300.txt").raw()
striptext = text.replace('\n\n', ' ')
striptext = striptext.replace('\n', ' ')
sentences = sent_tokenize(striptext)
#words = word_tokenize(striptext)
#tokenize each sentence into word tokens
texts = [[word for word in sentence.lower().split()
        if word not in STOPWORDS and word.isalnum()]
        for sentence in sentences]
len(texts)

Create a (word,frequency) dictionary for each word in the text



In [ ]:

    
print(text)



In [ ]:

    
text



In [ ]:

    
dictionary = corpora.Dictionary(texts) #(word_id,frequency) pairs
corpus = [dictionary.doc2bow(text) for text in texts] #(word_id,freq) pairs by sentence
#print(dictionary.token2id)
#print(dictionary.keys())
#print(corpus[9])
#print(texts[9])
#print(dictionary[73])
#dictionary[4]

Do the LDA

Parameters:

Number of topics: The number of topics you want generated. The larger the document, the more the desirable topics

Passes: The LDA model makes through the document. More passes, slower analysis



In [ ]:

    
#Set parameters
num_topics = 5 #The number of topics that should be generated
passes = 10



In [ ]:

    
lda = LdaModel(corpus,
              id2word=dictionary,
              num_topics=num_topics,
              passes=10)



In [ ]:

See results



In [ ]:

    
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(lda.print_topics(num_words=3))

Matching topics to documents

Sort topics by probability

We're using sentences as documents here, so this is less than ideal



In [ ]:

    
from operator import itemgetter
lda.get_document_topics(corpus[0],minimum_probability=0.05,per_word_topics=False)
sorted(lda.get_document_topics(corpus[0],minimum_probability=0,per_word_topics=False),key=itemgetter(1),reverse=True)

Making sense of the topics

Draw wordclouds



In [ ]:

    
def draw_wordcloud(lda,topicnum,min_size=0,STOPWORDS=[]):
    word_list=[]
    prob_total = 0
    for word,prob in lda.show_topic(topicnum,topn=50):
        prob_total +=prob
    for word,prob in lda.show_topic(topicnum,topn=50):
        if word in STOPWORDS or  len(word) < min_size:
            continue
        freq = int(prob/prob_total*1000)
        alist=[word]
        word_list.extend(alist*freq)

    from wordcloud import WordCloud, STOPWORDS
    import matplotlib.pyplot as plt
    %matplotlib inline
    text = ' '.join(word_list)
    wordcloud = WordCloud(stopwords=STOPWORDS,background_color='white',width=3000,height=3000).generate(' '.join(word_list))


    plt.imshow(wordcloud)
    plt.axis('off')
    plt.show()



In [ ]:

    
draw_wordcloud(lda,2)

Roughly,

lda looks for candidate topics assuming that there are many such candidates

looks for words related to the candidate topics

assign probablilites to those words

Let's look at Presidential addresses to see what sorts of topics emerge from there

Each document will be analyzed for topic

The corpus will consist of 58 documents, one per presidential address



In [ ]:

    
REMOVE_WORDS = {'shall','generally','spirit','country','people','nation','nations','great','better'}
#Create a word dictionary (id, word)
texts = [[word for word in sentence.lower().split()
        if word not in STOPWORDS and word not in REMOVE_WORDS and word.isalnum()]
        for sentence in sentences]
dictionary = corpora.Dictionary(texts)

#Create a corpus of documents
text_list = list()
for fileid in inaugural.fileids():
    text = inaugural.words(fileid)
    doc=list()
    for word in text:
        if word in STOPWORDS or word in REMOVE_WORDS or not word.isalpha() or len(word) <5:
            continue
        doc.append(word)
    text_list.append(doc)
by_address_corpus = [dictionary.doc2bow(text) for text in text_list]

Create the model



In [ ]:

    
lda = LdaModel(by_address_corpus,
              id2word=dictionary,
              num_topics=20,
              passes=10)



In [ ]:

    
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(lda.print_topics(num_words=10))

We can now compare presidential addresses by topic



In [ ]:

    
len(by_address_corpus)



In [ ]:

    
from operator import itemgetter
sorted(lda.get_document_topics(by_address_corpus[0],minimum_probability=0,per_word_topics=False),key=itemgetter(1),reverse=True)



In [ ]:

    
draw_wordcloud(lda,18)



In [ ]:

    
print(lda.show_topic(12,topn=5))
print(lda.show_topic(18,topn=5))

Similarity

Given a corpus of documents, when a new document arrives, find the document that is the most similar



In [ ]:

    
doc_list = [community_data,le_monde_data,amigos_data,heights_data]
all_text = community_data.raw() + le_monde_data.raw() + amigos_data.raw() + heights_data.raw()

documents = [doc.raw() for doc in doc_list]
texts = [[word for word in document.lower().split()
        if word not in STOPWORDS and word.isalnum()]
        for document in documents]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]



In [ ]:

    
from gensim.similarities.docsim import Similarity
from gensim import corpora, models, similarities
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)
doc = """
Many, many years ago, I used to frequent this place for their amazing french toast. 
It's been a while since then and I've been hesitant to review a place I haven't been to in 7-8 years... 
but I passed by French Roast and, feeling nostalgic, decided to go back.

It was a great decision.

Their Bloody Mary is fantastic and includes bacon (which was perfectly cooked!!), olives, 
cucumber, and celery. The Irish coffee is also excellent, even without the cream which is what I ordered.

Great food, great drinks, a great ambiance that is casual yet familiar like a tiny little French cafe. 
I highly recommend coming here, and will be back whenever I'm in the area next.

Juan, the bartender, is great!! One of the best in any brunch spot in the city, by far.
"""
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lsi = lsi[vec_bow]
index = similarities.MatrixSimilarity(lsi[corpus])
sims = index[vec_lsi]
sims = sorted(enumerate(sims), key=lambda item: -item[1])



In [ ]:

    
sims



In [ ]:

    
doc="""
I went to Mexican Festival Restaurant for Cinco De Mayo because I had been there years 
prior and had such a good experience. This time wasn't so good. The food was just 
mediocre and it wasn't hot when it was brought to our table. They brought my friends food out 
10 minutes before everyone else and it took forever to get drinks. We let it slide because the place was 
packed with people and it was Cinco De Mayo. Also, the margaritas we had were slamming! Pure tequila. 

But then things took a turn for the worst. As I went to get something out of my purse which was on 
the back of my chair, I looked down and saw a huge water bug. I had to warn the lady next to me because 
it was so close to her chair. We called the waitress over and someone came with a broom and a dustpan and 
swept it away like it was an everyday experience. No one seemed phased.

Even though our waitress was very nice, I do not think we will be returning to Mexican Festival again. 
It seems the restaurant is a shadow of its former self.
"""
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lsi = lsi[vec_bow]
index = similarities.MatrixSimilarity(lsi[corpus])
sims = index[vec_lsi]
sims = sorted(enumerate(sims), key=lambda item: -item[1])
sims



In [ ]:

Working with text!

Sentiment Analysis

Easy examples

Not so easy examples

Often, looking at words alone is not enough to figure out the sentiment

Bottom line

Sentiment analysis is usually done using a corpus of positive and negative words

Sources of sentiment coded words

Our examples

Simple sentiment analysis

Read the text being analyzed and count the proportion of positive and negative words in the text

Compute sentiment by looking at the proportion of positive and negative words in the text

Simple sentiment analysis using NRC data

For example, the word abandonment is associated with anger, fear, sadness and has a negative sentiment

Read the NRC sentiment data

Functionalize this

Analyzing yelp reviews

Caveat: We're only looking at one review snippet for each restaurant

We'll see what we can figure out re reviews of restaurants close to Columbia

First let's read in the yelp keys

I've saved my keys in a file and will use those. Don't run the next cell!

We need to do a few things

Write the function that queries yelp. We'll use rauth library to handle authentication

Extract snippets

Functionalize this

A function that analyzes emotions

Now we can analyze the emotional content of the review snippets

Let's functionalize this

And let's functionalize the yelp stuff as well

Simple analysis: Word Clouds

Let's see what sort of words the snippets use

Let's do a detailed comparison of local restaurants

I've saved a few reviews for each restaurant in four directories

We'll use the PlainTextCorpusReader to read these directories

We need to modify comparitive_emotion_analyzer to tell it where the restaurant name and the text is in the tuple

Simple Analysis: Complexity

We'll look at four complexity factors

Functionalize this

We could do a word cloud comparison

Comparing complexity of restaurant reviews won't get us anything useful

Let's look at something more useful

nltk: Python's natural language toolkit

ntlk documentation link:

Commands cheat sheet

nltk book

nltk contains a large corpora of pre-tokenized text

Import the corpora

Often, a comparitive analysis helps us understand text better

Let's look at US Presidentinaugural speeches

Copy the files 2013-Obama.txt and 2017-Trump.txt to the nltk_data/corpora/inaugural directory. nltk_data should be under your home directory

Let's look at the complexity of the speeches by four presidents

Analysis over time

The files are arranged over time so we can analyze how complexity has changed between Washington and Trump

dispersion plots

Dispersion plots show the relative frequency of words over the text

Let's see how the frequency of some words has changed over the course of the republic

That should give us some idea of how the focus of the nation has changed

We may want to use word stems rather than the part of speect form

Weighted word analysis using Vader

Vader contains a list of 7500 features weighted by how positive or negative they are

It uses these features to calculate stats on how positive, negative and neutral a passage is

And combines these results to give a compound sentiment (higher = more positive) for the passage

Human trained on twitter data and generally considered good for informal communication

10 humans rated each feature in each tweet in context from -4 to +4

Calculates the sentiment in a sentence using word order analysis

Computes a "compound" score based on heuristics (between -1 and +1)

Includes sentiment of emoticons, punctuation, and other 'social media' lexicon elements

And functionalize this as well

Named Entities

People, places, organizations

Named entity detection is based on Part-of-speech tagging of words and chunks (groups of words)

Assuming we've done a good job of identifying named entities, we can get an affect score on entities

We could also develop a affect calculator for common terms in our domain (e.g., food items)

The nltk function concordance returns text fragments around a word

Text summarization

Text summarization is useful because you can generate a short summary of a large piece of text automatically

Then, these summaries can serve as an input into a topic analyzer to figure out what the main topic of the text is

First the imports

Then prep the text. Get did of end of line chars

Construct a list of words after getting rid of unimportant ones and numbers