Outline

Collect some tweets
Annotate the tweets
Calculate the accuracy



In [ ]:

    
from pprint import pprint

Collect some data



In [ ]:

    
# we'll use data from a job that collected tweets about parenting
tweet_bodies = [body for body in open('tweet_bodies.txt')]



In [ ]:

    
# sanity checks
pprint(len(tweet_bodies))



In [ ]:

    
# sanity checks
pprint(tweet_bodies[:10])



In [ ]:

    
# lets do some quick deduplication
from duplicate_filter import duplicateFilter

## set the similarity threshold at 90%
dup_filter = duplicateFilter(0.9)

deduped_tweet_bodies = []
for id,tweet_body in enumerate(tweet_bodies):
    if not dup_filter.isDup(id,tweet_body):
        deduped_tweet_bodies.append(tweet_body)

pprint(deduped_tweet_bodies[:10])

Annotate the data

Start by tokenizing



In [ ]:

    
from nltk.tokenize import TweetTokenizer
tt = TweetTokenizer()
tokenized_deduped_tweet_bodies = [tt.tokenize(body) for body in deduped_tweet_bodies]



In [ ]:

    
# sanity checks
len(tokenized_deduped_tweet_bodies)



In [ ]:

    
pprint(tokenized_deduped_tweet_bodies[:2])

Now tag the tokens with parts-of-speech labels

The default configuration is the Greedy Averaged Perceptron tagger (https://explosion.ai/blog/part-of-speech-pos-tagger-in-python)



In [ ]:

    
from nltk.tag import pos_tag as pos_tagger
tagged_tokenized_deduped_tweet_bodies = [ pos_tagger(tokens) for tokens in tokenized_deduped_tweet_bodies]



In [ ]:

    
pprint(tagged_tokenized_deduped_tweet_bodies[:2])



In [ ]:

    
# let's look at the taxonomy of tags; in our case derived from the Penn treebank project 
# (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.9.8216&rep=rep1&type=pdf)

import nltk
nltk.help.upenn_tagset()



In [ ]:

    
# let's peek at the tag dictionary for our tagger

from nltk.tag.perceptron import PerceptronTagger
t = PerceptronTagger()
pprint(list(t.tagdict.items())[:10])

Evaluate the annotations

We must choose which parts of speech to evaluate. Let's focus on adjectives, which are useful for sentiment analysis, and proper nouns, which provide a set of potential events and topics.

JJ: adjective or numeral, ordinal
JJR: adjective, comparative
JJS: adjective, superlative

NNP: noun, proper, singular
NNPS: noun, proper, plural



In [ ]:

    
adjective_tags = ['JJ','JJR','JJS']
pn_tags = ['NNP','NNPS']
tag_types = [('adj',adjective_tags),('PN',pn_tags)]



In [ ]:

    
# print format: "POS: TOKEN --> TWEET TEXT"

for body,tweet_tokens,tagged_tokens in zip(deduped_tweet_bodies,tokenized_deduped_tweet_bodies,tagged_tokenized_deduped_tweet_bodies):
    for token,tag in tagged_tokens:
        if tag in adjective_tags:
        #if tag in pn_tags:
            print_str = '{}: {} --> {}'.format(tag,token,body)
            print(print_str)

These seem like dreadful results. Let's try a different NLP engine.

Stanford CoreNLP

Download:

http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip

Then unzip. Start up the server from the unzipped directory:

$ java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000



In [ ]:

    
from corenlp_pywrap import pywrap
cn = pywrap.CoreNLP(url='http://localhost:9000', annotator_list=["pos"])



In [ ]:

    
corenlp_results = []

for tweet_body in deduped_tweet_bodies:
    try:
        corenlp_results.append( cn.basic(tweet_body,out_format='json').json() )
    except UnicodeEncodeError:
        corenlp_results.append( {'sentences':[]} )



In [ ]:

    
# pull out the tokens and tags
corenlp_tagged_tokenized_deduped_tweet_bodies = [ [(token['word'],token['pos']) for sentence in result['sentences'] for token in sentence['tokens']] for result in corenlp_results]



In [ ]:

    
# print format: "POS: TOKEN --> TWEET TEXT"

for body,tagged_tokens in zip(deduped_tweet_bodies,corenlp_tagged_tokenized_deduped_tweet_bodies):
    for token,tag in tagged_tokens:
        #if tag in pn_tags:
        if tag in adjective_tags:
            print_str = '{}: {} --> {}'.format(tag,token,body)
            print(print_str)

Conclusions and next steps

For Tweet bodies:

NLTK TweetTokenizer is pretty good
NLTK default POS tagger is dreadful
CoreNLP POS tagger is better than NLTK

Next steps:

Make more careful accuracy measurements
Get a better tokenizer into CoreNLP
Look at other POS taggers
Compare other tags, such as sentiment



In [ ]: