Tweet sentiment analysis

In this section we will see how to extract features from tweets and use a classifier to classify the tweet as positive or negative.

We will use a pandas DataFrames (http://pandas.pydata.org/) to store tweets and process them. Pandas DataFrames are very powerful python data-structures, like excel spreadsheets with the power of python.


In [4]:
# Let's create a DataFrame with each tweet using pandas
import pandas as pd
import json
import numpy as np


def getTweetID(tweet):
    """ If properly included, get the ID of the tweet """
    return tweet.get('id')
    
def getUserIDandScreenName(tweet):
    """ If properly included, get the tweet 
        user ID and Screen Name """
    user = tweet.get('user')
    if user is not None:
        uid = user.get('id')
        screen_name = user.get('screen_name')
        return uid, screen_name
    else:
        return (None, None)
    

    
filename = 'AI2.txt'

# create a list of dictionaries with the data that interests us
tweet_data_list = []
with open(filename, 'r') as fopen:
    # each line correspond to a tweet
    for line in fopen:
        if line != '\n':
            tweet = json.loads(line.strip('\n'))
            tweet_id = getTweetID(tweet)
            user_id = getUserIDandScreenName(tweet)[0]
            text = tweet.get('text')
            if tweet_id is not None:
                tweet_data_list.append({'tweet_id' : tweet_id,
                           'user_id' : user_id,
                           'text' : text})

# put everything in a dataframe
tweet_df = pd.DataFrame.from_dict(tweet_data_list)

In [5]:
print(tweet_df.shape)
print(tweet_df.columns)

#print 5 first element of one of the column
print(tweet_df.text.iloc[:5])
# or
print(tweet_df['text'].iloc[:5])


(5012, 3)
Index(['text', 'tweet_id', 'user_id'], dtype='object')
0                                      'This Isn't AI'
1    RT @IoTRecruiting: How will Cognitive Computin...
2                        Trou* https://t.co/FlQdwMFbmh
3    RT @InvestorIdeas: https://t.co/ZFdyj2RNXV - #...
4    DeepStack: Expert-level artificial intelligenc...
Name: text, dtype: object
0                                      'This Isn't AI'
1    RT @IoTRecruiting: How will Cognitive Computin...
2                        Trou* https://t.co/FlQdwMFbmh
3    RT @InvestorIdeas: https://t.co/ZFdyj2RNXV - #...
4    DeepStack: Expert-level artificial intelligenc...
Name: text, dtype: object

In [6]:
#show the first 10 rows
tweet_df.head(10)


Out[6]:
text tweet_id user_id
0 'This Isn't AI' 860217132763754497 211638860
1 RT @IoTRecruiting: How will Cognitive Computin... 860217137058635776 3226000831
2 Trou* https://t.co/FlQdwMFbmh 860217138908475397 3070524046
3 RT @InvestorIdeas: https://t.co/ZFdyj2RNXV - #... 860217141827518464 837339093214318593
4 DeepStack: Expert-level artificial intelligenc... 860217143756857344 766378262259986437
5 RT @IoTRecruiting: How will Cognitive Computin... 860217149192785920 2390579695
6 RT @IoTRecruiting: Honored to be ranked Top In... 860217150237270022 860148473873739779
7 I want Barnier to hand over the cheque with a ... 860217155568242693 740545840838774784
8 Iraqi refugee 'used asylum seekers to stage bu... 860217170948612096 14101483
9 RT @cabroncita: $BKD $BKDCD / $BKD.V Artificia... 860217174077706242 835220550209396736

Extracting features from the tweets

1) Tokenize the tweet in a list of words

This part uses concepts from Naltural Langage Processing. We will use a tweet tokenizer I built based on TweetTokenizer from NLTK (http://www.nltk.org/). You can see how it works by opening the file TwSentiment.py. The goal is to process any tweets and extract a list of words taking into account usernames, hashtags, urls, emoticons and all the informal text we can find in tweets. We also want to reduce the number of features by doing some transformations such as putting all the words in lower cases.


In [7]:
from TwSentiment import CustomTweetTokenizer

In [8]:
tokenizer = CustomTweetTokenizer(preserve_case=False, # keep Upper cases
                                 reduce_len=True, # reduce repetition of letter to a maximum of three
                                 strip_handles=False, # remove usernames (@mentions)
                                 normalize_usernames=True, # replace all mentions to "@USER"
                                 normalize_urls=True, # replace all urls to "URL"
                                 keep_allupper=True) # keep upercase for words that are all in uppercase

In [9]:
# example
tweet_df.text.iloc[0]


Out[9]:
"'This Isn't AI'"

In [10]:
tokenizer.tokenize(tweet_df.text.iloc[0])


Out[10]:
["'", 'this', "isn't", 'AI', "'"]

In [9]:
# other examples
tokenizer.tokenize('Hey! This is SO cooooooooooooooooool! :)')


Out[9]:
['hey', '!', 'this', 'is', 'SO', 'coool', '!', ':)']

In [11]:
tokenizer.tokenize('Hey! This is so cooooooool! :)')


Out[11]:
['hey', '!', 'this', 'is', 'so', 'coool', '!', ':)']

2) Define the features that will represent the tweet

We will use the occurrence of words and pair of words (bigrams) as features.

This corresponds to a bag-of-words representation (https://en.wikipedia.org/wiki/Bag-of-words_model): we just count each words (or n-grams) without taking account their order. For document classification, the frequency of occurence of each words is usually taken as a feature. In the case of tweets, they are so short that we can just count each words once.

Using pair of words allows to capture some of the context in which each words appear. This helps capturing the correct meaning of words.


In [12]:
from TwSentiment import bag_of_words_and_bigrams

# this will return a dictionary of features,
# we just list the features present in this tweet
bag_of_words_and_bigrams(tokenizer.tokenize(tweet_df.text.iloc[0]))


Out[12]:
{"'": True,
 'this': True,
 "isn't": True,
 'AI': True,
 ("'", 'this'): True,
 ('this', "isn't"): True,
 ("isn't", 'AI'): True,
 ('AI', "'"): True}

Download the logistic regression classifier

https://www.dropbox.com/s/09rw6a85f7ezk31/sklearn_SGDLogReg_.pickle.zip?dl=1

I trained this classifier on this dataset: http://help.sentiment140.com/for-students/, following the approach from this paper: http://cs.stanford.edu/people/alecmgo/papers/TwitterDistantSupervision09.pdf

This is a set of 14 million tweets with emoticons. Tweets containing "sad" emoticons (7 million) are considered negative and tweets with "happy" emoticons (7 million) are considered positive.

I used a Logistic Regression classifier with L2 regularization that I optimized with a 10 fold cross-validation using $F_1$ score as a metric.


In [14]:
# the classifier is saved in a "pickle" file
import pickle

with open('sklearn_SGDLogReg_.pickle', 'rb') as fopen:
    classifier_dict = pickle.load(fopen)


D:\Anaconda3\lib\site-packages\sklearn\base.py:315: UserWarning: Trying to unpickle estimator DictVectorizer from version pre-0.18 when using version 0.18.1. This might lead to breaking code or invalid results. Use at your own risk.
  UserWarning)
D:\Anaconda3\lib\site-packages\sklearn\base.py:315: UserWarning: Trying to unpickle estimator SGDClassifier from version pre-0.18 when using version 0.18.1. This might lead to breaking code or invalid results. Use at your own risk.
  UserWarning)
D:\Anaconda3\lib\site-packages\sklearn\base.py:315: UserWarning: Trying to unpickle estimator Pipeline from version pre-0.18 when using version 0.18.1. This might lead to breaking code or invalid results. Use at your own risk.
  UserWarning)

In [15]:
# classifier_dict contain the classifier and label mappers
# that I added so that we remember how the classes are 
# encoded
classifier_dict


Out[15]:
{'label_inv_mapper': {0: 'neg', 1: 'pos'},
 'label_mapper': {'neg': 0, 'pos': 1},
 'sklearn_pipeline': Pipeline(steps=[('feat_vectorizer', DictVectorizer(dtype=<class 'numpy.int8'>, separator='=', sort=False,
         sparse=True)), ('classifier', SGDClassifier(alpha=7.847599703514622e-06, average=False, class_weight=None,
        epsilon=0.1, eta0=0.0, fit_intercept=True, l1_ratio=0.15,
        learning_rate='optimal', loss='log', n_iter=10, n_jobs=1,
        penalty='l2', power_t=0.5, random_state=42, shuffle=True, verbose=0,
        warm_start=False))])}

The classifier is in fact contained in a pipeline. A sklearn pipeline allows to assemble several transformation of your data (http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)


In [16]:
pipline = classifier_dict['sklearn_pipeline']

In [17]:
pipline.steps


Out[17]:
[('feat_vectorizer',
  DictVectorizer(dtype=<class 'numpy.int8'>, separator='=', sort=False,
          sparse=True)),
 ('classifier',
  SGDClassifier(alpha=7.847599703514622e-06, average=False, class_weight=None,
         epsilon=0.1, eta0=0.0, fit_intercept=True, l1_ratio=0.15,
         learning_rate='optimal', loss='log', n_iter=10, n_jobs=1,
         penalty='l2', power_t=0.5, random_state=42, shuffle=True, verbose=0,
         warm_start=False))]

In [18]:
# this the step that will transform a list of textual features to a vector of zeros and ones
dict_vect = pipline.steps[0][1]

In [19]:
dict_vect.feature_names_


Out[19]:
[':}',
 ('so', 'glad'),
 'was',
 ('that', 'was'),
 ('to', 'make'),
 'that',
 ('skool', 'this'),
 'for',
 'to',
 ('hafta', 'get'),
 'but',
 ('!', 'but'),
 'get',
 ('a', 'summer'),
 ('gay', '!'),
 ("don't", 'have'),
 ('paper', '!'),
 'this',
 'so',
 ('summer', 'skool'),
 ('!', 'I'),
 'paper',
 ('was', 'gay'),
 'make',
 ('wow', 'that'),
 ('!', ':}'),
 'glad',
 ('need', 'to'),
 ('make', 'that'),
 'have',
 ('job', 'for'),
 ('I', "don't"),
 ('!', 'wow'),
 'I',
 ('this', 'summer'),
 'a',
 ('have', 'summer'),
 'gay',
 ('get', 'a'),
 'wow',
 ('glad', 'I'),
 'summer',
 ('summer', '!'),
 ('summer', 'job'),
 'job',
 'skool',
 ('but', 'I'),
 ('sure', '!'),
 'sure',
 ('I', 'hafta'),
 'need',
 '!',
 "don't",
 'hafta',
 ('for', 'sure'),
 ('I', 'need'),
 ('that', 'paper'),
 ('only', 'on'),
 ('hot', 'milo'),
 'before',
 ('5', 'minutes'),
 'about',
 'the',
 ',',
 'heading',
 ('yup', ','),
 ('to', 'the'),
 'nice',
 ('couch', 'for'),
 ('nice', 'hot'),
 ('the', 'couch'),
 ('minutes', 'and'),
 ('for', 'about'),
 'minutes',
 'yup',
 ('@USER', 'yup'),
 'on',
 'hot',
 'then',
 'couch',
 'and',
 (',', 'only'),
 ('before', 'a'),
 ('for', 'a'),
 'only',
 ('then', "i'm"),
 '@USER',
 ("i'm", 'heading'),
 ('a', 'nice'),
 "i'm",
 ('early', 'night'),
 'night',
 '5',
 ('nice', 'early'),
 'early',
 ('about', '5'),
 'milo',
 ('milo', 'before'),
 ('and', 'then'),
 ('heading', 'to'),
 ('on', 'for'),
 ('@USER', 'drinks'),
 ('with', 'mates'),
 'promise',
 'meet',
 ('a', 'long'),
 'up',
 ('&', 'a'),
 'in',
 '&',
 ('long', 'promise'),
 'with',
 ('in', 'london'),
 ('drinks', 'with'),
 'mates',
 'drinks',
 ('london', '&'),
 ('up', 'with'),
 ('mates', 'in'),
 ('meet', 'up'),
 ('with', '@USER'),
 ('promise', 'meet'),
 'london',
 'long',
 'out',
 'gotta',
 '...',
 ('on', "jason's"),
 ('NYC', '!'),
 ('gotta', 'love'),
 'NYC',
 'rooftop',
 ('laying', 'out'),
 'laying',
 ("jason's", 'rooftop'),
 ('rooftop', '...'),
 'love',
 "jason's",
 ('love', 'NYC'),
 ('out', 'on'),
 ('...', 'gotta'),
 ('/', 'better'),
 'better',
 'online',
 '/',
 ('shopping', 'I'),
 ('easier', '/'),
 ('better', 'than'),
 '.',
 'irl',
 ('than', 'shopping'),
 ('irl', '.'),
 ('love', 'shopping'),
 ('-', 'so'),
 'much',
 ('so', 'much'),
 ('I', 'love'),
 'holiday',
 '-',
 ('holiday', 'shopping'),
 ('online', '-'),
 'than',
 ('shopping', 'irl'),
 'easier',
 ('shopping', 'online'),
 ('much', 'easier'),
 'shopping',
 'lie',
 ('off', '/'),
 ('!', 'tomorrows'),
 ('golf', 'today'),
 ('it', '!'),
 ('lie', 'in'),
 'tomorrows',
 ('-', 'what'),
 'golf',
 'it',
 ('what', 'a'),
 ('..', "i'm"),
 ('/', 'lie'),
 'off',
 'lovely',
 ('running', 'and'),
 ('in', 'for'),
 'first',
 'running',
 ('!', '!'),
 ('a', 'lovely'),
 'ages',
 'what',
 ("i'm", 'gonna'),
 ('day', '!'),
 ('enjoy', 'it'),
 ('tomorrows', 'my'),
 ('day', 'off'),
 ('for', 'ages'),
 ('ages', '..'),
 'day',
 ('first', 'day'),
 'gonna',
 ('today', '-'),
 ('lovely', 'day'),
 ('my', 'first'),
 'today',
 ('gonna', 'enjoy'),
 'my',
 'enjoy',
 ('and', 'golf'),
 '..',
 ('hope', 'to'),
 ('business', 'someday'),
 'sounds',
 ('do', 'business'),
 ('good', ','),
 'hope',
 'good',
 ('@USER', 'sounds'),
 'do',
 (',', 'hope'),
 'someday',
 ('to', 'do'),
 ('sounds', 'good'),
 'business',
 ('LOL', '...'),
 'course',
 'LOL',
 ("it's", 'hotter'),
 ("it's", 'out'),
 ('out', '!'),
 'XD',
 ('-', 'LOL'),
 ('than', 'yesterday'),
 'of',
 ('XD', '&'),
 ('!', 'XD'),
 ('of', 'course'),
 'yesterday',
 ('@USER', '-'),
 ('&', "it's"),
 ('course', "it's"),
 "it's",
 ('hotter', 'than'),
 ('...', 'of'),
 'hotter',
 "isn't",
 '?',
 'bitch',
 ('it', '?'),
 ('bitch', "isn't"),
 ("payback's", 'a'),
 ('a', 'bitch'),
 ("isn't", 'it'),
 "payback's",
 'by',
 'we',
 'cows',
 ('just', 'passed'),
 ('some', 'dumb'),
 'just',
 ('@USER', 'we'),
 ('cows', '!'),
 ('we', 'just'),
 'dumb',
 ('passed', 'by'),
 ('dumb', 'cows'),
 'some',
 'passed',
 ('by', 'some'),
 'dinner',
 'besties',
 'malta',
 ('the', 'besties'),
 ('should', 'be'),
 ('off', 'to'),
 ('for', 'dinner'),
 ('the', 'malta'),
 ('with', 'the'),
 'should',
 ('malta', 'for'),
 ('good', '!'),
 ('besties', 'should'),
 'be',
 ('be', 'good'),
 ('dinner', 'with'),
 ('freshman', 'on'),
 'freshman',
 ("i'm", 'a'),
 ('on', 'twitter'),
 ('twitter', '.'),
 'twitter',
 ('a', 'freshman'),
 'well.im',
 'graduation',
 'haha',
 'someones',
 'lets',
 'well',
 'saturday',
 ('do.hows', 'everyones'),
 ('someones', 'graduation'),
 'lol.something',
 ('graduation', 'part'),
 ('well', 'well'),
 ('tonight', 'lol.something'),
 ('lol.something', 'to'),
 ('part', 'tonight'),
 'do.hows',
 ('well.im', 'crashing'),
 ('talk', 'haha'),
 'talk',
 ('crashing', 'someones'),
 'everyones',
 'tonight',
 ('saturday', '?'),
 ('well', 'well.im'),
 ('to', 'do.hows'),
 ('lets', 'talk'),
 'crashing',
 'part',
 ('?', 'lets'),
 ('everyones', 'saturday'),
 'your',
 ('the', 'father'),
 ('.', 'quite'),
 'has',
 'father',
 ('much', 'more'),
 'were',
 ('honestly', ','),
 ('more', 'than'),
 'more',
 'you',
 ('has', 'even'),
 'amazing',
 ('amazing', '.'),
 ('your', 'friend'),
 (',', 'you'),
 ('quite', 'honestly'),
 ('you', 'and'),
 'attempted',
 'friend',
 ('do', '.'),
 'quite',
 ('and', 'your'),
 'did',
 ('what', 'the'),
 ('did', 'so'),
 ('than', 'what'),
 ('father', 'has'),
 ('even', 'attempted'),
 ('friend', 'did'),
 'even',
 ('@USER', 'you'),
 ('you', 'were'),
 ('attempted', 'to'),
 'honestly',
 ('were', 'amazing'),
 ('is', 'my'),
 ('follow', '@USER'),
 ('dinner', 'and'),
 ('my', 'cupcake'),
 ('great', 'dinner'),
 ('@USER', '!'),
 'great',
 'cupcake',
 'follow',
 ('she', 'is'),
 ('great', 'friends'),
 ('!', 'follow'),
 ('!', 'she'),
 'is',
 ('friends', '!'),
 ('and', 'great'),
 'she',
 'friends',
 ('...', 'and'),
 ('forget', 'you'),
 ('@USER', 'hope'),
 'how',
 'anyone',
 ('?', '?'),
 ('soon', '...'),
 'pic',
 ('hope', 'your'),
 'soon',
 'forget',
 ('how', 'can'),
 ('your', 'pic'),
 'comes',
 ('pic', 'comes'),
 ('can', 'anyone'),
 ('and', 'how'),
 ('comes', 'back'),
 ('anyone', 'forget'),
 'back',
 ('you', '?'),
 'can',
 ('back', 'soon'),
 ('lovely', '!'),
 ('hi', 'lovely'),
 ('@USER', 'hi'),
 'hi',
 ('not', 'even'),
 ('think', 'new'),
 'new',
 ('!', 'lol'),
 'not',
 ('a', 'car'),
 'think',
 ('I', 'did'),
 ('owned', 'a'),
 ('new', 'yorkers'),
 ('did', 'not'),
 ('@USER', 'I'),
 'car',
 'yorkers',
 'owned',
 'lol',
 ('even', 'think'),
 ('yorkers', 'owned'),
 ('car', '!'),
 'knowing',
 ('bar', 'I'),
 ('15', '+'),
 ('at', 'a'),
 ('people', 'at'),
 'randomly',
 ('randomly', 'showed'),
 ('love', 'that'),
 'also',
 ('showed', 'up'),
 ('I', 'also'),
 ('knowing', '15'),
 '+',
 ('that', 'coll'),
 'at',
 'bar',
 ('a', 'bar'),
 'coll',
 ('+', 'people'),
 '15',
 ('also', 'love'),
 ('just', 'randomly'),
 ('up', '!'),
 'showed',
 ('coll', 'just'),
 ('love', 'knowing'),
 'people',
 ('was', 'a'),
 ('lol', 'that'),
 'movie',
 ('@USER', 'lol'),
 ('great', 'movie'),
 ('a', 'great'),
 'or',
 ('sexier', 'with'),
 ('thingy', '...'),
 ('me', 'or'),
 ('vick', 'even'),
 'house',
 ('him', '!'),
 ('michael', 'vick'),
 ('with', 'his'),
 'luv',
 'thingy',
 'vick',
 'sexier',
 'iz',
 ('even', 'sexier'),
 ('iz', 'it'),
 ('house', 'arrest'),
 ('it', 'me'),
 'ooh',
 ('arrest', 'thingy'),
 'his',
 'arrest',
 'him',
 'michael',
 ('his', 'house'),
 ('ooh', 'I'),
 ('luv', 'him'),
 ('...', 'ooh'),
 ('or', 'iz'),
 'me',
 ('iz', 'michael'),
 ('I', 'luv'),
 ('sure', 'what'),
 ('it', 'is'),
 'know',
 ('is', "it's"),
 ('-', 'not'),
 ('all', 'I'),
 ('is', '..'),
 ('..', 'all'),
 ('I', 'know'),
 'all',
 'funny',
 ("it's", 'funny'),
 ('what', 'it'),
 ('know', 'is'),
 ('not', 'sure'),
 'always',
 ('the', 'world'),
 'talent',
 'amazed',
 ('at', 'the'),
 ('always', 'amazed'),
 ('talent', 'in'),
 ('amount', 'of'),
 ('world', '.'),
 ('the', 'amount'),
 ('of', 'musical'),
 ('musical', 'talent'),
 ('support', 'your'),
 ('local', 'musician'),
 ('your', 'local'),
 ('in', 'the'),
 ('musician', '!'),
 'amount',
 'local',
 'world',
 'support',
 ('amazed', 'at'),
 ('.', 'support'),
 'musician',
 'musical',
 'like',
 'read',
 ('I', 'should'),
 ('like', 'a'),
 ('book', 'I'),
 ('sounds', 'like'),
 'book',
 ('a', 'book'),
 ('should', 'read'),
 ('falls', 'away'),
 ('listening', 'to'),
 'favs',
 ('*', 'one'),
 ('my', 'favs'),
 'listening',
 ('favs', '@USER'),
 'away',
 ('to', '*'),
 '*',
 'falls',
 ('away', '*'),
 ('one', 'of'),
 ('of', 'my'),
 'one',
 ('*', 'falls'),
 ('loves', 'youtube'),
 'youtube',
 'loves',
 'luck',
 ('@USER', 'good'),
 ('good', 'luck'),
 'andy',
 ('luck', 'andy'),
 ('@USER', 'on'),
 ('with', 'it'),
 (',', "i'm"),
 (',', 'on'),
 ('!', 'and'),
 'controversy',
 ('on', 'the'),
 'sitting',
 ('my', 'tweet'),
 ('tweet', ','),
 ('welcome', '!'),
 ('it', 'then'),
 ('your', 'welcome'),
 ('sitting', 'on'),
 ('then', ','),
 ('edge', 'of'),
 ('controversy', '!'),
 'edge',
 'tweet',
 'welcome',
 ("i'm", 'sitting'),
 ('the', 'controversy'),
 ('the', 'edge'),
 ('on', 'with'),
 ('alright', '@USER'),
 ('@USER', '.'),
 'alright',
 ('do', 'this'),
 ('.', "let's"),
 "let's",
 ("let's", 'do'),
 ('enjoy', 'your'),
 ('@USER', 'awww'),
 ('awww', 'well'),
 ('well', 'enjoy'),
 ('your', "'"),
 ("'", 'tini'),
 'awww',
 'tini',
 "'",
 ('bak', 'from'),
 ('from', 'construction'),
 'bak',
 'from',
 'construction',
 'doll',
 ('URL', ','),
 'URL',
 ('doll', 'URL'),
 (',', 'URL'),
 ('what', 'do'),
 ('think', '?'),
 ('do', 'you'),
 ('you', 'think'),
 'bed',
 'im',
 'bella',
 (',', 'what'),
 ('im', 'off'),
 ('?', 'im'),
 ('bed', 'now'),
 'now',
 ('to', 'bed'),
 ('bella', 'doll'),
 ('@USER', 'all'),
 ('all', 'about'),
 ('the', 'trust'),
 'trust',
 ('about', 'the'),
 ('a', 'softy'),
 ('I', 'feel'),
 ('softy', 'now'),
 'feel',
 ('@USER', 'LOL'),
 'thanks',
 ('LOL', 'thanks'),
 ('...', 'I'),
 ('thanks', '...'),
 'softy',
 ('feel', 'like'),
 ('I', 'appreciate'),
 ('am', 'glad'),
 'kind',
 ('resonate', 'with'),
 ('the', 'words'),
 'appreciate',
 ('words', 'resonate'),
 'resonate',
 'am',
 ('your', 'kind'),
 ('that', 'the'),
 ('words', '.'),
 ('glad', 'that'),
 ('kind', 'words'),
 ('appreciate', 'your'),
 ('I', 'am'),
 ('.', 'I'),
 'words',
 ('with', 'you'),
 ('good', 'night'),
 ('sleeping', 'now'),
 ('now', '!'),
 ('will', 'be'),
 'sleeping',
 ('be', 'sleeping'),
 'everyone',
 ('!', 'good'),
 ('everyone', '!'),
 'will',
 ('night', 'everyone'),
 ('the', 'girl'),
 'mom',
 ('I', 'want'),
 ('s', 'mom'),
 ('time', '..'),
 ('please', '?'),
 'please',
 ('..', 'and'),
 ('answer', 'me'),
 'want',
 ('carly', '�'),
 'answer',
 ('to', 'see'),
 ('and', 'sam'),
 'any',
 'time',
 ('?', 'can'),
 ('?', 'xoxo'),
 ('who', 'is'),
 ('@USER', ':'),
 'sam',
 ('sam', '�'),
 ('girl', '?'),
 '�',
 ('me', 'please'),
 ('see', 'carly'),
 ('mom', 'at'),
 ('any', 'time'),
 ('is', 'the'),
 'see',
 ('s', 'who'),
 (':', 'I'),
 ('you', 'answer'),
 ('�', 's'),
 ':',
 'girl',
 ('URL', '?'),
 ('can', 'you'),
 'who',
 'carly',
 ('at', 'any'),
 'xoxo',
 's',
 ('xoxo', 'URL'),
 ('want', 'to'),
 'actually',
 'cousin',
 ('@USER', 'was'),
 ('my', 'cousin'),
 'done',
 ('actually', 'done'),
 ('was', 'actually'),
 ('done', 'by'),
 ('by', 'my'),
 ('A', 'new'),
 ('there', 'was'),
 ('comments', ','),
 'comments',
 ('!', 'very'),
 ('yanks', '&'),
 'A',
 ('harsh', 'banter'),
 ('check', 'the'),
 ('video', 'from'),
 'yanks',
 ('&', 'brits'),
 'there',
 ('very', 'stupid'),
 ('the', 'comments'),
 'between',
 ('and', 'check'),
 ('kasabian', '.'),
 'check',
 'kasabian',
 'video',
 ('from', 'kasabian'),
 ('brits', '!'),
 'very',
 ('between', 'yanks'),
 ('new', 'video'),
 'brits',
 ('a', 'harsh'),
 ('stupid', '!'),
 (',', 'there'),
 ('banter', 'between'),
 'harsh',
 'banter',
 'stupid',
 ('!', 'URL'),
 ('.', 'and'),
 'fine',
 'fits',
 ('you', 'just'),
 'black',
 ("i'm", 'sure'),
 ('black', 'fits'),
 ('@USER', "i'm"),
 ('sure', 'black'),
 ('fits', 'you'),
 ('just', 'fine'),
 'mee',
 ('for', 'mee'),
 'vote',
 ('mee', '.'),
 ('.', 'URL'),
 'xxxx',
 ('URL', 'xxxx'),
 ('vote', 'for'),
 'second',
 ('the', 'second'),
 ('@USER', 'the'),
 ('one', '.'),
 ('second', 'one'),
 ('smile', 'again'),
 ("i'm", 'wearing'),
 ('playing', 'up'),
 ('up', 'at'),
 'wearing',
 ('now', 'have'),
 ('wearing', 'my'),
 ('been', 'playing'),
 ('.', '@USER'),
 'again',
 ('so', 'life'),
 ('my', 'smile'),
 ('as', 'per'),
 'life',
 'brighter',
 'work',
 ('much', 'brighter'),
 'usual',
 ('brighter', '.'),
 'as',
 ('again', 'now'),
 'playing',
 ('usual', 'so'),
 ('at', 'work'),
 ('life', 'is'),
 'smile',
 ('have', 'been'),
 'per',
 ('is', 'much'),
 ('per', 'usual'),
 'been',
 ('work', 'as'),
 'giveaway',
 'i',
 (',', 'i'),
 ('much', 'for'),
 ('!', 'come'),
 ('entering', 'my'),
 ('thanks', 'so'),
 'entering',
 ('back', 'tomorrow'),
 ('lily', 'giveaway'),
 ('giveaway', '!'),
 ('pearl', 'necklace'),
 ('&', 'lily'),
 'necklace',
 ('tomorrow', ','),
 ('come', 'back'),
 'come',
 ('have', 'a'),
 ('i', 'have'),
 ('for', 'entering'),
 'scheduled',
 'tomorrow',
 ('jack', '&'),
 ('my', 'jack'),
 ('giveaway', 'scheduled'),
 ('@USER', 'thanks'),
 'lily',
 'pearl',
 ('a', 'pearl'),
 ('necklace', 'giveaway'),
 'jack',
 ('sounds', 'fun'),
 ('wish', 'i'),
 'fun',
 (',', 'wish'),
 ('fun', ','),
 ('have', 'gone'),
 'could',
 ('could', 'have'),
 'wish',
 'gone',
 ('gone', '!'),
 ('i', 'could'),
 ('I', 'guessed'),
 'guessed',
 'right',
 ('guessed', 'right'),
 'prize',
 ('YAY', '!'),
 ('prize', '?'),
 ('I', 'get'),
 ('do', 'I'),
 ('a', 'prize'),
 ('right', '...'),
 'YAY',
 ('...', 'do'),
 ('@USER', 'YAY'),
 'interwebz',
 ('hay', ','),
 'gunna',
 (',', 'goodnight'),
 ('the', 'hay'),
 ('interwebz', '!'),
 'hit',
 ('gunna', 'hit'),
 'goodnight',
 ('goodnight', 'interwebz'),
 'hay',
 ('hit', 'the'),
 'x',
 ('on', 'my'),
 ('!', 'x'),
 ('-', 'lovely'),
 ('your', 'comment'),
 ('my', 'blog'),
 ('comment', 'on'),
 ('thanks', 'for'),
 ('blog', '-'),
 ('for', 'your'),
 'blog',
 'comment',
 ('familyyy', '..'),
 ('out', 'with'),
 'familyyy',
 'text',
 ('exhausteddd', '..'),
 'exhausteddd',
 ('is', 'exhausteddd'),
 ('..', 'text'),
 ('with', 'familyyy'),
 ('..', 'out'),
 ("i've", 'miss'),
 'miss',
 "you're",
 ('miss', 'u'),
 ('!', "i've"),
 'hereee',
 ('!', "you're"),
 "i've",
 'heey',
 ("you're", 'hereee'),
 ('heey', '!'),
 'u',
 ('hereee', '!'),
 ('@USER', 'heey'),
 ('u', '!'),
 'finished',
 ('finished', 'pass'),
 'pass',
 'assignment',
 ('pass', 'assignment'),
 ('yay', 'finished'),
 'yay',
 ('comes', 'quik'),
 ('quik', '!'),
 'lucky',
 ('night', 'comes'),
 ('aaahhh', 'lucky'),
 ('hopefully', 'tomorrow'),
 'aaahhh',
 ('@USER', 'aaahhh'),
 ('tomorrow', 'night'),
 ('!', 'hopefully'),
 ('lucky', 'u'),
 'quik',
 'hopefully',
 ('good', 'isht'),
 ('.', 'good'),
 'isht',
 ('easy', 'shawty'),
 ('shawty', '.'),
 ('be', 'easy'),
 ('@USER', 'be'),
 'shawty',
 'easy',
 'flowin',
 'visitor',
 ('got', 'a'),
 ('-', '-'),
 ('a', 'visitor'),
 ('-', 'got'),
 ('flowin', 'today'),
 'got',
 ('visitor', '!'),
 'go',
 ('sneek', 'out'),
 ('lex', '<3'),
 ('go', 'with'),
 "i'll",
 ('bahah', 'lex'),
 ('you', 'bahah'),
 'lex',
 ('and', 'go'),
 ("i'll", 'sneek'),
 ('out', 'and'),
 '<3',
 'sneek',
 'bahah',
 ('hanging', 'with'),
 ('with', 'amber'),
 ('amber', '.'),
 'hanging',
 'amber',
 ('best', 'buy'),
 ('@USER', 'of'),
 ('to', 'best'),
 ('after', 'work'),
 'straight',
 'best',
 'buy',
 ('!', "i'll"),
 ('be', 'heading'),
 ('course', '!'),
 'after',
 ('buy', 'straight'),
 ('straight', 'after'),
 ("i'll", 'be'),
 ('can', 'move'),
 ('flight', 'so'),
 ('please', 'confirm'),
 ('move', 'on'),
 ('so', 'i'),
 'flight',
 ('zest', 'air'),
 'move',
 ('confirm', 'my'),
 ...]

In [20]:
# number of features
len(dict_vect.feature_names_)


Out[20]:
3887274

In [21]:
# a little example
text = 'Hi all, I am very happy today'
# first tokenize
tokens = tokenizer.tokenize(text)
print(tokens)

# list features
features = bag_of_words_and_bigrams(tokens)
print(features)

# vectorize features
X = dict_vect.transform(features)

print(X.shape)


['hi', 'all', ',', 'I', 'am', 'very', 'happy', 'today']
{'hi': True, 'all': True, ',': True, 'I': True, 'am': True, 'very': True, 'happy': True, 'today': True, ('hi', 'all'): True, ('all', ','): True, (',', 'I'): True, ('I', 'am'): True, ('am', 'very'): True, ('very', 'happy'): True, ('happy', 'today'): True}
(1, 3887274)

In [22]:
# X is a special kind of numpy array. beacause it is extremely sparse
# it can be encoded to take less space in memory
# if we want to see it fully, we can use .toarray()

# number of non-zero values in X:
X.toarray().sum()


Out[22]:
15

The mapping between the list of features and the vector of zeros and ones is done when you train the pipeline with its .fit method.

Classifing the tweet

Now that we have vector representing the presence of features in a tweet, we can apply our logistic regression classifier to compute the probability that a tweet belong to the "sad" or "happy" category


In [23]:
classifier = pipline.steps[1][1]

In [24]:
classifier


Out[24]:
SGDClassifier(alpha=7.847599703514622e-06, average=False, class_weight=None,
       epsilon=0.1, eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='log', n_iter=10, n_jobs=1,
       penalty='l2', power_t=0.5, random_state=42, shuffle=True, verbose=0,
       warm_start=False)

In [25]:
# access the weights of the logistic regression
classifier.coef_


Out[25]:
array([[-0.03320511,  0.28887428, -0.29384334, ..., -0.00785218,
        -0.00785218, -0.00218135]])

In [26]:
# we have as many weights as features
classifier.coef_.shape


Out[26]:
(1, 3887274)

In [27]:
# plus the intrecept 
classifier.intercept_


Out[27]:
array([ 0.20885619])

In [28]:
# let's check the weight associated with a given feature
x = dict_vect.transform({'bad': True})
_, ind = np.where(x.todense())
classifier.coef_[0,ind]


Out[28]:
array([-1.09908401])

In [29]:
# find the probability for a specific tweet
classifier.predict_proba(X)


Out[29]:
array([[ 0.06994618,  0.93005382]])

Using the sklearn pipeline to group the two last steps:


In [30]:
pipline.predict_proba(features)


Out[30]:
array([[ 0.06994618,  0.93005382]])

We see to numbers, the first one is the probability of the tweet being sad, the second one is the probability of the tweet being happy.


In [31]:
# note that:
pipline.predict_proba(features).sum()


Out[31]:
1.0

Putting it all together:

We will use the class TweetClassifier from TwSentiment.py that puts together this process for us:


In [32]:
from TwSentiment import TweetClassifier

In [33]:
twClassifier = TweetClassifier(pipline,
                              tokenizer=tokenizer,
                              feature_extractor=bag_of_words_and_bigrams)

In [34]:
# example
text = 'Hi all, I am very happy today'
twClassifier.classify_text(text)


Out[34]:
('pos', array([ 0.06994618,  0.93005382]))

In [35]:
# the classify text method also accepts a list of text as input
twClassifier.classify_text(['great day today!', 'bad day today...'])
# the classify text method also accepts a list of text as input
# twClassifier.classify_text(['not sad', 'not happy'])


Out[35]:
(array(['pos', 'neg'], 
       dtype='<U3'), array([[ 0.14707178,  0.85292822],
        [ 0.90656453,  0.09343547]]))

We can now classify our tweets:


In [36]:
emo_clas, prob = twClassifier.classify_text(tweet_df.text.tolist())

In [37]:
# add the result to the dataframe

In [38]:
tweet_df['pos_class'] = (emo_clas == 'pos')
tweet_df['pos_prob'] = prob[:,1]

In [39]:
tweet_df.head()


Out[39]:
text tweet_id user_id pos_class pos_prob
0 'This Isn't AI' 860217132763754497 211638860 False 0.316765
1 RT @IoTRecruiting: How will Cognitive Computin... 860217137058635776 3226000831 True 0.814357
2 Trou* https://t.co/FlQdwMFbmh 860217138908475397 3070524046 True 0.631905
3 RT @InvestorIdeas: https://t.co/ZFdyj2RNXV - #... 860217141827518464 837339093214318593 False 0.416576
4 DeepStack: Expert-level artificial intelligenc... 860217143756857344 766378262259986437 True 0.818437

In [40]:
# plot the distribution of probability
import matplotlib.pyplot as plt
%matplotlib inline
h = plt.hist(tweet_df.pos_prob, bins=50)


We want to classify users based on the class of their tweets. Pandas allows to easily group tweets per users using the groupy method of DataFrames:


In [41]:
user_group = tweet_df.groupby('user_id')

In [42]:
print(type(user_group))


<class 'pandas.core.groupby.DataFrameGroupBy'>

In [43]:
# let's look at one of the group
groups = user_group.groups
uid = list(groups.keys())[5]
user_group.get_group(uid)


Out[43]:
text tweet_id user_id pos_class pos_prob
85 RT @hackaday: Google AIY: Artificial Intellige... 860217395314647040 787345 True 0.885622

In [44]:
# we need to make a function that takes the dataframe of tweets grouped by users and return the class of the users
def get_user_emo(group):
    num_pos = group.pos_class.sum()
    num_tweets = group.pos_class.size
    if num_pos/num_tweets > 0.5:
        return 'pos'
    elif num_pos/num_tweets < 0.5:
        return 'neg'
    else:
        return 'NA'

In [45]:
# apply the function to each group
user_df = user_group.apply(get_user_emo)

In [46]:
# This is a pandas Series where the index are the user_id
user_df.head(10)


Out[46]:
user_id
785        pos
12774      pos
63433      pos
685063     neg
779302     pos
787345     pos
994761     pos
1246421    neg
1294621    pos
1308181    pos
dtype: object

Let's add this information to the graph we created earlier


In [48]:
import networkx as nx

G = nx.read_graphml('twitter_lcc_AI2.graphml', node_type=int)

for n in G.nodes_iter():
    if n in user_df.index:
        # here we look at the value of the user_df series at the position where the index 
        # is equal to the user_id of the node
        G.node[n]['emotion'] = user_df.loc[user_df.index == n].values[0]

In [49]:
# we have added an attribute 'emotion' to the nodes
G.node[n]


Out[49]:
{'name': 'Kasparov63'}

In [50]:
# save the graph to open it with Gephi
nx.write_graphml(G, 'twitter_lcc_emo_AI2.graphml')

In [ ]: