Tweet sentiment analysis

In this section we will see how to extract features from tweets and use a classifier to classify the tweet as positive or negative.

We will use a pandas DataFrames (http://pandas.pydata.org/) to store tweets and process them. Pandas DataFrames are very powerful python data-structures, like excel spreadsheets with the power of python.



In [4]:

    
# Let's create a DataFrame with each tweet using pandas
import pandas as pd
import json
import numpy as np


def getTweetID(tweet):
    """ If properly included, get the ID of the tweet """
    return tweet.get('id')
    
def getUserIDandScreenName(tweet):
    """ If properly included, get the tweet 
        user ID and Screen Name """
    user = tweet.get('user')
    if user is not None:
        uid = user.get('id')
        screen_name = user.get('screen_name')
        return uid, screen_name
    else:
        return (None, None)
    

    
filename = 'AI2.txt'

# create a list of dictionaries with the data that interests us
tweet_data_list = []
with open(filename, 'r') as fopen:
    # each line correspond to a tweet
    for line in fopen:
        if line != '\n':
            tweet = json.loads(line.strip('\n'))
            tweet_id = getTweetID(tweet)
            user_id = getUserIDandScreenName(tweet)[0]
            text = tweet.get('text')
            if tweet_id is not None:
                tweet_data_list.append({'tweet_id' : tweet_id,
                           'user_id' : user_id,
                           'text' : text})

# put everything in a dataframe
tweet_df = pd.DataFrame.from_dict(tweet_data_list)



In [5]:

    
print(tweet_df.shape)
print(tweet_df.columns)

#print 5 first element of one of the column
print(tweet_df.text.iloc[:5])
# or
print(tweet_df['text'].iloc[:5])









    



(5012, 3)
Index(['text', 'tweet_id', 'user_id'], dtype='object')
0                                      'This Isn't AI'
1    RT @IoTRecruiting: How will Cognitive Computin...
2                        Trou* https://t.co/FlQdwMFbmh
3    RT @InvestorIdeas: https://t.co/ZFdyj2RNXV - #...
4    DeepStack: Expert-level artificial intelligenc...
Name: text, dtype: object
0                                      'This Isn't AI'
1    RT @IoTRecruiting: How will Cognitive Computin...
2                        Trou* https://t.co/FlQdwMFbmh
3    RT @InvestorIdeas: https://t.co/ZFdyj2RNXV - #...
4    DeepStack: Expert-level artificial intelligenc...
Name: text, dtype: object



In [6]:

    
#show the first 10 rows
tweet_df.head(10)









    Out[6]:






  
    
      
      text
      tweet_id
      user_id
    
  
  
    
      0
      'This Isn't AI'
      860217132763754497
      211638860
    
    
      1
      RT @IoTRecruiting: How will Cognitive Computin...
      860217137058635776
      3226000831
    
    
      2
      Trou* https://t.co/FlQdwMFbmh
      860217138908475397
      3070524046
    
    
      3
      RT @InvestorIdeas: https://t.co/ZFdyj2RNXV - #...
      860217141827518464
      837339093214318593
    
    
      4
      DeepStack: Expert-level artificial intelligenc...
      860217143756857344
      766378262259986437
    
    
      5
      RT @IoTRecruiting: How will Cognitive Computin...
      860217149192785920
      2390579695
    
    
      6
      RT @IoTRecruiting: Honored to be ranked Top In...
      860217150237270022
      860148473873739779
    
    
      7
      I want Barnier to hand over the cheque with a ...
      860217155568242693
      740545840838774784
    
    
      8
      Iraqi refugee 'used asylum seekers to stage bu...
      860217170948612096
      14101483
    
    
      9
      RT @cabroncita: $BKD $BKDCD / $BKD.V Artificia...
      860217174077706242
      835220550209396736

Extracting features from the tweets

1) Tokenize the tweet in a list of words

This part uses concepts from Naltural Langage Processing. We will use a tweet tokenizer I built based on TweetTokenizer from NLTK (http://www.nltk.org/). You can see how it works by opening the file TwSentiment.py. The goal is to process any tweets and extract a list of words taking into account usernames, hashtags, urls, emoticons and all the informal text we can find in tweets. We also want to reduce the number of features by doing some transformations such as putting all the words in lower cases.



In [7]:

    
from TwSentiment import CustomTweetTokenizer



In [8]:

    
tokenizer = CustomTweetTokenizer(preserve_case=False, # keep Upper cases
                                 reduce_len=True, # reduce repetition of letter to a maximum of three
                                 strip_handles=False, # remove usernames (@mentions)
                                 normalize_usernames=True, # replace all mentions to "@USER"
                                 normalize_urls=True, # replace all urls to "URL"
                                 keep_allupper=True) # keep upercase for words that are all in uppercase



In [9]:

    
# example
tweet_df.text.iloc[0]









    Out[9]:





"'This Isn't AI'"



In [10]:

    
tokenizer.tokenize(tweet_df.text.iloc[0])









    Out[10]:





["'", 'this', "isn't", 'AI', "'"]



In [9]:

    
# other examples
tokenizer.tokenize('Hey! This is SO cooooooooooooooooool! :)')









    Out[9]:





['hey', '!', 'this', 'is', 'SO', 'coool', '!', ':)']



In [11]:

    
tokenizer.tokenize('Hey! This is so cooooooool! :)')









    Out[11]:





['hey', '!', 'this', 'is', 'so', 'coool', '!', ':)']

2) Define the features that will represent the tweet

We will use the occurrence of words and pair of words (bigrams) as features.

This corresponds to a bag-of-words representation (https://en.wikipedia.org/wiki/Bag-of-words_model): we just count each words (or n-grams) without taking account their order. For document classification, the frequency of occurence of each words is usually taken as a feature. In the case of tweets, they are so short that we can just count each words once.

Using pair of words allows to capture some of the context in which each words appear. This helps capturing the correct meaning of words.



In [12]:

    
from TwSentiment import bag_of_words_and_bigrams

# this will return a dictionary of features,
# we just list the features present in this tweet
bag_of_words_and_bigrams(tokenizer.tokenize(tweet_df.text.iloc[0]))









    Out[12]:





{"'": True,
 'this': True,
 "isn't": True,
 'AI': True,
 ("'", 'this'): True,
 ('this', "isn't"): True,
 ("isn't", 'AI'): True,
 ('AI', "'"): True}

Download the logistic regression classifier

https://www.dropbox.com/s/09rw6a85f7ezk31/sklearn_SGDLogReg_.pickle.zip?dl=1

I trained this classifier on this dataset: http://help.sentiment140.com/for-students/, following the approach from this paper: http://cs.stanford.edu/people/alecmgo/papers/TwitterDistantSupervision09.pdf

This is a set of 14 million tweets with emoticons. Tweets containing "sad" emoticons (7 million) are considered negative and tweets with "happy" emoticons (7 million) are considered positive.

I used a Logistic Regression classifier with L2 regularization that I optimized with a 10 fold cross-validation using $F_1$ score as a metric.



In [14]:

    
# the classifier is saved in a "pickle" file
import pickle

with open('sklearn_SGDLogReg_.pickle', 'rb') as fopen:
    classifier_dict = pickle.load(fopen)









    



D:\Anaconda3\lib\site-packages\sklearn\base.py:315: UserWarning: Trying to unpickle estimator DictVectorizer from version pre-0.18 when using version 0.18.1. This might lead to breaking code or invalid results. Use at your own risk.
  UserWarning)
D:\Anaconda3\lib\site-packages\sklearn\base.py:315: UserWarning: Trying to unpickle estimator SGDClassifier from version pre-0.18 when using version 0.18.1. This might lead to breaking code or invalid results. Use at your own risk.
  UserWarning)
D:\Anaconda3\lib\site-packages\sklearn\base.py:315: UserWarning: Trying to unpickle estimator Pipeline from version pre-0.18 when using version 0.18.1. This might lead to breaking code or invalid results. Use at your own risk.
  UserWarning)



In [15]:

    
# classifier_dict contain the classifier and label mappers
# that I added so that we remember how the classes are 
# encoded
classifier_dict









    Out[15]:





{'label_inv_mapper': {0: 'neg', 1: 'pos'},
 'label_mapper': {'neg': 0, 'pos': 1},
 'sklearn_pipeline': Pipeline(steps=[('feat_vectorizer', DictVectorizer(dtype=<class 'numpy.int8'>, separator='=', sort=False,
         sparse=True)), ('classifier', SGDClassifier(alpha=7.847599703514622e-06, average=False, class_weight=None,
        epsilon=0.1, eta0=0.0, fit_intercept=True, l1_ratio=0.15,
        learning_rate='optimal', loss='log', n_iter=10, n_jobs=1,
        penalty='l2', power_t=0.5, random_state=42, shuffle=True, verbose=0,
        warm_start=False))])}

The classifier is in fact contained in a pipeline. A sklearn pipeline allows to assemble several transformation of your data (http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)



In [16]:

    
pipline = classifier_dict['sklearn_pipeline']

In our case we have two steps:

Vectorize the textual features (using http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html)
Classify the vectorized features (using http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html)



In [17]:

    
pipline.steps









    Out[17]:





[('feat_vectorizer',
  DictVectorizer(dtype=<class 'numpy.int8'>, separator='=', sort=False,
          sparse=True)),
 ('classifier',
  SGDClassifier(alpha=7.847599703514622e-06, average=False, class_weight=None,
         epsilon=0.1, eta0=0.0, fit_intercept=True, l1_ratio=0.15,
         learning_rate='optimal', loss='log', n_iter=10, n_jobs=1,
         penalty='l2', power_t=0.5, random_state=42, shuffle=True, verbose=0,
         warm_start=False))]



In [18]:

    
# this the step that will transform a list of textual features to a vector of zeros and ones
dict_vect = pipline.steps[0][1]



In [19]:

    
dict_vect.feature_names_









    Out[19]:





[':}',
 ('so', 'glad'),
 'was',
 ('that', 'was'),
 ('to', 'make'),
 'that',
 ('skool', 'this'),
 'for',
 'to',
 ('hafta', 'get'),
 'but',
 ('!', 'but'),
 'get',
 ('a', 'summer'),
 ('gay', '!'),
 ("don't", 'have'),
 ('paper', '!'),
 'this',
 'so',
 ('summer', 'skool'),
 ('!', 'I'),
 'paper',
 ('was', 'gay'),
 'make',
 ('wow', 'that'),
 ('!', ':}'),
 'glad',
 ('need', 'to'),
 ('make', 'that'),
 'have',
 ('job', 'for'),
 ('I', "don't"),
 ('!', 'wow'),
 'I',
 ('this', 'summer'),
 'a',
 ('have', 'summer'),
 'gay',
 ('get', 'a'),
 'wow',
 ('glad', 'I'),
 'summer',
 ('summer', '!'),
 ('summer', 'job'),
 'job',
 'skool',
 ('but', 'I'),
 ('sure', '!'),
 'sure',
 ('I', 'hafta'),
 'need',
 '!',
 "don't",
 'hafta',
 ('for', 'sure'),
 ('I', 'need'),
 ('that', 'paper'),
 ('only', 'on'),
 ('hot', 'milo'),
 'before',
 ('5', 'minutes'),
 'about',
 'the',
 ',',
 'heading',
 ('yup', ','),
 ('to', 'the'),
 'nice',
 ('couch', 'for'),
 ('nice', 'hot'),
 ('the', 'couch'),
 ('minutes', 'and'),
 ('for', 'about'),
 'minutes',
 'yup',
 ('@USER', 'yup'),
 'on',
 'hot',
 'then',
 'couch',
 'and',
 (',', 'only'),
 ('before', 'a'),
 ('for', 'a'),
 'only',
 ('then', "i'm"),
 '@USER',
 ("i'm", 'heading'),
 ('a', 'nice'),
 "i'm",
 ('early', 'night'),
 'night',
 '5',
 ('nice', 'early'),
 'early',
 ('about', '5'),
 'milo',
 ('milo', 'before'),
 ('and', 'then'),
 ('heading', 'to'),
 ('on', 'for'),
 ('@USER', 'drinks'),
 ('with', 'mates'),
 'promise',
 'meet',
 ('a', 'long'),
 'up',
 ('&', 'a'),
 'in',
 '&',
 ('long', 'promise'),
 'with',
 ('in', 'london'),
 ('drinks', 'with'),
 'mates',
 'drinks',
 ('london', '&'),
 ('up', 'with'),
 ('mates', 'in'),
 ('meet', 'up'),
 ('with', '@USER'),
 ('promise', 'meet'),
 'london',
 'long',
 'out',
 'gotta',
 '...',
 ('on', "jason's"),
 ('NYC', '!'),
 ('gotta', 'love'),
 'NYC',
 'rooftop',
 ('laying', 'out'),
 'laying',
 ("jason's", 'rooftop'),
 ('rooftop', '...'),
 'love',
 "jason's",
 ('love', 'NYC'),
 ('out', 'on'),
 ('...', 'gotta'),
 ('/', 'better'),
 'better',
 'online',
 '/',
 ('shopping', 'I'),
 ('easier', '/'),
 ('better', 'than'),
 '.',
 'irl',
 ('than', 'shopping'),
 ('irl', '.'),
 ('love', 'shopping'),
 ('-', 'so'),
 'much',
 ('so', 'much'),
 ('I', 'love'),
 'holiday',
 '-',
 ('holiday', 'shopping'),
 ('online', '-'),
 'than',
 ('shopping', 'irl'),
 'easier',
 ('shopping', 'online'),
 ('much', 'easier'),
 'shopping',
 'lie',
 ('off', '/'),
 ('!', 'tomorrows'),
 ('golf', 'today'),
 ('it', '!'),
 ('lie', 'in'),
 'tomorrows',
 ('-', 'what'),
 'golf',
 'it',
 ('what', 'a'),
 ('..', "i'm"),
 ('/', 'lie'),
 'off',
 'lovely',
 ('running', 'and'),
 ('in', 'for'),
 'first',
 'running',
 ('!', '!'),
 ('a', 'lovely'),
 'ages',
 'what',
 ("i'm", 'gonna'),
 ('day', '!'),
 ('enjoy', 'it'),
 ('tomorrows', 'my'),
 ('day', 'off'),
 ('for', 'ages'),
 ('ages', '..'),
 'day',
 ('first', 'day'),
 'gonna',
 ('today', '-'),
 ('lovely', 'day'),
 ('my', 'first'),
 'today',
 ('gonna', 'enjoy'),
 'my',
 'enjoy',
 ('and', 'golf'),
 '..',
 ('hope', 'to'),
 ('business', 'someday'),
 'sounds',
 ('do', 'business'),
 ('good', ','),
 'hope',
 'good',
 ('@USER', 'sounds'),
 'do',
 (',', 'hope'),
 'someday',
 ('to', 'do'),
 ('sounds', 'good'),
 'business',
 ('LOL', '...'),
 'course',
 'LOL',
 ("it's", 'hotter'),
 ("it's", 'out'),
 ('out', '!'),
 'XD',
 ('-', 'LOL'),
 ('than', 'yesterday'),
 'of',
 ('XD', '&'),
 ('!', 'XD'),
 ('of', 'course'),
 'yesterday',
 ('@USER', '-'),
 ('&', "it's"),
 ('course', "it's"),
 "it's",
 ('hotter', 'than'),
 ('...', 'of'),
 'hotter',
 "isn't",
 '?',
 'bitch',
 ('it', '?'),
 ('bitch', "isn't"),
 ("payback's", 'a'),
 ('a', 'bitch'),
 ("isn't", 'it'),
 "payback's",
 'by',
 'we',
 'cows',
 ('just', 'passed'),
 ('some', 'dumb'),
 'just',
 ('@USER', 'we'),
 ('cows', '!'),
 ('we', 'just'),
 'dumb',
 ('passed', 'by'),
 ('dumb', 'cows'),
 'some',
 'passed',
 ('by', 'some'),
 'dinner',
 'besties',
 'malta',
 ('the', 'besties'),
 ('should', 'be'),
 ('off', 'to'),
 ('for', 'dinner'),
 ('the', 'malta'),
 ('with', 'the'),
 'should',
 ('malta', 'for'),
 ('good', '!'),
 ('besties', 'should'),
 'be',
 ('be', 'good'),
 ('dinner', 'with'),
 ('freshman', 'on'),
 'freshman',
 ("i'm", 'a'),
 ('on', 'twitter'),
 ('twitter', '.'),
 'twitter',
 ('a', 'freshman'),
 'well.im',
 'graduation',
 'haha',
 'someones',
 'lets',
 'well',
 'saturday',
 ('do.hows', 'everyones'),
 ('someones', 'graduation'),
 'lol.something',
 ('graduation', 'part'),
 ('well', 'well'),
 ('tonight', 'lol.something'),
 ('lol.something', 'to'),
 ('part', 'tonight'),
 'do.hows',
 ('well.im', 'crashing'),
 ('talk', 'haha'),
 'talk',
 ('crashing', 'someones'),
 'everyones',
 'tonight',
 ('saturday', '?'),
 ('well', 'well.im'),
 ('to', 'do.hows'),
 ('lets', 'talk'),
 'crashing',
 'part',
 ('?', 'lets'),
 ('everyones', 'saturday'),
 'your',
 ('the', 'father'),
 ('.', 'quite'),
 'has',
 'father',
 ('much', 'more'),
 'were',
 ('honestly', ','),
 ('more', 'than'),
 'more',
 'you',
 ('has', 'even'),
 'amazing',
 ('amazing', '.'),
 ('your', 'friend'),
 (',', 'you'),
 ('quite', 'honestly'),
 ('you', 'and'),
 'attempted',
 'friend',
 ('do', '.'),
 'quite',
 ('and', 'your'),
 'did',
 ('what', 'the'),
 ('did', 'so'),
 ('than', 'what'),
 ('father', 'has'),
 ('even', 'attempted'),
 ('friend', 'did'),
 'even',
 ('@USER', 'you'),
 ('you', 'were'),
 ('attempted', 'to'),
 'honestly',
 ('were', 'amazing'),
 ('is', 'my'),
 ('follow', '@USER'),
 ('dinner', 'and'),
 ('my', 'cupcake'),
 ('great', 'dinner'),
 ('@USER', '!'),
 'great',
 'cupcake',
 'follow',
 ('she', 'is'),
 ('great', 'friends'),
 ('!', 'follow'),
 ('!', 'she'),
 'is',
 ('friends', '!'),
 ('and', 'great'),
 'she',
 'friends',
 ('...', 'and'),
 ('forget', 'you'),
 ('@USER', 'hope'),
 'how',
 'anyone',
 ('?', '?'),
 ('soon', '...'),
 'pic',
 ('hope', 'your'),
 'soon',
 'forget',
 ('how', 'can'),
 ('your', 'pic'),
 'comes',
 ('pic', 'comes'),
 ('can', 'anyone'),
 ('and', 'how'),
 ('comes', 'back'),
 ('anyone', 'forget'),
 'back',
 ('you', '?'),
 'can',
 ('back', 'soon'),
 ('lovely', '!'),
 ('hi', 'lovely'),
 ('@USER', 'hi'),
 'hi',
 ('not', 'even'),
 ('think', 'new'),
 'new',
 ('!', 'lol'),
 'not',
 ('a', 'car'),
 'think',
 ('I', 'did'),
 ('owned', 'a'),
 ('new', 'yorkers'),
 ('did', 'not'),
 ('@USER', 'I'),
 'car',
 'yorkers',
 'owned',
 'lol',
 ('even', 'think'),
 ('yorkers', 'owned'),
 ('car', '!'),
 'knowing',
 ('bar', 'I'),
 ('15', '+'),
 ('at', 'a'),
 ('people', 'at'),
 'randomly',
 ('randomly', 'showed'),
 ('love', 'that'),
 'also',
 ('showed', 'up'),
 ('I', 'also'),
 ('knowing', '15'),
 '+',
 ('that', 'coll'),
 'at',
 'bar',
 ('a', 'bar'),
 'coll',
 ('+', 'people'),
 '15',
 ('also', 'love'),
 ('just', 'randomly'),
 ('up', '!'),
 'showed',
 ('coll', 'just'),
 ('love', 'knowing'),
 'people',
 ('was', 'a'),
 ('lol', 'that'),
 'movie',
 ('@USER', 'lol'),
 ('great', 'movie'),
 ('a', 'great'),
 'or',
 ('sexier', 'with'),
 ('thingy', '...'),
 ('me', 'or'),
 ('vick', 'even'),
 'house',
 ('him', '!'),
 ('michael', 'vick'),
 ('with', 'his'),
 'luv',
 'thingy',
 'vick',
 'sexier',
 'iz',
 ('even', 'sexier'),
 ('iz', 'it'),
 ('house', 'arrest'),
 ('it', 'me'),
 'ooh',
 ('arrest', 'thingy'),
 'his',
 'arrest',
 'him',
 'michael',
 ('his', 'house'),
 ('ooh', 'I'),
 ('luv', 'him'),
 ('...', 'ooh'),
 ('or', 'iz'),
 'me',
 ('iz', 'michael'),
 ('I', 'luv'),
 ('sure', 'what'),
 ('it', 'is'),
 'know',
 ('is', "it's"),
 ('-', 'not'),
 ('all', 'I'),
 ('is', '..'),
 ('..', 'all'),
 ('I', 'know'),
 'all',
 'funny',
 ("it's", 'funny'),
 ('what', 'it'),
 ('know', 'is'),
 ('not', 'sure'),
 'always',
 ('the', 'world'),
 'talent',
 'amazed',
 ('at', 'the'),
 ('always', 'amazed'),
 ('talent', 'in'),
 ('amount', 'of'),
 ('world', '.'),
 ('the', 'amount'),
 ('of', 'musical'),
 ('musical', 'talent'),
 ('support', 'your'),
 ('local', 'musician'),
 ('your', 'local'),
 ('in', 'the'),
 ('musician', '!'),
 'amount',
 'local',
 'world',
 'support',
 ('amazed', 'at'),
 ('.', 'support'),
 'musician',
 'musical',
 'like',
 'read',
 ('I', 'should'),
 ('like', 'a'),
 ('book', 'I'),
 ('sounds', 'like'),
 'book',
 ('a', 'book'),
 ('should', 'read'),
 ('falls', 'away'),
 ('listening', 'to'),
 'favs',
 ('*', 'one'),
 ('my', 'favs'),
 'listening',
 ('favs', '@USER'),
 'away',
 ('to', '*'),
 '*',
 'falls',
 ('away', '*'),
 ('one', 'of'),
 ('of', 'my'),
 'one',
 ('*', 'falls'),
 ('loves', 'youtube'),
 'youtube',
 'loves',
 'luck',
 ('@USER', 'good'),
 ('good', 'luck'),
 'andy',
 ('luck', 'andy'),
 ('@USER', 'on'),
 ('with', 'it'),
 (',', "i'm"),
 (',', 'on'),
 ('!', 'and'),
 'controversy',
 ('on', 'the'),
 'sitting',
 ('my', 'tweet'),
 ('tweet', ','),
 ('welcome', '!'),
 ('it', 'then'),
 ('your', 'welcome'),
 ('sitting', 'on'),
 ('then', ','),
 ('edge', 'of'),
 ('controversy', '!'),
 'edge',
 'tweet',
 'welcome',
 ("i'm", 'sitting'),
 ('the', 'controversy'),
 ('the', 'edge'),
 ('on', 'with'),
 ('alright', '@USER'),
 ('@USER', '.'),
 'alright',
 ('do', 'this'),
 ('.', "let's"),
 "let's",
 ("let's", 'do'),
 ('enjoy', 'your'),
 ('@USER', 'awww'),
 ('awww', 'well'),
 ('well', 'enjoy'),
 ('your', "'"),
 ("'", 'tini'),
 'awww',
 'tini',
 "'",
 ('bak', 'from'),
 ('from', 'construction'),
 'bak',
 'from',
 'construction',
 'doll',
 ('URL', ','),
 'URL',
 ('doll', 'URL'),
 (',', 'URL'),
 ('what', 'do'),
 ('think', '?'),
 ('do', 'you'),
 ('you', 'think'),
 'bed',
 'im',
 'bella',
 (',', 'what'),
 ('im', 'off'),
 ('?', 'im'),
 ('bed', 'now'),
 'now',
 ('to', 'bed'),
 ('bella', 'doll'),
 ('@USER', 'all'),
 ('all', 'about'),
 ('the', 'trust'),
 'trust',
 ('about', 'the'),
 ('a', 'softy'),
 ('I', 'feel'),
 ('softy', 'now'),
 'feel',
 ('@USER', 'LOL'),
 'thanks',
 ('LOL', 'thanks'),
 ('...', 'I'),
 ('thanks', '...'),
 'softy',
 ('feel', 'like'),
 ('I', 'appreciate'),
 ('am', 'glad'),
 'kind',
 ('resonate', 'with'),
 ('the', 'words'),
 'appreciate',
 ('words', 'resonate'),
 'resonate',
 'am',
 ('your', 'kind'),
 ('that', 'the'),
 ('words', '.'),
 ('glad', 'that'),
 ('kind', 'words'),
 ('appreciate', 'your'),
 ('I', 'am'),
 ('.', 'I'),
 'words',
 ('with', 'you'),
 ('good', 'night'),
 ('sleeping', 'now'),
 ('now', '!'),
 ('will', 'be'),
 'sleeping',
 ('be', 'sleeping'),
 'everyone',
 ('!', 'good'),
 ('everyone', '!'),
 'will',
 ('night', 'everyone'),
 ('the', 'girl'),
 'mom',
 ('I', 'want'),
 ('s', 'mom'),
 ('time', '..'),
 ('please', '?'),
 'please',
 ('..', 'and'),
 ('answer', 'me'),
 'want',
 ('carly', '�'),
 'answer',
 ('to', 'see'),
 ('and', 'sam'),
 'any',
 'time',
 ('?', 'can'),
 ('?', 'xoxo'),
 ('who', 'is'),
 ('@USER', ':'),
 'sam',
 ('sam', '�'),
 ('girl', '?'),
 '�',
 ('me', 'please'),
 ('see', 'carly'),
 ('mom', 'at'),
 ('any', 'time'),
 ('is', 'the'),
 'see',
 ('s', 'who'),
 (':', 'I'),
 ('you', 'answer'),
 ('�', 's'),
 ':',
 'girl',
 ('URL', '?'),
 ('can', 'you'),
 'who',
 'carly',
 ('at', 'any'),
 'xoxo',
 's',
 ('xoxo', 'URL'),
 ('want', 'to'),
 'actually',
 'cousin',
 ('@USER', 'was'),
 ('my', 'cousin'),
 'done',
 ('actually', 'done'),
 ('was', 'actually'),
 ('done', 'by'),
 ('by', 'my'),
 ('A', 'new'),
 ('there', 'was'),
 ('comments', ','),
 'comments',
 ('!', 'very'),
 ('yanks', '&'),
 'A',
 ('harsh', 'banter'),
 ('check', 'the'),
 ('video', 'from'),
 'yanks',
 ('&', 'brits'),
 'there',
 ('very', 'stupid'),
 ('the', 'comments'),
 'between',
 ('and', 'check'),
 ('kasabian', '.'),
 'check',
 'kasabian',
 'video',
 ('from', 'kasabian'),
 ('brits', '!'),
 'very',
 ('between', 'yanks'),
 ('new', 'video'),
 'brits',
 ('a', 'harsh'),
 ('stupid', '!'),
 (',', 'there'),
 ('banter', 'between'),
 'harsh',
 'banter',
 'stupid',
 ('!', 'URL'),
 ('.', 'and'),
 'fine',
 'fits',
 ('you', 'just'),
 'black',
 ("i'm", 'sure'),
 ('black', 'fits'),
 ('@USER', "i'm"),
 ('sure', 'black'),
 ('fits', 'you'),
 ('just', 'fine'),
 'mee',
 ('for', 'mee'),
 'vote',
 ('mee', '.'),
 ('.', 'URL'),
 'xxxx',
 ('URL', 'xxxx'),
 ('vote', 'for'),
 'second',
 ('the', 'second'),
 ('@USER', 'the'),
 ('one', '.'),
 ('second', 'one'),
 ('smile', 'again'),
 ("i'm", 'wearing'),
 ('playing', 'up'),
 ('up', 'at'),
 'wearing',
 ('now', 'have'),
 ('wearing', 'my'),
 ('been', 'playing'),
 ('.', '@USER'),
 'again',
 ('so', 'life'),
 ('my', 'smile'),
 ('as', 'per'),
 'life',
 'brighter',
 'work',
 ('much', 'brighter'),
 'usual',
 ('brighter', '.'),
 'as',
 ('again', 'now'),
 'playing',
 ('usual', 'so'),
 ('at', 'work'),
 ('life', 'is'),
 'smile',
 ('have', 'been'),
 'per',
 ('is', 'much'),
 ('per', 'usual'),
 'been',
 ('work', 'as'),
 'giveaway',
 'i',
 (',', 'i'),
 ('much', 'for'),
 ('!', 'come'),
 ('entering', 'my'),
 ('thanks', 'so'),
 'entering',
 ('back', 'tomorrow'),
 ('lily', 'giveaway'),
 ('giveaway', '!'),
 ('pearl', 'necklace'),
 ('&', 'lily'),
 'necklace',
 ('tomorrow', ','),
 ('come', 'back'),
 'come',
 ('have', 'a'),
 ('i', 'have'),
 ('for', 'entering'),
 'scheduled',
 'tomorrow',
 ('jack', '&'),
 ('my', 'jack'),
 ('giveaway', 'scheduled'),
 ('@USER', 'thanks'),
 'lily',
 'pearl',
 ('a', 'pearl'),
 ('necklace', 'giveaway'),
 'jack',
 ('sounds', 'fun'),
 ('wish', 'i'),
 'fun',
 (',', 'wish'),
 ('fun', ','),
 ('have', 'gone'),
 'could',
 ('could', 'have'),
 'wish',
 'gone',
 ('gone', '!'),
 ('i', 'could'),
 ('I', 'guessed'),
 'guessed',
 'right',
 ('guessed', 'right'),
 'prize',
 ('YAY', '!'),
 ('prize', '?'),
 ('I', 'get'),
 ('do', 'I'),
 ('a', 'prize'),
 ('right', '...'),
 'YAY',
 ('...', 'do'),
 ('@USER', 'YAY'),
 'interwebz',
 ('hay', ','),
 'gunna',
 (',', 'goodnight'),
 ('the', 'hay'),
 ('interwebz', '!'),
 'hit',
 ('gunna', 'hit'),
 'goodnight',
 ('goodnight', 'interwebz'),
 'hay',
 ('hit', 'the'),
 'x',
 ('on', 'my'),
 ('!', 'x'),
 ('-', 'lovely'),
 ('your', 'comment'),
 ('my', 'blog'),
 ('comment', 'on'),
 ('thanks', 'for'),
 ('blog', '-'),
 ('for', 'your'),
 'blog',
 'comment',
 ('familyyy', '..'),
 ('out', 'with'),
 'familyyy',
 'text',
 ('exhausteddd', '..'),
 'exhausteddd',
 ('is', 'exhausteddd'),
 ('..', 'text'),
 ('with', 'familyyy'),
 ('..', 'out'),
 ("i've", 'miss'),
 'miss',
 "you're",
 ('miss', 'u'),
 ('!', "i've"),
 'hereee',
 ('!', "you're"),
 "i've",
 'heey',
 ("you're", 'hereee'),
 ('heey', '!'),
 'u',
 ('hereee', '!'),
 ('@USER', 'heey'),
 ('u', '!'),
 'finished',
 ('finished', 'pass'),
 'pass',
 'assignment',
 ('pass', 'assignment'),
 ('yay', 'finished'),
 'yay',
 ('comes', 'quik'),
 ('quik', '!'),
 'lucky',
 ('night', 'comes'),
 ('aaahhh', 'lucky'),
 ('hopefully', 'tomorrow'),
 'aaahhh',
 ('@USER', 'aaahhh'),
 ('tomorrow', 'night'),
 ('!', 'hopefully'),
 ('lucky', 'u'),
 'quik',
 'hopefully',
 ('good', 'isht'),
 ('.', 'good'),
 'isht',
 ('easy', 'shawty'),
 ('shawty', '.'),
 ('be', 'easy'),
 ('@USER', 'be'),
 'shawty',
 'easy',
 'flowin',
 'visitor',
 ('got', 'a'),
 ('-', '-'),
 ('a', 'visitor'),
 ('-', 'got'),
 ('flowin', 'today'),
 'got',
 ('visitor', '!'),
 'go',
 ('sneek', 'out'),
 ('lex', '<3'),
 ('go', 'with'),
 "i'll",
 ('bahah', 'lex'),
 ('you', 'bahah'),
 'lex',
 ('and', 'go'),
 ("i'll", 'sneek'),
 ('out', 'and'),
 '<3',
 'sneek',
 'bahah',
 ('hanging', 'with'),
 ('with', 'amber'),
 ('amber', '.'),
 'hanging',
 'amber',
 ('best', 'buy'),
 ('@USER', 'of'),
 ('to', 'best'),
 ('after', 'work'),
 'straight',
 'best',
 'buy',
 ('!', "i'll"),
 ('be', 'heading'),
 ('course', '!'),
 'after',
 ('buy', 'straight'),
 ('straight', 'after'),
 ("i'll", 'be'),
 ('can', 'move'),
 ('flight', 'so'),
 ('please', 'confirm'),
 ('move', 'on'),
 ('so', 'i'),
 'flight',
 ('zest', 'air'),
 'move',
 ('confirm', 'my'),
 ...]



In [20]:

    
# number of features
len(dict_vect.feature_names_)









    Out[20]:





3887274



In [21]:

    
# a little example
text = 'Hi all, I am very happy today'
# first tokenize
tokens = tokenizer.tokenize(text)
print(tokens)

# list features
features = bag_of_words_and_bigrams(tokens)
print(features)

# vectorize features
X = dict_vect.transform(features)

print(X.shape)









    



['hi', 'all', ',', 'I', 'am', 'very', 'happy', 'today']
{'hi': True, 'all': True, ',': True, 'I': True, 'am': True, 'very': True, 'happy': True, 'today': True, ('hi', 'all'): True, ('all', ','): True, (',', 'I'): True, ('I', 'am'): True, ('am', 'very'): True, ('very', 'happy'): True, ('happy', 'today'): True}
(1, 3887274)



In [22]:

    
# X is a special kind of numpy array. beacause it is extremely sparse
# it can be encoded to take less space in memory
# if we want to see it fully, we can use .toarray()

# number of non-zero values in X:
X.toarray().sum()









    Out[22]:





15

The mapping between the list of features and the vector of zeros and ones is done when you train the pipeline with its .fit method.

Classifing the tweet

Now that we have vector representing the presence of features in a tweet, we can apply our logistic regression classifier to compute the probability that a tweet belong to the "sad" or "happy" category



In [23]:

    
classifier = pipline.steps[1][1]



In [24]:

    
classifier









    Out[24]:





SGDClassifier(alpha=7.847599703514622e-06, average=False, class_weight=None,
       epsilon=0.1, eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='log', n_iter=10, n_jobs=1,
       penalty='l2', power_t=0.5, random_state=42, shuffle=True, verbose=0,
       warm_start=False)



In [25]:

    
# access the weights of the logistic regression
classifier.coef_









    Out[25]:





array([[-0.03320511,  0.28887428, -0.29384334, ..., -0.00785218,
        -0.00785218, -0.00218135]])



In [26]:

    
# we have as many weights as features
classifier.coef_.shape









    Out[26]:





(1, 3887274)



In [27]:

    
# plus the intrecept 
classifier.intercept_









    Out[27]:





array([ 0.20885619])



In [28]:

    
# let's check the weight associated with a given feature
x = dict_vect.transform({'bad': True})
_, ind = np.where(x.todense())
classifier.coef_[0,ind]









    Out[28]:





array([-1.09908401])



In [29]:

    
# find the probability for a specific tweet
classifier.predict_proba(X)









    Out[29]:





array([[ 0.06994618,  0.93005382]])

Using the sklearn pipeline to group the two last steps:



In [30]:

    
pipline.predict_proba(features)









    Out[30]:





array([[ 0.06994618,  0.93005382]])

We see to numbers, the first one is the probability of the tweet being sad, the second one is the probability of the tweet being happy.



In [31]:

    
# note that:
pipline.predict_proba(features).sum()









    Out[31]:





1.0

Putting it all together:

We will use the class TweetClassifier from TwSentiment.py that puts together this process for us:



In [32]:

    
from TwSentiment import TweetClassifier



In [33]:

    
twClassifier = TweetClassifier(pipline,
                              tokenizer=tokenizer,
                              feature_extractor=bag_of_words_and_bigrams)



In [34]:

    
# example
text = 'Hi all, I am very happy today'
twClassifier.classify_text(text)









    Out[34]:





('pos', array([ 0.06994618,  0.93005382]))



In [35]:

    
# the classify text method also accepts a list of text as input
twClassifier.classify_text(['great day today!', 'bad day today...'])
# the classify text method also accepts a list of text as input
# twClassifier.classify_text(['not sad', 'not happy'])









    Out[35]:





(array(['pos', 'neg'], 
       dtype='<U3'), array([[ 0.14707178,  0.85292822],
        [ 0.90656453,  0.09343547]]))

We can now classify our tweets:



In [36]:

    
emo_clas, prob = twClassifier.classify_text(tweet_df.text.tolist())



In [37]:

    
# add the result to the dataframe



In [38]:

    
tweet_df['pos_class'] = (emo_clas == 'pos')
tweet_df['pos_prob'] = prob[:,1]



In [39]:

    
tweet_df.head()









    Out[39]:






  
    
      
      text
      tweet_id
      user_id
      pos_class
      pos_prob
    
  
  
    
      0
      'This Isn't AI'
      860217132763754497
      211638860
      False
      0.316765
    
    
      1
      RT @IoTRecruiting: How will Cognitive Computin...
      860217137058635776
      3226000831
      True
      0.814357
    
    
      2
      Trou* https://t.co/FlQdwMFbmh
      860217138908475397
      3070524046
      True
      0.631905
    
    
      3
      RT @InvestorIdeas: https://t.co/ZFdyj2RNXV - #...
      860217141827518464
      837339093214318593
      False
      0.416576
    
    
      4
      DeepStack: Expert-level artificial intelligenc...
      860217143756857344
      766378262259986437
      True
      0.818437



In [40]:

    
# plot the distribution of probability
import matplotlib.pyplot as plt
%matplotlib inline
h = plt.hist(tweet_df.pos_prob, bins=50)

We want to classify users based on the class of their tweets. Pandas allows to easily group tweets per users using the groupy method of DataFrames:



In [41]:

    
user_group = tweet_df.groupby('user_id')



In [42]:

    
print(type(user_group))









    



<class 'pandas.core.groupby.DataFrameGroupBy'>



In [43]:

    
# let's look at one of the group
groups = user_group.groups
uid = list(groups.keys())[5]
user_group.get_group(uid)









    Out[43]:






  
    
      
      text
      tweet_id
      user_id
      pos_class
      pos_prob
    
  
  
    
      85
      RT @hackaday: Google AIY: Artificial Intellige...
      860217395314647040
      787345
      True
      0.885622



In [44]:

    
# we need to make a function that takes the dataframe of tweets grouped by users and return the class of the users
def get_user_emo(group):
    num_pos = group.pos_class.sum()
    num_tweets = group.pos_class.size
    if num_pos/num_tweets > 0.5:
        return 'pos'
    elif num_pos/num_tweets < 0.5:
        return 'neg'
    else:
        return 'NA'



In [45]:

    
# apply the function to each group
user_df = user_group.apply(get_user_emo)



In [46]:

    
# This is a pandas Series where the index are the user_id
user_df.head(10)









    Out[46]:





user_id
785        pos
12774      pos
63433      pos
685063     neg
779302     pos
787345     pos
994761     pos
1246421    neg
1294621    pos
1308181    pos
dtype: object

Let's add this information to the graph we created earlier



In [48]:

    
import networkx as nx

G = nx.read_graphml('twitter_lcc_AI2.graphml', node_type=int)

for n in G.nodes_iter():
    if n in user_df.index:
        # here we look at the value of the user_df series at the position where the index 
        # is equal to the user_id of the node
        G.node[n]['emotion'] = user_df.loc[user_df.index == n].values[0]



In [49]:

    
# we have added an attribute 'emotion' to the nodes
G.node[n]









    Out[49]:





{'name': 'Kasparov63'}



In [50]:

    
# save the graph to open it with Gephi
nx.write_graphml(G, 'twitter_lcc_emo_AI2.graphml')



In [ ]:

	text	tweet_id	user_id
0	'This Isn't AI'	860217132763754497	211638860
1	RT @IoTRecruiting: How will Cognitive Computin...	860217137058635776	3226000831
2	Trou* https://t.co/FlQdwMFbmh	860217138908475397	3070524046
3	RT @InvestorIdeas: https://t.co/ZFdyj2RNXV - #...	860217141827518464	837339093214318593
4	DeepStack: Expert-level artificial intelligenc...	860217143756857344	766378262259986437
5	RT @IoTRecruiting: How will Cognitive Computin...	860217149192785920	2390579695
6	RT @IoTRecruiting: Honored to be ranked Top In...	860217150237270022	860148473873739779
7	I want Barnier to hand over the cheque with a ...	860217155568242693	740545840838774784
8	Iraqi refugee 'used asylum seekers to stage bu...	860217170948612096	14101483
9	RT @cabroncita: $BKD $BKDCD / $BKD.V Artificia...	860217174077706242	835220550209396736