Sentimenal Analysis On Facebook Posts

This project will use sentimental analysis to study the reaction of people on particular Facebook controvesial posts. To do this, Several classifiers were trained on datasets and the one with best performance were chosen to do the study

1. Data Acquisition

We used two main datasets for training:

  1. Sentimental Analysis Dataset: 1.6 million collected labelled sentences with '0' for negative and '1' for positive. This dataset is used to train a 2 class classifier.

    Sample:

         [['0' 'I missed the New Moon trailer...']
         ['1' 'omg its already 7:30 :O']
         ['0' ".. Omgaga. Im sooo  im gunna CRy. I've been at this dentist since 11.. I was suposed 2 just 
         get a crown put on (30mins)..."]
         ['0' '         i think mi bf is cheating on me!!!       T_T']
         ['0' '         or i just worry too much?        ']]
  2. Amazon Dataset: 30k Amazon Product reviews from users with the rating given to the product. This dataset is used to train a multi-class classifier and Regressors.

    Sample:

           [[2, "It is good if you have internet than you can download the stuff, else, you can't  "],
    
           [5, "The RIO rocks! It is so great that Diamond Multimedia prevailed in their fight against the forces of pure evil in the music industry and allowed us, the public, to have the RIO! This little baby holds your MP3's and plays  them with outrageous quality and no moving parts! You simply cannot make  the music "skip". Take it jogging, bob sledding, whatever! The  Rio is cute and compact, battery lasts forever, runs great and is really  simple to use. Works well with the PC linkup, etc. A hot item!  "],
    
           [4, 'I had high hopes for the Diamond Rio and it certainly lived up 2 the hype. Lightweight and excellent quality with some good connecting software. My only gripe can be with memory. You definately need another 32Mb to store  your music. If you want one, my advice is to wait for the new upgraded  version with 64Mb and a graphic equaliser!  '], [5, "Diamond's RIO is the current, silicon-state nightmare for monopolistic entertainment industries. When the first audio recording-devices entered the consumer market decades ago, the idea of a controlled  "charge-per-copy" business model in the music industry was  doomed. Although traditional copyrights could never be totally enforced,  the record and music-producing industry neglected these threads existing in  the shades of multi-billion profits.<p>Mp3 and the Internet raided the  existing markets with their "natural" power like a cruise-missile  against a frying-pan.  Like other good examples for  "Killer-Apps", the RIO is designed to Web-Specs (mp3). This makes  its use and performance comparable with other CD-quality playing devices  while putting the power of the internet into the palm of your hand. If you  want to make a statement of "being digital", the RIO is a must!  "],
    
           [5, "Remember when the Walkman hit the market years ago?  At first it was fad, then it became a craze before finally becoming as ubiquitous as taxi cabs in Manhattan.  The Diamond Rio is the first product with the potential to  eventually REPLACE the Walkman.  Think about it - portable music that  sounds as clear as a compact disc from a device that looks and feels like a  pager.  Unshockable, no matter how much you jump around or how bumpy the  bus ride.  This is the perfect gift for your favorite "gadget  junkie", as long as they have a PC to download from.  Pssst...here's a  secret:  you can record your own CD's into MP3 format and download the  songs into your RIO - it isn't just for internet music!!  Great product.  "]]
  3. Facebook Dataset: In order to study the reactions of the Facebook audience to a post , we needed to scrap to comments of this post.

Here are parts of 3 functions used to recover the information from Facebook.

postReactions (postId,filename): Takes the post id and the name the corresponding dataset we want to generate. Returns a table with: the post message,the post id, the post time,the number of likes of the post,nb of comments, all the comments id, all the comments message, all the comments time, and the number of likes for each comment.

lastPostsReactions(page,filename,n): Takes the page id and the name the corresponding dataset we want to generate, and the minimum number of posts we want from the page. It returns the same output as the postReactions() function but for several posts.

getAllComments(postId,nb_comments_per_post,serie,fb): this function is used by the 2 functions above to get all the comments of a post. Indeed, if there is more than 25 comments on one post we need to do a pagination of the comments to recover all of them.


In [ ]:
def postReactions (postId,filename):
    ...
    fields = 'id,created_time,message,likes.limit(0).summary(1),comments.limits(0).summary(1)'    
    url = 'https://graph.facebook.com/{}?fields={}&access_token={}'.format(postId, fields, token)
    
    fb = pd.DataFrame(columns=['post message','post id','post time','post likes','nb of comments','comment id', 'comment message', 'comment time', 'comment likes'])# 'user name']) #'age', 'gender','location','political','religion','education'])
    serie={'post message':[],'post id':[],'post time':[],'post likes':[],'nb of comments':[],'comment id':[],'comment message':[],'comment time':[], 'comment likes':[]}#,'user name':[]} # 'age':[], 'gender':[],'location':[],'political':[],'religion':[],'education':[]};
  
    post = requests.get(url).json()
    ...      
        try:
            # Only work with posts with comments which have text.
            nb_comments_per_post=post['comments']['summary']['total_count']
             
            #IndexError if no comment on the page, only work with posts
            # which have at least 1 comment
            x= post['comments']['data'][0]['message']
        
            serie['post message']=post_message
            serie['post time']=post['created_time'] 
            serie['post likes']=post['likes']['summary']['total_count']
            serie['nb of comments']= post['comments']['summary']['total_count']
            serie['post id']=post['id']
        
            fb = getAllComments(postId,nb_comments_per_post,serie,fb)
                                     
        except IndexError or KeyError:
            print('')
    except KeyError:
        print('')
    ....
        
    fb['post time'] = fb['post time'].apply(convert_time)
    fb['comment time'] = fb['comment time'].apply(convert_time)
    
    ...
    return fb

In [ ]:
def lastPostsReactions(page,filename,n):
    ...
    i=0
    while i < n: #len(fb) < n:
    
        posts = requests.get(url).json()
        
        # extract information for each of the received post
        for post in posts['data']:
           
           [...]
        try:
            url = posts['paging']['next']
            #print('next')
        except KeyError:
            #print('no next')
            break  

    print("Number of posts: ",i)
                            
    return fb

In [ ]:
def getAllComments(postId,nb_comments_per_post,serie,fb):
    fields_comments = 'comment_count,like_count,created_time,message'
    url_comments = 'https://graph.facebook.com/{}/comments/?fields={}&access_token={}'.format(postId, fields_comments, token)
    
    comment =0
    while comment <= nb_comments_per_post + 1:
        post_comments=requests.get(url_comments).json()
        i=0
        for com in post_comments['data']:
            i=i+1
            comment_message=com['message']
            serie['comment message'] = comment_message
            serie['comment time'] = com['created_time']
            serie['comment likes'] =  com['like_count']
            serie['comment id']=com['id']
            fb = fb.append(serie, ignore_index=True)
            comment=comment+1
        try:
            url_comments = post_comments['paging']['next']
        except KeyError:
            
            break
    
    return fb

2. Feature Extraction

For the classification , the comments are transformed into vectors that represent the features of the text. This is also called the bag-of-words model. Each word is represented by an integer and count the appearance in the document.

  • Removing Stopwords: Using the English dictionary

  • Select K Best Model: Selects the K best features in relation with the labels using the chi2 distribution.


In [ ]:
def compute_bag_of_words(text, stopwords , vocab=None):
    vectorizer = CountVectorizer(stop_words = stopwords,vocabulary=vocab)
    vectors = vectorizer.fit_transform(text)
    vocabulary = vectorizer.get_feature_names()
    return vectors, vocabulary

KBestModel = SelectKBest(chi2, k=1000).fit(bow, Y_sat)  
indices = KBestModel.get_support(True)
bow_transformed = KBestModel.transform(bow)

Some of the informative features of our dataset:

  • 'adorable
  • 'afraid'
  • 'agree'
  • 'attacked'
  • 'awesome'
  • 'awesomeness'
  • 'awful'
  • 'aww'
  • 'badly'
  • 'beautiful'
  • 'celebrate'
  • 'depressed'
  • 'failed'
  • 'fucked'
  • 'happiness'
  • 'jealous'
  • 'peace'
  • 'success'
  • 'suck'
  • 'waste'

In [ ]:
def Transform_To_Input_Format_SAT_Classifiers(X):
    with codecs.open('Best_Features_SAT.txt','r',encoding='utf8') as f:
        features = f.readlines()
    features = [x.strip("\n") for x in features]
    X_transformed,vocab = compute_bag_of_words(X, stopwords.words(),features)
    return X_transformed,features

def Transform_To_Input_Format_Amazon(X):
    with open('Best_Features_Amazon.txt') as f:
        features = f.readlines()
    features = [x.strip("\n") for x in features]
    X_transformed,vocab = compute_bag_of_words(X, stopwords.words(),features)
    return X_transformed

Classifiers


In [ ]:
partial_fit_classifiers = {
    'SGD': SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
        eta0=0.0, fit_intercept=True, l1_ratio=0,
        learning_rate='optimal', loss='hinge', n_iter=50, n_jobs=1,
        penalty='l1', power_t=0.5, random_state=None, shuffle=True,
        verbose=0, warm_start=False),
    'Perceptron': Perceptron(),
    'NB Multinomial': MultinomialNB(alpha=0.01),
    'Passive-Aggressive': PassiveAggressiveClassifier(C=1.0, n_iter=50, shuffle=True, 
                                                      verbose=0, loss='hinge',
                                                      warm_start=False),
    'NB Bernoulli': BernoulliNB(alpha=0.01),
}

Regressors


In [ ]:
partial_fit_Regressors = {
    'SGD Regressor':SGDRegressor(loss='squared_loss', penalty='l2', alpha=0.001, l1_ratio=0, 
                                 fit_intercept=True, n_iter=1000, shuffle=True, verbose=0, epsilon=0.01, random_state=None,
                                 learning_rate='invscaling', eta0=0.01, power_t=0.25, warm_start=False, average=False),
    'Passive-Aggressive Regressor' : PassiveAggressiveRegressor(),
    }

Incremental learning

The dataset is too large to train the classifiers at once Memory Error that is why we used incremental learning:

Libraries used:

1. sklearn
2. Numpy
3. Scipy


Learning Process:

  1. Divide Data into Batches

     minibatch_size = 10000
     batch = bow_transformed[start:start+minibatch_size]
  2. Partially train the classifier on the batch

     cls.partial_fit(X_train, Y_train, classes = y_all)
  3. Cross-Validation

     kf = KFold(n_splits = 10)
     for train_index,test_index in kf.split(X):
         X_train,X_test = X[train_index],X[test_index]
         Y_train,Y_test = Y[train_index],Y[test_index]
  4. Accuracy Check:

    Classifiers:

     100*sklearn.metrics.accuracy_score(Y_test, train_pred)
    

    Regressors:

     sklearn.metrics.mean_squared_error(Y_test, train_pred)

Results on SAT dataset

Results on Amazon Dataset

Analysis on Facebook data

Analysis on posts

Theme : Terrorism

The Orlando shooting

CNN article about orlando shooting's victims suing twitter and facebook


In [23]:
#CNN article about orlando shooting's victims suing twitter and facebook

#page='cnninternational'
#postId= '10154812470324641'
#fb= postReactions(postId,1,'CNN_OS1')
filename='facebookCNN_OS1.sqlite'
getAllGraphsForPost(filename,50)


Post sentiment:  1
Post sentiment:  1.0

CNN article about John McCain saying that Obama is responsible for the Orlando shooting


In [25]:
#postId= '10154219265779641'
#fb= postReactions(postId,1,'CNN_OS2')
filename='facebookCNN_OS2.sqlite'
getAllGraphsForPost(filename,50)


Post sentiment:  1
Post sentiment:  3.0

CNN articles about gun sales after Orlando shooting


In [26]:
#postId= '10154210843679641'
#fb= postReactions(postId,1,'CNN_OS3')
filename='facebookCNN_OS3.sqlite'
getAllGraphsForPost(filename,50)


Post sentiment:  1
Post sentiment:  1.0

Theme : Aborption

CNN articles about Lena Dunham's aborption


In [28]:
#postId= '10154817313449641'
#fb= postReactions(postId,1,'CNN_Ab1')
filename='facebookCNN_Ab1.sqlite'
getAllGraphsForPost(filename,50)


Post sentiment:  0
Post sentiment:  5.0

CNN article about Texas fetus burial obliagtion


In [30]:
#postId= '10154738564069641'
#fb= postReactions(postId,1,'CNN_Ab2')
filename='facebookCNN_Ab2.sqlite'
getAllGraphsForPost(filename,50)


Post sentiment:  1
Post sentiment:  5.0

CNN article about pope abortion forgivness


In [32]:
#postId= '10154708071624641'
#fb= postReactions(postId,1,'CNN_Ab3')
filename='facebookCNN_Ab3.sqlite'
getAllGraphsForPost(filename,50)


Post sentiment:  0
Post sentiment:  5.0

Theme : Political

Brexit

David Cameron before brexit speaking about


In [33]:
#page = 'DavidCameronOfficial'
#postId= '1216426805048302'
#fb= postReactions(postId,1,'DC1')
filename='facebookDC1.sqlite'
getAllGraphsForPost(filename,50)


Post sentiment:  1
Post sentiment:  4.0

David Cameron 'thank you' note to the voters


In [34]:
#postId= '1218229621534687'
#fb= postReactions(postId,0,'DC2')
filename='facebookDC2.sqlite'
getAllGraphsForPost(filename,50)


Post sentiment:  1
Post sentiment:  4.0

US presidential election

Hillary Clinton asking for votes (before election)


In [35]:
#page = 'hillaryclinton'
#postId= '1322830897773436'
#fb= postReactions(postId,1,'HC3')
filename='facebookHC3.sqlite'
getAllGraphsForPost(filename,50)


Post sentiment:  1
Post sentiment:  5.0

Donald Trump speaks about the defeat of the democrats


In [45]:
#page = 'DonaldTrump'
#postId= '10158423167515725'
#fb= postReactions(postId,0,'DT3')
filename='facebookDT3.sqlite'
getAllGraphsForPost(filename,50)


Post sentiment:  0
Post sentiment:  5.0

Donald Trump speaks about Obamacare


In [48]:
#postId= '10158417912085725'
#fb= postReactions(postId,0,'DT4')
filename='facebookDT4.sqlite'
getAllGraphsForPost(filename,50)


Post sentiment:  0
Post sentiment:  3.0

Theme : Environmental

Dakota Pipeline

AJ+ post on Native Americans fighting against the Dakota Access pipeline


In [49]:
#page='ajplusenglish'
#postId='843345759140266'
#fb= postReactions(postId,1,'ENV1')
filename='facebookENV1.sqlite'
getAllGraphsForPost(filename,50)


Post sentiment:  0
Post sentiment:  5.0

AJ+ post on celebrations after the U.S. Army Corp of Engineers refused to grant an easement allowing the Dakota Access Pipeline to go under Lake Oahe


In [50]:
#postId='852012618273580'
#fb= postReactions(postId,0,'ENV2')
filename='facebookENV2.sqlite'
getAllGraphsForPost(filename,50)


Post sentiment:  1
Post sentiment:  5.0

Donald Trump and environmental issues

AJ+ post on the hackers that are preserving environmental data before Trump takes office


In [51]:
#postId='864548553686653'
#fb= postReactions(postId,1,'ENV3')
filename='facebookENV3.sqlite'
getAllGraphsForPost(filename,50)


Post sentiment:  0
Post sentiment:  5.0

AJ+ post on Donald Trump being picked for the Environmental Protection Agency not believing in man-made climate change.


In [52]:
#postId='855199717954870'
#fb= postReactions(postId,1,'ENV4')
filename='facebookENV4.sqlite'
getAllGraphsForPost(filename,50)


Post sentiment:  0
Post sentiment:  5.0

Theme : Homosexuality

CNN post on Pope saying Christians should apologize to gay people


In [53]:
#page='cnninternational'
#postId='10154249259664641'
#fb= postReactions(postId,1,'HM1')
filename='facebookHM1.sqlite'
getAllGraphsForPost(filename,50)


Post sentiment:  1
Post sentiment:  5.0

CNN post on the fact that more than half of British Muslims think homosexuality should be illegal


In [54]:
#postId='10154056181289641'
#fb= postReactions(postId,1,'HM2')
filename='facebookHM2.sqlite'
getAllGraphsForPost(filename,50)


Post sentiment:  0
Post sentiment:  5.0

Analysis on pages

CNN


In [37]:
#page='cnninternational'
#lastPostsReactions(page,'CNN',20)
filename='facebookCNN.sqlite'
fb=getTable('facebookCNN.sqlite')
getAllGraphsForPage(filename,50)