Sentiment Analysis

Jose Manuel Vera Aray

Import libraries to be used



In [2]:

    
import numpy as np
import itertools
import math
import pandas as pd
import csv
import time
from sklearn.cross_validation import train_test_split, KFold
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.model_selection import learning_curve
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.metrics import make_scorer, accuracy_score, precision_score, recall_score, f1_score, roc_curve, auc, confusion_matrix
from sklearn.grid_search import GridSearchCV
from sklearn.utils import shuffle
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="whitegrid", color_codes=True)









    



/resources/common/.virtualenv/python3/lib/python3.4/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
/resources/common/.virtualenv/python3/lib/python3.4/site-packages/sklearn/grid_search.py:43: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
  DeprecationWarning)

Import training data



In [3]:

    
with open("/resources/data/classified_tweets.txt", "r",encoding="utf8") as myfile:
     data = myfile.readlines()

Separate tweets into two sets: tweet data and its respective classification



In [4]:

    
X=[]
y=[]
for x in data:
    X.append(x[1:])
    y.append(x[0])

Split the data into the training set and test set for crossvalidation



In [5]:

    
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=42, test_size = 0.3)

Create a pipeline for each classifier algorithm

.....vectorizer => transformer => classifier



In [6]:

    
#Logistic Regression
Log_clf=Pipeline([('vect', CountVectorizer(analyzer='word')), ('tfidf', TfidfTransformer()), ('clf', LogisticRegression())])
#Multinomial Naive Bayes
MNB_clf=Pipeline([('vect', CountVectorizer(analyzer='word')), ('tfidf', TfidfTransformer()), ('clf', MultinomialNB())])

Parameter tuning

We will use GridSearchCV, which exhaustively considers all parameter combinations, to find the best model for the data. A search consists of:

an estimator (regressor or classifier )
a parameter space;
a method for searching or sampling candidates;
a cross-validation scheme;
a score function, such as accurracy_score()

Set parameters we will test for each algorithm



In [7]:

    
parameters_log = {'vect__ngram_range': [(1, 1), (1, 2)],'tfidf__use_idf': (True, False), 'clf__penalty': ['l1','l2'], 'clf__solver':['liblinear']}
parameters_mnb = {'vect__ngram_range': [(1, 1), (1, 2)],'tfidf__use_idf': (True, False),'clf__alpha': (1,1e-2, 1e-3)}

Set models and do the search of the parameter



In [8]:

    
#Set models
acc_scorer = make_scorer(accuracy_score)
gs_log_clf = GridSearchCV(Log_clf, parameters_log, n_jobs=-1, scoring=acc_scorer)
gs_mnb_clf = GridSearchCV(MNB_clf, parameters_mnb, n_jobs=-1, scoring=acc_scorer)
# Grid search of best parameters
print ("-----Tunning of parameters-----")
start = time.time()
gs_log_clf = gs_log_clf.fit(X_train, y_train)
end = time.time()
print ("Logistic Regression -"," Running Time:", end - start,"s")
start = time.time()
gs_mnb_clf= gs_mnb_clf.fit(X_train, y_train)
end = time.time()
print ("Multinomial Naive Bayes -"," Running Time:", end - start,"s")









    



-----Tunning of parameters-----
Logistic Regression -  Running Time: 50.627103328704834 s
Multinomial Naive Bayes -  Running Time: 47.09324789047241 s

Set the classifiers to the have the best combination of parameters



In [9]:

    
Log_clf= gs_log_clf .best_estimator_
MNB_clf= gs_mnb_clf .best_estimator_

Train algorithms



In [10]:

    
start = time.time()
Log_clf = Log_clf.fit(X_train, y_train)
end = time.time()
print ("Logistic Regression -"," Running Time:", end - start,"s")
start = time.time()
MNB_clf = MNB_clf.fit(X_train, y_train)
end = time.time()
print ("Multinomial Naive Bayes -"," Running Time:", end - start,"s")









    



Logistic Regression -  Running Time: 18.095815181732178 s
Multinomial Naive Bayes -  Running Time: 13.500214576721191 s

Predict on the test set and check metrics



In [11]:

    
predicted_Log = Log_clf.predict(X_test)
predicted_MNB =MNB_clf.predict(X_test)
dec_Log = Log_clf.decision_function(X_test)
dec_MNB =MNB_clf.predict_proba(X_test)

Perform k-fold



In [12]:

    
def run_kfold(clf):
    
    #run KFold with 10 folds instead of the default 3
    #on the 200000 records in the training_data
    kf = KFold(200000, n_folds=10,shuffle=True)
    X_new=np.array(X)
    y_new=np.array(y)
    outcomes = []
    fold = 0
    
    for train_index, test_index in kf:
        fold += 1
        X1_train, X1_test = X_new[train_index], X_new[test_index]
        y1_train, y1_test = y_new[train_index], y_new[test_index]
        
        clf.fit(X1_train, y1_train)
        predictions = clf.predict(X1_test)
        
        accuracy = accuracy_score(y1_test, predictions)
        outcomes.append(accuracy)
        print("Fold {0} accuracy: {1}".format(fold, accuracy))   
        
    mean_outcome = np.mean(outcomes)
    print("Mean Accuracy: {0}".format(mean_outcome))



In [13]:

    
run_kfold(Log_clf)









    



Fold 1 accuracy: 0.8042
Fold 2 accuracy: 0.80495
Fold 3 accuracy: 0.80105
Fold 4 accuracy: 0.80675
Fold 5 accuracy: 0.80745
Fold 6 accuracy: 0.80215
Fold 7 accuracy: 0.80555
Fold 8 accuracy: 0.7991
Fold 9 accuracy: 0.80035
Fold 10 accuracy: 0.79935
Mean Accuracy: 0.8030900000000001



In [14]:

    
run_kfold(MNB_clf)









    



Fold 1 accuracy: 0.78955
Fold 2 accuracy: 0.78545
Fold 3 accuracy: 0.78645
Fold 4 accuracy: 0.79135
Fold 5 accuracy: 0.78725
Fold 6 accuracy: 0.78975
Fold 7 accuracy: 0.7871
Fold 8 accuracy: 0.7897
Fold 9 accuracy: 0.7837
Fold 10 accuracy: 0.7938
Mean Accuracy: 0.7884099999999999

Plot Confusion Matrix



In [15]:

    
Log_matrix=confusion_matrix(y_test, predicted_Log)
Log_matrix=Log_matrix[::-1, ::-1]
MNB_matrix=confusion_matrix(y_test, predicted_MNB)
MNB_matrix=MNB_matrix[::-1, ::-1]



In [16]:

    
def plot_confusion_matrix(cm, classes,  normalize=False, title='Confusion matrix', cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i,round(cm[i, j],2), horizontalalignment="center", color="white" if cm[i, j] > thresh else "black")
    
    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')



In [17]:

    
plot_confusion_matrix(Log_matrix, classes=["Positive","Negative"], normalize=True, title='Normalized confusion matrix')









    



Normalized confusion matrix
[[ 0.7897111   0.2102889 ]
 [ 0.18944089  0.81055911]]



In [18]:

    
plot_confusion_matrix(MNB_matrix, classes=["Positive","Negative"], normalize=True,  title='Normalized confusion matrix')









    



Normalized confusion matrix
[[ 0.7197917   0.2802083 ]
 [ 0.14853774  0.85146226]]

Print the metrics of the performance of the algorithms



In [19]:

    
X_test_new_Log=[]
X_test_new_MNB=[]
y_new=[]
Log_list=predicted_Log.tolist()
MNB_list=predicted_MNB.tolist()
for x in Log_list:
    X_test_new_Log.append(int(x))
for x in MNB_list:
    X_test_new_MNB.append(int(x))
for x in y_test:
    y_new.append(int(x))


Log_list_prediction=[1 if x==4 else x for x in X_test_new_Log]
MNB_list_prediction=[1 if x==4 else x for x in X_test_new_MNB]
target_new=[1 if x==4 else x for x in y_new]

print('Metrics Logistic Regression')
print('-------------------------------------')
print("Accuracy:",accuracy_score(target_new, Log_list_prediction))
print("Recall:",recall_score(target_new,Log_list_prediction))
print("Precision:",precision_score(target_new, Log_list_prediction))
print("F1 Score:",f1_score(target_new, Log_list_prediction))
print('' '')
print('Metrics Multinomial Naive Bayes')
print('-------------------------------------')
print("Accuracy:",accuracy_score(target_new, MNB_list_prediction))
print("Recall:",recall_score(target_new,MNB_list_prediction))
print("Precision:",precision_score(target_new, MNB_list_prediction))
print("F1 Score:",f1_score(target_new, MNB_list_prediction))









    



Metrics Logistic Regression
-------------------------------------
Accuracy: 0.800083333333
Recall: 0.789711101529
Precision: 0.808070866142
F1 Score: 0.798785499807

Metrics Multinomial Naive Bayes
-------------------------------------
Accuracy: 0.7853
Recall: 0.719791701217
Precision: 0.83034245265
F1 Score: 0.771125008884

ROC Curves



In [20]:

    
predicted_Log_new=[]
y_actual=[]
for x in dec_Log:
    predicted_Log_new.append(int(x))
for x in y_test:
    y_actual.append(int(x))

Log_list_prediction=[1 if x==4 else x for x in predicted_Log_new]
target_new=[1 if x==4 else x for x in  y_actual]

fpr, tpr, thresholds=roc_curve(target_new, Log_list_prediction, pos_label=1)
roc_auc= auc(fpr, tpr)
plt.figure()
lw = 2
plt.plot(fpr, tpr, color='darkorange', lw=lw, label='ROC curve (area = %0.3f)' % roc_auc )
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Logistic Regression ROC Curve')
plt.legend(loc="lower right")
plt.show()
plt.savefig('Log_roc.jpg')









    












    





<matplotlib.figure.Figure at 0x7f62e1682e80>



In [21]:

    
predicted_MNB_new=[]
y_actual=[]
for x in range(0,60000):
    predicted_MNB_new.append(dec_MNB[x][1])
#for x in dec_MNB:
#    predicted_MNB_new.append(int(x))
for x in y_test:
    y_actual.append(int(x))

#MNB_list_prediction=[1 if x==4 else x for x in predicted_MNB_new]
target_new=[1 if x==4 else x for x in  y_actual]

fpr, tpr, thresholds=roc_curve(target_new, predicted_MNB_new)
roc_auc= auc(fpr, tpr)
plt.figure()
lw = 2
plt.plot(fpr, tpr, color='darkorange', lw=lw, label='ROC curve (area = %0.3f)' % roc_auc )
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Multinomial Naive Bayes ROC Curve')
plt.legend(loc="lower right")
plt.show()
plt.savefig('MNB_roc.jpg')









    












    





<matplotlib.figure.Figure at 0x7f62e3e9d940>

Predict on unclassified data



In [22]:

    
with open("/resources/data/unclassified_tweets.txt", "r",encoding="utf8") as myfile:
     unclass_data = myfile.readlines()

Predict sentiment



In [23]:

    
MNB_clf = MNB_clf.fit(X_train, y_train)
predicted_MNB_unclass =MNB_clf.predict(unclass_data)

Categorize data by political party



In [24]:

    
def party(tw):
    """For NDP is chosen the name of the candidate of that party in various forms and its campaign slogan
        For Liberals is chosen the name of the candidate of that party in various forms and its campaign slogan
        For Conservatives was chosen the name of the candidate of that partym and associations related with the party (tcot,ccot),
            nickname used by them (like tory) and the bill that was inroduced by the conservative government (c51)"""
    tw_clean=tw.split()
    hashtags=[]
    NDP_list=['tommulcair','mulcair','ndp','tm4pm', 'ready4change','thomasmulcair']
    Lib_list=['justin', 'trudeau\'s','lpc','trudeau','realchange','liberal','liberals','justintrudeau','teamtrudeau']
    Cons_list=['c51','harper','cpc','conservative', 'tory','tcot','stephenharper','ccot','harper','conservatives']
    for x in range(0,len(tw_clean)):
        if tw_clean[x].find('#')!= -1:
            hashtags.append(tw_clean[x].replace('#',''))
    result=''
    if hashtags:
        for x in hashtags:
            if x in NDP_list:
                result= 'NDP'
                return result
            elif x in Lib_list:
                result= 'Liberal'
                return result
            elif x in Cons_list:
                result= 'Conservative'
                return result
    if result=='':
        result='Other'
        return result



In [25]:

    
party_af=[]
for x in range(0,len(unclass_data)):
    party_af.append(party(unclass_data[x]))

Put data into a dataframe



In [26]:

    
predictions=[]
for x in range(0,len(unclass_data)):
    predictions.append((unclass_data[x],party_af[x],predicted_MNB_unclass[x]))
tweets=pd.DataFrame(predictions, columns=["Tweet","Party","Classification - MNB"])
tweets.head()









    Out[26]:






  
    
      
      Tweet
      Party
      Classification - MNB
    
  
  
    
      0
      living the dream. #cameraman #camera #camerac...
      Other
      0
    
    
      1
      justin #trudeau's reasons for thanksgiving. to...
      Liberal
      4
    
    
      2
      @themadape   butt…..butt…..we’re allergic to l...
      Other
      4
    
    
      3
      2 massive explosions at peace march in #turkey...
      Other
      4
    
    
      4
      #mulcair suggests there’s bad blood between hi...
      NDP
      4



In [27]:

    
def get_sent(tweet):
    classf = tweet
    return 'Positive' if classf=='4' else 'Negative'
tweets_clean= tweets[tweets.Party !="Other"]
tweets_clean['Sentiment'] = tweets_clean['Classification - MNB'].apply(get_sent)
tweets_clean.drop(labels=['Classification - MNB'], axis=1, inplace=True)
tweets_clean.head()









    



/usr/local/lib/python3.4/dist-packages/ipykernel/__main__.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
/usr/local/lib/python3.4/dist-packages/ipykernel/__main__.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy






    Out[27]:






  
    
      
      Tweet
      Party
      Sentiment
    
  
  
    
      1
      justin #trudeau's reasons for thanksgiving. to...
      Liberal
      Positive
    
    
      4
      #mulcair suggests there’s bad blood between hi...
      NDP
      Positive
    
    
      5
      #polqc on se sort de la marde avec #harper et ...
      Conservative
      Positive
    
    
      16
      #ready4change #ndp #tm4pm fb.me/53nxi25ue \n
      NDP
      Negative
    
    
      17
      can you believe the guy who vows to spiral can...
      Liberal
      Negative



In [28]:

    
sns.countplot(x='Sentiment', hue="Party", data=tweets_clean)









    Out[28]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f62e5dd8940>

Get some information about the predictions



In [29]:

    
print ("Number of tweets of classfied for each party")
tweets_clean.Party.value_counts().head()









    



Number of tweets of classfied for each party






    Out[29]:





Liberal         520
Conservative    368
NDP             343
Name: Party, dtype: int64



In [30]:

    
print ("Number of tweets of classfied for each party")
tweets_clean[tweets_clean.Sentiment=='Positive'].Party.value_counts().head()









    



Number of tweets of classfied for each party






    Out[30]:





Liberal         398
Conservative    239
NDP             236
Name: Party, dtype: int64



In [31]:

    
print ("Number of tweets of classfied for each party")
tweets_clean[tweets_clean.Sentiment=='Negative'].Party.value_counts().head()









    



Number of tweets of classfied for each party






    Out[31]:





Conservative    129
Liberal         122
NDP             107
Name: Party, dtype: int64

	Tweet	Party	Classification - MNB
0	living the dream. #cameraman #camera #camerac...	Other	0
1	justin #trudeau's reasons for thanksgiving. to...	Liberal	4
2	@themadape butt…..butt…..we’re allergic to l...	Other	4
3	2 massive explosions at peace march in #turkey...	Other	4
4	#mulcair suggests there’s bad blood between hi...	NDP	4

	Tweet	Party	Sentiment
1	justin #trudeau's reasons for thanksgiving. to...	Liberal	Positive
4	#mulcair suggests there’s bad blood between hi...	NDP	Positive
5	#polqc on se sort de la marde avec #harper et ...	Conservative	Positive
16	#ready4change #ndp #tm4pm fb.me/53nxi25ue \n	NDP	Negative
17	can you believe the guy who vows to spiral can...	Liberal	Negative