This notebook implements an English-language tweet sentiment classifier.we are going to get a goodaccuracy on the test data containing positive and negative sentiment tweets.Training and test data was downloaded from here : http://help.sentiment140.com/for-students/


In [46]:
#from matplotlib import pyplot as plt
#import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import classification_report

from sklearn.pipeline import Pipeline

In [47]:
# name out col's
columns = ['polarity', 'tweetid', 'date', 'query_name', 'user', 'text']

dftrain = pd.read_csv(r'datasets\trainingandtestdata\training.1600000.processed.noemoticon.csv',
                      header = None,
                      encoding ='ISO-8859-1')
dftest = pd.read_csv(r'datasets\trainingandtestdata\testdata.manual.2009.06.14.csv',
                     header = None,
                     encoding ='ISO-8859-1')
dftrain.columns = columns
dftest.columns = columns

dftrain


Out[47]:
polarity tweetid date query_name user text
0 0 1467810369 Mon Apr 06 22:19:45 PDT 2009 NO_QUERY _TheSpecialOne_ @switchfoot http://twitpic.com/2y1zl - Awww, t...
1 0 1467810672 Mon Apr 06 22:19:49 PDT 2009 NO_QUERY scotthamilton is upset that he can't update his Facebook by ...
2 0 1467810917 Mon Apr 06 22:19:53 PDT 2009 NO_QUERY mattycus @Kenichan I dived many times for the ball. Man...
3 0 1467811184 Mon Apr 06 22:19:57 PDT 2009 NO_QUERY ElleCTF my whole body feels itchy and like its on fire
4 0 1467811193 Mon Apr 06 22:19:57 PDT 2009 NO_QUERY Karoli @nationwideclass no, it's not behaving at all....
5 0 1467811372 Mon Apr 06 22:20:00 PDT 2009 NO_QUERY joy_wolf @Kwesidei not the whole crew
6 0 1467811592 Mon Apr 06 22:20:03 PDT 2009 NO_QUERY mybirch Need a hug
7 0 1467811594 Mon Apr 06 22:20:03 PDT 2009 NO_QUERY coZZ @LOLTrish hey long time no see! Yes.. Rains a...
8 0 1467811795 Mon Apr 06 22:20:05 PDT 2009 NO_QUERY 2Hood4Hollywood @Tatiana_K nope they didn't have it
9 0 1467812025 Mon Apr 06 22:20:09 PDT 2009 NO_QUERY mimismo @twittera que me muera ?
10 0 1467812416 Mon Apr 06 22:20:16 PDT 2009 NO_QUERY erinx3leannexo spring break in plain city... it's snowing
11 0 1467812579 Mon Apr 06 22:20:17 PDT 2009 NO_QUERY pardonlauren I just re-pierced my ears
12 0 1467812723 Mon Apr 06 22:20:19 PDT 2009 NO_QUERY TLeC @caregiving I couldn't bear to watch it. And ...
13 0 1467812771 Mon Apr 06 22:20:19 PDT 2009 NO_QUERY robrobbierobert @octolinz16 It it counts, idk why I did either...
14 0 1467812784 Mon Apr 06 22:20:20 PDT 2009 NO_QUERY bayofwolves @smarrison i would've been the first, but i di...
15 0 1467812799 Mon Apr 06 22:20:20 PDT 2009 NO_QUERY HairByJess @iamjazzyfizzle I wish I got to watch it with ...
16 0 1467812964 Mon Apr 06 22:20:22 PDT 2009 NO_QUERY lovesongwriter Hollis' death scene will hurt me severely to w...
17 0 1467813137 Mon Apr 06 22:20:25 PDT 2009 NO_QUERY armotley about to file taxes
18 0 1467813579 Mon Apr 06 22:20:31 PDT 2009 NO_QUERY starkissed @LettyA ahh ive always wanted to see rent lov...
19 0 1467813782 Mon Apr 06 22:20:34 PDT 2009 NO_QUERY gi_gi_bee @FakerPattyPattz Oh dear. Were you drinking ou...
20 0 1467813985 Mon Apr 06 22:20:37 PDT 2009 NO_QUERY quanvu @alydesigns i was out most of the day so didn'...
21 0 1467813992 Mon Apr 06 22:20:38 PDT 2009 NO_QUERY swinspeedx one of my friend called me, and asked to meet ...
22 0 1467814119 Mon Apr 06 22:20:40 PDT 2009 NO_QUERY cooliodoc @angry_barista I baked you a cake but I ated it
23 0 1467814180 Mon Apr 06 22:20:40 PDT 2009 NO_QUERY viJILLante this week is not going as i had hoped
24 0 1467814192 Mon Apr 06 22:20:41 PDT 2009 NO_QUERY Ljelli3166 blagh class at 8 tomorrow
25 0 1467814438 Mon Apr 06 22:20:44 PDT 2009 NO_QUERY ChicagoCubbie I hate when I have to call and wake people up
26 0 1467814783 Mon Apr 06 22:20:50 PDT 2009 NO_QUERY KatieAngell Just going to cry myself to sleep after watchi...
27 0 1467814883 Mon Apr 06 22:20:52 PDT 2009 NO_QUERY gagoo im sad now Miss.Lilly
28 0 1467815199 Mon Apr 06 22:20:56 PDT 2009 NO_QUERY abel209 ooooh.... LOL that leslie.... and ok I won't ...
29 0 1467815753 Mon Apr 06 22:21:04 PDT 2009 NO_QUERY BaptisteTheFool Meh... Almost Lover is the exception... this t...
... ... ... ... ... ... ...
1599970 4 2193578196 Tue Jun 16 08:38:54 PDT 2009 NO_QUERY adbillingsley Thanks @eastwestchic & @wangyip Thanks! Th...
1599971 4 2193578237 Tue Jun 16 08:38:54 PDT 2009 NO_QUERY gekkko @marttn thanks Martin. not the most imaginativ...
1599972 4 2193578269 Tue Jun 16 08:38:54 PDT 2009 NO_QUERY millerslab @MikeJonesPhoto Congrats Mike Way to go!
1599973 4 2193578319 Tue Jun 16 08:38:55 PDT 2009 NO_QUERY luckygeorgeblog http://twitpic.com/7jp4n - OMG! Office Space.....
1599974 4 2193578345 Tue Jun 16 08:38:55 PDT 2009 NO_QUERY Kristah_Diggs @yrclndstnlvr ahaha nooo you were just away fr...
1599975 4 2193578347 Tue Jun 16 08:38:55 PDT 2009 NO_QUERY CoachChic @BizCoachDeb Hey, I'm baack! And, thanks so m...
1599976 4 2193578348 Tue Jun 16 08:38:55 PDT 2009 NO_QUERY serianna @mattycus Yeah, my conscience would be clear i...
1599977 4 2193578386 Tue Jun 16 08:38:55 PDT 2009 NO_QUERY TeamUKskyvixen @MayorDorisWolfe Thats my girl - dishing out t...
1599978 4 2193578395 Tue Jun 16 08:38:55 PDT 2009 NO_QUERY LaurenMoo10 @shebbs123 i second that
1599979 4 2193578576 Tue Jun 16 08:38:57 PDT 2009 NO_QUERY angel_sammy04 In the garden
1599980 4 2193578679 Tue Jun 16 08:38:56 PDT 2009 NO_QUERY puchal_ek @myheartandmind jo jen by nemuselo zrovna té ...
1599981 4 2193578716 Tue Jun 16 08:38:57 PDT 2009 NO_QUERY youtubelatest Another Commenting Contest! [;: Yay!!! http:/...
1599982 4 2193578739 Tue Jun 16 08:38:57 PDT 2009 NO_QUERY Mandi_Davenport @thrillmesoon i figured out how to see my twee...
1599983 4 2193578758 Tue Jun 16 08:38:57 PDT 2009 NO_QUERY xoAurixo @oxhot theri tomorrow, drinking coffee, talkin...
1599984 4 2193578847 Tue Jun 16 08:38:57 PDT 2009 NO_QUERY RobFoxKerr You heard it here first -- We're having a girl...
1599985 4 2193578982 Tue Jun 16 08:38:58 PDT 2009 NO_QUERY LISKFEST if ur the lead singer in a band, beware fallin...
1599986 4 2193579087 Tue Jun 16 08:38:58 PDT 2009 NO_QUERY marhgil @tarayqueen too much ads on my blog.
1599987 4 2193579092 Tue Jun 16 08:38:58 PDT 2009 NO_QUERY cathriiin @La_r_a NEVEER I think that you both will get...
1599988 4 2193579191 Tue Jun 16 08:38:59 PDT 2009 NO_QUERY tellman @Roy_Everitt ha- good job. that's right - we g...
1599989 4 2193579211 Tue Jun 16 08:38:59 PDT 2009 NO_QUERY jazzstixx @Ms_Hip_Hop im glad ur doing well
1599990 4 2193579249 Tue Jun 16 08:38:59 PDT 2009 NO_QUERY razzberry5594 WOOOOO! Xbox is back
1599991 4 2193579284 Tue Jun 16 08:38:59 PDT 2009 NO_QUERY AgustinaP @rmedina @LaTati Mmmm That sounds absolutely ...
1599992 4 2193579434 Tue Jun 16 08:39:00 PDT 2009 NO_QUERY sdancingsteph ReCoVeRiNg FrOm ThE lOnG wEeKeNd
1599993 4 2193579477 Tue Jun 16 08:39:00 PDT 2009 NO_QUERY ChloeAmisha @SCOOBY_GRITBOYS
1599994 4 2193579489 Tue Jun 16 08:39:00 PDT 2009 NO_QUERY EvolveTom @Cliff_Forster Yeah, that does work better tha...
1599995 4 2193601966 Tue Jun 16 08:40:49 PDT 2009 NO_QUERY AmandaMarie1028 Just woke up. Having no school is the best fee...
1599996 4 2193601969 Tue Jun 16 08:40:49 PDT 2009 NO_QUERY TheWDBoards TheWDB.com - Very cool to hear old Walt interv...
1599997 4 2193601991 Tue Jun 16 08:40:49 PDT 2009 NO_QUERY bpbabe Are you ready for your MoJo Makeover? Ask me f...
1599998 4 2193602064 Tue Jun 16 08:40:49 PDT 2009 NO_QUERY tinydiamondz Happy 38th Birthday to my boo of alll time!!! ...
1599999 4 2193602129 Tue Jun 16 08:40:50 PDT 2009 NO_QUERY RyanTrevMorris happy #charitytuesday @theNSPCC @SparksCharity...

1600000 rows × 6 columns


In [48]:
class RegexPreprocess(object):
   
    """Create a preprocessing module for a tweet or data structure of tweets.
    1) replace username, e.g., @crawles -> USERNAME
    2) replace http links -> URL
    3) replace repeated letters to two letters
    """
    
    user_pat = '(?<=^|(?<=[^a-zA-Z0-9-_\.]))@([A-Za-z]+[A-Za-z0-9]+)'
    http_pat = '(https?:\/\/(?:www\.|(?!www))[^\s\.]+\.[^\s]{2,}|www\.[^\s]+\.[^\s]{2,})'
    repeat_pat, repeat_repl = "(.)\\1\\1+",'\\1\\1'
    
    def __init__(self):
        pass
    
    def transform(self, X):
        is_pd_series = isinstance(X, pd.core.frame.Series)
        if not is_pd_series:
            pp_text = pd.Series(X)
        else:
            pp_text = X
        pp_text = pp_text.str.replace(pat = self.user_pat, repl = 'USERNAME')
        pp_text = pp_text.str.replace(pat = self.http_pat, repl = 'URL')
        pp_text.str.replace(pat = self.repeat_pat, repl = self.repeat_repl)
        return pp_text
        
    def fit(self, X, y=None):
        return self

Training and testing the model


In [51]:
sentiment_lr = Pipeline([('regex_preprocess', RegexPreprocess()),
                         ('count_vect', CountVectorizer(min_df = 100,
                                                        ngram_range = (1,1),
                                                        stop_words = 'english')), 
                         ('lr', LogisticRegression())])
sentiment_lr.fit(dftrain.text, dftrain.polarity)


Out[51]:
Pipeline(steps=[('regex_preprocess', <__main__.RegexPreprocess object at 0x00000116EEC4B7F0>), ('count_vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, mi...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

In [52]:
cv=CountVectorizer()
xtraintestcv=cv.fit_transform(["hi man how are you,you are cazy","i hope you are good"])
xtraintestcv.toarray()


Out[52]:
array([[2, 1, 0, 1, 0, 1, 1, 2],
       [1, 0, 1, 0, 1, 0, 0, 1]], dtype=int64)

In [50]:
cv.get_feature_names()


Out[50]:
['are', 'cazy', 'good', 'hi', 'hope', 'how', 'man', 'you']

In [53]:
Xtest, ytest = dftest.text[dftest.polarity!=2], dftest.polarity[dftest.polarity!=2]

#Xtest,ytest=dftest.text,dftest.polarity

print(classification_report(ytest,sentiment_lr.predict(Xtest)))


             precision    recall  f1-score   support

          0       0.86      0.81      0.83       177
          4       0.82      0.87      0.85       182

avg / total       0.84      0.84      0.84       359


In [54]:
import dill
f = open('twitter_sentiment_model.pkl','wb')
r = RegexPreprocess()
dill.dump(sentiment_lr, f)
f.close()

In [56]:
# test
f = open('twitter_sentiment_model.pkl','rb')
cl = dill.load(f)
#print(classification_report(ytest,cl.predict(Xtest)))

print(cl.predict_proba("Hello big beautiful world"))

print(cl.predict_proba("you are too bad man"))
f.close()


[[ 0.07064946  0.92935054]]
[[ 0.78401794  0.21598206]]

In [ ]: