Twitter Sentiment Analysis

Businesses and organizations around the world know that the first requirement for success is a happy customer base. For the purpose of identifying customer sentiment, the microblogging service Twitter, with its enormous collection of active users, is a font of knowledge. The most recent release of SAS Viya has added support for Twitter analysis to the DeepLearning action set’s Recurrent Neural Network layer. This recipe shows a pipeline to analyze sentiment in Twitter data using word embeddings and RNNs in SAS. The overall structure of this document and a small amount of the text comes from [1]

[1] https://github.com/sassoftware/sas-viya-programming/tree/master/deeplearning/fashion-mnist.

Import modules and create CAS session

In this code we import the needed modules and cas action sets
We assign values for the cashost, casport, and casauth values
These are then used to establish a CAS session named 's'
We set exception_on_severity to 2 to enable tracebacks for server-side CAS errors
Documentation to Connect and Start a Session



In [1]:

    
import swat
from IPython.display import display

swat.options.cas.exception_on_severity = 2

s = swat.CAS('rdcgrd075.unx.sas.com', 3217,authinfo=r'/u/saleem/.authinfo')

s.loadactionset('deeplearn')









    



NOTE: Added action set 'deeplearn'.






    Out[1]:




§ actionset

deeplearn


elapsed 0.239s · user 0.23s · sys 0.028s · mem 2.13MB

Load the Glove Embeddings into CAS

Semantic word embeddings, the vector encodings of the meaning of words, are the basis of deep learning for text analytics.

In this recipe, we use the public domain glove embeddings trained on Twitter available at [2]. We have made changes to the format of the glove embeddings for the purpose of this work.

We remove all words with non-ascii characters to make the file more lightweight, as the tweets themelves are ascii.

We also remove ", which is a special character in SAS, and change the delimiter to tabs from spaces.

cat glove.twitter.100d.txt | grep -v \" | grep -Pv "[^\x00-\x7F]" > glove.twitter.100d.clean.txt

We include the modified glove file as part of this recipe.

[2] https://nlp.stanford.edu/projects/glove/



In [2]:

    
import os
# An example embeddings file;
GLOVE_PATH = 'miniglove.tsv'
DELIMITER = "\t"
dims = 100



In [3]:

    
glove = s.CASTable('glove', replace=True)

glove = s.upload_file(GLOVE_PATH,
                      casout=glove,
                      importoptions=dict(fileType='csv',
                                         delimiter="\t",
                                         varChars=True,
                                         getNames=False,
                                         vars=[dict(type='varchar')]+[dict(type='double')]*dims))









    



NOTE: Cloud Analytic Services made the uploaded file available as table GLOVE in caslib CASUSERHDFS(saleem).
NOTE: The table GLOVE has been created in caslib CASUSERHDFS(saleem) from binary data uploaded to Cloud Analytic Services.

Load the Twitter data into CAS

Direct distribution of Twitter text is a violation of the Twitter terms of service [3]. The appropriate approach is to distribute data in dehydrated form. That is, we may distribute the tweet ids along with our annotations but without the text. Using [4], the user may download the text themselves through the Twitter API using the command below. You can run it right in the browser. The download takes about twelve hours on our machine.

[3] https://twitter.com/en/tos

[4] https://github.com/aritter/twitter_download

[5] https://developer.twitter.com/en/apply/user



In [4]:

    
!git clone https://github.com/aritter/twitter_download.git

The twitter download tool requires an access token. You can get a token by applying for a twitter developer account [5]. Once you have an account, register an app and get your consumer key and your secret key. Once you have these, update twitter_download/download_tweets_api.py and run it. The script will open a web browser for you to log in with your Twitter credentials. It will save a file with your private keys so you only need to do it once. Now you can download the data.



In [5]:

    
!python twitter_download/download_tweets_api.py --dist emoji_sentiment_data_dehydrated.tsv --output emoji_sentiment_data_rehydrated.tsv

Once we've downloaded the data, we clean it. To create this dataset, we collected 20,000 tweets containing the ":)" emoticon and 20,000 contanining ":(." We labeled these positive and negative respectively. We then removed tweets containing foul language and ended up with roughly 37,500 tweets. This is a noisy way to label the data, and you are likely to get more accurate labels if you sample all tweets and label manually. Nevertheless, it's an excellent method to get a lot of sentiment data quickly with an unrestrictive license. Since we don't want our sentiment analysis tool to simply learn to detect the presence of a smiley face or a frowny face, we scrub the data of these two emoticons. We also normalize whitespace to a single space each and remove all non-ascii characters and quotation marks to avoid confusing the software.



In [6]:

    
import re
import pandas as pd

path = "emoji_sentiment_data_rehydrated.tsv"
df = pd.read_csv(path,
                 delimiter="\t",
                 names=['_Document_', '_Target_', 'slice', 'text'],
                 skiprows=1)

def clean(tweet):
    tweet = re.sub(r'[^\x00-\x7F]+', '', tweet)
    tweet = re.sub(r"\s+", ' ', tweet)
    tweet = re.sub(r"^[\"']", "", tweet)
    tweet = tweet.replace("\'", "")
    return re.sub(r"(:\)+)|:\(+", "", tweet)


df['text'] = df['text'].apply(clean)

When a user deletes his or her tweet or makes it private it can no longer be downloaded, so one thing that publicly distributed Twitter data does is decay over time. Fortunately, it's simple to get a quick measure of the amount of data that has been lost.



In [7]:

    
import html

count = len(df)
lost_count = len(df[df['text'] == "Not Available"])
print("{:.1%} of data deleted or made private".format(lost_count/count))

df = df[df['text'] != "Not Available"]
for slize in ['train', 'dev', 'test']:
    print("{} datapoints in {}".format(len(df[df['slice'] == slize]), slize))









    



25.1% of data deleted or made private
21014 datapoints in train
688 datapoints in dev
6376 datapoints in test

This second preprocessing step is to coerce the format of the data into that used in the GloVe twitter embeddings. Here we use an included python Twitter normalization tool based on the ruby script provided by the GloVe team. [4]



In [8]:

    
from twitter_glove import normalize


def preprocess(tweet):
    return normalize(tweet).lower()


df['text'] = df['text'].apply(html.unescape)
df['text'] = df['text'].apply(preprocess)

df = df.drop_duplicates(subset='_Document_')

reviews_train = s.CASTable('reviews_train.csv', replace=True)

reviews_train = s.upload_frame(df, casout=reviews_train)









    



NOTE: Cloud Analytic Services made the uploaded file available as table REVIEWS_TRAIN in caslib CASUSERHDFS(saleem).
NOTE: The table REVIEWS_TRAIN has been created in caslib CASUSERHDFS(saleem) from binary data uploaded to Cloud Analytic Services.



In [9]:

    
for slize in ['train', 'dev']:
    print(slize)
    print(df[df['slice'] == slize]['_Target_'].value_counts(True))
    print()









    



train
positive    0.510374
negative    0.489626
Name: _Target_, dtype: float64

dev
positive    0.534884
negative    0.465116
Name: _Target_, dtype: float64

Tokenize Text

This step involves separating the text into tokens, then presenting the result in a table that ApplyWordVector can use.

Term: The token

Start: The position of the token in the document. This is used to sort the terms, as ApplyWordVector is designed for use in a parallel environment and cannot rely on inputs coming to it in order.

Document: The document id. Because the input is given as a single table, this is important to separate one document from another.



In [10]:

    
print(len(df))

df_cleaned = df

df_cols = {
    "_Term_": [],
    "_Start_": [],
    "_Document_": []
}

for i in df_cleaned.index:
    term = list(filter(None, df_cleaned['text'].loc[i].split(" ")))
    df_cols["_Term_"].extend(term)
    df_cols["_Start_"].extend(range(len(term)))
    df_cols["_Document_"].extend([df_cleaned['_Document_'].loc[i]]*len(term))

tokenized_df = pd.DataFrame.from_dict(df_cols)[['_Term_',
                                                '_Start_',
                                                '_Document_']]

out_offset = s.CASTable('out_offset', replace=True)

out_offset = s.upload_frame(tokenized_df, casout=out_offset)









    



28078
NOTE: Cloud Analytic Services made the uploaded file available as table OUT_OFFSET in caslib CASUSERHDFS(saleem).
NOTE: The table OUT_OFFSET has been created in caslib CASUSERHDFS(saleem) from binary data uploaded to Cloud Analytic Services.



In [11]:

    
tokenized_df.head()









    Out[11]:







  
    
      
      _Term_
      _Start_
      _Document_
    
  
  
    
      0
      i
      0
      960953084577972224
    
    
      1
      love
      1
      960953084577972224
    
    
      2
      <url>
      2
      960953084577972224
    
    
      3
      rt
      0
      945682890389516288
    
    
      4
      <allcaps>
      1
      945682890389516288



In [12]:

    
# vocab = set(tokenized_df['_Term_'].values)
# glove = pd.read_csv(GLOVE_PATH,sep=DELIMITER,header=None)
# miniglove=glove[glove[0].isin(vocab)]

# len(miniglove)/len(glove)
# miniglove.to_csv('miniglove.tsv',sep=DELIMITER,index=False)

A simple out of vocabulary test makes for a good sanity check if there are any mismatches in the glove embedding file. It will change due to Twitter decay, but should be roughly between 0.01 and 0.03



In [13]:

    
import pandas as pd

vocab = set([item[0] for item in pd.read_csv(
    GLOVE_PATH, sep=DELIMITER, header=None, usecols=[0]).values])

tokenized_df['_Term_'].apply(lambda word: word not in vocab).mean()









    Out[13]:





0.01937017733016004

Apply Word Vector

Here we run apply word vector and merge the resulting word sequences with their labels



In [14]:

    
s.loadactionset('textparse')

embedded = s.CASTable('embedded', replace=True)

s.textparse.applyWordVector(
    model=glove,
    offset=out_offset,
    casout=embedded
)

embedded.head()









    



NOTE: Added action set 'textparse'.






    Out[14]:






Selected Rows from Table EMBEDDED
  
    
      
      _Document_
      _Sequence_length_
      _F_0_0_
      _F_0_1_
      _F_0_2_
      _F_0_3_
      _F_0_4_
      _F_0_5_
      _F_0_6_
      _F_0_7_
      ...
      _F_57_90_
      _F_57_91_
      _F_57_92_
      _F_57_93_
      _F_57_94_
      _F_57_95_
      _F_57_96_
      _F_57_97_
      _F_57_98_
      _F_57_99_
    
  
  
    
      0
      9.609595e+17
      16
      0.60470
      0.895420
      0.27923
      0.033489
      0.158730
      0.185220
      0.30722
      0.47445
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      1
      9.610139e+17
      18
      -0.32202
      -0.001638
      -0.12868
      1.214900
      0.253890
      0.281980
      -0.21904
      -0.38038
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      2
      9.610606e+17
      29
      0.60470
      0.895420
      0.27923
      0.033489
      0.158730
      0.185220
      0.30722
      0.47445
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      3
      9.610957e+17
      7
      0.63006
      0.651770
      0.25545
      0.018593
      0.043094
      0.047194
      0.23218
      0.11613
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      4
      9.611704e+17
      6
      0.63006
      0.651770
      0.25545
      0.018593
      0.043094
      0.047194
      0.23218
      0.11613
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
  

5 rows × 5802 columns

_F_ne = the eth feature of the nth token in the sequence



In [15]:

    
embedding_columns = [column for column in embedded.columns if column.startswith('_F')]
len(embedding_columns)









    Out[15]:





5800

This step merges the lost target data back onto the newly embedded sentences. The ApplyWordVec action's metadata does not match that of the original cas action, so we start by clearing the column metadata. Then we simply call SWAT's merge action to combine the two tables.



In [16]:

    
import time

format_clearer = [dict(name=column, format="") for column in embedded.columns]

embedded.table.alterTable(columns=format_clearer)

start_time = time.time()

embedded_with_additional_data = s.CASTable('embedded_with_additional_data', replace=True)

reviews_train.merge(
    embedded, 
    on="_Document_",
    casout=embedded_with_additional_data
)

print(time.time() - start_time)









    



2.9450581073760986



In [17]:

    
embedded_with_additional_data[['text', '_Target_']].head()









    Out[17]:






Selected Rows from Table EMBEDDED_WITH_ADDITIONAL_DATA
  
    
      
      text
      _Target_
    
  
  
    
      0
      <user> applied to a job didnt get past the fir...
      negative
    
    
      1
      <user> make sure u eat bb and stay hydrated
      negative
    
    
      2
      rt <allcaps> <user> : <user> i love her
      negative
    
    
      3
      rt <allcaps> <user> : he said that he got a co...
      negative
    
    
      4
      rt <allcaps> <user> : you know what i miss ? n...
      negative

Model: Bidirectional Recurrent Neural Networks with Gated Units

To understand a recurrent neural network, imagine a single fully connected neural network that is applied to each step of a sequence. The difference between this and a recurrent neural network is that a hidden layer at time t is given as input at time t+1

One challenge of the recurrent neural network in its simplest form is that it tends to forget prior input quickly. There are a couple different approaches to this issue, Long Short Term Memory networks (LSTMs) being the most well known [6]. We use another approach, the Gated Recurrent Unit (GRU) [7]. We use the GRU because it requires fewer parameters than the LSTM, making it conceptually simpler and less computationally expensive to train, and it tends to give comparable results [8]. Each of these approaches uses "gates" to explicitly control the rate at which old information is forgotten and new information is incorporated.

When we generate a representation for each word in a sentence to represent words in their context, a recurrent neural network that reads from left to right will only include context from the left of each word. A bidirectional recurrent neural network (BiRNN) resolves this by using two recurrent neural networks, one operating from left to right, the other from right to left. The final output is the concatenation of these two representations.

For sentiment analysis, we import the embedded data, then use a variable number of BiRNN layers to generate a contextualized representation of the word sequence, which we then feed into a forward RNN and take the last hidden state as a summary of the sentence. We feed this to a fully connected neural network to get the output.

[6] http://colah.github.io/posts/2015-08-Understanding-LSTMs/

[7] https://towardsdatascience.com/understanding-gru-networks-2ef37df6c9be

[8] https://arxiv.org/abs/1412.3555



In [18]:

    
# Hyperparameters
settings = dict(
    n=25,
    init='msra',
    bidirectional_layers=1,
    learning_rate=0.0005,
    step_size=20,
    thread_minibatch_size=1,
    max_epochs=40,
    fc_dropout=0.0,
    output_dropout=0.0,
    recurrent_dropout=0.0
)



In [19]:

    
sentiment = s.CASTable('sentiment', replace=True)

# Generate the model
s.buildmodel(model=sentiment, type='RNN')

del sentiment.params.replace

# Add the input layer
s.addlayer(model=sentiment, name='data', layer=dict(type='input'))


# Generate some number of bidirectional layers
# This loop will generate however many bidirectional layers are specified in settings
output = ['data']
for i in range(settings['bidirectional_layers']):
    forward_birnn = 'birnn{}'.format(i)
    backward_birnn = forward_birnn+'r'

    s.addlayer(model=sentiment, name=forward_birnn, srclayers=output,
               layer=dict(type='recurrent',
                          n=settings['n'],
                          init=settings['init'],
                          rnnType='GRU',
                          outputType='samelength',
                          dropout=settings['recurrent_dropout'],
                          reverse=False))
    s.addlayer(model=sentiment, name=backward_birnn, srclayers=output,
               layer=dict(type='recurrent',
                          n=settings['n'],
                          init=settings['init'],
                          rnnType='GRU',
                          outputType='samelength',
                          dropout=settings['recurrent_dropout'],
                          reverse=True))
    output = [forward_birnn, backward_birnn]

# summary layer
s.addlayer(model=sentiment, name='frnn1', srclayers=output,
           layer=dict(type='recurrent',
                      n=settings['n'],
                      init=settings['init'],
                      rnnType='GRU',
                      dropout=settings['recurrent_dropout'],
                      outputType='encoding'))

# output fully connected layer
s.addlayer(model=sentiment,
           name='outlayer',
           srclayers=['frnn1'],
           layer=dict(type='output'))









    Out[19]:




§ OutputCasTables




  
    
      
      casLib
      Name
      Rows
      Columns
      casTable
    
  
  
    
      0
      CASUSERHDFS(saleem)
      sentiment
      70
      5
      CASTable('sentiment', caslib='CASUSERHDFS(sale...
    
  




elapsed 0.0269s · user 0.052s · sys 0.087s · mem 22.8MB

Train the model



In [20]:

    
trained_weights = s.CASTable('trainedWeights', replace=True)
best_weights = s.CASTable('bestWeights', replace=True)

shuffled_embedded = s.CASTable('shuffled_embedded',replace=True)

s.shuffle(embedded_with_additional_data,casout=shuffled_embedded)

embedded_with_additional_data = shuffled_embedded

r = embedded_with_additional_data.query("slice EQ 'train'").dlTrain(
        model=sentiment,
        dataspecs=[
            dict(type='numericnominal',
                 layer='data',
                 data=embedding_columns,
                 numnomParms=dict(
                 tokenSize=dims, length='_sequence_length_')),
            dict(type='numericnominal',
                 layer='outlayer',
                 data='_Target_',
                 nominals='_Target_')
        ],
        validtable=embedded_with_additional_data.query("slice EQ 'dev'"),
        modelWeights=trained_weights,
        bestWeights=best_weights,
        optimizer=dict(
            miniBatchSize=settings['thread_minibatch_size'],
            maxEpochs=settings['max_epochs'],
            loglevel=2,
            algorithm=dict(method='adam',
                           beta1=0.9,
                           beta2=0.999,
                           gamma=0.5,
                           learningRate=settings['learning_rate'],
                           clipGradMax=100,
                           clipGradMin=-100,
                           stepSize=settings['step_size'],
                           lrPolicy='step'),
            dropout=settings['output_dropout']),
        seed=12345)









    



NOTE:  The Synchronous mode is enabled.
NOTE:  The total number of parameters is 24652.
NOTE:  The approximate memory cost is 19.00 MB.
NOTE:  Loading weights cost       0.00 (s).
NOTE:  Initializing each layer cost       0.15 (s).
NOTE:  The total number of workers is 4.
NOTE:  The total number of threads on each worker is 32.
NOTE:  The total mini-batch size per thread on each worker is 1.
NOTE:  The maximum mini-batch size across all workers for the synchronous mode is 128.
NOTE:  Target variable: _Target_
NOTE:  Number of levels for the target variable:      2
NOTE:  Levels for the target variable:
NOTE:  Level      0: negative
NOTE:  Level      1: positive
NOTE:  Number of input variables:  5800
NOTE:  Number of numeric input variables:   5800
NOTE:  Epoch           Learning Rate     Loss    Fit Error   Validation Loss   Validation Error    Time (s)
NOTE:          0          0.0005       0.5649      0.291          0.5559           0.2791           0.92
NOTE:          1          0.0005        0.451     0.2081          0.4893           0.2355           0.92
NOTE:          2          0.0005       0.4318     0.1998          0.4509           0.2122           0.95
NOTE:          3          0.0005       0.4158     0.1913          0.4587           0.2166           0.93
NOTE:          4          0.0005       0.4028     0.1848          0.4694           0.2238           0.93
NOTE:          5          0.0005       0.3916      0.181          0.4761           0.2267           0.94
NOTE:          6          0.0005       0.3817     0.1754          0.4818           0.2267           0.94
NOTE:          7          0.0005       0.3729      0.171          0.4874           0.2282           0.93
NOTE:          8          0.0005       0.3648     0.1672          0.4929           0.2253           0.93
NOTE:          9          0.0005       0.3571     0.1634          0.4982           0.2137           0.93
NOTE:         10          0.0005       0.3496     0.1595          0.5024           0.2108           0.96
NOTE:         11          0.0005       0.3424     0.1554          0.5037           0.2093           0.98
NOTE:         12          0.0005       0.3353     0.1516          0.5011           0.2122           0.96
NOTE:         13          0.0005       0.3282     0.1482          0.4959           0.2078           0.95
NOTE:         14          0.0005       0.3212     0.1441          0.4896           0.1962           0.95
NOTE:         15          0.0005       0.3143     0.1413          0.4831           0.1933           1.00
NOTE:         16          0.0005       0.3075     0.1376          0.4767           0.1919           0.98
NOTE:         17          0.0005        0.301     0.1344            0.47            0.189           0.94
NOTE:         18          0.0005       0.2946     0.1305          0.4633           0.1846           0.95
NOTE:         19          0.0005       0.2882     0.1272          0.4573           0.1773           0.95
NOTE:         20          0.0003       0.2799     0.1231          0.3973           0.1642           0.97
NOTE:         21          0.0003       0.2741     0.1195          0.3967           0.1613           0.96
NOTE:         22          0.0003       0.2703     0.1175          0.3981           0.1599           0.97
NOTE:         23          0.0003       0.2669     0.1154          0.3996           0.1584           1.01
NOTE:         24          0.0003       0.2635     0.1144          0.4012           0.1599           0.97
NOTE:         25          0.0003       0.2601      0.112          0.4029           0.1584           0.99
NOTE:         26          0.0003       0.2569     0.1104          0.4047           0.1541           0.97
NOTE:         27          0.0003       0.2536     0.1087          0.4066            0.157           0.99
NOTE:         28          0.0003       0.2505     0.1068          0.4086            0.157           0.94
NOTE:         29          0.0003       0.2474     0.1054          0.4109            0.157           0.96
NOTE:         30          0.0003       0.2443     0.1036          0.4134            0.157           1.20
NOTE:         31          0.0003       0.2413      0.102          0.4161           0.1555           0.94
NOTE:         32          0.0003       0.2384        0.1           0.419            0.157           0.96
NOTE:         33          0.0003       0.2354     0.0985          0.4221           0.1584           0.96
NOTE:         34          0.0003       0.2325     0.0971          0.4254           0.1613           0.94
NOTE:         35          0.0003       0.2296     0.0959          0.4288           0.1642           0.92
NOTE:         36          0.0003       0.2268     0.0943          0.4325           0.1628           0.94
NOTE:         37          0.0003       0.2239     0.0933          0.4364           0.1613           0.94
NOTE:         38          0.0003       0.2211     0.0912          0.4407           0.1642           0.94
NOTE:         39          0.0003       0.2183     0.0893          0.4454           0.1642           0.93
NOTE:  The optimization reached the maximum number of epochs.
NOTE:  The total time is      38.35 (s).



In [21]:

    
sentiment_scored = s.CASTable('sentiment_scored', replace=True)

r = embedded_with_additional_data.query("slice EQ 'test'").dlScore(
        modelTable=sentiment,
        initWeights=best_weights,
        copyVars=['_Target_', 'text'],
        casOut=sentiment_scored,
        bufferSize=2)
r









    Out[21]:




§ ScoreInfo




  
    
      
      Descr
      Value
    
  
  
    
      0
      Number of Observations Read
      6376
    
    
      1
      Number of Observations Used
      6376
    
    
      2
      Misclassification Error (%)
      16.34253
    
    
      3
      Loss Error
      0.388574
    
  



§ OutputCasTables




  
    
      
      casLib
      Name
      Rows
      Columns
      casTable
    
  
  
    
      0
      CASUSERHDFS(saleem)
      sentiment_scored
      6376
      7
      CASTable('sentiment_scored', caslib='CASUSERHD...
    
  




elapsed 0.155s · user 2.8s · sys 0.199s · mem 398MB

Now that we have scored our data, we can perform an analysis of its errors. We can generate a confusion matrix in CAS using the crosstab function.



In [22]:

    
cmr = sentiment_scored.crosstab(row='_Target_', col='_DL_PredName_')
cmr.Crosstab









    Out[22]:







  
    
      
      _Target_
      Col1
      Col2
    
  
  
    
      0
      negative
      2566.0
      560.0
    
    
      1
      positive
      482.0
      2768.0

We see here that negative tweets were mistaken for positive tweets with roughly the same frequency as positive for negative. To examine in more detail, we can look at the misclassified tweets. Remember that our distant supervision approach is noisy, so the ground label may not always agree with your intuition. Here's a look at a few of the falsely classified negative tweets.



In [23]:

    
sentiment_scored.query("_Target_ EQ 'negative' AND _DL_PredName_ EQ 'positive'")['text'].head()









    Out[23]:





0                                       <user> playboy
1    <user> <user> thats not good . are they all fr...
2    rt <allcaps> <user> : <user> aa <elong> ! i mi...
3                          <user> alles oke ? <repeat>
4    rt <allcaps> <user> : rip <allcaps> john perry...
Name: text, dtype: object

Here are some falsely classified positive tweets.



In [24]:

    
pd.set_option('display.max_colwidth', -1)

sentiment_scored.query("_Target_ EQ 'positive' AND _DL_PredName_ EQ 'negative'")['text'].head()









    Out[24]:





0    <user> " that sugar film " is another good eye-opener                                                                
1    rt <allcaps> <user> : why governance matters<user> <url> <url>                                                       
2    sorry lame excuse <url>                                                                                              
3    <user> please fix it as soon as possible , regards                                                                   
4    <user> <user> only thing thatll get me through ! <repeat> im finishing half <number> going home and starting drinking
Name: text, dtype: object

Some work that a user of this recipe could do to further improve the results:

Collect more data - The advantage of the emoji approach is that it's cheap. We used a simple version of it, but to get more tweets you can include those with image emoji smiley faces such as 😁 and other text versions such as :c) and (O8. Or if you have the data for it, you can just collect more with the same simple smiles and frowns. We intentionally limited our dataset in order to make it quick to download.
We could also look for a cleaner way to collect data, since especially in the falsely classified negative tweets we are seeing some that probably would have received a different sentiment score from a human tagger. Keep in mind, human labeled data is expensive.
Tweak the hyperparameters, make custom sentiment-aware embeddings, or make changes to how the normalization is done. Keep in mind if you change the normalization you will likely have to generate your own embedding file.

Finally, remember that it's good manners to end your session when you are done with it.



In [25]:

    
s.terminate()

	_Term_	_Start_	_Document_
0	i	0	960953084577972224
1	love	1	960953084577972224
2	<url>	2	960953084577972224
3	rt	0	945682890389516288
4	<allcaps>	1	945682890389516288

	_Document_	_Sequence_length_	_F_0_0_	_F_0_1_	_F_0_2_	_F_0_3_	_F_0_4_	_F_0_5_	_F_0_6_	_F_0_7_	...
0	9.609595e+17	16	0.60470	0.895420	0.27923	0.033489	0.158730	0.185220	0.30722	0.47445	...
1	9.610139e+17	18	-0.32202	-0.001638	-0.12868	1.214900	0.253890	0.281980	-0.21904	-0.38038	...
2	9.610606e+17	29	0.60470	0.895420	0.27923	0.033489	0.158730	0.185220	0.30722	0.47445	...
3	9.610957e+17	7	0.63006	0.651770	0.25545	0.018593	0.043094	0.047194	0.23218	0.11613	...
4	9.611704e+17	6	0.63006	0.651770	0.25545	0.018593	0.043094	0.047194	0.23218	0.11613	...

	text	_Target_
0	<user> applied to a job didnt get past the fir...	negative
1	<user> make sure u eat bb and stay hydrated	negative
2	rt <allcaps> <user> : <user> i love her	negative
3	rt <allcaps> <user> : he said that he got a co...	negative
4	rt <allcaps> <user> : you know what i miss ? n...	negative

	Descr	Value
0	Number of Observations Read	6376
1	Number of Observations Used	6376
2	Misclassification Error (%)	16.34253
3	Loss Error	0.388574