Businesses and organizations around the world know that the first requirement for success is a happy customer base. For the purpose of identifying customer sentiment, the microblogging service Twitter, with its enormous collection of active users, is a font of knowledge. The most recent release of SAS Viya has added support for Twitter analysis to the DeepLearning action set’s Recurrent Neural Network layer. This recipe shows a pipeline to analyze sentiment in Twitter data using word embeddings and RNNs in SAS. The overall structure of this document and a small amount of the text comes from [1]
[1] https://github.com/sassoftware/sas-viya-programming/tree/master/deeplearning/fashion-mnist.
In [1]:
import swat
from IPython.display import display
swat.options.cas.exception_on_severity = 2
s = swat.CAS('rdcgrd075.unx.sas.com', 3217,authinfo=r'/u/saleem/.authinfo')
s.loadactionset('deeplearn')
Out[1]:
Semantic word embeddings, the vector encodings of the meaning of words, are the basis of deep learning for text analytics.
In this recipe, we use the public domain glove embeddings trained on Twitter available at [2]. We have made changes to the format of the glove embeddings for the purpose of this work.
We remove all words with non-ascii characters to make the file more lightweight, as the tweets themelves are ascii.
We also remove ", which is a special character in SAS, and change the delimiter to tabs from spaces.
cat glove.twitter.100d.txt | grep -v \" | grep -Pv "[^\x00-\x7F]" > glove.twitter.100d.clean.txt
We include the modified glove file as part of this recipe.
In [2]:
import os
# An example embeddings file;
GLOVE_PATH = 'miniglove.tsv'
DELIMITER = "\t"
dims = 100
In [3]:
glove = s.CASTable('glove', replace=True)
glove = s.upload_file(GLOVE_PATH,
casout=glove,
importoptions=dict(fileType='csv',
delimiter="\t",
varChars=True,
getNames=False,
vars=[dict(type='varchar')]+[dict(type='double')]*dims))
Direct distribution of Twitter text is a violation of the Twitter terms of service [3]. The appropriate approach is to distribute data in dehydrated form. That is, we may distribute the tweet ids along with our annotations but without the text. Using [4], the user may download the text themselves through the Twitter API using the command below. You can run it right in the browser. The download takes about twelve hours on our machine.
[3] https://twitter.com/en/tos
In [4]:
!git clone https://github.com/aritter/twitter_download.git
The twitter download tool requires an access token. You can get a token by applying for a twitter developer account [5]. Once you have an account, register an app and get your consumer key and your secret key. Once you have these, update twitter_download/download_tweets_api.py and run it. The script will open a web browser for you to log in with your Twitter credentials. It will save a file with your private keys so you only need to do it once. Now you can download the data.
In [5]:
!python twitter_download/download_tweets_api.py --dist emoji_sentiment_data_dehydrated.tsv --output emoji_sentiment_data_rehydrated.tsv
Once we've downloaded the data, we clean it. To create this dataset, we collected 20,000 tweets containing the ":)" emoticon and 20,000 contanining ":(." We labeled these positive and negative respectively. We then removed tweets containing foul language and ended up with roughly 37,500 tweets. This is a noisy way to label the data, and you are likely to get more accurate labels if you sample all tweets and label manually. Nevertheless, it's an excellent method to get a lot of sentiment data quickly with an unrestrictive license. Since we don't want our sentiment analysis tool to simply learn to detect the presence of a smiley face or a frowny face, we scrub the data of these two emoticons. We also normalize whitespace to a single space each and remove all non-ascii characters and quotation marks to avoid confusing the software.
In [6]:
import re
import pandas as pd
path = "emoji_sentiment_data_rehydrated.tsv"
df = pd.read_csv(path,
delimiter="\t",
names=['_Document_', '_Target_', 'slice', 'text'],
skiprows=1)
def clean(tweet):
tweet = re.sub(r'[^\x00-\x7F]+', '', tweet)
tweet = re.sub(r"\s+", ' ', tweet)
tweet = re.sub(r"^[\"']", "", tweet)
tweet = tweet.replace("\'", "")
return re.sub(r"(:\)+)|:\(+", "", tweet)
df['text'] = df['text'].apply(clean)
When a user deletes his or her tweet or makes it private it can no longer be downloaded, so one thing that publicly distributed Twitter data does is decay over time. Fortunately, it's simple to get a quick measure of the amount of data that has been lost.
In [7]:
import html
count = len(df)
lost_count = len(df[df['text'] == "Not Available"])
print("{:.1%} of data deleted or made private".format(lost_count/count))
df = df[df['text'] != "Not Available"]
for slize in ['train', 'dev', 'test']:
print("{} datapoints in {}".format(len(df[df['slice'] == slize]), slize))
This second preprocessing step is to coerce the format of the data into that used in the GloVe twitter embeddings. Here we use an included python Twitter normalization tool based on the ruby script provided by the GloVe team. [4]
In [8]:
from twitter_glove import normalize
def preprocess(tweet):
return normalize(tweet).lower()
df['text'] = df['text'].apply(html.unescape)
df['text'] = df['text'].apply(preprocess)
df = df.drop_duplicates(subset='_Document_')
reviews_train = s.CASTable('reviews_train.csv', replace=True)
reviews_train = s.upload_frame(df, casout=reviews_train)
In [9]:
for slize in ['train', 'dev']:
print(slize)
print(df[df['slice'] == slize]['_Target_'].value_counts(True))
print()
This step involves separating the text into tokens, then presenting the result in a table that ApplyWordVector can use.
Term: The token
Start: The position of the token in the document. This is used to sort the terms, as ApplyWordVector is designed for use in a parallel environment and cannot rely on inputs coming to it in order.
Document: The document id. Because the input is given as a single table, this is important to separate one document from another.
In [10]:
print(len(df))
df_cleaned = df
df_cols = {
"_Term_": [],
"_Start_": [],
"_Document_": []
}
for i in df_cleaned.index:
term = list(filter(None, df_cleaned['text'].loc[i].split(" ")))
df_cols["_Term_"].extend(term)
df_cols["_Start_"].extend(range(len(term)))
df_cols["_Document_"].extend([df_cleaned['_Document_'].loc[i]]*len(term))
tokenized_df = pd.DataFrame.from_dict(df_cols)[['_Term_',
'_Start_',
'_Document_']]
out_offset = s.CASTable('out_offset', replace=True)
out_offset = s.upload_frame(tokenized_df, casout=out_offset)
In [11]:
tokenized_df.head()
Out[11]:
In [12]:
# vocab = set(tokenized_df['_Term_'].values)
# glove = pd.read_csv(GLOVE_PATH,sep=DELIMITER,header=None)
# miniglove=glove[glove[0].isin(vocab)]
# len(miniglove)/len(glove)
# miniglove.to_csv('miniglove.tsv',sep=DELIMITER,index=False)
A simple out of vocabulary test makes for a good sanity check if there are any mismatches in the glove embedding file. It will change due to Twitter decay, but should be roughly between 0.01 and 0.03
In [13]:
import pandas as pd
vocab = set([item[0] for item in pd.read_csv(
GLOVE_PATH, sep=DELIMITER, header=None, usecols=[0]).values])
tokenized_df['_Term_'].apply(lambda word: word not in vocab).mean()
Out[13]:
In [14]:
s.loadactionset('textparse')
embedded = s.CASTable('embedded', replace=True)
s.textparse.applyWordVector(
model=glove,
offset=out_offset,
casout=embedded
)
embedded.head()
Out[14]:
_F_ne = the eth feature of the nth token in the sequence
In [15]:
embedding_columns = [column for column in embedded.columns if column.startswith('_F')]
len(embedding_columns)
Out[15]:
This step merges the lost target data back onto the newly embedded sentences. The ApplyWordVec action's metadata does not match that of the original cas action, so we start by clearing the column metadata. Then we simply call SWAT's merge action to combine the two tables.
In [16]:
import time
format_clearer = [dict(name=column, format="") for column in embedded.columns]
embedded.table.alterTable(columns=format_clearer)
start_time = time.time()
embedded_with_additional_data = s.CASTable('embedded_with_additional_data', replace=True)
reviews_train.merge(
embedded,
on="_Document_",
casout=embedded_with_additional_data
)
print(time.time() - start_time)
In [17]:
embedded_with_additional_data[['text', '_Target_']].head()
Out[17]:
To understand a recurrent neural network, imagine a single fully connected neural network that is applied to each step of a sequence. The difference between this and a recurrent neural network is that a hidden layer at time t is given as input at time t+1
One challenge of the recurrent neural network in its simplest form is that it tends to forget prior input quickly. There are a couple different approaches to this issue, Long Short Term Memory networks (LSTMs) being the most well known [6]. We use another approach, the Gated Recurrent Unit (GRU) [7]. We use the GRU because it requires fewer parameters than the LSTM, making it conceptually simpler and less computationally expensive to train, and it tends to give comparable results [8]. Each of these approaches uses "gates" to explicitly control the rate at which old information is forgotten and new information is incorporated.
When we generate a representation for each word in a sentence to represent words in their context, a recurrent neural network that reads from left to right will only include context from the left of each word. A bidirectional recurrent neural network (BiRNN) resolves this by using two recurrent neural networks, one operating from left to right, the other from right to left. The final output is the concatenation of these two representations.
For sentiment analysis, we import the embedded data, then use a variable number of BiRNN layers to generate a contextualized representation of the word sequence, which we then feed into a forward RNN and take the last hidden state as a summary of the sentence. We feed this to a fully connected neural network to get the output.
[6] http://colah.github.io/posts/2015-08-Understanding-LSTMs/
[7] https://towardsdatascience.com/understanding-gru-networks-2ef37df6c9be
In [18]:
# Hyperparameters
settings = dict(
n=25,
init='msra',
bidirectional_layers=1,
learning_rate=0.0005,
step_size=20,
thread_minibatch_size=1,
max_epochs=40,
fc_dropout=0.0,
output_dropout=0.0,
recurrent_dropout=0.0
)
In [19]:
sentiment = s.CASTable('sentiment', replace=True)
# Generate the model
s.buildmodel(model=sentiment, type='RNN')
del sentiment.params.replace
# Add the input layer
s.addlayer(model=sentiment, name='data', layer=dict(type='input'))
# Generate some number of bidirectional layers
# This loop will generate however many bidirectional layers are specified in settings
output = ['data']
for i in range(settings['bidirectional_layers']):
forward_birnn = 'birnn{}'.format(i)
backward_birnn = forward_birnn+'r'
s.addlayer(model=sentiment, name=forward_birnn, srclayers=output,
layer=dict(type='recurrent',
n=settings['n'],
init=settings['init'],
rnnType='GRU',
outputType='samelength',
dropout=settings['recurrent_dropout'],
reverse=False))
s.addlayer(model=sentiment, name=backward_birnn, srclayers=output,
layer=dict(type='recurrent',
n=settings['n'],
init=settings['init'],
rnnType='GRU',
outputType='samelength',
dropout=settings['recurrent_dropout'],
reverse=True))
output = [forward_birnn, backward_birnn]
# summary layer
s.addlayer(model=sentiment, name='frnn1', srclayers=output,
layer=dict(type='recurrent',
n=settings['n'],
init=settings['init'],
rnnType='GRU',
dropout=settings['recurrent_dropout'],
outputType='encoding'))
# output fully connected layer
s.addlayer(model=sentiment,
name='outlayer',
srclayers=['frnn1'],
layer=dict(type='output'))
Out[19]:
In [20]:
trained_weights = s.CASTable('trainedWeights', replace=True)
best_weights = s.CASTable('bestWeights', replace=True)
shuffled_embedded = s.CASTable('shuffled_embedded',replace=True)
s.shuffle(embedded_with_additional_data,casout=shuffled_embedded)
embedded_with_additional_data = shuffled_embedded
r = embedded_with_additional_data.query("slice EQ 'train'").dlTrain(
model=sentiment,
dataspecs=[
dict(type='numericnominal',
layer='data',
data=embedding_columns,
numnomParms=dict(
tokenSize=dims, length='_sequence_length_')),
dict(type='numericnominal',
layer='outlayer',
data='_Target_',
nominals='_Target_')
],
validtable=embedded_with_additional_data.query("slice EQ 'dev'"),
modelWeights=trained_weights,
bestWeights=best_weights,
optimizer=dict(
miniBatchSize=settings['thread_minibatch_size'],
maxEpochs=settings['max_epochs'],
loglevel=2,
algorithm=dict(method='adam',
beta1=0.9,
beta2=0.999,
gamma=0.5,
learningRate=settings['learning_rate'],
clipGradMax=100,
clipGradMin=-100,
stepSize=settings['step_size'],
lrPolicy='step'),
dropout=settings['output_dropout']),
seed=12345)
In [21]:
sentiment_scored = s.CASTable('sentiment_scored', replace=True)
r = embedded_with_additional_data.query("slice EQ 'test'").dlScore(
modelTable=sentiment,
initWeights=best_weights,
copyVars=['_Target_', 'text'],
casOut=sentiment_scored,
bufferSize=2)
r
Out[21]:
Now that we have scored our data, we can perform an analysis of its errors. We can generate a confusion matrix in CAS using the crosstab function.
In [22]:
cmr = sentiment_scored.crosstab(row='_Target_', col='_DL_PredName_')
cmr.Crosstab
Out[22]:
We see here that negative tweets were mistaken for positive tweets with roughly the same frequency as positive for negative. To examine in more detail, we can look at the misclassified tweets. Remember that our distant supervision approach is noisy, so the ground label may not always agree with your intuition. Here's a look at a few of the falsely classified negative tweets.
In [23]:
sentiment_scored.query("_Target_ EQ 'negative' AND _DL_PredName_ EQ 'positive'")['text'].head()
Out[23]:
Here are some falsely classified positive tweets.
In [24]:
pd.set_option('display.max_colwidth', -1)
sentiment_scored.query("_Target_ EQ 'positive' AND _DL_PredName_ EQ 'negative'")['text'].head()
Out[24]:
Some work that a user of this recipe could do to further improve the results:
Collect more data - The advantage of the emoji approach is that it's cheap. We used a simple version of it, but to get more tweets you can include those with image emoji smiley faces such as 😁 and other text versions such as :c) and (O8. Or if you have the data for it, you can just collect more with the same simple smiles and frowns. We intentionally limited our dataset in order to make it quick to download.
We could also look for a cleaner way to collect data, since especially in the falsely classified negative tweets we are seeing some that probably would have received a different sentiment score from a human tagger. Keep in mind, human labeled data is expensive.
Tweak the hyperparameters, make custom sentiment-aware embeddings, or make changes to how the normalization is done. Keep in mind if you change the normalization you will likely have to generate your own embedding file.
Finally, remember that it's good manners to end your session when you are done with it.
In [25]:
s.terminate()