Imagine you're organizing a big tech conference, and you want to understand what people thought of your conference, so you can run it even better next year.
A good resource to look at, would be what people posted on Social Media about your conference, because well, it's a tech event after all. But there's a problem here: if your event is as popular as the Web Summit, you're going to have to go through hundreds of thousands of tweets, which is simply not practical.
One common solution would be to use Social Media monitoring and analysis tools, which try to aggregate all of these posts in one place, and crunch them into more understandable metrics: total number of posts, most frequently used hashtags, average sentiment, etc., and this is indeed what we did in our 2014 report of the Web Summit according to Twitter.
But what if we wanted to go a level deeper, to understand a little bit more in depth, what people thought and said about our conference? What if we wanted to build a map, that shows all the topics that were discussed, the sentiment around them, and how these two things evolved with time? That's what we're going to explore in this series of blogs.
Over a 5 day period, we collected about 77,000 tweets about the Web Summit 2015 using the Twitter Streaming API and in this blog series, we're going to explore them and see what we can extract.
Note: This is a step by step guide which contains all you need to accomplish (almost) the same results, so feel free to try it for yourself.
In the first part of this series, we're going to build a 2D map of the tweets that we've collected, to get a better sense of what people talked about at a high level. We're going to try and put the similar tweets closer together on this map, and the dissimilar ones further apart.
We'll try out a few different methods such as classical document vectors with tf-idf scoring, K-means clustering, Latent Dirichlet Allocation (LDA), averaged Word Vectors (using GloVe word embeddings), Paragraph Vectors (using gensim's doc2vec implementation) and finally Skip-Thought Vectors to group together similar tweets, and we will then use a dimensionality reduction algorithm called t-SNE to give each tweet an (x,y) co-ordinate in a 2D space, that allows us to put the tweets on a scatter plot.
In [1]:
import json
tweets_file = '/Users/parsa/Desktop/WebSummit 2015/websummit_dump_20151106155110'
with open(tweets_file) as f:
tweets = json.load(f)
print('# of tweets:', len(tweets))
for tweet in tweets[:5]:
print(tweet["text"])
Next we're going to do some pre-processing to remove stopwords such as "the" and "a", convert the tweets to lower case, and normalize the URLs, user names, smiley faces, hashtags, elongated words, and numbers. We define the pre-processor class as follows:
In [2]:
import re
class TweetPreprocessor(object):
def __init__(self):
self.FLAGS = re.MULTILINE | re.DOTALL
self.ALLCAPS = '<allcaps>'
self.HASHTAG = '<hashtag>'
self.URL = '<url>'
self.USER = '<user>'
self.SMILE = '<smile>'
self.LOLFACE = '<lolface>'
self.SADFACE = '<sadface>'
self.NEUTRALFACE = '<neutralface>'
self.HEART = '<heart>'
self.NUMBER = '<number>'
self.REPEAT = '<repeat>'
self.ELONG = '<elong>'
def _hashtag(self, text):
text = text.group()
hashtag_body = text[1:]
if hashtag_body.isupper():
result = (self.HASHTAG + " {} " + self.ALLCAPS).format(hashtag_body)
else:
result = " ".join([self.HASHTAG] + re.split(r"(?=[A-Z])", hashtag_body, flags=self.FLAGS))
return result
def _allcaps(self, text):
text = text.group()
return text.lower() + ' ' + self.ALLCAPS
def preprocess(self, text):
eyes, nose = r"[8:=;]", r"['`\-]?"
re_sub = lambda pattern, repl: re.sub(pattern, repl, text, flags=self.FLAGS)
text = re_sub(r"https?:\/\/\S+\b|www\.(\w+\.)+\S*", self.URL)
text = re_sub(r"/"," / ")
text = re_sub(r"@\w+", self.USER)
text = re_sub(r"{}{}[)dD]+|[)dD]+{}{}".format(eyes, nose, nose, eyes), self.SMILE)
text = re_sub(r"{}{}p+".format(eyes, nose), self.LOLFACE)
text = re_sub(r"{}{}\(+|\)+{}{}".format(eyes, nose, nose, eyes), self.SADFACE)
text = re_sub(r"{}{}[\/|l*]".format(eyes, nose), self.NEUTRALFACE)
text = re_sub(r"<3", self.HEART)
text = re_sub(r"[-+]?[.\d]*[\d]+[:,.\d]*", self.NUMBER)
text = re_sub(r"#\S+", self._hashtag)
text = re_sub(r"([!?.]){2,}", r"\1 " + self.REPEAT)
text = re_sub(r"\b(\S*?)(.)\2{2,}\b", r"\1\2 " + self.ELONG)
text = re_sub(r"([A-Z]){2,}", self._allcaps)
return text.lower()
Let's create an instance of the pre-processor, and see how it works with an example:
In [3]:
tweet_processor = TweetPreprocessor()
# an example:
tweet = "@sarahtavel What #MustHave #tech gadget can you not travel without? Stop by stand D131 on Wed at #WebSummit #Dublin https://t.co/2wFLAVpGiV"
print("Before: " + tweet + "\n")
print("After: " + tweet_processor.preprocess(tweet))
We remove common words such as "the", "a", "an", "this", etc. using NLTK's list of English stopwords, and add a few extra items to remove tokens that don't contribute much to the meaning of a tweet.
We also iterate over the list of tweets and add a new attribute processed
to each tweet, which contains the processed and cleaned text.
In [4]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import TweetTokenizer
tknzr = TweetTokenizer()
nltk.download('stopwords')
stop = stopwords.words('english')
stop += ['<hashtag>', '<url>', '<allcaps>', '<number>', '<user>', '<repeat>', '<elong>', 'websummit']
for tweet in tweets:
parts = tknzr.tokenize(tweet_processor.preprocess(tweet["text"]))
clean = [i for i in parts if i not in stop]
tweet["processed"] = clean
And here's the end result of our pre-processing pipeline:
In [5]:
print("Before: " + tweets[1]["text"] + "\n")
print("After: " + str(tweets[1]["processed"]))
In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
tweet_texts = [tweet["text"] for tweet in tweets] # list of all tweet texts
tweet_texts_processed = [str.join(" ", tweet["processed"]) for tweet in tweets] # list of pre-processed tweet texts
vectorizer = TfidfVectorizer(min_df=4, max_features = 10000)
vz = vectorizer.fit_transform(tweet_texts_processed)
tfidf = dict(zip(vectorizer.get_feature_names(), vectorizer.idf_))
Let's check the tf-idf scores for a few random words:
In [7]:
print("dublin: " + str(tfidf["dublin"]))
print("catmull: " + str(tfidf["catmull"]))
print("cosgrave: " + str(tfidf["cosgrave"]))
print("enda: " + str(tfidf["enda"]))
print("sheeps: " + str(tfidf["sheeps"]))
We're going to create our first map of tweets, by plotting them on a 2D plane, based on pair-wise similarity between their corresponding document vectors. But as you might remember, we have created 10,000 features for each document, and therefore would need a 10,000-dimensional space to visualize their corresponding vectors!
That doesn't sound very practical or intuitive, so we have to reduce the dimensionality of these vectors first: from 10,000 features down to 2 or 3, depending if we want to visualize them in a 2- or 3-dimensional space.
Two common dimensionality reduction techniques are principal component analysis (PCA) and singular value decomposition (SVD). Here, we are going to use SVD to reduce the dimensionality to 50 dimensions first, and then use another dimensionality reduction technique called t-SNE that is particularly suited to visualizing high-dimensional datasets, to further reduce the dimensionality to 2.
In order to still be able to inspect individual points in the visualization, we visualize only part of our data: the first 10,000 tweets. Applying TruncatedSVD
on this 10,000 x 10,000 matrix reduces it to the number of components specified and thus yields a 10,000 x 50 dimensional matrix that we then feed into t-SNE to produce a 10,000 x 2 matrix that we're able to visualize.
In [8]:
from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components=50, random_state=0)
svd_tfidf = svd.fit_transform(vz[:10000])
In [9]:
svd_tfidf.shape
Out[9]:
In [10]:
from sklearn.manifold import TSNE
tsne_model = TSNE(n_components=2, verbose=1, random_state=0)
tsne_tfidf = tsne_model.fit_transform(svd_tfidf)
In [11]:
tsne_tfidf.shape
Out[11]:
In [12]:
tsne_tfidf[0]
Out[12]:
We are left with 10,000 (x,y) coordinates that we're going to visualize as a scatter plot. We also create tooltips to show the actual and processed text of each tweet (hover your mouse over the blue dots to see this).
In [13]:
import bokeh.plotting as bp
from bokeh.models import HoverTool, BoxSelectTool
from bokeh.plotting import figure, show, output_notebook
output_notebook()
plot_tfidf = bp.figure(plot_width=900, plot_height=700, title="Web Summit 2015 tweets (tf-idf)",
tools="pan,wheel_zoom,box_zoom,reset,hover,previewsave",
x_axis_type=None, y_axis_type=None, min_border=1)
plot_tfidf.scatter(x=tsne_tfidf[:,0], y=tsne_tfidf[:,1],
source=bp.ColumnDataSource({
"tweet": tweet_texts[:10000],
"processed": tweet_texts_processed[:10000]
}))
hover = plot_tfidf.select(dict(type=HoverTool))
hover.tooltips={"tweet": "@tweet (processed: \"@processed\")"}
show(plot_tfidf)