Web Summit 2015 meets Natural Language Processing – Part 1: A Map of the Social Media Lands

Introduction

Imagine you're organizing a big tech conference, and you want to understand what people thought of your conference, so you can run it even better next year.

A good resource to look at, would be what people posted on Social Media about your conference, because well, it's a tech event after all. But there's a problem here: if your event is as popular as the Web Summit, you're going to have to go through hundreds of thousands of tweets, which is simply not practical.

One common solution would be to use Social Media monitoring and analysis tools, which try to aggregate all of these posts in one place, and crunch them into more understandable metrics: total number of posts, most frequently used hashtags, average sentiment, etc., and this is indeed what we did in our 2014 report of the Web Summit according to Twitter.

But what if we wanted to go a level deeper, to understand a little bit more in depth, what people thought and said about our conference? What if we wanted to build a map, that shows all the topics that were discussed, the sentiment around them, and how these two things evolved with time? That's what we're going to explore in this series of blogs.

Over a 5 day period, we collected about 77,000 tweets about the Web Summit 2015 using the Twitter Streaming API and in this blog series, we're going to explore them and see what we can extract.

Note: This is a step by step guide which contains all you need to accomplish (almost) the same results, so feel free to try it for yourself.

Objective

In the first part of this series, we're going to build a 2D map of the tweets that we've collected, to get a better sense of what people talked about at a high level. We're going to try and put the similar tweets closer together on this map, and the dissimilar ones further apart.

We'll try out a few different methods such as classical document vectors with tf-idf scoring, K-means clustering, Latent Dirichlet Allocation (LDA), averaged Word Vectors (using GloVe word embeddings), Paragraph Vectors (using gensim's doc2vec implementation) and finally Skip-Thought Vectors to group together similar tweets, and we will then use a dimensionality reduction algorithm called t-SNE to give each tweet an (x,y) co-ordinate in a 2D space, that allows us to put the tweets on a scatter plot.

Tools and Libraries

Data

  • GloVe vectors trained on tweets [download]
  • About 77,000 tweets collected between Nov 2nd and Nov 6th 2015, that mention one of the official Web Summit hashtags, and/or handles of a few hand-picked notable speakers [download bzipped JSON]
  • Skip-Thought model files, trained on the BookCorpus dataset [download]

Pre-processing

Let's start by reading the tweets from a json file, and printing the first few tweets of our corpus:


In [1]:
import json

tweets_file = '/Users/parsa/Desktop/WebSummit 2015/websummit_dump_20151106155110'
with open(tweets_file) as f:
    tweets = json.load(f)
    
print('# of tweets:', len(tweets))
for tweet in tweets[:5]:
    print(tweet["text"])


# of tweets: 77111
@sarahtavel What #MustHave #tech gadget can you not travel without? Stop by stand D131 on Wed at #WebSummit #Dublin https://t.co/2wFLAVpGiV
Start-ups from every continent heading to #websummit, including #LendInvest! https://t.co/7zvTM5ihcH @WebSummitHQ
I'm at the #WebSummit2015 this week. On ali(at)goss(dot)ie if anyone wants to say hi! 👋🏻🙂
@jalak What #MustHave #tech gadget can you not travel without? Stop by stand D131 on Wed at #WebSummit #Dublin https://t.co/2vwDJdWIJJ
#websummit is about to kickoff in #dublin! What are you looking forward to the most?? @WebSummitHQ

Next we're going to do some pre-processing to remove stopwords such as "the" and "a", convert the tweets to lower case, and normalize the URLs, user names, smiley faces, hashtags, elongated words, and numbers. We define the pre-processor class as follows:


In [2]:
import re

class TweetPreprocessor(object):

    def __init__(self):
        self.FLAGS = re.MULTILINE | re.DOTALL
        self.ALLCAPS = '<allcaps>'
        self.HASHTAG = '<hashtag>'
        self.URL = '<url>'
        self.USER = '<user>'
        self.SMILE = '<smile>'
        self.LOLFACE = '<lolface>'
        self.SADFACE = '<sadface>'
        self.NEUTRALFACE = '<neutralface>'
        self.HEART = '<heart>'
        self.NUMBER = '<number>'
        self.REPEAT = '<repeat>'
        self.ELONG = '<elong>'

    def _hashtag(self, text):
        text = text.group()
        hashtag_body = text[1:]
        if hashtag_body.isupper():
            result = (self.HASHTAG + " {} " + self.ALLCAPS).format(hashtag_body)
        else:
            result = " ".join([self.HASHTAG] + re.split(r"(?=[A-Z])", hashtag_body, flags=self.FLAGS))
        return result

    def _allcaps(self, text):
        text = text.group()
        return text.lower() + ' ' + self.ALLCAPS

    def preprocess(self, text):
        eyes, nose = r"[8:=;]", r"['`\-]?"

        re_sub = lambda pattern, repl: re.sub(pattern, repl, text, flags=self.FLAGS)

        text = re_sub(r"https?:\/\/\S+\b|www\.(\w+\.)+\S*", self.URL)
        text = re_sub(r"/"," / ")
        text = re_sub(r"@\w+", self.USER)
        text = re_sub(r"{}{}[)dD]+|[)dD]+{}{}".format(eyes, nose, nose, eyes), self.SMILE)
        text = re_sub(r"{}{}p+".format(eyes, nose), self.LOLFACE)
        text = re_sub(r"{}{}\(+|\)+{}{}".format(eyes, nose, nose, eyes), self.SADFACE)
        text = re_sub(r"{}{}[\/|l*]".format(eyes, nose), self.NEUTRALFACE)
        text = re_sub(r"<3", self.HEART)
        text = re_sub(r"[-+]?[.\d]*[\d]+[:,.\d]*", self.NUMBER)
        text = re_sub(r"#\S+", self._hashtag)
        text = re_sub(r"([!?.]){2,}", r"\1 " + self.REPEAT)
        text = re_sub(r"\b(\S*?)(.)\2{2,}\b", r"\1\2 " + self.ELONG)

        text = re_sub(r"([A-Z]){2,}", self._allcaps)

        return text.lower()

Let's create an instance of the pre-processor, and see how it works with an example:


In [3]:
tweet_processor = TweetPreprocessor()

# an example:
tweet = "@sarahtavel What #MustHave #tech gadget can you not travel without? Stop by stand D131 on Wed at #WebSummit #Dublin https://t.co/2wFLAVpGiV"
print("Before: " + tweet + "\n")
print("After: " + tweet_processor.preprocess(tweet))


Before: @sarahtavel What #MustHave #tech gadget can you not travel without? Stop by stand D131 on Wed at #WebSummit #Dublin https://t.co/2wFLAVpGiV

After: <user> what <hashtag> musthave <hashtag> tech gadget can you not travel without? stop by stand d<number> on wed at <hashtag> websummit <hashtag> dublin <url>

We remove common words such as "the", "a", "an", "this", etc. using NLTK's list of English stopwords, and add a few extra items to remove tokens that don't contribute much to the meaning of a tweet.

We also iterate over the list of tweets and add a new attribute processed to each tweet, which contains the processed and cleaned text.


In [4]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import TweetTokenizer

tknzr = TweetTokenizer()
nltk.download('stopwords')
stop = stopwords.words('english')
stop += ['<hashtag>', '<url>', '<allcaps>', '<number>', '<user>', '<repeat>', '<elong>', 'websummit']

for tweet in tweets:
    parts = tknzr.tokenize(tweet_processor.preprocess(tweet["text"]))
    clean = [i for i in parts if i not in stop]
    tweet["processed"] = clean


[nltk_data] Downloading package stopwords to /Users/parsa/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

And here's the end result of our pre-processing pipeline:


In [5]:
print("Before: " + tweets[1]["text"] + "\n")
print("After: " + str(tweets[1]["processed"]))


Before: Start-ups from every continent heading to #websummit, including #LendInvest! https://t.co/7zvTM5ihcH @WebSummitHQ

After: ['start-ups', 'every', 'continent', 'heading', ',', 'including', 'lendinvest', '!']

Tf-idf scores

Now that we have cleaned up our tweets, we can create "one hot" vectors for our documents, with tf-idf values that indicate how common or rare a word in a document is, with respect to the entire corpus:


In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer

tweet_texts = [tweet["text"] for tweet in tweets] # list of all tweet texts
tweet_texts_processed = [str.join(" ", tweet["processed"]) for tweet in tweets] # list of pre-processed tweet texts

vectorizer = TfidfVectorizer(min_df=4, max_features = 10000)
vz = vectorizer.fit_transform(tweet_texts_processed)
tfidf = dict(zip(vectorizer.get_feature_names(), vectorizer.idf_))

Let's check the tf-idf scores for a few random words:


In [7]:
print("dublin: " + str(tfidf["dublin"]))
print("catmull: " + str(tfidf["catmull"]))
print("cosgrave: " + str(tfidf["cosgrave"]))
print("enda: " + str(tfidf["enda"]))
print("sheeps: " + str(tfidf["sheeps"]))


dublin: 3.51716751201
catmull: 6.89642791479
cosgrave: 6.78495404833
enda: 7.77567737499
sheeps: 8.34099118404

SVD and t-SNE

We're going to create our first map of tweets, by plotting them on a 2D plane, based on pair-wise similarity between their corresponding document vectors. But as you might remember, we have created 10,000 features for each document, and therefore would need a 10,000-dimensional space to visualize their corresponding vectors!

That doesn't sound very practical or intuitive, so we have to reduce the dimensionality of these vectors first: from 10,000 features down to 2 or 3, depending if we want to visualize them in a 2- or 3-dimensional space.

Two common dimensionality reduction techniques are principal component analysis (PCA) and singular value decomposition (SVD). Here, we are going to use SVD to reduce the dimensionality to 50 dimensions first, and then use another dimensionality reduction technique called t-SNE that is particularly suited to visualizing high-dimensional datasets, to further reduce the dimensionality to 2.

In order to still be able to inspect individual points in the visualization, we visualize only part of our data: the first 10,000 tweets. Applying TruncatedSVD on this 10,000 x 10,000 matrix reduces it to the number of components specified and thus yields a 10,000 x 50 dimensional matrix that we then feed into t-SNE to produce a 10,000 x 2 matrix that we're able to visualize.


In [8]:
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=50, random_state=0)
svd_tfidf = svd.fit_transform(vz[:10000])

In [9]:
svd_tfidf.shape


Out[9]:
(10000, 50)

In [10]:
from sklearn.manifold import TSNE

tsne_model = TSNE(n_components=2, verbose=1, random_state=0)
tsne_tfidf = tsne_model.fit_transform(svd_tfidf)


[t-SNE] Computing pairwise distances...
[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Computed conditional probabilities for sample 1000 / 10000
[t-SNE] Computed conditional probabilities for sample 2000 / 10000
[t-SNE] Computed conditional probabilities for sample 3000 / 10000
[t-SNE] Computed conditional probabilities for sample 4000 / 10000
[t-SNE] Computed conditional probabilities for sample 5000 / 10000
[t-SNE] Computed conditional probabilities for sample 6000 / 10000
[t-SNE] Computed conditional probabilities for sample 7000 / 10000
[t-SNE] Computed conditional probabilities for sample 8000 / 10000
[t-SNE] Computed conditional probabilities for sample 9000 / 10000
[t-SNE] Computed conditional probabilities for sample 10000 / 10000
[t-SNE] Mean sigma: 0.000000
[t-SNE] Error after 100 iterations with early exaggeration: 1.188186
[t-SNE] Error after 425 iterations: 1.123668

In [11]:
tsne_tfidf.shape


Out[11]:
(10000, 2)

In [12]:
tsne_tfidf[0]


Out[12]:
array([ 1.81195777,  3.29304416])

We are left with 10,000 (x,y) coordinates that we're going to visualize as a scatter plot. We also create tooltips to show the actual and processed text of each tweet (hover your mouse over the blue dots to see this).


In [13]:
import bokeh.plotting as bp
from bokeh.models import HoverTool, BoxSelectTool
from bokeh.plotting import figure, show, output_notebook

output_notebook()
plot_tfidf = bp.figure(plot_width=900, plot_height=700, title="Web Summit 2015 tweets (tf-idf)",
    tools="pan,wheel_zoom,box_zoom,reset,hover,previewsave",
    x_axis_type=None, y_axis_type=None, min_border=1)

plot_tfidf.scatter(x=tsne_tfidf[:,0], y=tsne_tfidf[:,1],
                    source=bp.ColumnDataSource({
                        "tweet": tweet_texts[:10000], 
                        "processed": tweet_texts_processed[:10000]
                    }))

hover = plot_tfidf.select(dict(type=HoverTool))
hover.tooltips={"tweet": "@tweet (processed: \"@processed\")"}
show(plot_tfidf)