Web Summit 2015 meets Natural Language Processing – Part 1: A Map of the Social Media Lands

Introduction

Imagine you're organizing a big tech conference, and you want to understand what people thought of your conference, so you can run it even better next year.

A good resource to look at, would be what people posted on Social Media about your conference, because well, it's a tech event after all. But there's a problem here: if your event is as popular as the Web Summit, you're going to have to go through hundreds of thousands of tweets, which is simply not practical.

One common solution would be to use Social Media monitoring and analysis tools, which try to aggregate all of these posts in one place, and crunch them into more understandable metrics: total number of posts, most frequently used hashtags, average sentiment, etc., and this is indeed what we did in our 2014 report of the Web Summit according to Twitter.

But what if we wanted to go a level deeper, to understand a little bit more in depth, what people thought and said about our conference? What if we wanted to build a map, that shows all the topics that were discussed, the sentiment around them, and how these two things evolved with time? That's what we're going to explore in this series of blogs.

Over a 5 day period, we collected about 77,000 tweets about the Web Summit 2015 using the Twitter Streaming API and in this blog series, we're going to explore them and see what we can extract.

Note: This is a step by step guide which contains all you need to accomplish (almost) the same results, so feel free to try it for yourself.

Objective

In the first part of this series, we're going to build a 2D map of the tweets that we've collected, to get a better sense of what people talked about at a high level. We're going to try and put the similar tweets closer together on this map, and the dissimilar ones further apart.

We'll try out a few different methods such as classical document vectors with tf-idf scoring, K-means clustering, Latent Dirichlet Allocation (LDA), averaged Word Vectors (using GloVe word embeddings), Paragraph Vectors (using gensim's doc2vec implementation) and finally Skip-Thought Vectors to group together similar tweets, and we will then use a dimensionality reduction algorithm called t-SNE to give each tweet an (x,y) co-ordinate in a 2D space, that allows us to put the tweets on a scatter plot.

Tools and Libraries

Data

  • GloVe vectors trained on tweets [download]
  • About 77,000 tweets collected between Nov 2nd and Nov 6th 2015, that mention one of the official Web Summit hashtags, and/or handles of a few hand-picked notable speakers [download bzipped JSON]
  • Skip-Thought model files, trained on the BookCorpus dataset [download]

Pre-processing

Let's start by reading the tweets from a json file, and printing the first few tweets of our corpus:


In [1]:
import json

tweets_file = '/Users/parsa/Desktop/WebSummit 2015/websummit_dump_20151106155110'
with open(tweets_file) as f:
    tweets = json.load(f)
    
print('# of tweets:', len(tweets))
for tweet in tweets[:5]:
    print(tweet["text"])


# of tweets: 77111
@sarahtavel What #MustHave #tech gadget can you not travel without? Stop by stand D131 on Wed at #WebSummit #Dublin https://t.co/2wFLAVpGiV
Start-ups from every continent heading to #websummit, including #LendInvest! https://t.co/7zvTM5ihcH @WebSummitHQ
I'm at the #WebSummit2015 this week. On ali(at)goss(dot)ie if anyone wants to say hi! 👋🏻🙂
@jalak What #MustHave #tech gadget can you not travel without? Stop by stand D131 on Wed at #WebSummit #Dublin https://t.co/2vwDJdWIJJ
#websummit is about to kickoff in #dublin! What are you looking forward to the most?? @WebSummitHQ

Next we're going to do some pre-processing to remove stopwords such as "the" and "a", convert the tweets to lower case, and normalize the URLs, user names, smiley faces, hashtags, elongated words, and numbers. We define the pre-processor class as follows:


In [2]:
import re

class TweetPreprocessor(object):

    def __init__(self):
        self.FLAGS = re.MULTILINE | re.DOTALL
        self.ALLCAPS = '<allcaps>'
        self.HASHTAG = '<hashtag>'
        self.URL = '<url>'
        self.USER = '<user>'
        self.SMILE = '<smile>'
        self.LOLFACE = '<lolface>'
        self.SADFACE = '<sadface>'
        self.NEUTRALFACE = '<neutralface>'
        self.HEART = '<heart>'
        self.NUMBER = '<number>'
        self.REPEAT = '<repeat>'
        self.ELONG = '<elong>'

    def _hashtag(self, text):
        text = text.group()
        hashtag_body = text[1:]
        if hashtag_body.isupper():
            result = (self.HASHTAG + " {} " + self.ALLCAPS).format(hashtag_body)
        else:
            result = " ".join([self.HASHTAG] + re.split(r"(?=[A-Z])", hashtag_body, flags=self.FLAGS))
        return result

    def _allcaps(self, text):
        text = text.group()
        return text.lower() + ' ' + self.ALLCAPS

    def preprocess(self, text):
        eyes, nose = r"[8:=;]", r"['`\-]?"

        re_sub = lambda pattern, repl: re.sub(pattern, repl, text, flags=self.FLAGS)

        text = re_sub(r"https?:\/\/\S+\b|www\.(\w+\.)+\S*", self.URL)
        text = re_sub(r"/"," / ")
        text = re_sub(r"@\w+", self.USER)
        text = re_sub(r"{}{}[)dD]+|[)dD]+{}{}".format(eyes, nose, nose, eyes), self.SMILE)
        text = re_sub(r"{}{}p+".format(eyes, nose), self.LOLFACE)
        text = re_sub(r"{}{}\(+|\)+{}{}".format(eyes, nose, nose, eyes), self.SADFACE)
        text = re_sub(r"{}{}[\/|l*]".format(eyes, nose), self.NEUTRALFACE)
        text = re_sub(r"<3", self.HEART)
        text = re_sub(r"[-+]?[.\d]*[\d]+[:,.\d]*", self.NUMBER)
        text = re_sub(r"#\S+", self._hashtag)
        text = re_sub(r"([!?.]){2,}", r"\1 " + self.REPEAT)
        text = re_sub(r"\b(\S*?)(.)\2{2,}\b", r"\1\2 " + self.ELONG)

        text = re_sub(r"([A-Z]){2,}", self._allcaps)

        return text.lower()

Let's create an instance of the pre-processor, and see how it works with an example:


In [3]:
tweet_processor = TweetPreprocessor()

# an example:
tweet = "@sarahtavel What #MustHave #tech gadget can you not travel without? Stop by stand D131 on Wed at #WebSummit #Dublin https://t.co/2wFLAVpGiV"
print("Before: " + tweet + "\n")
print("After: " + tweet_processor.preprocess(tweet))


Before: @sarahtavel What #MustHave #tech gadget can you not travel without? Stop by stand D131 on Wed at #WebSummit #Dublin https://t.co/2wFLAVpGiV

After: <user> what <hashtag> musthave <hashtag> tech gadget can you not travel without? stop by stand d<number> on wed at <hashtag> websummit <hashtag> dublin <url>

We remove common words such as "the", "a", "an", "this", etc. using NLTK's list of English stopwords, and add a few extra items to remove tokens that don't contribute much to the meaning of a tweet.

We also iterate over the list of tweets and add a new attribute processed to each tweet, which contains the processed and cleaned text.


In [4]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import TweetTokenizer

tknzr = TweetTokenizer()
nltk.download('stopwords')
stop = stopwords.words('english')
stop += ['<hashtag>', '<url>', '<allcaps>', '<number>', '<user>', '<repeat>', '<elong>', 'websummit']

for tweet in tweets:
    parts = tknzr.tokenize(tweet_processor.preprocess(tweet["text"]))
    clean = [i for i in parts if i not in stop]
    tweet["processed"] = clean


[nltk_data] Downloading package stopwords to /Users/parsa/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

And here's the end result of our pre-processing pipeline:


In [5]:
print("Before: " + tweets[1]["text"] + "\n")
print("After: " + str(tweets[1]["processed"]))


Before: Start-ups from every continent heading to #websummit, including #LendInvest! https://t.co/7zvTM5ihcH @WebSummitHQ

After: ['start-ups', 'every', 'continent', 'heading', ',', 'including', 'lendinvest', '!']

Tf-idf scores

Now that we have cleaned up our tweets, we can create "one hot" vectors for our documents, with tf-idf values that indicate how common or rare a word in a document is, with respect to the entire corpus:


In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer

tweet_texts = [tweet["text"] for tweet in tweets] # list of all tweet texts
tweet_texts_processed = [str.join(" ", tweet["processed"]) for tweet in tweets] # list of pre-processed tweet texts

vectorizer = TfidfVectorizer(min_df=4, max_features = 10000)
vz = vectorizer.fit_transform(tweet_texts_processed)
tfidf = dict(zip(vectorizer.get_feature_names(), vectorizer.idf_))

Let's check the tf-idf scores for a few random words:


In [7]:
print("dublin: " + str(tfidf["dublin"]))
print("catmull: " + str(tfidf["catmull"]))
print("cosgrave: " + str(tfidf["cosgrave"]))
print("enda: " + str(tfidf["enda"]))
print("sheeps: " + str(tfidf["sheeps"]))


dublin: 3.51716751201
catmull: 6.89642791479
cosgrave: 6.78495404833
enda: 7.77567737499
sheeps: 8.34099118404

SVD and t-SNE

We're going to create our first map of tweets, by plotting them on a 2D plane, based on pair-wise similarity between their corresponding document vectors. But as you might remember, we have created 10,000 features for each document, and therefore would need a 10,000-dimensional space to visualize their corresponding vectors!

That doesn't sound very practical or intuitive, so we have to reduce the dimensionality of these vectors first: from 10,000 features down to 2 or 3, depending if we want to visualize them in a 2- or 3-dimensional space.

Two common dimensionality reduction techniques are principal component analysis (PCA) and singular value decomposition (SVD). Here, we are going to use SVD to reduce the dimensionality to 50 dimensions first, and then use another dimensionality reduction technique called t-SNE that is particularly suited to visualizing high-dimensional datasets, to further reduce the dimensionality to 2.

In order to still be able to inspect individual points in the visualization, we visualize only part of our data: the first 10,000 tweets. Applying TruncatedSVD on this 10,000 x 10,000 matrix reduces it to the number of components specified and thus yields a 10,000 x 50 dimensional matrix that we then feed into t-SNE to produce a 10,000 x 2 matrix that we're able to visualize.


In [8]:
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=50, random_state=0)
svd_tfidf = svd.fit_transform(vz[:10000])

In [9]:
svd_tfidf.shape


Out[9]:
(10000, 50)

In [10]:
from sklearn.manifold import TSNE

tsne_model = TSNE(n_components=2, verbose=1, random_state=0)
tsne_tfidf = tsne_model.fit_transform(svd_tfidf)


[t-SNE] Computing pairwise distances...
[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Computed conditional probabilities for sample 1000 / 10000
[t-SNE] Computed conditional probabilities for sample 2000 / 10000
[t-SNE] Computed conditional probabilities for sample 3000 / 10000
[t-SNE] Computed conditional probabilities for sample 4000 / 10000
[t-SNE] Computed conditional probabilities for sample 5000 / 10000
[t-SNE] Computed conditional probabilities for sample 6000 / 10000
[t-SNE] Computed conditional probabilities for sample 7000 / 10000
[t-SNE] Computed conditional probabilities for sample 8000 / 10000
[t-SNE] Computed conditional probabilities for sample 9000 / 10000
[t-SNE] Computed conditional probabilities for sample 10000 / 10000
[t-SNE] Mean sigma: 0.000000
[t-SNE] Error after 100 iterations with early exaggeration: 1.188186
[t-SNE] Error after 425 iterations: 1.123668

In [11]:
tsne_tfidf.shape


Out[11]:
(10000, 2)

In [12]:
tsne_tfidf[0]


Out[12]:
array([ 1.81195777,  3.29304416])

We are left with 10,000 (x,y) coordinates that we're going to visualize as a scatter plot. We also create tooltips to show the actual and processed text of each tweet (hover your mouse over the blue dots to see this).


In [13]:
import bokeh.plotting as bp
from bokeh.models import HoverTool, BoxSelectTool
from bokeh.plotting import figure, show, output_notebook

output_notebook()
plot_tfidf = bp.figure(plot_width=900, plot_height=700, title="Web Summit 2015 tweets (tf-idf)",
    tools="pan,wheel_zoom,box_zoom,reset,hover,previewsave",
    x_axis_type=None, y_axis_type=None, min_border=1)

plot_tfidf.scatter(x=tsne_tfidf[:,0], y=tsne_tfidf[:,1],
                    source=bp.ColumnDataSource({
                        "tweet": tweet_texts[:10000], 
                        "processed": tweet_texts_processed[:10000]
                    }))

hover = plot_tfidf.select(dict(type=HoverTool))
hover.tooltips={"tweet": "@tweet (processed: \"@processed\")"}
show(plot_tfidf)


BokehJS successfully loaded.

Not bad for a first attempt: the tweets are separated nicely, and have formed a couple of homogenous 'islands'. For instance, there's an island of tweets about "pub crawls" on the top left, and one about the "future of Ireland" on the right hand side. But there are two fundamental problems with this chart:

  1. There's no rigid notion of grouping and
  2. The separation seems to be focused around keywords rather than concepts.

K-means clustering

K-means is a popular clustering algorithm that tries to distribute a predefined number of points (K) in a way that they end up in the center of our clusters, close to the mean.

We're going to create 10 clusters using MiniBatchKMeans from scikit-learn, which is a fast implementation of k-means that processes examples in small batches instead of individually.


In [14]:
from sklearn.cluster import MiniBatchKMeans

num_clusters = 10
kmeans_model = MiniBatchKMeans(n_clusters=num_clusters, init='k-means++', n_init=1, 
                         init_size=1000, batch_size=1000, verbose=False, max_iter=1000)
kmeans = kmeans_model.fit(vz)
kmeans_clusters = kmeans.predict(vz)
kmeans_distances = kmeans.transform(vz)

The k-means algorithm runs for a few hundred iterations until the centroids don't improve much any more, and then for each tweet, it provides us with the closest centroid and the distance to each cluster centroid.

Let's see which cluster the first five tweets – that we saw earlier – have ended up in:


In [15]:
for i, tweet in enumerate(tweets):
    if(i < 5):
        print("Cluster " + str(kmeans_clusters[i]) + ": " + tweet["text"] + "(distance: " + str(kmeans_distances[i][kmeans_clusters[i]]) + ")")


Cluster 9: @sarahtavel What #MustHave #tech gadget can you not travel without? Stop by stand D131 on Wed at #WebSummit #Dublin https://t.co/2wFLAVpGiV(distance: 0.995767476421)
Cluster 6: Start-ups from every continent heading to #websummit, including #LendInvest! https://t.co/7zvTM5ihcH @WebSummitHQ(distance: 0.999239920902)
Cluster 6: I'm at the #WebSummit2015 this week. On ali(at)goss(dot)ie if anyone wants to say hi! 👋🏻🙂(distance: 0.99857342472)
Cluster 9: @jalak What #MustHave #tech gadget can you not travel without? Stop by stand D131 on Wed at #WebSummit #Dublin https://t.co/2vwDJdWIJJ(distance: 0.995767476421)
Cluster 9: #websummit is about to kickoff in #dublin! What are you looking forward to the most?? @WebSummitHQ(distance: 0.969860857782)

To better understand what's in each cluster, let's get the top 10 features (word) for each of our 10 clusters:


In [16]:
sorted_centroids = kmeans.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(num_clusters):
    print("Cluster %d:" % i, end='')
    for j in sorted_centroids[i, :10]:
        print(' %s' % terms[j], end='')
    print()


Cluster 0: audipitch vote pitch unklerikoe support retweet websummithq best voting goes
Cluster 1: people talk smartwatch world need blocks way technology personal modular
Cluster 2: thanks great hosting server misconceptions blogging common site linux windows
Cluster 3: interested offer attend special maybe would startups trial sageone get
Cluster 4: ll dell michael we you love tomorrow it stage see
Cluster 5: day us meet come stand today visit dublin village green
Cluster 6: smile today see get live it like tech startup new
Cluster 7: great good day talk see dublin time looking today meet
Cluster 8: stage re we centre you marketing main come machine dublin
Cluster 9: dublin summit web ireland tech live day blog night startup

As you can see there's some meaningful consolidation in some of these clusters:

  • Cluster 0 seems to be dealing with startups and pitch competitions
  • Cluster 1 encompasses technology-related topics
  • Cluster 2 seems to be about server technologies
  • Cluster 8 and cluster 9 seem to be about the conference itself

Now let's try and visualize the tweets again, according to their distance from each centroid in our K clusters. We are going to use t-SNE again to reduce the dimensionality from 10 (we had 10 clusters/centroids) down to 2.


In [17]:
tsne_kmeans = tsne_model.fit_transform(kmeans_distances[:10000])


[t-SNE] Computing pairwise distances...
[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Computed conditional probabilities for sample 1000 / 10000
[t-SNE] Computed conditional probabilities for sample 2000 / 10000
[t-SNE] Computed conditional probabilities for sample 3000 / 10000
[t-SNE] Computed conditional probabilities for sample 4000 / 10000
[t-SNE] Computed conditional probabilities for sample 5000 / 10000
[t-SNE] Computed conditional probabilities for sample 6000 / 10000
[t-SNE] Computed conditional probabilities for sample 7000 / 10000
[t-SNE] Computed conditional probabilities for sample 8000 / 10000
[t-SNE] Computed conditional probabilities for sample 9000 / 10000
[t-SNE] Computed conditional probabilities for sample 10000 / 10000
[t-SNE] Mean sigma: 0.000000
[t-SNE] Error after 100 iterations with early exaggeration: 1.357103
[t-SNE] Error after 400 iterations: 1.296135

To make the separation stand out more clearly, in addition to plotting the tweets, we are going to colorize each tweet according to the cluster it belongs to:


In [18]:
import numpy as np

colormap = np.array([
    "#1f77b4", "#aec7e8", "#ff7f0e", "#ffbb78", "#2ca02c", 
    "#98df8a", "#d62728", "#ff9896", "#9467bd", "#c5b0d5", 
    "#8c564b", "#c49c94", "#e377c2", "#f7b6d2", "#7f7f7f", 
    "#c7c7c7", "#bcbd22", "#dbdb8d", "#17becf", "#9edae5"
])

plot_kmeans = bp.figure(plot_width=900, plot_height=700, title="Web Summit 2015 tweets (k-means)",
    tools="pan,wheel_zoom,box_zoom,reset,hover,previewsave",
    x_axis_type=None, y_axis_type=None, min_border=1)

plot_kmeans.scatter(x=tsne_kmeans[:,0], y=tsne_kmeans[:,1], 
                    color=colormap[kmeans_clusters][:10000], 
                    source=bp.ColumnDataSource({
                        "tweet": tweet_texts[:10000], 
                        "processed": tweet_texts_processed[:10000],
                        "cluster": kmeans_clusters[:10000]
                    }))
hover = plot_kmeans.select(dict(type=HoverTool))
hover.tooltips={"tweet": "@tweet (processed: \"@processed\" - cluster: @cluster)"}
show(plot_kmeans)


The separation seems to have improved, but there are some overlaps between different clusters, and the islands are still formed around keywords and keyphrases.

Latent Dirichlet Allocation (LDA)

Next we're going to use a well-known topic modeling algorithm called LDA, to uncover the latent topics in the tweets, and then wet're going to use the topic distributions for each document as a measure to group similar tweets together.

First we vectorize our data by representing each tweet as a 10k dimensional vector whose indices correspond to the 10k most frequent terms in our corpus. We then feed this 77,000 x 10,000 feature matrix into LDA to detect the latent topics in our data. While there are non-parametric alternatives available, the classic version of LDA allows us to specify the number of topics that we want to discover upfront. We run LDA for 2,000 iterations to identify 15 topics and receive a 77k x 15 matrix of topic distributions of our tweets.


In [19]:
import lda
from sklearn.feature_extraction.text import CountVectorizer

cvectorizer = CountVectorizer(min_df=4, max_features=10000, stop_words='english')
cvz = cvectorizer.fit_transform(tweet_texts_processed)

n_topics = 15
n_iter = 2000
lda_model = lda.LDA(n_topics=n_topics, n_iter=n_iter)
X_topics = lda_model.fit_transform(cvz)


WARNING:lda:all zero row in document-term matrix found

Let us now look at those topics in more detail. Specifically, we can inspect the words that are most relevant to a topic. We save these words as topic summaries for later.


In [20]:
n_top_words = 8
topic_summaries = []

topic_word = lda_model.topic_word_  # get the topic words
vocab = cvectorizer.get_feature_names()
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
    topic_summaries.append(' '.join(topic_words))
    print('Topic {}: {}'.format(i, ' '.join(topic_words)))


Topic 0: year lisbon tinder paddy free interview ireland like
Topic 1: like smile sheep ve cool love know got
Topic 2: vr world technology reality talk future drones robots
Topic 3: summit dublin web day ready start ireland morning
Topic 4: come stand booth today visit say hi tomorrow
Topic 5: mobile dell michael platform apps talk come fintech
Topic 6: stage live centre ceo dublin periscope nov tuesday
Topic 7: audipitch pitch vote smartwatch blocks support best open
Topic 8: dublin nightsummit tonight smile free night day party
Topic 9: great looking day thanks forward today good amazing
Topic 10: meet let startup today love dublin week great
Topic 11: data stage future talk design google machine iot
Topic 12: content marketing social digital media facebook video new
Topic 13: people don make iew ebstaff want need like
Topic 14: tech startup startups great business ireland sagementorhours stage

Most of these topics look somewhat coherent:

  • Topic 0/3: the conference itself
  • Topic 2 seems to be about Virtual Reality and the future of technology
  • Topic 4: networking
  • Topic 7: the pitch contests and wearable technology
  • Topic 8: the 'fun' part of the conference (night outs, food, partying)
  • Topic 9: appreciation, gratefulness
  • Topic 11: the talks and speakers
  • Topic 12: marketing
  • Topic 14: business and startups

To visualize the tweets according to their topic distributions, we first need to reduce the dimensionality from 15 down to 2 using t-SNE:


In [21]:
tsne_lda = tsne_model.fit_transform(X_topics[:10000])


[t-SNE] Computing pairwise distances...
[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Computed conditional probabilities for sample 1000 / 10000
[t-SNE] Computed conditional probabilities for sample 2000 / 10000
[t-SNE] Computed conditional probabilities for sample 3000 / 10000
[t-SNE] Computed conditional probabilities for sample 4000 / 10000
[t-SNE] Computed conditional probabilities for sample 5000 / 10000
[t-SNE] Computed conditional probabilities for sample 6000 / 10000
[t-SNE] Computed conditional probabilities for sample 7000 / 10000
[t-SNE] Computed conditional probabilities for sample 8000 / 10000
[t-SNE] Computed conditional probabilities for sample 9000 / 10000
[t-SNE] Computed conditional probabilities for sample 10000 / 10000
[t-SNE] Mean sigma: 0.000000
[t-SNE] Error after 100 iterations with early exaggeration: 1.265818
[t-SNE] Error after 325 iterations: 1.188688

Let's get the main topic for each tweet, which we'll use to colorize them later on:


In [22]:
doc_topic = lda_model.doc_topic_
lda_keys = []
for i, tweet in enumerate(tweets):
    lda_keys += [doc_topic[i].argmax()]

In [23]:
plot_lda = bp.figure(plot_width=900, plot_height=700, title="Web Summit 2015 tweets (LDA)",
    tools="pan,wheel_zoom,box_zoom,reset,hover,previewsave",
    x_axis_type=None, y_axis_type=None, min_border=1)

plot_lda.scatter(x=tsne_lda[:,0], y=tsne_lda[:,1], 
                 color=colormap[lda_keys][:10000], 
                 source=bp.ColumnDataSource({
                    "tweet": tweet_texts[:10000], 
                    "processed": tweet_texts_processed[:10000],
                    "topic_key": lda_keys[:10000]
                }))
hover = plot_lda.select(dict(type=HoverTool))
hover.tooltips={"tweet": "@tweet (processed: \"@processed\" - topic: @topic_key)"}
show(plot_lda)


This seems to be our best chart so far: there are entire islands dedicated to marketing or partying, without those words getting explicitly mentioned, so we've took a solid step towards a concept or topic-based representation.

Moreover, the chart not only puts similar tweets together, but also brings similar topics closer. Pay attention to how close business (grey) and marketing (dark pink) are, and how far innovation/future (orange) is pushed away from fun and partying (purple).

Averaged GloVe Embeddings

Word Embeddings are everywhere these days, and as the flagship of Deep Learning-based techniques for processing language, they have dominated the world of NLP research since 2014, and for good reasons.

In layperson terms, these are dense vectors of typically between 25 and 300 dimensions that capture a lot of information about the word as well as its surrounding context. In this respect, they can be seen as another form of dimensionality reduction (such as PCA, SVD, LDA, etc.) that has been shown to be very effective in a wide array of NLP tasks.

These vectors have a set of characteristics that make them very special. To begin with, the vectors of similar words are closer to each other, and conversely, the vectors of dissimilar words are far from each other.

Moreover, the vectors can be combined to find answers to queries such as "king is to man as queen is to ???" by forming simple arithmetic equations such as:

$$v_{king} - v_{man} ~= v_{queen} - v_{woman}$$

Let's see if we can exploit these characteristics to enhance our map.


Our first step is to read the pretrained GloVe vectors into a list:


In [24]:
print("loading glove model...")
embedding_size = 50
glove_file = '/Users/parsa/Downloads/glove.twitter.27B/glove.twitter.27B.50d.txt'
glove = {}
with open(glove_file) as f:
    for line in f.readlines():
        line = line.replace("\n","").split(" ")
        glove[line[0]] = np.array(line[1:],dtype='float64')


loading glove model...

The model is loaded. Let's look at the vector for the word "book":


In [25]:
print("book:", glove["book"])


book: [ -1.15900000e-01   3.01330000e-02  -2.13950000e-02   9.47920000e-02
   1.06020000e+00  -6.15770000e-01   1.43900000e+00  -2.95730000e-01
   1.72540000e-01  -1.84670000e-01   1.73100000e-01  -1.11780000e-01
  -4.13480000e+00   2.44770000e-01  -6.08460000e-01  -1.13420000e-01
   4.28910000e-01   5.54900000e-01  -1.05920000e-01   7.49320000e-01
  -1.07730000e+00   1.68330000e-02  -3.17150000e-01  -2.26790000e-01
  -3.69700000e-01   1.92180000e-02   5.11020000e-02  -3.39400000e-01
   9.18050000e-01   1.39370000e-02   6.78850000e-01  -2.76930000e-01
  -2.80580000e-01  -3.39460000e-01   9.13070000e-04  -1.27950000e-01
   2.95180000e-01  -1.56730000e-01   8.63490000e-01  -9.87520000e-01
  -7.19650000e-01   3.17540000e-01   1.73410000e-01   3.89070000e-01
   9.12310000e-01   1.72690000e-01  -2.32080000e-01  -5.95720000e-01
  -1.06630000e+00  -3.66940000e-01]

In order to get a vector representation for an entire tweet, we're going to combine the vectors of individual words in it, by doing a weighted average using tf-idf values ($t$), to give more importance to vectors of more significant words:

$$v_{tweet} = \frac{1}{|T|}\sum_{i \in T}{v_i.t_i}$$

In [26]:
def tweetVector(processed_tweet):
    l = float(len(processed_tweet) | 1)
    sum = np.zeros(embedding_size)
    for part in processed_tweet:
        sum += glove.get(part, np.zeros(embedding_size)) * tfidf.get(part, 1)
    return sum/l

The tweetVector function takes a tweet, and returns a 50-dimensional vector by averaging the individual word vectors. Next we're going to run through our tweets, and calculate the tweet vector for each one of them:


In [27]:
for tweet in tweets:
    tweet["vector"] = tweetVector(tweet["processed"])

Let's take a look at the vector for our first tweet:


In [28]:
tweets[0]["vector"]


Out[28]:
array([  0.67558973,   1.57907176,  -1.97438477,  -1.42001073,
         0.42001246,   0.15403742,   3.89938014,  -1.63873213,
         3.01999621,  -0.52678554,   0.87098459,   0.35202588,
       -14.97741569,   0.75753689,  -0.9268416 ,   1.08553229,
         0.05827361,   0.55365144,  -0.42338387,   2.73923924,
        -1.5773125 ,  -1.80355323,  -0.88586389,  -0.04644073,
        -0.33820551,   2.93505561,  -0.19981562,   1.30924813,
        -0.82639449,  -1.41393207,   0.71033069,  -1.81304577,
         0.46166968,  -0.85571835,   3.63464606,  -2.07778157,
         0.79176347,   1.29247613,   0.54891076,   0.18781912,
        -3.08049038,   1.48118074,   0.5526418 ,   0.47584854,
         2.20498931,   0.44194745,  -0.11832405,   0.48159421,
        -1.35909449,   0.30931496])

Note that a caveat of this approach is that we lose word order, and in that sense we're doing something similar to a bag-of-words model. For instance, the following sentences have the same tweet vectors:


In [29]:
np.allclose(tweetVector(["we", "are", "totally", "equal"]), tweetVector(["equal", "totally", "we", "are"]))


Out[29]:
True

Now that we've calculated the tweet vectors, let's release the GloVe embeddings and free up some memory:


In [30]:
glove = None

In [31]:
sample_size = 10000
filtered = [tweet for tweet in tweets if "<url>" not in tweet["processed"]]
sorted_ = sorted(filtered, key=lambda tweet: len(tweet["text"]), reverse=True)
tweets_sample = sorted_[:sample_size]

The tweet vectors that we've created are 50-dimensional, which means we now have a matrix of 77,000 x 50 that we need to plot. We're going to use t-SNE again to reduce the dimensionality down to 2:


In [32]:
tsne_glove = tsne_model.fit_transform([tweet["vector"] for tweet in tweets[:10000]])


[t-SNE] Computing pairwise distances...
[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Computed conditional probabilities for sample 1000 / 10000
[t-SNE] Computed conditional probabilities for sample 2000 / 10000
[t-SNE] Computed conditional probabilities for sample 3000 / 10000
[t-SNE] Computed conditional probabilities for sample 4000 / 10000
[t-SNE] Computed conditional probabilities for sample 5000 / 10000
[t-SNE] Computed conditional probabilities for sample 6000 / 10000
[t-SNE] Computed conditional probabilities for sample 7000 / 10000
[t-SNE] Computed conditional probabilities for sample 8000 / 10000
[t-SNE] Computed conditional probabilities for sample 9000 / 10000
[t-SNE] Computed conditional probabilities for sample 10000 / 10000
[t-SNE] Mean sigma: 0.000000
[t-SNE] Error after 100 iterations with early exaggeration: 1.119460
[t-SNE] Error after 325 iterations: 1.096473

Time to plot the tweet vectors on a 2D surface. Note that since we were happy with the LDA topics, we're going to use the topic keys again to colorize the tweets, and check if they correlate with the spatial separation:


In [33]:
plot_glove = bp.figure(plot_width=900, plot_height=700, title="Web Summit 2015 tweets (averaged GloVe vectors)",
    tools="pan,wheel_zoom,box_zoom,reset,hover,previewsave",
    x_axis_type=None, y_axis_type=None, min_border=1)

plot_glove.scatter(x=tsne_glove[:,0], y=tsne_glove[:,1], 
                   color=colormap[lda_keys][:10000], 
                   source=bp.ColumnDataSource({
                       "tweet": tweet_texts[:10000], 
                       "processed": tweet_texts_processed[:10000]
                   }))
hover = plot_glove.select(dict(type=HoverTool))
hover.tooltips={"tweet": "@tweet (processed: \"@processed\")"}
show(plot_glove)


No clear separation is evident in this representation, though there are areas that are clearly richer in one or two colors. The word vectors have worked their magic though, and neighboring tweets are a lot closer to each other semantically, without necessarily sharing the same keywords.

Paragraph Vectors (doc2vec)

Paragraph Vector is an unsupervised algorithm that learns fixed-length feature representations from variable-length pieces of texts, such as sentences, paragraphs, and documents. The algorithm represents each document by a dense vector which is trained to predict words in the document.

We are going to use the implementation available in gensim, called doc2vec, to created paragraph vectors for each tweet in our dataset:


In [34]:
import gensim
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

Let's extract the documents and tokenize them (without removing the stopwords):


In [35]:
docs = [TaggedDocument(tknzr.tokenize(tweet_processor.preprocess(tweet["text"])), [i]) for i, tweet in enumerate(tweets)]

Here's what a tokenized document looks like:


In [36]:
docs[0]


Out[36]:
TaggedDocument(words=['<user>', 'what', '<hashtag>', 'musthave', '<hashtag>', 'tech', 'gadget', 'can', 'you', 'not', 'travel', 'without', '?', 'stop', 'by', 'stand', 'd', '<number>', 'on', 'wed', 'at', '<hashtag>', 'websummit', '<hashtag>', 'dublin', '<url>'], tags=[0])

We train our doc2vec model with 100 dimensions, a window of size 8, a minimum word count of 5 and with 4 workers:


In [37]:
doc2vec_model = Doc2Vec(docs, size=100, window=8, min_count=5, workers=4)

We can do some fun things with the trained model. Let's see if we can find out who's Dell's CEO, by knowing Pixar's CEO:


In [38]:
doc2vec_model.most_similar(positive=["pixar", "dell"], negative=["catmull"])


Out[38]:
[('instagram', 0.6922492980957031),
 ('oculus', 0.6137598752975464),
 ('michael', 0.5988619923591614),
 ('kirt', 0.5902463793754578),
 ('azure', 0.5881903171539307),
 ('creativity', 0.5847536325454712),
 ('tinder', 0.5744156837463379),
 ('pebble', 0.5689749717712402),
 ('stripe', 0.5666184425354004),
 ('facebook', 0.5621976852416992)]

The second entry is "michael" which refers to Michael Dell, CEO of Dell. Note that the model picked this up from the 77k tweets alone, with no additional knowledge source.

Let' see what each paragraph vector looks like:


In [39]:
doc2vec_model.docvecs[0]


Out[39]:
array([ -3.44759109e-03,   3.07281315e-03,  -3.29531799e-03,
         2.88680848e-03,  -1.61511789e-03,  -8.95729579e-04,
        -2.39316514e-03,   2.49383855e-03,  -3.39507335e-03,
        -3.14996345e-03,   4.88198036e-03,   3.14503442e-03,
         4.32997709e-03,   1.02909515e-03,   1.66431861e-03,
         3.51064838e-03,   4.37828247e-03,  -1.99072761e-04,
         1.82044157e-03,  -2.24329508e-03,   2.24335259e-03,
        -6.26384863e-04,   2.87131499e-03,   9.46638058e-04,
        -9.78242606e-04,  -3.60067515e-03,   2.50579114e-03,
         2.23646895e-03,  -4.08810610e-03,   1.20351062e-04,
        -4.48790705e-03,   4.77654999e-03,   1.97079475e-03,
         2.79729301e-03,  -3.83013790e-03,   3.49228294e-03,
        -3.26741952e-03,  -2.05332087e-03,  -1.66271708e-03,
         4.62410739e-03,   2.97313603e-03,  -1.12211669e-03,
         4.33133356e-03,   4.94921580e-04,  -2.90978327e-03,
         2.44172616e-03,   2.96942773e-03,  -3.63966497e-03,
        -3.64260748e-03,  -4.71649552e-03,   7.62001844e-04,
        -1.19290268e-03,  -1.84922130e-03,   2.93004734e-04,
         2.96249455e-05,   1.51816988e-03,  -2.59471592e-04,
         3.76018556e-03,  -2.82596541e-03,   1.65637094e-03,
         3.04812216e-03,   9.14487871e-04,   1.11530324e-04,
        -2.13162671e-03,   4.46353946e-03,   2.58666137e-03,
        -1.03072671e-03,   9.98745789e-04,   4.35710885e-03,
        -4.34886059e-03,   3.82006913e-03,   4.95229941e-03,
         3.23724328e-03,   4.41262871e-03,  -1.98566797e-03,
         2.36772909e-03,   3.24898213e-03,   2.41492628e-04,
        -1.90383266e-03,   2.11429098e-04,  -2.54405808e-04,
         3.31444060e-03,  -2.82705179e-03,  -3.90899414e-03,
        -2.32245005e-03,   3.15536046e-03,  -1.59297895e-03,
         3.51237785e-03,  -9.50365677e-04,   6.99797587e-04,
        -2.89310282e-03,  -5.97024162e-04,  -1.25862064e-03,
         1.93469576e-03,   1.78720069e-03,  -2.20232154e-03,
        -8.24157498e-04,  -2.28217826e-03,   3.65720852e-03,
        -4.93786857e-03], dtype=float32)

Now we're going to use t-SNE to reduce dimensionality and plot the tweets:


In [40]:
doc_vectors = [doc2vec_model.docvecs[i] for i, t in enumerate(tweets[:10000])]

In [41]:
tsne_d2v = tsne_model.fit_transform(doc_vectors)


[t-SNE] Computing pairwise distances...
[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Computed conditional probabilities for sample 1000 / 10000
[t-SNE] Computed conditional probabilities for sample 2000 / 10000
[t-SNE] Computed conditional probabilities for sample 3000 / 10000
[t-SNE] Computed conditional probabilities for sample 4000 / 10000
[t-SNE] Computed conditional probabilities for sample 5000 / 10000
[t-SNE] Computed conditional probabilities for sample 6000 / 10000
[t-SNE] Computed conditional probabilities for sample 7000 / 10000
[t-SNE] Computed conditional probabilities for sample 8000 / 10000
[t-SNE] Computed conditional probabilities for sample 9000 / 10000
[t-SNE] Computed conditional probabilities for sample 10000 / 10000
[t-SNE] Mean sigma: 0.008812
[t-SNE] Error after 100 iterations with early exaggeration: 1.146624
[t-SNE] Error after 375 iterations: 1.140521

Let's put the tweets on a 2D plane:


In [42]:
plot_d2v = bp.figure(plot_width=900, plot_height=700, title="Web Summit 2015 tweets (doc2vec)",
    tools="pan,wheel_zoom,box_zoom,reset,hover,previewsave",
    x_axis_type=None, y_axis_type=None, min_border=1)

plot_d2v.scatter(x=tsne_d2v[:,0], y=tsne_d2v[:,1],
                    color=colormap[lda_keys][:10000],
                    source=bp.ColumnDataSource({
                        "tweet": tweet_texts[:10000],
                        "processed": tweet_texts_processed[:10000]
                    }))
hover = plot_d2v.select(dict(type=HoverTool))
hover.tooltips={"tweet": "@tweet (processed: \"@processed\")"}
show(plot_d2v)


This is a bit disappointing both in terms of separation, as well as semantic similarity of neighboring tweets. Perhaps it can be improved by plugging in word embeddings trained on an external corpus such as Wikipedia. This can be done using the intersect_word2vec_format method in the model object but we're not going to do that here.

Skip-Thought Vectors

Another Deep Learning-based representational model that has gained popularity recently is Skip-Thought Vectors, also called Sent2Vec. It's described as an approach for unsupervised learning of a generic, distributed sentence encoder. Using the continuity of text from books, an encoder-decoder model is trained, that tries to reconstruct the surrounding sentences of an encoded passage. Sentences that share semantic and syntactic properties are thus mapped to similar vector representations.

We have used the implementation and models made available by the author, to create Skip-Thought Vectors for our tweets:


In [43]:
import pickle

skipthought_vectors = pickle.load(open("/Users/parsa/Desktop/WebSummit 2015/skip-thought/skip-thoughts/tweet_skipthought_vectors.pk", "rb"), encoding='latin1')

In [44]:
skipthought_vectors.shape


Out[44]:
(77111, 4800)

We're now going to use t-SNE to reduce the dimensionality from 4800 to 2:


In [45]:
tsne_st = tsne_model.fit_transform(skipthought_vectors[:10000])


[t-SNE] Computing pairwise distances...
[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Computed conditional probabilities for sample 1000 / 10000
[t-SNE] Computed conditional probabilities for sample 2000 / 10000
[t-SNE] Computed conditional probabilities for sample 3000 / 10000
[t-SNE] Computed conditional probabilities for sample 4000 / 10000
[t-SNE] Computed conditional probabilities for sample 5000 / 10000
[t-SNE] Computed conditional probabilities for sample 6000 / 10000
[t-SNE] Computed conditional probabilities for sample 7000 / 10000
[t-SNE] Computed conditional probabilities for sample 8000 / 10000
[t-SNE] Computed conditional probabilities for sample 9000 / 10000
[t-SNE] Computed conditional probabilities for sample 10000 / 10000
[t-SNE] Mean sigma: 0.000000
[t-SNE] Error after 100 iterations with early exaggeration: 1.275823
[t-SNE] Error after 350 iterations: 1.217282

Let's plot these vectors:


In [46]:
plot_st = bp.figure(plot_width=900, plot_height=700, title="Web Summit 2015 tweets (Skip-Thought Vectors)",
    tools="pan,wheel_zoom,box_zoom,reset,hover,previewsave",
    x_axis_type=None, y_axis_type=None, min_border=1)

plot_st.scatter(x=tsne_st[:,0], y=tsne_st[:,1],
                    color=colormap[lda_keys][:10000],
                    source=bp.ColumnDataSource({
                        "tweet": tweet_texts[:10000],
                        "processed": tweet_texts_processed[:10000]
                    }))
hover = plot_st.select(dict(type=HoverTool))
hover.tooltips={"tweet": "@tweet (processed: \"@processed\")"}
show(plot_st)


There are some nice consolidations here, such as one where most tweets are about excitement and anticipation, but we couldn't find a general pattern for topical similarity. Perhaps this is due to the difference between the two contexts: the model is trained on books, whereas our documents are tweets, and may include words and phrases that don't exist in our model's vocabulary.

Conclusion

We tried several different methods to build a semantic map of our 77k tweets. Each method had its own weaknesses and advantages, but the LDA one looks the most intuitive to us so far.

Here are a few things to try next:

  • Trying a different topic model such as the Hierarchical Dirichlet Process (HDP)
  • Trying the Word Mover's Distance similarity metric, together with the GloVe vectors
  • Improving the doc2vec model by intersecting the model with word vectors obtained from an external, larger corpus
  • Training a new Skip-Thought model on tweets

What's next?

In our next blog in the series we're going to explore in more detail, how the topics evolved over the course of the event, along with the sentiment towards each topic.

Credits

Special thanks to:

  • Hamed Ramezanian and Mike Waldron for collecting the tweets,
  • Sebastian Ruder and Pasquale Minervini for contributing to this report,
  • Amir Saied for proof-reading the code.

We got inspiration for this report from: