DenseTweets

Josh Montague, 2017-09-29

How language embedding models might offer an opportunity for more robust and insightful modeling of short texts.


In [ ]:
import gzip
import json
import logging
import re
import sys

from gensim.models import Word2Vec
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
from scipy.spatial import distance
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.datasets.samples_generator import make_blobs
import seaborn as sns

from tweet_parser.tweet import Tweet

%load_ext autoreload
%autoreload 2

In [ ]:
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', 
                    stream=sys.stderr, level=logging.DEBUG)

To motivate a new approach to text modelling, consider a typical approach to modeling on text data...

A classic approach to language modeling

The standard approach is to:

  • acquire data
  • tokenize and count tokens for each document
  • optionally transform those counts and end up with a document-term matrix

This matrix is the representation of the observed data, and one way we can use it to surface patterns (cluster previously observed data), or apply it to new data as a label (assign cluster label to new data).

Note for readers: to get this to work with your own data, replace the input file string below with a file of your own newline-delimited Tweets from one of the Twitter APIs. Later, there will be additional file paths and files that you will need to modify or create.


In [ ]:
input = []
with gzip.open('/mnt3/archives/twitter/2017/09/30/13/twitter_2017-09-30_1304.gz','r') as infile:
    for i,line in enumerate(infile):
        try:
            tw = Tweet(json.loads(line.decode()))
        except json.JSONDecodeError:
            continue
        # strip URLs and numbers for demo purposes
        # https://bit.ly/PyURLre
        text = re.sub('(https?://)?(\w*[.]\w+)+([/?=&]+\w+)*', ' ', tw.text)        
        text = re.sub('\\b[0-9]+\\b', ' ', text)
        input.append(text)
        # grab just a handful of tweets
        if i == 1000:
            break

In [ ]:
input[:10]

The vectorization step (specifically, the .fit_transform() step) is when we create a model of our corpus. The fit part creates the model and the transform part returns a new representation of the corpus according to that model. The arguments we choose in the vectorizer dictate things like tokenization choices as well (optionally) the final dimensionality of the space.

Later, we can (and will!) use the vectorizer to transform new text, and the document term matrix to measure and group observations.

If we use the defaults of the vectorizer, we'll get things like lowercasing, "word-boundary" tokenization, and keep 1-gram feature that is observed in the corpus.


In [ ]:
# default settings
vec = CountVectorizer()
dtm = vec.fit_transform(input)

# what does the input space look like?
dtm

In [ ]:
print('data matrix is {:.1%} non-zero values'.format(dtm.count_nonzero()/(dtm.shape[0] * dtm.shape[1])))

That low % above is what people mean when they say a feature matrix or the space of the data is sparse.

As an aside, note that the data structure is also technically called a "sparse matrix," which is a little confusing. This particular data structure is usually an efficient optimization for models. So, while that's fine, the content sparsity of the matrix is not! Models can learn poor representations of data when the dimensionality of the feature space is high and the amount of data is low (see also: this session).

Nevertheless, let's get a better mental model of the document-term matrix...


In [ ]:
# dataframes have nice reprs
dtm_df = pd.DataFrame(dtm.todense(), columns=[x for x in vec.get_feature_names()])
tweet_count = len(dtm_df)
dtm_df.head()

For now, put aside that we didn't strip stopwords or do much of preprocessing that we sometimes do. The general approach is the same, while the specific features would vary a bit.

In this labeled matrix (technically a dataframe), the vector representation of each tweet (row) is now the linear combination of the corresponding set of word features, each with a coefficient that is the number of occurances of the word.

Note that this representation has no sense of word ordering - "the dog saw a cat" has the same vector representation as "the cat saw a dog".

For example, our first tweet:


In [ ]:
input[0]

has a vector representation with word coefficients (counts) that look like:


In [ ]:
dtm_df.head(1).values[0][:500]

Note: in this illustration, I'm suggesting that the Tweet vector is the linear combination of all the token vectors. That is one of many ways you can do this. For example, another common approach is for the Tweet vector to be the (arithmetic) mean of the token vectors.

Most of the token coefficients are zero! This is really unhelpful. There isn't much useful information in a zero - we can't depend on a model to pick up meaning based on small changes around zero. And if many unrelated things are all "zero," then we may end up calling them related on accident since they're at the same point.

Ultimately that means the only information we have about this tweet vector is encoded in the non-zero dimensions (and linear combination) below:


In [ ]:
# how does this represent some of our tweets?
# compare sampled text array to sampled dtm to make sure they're aligned

for i,doc in enumerate(dtm.toarray()): 
    idx = [int(x) for x in np.nonzero(doc)[0]]
    print("vector: (", end='')
    for x in idx:
        print(vec.get_feature_names()[x] + ', ', end='')
    print(')', end='')
    print('\n[doc: {}]'.format(input[i].replace('\n',' ')))
    print()

This situation gets even worse if we apply the "100x" rule of thumb to limit our feature count (pdf, Rule #21).


In [ ]:
# 100x obs for each feature
feats = dtm.shape[0] // 100

# kwargs uses the max_features most common terms
vec_small = CountVectorizer(max_features=feats)
dtm_small = vec_small.fit_transform(input)

dtm_small

In [ ]:
dtm_small_df = pd.DataFrame(dtm_small.todense(), columns=[x for x in vec_small.get_feature_names()])

dtm_small_df.head()

We can already see that this is going to turn out poorly. By choosing the "biggest" coefficients in our feature engineering, we've reduced the document vectors to approximately stopwords only. This is an exaggerated example, but the principle is still correct.


In [ ]:
# how does this represent some of our tweets?
# compare sampled text array to sampled dtm to make sure they're aligned

for i,doc in enumerate(dtm_small.toarray()): 
    idx = [int(x) for x in np.nonzero(doc)[0]]
    print("vector: (", end='')
    for x in idx:
        print(vec_small.get_feature_names()[x] + ', ', end='')
    print(')', end='')
    print('\n[doc: {}]'.format(input[i].replace('\n',' ')))
    print()

Note that the order of tokens has no impact on the ultimate tweet vector - only the tokens and their counts (or, if we used a tfidf vectorizer the normalized counts).

This approach has two problems for unsupervised learning like clustering, and both are caused by the high sparsity of our observed document-term matrix:

  1. the model of the observed data is not robust
    • a) small changes in coefficients lead to very different vectors
    • b) we will typically further truncate the feature space which loses some of the already low amount of
  2. the model doesn't apply to new observations well information
    • a) the majority of our newly observed tokens will not be present in the model

Visualizing sparsity

We can highlight this issue (and motivate the next steps) by appealing to our visual sense and intuition.

Imagine we want to use clustering to simplify the representation of our observations - instead of coordinate pairs for each observation, we want a single label.

Intuitively, when grouping a set of real-world data points into clusters we expect there to be some local variation of non-zero values within higher-density regions, and then some sort of gap between these high-density regions. In two dimensions, we can visualize this well. Imagine that each point below represents an observation of something (like a weather measurement) in two dimensions (like temperature and humidity).


In [ ]:
scatter_kwargs=dict(s=100, alpha=0.5)

In [ ]:
X, y = make_blobs(n_samples=100, centers=2, n_features=2, random_state=42)
X = pd.DataFrame(X, columns=['temp','humidity'])

# might as well use some made up labels for our made up data!
X.plot.scatter(x='temp', y='humidity', **scatter_kwargs);

The specific algorithm we apply in an attempt to discover these clusters isn't too important. The point is that for data that looks like that above, most algorithms would identify those two blobs as unique clusters.

In our language model, the analogous plot to humidity vs. temp would be word1 vs. word2. But as we saw above, those observations appear mostly at (0,0) for any pair of word1 and word2.


In [ ]:
print('total tweets: {}'.format(tweet_count))

# choose other slices for fun
s = slice(500,505)

sns.pairplot(dtm_df.iloc[:,s], diag_kind='kde', plot_kws=dict(s=200,alpha=0.5))

That most of these data points are at (0,0) means that most documents don't include either of these words. Further, none of these words co-occur in a document (would be at (1,1), or (n,n)), and none of them appear more than one time.

If we truly believe that there are meaningful patterns in these Tweets ("similarity," "communities," "patterns," however you might describe it), then we need a way to look at - and compare - these data points in a different space.

Effectively, what we're seeking with a new language model is a way to smear out those observations that are currently mostly sitting at (0,0) in a way that has meaning (i.e. it won't be helpful to randomly distribute them in space).

This is the goal of #DenseTweets.

#DenseTweets

The alternative approach proposed here makes the following assumption: we can use (or create) a new feature space for model training and inference where we'll have less risk of spurious results ("curse of dimensionality"), as well as obtain richer semantic structure. This should allow our model results to be more robust to variations in input, as well as be more robust in tasks such as unsupervised clustering.

How do we make this new feature space?

  • First of all, we use as large a training corpus as we can (to see all of the words and uses of those words). This, by itself, doesn't solve the issues we highlighted above, however.

  • Second, instead of representing the model in the feature space of tokens (words), we use a new, abstract space and iteratively train a model that ultimately positions tokens (words) within that space in such a way as to encode their semantic relationship in terms of their distance and position relative to other words.

For example, dog and puppy should be relatively "near" each other in the lower-dimensional space, while dog and cabinet should not necessarily be near each other. Similarly, the positional difference between e.g. man and woman should be similar to that of boy and girl because they represent the same comparison (for a binary view of gender that is likely common in text), while differing only in age. These are the things we mean by semantic relationships, the relative "meaning in language".

Sure, but how do we do that, really?

How word vectors work should be it's own separate RST, but this is my favorite write up of it. For now, sit with this inadequate explanation: we assume that all our tokens (words) can be represented as a distribution over the fixed dimensionality space (typically a few hundred dimensions). The linear combination of weights in each dimension is the distributed representation of the word. The model is a shallow, dense neural network (multi-layer preceptron), which is optimized by e.g. SGD against a loss function which is roughly seeking to maximize the conditional probability of observing a specific word given the input of the surrounding (or preceding) words.

What does it produce?

In the end, you basically end up with a giant matrix that represents the distributed weights of (most of) the observed tokens. If we previously thought of our data matrix representation as being Tweets (rows) by word features (columns), you can think of the new matrix representation as being words (rows) by calculated feature dimensions (columns).

To create the vector representation of a sentence (or Tweet), you now combine the individual word vectors in some way (like summing them, or taking their vector mean).

This lower-dimensionality representation of word relationships is called a "word embedding." I'm intentionally glossing over a ton of detail here! Again, if you're curious, I recommend reading this edition of the morning paper.

A library in two acts

Project #DenseTweets works toward this goal in two phases:

  1. use a pre-trained language model from another data source (to project new tweets into a new dense space)
  2. create our own language model from a twitter data corpus that we can then use for both additional modeling and inference on new data

We'll walk through the 0.0.1 version of this code by demonstrating those two phases.

1. Use an existing embedding model (on observed data)

Google News corpus

The original paper on the word2vec family of algorithms included a pre-trained model based on Google News data (100 billion words) for 3 million words and phrases. You can simply download this file.

Importantly, we must remember that the way language is used in news articles is different than that of Tweets. We'll work on this in step 2. Nevertheless, we think the overlap of the two languages is non-zero, so there should be some value in this approach.

Because the GNews dataset is so accessible, it has a helper function to load it.


In [ ]:
import densetweets as dent

In [ ]:
# this takes ~1 min to load
gn_model = dent.load_GNews_model()

In [ ]:
gn_model

Once we have this representation of language, we can take our new previous observations (tweets), and project them into this new space. The method below that does this ( .get_summary_vector() ) makes some choices about how to split a Tweet into words, and how to combine those words into a Tweet summary vector. These are configurable but have sensible defaults.

With verbose logging, this also outputs some useful insight into the construction of our Tweet summary vector. In particular, we can see which words aren't in the model (remember this model wasn't created from a Twitter corpus), and also the fraction of tokens in the Tweet that contribute to the summary vector.


In [ ]:
for i,tw in enumerate(input[:10]):
    # tokenize input text
    tokens = dent.nltk_tweet_tokenizer(tw)
    # show first 5 dimensions of each summary vector
    summary = dent.get_summary_vector(model=gn_model, token_list=tokens)[:5]
    print('tweet #{}: {} ...'.format(i, summary))

The current implementation uses either the KeyedVector or Word2Vec model from gensim so those docs are the best reference for methods and attributes. The main highlights are token lookup and similarity measurement.


In [ ]:
# single-word vector lookup is dict-like (only showing first 20 of 300 dimensions)
gn_model['colorado'][:20]

In [ ]:
# we have to account for any words that were never seen in the GNews corpus
try:
    gn_model['adsfaksjdfhlakjshf']
except KeyError as e:
    print(e)

The model exposes a .most_similar() method, which returns the topn terms that are closest to the given word in the 300-dimensional feature space. These should be words that carry similar semantics to the given word.


In [ ]:
gn_model.most_similar('robot', topn=5)

The densetweets library also provides access to a number of useful internal methods for passing data around. Most can be overridden, but most have sensible defaults and can be used without modification.


In [ ]:
# use a minimal fake tweet
tiny_tw_s = """
{"postedTime": "1999-07-18T23:25:04.000Z", 
"body": "N) A Tweet with explicit geo coordinates https://t.co/d7d7d7d7d7", 
"actor": {"displayName": "jk no"}, 
"id": "tag:search.twitter.com,2005:111111111111111"}
"""

# tweet parsing is format agnostic and managed by a dedicated library
tw = dent.parse_tweet(tiny_tw_s)

type(tw)

In [ ]:
# for now, we only parse the tweet text
print(tw.text)
print('-'*10)

# the default tokenizer is NLTK's TweetTokenizer
tokens = dent.extract_tokens(tw)
tokens

The .get_summary_vector() method encodes the specific mapping of tokens to summary. It's currently just the arithmetic vector mean.


In [ ]:
# calculate a summary vector in the dimensionality of the pretrained model
# (only display the first few dimensions)
dent.get_summary_vector(gn_model, tokens)[:10]

Modeling on existing data

Once we can calculate summary vectors for any Tweet, we can also apply various models to the data we have on hand.

For example, we can calculate similarity between Tweets. In the tiny collection of data, the similarity metric often ends up presenting as a language classifier!

In this calculation, we use scipy's cosine distance - "small distance" is "more similar."


In [ ]:
print('* calculating cosine distance from: \n\n{}\n'.format(input[0]))
print('='*100 + '\n')
print('dist. -- tweet')

v1 = dent.get_summary_vector(gn_model, input[0].split())    
for text in input[:10]:    
    v2 = dent.get_summary_vector(gn_model, text.split())    
    print('{:.3f} -- {}\n'.format(distance.cosine(v1, v2), text.replace('\n',' ')))

Great! Now that we have this different representation there are many explorations to consider, like differences in user or tweet clustering.

While we're setting up some tools for that work, that's not the goal for right now. First, we want to continue working on our points above.

2. Training a new embedding model from our data

It's great to get up and running with a pre-trained model. But, we also want to make our own.

The specific reason here is that we have a strong hunch that the way language is used on twitter is somewhat different than in news articles. If we can build a language model that incorporates the colloquialisms and nuance of twitter, our application of such a model to new data should lead to more robust results.

Training this kind of word embedding model is a relatively slow process, but it's achievable. For example, the last model I trained during hackweek was on about half a day of the 10% stream (in the ballpark of 25M Tweets) and it took about 12 hours to run. In the current implementation, I believe the main bottleneck is the JSON parsing and string tokenization. I'll work on making these faster in a later version :)

To demonstrate, we'll train a model on a very small sample of 1000 Tweets. Note that this will be a pretty bad model for any task to which we might want to apply it! Ideally, we want a lot more data. But this will do for a demo.


In [ ]:
# 10k newline-delimited tweet records from the api
tweet_file = 'rdata/10000-tweets.json'

sm_model = dent.create_model(tweet_file)

Now, we have another word embedding model for which we can use all the tools shown previously. Remember that this model isn't going to be very good - in particular, it won't have a very large vocabulary.


In [ ]:
# stopwords gonna stopword...
sm_model.most_similar('the')

In [ ]:
# can serialize it for later
sm_model.save('rdata/small.model')

In [ ]:
# can reinstantiate it from disk
sm_model_2 = dent.load_model('rdata/small.model')

In [ ]:
sm_model_2.most_similar('the')

But, the real utility of this technique is in training such a model on a large corpus. Due to some technical challenges, this is the largest model I was able to train during hackweek. It's about half a day of the 10% stream.


In [ ]:
model = dent.load_model('rdata/2017-08-25-1_2_hr.model')

In [ ]:
model

In [ ]:
for x in ['dog', 'cat', 'man', 'woman', '#MAGA', '#BLM', 'baseball', 'hockey']:
    print('-- {} --'.format(x))
    print(model.most_similar_cosmul(x, topn=5))
    print()

Wrap up

This session doesn't include any mind-blowing results - I'm betting that language modeling based on word (and other) embeddings will enable us to do new and valuable things with our text data. densetweets is hopefully a first, tooling-focused step in that direction! I plan to continue work on this code, and will open source a version that can be pip-installed soon. Stay tuned.


In [ ]: