Introduction

There is a perception that Twitter data can be used to surface insights: unexpected features of the data that have business value. In this tutorial, I will explore some of the difficulties and opportunities of turning that perception into reality.

We will focus exclusively on text analysis, and on insights represented by textual differences between documents and corpora. We will start by constructing a small, simple data set that represents a few notions of what insights should be surfaced. We can then examine which technique uncover which insights.

Next, we will move to real data, where we don't know what we might surface. We will have to address data cleaning and curation, both at the beginning and in an iterative fashion as our insights-generation surfaces artifacts of insufficient data curation. We will finish by developing and evaluating a variety of tools and techiques for comparing text-based data.

Resources

Good further reading, and the source of some of the ideas here: https://de.dariah.eu/tatom/feature_selection.html

Setup

Requires Python 3.6 or greater


In [ ]:
import itertools
import nltk
import operator
import numpy as np

In [ ]:
import sklearn
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer

A Synthetic Example

Let's build some intuition by creating two artificial documents, which represent textual differences that we might intend to surface.


In [ ]:
doc0,doc1 = ('bun cat cat dog bird','bun cat dog dog dog')

In terms of unigram frequency, here are 3 differences:

  • 1 more "cat" in doc0 than in doc1
  • 2 more "dog" in doc1 than in doc0
  • "bird" only exists in doc0

Let's throw together a function that prints out the differences in term frequencies:


In [ ]:
def func(doc0,doc1,vectorizer):
    """
    print difference in absolute term-frequency difference for each unigram
    """
    tf = vectorizer.fit_transform([doc0,doc1])
    # this is a 2-column matrix, where the columns represent doc0 and doc1
    tfa = tf.toarray()
    # make tuples of the tokens and the difference of their doc0 and doc1 coefficients
    # if we use a basic token count vectorizer, this is the term frequency difference 
    tup = zip(vectorizer.get_feature_names(),tfa[0] - tfa[1])
    # print the top-10 tokens ranked by the difference measure
    for token,score in list(reversed(sorted(tup,key=operator.itemgetter(1))))[:10]:
        print(token,score)

In [ ]:
func(doc0,doc1,CountVectorizer())

Observations:

  • positive numbers are more "doc0-like"
  • the "dog" score is higher in absolute value than the bird score
  • "bird" and "cat" are indistinguishable

Let's try inverse-document frequency.


In [ ]:
func(doc0,doc1,TfidfVectorizer())

Observations:

  • "bird" now has a larger coefficient that "cat"
  • "dog is still most significant that "cat"

How does this scale?

Let's construct:

  • doc0 is +1 "cat"
  • doc0 is +40 "bun"
  • doc0 is +1 "bird"

In [ ]:
doc0 = 'cat '*5 + 'dog '*3 + 'bun '*350 + 'bird '
doc1 = 'cat '*4 + 'dog '*3 + 'bun '*310

In [ ]:
func(doc0,doc1,CountVectorizer())

In [ ]:
func(doc0,doc1,TfidfVectorizer())

Observations:

  • "bird" stands out strongly
  • "cat" and "dog" are similar in absolute value
  • "bun" is the least significant token

What about including 2-grams?


In [ ]:
func(doc0,doc1,TfidfVectorizer(ngram_range=(1,2)))

That's impossible to read. Let's build better formatting into our function.


In [ ]:
def func(doc0,doc1,vectorizer):
    tf = vectorizer.fit_transform([doc0,doc1])
    tfa = tf.toarray()
    tup = zip(vectorizer.get_feature_names(),tfa[0] - tfa[1])
    
    # print 
    max_token_length = 0
    output_tuples = list(reversed(sorted(tup,key=operator.itemgetter(1))))[:10]

    for token,score in output_tuples:
        if max_token_length < len(token):
            max_token_length = len(token)
    for token,score in output_tuples:
        print(f"{token:{max_token_length}s} {score:.3e}")

In [ ]:
func(doc0,doc1,TfidfVectorizer(ngram_range=(1,2)))

Observations:

  • grams with "bird" still stand out
  • scores are getting hard to interpret

Let's get some real data.


In [ ]:
import string
from tweet_parser.tweet import Tweet
from searchtweets import (ResultStream,
                           collect_results,
                           gen_rule_payload,
                           load_credentials)

search_args = load_credentials(filename="~/.twitter_keys.yaml",
                               account_type="enterprise")

In [ ]:
_pats_rule = "#patriots OR @patriots"

In [ ]:
_eagles_rule = "#eagles OR @eagles"

In [ ]:
from_date="2018-01-28"
to_date="2018-01-29"
max_results = 3000

pats_rule = gen_rule_payload(_pats_rule,
                        from_date=from_date,
                        to_date=to_date,
                        )
eagles_rule = gen_rule_payload(_eagles_rule,
                        from_date=from_date,
                        to_date=to_date,
                        )

In [ ]:
eagles_results_list = collect_results(eagles_rule, 
                               max_results=max_results, 
                               result_stream_args=search_args)

In [ ]:
pats_results_list = collect_results(pats_rule, 
                               max_results=max_results, 
                               result_stream_args=search_args)

Join all tweet bodies in a corpus into one space-delimited document.


In [ ]:
eagles_body_text = [tweet['body'] for tweet in eagles_results_list]
eagles_doc = ' '.join(eagles_body_text)

In [ ]:
pats_body_text = [tweet['body'] for tweet in pats_results_list]
pats_doc = ' '.join(pats_body_text)

Let's have a look at the data (AS YOU ALWAYS SHOULD).


In [ ]:
eagles_body_text[:10]

Whew...this is gonna take some cleaning.

Let's start with a tokenizer and a stopword list.


In [ ]:
tokenizer = nltk.tokenize.TweetTokenizer()
stopwords = nltk.corpus.stopwords.words('english')
stopwords.extend(string.punctuation)

In [ ]:
vectorizer = TfidfVectorizer(
    tokenizer=tokenizer.tokenize,
    stop_words=stopwords,
    ngram_range=(1,2)
)

Here are the top 10 1- and 2-grams for the Eagles corpus/document.


In [ ]:
func(eagles_doc,pats_doc,vectorizer)

Add the ability to specify n in top-n.


In [ ]:
def compare_docs(doc0,doc1,vectorizer,n_to_display=10):
    tfm_sparse = vectorizer.fit_transform([doc0,doc1])
    tfm = tfm_sparse.toarray()
    tup = zip(vectorizer.get_feature_names(),tfm[0] - tfm[1])
    
    # print 
    max_token_length = 0
    output_tuples = list(reversed(sorted(tup,key=operator.itemgetter(1))))[:n_to_display]

    for token,score in output_tuples:
        if max_token_length < len(token):
            max_token_length = len(token)
    for token,score in output_tuples:
        print(f"{token:{max_token_length}s} {score:.3e}")

In [ ]:
compare_docs(eagles_doc,pats_doc,vectorizer,n_to_display=30)

In [ ]:
compare_docs(pats_doc,eagles_doc,vectorizer,n_to_display=30)

We can't really evalute more sophisticated text comparison techniques without doing better filtering on the data.


In [ ]:
# add token filtering to the TweetTokenizer
def filter_tokens(token):
    if len(token) < 2:
        return False
    if token.startswith('http'):
        return False
    if '’' in token:
        return False
    if '…' in token or '...' in token:
        return False
    return True
def custom_tokenizer(doc):
    initial_tokens = tokenizer.tokenize(doc)
    return [token for token in initial_tokens if filter_tokens(token)]

In [ ]:
vectorizer = TfidfVectorizer(
    tokenizer=custom_tokenizer,
    stop_words=stopwords,
    ngram_range=(1,2),
)

In [ ]:
compare_docs(eagles_doc,pats_doc,vectorizer,n_to_display=20)

In [ ]:
compare_docs(pats_doc,eagles_doc,vectorizer,n_to_display=20)

Retweets makes a mess of a term frequency analysis on documents consisting of concatenated tweet bodies. Remove them for now.


In [ ]:
eagles_body_text_noRT = [tweet['body'] for tweet in eagles_results_list if tweet['verb'] == 'post']
eagles_doc_noRT = ' '.join(eagles_body_text_noRT)

pats_body_text_noRT = [tweet['body'] for tweet in pats_results_list if tweet['verb'] == 'post']
pats_doc_noRT = ' '.join(pats_body_text_noRT)

vectorizer = TfidfVectorizer(
    tokenizer=custom_tokenizer,
    stop_words=stopwords,
    ngram_range=(1,2),
)

compare_docs(eagles_doc_noRT,pats_doc_noRT,vectorizer,n_to_display=20)
print("\n")
compare_docs(pats_doc_noRT,eagles_doc_noRT,vectorizer,n_to_display=20)

Well, now we have clear evidence of the political notion of the "#patriots" clause in our rule. Let's simplfy things by removing the hashtags from the rules.


In [ ]:
_pats_rule = "@patriots"

In [ ]:
_eagles_rule = "@eagles"

In [ ]:
from_date="2018-01-28"
to_date="2018-01-29"
max_results = 20000

pats_rule = gen_rule_payload(_pats_rule,
                        from_date=from_date,
                        to_date=to_date,
                        )
eagles_rule = gen_rule_payload(_eagles_rule,
                        from_date=from_date,
                        to_date=to_date,
                        )

In [ ]:
eagles_results_list = collect_results(eagles_rule, 
                               max_results=max_results, 
                               result_stream_args=search_args)

In [ ]:
pats_results_list = collect_results(pats_rule, 
                               max_results=max_results, 
                               result_stream_args=search_args)

In [ ]:
eagles_body_text_noRT = [tweet['body'] for tweet in eagles_results_list if tweet['verb'] == 'post']
eagles_doc_noRT = ' '.join(eagles_body_text_noRT)

pats_body_text_noRT = [tweet['body'] for tweet in pats_results_list if tweet['verb'] == 'post']
pats_doc_noRT = ' '.join(pats_body_text_noRT)

vectorizer = TfidfVectorizer(
    tokenizer=custom_tokenizer,
    stop_words=stopwords,
    ngram_range=(1,2),
)

compare_docs(eagles_doc_noRT,pats_doc_noRT,vectorizer,n_to_display=20)
print("\n")
compare_docs(pats_doc_noRT,eagles_doc_noRT,vectorizer,n_to_display=20)

Things we could do:

  • vectorize tweets as documents, and summarize or aggregate the coeeficients
  • select tokens for which the mean coefficient within a corpus is zero
  • look at the difference in mean coefficient

Let's start by going back to simple corpora, and account for individual docs this time.


In [ ]:
corpus0 = ["cat","cat dog"]
corpus1 = ["bun","dog","cat"]

In [ ]:
# basic unigram vectorizer with Twitter-specific tokenization and stopwords
vectorizer = CountVectorizer(
                    tokenizer=custom_tokenizer,
                    stop_words=stopwords,
                    ngram_range=(1,1)
)

In [ ]:
# get the term-frequency matrix
m = vectorizer.fit_transform(corpus0+corpus1)
vocab = np.array(vectorizer.get_feature_names())
print(vocab)

m = m.toarray()
print(m)

In [ ]:
# get TF matrices for each corpus
corpus0_indices = range(len(corpus0))
corpus1_indices = range(len(corpus0),len(corpus0)+len(corpus1))
m0 = m[corpus0_indices,:]
m1 = m[corpus1_indices,:]
print(m0)

In [ ]:
# calculate the average term frequency within each corpus
c0_means = np.mean(m0,axis=0)
c1_means = np.mean(m1,axis=0)
print(c0_means)

In [ ]:
# calculate the indices of the distinct tokens, which only occur in a single corpus
distinct_indices = c0_means * c1_means == 0
print(vocab[distinct_indices])

In [ ]:
# now remove the distinct tokens from the vocab
print(m[:, np.invert(distinct_indices)])

In [ ]:
# recalculate things
m0_non_distinct = m[:, np.invert(distinct_indices)][corpus0_indices,:]
m1_non_distinct = m[:, np.invert(distinct_indices)][corpus1_indices,:]
c0_non_distinct_means = np.mean(m0_non_distinct,axis=0)
c1_non_distinct_means = np.mean(m1_non_distinct,axis=0)
# and take the difference
print(c0_non_distinct_means - c1_non_distinct_means)

This difference in averages is sometimes called "keyness".

Now let's do it on real data.


In [ ]:
# build and identify the corpora
docs = eagles_body_text_noRT + pats_body_text_noRT
eagles_indices = range(len(eagles_body_text_noRT))
pats_indices = range(len(eagles_body_text_noRT),len(eagles_body_text_noRT) + len(pats_body_text_noRT))

In [ ]:
# use a single vectorizer because we care about the joint vocabulary
vectorizer = CountVectorizer(
                    tokenizer=custom_tokenizer,
                    stop_words=stopwords,
                    ngram_range=(1,1)
)

dtm = vectorizer.fit_transform(docs).toarray()
vocab = np.array(vectorizer.get_feature_names())


eagles_dtm = dtm[eagles_indices, :]
pats_dtm = dtm[pats_indices, :]

Take the average coefficient for each vocab element, for each corpus.


In [ ]:
# columns for every token in the vocab; rows for tweets in the corpus
eagles_means = np.mean(eagles_dtm,axis=0)
pats_means = np.mean(pats_dtm,axis=0)

Start by looking for distinct tokens, which only exist in one corpus.


In [ ]:
# get indices for any column with zero mean in either corpus
distinct_indices = eagles_means * pats_means == 0

In [ ]:
print(str(np.count_nonzero(distinct_indices)) + " distinct tokens out of " + str(len(vocab)))

In [ ]:
eagles_ranking = np.argsort(eagles_means[distinct_indices])[::-1]
pats_ranking = np.argsort(pats_means[distinct_indices])[::-1]
total_ranking = np.argsort(eagles_means[distinct_indices] + pats_means[distinct_indices])[::-1]

In [ ]:
vocab[distinct_indices][total_ranking]

In [ ]:
print("Top distinct Eagles tokens by average term count in Eagles corpus")
for token in vocab[distinct_indices][eagles_ranking][:10]:
    print_str = f"{token:30s} {eagles_means[vectorizer.vocabulary_[token]]:.3g}"
    print(print_str)

In [ ]:
print("Top distinct Patriots tokens by average term count in Patriots corpus")
for token in vocab[distinct_indices][pats_ranking][:10]:
    print_str = f"{token:30s} {pats_means[vectorizer.vocabulary_[token]]:.3g}"
    print(print_str)

How does this change if we account for inverse document frequency?

Let's build a function and encapsulate this.


In [ ]:
def compare_corpora(corpus0,corpus1,vectorizer,n_to_display=10):
    corpus0_indices = range(len(corpus0))
    corpus1_indices = range(len(corpus0), len(corpus0) + len(corpus1))
    m_sparse = vectorizer.fit_transform(corpus0 + corpus1)
    m = m_sparse.toarray()

    vocab = np.array(vectorizer.get_feature_names())
    m_corpus0 = m[corpus0_indices,:]
    m_corpus1 = m[corpus1_indices,:]
    
    corpus0_means = np.mean(m_corpus0,axis=0)
    corpus1_means = np.mean(m_corpus1,axis=0)
    
    distinct_indices = corpus0_means * corpus1_means == 0
    print(str(np.count_nonzero(distinct_indices)) + " distinct tokens out of " + str(len(vocab)) + '\n')    
    
    corpus0_ranking = np.argsort(corpus0_means[distinct_indices])[::-1]
    corpus1_ranking = np.argsort(corpus1_means[distinct_indices])[::-1]

    print("Top distinct tokens from corpus0 by average term count in corpus")
    for token in vocab[distinct_indices][corpus0_ranking][:n_to_display]:
        print_str = f"{token:30s} {corpus0_means[vectorizer.vocabulary_[token]]:.3g}"
        print(print_str)
    print()
    print("Top distinct tokens from corpus1 by average term count in corpus")
    for token in vocab[distinct_indices][corpus1_ranking][:n_to_display]:
        print_str = f"{token:30s} {corpus1_means[vectorizer.vocabulary_[token]]:.3g}"
        print(print_str)

In [ ]:
#vectorizer = TfidfVectorizer(
vectorizer = CountVectorizer(
                    tokenizer=custom_tokenizer,
                    stop_words=stopwords,
                    ngram_range=(1,1)
)
compare_corpora(eagles_body_text_noRT,pats_body_text_noRT,vectorizer)

Now let's remove the distrinct tokens and look at the maximum difference in means.


In [ ]:
def compare_corpora(corpus0,corpus1,vectorizer,n_to_display=10):
    
    # get corpus indices
    corpus0_indices = range(len(corpus0))
    corpus1_indices = range(len(corpus0), len(corpus0) + len(corpus1))
    m_sparse = vectorizer.fit_transform(corpus0 + corpus1)
    m = m_sparse.toarray()

    # get vocab and TF matrices for each corpus
    vocab = np.array(vectorizer.get_feature_names())
    m_corpus0 = m[corpus0_indices,:]
    m_corpus1 = m[corpus1_indices,:]
    
    corpus0_means = np.mean(m_corpus0,axis=0)
    corpus1_means = np.mean(m_corpus1,axis=0)
    
    distinct_indices = corpus0_means * corpus1_means == 0
    print(str(np.count_nonzero(distinct_indices)) + " distinct tokens out of " + str(len(vocab)) + '\n')    
    
    corpus0_ranking = np.argsort(corpus0_means[distinct_indices])[::-1]
    corpus1_ranking = np.argsort(corpus1_means[distinct_indices])[::-1]

    print("Top distinct tokens from corpus0 by average term count in corpus")
    for token in vocab[distinct_indices][corpus0_ranking][:n_to_display]:
        print_str = f"{token:30s} {corpus0_means[vectorizer.vocabulary_[token]]:.3g}"
        print(print_str)
    print()
    print("Top distinct tokens from corpus1 by average term count in corpus")
    for token in vocab[distinct_indices][corpus1_ranking][:n_to_display]:
        print_str = f"{token:30s} {corpus1_means[vectorizer.vocabulary_[token]]:.3g}"
        print(print_str)    
    
    # remove distinct tokens
    m = m[:, np.invert(distinct_indices)]
    vocab = vocab[np.invert(distinct_indices)]
    
    # recalculate stuff
    m_corpus0 = m[corpus0_indices,:]
    m_corpus1 = m[corpus1_indices,:]
    corpus0_means = np.mean(m_corpus0,axis=0)
    corpus1_means = np.mean(m_corpus1,axis=0)
    
    # get "keyness"
    keyness = corpus0_means - corpus1_means
    # order token indices by keyness
    ranking = np.argsort(keyness)[::-1]
    
    print()
    print("Top tokens by keyness from corpus0 by average term count in corpus")
    for rank in ranking[:n_to_display]:
        token = vocab[rank]
        print_str = f"{token:30s} {keyness[rank]:.3g}"
        print(print_str)       
   
    print()
    print("Top tokens by keyness from corpus1 by average term count in corpus")
    for rank in ranking[-n_to_display:]:
        token = vocab[rank]
        print_str = f"{token:30s} {keyness[rank]:.3g}"
        print(print_str)

In [ ]:
vectorizer = CountVectorizer(
                    tokenizer=custom_tokenizer,
                    stop_words=stopwords,
                    ngram_range=(1,1)
)
compare_corpora(eagles_body_text_noRT,pats_body_text_noRT,vectorizer)

Observations

  • Distinct tokens are frequency players' handles or institutions specific to a team.
  • Terms counts and IDF surface different top keyness terms; would need to investigate.
  • Taking the mean occurence across the docs in a corpus might not be the best aggregation for sparse spaces like Twitter.

Despite common intuition around surfacing insights through text analysis and token counting, there isn't One Way to do this.


In [ ]: