Comparing collections (Part Two)

  • Motivation
  • Review Part I
  • Describe what we want
  • Kendall's tau
  • Rank-biased overlap
  • Apply to data

Inspired by "A Similarity Measure for Indefinite Rankings", http://www.williamwebber.com/research/papers/wmz10_tois.pdf

Motivation

We often want to programatically describe how two Twitter corpora are difference or the same. A common method for doing this is to count things and rank them by their counts.


In [82]:
import yaml
import time
import operator
import string
import re
import csv
import random

import nltk.tokenize
from sklearn.feature_extraction import text
import twitter
import scipy

Review Part I

Set comparision

  • Looked at intersection and union

List comparision

  • Looked at Pearson's correlation coefficient

Ordinal (rank) comparison

  • Pulled tweets from users with 'mom' in bio.
  • created exact and approximate term frequency distributions of 1-grams in tweet bodies
  • Kendall's tau coefficient compares exact and approximate rankings

Over/under indexing

  • Pulled tweets from users with 'dad' in bio
  • Created exact term frequency distributions for 'mom' and 'dad' tweet corpora
  • Attemped to find a function that de-emphasizes common terms and emphasizes un-shared, highly ranked terms

So what do we want to do?

We want to compare two lists of objects and their counts. This is a common need when comparing counts of hashtags, n-grams, locations, etc.

We want the ability to act on lists that are:

  • non-conjoint: lists may contain different elements
  • incomplete: all elements in the list are not present or not analyzed
  • indefinite: the cutoff for the incomplete list is essentially arbitrary

We also want to apply weighting: where the comparison emphasizes elements with highest counts.

For output, we want to:

  • calculate a similarity score
  • highlight differences (save for next time)

Kendall's tau correlation coefficient, again

Kendall's tau is rank correlation coefficient that compares two sequences of rankings. Specifically, it is a function of the number of concordant and the number of discordant pairs.

In our case, the data might look something like:

ngram rank in corpus 1 rank in corpus 2
dad 1 2
parent 2 3
lol 3 1
know 4 4

The 'dad-parent' pair is concordant (C) because 1>2 and 2>3, while the 'parent-lol' pair is discordant (D), because 2>3 but 3<1. The 'dad-lol' pair is discordant, while the 'lol-know', 'parent-lol', and 'parent-know' pairs are concordant.

The un-normalized tau coefficient is C - D, which is 2 in this case. The normalized version is:

$\tau$ = (C - D)/n(n-1)/2,

where C(D) is the number of concordant(discordant) pairs, and n is the length of the ranking list(s). This gives us $\tau$ = 0.3.

The scipy implementation of this coefficient accounts for ties (multiple entries with the same rank).

Let's look at some common test points for the measure.


In [83]:
## self-correlation
a = [i for i in range(20)]

scipy.stats.kendalltau(a,a).correlation


Out[83]:
1.0

In [84]:
## remember that the rows need not be ordered

# shuffle in place
random.shuffle(a)

scipy.stats.kendalltau(a,a).correlation


Out[84]:
1.0

In [85]:
## anti-correlation
a = [i for i in range(20)]
b = list(reversed(a))

scipy.stats.kendalltau(a,b).correlation


Out[85]:
-1.0

In [88]:
## random case
# correlation will average 0 and get closer to 0 for large lists
a = [i for i in range(1000)]
b = random.sample(a, k=len(a))

scipy.stats.kendalltau(a,b).correlation


Out[88]:
0.0086966966966966971

In [91]:
## ties

# scipy implementation uses:
# https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient#Tau-b

a = [i for i in range(10)]
b = [i for i in range(10)]

# items in list b at indices 0 and 1 will both have rank 1 (zero-based)
b[5] = 1

print(a)
print(b)

scipy.stats.kendalltau(a,b).correlation


[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[0, 1, 2, 3, 4, 1, 6, 7, 8, 9]
Out[91]:
0.85398649245343994

Finally:

Kendall's tau is not defined for non-conjoint lists, meaning that it won't work for most incomplete lists.

Rank-biased Overlap

Rank-biased overlap (RBO) is based on the idea the the overlap (or size of the intersection) of two sets is a good, simple starting point for similarity measures. We apply this to ordinal lists by calculating the overlap at varying depths and cleverly aggregating the results. Importantly, this method does not depend on elements being in both lists.

For ordinal lists S and T, the agreement (A) at depth k is given in terms of the overlap (X, the size of the intersection) between S and T at depth k.

$A_{S,T,k} = \frac{X_{S,T,k}}{k}$

The average overlap for 1 <= k <= d gives decent similarity measure.

If you make it a weighted average and choose your weights to be elements of a geometric series on parameter p, you can take d --> infinity and you have a distance measure r bounded by 0 and 1 and controlled by a single parameter, p. Values of p close to 0 emphasize agreement between highly ranked elements, while larger values of p emphasize a broader range of agreement.

For truncated lists, you can calculate exactly the minimum (min) and maximum value that r can take on, given the unknown values lost in truncation. This is usually quoted in terms of min and the residual difference between the minimum and maximum (res).

For truncated lists, the base score (r) is a function of the cutoff depth (d) and can not actually reach 1. We can instead extrapolate from the visible lists and calculate $r_{ext}$ such that it has the range [0-1].


In [96]:
from rbo import rbo

In [97]:
# elements in the list can be any object
rbo(['c', 'b', 'd'], ['a', 'c', 'd'], p=.5)


Out[97]:
{'ext': 0.29166666666666663,
 'min': 0.2612943611198906,
 'res': 0.07203897221344269}

In [105]:
# self-similarity
a = [i for i in range(20)]

rbo(a,a,p=0.9)


Out[105]:
{'ext': 0.9999999999999998,
 'min': 0.965613032847319,
 'res': 0.03438696715268043}

In [106]:
# order doesn't matter
random.shuffle(a)

rbo(a,a,p=0.9)


Out[106]:
{'ext': 0.9999999999999998,
 'min': 0.965613032847319,
 'res': 0.03438696715268043}

In [107]:
# we are comparing ordered lists of objects, not rankings
a = [i for i in string.punctuation]

rbo(a,a,p=0.9)


Out[107]:
{'ext': 0.9999999999999998,
 'min': 0.9928676862776876,
 'res': 0.007132313722311028}

In [109]:
# reversed case
a = [i for i in string.punctuation]
b = list(reversed(a))
print(a)
print(b)
rbo(a,b,p=0.9)


['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']
['~', '}', '|', '{', '`', '_', '^', ']', '\\', '[', '@', '?', '>', '=', '<', ';', ':', '/', '.', '-', ',', '+', '*', ')', '(', "'", '&', '%', '$', '#', '"', '!']
Out[109]:
{'ext': 0.11254679714651897,
 'min': 0.10541448342420914,
 'res': 0.007132313722311028}

In [110]:
# random comparison
a = [i for i in string.punctuation]
b = random.sample(a, k=len(a))

rbo(a,b,p=0.9)


Out[110]:
{'ext': 0.3650820454635307,
 'min': 0.357949731741221,
 'res': 0.007132313722311028}

Apply it!

Lesson: most vocabularies on Twitter (1-grams, 2-grams, hashtags) for a location, date, etc. are sparsely populated, meaning that the rankings are largely non-conjoint. When comparing 2 rankings with the similarity measurements described above, it's hard to know when small similarity differences are due to statistics, platform differences, or real textual differences.

Two paths forward:

  • draw random samples from a single corpus to create a distribution for the null hypothesis
  • create a small vocabulary

Get tweets

Let's collect 3 data sets matching 3 keywords: "mom", "dad", and "mum", hypothesizing that a good similarity measurment will be smaller for "mom" vs "mum" than for "mom" vs "dad".


In [ ]:
"""
Get Tweets from the Twitter public API
"""

"""
# Get your Twitter API tokens
# this is specific to my computer; modify for yours
my_creds_file = '/Users/jkolb/.twitter_api_creds'
creds = yaml.load(open(my_creds_file))
consumer_key = creds['audience']['consumer_key']
consumer_secret = creds['audience']['consumer_secret']
access_token_key = creds['audience']['token']
access_token_secret = creds['audience']['token_secret']
"""

In [ ]:
"""
api = twitter.Api(consumer_key=consumer_key,
                 consumer_secret=consumer_secret,
                 access_token_key=access_token_key,
                 access_token_secret=access_token_secret
                 )

mom_tweets = []
for _ in range(20):
    mom_tweets.extend( api.GetSearch("mom",count=100) )
    time.sleep(1)

dad_tweets = []
for _ in range(20):
    dad_tweets.extend( api.GetSearch("dad",count=100) )
    time.sleep(1)

mum_tweets = []
for _ in range(20):
    mum_tweets.extend( api.GetSearch("mom",count=100) )
    time.sleep(1)
"""

In [111]:
"""
Get Tweets from the Gnip Search API
"""

from search.api import Query
import json
import yaml
creds = yaml.load(open('/Users/jkolb/.creds.yaml'))

# set up a query to the Gnip Search API
q = Query(creds['username'],
          creds['password'],
          creds['search_endpoint'],
          paged=True,
          hard_max = 2000, ## <--- control tweet volume here
          )

# query parameters
start_date = '2017-06-01T00:00'
end_date = '2017-06-03T00:00'

# get the tweet data
rule = 'mom'
rule += ' -is:retweet'
q.execute(rule,start=start_date,end=end_date)
mom_tweets = list(q.get_activity_set())

rule = 'dad'
rule += ' -is:retweet'
q.execute(rule,start=start_date,end=end_date)
dad_tweets = list(q.get_activity_set())

rule = 'mum'
rule += ' -is:retweet'
q.execute(rule,start=start_date,end=end_date)
mum_tweets = list(q.get_activity_set())


[    4608 bytes]   500 total activities retrieved...
Fetching page 2...
[    9112 bytes]  1000 total activities retrieved...
Fetching page 3...
[   13608 bytes]  1500 total activities retrieved...
Fetching page 4...
[    4608 bytes]   500 total activities retrieved...
Fetching page 2...
[    9112 bytes]  1000 total activities retrieved...
Fetching page 3...
[   13608 bytes]  1500 total activities retrieved...
Fetching page 4...
[    4608 bytes]   500 total activities retrieved...
Fetching page 2...
[    9112 bytes]  1000 total activities retrieved...
Fetching page 3...
[   13608 bytes]  1500 total activities retrieved...
Fetching page 4...

Do better n-gram extraction

  • better stopword list removes unimportant words from list of top-ranked n-grams
  • better lemmatization and normalization removes duplicates

In [112]:
## get tweet bodies
dad_bodies = [tweet['body'] for tweet in dad_tweets]
mom_bodies = [tweet['body'] for tweet in mom_tweets]
mum_bodies = [tweet['body'] for tweet in mum_tweets]

In [113]:
## create a tweet tokenizer and stopword list
my_additional_stop_words = ['https','rt']
my_additional_stop_words.extend(string.punctuation)
stop_words = text.ENGLISH_STOP_WORDS.union(my_additional_stop_words)

tokenizer = nltk.tokenize.TweetTokenizer(preserve_case=False, reduce_len=True)

In [114]:
## make vectorizers
dad_ngram_vectorizer = text.CountVectorizer(lowercase=True,
                             stop_words=stop_words,
                             ngram_range=(1,2),
                             tokenizer = tokenizer.tokenize,
                             min_df = 2,
                            )
dad_ngram_vectorizer_idf = text.TfidfVectorizer(lowercase=True,
                             stop_words=stop_words,
                             ngram_range=(1,2),
                             tokenizer = tokenizer.tokenize,
                             min_df = 2,
                            )
mom_ngram_vectorizer = text.CountVectorizer(lowercase=True,
                             stop_words=stop_words,
                             ngram_range=(1,2),
                             tokenizer = tokenizer.tokenize,
                             min_df = 2,
                            )
mom_ngram_vectorizer_idf = text.TfidfVectorizer(lowercase=True,
                             stop_words=stop_words,
                             ngram_range=(1,2),
                             tokenizer = tokenizer.tokenize,
                             min_df = 2,
                            )
mum_ngram_vectorizer = text.CountVectorizer(lowercase=True,
                             stop_words=stop_words,
                             ngram_range=(1,2),
                             tokenizer = tokenizer.tokenize,
                             min_df = 2,
                            )
mum_ngram_vectorizer_idf = text.TfidfVectorizer(lowercase=True,
                             stop_words=stop_words,
                             ngram_range=(1,2),
                             tokenizer = tokenizer.tokenize,
                             min_df = 2,
                            )

In [115]:
# helper functions
def ngram_freq_from_dtmatrix(dtmatrix,col_names):
    return dict([(ngram,dtmatrix.getcol(icol).toarray().sum()) for icol,ngram in enumerate(col_names)])
def ranked_tuples_from_ngram_freq(term_freq_dict):
    return list(reversed(sorted(term_freq_dict.items(),key=operator.itemgetter(1))))

In [116]:
## get top ranked ngrams for 'dad' tweets 

dad_dtmatrix = dad_ngram_vectorizer.fit_transform(dad_bodies)
dad_ngrams = dad_ngram_vectorizer.get_feature_names()

dad_tf_dict = ngram_freq_from_dtmatrix(dad_dtmatrix,dad_ngrams)
dad_ngrams_ranked = ranked_tuples_from_ngram_freq(dad_tf_dict)

In [117]:
## get top ranked ngrams for 'mom' tweets 

mom_dtmatrix = mom_ngram_vectorizer.fit_transform(mom_bodies)
mom_ngrams = mom_ngram_vectorizer.get_feature_names()

mom_tf_dict = ngram_freq_from_dtmatrix(mom_dtmatrix,mom_ngrams)
mom_ngrams_ranked = ranked_tuples_from_ngram_freq(mom_tf_dict)

In [118]:
## get top ranked ngrams for 'mum' tweets 

mum_dtmatrix = mum_ngram_vectorizer.fit_transform(mum_bodies)
mum_ngrams = mum_ngram_vectorizer.get_feature_names()

mum_tf_dict = ngram_freq_from_dtmatrix(mum_dtmatrix,mum_ngrams)
mum_ngrams_ranked = ranked_tuples_from_ngram_freq(mum_tf_dict)

In [120]:
# sanity check
dad_ngrams_ranked[:20]


Out[120]:
[('dad', 1887),
 ('…', 281),
 ('...', 183),
 ('just', 182),
 ('😂', 178),
 ("i'm", 156),
 ('like', 139),
 ('mom', 118),
 ('got', 92),
 ("dad's", 91),
 ('day', 83),
 ('️', 82),
 ("don't", 75),
 ('love', 70),
 ('😂 😂', 67),
 ("he's", 66),
 ('said', 62),
 ("it's", 60),
 ('mom dad', 57),
 ('❤', 55)]

Simply by looking at this list, we can see other avenues for improving ngram extraction.

  • do we include handles?
  • should we remove RTs?
  • what about emoji?
  • minimum token length?

For now, we won't go down these paths.

Try Kendall's tau


In [121]:
## now let's extract the rankings and compare

# probably want to cut off the rankings somewhere
cutoff = 10000
final_cutoff = 300

# get the (ngram,rank) lists
dad_ngram_ranks = {ngram:rank for rank,(ngram,count) in enumerate(dad_ngrams_ranked[:cutoff])}
mom_ngram_ranks = {ngram:rank for rank,(ngram,count) in enumerate(mom_ngrams_ranked[:cutoff])}
mum_ngram_ranks = {ngram:rank for rank,(ngram,count) in enumerate(mum_ngrams_ranked[:cutoff])}

# get the rank lists
# NB: if cutoff lists are not conjoint (they probably aren't), 
# you'll have to choose one list as a reference
dad_ranks = []
mom_ranks = []
mum_ranks = []

data = []
for ngram,mom_rank in mom_ngram_ranks.items():
    try:
        dad_rank = dad_ngram_ranks[ngram]
    except KeyError:
        # for elements not in list, rank them last
        dad_rank = cutoff 
    try:
        # for elements not in list, rank them last
        mum_rank = mum_ngram_ranks[ngram]
    except KeyError:
        mum_rank = cutoff 
        
    if mom_rank < final_cutoff:
        dad_ranks.append(dad_rank)
        mom_ranks.append(mom_rank)
        mum_ranks.append(mum_rank)

        data.append((ngram,mom_rank,mum_rank,dad_rank))

In [122]:
dad_mom_tau = scipy.stats.kendalltau(dad_ranks,mom_ranks).correlation
mum_mom_tau = scipy.stats.kendalltau(mum_ranks,mom_ranks).correlation

print('Tau')
print('cutoff = ' + str(final_cutoff))
print('mom-dad: ' + str(dad_mom_tau))
print('mom-mum: ' + str(mum_mom_tau))


Tau
cutoff = 300
mom-dad: 0.450678709625
mom-mum: 0.402961287131

Try RBO


In [123]:
mom_top_ngrams = [ngram for ngram,ct in mom_ngrams_ranked][:final_cutoff]
mum_top_ngrams = [ngram for ngram,ct in mum_ngrams_ranked][:final_cutoff]
dad_top_ngrams = [ngram for ngram,ct in dad_ngrams_ranked][:final_cutoff]

mum_mom_rbo = rbo(mom_top_ngrams,mum_top_ngrams,p=0.9)['ext']
dad_mom_rbo = rbo(mom_top_ngrams,dad_top_ngrams,p=0.9)['ext']

print('RBO')
print('cutoff = ' + str(cutoff))
print('mom-dad: ' + str(dad_mom_rbo))
print('mom-mum: ' + str(mum_mom_rbo))


RBO
cutoff = 10000
mom-dad: 0.528924497023988
mom-mum: 0.5984254668237872

Results

It's hard to interpret similarity scores without carefully selecting the data.

Next steps

  • Find corpora for comparison that have more comparable vocabularies
    • "mom" vs "dad" in bio
    • finer slicing to get more similar audiences
      • "mom" vs "dad" in bio for profile location in California
    • event driven data: #nbafinals for users from Cleveland vs Bay Area
  • Make more careful hypothesis tests
    • looked at intra-corpus differences to produce a distribution of the measurement under the null hypothesis

In [ ]: