Inspired by "A Similarity Measure for Indefinite Rankings", http://www.williamwebber.com/research/papers/wmz10_tois.pdf
In [ ]:
import yaml
import time
import operator
import string
import re
import csv
import random
import nltk.tokenize
from sklearn.feature_extraction import text
import twitter
import scipy
We want to compare two lists of objects and their counts. This is a common need when comparing counts of hashtags, n-grams, locations, etc.
We want the ability to act on lists that are:
We also want to apply weighting: where the comparison emphasizes elements with highest counts.
For output, we want to:
Kendall's tau is rank correlation coefficient that compares two sequences of rankings. Specifically, it is a function of the number of concordant and the number of discordant pairs.
In our case, the data might look something like:
ngram | rank in corpus 1 | rank in corpus 2 |
---|---|---|
dad | 1 | 2 |
parent | 2 | 3 |
lol | 3 | 1 |
know | 4 | 4 |
The 'dad-parent' pair is concordant (C) because 1>2
and 2>3
, while the 'parent-lol' pair
is discordant (D), because 2>3
but 3<1
. The 'dad-lol' pair is discordant, while the 'lol-know',
'parent-lol', and 'parent-know' pairs are concordant.
The un-normalized tau coefficient is C - D
, which is 2 in this case. The normalized version is:
$\tau$ = (C - D)/n(n-1)/2
,
where C
(D
) is the number of concordant(discordant) pairs, and n
is the length of the ranking list(s). This gives us $\tau$ = 0.3
.
The scipy
implementation of this coefficient accounts for ties (multiple entries with the same rank).
Let's look at some common test points for the measure.
In [ ]:
## self-correlation
a = [i for i in range(20)]
scipy.stats.kendalltau(a,a).correlation
In [ ]:
## remember that the rows need not be ordered
# shuffle in place
random.shuffle(a)
scipy.stats.kendalltau(a,a).correlation
In [ ]:
## anti-correlation
a = [i for i in range(20)]
b = list(reversed(a))
scipy.stats.kendalltau(a,b).correlation
In [ ]:
## random case
# correlation will average 0 and get closer to 0 for large lists
a = [i for i in range(1000)]
b = random.sample(a, k=len(a))
scipy.stats.kendalltau(a,b).correlation
In [ ]:
## ties
# scipy implementation uses:
# https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient#Tau-b
a = [i for i in range(10)]
b = [i for i in range(10)]
# items in list b at indices 0 and 1 will both have rank 1 (zero-based)
b[0] = 1
print(a)
print(b)
scipy.stats.kendalltau(a,b).correlation
Finally:
Kendall's tau is not defined for non-conjoint lists, meaning that it won't work for most incomplete lists.
Rank-biased overlap (RBO) is based on the idea the the overlap (or size of the intersection) of two sets is a good, simple starting point for similarity measures. We apply this to ordinal lists by calculating the overlap at varying depths and cleverly aggregating the results. Importantly, this method does not depend on elements being in both lists.
For ordinal lists S and T, the agreement (A) at depth k is given in terms of the overlap (X, the size of the intersection) between S and T at depth k.
$A_{S,T,k} = \frac{X_{S,T,k}}{k}$
The average overlap for 1 <= k <= d gives decent similarity measure.
If you make it a weighted average and choose your weights to be elements of a geometric series on parameter p, you can take d --> infinity and you have a distance measure r bounded by 0 and 1 and controlled by a single parameter, p. Values of p close to 1 emphasize agreement between highly ranked elements, while smaller values of p emphasize a broader range of agreement.
For truncated lists, you can calculate exactly the minimum (min) and maximum value that r can take on, given the unknown values lost in truncation. This is usually quoted in terms of min and the residual difference between the minimum and maximum (res).
For truncated lists, the base score (r) is a function of the cutoff depth (d) and can not actually reach 1. We can instead extrapolate from the visible lists and calculate $r_{ext}$ such that it has the range [0-1].
In [ ]:
from rbo import rbo
In [ ]:
# elements in the list can be any object
rbo([{'c', 'a'}, 'b', 'd'], ['a', {'c', 'b'}, 'd'], p=.9)
In [ ]:
# self-similarity
a = [i for i in range(20)]
rbo(a,a,p=0.9)
In [ ]:
# order doesn't matter
random.shuffle(a)
rbo(a,a,p=0.9)
In [ ]:
# we are comparing ordered lists of objects, not rankings
a = [i for i in string.punctuation]
rbo(a,a,p=0.9)
In [ ]:
# reversed case
a = [i for i in string.punctuation]
b = list(reversed(a))
rbo(a,b,p=0.9)
In [ ]:
# random comparison
a = [i for i in string.punctuation]
b = random.sample(a, k=len(a))
rbo(a,b,p=0.9)
Lesson: most vocabularies on Twitter (1-grams, 2-grams, hashtags) for a location, date, etc. are sparsely populated, meaning that the rankings are largely non-conjoint. When comparing 2 rankings with the similarity measurements described above, it's hard to know when small similarity differences are due to statistics, platform differences, or real textual differences.
Two paths forward:
In [ ]:
"""
Get Tweets from the Twitter public API
"""
"""
# Get your Twitter API tokens
# this is specific to my computer; modify for yours
my_creds_file = '/Users/jkolb/.twitter_api_creds'
creds = yaml.load(open(my_creds_file))
consumer_key = creds['audience']['consumer_key']
consumer_secret = creds['audience']['consumer_secret']
access_token_key = creds['audience']['token']
access_token_secret = creds['audience']['token_secret']
"""
In [ ]:
"""
api = twitter.Api(consumer_key=consumer_key,
consumer_secret=consumer_secret,
access_token_key=access_token_key,
access_token_secret=access_token_secret
)
mom_tweets = []
for _ in range(20):
mom_tweets.extend( api.GetSearch("mom",count=100) )
time.sleep(1)
dad_tweets = []
for _ in range(20):
dad_tweets.extend( api.GetSearch("dad",count=100) )
time.sleep(1)
mum_tweets = []
for _ in range(20):
mum_tweets.extend( api.GetSearch("mom",count=100) )
time.sleep(1)
"""
In [ ]:
"""
Get Tweets from the Gnip Search API
"""
from search.api import Query
import json
import yaml
creds = yaml.load(open('/Users/jkolb/.creds.yaml'))
# set up a query to the Gnip Search API
q = Query(creds['username'],
creds['password'],
creds['search_endpoint'],
paged=True,
hard_max = 2000, ## <--- control tweet volume here
)
# query parameters
start_date = '2017-06-01T00:00'
end_date = '2017-06-03T00:00'
# get the tweet data
rule = 'mom'
rule += ' -is:retweet'
q.execute(rule,start=start_date,end=end_date)
mom_tweets = list(q.get_activity_set())
rule = 'dad'
rule += ' -is:retweet'
q.execute(rule,start=start_date,end=end_date)
dad_tweets = list(q.get_activity_set())
rule = 'mum'
rule += ' -is:retweet'
q.execute(rule,start=start_date,end=end_date)
mum_tweets = list(q.get_activity_set())
In [ ]:
## get tweet bodies
dad_bodies = [tweet['body'] for tweet in dad_tweets]
mom_bodies = [tweet['body'] for tweet in mom_tweets]
mum_bodies = [tweet['body'] for tweet in mum_tweets]
In [ ]:
## create a tweet tokenizer and stopword list
my_additional_stop_words = ['https','rt']
my_additional_stop_words.extend(string.punctuation)
stop_words = text.ENGLISH_STOP_WORDS.union(my_additional_stop_words)
tokenizer = nltk.tokenize.TweetTokenizer(preserve_case=False, reduce_len=True)
In [ ]:
## make vectorizers
dad_ngram_vectorizer = text.CountVectorizer(lowercase=True,
stop_words=stop_words,
ngram_range=(1,2),
tokenizer = tokenizer.tokenize,
min_df = 2,
)
dad_ngram_vectorizer_idf = text.TfidfVectorizer(lowercase=True,
stop_words=stop_words,
ngram_range=(1,2),
tokenizer = tokenizer.tokenize,
min_df = 2,
)
mom_ngram_vectorizer = text.CountVectorizer(lowercase=True,
stop_words=stop_words,
ngram_range=(1,2),
tokenizer = tokenizer.tokenize,
min_df = 2,
)
mom_ngram_vectorizer_idf = text.TfidfVectorizer(lowercase=True,
stop_words=stop_words,
ngram_range=(1,2),
tokenizer = tokenizer.tokenize,
min_df = 2,
)
mum_ngram_vectorizer = text.CountVectorizer(lowercase=True,
stop_words=stop_words,
ngram_range=(1,2),
tokenizer = tokenizer.tokenize,
min_df = 2,
)
mum_ngram_vectorizer_idf = text.TfidfVectorizer(lowercase=True,
stop_words=stop_words,
ngram_range=(1,2),
tokenizer = tokenizer.tokenize,
min_df = 2,
)
In [ ]:
# helper functions
def ngram_freq_from_dtmatrix(dtmatrix,col_names):
return dict([(ngram,dtmatrix.getcol(icol).toarray().sum()) for icol,ngram in enumerate(col_names)])
def ranked_tuples_from_ngram_freq(term_freq_dict):
return list(reversed(sorted(term_freq_dict.items(),key=operator.itemgetter(1))))
In [ ]:
## get top ranked ngrams for 'dad' tweets
dad_dtmatrix = dad_ngram_vectorizer.fit_transform(dad_bodies)
dad_ngrams = dad_ngram_vectorizer.get_feature_names()
dad_tf_dict = ngram_freq_from_dtmatrix(dad_dtmatrix,dad_ngrams)
dad_ngrams_ranked = ranked_tuples_from_ngram_freq(dad_tf_dict)
In [ ]:
## get top ranked ngrams for 'mom' tweets
mom_dtmatrix = mom_ngram_vectorizer.fit_transform(mom_bodies)
mom_ngrams = mom_ngram_vectorizer.get_feature_names()
mom_tf_dict = ngram_freq_from_dtmatrix(mom_dtmatrix,mom_ngrams)
mom_ngrams_ranked = ranked_tuples_from_ngram_freq(mom_tf_dict)
In [ ]:
## get top ranked ngrams for 'mum' tweets
mum_dtmatrix = mum_ngram_vectorizer.fit_transform(mum_bodies)
mum_ngrams = mum_ngram_vectorizer.get_feature_names()
mum_tf_dict = ngram_freq_from_dtmatrix(mum_dtmatrix,mum_ngrams)
mum_ngrams_ranked = ranked_tuples_from_ngram_freq(mum_tf_dict)
In [ ]:
# sanity check
dad_ngrams_ranked[:20]
Simply by looking at this list, we can see other avenues for improving ngram extraction.
For now, we won't go down these paths.
In [ ]:
## now let's extract the rankings and compare
# probably want to cut off the rankings somewhere
cutoff = 10000
final_cutoff = 300
# get the (ngram,rank) lists
dad_ngram_ranks = {ngram:rank for rank,(ngram,count) in enumerate(dad_ngrams_ranked[:cutoff])}
mom_ngram_ranks = {ngram:rank for rank,(ngram,count) in enumerate(mom_ngrams_ranked[:cutoff])}
mum_ngram_ranks = {ngram:rank for rank,(ngram,count) in enumerate(mum_ngrams_ranked[:cutoff])}
# get the rank lists
# NB: if cutoff lists are not conjoint (they probably aren't),
# you'll have to choose one list as a reference
dad_ranks = []
mom_ranks = []
mum_ranks = []
data = []
for ngram,mom_rank in mom_ngram_ranks.items():
try:
dad_rank = dad_ngram_ranks[ngram]
except KeyError:
# for elements not in list, rank them last
dad_rank = cutoff
try:
# for elements not in list, rank them last
mum_rank = mum_ngram_ranks[ngram]
except KeyError:
mum_rank = cutoff
if mom_rank < final_cutoff:
dad_ranks.append(dad_rank)
mom_ranks.append(mom_rank)
mum_ranks.append(mum_rank)
data.append((ngram,mom_rank,mum_rank,dad_rank))
In [ ]:
dad_mom_tau = scipy.stats.kendalltau(dad_ranks,mom_ranks).correlation
mum_mom_tau = scipy.stats.kendalltau(mum_ranks,mom_ranks).correlation
print('Tau')
print('cutoff = ' + str(final_cutoff))
print('mom-dad: ' + str(dad_mom_tau))
print('mom-mum: ' + str(mum_mom_tau))
In [ ]:
mom_top_ngrams = [ngram for ngram,ct in mom_ngrams_ranked][:final_cutoff]
mum_top_ngrams = [ngram for ngram,ct in mum_ngrams_ranked][:final_cutoff]
dad_top_ngrams = [ngram for ngram,ct in dad_ngrams_ranked][:final_cutoff]
mum_mom_rbo = rbo(mom_top_ngrams,mum_top_ngrams,p=0.9)['ext']
dad_mom_rbo = rbo(mom_top_ngrams,dad_top_ngrams,p=0.9)['ext']
print('RBO')
print('cutoff = ' + str(cutoff))
print('mom-dad: ' + str(dad_mom_rbo))
print('mom-mum: ' + str(mum_mom_rbo))
In [ ]: