Estimating the number of classes in a population

A notebook to evalute various estimators of the number of distinct classes in a population.

Intended to supplement the readings On the Estimation of the Number of Classes in a Population by Leo A Goodman and The Population Frequencies of Species and the Estimation of Population Parameters by I. J. Good

Notebook structure

#### Section 1
- 1.1 Overview Goodman's notation & explain the relationship between Goodman's problem statement in 1949 and a Twitter data problem
#### Section 2
- 2.1 Make a test population with class frequencies that follow a) Poisson distribution or b) uniform distribution
- 2.2 Calculate the unbiased estimator suggested in Goodman's paper
- 2.3 Calculate and discuss other (biased) estimators suggested by Goodman
#### Section 3
- 3.1 Load real data from a Twitter dataset (all Tweets corresponding to some PowerTrack query)
- 3.2 Pull a random sample of size n of this data
- 3.3 Calculate the estimators again for Twitter data
- 3.4 Exploration and suggestions for future work

Section 1: Notation & problem statement

Goodman's problem:

The 1949 paper refers to "estimating the number of classes in a population." Imagine a population of a large number (N) of differently colored balls in an urn. Given some random draw of n balls from the urn, from the distribution of those colors estimate the total number of different colors (classes, K) in the urn.

The problem can be extended to social data. Given that there are N total Tweets about cats, select a much smaller (and more manageable) sample of n cat-related Tweets. Call the user who created a Tweet the "class" of that Tweet. Now use the distribution of Tweets per user in your n-sized sample to estimate the total number of unique users Tweeting about cats. I'll continue to refer to the problem this way.

Notation:

N: The total size of the population, or the total number of cat Tweets on Twitter
n: The size of the sample, the number of cat Tweets that we are actually going to inspect
- We can think about n as a vector with K elements, $\vec{n}$ s.t. $n_k$ = the number of elements of class k in the sample, $\sum_{k = 1}^K n_k = n$ (the number of Tweets from user k in the sample)
K: The total number of classes in the population, or the total number of users on Twitter talking about cats
- now this of $\vec{K}$ as the vetor with elements $K_j$ s.t. $K_j$ = the number of classes in the population with j elements (the number of users with j Tweets about cats)
$\mathbf{\vec{x}}$: $\vec{x}$ s.t. $x_i$ = the total number of classes with i elements in the sample. $\sum_{i = 1}^n x_i = $ the total number of classes in the sample; x is the observable statistic that we will use to estimate the number of classes in the population. This is a vector related to how many users have exactly i Tweets in the sample. One way to think about $\vec{x}$ is that we are trying to estimate $x_0$, or the toal number of classes (users) who do not appear in the sample at all.
q: While I don't talk about this here, it is worth noting that Goodman's estimator can only be shown to be unbiased if q < n, where q is the total number of Tweets in the population (not the sample) from the most common class (most active user, if you like). I.e., q is the total number of Tweets generated by @CrazyCatLadyXxxX. In our case, we won't always know q exactly, but we can make an educated guess about whether or not it is smaller than the size of our sample.
S: The estimator for K. This is what we are solving for. $S_j$ is analogous to $K_j$, an estimator of the total number of classes with j elements in the population.

Section 2

2.1: Make the test populations (choose either Uniform or Poisson before proceeding)

Feel free to change "pop_size" and "percent_sample" or mess with the number of classes/ distribution parameters



In [ ]:

    
# Imports
from __future__ import division
import numpy as np
from scipy.misc import comb
from numpy import random as np_random
from collections import Counter
from random import sample, choice
from math import factorial, log, exp
import fileinput
import matplotlib.pyplot as plt
%matplotlib inline



In [ ]:

    
# make a population with classes that are Poisson distributed
l = 10
pop_size = 10000
population = list(np_random.poisson(lam = l, size = pop_size))
percent_sample = 1

print "For a {}% sample of a Poisson({}) population of size {}: \n".format(percent_sample, l, pop_size)



In [ ]:

    
# OR: make a population with classes that are Uniform distributed
pop_size = 10000
population = [choice(range(0,100)) for x in range(0,pop_size)]
percent_sample = 1

print "For a {}% sample of a Uniform (integers 1 through 100) population of size {}: \n".format(percent_sample, l, pop_size)

Find $\vec{x}$, look at the properties of the population and the sample



In [ ]:

    
# generate samples and print out info about the population
print "\n The exact population demographics:"
pop_demographic = sorted(Counter(population).items(), key=lambda x: x[0])
for key, value in pop_demographic:
    print "    {} items of class {}".format(str(value).ljust(5), key)

print "\n The demographics of the hypergeometric sample population:"
exactly_10_percent_sample = sample(population,int(pop_size*percent_sample/100))
n = len(exactly_10_percent_sample)
exactly_10_demographic_sample_only = Counter(exactly_10_percent_sample)
exactly_10_demographic = Counter(exactly_10_percent_sample)
for key in [x[0] for x in pop_demographic]:
    if key not in exactly_10_demographic.keys():
        exactly_10_demographic[key] = 0
exactly_10_demographic = sorted(exactly_10_demographic.items(), key=lambda x: x[0])
for key, value in exactly_10_demographic:
    print "    {} items of class {}".format(str(value).ljust(5), key)
print " If the statistic x_{i} is the number of classes with i elements in the sample:"
x_counts_exact = sorted(Counter([x[1] for x in exactly_10_demographic]).items(), key=lambda x: x[0])
for count, val in x_counts_exact:
    print "    {} classes appear {} times in the sample, x_{} = {}".format(val, count, count, val)
print " So the observable vector x for this sample is " 
print " (recall that the number of classes appearing 0 times in the sample is not observable):"
x_vec = [0 for x in range(0,max([x[0] for x in x_counts_exact]))]
for x in x_counts_exact:
    if x[0] != 0:
        x_vec[x[0]-1] = x[1]
x_padded = np.pad(np.array(x_vec),(0,n-len(x_vec)),mode='constant',constant_values=0)
print x_vec

$\vec{x}$ as a bar chart, a pdf. I've shown $x_0$ here in gray (because we know the exact composition of the population) even though $x_0$ is really unobservable.



In [ ]:

    
fig, ax = plt.subplots()
ax.bar(range(1,len(x_vec) + 1), x_vec, .45)
ax.bar([0.05], [x_counts_exact[0][1]], .45, color = 'gray')
ax.set_xticks([t + .22 for t in range(0,len(x_vec) + 1)])
ax.set_xticklabels(range(0,len(x_vec) + 1))
plt.ylabel("# of classes appearing x times in the sample")
plt.xlabel("x")

2.2: Evaluating estimators

In his paper, Goodman derives the estimator S (I won't go through the entire derivation here), except to note that S is the solution to the system of equations where:

$$ x_i = \sum_{j = 1}^{n} Pr(i \; | \; j,N,n)\,S_j$$

Note that the probabiltiy of having i elements of some class in the sample where that class has j elements in the total population is $$ Pr(i \; | \; j,N,n) = \frac{ {j \choose i} { N - j \choose n - i} }{ {N \choose n}} $$ which can be explained as:

$ { j \choose i} $: from a class with j elements, choose i of them
$ { N-j \choose n-i} $: from all other classes, choose n-i elements to make up an n-element sample
$ { N \choose n} $: thte total number of ways to choose an n-element sample

Then we multiply by $S_j$, the number of classes with j elements to get the expected value for $x_i$



In [ ]:

    
# This would calculate the unbiased estimator by explicitly solving for S in the system of equations.
# We don't do this because integer overflows. Not good. You can try it if you like for small populations/ samples
'''
eq_matrix = np.zeros((n,n))
for i in xrange(0,n+1):
    for j in xrange(0,n+1):
        eq_matrix[i,j] = comb(i,j)*comb((pop_size - j), (n - i))/comb(pop_size, n)

s_vec = np.linalg.solve(eq_matrix,x_padded)
sum(s_vec)
'''

There's a closed form solution for $S_j$, presented in Goodman's work. Once again, I won't go through tthe derivation. I have modified the calculation to use logs to avoid integer overflows, but the calculation is the same.



In [ ]:

    
# calculate the unbiased estimator S
S = 0
for i in xrange(1, n + 1):
    x_i = x_padded[i - 1]
    if x_i != 0:
        sign = (-1)**(i+1)
        log_numerator = sum([log(h) for h in range(pop_size - n, pop_size - n + i - 1 + 1)])
        log_denominator = sum([log(k) for k in range(n - i, n - i + 1)])
        # print x_i + sign*exp(log_numerator/log_denominator)*x_i
        S += x_i + sign*exp(log_numerator/log_denominator)*x_i
print "The estimated number of total classes in the population is {}".format(S)

2.3

As we have just observed, the unbiased estimator can have a very, very large error, to the point where it isn't even necessarily positive.

Becuase of this, Goodman suggests some other biased estimators that have smaller error for some distributions of the population.

$$ S' = N - \frac{{N \choose 2}}{{n \choose 2}} x_2 $$

gives a resonable lower bound when $(N)(x_2) \leq n^2$ (not necessarily the case, see below)



In [ ]:

    
# calculate the biased estimator S'
S_1 = pop_size - ((pop_size)*(pop_size - 1))/((n)*(n - 1))*x_padded[2 - 1]
print "The estimator S' gives {} as the total number of classes in the population".format(S_1)

Or assume that the fraction $\frac{n}{N}$ is equal to the fraction of classes that were sampled $$ S'' = \frac{N}{n} \sum{x} $$

(not useful if we sampled most of the classes)



In [ ]:

    
# calculate S''
S_2 = (pop_size/n) * sum(x_padded)
print "The estimator S'' gives {} as the total number of classes in the population".format(S_2)

Or assume that we sampled most of the classes (this is a hard lower bound for the total number of classes, there cannot be fewer classes than we have already observed):

$$ \sum_{i = 1}^n x_i $$



In [ ]:

    
# calculate S'' (2)
S_3 = sum(x_padded)
print "The estimator S'' gives {} as the total number of classes in the population".format(S_3)

Or, as is suggested in I.J. Good's paper, we can estimate $x_0$ as $x_1$ (assuming that there are the same number of unsampled classes as there are classes sampled exactly once, this holds for populations where there is not a large tail of very rare classes), and then estimate the proportion of classes that do appear in the sample as $\left(1 - \frac{x_1}{n}\right)$ $$ \left(1 - \frac{x_1}{n}\right)^{-1} \sum_{i = 1}^n x_i $$



In [ ]:

    
# From I.J. Good's paper
S_4 = (1/(1 - x_padded[0]/n))*(sum(x_padded))
print "The estimator suggested by I.J. Good gives {} as the number of classes in the population".format(S_4)



In [ ]:

    
print "Reminder: the actual number of classes in the population is {}".format(len(set(population)))

Section 3

3.1: Load real Twitter data



In [ ]:

    
tweet_pop = []
for line in fileinput.FileInput("craft_users_tweeting_soda_usernames.txt"):
    tweet_pop.append(line.strip())
tweet_pop_size = len(tweet_pop)
num_uniq_users_in_pop = len(set(tweet_pop))
tweet_pop_demographic = sorted(Counter(tweet_pop).items(), key=lambda x: x[0])



In [ ]:

    
#print tweet_pop_demographic

3.2: Take a sample

Feel free to change the sample size "tweet_n"



In [ ]:

    
## Sample size, generate a new sample
tweet_n = 3000
sample_tweet_pop = sample(tweet_pop, tweet_n)



In [ ]:

    
sample_tweet_pop_demographic = Counter(sample_tweet_pop)
for key in [x[0] for x in tweet_pop_demographic]:
    if key not in sample_tweet_pop_demographic.keys():
        sample_tweet_pop_demographic[key] = 0
sample_tweet_pop_demographic = sorted(sample_tweet_pop_demographic.items(), key=lambda x: x[0])

print " If the statistic x_{i} is the number of classes with i elements in the sample:"
x_counts_tweets = sorted(Counter([x[1] for x in sample_tweet_pop_demographic]).items(), key=lambda x: x[0])
for count, val in x_counts_tweets:
    print "    {} classes appear {} times in the sample, x_{} = {}".format(val, count, count, val)
print " So the observable vector x for this sample is " 
print " (recall that the number of classes appearing 0 times in the sample is not observable):"
x_tweets_sample = [0 for x in range(0,max([x[0] for x in x_counts_tweets]))]
for x in x_counts_tweets:
    if x[0] != 0:
        x_tweets_sample[x[0]-1] = x[1]
x_tweets = np.pad(np.array(x_tweets_sample),(0,tweet_n-len(x_tweets_sample)),mode='constant',constant_values=0)
print x_tweets_sample



In [ ]:

    
fig, ax = plt.subplots(figsize=(10,6))
ax.bar(range(0,len(x_counts_tweets)), [t[1] for t in x_counts_tweets], .45)
#ax.bar([0.05], [x_counts_exact[0][1]], .45, color = 'gray')
ax.set_xticks([t + .22 for t in range(0,len(x_tweets_sample) + 1)])
ax.set_xticklabels(range(0,len(x_tweets_sample) + 1))
plt.ylabel("# of classes appearing x times in the sample")
plt.xlabel("x (unobservable, because x_0, the number of unobserved classes, is included)")
plt.title = ("Histogram of x")

fig, ax = plt.subplots(figsize=(10,6))
ax.bar(range(0,len(x_tweets_sample)), x_tweets_sample, .45)
#ax.bar([0.05], [x_counts_exact[0][1]], .45, color = 'gray')
ax.set_xticks([t + .22 for t in range(0,len(x_tweets_sample) + 1)])
ax.set_xticklabels(range(1,len(x_tweets_sample) + 1))
plt.ylabel("# of classes appearing x times in the sample")
plt.xlabel("x oberseved")
plt.title = ("Histogram of x")

3.2: Evaluating estimators (same content as 2.2, different data)

In his paper, Goodman derives the estimator S (I won't go through the entire derivation here), except to note that S is the solution to the system of equations where:

$$ x_i = \sum_{j = 1}^{n} Pr(i \; | \; j,N,n)\,S_j$$

$ { j \choose i} $: from a class with j elements, choose i of them
$ { N-j \choose n-i} $: from all other classes, choose n-i elements to make up an n-element sample
$ { N \choose n} $: thte total number of ways to choose an n-element sample

Then we multiply by $S_j$, the number of classes with j elements to get the expected value for $x_i$



In [ ]:

    
# calculate the unbiased estimator S
tweet_S = 0
test = []
for i in xrange(1, tweet_n + 1):
    x_i = x_tweets[i - 1]
    if x_i != 0:
        sign = (-1)**(i+1)
        log_numerator = sum([log(h) for h in range(tweet_pop_size - tweet_n, tweet_pop_size - tweet_n + i - 1 + 1)])
        log_denominator = sum([log(k) for k in range(tweet_n - i, tweet_n - i + 1)])
        test.append(x_i + sign*exp(log_numerator/log_denominator)*x_i)
        tweet_S += x_i + sign*exp(log_numerator/log_denominator)*x_i
print "The estimated number of total classes in the population is {}".format(tweet_S)

$$ S' = N - \frac{{N \choose 2}}{{n \choose 2}} x_2 $$

gives a resonable lower bound when $(N)(x_2) \leq n^2$ (not necessarily the case, see below)



In [ ]:

    
# calculate the biased estimator S'
tweet_S_1 = tweet_pop_size - ((tweet_pop_size)*(tweet_pop_size - 1))/((tweet_n)*(tweet_n - 1))*x_tweets[2 - 1]
print "The estimator S' gives {} as the total number of classes in the population".format(tweet_S_1)

Or assume that the fraction $\frac{n}{N}$ is equal to the fraction of classes that were sampled $$ S'' = \frac{N}{n} \sum{x} $$

(not useful if we sampled most of the classes)



In [ ]:

    
# calculate S''
tweet_S_2 = (tweet_pop_size/tweet_n) * sum(x_tweets)
print "The estimator S'' gives {} as the total number of classes in the population".format(tweet_S_2)

Or assume that we sampled most of the classes (this is a hard lower bound for the total number of classes, there cannot be fewer classes than we have already observed):

$$ \sum_{i = 1}^n x_i $$



In [ ]:

    
# calculate S'' (2)
tweet_S_3 = sum(x_tweets)
print "The estimator S'' gives {} as the total number of classes in the population".format(tweet_S_3)



In [ ]:

    
# From I.J. Good's paper
tweet_S_4 = (1/(1 - x_tweets[0]/tweet_n))*(sum(x_tweets))
print "The estimator suggested by I.J. Good gives {} as the number of classes in the population".format(tweet_S_4)



In [ ]:

    
print "Reminder: the actual number of classes (unique users) in the population is: {}".format(num_uniq_users_in_pop)

3.3: Future work

Disclaimer: it's completely possible that none of this makes any sense. Just poking around.

Try cutting off $\vec{x}$ at the first zero (to remove a long tail) and treating very common classes (outliers) differently from the less common ones.



In [ ]:

    
first_cutoff = np.where(x_tweets<=0)[0][0]
x_tweets_ = x_tweets[0:first_cutoff]
outliers = []
for i,k in enumerate(x_tweets[first_cutoff:]):
    if k != 0:
        outliers.append((i + first_cutoff + 1, k))
# assume that these more common classes were sampled pretty evenly
num_outliers_in_sample = sum([x[0]*x[1] for x in outliers])
num_outliers_in_pop = sum([x[0]*x[1]*tweet_pop_size/tweet_n for x in outliers])
num_outlier_classes = sum([x[1] for x in outliers])

tweet_pop_size_ = int(tweet_pop_size - num_outliers_in_pop)
tweet_n_ = int(tweet_n - num_outliers_in_sample)



In [ ]:

    
# calculate the unbiased estimator S
tweet_S = 0
for i in xrange(1, len(x_tweets_) + 1):
    x_i = x_tweets_[i - 1]
    sign = (-1)**(i+1)
    log_numerator = sum([log(h) for h in range(tweet_pop_size_ - tweet_n_, tweet_pop_size_ - tweet_n_ + i - 1 + 1)])
    log_denominator = sum([log(k) for k in range(tweet_n_ - i, tweet_n_ - i + 1)])
    # print x_i + sign*exp(log_numerator/log_denominator)*x_i
    tweet_S += x_i + sign*exp(log_numerator/log_denominator)*x_i
print "The estimated number of total classes in the population is {}".format(int(tweet_S + num_outlier_classes))

Look at the difference in frequency distribution between classes that do appear in the sample and classes that don't.



In [ ]:

    
# Analysis of the classes that *do not* appear in the sample
users_not_in_sample = set([x[0] for x in sample_tweet_pop_demographic if x[1]==0])
users_not_in_sample_demographic = []
users_in_sample_demographic = []
for u in tweet_pop_demographic:
    if u[0] in users_not_in_sample:
        users_not_in_sample_demographic.append(u)
    else:
        users_in_sample_demographic.append(u)
counts_users_in_sample = Counter([x[1] for x in users_in_sample_demographic]).items()
counts_users_not_in_sample = Counter([x[1] for x in users_not_in_sample_demographic]).items()

x_users_in_sample = [0 for x in range(0,max([x[0] for x in counts_users_in_sample]))]
for x in counts_users_in_sample:
    if x[0] != 0:
        x_users_in_sample[x[0]-1] = x[1]

x_users_not_in_sample = [0 for x in range(0,max([x[0] for x in counts_users_not_in_sample]))]
for x in counts_users_not_in_sample:
    if x[0] != 0:
        x_users_not_in_sample[x[0]-1] = x[1]

print x_users_in_sample
print x_users_not_in_sample



In [ ]:

    
fig, ax = plt.subplots(figsize=(10,6))
ax.bar(range(0,len(x_users_not_in_sample)), x_users_not_in_sample, .45)
ax.set_xticks([t + .22 for t in range(0,len(x_users_not_in_sample) + 1)])
ax.set_xticklabels(range(0,len(x_users_not_in_sample) + 1))
plt.ylabel("# of classes appearing x times in the pop and at least once in the sample")
plt.xlabel("x ")
plt.title = ("Histogram of x")



In [ ]:

    
x_users_not_in_sample[0]/sum(x_users_not_in_sample)



In [ ]:

    
x_users_in_sample[0]/sum(x_users_in_sample)

This seems to give a resonable estimate occasionally

Estimate $x_0$ as an extrapolation of the ratio between $x_1$ and $x_2$ (as opposed to simply estimating $x_0 = x_1$)

$$ \frac{x_1}{x_2} = \frac{x_0}{x_1} $$



In [ ]:

    
print "The total number of classes in the population: {}".format(int((x_tweets[0]*x_tweets[0]/x_tweets[1]) + sum(x_tweets)))



In [ ]:

    
print "Reminder: the actual number of classes (unique users) in the population is: {}".format(num_uniq_users_in_pop)



In [ ]:

    
test_estimator = []
test_estimator_2 = []
test_lower_bound = []
test_upper_bound = []
test_goodman_s1 = []
test_goodman_s2 = []
test_ijgood = []
for test_tweet_n in range(400,int(38000/2),100):
    for t in range(0,10):
        ## Sample size, generate a new sample
        test_sample_tweet_pop = sample(tweet_pop, test_tweet_n)

        test_sample_tweet_pop_demographic = sorted(Counter(test_sample_tweet_pop).items(), key=lambda x: x[0])

        test_x_counts_tweets = sorted(Counter([x[1] for x in test_sample_tweet_pop_demographic]).items(), key=lambda x: x[0])

        test_x_tweets_sample = [0 for x in range(0,max([x[0] for x in test_x_counts_tweets]))]
        for x in test_x_counts_tweets:
            if x[0] != 0:
                test_x_tweets_sample[x[0]-1] = x[1]
        test_x_tweets = np.pad(np.array(test_x_tweets_sample),(0,test_tweet_n-len(test_x_tweets_sample)),mode='constant',constant_values=0)

        sum_test_x_tweets = sum(test_x_tweets)
        
        est_1 = ((test_x_tweets[0]*test_x_tweets[0]/test_x_tweets[1]) + sum_test_x_tweets)
        est_2 = (1/(1 - test_x_tweets[0]/test_tweet_n))*(sum_test_x_tweets)
        est_3 = sum_test_x_tweets*(len(tweet_pop)/test_tweet_n)
        est_4 = len(tweet_pop) - ((len(tweet_pop))*(len(tweet_pop) - 1))/((test_tweet_n)*(test_tweet_n - 1))*test_x_tweets[2 - 1]
        
        test_estimator.append((test_tweet_n, est_1/num_uniq_users_in_pop))
        test_ijgood.append((test_tweet_n, est_2/num_uniq_users_in_pop))
        test_goodman_s2.append((test_tweet_n, est_3/num_uniq_users_in_pop))
        test_goodman_s1.append((test_tweet_n, est_4/num_uniq_users_in_pop))
        test_lower_bound.append((test_tweet_n, sum_test_x_tweets/num_uniq_users_in_pop))
        test_upper_bound.append((test_tweet_n, (len(tweet_pop) - test_tweet_n + sum_test_x_tweets)/num_uniq_users_in_pop))



In [ ]:

    
fig, ax = plt.subplots(figsize=(10,6))
ax.set_ylim([-.5,2.25])
plt.plot([x[0]/len(tweet_pop)*100 for x in test_estimator], [x[1] for x in test_estimator],'r.')
plt.plot([x[0]/len(tweet_pop)*100 for x in test_ijgood], [x[1] for x in test_ijgood],'b.')
plt.plot([x[0]/len(tweet_pop)*100 for x in test_goodman_s2], [x[1] for x in test_goodman_s2],'g.')
plt.plot([x[0]/len(tweet_pop)*100 for x in test_goodman_s1], [x[1] for x in test_goodman_s1],'m.')
plt.plot([x[0]/len(tweet_pop)*100 for x in test_lower_bound], [x[1] for x in test_lower_bound],'k')
plt.plot([x[0]/len(tweet_pop)*100 for x in test_upper_bound], [x[1] for x in test_upper_bound],'k')
plt.plot([x[0]/len(tweet_pop)*100 for x in test_upper_bound], [1 for x in test_upper_bound],'k-')
plt.xlabel("Sample is x% of the total pop")
plt.ylabel("Fraction of the true number of classes")



In [ ]:



In [ ]:

Estimating the number of classes in a population

Notebook structure

Section 1: Notation & problem statement

Goodman's problem:

Social data:

Notation:

Section 2

2.1: Make the test populations (choose either Uniform or Poisson before proceeding)

2.2: Evaluating estimators

2.3

Section 3

3.1: Load real Twitter data

3.2: Take a sample

3.2: Evaluating estimators (same content as 2.2, different data)

3.3: Future work

This seems to give a resonable estimate occasionally