Sampling correction for large choice sets

Sam Maurer, Dec. 1, 2016

  1. Replicate synthetic data from Guevara & Ben-Akiva 2013
  2. Do MNL with and without sampling correction
  3. Check whether parameter estimates deviate from true values
  4. Extend to Mixed Logit

In [ ]:

1. Generate synthetic data set

  • numobs (N) = 1000 observations
  • numalts (J) = 1000 alternatives for all observations (choiceset_n = choiceset, C_n = C)
  • X = single attribute distributed Uniform(-2,1) for the first 500 alternatives and Uniform(-1,2) for the second half
  • beta = generic linear taste coefficient, distributed Normal(mu=1.5, sigma=0.8) across the 1000 observations
  • systematic utility = beta * X
  • epsilon = error term distributed ExtremeValue(0,1)
  • random utility = beta * X + epsilon

Utility of alternative i for agent n: $$ U_{in} = V_{in} + \varepsilon_{in} = \beta_n x_{i} + \varepsilon_{in} $$

Probability that agent n will choose alternative i: $$ L_n(i \mid \beta_n, x_n,C_n) = \frac {e^{V_{in}}} {\sum_{j \epsilon C_n} e^{V_{jn}}} $$


In [ ]:


In [43]:
import numpy as np
import pandas as pd

# Set a seed so that the random numbers will be reproducible
np.random.seed(12)

Generate attributes x1, x2 for each of numalts (J) alternatives


In [134]:
# For now, J << 1000 alternatives to speed up runtimes
numalts = 50

def rand(len, min, max):
    """ Generate `len` random floats uniformly distributed from `min` to `max` """
    return (max - min) * np.random.rand(len) + min

# Attribute x is uniformly distributed over [-2, 1] for half the alternatives
# and over [-1, 2] for the other half, as in Guevara & Ben-Akiva

# X = np.concatenate((rand(numalts/2, -2, 1), rand(numalts/2, -1, 2)))

# Or, attribute x is uniformly distributed over [0, 10] for half the alternatives
# and over [100, 110] for the other half, to induce bias in estimation

X = np.concatenate((rand(numalts/2, 0, 10), rand(numalts/2, 100, 110)))

In [135]:
print pd.DataFrame(X[:numalts/2]).describe()
print pd.DataFrame(X[numalts/2:]).describe()


               0
count  25.000000
mean    5.782730
std     2.903692
min     0.233499
25%     3.436553
50%     6.561771
75%     7.527760
max     9.956289
                0
count   25.000000
mean   105.373296
std      3.202724
min    100.563383
25%    102.109451
50%    105.497276
75%    108.062617
max    109.884905

Generate taste coefficient beta for each of numobs (N) agents


In [57]:
# For regular MNL, use a single value instead of a distribution as 
# Guevara & Ben-Akiva used for the mixture model

numobs = 1000  # agents/observations

beta = np.zeros(1000) + 1.5
# beta = 0.8 * np.random.randn(numobs) + 1.5

In [ ]:
print pd.DataFrame(beta).describe()

Simulate a choice from numalts (J) alternatives for each of numobs (N) agents


In [136]:
# Generate a utility matrix for N agents choosing among J alternatives

U = [[beta[n]*x + np.random.gumbel() for x in X] for n in range(numobs)]
   
print len(U)
print len(U[0])


1000
50

In [137]:
# Each agent chooses the alternative with highest utility

choices = [np.argmax(a) for a in U]

print len(choices)
print choices[:10]


1000
[28, 32, 33, 33, 33, 43, 32, 43, 32, 37]

Now we have data:

  • N agents/observations with true taste coefficients in array "beta"
  • J alternatives with single attributes in array "X"
  • N choice outcomes in array "choices"

In [ ]:


In [ ]:

2. Estimate beta without sampling, using PyLogit MNL


In [49]:
import pylogit
from collections import OrderedDict

In [174]:
# Set up the estimation dataset in long format

d = [[n, i, int(choices[n]==i), X[i]] for n in range(numobs) for i in range(numalts)]
df = pd.DataFrame(d, columns=['obs_id', 'alt_id', 'chosen', 'x'])

In [97]:
print df.head(), '\n'
print df.describe()


   obs_id  alt_id  chosen         x
0       0       0       0  1.699728
1       0       1       0  2.530486
2       0       2       0  7.104747
3       0       3       0  5.721117
4       0       4       0  6.125892 

             obs_id        alt_id        chosen             x
count  50000.000000  50000.000000  50000.000000  50000.000000
mean     499.500000     24.500000      0.020000     55.555265
std      288.677877     14.431014      0.140001     50.140919
min        0.000000      0.000000      0.000000      1.611696
25%      249.750000     12.000000      0.000000      6.127162
50%      499.500000     24.500000      0.000000     54.888537
75%      749.250000     37.000000      0.000000    105.748896
max      999.000000     49.000000      1.000000    109.778827

In [168]:
# Set up reusable model spec

spec = OrderedDict([('x', 'all_same')])
labels = OrderedDict([('x', 'beta_x')])

In [172]:
# Set up reusable code to estimate a model

def estimate_model(init_val):
    """
    Initialize and fit a model, returning it as an object. Will use the 
    current values of `df`, `spec`, and `labels`.
    """
    m = pylogit.create_choice_model(data = df, 
                                    alt_id_col = 'alt_id', 
                                    obs_id_col = 'obs_id', 
                                    choice_col = 'chosen', 
                                    specification = spec, 
                                    model_type = "MNL", 
                                    names = labels)

    m.fit_mle(init_vals = np.array([init_val]))
    return m

In [175]:
%%time
m = estimate_model(init_val = 1.2)
print m.get_statsmodels_summary()


Log-likelihood at zero: -3,912.0230
Initial Log-likelihood: -1,823.3647
Estimation Time: 0.17 seconds.
Final log-likelihood: -1,813.8248
                     Multinomial Logit Model Regression Results                    
===================================================================================
Dep. Variable:                      chosen   No. Observations:                1,000
Model:             Multinomial Logit Model   Df Residuals:                      999
Method:                                MLE   Df Model:                            1
Date:                     Sun, 11 Dec 2016   Pseudo R-squ.:                   0.536
Time:                             19:42:20   Pseudo R-bar-squ.:               0.536
converged:                            True   Log-Likelihood:             -1,813.825
                                             LL-Null:                    -3,912.023
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
beta_x         1.4422      0.060     24.067      0.000         1.325     1.560
==============================================================================
CPU times: user 8.37 s, sys: 15.4 s, total: 23.7 s
Wall time: 15.1 s

This looks good: it's very close to the true beta of 1.5


In [ ]:


In [ ]:

3a. Estimate beta with random sampling of alternatives

This should produce an unbiased estimate of beta


In [ ]:
# In the estimation dataset, for each observation include a row for the
# chosen alternative, plus K-1 other alternatives sampled randomly
# without replacement, where K < J.

# Some more notation:
# - true choice set C = range(J)
# - restricted choice set D_n is a subset of C, where len(D_n) = K

In [154]:
# TO DO - rewrite to use sampling weights

def alts(obs_id, C, K):
    """
    This function generates a restricted choice set D for a particular
    observation. Expects list `C` of alternatives to sample from (either
    the full choice set or a stratrum), int `K` alternatives to sample,
    and list `choices` of the alt_id chosen for each obs_id. Returns list 
    of K alt_id's including the chosen one.
    """
    chosen = choices[obs_id]  # id of chosen alternative
    unchosen = [i for i in C if chosen != i]  # id's of unchosen alts
    sample_unchosen = np.random.choice(unchosen, size=K-1, replace=False).tolist()
    return np.sort([chosen] + sample_unchosen)
    
print alts(0, range(numalts), 5)


[ 3 12 13 28 38]

In [176]:
# Set up the estimation dataset, which can use the same spec as earlier

C = range(numalts)  # choice set to sample from
K = 10

d = [[n, i, int(choices[n]==i), X[i]] for n in range(numobs) for i in alts(n, C, K)]
df = pd.DataFrame(d, columns=['obs_id', 'alt_id', 'chosen', 'x'])

In [146]:
print df.head(), '\n'
print df.describe()


   obs_id  alt_id  chosen           x
0       0       2       0    8.730367
1       0      11       0    2.855760
2       0      18       0    9.956289
3       0      28       1  108.045363
4       0      30       0  105.386425 

             obs_id       alt_id        chosen             x
count  10000.000000  10000.00000  10000.000000  10000.000000
mean     499.500000     25.51300      0.100000     60.170736
std      288.689425     14.29645      0.300015     50.009368
min        0.000000      0.00000      0.000000      0.233499
25%      249.750000     13.00000      0.000000      6.797005
50%      499.500000     27.00000      0.000000    100.970300
75%      749.250000     37.00000      0.000000    106.140668
max      999.000000     49.00000      1.000000    109.884905

In [177]:
%%time
m = estimate_model(init_val = 1.2)
print m.get_statsmodels_summary()


Log-likelihood at zero: -2,302.5851
Initial Log-likelihood: -585.5314
Estimation Time: 0.01 seconds.
Final log-likelihood: -578.0528
                     Multinomial Logit Model Regression Results                    
===================================================================================
Dep. Variable:                      chosen   No. Observations:                1,000
Model:             Multinomial Logit Model   Df Residuals:                      999
Method:                                MLE   Df Model:                            1
Date:                     Sun, 11 Dec 2016   Pseudo R-squ.:                   0.749
Time:                             19:42:57   Pseudo R-bar-squ.:               0.749
converged:                            True   Log-Likelihood:               -578.053
                                             LL-Null:                    -2,302.585
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
beta_x         1.4855      0.081     18.243      0.000         1.326     1.645
==============================================================================
CPU times: user 594 ms, sys: 380 ms, total: 974 ms
Wall time: 657 ms

Run 1000x with different samples of alternatives


In [178]:
%%time
%%capture

beta = []
C = range(numalts)
K = 10

for i in range(100):
    d = [[n, i, int(choices[n]==i), X[i]] for n in range(numobs) for i in alts(n, C, K)]
    df = pd.DataFrame(d, columns=['obs_id', 'alt_id', 'chosen', 'x'])
    m = estimate_model(init_val = 1.2)
    beta.append(m.params.beta_x)


CPU times: user 1min 33s, sys: 38 s, total: 2min 11s
Wall time: 1min 14s

In [179]:
print pd.Series(beta).describe()


count    100.000000
mean       1.453900
std        0.042923
min        1.371329
25%        1.426357
50%        1.450013
75%        1.484759
max        1.557182
dtype: float64

Looks unbiased, as expected. It's very close to the true beta of 1.5


In [ ]:


In [ ]:

3b. Estimate beta with over-sampling of irrelevant alternatives

This should produce a biased estimate of beta, until we add a correction to the estimation procedure


In [187]:
# Recall that half the values of x are in the range [0, 10] and half are
# in the range [100, 110]. The taste coefficient is positive, so the first
# set of alternatives is much less relevant than the second set. 

C = range(numalts/2)  # alternatives to sample from
K = 10

d = [[n, i, int(choices[n]==i), X[i]] for n in range(numobs) for i in alts(n, C, K)]
df = pd.DataFrame(d, columns=['obs_id', 'alt_id', 'chosen', 'x'])

In [188]:
print df.head(), '\n'
print df.describe()


   obs_id  alt_id  chosen         x
0       0       1       0  7.427010
1       0       3       0  9.930585
2       0       5       0  3.436553
3       0       9       0  6.558043
4       0      11       0  2.855760 

             obs_id        alt_id        chosen             x
count  10000.000000  10000.000000  10000.000000  10000.000000
mean     499.500000     14.463900      0.100000     16.177211
std      288.689425     10.195033      0.300015     31.248750
min        0.000000      0.000000      0.000000      0.233499
25%      249.750000      6.000000      0.000000      3.436553
50%      499.500000     13.000000      0.000000      6.797005
75%      749.250000     20.000000      0.000000      8.371036
max      999.000000     45.000000      1.000000    109.884905

In [189]:
%%time
m = estimate_model(init_val = 1.5)
print m.get_statsmodels_summary()


Log-likelihood at zero: -2,302.5851
Initial Log-likelihood: 0.0000
Estimation Time: 0.00 seconds.
Final log-likelihood: 0.0000
                     Multinomial Logit Model Regression Results                    
===================================================================================
Dep. Variable:                      chosen   No. Observations:                1,000
Model:             Multinomial Logit Model   Df Residuals:                      999
Method:                                MLE   Df Model:                            1
Date:                     Sun, 11 Dec 2016   Pseudo R-squ.:                   1.000
Time:                             20:00:33   Pseudo R-bar-squ.:               1.000
converged:                            True   Log-Likelihood:                  0.000
                                             LL-Null:                    -2,302.585
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
beta_x         1.5000        nan        nan        nan           nan       nan
==============================================================================
CPU times: user 635 ms, sys: 373 ms, total: 1.01 s
Wall time: 674 ms

In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:

5. MNL with sampling correction

Utility of alternative j: $$ V_{j} = \beta x_{j} $$

With sampling, we have to account for the restricted choice set (from Eq 6 in Guevara & Ben-Akiva 2013):

$$ V_j = \beta x_j + \ln \pi(D \mid j) $$

Where pi is the conditional probability that we would construct the choice set D given that alternative j was chosen. This goes into the likelihood function in both the numerator and denominator.

$$ L_n = \frac {exp(\beta x_i + \ln \pi(D_n \mid i))} {\sum_{j \epsilon D_n} exp(\beta x_j + \ln \pi(D_n \mid j))} $$

How to calculate pi? From the original formulation of this in McFadden 1978: "Suppose D is comprized of i plus a sample of alternatives from the set C\{i}, obtained by considering each element of this set independently, and including it with probability p. Then, the probability of D will depend solely on the number of elements K it contains."

$$ \pi(D) = p^{K-1} (1 - p)^{J-K} $$

(?? Without replacement, i think it should be the n-choose-k binomial coefficient, where n=J-1 and k=K-1)

$$ \pi(D) = {n \choose k} = \frac {(K-1)!(J-K)!} {(J-1)!} $$

In [ ]:
# Add a column in the estimation data for the constant

d = [[n, i, int(C[n]==i), X[i], 1] for n in range(N) for i in alts(n)]

df = pd.DataFrame(d, columns=['obs_id', 'alt_id', 'choice', 'x', 'const'])

print df.head(), '\n'
print df.describe()

In [ ]:
spec2 = OrderedDict([
        ('x', [range(J)]),
        ('const', [range(J)])
    ])

labels2 = OrderedDict([
        ('x', ['beta_x']),
        ('const', ['constant'])
    ])

In [ ]:
# Try binomial formula

j=3
k=2

fact = np.math.factorial

float(fact(k-1)*fact(j-k))/fact(j-1)

In [ ]:
%%time
m = pylogit.create_choice_model(data = df, 
                                alt_id_col = 'alt_id', 
                                obs_id_col = 'obs_id', 
                                choice_col = 'choice', 
                                specification = spec2, 
                                model_type = "MNL", 
                                names = labels2)

# p = float(K-1)/(J-1)
# const = np.log(p**(K-1) * (1-p)**(J-K))

const = np.log(float(fact(K-1)*fact(J-K))/fact(J-1))

# Add an initial value for the constant and constrain it to that
m.fit_mle(init_vals = np.array([0, const]), 
          constrained_pos=[1])

print m.get_statsmodels_summary()

Run 1000x with different samples


In [ ]:
# try binomial formula
const = np.log(float(fact(K-1)*fact(J-K))/fact(J-1))

def estimate_beta_with_correction():
    d = [[n, i, int(C[n]==i), X[i], 1] for n in range(N) for i in alts(n)]
    df = pd.DataFrame(d, columns=['obs_id', 'alt_id', 'choice', 'x', 'const'])
    m = pylogit.create_choice_model(df, 'alt_id', 'obs_id', 'choice', spec2, 'MNL', names=labels2)
    m.fit_mle(init_vals = np.array([0, const]), constrained_pos=[1])
    return m.params.beta_x

In [ ]:
%%time
%%capture

beta = []
for i in range(1000):
    beta.append(estimate_beta_with_correction())

In [ ]:
print pd.Series(beta).describe()

In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]: