Sampling correction for large choice sets

Sam Maurer, Nov. 21, 2016 (updated Dec. 6 to remove errors)

  1. Replicate synthetic data from Guevara & Ben-Akiva 2013
  2. Do MNL with and without sampling correction
  3. Check whether parameter estimates deviate from true values
  4. Extend to Mixed Logit

In [ ]:

1. Generate synthetic data set

  • N = 1000 observations
  • J = 1000 alternatives for all observations (C_n = C)
  • X = single attribute distributed Uniform(-2,1) for the first 500 alternatives and Uniform(-1,2) for the second half
  • beta = generic linear taste coefficient, distributed Normal(mu=1.5, sigma=0.8) across the 1000 observations
  • systematic utility = beta * X
  • epsilon = error term distributed ExtremeValue(0,1)
  • random utility = beta * X + epsilon

Utility of alternative i for agent n: $$ U_{in} = V_{in} + \varepsilon_{in} = \beta_n x_{i} + \varepsilon_{in} $$

Probability that agent n will choose alternative i: $$ L_n(i \mid \beta_n, x_n,C_n) = \frac {e^{V_{in}}} {\sum_{j \epsilon C_n} e^{V_{jn}}} $$


In [ ]:


In [2]:
import numpy as np
import pandas as pd

In [187]:
# Generate attribute x for each of J alternatives

# Set a seed for reproducibility
np.random.seed(12)

# Start with J << 1000 to speed up runtimes

J = 50  # alternatives

Xa = 3 * np.random.rand(J/2) - 2  # uniform distribution over [-2, 1]
Xb = 3 * np.random.rand(J/2) - 1  # uniform distribution over [-1, 2]

X = np.concatenate((Xa, Xb))

print len(X)
print X[:5]


50
[-1.53751147  0.22014909 -1.21005495 -0.39878182 -1.95627511]

In [188]:
# Generate taste coefficient beta for each of N agents 

# For regular MNL, i think we need to use a single value, instead of a 
# distribution as Guevara & Ben-Akiva used for the mixture model

N = 1000  # agents/observations

beta = np.zeros(1000) + 1.5
# beta = 0.8 * np.random.randn(N) + 1.5

print len(beta)
print beta[:5]


1000
[ 1.5  1.5  1.5  1.5  1.5]

In [ ]:
print pd.DataFrame(beta).describe()

In [190]:
# Generate probability matrix for N agents choosing among J alternatives

def probs(n):
    ''' 
    Return list of J probabilities for agent n
    '''
    b = beta[n]
    exps = [np.exp(b*x) for x in X]
    sum_exps = np.sum(exps)
    return [exp/sum_exps for exp in exps]

P = np.array([probs(n) for n in range(N)])
    
print P.shape


(1000, 50)

In [191]:
# Check that each row sums to 1

print np.sum(P, axis=1)[:10]


[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]

In [192]:
# Simulate a choice from J alternatives for each of N agents

C = [np.random.choice(range(J), p=p) for p in P]

print len(C)
print C[:10]


1000
[12, 41, 37, 5, 30, 27, 8, 35, 33, 6]

Now we have data:

  • N agents/observations with true taste coefficients in array "beta"
  • J alternatives with single attributes in array "X"
  • N choice outcomes in array "C"

In [ ]:

2. Estimate beta without sampling, using PyLogit MNL


In [8]:
import pylogit
from collections import OrderedDict

In [233]:
# Set up an estimation dataset in long format

d = [[n, i, int(C[n]==i), X[i]] for n in range(N) for i in range(J)]

print len(d)


50000

In [234]:
df = pd.DataFrame(d, columns=['obs_id', 'alt_id', 'choice', 'x'])

print df.head(), '\n'
print df.describe()


   obs_id  alt_id  choice         x
0       0       0       0 -1.537511
1       0       1       0  0.220149
2       0       2       0 -1.210055
3       0       3       0 -0.398782
4       0       4       0 -1.956275 

             obs_id        alt_id        choice             x
count  50000.000000  50000.000000  50000.000000  50000.000000
mean     499.500000     24.500000      0.020000      0.014570
std      288.677877     14.431014      0.140001      1.116965
min        0.000000      0.000000      0.000000     -1.993222
25%      249.750000     12.000000      0.000000     -0.894495
50%      499.500000     24.500000      0.000000      0.220035
75%      749.250000     37.000000      0.000000      0.832675
max      999.000000     49.000000      1.000000      1.985414

In [235]:
# Set up model spec

spec = OrderedDict([
        ('x', [range(J)])
    ])

labels = OrderedDict([
        ('x', ['beta_x'])
    ])

In [236]:
%%time
m = pylogit.create_choice_model(data = df, 
                                alt_id_col = 'alt_id', 
                                obs_id_col = 'obs_id', 
                                choice_col = 'choice', 
                                specification = spec, 
                                model_type = "MNL", 
                                names = labels)

m.fit_mle(init_vals = np.array([0]))
print m.get_statsmodels_summary()


Log-likelihood at zero: -3,912.0230
Initial Log-likelihood: -3,912.0230
Estimation Time: 0.07 seconds.
Final log-likelihood: -3,065.1983
                     Multinomial Logit Model Regression Results                    
===================================================================================
Dep. Variable:                      choice   No. Observations:                1,000
Model:             Multinomial Logit Model   Df Residuals:                      999
Method:                                MLE   Df Model:                            1
Date:                     Mon, 21 Nov 2016   Pseudo R-squ.:                   0.216
Time:                             13:52:41   Pseudo R-bar-squ.:               0.216
converged:                            True   Log-Likelihood:             -3,065.198
                                             LL-Null:                    -3,912.023
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
beta_x         1.5324      0.046     33.649      0.000         1.443     1.622
==============================================================================
CPU times: user 7.7 s, sys: 14.1 s, total: 21.8 s
Wall time: 14.4 s

In [ ]:


In [ ]:


In [13]:
from urbansim.models import MNLDiscreteChoiceModel

In [178]:
# Choosers should be a DataFrame of characteristics, with index as identifier

d = [[n, C[n]] for n in range(N)]

choosers = pd.DataFrame(d, columns=['id', 'choice']).set_index('id')

print len(choosers)


1000

In [179]:
# Alternatives should be a DataFrame of characteristics, with index as identifier

d = [[i, X[i]] for i in range(J)]

alts = pd.DataFrame(d, columns=['id', 'x']).set_index('id')

print len(alts)


50

In [180]:
%%time

# It seems like this implementation *requires* us to sample the alternatives, 
# so here i'm estimating the model with J-1 alts

m = MNLDiscreteChoiceModel(model_expression = 'x',
                           sample_size = J-1)

m.fit(choosers = choosers,
      alternatives = alts,
      current_choice = 'choice')

m.report_fit()


Null Log-liklihood: -3891.820
Log-liklihood at convergence: -3077.869
Log-liklihood Ratio: 0.209

+-----------+-------------+------------+---------+
| Component | Coefficient | Std. Error | T-Score |
+-----------+-------------+------------+---------+
| x         |    1.527    |   0.022    |  69.267 |
+-----------+-------------+------------+---------+
CPU times: user 104 ms, sys: 9.03 ms, total: 113 ms
Wall time: 89.4 ms

In [ ]:

4. MNL, sampling alternatives without correction

(NB - with random sampling, no correction is needed)


In [ ]:
# In the estimation dataset, for each observation include a row for the
# chosen alternative, plus K-1 other alternatives sampled randomly
# without replacement, where K < J.

In [194]:
K = 3

def alts(obs_id):
    """
    Sample alternatives for observation `obs_id`. Expects `J` total
    alts, `K` sampled alts, list `C` of choice outcomes. Returns list 
    of K alt id's including the chosen one.
    """
    chosen = C[obs_id]  # id of chosen alternative
    unchosen = [i for i in range(J) if chosen != i]  # id's of J-1 unchosen alts
    sample_unchosen = np.random.choice(unchosen, size=K-1, replace=False)
    return [chosen] + sample_unchosen.tolist()
    
print alts(0)


[12, 23, 7]

In [195]:
# Set up the estimation dataset

d = [[n, i, int(C[n]==i), X[i]] for n in range(N) for i in alts(n)]

print len(d)


3000

In [196]:
df = pd.DataFrame(d, columns=['obs_id', 'alt_id', 'choice', 'x'])

print df.head(), '\n'
print df.describe()


   obs_id  alt_id  choice         x
0       0      12       1  0.832675
1       0       3       0 -0.398782
2       0      35       0  1.850941
3       1      41       1  1.985414
4       1      45       0  0.272157 

            obs_id       alt_id       choice            x
count  3000.000000  3000.000000  3000.000000  3000.000000
mean    499.500000    26.898000     0.333333     0.446787
std     288.723115    13.974669     0.471483     1.170677
min       0.000000     0.000000     0.000000    -1.993222
25%     249.750000    14.000000     0.000000    -0.181750
50%     499.500000    29.000000     0.000000     0.413689
75%     749.250000    38.000000     1.000000     1.448505
max     999.000000    49.000000     1.000000     1.985414

In [150]:
# Same model spec as before

spec = OrderedDict([
        ('x', [range(J)])
    ])

labels = OrderedDict([
        ('x', ['beta_x'])
    ])

In [151]:
%%time
m = pylogit.create_choice_model(data = df, 
                                alt_id_col = 'alt_id', 
                                obs_id_col = 'obs_id', 
                                choice_col = 'choice', 
                                specification = spec, 
                                model_type = "MNL", 
                                names = labels)

m.fit_mle(init_vals = np.array([0]))
print m.get_statsmodels_summary()


Log-likelihood at zero: -1,098.6123
Initial Log-likelihood: -1,098.6123
Estimation Time: 0.02 seconds.
Final log-likelihood: -585.7551
                     Multinomial Logit Model Regression Results                    
===================================================================================
Dep. Variable:                      choice   No. Observations:                1,000
Model:             Multinomial Logit Model   Df Residuals:                      999
Method:                                MLE   Df Model:                            1
Date:                     Sun, 20 Nov 2016   Pseudo R-squ.:                   0.467
Time:                             14:37:24   Pseudo R-bar-squ.:               0.466
converged:                            True   Log-Likelihood:               -585.755
                                             LL-Null:                    -1,098.612
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
beta_x         1.6151      0.077     20.888      0.000         1.464     1.767
==============================================================================
CPU times: user 303 ms, sys: 41.3 ms, total: 344 ms
Wall time: 226 ms

Run 1000x with different samples


In [152]:
def estimate_beta():
    d = [[n, i, int(C[n]==i), X[i]] for n in range(N) for i in alts(n)]
    df = pd.DataFrame(d, columns=['obs_id', 'alt_id', 'choice', 'x'])
    m = pylogit.create_choice_model(df, 'alt_id', 'obs_id', 'choice', spec, 'MNL', names=labels)
    m.fit_mle(init_vals = np.array([0]))
    return m.params.beta_x

In [218]:
%%capture

beta = []
for i in range(1000):
    beta.append(estimate_beta())

In [219]:
print pd.Series(beta).describe()


count    1000.000000
mean        1.508913
std         0.052854
min         1.322523
25%         1.471155
50%         1.507724
75%         1.545232
max         1.674443
dtype: float64

In [ ]:


In [ ]:

5. MNL with sampling correction

Utility of alternative j: $$ V_{j} = \beta x_{j} $$

With sampling, we have to account for the restricted choice set (from Eq 6 in Guevara & Ben-Akiva 2013):

$$ V_j = \beta x_j + \ln \pi(D \mid j) $$

Where pi is the conditional probability that we would construct the choice set D given that alternative j was chosen. This goes into the likelihood function in both the numerator and denominator.

$$ L_n = \frac {exp(\beta x_i + \ln \pi(D_n \mid i))} {\sum_{j \epsilon D_n} exp(\beta x_j + \ln \pi(D_n \mid j))} $$

How to calculate pi? From the original formulation of this in McFadden 1978: "Suppose D is comprized of i plus a sample of alternatives from the set C\{i}, obtained by considering each element of this set independently, and including it with probability p. Then, the probability of D will depend solely on the number of elements K it contains."

$$ \pi(D) = p^{K-1} (1 - p)^{J-K} $$

(?? Without replacement, i think it should be the n-choose-k binomial coefficient, where n=J-1 and k=K-1)

$$ \pi(D) = {n \choose k} = \frac {(K-1)!(J-K)!} {(J-1)!} $$

In [197]:
# Add a column in the estimation data for the constant

d = [[n, i, int(C[n]==i), X[i], 1] for n in range(N) for i in alts(n)]

df = pd.DataFrame(d, columns=['obs_id', 'alt_id', 'choice', 'x', 'const'])

print df.head(), '\n'
print df.describe()


   obs_id  alt_id  choice         x  const
0       0      12       1  0.832675      1
1       0      24       0 -1.070307      1
2       0       4       0 -1.956275      1
3       1      41       1  1.985414      1
4       1      26       0  0.413689      1 

            obs_id       alt_id       choice            x   const
count  3000.000000  3000.000000  3000.000000  3000.000000  3000.0
mean    499.500000    26.777667     0.333333     0.438763     1.0
std     288.723115    13.883149     0.471483     1.180510     0.0
min       0.000000     0.000000     0.000000    -1.993222     1.0
25%     249.750000    15.000000     0.000000    -0.343887     1.0
50%     499.500000    29.000000     0.000000     0.413689     1.0
75%     749.250000    38.000000     1.000000     1.448505     1.0
max     999.000000    49.000000     1.000000     1.985414     1.0

In [198]:
spec2 = OrderedDict([
        ('x', [range(J)]),
        ('const', [range(J)])
    ])

labels2 = OrderedDict([
        ('x', ['beta_x']),
        ('const', ['constant'])
    ])

In [232]:
# Try binomial formula

j=3
k=2

fact = np.math.factorial

float(fact(k-1)*fact(j-k))/fact(j-1)


Out[232]:
0.5

In [225]:
%%time
m = pylogit.create_choice_model(data = df, 
                                alt_id_col = 'alt_id', 
                                obs_id_col = 'obs_id', 
                                choice_col = 'choice', 
                                specification = spec2, 
                                model_type = "MNL", 
                                names = labels2)

# p = float(K-1)/(J-1)
# const = np.log(p**(K-1) * (1-p)**(J-K))

const = np.log(float(fact(K-1)*fact(J-K))/fact(J-1))

# Add an initial value for the constant and constrain it to that
m.fit_mle(init_vals = np.array([0, const]), 
          constrained_pos=[1])

print m.get_statsmodels_summary()


Log-likelihood at zero: -1,098.6123
Initial Log-likelihood: -1,098.6123
Estimation Time: 0.02 seconds.
Final log-likelihood: -613.3560
                     Multinomial Logit Model Regression Results                    
===================================================================================
Dep. Variable:                      choice   No. Observations:                1,000
Model:             Multinomial Logit Model   Df Residuals:                      998
Method:                                MLE   Df Model:                            2
Date:                     Mon, 21 Nov 2016   Pseudo R-squ.:                   0.442
Time:                             13:47:43   Pseudo R-bar-squ.:               0.440
converged:                            True   Log-Likelihood:               -613.356
                                             LL-Null:                    -1,098.612
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
beta_x         1.5376      0.075     20.586      0.000         1.391     1.684
constant      -7.0699   1.31e+07  -5.39e-07      1.000     -2.57e+07  2.57e+07
==============================================================================
CPU times: user 325 ms, sys: 29.7 ms, total: 355 ms
Wall time: 237 ms

Run 1000x with different samples


In [213]:
# try binomial formula
const = np.log(float(fact(K-1)*fact(J-K))/fact(J-1))

def estimate_beta_with_correction():
    d = [[n, i, int(C[n]==i), X[i], 1] for n in range(N) for i in alts(n)]
    df = pd.DataFrame(d, columns=['obs_id', 'alt_id', 'choice', 'x', 'const'])
    m = pylogit.create_choice_model(df, 'alt_id', 'obs_id', 'choice', spec2, 'MNL', names=labels2)
    m.fit_mle(init_vals = np.array([0, const]), constrained_pos=[1])
    return m.params.beta_x

In [216]:
%%capture

beta = []
for i in range(1000):
    beta.append(estimate_beta_with_correction())

In [217]:
print pd.Series(beta).describe()


count    1000.000000
mean        1.513490
std         0.051725
min         1.354507
25%         1.477341
50%         1.512756
75%         1.548081
max         1.736557
dtype: float64

NB - the correction isn't needed for the random sampling case, but we can adapt this code for stratified sampling later on


In [ ]:


In [ ]:


In [ ]: