Sam Maurer, Dec. 1, 2016
In [ ]:
Utility of alternative i for agent n: $$ U_{in} = V_{in} + \varepsilon_{in} = \beta_n x_{i} + \varepsilon_{in} $$
Probability that agent n will choose alternative i: $$ L_n(i \mid \beta_n, x_n,C_n) = \frac {e^{V_{in}}} {\sum_{j \epsilon C_n} e^{V_{jn}}} $$
In [ ]:
In [43]:
import numpy as np
import pandas as pd
# Set a seed so that the random numbers will be reproducible
np.random.seed(12)
In [134]:
# For now, J << 1000 alternatives to speed up runtimes
numalts = 50
def rand(len, min, max):
""" Generate `len` random floats uniformly distributed from `min` to `max` """
return (max - min) * np.random.rand(len) + min
# Attribute x is uniformly distributed over [-2, 1] for half the alternatives
# and over [-1, 2] for the other half, as in Guevara & Ben-Akiva
# X = np.concatenate((rand(numalts/2, -2, 1), rand(numalts/2, -1, 2)))
# Or, attribute x is uniformly distributed over [0, 10] for half the alternatives
# and over [100, 110] for the other half, to induce bias in estimation
X = np.concatenate((rand(numalts/2, 0, 10), rand(numalts/2, 100, 110)))
In [135]:
print pd.DataFrame(X[:numalts/2]).describe()
print pd.DataFrame(X[numalts/2:]).describe()
In [57]:
# For regular MNL, use a single value instead of a distribution as
# Guevara & Ben-Akiva used for the mixture model
numobs = 1000 # agents/observations
beta = np.zeros(1000) + 1.5
# beta = 0.8 * np.random.randn(numobs) + 1.5
In [ ]:
print pd.DataFrame(beta).describe()
In [136]:
# Generate a utility matrix for N agents choosing among J alternatives
U = [[beta[n]*x + np.random.gumbel() for x in X] for n in range(numobs)]
print len(U)
print len(U[0])
In [137]:
# Each agent chooses the alternative with highest utility
choices = [np.argmax(a) for a in U]
print len(choices)
print choices[:10]
In [ ]:
In [ ]:
In [49]:
import pylogit
from collections import OrderedDict
In [174]:
# Set up the estimation dataset in long format
d = [[n, i, int(choices[n]==i), X[i]] for n in range(numobs) for i in range(numalts)]
df = pd.DataFrame(d, columns=['obs_id', 'alt_id', 'chosen', 'x'])
In [97]:
print df.head(), '\n'
print df.describe()
In [168]:
# Set up reusable model spec
spec = OrderedDict([('x', 'all_same')])
labels = OrderedDict([('x', 'beta_x')])
In [172]:
# Set up reusable code to estimate a model
def estimate_model(init_val):
"""
Initialize and fit a model, returning it as an object. Will use the
current values of `df`, `spec`, and `labels`.
"""
m = pylogit.create_choice_model(data = df,
alt_id_col = 'alt_id',
obs_id_col = 'obs_id',
choice_col = 'chosen',
specification = spec,
model_type = "MNL",
names = labels)
m.fit_mle(init_vals = np.array([init_val]))
return m
In [175]:
%%time
m = estimate_model(init_val = 1.2)
print m.get_statsmodels_summary()
This looks good: it's very close to the true beta of 1.5
In [ ]:
In [ ]:
In [ ]:
# In the estimation dataset, for each observation include a row for the
# chosen alternative, plus K-1 other alternatives sampled randomly
# without replacement, where K < J.
# Some more notation:
# - true choice set C = range(J)
# - restricted choice set D_n is a subset of C, where len(D_n) = K
In [154]:
# TO DO - rewrite to use sampling weights
def alts(obs_id, C, K):
"""
This function generates a restricted choice set D for a particular
observation. Expects list `C` of alternatives to sample from (either
the full choice set or a stratrum), int `K` alternatives to sample,
and list `choices` of the alt_id chosen for each obs_id. Returns list
of K alt_id's including the chosen one.
"""
chosen = choices[obs_id] # id of chosen alternative
unchosen = [i for i in C if chosen != i] # id's of unchosen alts
sample_unchosen = np.random.choice(unchosen, size=K-1, replace=False).tolist()
return np.sort([chosen] + sample_unchosen)
print alts(0, range(numalts), 5)
In [176]:
# Set up the estimation dataset, which can use the same spec as earlier
C = range(numalts) # choice set to sample from
K = 10
d = [[n, i, int(choices[n]==i), X[i]] for n in range(numobs) for i in alts(n, C, K)]
df = pd.DataFrame(d, columns=['obs_id', 'alt_id', 'chosen', 'x'])
In [146]:
print df.head(), '\n'
print df.describe()
In [177]:
%%time
m = estimate_model(init_val = 1.2)
print m.get_statsmodels_summary()
In [178]:
%%time
%%capture
beta = []
C = range(numalts)
K = 10
for i in range(100):
d = [[n, i, int(choices[n]==i), X[i]] for n in range(numobs) for i in alts(n, C, K)]
df = pd.DataFrame(d, columns=['obs_id', 'alt_id', 'chosen', 'x'])
m = estimate_model(init_val = 1.2)
beta.append(m.params.beta_x)
In [179]:
print pd.Series(beta).describe()
Looks unbiased, as expected. It's very close to the true beta of 1.5
In [ ]:
In [ ]:
In [187]:
# Recall that half the values of x are in the range [0, 10] and half are
# in the range [100, 110]. The taste coefficient is positive, so the first
# set of alternatives is much less relevant than the second set.
C = range(numalts/2) # alternatives to sample from
K = 10
d = [[n, i, int(choices[n]==i), X[i]] for n in range(numobs) for i in alts(n, C, K)]
df = pd.DataFrame(d, columns=['obs_id', 'alt_id', 'chosen', 'x'])
In [188]:
print df.head(), '\n'
print df.describe()
In [189]:
%%time
m = estimate_model(init_val = 1.5)
print m.get_statsmodels_summary()
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
Utility of alternative j: $$ V_{j} = \beta x_{j} $$
With sampling, we have to account for the restricted choice set (from Eq 6 in Guevara & Ben-Akiva 2013):
$$ V_j = \beta x_j + \ln \pi(D \mid j) $$Where pi is the conditional probability that we would construct the choice set D given that alternative j was chosen. This goes into the likelihood function in both the numerator and denominator.
$$ L_n = \frac {exp(\beta x_i + \ln \pi(D_n \mid i))} {\sum_{j \epsilon D_n} exp(\beta x_j + \ln \pi(D_n \mid j))} $$How to calculate pi? From the original formulation of this in McFadden 1978: "Suppose D is comprized of i plus a sample of alternatives from the set C\{i}, obtained by considering each element of this set independently, and including it with probability p. Then, the probability of D will depend solely on the number of elements K it contains."
$$ \pi(D) = p^{K-1} (1 - p)^{J-K} $$(?? Without replacement, i think it should be the n-choose-k binomial coefficient, where n=J-1 and k=K-1)
$$ \pi(D) = {n \choose k} = \frac {(K-1)!(J-K)!} {(J-1)!} $$
In [ ]:
# Add a column in the estimation data for the constant
d = [[n, i, int(C[n]==i), X[i], 1] for n in range(N) for i in alts(n)]
df = pd.DataFrame(d, columns=['obs_id', 'alt_id', 'choice', 'x', 'const'])
print df.head(), '\n'
print df.describe()
In [ ]:
spec2 = OrderedDict([
('x', [range(J)]),
('const', [range(J)])
])
labels2 = OrderedDict([
('x', ['beta_x']),
('const', ['constant'])
])
In [ ]:
# Try binomial formula
j=3
k=2
fact = np.math.factorial
float(fact(k-1)*fact(j-k))/fact(j-1)
In [ ]:
%%time
m = pylogit.create_choice_model(data = df,
alt_id_col = 'alt_id',
obs_id_col = 'obs_id',
choice_col = 'choice',
specification = spec2,
model_type = "MNL",
names = labels2)
# p = float(K-1)/(J-1)
# const = np.log(p**(K-1) * (1-p)**(J-K))
const = np.log(float(fact(K-1)*fact(J-K))/fact(J-1))
# Add an initial value for the constant and constrain it to that
m.fit_mle(init_vals = np.array([0, const]),
constrained_pos=[1])
print m.get_statsmodels_summary()
In [ ]:
# try binomial formula
const = np.log(float(fact(K-1)*fact(J-K))/fact(J-1))
def estimate_beta_with_correction():
d = [[n, i, int(C[n]==i), X[i], 1] for n in range(N) for i in alts(n)]
df = pd.DataFrame(d, columns=['obs_id', 'alt_id', 'choice', 'x', 'const'])
m = pylogit.create_choice_model(df, 'alt_id', 'obs_id', 'choice', spec2, 'MNL', names=labels2)
m.fit_mle(init_vals = np.array([0, const]), constrained_pos=[1])
return m.params.beta_x
In [ ]:
%%time
%%capture
beta = []
for i in range(1000):
beta.append(estimate_beta_with_correction())
In [ ]:
print pd.Series(beta).describe()
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]: