Sam Maurer, Nov. 21, 2016 (updated Dec. 6 to remove errors)
In [ ]:
Utility of alternative i for agent n: $$ U_{in} = V_{in} + \varepsilon_{in} = \beta_n x_{i} + \varepsilon_{in} $$
Probability that agent n will choose alternative i: $$ L_n(i \mid \beta_n, x_n,C_n) = \frac {e^{V_{in}}} {\sum_{j \epsilon C_n} e^{V_{jn}}} $$
In [ ]:
In [2]:
import numpy as np
import pandas as pd
In [187]:
# Generate attribute x for each of J alternatives
# Set a seed for reproducibility
np.random.seed(12)
# Start with J << 1000 to speed up runtimes
J = 50 # alternatives
Xa = 3 * np.random.rand(J/2) - 2 # uniform distribution over [-2, 1]
Xb = 3 * np.random.rand(J/2) - 1 # uniform distribution over [-1, 2]
X = np.concatenate((Xa, Xb))
print len(X)
print X[:5]
In [188]:
# Generate taste coefficient beta for each of N agents
# For regular MNL, i think we need to use a single value, instead of a
# distribution as Guevara & Ben-Akiva used for the mixture model
N = 1000 # agents/observations
beta = np.zeros(1000) + 1.5
# beta = 0.8 * np.random.randn(N) + 1.5
print len(beta)
print beta[:5]
In [ ]:
print pd.DataFrame(beta).describe()
In [190]:
# Generate probability matrix for N agents choosing among J alternatives
def probs(n):
'''
Return list of J probabilities for agent n
'''
b = beta[n]
exps = [np.exp(b*x) for x in X]
sum_exps = np.sum(exps)
return [exp/sum_exps for exp in exps]
P = np.array([probs(n) for n in range(N)])
print P.shape
In [191]:
# Check that each row sums to 1
print np.sum(P, axis=1)[:10]
In [192]:
# Simulate a choice from J alternatives for each of N agents
C = [np.random.choice(range(J), p=p) for p in P]
print len(C)
print C[:10]
In [ ]:
In [8]:
import pylogit
from collections import OrderedDict
In [233]:
# Set up an estimation dataset in long format
d = [[n, i, int(C[n]==i), X[i]] for n in range(N) for i in range(J)]
print len(d)
In [234]:
df = pd.DataFrame(d, columns=['obs_id', 'alt_id', 'choice', 'x'])
print df.head(), '\n'
print df.describe()
In [235]:
# Set up model spec
spec = OrderedDict([
('x', [range(J)])
])
labels = OrderedDict([
('x', ['beta_x'])
])
In [236]:
%%time
m = pylogit.create_choice_model(data = df,
alt_id_col = 'alt_id',
obs_id_col = 'obs_id',
choice_col = 'choice',
specification = spec,
model_type = "MNL",
names = labels)
m.fit_mle(init_vals = np.array([0]))
print m.get_statsmodels_summary()
In [ ]:
In [ ]:
Model class: https://github.com/UDST/urbansim/blob/master/urbansim/models/dcm.py
Estimation algorithms: https://github.com/UDST/urbansim/blob/master/urbansim/urbanchoice/mnl.py
In [13]:
from urbansim.models import MNLDiscreteChoiceModel
In [178]:
# Choosers should be a DataFrame of characteristics, with index as identifier
d = [[n, C[n]] for n in range(N)]
choosers = pd.DataFrame(d, columns=['id', 'choice']).set_index('id')
print len(choosers)
In [179]:
# Alternatives should be a DataFrame of characteristics, with index as identifier
d = [[i, X[i]] for i in range(J)]
alts = pd.DataFrame(d, columns=['id', 'x']).set_index('id')
print len(alts)
In [180]:
%%time
# It seems like this implementation *requires* us to sample the alternatives,
# so here i'm estimating the model with J-1 alts
m = MNLDiscreteChoiceModel(model_expression = 'x',
sample_size = J-1)
m.fit(choosers = choosers,
alternatives = alts,
current_choice = 'choice')
m.report_fit()
In [ ]:
In [ ]:
# In the estimation dataset, for each observation include a row for the
# chosen alternative, plus K-1 other alternatives sampled randomly
# without replacement, where K < J.
In [194]:
K = 3
def alts(obs_id):
"""
Sample alternatives for observation `obs_id`. Expects `J` total
alts, `K` sampled alts, list `C` of choice outcomes. Returns list
of K alt id's including the chosen one.
"""
chosen = C[obs_id] # id of chosen alternative
unchosen = [i for i in range(J) if chosen != i] # id's of J-1 unchosen alts
sample_unchosen = np.random.choice(unchosen, size=K-1, replace=False)
return [chosen] + sample_unchosen.tolist()
print alts(0)
In [195]:
# Set up the estimation dataset
d = [[n, i, int(C[n]==i), X[i]] for n in range(N) for i in alts(n)]
print len(d)
In [196]:
df = pd.DataFrame(d, columns=['obs_id', 'alt_id', 'choice', 'x'])
print df.head(), '\n'
print df.describe()
In [150]:
# Same model spec as before
spec = OrderedDict([
('x', [range(J)])
])
labels = OrderedDict([
('x', ['beta_x'])
])
In [151]:
%%time
m = pylogit.create_choice_model(data = df,
alt_id_col = 'alt_id',
obs_id_col = 'obs_id',
choice_col = 'choice',
specification = spec,
model_type = "MNL",
names = labels)
m.fit_mle(init_vals = np.array([0]))
print m.get_statsmodels_summary()
In [152]:
def estimate_beta():
d = [[n, i, int(C[n]==i), X[i]] for n in range(N) for i in alts(n)]
df = pd.DataFrame(d, columns=['obs_id', 'alt_id', 'choice', 'x'])
m = pylogit.create_choice_model(df, 'alt_id', 'obs_id', 'choice', spec, 'MNL', names=labels)
m.fit_mle(init_vals = np.array([0]))
return m.params.beta_x
In [218]:
%%capture
beta = []
for i in range(1000):
beta.append(estimate_beta())
In [219]:
print pd.Series(beta).describe()
In [ ]:
In [ ]:
Utility of alternative j: $$ V_{j} = \beta x_{j} $$
With sampling, we have to account for the restricted choice set (from Eq 6 in Guevara & Ben-Akiva 2013):
$$ V_j = \beta x_j + \ln \pi(D \mid j) $$Where pi is the conditional probability that we would construct the choice set D given that alternative j was chosen. This goes into the likelihood function in both the numerator and denominator.
$$ L_n = \frac {exp(\beta x_i + \ln \pi(D_n \mid i))} {\sum_{j \epsilon D_n} exp(\beta x_j + \ln \pi(D_n \mid j))} $$How to calculate pi? From the original formulation of this in McFadden 1978: "Suppose D is comprized of i plus a sample of alternatives from the set C\{i}, obtained by considering each element of this set independently, and including it with probability p. Then, the probability of D will depend solely on the number of elements K it contains."
$$ \pi(D) = p^{K-1} (1 - p)^{J-K} $$(?? Without replacement, i think it should be the n-choose-k binomial coefficient, where n=J-1 and k=K-1)
$$ \pi(D) = {n \choose k} = \frac {(K-1)!(J-K)!} {(J-1)!} $$
In [197]:
# Add a column in the estimation data for the constant
d = [[n, i, int(C[n]==i), X[i], 1] for n in range(N) for i in alts(n)]
df = pd.DataFrame(d, columns=['obs_id', 'alt_id', 'choice', 'x', 'const'])
print df.head(), '\n'
print df.describe()
In [198]:
spec2 = OrderedDict([
('x', [range(J)]),
('const', [range(J)])
])
labels2 = OrderedDict([
('x', ['beta_x']),
('const', ['constant'])
])
In [232]:
# Try binomial formula
j=3
k=2
fact = np.math.factorial
float(fact(k-1)*fact(j-k))/fact(j-1)
Out[232]:
In [225]:
%%time
m = pylogit.create_choice_model(data = df,
alt_id_col = 'alt_id',
obs_id_col = 'obs_id',
choice_col = 'choice',
specification = spec2,
model_type = "MNL",
names = labels2)
# p = float(K-1)/(J-1)
# const = np.log(p**(K-1) * (1-p)**(J-K))
const = np.log(float(fact(K-1)*fact(J-K))/fact(J-1))
# Add an initial value for the constant and constrain it to that
m.fit_mle(init_vals = np.array([0, const]),
constrained_pos=[1])
print m.get_statsmodels_summary()
In [213]:
# try binomial formula
const = np.log(float(fact(K-1)*fact(J-K))/fact(J-1))
def estimate_beta_with_correction():
d = [[n, i, int(C[n]==i), X[i], 1] for n in range(N) for i in alts(n)]
df = pd.DataFrame(d, columns=['obs_id', 'alt_id', 'choice', 'x', 'const'])
m = pylogit.create_choice_model(df, 'alt_id', 'obs_id', 'choice', spec2, 'MNL', names=labels2)
m.fit_mle(init_vals = np.array([0, const]), constrained_pos=[1])
return m.params.beta_x
In [216]:
%%capture
beta = []
for i in range(1000):
beta.append(estimate_beta_with_correction())
In [217]:
print pd.Series(beta).describe()
NB - the correction isn't needed for the random sampling case, but we can adapt this code for stratified sampling later on
In [ ]:
In [ ]:
In [ ]: