Exploring destination choice models

Sam Maurer, June 2017

Python 3.6

Plan

  • Set up a simple MNL destination choice model using the urbansim.urbanchoice interface

  • Refactor the code, using this notebook for ad-hoc testing

  • Set up more complex models as needed


In [1]:
import numpy as np
import pandas as pd

from patsy import dmatrix
from urbansim.urbanchoice import interaction, mnl

from choicemodels import MultinomialLogit

In [2]:
# Suppress deprecation warnings

import warnings; warnings.simplefilter('ignore')

In [ ]:

Load estimation data from disk


In [3]:
# Suppress scientific notation in the Pandas display output

pd.set_option('display.float_format', lambda x: '%.3f' % x)

In [4]:
tracts = pd.read_csv('../data/tracts.csv').set_index('full_tract_id')

print(tracts.shape[0])
print(tracts.head())


1583
                    city  home_density  work_density  school_density
full_tract_id                                                       
6001008309.000   TIJUANA         0.000         0.000           0.000
6001400100.000  BERKELEY        13.438        13.131          13.512
6001400200.000   OAKLAND        11.090         4.249           0.895
6001400300.000   OAKLAND        28.878         7.672           0.000
6001400400.000   OAKLAND        16.885         4.064           8.150

In [5]:
trips = pd.read_csv('../data/trips.csv').set_index('place_id')

print(trips.shape[0])
print(trips.head())


36765
                 full_tract_id  mode  trip_distance_miles
place_id                                                 
10319850102.000 6095252108.000 6.000               13.428
10319850202.000 6095251902.000 5.000                5.126
10335860102.000 6085511915.000 6.000              156.371
10335860103.000 6085512027.000 6.000                1.616
10335860104.000 6085512027.000 6.000                0.376

In [ ]:

MNL destination choice using urbansim.urbanchoice


In [6]:
# - each trip is a realized choice of a particular census tract
# - we can randomly sample alternative census tracts and build a model
#   of destination choice

In [7]:
# `interaction.mnl_interaction_dataset()` is not documented very well, but 
# this is how it seems to work

# Takes following input:
# - choosers: pandas.DataFrame with unique index
# - alternatives: pandas.DataFrame with unique index
# - SAMPLE_SIZE: number of alternatives for each choice scenario
# - chosenalts: list containing the alternative id chosen by each chooser?

# Returns following output:
# - full list of alternatives that were sampled
# - long-format DataFrame merging the two tables
# - numchoosers X SAMPLE_SIZE matrix representing chosen alternatives

Start with a sample of ~500 trips for easier computation


In [6]:
choosers = trips.loc[np.random.choice(trips.index, 500, replace=False)]
choosers = choosers.loc[choosers.trip_distance_miles.notnull()]

print(choosers.shape[0])
print(choosers.head())


490
                 full_tract_id  mode  trip_distance_miles
place_id                                                 
71720050203.000 6055201402.000 6.000                3.080
19678330204.000 6095253404.000 6.000               15.400
30057980204.000 6001408600.000 6.000                7.070
30002610307.000 6001433400.000 5.000                1.371
30208410103.000 6085503601.000 5.000                7.498

Sample 100 alternatives for each and set up a long-format data table


In [7]:
numalts = 100

_, merged, chosen = interaction.mnl_interaction_dataset(
    choosers=choosers, alternatives=tracts, SAMPLE_SIZE=numalts, 
    chosenalts=choosers.full_tract_id)

print(merged.shape[0])
print(chosen.shape)


49000
(490, 100)

Use Patsy to generate the design matrix


In [8]:
model_expression = "home_density + work_density + school_density"

model_design = dmatrix(model_expression, data=merged, return_type='dataframe')

print(model_design.head())


                Intercept  home_density  work_density  school_density
full_tract_id                                                        
6055201402.000      1.000        13.406         1.692           0.000
6013308001.000      1.000         8.448         0.828           2.252
6085500901.000      1.000         6.060        32.747         110.417
6085503712.000      1.000        16.097         6.792           0.000
6097153801.000      1.000        48.146         3.061           8.313

Fit the model using mnl_estimate()


In [9]:
log_likelihoods, fit_parameters = mnl.mnl_estimate(
    model_design.as_matrix(), chosen, numalts=numalts)

print(log_likelihoods)
print(fit_parameters)


{'convergence': -2209.5185606064615, 'null': -2256.5333911341672, 'ratio': 0.02083498108755011}
   Coefficient  Std. Error  T-Score
0       -0.000       0.084   -0.000
1        0.013       0.004    3.049
2        0.012       0.001    9.855
3        0.011       0.005    2.170

In [ ]:

NEW -- Same process in ChoiceModels


In [10]:
from choicemodels import MultinomialLogit
from choicemodels.tools import MergedChoiceTable

In [11]:
# Start with the same sample of trips

print(choosers.shape[0])


490

Merge choosers and alternatives using a new ChoiceModels interface


In [12]:
merged = MergedChoiceTable(observations = choosers, 
                           alternatives = tracts, 
                           chosen_alternatives = choosers.full_tract_id, 
                           sample_size = numalts)

print(type(merged))
print(merged.to_frame().shape[0])


<class 'choicemodels.tools.interaction.MergedChoiceTable'>
49000

Estimate a model using the ChoiceModels engine


In [17]:
%%time
model_expression = "home_density + work_density + school_density - 1"

model = MultinomialLogit(data = merged.to_frame(), 
                         observation_id_col = merged.observation_id_col, 
                         choice_col = merged.choice_col,
                         model_expression = model_expression)

results = model.fit()
print(results)


                  CHOICEMODELS ESTIMATION RESULTS                  
===================================================================
Dep. Var.:                chosen   No. Observations:               
Model:         Multinomial Logit   Df Residuals:                   
Method:       Maximum Likelihood   Df Model:                       
Date:                              Pseudo R-squ.:                  
Time:                              Pseudo R-bar-squ.:              
AIC:                               Log-Likelihood:       -2,206.414
BIC:                               LL-Null:              -2,256.533
===================================================================
                    coef   std err         z     P>|z|   Conf. Int.
-------------------------------------------------------------------
home_density      0.0123     0.003     4.574                       
work_density      0.0128     0.001    10.993                       
school_density    0.0097     0.005     2.018                       
===================================================================
CPU times: user 125 ms, sys: 34.8 ms, total: 160 ms
Wall time: 110 ms

In [14]:
print(type(results))


<class 'choicemodels.mnl.MultinomialLogitResults'>

Estimate a model using the PyLogit engine

Usage is the same, but with an OrderedDict model expression


In [15]:
from collections import OrderedDict

In [16]:
%%time
model_expression = OrderedDict([('home_density', 'all_same'),
                                ('work_density', 'all_same'),
                                ('school_density', 'all_same')])

model = MultinomialLogit(data = merged.to_frame(),
                         observation_id_col = merged.observation_id_col,
                         alternative_id_col = merged.alternative_id_col,
                         choice_col = merged.choice_col,
                         model_expression = model_expression)

results = model.fit()
print(results)


Log-likelihood at zero: -2,256.5334
Initial Log-likelihood: -2,256.5334
Estimation Time: 0.15 seconds.
Final log-likelihood: -2,206.4141
                     Multinomial Logit Model Regression Results                    
===================================================================================
Dep. Variable:                      chosen   No. Observations:                  490
Model:             Multinomial Logit Model   Df Residuals:                      487
Method:                                MLE   Df Model:                            3
Date:                     Tue, 27 Jun 2017   Pseudo R-squ.:                   0.022
Time:                             19:51:07   Pseudo R-bar-squ.:               0.021
converged:                            True   Log-Likelihood:             -2,206.414
                                             LL-Null:                    -2,256.533
==================================================================================
                     coef    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------------------
home_density       0.0123      0.004      2.942      0.003       0.004       0.020
work_density       0.0128      0.001     11.104      0.000       0.011       0.015
school_density     0.0097      0.004      2.191      0.028       0.001       0.018
==================================================================================
CPU times: user 15.5 s, sys: 13.7 s, total: 29.2 s
Wall time: 21 s

In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]: