In [1]:
from collections import OrderedDict    # For recording the model specification 

import pandas as pd                    # For file input/output
import numpy as np                     # For vectorized math operations

import pylogit as pl                   # For MNL model estimation
                                       # To convert from wide to long format

1. Load the Swissmetro Dataset


In [2]:
# Load the raw swiss metro data
# Note the .dat files are tab delimited text files
swissmetro_wide = pd.read_table("../data/swissmetro.dat", sep='\t')

2. Clean the dataset

Note that the 01Logit.py file provided is an example from Python Biogeme (see: http://biogeme.epfl.ch/examples_swissmetro.html). See http://www.strc.ch/conferences/2001/bierlaire1.pdf for a detailed explanation of the variables. The 01Logit.py file excludes observations meeting the following critera:

exclude = (( PURPOSE != 1 ) * (  PURPOSE   !=  3  ) + ( CHOICE == 0 )) > 0
As a result, their dataset has 6,768 observations. Below, I make the same exclusions.


In [3]:
# Select obervations whose choice is known (i.e. CHOICE != 0)
# **AND** whose PURPOSE is either 1 or 3
include_criteria = (swissmetro_wide.PURPOSE.isin([1, 3]) &
                    (swissmetro_wide.CHOICE != 0))

# Use ".copy()" so that later on, we avoid performing operations 
# on a view of a dataframe as opposed to on an actual dataframe
clean_sm_wide = swissmetro_wide.loc[include_criteria].copy()

# Look at how many observations we have after removing unwanted
# observations
final_num_obs = clean_sm_wide.shape[0]
num_obs_statement = "The cleaned number of observations is {:,.0f}."
print (num_obs_statement.format(final_num_obs))


The cleaned number of observations is 6,768.

3. Create an id column that ignores the repeat observations per individual

In the simple example given on the Python Biogeme website for 01Logit.py, the repeated observations per individual are treated as separate and independent observations We will do the same


In [4]:
# Create a custom id column that ignores the fact that this is a 
# panel/repeated-observations dataset, and start the "custom_id" from 1
clean_sm_wide["custom_id"] = np.arange(clean_sm_wide.shape[0], dtype=int) + 1

4. Convert the data from 'wide' format to 'long' format

4a. Determine the 'type' of each column in the dataset.


In [5]:
# Look at the columns of the swissmetro data
clean_sm_wide.columns


Out[5]:
Index([u'GROUP', u'SURVEY', u'SP', u'ID', u'PURPOSE', u'FIRST', u'TICKET',
       u'WHO', u'LUGGAGE', u'AGE', u'MALE', u'INCOME', u'GA', u'ORIGIN',
       u'DEST', u'TRAIN_AV', u'CAR_AV', u'SM_AV', u'TRAIN_TT', u'TRAIN_CO',
       u'TRAIN_HE', u'SM_TT', u'SM_CO', u'SM_HE', u'SM_SEATS', u'CAR_TT',
       u'CAR_CO', u'CHOICE', u'custom_id'],
      dtype='object')

In [6]:
# Create the list of individual specific variables
ind_variables = clean_sm_wide.columns.tolist()[:15]

# Specify the variables that vary across individuals **AND** 
# across some or all alternatives
alt_varying_variables = {u'travel_time': dict([(1, 'TRAIN_TT'),
                                               (2, 'SM_TT'),
                                               (3, 'CAR_TT')]),
                          u'travel_cost': dict([(1, 'TRAIN_CO'),
                                                (2, 'SM_CO'),
                                                (3, 'CAR_CO')]),
                          u'headway': dict([(1, 'TRAIN_HE'),
                                            (2, 'SM_HE')]),
                          u'seat_configuration': dict([(2, "SM_SEATS")])}

# Specify the availability variables
availability_variables = dict(zip(range(1, 4), ['TRAIN_AV', 'SM_AV', 'CAR_AV']))

# Determine the columns that will denote the
# new column of alternative ids, and the columns
# that denote the custom observation ids and the 
# choice column
new_alt_id = "mode_id"
obs_id_column = "custom_id"
choice_column = "CHOICE"

4b. Actually perform the conversion from wide to long formats


In [7]:
# Perform the desired conversion
long_swiss_metro = pl.convert_wide_to_long(clean_sm_wide, 
                                           ind_variables, 
                                           alt_varying_variables, 
                                           availability_variables, 
                                           obs_id_column, 
                                           choice_column,
                                           new_alt_id_name=new_alt_id)

# Look at the first 9 rows of the long-format dataframe
long_swiss_metro.head(9).T


Out[7]:
0 1 2 3 4 5 6 7 8
custom_id 1 1 1 2 2 2 3 3 3
mode_id 1 2 3 1 2 3 1 2 3
CHOICE 0 1 0 0 1 0 0 1 0
GROUP 2 2 2 2 2 2 2 2 2
SURVEY 0 0 0 0 0 0 0 0 0
SP 1 1 1 1 1 1 1 1 1
ID 1 1 1 1 1 1 1 1 1
PURPOSE 1 1 1 1 1 1 1 1 1
FIRST 0 0 0 0 0 0 0 0 0
TICKET 1 1 1 1 1 1 1 1 1
WHO 1 1 1 1 1 1 1 1 1
LUGGAGE 0 0 0 0 0 0 0 0 0
AGE 3 3 3 3 3 3 3 3 3
MALE 0 0 0 0 0 0 0 0 0
INCOME 2 2 2 2 2 2 2 2 2
GA 0 0 0 0 0 0 0 0 0
ORIGIN 2 2 2 2 2 2 2 2 2
DEST 1 1 1 1 1 1 1 1 1
seat_configuration 0 0 0 0 0 0 0 0 0
travel_time 112 63 117 103 60 117 130 67 117
headway 120 20 0 30 10 0 60 30 0
travel_cost 48 52 65 48 49 84 48 58 52

5. Create the variables used in the Python Biogeme Logit Model Example

In 01Logit.py, the travel time and travel cost variables are scaled for ease of numeric optimization. We will do the same such that our estimated coefficients are comparable.


In [8]:
# Scale both the travel time and travel cost by 100
long_swiss_metro["travel_time_hundredth"] = (long_swiss_metro["travel_time"] /
                                             100.0)

# Figure out which rows correspond to train or swiss metro 
# alternatives for individuals with GA passes. These individuals face no 
# marginal costs for a trip
train_pass_train_alt = ((long_swiss_metro["GA"] == 1) *
                        (long_swiss_metro["mode_id"].isin([1, 2]))).astype(int)
# Note that the (train_pass_train_alt == 0) term accounts for the
# fact that those with a GA pass have no marginal cost for the trip
long_swiss_metro["travel_cost_hundredth"] = (long_swiss_metro["travel_cost"] *
                                             (train_pass_train_alt == 0) /
                                             100.0)


/Users/timothyb0912/anaconda/lib/python2.7/site-packages/pandas/computation/expressions.py:190: UserWarning: evaluating in Python space because the '*' operator is not supported by numexpr for the bool dtype, use '&' instead
  unsupported[op_str]))

6. Specify and Estimate the Python Biogeme Logit Model Example

6a. Specify the Model


In [9]:
# Create the model's specification dictionary and variable names dictionary
# NOTE: - Keys should be variables within the long format dataframe.
#         The sole exception to this is the "intercept" key.
#       - For the specification dictionary, the values should be lists
#         or lists of lists. Within a list, or within the inner-most
#         list should be the alternative ID's of the alternative whose
#         utility specification the explanatory variable is entering.

example_specification = OrderedDict()
example_names = OrderedDict()

# Note that 1 is the id for the Train and 3 is the id for the Car.
# The next two lines are placing alternative specific constants in
# the utility equations for the Train and for the Car. The order
# in which these variables are placed is chosen so the summary
# dataframe which is returned will match that shown in the HTML
# file of the python biogeme example.
example_specification["intercept"] = [3, 1]
example_names["intercept"] = ['ASC Car', 'ASC Train']

# Note that the names used below are simply for consistency with
# the coefficient names given in the Python Biogeme example.
# example_specification["travel_cost_hundredth"] = [[1, 2, 3]]
# example_names["travel_cost_hundredth"] = ['B_COST']

example_specification["travel_cost_hundredth"] = [[1, 2, 3]]
example_names["travel_cost_hundredth"] = ['B_COST']

example_specification["travel_time_hundredth"] = [[1, 2, 3]]
example_names["travel_time_hundredth"] = ['B_TIME']

6b. Estimate the model


In [10]:
# Provide the module with the needed input arguments to create
# an instance of the MNL model class
example_mnl = pl.create_choice_model(data=long_swiss_metro,
                                     alt_id_col=new_alt_id,
                                     obs_id_col=obs_id_column,
                                     choice_col=choice_column,
                                     specification=example_specification,
                                     model_type="MNL",
                                     names=example_names)

# Start the model estimation from initial values of all zeros
# i.e. 4 zeros for the 4 coefficients being estimated
example_mnl.fit_mle(np.zeros(4))


Log-likelihood at zero: -6,964.6630
Initial Log-likelihood: -6,964.6630
Estimation Time: 0.05 seconds.
Final log-likelihood: -5,331.2520
/Users/timothyb0912/anaconda/lib/python2.7/site-packages/scipy/optimize/_minimize.py:382: RuntimeWarning: Method BFGS does not use Hessian information (hess).
  RuntimeWarning)

6.c Compare the model output with that of Python Biogeme


In [11]:
# Look at the estimated coefficients and goodness-of-fit statistics
example_mnl.get_statsmodels_summary()


Out[11]:
Multinomial Logit Model Regression Results
Dep. Variable: CHOICE No. Observations: 6,768
Model: Multinomial Logit Model Df Residuals: 6,764
Method: MLE Df Model: 4
Date: Mon, 14 Mar 2016 Pseudo R-squ.: 0.235
Time: 18:51:51 Pseudo R-bar-squ.: 0.234
converged: True Log-Likelihood: -5,331.252
LL-Null: -6,964.663
coef std err z P>|z| [95.0% Conf. Int.]
ASC Car -0.1546 0.043 -3.577 0.000 -0.239 -0.070
ASC Train -0.7012 0.055 -12.778 0.000 -0.809 -0.594
B_COST -1.0838 0.052 -20.910 0.000 -1.185 -0.982
B_TIME -1.2779 0.057 -22.465 0.000 -1.389 -1.166

In [12]:
# Look at robust p-values in case one wants to see them
example_mnl.summary


Out[12]:
parameters std_err t_stats p_values robust_std_err robust_t_stats robust_p_values
ASC Car -0.154632 0.043235 -3.576518 3.482020e-04 0.066932 -2.310308 2.087112e-02
ASC Train -0.701187 0.054874 -12.778138 2.172063e-37 0.130980 -5.353404 8.631472e-08
B_COST -1.083791 0.051830 -20.910412 4.305015e-97 0.204736 -5.293592 1.199367e-07
B_TIME -1.277860 0.056883 -22.464576 9.218676e-112 0.229761 -5.561684 2.671842e-08

Summary

My estimation results match those of Python Biogeme.
The Python Biogeme log-likelihood is -5,331.252 and their estimated parameters are:

ASC Car:    -0.155
ASC Train:  -0.701
B_COST:     -1.08
B_TIME:     -1.28

As shown above, my log-likelihood is -5,331.252, and my estimated parameters are:

ASC Car:    -0.1546
ASC Train:  -0.7012
B_COST:     -1.0838 
B_TIME:     -1.2779