Mlogit Benchmark 1

The purpose of this notebook is to:

  1. Demonstrate the use of the pyLogit to estimate conditional logit models.
  2. Benchmark the results reported pyLogit against those reported by the mlogit package.

The models estimated in this notebook will be as follows:

  1. The "Train" model described on page 22 of the mlogit documentation.
     ml.Train <- mlogit(choice ~ price + time + change + comfort | -1, Tr) 
  2. The "Fishing" model described on pages 23-24 of the mlogit documentation
     ml.Fish <- mlogit(mode ~ price | income | catch, Fishing, shape = "wide", varying = 2:9) 

1. Import Needed libraries


In [1]:
from collections import OrderedDict    # For recording the model specification 

import pandas as pd                    # For file input/output
import numpy as np                     # For vectorized math operations

import pylogit as pl                   # For MNL model estimation
                                       # To convert from wide to long format

2. Load and look at the required datasets


In [2]:
# Load the Train data, noting that the data is in wide data format
wide_train_df = pd.read_csv("../data/train_data_r.csv")

# Load the Fishing data, noting that the data is in wide data format
wide_fishing_df = pd.read_csv("../data/fishing_data_r.csv")

In [3]:
# Look at the raw Train data
wide_train_df.head().T


Out[3]:
0 1 2 3 4
id 1 1 1 1 1
choiceid 1 2 3 4 5
choice choice1 choice1 choice1 choice2 choice2
price1 2400 2400 2400 4000 2400
time1 150 150 115 130 150
change1 0 0 0 0 0
comfort1 1 1 1 1 1
price2 4000 3200 4000 3200 3200
time2 150 130 115 150 150
change2 0 0 0 0 0
comfort2 1 1 0 0 0

In [4]:
# Look at the raw Fishing data
wide_fishing_df.head().T


Out[4]:
0 1 2 3 4
mode charter charter boat pier boat
price.beach 157.93 15.114 161.874 15.134 106.93
price.pier 157.93 15.114 161.874 15.134 106.93
price.boat 157.93 10.534 24.334 55.93 41.514
price.charter 182.93 34.534 59.334 84.93 71.014
catch.beach 0.0678 0.1049 0.5333 0.0678 0.0678
catch.pier 0.0503 0.0451 0.4522 0.0789 0.0503
catch.boat 0.2601 0.1574 0.2413 0.1643 0.1082
catch.charter 0.5391 0.4671 1.0266 0.5391 0.324
income 7083.332 1250 3750 2083.333 4583.332

3. Convert the wide format dataframes to long format

3a. Perform needed data cleaning

Recognizing that the Train dataset is a panel dataset, and recognizing that our estimated MNL model will not take the panel nature of the data into account, we need a new id column that specifies each individual choice situation.

In a similar fashion, the Fishing data needs an observation id column, even though it is not a panel dataset. All datasets being used in pyLogit need an "observation" id column that denotes the id of what is being thought of as the unit of observation being modeled.


In [5]:
# Note that we start the ids for the choice situations at 1.
wide_train_df["choice_situation"] = wide_train_df.index.values + 1

wide_fishing_df["observation"] = wide_fishing_df.index.values + 1

Noting that the columns denoting the choice for both the Train and the Fishing data are string objects, we need to convert the choice columns into integer based columns.


In [6]:
# Convert the choice column for the Train data into integers
# Note that we will use a 1 to denote 'choice1' and a 2 to 
# represent 'choice2'
wide_train_df["choice"] = wide_train_df["choice"].map({'choice1': 1,
                                                       'choice2': 2})

# Convert the "mode" column for the Fishing data into an
# integer based column. Use the following mapping:
mode_name_to_id = dict(zip(["beach", "pier", "boat", "charter"],
                           range(1, 5)))
wide_fishing_df["mode"] = wide_fishing_df["mode"].map(mode_name_to_id)

For both the Train and the Fishing data, all of the alternatives are available in all choice situations. Note that, in general, this is not the case for choice data. As such we need to have columns that denote the availability of each alternative for each individual.

These columns will all be filled with ones for each row in the wide format dataframes because all of the alternatives are always available for each individual.


In [7]:
# Create the needed availability columns for the Train data
# where each choice is a binary decision
for i in [1, 2]:
    wide_train_df["availability_{}".format(i)] = 1
    
# Create the needed availability columns for the Fishing data
# where each choice has four available alternatives
for i in range(1, 5):
    wide_fishing_df["availability_{}".format(i)] = 1

3b. Convert the Train dataset to long format


In [8]:
# Look at the columns that we need to account for when converting from
# the wide data format to the long data format.
wide_train_df.columns


Out[8]:
Index([u'id', u'choiceid', u'choice', u'price1', u'time1', u'change1',
       u'comfort1', u'price2', u'time2', u'change2', u'comfort2',
       u'choice_situation', u'availability_1', u'availability_2'],
      dtype='object')

In [9]:
##########
# Define lists of the variables pertaining to each variable type
# that we need to account for in the data format transformation
##########
# Determine the name for the alternative ids in the long format 
# data frame
train_alt_id = "alt_id"
# Determine the column that denotes the id of what we're treating
# as individual observations, i.e. the choice situations.
train_obs_id_col = "choice_situation"
# Determine what column denotes the choice that was made
train_choice_column = "choice"


# Create the list of observation specific variables
train_ind_variables = ["id", "choiceid"]

# Specify the variables that vary across individuals and some or all alternatives
# Note that each "main" key should be the desired name of the column in the long
# data format. The inner keys shoud be the alternative ids that that have some
# value for the "main" key variable.
train_alt_varying_variables = {"price": {1: "price1",
                                         2: "price2"},
                               "time": {1: "time1",
                                        2: "time2"},
                               "change": {1: "change1",
                                          2: "change2"},
                               "comfort": {1: "comfort1",
                                           2: "comfort2"}
                               }

# Specify the availability variables
train_availability_variables = OrderedDict()
for alt_id, var in zip([1, 2], ["availability_1", "availability_2"]):
    train_availability_variables[alt_id] = var

In [10]:
##########
# Actually perform the conversion to long format
##########
long_train_df = pl.convert_wide_to_long(wide_data=wide_train_df,
                                        ind_vars=train_ind_variables,
                                        alt_specific_vars=train_alt_varying_variables,
                                        availability_vars=train_availability_variables,
                                        obs_id_col=train_obs_id_col,
                                        choice_col=train_choice_column,
                                        new_alt_id_name=train_alt_id)

# Look at the long format train data
long_train_df.head()


Out[10]:
choice_situation alt_id choice id choiceid price comfort change time
0 1 1 1 1 1 2400 1 0 150
1 1 2 0 1 1 4000 1 0 150
2 2 1 1 1 2 2400 1 0 150
3 2 2 0 1 2 3200 1 0 130
4 3 1 1 1 3 2400 1 0 115

3c. Convert the Fishing data to long format


In [11]:
# Look at the columns that we need to account for when converting from
# the wide data format to the long data format.
wide_fishing_df.columns


Out[11]:
Index([u'mode', u'price.beach', u'price.pier', u'price.boat', u'price.charter',
       u'catch.beach', u'catch.pier', u'catch.boat', u'catch.charter',
       u'income', u'observation', u'availability_1', u'availability_2',
       u'availability_3', u'availability_4'],
      dtype='object')

In [12]:
##########
# Define lists of the variables pertaining to each variable type
# that we need to account for in the data format transformation
##########
# Determine the name for the alternative ids in the long format 
# data frame
fishing_alt_id = "alt_id"
# Determine the column that denotes the id of what we're treating
# as individual observations, i.e. the choice situations.
fishing_obs_id_col = "observation"
# Determine what column denotes the choice that was made
fishing_choice_column = "mode"


# Create the list of observation specific variables
fishing_ind_variables = ["income"]

# Specify the variables that vary across individuals and some or all alternatives
# Note that each "main" key should be the desired name of the column in the long
# data format. The inner keys shoud be the alternative ids that that have some
# value for the "main" key variable.
fishing_alt_varying_variables = {"price": {1: "price.beach",
                                           2: "price.pier",
                                           3: "price.boat",
                                           4: "price.charter"},
                                 "catch": {1: "catch.beach",
                                           2: "catch.pier",
                                           3: "catch.boat",
                                           4: "catch.charter"},
                                 }

# Specify the availability variables
fishing_availability_variables = OrderedDict()
for alt_id, var in zip(range(1, 5), ["availability_{}".format(x) for x in range(1, 5)]):
    fishing_availability_variables[alt_id] = var

In [13]:
##########
# Actually perform the conversion to long format
##########
long_fishing_df = pl.convert_wide_to_long(wide_data=wide_fishing_df,
                                          ind_vars=fishing_ind_variables,
                                          alt_specific_vars=fishing_alt_varying_variables,
                                          availability_vars=fishing_availability_variables,
                                          obs_id_col=fishing_obs_id_col,
                                          choice_col=fishing_choice_column,
                                          new_alt_id_name=fishing_alt_id)

# Look at the long format train data
long_fishing_df.head()


Out[13]:
observation alt_id mode income catch price
0 1 1 0 7083.3317 0.0678 157.930
1 1 2 0 7083.3317 0.0503 157.930
2 1 3 0 7083.3317 0.2601 157.930
3 1 4 1 7083.3317 0.5391 182.930
4 2 1 0 1249.9998 0.1049 15.114

4. Create desired variables


In [14]:
# For the Train data, scale the price and time variables so the units
# are meaningful, namely hours and euros.
long_train_df["price_euros"] = long_train_df["price"] / 100.0 * 2.20371
long_train_df["time_hours"] = long_train_df["time"] / 60.0

For numeric stability reasons, it is advised that one scale one's variables so that the estimated coefficients are similar in absolute magnitude, and if possible so that the estimated coefficients are close to 1 in absolute value (in other words, not terribly tiny or extremely large). This is done for the fishing data below


In [15]:
# Scale the income data
long_fishing_df["income_thousandth"] = long_fishing_df["income"] / 1000.0

5. Specify and estimate the desired models needed for benchmarking

5a. Train Model

Note that this dataset is a stated-preference dataset with unlabeled alternatives. Because the unobserved elements that affect a person's choice are assumed to be the same for both alternatives, the mean of the error terms is expected to be the same for the two alternatives, therefore the alternative specific constants (ASCs) would be the same and the difference between the two ASCs would be zero. Because of this, no ASCs are estimated for the Train model.


In [16]:
# Look at the columns available for use in specifying the model
long_train_df.columns


Out[16]:
Index([u'choice_situation', u'alt_id', u'choice', u'id', u'choiceid', u'price',
       u'comfort', u'change', u'time', u'price_euros', u'time_hours'],
      dtype='object')

In [17]:
# Create the model specification
train_spec = OrderedDict()
train_names = OrderedDict()

# Note that for the specification dictionary, the
# keys should be the column names from the long format
# dataframe and the values should be a list with a combination
# of alternative id's and/or lists of alternative id's. There 
# should be one element for each beta that will be estimated 
# in relation to the given column. Lists of alternative id's
# mean that all of the alternatives in the list will get a
# single beta for them, for the given variable.
# The names dictionary should contain one name for each
# element (that is each alternative id or list of alternative 
# ids) in the specification dictionary value for the same 
# variable
for col, display_name in [("price_euros", "price"), 
                          ("time_hours", "time"), 
                          ("change", "change"), 
                          ("comfort", "comfort")]:
    train_spec[col] = [[1, 2]]
    train_names[col] = [display_name]

In [18]:
# Create an instance of the MNL model class
train_model = pl.create_choice_model(data=long_train_df,
                                     alt_id_col=train_alt_id,
                                     obs_id_col=train_obs_id_col,
                                     choice_col=train_choice_column,
                                     specification=train_spec,
                                     model_type="MNL",
                                     names=train_names)

# Estimate the given model, starting from a point of all zeros
# as the initial values.
train_model.fit_mle(np.zeros(4))

# Look at the estimation summaries
train_model.get_statsmodels_summary()


Log-likelihood at zero: -2,030.2281
Initial Log-likelihood: -2,030.2281
Estimation Time: 0.02 seconds.
Final log-likelihood: -1,724.1500
/Users/timothyb0912/anaconda/lib/python2.7/site-packages/scipy/optimize/_minimize.py:382: RuntimeWarning: Method BFGS does not use Hessian information (hess).
  RuntimeWarning)
Out[18]:
Multinomial Logit Model Regression Results
Dep. Variable: choice No. Observations: 2,929
Model: Multinomial Logit Model Df Residuals: 2,925
Method: MLE Df Model: 4
Date: Mon, 14 Mar 2016 Pseudo R-squ.: 0.151
Time: 18:50:31 Pseudo R-bar-squ.: 0.149
converged: True Log-Likelihood: -1,724.150
LL-Null: -2,030.228
coef std err z P>|z| [95.0% Conf. Int.]
price -0.0674 0.003 -19.851 0.000 -0.074 -0.061
time -1.7206 0.160 -10.730 0.000 -2.035 -1.406
change -0.3263 0.059 -5.486 0.000 -0.443 -0.210
comfort -0.9457 0.065 -14.562 0.000 -1.073 -0.818

Look at the corresponding results from mlogit:

Call:
mlogit(formula = choice ~ price + time + change + comfort | -1,
    data = Tr, method = "nr", print.level = 0)
Frequencies of alternatives: 12
0.50324 0.49676
nr method
5 iterations, 0h:0m:0s
g'(-H)^-1g = 0.00014
successive fonction values within tolerance limits
Coefficients :
          Estimate Std. Error  t-value  Pr(>|t|)
price   -0.0673580  0.0033933 -19.8506 < 2.2e-16 ***
time    -1.7205514  0.1603517 -10.7299 < 2.2e-16 ***
change  -0.3263409  0.0594892  -5.4857 4.118e-08 ***
comfort -0.9457256  0.0649455 -14.5618 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Log-Likelihood: -1724.2

In terms of differences between the estimation output of mlogit and my estimation output, the differencs seem mainly only be with the p-values. My p-values are calculated with respect to an asymptotic normal distribution whereas the p-values of mlogit are based on the t-distribution. This accounts for the p-value difference.

There is a very slight difference between mlogits value for the time parameter and my own time parameter estimate, but this may simply be due to the convergance criteria that each of the packages are using.

5b. Fishing model


In [19]:
# Look at the columns available for use in specifying the model
long_fishing_df.columns


Out[19]:
Index([u'observation', u'alt_id', u'mode', u'income', u'catch', u'price',
       u'income_thousandth'],
      dtype='object')

In [20]:
# Create the model specification
fishing_spec = OrderedDict()
fishing_names = OrderedDict()

# Note that for the specification dictionary, the
# keys should be the column names from the long format
# dataframe and the values should be a list with a combination
# of alternative id's and/or lists of alternative id's. There 
# should be one element for each beta that will be estimated 
# in relation to the given column. Lists of alternative id's
# mean that all of the alternatives in the list will get a
# single beta for them, for the given variable.
# The names dictionary should contain one name for each
# element (that is each alternative id or list of alternative 
# ids) in the specification dictionary value for the same 
# variable

# Note the intercept for beach is constrained to zero for identification
fishing_spec["intercept"] = range(2, 5)
fishing_names["intercept"] = ["ASC: pier",
                              "ASC: boat",
                              "ASC: charter"]

fishing_spec["price"] = [[1, 2, 3, 4]]
fishing_names["price"] = ["price"]

# Note the income coefficient for beach is constrained to zero for identification
# Note also that we use the scaled variables because they numerically perform better

# fishing_spec["income"] = range(2, 5)
# fishing_names["income"] = ["income_{}".format(x) 
#                            for x in ["pier", "boat", "charter"]]

fishing_spec["income_thousandth"] = range(2, 5)
fishing_names["income_thousandth"] = ["income_{} / 1000".format(x) 
                                      for x in ["pier", "boat", "charter"]]

fishing_spec["catch"] = range(1, 5)
fishing_names["catch"] = ["catch_{}".format(x) for x in
                          ["beach", "pier", "boat", "charter"]]

In [21]:
# Create an instance of the MNL model class
fishing_model = pl.create_choice_model(data=long_fishing_df,
                                       alt_id_col=fishing_alt_id,
                                       obs_id_col=fishing_obs_id_col,
                                       choice_col=fishing_choice_column,
                                       specification=fishing_spec,
                                       model_type="MNL",
                                       names=fishing_names)

# Estimate the given model, starting from a point of all zeros
# as the initial values.
fishing_model.fit_mle(np.zeros(11))

# Look at the estimation summaries
fishing_model.get_statsmodels_summary()


Log-likelihood at zero: -1,638.5999
Initial Log-likelihood: -1,638.5999
Estimation Time: 0.02 seconds.
Final log-likelihood: -1,199.1434
Out[21]:
Multinomial Logit Model Regression Results
Dep. Variable: mode No. Observations: 1,182
Model: Multinomial Logit Model Df Residuals: 1,171
Method: MLE Df Model: 11
Date: Mon, 14 Mar 2016 Pseudo R-squ.: 0.268
Time: 18:50:31 Pseudo R-bar-squ.: 0.261
converged: False Log-Likelihood: -1,199.143
LL-Null: -1,638.600
coef std err z P>|z| [95.0% Conf. Int.]
ASC: pier 1.0430 0.295 3.531 0.000 0.464 1.622
ASC: boat 0.8418 0.300 2.807 0.005 0.254 1.430
ASC: charter 2.1549 0.297 7.244 0.000 1.572 2.738
price -0.0253 0.002 -14.405 0.000 -0.029 -0.022
income_pier / 1000 -0.1355 0.051 -2.648 0.008 -0.236 -0.035
income_boat / 1000 0.0554 0.052 1.063 0.288 -0.047 0.158
income_charter / 1000 -0.0723 0.053 -1.376 0.169 -0.175 0.031
catch_beach 3.1177 0.713 4.372 0.000 1.720 4.515
catch_pier 2.8512 0.775 3.681 0.000 1.333 4.369
catch_boat 2.5425 0.523 4.864 0.000 1.518 3.567
catch_charter 0.7595 0.154 4.925 0.000 0.457 1.062

Look at the results from mlogit:

Call:
mlogit(formula = mode ~ price | income | catch, data = Fishing, 
    shape = "wide", varying = 2:9, method = "nr", print.level = 0)

Frequencies of alternatives:
  beach    boat charter    pier 
0.11337 0.35364 0.38240 0.15059 

nr method
7 iterations, 0h:0m:0s 
g'(-H)^-1g = 2.54E-05 
successive function values within tolerance limits 

Coefficients :
                       Estimate  Std. Error  t-value  Pr(>|t|)    
boat:(intercept)     8.4184e-01  2.9996e-01   2.8065 0.0050080 ** 
charter:(intercept)  2.1549e+00  2.9746e-01   7.2443 4.348e-13 ***
pier:(intercept)     1.0430e+00  2.9535e-01   3.5315 0.0004132 ***
price               -2.5281e-02  1.7551e-03 -14.4046 < 2.2e-16 ***
boat:income          5.5428e-05  5.2130e-05   1.0633 0.2876612    
charter:income      -7.2337e-05  5.2557e-05  -1.3764 0.1687088    
pier:income         -1.3550e-04  5.1172e-05  -2.6480 0.0080977 ** 
beach:catch          3.1177e+00  7.1305e-01   4.3724 1.229e-05 ***
boat:catch           2.5425e+00  5.2274e-01   4.8638 1.152e-06 ***
charter:catch        7.5949e-01  1.5420e-01   4.9254 8.417e-07 ***
pier:catch           2.8512e+00  7.7464e-01   3.6807 0.0002326 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Log-Likelihood: -1199.1
McFadden R^2:  0.19936 
Likelihood ratio test : chisq = 597.16 (p.value = < 2.22e-16)

The coefficient estimates, std. error, and t-values are exactly equal. The p-value differences are, as already noted, because mlogit uses the t-distribution whereas pyLogit uses an asymptotic normal distribution. The log-likelihoods of our final model is also the same.

Note that the McFadden R^2 values are different. I am not sure how the mlogit value is calculated. From "Coefficients of Determination for Multiple Logistic Regression Analysis" by Scott Menard (2000), The American Statistician, 54:1, 17-24, we have the following equation:

$\textrm{McFadden's R^2} = 1 - \frac{\mathscr{L}_M}{\mathscr{L}_0}$
which evaluates to 0.2681902 not 0.19936. This is provided that my null-log-likelihood is correct. The next cell shows this to be the case and verifies pyLogit's calculated McFadden's R^2.


In [22]:
##########
# Make sure that pyLogit's null log-likelihood 
# and McFadden's R^2 are correct
##########
# Note that every observation in the Fishing dataset
# has 4 available alternatives, therefore the null 
# probability is 0.25
null_prob = 0.25

# Calculate how many observations are in the Fishing 
# dataset
num_fishing_obs = wide_fishing_df.shape[0]

# Calculate the Fishing dataset's null log-likelihood 
null_fishing_log_likelihood = (num_fishing_obs * 
                               np.log(null_prob))

# Determine whether pyLogit's null log-likelihood is correct
correct_null_ll = (null_fishing_log_likelihood == 
                   fishing_model.null_log_likelihood)
print "pyLogit's null log-likelihood is correct:", correct_null_ll

# Calculate McFadden's R^2
mcfaddens_r2 = 1 - (fishing_model.log_likelihood / fishing_model.null_log_likelihood)
print "McFadden's R^2 is {:.5f}".format(mcfaddens_r2)


pyLogit's null log-likelihood is correct: True
McFadden's R^2 is 0.26819