The purpose of this notebook is twofold. First, it demonstrates the basic functionality of PyLogit for estimatin Mixed Logit models. Secondly, it compares the estimation results for a Mixed Logit model from PyLogit and MLogit on a common dataset and model specification. The dataset is the "Electricity" dataset. Both the dataset and the model speecification come from Kenneth Train's exercises. See this mlogit pdf.



In [1]:

    
from collections import OrderedDict    # For recording the model specification 

import pandas as pd                    # For file input/output
import numpy as np                     # For vectorized math operations

import pylogit as pl                   # For choice model estimation

1. Load the Electricity Dataset



In [2]:

    
# Load the raw electricity data
long_electricity = pd.read_csv("../data/electricity_r_data_long.csv")
long_electricity.head().T

2. Clean the dataset



In [3]:

    
# Make sure that the choice column contains ones and zeros as opposed
# to true and false
long_electricity["choice"] = long_electricity["choice"].astype(int)

# List the variables that are the index variables
index_var_names = ["pf", "cl", "loc", "wk", "tod", "seas"]
# Transform all of the index variable columns to have float dtypes
for col in index_var_names:
    long_electricity[col] = long_electricity[col].astype(float)

6. Specify and Estimate the Mixed Logit Model

6a. Specify the Model



In [4]:

    
# Create the model's specification dictionary and variable names dictionary
# NOTE: - Keys should be variables within the long format dataframe.
#         The sole exception to this is the "intercept" key.
#       - For the specification dictionary, the values should be lists
#         or lists of lists. Within a list, or within the inner-most
#         list should be the alternative ID's of the alternative whose
#         utility specification the explanatory variable is entering.

example_specification = OrderedDict()
example_names = OrderedDict()

# Note that the names used below are simply for consistency with
# the coefficient names given in the mlogit vignette.
for col in index_var_names:
    example_specification[col] = [[1, 2, 3, 4]]
    example_names[col] = [col]

6b. Estimate the model

Compared to estimating a Multinomial Logit Model, creating Mixed Logit models requires the declaration of more arguments.

In particular, we have to specify the identification column over which the coefficients will be randomly distributed. Usually, this is the "observation id" for our dataset. While it is unfortunately named (for now), the "obs_id_col" argument should be used to specify the id of each choice situation. The "mixing_id_col" should be used to specify the id of each unit of observation over which the coefficients are believed to be randomly distributed.

At the moment, PyLogit does not support estimation of Mixed Logit models where some coefficients are mixed over one unit of observation (such as individuals) and other coefficients are mixed over a different unit of observation (such as households).

Beyond, specification of the mixing_id_col, one must also specify the variables that will be treated as random variables. This is done by passing a list containing the names of the coefficients in the model. Note, the strings in the passed list must be present in one of the lists within "names.values()".

When estimating the mixed logit model, we must specify the number of draws to be taken from the distributions of the random coefficients. This is done through the "num_draws" argument of the "fit_mle()" function. Additionally, we can specify a random seed so the results of our estimation are reproducible. This is done through the "seed" argument of the "fit_mle()" function. Finally, the initial values argument should specify enough initial values for the original index coefficients as well as the standard deviation values of the coefficients that are being treated as random variables.



In [5]:

    
# Provide the module with the needed input arguments to create
# an instance of the Mixed Logit model class.

# Note that "chid" is used as the obs_id_col because "chid" is
# the choice situation id.

# Currently, the obs_id_col argument name is unfortunate because
# in the most general of senses, it refers to the situation id.
# In panel data settings, the mixing_id_col argument is what one 
# would generally think of as a "observation id".

# For mixed logit models, the "mixing_id_col" argument specifies
# the units of observation that the coefficients are randomly
# distributed over.
example_mixed = pl.create_choice_model(data=long_electricity,
                                       alt_id_col="alt",
                                       obs_id_col="chid",
                                       choice_col="choice",
                                       specification=example_specification,
                                       model_type="Mixed Logit",
                                       names=example_names,
                                       mixing_id_col="id",
                                       mixing_vars=index_var_names)

# Note 2 * len(index_var_names) is used because we are estimating
# both the mean and standard deviation of each of the random coefficients
# for the listed index variables.
example_mixed.fit_mle(init_vals=np.zeros(2 * len(index_var_names)),
                      num_draws=600,
                      seed=123)

# Look at the estimated results
example_mixed.get_statsmodels_summary()









    



Log-likelihood at zero: -5,972.1561
Initial Log-likelihood: -5,972.1561






    



/Users/timothyb0912/anaconda/lib/python2.7/site-packages/scipy/optimize/_minimize.py:385: RuntimeWarning: Method BFGS does not use Hessian information (hess).
  RuntimeWarning)






    



Estimation Time: 3.20 minutes.
Final log-likelihood: -3,903.8028






    Out[5]:





Mixed Logit Model Regression Results

  Dep. Variable:       choice          No. Observations:      4,308  


  Model:          Mixed Logit Model    Df Residuals:          4,296  


  Method:                MLE           Df Model:               12    


  Date:           Tue, 30 Aug 2016     Pseudo R-squ.:         0.346  


  Time:               13:26:15         Pseudo R-bar-squ.:     0.344  


  converged:            False          Log-Likelihood:     -3,903.803


                                     LL-Null:            -5,972.156




                coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  pf             -0.9919      0.029    -34.436   0.000     -1.048    -0.935


  cl             -0.1952      0.022     -9.064   0.000     -0.237    -0.153


  loc             2.2735      0.117     19.503   0.000      2.045     2.502


  wk              1.6136      0.086     18.716   0.000      1.445     1.783


  tod            -9.3508      0.245    -38.127   0.000     -9.831    -8.870


  seas           -9.6250      0.226    -42.506   0.000    -10.069    -9.181


  Sigma pf       -0.2218      0.016    -14.265   0.000     -0.252    -0.191


  Sigma cl        0.4005      0.022     18.377   0.000      0.358     0.443


  Sigma loc      -1.8138      0.117    -15.512   0.000     -2.043    -1.585


  Sigma wk       -1.1761      0.079    -14.809   0.000     -1.332    -1.020


  Sigma tod      -2.5438      0.161    -15.840   0.000     -2.859    -2.229


  Sigma seas     -1.2916      0.164     -7.874   0.000     -1.613    -0.970

6.c Compare the model output with that of mlogit

The mlogit call and output are given below:

> set.seed(123)
> electricity_mixl = mlogit(choice~pf+cl+loc+wk+tod+seas|0,
+                           electricity_long,
+                           rpar=c(pf="n",
+                                  cl="n",
+                                  loc="n",
+                                  wk="n",
+                                  tod="n",
+                                  seas="n"),
+                           R=600,
+                           halton=NULL,
+                           print.level=0,
+                           panel=TRUE)
> summary(electricity_mixl)

Call:
mlogit(formula = choice ~ pf + cl + loc + wk + tod + seas | 0, 
    data = electricity_long, rpar = c(pf = "n", cl = "n", loc = "n", 
        wk = "n", tod = "n", seas = "n"), R = 600, halton = NULL, 
    panel = TRUE, print.level = 0)

Frequencies of alternatives:
      1       2       3       4 
0.22702 0.26393 0.23816 0.27089 

bfgs method
23 iterations, 0h:1m:48s 
g'(-H)^-1g = 1.66E-07 
gradient close to zero 

Coefficients :
         Estimate Std. Error t-value  Pr(>|t|)    
pf      -0.980049   0.035852 -27.336 < 2.2e-16 ***
cl      -0.214234   0.014205 -15.082 < 2.2e-16 ***
loc      2.339543   0.088591  26.408 < 2.2e-16 ***
wk       1.596299   0.069847  22.854 < 2.2e-16 ***
tod     -9.325374   0.304802 -30.595 < 2.2e-16 ***
seas    -9.643796   0.310437 -31.065 < 2.2e-16 ***
sd.pf    0.207022   0.011939  17.340 < 2.2e-16 ***
sd.cl    0.393249   0.019352  20.321 < 2.2e-16 ***
sd.loc   1.770551   0.099411  17.811 < 2.2e-16 ***
sd.wk    1.183782   0.083063  14.252 < 2.2e-16 ***
sd.tod   2.282927   0.128163  17.813 < 2.2e-16 ***
sd.seas  1.578440   0.124485  12.680 < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Log-Likelihood: -3904.7

random coefficients
     Min.     1st Qu.     Median       Mean     3rd Qu. Max.
pf   -Inf  -1.1196828 -0.9800489 -0.9800489 -0.84041499  Inf
cl   -Inf  -0.4794766 -0.2142339 -0.2142339  0.05100865  Inf
loc  -Inf   1.1453244  2.3395430  2.3395430  3.53376158  Inf
wk   -Inf   0.7978500  1.5962985  1.5962985  2.39474707  Inf
tod  -Inf -10.8651849 -9.3253739 -9.3253739 -7.78556285  Inf
seas -Inf -10.7084372 -9.6437959 -9.6437959 -8.57915462  Inf

Summary

Clearly, there are differences between the estimation results of mlogit and those of PyLogit. However these differences are not large, and it is not clear where the differences between the two estimation routines comes from.

The results are close enough to suspect that the estimation differences stem from randomness due to simulation and the fact that one cannot force the randomly generated numbers to be the same across programming languages (R versus Python).

The log-likelihood values are quite close to each other and none of the parameters seem to be actually different from each other (in the sense of not overlapping with each other's 95% confidence interval).



In [ ]:

	0	1	2	3	4
choice	False	False	False	True	False
id	1	1	1	1	1
alt	1	2	3	4	1
pf	7	9	0	0	7
cl	5	1	0	5	0
loc	0	1	0	0	0
wk	1	0	0	1	1
tod	0	0	0	1	0
seas	0	0	1	0	0
chid	1	1	1	1	2

Dep. Variable:	choice	No. Observations:	4,308
Model:	Mixed Logit Model	Df Residuals:	4,296
Method:	MLE	Df Model:	12
Date:	Tue, 30 Aug 2016	Pseudo R-squ.:	0.346
Time:	13:26:15	Pseudo R-bar-squ.:	0.344
converged:	False	Log-Likelihood:	-3,903.803
		LL-Null:	-5,972.156

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
pf	-0.9919	0.029	-34.436	0.000	-1.048 -0.935
cl	-0.1952	0.022	-9.064	0.000	-0.237 -0.153
loc	2.2735	0.117	19.503	0.000	2.045 2.502
wk	1.6136	0.086	18.716	0.000	1.445 1.783
tod	-9.3508	0.245	-38.127	0.000	-9.831 -8.870
seas	-9.6250	0.226	-42.506	0.000	-10.069 -9.181
Sigma pf	-0.2218	0.016	-14.265	0.000	-0.252 -0.191
Sigma cl	0.4005	0.022	18.377	0.000	0.358 0.443
Sigma loc	-1.8138	0.117	-15.512	0.000	-2.043 -1.585
Sigma wk	-1.1761	0.079	-14.809	0.000	-1.332 -1.020
Sigma tod	-2.5438	0.161	-15.840	0.000	-2.859 -2.229
Sigma seas	-1.2916	0.164	-7.874	0.000	-1.613 -0.970