The purpose of this notebook is to:
The models estimated in this notebook will be the same models detailed in "Kenneth Train’s exercises using the mlogit package for R." In particular, the following models will be estimated:
mlogit(depvar~ic+oc|0, H)
H$lcc=H$ic+H$oc/0.12 mlcc <- mlogit(depvar~lcc|0, H)
mc <- mlogit(depvar~ic+oc, H, reflevel = 'hp')
mi <- mlogit(depvar~oc+I(ic/income), H, reflevel = 'hp')
mi2 <- mlogit(depvar~oc+ic|income, H, reflevel="hp")
In [1]:
from collections import OrderedDict # For recording the model specification
import pandas as pd # For file input/output
import numpy as np # For vectorized math operations
import pylogit as pl # For MNL model estimation and
# conversion from wide to long format
In [2]:
# Load the Heating data, noting that the data is in wide data format
wide_heating_df = pd.read_csv("../data/heating_data_r.csv")
In [3]:
# Look at the raw Heating data
wide_heating_df.head().T
Out[3]:
Noting that the column denoting the choice (depvar) contains string objects, we need to convert the choice column into an integer based column.
In [4]:
# Convert the choice column for the Train data into integers
# Note that we will use a 1 to denote 'choice1' and a 2 to
# represent 'choice2'
wide_heating_df["choice"] = wide_heating_df["depvar"].map(dict(zip(['gc', 'gr',
'ec', 'er',
'hp'],
range(1,6))))
For the Heating data, all of the alternatives are available in all choice situations. Note that, in general, this is not the case for choice data. As such we need to have columns that denote the availability of each alternative for each individual.
These columns will all be filled with ones for each row in the wide format dataframes because all of the alternatives are always available for each individual.
In [5]:
# Create the needed availability columns for the Heating data
for i in range(1, 6):
wide_heating_df["availability_{}".format(i)] = 1
In [6]:
# Look at the columns that we need to account for when converting from
# the wide data format to the long data format.
wide_heating_df.columns
Out[6]:
In [7]:
##########
# Define lists of the variables pertaining to each variable type
# that we need to account for in the data format transformation
##########
# Determine the name for the alternative ids in the long format
# data frame
heating_alt_id = "alt_id"
# Determine the column that denotes the id of what we're treating
# as individual observations, i.e. the choice situations.
heating_obs_id_col = "idcase"
# Determine what column denotes the choice that was made
heating_choice_column = "choice"
# Create the list of observation specific variables
heating_ind_variables = ["depvar", "income", "agehed", "rooms", "region"]
# Specify the variables that vary across individuals and some or all alternatives
# Note that each "main" key should be the desired name of the column in the long
# data format. The inner keys shoud be the alternative ids that that have some
# value for the "main" key variable.
heating_alt_varying_variables = {"installation_costs": {1: "ic.gc",
2: "ic.gr",
3: "ic.ec",
4: "ic.er",
5: "ic.hp"},
"operating_costs": {1: "oc.gc",
2: "oc.gr",
3: "oc.ec",
4: "oc.er",
5: "oc.hp"},
}
# Specify the availability variables
heating_availability_variables = OrderedDict()
for alt_id, var in zip(range(1, 6),
["availability_{}".format(i) for i in range(1, 6)]):
heating_availability_variables[alt_id] = var
In [8]:
##########
# Actually perform the conversion to long format
##########
long_heating_df = pl.convert_wide_to_long(wide_data=wide_heating_df,
ind_vars=heating_ind_variables,
alt_specific_vars=heating_alt_varying_variables,
availability_vars=heating_availability_variables,
obs_id_col=heating_obs_id_col,
choice_col=heating_choice_column,
new_alt_id_name=heating_alt_id)
# Look at the long format Heating data
long_heating_df.head()
Out[8]:
In [9]:
# Create the life-cycle cost variable needed for model 2 where
# we assume the discount rate, r, is 0.12.
long_heating_df["life_cycle_cost"] = (long_heating_df["installation_costs"] +
long_heating_df["operating_costs"] / 0.12)
# Create the installation cost divided by income variable
long_heating_df["installation_cost_burden"] = (long_heating_df["installation_costs"] /
long_heating_df["income"])
For numeric stability reasons, it is advised that one scale one's variables so that the estimated coefficients are similar in absolute magnitude, and if possible so that the estimated coefficients are close to 1 in absolute value (in other words, not terribly tiny or extremely large). This is done for the fishing data below
In [10]:
# Create the model specification
model_1_spec = OrderedDict()
model_1_names = OrderedDict()
# Note that for the specification dictionary, the
# keys should be the column names from the long format
# dataframe and the values should be a list with a combination
# of alternative id's and/or lists of alternative id's. There
# should be one element for each beta that will be estimated
# in relation to the given column. Lists of alternative id's
# mean that all of the alternatives in the list will get a
# single beta for them, for the given variable.
# The names dictionary should contain one name for each
# element (that is each alternative id or list of alternative
# ids) in the specification dictionary value for the same
# variable
model_1_spec["installation_costs"] = [range(1, 6)]
model_1_names["installation_costs"] = ["installation_costs"]
model_1_spec["operating_costs"] = [range(1, 6)]
model_1_names["operating_costs"] = ["operating_costs"]
In [11]:
# Create an instance of the MNL model class
model_1 = pl.create_choice_model(data=long_heating_df,
alt_id_col=heating_alt_id,
obs_id_col=heating_obs_id_col,
choice_col=heating_choice_column,
specification=model_1_spec,
model_type="MNL",
names=model_1_names)
# Estimate the given model, starting from a point of all zeros
# as the initial values.
model_1.fit_mle(np.zeros(2), method='newton-cg')
# Look at the estimation summaries
model_1.get_statsmodels_summary()
Out[11]:
In [12]:
# Look at the 'standard summary' since it includes robust p-values
model_1.print_summaries()
The call from mlogit was as follows:
Call: mlogit(formula = depvar ~ ic + oc | 0, data = H, method = "nr", print.level = 0)
Frequencies of alternatives: ec er gc gr hp 0.071111 0.093333 0.636667 0.143333 0.055556
nr method 4 iterations, 0h:0m:0s g'(-H)^-1g = 1.56E-07 gradient close to zero
Coefficients :
Estimate Std. Error t-value Pr(>|t|)
ic -0.00623187 0.00035277 -17.665 < 2.2e-16 ***
oc -0.00458008 0.00032216 -14.217 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Log-Likelihood: -1095.2 </pre> As can be seen, the estimates, standard errors, t-values, and log-likelihood agree. The p-values differ but this is because mlogit calculates its p-values based on a t-distribution whereas pyLogit uses an asymptotic normal distribution.
In [13]:
# Create the model specification
model_2_spec = OrderedDict()
model_2_names = OrderedDict()
model_2_spec["life_cycle_cost"] = [range(1, 6)]
model_2_names["life_cycle_cost"] = ["installation_costs"]
# Create an instance of the MNL model class
model_2 = pl.create_choice_model(data=long_heating_df,
alt_id_col=heating_alt_id,
obs_id_col=heating_obs_id_col,
choice_col=heating_choice_column,
specification=model_2_spec,
model_type="MNL",
names=model_2_names)
# Estimate the given model, starting from a point of all zeros
# as the initial values.
model_2.fit_mle(np.zeros(1), method='newton-cg')
# Look at the estimation summaries
model_2.get_statsmodels_summary()
Out[13]:
Look at the corresponding results from mlogit:
Call: mlogit(formula = depvar ~ lcc | 0, data = H, method = "nr", print.level = 0) Frequencies of alternatives: ec er gc gr hp 0.071111 0.093333 0.636667 0.143333 0.055556 nr method 5 iterations, 0h:0m:0s g'(-H)^-1g = 9.32E-05 successive function values within tolerance limits Coefficients : Estimate Std. Error t-value Pr(>|t|) lcc -7.1585e-04 4.2761e-05 -16.741 < 2.2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Log-Likelihood: -1248.7
As before, all computed values agree except for the p-values, which we already know to be different due to the distribution being used to compute the p-values (t-distribution vs normal distribution).
In [14]:
# Create the model specification
model_3_spec = OrderedDict()
model_3_names = OrderedDict()
model_3_spec["intercept"] = range(1, 5)
model_3_names["intercept"] = ["ASC: {}".format(x)
for x in ["gc", "gr", "ec", "er"]]
model_3_spec["installation_costs"] = [range(1, 6)]
model_3_names["installation_costs"] = ["installation_costs"]
model_3_spec["operating_costs"] = [range(1, 6)]
model_3_names["operating_costs"] = ["operating_costs"]
# Create an instance of the MNL model class
model_3 = pl.create_choice_model(data=long_heating_df,
alt_id_col=heating_alt_id,
obs_id_col=heating_obs_id_col,
choice_col=heating_choice_column,
specification=model_3_spec,
model_type="MNL",
names=model_3_names)
# Estimate the given model, starting from a point of all zeros
# as the initial values.
model_3.fit_mle(np.zeros(6))
# Look at the estimation summaries
model_3.get_statsmodels_summary()
Out[14]:
Look at the corresponding results from mlogit:
Call: mlogit(formula = depvar ~ ic + oc, data = H, reflevel = "hp", method = "nr", print.level = 0) Frequencies of alternatives: hp ec er gc gr 0.055556 0.071111 0.093333 0.636667 0.143333 nr method 6 iterations, 0h:0m:0s g'(-H)^-1g = 9.58E-06 successive function values within tolerance limits Coefficients : Estimate Std. Error t-value Pr(>|t|) ec:(intercept) 1.65884594 0.44841936 3.6993 0.0002162 *** er:(intercept) 1.85343697 0.36195509 5.1206 3.045e-07 *** gc:(intercept) 1.71097930 0.22674214 7.5459 4.485e-14 *** gr:(intercept) 0.30826328 0.20659222 1.4921 0.1356640 ic -0.00153315 0.00062086 -2.4694 0.0135333 * oc -0.00699637 0.00155408 -4.5019 6.734e-06 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Log-Likelihood: -1008.2 McFadden R^2: 0.013691 Likelihood ratio test : chisq = 27.99 (p.value = 8.3572e-07)
Again, all calculated values except for the p-values and McFadden's $R^2$ seem to agree.
As noted in the mlogit benchmark # 1 notebook, the mlogit values for McFadden's $R^2$ seem to be incorrect. Based off of the formula: $$\begin{aligned} \textrm{McFadden's }R^2 &= 1 - \frac{\mathscr{L}_M}{\mathscr{L}_0} \\ \textrm{where } \mathscr{L}_M &= \textrm{the fitted log-likelihood} \\ \mathscr{L}_0 &= \textrm{the null log-likelihood}\end{aligned}$$ from "Coefficients of Determination for Multiple Logistic Regression Analysis" by Scott Menard (2000), The American Statistician, 54:1, 17-24, the calculated value of McFadden's R^2 should be 0.303947 as reported by pyLogit.
Note that the initial log-likelihood and McFadden $R^2$ are recomputed below for verification of its correctnes.
In [15]:
# Note that every observation in the Heating dataset
# has 5 available alternatives, therefore the null
# probability is 0.20
null_prob = 0.20
# Calculate how many observations are in the Heating
# dataset
num_heating_obs = wide_heating_df.shape[0]
# Calculate the Fishing dataset's null log-likelihood
null_heating_log_likelihood = (num_heating_obs *
np.log(null_prob))
# Determine whether pyLogit's null log-likelihood is correct
correct_null_ll = np.allclose(null_heating_log_likelihood,
model_3.null_log_likelihood)
print "pyLogit's null log-likelihood is correct:", correct_null_ll
# Calculate McFadden's R^2
mcfaddens_r2 = 1 - (model_3.log_likelihood / model_3.null_log_likelihood)
print "McFadden's R^2 is {:.5f}".format(mcfaddens_r2)
In [16]:
# Create the model specification
model_4_spec = OrderedDict()
model_4_names = OrderedDict()
model_4_spec["intercept"] = range(1, 5)
model_4_names["intercept"] = ["ASC: {}".format(x)
for x in ["gc", "gr", "ec", "er"]]
model_4_spec["installation_cost_burden"] = [range(1, 6)]
model_4_names["installation_cost_burden"] = ["installation_cost_burden"]
model_4_spec["operating_costs"] = [range(1, 6)]
model_4_names["operating_costs"] = ["operating_costs"]
# Create an instance of the MNL model class
model_4 = pl.create_choice_model(data=long_heating_df,
alt_id_col=heating_alt_id,
obs_id_col=heating_obs_id_col,
choice_col=heating_choice_column,
specification=model_4_spec,
model_type="MNL",
names=model_4_names)
# Estimate the given model, starting from a point of all zeros
# as the initial values.
model_4.fit_mle(np.zeros(6))
# Look at the estimation summaries
model_4.get_statsmodels_summary()
Out[16]:
Look at the results from mlogit:
Call: mlogit(formula = depvar ~ oc + I(ic/income), data = H, reflevel = "hp", method = "nr", print.level = 0) Frequencies of alternatives: hp ec er gc gr 0.055556 0.071111 0.093333 0.636667 0.143333 nr method 6 iterations, 0h:0m:0s g'(-H)^-1g = 1.03E-05 successive function values within tolerance limits Coefficients : Estimate Std. Error t-value Pr(>|t|) ec:(intercept) 1.8700773 0.4364248 4.2850 1.827e-05 *** er:(intercept) 1.9340707 0.3599991 5.3724 7.768e-08 *** gc:(intercept) 1.9264254 0.2034031 9.4710 < 2.2e-16 *** gr:(intercept) 0.4047710 0.2011694 2.0121 0.04421 * oc -0.0071066 0.0015518 -4.5797 4.657e-06 *** I(ic/income) -0.0027658 0.0018944 -1.4600 0.14430 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Log-Likelihood: -1010.2 McFadden R^2: 0.011765 Likelihood ratio test : chisq = 24.052 (p.value = 5.9854e-06)
Again, all calculated values except for the p-values and McFadden's $R^2$ seem to agree.
In [17]:
# Create the model specification
model_5_spec = OrderedDict()
model_5_names = OrderedDict()
model_5_spec["intercept"] = range(1, 5)
model_5_names["intercept"] = ["ASC: {}".format(x)
for x in ["gc", "gr", "ec", "er"]]
model_5_spec["installation_costs"] = [range(1, 6)]
model_5_names["installation_costs"] = ["installation_costs"]
model_5_spec["operating_costs"] = [range(1, 6)]
model_5_names["operating_costs"] = ["operating_costs"]
model_5_spec["income"] = range(1, 5)
model_5_names["income"] = ["income_{}".format(x)
for x in ["gc", "gr", "ec", "er"]]
# Create an instance of the MNL model class
model_5 = pl.create_choice_model(data=long_heating_df,
alt_id_col=heating_alt_id,
obs_id_col=heating_obs_id_col,
choice_col=heating_choice_column,
specification=model_5_spec,
model_type="MNL",
names=model_5_names)
# Estimate the given model, starting from a point of all zeros
# as the initial values.
model_5.fit_mle(np.zeros(10))
# Look at the estimation summaries
model_5.get_statsmodels_summary()
Out[17]:
Look at the output from mlogit:
Call: mlogit(formula = depvar ~ oc + ic | income, data = H, reflevel = "hp", method = "nr", print.level = 0) Frequencies of alternatives: hp ec er gc gr 0.055556 0.071111 0.093333 0.636667 0.143333 nr method 6 iterations, 0h:0m:0s g'(-H)^-1g = 6.27E-06 successive function values within tolerance limits Coefficients : Estimate Std. Error t-value Pr(>|t|) ec:(intercept) 1.95445797 0.70353833 2.7780 0.0054688 ** er:(intercept) 2.30560852 0.62390478 3.6954 0.0002195 *** gc:(intercept) 2.05517018 0.48639682 4.2253 2.386e-05 *** gr:(intercept) 1.14158139 0.51828845 2.2026 0.0276231 * oc -0.00696000 0.00155383 -4.4792 7.491e-06 *** ic -0.00153534 0.00062251 -2.4664 0.0136486 * ec:income -0.06362917 0.11329865 -0.5616 0.5743846 er:income -0.09685787 0.10755423 -0.9005 0.3678281 gc:income -0.07178917 0.08878777 -0.8085 0.4187752 gr:income -0.17981159 0.10012691 -1.7958 0.0725205 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Log-Likelihood: -1005.9 McFadden R^2: 0.01598 Likelihood ratio test : chisq = 32.67 (p.value = 1.2134e-05)
As before, pyLogit and mlogit agree on all calculated values except for the p-values and McFadden's $R^2$.