The purpose of this notebook is to demonstrate they key functionalities of pyLogit:
The dataset being used for this example is the "Swissmetro" dataset used in the Python Biogeme examples. The data can be downloaded at http://biogeme.epfl.ch/examples_swissmetro.html, and a detailed explanation of the variables and data-collection procedure can be found at http://www.strc.ch/conferences/2001/bierlaire1.pdf.
Relevant information about this dataset is that it is from a stated preference survey about whether or not individuals would use a new underground Magnetic-Levetation train system called the Swissmetro.
The overall set of possible choices in this dataset was "Train", "Swissmetro", and "Car." However, the choice set faced by each individual is not constant. An individual's choice set was partially based on the alternatives that he/she was capable of using at the moment. For instance, people who did not own cars did not receive a stated preference question where car was an alternative that they could choose. Note that because the choice set varies across choice situations, mlogit and statsmodels could not be used with this dataset.
Also, each individual responded to multiple choice situations. Thus the choice observations are not truly independent of all other choice observations (they are correlated accross choices made by the same individual). However, for the purposes of this example, the effect of repeat-observations on the typical i.i.d. assumptions will be ignored.
Based on the Swissmetro data, we will build a travel mode choice model for individuals who are commuting or going on a business trip.
In [1]:
from collections import OrderedDict # For recording the model specification
import pandas as pd # For file input/output
import numpy as np # For vectorized math operations
import pylogit as pl # For MNL model estimation and
# conversion from wide to long format
In [2]:
# Note that the .dat files used by python biogeme are tab delimited text files
wide_swiss_metro = pd.read_table("../data/swissmetro.dat", sep="\t")
# Select obervations whose choice is known (i.e. CHOICE != 0)
# **AND** whose PURPOSE is either 1 (commute) or 3 (business)
include_criteria = (wide_swiss_metro.PURPOSE.isin([1, 3]) &
(wide_swiss_metro.CHOICE != 0))
# Note that the .copy() ensures that any later changes are made
# to a copy of the data and not to the original data
wide_swiss_metro = wide_swiss_metro.loc[include_criteria].copy()
In [3]:
# Look at the first 5 rows of the data
wide_swiss_metro.head().T
Out[3]:
pyLogit only estimates models using data that is in "long" format.
Long format has 1 row per individual per available alternative, and wide format has 1 row per individual or observation. Long format is useful because it permits one to directly use matrix dot products to calculate the index, $V_{ij} = x_{ij} \beta$, for each individual $\left(i \right)$ for each alternative $\left(j \right)$. In applications where one creates one's own dataset, the dataset can usually be created in long format from the very beginning. However, in situations where a dataset is provided to you in wide format (as in the case of the Swiss Metro dataset), it will be necesssary to convert the data from wide format to long format.
To convert the raw swiss metro data to long format, we need to specify:
In [4]:
# Look at the columns of the swiss metro dataset
wide_swiss_metro.columns
Out[4]:
In [5]:
# Create the list of individual specific variables
ind_variables = wide_swiss_metro.columns.tolist()[:15]
# Specify the variables that vary across individuals and some or all alternatives
# The keys are the column names that will be used in the long format dataframe.
# The values are dictionaries whose key-value pairs are the alternative id and
# the column name of the corresponding column that encodes that variable for
# the given alternative. Examples below.
alt_varying_variables = {u'travel_time': dict([(1, 'TRAIN_TT'),
(2, 'SM_TT'),
(3, 'CAR_TT')]),
u'travel_cost': dict([(1, 'TRAIN_CO'),
(2, 'SM_CO'),
(3, 'CAR_CO')]),
u'headway': dict([(1, 'TRAIN_HE'),
(2, 'SM_HE')]),
u'seat_configuration': dict([(2, "SM_SEATS")])}
# Specify the availability variables
# Note that the keys of the dictionary are the alternative id's.
# The values are the columns denoting the availability for the
# given mode in the dataset.
availability_variables = {1: 'TRAIN_AV',
2: 'SM_AV',
3: 'CAR_AV'}
##########
# Determine the columns for: alternative ids, the observation ids and the choice
##########
# The 'custom_alt_id' is the name of a column to be created in the long-format data
# It will identify the alternative associated with each row.
custom_alt_id = "mode_id"
# Create a custom id column that ignores the fact that this is a
# panel/repeated-observations dataset. Note the +1 ensures the id's start at one.
obs_id_column = "custom_id"
wide_swiss_metro[obs_id_column] = np.arange(wide_swiss_metro.shape[0],
dtype=int) + 1
# Create a variable recording the choice column
choice_column = "CHOICE"
In [6]:
# Perform the conversion to long-format
long_swiss_metro = pl.convert_wide_to_long(wide_swiss_metro,
ind_variables,
alt_varying_variables,
availability_variables,
obs_id_column,
choice_column,
new_alt_id_name=custom_alt_id)
# Look at the resulting long-format dataframe
long_swiss_metro.head(10).T
Out[6]:
Before estimating a model, one needs to pre-compute all of the variables that one wants to use. This is different from the functionality of other packages such as mlogit or statsmodels that use formula strings to create new variables "on-the-fly." This is also somewhat different from Python Biogeme where new variables can be defined in the script but not actually created by the user before model estimation. pyLogit does not perform variable creation. It only estimates models using variables that already exist.
Below, we pre-compute the variables needed for this example's model:
In [7]:
##########
# Create scaled variables so the estimated coefficients are of similar magnitudes
##########
# Scale the travel time column by 60 to convert raw units (minutes) to hours
long_swiss_metro["travel_time_hrs"] = long_swiss_metro["travel_time"] / 60.0
# Scale the headway column by 60 to convert raw units (minutes) to hours
long_swiss_metro["headway_hrs"] = long_swiss_metro["headway"] / 60.0
# Figure out who doesn't incur a marginal cost for the ticket
# This can be because he/she owns an annial season pass (GA == 1)
# or because his/her employer pays for the ticket (WHO == 2).
# Note that all the other complexity in figuring out ticket costs
# have been accounted for except the GA pass (the annual season
# ticket). Make sure this dummy variable is only equal to 1 for
# the rows with the Train or Swissmetro
long_swiss_metro["free_ticket"] = (((long_swiss_metro["GA"] == 1) |
(long_swiss_metro["WHO"] == 2)) &
long_swiss_metro[custom_alt_id].isin([1,2])).astype(int)
# Scale the travel cost by 100 so estimated coefficients are of similar magnitude
# and acccount for ownership of a season pass
long_swiss_metro["travel_cost_hundreth"] = (long_swiss_metro["travel_cost"] *
(long_swiss_metro["free_ticket"] == 0) /
100.0)
##########
# Create various dummy variables to describe the choice context of a given
# invidual for each choice task.
##########
# Create a dummy variable for whether a person has a single piece of luggage
long_swiss_metro["single_luggage_piece"] = (long_swiss_metro["LUGGAGE"] == 1).astype(int)
# Create a dummy variable for whether a person has multiple pieces of luggage
long_swiss_metro["multiple_luggage_pieces"] = (long_swiss_metro["LUGGAGE"] == 3).astype(int)
# Create a dummy variable indicating that a person is NOT first class
long_swiss_metro["regular_class"] = 1 - long_swiss_metro["FIRST"]
# Create a dummy variable indicating that the survey was taken aboard a train
# Note that such passengers are a-priori imagined to be somewhat partial to train modes
long_swiss_metro["train_survey"] = 1 - long_swiss_metro["SURVEY"]
The model specification being used in this example is the following: $$ \begin{aligned} V_{i, \textrm{Train}} &= \textrm{ASC Train} + \\ &\quad \beta _{ \textrm{tt_transit} } \textrm{Travel Time} _{ \textrm{Train}} * \frac{1}{60} + \\ &\quad \beta _{ \textrm{tc_train} } \textrm{Travel Cost}_{\textrm{Train}} * \left( GA == 0 \right) * 0.01 + \\ &\quad \beta _{ \textrm{headway_train} } \textrm{Headway} _{\textrm{Train}} * \frac{1}{60} + \\ &\quad \beta _{ \textrm{survey} } \left( \textrm{Train Survey} == 1 \right) \\ \\ V_{i, \textrm{Swissmetro}} &= \textrm{ASC Swissmetro} + \\ &\quad \beta _{ \textrm{tt_transit} } \textrm{Travel Time} _{ \textrm{Swissmetro}} * \frac{1}{60} + \\ &\quad \beta _{ \textrm{tc_sm} } \textrm{Travel Cost}_{\textrm{Swissmetro}} * \left( GA == 0 \right) * 0.01 + \\ &\quad \beta _{ \textrm{headway_sm} } \textrm{Heaway} _{\textrm{Swissmetro}} * \frac{1}{60} + \\ &\quad \beta _{ \textrm{seat} } \left( \textrm{Seat Configuration} == 1 \right) \\ &\quad \beta _{ \textrm{survey} } \left( \textrm{Train Survey} == 1 \right) \\ &\quad \beta _{ \textrm{first_class} } \left( \textrm{First Class} == 0 \right) \\ \\ V_{i, \textrm{Car}} &= \beta _{ \textrm{tt_car} } \textrm{Travel Time} _{ \textrm{Car}} * \frac{1}{60} + \\ &\quad \beta _{ \textrm{tc_car}} \textrm{Travel Cost}_{\textrm{Car}} * 0.01 + \\ &\quad \beta _{\textrm{luggage}=1} \left( \textrm{Luggage} == 1 \right) + \\ &\quad \beta _{\textrm{luggage}>1} \left( \textrm{Luggage} > 1 \right) \end{aligned} $$
Note that packages such as mlogit and statsmodels do not, by default, handle coefficients that vary over some alternatives but not all, such as the travel time coefficient that is specified as being the same for "Train" and "Swissmetro" but different for "Car."
In [8]:
# NOTE: - Specification and variable names must be ordered dictionaries.
# - Keys should be variables within the long format dataframe.
# The sole exception to this is the "intercept" key.
# - For the specification dictionary, the values should be lists
# of integers or or lists of lists of integers. Within a list,
# or within the inner-most list, the integers should be the
# alternative ID's of the alternative whose utility specification
# the explanatory variable is entering. Lists of lists denote
# alternatives that will share a common coefficient for the variable
# in question.
basic_specification = OrderedDict()
basic_names = OrderedDict()
basic_specification["intercept"] = [1, 2]
basic_names["intercept"] = ['ASC Train',
'ASC Swissmetro']
basic_specification["travel_time_hrs"] = [[1, 2,], 3]
basic_names["travel_time_hrs"] = ['Travel Time, units:hrs (Train and Swissmetro)',
'Travel Time, units:hrs (Car)']
basic_specification["travel_cost_hundreth"] = [1, 2, 3]
basic_names["travel_cost_hundreth"] = ['Travel Cost * (Annual Pass == 0), units: 0.01 CHF (Train)',
'Travel Cost * (Annual Pass == 0), units: 0.01 CHF (Swissmetro)',
'Travel Cost, units: 0.01 CHF (Car)']
basic_specification["headway_hrs"] = [1, 2]
basic_names["headway_hrs"] = ["Headway, units:hrs, (Train)",
"Headway, units:hrs, (Swissmetro)"]
basic_specification["seat_configuration"] = [2]
basic_names["seat_configuration"] = ['Airline Seat Configuration, base=No (Swissmetro)']
basic_specification["train_survey"] = [[1, 2]]
basic_names["train_survey"] = ["Surveyed on a Train, base=No, (Train and Swissmetro)"]
basic_specification["regular_class"] = [1]
basic_names["regular_class"] = ["First Class == False, (Swissmetro)"]
basic_specification["single_luggage_piece"] = [3]
basic_names["single_luggage_piece"] = ["Number of Luggage Pieces == 1, (Car)"]
basic_specification["multiple_luggage_pieces"] = [3]
basic_names["multiple_luggage_pieces"] = ["Number of Luggage Pieces > 1, (Car)"]
In [9]:
# Estimate the multinomial logit model (MNL)
swissmetro_mnl = pl.create_choice_model(data=long_swiss_metro,
alt_id_col=custom_alt_id,
obs_id_col=obs_id_column,
choice_col=choice_column,
specification=basic_specification,
model_type="MNL",
names=basic_names)
# Specify the initial values and method for the optimization.
swissmetro_mnl.fit_mle(np.zeros(14))
# Look at the estimation results
swissmetro_mnl.get_statsmodels_summary()
Out[9]:
You can view all of the results simply by using print_summaries(). This will simply print the various summary dataframes.
In [10]:
# Look at other all results at the same time
swissmetro_mnl.print_summaries()
In [11]:
# Look at the general and goodness of fit statistics
swissmetro_mnl.fit_summary
Out[11]:
In [12]:
# Look at the parameter estimation results, and round the results for easy viewing
np.round(swissmetro_mnl.summary, 3)
Out[12]: