107: Latent Class Models

In this example, we will replicate the latent class example model from Biogeme.



In [1]:

    
import larch
import pandas
from larch.roles import P,X

The swissmetro dataset used in this example is conveniently bundled with Larch, accessible using the data_warehouse module. We'll load this file using the pandas read_csv command.



In [2]:

    
from larch import data_warehouse
raw = pandas.read_csv(larch.data_warehouse.example_file('swissmetro.csv.gz'))

We can inspect a few rows of data to see what we have using the head method.



In [3]:

    
raw.head()









    Out[3]:







  
    
      
      GROUP
      SURVEY
      SP
      ID
      PURPOSE
      FIRST
      TICKET
      WHO
      LUGGAGE
      AGE
      ...
      TRAIN_TT
      TRAIN_CO
      TRAIN_HE
      SM_TT
      SM_CO
      SM_HE
      SM_SEATS
      CAR_TT
      CAR_CO
      CHOICE
    
  
  
    
      0
      2
      0
      1
      1
      1
      0
      1
      1
      0
      3
      ...
      112
      48
      120
      63
      52
      20
      0
      117
      65
      2
    
    
      1
      2
      0
      1
      1
      1
      0
      1
      1
      0
      3
      ...
      103
      48
      30
      60
      49
      10
      0
      117
      84
      2
    
    
      2
      2
      0
      1
      1
      1
      0
      1
      1
      0
      3
      ...
      130
      48
      60
      67
      58
      30
      0
      117
      52
      2
    
    
      3
      2
      0
      1
      1
      1
      0
      1
      1
      0
      3
      ...
      103
      40
      30
      63
      52
      20
      0
      72
      52
      2
    
    
      4
      2
      0
      1
      1
      1
      0
      1
      1
      0
      3
      ...
      130
      36
      60
      63
      42
      20
      0
      90
      84
      2
    
  

5 rows × 28 columns

The Biogeme code includes a variety of commands to manipulate the data and create new variables. Because Larch sits on top of pandas, a reasonable method to create new variables is to just create new columns in the source pandas.DataFrame in the usual manner for any DataFrame.



In [4]:

    
raw['SM_COST'] = raw['SM_CO'] * (raw["GA"]==0)

You can also use the eval method of pandas DataFrames. This method takes an expression as a string and evaluates it within a namespace that has already loaded the column names as variables.



In [5]:

    
raw['TRAIN_COST'] = raw.eval("TRAIN_CO * (GA == 0)")

This can allow for writing data expressions more succinctly, as long as all your variable names are strings that can also be the names of variables in Python. If this isn't the case (e.g., if any variable names have spaces in the name) you'll be better off if you stay away from this feature.

We can mix and match between these two method to create new columns in any DataFrame as needed.



In [6]:

    
raw['TRAIN_COST_SCALED'] = raw['TRAIN_COST'] / 100
raw['TRAIN_TT_SCALED'] = raw['TRAIN_TT'] / 100

raw['SM_COST_SCALED'] = raw.eval('SM_COST / 100')
raw['SM_TT_SCALED'] = raw['SM_TT'] / 100

raw['CAR_CO_SCALED'] = raw['CAR_CO'] / 100
raw['CAR_TT_SCALED'] = raw['CAR_TT'] / 100



In [7]:

    
raw['CAR_AV_SP'] = raw.eval("CAR_AV * (SP!=0)")
raw['TRAIN_AV_SP'] = raw.eval("TRAIN_AV * (SP!=0)")

Removing some observations can also be done directly using pandas. Here we identify a subset of observations that we want to keep.



In [8]:

    
keep = raw.eval("PURPOSE in (1,3) and CHOICE != 0")

You may note that we don't assign this value to a column within the raw DataFrame. This is perfectly acceptable, as the output from the eval method is just a normal pandas.Series, like any other single column output you might expect to get from a pandas method.

When you've created the data you need, you can pass the dataframe to the larch.DataFrames constructor. Since the swissmetro data is in idco format, we'll need to explicitly identify the alternative codes as well.



In [9]:

    
dfs = larch.DataFrames(raw[keep], alt_codes=[1,2,3])

The info method of the DataFrames object gives a short summary of the contents.



In [10]:

    
dfs.info()









    



larch.DataFrames:  (not computation-ready)
  n_cases: 6768
  n_alts: 3
  data_ca: <not populated>
  data_co: 38 variables

A longer summary is available by setting verbose to True.



In [11]:

    
dfs.info(verbose=True)









    



larch.DataFrames:  (not computation-ready)
  n_cases: 6768
  n_alts: 3
  data_ca: <not populated>
  data_co:
    - GROUP
    - SURVEY
    - SP
    - ID
    - PURPOSE
    - FIRST
    - TICKET
    - WHO
    - LUGGAGE
    - AGE
    - MALE
    - INCOME
    - GA
    - ORIGIN
    - DEST
    - TRAIN_AV
    - CAR_AV
    - SM_AV
    - TRAIN_TT
    - TRAIN_CO
    - TRAIN_HE
    - SM_TT
    - SM_CO
    - SM_HE
    - SM_SEATS
    - CAR_TT
    - CAR_CO
    - CHOICE
    - SM_COST
    - TRAIN_COST
    - TRAIN_COST_SCALED
    - TRAIN_TT_SCALED
    - SM_COST_SCALED
    - SM_TT_SCALED
    - CAR_CO_SCALED
    - CAR_TT_SCALED
    - CAR_AV_SP
    - TRAIN_AV_SP

You may have noticed that the info summary notes that this data is "not computation-ready". That's because some of the data columns are stored as integers, which can be observed by inspecting the info on the data_co dataframe.



In [12]:

    
dfs.data_co.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 6768 entries, 0 to 8450
Data columns (total 38 columns):
GROUP                6768 non-null int64
SURVEY               6768 non-null int64
SP                   6768 non-null int64
ID                   6768 non-null int64
PURPOSE              6768 non-null int64
FIRST                6768 non-null int64
TICKET               6768 non-null int64
WHO                  6768 non-null int64
LUGGAGE              6768 non-null int64
AGE                  6768 non-null int64
MALE                 6768 non-null int64
INCOME               6768 non-null int64
GA                   6768 non-null int64
ORIGIN               6768 non-null int64
DEST                 6768 non-null int64
TRAIN_AV             6768 non-null int64
CAR_AV               6768 non-null int64
SM_AV                6768 non-null int64
TRAIN_TT             6768 non-null int64
TRAIN_CO             6768 non-null int64
TRAIN_HE             6768 non-null int64
SM_TT                6768 non-null int64
SM_CO                6768 non-null int64
SM_HE                6768 non-null int64
SM_SEATS             6768 non-null int64
CAR_TT               6768 non-null int64
CAR_CO               6768 non-null int64
CHOICE               6768 non-null int64
SM_COST              6768 non-null int64
TRAIN_COST           6768 non-null int64
TRAIN_COST_SCALED    6768 non-null float64
TRAIN_TT_SCALED      6768 non-null float64
SM_COST_SCALED       6768 non-null float64
SM_TT_SCALED         6768 non-null float64
CAR_CO_SCALED        6768 non-null float64
CAR_TT_SCALED        6768 non-null float64
CAR_AV_SP            6768 non-null int64
TRAIN_AV_SP          6768 non-null int64
dtypes: float64(6), int64(32)
memory usage: 2.0 MB

When computations are run, we'll need all the data to be in float format, but Larch knows this and will handle it for you later.

Class Model Setup

Having prepped our data, we're ready to set up discrete choices models for each class in the latent class model. We'll reproduce the Biogeme example exactly here, as a technology demonstation. Each of two classes will be set up with a simple MNL model.



In [13]:

    
m1 = larch.Model(dataservice=dfs)
m1.availability_co_vars = {
    1: "TRAIN_AV_SP",
    2: "SM_AV",
    3: "CAR_AV_SP",
}
m1.choice_co_code = 'CHOICE'

m1.utility_co[1] = P("ASC_TRAIN") + X("TRAIN_COST_SCALED") * P("B_COST")
m1.utility_co[2] = X("SM_COST_SCALED") * P("B_COST")
m1.utility_co[3] = P("ASC_CAR") + X("CAR_CO_SCALED") * P("B_COST")



In [14]:

    
m2 = larch.Model(dataservice=dfs)
m2.availability_co_vars = {
    1: "TRAIN_AV_SP",
    2: "SM_AV",
    3: "CAR_AV_SP",
}
m2.choice_co_code = 'CHOICE'

m2.utility_co[1] = P("ASC_TRAIN") + X("TRAIN_TT_SCALED") * P("B_TIME") + X("TRAIN_COST_SCALED") * P("B_COST")
m2.utility_co[2] = X("SM_TT_SCALED") * P("B_TIME") + X("SM_COST_SCALED") * P("B_COST")
m2.utility_co[3] = P("ASC_CAR") + X("CAR_TT_SCALED") * P("B_TIME") + X("CAR_CO_SCALED") * P("B_COST")

Class Membership Model

For Larch, the class membership model will be set up as yet another discrete choice model. In this case, the choices are not the ultimate choices, but instead are the latent classes. To remain consistent with the Biogeme example, we'll set up this model with only a single constant that determines class membership. Unlike Biogeme, this class membership will be represented with an MNL model, not a simple direct probability.



In [15]:

    
mk = larch.Model()
mk.utility_co[2] = P("W_OTHER")

The utility function of the first class isn't written here, which means it will implicitly be set as 0.

Latent Class Model

Now we're ready to create the latent class model itself, by assembling the components we created above. The constructor for the LatentClassModel takes two arguments, a class membership model, and a dictionary of class models, where the keys in the dictionary correspond to the identifying codes from the utility functions we wrote for the class membership model.



In [16]:

    
from larch.model.latentclass import LatentClassModel
m = LatentClassModel(mk, {1:m1, 2:m2})

The we'll load the data needed for our models using the load_data method. This step will assemble the data needed, and convert it to floating point format as required.



In [17]:

    
m.load_data()









    



converting data_ch to <class 'numpy.float64'>



In [18]:

    
m.dataframes.info(verbose=1)









    



larch.DataFrames:
  n_cases: 6768
  n_alts: 3
  data_ca: <not populated>
  data_co:
    - CAR_CO_SCALED
    - CAR_TT_SCALED
    - SM_COST_SCALED
    - SM_TT_SCALED
    - TRAIN_COST_SCALED
    - TRAIN_TT_SCALED
  data_av: <populated>
  data_ch: <populated>

Only the data actually needed by the models has been converted, which may help keep memory usage down on larger models. You may also note that the loaded dataframes no longer reports that it is "not computational-ready".

To estimate the model, we'll use the maximize_loglike method. When run in Jupyter, a live-view report of the parmeters and log likelihood is displayed.



In [19]:

    
result = m.maximize_loglike()









    




Iteration 056 [Converged] 






    




LL = -5208.498065961454






    







  
    
      
      value
      initvalue
      nullvalue
      minimum
      maximum
      holdfast
      note
      best
    
  
  
    
      ASC_CAR
      0.124611
      0.0
      0.0
      -inf
      inf
      0
      
      0.124611
    
    
      ASC_TRAIN
      -0.397611
      0.0
      0.0
      -inf
      inf
      0
      
      -0.397611
    
    
      B_COST
      -1.263567
      0.0
      0.0
      -inf
      inf
      0
      
      -1.263567
    
    
      B_TIME
      -2.797798
      0.0
      0.0
      -inf
      inf
      0
      
      -2.797798
    
    
      W_OTHER
      1.094291
      0.0
      0.0
      -inf
      inf
      0
      
      1.094291



In [20]:

    
result









    Out[20]:




key value
loglike -5208.498065961454
x 
  
    
      
      0
    
  
  
    
      ASC_CAR
      0.124611
    
    
      ASC_TRAIN
      -0.397611
    
    
      B_COST
      -1.263567
    
    
      B_TIME
      -2.797798
    
    
      W_OTHER
      1.094291
    
  
tolerance 8.736059377268807e-06
steps array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1.])
message 'Optimization terminated successfully.'
elapsed_time 0:00:32.584511
method 'bhhh'
n_cases 6768
iteration_number 56
logloss 0.7695771374056521

To complete our analysis, we can compute the log likelihood at "null" parameters.



In [21]:

    
m.loglike_null()









    Out[21]:





-6964.662979191462

And the parameter covariance matrixes.



In [22]:

    
m.calculate_parameter_covariance()



In [23]:

    
m.covariance_matrix



In [24]:

    
m.robust_covariance_matrix

Reporting Results

And then generate a report of the estimation statistics. Larch includes a Reporter class to help you assemble a report containing the relevant output you want.



In [25]:

    
report = larch.Reporter("Latent Class Example")

Pipe into the report section headers in markdown format (use one hash for top level headings, two hashes for lower levels, etc.)



In [26]:

    
report << "# Parameter Estimates"









    Out[26]:




Parameter Estimates

You can also pipe in dataframes directly, include the pf parameter frame from the model.



In [27]:

    
report << m.pf









    Out[27]:







  
    
      
      value
      initvalue
      nullvalue
      minimum
      maximum
      holdfast
      note
      best
      std err
      t stat
      robust std err
      robust t stat
    
  
  
    
      ASC_CAR
      0.124611
      0.0
      0.0
      -inf
      inf
      0
      
      0.124611
      0.050483
      2.468380
      0.101465
      1.228109
    
    
      ASC_TRAIN
      -0.397611
      0.0
      0.0
      -inf
      inf
      0
      
      -0.397611
      0.060847
      -6.534638
      0.124063
      -3.204911
    
    
      B_COST
      -1.263567
      0.0
      0.0
      -inf
      inf
      0
      
      -1.263567
      0.061170
      -20.656769
      0.171159
      -7.382413
    
    
      B_TIME
      -2.797798
      0.0
      0.0
      -inf
      inf
      0
      
      -2.797798
      0.175613
      -15.931614
      0.343313
      -8.149405
    
    
      W_OTHER
      1.094291
      0.0
      0.0
      -inf
      inf
      0
      
      1.094291
      0.116461
      9.396203
      0.231430
      4.728383

And a selection of pre-formatted summary sections.



In [28]:

    
report << "# Estimation Statistics"
report << m.estimation_statistics()









    Out[28]:




Estimation Statistics
Statistic Aggregate Per Case
Number of Cases 6768
Log Likelihood at Convergence -5208.50 -0.77
Log Likelihood at Null Parameters -6964.66 -1.03
Rho Squared w.r.t. Null Parameters 0.252



In [29]:

    
report << "# Parameter Covariance"
report << "## Typical Parameter Covariance"
report << m.covariance_matrix
report << "## Robust Parameter Covariance"
report << m.robust_covariance_matrix









    Out[29]:




Parameter Covariance
Typical Parameter Covariance



  
    
      
      ASC_CAR
      ASC_TRAIN
      B_COST
      B_TIME
      W_OTHER
    
  
  
    
      ASC_CAR
      0.002549
      0.001935
      0.000252
      -0.004538
      -0.000080
    
    
      ASC_TRAIN
      0.001935
      0.003702
      -0.000143
      -0.004348
      0.001439
    
    
      B_COST
      0.000252
      -0.000143
      0.003742
      0.002839
      0.000620
    
    
      B_TIME
      -0.004538
      -0.004348
      0.002839
      0.030840
      0.012869
    
    
      W_OTHER
      -0.000080
      0.001439
      0.000620
      0.012869
      0.013563
    
  

Robust Parameter Covariance



  
    
      
      ASC_CAR
      ASC_TRAIN
      B_COST
      B_TIME
      W_OTHER
    
  
  
    
      ASC_CAR
      0.010295
      0.008356
      0.001083
      -0.018769
      -0.001150
    
    
      ASC_TRAIN
      0.008356
      0.015392
      -0.001147
      -0.017526
      0.006384
    
    
      B_COST
      0.001083
      -0.001147
      0.029295
      0.014484
      -0.000933
    
    
      B_TIME
      -0.018770
      -0.017526
      0.014483
      0.117864
      0.047962
    
    
      W_OTHER
      -0.001150
      0.006384
      -0.000933
      0.047962
      0.053560



In [30]:

    
report << "# Utility Functions"
report << "## Class 1"
report << "### Formulae"
report << m1.utility_functions(resolve_parameters=False)
report << "### Final Estimated Values"
report << m1.utility_functions(resolve_parameters=True)









    Out[30]:




Utility Functions
Class 1
Formulae
alt formula
1 
 + P.ASC_TRAIN-0.398

 + P.B_COST-1.26
 * X.TRAIN_COST_SCALEDThis is Data
2 
 + P.B_COST-1.26
 * X.SM_COST_SCALEDThis is Data
3 
 + P.ASC_CAR0.125

 + P.B_COST-1.26
 * X.CAR_CO_SCALEDThis is Data
Final Estimated Values
alt formula
1 
 + -0.398P.ASC_TRAIN

 + -1.26P.B_COST
 * X.TRAIN_COST_SCALEDThis is Data
2 
 + -1.26P.B_COST
 * X.SM_COST_SCALEDThis is Data
3 
 + 0.125P.ASC_CAR

 + -1.26P.B_COST
 * X.CAR_CO_SCALEDThis is Data



In [31]:

    
report << "## Class 2"
report << "### Formulae"
report << m1.utility_functions(resolve_parameters=False)
report << "### Final Estimated Values"
report << m1.utility_functions(resolve_parameters=True)









    Out[31]:




Class 2
Formulae
alt formula
1 
 + P.ASC_TRAIN-0.398

 + P.B_COST-1.26
 * X.TRAIN_COST_SCALEDThis is Data
2 
 + P.B_COST-1.26
 * X.SM_COST_SCALEDThis is Data
3 
 + P.ASC_CAR0.125

 + P.B_COST-1.26
 * X.CAR_CO_SCALEDThis is Data
Final Estimated Values
alt formula
1 
 + -0.398P.ASC_TRAIN

 + -1.26P.B_COST
 * X.TRAIN_COST_SCALEDThis is Data
2 
 + -1.26P.B_COST
 * X.SM_COST_SCALEDThis is Data
3 
 + 0.125P.ASC_CAR

 + -1.26P.B_COST
 * X.CAR_CO_SCALEDThis is Data

In addition to reviewing report sections in a Jupyter notebook, the entire report can be saved to an HTML file.



In [32]:

    
report.save('latent-class-example-report.html', overwrite=True)









    Out[32]:





'latent-class-example-report.html'

	GROUP	SP	ID	PURPOSE	TICKET	WHO	AGE	...	TRAIN_TT	TRAIN_CO	TRAIN_HE	SM_TT	SM_CO	SM_HE	CAR_TT	CAR_CO	CHOICE
0	2	1	1	1	1	1	3	...	112	48	120	63	52	20	117	65	2
1	2	1	1	1	1	1	3	...	103	48	30	60	49	10	117	84	2
2	2	1	1	1	1	1	3	...	130	48	60	67	58	30	117	52	2
3	2	1	1	1	1	1	3	...	103	40	30	63	52	20	72	52	2
4	2	1	1	1	1	1	3	...	130	36	60	63	42	20	90	84	2

	value	minimum	maximum	best
ASC_CAR	0.124611	-inf	inf	0.124611
ASC_TRAIN	-0.397611	-inf	inf	-0.397611
B_COST	-1.263567	-inf	inf	-1.263567
B_TIME	-2.797798	-inf	inf	-2.797798
W_OTHER	1.094291	-inf	inf	1.094291

	ASC_CAR	ASC_TRAIN	B_COST	B_TIME	W_OTHER
ASC_CAR	0.002549	0.001935	0.000252	-0.004538	-0.000080
ASC_TRAIN	0.001935	0.003702	-0.000143	-0.004348	0.001439
B_COST	0.000252	-0.000143	0.003742	0.002839	0.000620
B_TIME	-0.004538	-0.004348	0.002839	0.030840	0.012869
W_OTHER	-0.000080	0.001439	0.000620	0.012869	0.013563

	ASC_CAR	ASC_TRAIN	B_COST	B_TIME	W_OTHER
ASC_CAR	0.010295	0.008356	0.001083	-0.018769	-0.001150
ASC_TRAIN	0.008356	0.015392	-0.001147	-0.017526	0.006384
B_COST	0.001083	-0.001147	0.029295	0.014484	-0.000933
B_TIME	-0.018770	-0.017526	0.014483	0.117864	0.047962
W_OTHER	-0.001150	0.006384	-0.000933	0.047962	0.053560

Statistic	Aggregate	Per Case
Number of Cases	6768
Log Likelihood at Convergence	-5208.50	-0.77
Log Likelihood at Null Parameters	-6964.66	-1.03
Rho Squared w.r.t. Null Parameters	0.252

alt	formula
1	+ P.ASC_TRAIN-0.398 + P.B_COST-1.26 * X.TRAIN_COST_SCALEDThis is Data
2	+ P.B_COST-1.26 * X.SM_COST_SCALEDThis is Data
3	+ P.ASC_CAR0.125 + P.B_COST-1.26 * X.CAR_CO_SCALEDThis is Data

alt	formula
1	+ -0.398P.ASC_TRAIN + -1.26P.B_COST * X.TRAIN_COST_SCALEDThis is Data
2	+ -1.26P.B_COST * X.SM_COST_SCALEDThis is Data
3	+ 0.125P.ASC_CAR + -1.26P.B_COST * X.CAR_CO_SCALEDThis is Data

	GROUP	SP	ID	PURPOSE	TICKET	WHO	AGE	...	TRAIN_TT	TRAIN_CO	TRAIN_HE	SM_TT	SM_CO	SM_HE	CAR_TT	CAR_CO	CHOICE
0	2	1	1	1	1	1	3	...	112	48	120	63	52	20	117	65	2
1	2	1	1	1	1	1	3	...	103	48	30	60	49	10	117	84	2
2	2	1	1	1	1	1	3	...	130	48	60	67	58	30	117	52	2
3	2	1	1	1	1	1	3	...	103	40	30	63	52	20	72	52	2
4	2	1	1	1	1	1	3	...	130	36	60	63	42	20	90	84	2

	GROUP	SP	ID	PURPOSE	TICKET	WHO	AGE	...	TRAIN_TT	TRAIN_CO	TRAIN_HE	SM_TT	SM_CO	SM_HE	CAR_TT	CAR_CO	CHOICE
0	2	1	1	1	1	1	3	...	112	48	120	63	52	20	117	65	2
1	2	1	1	1	1	1	3	...	103	48	30	60	49	10	117	84	2
2	2	1	1	1	1	1	3	...	130	48	60	67	58	30	117	52	2
3	2	1	1	1	1	1	3	...	103	40	30	63	52	20	72	52	2
4	2	1	1	1	1	1	3	...	130	36	60	63	42	20	90	84	2