In [1]:
import pandas as pd
import larch
from larch.data_warehouse import example_file
There are two standard example datasets included with Larch. The MTC
example
demonstrates working with data that is (originally) in idca
format, while the
swissmetro
example demonstrates working with data that is in idco
format.
In [2]:
mtc_raw = pd.read_csv(example_file("MTCwork.csv.gz"),index_col=['casenum','altnum'])
mtc_raw.head(15)
Out[2]:
To prepare this data for use with Larch, we'll load it into a larch.DataFrames
object.
In [3]:
mtc = larch.DataFrames(mtc_raw)
mtc.info()
Because this data has a row for each available alternative, and omits rows for unavailable alternatives, Larch
has stored it in the sparse data_ce
attribute. It's also used this information to populate the data_av
attribute.
The "not computation-ready" is indicating that the data stored is not all using the standard computational dtype
(float64), so this dataframe isn't ready to use for model estimation (yet). Larch can fix that itself later, so there's
no need to worry.
You might notice that the data_co
says "not populated", we are are starting with data in idce
(sparse idca
) format.
If we want to pre-process it to crack the data into seperate idco
and idce
parts, we can use the crack
argument.
This will find all the data columns that have no within-case variance, and move them to the data_co
attribute.
In [4]:
mtc = larch.DataFrames(mtc_raw, crack=True)
mtc.info()
If we want, we can also identify which data is the "choice" at this stage.
(We can also leave that up to the Model object to be defined later.)
To do so here, we can identify the data column that includes the choices.
In [5]:
mtc = larch.DataFrames(mtc_raw, crack=True, ch='chose')
mtc.info()
In [6]:
raw = pd.read_csv(example_file('swissmetro.csv.gz')).query("PURPOSE in (1,3) and CHOICE != 0")
raw.head()
Out[6]:
We can create a simple DataFrames
object simply by giving this raw data to the constructor.
In [7]:
sm = larch.DataFrames(raw)
sm.info()
When we loaded the idca
example data above, Larch automatically detected the set of alternatives based on the data.
With idco
data, we cannot infer the alternatives without some additional context. One way to do that is to give
alternative id codes explicitly.
In [8]:
sm = larch.DataFrames(raw, alt_codes=[1,2,3])
sm.info()
Larch can also infer the alternative codes if we identify the column containing the choices. (Note this only works if every alternative is chosen at least once in the data, otherwise the inferred alternative codes will be incomplete.)
In [9]:
sm = larch.DataFrames(raw, ch="CHOICE")
sm.info()