DataFrames

Examples


In [1]:
import pandas as pd
import larch
from larch.data_warehouse import example_file

There are two standard example datasets included with Larch. The MTC example demonstrates working with data that is (originally) in idca format, while the swissmetro example demonstrates working with data that is in idco format.

idca

To start with, we'll load the MTC example data using pandas to create a normal DataFrame, although we'll identify that it will have a two-level MultiIndex, using the case and alt identifiers.


In [2]:
mtc_raw = pd.read_csv(example_file("MTCwork.csv.gz"),index_col=['casenum','altnum'])
mtc_raw.head(15)


Out[2]:
chose ivtt ovtt tottime totcost hhid perid numalts dist wkzone ... numadlt nmlt5 nm5to11 nm12to16 wkccbd wknccbd corredis vehbywrk vocc wgt
casenum altnum
1 1 1 13.38 2.0 15.38 70.63 2 1 2 7.69 664 ... 1 0 0 0 0 0 0 4.00 1 1
2 0 18.38 2.0 20.38 35.32 2 1 2 7.69 664 ... 1 0 0 0 0 0 0 4.00 1 1
3 0 20.38 2.0 22.38 20.18 2 1 2 7.69 664 ... 1 0 0 0 0 0 0 4.00 1 1
4 0 25.90 15.2 41.10 115.64 2 1 2 7.69 664 ... 1 0 0 0 0 0 0 4.00 1 1
5 0 40.50 2.0 42.50 0.00 2 1 2 7.69 664 ... 1 0 0 0 0 0 0 4.00 1 1
2 1 0 29.92 10.0 39.92 390.81 3 1 2 11.62 738 ... 1 0 0 0 1 0 1 1.00 0 1
2 0 34.92 10.0 44.92 195.40 3 1 2 11.62 738 ... 1 0 0 0 1 0 1 1.00 0 1
3 0 21.92 10.0 31.92 97.97 3 1 2 11.62 738 ... 1 0 0 0 1 0 1 1.00 0 1
4 1 22.96 14.2 37.16 185.00 3 1 2 11.62 738 ... 1 0 0 0 1 0 1 1.00 0 1
5 0 58.95 10.0 68.95 0.00 3 1 2 11.62 738 ... 1 0 0 0 1 0 1 1.00 0 1
3 1 1 8.60 6.0 14.60 37.76 5 1 2 4.10 696 ... 3 2 0 0 0 1 0 0.33 1 1
2 0 13.60 6.0 19.60 18.88 5 1 2 4.10 696 ... 3 2 0 0 0 1 0 0.33 1 1
3 0 15.60 6.0 21.60 10.79 5 1 2 4.10 696 ... 3 2 0 0 0 1 0 0.33 1 1
4 0 16.87 21.4 38.27 105.00 5 1 2 4.10 696 ... 3 2 0 0 0 1 0 0.33 1 1
4 1 0 30.60 8.5 39.10 417.32 6 1 2 14.58 665 ... 2 1 0 0 1 0 0 1.00 0 1

15 rows × 36 columns

To prepare this data for use with Larch, we'll load it into a larch.DataFrames object.


In [3]:
mtc = larch.DataFrames(mtc_raw)
mtc.info()


larch.DataFrames:  (not computation-ready)
  n_cases: 5029
  n_alts: 6
  data_ce: 36 variables, 22033 rows
  data_co: <not populated>
  data_av: <populated>

Because this data has a row for each available alternative, and omits rows for unavailable alternatives, Larch has stored it in the sparse data_ce attribute. It's also used this information to populate the data_av attribute. The "not computation-ready" is indicating that the data stored is not all using the standard computational dtype (float64), so this dataframe isn't ready to use for model estimation (yet). Larch can fix that itself later, so there's no need to worry.

You might notice that the data_co says "not populated", we are are starting with data in idce (sparse idca) format. If we want to pre-process it to crack the data into seperate idco and idce parts, we can use the crack argument. This will find all the data columns that have no within-case variance, and move them to the data_co attribute.


In [4]:
mtc = larch.DataFrames(mtc_raw, crack=True)
mtc.info()


larch.DataFrames:  (not computation-ready)
  n_cases: 5029
  n_alts: 6
  data_ce: 5 variables, 22033 rows
  data_co: 31 variables
  data_av: <populated>

If we want, we can also identify which data is the "choice" at this stage.
(We can also leave that up to the Model object to be defined later.) To do so here, we can identify the data column that includes the choices.


In [5]:
mtc = larch.DataFrames(mtc_raw, crack=True, ch='chose')
mtc.info()


larch.DataFrames:  (not computation-ready)
  n_cases: 5029
  n_alts: 6
  data_ce: 5 variables, 22033 rows
  data_co: 31 variables
  data_av: <populated>
  data_ch: chose

idco

To contrast, we'll load the swissmetro example data, which is in idco format. Again we'll use pandas to start by loading a normal DataFrame.


In [6]:
raw = pd.read_csv(example_file('swissmetro.csv.gz')).query("PURPOSE in (1,3) and CHOICE != 0")
raw.head()


Out[6]:
GROUP SURVEY SP ID PURPOSE FIRST TICKET WHO LUGGAGE AGE ... TRAIN_TT TRAIN_CO TRAIN_HE SM_TT SM_CO SM_HE SM_SEATS CAR_TT CAR_CO CHOICE
0 2 0 1 1 1 0 1 1 0 3 ... 112 48 120 63 52 20 0 117 65 2
1 2 0 1 1 1 0 1 1 0 3 ... 103 48 30 60 49 10 0 117 84 2
2 2 0 1 1 1 0 1 1 0 3 ... 130 48 60 67 58 30 0 117 52 2
3 2 0 1 1 1 0 1 1 0 3 ... 103 40 30 63 52 20 0 72 52 2
4 2 0 1 1 1 0 1 1 0 3 ... 130 36 60 63 42 20 0 90 84 2

5 rows × 28 columns

We can create a simple DataFrames object simply by giving this raw data to the constructor.


In [7]:
sm = larch.DataFrames(raw)
sm.info()


larch.DataFrames:  (not computation-ready)
  n_cases: 6768
  n_alts: 0
  data_ca: <not populated>
  data_co: 28 variables

When we loaded the idca example data above, Larch automatically detected the set of alternatives based on the data. With idco data, we cannot infer the alternatives without some additional context. One way to do that is to give alternative id codes explicitly.


In [8]:
sm = larch.DataFrames(raw, alt_codes=[1,2,3])
sm.info()


larch.DataFrames:  (not computation-ready)
  n_cases: 6768
  n_alts: 3
  data_ca: <not populated>
  data_co: 28 variables

Larch can also infer the alternative codes if we identify the column containing the choices. (Note this only works if every alternative is chosen at least once in the data, otherwise the inferred alternative codes will be incomplete.)


In [9]:
sm = larch.DataFrames(raw, ch="CHOICE")
sm.info()


larch.DataFrames:  (not computation-ready)
  n_cases: 6768
  n_alts: 3
  data_ca: <not populated>
  data_co: 28 variables
  data_ch: CHOICE