DataFrames

Examples



In [1]:

    
import pandas as pd
import larch
from larch.data_warehouse import example_file

There are two standard example datasets included with Larch. The MTC example demonstrates working with data that is (originally) in idca format, while the swissmetro example demonstrates working with data that is in idco format.

idca

To start with, we'll load the MTC example data using pandas to create a normal DataFrame, although we'll identify that it will have a two-level MultiIndex, using the case and alt identifiers.



In [2]:

    
mtc_raw = pd.read_csv(example_file("MTCwork.csv.gz"),index_col=['casenum','altnum'])
mtc_raw.head(15)









    Out[2]:







  
    
      
      
      chose
      ivtt
      ovtt
      tottime
      totcost
      hhid
      perid
      numalts
      dist
      wkzone
      ...
      numadlt
      nmlt5
      nm5to11
      nm12to16
      wkccbd
      wknccbd
      corredis
      vehbywrk
      vocc
      wgt
    
    
      casenum
      altnum
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1
      1
      1
      13.38
      2.0
      15.38
      70.63
      2
      1
      2
      7.69
      664
      ...
      1
      0
      0
      0
      0
      0
      0
      4.00
      1
      1
    
    
      2
      0
      18.38
      2.0
      20.38
      35.32
      2
      1
      2
      7.69
      664
      ...
      1
      0
      0
      0
      0
      0
      0
      4.00
      1
      1
    
    
      3
      0
      20.38
      2.0
      22.38
      20.18
      2
      1
      2
      7.69
      664
      ...
      1
      0
      0
      0
      0
      0
      0
      4.00
      1
      1
    
    
      4
      0
      25.90
      15.2
      41.10
      115.64
      2
      1
      2
      7.69
      664
      ...
      1
      0
      0
      0
      0
      0
      0
      4.00
      1
      1
    
    
      5
      0
      40.50
      2.0
      42.50
      0.00
      2
      1
      2
      7.69
      664
      ...
      1
      0
      0
      0
      0
      0
      0
      4.00
      1
      1
    
    
      2
      1
      0
      29.92
      10.0
      39.92
      390.81
      3
      1
      2
      11.62
      738
      ...
      1
      0
      0
      0
      1
      0
      1
      1.00
      0
      1
    
    
      2
      0
      34.92
      10.0
      44.92
      195.40
      3
      1
      2
      11.62
      738
      ...
      1
      0
      0
      0
      1
      0
      1
      1.00
      0
      1
    
    
      3
      0
      21.92
      10.0
      31.92
      97.97
      3
      1
      2
      11.62
      738
      ...
      1
      0
      0
      0
      1
      0
      1
      1.00
      0
      1
    
    
      4
      1
      22.96
      14.2
      37.16
      185.00
      3
      1
      2
      11.62
      738
      ...
      1
      0
      0
      0
      1
      0
      1
      1.00
      0
      1
    
    
      5
      0
      58.95
      10.0
      68.95
      0.00
      3
      1
      2
      11.62
      738
      ...
      1
      0
      0
      0
      1
      0
      1
      1.00
      0
      1
    
    
      3
      1
      1
      8.60
      6.0
      14.60
      37.76
      5
      1
      2
      4.10
      696
      ...
      3
      2
      0
      0
      0
      1
      0
      0.33
      1
      1
    
    
      2
      0
      13.60
      6.0
      19.60
      18.88
      5
      1
      2
      4.10
      696
      ...
      3
      2
      0
      0
      0
      1
      0
      0.33
      1
      1
    
    
      3
      0
      15.60
      6.0
      21.60
      10.79
      5
      1
      2
      4.10
      696
      ...
      3
      2
      0
      0
      0
      1
      0
      0.33
      1
      1
    
    
      4
      0
      16.87
      21.4
      38.27
      105.00
      5
      1
      2
      4.10
      696
      ...
      3
      2
      0
      0
      0
      1
      0
      0.33
      1
      1
    
    
      4
      1
      0
      30.60
      8.5
      39.10
      417.32
      6
      1
      2
      14.58
      665
      ...
      2
      1
      0
      0
      1
      0
      0
      1.00
      0
      1
    
  

15 rows × 36 columns

To prepare this data for use with Larch, we'll load it into a larch.DataFrames object.



In [3]:

    
mtc = larch.DataFrames(mtc_raw)
mtc.info()









    



larch.DataFrames:  (not computation-ready)
  n_cases: 5029
  n_alts: 6
  data_ce: 36 variables, 22033 rows
  data_co: <not populated>
  data_av: <populated>

Because this data has a row for each available alternative, and omits rows for unavailable alternatives, Larch has stored it in the sparse data_ce attribute. It's also used this information to populate the data_av attribute. The "not computation-ready" is indicating that the data stored is not all using the standard computational dtype (float64), so this dataframe isn't ready to use for model estimation (yet). Larch can fix that itself later, so there's no need to worry.

You might notice that the data_co says "not populated", we are are starting with data in idce (sparse idca) format. If we want to pre-process it to crack the data into seperate idco and idce parts, we can use the crack argument. This will find all the data columns that have no within-case variance, and move them to the data_co attribute.



In [4]:

    
mtc = larch.DataFrames(mtc_raw, crack=True)
mtc.info()









    



larch.DataFrames:  (not computation-ready)
  n_cases: 5029
  n_alts: 6
  data_ce: 5 variables, 22033 rows
  data_co: 31 variables
  data_av: <populated>

If we want, we can also identify which data is the "choice" at this stage.
(We can also leave that up to the Model object to be defined later.) To do so here, we can identify the data column that includes the choices.



In [5]:

    
mtc = larch.DataFrames(mtc_raw, crack=True, ch='chose')
mtc.info()









    



larch.DataFrames:  (not computation-ready)
  n_cases: 5029
  n_alts: 6
  data_ce: 5 variables, 22033 rows
  data_co: 31 variables
  data_av: <populated>
  data_ch: chose

idco

To contrast, we'll load the swissmetro example data, which is in idco format. Again we'll use pandas to start by loading a normal DataFrame.



In [6]:

    
raw = pd.read_csv(example_file('swissmetro.csv.gz')).query("PURPOSE in (1,3) and CHOICE != 0")
raw.head()









    Out[6]:







  
    
      
      GROUP
      SURVEY
      SP
      ID
      PURPOSE
      FIRST
      TICKET
      WHO
      LUGGAGE
      AGE
      ...
      TRAIN_TT
      TRAIN_CO
      TRAIN_HE
      SM_TT
      SM_CO
      SM_HE
      SM_SEATS
      CAR_TT
      CAR_CO
      CHOICE
    
  
  
    
      0
      2
      0
      1
      1
      1
      0
      1
      1
      0
      3
      ...
      112
      48
      120
      63
      52
      20
      0
      117
      65
      2
    
    
      1
      2
      0
      1
      1
      1
      0
      1
      1
      0
      3
      ...
      103
      48
      30
      60
      49
      10
      0
      117
      84
      2
    
    
      2
      2
      0
      1
      1
      1
      0
      1
      1
      0
      3
      ...
      130
      48
      60
      67
      58
      30
      0
      117
      52
      2
    
    
      3
      2
      0
      1
      1
      1
      0
      1
      1
      0
      3
      ...
      103
      40
      30
      63
      52
      20
      0
      72
      52
      2
    
    
      4
      2
      0
      1
      1
      1
      0
      1
      1
      0
      3
      ...
      130
      36
      60
      63
      42
      20
      0
      90
      84
      2
    
  

5 rows × 28 columns

We can create a simple DataFrames object simply by giving this raw data to the constructor.



In [7]:

    
sm = larch.DataFrames(raw)
sm.info()









    



larch.DataFrames:  (not computation-ready)
  n_cases: 6768
  n_alts: 0
  data_ca: <not populated>
  data_co: 28 variables

When we loaded the idca example data above, Larch automatically detected the set of alternatives based on the data. With idco data, we cannot infer the alternatives without some additional context. One way to do that is to give alternative id codes explicitly.



In [8]:

    
sm = larch.DataFrames(raw, alt_codes=[1,2,3])
sm.info()









    



larch.DataFrames:  (not computation-ready)
  n_cases: 6768
  n_alts: 3
  data_ca: <not populated>
  data_co: 28 variables

Larch can also infer the alternative codes if we identify the column containing the choices. (Note this only works if every alternative is chosen at least once in the data, otherwise the inferred alternative codes will be incomplete.)



In [9]:

    
sm = larch.DataFrames(raw, ch="CHOICE")
sm.info()









    



larch.DataFrames:  (not computation-ready)
  n_cases: 6768
  n_alts: 3
  data_ca: <not populated>
  data_co: 28 variables
  data_ch: CHOICE

		chose	ivtt	ovtt	tottime	totcost	hhid	perid	numalts	dist	wkzone	...	numadlt	nmlt5	nm5to11	nm12to16	wkccbd	wknccbd	corredis	vehbywrk	vocc	wgt
casenum	altnum
1	1	1	13.38	2.0	15.38	70.63	2	1	2	7.69	664	...	1	0	0	0	0	0	0	4.00	1	1
	2	0	18.38	2.0	20.38	35.32	2	1	2	7.69	664	...	1	0	0	0	0	0	0	4.00	1	1
	3	0	20.38	2.0	22.38	20.18	2	1	2	7.69	664	...	1	0	0	0	0	0	0	4.00	1	1
	4	0	25.90	15.2	41.10	115.64	2	1	2	7.69	664	...	1	0	0	0	0	0	0	4.00	1	1
	5	0	40.50	2.0	42.50	0.00	2	1	2	7.69	664	...	1	0	0	0	0	0	0	4.00	1	1
2	1	0	29.92	10.0	39.92	390.81	3	1	2	11.62	738	...	1	0	0	0	1	0	1	1.00	0	1
	2	0	34.92	10.0	44.92	195.40	3	1	2	11.62	738	...	1	0	0	0	1	0	1	1.00	0	1
	3	0	21.92	10.0	31.92	97.97	3	1	2	11.62	738	...	1	0	0	0	1	0	1	1.00	0	1
	4	1	22.96	14.2	37.16	185.00	3	1	2	11.62	738	...	1	0	0	0	1	0	1	1.00	0	1
	5	0	58.95	10.0	68.95	0.00	3	1	2	11.62	738	...	1	0	0	0	1	0	1	1.00	0	1
3	1	1	8.60	6.0	14.60	37.76	5	1	2	4.10	696	...	3	2	0	0	0	1	0	0.33	1	1
	2	0	13.60	6.0	19.60	18.88	5	1	2	4.10	696	...	3	2	0	0	0	1	0	0.33	1	1
	3	0	15.60	6.0	21.60	10.79	5	1	2	4.10	696	...	3	2	0	0	0	1	0	0.33	1	1
	4	0	16.87	21.4	38.27	105.00	5	1	2	4.10	696	...	3	2	0	0	0	1	0	0.33	1	1
4	1	0	30.60	8.5	39.10	417.32	6	1	2	14.58	665	...	2	1	0	0	1	0	0	1.00	0	1

	GROUP	SP	ID	PURPOSE	TICKET	WHO	AGE	...	TRAIN_TT	TRAIN_CO	TRAIN_HE	SM_TT	SM_CO	SM_HE	CAR_TT	CAR_CO	CHOICE
0	2	1	1	1	1	1	3	...	112	48	120	63	52	20	117	65	2
1	2	1	1	1	1	1	3	...	103	48	30	60	49	10	117	84	2
2	2	1	1	1	1	1	3	...	130	48	60	67	58	30	117	52	2
3	2	1	1	1	1	1	3	...	103	40	30	63	52	20	72	52	2
4	2	1	1	1	1	1	3	...	130	36	60	63	42	20	90	84	2

	GROUP	SP	ID	PURPOSE	TICKET	WHO	AGE	...	TRAIN_TT	TRAIN_CO	TRAIN_HE	SM_TT	SM_CO	SM_HE	CAR_TT	CAR_CO	CHOICE
0	2	1	1	1	1	1	3	...	112	48	120	63	52	20	117	65	2
1	2	1	1	1	1	1	3	...	103	48	30	60	49	10	117	84	2
2	2	1	1	1	1	1	3	...	130	48	60	67	58	30	117	52	2
3	2	1	1	1	1	1	3	...	103	40	30	63	52	20	72	52	2
4	2	1	1	1	1	1	3	...	130	36	60	63	42	20	90	84	2

	GROUP	SP	ID	PURPOSE	TICKET	WHO	AGE	...	TRAIN_TT	TRAIN_CO	TRAIN_HE	SM_TT	SM_CO	SM_HE	CAR_TT	CAR_CO	CHOICE
0	2	1	1	1	1	1	3	...	112	48	120	63	52	20	117	65	2
1	2	1	1	1	1	1	3	...	103	48	30	60	49	10	117	84	2
2	2	1	1	1	1	1	3	...	130	48	60	67	58	30	117	52	2
3	2	1	1	1	1	1	3	...	103	40	30	63	52	20	72	52	2
4	2	1	1	1	1	1	3	...	130	36	60	63	42	20	90	84	2