The MultiIndex object

View vs copy


In [1]:
import pandas as pd

In [2]:
mlo = pd.read_csv('../data/co2-mm-mlo.csv', na_values=-99.99, index_col='Date', parse_dates=True)

In [3]:
mlo.head()


Out[3]:
Decimal Date Average Interpolated Trend Number of Days
Date
1958-03-01 1958.208 315.71 315.71 314.62 -1
1958-04-01 1958.292 317.45 317.45 315.29 -1
1958-05-01 1958.375 317.50 317.50 314.71 -1
1958-06-01 1958.458 NaN 317.10 314.85 -1
1958-07-01 1958.542 315.86 315.86 314.98 -1

In [4]:
s = mlo['Interpolated']

In [5]:
mlo.assign(smooth=s.rolling(12).mean()).tail()


Out[5]:
Decimal Date Average Interpolated Trend Number of Days smooth
Date
2016-08-01 2016.625 402.25 402.25 404.09 23 403.157500
2016-09-01 2016.708 401.03 401.03 404.52 24 403.440833
2016-10-01 2016.792 401.57 401.57 404.93 29 403.714167
2016-11-01 2016.875 403.53 403.53 405.57 27 403.995000
2016-12-01 2016.958 404.48 404.48 405.25 29 404.214167

A copy is returned.


In [6]:
mlo.head()


Out[6]:
Decimal Date Average Interpolated Trend Number of Days
Date
1958-03-01 1958.208 315.71 315.71 314.62 -1
1958-04-01 1958.292 317.45 317.45 315.29 -1
1958-05-01 1958.375 317.50 317.50 314.71 -1
1958-06-01 1958.458 NaN 317.10 314.85 -1
1958-07-01 1958.542 315.86 315.86 314.98 -1

In [7]:
s2 = mlo.loc[:'1958-05', 'Average']
s2


Out[7]:
Date
1958-03-01    315.71
1958-04-01    317.45
1958-05-01    317.50
Name: Average, dtype: float64

A view is returned.


In [8]:
s2[:] = 313

In [9]:
s2


Out[9]:
Date
1958-03-01    313.0
1958-04-01    313.0
1958-05-01    313.0
Name: Average, dtype: float64

In [10]:
mlo.head()


Out[10]:
Decimal Date Average Interpolated Trend Number of Days
Date
1958-03-01 1958.208 313.00 315.71 314.62 -1
1958-04-01 1958.292 313.00 317.45 315.29 -1
1958-05-01 1958.375 313.00 317.50 314.71 -1
1958-06-01 1958.458 NaN 317.10 314.85 -1
1958-07-01 1958.542 315.86 315.86 314.98 -1

Hands-on exercise

  1. How could you create a series equal to s2 while preserving the original mlo DataFrame (Hint: Remember the NumPy lesson.)

Chained indexing


In [11]:
mlo['Average']['1958-03']


Out[11]:
Date
1958-03-01    313.0
Name: Average, dtype: float64

In [12]:
mlo['Average']['1958-03'] = 312


/home/marianne/python/miniconda3/envs/advanced-pandas/lib/python3.6/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.

Generally speaking, chained indexing is not a good practice. To set a new value, use mlo.loc[row_indexer, col_indexer] because mlo.loc is guaranteed to be mlo itself.


In [13]:
mlo.loc['1958-03', 'Average']


Out[13]:
Date
1958-03-01    312.0
Name: Average, dtype: float64

Hierarchical indexing


In [14]:
h_index = pd.MultiIndex.from_product([['first', 'second'], ['A', 'B']])
h_index


Out[14]:
MultiIndex(levels=[['first', 'second'], ['A', 'B']],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

In [15]:
x = pd.Series(range(4), index=h_index)
x


Out[15]:
first   A    0
        B    1
second  A    2
        B    3
dtype: int64

In [16]:
x['first']


Out[16]:
A    0
B    1
dtype: int64

In [17]:
x['first']['B']


Out[17]:
1

In the above, there are two selection operations.


In [18]:
x.loc[('first', 'B')]


Out[18]:
1

In the above, there is a single selection operation.

We can end up with a hierarchical index when stacking records.


In [19]:
gl = pd.read_csv('../data/co2-mm-gl.csv', na_values=-99.99, index_col='Date', parse_dates=True)

gl = gl[['Average']]
gl.columns = ['Average_gl']
gl.head()


Out[19]:
Average_gl
Date
1980-01-01 338.45
1980-02-01 339.14
1980-03-01 339.46
1980-04-01 339.86
1980-05-01 340.30

In [20]:
ml = mlo[['Average']]
ml.columns = ['Average_mlo']
ml.head()


Out[20]:
Average_mlo
Date
1958-03-01 312.00
1958-04-01 313.00
1958-05-01 313.00
1958-06-01 NaN
1958-07-01 315.86

In [21]:
ml = ml[ml.index >= '1980-01']

gl = gl.head()
ml = ml.head()

In [22]:
multi = pd.concat([ml, gl], axis=1).stack()
multi


Out[22]:
Date                   
1980-01-01  Average_mlo    337.90
            Average_gl     338.45
1980-02-01  Average_mlo    338.34
            Average_gl     339.14
1980-03-01  Average_mlo    340.01
            Average_gl     339.46
1980-04-01  Average_mlo    340.93
            Average_gl     339.86
1980-05-01  Average_mlo    341.48
            Average_gl     340.30
dtype: float64

In [23]:
multi.index


Out[23]:
MultiIndex(levels=[[1980-01-01 00:00:00, 1980-02-01 00:00:00, 1980-03-01 00:00:00, 1980-04-01 00:00:00, 1980-05-01 00:00:00], ['Average_mlo', 'Average_gl']],
           labels=[[0, 0, 1, 1, 2, 2, 3, 3, 4, 4], [0, 1, 0, 1, 0, 1, 0, 1, 0, 1]],
           names=['Date', None])

In [24]:
multi.index.get_level_values('Date')


Out[24]:
DatetimeIndex(['1980-01-01', '1980-01-01', '1980-02-01', '1980-02-01',
               '1980-03-01', '1980-03-01', '1980-04-01', '1980-04-01',
               '1980-05-01', '1980-05-01'],
              dtype='datetime64[ns]', name='Date', freq='MS')

In [25]:
multi.loc[multi.index.get_level_values('Date') < '1980-03']


Out[25]:
Date                   
1980-01-01  Average_mlo    337.90
            Average_gl     338.45
1980-02-01  Average_mlo    338.34
            Average_gl     339.14
dtype: float64

Hands-on exercise

  1. Select out all values of the multi series for the Average_mlo variable.

Reshaping

The stack() function compressed a level in the DataFrame’s columns to produce a Series (as a reminder, multi = pd.concat([ml, gl], axis=1).stack()).


In [26]:
pd.concat([ml, gl], axis=1)


Out[26]:
Average_mlo Average_gl
Date
1980-01-01 337.90 338.45
1980-02-01 338.34 339.14
1980-03-01 340.01 339.46
1980-04-01 340.93 339.86
1980-05-01 341.48 340.30

In [27]:
multi


Out[27]:
Date                   
1980-01-01  Average_mlo    337.90
            Average_gl     338.45
1980-02-01  Average_mlo    338.34
            Average_gl     339.14
1980-03-01  Average_mlo    340.01
            Average_gl     339.46
1980-04-01  Average_mlo    340.93
            Average_gl     339.86
1980-05-01  Average_mlo    341.48
            Average_gl     340.30
dtype: float64

The inverse function is unstack(); it is designed to work with a hierarchical index.


In [28]:
multi.unstack()


Out[28]:
Average_mlo Average_gl
Date
1980-01-01 337.90 338.45
1980-02-01 338.34 339.14
1980-03-01 340.01 339.46
1980-04-01 340.93 339.86
1980-05-01 341.48 340.30

Hands-on exercises

  1. Unstack x.
  2. What does x.unstack(0) return?
  3. What is another term for 'unstacking' (which you may have heard in the context of spreadsheets)?

Pivoting


In [29]:
rec = pd.concat([ml, gl], axis=1).stack().reset_index()
rec.columns = ['date', 'variable', 'value']
rec


Out[29]:
date variable value
0 1980-01-01 Average_mlo 337.90
1 1980-01-01 Average_gl 338.45
2 1980-02-01 Average_mlo 338.34
3 1980-02-01 Average_gl 339.14
4 1980-03-01 Average_mlo 340.01
5 1980-03-01 Average_gl 339.46
6 1980-04-01 Average_mlo 340.93
7 1980-04-01 Average_gl 339.86
8 1980-05-01 Average_mlo 341.48
9 1980-05-01 Average_gl 340.30

The above data is in 'stacked' or 'record' format.


In [30]:
rec


Out[30]:
date variable value
0 1980-01-01 Average_mlo 337.90
1 1980-01-01 Average_gl 338.45
2 1980-02-01 Average_mlo 338.34
3 1980-02-01 Average_gl 339.14
4 1980-03-01 Average_mlo 340.01
5 1980-03-01 Average_gl 339.46
6 1980-04-01 Average_mlo 340.93
7 1980-04-01 Average_gl 339.86
8 1980-05-01 Average_mlo 341.48
9 1980-05-01 Average_gl 340.30

In [31]:
rec[rec.variable == 'Average_mlo']


Out[31]:
date variable value
0 1980-01-01 Average_mlo 337.90
2 1980-02-01 Average_mlo 338.34
4 1980-03-01 Average_mlo 340.01
6 1980-04-01 Average_mlo 340.93
8 1980-05-01 Average_mlo 341.48

In [32]:
pivot_table = rec.pivot(index='date', columns='variable', values='value')
pivot_table


Out[32]:
variable Average_gl Average_mlo
date
1980-01-01 338.45 337.90
1980-02-01 339.14 338.34
1980-03-01 339.46 340.01
1980-04-01 339.86 340.93
1980-05-01 340.30 341.48

The pivoted data is more suitable for timeseries analysis.


In [33]:
pivot_table['Average_gl']


Out[33]:
date
1980-01-01    338.45
1980-02-01    339.14
1980-03-01    339.46
1980-04-01    339.86
1980-05-01    340.30
Name: Average_gl, dtype: float64

In [34]:
pivot_table.index


Out[34]:
DatetimeIndex(['1980-01-01', '1980-02-01', '1980-03-01', '1980-04-01',
               '1980-05-01'],
              dtype='datetime64[ns]', name='date', freq=None)

Hands-on exercises

  1. How could you transform x so that you could apply the .pivot() method on it?
  2. Consider the (x, y, z) 3D physical space. We can have altitude (z) values for given spatial coordinates (x, y). In stacked (or triplet?) format, this could be alt = pd.DataFrame({'x': [0, 0, 1, 1], 'y': [0, 1, 0, 1], 'z': [0.5, 0.8, 0.6, 0.3]}). Pivot this DataFrame.
  3. Rewrite alt as a Series (of z values) with a multi-index. Unstack it. How does it compare with the pivot table above?
  4. In this geographically inspired example, what would you use the pivoted / unstacked table for?