Pandas 고급 인덱싱

pandas는 numpy 행렬과 같이 comma를 사용한 복수 인덱싱을 지원하기 위해 다음과 같은 특별한 인덱서 속성을 제공한다.

  • ix : 라벨과 숫자를 동시에 지원하는 복수 인덱싱
  • loc : 라벨 기반의 복수 인덱싱
  • iloc : 숫자 기반의 복수 인덱싱

ix 인덱서

  • 행(Row)/열(Column) 양쪽에서 라벨 인덱싱, 숫자 인덱싱, 불리언 인덱싱(행만) 동시 가능
    • 단일 숫자 인덱싱 가능
    • 열(column)도 라벨이 아닌 숫자 인덱싱 가능
    • 열(column)도 라벨 슬라이싱(label slicing) 가능

In [1]:
data = {
    'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
    'year': [2000, 2001, 2002, 2001, 2002],
    'pop': [1.5, 1.7, 3.6, 2.4, 2.9]
}

In [2]:
df = pd.DataFrame(data)
df


Out[2]:
pop state year
0 1.5 Ohio 2000
1 1.7 Ohio 2001
2 3.6 Ohio 2002
3 2.4 Nevada 2001
4 2.9 Nevada 2002

In [14]:
# 순차적 indexing 과 동일
df.ix[1:3, ["state", "pop"]]


Out[14]:
pop state year
0 1.5 Ohio 2000
1 1.7 Ohio 2001
2 3.6 Ohio 2002
3 2.4 Nevada 2001
4 2.9 Nevada 2002

In [3]:
df2 = pd.DataFrame(data,
                  columns=['year', 'state', 'pop'],
                  index=['one', 'two', 'three', 'four', 'five'])
df2


Out[3]:
year state pop
one 2000 Ohio 1.5
two 2001 Ohio 1.7
three 2002 Ohio 3.6
four 2001 Nevada 2.4
five 2002 Nevada 2.9

In [5]:
# , 이용
df2.ix[["two", "three"], ["state", "pop"]]


Out[5]:
state pop
two Ohio 1.7
three Ohio 3.6

In [6]:
# column에도 integer 기반 indexing 가능
df2.ix[["two", "three"], :2]


Out[6]:
year state
two 2001 Ohio
three 2002 Ohio

In [7]:
# column에도 Label Slicing 가능
df2.ix[["two", "three"], "state":"pop"]


Out[7]:
state pop
two Ohio 1.7
three Ohio 3.6

In [8]:
# `:` 사용
df2.ix[:, ["state", "pop"]]


Out[8]:
state pop
one Ohio 1.5
two Ohio 1.7
three Ohio 3.6
four Nevada 2.4
five Nevada 2.9

In [9]:
# `:` 사용
df2.ix[["two", "five"], :]


Out[9]:
year state pop
two 2001 Ohio 1.7
five 2002 Nevada 2.9

Index Label이 없는 경우의 주의점

  • Label이 지정되지 않는 경우에는 integer slicing을 label slicing으로 간주하여 마지막 값을 포함한다

In [10]:
df = pd.DataFrame(np.random.randn(5, 3))
df


Out[10]:
0 1 2
0 -0.108509 -0.733949 -0.111357
1 -0.025895 -0.621490 0.193022
2 0.857554 -0.186033 0.268976
3 1.942993 -0.371014 0.022745
4 1.220884 -0.178527 -0.743444

In [11]:
df.columns = ["c1", "c2", "c3"]
df.ix[0:2, 1:2]


Out[11]:
c2
0 -0.733949
1 -0.621490
2 -0.186033

loc 인덱서

  • 라벨 기준 인덱싱

    • 숫자가 오더라도 라벨로 인식한다.
    • 라벨 리스트 가능
    • 라벨 슬라이싱 가능
    • 불리언 배열 가능

iloc 인덱서

  • 숫자 기준 인덱싱

    • 문자열 라벨은 불가
    • 숫자 리스트 가능
    • 숫자 슬라이싱 가능
    • 불리언 배열 가능

In [12]:
np.random.seed(1)
df = pd.DataFrame(np.random.randint(1, 11, size=(4,3)), 
                  columns=["A", "B", "C"], index=["a", "b", "c", "d"])
df


Out[12]:
A B C
a 6 9 10
b 6 1 1
c 2 8 7
d 10 3 5

In [13]:
df.ix[["a", "c"], "B":"C"]


Out[13]:
B C
a 9 10
c 8 7

In [14]:
df.ix[[0, 2], 1:3]


Out[14]:
B C
a 9 10
c 8 7

In [15]:
df.loc[["a", "c"], "B":"C"]


Out[15]:
B C
a 9 10
c 8 7

In [17]:
df.ix[2:4, 1:3]


Out[17]:
B C
c 8 7
d 3 5

In [16]:
df.loc[2:4, 1:3]


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-16-5f06278145c0> in <module>()
----> 1 df.loc[2:4, 1:3]

C:\Anaconda3\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key)
   1292 
   1293         if type(key) is tuple:
-> 1294             return self._getitem_tuple(key)
   1295         else:
   1296             return self._getitem_axis(key, axis=0)

C:\Anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_tuple(self, tup)
    802                 continue
    803 
--> 804             retval = getattr(retval, self.name)._getitem_axis(key, axis=i)
    805 
    806         return retval

C:\Anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_axis(self, key, axis)
   1435         if isinstance(key, slice):
   1436             self._has_valid_type(key, axis)
-> 1437             return self._get_slice_axis(key, axis=axis)
   1438         elif is_bool_indexer(key):
   1439             return self._getbool_axis(key, axis=axis)

C:\Anaconda3\lib\site-packages\pandas\core\indexing.py in _get_slice_axis(self, slice_obj, axis)
   1316         labels = obj._get_axis(axis)
   1317         indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop,
-> 1318                                        slice_obj.step, kind=self.name)
   1319 
   1320         if isinstance(indexer, slice):

C:\Anaconda3\lib\site-packages\pandas\indexes\base.py in slice_indexer(self, start, end, step, kind)
   2783         """
   2784         start_slice, end_slice = self.slice_locs(start, end, step=step,
-> 2785                                                  kind=kind)
   2786 
   2787         # return a slice

C:\Anaconda3\lib\site-packages\pandas\indexes\base.py in slice_locs(self, start, end, step, kind)
   2962         start_slice = None
   2963         if start is not None:
-> 2964             start_slice = self.get_slice_bound(start, 'left', kind)
   2965         if start_slice is None:
   2966             start_slice = 0

C:\Anaconda3\lib\site-packages\pandas\indexes\base.py in get_slice_bound(self, label, side, kind)
   2901         # For datetime indices label may be a string that has to be converted
   2902         # to datetime boundary according to its resolution.
-> 2903         label = self._maybe_cast_slice_bound(label, side, kind)
   2904 
   2905         # we need to look up the label

C:\Anaconda3\lib\site-packages\pandas\indexes\base.py in _maybe_cast_slice_bound(self, label, side, kind)
   2859         # this is rejected (generally .loc gets you here)
   2860         elif is_integer(label):
-> 2861             self._invalid_indexer('slice', label)
   2862 
   2863         return label

C:\Anaconda3\lib\site-packages\pandas\indexes\base.py in _invalid_indexer(self, form, key)
   1123                         "indexers [{key}] of {kind}".format(
   1124                             form=form, klass=type(self), key=key,
-> 1125                             kind=type(key)))
   1126 
   1127     def get_duplicates(self):

TypeError: cannot do slice indexing on <class 'pandas.indexes.base.Index'> with these indexers [2] of <class 'int'>

In [18]:
df.iloc[2:4, 1:3]


Out[18]:
B C
c 8 7
d 3 5

In [19]:
df.iloc[["a", "c"], "B":"C"]


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-19-25cc466bce07> in <module>()
----> 1 df.iloc[["a", "c"], "B":"C"]

C:\Anaconda3\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key)
   1292 
   1293         if type(key) is tuple:
-> 1294             return self._getitem_tuple(key)
   1295         else:
   1296             return self._getitem_axis(key, axis=0)

C:\Anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_tuple(self, tup)
   1542     def _getitem_tuple(self, tup):
   1543 
-> 1544         self._has_valid_tuple(tup)
   1545         try:
   1546             return self._getitem_lowerdim(tup)

C:\Anaconda3\lib\site-packages\pandas\core\indexing.py in _has_valid_tuple(self, key)
    140             if i >= self.obj.ndim:
    141                 raise IndexingError('Too many indexers')
--> 142             if not self._has_valid_type(k, i):
    143                 raise ValueError("Location based indexing can only have [%s] "
    144                                  "types" % self._valid_types)

C:\Anaconda3\lib\site-packages\pandas\core\indexing.py in _has_valid_type(self, key, axis)
   1512             return self._is_valid_integer(key, axis)
   1513         elif is_list_like_indexer(key):
-> 1514             return self._is_valid_list_like(key, axis)
   1515         return False
   1516 

C:\Anaconda3\lib\site-packages\pandas\core\indexing.py in _is_valid_list_like(self, key, axis)
   1535         ax = self.obj._get_axis(axis)
   1536         l = len(ax)
-> 1537         if len(arr) and (arr.max() >= l or arr.min() < -l):
   1538             raise IndexError("positional indexers are out-of-bounds")
   1539 

C:\Anaconda3\lib\site-packages\numpy\core\_methods.py in _amax(a, axis, out, keepdims)
     24 # small reductions
     25 def _amax(a, axis=None, out=None, keepdims=False):
---> 26     return umr_maximum(a, axis, None, out, keepdims)
     27 
     28 def _amin(a, axis=None, out=None, keepdims=False):

TypeError: cannot perform reduce with flexible type