Pandas 고급 인덱싱

pandas는 numpy 행렬과 같이 comma를 사용한 복수 인덱싱을 지원하기 위해 다음과 같은 특별한 인덱서 속성을 제공한다.

ix : 라벨과 숫자를 동시에 지원하는 복수 인덱싱
loc : 라벨 기반의 복수 인덱싱
iloc : 숫자 기반의 복수 인덱싱

ix 인덱서

행(Row)/열(Column) 양쪽에서 라벨 인덱싱, 숫자 인덱싱, 불리언 인덱싱(행만) 동시 가능
- 단일 숫자 인덱싱 가능
- 열(column)도 라벨이 아닌 숫자 인덱싱 가능
- 열(column)도 라벨 슬라이싱(label slicing) 가능



In [1]:

    
data = {
    'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
    'year': [2000, 2001, 2002, 2001, 2002],
    'pop': [1.5, 1.7, 3.6, 2.4, 2.9]
}



In [2]:

    
df = pd.DataFrame(data)
df



In [14]:

    
# 순차적 indexing 과 동일
df.ix[1:3, ["state", "pop"]]



In [3]:

    
df2 = pd.DataFrame(data,
                  columns=['year', 'state', 'pop'],
                  index=['one', 'two', 'three', 'four', 'five'])
df2



In [5]:

    
# , 이용
df2.ix[["two", "three"], ["state", "pop"]]









    Out[5]:






  
    
      
      state
      pop
    
  
  
    
      two
      Ohio
      1.7
    
    
      three
      Ohio
      3.6



In [6]:

    
# column에도 integer 기반 indexing 가능
df2.ix[["two", "three"], :2]









    Out[6]:






  
    
      
      year
      state
    
  
  
    
      two
      2001
      Ohio
    
    
      three
      2002
      Ohio



In [7]:

    
# column에도 Label Slicing 가능
df2.ix[["two", "three"], "state":"pop"]









    Out[7]:






  
    
      
      state
      pop
    
  
  
    
      two
      Ohio
      1.7
    
    
      three
      Ohio
      3.6



In [8]:

    
# `:` 사용
df2.ix[:, ["state", "pop"]]









    Out[8]:






  
    
      
      state
      pop
    
  
  
    
      one
      Ohio
      1.5
    
    
      two
      Ohio
      1.7
    
    
      three
      Ohio
      3.6
    
    
      four
      Nevada
      2.4
    
    
      five
      Nevada
      2.9



In [9]:

    
# `:` 사용
df2.ix[["two", "five"], :]









    Out[9]:






  
    
      
      year
      state
      pop
    
  
  
    
      two
      2001
      Ohio
      1.7
    
    
      five
      2002
      Nevada
      2.9

Index Label이 없는 경우의 주의점

Label이 지정되지 않는 경우에는 integer slicing을 label slicing으로 간주하여 마지막 값을 포함한다



In [10]:

    
df = pd.DataFrame(np.random.randn(5, 3))
df



In [11]:

    
df.columns = ["c1", "c2", "c3"]
df.ix[0:2, 1:2]

loc 인덱서

라벨 기준 인덱싱
- 숫자가 오더라도 라벨로 인식한다.
- 라벨 리스트 가능
- 라벨 슬라이싱 가능
- 불리언 배열 가능

iloc 인덱서

숫자 기준 인덱싱
- 문자열 라벨은 불가
- 숫자 리스트 가능
- 숫자 슬라이싱 가능
- 불리언 배열 가능



In [12]:

    
np.random.seed(1)
df = pd.DataFrame(np.random.randint(1, 11, size=(4,3)), 
                  columns=["A", "B", "C"], index=["a", "b", "c", "d"])
df



In [13]:

    
df.ix[["a", "c"], "B":"C"]



In [14]:

    
df.ix[[0, 2], 1:3]



In [15]:

    
df.loc[["a", "c"], "B":"C"]



In [17]:

    
df.ix[2:4, 1:3]



In [16]:

    
df.loc[2:4, 1:3]









    



---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-16-5f06278145c0> in <module>()
----> 1 df.loc[2:4, 1:3]

C:\Anaconda3\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key)
   1292 
   1293         if type(key) is tuple:
-> 1294             return self._getitem_tuple(key)
   1295         else:
   1296             return self._getitem_axis(key, axis=0)

C:\Anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_tuple(self, tup)
    802                 continue
    803 
--> 804             retval = getattr(retval, self.name)._getitem_axis(key, axis=i)
    805 
    806         return retval

C:\Anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_axis(self, key, axis)
   1435         if isinstance(key, slice):
   1436             self._has_valid_type(key, axis)
-> 1437             return self._get_slice_axis(key, axis=axis)
   1438         elif is_bool_indexer(key):
   1439             return self._getbool_axis(key, axis=axis)

C:\Anaconda3\lib\site-packages\pandas\core\indexing.py in _get_slice_axis(self, slice_obj, axis)
   1316         labels = obj._get_axis(axis)
   1317         indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop,
-> 1318                                        slice_obj.step, kind=self.name)
   1319 
   1320         if isinstance(indexer, slice):

C:\Anaconda3\lib\site-packages\pandas\indexes\base.py in slice_indexer(self, start, end, step, kind)
   2783         """
   2784         start_slice, end_slice = self.slice_locs(start, end, step=step,
-> 2785                                                  kind=kind)
   2786 
   2787         # return a slice

C:\Anaconda3\lib\site-packages\pandas\indexes\base.py in slice_locs(self, start, end, step, kind)
   2962         start_slice = None
   2963         if start is not None:
-> 2964             start_slice = self.get_slice_bound(start, 'left', kind)
   2965         if start_slice is None:
   2966             start_slice = 0

C:\Anaconda3\lib\site-packages\pandas\indexes\base.py in get_slice_bound(self, label, side, kind)
   2901         # For datetime indices label may be a string that has to be converted
   2902         # to datetime boundary according to its resolution.
-> 2903         label = self._maybe_cast_slice_bound(label, side, kind)
   2904 
   2905         # we need to look up the label

C:\Anaconda3\lib\site-packages\pandas\indexes\base.py in _maybe_cast_slice_bound(self, label, side, kind)
   2859         # this is rejected (generally .loc gets you here)
   2860         elif is_integer(label):
-> 2861             self._invalid_indexer('slice', label)
   2862 
   2863         return label

C:\Anaconda3\lib\site-packages\pandas\indexes\base.py in _invalid_indexer(self, form, key)
   1123                         "indexers [{key}] of {kind}".format(
   1124                             form=form, klass=type(self), key=key,
-> 1125                             kind=type(key)))
   1126 
   1127     def get_duplicates(self):

TypeError: cannot do slice indexing on <class 'pandas.indexes.base.Index'> with these indexers [2] of <class 'int'>



In [18]:

    
df.iloc[2:4, 1:3]



In [19]:

    
df.iloc[["a", "c"], "B":"C"]









    



---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-19-25cc466bce07> in <module>()
----> 1 df.iloc[["a", "c"], "B":"C"]

C:\Anaconda3\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key)
   1292 
   1293         if type(key) is tuple:
-> 1294             return self._getitem_tuple(key)
   1295         else:
   1296             return self._getitem_axis(key, axis=0)

C:\Anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_tuple(self, tup)
   1542     def _getitem_tuple(self, tup):
   1543 
-> 1544         self._has_valid_tuple(tup)
   1545         try:
   1546             return self._getitem_lowerdim(tup)

C:\Anaconda3\lib\site-packages\pandas\core\indexing.py in _has_valid_tuple(self, key)
    140             if i >= self.obj.ndim:
    141                 raise IndexingError('Too many indexers')
--> 142             if not self._has_valid_type(k, i):
    143                 raise ValueError("Location based indexing can only have [%s] "
    144                                  "types" % self._valid_types)

C:\Anaconda3\lib\site-packages\pandas\core\indexing.py in _has_valid_type(self, key, axis)
   1512             return self._is_valid_integer(key, axis)
   1513         elif is_list_like_indexer(key):
-> 1514             return self._is_valid_list_like(key, axis)
   1515         return False
   1516 

C:\Anaconda3\lib\site-packages\pandas\core\indexing.py in _is_valid_list_like(self, key, axis)
   1535         ax = self.obj._get_axis(axis)
   1536         l = len(ax)
-> 1537         if len(arr) and (arr.max() >= l or arr.min() < -l):
   1538             raise IndexError("positional indexers are out-of-bounds")
   1539 

C:\Anaconda3\lib\site-packages\numpy\core\_methods.py in _amax(a, axis, out, keepdims)
     24 # small reductions
     25 def _amax(a, axis=None, out=None, keepdims=False):
---> 26     return umr_maximum(a, axis, None, out, keepdims)
     27 
     28 def _amin(a, axis=None, out=None, keepdims=False):

TypeError: cannot perform reduce with flexible type

	0	1	2
0	-0.108509	-0.733949	-0.111357
1	-0.025895	-0.621490	0.193022
2	0.857554	-0.186033	0.268976
3	1.942993	-0.371014	0.022745
4	1.220884	-0.178527	-0.743444

	c2
0	-0.733949
1	-0.621490
2	-0.186033

	pop	state	year
0	1.5	Ohio	2000
1	1.7	Ohio	2001
2	3.6	Ohio	2002
3	2.4	Nevada	2001
4	2.9	Nevada	2002