检索，查询数据

这一节学习如何检索pandas数据。



In [1]:

    
import numpy as np
import pandas as pd

Python和Numpy的索引操作符[]和属性操作符‘.’能够快速检索pandas数据。

然而，这两种方式的效率在pandas中可能不是最优的，我们推荐使用专门优化过的pandas数据检索方法。而这些方法则是本节要介绍的。

多种索引方式

pandas支持三种不同的索引方式：

.loc 基于label进行索引，当然也可以和boolean数组一起使用。‘.loc’接受的输入：
- 一个单独的label，比如5、'a'，注意，这里的5是index值，而不是整形下标
- label列表或label数组，比如['a', 'b', 'c']
.iloc 是基本的基于整数位置(从0到axis的length-1)的，当然也可以和一个boolean数组一起使用。当提供检索的index越界时会有IndexError错误，注意切片索引(slice index)允许越界。
.ix 支持基于label和整数位置混合的数据获取方式。默认是基本label的. .ix是最常用的方式，它支持所有.loc和.iloc的输入。如果提供的是纯label或纯整数索引，我们建议使用.loc或 .iloc。

以 .loc为例看一下使用方式：

对象类型 | Indexers

Series | s.loc[indexer]

DataFrame | df.loc[row_indexer, column_indexer]

Panel | p.loc[item_indexer, major_indexer, minor_indexer]

最基本的索引和选择

最基本的选择数据方式就是使用[]操作符进行索引，

对象类型 | Selection | 返回值类型

Series | series[label],这里的label是index名 | 常数

DataFrame| frame[colname],使用列名 | Series对象，相应的colname那一列

Panel | panel[itemname] | DataFrame对象,相应的itemname那一个

下面用示例展示一下



In [62]:

    
dates = pd.date_range('1/1/2000', periods=8)
dates









    Out[62]:





DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04',
               '2000-01-05', '2000-01-06', '2000-01-07', '2000-01-08'],
              dtype='datetime64[ns]', freq='D')



In [63]:

    
df  = pd.DataFrame(np.random.randn(8,4), index=dates, columns=list('ABCD'))
df



In [64]:

    
panel = pd.Panel({'one':df, 'two':df-df.mean()})
panel









    Out[64]:





<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 8 (major_axis) x 4 (minor_axis)
Items axis: one to two
Major_axis axis: 2000-01-01 00:00:00 to 2000-01-08 00:00:00
Minor_axis axis: A to D



In [ ]:

我们使用最基本的[]操作符



In [5]:

    
s = df['A'] #使用列名
s#返回的是 Series









    Out[5]:





2000-01-01   -0.562650
2000-01-02   -1.062558
2000-01-03   -1.194126
2000-01-04    0.936506
2000-01-05   -1.196422
2000-01-06    1.436726
2000-01-07    0.329280
2000-01-08    0.857815
Freq: D, Name: A, dtype: float64

Series使用index索引



In [6]:

    
s[dates[5]] #使用index名









    Out[6]:





1.436726247472784



In [7]:

    
panel['two']

也可以给[]传递一个column name组成的的list，形如df[[col1,col2]], 如果给出的某个列名不存在，会报错



In [8]:

    
df



In [9]:

    
df[['B', 'A']] = df[['A', 'B']]
df



In [ ]:



In [ ]:

通过属性访问把column作为DataFrame对象的属性

可以直接把Series的index、DataFrame中的column、Panel中的item作为这些对象的属性使用，然后直接访问相应的index、column、item



In [10]:

    
sa = pd.Series([1,2,3],index=list('abc'))
dfa = df.copy()



In [11]:

    
sa









    Out[11]:





a    1
b    2
c    3
dtype: int64



In [12]:

    
sa.b #直接把index作为属性









    Out[12]:





2



In [13]:

    
dfa



In [14]:

    
dfa.A









    Out[14]:





2000-01-01   -1.226827
2000-01-02    0.772811
2000-01-03    0.502868
2000-01-04   -0.758176
2000-01-05   -1.276918
2000-01-06   -0.048258
2000-01-07    0.368375
2000-01-08   -1.992648
Freq: D, Name: A, dtype: float64



In [15]:

    
panel.one



In [ ]:



In [ ]:



In [16]:

    
sa









    Out[16]:





a    1
b    2
c    3
dtype: int64



In [17]:

    
sa.a = 5
sa









    Out[17]:





a    5
b    2
c    3
dtype: int64



In [18]:

    
sa









    Out[18]:





a    5
b    2
c    3
dtype: int64



In [19]:

    
dfa.A=list(range(len(dfa.index))) # ok if A already exists



In [20]:

    
dfa



In [21]:

    
dfa['A'] = list(range(len(dfa.index)))  # use this form to create a new column
dfa



In [ ]:



In [ ]:

注意：使用属性和[] 有一点区别：

如果要新建一个column，只能使用[]

毕竟属性的含义就是现在存在的！不存在的列名当然不是属性了

You can use attribute access to modify an existing element of a Series or column of a DataFrame, but be careful; if you try to use attribute access to create a new column, it fails silently, creating a new attribute rather than a new column.

使用属性要注意的：

如果一个已经存在的函数和列名相同，则不存在相应的属性哦
总而言之，属性的适用范围要比[]小



In [ ]:

切片范围 Slicing ranges

可以使用 [] 还有.iloc切片，这里先介绍使用[]



In [ ]:

对于Series来说，使用[]进行切片就像ndarray一样，



In [22]:

    
s









    Out[22]:





2000-01-01   -1.226827
2000-01-02    0.772811
2000-01-03    0.502868
2000-01-04   -0.758176
2000-01-05   -1.276918
2000-01-06   -0.048258
2000-01-07    0.368375
2000-01-08   -1.992648
Freq: D, Name: A, dtype: float64



In [23]:

    
s[:5]









    Out[23]:





2000-01-01   -1.226827
2000-01-02    0.772811
2000-01-03    0.502868
2000-01-04   -0.758176
2000-01-05   -1.276918
Freq: D, Name: A, dtype: float64



In [24]:

    
s[::2]









    Out[24]:





2000-01-01   -1.226827
2000-01-03    0.502868
2000-01-05   -1.276918
2000-01-07    0.368375
Freq: 2D, Name: A, dtype: float64



In [ ]:



In [25]:

    
s[::-1]









    Out[25]:





2000-01-08   -1.992648
2000-01-07    0.368375
2000-01-06   -0.048258
2000-01-05   -1.276918
2000-01-04   -0.758176
2000-01-03    0.502868
2000-01-02    0.772811
2000-01-01   -1.226827
Freq: -1D, Name: A, dtype: float64

[]不但可以检索，也可以赋值



In [65]:

    
s2 = s.copy()



In [66]:

    
s2[:5]=0 #赋值



In [67]:

    
s2









    Out[67]:





0    0
1    0
2    0
3    0
4    0
5    5
dtype: int64

对于DataFrame对象来说，[]操作符按照行进行切片，非常有用。



In [68]:

    
df[:3]



In [69]:

    
df[::-1]



In [ ]:

使用Label进行检索

警告：

.loc要求检索时输入必须严格遵守index的类型，一旦输入类型不对，将会引起TypeError。



In [70]:

    
df1 = pd.DataFrame(np.random.rand(5,4), columns=list('ABCD'), index=pd.date_range('20160101',periods=5))
df1



In [71]:

    
df1.loc[2:3]









    



---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-71-f18421f8c4a7> in <module>()
----> 1 df1.loc[2:3]

c:\python27\lib\site-packages\pandas\core\indexing.pyc in __getitem__(self, key)
   1284             return self._getitem_tuple(key)
   1285         else:
-> 1286             return self._getitem_axis(key, axis=0)
   1287 
   1288     def _getitem_axis(self, key, axis=0):

c:\python27\lib\site-packages\pandas\core\indexing.pyc in _getitem_axis(self, key, axis)
   1398         if isinstance(key, slice):
   1399             self._has_valid_type(key, axis)
-> 1400             return self._get_slice_axis(key, axis=axis)
   1401         elif is_bool_indexer(key):
   1402             return self._getbool_axis(key, axis=axis)

c:\python27\lib\site-packages\pandas\core\indexing.pyc in _get_slice_axis(self, slice_obj, axis)
   1306         labels = obj._get_axis(axis)
   1307         indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop,
-> 1308                                        slice_obj.step, kind=self.name)
   1309 
   1310         if isinstance(indexer, slice):

c:\python27\lib\site-packages\pandas\tseries\index.pyc in slice_indexer(self, start, end, step, kind)
   1503 
   1504         try:
-> 1505             return Index.slice_indexer(self, start, end, step, kind=kind)
   1506         except KeyError:
   1507             # For historical reasons DatetimeIndex by default supports

c:\python27\lib\site-packages\pandas\indexes\base.pyc in slice_indexer(self, start, end, step, kind)
   2698         """
   2699         start_slice, end_slice = self.slice_locs(start, end, step=step,
-> 2700                                                  kind=kind)
   2701 
   2702         # return a slice

c:\python27\lib\site-packages\pandas\indexes\base.pyc in slice_locs(self, start, end, step, kind)
   2877         start_slice = None
   2878         if start is not None:
-> 2879             start_slice = self.get_slice_bound(start, 'left', kind)
   2880         if start_slice is None:
   2881             start_slice = 0

c:\python27\lib\site-packages\pandas\indexes\base.pyc in get_slice_bound(self, label, side, kind)
   2816         # For datetime indices label may be a string that has to be converted
   2817         # to datetime boundary according to its resolution.
-> 2818         label = self._maybe_cast_slice_bound(label, side, kind)
   2819 
   2820         # we need to look up the label

c:\python27\lib\site-packages\pandas\tseries\index.pyc in _maybe_cast_slice_bound(self, label, side, kind)
   1458 
   1459         if is_float(label) or isinstance(label, time) or is_integer(label):
-> 1460             self._invalid_indexer('slice', label)
   1461 
   1462         if isinstance(label, compat.string_types):

c:\python27\lib\site-packages\pandas\indexes\base.pyc in _invalid_indexer(self, form, key)
   1115                         "indexers [{key}] of {kind}".format(
   1116                             form=form, klass=type(self), key=key,
-> 1117                             kind=type(key)))
   1118 
   1119     def get_duplicates(self):

TypeError: cannot do slice indexing on <class 'pandas.tseries.index.DatetimeIndex'> with these indexers [2] of <type 'int'>

输入string进行检索没问题



In [72]:

    
df1.loc['20160102':'20160104']

细心地你一定发现了，index='20160104'那一行也被检索出来了，没错，loc检索时范围是闭集合[start,end].

整型可以作为label检索，这是没问题的，不过要记住此时整型表示的是label而不是index中的下标！

.loc操作是检索时的基本操作，以下输入格式都是合法的：

一个label，比如：5、'a'. 记住这里的5表示的是index中的一个label而不是index中的一个下标。
label组成的列表或者数组比如['a','b','c']
切片，比如'a':'f'.注意loc中切片范围是闭集合！
布尔数组



In [ ]:



In [73]:

    
s1 = pd.Series(np.random.randn(6), index=list('abcdef'))
s1









    Out[73]:





a    1.270268
b    1.015481
c    0.380879
d    0.965170
e   -0.218055
f    0.224802
dtype: float64



In [74]:

    
s1.loc['c':]









    Out[74]:





c    0.380879
d    0.965170
e   -0.218055
f    0.224802
dtype: float64



In [75]:

    
s1.loc['b']









    Out[75]:





1.0154808822674235



In [ ]:

loc同样支持赋值操作



In [76]:

    
s1.loc['c':]=0
s1









    Out[76]:





a    1.270268
b    1.015481
c    0.000000
d    0.000000
e    0.000000
f    0.000000
dtype: float64



In [ ]:

再来看看DataFramed的例子



In [77]:

    
df1 = pd.DataFrame(np.random.randn(6,4), index=list('abcdef'),columns=list('ABCD'))
df1



In [78]:

    
df1.loc[['a','b','c','d'],:]



In [79]:

    
df1.loc[['a','b','c','d']] #可以省略 ':'



In [ ]:

使用切片检索



In [80]:

    
df1.loc['d':,'A':'C'] #注意是闭集合



In [81]:

    
df1.loc['a']









    Out[81]:





A    0.500635
B    2.515980
C    0.968653
D   -0.764951
Name: a, dtype: float64

使用布尔数组检索



In [82]:

    
df1.loc['a']>0









    Out[82]:





A     True
B     True
C     True
D    False
Name: a, dtype: bool



In [83]:

    
df1.loc[:,df1.loc['a']>0]

得到DataFrame中的某一个值, 等同于df1.get_value('a','A')



In [84]:

    
df1.loc['a','A']









    Out[84]:





0.50063542438780895



In [85]:

    
df1.get_value('a','A')









    Out[85]:





0.50063542438780895

根据下标进行检索 Selection By Position

pandas提供了一系列的方法实现基于整型的检索。语义和python、numpy切片几乎一样。下标同样都是从0开始，并且进行的是半闭半开的区间检索[start,end)。如果输入非整型label当做下标进行检索会引起IndexError。

.iloc的合法输入包括：

一个整数，比如5
整数组成的列表或者数组，比如[4,3,0]
整型表示的切片，比如1:7
布尔数组

看一下Series使用iloc检索的示例：



In [86]:

    
s1 = pd.Series(np.random.randn(5),index=list(range(0,10,2)))
s1









    Out[86]:





0   -0.280654
2   -0.687606
4   -1.195345
6   -0.384770
8   -0.590466
dtype: float64



In [87]:

    
s1.iloc[:3] #注意检索是半闭半开区间









    Out[87]:





0   -0.280654
2   -0.687606
4   -1.195345
dtype: float64



In [88]:

    
s1.iloc[3]









    Out[88]:





-0.38477022333948063

iloc同样也可以进行赋值



In [89]:

    
s1.iloc[:3]=0
s1









    Out[89]:





0    0.000000
2    0.000000
4    0.000000
6   -0.384770
8   -0.590466
dtype: float64



In [ ]:

DataFrame的示例:



In [90]:

    
df1 = pd.DataFrame(np.random.randn(6,4),index=list(range(0,12,2)), columns=list(range(0,8,2)))
df1



In [91]:

    
df1.iloc[:3]

进行行和列的检索



In [92]:

    
df1.iloc[1:5,2:4]



In [93]:

    
df1.iloc[[1,3,5],[1,2]]



In [94]:

    
df1.iloc[1:3,:]



In [95]:

    
df1.iloc[:,1:3]



In [96]:

    
df1.iloc[1,1]#只检索一个元素









    Out[96]:





-0.36625813479137037

注意下面两个例子的区别：



In [97]:

    
df1.iloc[1]









    Out[97]:





0   -1.256285
2   -0.366258
4   -0.980229
6   -1.377265
Name: 2, dtype: float64



In [98]:

    
df1.iloc[1:2]



In [ ]:

如果切片检索时输入的范围越界，没关系，只要pandas版本>=v0.14.0, 就能如同Python/Numpy那样正确处理。

注意：仅限于切片检索



In [99]:

    
x = list('abcdef')
x









    Out[99]:





['a', 'b', 'c', 'd', 'e', 'f']



In [100]:

    
x[4:10] #这里x的长度是6









    Out[100]:





['e', 'f']



In [101]:

    
x[8:10]









    Out[101]:





[]



In [ ]:



In [102]:

    
s = pd.Series(x)



In [103]:

    
s









    Out[103]:





0    a
1    b
2    c
3    d
4    e
5    f
dtype: object



In [104]:

    
s.iloc[4:10]









    Out[104]:





4    e
5    f
dtype: object



In [105]:

    
s.iloc[8:10]









    Out[105]:





Series([], dtype: object)



In [ ]:



In [106]:

    
df1 = pd.DataFrame(np.random.randn(5,2), columns=list('AB'))
df1



In [107]:

    
df1.iloc[:,2:3]



In [108]:

    
df1.iloc[:,1:3]



In [109]:

    
df1.iloc[4:6]

上面说到，这种优雅处理越界的能力仅限于输入全是切片，如果输入是越界的列表或者整数，则会引起IndexError



In [110]:

    
df1.iloc[[4,5,6]]









    



---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-110-496782e5248f> in <module>()
----> 1 df1.iloc[[4,5,6]]

c:\python27\lib\site-packages\pandas\core\indexing.pyc in __getitem__(self, key)
   1284             return self._getitem_tuple(key)
   1285         else:
-> 1286             return self._getitem_axis(key, axis=0)
   1287 
   1288     def _getitem_axis(self, key, axis=0):

c:\python27\lib\site-packages\pandas\core\indexing.pyc in _getitem_axis(self, key, axis)
   1558 
   1559                 # validate list bounds
-> 1560                 self._is_valid_list_like(key, axis)
   1561 
   1562                 # force an actual list

c:\python27\lib\site-packages\pandas\core\indexing.pyc in _is_valid_list_like(self, key, axis)
   1497         l = len(ax)
   1498         if len(arr) and (arr.max() >= l or arr.min() < -l):
-> 1499             raise IndexError("positional indexers are out-of-bounds")
   1500 
   1501         return True

IndexError: positional indexers are out-of-bounds

输入有切片，有整数，如果越界同样不能处理



In [111]:

    
df1.iloc[:,4]









    



---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-111-a87dc3221dcf> in <module>()
----> 1 df1.iloc[:,4]

c:\python27\lib\site-packages\pandas\core\indexing.pyc in __getitem__(self, key)
   1282     def __getitem__(self, key):
   1283         if type(key) is tuple:
-> 1284             return self._getitem_tuple(key)
   1285         else:
   1286             return self._getitem_axis(key, axis=0)

c:\python27\lib\site-packages\pandas\core\indexing.pyc in _getitem_tuple(self, tup)
   1503     def _getitem_tuple(self, tup):
   1504 
-> 1505         self._has_valid_tuple(tup)
   1506         try:
   1507             return self._getitem_lowerdim(tup)

c:\python27\lib\site-packages\pandas\core\indexing.pyc in _has_valid_tuple(self, key)
    136             if i >= self.obj.ndim:
    137                 raise IndexingError('Too many indexers')
--> 138             if not self._has_valid_type(k, i):
    139                 raise ValueError("Location based indexing can only have [%s] "
    140                                  "types" % self._valid_types)

c:\python27\lib\site-packages\pandas\core\indexing.pyc in _has_valid_type(self, key, axis)
   1471             return True
   1472         elif is_integer(key):
-> 1473             return self._is_valid_integer(key, axis)
   1474         elif is_list_like_indexer(key):
   1475             return self._is_valid_list_like(key, axis)

c:\python27\lib\site-packages\pandas\core\indexing.pyc in _is_valid_integer(self, key, axis)
   1485         l = len(ax)
   1486         if key >= l or key < -l:
-> 1487             raise IndexError("single positional indexer is out-of-bounds")
   1488         return True
   1489 

IndexError: single positional indexer is out-of-bounds

选择随机样本 Selecting Random Samples

使用sample()方法能够从行或者列中进行随机选择，适用对象包括Series、DataFrame和Panel。sample()方法默认对行进行随机选择，输入可以是整数或者小数。



In [112]:

    
s = pd.Series([0,1,2,3,4,5])
s









    Out[112]:





0    0
1    1
2    2
3    3
4    4
5    5
dtype: int64



In [113]:

    
s.sample()









    Out[113]:





0    0
dtype: int64



In [114]:

    
s.sample(n=6)









    Out[114]:





5    5
1    1
3    3
0    0
2    2
4    4
dtype: int64



In [115]:

    
s.sample(3) #直接输入整数即可









    Out[115]:





4    4
1    1
0    0
dtype: int64

也可以输入小数，则会随机选择N*frac个样本, 结果进行四舍五入



In [116]:

    
s.sample(frac=0.5)









    Out[116]:





5    5
2    2
4    4
dtype: int64



In [117]:

    
s.sample(0.5) #必须输入frac=0.5









    



---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-117-24c34e460c2e> in <module>()
----> 1 s.sample(0.5) #必须输入frac=0.5

c:\python27\lib\site-packages\pandas\core\generic.pyc in sample(self, n, frac, replace, weights, random_state, axis)
   2555             n = 1
   2556         elif n is not None and frac is None and n % 1 != 0:
-> 2557             raise ValueError("Only integers accepted as `n` values")
   2558         elif n is None and frac is not None:
   2559             n = int(round(frac * axis_length))

ValueError: Only integers accepted as `n` values



In [118]:

    
s.sample(frac=0.8) #6*0.8=4.8









    Out[118]:





5    5
4    4
0    0
2    2
1    1
dtype: int64



In [119]:

    
s.sample(frac=0.7)# 6*0.7=4.2









    Out[119]:





0    0
5    5
4    4
2    2
dtype: int64

sample()默认进行的无放回抽样，可以利用replace=True参数进行可放回抽样。



In [120]:

    
s









    Out[120]:





0    0
1    1
2    2
3    3
4    4
5    5
dtype: int64



In [121]:

    
s.sample(n=6,replace=False)









    Out[121]:





1    1
3    3
5    5
4    4
2    2
0    0
dtype: int64



In [122]:

    
s.sample(6,replace=True)









    Out[122]:





5    5
5    5
5    5
5    5
3    3
1    1
dtype: int64

默认情况下，每一行/列都被等可能的采样，如果你想为每一行赋予一个被抽样选择的权重，可以利用weights参数实现。

注意：如果weights中各概率相加和不等于1，pandas会先对weights进行归一化，强制转为概率和为1！



In [123]:

    
s = pd.Series([0,1,2,3,4,5])
s









    Out[123]:





0    0
1    1
2    2
3    3
4    4
5    5
dtype: int64



In [124]:

    
example_weights=[0,0,0.2,0.2,0.2,0.4]



In [125]:

    
s.sample(n=3,weights=example_weights)









    Out[125]:





2    2
5    5
3    3
dtype: int64



In [126]:

    
example_weights2 = [0.5, 0, 0, 0, 0, 0]



In [127]:

    
s.sample(n=1, weights=example_weights2)









    Out[127]:





0    0
dtype: int64



In [128]:

    
s.sample(n=2, weights=example_weights2) #n>1 会报错，









    



---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-128-933f0356437e> in <module>()
----> 1 s.sample(n=2, weights=example_weights2) #n>1 会报错，

c:\python27\lib\site-packages\pandas\core\generic.pyc in sample(self, n, frac, replace, weights, random_state, axis)
   2567                              "provide positive value.")
   2568 
-> 2569         locs = rs.choice(axis_length, size=n, replace=replace, p=weights)
   2570         return self.take(locs, axis=axis, is_copy=False)
   2571 

mtrand.pyx in mtrand.RandomState.choice (numpy\random\mtrand\mtrand.c:16370)()

ValueError: Fewer non-zero entries in p than size

注意：由于sample默认进行的是无放回抽样，所以输入必须n<=行数，除非进行可放回抽样。



In [129]:

    
s









    Out[129]:





0    0
1    1
2    2
3    3
4    4
5    5
dtype: int64



In [130]:

    
s.sample(7) #7不行









    



---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-130-d71dc28cbaa3> in <module>()
----> 1 s.sample(7) #7不行

c:\python27\lib\site-packages\pandas\core\generic.pyc in sample(self, n, frac, replace, weights, random_state, axis)
   2567                              "provide positive value.")
   2568 
-> 2569         locs = rs.choice(axis_length, size=n, replace=replace, p=weights)
   2570         return self.take(locs, axis=axis, is_copy=False)
   2571 

mtrand.pyx in mtrand.RandomState.choice (numpy\random\mtrand\mtrand.c:16292)()

ValueError: Cannot take a larger sample than population when 'replace=False'



In [131]:

    
s.sample(7,replace=True)









    Out[131]:





3    3
1    1
4    4
2    2
2    2
1    1
1    1
dtype: int64

如果是对DataFrame对象进行有权重采样，一个简单的方法是新增一列用于表示每一行的权重



In [134]:

    
df2 = pd.DataFrame({'col1':[9,8,7,6], 'weight_column':[0.5, 0.4, 0.1, 0]})
df2









    Out[134]:






  
    
      
      col1
      weight_column
    
  
  
    
      0
      9
      0.5
    
    
      1
      8
      0.4
    
    
      2
      7
      0.1
    
    
      3
      6
      0.0



In [135]:

    
df2.sample(n=3,weights='weight_column')









    Out[135]:






  
    
      
      col1
      weight_column
    
  
  
    
      0
      9
      0.5
    
    
      1
      8
      0.4
    
    
      2
      7
      0.1

对列进行采样, axis=1



In [136]:

    
df3 = pd.DataFrame({'col1':[1,2,3], 'clo2':[2,3,4]})
df3



In [137]:

    
df3.sample(1,axis=1)

我们也可以使用random_state参数为sample内部的随机数生成器提供种子数。



In [138]:

    
df4 = pd.DataFrame({'col1':[1,2,3], 'clo2':[2,3,4]})
df4

注意下面两个示例，输出是相同的，因为使用了相同的种子数



In [139]:

    
df4.sample(n=2, random_state=2)



In [140]:

    
df4.sample(n=2,random_state=2)



In [141]:

    
df4.sample(n=2,random_state=3)



In [ ]:

使用赋值的方式扩充对象 Setting With Enlargement

用.loc/.ix/[]对不存在的键值进行赋值时，将会导致在对象中添加新的元素，它的键即为赋值时不存在的键。

对于Series来说，这是一种有效的添加操作。



In [143]:

    
se = pd.Series([1,2,3])
se









    Out[143]:





0    1
1    2
2    3
dtype: int64



In [144]:

    
se[5]=5
se









    Out[144]:





0    1
1    2
2    3
5    5
dtype: int64

DataFrame可以在行或者列上扩充数据



In [145]:

    
dfi = pd.DataFrame(np.arange(6).reshape(3,2),columns=['A','B'])
dfi



In [148]:

    
dfi.loc[:,'C']=dfi.loc[:,'A'] #对列进行扩充



In [149]:

    
dfi



In [152]:

    
dfi.loc[3]=5 #对行进行扩充



In [153]:

    
dfi

标量值的快速获取和赋值

如果仅仅想获取一个元素，使用[]未免太繁重了。pandas提供了快速获取一个元素的方法：at和iat. 适用于Series、DataFrame和Panel。

如果loc方法，at方法的合法输入是label，iat的合法输入是整型。



In [154]:

    
s.iat[5]









    Out[154]:





5



In [155]:

    
df.at[dates[5],'A']









    Out[155]:





0.24985887518963862



In [156]:

    
df.iat[3,0]









    Out[156]:





-0.79862582029106743

也可以进行赋值操作



In [157]:

    
df.at[dates[-1]+1,0]=7
df

布尔检索 Boolean indexing

另一种常用的操作是使用布尔向量过滤数据。运算符有三个:|(or), &(and), ~(not)。

注意：运算符的操作数要在圆括号内。

使用布尔向量检索Series的操作方式和numpy ndarray一样。



In [159]:

    
s = pd.Series(range(-3, 4))
s









    Out[159]:





0   -3
1   -2
2   -1
3    0
4    1
5    2
6    3
dtype: int64



In [160]:

    
s[s>0]









    Out[160]:





4    1
5    2
6    3
dtype: int64



In [161]:

    
s[(s<-1) | (s>0.5)]









    Out[161]:





0   -3
1   -2
4    1
5    2
6    3
dtype: int64



In [162]:

    
s[~(s<0)]









    Out[162]:





3    0
4    1
5    2
6    3
dtype: int64

DataFrame示例：



In [165]:

    
df[df['A'] > 0]

利用列表解析和map方法能够产生更加复杂的选择标准。



In [169]:

    
df2 = pd.DataFrame({'a' : ['one', 'one', 'two', 'three', 'two', 'one', 'six'],
                    'b' : ['x', 'y', 'y', 'x', 'y', 'x', 'x'],
                     'c' : np.random.randn(7)})
df2



In [170]:

    
criterion = df2['a'].map(lambda x:x.startswith('t'))



In [171]:

    
df2[criterion]



In [172]:

    
df2[[x.startswith('t') for x in df2['a']]]



In [173]:

    
df2[criterion & (df2['b'] == 'x')]

结合loc、iloc等方法可以检索多个坐标下的数据.



In [174]:

    
df2.loc[criterion & (df2['b'] == 'x'), 'b':'c']

使用isin方法检索 Indexing with isin

isin(is in)

对于Series对象来说，使用isin方法时传入一个列表，isin方法会返回一个布尔向量。布尔向量元素为1的前提是列表元素在Series对象中存在。看起来比较拗口，还是看例子吧：



In [175]:

    
s = pd.Series(np.arange(5), index=np.arange(5)[::-1],dtype='int64')



In [176]:

    
s









    Out[176]:





4    0
3    1
2    2
1    3
0    4
dtype: int64



In [177]:

    
s.isin([2,4,6])









    Out[177]:





4    False
3    False
2     True
1    False
0     True
dtype: bool



In [178]:

    
s[s.isin([2,4,6])]









    Out[178]:





2    2
0    4
dtype: int64

Index对象中也有isin方法.



In [179]:

    
s[s.index.isin([2,4,6])]









    Out[179]:





4    0
2    2
dtype: int64



In [180]:

    
s[[2,4,6]]









    Out[180]:





2    2.0
4    0.0
6    NaN
dtype: float64



In [ ]:

DataFrame同样有isin方法，参数是数组或字典。二者的区别看例子吧：



In [182]:

    
df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': ['a', 'b', 'f', 'n'],
                   'ids2':['a', 'n', 'c', 'n']})
df



In [183]:

    
values=['a', 'b', 1, 3]



In [184]:

    
df.isin(values)









    Out[184]:






  
    
      
      ids
      ids2
      vals
    
  
  
    
      0
      True
      True
      True
    
    
      1
      True
      False
      False
    
    
      2
      False
      False
      True
    
    
      3
      False
      False
      False

输入一个字典的情形：



In [185]:

    
values = {'ids': ['a', 'b'], 'vals': [1, 3]}



In [186]:

    
df.isin(values)









    Out[186]:






  
    
      
      ids
      ids2
      vals
    
  
  
    
      0
      True
      False
      True
    
    
      1
      True
      False
      False
    
    
      2
      False
      False
      True
    
    
      3
      False
      False
      False

结合isin方法和any() all()可以对DataFrame进行快速查询。比如选择每一列都符合标准的行:



In [187]:

    
values = {'ids': ['a', 'b'], 'ids2': ['a', 'c'], 'vals': [1, 3]}



In [188]:

    
row_mark = df.isin(values).all(1)



In [189]:

    
df[row_mark]



In [198]:

    
row_mark = df.isin(values).any(1)



In [199]:

    
df[row_mark]



In [ ]:

where()方法 The where() Method and Masking

使用布尔向量对Series对象查询时通常返回的是对象的子集。如果想要返回的shape和原对象相同，可以使用where方法。

使用布尔向量对DataFrame对象查询返回的shape和原对象相同，这是因为底层用的where方法实现。



In [205]:

    
s[s>0]









    Out[205]:





3    1
2    2
1    3
0    4
dtype: int64

使用where方法



In [206]:

    
s.where(s>0)









    Out[206]:





4    NaN
3    1.0
2    2.0
1    3.0
0    4.0
dtype: float64



In [207]:

    
df[df<0]



In [208]:

    
df.where(df<0)

where方法还有一个可选的other参数，作用是替换返回结果中是False的值，并不会改变原对象。



In [214]:

    
df.where(df<0, 2)



In [211]:

    
df



In [215]:

    
df.where(df<0, df) #将df作为other的参数值

你可能想基于某种判断条件来赋值。一种直观的方法是：



In [216]:

    
s2 = s.copy()
s2









    Out[216]:





4    0
3    1
2    2
1    3
0    4
dtype: int64



In [217]:

    
s2[s2<0]=0



In [218]:

    
s2









    Out[218]:





4    0
3    1
2    2
1    3
0    4
dtype: int64

默认情况下，where方法并不会修改原始对象，它返回的是一个修改过的原始对象副本，如果你想直接修改原始对象，方法是将inplace参数设置为True



In [223]:

    
df = pd.DataFrame(np.random.randn(6,5), index=list('abcdef'), columns=list('ABCDE'))
df_orig = df.copy()



In [226]:

    
df_orig.where(df < 0, -df, inplace=True);



In [227]:

    
df_orig

对齐

where方法会将输入的布尔条件对齐，因此允许部分检索时的赋值。



In [231]:

    
df2 = df.copy()



In [232]:

    
df2[df2[1:4] >0]=3



In [233]:

    
df2



In [234]:

    
df2 = df.copy()



In [235]:

    
df2.where(df2>0, df2['A'], axis='index')

mask



In [236]:

    
s.mask(s>=0)









    Out[236]:





4   NaN
3   NaN
2   NaN
1   NaN
0   NaN
dtype: float64



In [237]:

    
df.mask(df >= 0)

query()方法 The query() Method (Experimental)

DataFrame对象拥有query方法，允许使用表达式检索。

比如，检索列'b'的值介于列‘a’和‘c’之间的行。

注意： 需要安装numexptr。



In [7]:

    
n = 10



In [8]:

    
df = pd.DataFrame(np.random.randn(n, 3), columns=list('abc'))



In [9]:

    
df



In [10]:

    
df[(df.a<df.b) & (df.b<df.c)]



In [11]:

    
df.query('(a < b) & (b < c)') #

MultiIndex query() 语法

对于DataFrame对象，可以使用MultiIndex，如同操作列名一样。



In [14]:

    
n = 10



In [15]:

    
colors = np.random.choice(['red', 'green'], size=n)



In [16]:

    
foods = np.random.choice(['eggs', 'ham'], size=n)



In [17]:

    
colors









    Out[17]:





array(['green', 'red', 'red', 'red', 'green', 'red', 'green', 'red',
       'green', 'red'], 
      dtype='|S5')



In [18]:

    
foods









    Out[18]:





array(['ham', 'eggs', 'ham', 'ham', 'ham', 'ham', 'ham', 'eggs', 'ham',
       'eggs'], 
      dtype='|S4')



In [19]:

    
index = pd.MultiIndex.from_arrays([colors, foods], names=['color', 'food'])



In [20]:

    
df = pd.DataFrame(np.random.randn(n,2), index=index)



In [21]:

    
df



In [22]:

    
df.query('color == "red"')

如果index没有名字，可以给他们命名



In [23]:

    
df.index.names = [None, None]



In [24]:

    
df



In [25]:

    
df.query('ilevel_0 == "red"')

ilevl_0意思是 0级index。

query() 用例 query() Use Cases

一个使用query()的情景是面对DataFrame对象组成的集合，并且这些对象有共同的的列名，则可以利用query方法对这个集合进行统一检索。



In [26]:

    
df = pd.DataFrame(np.random.randn(n, 3), columns=list('abc'))
df



In [28]:

    
df2 = pd.DataFrame(np.random.randn(n+2, 3), columns=df.columns)
df2



In [29]:

    
expr = '0.0 <= a <= c <= 0.5'



In [30]:

    
map(lambda frame: frame.query(expr), [df, df2])









    Out[30]:





[Empty DataFrame
 Columns: [a, b, c]
 Index: [], Empty DataFrame
 Columns: [a, b, c]
 Index: []]

Python中query和pandas中query语法比较 query() Python versus pandas Syntax Comparison



In [31]:

    
df = pd.DataFrame(np.random.randint(n, size=(n, 3)), columns=list('abc'))
df



In [32]:

    
df.query('(a<b) &(b<c)')



In [33]:

    
df[(df.a < df.b) & (df.b < df.c)]

query()可以去掉圆括号, 也可以用and 代替&运算符



In [35]:

    
df.query('a < b & b < c')



In [36]:

    
df.query('a<b and b<c')

in 和not in 运算符 The in and not in operators

query()也支持Python中的in和not in运算符，实际上是底层调用isin



In [37]:

    
df = pd.DataFrame({'a': list('aabbccddeeff'), 'b': list('aaaabbbbcccc'),
                  'c': np.random.randint(5, size=12),
                  'd': np.random.randint(9, size=12)})



In [38]:

    
df



In [39]:

    
df.query('a in b')



In [40]:

    
df[df.a.isin(df.b)]



In [41]:

    
df[~df.a.isin(df.b)]



In [42]:

    
df.query('a in b and c < d') #更复杂的例子



In [43]:

    
df[df.b.isin(df.a) & (df.c < df.d)] #Python语法

==和列表对象一起使用 Special use of the == operator with list objects

可以使用==/!=将列表和列名直接进行比较，等价于使用in/not in.

三种方法功能等价： ==/!= VS in/not in VS isin()/~isin()



In [49]:

    
df.query('b==["a", "b", "c"]')



In [50]:

    
df[df.b.isin(["a", "b", "c"])] #Python语法



In [51]:

    
df.query('c == [1, 2]')



In [52]:

    
df.query('c != [1, 2]')



In [53]:

    
df.query('[1, 2] in c') #使用in



In [54]:

    
df.query('[1, 2] not in c')



In [55]:

    
df[df.c.isin([1, 2])] #Python语法

布尔运算符 Boolean Operators

可以使用not或者~对布尔表达式进行取非。



In [56]:

    
df = pd.DataFrame(np.random.randn(n, 3), columns=list('abc'))
df



In [57]:

    
df['bools']=np.random.randn(len(df))>0.5



In [58]:

    
df



In [59]:

    
df.query('bools')



In [60]:

    
df.query('not bools')



In [61]:

    
df.query('not bools') == df[~df.bools]









    Out[61]:






  
    
      
      a
      b
      c
      bools
    
  
  
    
      3
      True
      True
      True
      True
    
    
      4
      True
      True
      True
      True
    
    
      5
      True
      True
      True
      True
    
    
      6
      True
      True
      True
      True
    
    
      7
      True
      True
      True
      True
    
    
      8
      True
      True
      True
      True
    
    
      9
      True
      True
      True
      True

表达式任意复杂都没关系。



In [62]:

    
shorter = df.query('a<b<c and (not bools) or bools>2')
shorter



In [63]:

    
longer = df[(df.a < df.b) & (df.b < df.c) & (~df.bools) | (df.bools > 2)]
longer



In [64]:

    
shorter == longer









    Out[64]:






  
    
      
      a
      b
      c
      bools
    
  
  
    
      4
      True
      True
      True
      True

query()的性能

DataFrame.query()底层使用numexptr，所以速度要比Python快,特别时当DataFrame对象非常大时。



In [ ]:

重复数据的确定和删除 Duplicate Data

如果你想确定和去掉DataFrame对象中重复的行，pandas提供了两个方法：duplicated和drop_duplicates. 两个方法的参数都是列名。

duplicated 返回一个布尔向量，长度等于行数，表示每一行是否重复
drop_duplicates 则删除重复的行

默认情况下，首次遇到的行被认为是唯一的，以后遇到内容相同的行都被认为是重复的。不过两个方法都有一个keep参数来确定目标行是否被保留。

keep='first'(默认)：标记/去掉重复行除了第一次出现的那一行
keep='last': 标记/去掉重复行除了最后一次出现的那一行
keep=False: 标记/去掉所有重复的行



In [66]:

    
df2 = pd.DataFrame({'a': ['one', 'one', 'two', 'two', 'two', 'three', 'four'],
                        'b': ['x', 'y', 'x', 'y', 'x', 'x', 'x'],
                      'c': np.random.randn(7)})
   
df2



In [69]:

    
df2.duplicated('a') #只观察列a的值是否重复









    Out[69]:





0    False
1     True
2    False
3     True
4     True
5    False
6    False
dtype: bool



In [68]:

    
df2.duplicated('a', keep='last')









    Out[68]:





0     True
1    False
2     True
3     True
4    False
5    False
6    False
dtype: bool



In [70]:

    
df2.drop_duplicates('a')



In [71]:

    
df2.drop_duplicates('a', keep='last')



In [72]:

    
df2.drop_duplicates('a', keep=False)

可以传递列名组成的列表



In [76]:

    
df2.duplicated(['a', 'b']) #此时列a和b两个元素构成每一个检索的基本单位，









    Out[76]:





0    False
1    False
2    False
3    False
4     True
5    False
6    False
dtype: bool



In [77]:

    
df2

也可以检查index值是否重复来去掉重复行，方法是Index.duplicated然后使用切片操作(因为调用Index.duplicated会返回布尔向量)。keep参数同上。



In [89]:

    
df3 = pd.DataFrame({'a': np.arange(6),
                       'b': np.random.randn(6)},
                       index=['a', 'a', 'b', 'c', 'b', 'a'])



In [90]:

    
df3



In [91]:

    
df3.index.duplicated() #布尔表达式









    Out[91]:





array([False,  True, False, False,  True,  True], dtype=bool)



In [92]:

    
df3[~df3.index.duplicated()]



In [93]:

    
df3[~df3.index.duplicated(keep='last')]



In [94]:

    
df3[~df3.index.duplicated(keep=False)]

形似字典的get()方法

Serires, DataFrame和Panel都有一个get方法来得到一个默认值。



In [95]:

    
s = pd.Series([1,2,3], index=['a', 'b', 'c'])
s









    Out[95]:





a    1
b    2
c    3
dtype: int64



In [96]:

    
s.get('a')









    Out[96]:





1



In [97]:

    
s.get('x', default=-1)









    Out[97]:





-1



In [99]:

    
s.get('b')









    Out[99]:





2

select()方法 The select() Method

Series, DataFrame和Panel都有select()方法来检索数据，这个方法作为保留手段通常其他方法都不管用的时候才使用。select接受一个函数(在label上进行操作)作为输入返回一个布尔值。



In [101]:

    
df = pd.DataFrame(np.random.randn(10, 3), columns=list('ABC'))



In [102]:

    
df.select(lambda x: x=='A', axis=1)

lookup()方法 The lookup()方法

输入行label和列label，得到一个numpy数组，这就是lookup方法的功能。



In [103]:

    
dflookup = pd.DataFrame(np.random.randn(20, 4), columns=list('ABCD'))
dflookup



In [104]:

    
dflookup.lookup(list(range(0,10,2)), ['B','C','A','B','D'])









    Out[104]:





array([ 0.26172828,  0.56197743, -1.07614596,  0.28491916, -0.35354466])

Index对象 Index objects

pandas中的Index类和它的子类可以被当做一个序列可重复集合(ordered multiset)，允许数据重复。然而，如果你想把一个有重复值Index对象转型为一个集合这是不可以的。创建Index最简单的方法就是通过传递一个列表或者其他序列创建。



In [105]:

    
index = pd.Index(['e', 'd', 'a', 'b'])
index









    Out[105]:





Index([u'e', u'd', u'a', u'b'], dtype='object')



In [106]:

    
'd' in index









    Out[106]:





True

还可以个Index命名



In [107]:

    
index = pd.Index(['e', 'd', 'a', 'b'], name='something')
index.name









    Out[107]:





'something'



In [108]:

    
index = pd.Index(list(range(5)), name='rows')



In [110]:

    
columns = pd.Index(['A', 'B', 'C'], name='cols')



In [111]:

    
df = pd.DataFrame(np.random.randn(5, 3), index=index, columns=columns)



In [112]:

    
df



In [113]:

    
df['A']









    Out[113]:





rows
0    2.344614
1    0.085694
2   -0.487108
3    0.124834
4   -0.531960
Name: A, dtype: float64

返回视图VS返回副本 Returning a view versus a copy

当对pandas对象赋值时，一定要注意避免链式索引(chained indexing)。看下面的例子：



In [114]:

    
dfmi = pd.DataFrame([list('abcd'),
                        list('efgh'),
                        list('ijkl'),
                        list('mnop')],
                       columns=pd.MultiIndex.from_product([['one','two'],
                                                          ['first','second']]))
dfmi

比较下面两种访问方式：



In [119]:

    
dfmi['one']['second']









    Out[119]:





0    b
1    f
2    j
3    n
Name: second, dtype: object



In [118]:

    
dfmi.loc[:,('one','second')]









    Out[118]:





0    b
1    f
2    j
3    n
Name: (one, second), dtype: object

上面两种方法返回的结果抖一下，那么应该使用哪种方法呢？答案是我们更推荐大家使用方法二。

dfmi['one']选择了第一级列然后返回一个DataFrame对象，然后另一个Python操作dfmi_with_one['second']根据'second'检索出了一个Series。对pandas来说，这两个操作是独立、有序执行的。而.loc方法传入一个元组(slice(None),('one','second')),pandas把这当作一个事件执行，所以执行速度更快。

为什么使用链式索引赋值为报错？

刚才谈到不推荐使用链式索引是出于性能的考虑。接下来从赋值角度谈一下不推荐使用链式索引。首先，思考Python怎么解释执行下面的代码？



In [ ]:

    
dfmi.loc[:,('one','second')]=value
#实际是
dfmi.loc.__setitem__((slice(None), ('one', 'second')), value)

但下面的代码解释后结果却不一样：



In [ ]:

    
dfmi['one']['second'] = value
# becomes
dfmi.__getitem__('one').__setitem__('second', value)

看到getitem了吗？除了最简单的情况，我们很难预测他到底返回的是视图还是副本(哲依赖于数组的内存布局，这是pandas没有硬性要求的)，因此不推荐使用链式索引赋值！

而dfmi.loc.setitem直接对dfmi进行操作。

有时候明明没有使用链式索引，也会引起SettingWithCopy警告，这是Pandas设计的bug~



In [123]:

    
def do_something(df):
   foo = df[['bar', 'baz']]  # Is foo a view? A copy? Nobody knows!
   # ... many lines here ...
   foo['quux'] = value       # We don't know whether this will modify df or not!
   return foo

链式索引中顺序也很重要

此外，在链式表达式中，不同的顺序也可能导致不同的结果。这里的顺序指的是检索时行和列的顺序。



In [124]:

    
dfb = pd.DataFrame({'a' : ['one', 'one', 'two',
                               'three', 'two', 'one', 'six'],
                       'c' : np.arange(7)})
dfb



In [126]:

    
dfb['c'][dfb.a.str.startswith('o')] = 42 #虽然会引起SettingWithCopyWarning 但也能得到正确结果









    



c:\python27\lib\site-packages\ipykernel\__main__.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':



In [128]:

    
pd.set_option('mode.chained_assignment','warn')
dfb[dfb.a.str.startswith('o')]['c'] = 42 #这实际上是对副本赋值！









    



c:\python27\lib\site-packages\ipykernel\__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app

正确的方式是：老老实实使用.loc



In [133]:

    
dfc = pd.DataFrame({'A':['aaa','bbb','ccc'],'B':[1,2,3]})
dfc



In [134]:

    
dfc.loc[0,'A'] = 11
dfc

	A	B	C	D
2000-01-01	0.177302	-1.546113	-0.952839	0.008122
2000-01-02	-1.049451	0.137660	1.987125	1.246595
2000-01-03	-0.978006	0.532418	-0.847118	2.461356
2000-01-04	-0.798626	-1.377614	0.776687	0.342846
2000-01-05	-0.095023	-1.861150	0.134254	0.661625
2000-01-06	0.249859	-0.664076	0.881349	1.945613
2000-01-07	0.995785	0.674858	-0.809902	-0.543680
2000-01-08	0.295963	-0.698917	0.932431	1.440893

	A	B	C	D
2000-01-01	-0.505721	-0.769480	0.857238	2.124492
2000-01-02	-1.005630	1.230158	-0.849252	-0.512871
2000-01-03	-1.137198	0.960215	1.531157	0.940112
2000-01-04	0.993435	-0.300830	1.046579	1.142567
2000-01-05	-1.139493	-0.819572	0.168454	-0.310352
2000-01-06	1.493655	0.409088	-1.424400	-1.870575
2000-01-07	0.386208	0.825722	-0.845220	-0.275965
2000-01-08	0.914743	-1.535301	-0.484555	-1.237407

	A	B	C	D
2000-01-01	0.177302	-1.546113	-0.952839	0.008122
2000-01-02	-1.049451	0.137660	1.987125	1.246595
2000-01-03	-0.978006	0.532418	-0.847118	2.461356

	A	B	C	D
2000-01-08	0.295963	-0.698917	0.932431	1.440893
2000-01-07	0.995785	0.674858	-0.809902	-0.543680
2000-01-06	0.249859	-0.664076	0.881349	1.945613
2000-01-05	-0.095023	-1.861150	0.134254	0.661625
2000-01-04	-0.798626	-1.377614	0.776687	0.342846
2000-01-03	-0.978006	0.532418	-0.847118	2.461356
2000-01-02	-1.049451	0.137660	1.987125	1.246595
2000-01-01	0.177302	-1.546113	-0.952839	0.008122

	A	B	C	D
2016-01-01	0.478902	0.234646	0.628231	0.480590
2016-01-02	0.744357	0.234170	0.555582	0.117715
2016-01-03	0.612064	0.104215	0.674296	0.842351
2016-01-04	0.823353	0.829003	0.501923	0.388439
2016-01-05	0.810892	0.192622	0.606018	0.581612

	A	B	C	D
2000-01-01	-0.562650	-1.226827	0.149550	2.333782
2000-01-02	-1.062558	0.772811	-1.556939	-0.303581
2000-01-03	-1.194126	0.502868	0.823470	1.149401
2000-01-04	0.936506	-0.758176	0.338892	1.351857
2000-01-05	-1.196422	-1.276918	-0.539233	-0.101062
2000-01-06	1.436726	-0.048258	-2.132088	-1.661286
2000-01-07	0.329280	0.368375	-1.552907	-0.066676
2000-01-08	0.857815	-1.992648	-1.192243	-1.028118

	A	B	C	D
a	0.500635	2.515980	0.968653	-0.764951
b	0.911650	-2.208888	0.389002	0.296063
c	-0.326533	-0.548483	-0.225515	0.561847
d	0.061768	-0.299833	-1.081881	-1.389517
e	0.440465	-0.332527	1.633278	-0.096852
f	1.023751	-0.562649	-0.284983	1.629945

	0	2	4	6
0	0.357496	0.007987	-0.373388	0.713999
2	-1.256285	-0.366258	-0.980229	-1.377265
4	-0.256183	-0.405019	-0.740001	0.854734
6	-1.122789	-1.652925	-2.109178	0.714779
8	-1.105426	0.183194	-0.418197	1.454595
10	1.287264	0.318804	0.532221	1.124164

	A	B
0	-0.372023	-1.074524
1	0.061010	-0.205115
2	0.346113	0.256906
3	2.575367	1.030279
4	1.305473	-0.973037

	a	b	c
0	one	x	1.952594
1	one	y	0.964196
2	two	y	0.752335
3	three	x	0.897060
4	two	y	-0.120268
5	one	x	-0.114347
6	six	x	-0.690658

	ids	ids2	vals
0	True	False	True
1	True	False	False
2	False	False	True
3	False	False	False

	A	B	C	D	E
a	-1.196045	1.488123	-1.258859	-1.072124	1.186134
b	-0.981695	-0.895986	-0.342973	1.387443	-1.254906
c	-1.051246	1.700564	-1.647643	0.569339	-0.292161
d	-0.668880	-1.762922	0.886166	-0.057681	-0.106489
e	-1.049452	-0.427708	0.111594	0.524640	0.372802
f	-0.320219	0.179891	0.638859	-1.704587	0.199829

	A	B	C	D	E
a	1.196045	-1.488123	1.258859	1.072124	-1.186134
b	3.000000	3.000000	3.000000	-1.387443	3.000000
c	3.000000	-1.700564	3.000000	-0.569339	3.000000
d	3.000000	3.000000	-0.886166	3.000000	3.000000
e	1.049452	0.427708	-0.111594	-0.524640	-0.372802
f	0.320219	-0.179891	-0.638859	1.704587	-0.199829

	a	b	c
0	0.122199	0.930233	0.032165
1	0.701562	-0.264389	-0.218722
2	1.249641	-0.491504	-0.505626
3	0.737489	-1.021851	-0.133149
4	0.824726	0.610772	-0.618181
5	1.776072	0.179174	-0.221257
6	0.203382	0.170864	0.311583
7	-0.184993	2.168807	-0.525213
8	-1.569373	-0.422874	0.025034
9	-0.250066	-0.523624	-0.068660

		0	1
color	food
green	ham	-0.302117	-0.329496
red	eggs	1.511928	0.516101
	ham	-0.496161	-2.188031
	ham	-0.675945	-1.174039
green	ham	-0.045403	1.232689
red	ham	-0.629616	0.898270
green	ham	-1.107152	-0.608575
red	eggs	-2.190136	0.267058
green	ham	0.463844	0.753210
red	eggs	-2.202780	-0.489497

	a	b	c
0	0.165333	-1.033593	-0.350963
1	1.194190	-1.150226	0.394567
2	2.596311	0.731163	-0.076441
3	-0.581372	-0.181338	-0.537066
4	-0.400650	-0.557857	0.899903
5	-0.953411	-0.339584	-0.326526
6	-0.624137	2.367149	1.713632
7	-1.082154	0.656102	-0.085876
8	0.403841	0.049296	-1.450854
9	-1.545236	-0.277590	0.670381

	a	b	c
0	-1.094352	-0.298453	-1.033678
1	0.433829	0.453419	0.025845
2	-0.073075	-0.807078	-0.301511
3	-0.349793	1.206156	-0.409708
4	-1.182058	-0.751007	-1.877473
5	0.073020	0.615864	1.196358
6	-0.359860	1.054296	0.174546
7	-1.139869	-0.005851	0.100370
8	-0.250069	-0.380308	-0.632730
9	-1.041732	0.077047	0.368696
10	0.769995	0.593472	1.192260
11	-0.863597	-0.064093	1.033223

	a	b	c	d
0	a	a	2	8
1	a	a	2	2
2	b	a	1	1
3	b	a	1	3
4	c	b	0	8
5	c	b	1	2
6	d	b	0	1
7	d	b	1	7
8	e	c	4	2
9	e	c	1	7
10	f	c	4	1
11	f	c	3	0

	a	b	c	d
0	a	a	2	8
1	a	a	2	2
2	b	a	1	1
3	b	a	1	3
4	c	b	0	8
5	c	b	1	2
6	d	b	0	1
7	d	b	1	7
8	e	c	4	2
9	e	c	1	7
10	f	c	4	1
11	f	c	3	0

	a	b	c	d
0	a	a	2	8
1	a	a	2	2
2	b	a	1	1
3	b	a	1	3
4	c	b	0	8
5	c	b	1	2
6	d	b	0	1
7	d	b	1	7
8	e	c	4	2
9	e	c	1	7
10	f	c	4	1
11	f	c	3	0

	a	b	c
0	0.809874	0.825521	1.029453
1	-0.051787	0.918937	1.154500
2	0.335353	0.231090	-1.512497
3	1.176560	-0.966830	2.052055
4	-0.074463	0.166296	0.576796
5	-0.082201	-0.900843	-0.374039
6	1.519903	0.041034	-0.642189
7	-0.483423	-0.845009	0.190998
8	0.822515	-0.926675	-0.165761
9	-0.884488	1.118452	-1.248411

	a	b	c
0	one	x	2.126953
1	one	y	0.570685
2	two	x	-0.718881
3	two	y	0.044910
4	two	x	0.376090
5	three	x	-0.205828
6	four	x	0.336854

	a	b
a	0	0.370944
a	1	0.392321
b	2	-0.999154
c	3	-0.236476
b	4	-0.318244
a	5	1.164510

	A
0	0.768992
1	-0.126865
2	-0.508768
3	-0.847265
4	-1.537149
5	0.610245
6	-0.626082
7	0.813772
8	0.300097
9	1.368777

	A	B	C	D
0	1.456863	0.261728	1.313254	0.105114
1	-1.041107	1.067823	1.156849	-1.306408
2	-1.186713	-1.078201	0.561977	-0.107848
3	0.496749	0.246163	0.496875	0.334775
4	-1.076146	-0.459081	-0.646699	-0.143237
5	-0.337840	1.284264	-1.327627	0.139834
6	0.161458	0.284919	-1.969045	-0.129893
7	-0.361869	-1.292803	0.204441	0.066561
8	-1.518187	-1.247654	-0.988123	-0.353545
9	0.391405	0.097482	-0.190093	-0.074410
10	1.023540	-1.345222	0.537438	-1.357927
11	-0.108617	-0.888801	-0.142176	-0.143029
12	0.996520	0.569428	0.078876	-0.645631
13	-0.845302	-0.642925	1.089828	0.645551
14	2.205784	-0.763532	1.763455	-0.873189
15	0.645096	0.828053	-1.405876	0.974612
16	0.444146	1.831544	0.439983	-0.108334
17	-0.126371	-0.196340	-0.411644	0.414911
18	-0.721539	-1.932596	-1.595068	-1.966388
19	0.060839	0.731949	-0.082693	1.665486