[TOC]

10分钟入门pandas

这是一篇针对pandas新手的简短入门。

在开始之前,首先导入以下几个库:


In [129]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

创建对象

通过传递一个list来创建Series ,pandas会默认创建整型index:


In [2]:
s = pd.Series([1,3,5,np.nan,6,8])
s


Out[2]:
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

通过传递一个numpy array,日期index以及column标签来创建一个DataFrame:


In [3]:
dates = pd.date_range('20160101',periods=6)
dates


Out[3]:
DatetimeIndex(['2016-01-01', '2016-01-02', '2016-01-03', '2016-01-04',
               '2016-01-05', '2016-01-06'],
              dtype='datetime64[ns]', freq='D')

In [4]:
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
df


Out[4]:
A B C D
2016-01-01 0.928369 0.333578 -0.105632 -0.712419
2016-01-02 -0.523376 -0.150959 0.178486 -0.178268
2016-01-03 -0.541057 -0.397324 -2.580032 -0.778315
2016-01-04 1.282084 -0.205098 -0.665331 -0.686336
2016-01-05 -1.279960 0.279432 1.300774 -0.698074
2016-01-06 0.669912 0.172428 0.980341 1.248861

通过传递一个dictionary来创建一个DataFrame:


In [5]:
df2 = pd.DataFrame({ 'A' : 1.,
                     'B' : pd.Timestamp('20160102'),
                     'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                     'D' : np.array([3]*4,dtype='int32'),
                     'E' : pd.Categorical(["test","train","test","train"]),
                     'F' : 'foo' })
df2


Out[5]:
A B C D E F
0 1.0 2016-01-02 1.0 3 test foo
1 1.0 2016-01-02 1.0 3 train foo
2 1.0 2016-01-02 1.0 3 test foo
3 1.0 2016-01-02 1.0 3 train foo

查看每一列的数据类型:


In [6]:
df2.dtypes


Out[6]:
A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

查看数据

查看DataFrame中前几行和最后几行:


In [7]:
df.head()


Out[7]:
A B C D
2016-01-01 0.928369 0.333578 -0.105632 -0.712419
2016-01-02 -0.523376 -0.150959 0.178486 -0.178268
2016-01-03 -0.541057 -0.397324 -2.580032 -0.778315
2016-01-04 1.282084 -0.205098 -0.665331 -0.686336
2016-01-05 -1.279960 0.279432 1.300774 -0.698074

In [8]:
df.tail(3)


Out[8]:
A B C D
2016-01-04 1.282084 -0.205098 -0.665331 -0.686336
2016-01-05 -1.279960 0.279432 1.300774 -0.698074
2016-01-06 0.669912 0.172428 0.980341 1.248861

显示index,columns以及底层的numpy数据:


In [9]:
df.index


Out[9]:
DatetimeIndex(['2016-01-01', '2016-01-02', '2016-01-03', '2016-01-04',
               '2016-01-05', '2016-01-06'],
              dtype='datetime64[ns]', freq='D')

In [10]:
df.columns


Out[10]:
Index([u'A', u'B', u'C', u'D'], dtype='object')

In [11]:
df.values


Out[11]:
array([[ 0.92836868,  0.33357795, -0.10563178, -0.71241949],
       [-0.5233758 , -0.15095921,  0.17848586, -0.17826755],
       [-0.54105741, -0.39732389, -2.58003191, -0.77831501],
       [ 1.28208414, -0.20509759, -0.66533112, -0.68633603],
       [-1.27995972,  0.27943192,  1.3007739 , -0.69807398],
       [ 0.66991176,  0.17242801,  0.98034059,  1.24886146]])

describe()方法能对数据做一个快速统计汇总:


In [12]:
df.describe()


Out[12]:
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean 0.089329 0.005343 -0.148566 -0.300758
std 1.011086 0.297199 1.390172 0.789985
min -1.279960 -0.397324 -2.580032 -0.778315
25% -0.536637 -0.191563 -0.525406 -0.708833
50% 0.073268 0.010734 0.036427 -0.692205
75% 0.863754 0.252681 0.779877 -0.305285
max 1.282084 0.333578 1.300774 1.248861

对数据进行转置:


In [13]:
df.T


Out[13]:
2016-01-01 00:00:00 2016-01-02 00:00:00 2016-01-03 00:00:00 2016-01-04 00:00:00 2016-01-05 00:00:00 2016-01-06 00:00:00
A 0.928369 -0.523376 -0.541057 1.282084 -1.279960 0.669912
B 0.333578 -0.150959 -0.397324 -0.205098 0.279432 0.172428
C -0.105632 0.178486 -2.580032 -0.665331 1.300774 0.980341
D -0.712419 -0.178268 -0.778315 -0.686336 -0.698074 1.248861

按某一坐标轴进行排序:


In [14]:
df.sort_index(axis=1,ascending=False)


Out[14]:
D C B A
2016-01-01 -0.712419 -0.105632 0.333578 0.928369
2016-01-02 -0.178268 0.178486 -0.150959 -0.523376
2016-01-03 -0.778315 -2.580032 -0.397324 -0.541057
2016-01-04 -0.686336 -0.665331 -0.205098 1.282084
2016-01-05 -0.698074 1.300774 0.279432 -1.279960
2016-01-06 1.248861 0.980341 0.172428 0.669912

按值进行排序:


In [15]:
df.sort_values(by='B')


Out[15]:
A B C D
2016-01-03 -0.541057 -0.397324 -2.580032 -0.778315
2016-01-04 1.282084 -0.205098 -0.665331 -0.686336
2016-01-02 -0.523376 -0.150959 0.178486 -0.178268
2016-01-06 0.669912 0.172428 0.980341 1.248861
2016-01-05 -1.279960 0.279432 1.300774 -0.698074
2016-01-01 0.928369 0.333578 -0.105632 -0.712419

数据选择

注意:虽然标准的Python/Numpy的表达式能完成选择与赋值等功能,但我们仍推荐使用优化过的pandas数据访问方法:.at,.iat,.loc,.iloc和.ix

选取

选择某一列数据,它会返回一个Series,等同于df.A:


In [16]:
df['A']


Out[16]:
2016-01-01    0.928369
2016-01-02   -0.523376
2016-01-03   -0.541057
2016-01-04    1.282084
2016-01-05   -1.279960
2016-01-06    0.669912
Freq: D, Name: A, dtype: float64

使用[],对行进行切片选取。

注意,[]内必须是一个范围,而不能仅仅是一个值。否则报错:


In [24]:
df[0:3] # 选取范围是[),前闭合,后开


Out[24]:
A B C D
2016-01-01 0.928369 0.333578 -0.105632 -0.712419
2016-01-02 -0.523376 -0.150959 0.178486 -0.178268
2016-01-03 -0.541057 -0.397324 -2.580032 -0.778315

In [25]:
df['20160102':'20160104'] # 选取范围是[], 前闭合,后闭合


Out[25]:
A B C D
2016-01-02 -0.523376 -0.150959 0.178486 -0.178268
2016-01-03 -0.541057 -0.397324 -2.580032 -0.778315
2016-01-04 1.282084 -0.205098 -0.665331 -0.686336

In [40]:
# []内必须是一个范围,而不能仅仅是一个值。否则报错

# df[1]  # 报错

# df['20160102'] # 报错
# df[dates[0]] # 报错

通过label选取

Series的label就是index。DataFrame的label包括index和columns

通过label选取比使用整形下标选取直观的多。方法很简单,调用.loc[], .loc[]内可以是一个label值,也可以是一个label取值范围。注意切片范围是前后闭合。

通过label进行交叉选取:


In [19]:
df.loc[dates[0]]


Out[19]:
A    0.928369
B    0.333578
C   -0.105632
D   -0.712419
Name: 2016-01-01 00:00:00, dtype: float64

使用label对多个轴进行选取:


In [32]:
df.loc[:, ['A', 'B']]


Out[32]:
A B
2016-01-01 0.928369 0.333578
2016-01-02 -0.523376 -0.150959
2016-01-03 -0.541057 -0.397324
2016-01-04 1.282084 -0.205098
2016-01-05 -1.279960 0.279432
2016-01-06 0.669912 0.172428

根据label进行切片选择,注意切片范围是前后闭合:


In [34]:
df.loc['20160102':'20160104',['A','B']]


Out[34]:
A B
2016-01-02 -0.523376 -0.150959
2016-01-03 -0.541057 -0.397324
2016-01-04 1.282084 -0.205098

In [36]:
# .loc[] 内的值只能是label,如果传入整数下标,报错
# df.loc[0:3, ['A','B']]

In [43]:
df.loc['20160102',['A','B']]


Out[43]:
A   -0.523376
B   -0.150959
Name: 2016-01-02 00:00:00, dtype: float64

获取一个标量值:


In [44]:
df.loc[dates[0],'A']


Out[44]:
0.9283686787182388

In [45]:
df.loc['20160101', 'A']


Out[45]:
0.9283686787182388

快速获取一个标量(与上面方法效果相同)


In [46]:
df.at[dates[0],'A']


Out[46]:
0.9283686787182388

通过下标选取

下标(position), 是编程语言中的重要概念。

pandas的数据结构也支持用下标进行检索。比如把DataFrame看做一个二维数组,用下标进行检索。

用下标进行选取/检索的方法是调用 .iloc[]

.loc[]内可以是一个整形下标值,也可以是一个下标取值范围(切片),甚至可以是一个list。注意,如果传入切片范围,范围是前闭合,后开的。

传入单个整形下标:


In [48]:
df.iloc[3]


Out[48]:
A    1.282084
B   -0.205098
C   -0.665331
D   -0.686336
Name: 2016-01-04 00:00:00, dtype: float64

传入整型的下标切片,与python/numpy形式相同


In [50]:
df.iloc[3:5,0:2]


Out[50]:
A B
2016-01-04 1.282084 -0.205098
2016-01-05 -1.279960 0.279432

传入list


In [51]:
df.iloc[[1,2,4], [0,2]]


Out[51]:
A C
2016-01-02 -0.523376 0.178486
2016-01-03 -0.541057 -2.580032
2016-01-05 -1.279960 1.300774

只对行进行切片


In [52]:
df.iloc[1:3,:]


Out[52]:
A B C D
2016-01-02 -0.523376 -0.150959 0.178486 -0.178268
2016-01-03 -0.541057 -0.397324 -2.580032 -0.778315

只对列进行切片


In [53]:
df.iloc[:,1:3]


Out[53]:
B C
2016-01-01 0.333578 -0.105632
2016-01-02 -0.150959 0.178486
2016-01-03 -0.397324 -2.580032
2016-01-04 -0.205098 -0.665331
2016-01-05 0.279432 1.300774
2016-01-06 0.172428 0.980341

只获取某个值


In [55]:
df.iloc[1,1]


Out[55]:
-0.15095920811674263

快速获取某个值(与上面的方法等价)


In [56]:
df.iat[1,1]


Out[56]:
-0.15095920811674263

布尔索引

用某列的值来选取数据


In [57]:
df[df.A > 0]


Out[57]:
A B C D
2016-01-01 0.928369 0.333578 -0.105632 -0.712419
2016-01-04 1.282084 -0.205098 -0.665331 -0.686336
2016-01-06 0.669912 0.172428 0.980341 1.248861

用where操作来选取数据


In [59]:
df[df > 0]


Out[59]:
A B C D
2016-01-01 0.928369 0.333578 NaN NaN
2016-01-02 NaN NaN 0.178486 NaN
2016-01-03 NaN NaN NaN NaN
2016-01-04 1.282084 NaN NaN NaN
2016-01-05 NaN 0.279432 1.300774 NaN
2016-01-06 0.669912 0.172428 0.980341 1.248861

用isin()方法来过滤数据


In [60]:
df2 = df.copy()

In [61]:
df2['E'] = ['one','one','two','three','four','three']
df2


Out[61]:
A B C D E
2016-01-01 0.928369 0.333578 -0.105632 -0.712419 one
2016-01-02 -0.523376 -0.150959 0.178486 -0.178268 one
2016-01-03 -0.541057 -0.397324 -2.580032 -0.778315 two
2016-01-04 1.282084 -0.205098 -0.665331 -0.686336 three
2016-01-05 -1.279960 0.279432 1.300774 -0.698074 four
2016-01-06 0.669912 0.172428 0.980341 1.248861 three

In [62]:
df2[df2['E'].isin(['two','four'])]


Out[62]:
A B C D E
2016-01-03 -0.541057 -0.397324 -2.580032 -0.778315 two
2016-01-05 -1.279960 0.279432 1.300774 -0.698074 four

In [63]:
df2[df2['E'].isin(['one','one','two'])]


Out[63]:
A B C D E
2016-01-01 0.928369 0.333578 -0.105632 -0.712419 one
2016-01-02 -0.523376 -0.150959 0.178486 -0.178268 one
2016-01-03 -0.541057 -0.397324 -2.580032 -0.778315 two

赋值

为DataFrame增加新的一列,通过index来自动对齐数据


In [65]:
s1 = pd.Series([1,2,3,4,5,6],index=pd.date_range('20160102',periods=6))

In [66]:
s1


Out[66]:
2016-01-02    1
2016-01-03    2
2016-01-04    3
2016-01-05    4
2016-01-06    5
2016-01-07    6
Freq: D, dtype: int64

In [67]:
df['F'] = s1
s1


Out[67]:
2016-01-02    1
2016-01-03    2
2016-01-04    3
2016-01-05    4
2016-01-06    5
2016-01-07    6
Freq: D, dtype: int64

In [69]:
df # 注意F列


Out[69]:
A B C D F
2016-01-01 0.928369 0.333578 -0.105632 -0.712419 NaN
2016-01-02 -0.523376 -0.150959 0.178486 -0.178268 1.0
2016-01-03 -0.541057 -0.397324 -2.580032 -0.778315 2.0
2016-01-04 1.282084 -0.205098 -0.665331 -0.686336 3.0
2016-01-05 -1.279960 0.279432 1.300774 -0.698074 4.0
2016-01-06 0.669912 0.172428 0.980341 1.248861 5.0

通过label赋值


In [70]:
df.at[dates[0],'A'] = 0
df


Out[70]:
A B C D F
2016-01-01 0.000000 0.333578 -0.105632 -0.712419 NaN
2016-01-02 -0.523376 -0.150959 0.178486 -0.178268 1.0
2016-01-03 -0.541057 -0.397324 -2.580032 -0.778315 2.0
2016-01-04 1.282084 -0.205098 -0.665331 -0.686336 3.0
2016-01-05 -1.279960 0.279432 1.300774 -0.698074 4.0
2016-01-06 0.669912 0.172428 0.980341 1.248861 5.0

通过下标赋值


In [71]:
df.iat[0,1] = 0
df


Out[71]:
A B C D F
2016-01-01 0.000000 0.000000 -0.105632 -0.712419 NaN
2016-01-02 -0.523376 -0.150959 0.178486 -0.178268 1.0
2016-01-03 -0.541057 -0.397324 -2.580032 -0.778315 2.0
2016-01-04 1.282084 -0.205098 -0.665331 -0.686336 3.0
2016-01-05 -1.279960 0.279432 1.300774 -0.698074 4.0
2016-01-06 0.669912 0.172428 0.980341 1.248861 5.0

通过传递numpy array赋值


In [72]:
df.loc[:,'D'] = np.array([5] * len(df))

In [73]:
df


Out[73]:
A B C D F
2016-01-01 0.000000 0.000000 -0.105632 5 NaN
2016-01-02 -0.523376 -0.150959 0.178486 5 1.0
2016-01-03 -0.541057 -0.397324 -2.580032 5 2.0
2016-01-04 1.282084 -0.205098 -0.665331 5 3.0
2016-01-05 -1.279960 0.279432 1.300774 5 4.0
2016-01-06 0.669912 0.172428 0.980341 5 5.0

In [74]:
df2 = df.copy()
df2[df2 > 0] = -df2
df2


Out[74]:
A B C D F
2016-01-01 0.000000 0.000000 -0.105632 -5 NaN
2016-01-02 -0.523376 -0.150959 -0.178486 -5 -1.0
2016-01-03 -0.541057 -0.397324 -2.580032 -5 -2.0
2016-01-04 -1.282084 -0.205098 -0.665331 -5 -3.0
2016-01-05 -1.279960 -0.279432 -1.300774 -5 -4.0
2016-01-06 -0.669912 -0.172428 -0.980341 -5 -5.0

缺失值处理

在pandas中,用np.nan来填充缺失值,这些值默认不会参与运算。

reindex()允许你修改、增加、删除指定轴上的index,并返回数据副本


In [75]:
df1 = df.reindex(index=dates[0:4],columns=list(df.columns)+['E'])
df1.loc[dates[0]:dates[1],'E'] = 1
df1


Out[75]:
A B C D F E
2016-01-01 0.000000 0.000000 -0.105632 5 NaN 1.0
2016-01-02 -0.523376 -0.150959 0.178486 5 1.0 1.0
2016-01-03 -0.541057 -0.397324 -2.580032 5 2.0 NaN
2016-01-04 1.282084 -0.205098 -0.665331 5 3.0 NaN

剔除所有包含缺失值的行


In [76]:
df1.dropna(how='any')


Out[76]:
A B C D F E
2016-01-02 -0.523376 -0.150959 0.178486 5 1.0 1.0

填充缺失值


In [77]:
df1.fillna(value=5)


Out[77]:
A B C D F E
2016-01-01 0.000000 0.000000 -0.105632 5 5.0 1.0
2016-01-02 -0.523376 -0.150959 0.178486 5 1.0 1.0
2016-01-03 -0.541057 -0.397324 -2.580032 5 2.0 5.0
2016-01-04 1.282084 -0.205098 -0.665331 5 3.0 5.0

获取值是否为nan的布尔标记


In [78]:
pd.isnull(df1)


Out[78]:
A B C D F E
2016-01-01 False False False False True False
2016-01-02 False False False False False False
2016-01-03 False False False False False True
2016-01-04 False False False False False True

运算

统计

运算过程中,缺失值不参与计算。

得到描述性统计。


In [79]:
df


Out[79]:
A B C D F
2016-01-01 0.000000 0.000000 -0.105632 5 NaN
2016-01-02 -0.523376 -0.150959 0.178486 5 1.0
2016-01-03 -0.541057 -0.397324 -2.580032 5 2.0
2016-01-04 1.282084 -0.205098 -0.665331 5 3.0
2016-01-05 -1.279960 0.279432 1.300774 5 4.0
2016-01-06 0.669912 0.172428 0.980341 5 5.0

In [80]:
df.mean()


Out[80]:
A   -0.065400
B   -0.050253
C   -0.148566
D    5.000000
F    3.000000
dtype: float64

In [82]:
df.sum()


Out[82]:
A    -0.392397
B    -0.301521
C    -0.891394
D    30.000000
F    15.000000
dtype: float64

对其他轴进行同样的运算


In [81]:
df.mean(1)


Out[81]:
2016-01-01    1.223592
2016-01-02    1.100830
2016-01-03    0.696317
2016-01-04    1.682331
2016-01-05    1.860049
2016-01-06    2.364536
Freq: D, dtype: float64

对于拥有不同维度的对象进行运算时需要对齐。除此之外,pandas会自动沿着指定维度广播。


In [83]:
s = pd.Series([1,3,5,np.nan,6,8],index=dates).shift(2)
s


Out[83]:
2016-01-01    NaN
2016-01-02    NaN
2016-01-03    1.0
2016-01-04    3.0
2016-01-05    5.0
2016-01-06    NaN
Freq: D, dtype: float64

In [84]:
df.sub(s,axis='index')


Out[84]:
A B C D F
2016-01-01 NaN NaN NaN NaN NaN
2016-01-02 NaN NaN NaN NaN NaN
2016-01-03 -1.541057 -1.397324 -3.580032 4.0 1.0
2016-01-04 -1.717916 -3.205098 -3.665331 2.0 0.0
2016-01-05 -6.279960 -4.720568 -3.699226 0.0 -1.0
2016-01-06 NaN NaN NaN NaN NaN

Apply函数

这是一个高阶函数,它的参数也是函数,类似于map(),把参数函数作用在数据上。

默认作用在一列数据上。


In [88]:
df.apply(np.cumsum)


Out[88]:
A B C D F
2016-01-01 0.000000 0.000000 -0.105632 5 NaN
2016-01-02 -0.523376 -0.150959 0.072854 10 1.0
2016-01-03 -1.064433 -0.548283 -2.507178 15 3.0
2016-01-04 0.217651 -0.753381 -3.172509 20 6.0
2016-01-05 -1.062309 -0.473949 -1.871735 25 10.0
2016-01-06 -0.392397 -0.301521 -0.891394 30 15.0

In [90]:
df.apply(lambda x: x.max()-x.min())


Out[90]:
A    2.562044
B    0.676756
C    3.880806
D    0.000000
F    4.000000
dtype: float64

频数统计


In [91]:
s = pd.Series(np.random.randint(0,7,size=10))
s


Out[91]:
0    0
1    6
2    0
3    0
4    6
5    6
6    3
7    1
8    2
9    2
dtype: int64

In [92]:
s.value_counts()


Out[92]:
6    3
0    3
2    2
3    1
1    1
dtype: int64

字符串方法

对于Series对象,它的str属性包含一系列的字符串处理方法。

就如同下段代码一样,能很方便的对array中各个元素进行运算。值得注意的是,在str属性中的模式匹配默认使用正则表达式。


In [93]:
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s.str.lower()


Out[93]:
0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

合并(Merge)

Concat 连接

pandas中提供了大量的方法能够轻松对Series,DataFrame和Panel对象进行多种逻辑关系的合并操作。

通过pd.concat()来连接pandas对象:


In [96]:
df = pd.DataFrame(np.random.randn(10,4))
df


Out[96]:
0 1 2 3
0 0.247905 -2.345166 1.252221 0.670462
1 -0.314595 0.619986 2.421725 0.357492
2 1.826127 -0.483145 0.655893 -0.627891
3 0.071736 0.601667 -0.995181 -0.268664
4 -0.477940 -1.112391 -0.374552 0.058009
5 -0.781169 1.171458 -0.071877 -0.352357
6 -0.286301 -0.252232 0.342850 1.783261
7 1.670887 1.295367 -0.720624 -0.795947
8 0.174607 0.357461 1.990180 -0.521559
9 -1.561951 -1.539058 -0.640498 1.416332

In [97]:
# break it into pieces
pieces = [df[:3], df[3:7], df[7:]]
pieces


Out[97]:
[          0         1         2         3
 0  0.247905 -2.345166  1.252221  0.670462
 1 -0.314595  0.619986  2.421725  0.357492
 2  1.826127 -0.483145  0.655893 -0.627891,
           0         1         2         3
 3  0.071736  0.601667 -0.995181 -0.268664
 4 -0.477940 -1.112391 -0.374552  0.058009
 5 -0.781169  1.171458 -0.071877 -0.352357
 6 -0.286301 -0.252232  0.342850  1.783261,
           0         1         2         3
 7  1.670887  1.295367 -0.720624 -0.795947
 8  0.174607  0.357461  1.990180 -0.521559
 9 -1.561951 -1.539058 -0.640498  1.416332]

In [98]:
pd.concat(pieces)


Out[98]:
0 1 2 3
0 0.247905 -2.345166 1.252221 0.670462
1 -0.314595 0.619986 2.421725 0.357492
2 1.826127 -0.483145 0.655893 -0.627891
3 0.071736 0.601667 -0.995181 -0.268664
4 -0.477940 -1.112391 -0.374552 0.058009
5 -0.781169 1.171458 -0.071877 -0.352357
6 -0.286301 -0.252232 0.342850 1.783261
7 1.670887 1.295367 -0.720624 -0.795947
8 0.174607 0.357461 1.990180 -0.521559
9 -1.561951 -1.539058 -0.640498 1.416332

Join 合并

类似于SQL中的合并(merge)


In [99]:
left = pd.DataFrame({'key':['foo','foo'],'lval':[1,2]})
left


Out[99]:
key lval
0 foo 1
1 foo 2

In [100]:
right = pd.DataFrame({'key':['foo','foo'],'lval':[4,5]})
right


Out[100]:
key lval
0 foo 4
1 foo 5

In [101]:
pd.merge(left, right, on='key')


Out[101]:
key lval_x lval_y
0 foo 1 4
1 foo 1 5
2 foo 2 4
3 foo 2 5

pd.merge()的另一个例子:


In [102]:
left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})
left


Out[102]:
key lval
0 foo 1
1 bar 2

In [103]:
right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})
right


Out[103]:
key rval
0 foo 4
1 bar 5

In [104]:
pd.merge(left, right, on='key')


Out[104]:
key lval rval
0 foo 1 4
1 bar 2 5

Append 添加

将若干行添加到dataFrame


In [105]:
df =pd.DataFrame(np.random.randn(8,4),columns=['A','B','C','D'])
df


Out[105]:
A B C D
0 0.222672 0.443049 -1.257127 0.736807
1 -0.375083 -0.024848 -0.976872 -0.605654
2 -0.685383 0.229636 -0.970260 2.517385
3 -0.255183 -0.482105 0.703707 0.069610
4 -0.128585 2.153157 -1.720873 -0.598394
5 0.629814 -0.916473 0.264643 -0.412704
6 0.636748 0.450428 -0.665539 -0.608706
7 0.767546 -0.706155 0.594495 -0.213454

In [106]:
s = df.iloc[3]
s


Out[106]:
A   -0.255183
B   -0.482105
C    0.703707
D    0.069610
Name: 3, dtype: float64

In [107]:
df.append(s,ignore_index=True)


Out[107]:
A B C D
0 0.222672 0.443049 -1.257127 0.736807
1 -0.375083 -0.024848 -0.976872 -0.605654
2 -0.685383 0.229636 -0.970260 2.517385
3 -0.255183 -0.482105 0.703707 0.069610
4 -0.128585 2.153157 -1.720873 -0.598394
5 0.629814 -0.916473 0.264643 -0.412704
6 0.636748 0.450428 -0.665539 -0.608706
7 0.767546 -0.706155 0.594495 -0.213454
8 -0.255183 -0.482105 0.703707 0.069610

group(分组)

group by 操作过程如下:

  • 划分 按照某些标准将数据分为不同的组
  • 应用 对每组数据分别应用某个函数
  • 划分 将结果组合到一个新的数据结构

In [108]:
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 
                          'foo', 'bar', 'foo', 'bar'],
                   'B' : ['one', 'one', 'two', 'three', 
                          'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})
df


Out[108]:
A B C D
0 foo one 1.389706 0.194864
1 bar one 1.068122 0.548799
2 foo two -1.772096 0.656627
3 bar three -0.784895 -0.971999
4 foo two -1.601806 -0.986536
5 bar two -0.778508 -3.408366
6 foo one 0.251064 -0.144214
7 bar three 0.180176 0.669016

分组并对每个分组应用sum函数


In [109]:
df.groupby('A').sum()


Out[109]:
C D
A
bar -0.315105 -3.162549
foo -1.733133 -0.279259

按多个column分组形成层级index,然后应用sun函数


In [110]:
df.groupby(['A','B']).sum()


Out[110]:
C D
A B
bar one 1.068122 0.548799
three -0.604719 -0.302983
two -0.778508 -3.408366
foo one 1.640770 0.050650
two -3.373903 -0.329909

reshape(变形)

stack


In [112]:
tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
                     'foo', 'foo', 'qux', 'qux'],
                    ['one', 'two', 'one', 'two',
                     'one', 'two', 'one', 'two']]))
tuples


Out[112]:
[('bar', 'one'),
 ('bar', 'two'),
 ('baz', 'one'),
 ('baz', 'two'),
 ('foo', 'one'),
 ('foo', 'two'),
 ('qux', 'one'),
 ('qux', 'two')]

In [113]:
index = pd.MultiIndex.from_tuples(tuples,names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])

df2 = df[:4]
df2


Out[113]:
A B
first second
bar one -0.678641 0.196709
two 0.414739 -1.874639
baz one 0.526168 -0.586098
two -0.794330 0.621687

stack()方法对DataFrame的列“压缩”一个层级


In [114]:
stacked = df2.stack()
stacked


Out[114]:
first  second   
bar    one     A   -0.678641
               B    0.196709
       two     A    0.414739
               B   -1.874639
baz    one     A    0.526168
               B   -0.586098
       two     A   -0.794330
               B    0.621687
dtype: float64

对于一个“stacked” DataFrame或者Series(拥有MultiIndex作为索引),stack()的逆操作是unstack(),默认反堆叠到上一个层级


In [115]:
stacked.unstack()


Out[115]:
A B
first second
bar one -0.678641 0.196709
two 0.414739 -1.874639
baz one 0.526168 -0.586098
two -0.794330 0.621687

In [116]:
stacked.unstack(1)


Out[116]:
second one two
first
bar A -0.678641 0.414739
B 0.196709 -1.874639
baz A 0.526168 -0.794330
B -0.586098 0.621687

In [117]:
stacked.unstack(0)


Out[117]:
first bar baz
second
one A -0.678641 0.526168
B 0.196709 -0.586098
two A 0.414739 -0.794330
B -1.874639 0.621687

数据透视表(pivot tables)


In [118]:
df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
                   'B' : ['A', 'B', 'C'] * 4,
                   'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                   'D' : np.random.randn(12),
                   'E' : np.random.randn(12)})
df


Out[118]:
A B C D E
0 one A foo -0.484209 1.372559
1 one B foo 1.229735 1.155902
2 two C foo 0.540843 -0.244282
3 three A bar 1.639626 1.056057
4 one B bar -1.323951 -0.729839
5 one C bar 0.855990 0.498600
6 two A foo -1.298377 -0.219432
7 three B foo -0.363005 -0.471282
8 one C foo -1.430104 -0.466728
9 one A bar 1.216707 0.439948
10 two B bar -2.140101 -1.813999
11 three C bar 1.620069 1.107080

In [119]:
pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])


Out[119]:
C bar foo
A B
one A 1.216707 -0.484209
B -1.323951 1.229735
C 0.855990 -1.430104
three A 1.639626 NaN
B NaN -0.363005
C 1.620069 NaN
two A NaN -1.298377
B -2.140101 NaN
C NaN 0.540843

时间序列

pandas在对频率转换进行重新采样时拥有着简单,强大而且高效的功能(例如把按秒采样的数据转换为按5分钟采样的数据)。


In [86]:
rng = pd.date_range('1/1/2012',periods=100,freq='S')
rng


Out[86]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2012-01-01 00:00:00, ..., 2012-01-01 00:01:39]
Length: 100, Freq: S, Timezone: None

In [87]:
ts = pd.Series(np.random.randint(0,500,len(rng)),index=rng)
ts


Out[87]:
2012-01-01 00:00:00    114
2012-01-01 00:00:01     62
2012-01-01 00:00:02    370
2012-01-01 00:00:03    379
2012-01-01 00:00:04    360
2012-01-01 00:00:05    234
2012-01-01 00:00:06    198
2012-01-01 00:00:07     55
2012-01-01 00:00:08    458
2012-01-01 00:00:09    241
2012-01-01 00:00:10    117
2012-01-01 00:00:11    110
2012-01-01 00:00:12     57
2012-01-01 00:00:13    293
2012-01-01 00:00:14    150
...
2012-01-01 00:01:25    346
2012-01-01 00:01:26    355
2012-01-01 00:01:27     27
2012-01-01 00:01:28     36
2012-01-01 00:01:29     47
2012-01-01 00:01:30    394
2012-01-01 00:01:31    295
2012-01-01 00:01:32    329
2012-01-01 00:01:33     48
2012-01-01 00:01:34    296
2012-01-01 00:01:35     18
2012-01-01 00:01:36    373
2012-01-01 00:01:37    328
2012-01-01 00:01:38    494
2012-01-01 00:01:39    457
Freq: S, Length: 100

In [88]:
ts.resample('5Min',how='sum')


Out[88]:
2012-01-01    24374
Freq: 5T, dtype: int64

时区表示


In [89]:
rng = pd.date_range('3/6/2012',periods=5,freq='D')
rng


Out[89]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2012-03-06, ..., 2012-03-10]
Length: 5, Freq: D, Timezone: None

In [90]:
ts = pd.Series(np.random.randn(len(rng)),index=rng)
ts


Out[90]:
2012-03-06   -1.062647
2012-03-07   -0.988973
2012-03-08    0.018998
2012-03-09    0.882671
2012-03-10   -0.566935
Freq: D, dtype: float64

In [91]:
ts_utc = ts.tz_localize('UTC')
ts_utc


Out[91]:
2012-03-06 00:00:00+00:00   -1.062647
2012-03-07 00:00:00+00:00   -0.988973
2012-03-08 00:00:00+00:00    0.018998
2012-03-09 00:00:00+00:00    0.882671
2012-03-10 00:00:00+00:00   -0.566935
Freq: D, dtype: float64

时区转换


In [92]:
ts_utc.tz_convert('US/Eastern')


Out[92]:
2012-03-05 19:00:00-05:00   -1.062647
2012-03-06 19:00:00-05:00   -0.988973
2012-03-07 19:00:00-05:00    0.018998
2012-03-08 19:00:00-05:00    0.882671
2012-03-09 19:00:00-05:00   -0.566935
Freq: D, dtype: float64

时间跨度转换


In [93]:
rng = pd.date_range('1/1/2012', periods=5, freq='M')
rng


Out[93]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2012-01-31, ..., 2012-05-31]
Length: 5, Freq: M, Timezone: None

In [94]:
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts


Out[94]:
2012-01-31   -0.632847
2012-02-29   -0.983099
2012-03-31    1.646944
2012-04-30   -2.136451
2012-05-31   -0.048257
Freq: M, dtype: float64

In [96]:
ps = ts.to_period()
ps


Out[96]:
2012-01   -0.632847
2012-02   -0.983099
2012-03    1.646944
2012-04   -2.136451
2012-05   -0.048257
Freq: M, dtype: float64

In [97]:
ps.to_timestamp()


Out[97]:
2012-01-01   -0.632847
2012-02-01   -0.983099
2012-03-01    1.646944
2012-04-01   -2.136451
2012-05-01   -0.048257
Freq: MS, dtype: float64

日期与时间戳之间的转换使得可以使用一些方便的算术函数。例如,我们把以11月为年底的季度数据转换为当前季度末月底为始的数据


In [98]:
prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')
prng


Out[98]:
<class 'pandas.tseries.period.PeriodIndex'>
[1990Q1, ..., 2000Q4]
Length: 44, Freq: Q-NOV

In [99]:
ts = pd.Series(np.random.randn(len(prng)), index = prng)
ts


Out[99]:
1990Q1   -0.529455
1990Q2    0.570384
1990Q3   -1.544862
1990Q4   -0.297993
1991Q1   -0.817410
1991Q2   -0.409854
1991Q3   -2.555214
1991Q4    1.032389
1992Q1   -0.378980
1992Q2   -0.373298
1992Q3   -0.290478
1992Q4   -1.106722
1993Q1   -1.118070
1993Q2    1.683016
1993Q3    2.278607
1993Q4    1.070247
1994Q1    0.911208
1994Q2    0.632467
1994Q3   -0.234640
1994Q4   -1.545572
1995Q1    0.707218
1995Q2   -0.386264
1995Q3    1.440627
1995Q4    0.751621
1996Q1   -1.640999
1996Q2    0.125822
1996Q3   -0.781528
1996Q4    1.900030
1997Q1    0.031440
1997Q2   -1.521395
1997Q3   -0.350913
1997Q4   -1.335841
1998Q1   -0.506609
1998Q2    0.609221
1998Q3   -0.092327
1998Q4    0.742343
1999Q1   -0.490511
1999Q2    1.018211
1999Q3    1.817720
1999Q4    0.560937
2000Q1   -1.012795
2000Q2   -1.070557
2000Q3    0.704798
2000Q4    0.353916
Freq: Q-NOV, dtype: float64

In [100]:
ts.index = (prng.asfreq('M', 'end') ) .asfreq('H', 'start') +9
ts


Out[100]:
1990-02-01 09:00   -0.529455
1990-05-01 09:00    0.570384
1990-08-01 09:00   -1.544862
1990-11-01 09:00   -0.297993
1991-02-01 09:00   -0.817410
1991-05-01 09:00   -0.409854
1991-08-01 09:00   -2.555214
1991-11-01 09:00    1.032389
1992-02-01 09:00   -0.378980
1992-05-01 09:00   -0.373298
1992-08-01 09:00   -0.290478
1992-11-01 09:00   -1.106722
1993-02-01 09:00   -1.118070
1993-05-01 09:00    1.683016
1993-08-01 09:00    2.278607
1993-11-01 09:00    1.070247
1994-02-01 09:00    0.911208
1994-05-01 09:00    0.632467
1994-08-01 09:00   -0.234640
1994-11-01 09:00   -1.545572
1995-02-01 09:00    0.707218
1995-05-01 09:00   -0.386264
1995-08-01 09:00    1.440627
1995-11-01 09:00    0.751621
1996-02-01 09:00   -1.640999
1996-05-01 09:00    0.125822
1996-08-01 09:00   -0.781528
1996-11-01 09:00    1.900030
1997-02-01 09:00    0.031440
1997-05-01 09:00   -1.521395
1997-08-01 09:00   -0.350913
1997-11-01 09:00   -1.335841
1998-02-01 09:00   -0.506609
1998-05-01 09:00    0.609221
1998-08-01 09:00   -0.092327
1998-11-01 09:00    0.742343
1999-02-01 09:00   -0.490511
1999-05-01 09:00    1.018211
1999-08-01 09:00    1.817720
1999-11-01 09:00    0.560937
2000-02-01 09:00   -1.012795
2000-05-01 09:00   -1.070557
2000-08-01 09:00    0.704798
2000-11-01 09:00    0.353916
Freq: H, dtype: float64

categoricals

从版本0.15开始,DataFrame中可以包含categorical数据。


In [120]:
df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'e', 'e']})
df


Out[120]:
id raw_grade
0 1 a
1 2 b
2 3 b
3 4 a
4 5 e
5 6 e

把raw_grade转换为categorical类型


In [121]:
df["grade"] = df["raw_grade"].astype("category")
df["grade"]


Out[121]:
0    a
1    b
2    b
3    a
4    e
5    e
Name: grade, dtype: category
Categories (3, object): [a, b, e]

对category进行重命名


In [122]:
df["grade"].cat.categories = ["very good","good","very bad"]

对分类重新排序,并添加缺失的分类


In [108]:
df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
df["grade"]


Out[108]:
0    very good
1         good
2         good
3    very good
4     very bad
5     very bad
Name: grade, dtype: category
Categories (5, object): [very bad < bad < medium < good < very good]

排序是按照分类的顺序进行的,而不是字典序


In [109]:
df.sort_index(by="grade")


Out[109]:
id raw_grade grade
4 5 e very bad
5 6 e very bad
1 2 b good
2 3 b good
0 1 a very good
3 4 a very good

按分类分组时,也会显示空的分类


In [110]:
df.groupby("grade").size()


Out[110]:
grade
very bad      2
bad         NaN
medium      NaN
good          2
very good     2
dtype: float64

绘图


In [125]:
ts = pd.Series(np.random.randn(1000),index=pd.date_range('1/1/2000',periods=1000))
ts = ts.cumsum()
ts.plot()


Out[125]:
<matplotlib.axes._subplots.AxesSubplot at 0x1069af0d0>

对于DataFrame类型,plot()能很方便地画出所有column及其label


In [126]:
df = pd.DataFrame(np.random.randn(1000,4),index=ts.index,columns=['A', 'B', 'C', 'D'])
df = df.cumsum()
df.plot()


Out[126]:
<matplotlib.axes._subplots.AxesSubplot at 0x106ae2b50>

获取数据的

CSV

写入一个csv文件


In [127]:
df.to_csv('foo.csv')

从一个csv文件读入


In [128]:
pd.read_csv('foo.csv')


Out[128]:
Unnamed: 0 A B C D
0 2000-01-01 0.000241 -0.062672 -1.256108 -0.276911
1 2000-01-02 1.185635 0.278415 0.111969 1.471304
2 2000-01-03 0.181744 0.862553 2.168972 1.217986
3 2000-01-04 1.025786 0.444376 1.946407 1.781225
4 2000-01-05 1.156284 1.556331 2.972193 1.485344
5 2000-01-06 2.647098 1.406498 2.667766 1.806171
6 2000-01-07 2.838413 1.148143 4.251299 1.178629
7 2000-01-08 3.264215 2.052171 4.695244 0.635274
8 2000-01-09 4.837534 1.668099 2.820105 -1.641457
9 2000-01-10 4.785927 1.949996 3.937043 -1.326241
10 2000-01-11 4.370393 -0.773381 5.357964 -2.355775
11 2000-01-12 4.817681 -1.828862 5.009137 -2.723674
12 2000-01-13 4.056709 -2.298854 5.661288 -2.848931
13 2000-01-14 2.358674 -1.586434 5.171422 -3.659978
14 2000-01-15 2.286226 -1.309240 6.083846 -3.033132
15 2000-01-16 2.214300 0.991229 6.179765 -3.977637
16 2000-01-17 1.667437 1.482464 6.586782 -4.733009
17 2000-01-18 -0.704198 1.114435 8.273148 -6.013564
18 2000-01-19 0.521272 0.855078 10.667356 -7.072008
19 2000-01-20 0.453339 1.491088 11.326491 -6.707565
20 2000-01-21 2.120191 1.624046 12.256566 -6.946169
21 2000-01-22 2.653590 1.515170 13.133247 -7.462825
22 2000-01-23 3.766077 1.367578 12.350256 -7.643055
23 2000-01-24 5.390681 1.771788 10.443662 -7.819640
24 2000-01-25 6.333766 2.907701 7.415541 -8.294551
25 2000-01-26 5.745656 3.138093 6.264761 -7.738085
26 2000-01-27 4.820698 3.079208 8.175576 -7.785478
27 2000-01-28 2.902801 4.318009 9.259691 -7.447738
28 2000-01-29 2.562628 5.182195 9.737713 -7.622801
29 2000-01-30 2.493323 5.992592 9.364377 -6.604847
... ... ... ... ... ...
970 2002-08-28 -13.787008 10.349694 20.639172 -16.897246
971 2002-08-29 -13.961866 9.608443 19.893805 -16.674283
972 2002-08-30 -14.407201 10.131890 22.030059 -15.462973
973 2002-08-31 -15.661060 10.139718 22.063381 -16.127232
974 2002-09-01 -15.893622 9.579937 20.869018 -16.426683
975 2002-09-02 -14.326294 9.843613 21.372123 -16.396319
976 2002-09-03 -13.222234 11.503247 21.673593 -17.067178
977 2002-09-04 -13.874599 11.029781 22.177487 -17.584621
978 2002-09-05 -14.366766 10.929471 23.749803 -18.250348
979 2002-09-06 -14.301658 11.352843 25.016192 -16.140891
980 2002-09-07 -14.765577 11.554946 21.596048 -16.378724
981 2002-09-08 -14.295516 11.904146 21.838288 -16.458944
982 2002-09-09 -14.869276 11.892539 22.105776 -16.813928
983 2002-09-10 -14.431293 12.287799 23.208907 -16.393767
984 2002-09-11 -15.550096 13.010101 22.782674 -17.215624
985 2002-09-12 -15.301565 13.504414 21.390448 -16.545548
986 2002-09-13 -15.723424 13.094349 19.957580 -15.729522
987 2002-09-14 -14.264989 14.834233 18.490564 -15.643701
988 2002-09-15 -13.551818 14.704450 19.560484 -15.715354
989 2002-09-16 -13.792679 14.822606 20.930214 -15.642744
990 2002-09-17 -13.583718 15.564811 23.432460 -16.040966
991 2002-09-18 -12.819410 17.180775 20.185889 -15.677880
992 2002-09-19 -12.801132 17.250503 19.255585 -14.618826
993 2002-09-20 -11.267237 18.258379 19.270546 -15.589998
994 2002-09-21 -13.001507 18.570462 20.733814 -15.013118
995 2002-09-22 -14.067549 18.896325 21.719062 -14.114297
996 2002-09-23 -12.348151 18.025382 20.937040 -12.093957
997 2002-09-24 -12.693599 17.232992 20.656773 -12.760468
998 2002-09-25 -11.666343 18.446673 19.570311 -12.059097
999 2002-09-26 -12.161885 18.318993 18.654749 -11.719645

1000 rows × 5 columns

HDF5

HDFStores的读写

写入一个HDF5 Store


In [120]:
df.to_hdf('foo.h5','df')

从一个HDF5 Store读入


In [121]:
pd.read_hdf('foo.h5','df')


Out[121]:
A B C D
2000-01-01 -0.425461 -1.403269 -0.044436 0.219559
2000-01-02 0.308576 -2.057619 0.634288 -0.452682
2000-01-03 0.619000 -2.869349 1.237154 -0.308545
2000-01-04 -1.165187 -2.040167 1.047757 0.847036
2000-01-05 -0.214774 -4.301137 0.571458 2.713687
2000-01-06 1.639842 -3.369141 0.753631 2.160821
2000-01-07 1.910771 -4.025835 -0.773456 1.821734
2000-01-08 1.563340 -4.742109 2.518628 0.588485
2000-01-09 0.132218 -4.341313 1.862238 -0.697081
2000-01-10 0.373116 -4.632546 1.397640 -0.845373
2000-01-11 1.112101 -5.740170 1.803410 0.575382
2000-01-12 0.310180 -5.740034 1.422615 -0.070605
2000-01-13 -0.496999 -6.002253 1.836140 -1.737801
2000-01-14 0.162170 -4.867653 1.883516 -1.377816
2000-01-15 -0.219185 -6.045063 2.179289 0.404574
2000-01-16 -0.492499 -8.239633 2.340800 2.083196
2000-01-17 -1.906515 -7.611583 2.096817 2.474113
2000-01-18 -1.479602 -7.924990 3.659178 1.732523
2000-01-19 -1.877567 -7.676786 2.962439 0.808766
2000-01-20 0.446554 -8.478972 2.770583 1.005155
2000-01-21 0.815523 -7.639916 4.375444 0.975386
2000-01-22 1.232161 -8.011159 3.948103 -0.071826
2000-01-23 0.809309 -10.975806 3.579664 -0.193101
2000-01-24 2.980972 -10.672872 4.323559 -0.837766
2000-01-25 3.361422 -12.640693 5.046673 -0.518537
2000-01-26 1.927294 -11.242640 3.801135 0.896068
2000-01-27 1.249261 -10.931013 2.837784 1.464897
2000-01-28 1.281549 -11.003295 3.951254 1.179706
2000-01-29 0.512112 -10.761742 2.233165 0.174430
2000-01-30 0.262344 -9.672089 4.887404 0.712865
... ... ... ... ...
2002-08-28 4.625208 -1.794282 -8.259543 -1.588357
2002-08-29 4.666648 -1.795497 -8.238837 0.252956
2002-08-30 5.607705 -1.428826 -7.712760 1.868046
2002-08-31 4.946251 -0.112299 -7.518860 1.777350
2002-09-01 3.864572 -0.545530 -6.292253 1.586294
2002-09-02 5.555515 -0.372099 -6.675474 -0.593894
2002-09-03 7.088676 -1.012956 -7.077376 -0.086973
2002-09-04 7.750305 -1.040951 -6.940778 -1.275356
2002-09-05 7.037225 -2.321507 -7.430603 -2.204701
2002-09-06 7.219238 -1.307758 -7.643952 -1.942661
2002-09-07 5.001612 -2.704933 -7.839517 -2.209851
2002-09-08 6.266836 -2.599260 -7.889995 -1.813327
2002-09-09 4.921057 -2.086468 -8.432188 -1.590568
2002-09-10 5.409754 -0.320361 -11.189269 -2.413758
2002-09-11 5.011992 -0.950910 -11.590882 -3.230179
2002-09-12 4.742224 -0.815597 -9.546448 -4.055350
2002-09-13 6.101983 -0.774610 -9.722567 -6.045751
2002-09-14 7.365744 -1.116305 -9.630651 -5.592564
2002-09-15 5.084383 -2.143574 -9.798200 -6.216111
2002-09-16 5.535138 -2.309296 -10.247144 -5.912922
2002-09-17 6.727144 -1.383390 -10.538517 -6.811851
2002-09-18 6.381736 -1.541663 -11.300122 -7.795385
2002-09-19 5.797594 -2.066551 -10.043998 -5.295954
2002-09-20 5.727980 -2.694832 -10.937890 -6.940196
2002-09-21 6.101493 -3.295493 -10.662621 -8.503858
2002-09-22 6.281263 -3.543243 -9.669077 -8.634091
2002-09-23 6.193919 -3.517716 -11.120515 -9.019951
2002-09-24 6.032230 -3.171396 -11.000641 -10.376975
2002-09-25 5.692565 -4.190530 -11.027112 -11.042980
2002-09-26 5.337877 -3.505361 -11.775212 -10.298722

1000 rows × 4 columns

Excel表格

Excel表格的读写

写入一个Excel文件


In [122]:
df.to_excel('foo.xlsx', sheet_name='Sheet1')

从一个Excel文件读入


In [123]:
pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])


Out[123]:
A B C D
2000-01-01 -0.425461 -1.403269 -0.044436 0.219559
2000-01-02 0.308576 -2.057619 0.634288 -0.452682
2000-01-03 0.619000 -2.869349 1.237154 -0.308545
2000-01-04 -1.165187 -2.040167 1.047757 0.847036
2000-01-05 -0.214774 -4.301137 0.571458 2.713687
2000-01-06 1.639842 -3.369141 0.753631 2.160821
2000-01-07 1.910771 -4.025835 -0.773456 1.821734
2000-01-08 1.563340 -4.742109 2.518628 0.588485
2000-01-09 0.132218 -4.341313 1.862238 -0.697081
2000-01-10 0.373116 -4.632546 1.397640 -0.845373
2000-01-11 1.112101 -5.740170 1.803410 0.575382
2000-01-12 0.310180 -5.740034 1.422615 -0.070605
2000-01-13 -0.496999 -6.002253 1.836140 -1.737801
2000-01-14 0.162170 -4.867653 1.883516 -1.377816
2000-01-15 -0.219185 -6.045063 2.179289 0.404574
2000-01-16 -0.492499 -8.239633 2.340800 2.083196
2000-01-17 -1.906515 -7.611583 2.096817 2.474113
2000-01-18 -1.479602 -7.924990 3.659178 1.732523
2000-01-19 -1.877567 -7.676786 2.962439 0.808766
2000-01-20 0.446554 -8.478972 2.770583 1.005155
2000-01-21 0.815523 -7.639916 4.375444 0.975386
2000-01-22 1.232161 -8.011159 3.948103 -0.071826
2000-01-23 0.809309 -10.975806 3.579664 -0.193101
2000-01-24 2.980972 -10.672872 4.323559 -0.837766
2000-01-25 3.361422 -12.640693 5.046673 -0.518537
2000-01-26 1.927294 -11.242640 3.801135 0.896068
2000-01-27 1.249261 -10.931013 2.837784 1.464897
2000-01-28 1.281549 -11.003295 3.951254 1.179706
2000-01-29 0.512112 -10.761742 2.233165 0.174430
2000-01-30 0.262344 -9.672089 4.887404 0.712865
... ... ... ... ...
2002-08-28 4.625208 -1.794282 -8.259543 -1.588357
2002-08-29 4.666648 -1.795497 -8.238837 0.252956
2002-08-30 5.607705 -1.428826 -7.712760 1.868046
2002-08-31 4.946251 -0.112299 -7.518860 1.777350
2002-09-01 3.864572 -0.545530 -6.292253 1.586294
2002-09-02 5.555515 -0.372099 -6.675474 -0.593894
2002-09-03 7.088676 -1.012956 -7.077376 -0.086973
2002-09-04 7.750305 -1.040951 -6.940778 -1.275356
2002-09-05 7.037225 -2.321507 -7.430603 -2.204701
2002-09-06 7.219238 -1.307758 -7.643952 -1.942661
2002-09-07 5.001612 -2.704933 -7.839517 -2.209851
2002-09-08 6.266836 -2.599260 -7.889995 -1.813327
2002-09-09 4.921057 -2.086468 -8.432188 -1.590568
2002-09-10 5.409754 -0.320361 -11.189269 -2.413758
2002-09-11 5.011992 -0.950910 -11.590882 -3.230179
2002-09-12 4.742224 -0.815597 -9.546448 -4.055350
2002-09-13 6.101983 -0.774610 -9.722567 -6.045751
2002-09-14 7.365744 -1.116305 -9.630651 -5.592564
2002-09-15 5.084383 -2.143574 -9.798200 -6.216111
2002-09-16 5.535138 -2.309296 -10.247144 -5.912922
2002-09-17 6.727144 -1.383390 -10.538517 -6.811851
2002-09-18 6.381736 -1.541663 -11.300122 -7.795385
2002-09-19 5.797594 -2.066551 -10.043998 -5.295954
2002-09-20 5.727980 -2.694832 -10.937890 -6.940196
2002-09-21 6.101493 -3.295493 -10.662621 -8.503858
2002-09-22 6.281263 -3.543243 -9.669077 -8.634091
2002-09-23 6.193919 -3.517716 -11.120515 -9.019951
2002-09-24 6.032230 -3.171396 -11.000641 -10.376975
2002-09-25 5.692565 -4.190530 -11.027112 -11.042980
2002-09-26 5.337877 -3.505361 -11.775212 -10.298722

1000 rows × 4 columns