Pandas 是一个Python数据分析库,安装完 Anaconda 会安装好 Pandas

1 基础数据结构

Pandas 包含了两种主要的数据结构

  • Series
  • DataFrame Series 用来存储一维数据,而DataFrame则存储复杂的数据

Series

表示一维数据,并且内部结构简单,由两个关联的数组组成,主数组用来存放数据,每个主数组有一个与之相关的标签。

index value
0 12
1 -4
2 22
3 0

In [1]:
import numpy as np
import pandas as pd
a = pd.Series([12,-4,22,0])
a


Out[1]:
0    12
1    -4
2    22
3     0
dtype: int64

In [2]:
# 讲标签替换成有意义的值
a = pd.Series([12,-4,22,0],index=['a','b','c','d'])
print a


a    12
b    -4
c    22
d     0
dtype: int64

In [3]:
print a.values
print a.index


[12 -4 22  0]
Index([u'a', u'b', u'c', u'd'], dtype='object')

选取数据


In [4]:
print a[2]
print a['b']
print a[0:2]
print a[['b','d']]


22
-4
a    12
b    -4
dtype: int64
b   -4
d    0
dtype: int64

数据赋值


In [5]:
a[2]=10
a['d']=-5

从numpy对象中创建Series


In [6]:
arr =np.array([1,2,3,4])
s = pd.Series(arr)
s


Out[6]:
0    1
1    2
2    3
3    4
dtype: int64

In [7]:
s1 = pd.Series(s)
s1[0]=-10
print s


0   -10
1     2
2     3
3     4
dtype: int64

筛选数据


In [8]:
s[s>2]


Out[8]:
2    3
3    4
dtype: int64

数学运算


In [9]:
s/2


Out[9]:
0   -5.0
1    1.0
2    1.5
3    2.0
dtype: float64

In [10]:
np.log(s)


Out[10]:
0         NaN
1    0.693147
2    1.098612
3    1.386294
dtype: float64

NaN

字段中若为空或者不符合要求的数字定义的是,放回NaN(Not a Number)


In [1]:
import numpy as np
import pandas as pd
s2 = pd.Series([5,-3,np.NaN,14])
s2


Out[1]:
0     5.0
1    -3.0
2     NaN
3    14.0
dtype: float64

In [2]:
s2.isnull()


Out[2]:
0    False
1    False
2     True
3    False
dtype: bool

In [4]:
s2.notnull()


Out[4]:
0     True
1     True
2    False
3     True
dtype: bool

In [5]:
# 当做筛选条件
s2[s2.notnull()]


Out[5]:
0     5.0
1    -3.0
3    14.0
dtype: float64

In [6]:
s2[s2.isnull()]


Out[6]:
2   NaN
dtype: float64

字典使用


In [8]:
mydic = {'red':200,'blue':1000,'yellow':500,'orange':1000}
myseries = pd.Series(mydic)
myseries


Out[8]:
blue      1000
orange    1000
red        200
yellow     500
dtype: int64

In [9]:
colors = ['red','yellow','orange','blue','green']
mySeries = pd.Series(mydic, index=colors)
mySeries


Out[9]:
red        200.0
yellow     500.0
orange    1000.0
blue      1000.0
green        NaN
dtype: float64

Series 对象之前运算


In [10]:
mydict2 = {'red':400,'yellow':1000,'black':700}
mySeries2 = pd.Series(mydict2)
mySeries + mySeries2


Out[10]:
black        NaN
blue         NaN
green        NaN
orange       NaN
red        600.0
yellow    1500.0
dtype: float64

DataFrame

DataFrame 列表式跟Excel比较类似,其设计初衷将Series的使用场景扩展至多维

index color object price
0 blue ball 1.2
1 green pen 1.0
2 yellow pencil 0.6

DataFrame 对象则有所不同,它有两个索引数组

  1. 与行相关,与Series的索引组类似
  2. 一系列标签,每个标签与列数据关联

In [3]:
import numpy as np
import pandas as pd
data = {'color':['blue','green','yellow'],'object':['ball','pen','pencil'],'price':[1.2,1.0,0.6]}
frame = pd.DataFrame(data)
frame


Out[3]:
color object price
0 blue ball 1.2
1 green pen 1.0
2 yellow pencil 0.6

当然也可以选择你感兴趣的内容


In [3]:
frame2 = pd.DataFrame(data,columns=['object','price'])
frame2


Out[3]:
object price
0 ball 1.2
1 pen 1.0
2 pencil 0.6

修改index值


In [4]:
frame2 = pd.DataFrame(data,index=['one','two','three'])
frame2


Out[4]:
color object price
one blue ball 1.2
two green pen 1.0
three yellow pencil 0.6

其他创建DataFrame的方式


In [7]:
frame3 = pd.DataFrame(np.arange(16).reshape((4,4)),
                     index=['red','blue','yellow','white'],
                     columns=['ball','pen','pencil','paper'])
frame3


Out[7]:
ball pen pencil paper
red 0 1 2 3
blue 4 5 6 7
yellow 8 9 10 11
white 12 13 14 15

选择元素


In [9]:
frame.columns


Out[9]:
Index([u'color', u'object', u'price'], dtype='object')

In [10]:
frame.index


Out[10]:
RangeIndex(start=0, stop=3, step=1)

In [11]:
frame.values


Out[11]:
array([['blue', 'ball', 1.2],
       ['green', 'pen', 1.0],
       ['yellow', 'pencil', 0.6]], dtype=object)

In [12]:
frame['price']


Out[12]:
0    1.2
1    1.0
2    0.6
Name: price, dtype: float64

In [13]:
frame.ix[2]


Out[13]:
color     yellow
object    pencil
price        0.6
Name: 2, dtype: object

In [16]:
# 选择选择元素
frame.ix[[0,2]]


Out[16]:
color object price
0 blue ball 1.2
2 yellow pencil 0.6

In [17]:
# 索引值选择类似切片
frame[0:1]


Out[17]:
color object price
0 blue ball 1.2

In [18]:
frame[0:3]


Out[18]:
color object price
0 blue ball 1.2
1 green pen 1.0
2 yellow pencil 0.6

In [19]:
# 选择其中的元素
frame['object'][2]


Out[19]:
'pencil'

赋值


In [23]:
frame.index.name='id'
frame.columns.name='item'
frame


Out[23]:
item color object price
id
0 blue ball 1.2
1 green pen 1.0
2 yellow pencil 0.6

In [24]:
# 增加一个新列
frame['new']=12 # 默认
frame


Out[24]:
item color object price new
id
0 blue ball 1.2 12
1 green pen 1.0 12
2 yellow pencil 0.6 12

In [25]:
frame['new'] = [3.0,1.3,2.2]
frame


Out[25]:
item color object price new
id
0 blue ball 1.2 3.0
1 green pen 1.0 1.3
2 yellow pencil 0.6 2.2

In [26]:
# 通过Series其他方式更新
ser = pd.Series(np.arange(3))
frame['new']=ser
frame


Out[26]:
item color object price new
id
0 blue ball 1.2 0
1 green pen 1.0 1
2 yellow pencil 0.6 2

元素的所属关系


In [30]:
frame.isin([1.0,'pen'])


Out[30]:
item color object price new
id
0 False False False False
1 False True True True
2 False False False False

In [31]:
frame[frame.isin([1.0,'pen'])]


Out[31]:
item color object price new
id
0 NaN NaN NaN NaN
1 NaN pen 1.0 1.0
2 NaN NaN NaN NaN

删除某一列


In [32]:
del frame['new']
frame


Out[32]:
item color object price
id
0 blue ball 1.2
1 green pen 1.0
2 yellow pencil 0.6

筛选元素


In [33]:
frame[frame<3]


Out[33]:
item color object price
id
0 NaN NaN 1.2
1 NaN NaN 1.0
2 NaN NaN 0.6

有嵌套的字典生成DataFrame对象

pandas 将会将外部的key解释为列名称,将内部的key解释为索引的标签


In [34]:
nestdict = {'red':{2012:22,2013:33},'white':{2011:13,2012:22,2013:16},'blue':{2011:17,2012:27,2013:18}}
frame2 = pd.DataFrame(nestdict)
frame2


Out[34]:
blue red white
2011 17 NaN 13
2012 27 22.0 22
2013 18 33.0 16

DataFrame转置


In [36]:
frame2.T


Out[36]:
2011 2012 2013
blue 17.0 27.0 18.0
red NaN 22.0 33.0
white 13.0 22.0 16.0

index对象


In [3]:
ser = pd.Series([5,0,3,8,4],index=['red','blue','yellow','white','green'])
ser.index


Out[3]:
Index([u'red', u'blue', u'yellow', u'white', u'green'], dtype='object')

index 对象方法


In [4]:
ser.idxmin()


Out[4]:
'blue'

In [5]:
ser.idxmax()


Out[5]:
'white'

含有重复标签的index


In [6]:
serd = pd.Series(range(6),index=['white','white','blue','green','green','yellow'])
serd


Out[6]:
white     0
white     1
blue      2
green     3
green     4
yellow    5
dtype: int64

In [7]:
serd['white']


Out[7]:
white    0
white    1
dtype: int64

In [10]:
serd.index.is_unique


Out[10]:
False

索引对象其他的作用

  • 更换索引
  • 删除
  • 对齐

更换索引


In [11]:
ser = pd.Series([2,5,7,4],index=['one','two','three','four'])
ser


Out[11]:
one      2
two      5
three    7
four     4
dtype: int64

In [13]:
ser.reindex(['three','four','five','one'])


Out[13]:
three    7.0
four     4.0
five     NaN
one      2.0
dtype: float64

上述reindex函数删除了'two'标签,增加了'five'标签,并且该值为NaN 自动填充标签


In [5]:
ser3 = pd.Series([1,5,6,3],index=[0,3,5,6])
ser3


Out[5]:
0    1
3    5
5    6
6    3
dtype: int64

上述索引列并不完整,而是缺失了1,2和4,常见的需求为插值,得到一个完整的序列,reindex函数的method选项的值进行给定


In [6]:
ser3.reindex(range(6),method='ffill')


Out[6]:
0    1
1    1
2    1
3    5
4    5
5    6
dtype: int64

在插值过程中,所以缺失的索引所对应的值是比其小的索引值,如果想要后面的值,修改method的选项


In [7]:
ser3.reindex(range(6),method='bfill')


Out[7]:
0    1
1    5
2    5
3    5
4    6
5    6
dtype: int64

删除操作


In [8]:
ser = pd.Series(np.arange(4.),index=['red','blue','yellow','white'])
ser


Out[8]:
red       0.0
blue      1.0
yellow    2.0
white     3.0
dtype: float64

In [9]:
ser.drop('yellow')


Out[9]:
red      0.0
blue     1.0
white    3.0
dtype: float64

DataFrame删除操作


In [11]:
frame = pd.DataFrame(np.arange(16).reshape((4,4)),
                    index=['red','blue','yellow','white'],
                    columns=['ball','pen','pencil','paper'])
frame


Out[11]:
ball pen pencil paper
red 0 1 2 3
blue 4 5 6 7
yellow 8 9 10 11
white 12 13 14 15

In [12]:
# 删除行
frame.drop(['blue','yellow'])


Out[12]:
ball pen pencil paper
red 0 1 2 3
white 12 13 14 15

删除列操作,指定列名,也要指定axis的值


In [14]:
frame.drop(['pen','pencil'],axis=1)


Out[14]:
ball paper
red 0 3
blue 4 7
yellow 8 11
white 12 15

算术和数据对齐


In [15]:
s1 = pd.Series([3,2,5,1],['white','yellow','green','blue'])
s2 = pd.Series([1,4,7,2,1],['white','yellow','black','blue','brown'])
s1+s2


Out[15]:
black     NaN
blue      3.0
brown     NaN
green     NaN
white     4.0
yellow    6.0
dtype: float64

In [16]:
frame1 = pd.DataFrame(np.arange(16).reshape((4,4)),
                     index=['red','blue','yellow','white'],
                     columns=['ball','pen','pencil','paper'])
frame2 = pd.DataFrame(np.arange(12).reshape((4,3)),
                     index=['blue','green','white','yellow'],
                     columns=['mug','pen','ball'])
frame1+frame2


Out[16]:
ball mug paper pen pencil
blue 6.0 NaN NaN 6.0 NaN
green NaN NaN NaN NaN NaN
red NaN NaN NaN NaN NaN
white 20.0 NaN NaN 20.0 NaN
yellow 19.0 NaN NaN 19.0 NaN

数据结构之间的运算

算术运算

  • add()
  • sub()
  • div()
  • mul()

In [17]:
frame1.add(frame2)


Out[17]:
ball mug paper pen pencil
blue 6.0 NaN NaN 6.0 NaN
green NaN NaN NaN NaN NaN
red NaN NaN NaN NaN NaN
white 20.0 NaN NaN 20.0 NaN
yellow 19.0 NaN NaN 19.0 NaN

In [18]:
frame1.sub(frame2)


Out[18]:
ball mug paper pen pencil
blue 2.0 NaN NaN 4.0 NaN
green NaN NaN NaN NaN NaN
red NaN NaN NaN NaN NaN
white 4.0 NaN NaN 6.0 NaN
yellow -3.0 NaN NaN -1.0 NaN

In [19]:
frame1.div(frame2)


Out[19]:
ball mug paper pen pencil
blue 2.000000 NaN NaN 5.000000 NaN
green NaN NaN NaN NaN NaN
red NaN NaN NaN NaN NaN
white 1.500000 NaN NaN 1.857143 NaN
yellow 0.727273 NaN NaN 0.900000 NaN

In [20]:
frame1.mul(frame2)


Out[20]:
ball mug paper pen pencil
blue 8.0 NaN NaN 5.0 NaN
green NaN NaN NaN NaN NaN
red NaN NaN NaN NaN NaN
white 96.0 NaN NaN 91.0 NaN
yellow 88.0 NaN NaN 90.0 NaN

DataFrame与Series之间运算


In [21]:
frame = pd.DataFrame(np.arange(16).reshape((4,4)),
                    index=['red','blue','yellow','white'],
                    columns=['ball','pen','pencil','paper'])
frame


Out[21]:
ball pen pencil paper
red 0 1 2 3
blue 4 5 6 7
yellow 8 9 10 11
white 12 13 14 15

In [22]:
ser = pd.Series(np.arange(4),index=['ball','pen','pencil','paper'])
ser


Out[22]:
ball      0
pen       1
pencil    2
paper     3
dtype: int32

In [23]:
frame - ser


Out[23]:
ball pen pencil paper
red 0 0 0 0
blue 4 4 4 4
yellow 8 8 8 8
white 12 12 12 12

In [24]:
ser['mug']=9
frame - ser


Out[24]:
ball mug paper pen pencil
red 0 NaN 0 0 0
blue 4 NaN 4 4 4
yellow 8 NaN 8 8 8
white 12 NaN 12 12 12

函数应用与映射

操作元素的函数

Pandas与numpy 一样,有大量关于元素操作的通用函数(universal function)


In [25]:
frame


Out[25]:
ball pen pencil paper
red 0 1 2 3
blue 4 5 6 7
yellow 8 9 10 11
white 12 13 14 15

In [26]:
np.sqrt(frame)


Out[26]:
ball pen pencil paper
red 0.000000 1.000000 1.414214 1.732051
blue 2.000000 2.236068 2.449490 2.645751
yellow 2.828427 3.000000 3.162278 3.316625
white 3.464102 3.605551 3.741657 3.872983

按行和列执行操作函数


In [27]:
f = lambda x:x.max()-x.min()
frame.apply(f)


Out[27]:
ball      12
pen       12
pencil    12
paper     12
dtype: int64

In [28]:
# 对行进行操作
frame.apply(f, axis=1)


Out[28]:
red       3
blue      3
yellow    3
white     3
dtype: int64

apply 函数不一定返回标量,也可以返回一个向量


In [29]:
def f(x):
    return pd.Series([x.min(),x.max()],index=['min','max'])
frame.apply(f)


Out[29]:
ball pen pencil paper
min 0 1 2 3
max 12 13 14 15

In [30]:
frame.apply(f,axis=1)


Out[30]:
min max
red 0 3
blue 4 7
yellow 8 11
white 12 15

统计函数


In [31]:
frame.sum()


Out[31]:
ball      24
pen       28
pencil    32
paper     36
dtype: int64

In [33]:
frame.mean()


Out[33]:
ball      6.0
pen       7.0
pencil    8.0
paper     9.0
dtype: float64

In [34]:
frame.describe()


Out[34]:
ball pen pencil paper
count 4.000000 4.000000 4.000000 4.000000
mean 6.000000 7.000000 8.000000 9.000000
std 5.163978 5.163978 5.163978 5.163978
min 0.000000 1.000000 2.000000 3.000000
25% 3.000000 4.000000 5.000000 6.000000
50% 6.000000 7.000000 8.000000 9.000000
75% 9.000000 10.000000 11.000000 12.000000
max 12.000000 13.000000 14.000000 15.000000

排序

Series

series对象排序只有索引这一列


In [38]:
ser =pd.Series([5,0,3,8,4],index=['red','blue','yellow','white','green'])
ser


Out[38]:
red       5
blue      0
yellow    3
white     8
green     4
dtype: int64

In [39]:
ser.sort_index()


Out[39]:
blue      0
green     4
red       5
white     8
yellow    3
dtype: int64

DataFrame排序

DataFrame 可以对两个轴任意一条进行排序,如果用索引列进行排序,直接使用sort_index()函数,如果对列进行排序,指定axis=1


In [40]:
frame


Out[40]:
ball pen pencil paper
red 0 1 2 3
blue 4 5 6 7
yellow 8 9 10 11
white 12 13 14 15

In [41]:
frame.sort_index()


Out[41]:
ball pen pencil paper
blue 4 5 6 7
red 0 1 2 3
white 12 13 14 15
yellow 8 9 10 11

In [42]:
frame.sort_index(axis=1)


Out[42]:
ball paper pen pencil
red 0 3 1 2
blue 4 7 5 6
yellow 8 11 9 10
white 12 15 13 14

对其中的元素进行排序


In [46]:
ser.sort_values()


Out[46]:
blue      0
yellow    3
green     4
red       5
white     8
dtype: int64

In [47]:
frame.sort_values(by='pen')


Out[47]:
ball pen pencil paper
red 0 1 2 3
blue 4 5 6 7
yellow 8 9 10 11
white 12 13 14 15

In [48]:
#基于多列排序
frame.sort_values(by=['pen','pencil'])


Out[48]:
ball pen pencil paper
red 0 1 2 3
blue 4 5 6 7
yellow 8 9 10 11
white 12 13 14 15

ranking 操作与排序操作相关,该操作为序列的每一个元素安排一个位置,从1开始


In [49]:
ser.rank()


Out[49]:
red       4.0
blue      1.0
yellow    2.0
white     5.0
green     3.0
dtype: float64

In [50]:
ser.rank(method='first')


Out[50]:
red       4.0
blue      1.0
yellow    2.0
white     5.0
green     3.0
dtype: float64

In [52]:
# 降序排列
ser.rank(ascending=False)


Out[52]:
red       2.0
blue      5.0
yellow    4.0
white     1.0
green     3.0
dtype: float64

相关性和协方差

correlation和covariance是两个重要的统计量,pandas计算这两个量使用corr()和cov().

  • 标准差 $$D(X)=E([X - E(X)]^2)$$
  • 协方差 $$COV(X,Y)=E([X-E(X)][Y-E(Y)])$$
  • 相关系数 $$\frac{COV(X,Y)}{\sqrt(D(X)) \times \sqrt(D(Y))}$$

In [53]:
seq1 = pd.Series([3,4,3,4,5,4,3,2],index=['2006','2007','2008','2009','2010','2011','2012','2013'])
seq2 = pd.Series([1,2,3,4,4,3,2,1],index=['2006','2007','2008','2009','2010','2011','2012','2013'])
print seq1.corr(seq2)
print seq1.cov(seq2)


0.774596669241
0.857142857143

In [54]:
frame2 = pd.DataFrame([[1,4,3,6],[4,5,6,1],[3,3,1,5],[4,1,6,4]],
                     index=['red','blue','yellow','white'],
                     columns=['ball','pen','pencil','paper'])
frame2


Out[54]:
ball pen pencil paper
red 1 4 3 6
blue 4 5 6 1
yellow 3 3 1 5
white 4 1 6 4

In [61]:
frame2.corr()


Out[61]:
ball pen pencil paper
ball 1.000000 -0.276026 0.577350 -0.763763
pen -0.276026 1.000000 -0.079682 -0.361403
pencil 0.577350 -0.079682 1.000000 -0.692935
paper -0.763763 -0.361403 -0.692935 1.000000

In [62]:
frame2.cov()


Out[62]:
ball pen pencil paper
ball 2.000000 -0.666667 2.000000 -2.333333
pen -0.666667 2.916667 -0.333333 -1.333333
pencil 2.000000 -0.333333 6.000000 -3.666667
paper -2.333333 -1.333333 -3.666667 4.666667

In [64]:
frame2.corrwith(ser)


Out[64]:
ball     -0.140028
pen      -0.869657
pencil    0.080845
paper     0.595854
dtype: float64

In [65]:
frame2.corrwith(frame)


Out[65]:
ball      0.730297
pen      -0.831522
pencil    0.210819
paper    -0.119523
dtype: float64

NaN数据


In [66]:
ser = pd.Series([0,1,2,np.NaN,9],index=['red','blue','yellow','white','green'])
ser


Out[66]:
red       0.0
blue      1.0
yellow    2.0
white     NaN
green     9.0
dtype: float64

In [69]:
ser['green']=None
ser


Out[69]:
red       0.0
blue      1.0
yellow    2.0
white     NaN
green     NaN
dtype: float64

过滤NaN数据


In [70]:
ser.dropna()


Out[70]:
red       0.0
blue      1.0
yellow    2.0
dtype: float64

In [71]:
ser[ser.notnull()]


Out[71]:
red       0.0
blue      1.0
yellow    2.0
dtype: float64

In [72]:
frame3 = pd.DataFrame([[6,np.nan,6],[np.nan,np.nan,np.nan],[2,np.nan,5]],
                     index=['blue','green','red'],
                     columns=['ball','mug','pen'])
frame3


Out[72]:
ball mug pen
blue 6.0 NaN 6.0
green NaN NaN NaN
red 2.0 NaN 5.0

在DataFrame中使用dropna()函数,一旦这一行或者列存在NaN,则将其整行或者整列全部删除


In [73]:
frame3.dropna()


Out[73]:
ball mug pen

In [76]:
# 改进
frame3.dropna(how='all')


Out[76]:
ball mug pen
blue 6.0 NaN 6.0
red 2.0 NaN 5.0

为NaN赋值


In [75]:
frame3.fillna(0)


Out[75]:
ball mug pen
blue 6.0 0.0 6.0
green 0.0 0.0 0.0
red 2.0 0.0 5.0

In [77]:
frame3.fillna({'ball':1,'mug':0,'pen':99})


Out[77]:
ball mug pen
blue 6.0 0.0 6.0
green 1.0 0.0 99.0
red 2.0 0.0 5.0

等级索引和分级

等级索引(hierarchical indexing)是pandas的一个重要的功能,单条轴也可以有多级索引,可以像操作两维结构那样处理多维数据


In [79]:
mser = pd.Series(np.random.rand(8),
                index=[['white','white','white','blue','blue','red','red','red'],
                      ['up','down','right','up','down','up','down','left']])
mser


Out[79]:
white  up       0.913087
       down     0.055399
       right    0.629899
blue   up       0.310707
       down     0.638752
red    up       0.472957
       down     0.211078
       left     0.984104
dtype: float64

In [81]:
mser.index


Out[81]:
MultiIndex(levels=[[u'blue', u'red', u'white'], [u'down', u'left', u'right', u'up']],
           labels=[[2, 2, 2, 0, 0, 1, 1, 1], [3, 0, 2, 3, 0, 3, 0, 1]])

In [82]:
mser['white']


Out[82]:
up       0.913087
down     0.055399
right    0.629899
dtype: float64

In [83]:
mser[:,'up']


Out[83]:
white    0.913087
blue     0.310707
red      0.472957
dtype: float64

In [85]:
mser['white','up']


Out[85]:
0.91308667237103303

使用unstack()函数将其转换成DataFrame,其中第二列的索引转换成列


In [86]:
mser.unstack()


Out[86]:
down left right up
blue 0.638752 NaN NaN 0.310707
red 0.211078 0.984104 NaN 0.472957
white 0.055399 NaN 0.629899 0.913087

逆操作,将DataFrame转换成Series对象,使用stack()函数


In [87]:
frame.stack()


Out[87]:
red     ball       0
        pen        1
        pencil     2
        paper      3
blue    ball       4
        pen        5
        pencil     6
        paper      7
yellow  ball       8
        pen        9
        pencil    10
        paper     11
white   ball      12
        pen       13
        pencil    14
        paper     15
dtype: int32

对于DataFrame来讲,可以对其DataFrame对象的行和列分别进行定义等级索引


In [88]:
mframe = pd.DataFrame(np.random.randn(16).reshape((4,4)),
                     index=[['white','white','red','red'],['up','down','up','down']],
                     columns=[['pen','pen','paper','paper'],[1,2,1,2]])
mframe


Out[88]:
pen paper
1 2 1 2
white up -1.729715 1.571391 -1.139209 1.289390
down 0.560778 -2.164030 0.162912 -0.286249
red up 0.828617 -1.035014 -1.417486 1.044125
down 0.416866 -0.513328 -0.439312 -0.119890

调整顺序和为层级排序


In [91]:
# 指定index和columns的名称
mframe.columns.names=['object','id']
mframe.index.names=['colors','status']
mframe


Out[91]:
object pen paper
id 1 2 1 2
colors status
white up -1.729715 1.571391 -1.139209 1.289390
down 0.560778 -2.164030 0.162912 -0.286249
red up 0.828617 -1.035014 -1.417486 1.044125
down 0.416866 -0.513328 -0.439312 -0.119890

In [92]:
mframe.swaplevel('colors','status')


Out[92]:
object pen paper
id 1 2 1 2
status colors
up white -1.729715 1.571391 -1.139209 1.289390
down white 0.560778 -2.164030 0.162912 -0.286249
up red 0.828617 -1.035014 -1.417486 1.044125
down red 0.416866 -0.513328 -0.439312 -0.119890

In [93]:
mframe.sortlevel('colors')


Out[93]:
object pen paper
id 1 2 1 2
colors status
red down 0.416866 -0.513328 -0.439312 -0.119890
up 0.828617 -1.035014 -1.417486 1.044125
white down 0.560778 -2.164030 0.162912 -0.286249
up -1.729715 1.571391 -1.139209 1.289390

按层进行统计数据


In [94]:
mframe.sum(level='colors')


Out[94]:
object pen paper
id 1 2 1 2
colors
red 1.245484 -1.548342 -1.856798 0.924236
white -1.168937 -0.592638 -0.976297 1.003141

如果相对某一列进行统计分析


In [95]:
mframe.sum(level='id',axis=1)


Out[95]:
id 1 2
colors status
white up -2.868924 2.860781
down 0.723690 -2.450278
red up -0.588869 0.009111
down -0.022446 -0.633218

In [ ]: