Pandas 数据处理

1 数据合并

类似SQL语句的join操作


In [4]:
import numpy as np
import pandas as pd
frame1 = pd.DataFrame({'id':['ball','pencil','pen','mug','ashtray'],
                      'price':[12.33,11.44,33.21,13.23,33.62]})
frame1


Out[4]:
id price
0 ball 12.33
1 pencil 11.44
2 pen 33.21
3 mug 13.23
4 ashtray 33.62

In [6]:
frame2 = pd.DataFrame({'id':['ball','pencil','pen','ashtray'],
                     'color':['white','red','red','black']})
frame2


Out[6]:
color id
0 white ball
1 red pencil
2 red pen
3 black ashtray

In [7]:
pd.merge(frame1,frame2)


Out[7]:
id price color
0 ball 12.33 white
1 pencil 11.44 red
2 pen 33.21 red
3 ashtray 33.62 black

也可以通过制定on关键词,使用条件join


In [8]:
frame1 = pd.DataFrame({'id':['ball','pencil','pen','mug','ashtray'],
                      'color':['white','red','red','black','green'],
                      'brand':['OMG','ABC','ABC','POD','POD']})
frame2 = pd.DataFrame({'id':['pencil','pencil','ball','pen'],
                      'brand':['OMG','POD','ABC','POD']})
pd.merge(frame1,frame2,on='id')


Out[8]:
brand_x color id brand_y
0 OMG white ball ABC
1 ABC red pencil OMG
2 ABC red pencil POD
3 ABC red pen POD

In [9]:
pd.merge(frame1,frame2,on='brand')


Out[9]:
brand color id_x id_y
0 OMG white ball pencil
1 ABC red pencil ball
2 ABC red pen ball
3 POD black mug pencil
4 POD black mug pen
5 POD green ashtray pencil
6 POD green ashtray pen

使用left_onright_on进行有选择的连接


In [10]:
pd.merge(frame1,frame2,left_on='id',right_on='id')


Out[10]:
brand_x color id brand_y
0 OMG white ball ABC
1 ABC red pencil OMG
2 ABC red pencil POD
3 ABC red pen POD

使用类似left_join,right_join和out_join等方式


In [11]:
pd.merge(frame1,frame2,on='id')


Out[11]:
brand_x color id brand_y
0 OMG white ball ABC
1 ABC red pencil OMG
2 ABC red pencil POD
3 ABC red pen POD

In [12]:
pd.merge(frame1,frame2,on='id',how='outer')


Out[12]:
brand_x color id brand_y
0 OMG white ball ABC
1 ABC red pencil OMG
2 ABC red pencil POD
3 ABC red pen POD
4 POD black mug NaN
5 POD green ashtray NaN

In [13]:
pd.merge(frame1,frame2,on='id',how='left')


Out[13]:
brand_x color id brand_y
0 OMG white ball ABC
1 ABC red pencil OMG
2 ABC red pencil POD
3 ABC red pen POD
4 POD black mug NaN
5 POD green ashtray NaN

In [15]:
pd.merge(frame1,frame2,on='id',how='right')


Out[15]:
brand_x color id brand_y
0 OMG white ball ABC
1 ABC red pencil OMG
2 ABC red pencil POD
3 ABC red pen POD

In [16]:
# 合并多个键
pd.merge(frame1,frame2,on=['id','brand'],how='outer')


Out[16]:
brand color id
0 OMG white ball
1 ABC red pencil
2 ABC red pen
3 POD black mug
4 POD green ashtray
5 OMG NaN pencil
6 POD NaN pencil
7 ABC NaN ball
8 POD NaN pen

根据索引合并


In [18]:
pd.merge(frame1,frame2,right_index=True,left_index=True)


Out[18]:
brand_x color id_x brand_y id_y
0 OMG white ball OMG pencil
1 ABC red pencil POD pencil
2 ABC red pen ABC ball
3 POD black mug POD pen

join 函数更适合根据索引进行合并


In [23]:
#重命名frame2列的名称
frame2.columns=['brand2','id2']
# frame1.join(frame2)


Out[23]:
brand color id brand2 id2
0 OMG white ball OMG pencil
1 ABC red pencil POD pencil
2 ABC red pen ABC ball
3 POD black mug POD pen
4 POD green ashtray NaN NaN

2 拼接

与另外一种数据整合操作叫拼接(concatenation)

numpy 数组拼接


In [1]:
import numpy as np
import pandas as pd
array1 = np.arange(9).reshape((3,3))
array1


Out[1]:
array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [4]:
array2 = np.arange(9).reshape((3,3))+6
array2


Out[4]:
array([[ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14]])

In [5]:
# 拼接
# axis=1 行拼接
np.concatenate([array1,array2],axis=1)


Out[5]:
array([[ 0,  1,  2,  6,  7,  8],
       [ 3,  4,  5,  9, 10, 11],
       [ 6,  7,  8, 12, 13, 14]])

In [8]:
# axis = 0 列拼接
np.concatenate([array1,array2],axis=0)


Out[8]:
array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14]])

Series 和DataFrame拼接


In [11]:
ser1 = pd.Series(np.random.rand(4),index=[1,2,3,4])
ser1


Out[11]:
1    0.158543
2    0.360468
3    0.804781
4    0.415040
dtype: float64

In [16]:
ser2 = pd.Series(np.random.rand(4),index=[5,6,7,8])
ser2


Out[16]:
5    0.355089
6    0.287637
7    0.744526
8    0.114241
dtype: float64

In [17]:
pd.concat([ser1,ser2])


Out[17]:
1    0.158543
2    0.360468
3    0.804781
4    0.415040
5    0.355089
6    0.287637
7    0.744526
8    0.114241
dtype: float64

concat()函数默认是按照axis=0这条轴拼接,返回Series对象,如果指定axis=1,那么将会返回DataFrame对象


In [18]:
pd.concat([ser1,ser2],axis=1)


Out[18]:
0 1
1 0.158543 NaN
2 0.360468 NaN
3 0.804781 NaN
4 0.415040 NaN
5 NaN 0.355089
6 NaN 0.287637
7 NaN 0.744526
8 NaN 0.114241

上述结果返回的相当于'out_join'操作,可以吧join选项修改为'inner'


In [20]:
pd.concat([ser1,ser1],axis=1,join='inner')


Out[20]:
0 1
1 0.158543 0.158543
2 0.360468 0.360468
3 0.804781 0.804781
4 0.415040 0.415040

创建等级索引,借助keys完成


In [24]:
pd.concat([ser1,ser2],keys=[1,2])


Out[24]:
1  1    0.158543
   2    0.360468
   3    0.804781
   4    0.415040
2  5    0.355089
   6    0.287637
   7    0.744526
   8    0.114241
dtype: float64

使用axis=1的话,将会DataFrame的各列对象


In [25]:
pd.concat([ser1,ser2],keys=[1,2],axis=1)


Out[25]:
1 2
1 0.158543 NaN
2 0.360468 NaN
3 0.804781 NaN
4 0.415040 NaN
5 NaN 0.355089
6 NaN 0.287637
7 NaN 0.744526
8 NaN 0.114241

DataFrame 对象拼接


In [29]:
frame1 = pd.DataFrame(np.random.rand(9).reshape(3,3),index=[1,2,3],columns=['A','B','C'])
frame2 = pd.DataFrame(np.random.rand(9).reshape(3,3),index=[4,5,6],columns=['A','B','C'])
pd.concat([frame1,frame2])


Out[29]:
A B C
1 0.811092 0.404428 0.754960
2 0.167322 0.918865 0.662690
3 0.369280 0.450852 0.537238
4 0.576948 0.743344 0.677287
5 0.491251 0.144968 0.389611
6 0.904182 0.472398 0.647457

In [30]:
pd.concat([frame1,frame2],axis=1)


Out[30]:
A B C A B C
1 0.811092 0.404428 0.754960 NaN NaN NaN
2 0.167322 0.918865 0.662690 NaN NaN NaN
3 0.369280 0.450852 0.537238 NaN NaN NaN
4 NaN NaN NaN 0.576948 0.743344 0.677287
5 NaN NaN NaN 0.491251 0.144968 0.389611
6 NaN NaN NaN 0.904182 0.472398 0.647457

组合数据

无法通过合并和拼接的方法组合数据,例如数据结构索引相同


In [31]:
ser1 = pd.Series(np.random.rand(5),index=[1,2,3,4,5])
ser2 = pd.Series(np.random.rand(4),index=[2,4,5,6])
ser1.combine_first(ser2)


Out[31]:
1    0.647531
2    0.577483
3    0.597037
4    0.335341
5    0.301346
6    0.209635
dtype: float64

In [32]:
ser2.combine_first(ser1)


Out[32]:
1    0.647531
2    0.322013
3    0.597037
4    0.023025
5    0.155955
6    0.209635
dtype: float64

In [34]:
ser1[:3].combine_first(ser2[:3])


Out[34]:
1    0.647531
2    0.577483
3    0.597037
4    0.023025
5    0.155955
dtype: float64

轴向旋转

按等级索引旋转

  • stacking

旋转数据,把列转换成行

  • unstacking

旋转数据,把行转换成列


In [36]:
frame1 = pd.DataFrame(np.arange(9).reshape(3,3),
                     index=['white','black','red'],
                     columns=['ball','pen','pencil'])
frame1


Out[36]:
ball pen pencil
white 0 1 2
black 3 4 5
red 6 7 8

In [39]:
ser5 = frame1.stack()
ser5


Out[39]:
white  ball      0
       pen       1
       pencil    2
black  ball      3
       pen       4
       pencil    5
red    ball      6
       pen       7
       pencil    8
dtype: int64

In [40]:
ser5.unstack()


Out[40]:
ball pen pencil
white 0 1 2
black 3 4 5
red 6 7 8

In [41]:
ser5.unstack(0)


Out[41]:
white black red
ball 0 3 6
pen 1 4 7
pencil 2 5 8

从“长”格式和“宽”格式旋转


In [43]:
longframe = pd.DataFrame({'color':['white','white','white',
                                   'red','red','red',
                                  'black','black','black'],
                         'item':['ball','pen','mug',
                                'ball','pen','mug',
                                'ball','pen','mug'],
                         'value':np.random.rand(9)})
longframe


Out[43]:
color item value
0 white ball 0.047070
1 white pen 0.527669
2 white mug 0.167968
3 red ball 0.766549
4 red pen 0.728938
5 red mug 0.944948
6 black ball 0.999640
7 black pen 0.930060
8 black mug 0.505569

In [44]:
longframe.pivot('color','item')


Out[44]:
value
item ball mug pen
color
black 0.999640 0.505569 0.930060
red 0.766549 0.944948 0.728938
white 0.047070 0.167968 0.527669

删除


In [55]:
frame1 = pd.DataFrame(np.arange(9).reshape(3,3),
                     index=['white','black','red'],
                     columns=['ball','pen','pencil'])
frame1


Out[55]:
ball pen pencil
white 0 1 2
black 3 4 5
red 6 7 8

In [56]:
del frame1['ball']
frame1


Out[56]:
pen pencil
white 1 2
black 4 5
red 7 8

In [57]:
frame1.drop('white')


Out[57]:
pen pencil
black 4 5
red 7 8

3 数据转换

删除重复元素


In [60]:
dframe = pd.DataFrame({'color':['white','white','red','red','white'],
                      'value':[2,1,3,3,2]})
dframe


Out[60]:
color value
0 white 2
1 white 1
2 red 3
3 red 3
4 white 2

使用duplicate()函数,返回Series对象,数值类型为bool类型


In [62]:
dframe.duplicated()


Out[62]:
0    False
1    False
2    False
3     True
4     True
dtype: bool

筛选数据


In [64]:
dframe[dframe.duplicated()]


Out[64]:
color value
3 red 3
4 white 2

In [66]:
dframe.drop_duplicates()


Out[66]:
color value
0 white 2
1 white 1
2 red 3

映射

用映射代替元素


In [69]:
frame = pd.DataFrame({'item':['ball','mug','pen','pencil','ashtray'],
                     'color':['white','rosso','verdo','black','yellow'],
                     'price':[5.56,4.2,1.3,0.56,2.75]})
frame


Out[69]:
color item price
0 white ball 5.56
1 rosso mug 4.20
2 verdo pen 1.30
3 black pencil 0.56
4 yellow ashtray 2.75

如果DataFrame中,有一些元素不正确,在映射关系中,使用旧元素为key,新元素作为value


In [71]:
newcolor = {
    'rosso':'red',
    'verde':'green'
}
frame.replace(newcolor)


Out[71]:
color item price
0 white ball 5.56
1 red mug 4.20
2 verdo pen 1.30
3 black pencil 0.56
4 yellow ashtray 2.75

将NaN转换成0


In [73]:
ser = pd.Series([1,3,np.nan,4,6,np.nan,4])
ser.replace(np.nan,0)


Out[73]:
0    1.0
1    3.0
2    0.0
3    4.0
4    6.0
5    0.0
6    4.0
dtype: float64

映射添加元素


In [74]:
frame = pd.DataFrame({'item':['ball','mug','pen','pencil','ashtray'],
                     'color':['white','red','green','black','yellow']})
frame


Out[74]:
color item
0 white ball
1 red mug
2 green pen
3 black pencil
4 yellow ashtray

In [78]:
price={
    'ball':5.56,
    'mug':4.2,
    'bottle':1.3,
    'scissors':3.41,
    'pen':1.3,
    'pencil':0.56,
    'ashtray':2.75
}
frame['price']=frame['item'].map(price)
frame


Out[78]:
color item price
0 white ball 5.56
1 red mug 4.20
2 green pen 1.30
3 black pencil 0.56
4 yellow ashtray 2.75

重命名轴索引


In [83]:
# 重命名 index 索引
reindex = {
    0:'first',
    1:'second',
    2:'third',
    3:'fourth',
    4:'fifth'
}
frame.rename(reindex)


Out[83]:
color item price
first white ball 5.56
second red mug 4.20
third green pen 1.30
fourth black pencil 0.56
fifth yellow ashtray 2.75

In [85]:
# 重命名列明,添加columns选项
recolumn = {
    'item':'object',
    'price':'value'
}
frame.rename(index=reindex,columns=recolumn)


Out[85]:
color object value
first white ball 5.56
second red mug 4.20
third green pen 1.30
fourth black pencil 0.56
fifth yellow ashtray 2.75

In [87]:
frame.rename(index={1:'first'},columns={'item':'object'},inplace=True)
frame


Out[87]:
color object price
0 white ball 5.56
first red mug 4.20
2 green pen 1.30
3 black pencil 0.56
4 yellow ashtray 2.75

4 离散化和面元划分

将数据打散为几个类别,把取值范围分割成一个个小区间


In [92]:
results = np.random.randint(0,100,200)
bins=[0,25,50,75,100]
cat = pd.cut(results,bins)
cat


Out[92]:
[(25, 50], (0, 25], (0, 25], (50, 75], (25, 50], ..., (50, 75], (50, 75], (0, 25], (50, 75], (50, 75]]
Length: 200
Categories (4, object): [(0, 25] < (25, 50] < (50, 75] < (75, 100]]

In [95]:
cat.categories


Out[95]:
Index([u'(0, 25]', u'(25, 50]', u'(50, 75]', u'(75, 100]'], dtype='object')

In [100]:
cat.codes


Out[100]:
array([ 1,  0,  0,  2,  1,  2,  0,  2,  0,  1,  0,  3,  1,  3,  3,  2,  0,
        2,  3,  1,  1,  3,  3,  2,  1,  1,  2,  2,  3,  0,  1,  0,  0,  3,
        3,  1,  1,  1,  2,  1,  2,  2,  3,  3,  3,  1,  0,  0,  2,  2,  2,
        0,  0,  2,  1,  2,  0,  1,  3,  3,  2,  0,  0,  2,  0,  0,  2,  2,
        2,  0,  0,  3,  2,  1,  2,  0,  2,  0,  3,  0,  0,  0,  2,  3,  0,
        3,  3,  3,  1,  1,  0,  1,  1,  2,  3,  0,  0,  2,  0,  0,  2,  2,
        1,  2,  3,  0,  1,  0,  0,  1,  2,  1,  3,  0,  2,  0,  0,  2,  2,
        2,  2,  0,  1,  1,  3,  2,  1,  2,  1,  0,  2,  1,  2,  1,  2,  1,
        1,  3,  3,  0,  2,  1,  3,  0,  0,  0,  0,  1,  2,  0,  2,  1,  2,
        3,  0,  1, -1,  0,  0,  0,  3,  1,  2,  1,  2,  3,  1,  1,  0,  1,
        1,  3,  0,  1,  1,  1,  3,  0,  1,  3,  3,  3,  1,  3,  2,  1,  0,
        2,  2,  0,  3,  2,  2,  3,  1,  2,  2,  0,  2,  2], dtype=int8)

In [101]:
pd.value_counts(cat)


Out[101]:
(50, 75]     56
(0, 25]      55
(25, 50]     50
(75, 100]    38
dtype: int64

In [102]:
bin_names = ['unlikely','less likely','likely','high likely']
pd.cut(result,bins,labels=bin_names)


Out[102]:
[less likely, likely, likely, unlikely, unlikely, ..., less likely, high likely, likely, likely, high likely]
Length: 200
Categories (4, object): [unlikely < less likely < likely < high likely]

In [103]:
pd.cut(result,5)


Out[103]:
[(39.6, 59.4], (39.6, 59.4], (59.4, 79.2], (19.8, 39.6], (19.8, 39.6], ..., (19.8, 39.6], (59.4, 79.2], (59.4, 79.2], (59.4, 79.2], (79.2, 99]]
Length: 200
Categories (5, object): [(-0.099, 19.8] < (19.8, 39.6] < (39.6, 59.4] < (59.4, 79.2] < (79.2, 99]]

In [104]:
pd.qcut(result,5)


Out[104]:
[(39, 61.4], (39, 61.4], (61.4, 78.2], [0, 20.8], [0, 20.8], ..., (20.8, 39], (61.4, 78.2], (61.4, 78.2], (61.4, 78.2], (78.2, 99]]
Length: 200
Categories (5, object): [[0, 20.8] < (20.8, 39] < (39, 61.4] < (61.4, 78.2] < (78.2, 99]]

异常值检测


In [106]:
# 随机创建一个只有三列的DataFrame对象,每一列包含了一个100个数据
randFrame = pd.DataFrame(np.random.randn(100,3))
randFrame


Out[106]:
0 1 2
0 0.042855 0.125821 -1.369209
1 0.313007 1.466830 -2.638114
2 -1.043194 -0.010528 -0.848267
3 1.033055 -0.845470 -0.810793
4 1.151180 -0.108188 1.944497
5 0.930607 -0.506807 -0.048236
6 -0.746800 -1.045934 -0.742140
7 -0.685199 1.453668 0.984025
8 -0.326833 -1.358878 -0.170905
9 0.870439 0.893227 -0.167956
10 -1.095913 -2.077859 0.283047
11 -0.390669 -0.134235 1.477105
12 0.665446 -0.740633 -0.375786
13 -0.068157 1.452425 0.373986
14 0.924698 1.075485 0.121636
15 -1.435241 -0.486356 -1.385073
16 -0.548415 0.453212 1.050809
17 -0.247208 2.414842 1.181692
18 0.177077 1.106947 0.078635
19 -0.493999 0.388532 -0.012207
20 -0.241711 1.578980 -0.328364
21 0.934230 0.279991 -0.498294
22 -1.093486 -1.077672 -0.062265
23 1.223079 -0.331191 0.404422
24 0.173034 -0.530792 -0.423373
25 2.239950 0.685327 -1.453176
26 -2.412055 1.007259 0.629088
27 1.730295 0.633144 0.085556
28 -1.899829 -0.594600 1.015551
29 -1.234743 -0.808033 -0.439566
... ... ... ...
70 -0.747051 0.198963 -1.080476
71 -0.612576 -0.399977 0.160355
72 0.427011 1.827715 -0.260126
73 -1.696423 -0.777880 1.022526
74 1.301100 0.763960 2.469218
75 -0.880183 -0.955752 -1.292819
76 1.784311 0.143099 -0.083054
77 0.226177 2.604407 -1.024551
78 -1.360976 2.070420 0.428620
79 0.724338 -0.642064 -1.530405
80 -0.471835 0.822025 0.210076
81 -0.129122 0.003055 0.746299
82 0.565931 0.369160 -0.469432
83 1.179386 -0.897671 -1.276313
84 0.815241 -0.361745 -0.509388
85 0.736177 -1.168714 0.227846
86 -1.715851 -1.325452 -0.694898
87 -0.834178 -0.266720 -0.258379
88 0.273636 -0.105868 1.607285
89 -0.241522 -1.182669 2.458979
90 -1.386777 1.149516 0.133334
91 0.910392 -0.002882 1.165293
92 0.963135 0.532747 -0.868203
93 -0.093653 1.949422 0.888465
94 -0.864895 -0.319510 0.930115
95 -1.032895 1.999221 0.761788
96 0.435177 0.317515 0.651005
97 -0.312506 1.275258 -0.021342
98 1.751284 0.143092 -0.994560
99 1.359067 -2.321510 0.980623

100 rows × 3 columns


In [107]:
randFrame.describe()


Out[107]:
0 1 2
count 100.000000 100.000000 100.000000
mean -0.021385 0.272741 0.075937
std 0.985184 1.120187 0.994919
min -2.412055 -2.321510 -2.638114
25% -0.766489 -0.537447 -0.601746
50% -0.009040 0.171031 0.073825
75% 0.821547 1.083351 0.779704
max 2.239950 3.601513 2.469218

假定比标准差3倍中误差为异常数据


In [108]:
randFrame.std()


Out[108]:
0    0.985184
1    1.120187
2    0.994919
dtype: float64

In [109]:
randFrame[(np.abs(randFrame) > 3 * randFrame.std()).any(1)]


Out[109]:
0 1 2
48 -1.69756 3.601513 1.956437

5 排序


In [3]:
nframe = pd.DataFrame(np.arange(25).reshape(5,5))
nframe


Out[3]:
0 1 2 3 4
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
4 20 21 22 23 24

In [6]:
new_order = np.random.permutation(5)
new_order


Out[6]:
array([4, 3, 0, 1, 2])

In [7]:
nframe.take(new_order)


Out[7]:
0 1 2 3 4
4 20 21 22 23 24
3 15 16 17 18 19
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14

In [8]:
new_order = [3,4,2]
nframe.take(new_order)


Out[8]:
0 1 2 3 4
3 15 16 17 18 19
4 20 21 22 23 24
2 10 11 12 13 14

随机抽样


In [10]:
sample = np.random.randint(0,len(nframe),size=3)
sample


Out[10]:
array([0, 0, 1])

In [11]:
nframe.take(sample)


Out[11]:
0 1 2 3 4
0 0 1 2 3 4
0 0 1 2 3 4
1 5 6 7 8 9

6 字符串处理

Python内置的字符串处理方法


In [12]:
text = '16 bolton Avenue , Boston'
text.split(',')


Out[12]:
['16 bolton Avenue ', ' Boston']

In [14]:
tokens = [s.strip() for s in text.split(',')]
tokens


Out[14]:
['16 bolton Avenue', 'Boston']

In [15]:
strings =['A+','A','A+','B','BB','BBB','C+']
';'.join(strings)


Out[15]:
'A+;A;A+;B;BB;BBB;C+'

In [16]:
'Boston' in text


Out[16]:
True

In [17]:
text.index('Boston')


Out[17]:
19

In [18]:
text.find('Boston')


Out[18]:
19

In [20]:
text.index('China')


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-20-ce548e4afe75> in <module>()
----> 1 text.index('China')

ValueError: substring not found

In [21]:
text.find('China')


Out[21]:
-1

In [22]:
text.count('e')


Out[22]:
2

In [23]:
text.replace('Boston','China')


Out[23]:
'16 bolton Avenue , China'

In [24]:
text.replace('1','')


Out[24]:
'6 bolton Avenue , Boston'

正则表达式


In [26]:
import re
text = 'This is   an \t odd \n text'
re.split('\s+',text)


Out[26]:
['This', 'is', 'an', 'odd', 'text']

In [28]:
regex = re.compile('\s+')
regex.split(text)


Out[28]:
['This', 'is', 'an', 'odd', 'text']

In [31]:
# findall 函数匹配文本中所有符合条件的正则表达式
text ='This is my address: 16 bolton Avenue, boston'
re.findall('a\w+',text)


Out[31]:
['address']

In [32]:
re.findall('[A,a]\w+',text)


Out[32]:
['address', 'Avenue']

与findall函数相关的函数还有两个,match()和search(),findall()函数返回所有符合条件的模式的子串,而search()函数返回第一个符合条件的子串


In [33]:
re.search('[A,a]\w+',text)


Out[33]:
<_sre.SRE_Match at 0x111f6b988>

In [34]:
search = re.search('[A,a]\w+',text)
search.start()


Out[34]:
11

match 从字符串的开始匹配,如果第一个字符不匹配,则不再搜索字符串内部。


In [36]:
re.match('[A,a]\w+',text)

In [37]:
re.match('T\w+',text)


Out[37]:
<_sre.SRE_Match at 0x111f6bcc8>

In [41]:
match = re.match('T\w+',text)
text[match.start():match.end()]


Out[41]:
'This'

7 数据聚合

Groupby

内部机制:SPLIT-APPLY-COMBINE

  • 分组
    将数据集分成多个组

  • 用函数处理
    用函数处理每一组数据

  • 合并
    将不同组得到的数据合并起来


In [42]:
frame = pd.DataFrame({'color':['white','red','green','red','green'],
                     'object':['pen','pencil','pencil','ashtray','pen'],
                     'price1':[5.56,4.20,1.30,0.56,2.75],
                     'price2':[4.75,4.12,1.60,0.75,3.15]})
frame


Out[42]:
color object price1 price2
0 white pen 5.56 4.75
1 red pencil 4.20 4.12
2 green pencil 1.30 1.60
3 red ashtray 0.56 0.75
4 green pen 2.75 3.15

In [50]:
group=frame['price1'].groupby(frame['color'])
group.groups


Out[50]:
{'green': [2, 4], 'red': [1, 3], 'white': [0]}

In [51]:
group.mean()


Out[51]:
color
green    2.025
red      2.380
white    5.560
Name: price1, dtype: float64

In [53]:
group.sum()


Out[53]:
color
green    4.05
red      4.76
white    5.56
Name: price1, dtype: float64

等级分组

使用多列,使用多个关键字来进行分组


In [54]:
ggroup = frame['price1'].groupby([frame['color'],frame['object']])
ggroup.groups


Out[54]:
{('green', 'pen'): [4],
 ('green', 'pencil'): [2],
 ('red', 'ashtray'): [3],
 ('red', 'pencil'): [1],
 ('white', 'pen'): [0]}

In [55]:
ggroup.sum()


Out[55]:
color  object 
green  pen        2.75
       pencil     1.30
red    ashtray    0.56
       pencil     4.20
white  pen        5.56
Name: price1, dtype: float64

In [56]:
frame[['price1','price2']].groupby(frame['color']).mean()


Out[56]:
price1 price2
color
green 2.025 2.375
red 2.380 2.435
white 5.560 4.750

In [57]:
frame.groupby(frame['color']).mean()


Out[57]:
price1 price2
color
green 2.025 2.375
red 2.380 2.435
white 5.560 4.750

8 组迭代

groupby 还支持迭代操作


In [58]:
for name,group in frame.groupby('color'):
    print name
    print group


green
   color  object  price1  price2
2  green  pencil    1.30    1.60
4  green     pen    2.75    3.15
red
  color   object  price1  price2
1   red   pencil    4.20    4.12
3   red  ashtray    0.56    0.75
white
   color object  price1  price2
0  white    pen    5.56    4.75

链式转换


In [60]:
frame['price1'].groupby(frame['color']).mean()


Out[60]:
color
green    2.025
red      2.380
white    5.560
Name: price1, dtype: float64

In [61]:
frame.groupby(frame['color'])['price1'].mean()


Out[61]:
color
green    2.025
red      2.380
white    5.560
Name: price1, dtype: float64

分组函数


In [62]:
group = frame.groupby('color')
group['price1'].quantile(0.5)


Out[62]:
color
green    2.025
red      2.380
white    5.560
Name: price1, dtype: float64

In [64]:
def range(series):
    return series.max() - series.min()
group['price1'].agg(range)


Out[64]:
color
green    1.45
red      3.64
white    0.00
Name: price1, dtype: float64

In [65]:
group.agg(range)


Out[65]:
price1 price2
color
green 1.45 1.55
red 3.64 3.37
white 0.00 0.00

In [66]:
group['price1'].agg(['mean','std',range])


Out[66]:
mean std range
color
green 2.025 1.025305 1.45
red 2.380 2.573869 3.64
white 5.560 NaN 0.00

9 高级数据聚合


In [67]:
frame


Out[67]:
color object price1 price2
0 white pen 5.56 4.75
1 red pencil 4.20 4.12
2 green pencil 1.30 1.60
3 red ashtray 0.56 0.75
4 green pen 2.75 3.15

In [68]:
sums = frame.groupby('color').sum().add_prefix('tot_')
sums


Out[68]:
tot_price1 tot_price2
color
green 4.05 4.75
red 4.76 4.87
white 5.56 4.75

In [70]:
pd.merge(frame,sums,left_on='color',right_index=True)


Out[70]:
color object price1 price2 tot_price1 tot_price2
0 white pen 5.56 4.75 5.56 4.75
1 red pencil 4.20 4.12 4.76 4.87
3 red ashtray 0.56 0.75 4.76 4.87
2 green pencil 1.30 1.60 4.05 4.75
4 green pen 2.75 3.15 4.05 4.75

In [71]:
frame.groupby('color').transform(np.sum).add_prefix('tot_')


Out[71]:
tot_object tot_price1 tot_price2
0 pen 5.56 4.75
1 pencil 4.20 4.12
2 pencil 1.30 1.60
3 ashtray 0.56 0.75
4 pen 2.75 3.15

In [73]:
frame = pd.DataFrame({'color':['white','black','white','white','black','black'],
                     'status':['up','up','down','down','down','up'],
                     'value1':[12.33,14.55,22.34,27.84,23.40,18.33],
                     'value2':[11.23,31.80,29.99,31.18,18.25,22.44]})
frame


Out[73]:
color status value1 value2
0 white up 12.33 11.23
1 black up 14.55 31.80
2 white down 22.34 29.99
3 white down 27.84 31.18
4 black down 23.40 18.25
5 black up 18.33 22.44

In [74]:
frame.groupby(['color','status']).apply(lambda x:x.max())


Out[74]:
color status value1 value2
color status
black down black down 23.40 18.25
up black up 18.33 31.80
white down white down 27.84 31.18
up white up 12.33 11.23