5. Pandas - GroupBy


In [2]:
%pylab inline
from pandas import Series, DataFrame
import pandas as pd


Populating the interactive namespace from numpy and matplotlib

GroupBy技術

分組運算是一種 Split-Apply-Combine的過程,類似於MapReduce的模式

![Pandas GroupBy](http://i.stack.imgur.com/sgCn1.jpg)

In [3]:
df = DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
                'key2' : ['one', 'two', 'one', 'two', 'one'],
                'data1' : np.random.randn(5),
                'data2' : np.random.randn(5)}, columns = ['key1', 'key2', 'data1', 'data2'])
df


Out[3]:
key1 key2 data1 data2
0 a one -1.678498 -0.692768
1 a two -0.192626 1.369141
2 b one 0.550083 -1.905869
3 b two 1.820455 -0.270374
4 a one -0.428296 -0.602598

In [4]:
# 使用 groupby方法
grouped = df.data1.groupby(df.key1)
grouped
# 產生一個 SeriesGroupBy物件


Out[4]:
<pandas.core.groupby.SeriesGroupBy object at 0x0000000005FDD978>

In [5]:
grouped.size()


Out[5]:
key1
a    3
b    2
dtype: int64

In [6]:
# 用GroupBy物件的 mean()方法
# mean()方法是一種 聚合運算
grouped.mean()


Out[6]:
key1
a   -0.766473
b    1.185269
Name: data1, dtype: float64

分組所依據的鍵,可以是任何長度的數組,且可以有多層


In [7]:
# 也可以建立多層次的分組
grouped = df.data1.groupby([df.key1, df.key2])
grouped.size()


Out[7]:
key1  key2
a     one     2
      two     1
b     one     1
      two     1
dtype: int64

In [8]:
grouped.mean()


Out[8]:
key1  key2
a     one    -1.053397
      two    -0.192626
b     one     0.550083
      two     1.820455
Name: data1, dtype: float64

In [9]:
grouped.mean().unstack('key1')


Out[9]:
key1 a b
key2
one -1.053397 0.550083
two -0.192626 1.820455

In [10]:
# 也可以對多個 columns同時做分組統計運算
df.groupby(df.key1).mean()


Out[10]:
data1 data2
key1
a -0.766473 0.024592
b 1.185269 -1.088121

In [11]:
# 也可以直接以 column索引的名稱來指定分組
df.groupby(['key1', 'key2']).mean()


Out[11]:
data1 data2
key1 key2
a one -1.053397 -0.647683
two -0.192626 1.369141
b one 0.550083 -1.905869
two 1.820455 -0.270374

In [12]:
# GroupBy 的 size()方法,傳回各分組的大小
df.groupby(['key1', 'key2']).size()


Out[12]:
key1  key2
a     one     2
      two     1
b     one     1
      two     1
dtype: int64

對分組進行迭代


In [13]:
for name, group in df.groupby('key1'):
    print(name)
    print(group)
# 所以分組的結果,是拆分為多個 DataFrame


a
  key1 key2     data1     data2
0    a  one -1.678498 -0.692768
1    a  two -0.192626  1.369141
4    a  one -0.428296 -0.602598
b
  key1 key2     data1     data2
2    b  one  0.550083 -1.905869
3    b  two  1.820455 -0.270374

In [14]:
# 依照多重鍵分組,groupby元素元組的第一個元素是 多重鍵的 元組
for name, group in df.groupby(['key1', 'key2']):
    print(name)
    print(group)


('a', 'one')
  key1 key2     data1     data2
0    a  one -1.678498 -0.692768
4    a  one -0.428296 -0.602598
('a', 'two')
  key1 key2     data1     data2
1    a  two -0.192626  1.369141
('b', 'one')
  key1 key2     data1     data2
2    b  one  0.550083 -1.905869
('b', 'two')
  key1 key2     data1     data2
3    b  two  1.820455 -0.270374

選取一個或一組columns


In [15]:
df


Out[15]:
key1 key2 data1 data2
0 a one -1.678498 -0.692768
1 a two -0.192626 1.369141
2 b one 0.550083 -1.905869
3 b two 1.820455 -0.270374
4 a one -0.428296 -0.602598

In [16]:
df.groupby('key1')['data1']
# 等同於
df['data1'].groupby(df['key1'])

df.groupby('key1')['data2']
# 等同於
df[['data2']].groupby(df['key1'])


Out[16]:
<pandas.core.groupby.DataFrameGroupBy object at 0x00000000089E91D0>

In [17]:
# 有時候只需要對部分的資料列進行聚合
df.groupby(['key1', 'key2'])[['data2']].mean()
# 傳回 DataFrame


Out[17]:
data2
key1 key2
a one -0.647683
two 1.369141
b one -1.905869
two -0.270374

In [18]:
df.groupby(['key1', 'key2'])['data2'].mean()
# 傳回 Series


Out[18]:
key1  key2
a     one    -0.647683
      two     1.369141
b     one    -1.905869
      two    -0.270374
Name: data2, dtype: float64

通過字典或Series進行分組


In [19]:
people = DataFrame(np.random.randn(5, 5),
                   columns=['a', 'b', 'c', 'd', 'e'],
                   index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])
people.ix[2:3, ['b', 'c']] = np.nan
people


Out[19]:
a b c d e
Joe 0.304629 0.533158 2.076010 -0.644645 -0.872860
Steve 0.242612 -1.249073 0.932613 0.148341 -1.324835
Wes 0.134158 NaN NaN -1.259198 -0.276365
Jim 0.913586 1.198580 1.886465 0.839542 0.379609
Travis 0.468621 1.850853 -1.813160 -0.852540 -0.039789

In [20]:
# 已經知道 列的分組關係
mapping = {'a': 'red', 'b': 'red', 'c': 'blue',
           'd': 'blue', 'e': 'red', 'f' : 'orange'}
#只需要將mapping關係的字典傳給 groupby()
grouped_by_column = people.groupby(mapping, axis = 1)
grouped_by_column.sum()


Out[20]:
blue red
Joe 1.431365 -0.035073
Steve 1.080954 -2.331295
Wes -1.259198 -0.142207
Jim 2.726007 2.491775
Travis -2.665700 2.279685

In [21]:
map_series = Series(mapping)
map_series


Out[21]:
a       red
b       red
c      blue
d      blue
e       red
f    orange
dtype: object

In [22]:
# 也可以將mapping關係的Series物件傳給 groupby()
grouped_by_column = people.groupby(map_series, axis = 1)
grouped_by_column.sum()


Out[22]:
blue red
Joe 1.431365 -0.035073
Steve 1.080954 -2.331295
Wes -1.259198 -0.142207
Jim 2.726007 2.491775
Travis -2.665700 2.279685

透過函數進行分組


In [23]:
people


Out[23]:
a b c d e
Joe 0.304629 0.533158 2.076010 -0.644645 -0.872860
Steve 0.242612 -1.249073 0.932613 0.148341 -1.324835
Wes 0.134158 NaN NaN -1.259198 -0.276365
Jim 0.913586 1.198580 1.886465 0.839542 0.379609
Travis 0.468621 1.850853 -1.813160 -0.852540 -0.039789

In [24]:
# 被當作分組鍵的函數都會在各個索引值上被調用一次,返回值就被當作分組名稱
people.groupby(len).mean()


Out[24]:
a b c d e
3 0.450791 0.865869 1.981237 -0.354767 -0.256539
5 0.242612 -1.249073 0.932613 0.148341 -1.324835
6 0.468621 1.850853 -1.813160 -0.852540 -0.039789

In [25]:
# 函數、列表、字典、Series都可以混用,因為最後都會被轉換為數組
key_list = ['one', 'one', 'one', 'two', 'two']
people.groupby([len, key_list]).min()


Out[25]:
a b c d e
3 one 0.134158 0.533158 2.076010 -1.259198 -0.872860
two 0.913586 1.198580 1.886465 0.839542 0.379609
5 one 0.242612 -1.249073 0.932613 0.148341 -1.324835
6 two 0.468621 1.850853 -1.813160 -0.852540 -0.039789

In [26]:
# 根據索引級別分組
# 要依據層次化索引來分組聚合,只需要透過 level參數即可
columns = pd.MultiIndex.from_arrays([['US', 'US', 'US', 'JP', 'JP'],
                                     [1, 3, 5, 1, 3]], names=['cty', 'tenor'])

hier_df = DataFrame(np.random.randn(4, 5), columns=columns)
hier_df


Out[26]:
cty US JP
tenor 1 3 5 1 3
0 -0.251274 0.988204 1.978977 0.038430 -1.044804
1 1.623105 0.107912 0.278087 1.018542 -0.056278
2 -0.894753 2.681102 1.129964 -0.439178 -2.457484
3 0.255927 1.065645 -0.635942 0.324136 -0.311153

In [27]:
hier_df.groupby(level = 'cty', axis = 1).count()


Out[27]:
cty JP US
0 2 3
1 2 3
2 2 3
3 2 3

數據聚合 (pandas.core.groupby.DataFrameGroupBy.aggregate() )


In [28]:
# 可以自訂一聚合方法。聚合方法會對每一個分組之後的group操作一次
df


Out[28]:
key1 key2 data1 data2
0 a one -1.678498 -0.692768
1 a two -0.192626 1.369141
2 b one 0.550083 -1.905869
3 b two 1.820455 -0.270374
4 a one -0.428296 -0.602598

In [29]:
grouped = df.groupby('key1')
for name, group in grouped:
    print(name)
    print(group)


a
  key1 key2     data1     data2
0    a  one -1.678498 -0.692768
1    a  two -0.192626  1.369141
4    a  one -0.428296 -0.602598
b
  key1 key2     data1     data2
2    b  one  0.550083 -1.905869
3    b  two  1.820455 -0.270374

In [30]:
type(grouped)


Out[30]:
pandas.core.groupby.DataFrameGroupBy

In [31]:
for name, group in grouped['data1']:
    print(name)
    print(group)


a
0   -1.678498
1   -0.192626
4   -0.428296
Name: data1, dtype: float64
b
2    0.550083
3    1.820455
Name: data1, dtype: float64

In [32]:
# Series, DataFrame的方法都可以施加在 group上
# quantile 是 Series的方法
grouped = df.groupby('key1')
grouped['data1'].quantile(0.9)


Out[32]:
key1
a   -0.239760
b    1.693418
Name: data1, dtype: float64

In [33]:
# 透過 aggregate()方法,可以使用自訂函式
def peak_to_peak(arr):
    return arr.max() - arr.min()

grouped.aggregate(peak_to_peak)

# 會對每一個 pandas.core.groupby.DataFrameGroupBy(grouped)中的 DataFrame 中的 Series 做一次指定的 aggregate (在這邊是 peak_to_peak()) 運算


Out[33]:
data1 data2
key1
a 1.485873 2.061909
b 1.270372 1.635495

In [34]:
# describe 也可以用
grouped.describe()


Out[34]:
data1 data2
key1
a count 3.000000 3.000000
mean -0.766473 0.024592
std 0.798578 1.165286
min -1.678498 -0.692768
25% -1.053397 -0.647683
50% -0.428296 -0.602598
75% -0.310461 0.383272
max -0.192626 1.369141
b count 2.000000 2.000000
mean 1.185269 -1.088121
std 0.898288 1.156470
min 0.550083 -1.905869
25% 0.867676 -1.496995
50% 1.185269 -1.088121
75% 1.502862 -0.679248
max 1.820455 -0.270374

In [35]:
grouped.mean()
# 會對每一個 pandas.core.groupby.DataFrameGroupBy中的 DataFrame 中的 Series 做一次指定的 aggregate (在這邊是 mean()) 運算


Out[35]:
data1 data2
key1
a -0.766473 0.024592
b 1.185269 -1.088121

面向列的多函數應用


In [36]:
tips = pd.read_csv('../data/tips.csv')
tips['tip_total_ratio'] = tips['tip']  / tips['total_bill'] 
tips[:5]


Out[36]:
total_bill tip sex smoker day time size tip_total_ratio
0 16.99 1.01 Female No Sun Dinner 2 0.059447
1 10.34 1.66 Male No Sun Dinner 3 0.160542
2 21.01 3.50 Male No Sun Dinner 3 0.166587
3 23.68 3.31 Male No Sun Dinner 2 0.139780
4 24.59 3.61 Female No Sun Dinner 4 0.146808

In [37]:
# 對不同的列使用不同的聚合函數
grouped = tips.groupby(['sex', 'smoker'])
grouped_pct = grouped['tip_total_ratio']
for name, group in grouped_pct:
    print(name)
    print(group.tail(3))


('Female', 'No')
223    0.187735
238    0.130338
243    0.159744
Name: tip_total_ratio, dtype: float64
('Female', 'Yes')
226    0.198216
229    0.130199
240    0.073584
Name: tip_total_ratio, dtype: float64
('Male', 'No')
235    0.124131
239    0.203927
242    0.098204
Name: tip_total_ratio, dtype: float64
('Male', 'Yes')
236    0.079365
237    0.035638
241    0.088222
Name: tip_total_ratio, dtype: float64

In [38]:
grouped_pct.agg('mean')


Out[38]:
sex     smoker
Female  No        0.156921
        Yes       0.182150
Male    No        0.160669
        Yes       0.152771
Name: tip_total_ratio, dtype: float64

In [39]:
grouped_pct.aggregate('mean')


Out[39]:
sex     smoker
Female  No        0.156921
        Yes       0.182150
Male    No        0.160669
        Yes       0.152771
Name: tip_total_ratio, dtype: float64

In [40]:
# 傳入一組函數或函數名,得到的DataFrame的列就會以相應的函數命名
grouped_pct.agg(['mean', 'std', peak_to_peak])


Out[40]:
mean std peak_to_peak
sex smoker
Female No 0.156921 0.036421 0.195876
Yes 0.182150 0.071595 0.360233
Male No 0.160669 0.041849 0.220186
Yes 0.152771 0.090588 0.674707

In [41]:
# 如果傳入一個由(name, function)的元組列表,則各元組的第一個元素就會被當作DataFrame的 column名稱
grouped_pct.agg([('foo', 'mean'), ('bar', np.std)])


Out[41]:
foo bar
sex smoker
Female No 0.156921 0.036421
Yes 0.182150 0.071595
Male No 0.160669 0.041849
Yes 0.152771 0.090588

In [42]:
# 對於 DataFrame,還可以定義使用多個函數
functions = ['count', 'mean', 'max']
result = grouped['tip_total_ratio', 'total_bill'].agg(functions)
result


Out[42]:
tip_total_ratio total_bill
count mean max count mean max
sex smoker
Female No 54 0.156921 0.252672 54 18.105185 35.83
Yes 33 0.182150 0.416667 33 17.977879 44.30
Male No 97 0.160669 0.291990 97 19.791237 48.33
Yes 60 0.152771 0.710345 60 22.284500 50.81

In [43]:
result['tip_total_ratio']


Out[43]:
count mean max
sex smoker
Female No 54 0.156921 0.252672
Yes 33 0.182150 0.416667
Male No 97 0.160669 0.291990
Yes 60 0.152771 0.710345

In [44]:
# 自訂一結果的列名稱
functions = [('Counts', 'count'), ('Mean', 'mean'), ('Max', 'max')]
result = grouped['tip_total_ratio', 'total_bill'].agg(functions)
result


Out[44]:
tip_total_ratio total_bill
Counts Mean Max Counts Mean Max
sex smoker
Female No 54 0.156921 0.252672 54 18.105185 35.83
Yes 33 0.182150 0.416667 33 17.977879 44.30
Male No 97 0.160669 0.291990 97 19.791237 48.33
Yes 60 0.152771 0.710345 60 22.284500 50.81

In [45]:
# 對於 DataFrame,還可以定義不同列使用不同的函數
# 傳入一個名稱與函數的字典
functions = {'tip_total_ratio':  np.max, 'total_bill': np.min}
result = grouped.agg(functions)
result


Out[45]:
tip_total_ratio total_bill
sex smoker
Female No 0.252672 7.25
Yes 0.416667 3.07
Male No 0.291990 7.51
Yes 0.710345 7.25

In [46]:
# 對於 DataFrame,還可以定義不同列使用不同的函數
functions = {'tip_total_ratio': (np.max,  np.min), 
             'size': ['sum', 'min']}
result = grouped.agg(functions)
result


Out[46]:
tip_total_ratio size
amax amin sum min
sex smoker
Female No 0.252672 0.056797 140 1
Yes 0.416667 0.056433 74 1
Male No 0.291990 0.071804 263 2
Yes 0.710345 0.035638 150 1

以 無索引 的形式返回聚合數據


In [47]:
# 透過 as_index = False,分組鍵不要成為索引
tips.groupby(['sex', 'smoker'], as_index = False).mean()


Out[47]:
sex smoker total_bill tip size tip_total_ratio
0 Female No 18.105185 2.773519 2.592593 0.156921
1 Female Yes 17.977879 2.931515 2.242424 0.182150
2 Male No 19.791237 3.113402 2.711340 0.160669
3 Male Yes 22.284500 3.051167 2.500000 0.152771

In [48]:
tips.groupby(['sex', 'smoker']).mean()


Out[48]:
total_bill tip size tip_total_ratio
sex smoker
Female No 18.105185 2.773519 2.592593 0.156921
Yes 17.977879 2.931515 2.242424 0.182150
Male No 19.791237 3.113402 2.711340 0.160669
Yes 22.284500 3.051167 2.500000 0.152771

分組級運算和轉換


In [49]:
# 聚合運算 是數據轉換的一種特例
# 為df增加一列 用於存放各索引分組平均值
df


Out[49]:
key1 key2 data1 data2
0 a one -1.678498 -0.692768
1 a two -0.192626 1.369141
2 b one 0.550083 -1.905869
3 b two 1.820455 -0.270374
4 a one -0.428296 -0.602598

In [50]:
# 計算分組mean
k1_means = df.groupby('key1').mean().add_prefix('mean_')
k1_means


Out[50]:
mean_data1 mean_data2
key1
a -0.766473 0.024592
b 1.185269 -1.088121

In [51]:
# merge
pd.merge(df, k1_means, left_on = 'key1', right_index = True)


Out[51]:
key1 key2 data1 data2 mean_data1 mean_data2
0 a one -1.678498 -0.692768 -0.766473 0.024592
1 a two -0.192626 1.369141 -0.766473 0.024592
4 a one -0.428296 -0.602598 -0.766473 0.024592
2 b one 0.550083 -1.905869 1.185269 -1.088121
3 b two 1.820455 -0.270374 1.185269 -1.088121

In [52]:
# 使用 transform()
people


Out[52]:
a b c d e
Joe 0.304629 0.533158 2.076010 -0.644645 -0.872860
Steve 0.242612 -1.249073 0.932613 0.148341 -1.324835
Wes 0.134158 NaN NaN -1.259198 -0.276365
Jim 0.913586 1.198580 1.886465 0.839542 0.379609
Travis 0.468621 1.850853 -1.813160 -0.852540 -0.039789

In [53]:
key = ['one', 'two', 'one', 'two', 'one']
people.groupby(key).mean()


Out[53]:
a b c d e
one 0.302469 1.192005 0.131425 -0.918794 -0.396338
two 0.578099 -0.025247 1.409539 0.493941 -0.472613

In [54]:
# 使用 transform(),將分組結果又放到各個row中(使用廣播的方式)
people.groupby(key).transform(np.mean)


Out[54]:
a b c d e
Joe 0.302469 1.192005 0.131425 -0.918794 -0.396338
Steve 0.578099 -0.025247 1.409539 0.493941 -0.472613
Wes 0.302469 1.192005 0.131425 -0.918794 -0.396338
Jim 0.578099 -0.025247 1.409539 0.493941 -0.472613
Travis 0.302469 1.192005 0.131425 -0.918794 -0.396338

In [55]:
# 可以套用各種自訂函式
# 距平均化函數
def demean(arr):
    return arr - arr.mean()

demeaned = people.groupby(key).transform(demean)
demeaned


Out[55]:
a b c d e
Joe 0.002160 -0.658847 1.944585 0.274150 -0.476522
Steve -0.335487 -1.223826 -0.476926 -0.345600 -0.852222
Wes -0.168312 NaN NaN -0.340404 0.119974
Jim 0.335487 1.223826 0.476926 0.345600 0.852222
Travis 0.166152 0.658847 -1.944585 0.066255 0.356549

In [56]:
demeaned.groupby(key).transform(np.mean).applymap(lambda x: '{0:.5f}'.format(x))


Out[56]:
a b c d e
Joe -0.00000 0.00000 0.00000 -0.00000 0.00000
Steve -0.00000 0.00000 0.00000 0.00000 0.00000
Wes -0.00000 0.00000 0.00000 -0.00000 0.00000
Jim -0.00000 0.00000 0.00000 0.00000 0.00000
Travis -0.00000 0.00000 0.00000 -0.00000 0.00000

apply: 一般性的 '拆分-應用-合併'

Difference between map, applymap and apply methods in Pandas
apply: 對 整個DataFrame(單一group) 實施一次
applymap: 對 DataFrame 的每個 儲存格 實施一次
map: 是 Series 的 function,對 Series 的每個 數值 實施一次


In [57]:
# apply 會將資料拆分成多個片段,對各個片段調用函式,最後再組合各個結果
def top(df, n = 5, column = 'tip_total_ratio'):
    return df.sort_values(by = column)[-n:]

In [58]:
tips.tail()


Out[58]:
total_bill tip sex smoker day time size tip_total_ratio
239 29.03 5.92 Male No Sat Dinner 3 0.203927
240 27.18 2.00 Female Yes Sat Dinner 2 0.073584
241 22.67 2.00 Male Yes Sat Dinner 2 0.088222
242 17.82 1.75 Male No Sat Dinner 2 0.098204
243 18.78 3.00 Female No Thur Dinner 2 0.159744

In [59]:
top(tips, n = 6)


Out[59]:
total_bill tip sex smoker day time size tip_total_ratio
109 14.31 4.00 Female Yes Sat Dinner 2 0.279525
183 23.17 6.50 Male Yes Sun Dinner 4 0.280535
232 11.61 3.39 Male No Sat Dinner 2 0.291990
67 3.07 1.00 Female Yes Sat Dinner 1 0.325733
178 9.60 4.00 Female Yes Sun Dinner 2 0.416667
172 7.25 5.15 Male Yes Sun Dinner 2 0.710345

In [60]:
# 使用 apply() 來施加 自訂函式
tips.groupby('smoker').apply(top)


Out[60]:
total_bill tip sex smoker day time size tip_total_ratio
smoker
No 88 24.71 5.85 Male No Thur Lunch 2 0.236746
185 20.69 5.00 Male No Sun Dinner 5 0.241663
51 10.29 2.60 Female No Sun Dinner 2 0.252672
149 7.51 2.00 Male No Thur Lunch 2 0.266312
232 11.61 3.39 Male No Sat Dinner 2 0.291990
Yes 109 14.31 4.00 Female Yes Sat Dinner 2 0.279525
183 23.17 6.50 Male Yes Sun Dinner 4 0.280535
67 3.07 1.00 Female Yes Sat Dinner 1 0.325733
178 9.60 4.00 Female Yes Sun Dinner 2 0.416667
172 7.25 5.15 Male Yes Sun Dinner 2 0.710345

In [61]:
# 自訂函式所需要的參數,可以放在後面一起傳入
tips.groupby(['smoker', 'day']).apply(top, n = 1, column = 'total_bill')


Out[61]:
total_bill tip sex smoker day time size tip_total_ratio
smoker day
No Fri 94 22.75 3.25 Female No Fri Dinner 2 0.142857
Sat 212 48.33 9.00 Male No Sat Dinner 4 0.186220
Sun 156 48.17 5.00 Male No Sun Dinner 6 0.103799
Thur 142 41.19 5.00 Male No Thur Lunch 5 0.121389
Yes Fri 95 40.17 4.73 Male Yes Fri Dinner 4 0.117750
Sat 170 50.81 10.00 Male Yes Sat Dinner 3 0.196812
Sun 182 45.35 3.50 Male Yes Sun Dinner 3 0.077178
Thur 197 43.11 5.00 Female Yes Thur Lunch 4 0.115982

In [62]:
result = tips.groupby(['smoker',])['total_bill'].describe()
result


Out[62]:
smoker       
No      count    151.000000
        mean      19.188278
        std        8.255582
        min        7.250000
        25%       13.325000
        50%       17.590000
        75%       22.755000
        max       48.330000
Yes     count     93.000000
        mean      20.756344
        std        9.832154
        min        3.070000
        25%       13.420000
        50%       17.920000
        75%       26.860000
        max       50.810000
dtype: float64

In [63]:
result.unstack('smoker')


Out[63]:
smoker No Yes
count 151.000000 93.000000
mean 19.188278 20.756344
std 8.255582 9.832154
min 7.250000 3.070000
25% 13.325000 13.420000
50% 17.590000 17.920000
75% 22.755000 26.860000
max 48.330000 50.810000

禁止分組鍵 (group_keys = False)


In [64]:
# 設定 group_keys = False,不讓分組鍵成為row索引
tips.groupby('smoker', group_keys = False).apply(top)


Out[64]:
total_bill tip sex smoker day time size tip_total_ratio
88 24.71 5.85 Male No Thur Lunch 2 0.236746
185 20.69 5.00 Male No Sun Dinner 5 0.241663
51 10.29 2.60 Female No Sun Dinner 2 0.252672
149 7.51 2.00 Male No Thur Lunch 2 0.266312
232 11.61 3.39 Male No Sat Dinner 2 0.291990
109 14.31 4.00 Female Yes Sat Dinner 2 0.279525
183 23.17 6.50 Male Yes Sun Dinner 4 0.280535
67 3.07 1.00 Female Yes Sat Dinner 1 0.325733
178 9.60 4.00 Female Yes Sun Dinner 2 0.416667
172 7.25 5.15 Male Yes Sun Dinner 2 0.710345

In [65]:
tips.groupby('smoker', group_keys = True).apply(top)


Out[65]:
total_bill tip sex smoker day time size tip_total_ratio
smoker
No 88 24.71 5.85 Male No Thur Lunch 2 0.236746
185 20.69 5.00 Male No Sun Dinner 5 0.241663
51 10.29 2.60 Female No Sun Dinner 2 0.252672
149 7.51 2.00 Male No Thur Lunch 2 0.266312
232 11.61 3.39 Male No Sat Dinner 2 0.291990
Yes 109 14.31 4.00 Female Yes Sat Dinner 2 0.279525
183 23.17 6.50 Male Yes Sun Dinner 4 0.280535
67 3.07 1.00 Female Yes Sat Dinner 1 0.325733
178 9.60 4.00 Female Yes Sun Dinner 2 0.416667
172 7.25 5.15 Male Yes Sun Dinner 2 0.710345

範例: 分組加權平均數和相關係數


In [66]:
df = DataFrame({'category': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'],
                'data': np.random.randn(8),
                'weights': np.random.rand(8)})
df


Out[66]:
category data weights
0 a -0.465269 0.433269
1 a -0.460243 0.155179
2 a 1.467560 0.964604
3 a -0.568058 0.743647
4 b -0.437733 0.142860
5 b -0.124299 0.392970
6 b -0.745835 0.962753
7 b -2.072258 0.285318

In [67]:
# 計算分組加權平均數
get_wavg = lambda g: np.average(g.data * g.weights)

# 每個分組施以 get_wavg
df.groupby('category').apply(get_wavg)


Out[67]:
category
a    0.180043
b   -0.355172
dtype: float64

In [68]:
# Yahoo Finance 
close_px = pd.read_csv('../data/stock_px.csv',
                       parse_dates = True, index_col = 0)
close_px[:6]


Out[68]:
AAPL MSFT XOM SPX
2003-01-02 7.40 21.11 29.22 909.03
2003-01-03 7.45 21.14 29.24 908.59
2003-01-06 7.45 21.52 29.96 929.01
2003-01-07 7.43 21.93 28.95 922.93
2003-01-08 7.28 21.31 28.83 909.93
2003-01-09 7.34 21.93 29.44 927.57

In [69]:
# 計算 日收益率 與 SPX之間的年度相關係數組成的DataFrame
rets = close_px.pct_change().dropna()
rets[:6]


Out[69]:
AAPL MSFT XOM SPX
2003-01-03 0.006757 0.001421 0.000684 -0.000484
2003-01-06 0.000000 0.017975 0.024624 0.022474
2003-01-07 -0.002685 0.019052 -0.033712 -0.006545
2003-01-08 -0.020188 -0.028272 -0.004145 -0.014086
2003-01-09 0.008242 0.029094 0.021159 0.019386
2003-01-10 0.002725 0.001824 -0.013927 0.000000

In [70]:
# 與 SPX之間的相關係數
spx_corr = lambda g: g.corrwith(g.SPX)

# 以年度區分
by_year = rets.groupby(lambda x: x.year)

# 計算分組與 SPX的 corr
by_year.apply(spx_corr)


Out[70]:
AAPL MSFT XOM SPX
2003 0.541124 0.745174 0.661265 1.0
2004 0.374283 0.588531 0.557742 1.0
2005 0.467540 0.562374 0.631010 1.0
2006 0.428267 0.406126 0.518514 1.0
2007 0.508118 0.658770 0.786264 1.0
2008 0.681434 0.804626 0.828303 1.0
2009 0.707103 0.654902 0.797921 1.0
2010 0.710105 0.730118 0.839057 1.0
2011 0.691931 0.800996 0.859975 1.0

In [71]:
# 也可以計算 列與列之間的相關係數
by_year.apply(lambda g: g.AAPL.corr(g.MSFT))


Out[71]:
2003    0.480868
2004    0.259024
2005    0.300093
2006    0.161735
2007    0.417738
2008    0.611901
2009    0.432738
2010    0.571946
2011    0.581987
dtype: float64

透視表(pivot table)和交叉表(cross-tabulation, 或稱 crosstab)


In [72]:
tips = pd.read_csv('../data/tips.csv')
tips['tip_pct'] = tips['tip']  / tips['total_bill'] 
tips[:5]


Out[72]:
total_bill tip sex smoker day time size tip_pct
0 16.99 1.01 Female No Sun Dinner 2 0.059447
1 10.34 1.66 Male No Sun Dinner 3 0.160542
2 21.01 3.50 Male No Sun Dinner 3 0.166587
3 23.68 3.31 Male No Sun Dinner 2 0.139780
4 24.59 3.61 Female No Sun Dinner 4 0.146808

In [73]:
# DataFrame 本身就有 pivot_table()方法,預設的 aggregate function 是 average
tips.pivot_table(index = ['sex', 'smoker'])


Out[73]:
size tip tip_pct total_bill
sex smoker
Female No 2.592593 2.773519 0.156921 18.105185
Yes 2.242424 2.931515 0.182150 17.977879
Male No 2.711340 3.113402 0.160669 19.791237
Yes 2.500000 3.051167 0.152771 22.284500

In [74]:
# 只聚合 tip_pct, size,而且想根據day來分組
# margins = True , 添加分項小計
tips.pivot_table(values = ['tip_pct', 'size'], index = ['sex', 'day'], columns = 'smoker', margins = True)


Out[74]:
tip_pct size
smoker No Yes All No Yes All
sex day
Female Fri 0.165296 0.209129 0.199388 2.500000 2.000000 2.111111
Sat 0.147993 0.163817 0.156470 2.307692 2.200000 2.250000
Sun 0.165710 0.237075 0.181569 3.071429 2.500000 2.944444
Thur 0.155971 0.163073 0.157525 2.480000 2.428571 2.468750
Male Fri 0.138005 0.144730 0.143385 2.000000 2.125000 2.100000
Sat 0.162132 0.139067 0.151577 2.656250 2.629630 2.644068
Sun 0.158291 0.173964 0.162344 2.883721 2.600000 2.810345
Thur 0.165706 0.164417 0.165276 2.500000 2.300000 2.433333
All 0.159328 0.163196 0.160803 2.668874 2.408602 2.569672

In [75]:
tips.pivot_table(values = ['tip_pct'], index = ['sex', 'smoker'], columns = 'day', margins = True)


Out[75]:
tip_pct
day Fri Sat Sun Thur All
sex smoker
Female No 0.165296 0.147993 0.165710 0.155971 0.156921
Yes 0.209129 0.163817 0.237075 0.163073 0.182150
Male No 0.138005 0.162132 0.158291 0.165706 0.160669
Yes 0.144730 0.139067 0.173964 0.164417 0.152771
All 0.169913 0.153152 0.166897 0.161276 0.160803

In [76]:
# 也可傳入指定的 aggregate function (參數 aggfunc)
tips.pivot_table(values = ['tip_pct'], index = ['sex', 'smoker'], columns = 'day', margins = True, aggfunc = len)


Out[76]:
tip_pct
day Fri Sat Sun Thur All
sex smoker
Female No 2.0 13.0 14.0 25.0 54.0
Yes 7.0 15.0 4.0 7.0 33.0
Male No 2.0 32.0 43.0 20.0 97.0
Yes 8.0 27.0 15.0 10.0 60.0
All 19.0 87.0 76.0 62.0 244.0

In [77]:
# 如果存在空的組合(NA),可以指定 fill_value參數,自動填入空缺值
tips.pivot_table(values = ['size'], index = ['time', 'sex', 'smoker'], columns = 'day', margins = True, aggfunc = sum, fill_value = 0)


Out[77]:
size
day Fri Sat Sun Thur All
time sex smoker
Dinner Female No 2.0 30.0 43.0 2.0 77.0
Yes 8.0 33.0 10.0 0.0 51.0
Male No 4.0 85.0 124.0 0.0 213.0
Yes 12.0 71.0 39.0 0.0 122.0
Lunch Female No 3.0 0.0 0.0 60.0 63.0
Yes 6.0 0.0 0.0 17.0 23.0
Male No 0.0 0.0 0.0 50.0 50.0
Yes 5.0 0.0 0.0 23.0 28.0
All 40.0 219.0 216.0 152.0 627.0

交叉表(crosstab)

用於計算 分組頻率 的特殊 透視表(pivot)


In [78]:
data = DataFrame(
                {'Sample': list(range(1, 11)),
                 'Gender': [random.choice(['Female', 'Male']) for i in range(10)],
                 'Handedness': [random.choice(['Right-handed', 'Left-handed']) for i in range(10)]
                }, 
                columns = ['Sample', 'Gender', 'Handedness'])
data


Out[78]:
Sample Gender Handedness
0 1 Female Left-handed
1 2 Male Left-handed
2 3 Female Left-handed
3 4 Male Right-handed
4 5 Female Left-handed
5 6 Female Left-handed
6 7 Female Left-handed
7 8 Male Right-handed
8 9 Female Left-handed
9 10 Female Left-handed

In [79]:
# 用 crosstab() 方法
pd.crosstab(data.Gender, data.Handedness, margins = True)


Out[79]:
Handedness Left-handed Right-handed All
Gender
Female 7 0 7
Male 1 2 3
All 8 2 10

In [80]:
# crosstab()方法的參數值可以是 數組
pd.crosstab(index = [tips.time, tips.day], columns = tips.smoker, margins = True)


Out[80]:
smoker No Yes All
time day
Dinner Fri 3 9 12
Sat 45 42 87
Sun 57 19 76
Thur 1 0 1
Lunch Fri 1 6 7
Thur 44 17 61
All 151 93 244