5. Pandas - GroupBy

Group By: split-apply-combine



In [2]:

    
%pylab inline
from pandas import Series, DataFrame
import pandas as pd









    



Populating the interactive namespace from numpy and matplotlib

GroupBy技術

分組運算是一種 Split-Apply-Combine的過程，類似於MapReduce的模式

![Pandas GroupBy](http://i.stack.imgur.com/sgCn1.jpg)



In [3]:

    
df = DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
                'key2' : ['one', 'two', 'one', 'two', 'one'],
                'data1' : np.random.randn(5),
                'data2' : np.random.randn(5)}, columns = ['key1', 'key2', 'data1', 'data2'])
df



In [4]:

    
# 使用 groupby方法
grouped = df.data1.groupby(df.key1)
grouped
# 產生一個 SeriesGroupBy物件









    Out[4]:





<pandas.core.groupby.SeriesGroupBy object at 0x0000000005FDD978>



In [5]:

    
grouped.size()









    Out[5]:





key1
a    3
b    2
dtype: int64



In [6]:

    
# 用GroupBy物件的 mean()方法
# mean()方法是一種 聚合運算
grouped.mean()









    Out[6]:





key1
a   -0.766473
b    1.185269
Name: data1, dtype: float64

分組所依據的鍵，可以是任何長度的數組，且可以有多層



In [7]:

    
# 也可以建立多層次的分組
grouped = df.data1.groupby([df.key1, df.key2])
grouped.size()









    Out[7]:





key1  key2
a     one     2
      two     1
b     one     1
      two     1
dtype: int64



In [8]:

    
grouped.mean()









    Out[8]:





key1  key2
a     one    -1.053397
      two    -0.192626
b     one     0.550083
      two     1.820455
Name: data1, dtype: float64



In [9]:

    
grouped.mean().unstack('key1')



In [10]:

    
# 也可以對多個 columns同時做分組統計運算
df.groupby(df.key1).mean()



In [11]:

    
# 也可以直接以 column索引的名稱來指定分組
df.groupby(['key1', 'key2']).mean()



In [12]:

    
# GroupBy 的 size()方法，傳回各分組的大小
df.groupby(['key1', 'key2']).size()









    Out[12]:





key1  key2
a     one     2
      two     1
b     one     1
      two     1
dtype: int64

對分組進行迭代



In [13]:

    
for name, group in df.groupby('key1'):
    print(name)
    print(group)
# 所以分組的結果，是拆分為多個 DataFrame









    



a
  key1 key2     data1     data2
0    a  one -1.678498 -0.692768
1    a  two -0.192626  1.369141
4    a  one -0.428296 -0.602598
b
  key1 key2     data1     data2
2    b  one  0.550083 -1.905869
3    b  two  1.820455 -0.270374



In [14]:

    
# 依照多重鍵分組，groupby元素元組的第一個元素是 多重鍵的 元組
for name, group in df.groupby(['key1', 'key2']):
    print(name)
    print(group)









    



('a', 'one')
  key1 key2     data1     data2
0    a  one -1.678498 -0.692768
4    a  one -0.428296 -0.602598
('a', 'two')
  key1 key2     data1     data2
1    a  two -0.192626  1.369141
('b', 'one')
  key1 key2     data1     data2
2    b  one  0.550083 -1.905869
('b', 'two')
  key1 key2     data1     data2
3    b  two  1.820455 -0.270374

選取一個或一組columns



In [15]:

    
df



In [16]:

    
df.groupby('key1')['data1']
# 等同於
df['data1'].groupby(df['key1'])

df.groupby('key1')['data2']
# 等同於
df[['data2']].groupby(df['key1'])









    Out[16]:





<pandas.core.groupby.DataFrameGroupBy object at 0x00000000089E91D0>



In [17]:

    
# 有時候只需要對部分的資料列進行聚合
df.groupby(['key1', 'key2'])[['data2']].mean()
# 傳回 DataFrame



In [18]:

    
df.groupby(['key1', 'key2'])['data2'].mean()
# 傳回 Series









    Out[18]:





key1  key2
a     one    -0.647683
      two     1.369141
b     one    -1.905869
      two    -0.270374
Name: data2, dtype: float64

通過字典或Series進行分組



In [19]:

    
people = DataFrame(np.random.randn(5, 5),
                   columns=['a', 'b', 'c', 'd', 'e'],
                   index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])
people.ix[2:3, ['b', 'c']] = np.nan
people



In [20]:

    
# 已經知道 列的分組關係
mapping = {'a': 'red', 'b': 'red', 'c': 'blue',
           'd': 'blue', 'e': 'red', 'f' : 'orange'}
#只需要將mapping關係的字典傳給 groupby()
grouped_by_column = people.groupby(mapping, axis = 1)
grouped_by_column.sum()



In [21]:

    
map_series = Series(mapping)
map_series









    Out[21]:





a       red
b       red
c      blue
d      blue
e       red
f    orange
dtype: object



In [22]:

    
# 也可以將mapping關係的Series物件傳給 groupby()
grouped_by_column = people.groupby(map_series, axis = 1)
grouped_by_column.sum()

透過函數進行分組



In [23]:

    
people



In [24]:

    
# 被當作分組鍵的函數都會在各個索引值上被調用一次，返回值就被當作分組名稱
people.groupby(len).mean()



In [25]:

    
# 函數、列表、字典、Series都可以混用，因為最後都會被轉換為數組
key_list = ['one', 'one', 'one', 'two', 'two']
people.groupby([len, key_list]).min()



In [26]:

    
# 根據索引級別分組
# 要依據層次化索引來分組聚合，只需要透過 level參數即可
columns = pd.MultiIndex.from_arrays([['US', 'US', 'US', 'JP', 'JP'],
                                     [1, 3, 5, 1, 3]], names=['cty', 'tenor'])

hier_df = DataFrame(np.random.randn(4, 5), columns=columns)
hier_df



In [27]:

    
hier_df.groupby(level = 'cty', axis = 1).count()

數據聚合 (pandas.core.groupby.DataFrameGroupBy.aggregate() )



In [28]:

    
# 可以自訂一聚合方法。聚合方法會對每一個分組之後的group操作一次
df



In [29]:

    
grouped = df.groupby('key1')
for name, group in grouped:
    print(name)
    print(group)









    



a
  key1 key2     data1     data2
0    a  one -1.678498 -0.692768
1    a  two -0.192626  1.369141
4    a  one -0.428296 -0.602598
b
  key1 key2     data1     data2
2    b  one  0.550083 -1.905869
3    b  two  1.820455 -0.270374



In [30]:

    
type(grouped)









    Out[30]:





pandas.core.groupby.DataFrameGroupBy



In [31]:

    
for name, group in grouped['data1']:
    print(name)
    print(group)









    



a
0   -1.678498
1   -0.192626
4   -0.428296
Name: data1, dtype: float64
b
2    0.550083
3    1.820455
Name: data1, dtype: float64



In [32]:

    
# Series, DataFrame的方法都可以施加在 group上
# quantile 是 Series的方法
grouped = df.groupby('key1')
grouped['data1'].quantile(0.9)









    Out[32]:





key1
a   -0.239760
b    1.693418
Name: data1, dtype: float64



In [33]:

    
# 透過 aggregate()方法，可以使用自訂函式
def peak_to_peak(arr):
    return arr.max() - arr.min()

grouped.aggregate(peak_to_peak)

# 會對每一個 pandas.core.groupby.DataFrameGroupBy(grouped)中的 DataFrame 中的 Series 做一次指定的 aggregate (在這邊是 peak_to_peak()) 運算



In [34]:

    
# describe 也可以用
grouped.describe()



In [35]:

    
grouped.mean()
# 會對每一個 pandas.core.groupby.DataFrameGroupBy中的 DataFrame 中的 Series 做一次指定的 aggregate (在這邊是 mean()) 運算

面向列的多函數應用



In [36]:

    
tips = pd.read_csv('../data/tips.csv')
tips['tip_total_ratio'] = tips['tip']  / tips['total_bill'] 
tips[:5]









    Out[36]:






  
    
      
      total_bill
      tip
      sex
      smoker
      day
      time
      size
      tip_total_ratio
    
  
  
    
      0
      16.99
      1.01
      Female
      No
      Sun
      Dinner
      2
      0.059447
    
    
      1
      10.34
      1.66
      Male
      No
      Sun
      Dinner
      3
      0.160542
    
    
      2
      21.01
      3.50
      Male
      No
      Sun
      Dinner
      3
      0.166587
    
    
      3
      23.68
      3.31
      Male
      No
      Sun
      Dinner
      2
      0.139780
    
    
      4
      24.59
      3.61
      Female
      No
      Sun
      Dinner
      4
      0.146808



In [37]:

    
# 對不同的列使用不同的聚合函數
grouped = tips.groupby(['sex', 'smoker'])
grouped_pct = grouped['tip_total_ratio']
for name, group in grouped_pct:
    print(name)
    print(group.tail(3))









    



('Female', 'No')
223    0.187735
238    0.130338
243    0.159744
Name: tip_total_ratio, dtype: float64
('Female', 'Yes')
226    0.198216
229    0.130199
240    0.073584
Name: tip_total_ratio, dtype: float64
('Male', 'No')
235    0.124131
239    0.203927
242    0.098204
Name: tip_total_ratio, dtype: float64
('Male', 'Yes')
236    0.079365
237    0.035638
241    0.088222
Name: tip_total_ratio, dtype: float64

What is the difference between pandas agg and apply function?



In [38]:

    
grouped_pct.agg('mean')









    Out[38]:





sex     smoker
Female  No        0.156921
        Yes       0.182150
Male    No        0.160669
        Yes       0.152771
Name: tip_total_ratio, dtype: float64



In [39]:

    
grouped_pct.aggregate('mean')









    Out[39]:





sex     smoker
Female  No        0.156921
        Yes       0.182150
Male    No        0.160669
        Yes       0.152771
Name: tip_total_ratio, dtype: float64



In [40]:

    
# 傳入一組函數或函數名，得到的DataFrame的列就會以相應的函數命名
grouped_pct.agg(['mean', 'std', peak_to_peak])









    Out[40]:






  
    
      
      
      mean
      std
      peak_to_peak
    
    
      sex
      smoker
      
      
      
    
  
  
    
      Female
      No
      0.156921
      0.036421
      0.195876
    
    
      Yes
      0.182150
      0.071595
      0.360233
    
    
      Male
      No
      0.160669
      0.041849
      0.220186
    
    
      Yes
      0.152771
      0.090588
      0.674707



In [41]:

    
# 如果傳入一個由(name, function)的元組列表，則各元組的第一個元素就會被當作DataFrame的 column名稱
grouped_pct.agg([('foo', 'mean'), ('bar', np.std)])



In [42]:

    
# 對於 DataFrame，還可以定義使用多個函數
functions = ['count', 'mean', 'max']
result = grouped['tip_total_ratio', 'total_bill'].agg(functions)
result









    Out[42]:






  
    
      
      
      tip_total_ratio
      total_bill
    
    
      
      
      count
      mean
      max
      count
      mean
      max
    
    
      sex
      smoker
      
      
      
      
      
      
    
  
  
    
      Female
      No
      54
      0.156921
      0.252672
      54
      18.105185
      35.83
    
    
      Yes
      33
      0.182150
      0.416667
      33
      17.977879
      44.30
    
    
      Male
      No
      97
      0.160669
      0.291990
      97
      19.791237
      48.33
    
    
      Yes
      60
      0.152771
      0.710345
      60
      22.284500
      50.81



In [43]:

    
result['tip_total_ratio']



In [44]:

    
# 自訂一結果的列名稱
functions = [('Counts', 'count'), ('Mean', 'mean'), ('Max', 'max')]
result = grouped['tip_total_ratio', 'total_bill'].agg(functions)
result









    Out[44]:






  
    
      
      
      tip_total_ratio
      total_bill
    
    
      
      
      Counts
      Mean
      Max
      Counts
      Mean
      Max
    
    
      sex
      smoker
      
      
      
      
      
      
    
  
  
    
      Female
      No
      54
      0.156921
      0.252672
      54
      18.105185
      35.83
    
    
      Yes
      33
      0.182150
      0.416667
      33
      17.977879
      44.30
    
    
      Male
      No
      97
      0.160669
      0.291990
      97
      19.791237
      48.33
    
    
      Yes
      60
      0.152771
      0.710345
      60
      22.284500
      50.81



In [45]:

    
# 對於 DataFrame，還可以定義不同列使用不同的函數
# 傳入一個名稱與函數的字典
functions = {'tip_total_ratio':  np.max, 'total_bill': np.min}
result = grouped.agg(functions)
result









    Out[45]:






  
    
      
      
      tip_total_ratio
      total_bill
    
    
      sex
      smoker
      
      
    
  
  
    
      Female
      No
      0.252672
      7.25
    
    
      Yes
      0.416667
      3.07
    
    
      Male
      No
      0.291990
      7.51
    
    
      Yes
      0.710345
      7.25



In [46]:

    
# 對於 DataFrame，還可以定義不同列使用不同的函數
functions = {'tip_total_ratio': (np.max,  np.min), 
             'size': ['sum', 'min']}
result = grouped.agg(functions)
result









    Out[46]:






  
    
      
      
      tip_total_ratio
      size
    
    
      
      
      amax
      amin
      sum
      min
    
    
      sex
      smoker
      
      
      
      
    
  
  
    
      Female
      No
      0.252672
      0.056797
      140
      1
    
    
      Yes
      0.416667
      0.056433
      74
      1
    
    
      Male
      No
      0.291990
      0.071804
      263
      2
    
    
      Yes
      0.710345
      0.035638
      150
      1

以無索引的形式返回聚合數據



In [47]:

    
# 透過 as_index = False，分組鍵不要成為索引
tips.groupby(['sex', 'smoker'], as_index = False).mean()









    Out[47]:






  
    
      
      sex
      smoker
      total_bill
      tip
      size
      tip_total_ratio
    
  
  
    
      0
      Female
      No
      18.105185
      2.773519
      2.592593
      0.156921
    
    
      1
      Female
      Yes
      17.977879
      2.931515
      2.242424
      0.182150
    
    
      2
      Male
      No
      19.791237
      3.113402
      2.711340
      0.160669
    
    
      3
      Male
      Yes
      22.284500
      3.051167
      2.500000
      0.152771



In [48]:

    
tips.groupby(['sex', 'smoker']).mean()









    Out[48]:






  
    
      
      
      total_bill
      tip
      size
      tip_total_ratio
    
    
      sex
      smoker
      
      
      
      
    
  
  
    
      Female
      No
      18.105185
      2.773519
      2.592593
      0.156921
    
    
      Yes
      17.977879
      2.931515
      2.242424
      0.182150
    
    
      Male
      No
      19.791237
      3.113402
      2.711340
      0.160669
    
    
      Yes
      22.284500
      3.051167
      2.500000
      0.152771

分組級運算和轉換



In [49]:

    
# 聚合運算 是數據轉換的一種特例
# 為df增加一列 用於存放各索引分組平均值
df



In [50]:

    
# 計算分組mean
k1_means = df.groupby('key1').mean().add_prefix('mean_')
k1_means



In [51]:

    
# merge
pd.merge(df, k1_means, left_on = 'key1', right_index = True)



In [52]:

    
# 使用 transform()
people



In [53]:

    
key = ['one', 'two', 'one', 'two', 'one']
people.groupby(key).mean()



In [54]:

    
# 使用 transform()，將分組結果又放到各個row中(使用廣播的方式)
people.groupby(key).transform(np.mean)



In [55]:

    
# 可以套用各種自訂函式
# 距平均化函數
def demean(arr):
    return arr - arr.mean()

demeaned = people.groupby(key).transform(demean)
demeaned



In [56]:

    
demeaned.groupby(key).transform(np.mean).applymap(lambda x: '{0:.5f}'.format(x))

apply: 一般性的 '拆分-應用-合併'

Difference between map, applymap and apply methods in Pandas
apply: 對整個DataFrame(單一group) 實施一次
applymap: 對 DataFrame 的每個儲存格實施一次
map: 是 Series 的 function，對 Series 的每個數值實施一次



In [57]:

    
# apply 會將資料拆分成多個片段，對各個片段調用函式，最後再組合各個結果
def top(df, n = 5, column = 'tip_total_ratio'):
    return df.sort_values(by = column)[-n:]



In [58]:

    
tips.tail()









    Out[58]:






  
    
      
      total_bill
      tip
      sex
      smoker
      day
      time
      size
      tip_total_ratio
    
  
  
    
      239
      29.03
      5.92
      Male
      No
      Sat
      Dinner
      3
      0.203927
    
    
      240
      27.18
      2.00
      Female
      Yes
      Sat
      Dinner
      2
      0.073584
    
    
      241
      22.67
      2.00
      Male
      Yes
      Sat
      Dinner
      2
      0.088222
    
    
      242
      17.82
      1.75
      Male
      No
      Sat
      Dinner
      2
      0.098204
    
    
      243
      18.78
      3.00
      Female
      No
      Thur
      Dinner
      2
      0.159744



In [59]:

    
top(tips, n = 6)









    Out[59]:






  
    
      
      total_bill
      tip
      sex
      smoker
      day
      time
      size
      tip_total_ratio
    
  
  
    
      109
      14.31
      4.00
      Female
      Yes
      Sat
      Dinner
      2
      0.279525
    
    
      183
      23.17
      6.50
      Male
      Yes
      Sun
      Dinner
      4
      0.280535
    
    
      232
      11.61
      3.39
      Male
      No
      Sat
      Dinner
      2
      0.291990
    
    
      67
      3.07
      1.00
      Female
      Yes
      Sat
      Dinner
      1
      0.325733
    
    
      178
      9.60
      4.00
      Female
      Yes
      Sun
      Dinner
      2
      0.416667
    
    
      172
      7.25
      5.15
      Male
      Yes
      Sun
      Dinner
      2
      0.710345



In [60]:

    
# 使用 apply() 來施加 自訂函式
tips.groupby('smoker').apply(top)









    Out[60]:






  
    
      
      
      total_bill
      tip
      sex
      smoker
      day
      time
      size
      tip_total_ratio
    
    
      smoker
      
      
      
      
      
      
      
      
      
    
  
  
    
      No
      88
      24.71
      5.85
      Male
      No
      Thur
      Lunch
      2
      0.236746
    
    
      185
      20.69
      5.00
      Male
      No
      Sun
      Dinner
      5
      0.241663
    
    
      51
      10.29
      2.60
      Female
      No
      Sun
      Dinner
      2
      0.252672
    
    
      149
      7.51
      2.00
      Male
      No
      Thur
      Lunch
      2
      0.266312
    
    
      232
      11.61
      3.39
      Male
      No
      Sat
      Dinner
      2
      0.291990
    
    
      Yes
      109
      14.31
      4.00
      Female
      Yes
      Sat
      Dinner
      2
      0.279525
    
    
      183
      23.17
      6.50
      Male
      Yes
      Sun
      Dinner
      4
      0.280535
    
    
      67
      3.07
      1.00
      Female
      Yes
      Sat
      Dinner
      1
      0.325733
    
    
      178
      9.60
      4.00
      Female
      Yes
      Sun
      Dinner
      2
      0.416667
    
    
      172
      7.25
      5.15
      Male
      Yes
      Sun
      Dinner
      2
      0.710345



In [61]:

    
# 自訂函式所需要的參數，可以放在後面一起傳入
tips.groupby(['smoker', 'day']).apply(top, n = 1, column = 'total_bill')









    Out[61]:






  
    
      
      
      
      total_bill
      tip
      sex
      smoker
      day
      time
      size
      tip_total_ratio
    
    
      smoker
      day
      
      
      
      
      
      
      
      
      
    
  
  
    
      No
      Fri
      94
      22.75
      3.25
      Female
      No
      Fri
      Dinner
      2
      0.142857
    
    
      Sat
      212
      48.33
      9.00
      Male
      No
      Sat
      Dinner
      4
      0.186220
    
    
      Sun
      156
      48.17
      5.00
      Male
      No
      Sun
      Dinner
      6
      0.103799
    
    
      Thur
      142
      41.19
      5.00
      Male
      No
      Thur
      Lunch
      5
      0.121389
    
    
      Yes
      Fri
      95
      40.17
      4.73
      Male
      Yes
      Fri
      Dinner
      4
      0.117750
    
    
      Sat
      170
      50.81
      10.00
      Male
      Yes
      Sat
      Dinner
      3
      0.196812
    
    
      Sun
      182
      45.35
      3.50
      Male
      Yes
      Sun
      Dinner
      3
      0.077178
    
    
      Thur
      197
      43.11
      5.00
      Female
      Yes
      Thur
      Lunch
      4
      0.115982



In [62]:

    
result = tips.groupby(['smoker',])['total_bill'].describe()
result









    Out[62]:





smoker       
No      count    151.000000
        mean      19.188278
        std        8.255582
        min        7.250000
        25%       13.325000
        50%       17.590000
        75%       22.755000
        max       48.330000
Yes     count     93.000000
        mean      20.756344
        std        9.832154
        min        3.070000
        25%       13.420000
        50%       17.920000
        75%       26.860000
        max       50.810000
dtype: float64



In [63]:

    
result.unstack('smoker')

禁止分組鍵 (group_keys = False)



In [64]:

    
# 設定 group_keys = False，不讓分組鍵成為row索引
tips.groupby('smoker', group_keys = False).apply(top)









    Out[64]:






  
    
      
      total_bill
      tip
      sex
      smoker
      day
      time
      size
      tip_total_ratio
    
  
  
    
      88
      24.71
      5.85
      Male
      No
      Thur
      Lunch
      2
      0.236746
    
    
      185
      20.69
      5.00
      Male
      No
      Sun
      Dinner
      5
      0.241663
    
    
      51
      10.29
      2.60
      Female
      No
      Sun
      Dinner
      2
      0.252672
    
    
      149
      7.51
      2.00
      Male
      No
      Thur
      Lunch
      2
      0.266312
    
    
      232
      11.61
      3.39
      Male
      No
      Sat
      Dinner
      2
      0.291990
    
    
      109
      14.31
      4.00
      Female
      Yes
      Sat
      Dinner
      2
      0.279525
    
    
      183
      23.17
      6.50
      Male
      Yes
      Sun
      Dinner
      4
      0.280535
    
    
      67
      3.07
      1.00
      Female
      Yes
      Sat
      Dinner
      1
      0.325733
    
    
      178
      9.60
      4.00
      Female
      Yes
      Sun
      Dinner
      2
      0.416667
    
    
      172
      7.25
      5.15
      Male
      Yes
      Sun
      Dinner
      2
      0.710345



In [65]:

    
tips.groupby('smoker', group_keys = True).apply(top)









    Out[65]:






  
    
      
      
      total_bill
      tip
      sex
      smoker
      day
      time
      size
      tip_total_ratio
    
    
      smoker
      
      
      
      
      
      
      
      
      
    
  
  
    
      No
      88
      24.71
      5.85
      Male
      No
      Thur
      Lunch
      2
      0.236746
    
    
      185
      20.69
      5.00
      Male
      No
      Sun
      Dinner
      5
      0.241663
    
    
      51
      10.29
      2.60
      Female
      No
      Sun
      Dinner
      2
      0.252672
    
    
      149
      7.51
      2.00
      Male
      No
      Thur
      Lunch
      2
      0.266312
    
    
      232
      11.61
      3.39
      Male
      No
      Sat
      Dinner
      2
      0.291990
    
    
      Yes
      109
      14.31
      4.00
      Female
      Yes
      Sat
      Dinner
      2
      0.279525
    
    
      183
      23.17
      6.50
      Male
      Yes
      Sun
      Dinner
      4
      0.280535
    
    
      67
      3.07
      1.00
      Female
      Yes
      Sat
      Dinner
      1
      0.325733
    
    
      178
      9.60
      4.00
      Female
      Yes
      Sun
      Dinner
      2
      0.416667
    
    
      172
      7.25
      5.15
      Male
      Yes
      Sun
      Dinner
      2
      0.710345

範例: 分組加權平均數和相關係數



In [66]:

    
df = DataFrame({'category': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'],
                'data': np.random.randn(8),
                'weights': np.random.rand(8)})
df



In [67]:

    
# 計算分組加權平均數
get_wavg = lambda g: np.average(g.data * g.weights)

# 每個分組施以 get_wavg
df.groupby('category').apply(get_wavg)









    Out[67]:





category
a    0.180043
b   -0.355172
dtype: float64



In [68]:

    
# Yahoo Finance 
close_px = pd.read_csv('../data/stock_px.csv',
                       parse_dates = True, index_col = 0)
close_px[:6]



In [69]:

    
# 計算 日收益率 與 SPX之間的年度相關係數組成的DataFrame
rets = close_px.pct_change().dropna()
rets[:6]



In [70]:

    
# 與 SPX之間的相關係數
spx_corr = lambda g: g.corrwith(g.SPX)

# 以年度區分
by_year = rets.groupby(lambda x: x.year)

# 計算分組與 SPX的 corr
by_year.apply(spx_corr)



In [71]:

    
# 也可以計算 列與列之間的相關係數
by_year.apply(lambda g: g.AAPL.corr(g.MSFT))









    Out[71]:





2003    0.480868
2004    0.259024
2005    0.300093
2006    0.161735
2007    0.417738
2008    0.611901
2009    0.432738
2010    0.571946
2011    0.581987
dtype: float64

透視表(pivot table)和交叉表(cross-tabulation, 或稱 crosstab)



In [72]:

    
tips = pd.read_csv('../data/tips.csv')
tips['tip_pct'] = tips['tip']  / tips['total_bill'] 
tips[:5]









    Out[72]:






  
    
      
      total_bill
      tip
      sex
      smoker
      day
      time
      size
      tip_pct
    
  
  
    
      0
      16.99
      1.01
      Female
      No
      Sun
      Dinner
      2
      0.059447
    
    
      1
      10.34
      1.66
      Male
      No
      Sun
      Dinner
      3
      0.160542
    
    
      2
      21.01
      3.50
      Male
      No
      Sun
      Dinner
      3
      0.166587
    
    
      3
      23.68
      3.31
      Male
      No
      Sun
      Dinner
      2
      0.139780
    
    
      4
      24.59
      3.61
      Female
      No
      Sun
      Dinner
      4
      0.146808



In [73]:

    
# DataFrame 本身就有 pivot_table()方法，預設的 aggregate function 是 average
tips.pivot_table(index = ['sex', 'smoker'])



In [74]:

    
# 只聚合 tip_pct, size，而且想根據day來分組
# margins = True , 添加分項小計
tips.pivot_table(values = ['tip_pct', 'size'], index = ['sex', 'day'], columns = 'smoker', margins = True)



In [75]:

    
tips.pivot_table(values = ['tip_pct'], index = ['sex', 'smoker'], columns = 'day', margins = True)



In [76]:

    
# 也可傳入指定的 aggregate function (參數 aggfunc)
tips.pivot_table(values = ['tip_pct'], index = ['sex', 'smoker'], columns = 'day', margins = True, aggfunc = len)



In [77]:

    
# 如果存在空的組合(NA)，可以指定 fill_value參數，自動填入空缺值
tips.pivot_table(values = ['size'], index = ['time', 'sex', 'smoker'], columns = 'day', margins = True, aggfunc = sum, fill_value = 0)

交叉表(crosstab)

用於計算分組頻率的特殊透視表(pivot)



In [78]:

    
data = DataFrame(
                {'Sample': list(range(1, 11)),
                 'Gender': [random.choice(['Female', 'Male']) for i in range(10)],
                 'Handedness': [random.choice(['Right-handed', 'Left-handed']) for i in range(10)]
                }, 
                columns = ['Sample', 'Gender', 'Handedness'])
data









    Out[78]:






  
    
      
      Sample
      Gender
      Handedness
    
  
  
    
      0
      1
      Female
      Left-handed
    
    
      1
      2
      Male
      Left-handed
    
    
      2
      3
      Female
      Left-handed
    
    
      3
      4
      Male
      Right-handed
    
    
      4
      5
      Female
      Left-handed
    
    
      5
      6
      Female
      Left-handed
    
    
      6
      7
      Female
      Left-handed
    
    
      7
      8
      Male
      Right-handed
    
    
      8
      9
      Female
      Left-handed
    
    
      9
      10
      Female
      Left-handed



In [79]:

    
# 用 crosstab() 方法
pd.crosstab(data.Gender, data.Handedness, margins = True)









    Out[79]:






  
    
      Handedness
      Left-handed
      Right-handed
      All
    
    
      Gender
      
      
      
    
  
  
    
      Female
      7
      0
      7
    
    
      Male
      1
      2
      3
    
    
      All
      8
      2
      10



In [80]:

    
# crosstab()方法的參數值可以是 數組
pd.crosstab(index = [tips.time, tips.day], columns = tips.smoker, margins = True)

	a	b	c	d	e
Joe	0.304629	0.533158	2.076010	-0.644645	-0.872860
Steve	0.242612	-1.249073	0.932613	0.148341	-1.324835
Wes	0.134158	NaN	NaN	-1.259198	-0.276365
Jim	0.913586	1.198580	1.886465	0.839542	0.379609
Travis	0.468621	1.850853	-1.813160	-0.852540	-0.039789

	blue	red
Joe	1.431365	-0.035073
Steve	1.080954	-2.331295
Wes	-1.259198	-0.142207
Jim	2.726007	2.491775
Travis	-2.665700	2.279685

	blue	red
Joe	1.431365	-0.035073
Steve	1.080954	-2.331295
Wes	-1.259198	-0.142207
Jim	2.726007	2.491775
Travis	-2.665700	2.279685

	a	b	c	d	e
Joe	0.304629	0.533158	2.076010	-0.644645	-0.872860
Steve	0.242612	-1.249073	0.932613	0.148341	-1.324835
Wes	0.134158	NaN	NaN	-1.259198	-0.276365
Jim	0.913586	1.198580	1.886465	0.839542	0.379609
Travis	0.468621	1.850853	-1.813160	-0.852540	-0.039789

	a	b	c	d	e
3	0.450791	0.865869	1.981237	-0.354767	-0.256539
5	0.242612	-1.249073	0.932613	0.148341	-1.324835
6	0.468621	1.850853	-1.813160	-0.852540	-0.039789

	key1	key2	data1	data2
0	a	one	-1.678498	-0.692768
1	a	two	-0.192626	1.369141
2	b	one	0.550083	-1.905869
3	b	two	1.820455	-0.270374
4	a	one	-0.428296	-0.602598

		data1	data2
key1	key2
a	one	-1.053397	-0.647683
a	two	-0.192626	1.369141
b	one	0.550083	-1.905869
b	two	1.820455	-0.270374

cty	US			JP
tenor	1	3	5	1	3
0	-0.251274	0.988204	1.978977	0.038430	-1.044804
1	1.623105	0.107912	0.278087	1.018542	-0.056278
2	-0.894753	2.681102	1.129964	-0.439178	-2.457484
3	0.255927	1.065645	-0.635942	0.324136	-0.311153

		data1	data2
key1
a	count	3.000000	3.000000
	mean	-0.766473	0.024592
	std	0.798578	1.165286
	min	-1.678498	-0.692768
	25%	-1.053397	-0.647683
	50%	-0.428296	-0.602598
	75%	-0.310461	0.383272
	max	-0.192626	1.369141
b	count	2.000000	2.000000
	mean	1.185269	-1.088121
	std	0.898288	1.156470
	min	0.550083	-1.905869
	25%	0.867676	-1.496995
	50%	1.185269	-1.088121
	75%	1.502862	-0.679248
	max	1.820455	-0.270374

	total_bill	tip	sex	smoker	day	time	size	tip_total_ratio
0	16.99	1.01	Female	No	Sun	Dinner	2	0.059447
1	10.34	1.66	Male	No	Sun	Dinner	3	0.160542
2	21.01	3.50	Male	No	Sun	Dinner	3	0.166587
3	23.68	3.31	Male	No	Sun	Dinner	2	0.139780
4	24.59	3.61	Female	No	Sun	Dinner	4	0.146808

		mean	std	peak_to_peak
sex	smoker
Female	No	0.156921	0.036421	0.195876
Female	Yes	0.182150	0.071595	0.360233
Male	No	0.160669	0.041849	0.220186
Male	Yes	0.152771	0.090588	0.674707

		tip_total_ratio			total_bill
		Counts	Mean	Max	Counts	Mean	Max
sex	smoker
Female	No	54	0.156921	0.252672	54	18.105185	35.83
Female	Yes	33	0.182150	0.416667	33	17.977879	44.30
Male	No	97	0.160669	0.291990	97	19.791237	48.33
Male	Yes	60	0.152771	0.710345	60	22.284500	50.81

		tip_total_ratio		size
		amax	amin	sum	min
sex	smoker
Female	No	0.252672	0.056797	140	1
Female	Yes	0.416667	0.056433	74	1
Male	No	0.291990	0.071804	263	2
Male	Yes	0.710345	0.035638	150	1

	sex	smoker	total_bill	tip	size	tip_total_ratio
0	Female	No	18.105185	2.773519	2.592593	0.156921
1	Female	Yes	17.977879	2.931515	2.242424	0.182150
2	Male	No	19.791237	3.113402	2.711340	0.160669
3	Male	Yes	22.284500	3.051167	2.500000	0.152771

	a	b	c	d	e
one	0.302469	1.192005	0.131425	-0.918794	-0.396338
two	0.578099	-0.025247	1.409539	0.493941	-0.472613

	a	b	c	d	e
Joe	0.002160	-0.658847	1.944585	0.274150	-0.476522
Steve	-0.335487	-1.223826	-0.476926	-0.345600	-0.852222
Wes	-0.168312	NaN	NaN	-0.340404	0.119974
Jim	0.335487	1.223826	0.476926	0.345600	0.852222
Travis	0.166152	0.658847	-1.944585	0.066255	0.356549

	a	d
Joe	-0.00000	-0.00000
Steve	-0.00000	0.00000
Wes	-0.00000	-0.00000
Jim	-0.00000	0.00000
Travis	-0.00000	-0.00000

	total_bill	tip	sex	smoker	day	time	size	tip_total_ratio
239	29.03	5.92	Male	No	Sat	Dinner	3	0.203927
240	27.18	2.00	Female	Yes	Sat	Dinner	2	0.073584
241	22.67	2.00	Male	Yes	Sat	Dinner	2	0.088222
242	17.82	1.75	Male	No	Sat	Dinner	2	0.098204
243	18.78	3.00	Female	No	Thur	Dinner	2	0.159744

	total_bill	tip	sex	smoker	day	time	size	tip_total_ratio
109	14.31	4.00	Female	Yes	Sat	Dinner	2	0.279525
183	23.17	6.50	Male	Yes	Sun	Dinner	4	0.280535
232	11.61	3.39	Male	No	Sat	Dinner	2	0.291990
67	3.07	1.00	Female	Yes	Sat	Dinner	1	0.325733
178	9.60	4.00	Female	Yes	Sun	Dinner	2	0.416667
172	7.25	5.15	Male	Yes	Sun	Dinner	2	0.710345

		total_bill	tip	sex	smoker	day	time	size	tip_total_ratio
smoker
No	88	24.71	5.85	Male	No	Thur	Lunch	2	0.236746
	185	20.69	5.00	Male	No	Sun	Dinner	5	0.241663
	51	10.29	2.60	Female	No	Sun	Dinner	2	0.252672
	149	7.51	2.00	Male	No	Thur	Lunch	2	0.266312
	232	11.61	3.39	Male	No	Sat	Dinner	2	0.291990
Yes	109	14.31	4.00	Female	Yes	Sat	Dinner	2	0.279525
	183	23.17	6.50	Male	Yes	Sun	Dinner	4	0.280535
	67	3.07	1.00	Female	Yes	Sat	Dinner	1	0.325733
	178	9.60	4.00	Female	Yes	Sun	Dinner	2	0.416667
	172	7.25	5.15	Male	Yes	Sun	Dinner	2	0.710345

			total_bill	tip	sex	smoker	day	time	size	tip_total_ratio
smoker	day
No	Fri	94	22.75	3.25	Female	No	Fri	Dinner	2	0.142857
	Sat	212	48.33	9.00	Male	No	Sat	Dinner	4	0.186220
	Sun	156	48.17	5.00	Male	No	Sun	Dinner	6	0.103799
	Thur	142	41.19	5.00	Male	No	Thur	Lunch	5	0.121389
Yes	Fri	95	40.17	4.73	Male	Yes	Fri	Dinner	4	0.117750
	Sat	170	50.81	10.00	Male	Yes	Sat	Dinner	3	0.196812
	Sun	182	45.35	3.50	Male	Yes	Sun	Dinner	3	0.077178
	Thur	197	43.11	5.00	Female	Yes	Thur	Lunch	4	0.115982

smoker	No	Yes
count	151.000000	93.000000
mean	19.188278	20.756344
std	8.255582	9.832154
min	7.250000	3.070000
25%	13.325000	13.420000
50%	17.590000	17.920000
75%	22.755000	26.860000
max	48.330000	50.810000

	category	data	weights
0	a	-0.465269	0.433269
1	a	-0.460243	0.155179
2	a	1.467560	0.964604
3	a	-0.568058	0.743647
4	b	-0.437733	0.142860
5	b	-0.124299	0.392970
6	b	-0.745835	0.962753
7	b	-2.072258	0.285318

	AAPL	MSFT	XOM	SPX
2003-01-02	7.40	21.11	29.22	909.03
2003-01-03	7.45	21.14	29.24	908.59
2003-01-06	7.45	21.52	29.96	929.01
2003-01-07	7.43	21.93	28.95	922.93
2003-01-08	7.28	21.31	28.83	909.93
2003-01-09	7.34	21.93	29.44	927.57

	AAPL	MSFT	XOM	SPX
2003-01-03	0.006757	0.001421	0.000684	-0.000484
2003-01-06	0.000000	0.017975	0.024624	0.022474
2003-01-07	-0.002685	0.019052	-0.033712	-0.006545
2003-01-08	-0.020188	-0.028272	-0.004145	-0.014086
2003-01-09	0.008242	0.029094	0.021159	0.019386
2003-01-10	0.002725	0.001824	-0.013927	0.000000