In [ ]:

    
%run setup_env.py
%matplotlib inline

Group By split-apply-combine

Splitting data into groups
Applying a function to each group independently
Combining the results into a data structure

Splitting

a python function, can be called on each of axis labels
a list or NumPy array of the same length as the selected asix
a dict or Series, providing a label - group name mapping
for DataFrame objects, give selected columns or a column name

GroupBy object attributes

groups
sum()
mean()

GroupBy with MultiIndex

s.groupby(level=0)
s.groupby(level=1)
s.groupby(level=[0, 1]).sum()
s.sum(level=0)
s.sum(level=1)

Iteraing through groups

for name, group in grouped:
    print(name)
    print(group)

Aggregation

DataFrameGroupBy.aggregate(np.sum)
DataFrameGroupBy.sum()

Applying multiple functions at once

DataFrameGroupBy.agg([np.sum, np.mean])
DataFrameGroupBy.agg({'total': np.sum, 'avg': np.mean})

Applying different functions to columns

grouped = df.groupby('PK')
grouped.agg({
    'A': np.sum,
    'B': 'first',
    'C': 'last',
    'D': 'mean',
})



In [ ]:

    
df = pd.DataFrame(np.random.randn(10, 5), columns=['A', 'B', 'C', 'D', 'E'])
df['A'] = [np.random.choice(['T', 'A', 'N']) for _ in range(10)]
df['B'] = [np.random.choice(['T', 'A', 'N']) for _ in range(10)]
display_html(df)
df.groupby('A').agg({
    'B': 'first',
    'C': np.sum,
    'D': 'last',
})

Transformation



In [ ]:

    
df = pd.DataFrame(np.random.randn(10, 5), columns=['A', 'B', 'C', 'D', 'E'])
df['A'] = [np.random.choice(['T', 'A', 'N']) for _ in range(10)]
df['B'] = [np.random.choice(['T', 'A', 'N']) for _ in range(10)]
display_html(df)
zscore = lambda x: (x - x.mean()) / x.std()
grouped = df.groupby('A')
grouped.transform(zscore)



In [ ]:

    
log.info('compare transformed with original data')
ts = pd.Series(np.random.normal(0.5, 2, 1100), pd.date_range('2000/1/1', periods=1100))
transformed = ts.groupby(lambda x: x.year).transform(lambda x: (x - x.mean()) / x.std())
pd.DataFrame({'original': ts, 'transformed': transformed}).plot()

Filtration



In [ ]:

    
df = pd.DataFrame({'A': np.arange(10), 'B': list('abbbccdddd'), 'C': np.random.randint(0, 10, 10)})
df2 = df.groupby('B').filter(lambda x: x['C'].sum() > 10)
df3 = df.groupby('B').head(2)
display_html(df, df2, df3)

Flexible apply

Other useful features

Automatic exclusion of nuisance columns

NA group handling

Grouping with ordered factors



In [ ]:

    
s = pd.Series(np.random.randn(100))
s.groupby(pd.qcut(s, [0, .25, .5, .75, 1.])).mean()

Grouping with a Grouper specification



In [ ]:

    
from datetime import datetime
df = pd.DataFrame({
    'Branch' : 'A A A A A A A B'.split(),
    'Buyer': 'Carl Mark Carl Carl Joe Joe Joe Carl'.split(),
    'Quantity': [1,3,5,1,8,1,9,3],
    'Date' : [
        datetime(2013,1,1,13,0),
        datetime(2013,1,1,13,5),
        datetime(2013,10,1,20,0),
        datetime(2013,10,2,10,0),
        datetime(2013,10,1,20,0),
        datetime(2013,10,2,10,0),
        datetime(2013,12,2,12,0),
        datetime(2013,12,2,14,0),
    ]})
df2 = df.groupby([pd.Grouper(freq='1M',key='Date'),'Buyer']).sum()
display_html(df2)
df = df.set_index('Date')
df['Date'] = df.index + pd.offsets.MonthBegin(3)
df3 = df.groupby([pd.Grouper(freq='2M',key='Date'),'Buyer']).sum()
display_html(df3)

Taking the first row of each group

Taking nth row of each group

df.groupby('B').nth(1)

Group By split-apply-combine

Splitting

GroupBy object attributes

GroupBy with MultiIndex

Iteraing through groups

Aggregation

Applying multiple functions at once

Applying different functions to columns

Transformation

Filtration

Flexible apply

Other useful features

Automatic exclusion of nuisance columns

NA group handling

Grouping with ordered factors

Grouping with a Grouper specification

Taking the first row of each group

Taking nth row of each group

Enumerate group items

Plotting