Title: Apply Operations To Groups In Pandas
Slug: pandas_apply_operations_to_groups
Summary: Apply Operations To Groups In Pandas
Date: 2016-05-01 12:00
Category: Python
Tags: Data Wrangling
Authors: Chris Albon

Preliminaries



In [1]:

    
# import modules
import pandas as pd



In [2]:

    
# Create dataframe
raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'], 
        'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'], 
        'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'], 
        'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
        'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'name', 'preTestScore', 'postTestScore'])
df









    Out[2]:







  
    
      
      regiment
      company
      name
      preTestScore
      postTestScore
    
  
  
    
      0
      Nighthawks
      1st
      Miller
      4
      25
    
    
      1
      Nighthawks
      1st
      Jacobson
      24
      94
    
    
      2
      Nighthawks
      2nd
      Ali
      31
      57
    
    
      3
      Nighthawks
      2nd
      Milner
      2
      62
    
    
      4
      Dragoons
      1st
      Cooze
      3
      70
    
    
      5
      Dragoons
      1st
      Jacon
      4
      25
    
    
      6
      Dragoons
      2nd
      Ryaner
      24
      94
    
    
      7
      Dragoons
      2nd
      Sone
      31
      57
    
    
      8
      Scouts
      1st
      Sloan
      2
      62
    
    
      9
      Scouts
      1st
      Piger
      3
      70
    
    
      10
      Scouts
      2nd
      Riani
      2
      62
    
    
      11
      Scouts
      2nd
      Ali
      3
      70



In [3]:

    
# Create a groupby variable that groups preTestScores by regiment
groupby_regiment = df['preTestScore'].groupby(df['regiment'])
groupby_regiment









    Out[3]:





<pandas.core.groupby.SeriesGroupBy object at 0x113ddb550>

"This grouped variable is now a GroupBy object. It has not actually computed anything yet except for some intermediate data about the group key df['key1']. The idea is that this object has all of the information needed to then apply some operation to each of the groups." - Python for Data Analysis

View a grouping

Use list() to show what a grouping looks like



In [4]:

    
list(df['preTestScore'].groupby(df['regiment']))









    Out[4]:





[('Dragoons', 4     3
  5     4
  6    24
  7    31
  Name: preTestScore, dtype: int64), ('Nighthawks', 0     4
  1    24
  2    31
  3     2
  Name: preTestScore, dtype: int64), ('Scouts', 8     2
  9     3
  10    2
  11    3
  Name: preTestScore, dtype: int64)]

Descriptive statistics by group



In [5]:

    
df['preTestScore'].groupby(df['regiment']).describe()

Mean of each regiment's preTestScore



In [6]:

    
groupby_regiment.mean()









    Out[6]:





regiment
Dragoons      15.50
Nighthawks    15.25
Scouts         2.50
Name: preTestScore, dtype: float64

Mean preTestScores grouped by regiment and company



In [7]:

    
df['preTestScore'].groupby([df['regiment'], df['company']]).mean()









    Out[7]:





regiment    company
Dragoons    1st         3.5
            2nd        27.5
Nighthawks  1st        14.0
            2nd        16.5
Scouts      1st         2.5
            2nd         2.5
Name: preTestScore, dtype: float64

Mean preTestScores grouped by regiment and company without heirarchical indexing



In [8]:

    
df['preTestScore'].groupby([df['regiment'], df['company']]).mean().unstack()









    Out[8]:







  
    
      company
      1st
      2nd
    
    
      regiment
      
      
    
  
  
    
      Dragoons
      3.5
      27.5
    
    
      Nighthawks
      14.0
      16.5
    
    
      Scouts
      2.5
      2.5

Group the entire dataframe by regiment and company



In [9]:

    
df.groupby(['regiment', 'company']).mean()









    Out[9]:







  
    
      
      
      preTestScore
      postTestScore
    
    
      regiment
      company
      
      
    
  
  
    
      Dragoons
      1st
      3.5
      47.5
    
    
      2nd
      27.5
      75.5
    
    
      Nighthawks
      1st
      14.0
      59.5
    
    
      2nd
      16.5
      59.5
    
    
      Scouts
      1st
      2.5
      66.0
    
    
      2nd
      2.5
      66.0

Number of observations in each regiment and company



In [10]:

    
df.groupby(['regiment', 'company']).size()









    Out[10]:





regiment    company
Dragoons    1st        2
            2nd        2
Nighthawks  1st        2
            2nd        2
Scouts      1st        2
            2nd        2
dtype: int64

Iterate an operations over groups



In [11]:

    
# Group the dataframe by regiment, and for each regiment,
for name, group in df.groupby('regiment'): 
    # print the name of the regiment
    print(name)
    # print the data of that regiment
    print(group)









    



Dragoons
   regiment company    name  preTestScore  postTestScore
4  Dragoons     1st   Cooze             3             70
5  Dragoons     1st   Jacon             4             25
6  Dragoons     2nd  Ryaner            24             94
7  Dragoons     2nd    Sone            31             57
Nighthawks
     regiment company      name  preTestScore  postTestScore
0  Nighthawks     1st    Miller             4             25
1  Nighthawks     1st  Jacobson            24             94
2  Nighthawks     2nd       Ali            31             57
3  Nighthawks     2nd    Milner             2             62
Scouts
   regiment company   name  preTestScore  postTestScore
8    Scouts     1st  Sloan             2             62
9    Scouts     1st  Piger             3             70
10   Scouts     2nd  Riani             2             62
11   Scouts     2nd    Ali             3             70

Group by columns

Specifically in this case: group by the data types of the columns (i.e. axis=1) and then use list() to view what that grouping looks like



In [12]:

    
list(df.groupby(df.dtypes, axis=1))









    Out[12]:





[(dtype('int64'),     preTestScore  postTestScore
  0              4             25
  1             24             94
  2             31             57
  3              2             62
  4              3             70
  5              4             25
  6             24             94
  7             31             57
  8              2             62
  9              3             70
  10             2             62
  11             3             70),
 (dtype('O'),       regiment company      name
  0   Nighthawks     1st    Miller
  1   Nighthawks     1st  Jacobson
  2   Nighthawks     2nd       Ali
  3   Nighthawks     2nd    Milner
  4     Dragoons     1st     Cooze
  5     Dragoons     1st     Jacon
  6     Dragoons     2nd    Ryaner
  7     Dragoons     2nd      Sone
  8       Scouts     1st     Sloan
  9       Scouts     1st     Piger
  10      Scouts     2nd     Riani
  11      Scouts     2nd       Ali)]

In the dataframe "df", group by "regiments, take the mean values of the other variables for those groups, then display them with the prefix_mean



In [13]:

    
df.groupby('regiment').mean().add_prefix('mean_')









    Out[13]:







  
    
      
      mean_preTestScore
      mean_postTestScore
    
    
      regiment
      
      
    
  
  
    
      Dragoons
      15.50
      61.5
    
    
      Nighthawks
      15.25
      59.5
    
    
      Scouts
      2.50
      66.0

Create a function to get the stats of a group



In [14]:

    
def get_stats(group):
    return {'min': group.min(), 'max': group.max(), 'count': group.count(), 'mean': group.mean()}

Create bins and bin up postTestScore by those pins



In [15]:

    
bins = [0, 25, 50, 75, 100]
group_names = ['Low', 'Okay', 'Good', 'Great']
df['categories'] = pd.cut(df['postTestScore'], bins, labels=group_names)

Apply the get_stats() function to each postTestScore bin



In [16]:

    
df['postTestScore'].groupby(df['categories']).apply(get_stats).unstack()

	count	mean	std	min	25%	50%	75%	max
regiment
Dragoons	4.0	15.50	14.153916	3.0	3.75	14.0	25.75	31.0
Nighthawks	4.0	15.25	14.453950	2.0	3.50	14.0	25.75	31.0
Scouts	4.0	2.50	0.577350	2.0	2.00	2.5	3.00	3.0

	count	max	mean	min
categories
Good	8.0	70.0	63.75	57.0
Great	2.0	94.0	94.00	94.0
Low	2.0	25.0	25.00	25.0
Okay	0.0	NaN	NaN	NaN

	regiment	company	name	preTestScore	postTestScore
0	Nighthawks	1st	Miller	4	25
1	Nighthawks	1st	Jacobson	24	94
2	Nighthawks	2nd	Ali	31	57
3	Nighthawks	2nd	Milner	2	62
4	Dragoons	1st	Cooze	3	70
5	Dragoons	1st	Jacon	4	25
6	Dragoons	2nd	Ryaner	24	94
7	Dragoons	2nd	Sone	31	57
8	Scouts	1st	Sloan	2	62
9	Scouts	1st	Piger	3	70
10	Scouts	2nd	Riani	2	62
11	Scouts	2nd	Ali	3	70

		preTestScore	postTestScore
regiment	company
Dragoons	1st	3.5	47.5
Dragoons	2nd	27.5	75.5
Nighthawks	1st	14.0	59.5
Nighthawks	2nd	16.5	59.5
Scouts	1st	2.5	66.0
Scouts	2nd	2.5	66.0