What this notebook is about

Goal:

to learn about the basics of Pandas groupby operations using the census information about states

References:

Group By: split-apply-combine — pandas 0.13.1 documentation
Python for Data Analysis > 9. Data Aggregation and Group Operations > GroupBy Mechanics : Safari Books Online

Note the split-apply-combine framework of thinking

Version of pandas



In [1]:

    
# note version of pandas
import pandas
pandas.__version__









    Out[1]:





'0.16.2'



In [2]:

    
import us
import census
import settings

import pandas as pd
import numpy as np
from pandas import DataFrame, Series
from itertools import islice

c = census.Census(settings.CENSUS_KEY)

def states(variables='NAME'):
    geo={'for':'state:*'}
    
    states_fips = set([state.fips for state in us.states.STATES])
    # need to filter out non-states
    for r in c.sf1.get(variables, geo=geo, year=2010):
        if r['state'] in states_fips:
            yield r



In [3]:

    
# make a dataframe from the total populations of states in the 2010 Census

df = DataFrame(states('NAME,P0010001'))
df.P0010001 = df.P0010001.astype('int')
df.head()



In [4]:

    
# check that that we have the right total population

df.P0010001.sum() == 308745538









    Out[4]:





True



In [5]:

    
# add a column with the first letter 
# we'll be grouping states based on the first letter of the state NAME

df['first_letter'] = df.NAME.apply(lambda s:s[0])
df.head()









    Out[5]:






  
    
      
      NAME
      P0010001
      state
      first_letter
    
  
  
    
      0
      Alabama
      4779736
      01
      A
    
    
      1
      Alaska
      710231
      02
      A
    
    
      2
      Arizona
      6392017
      04
      A
    
    
      3
      Arkansas
      2915918
      05
      A
    
    
      4
      California
      37253956
      06
      C

Group By: split-apply-combine — pandas 0.13.1 documentation

Possible to do grouping in many ways. "The mapping can be specified many different ways":

A Python function, to be called on each of the axis labels
A list or NumPy array of the same length as the selected axis
A dict or Series, providing a label -> group name mapping
For DataFrame objects, a string indicating a column to be used to group. Of course df.groupby('A') is just syntactic sugar for df.groupby(df['A']), but it makes life simpler
A list of any of the above things

We can also group by columns, axis=1, but such functionality is not demonstrated in this notebook.



In [6]:

    
# we can explicitly name df.first_letter

grouped = df.groupby(df.first_letter)
grouped









    Out[6]:





<pandas.core.groupby.DataFrameGroupBy object at 0x106e76d68>



In [7]:

    
# shorthand for df.first_letter
# same thing as df.groupby(df.first_letter)

grouped = df.groupby('first_letter')
grouped









    Out[7]:





<pandas.core.groupby.DataFrameGroupBy object at 0x106e76f28>



In [8]:

    
# count the number of states with each first letter -- look at top of the resulting DataFrame

grouped.count().head()









    Out[8]:






  
    
      
      NAME
      P0010001
      state
    
    
      first_letter
      
      
      
    
  
  
    
      A
      4
      4
      4
    
    
      C
      3
      3
      3
    
    
      D
      2
      2
      2
    
    
      F
      1
      1
      1
    
    
      G
      1
      1
      1



In [9]:

    
# count the number of states with each first letter -- look at bottom of the resulting DataFrame

grouped.count().tail()









    Out[9]:






  
    
      
      NAME
      P0010001
      state
    
    
      first_letter
      
      
      
    
  
  
    
      S
      2
      2
      2
    
    
      T
      2
      2
      2
    
    
      U
      1
      1
      1
    
    
      V
      2
      2
      2
    
    
      W
      4
      4
      4



In [10]:

    
# we didn't have to explicitly create a new column -- we could groupby on a dynamically generated Series
# note the use of str operations on df.NAME: 
# http://pandas.pydata.org/pandas-docs/stable/basics.html#vectorized-string-methods

df.groupby(df.NAME.str.slice(0,1)).sum()



In [11]:

    
# we can get groups of indexes

df.groupby('first_letter').groups









    Out[11]:





{'A': [0, 1, 2, 3],
 'C': [4, 5, 6],
 'D': [7, 8],
 'F': [9],
 'G': [10],
 'H': [11],
 'I': [12, 13, 14, 15],
 'K': [16, 17],
 'L': [18],
 'M': [19, 20, 21, 22, 23, 24, 25, 26],
 'N': [27, 28, 29, 30, 31, 32, 33, 34],
 'O': [35, 36, 37],
 'P': [38],
 'R': [39],
 'S': [40, 41],
 'T': [42, 43],
 'U': [44],
 'V': [45, 46],
 'W': [47, 48, 49, 50]}



In [12]:

    
# loop through all groups --> here just the first one

for name, group in islice(df.groupby('first_letter'),1):
    print(name)
    print (type(group)) # yes -- a DataFrame
    print (group.index)
    print((group),)









    



A
<class 'pandas.core.frame.DataFrame'>
Int64Index([0, 1, 2, 3], dtype='int64')
       NAME  P0010001 state first_letter
0   Alabama   4779736    01            A
1    Alaska    710231    02            A
2   Arizona   6392017    04            A
3  Arkansas   2915918    05            A



In [13]:

    
# how about accessing group 'C'?
# http://stackoverflow.com/a/14734627/7782
# http://stackoverflow.com/questions/19804282/in-pandas-is-there-something-like-a-groupby-get-group-but-with-an-optional-defa
# http://pandas.pydata.org/pandas-docs/dev/generated/pandas.core.groupby.GroupBy.get_group.html

grouped = df.groupby('first_letter')
grouped.get_group('C')









    Out[13]:






  
    
      
      NAME
      P0010001
      state
      first_letter
    
  
  
    
      4
      California
      37253956
      06
      C
    
    
      5
      Colorado
      5029196
      08
      C
    
    
      6
      Connecticut
      3574097
      09
      C



In [14]:

    
# total population of states starting with 'C'

grouped.get_group('C').P0010001.sum()









    Out[14]:





45857249

Aggregation



In [15]:

    
# generate a Series of total populations by first letter

grouped = df.groupby('first_letter')
s = grouped['P0010001'].sum()
s









    Out[15]:





first_letter
A    14797902
C    45857249
D     1499657
F    18801310
G     9687653
H     1360301
I    23928371
K     7192485
L     4533372
M    38782746
N    46280611
O    19118929
P    12702379
R     1052567
S     5439544
T    31491666
U     2763885
V     8626765
W    14828146
Name: P0010001, dtype: int64



In [16]:

    
# sort the list to get the most populous groups
# http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.sort.html
# note sorting in place

s.sort(ascending=False)
s









    Out[16]:





first_letter
N    46280611
C    45857249
M    38782746
T    31491666
I    23928371
O    19118929
F    18801310
W    14828146
A    14797902
P    12702379
G     9687653
V     8626765
K     7192485
S     5439544
L     4533372
U     2763885
D     1499657
H     1360301
R     1052567
Name: P0010001, dtype: int64



In [17]:

    
s.index









    Out[17]:





Index(['N', 'C', 'M', 'T', 'I', 'O', 'F', 'W', 'A', 'P', 'G', 'V', 'K', 'S',
       'L', 'U', 'D', 'H', 'R'],
      dtype='object', name='first_letter')



In [18]:

    
# first pass at pulling together the letter, the total population, and the list of corresponding states

for k in s.index:
    print k, s[k], list(grouped.get_group(k).NAME)









    



  File "<ipython-input-18-46eede2925bd>", line 4
    print k, s[k], list(grouped.get_group(k).NAME)
          ^
SyntaxError: invalid syntax



In [ ]:



In [ ]:

    
# let's try this again and try to do this in a more idiomatic Pandas fashion
# ideally, generate a DataFrame with a NAME column that is a list of all states with the same first letter

# It turns out that apply can come to the rescue
# here' calculate the list of NAMEs

# http://pandas.pydata.org/pandas-docs/dev/groupby.html#flexible-apply
# http://stackoverflow.com/questions/19530568/can-pandas-groupby-aggregate-into-a-list-rather-than-sum-mean-etc

df.groupby("first_letter").apply(lambda x: list(x['NAME']))



In [ ]:

    
# apply can be used to add up the populations by group

df.groupby("first_letter").apply(lambda x: np.sum(x['P0010001']))



In [ ]:

    
# make a tuple out of the list of names and the population

df.groupby("first_letter").apply(lambda x:( list(x['NAME']), np.sum(x['P0010001'])))



In [ ]:

    
# remind ourselves on how to turn a tuple into a Series with a small example

Series(([1,2],2), index=['one','two'])



In [ ]:

    
# we're ready to make a new DataFrame

df.groupby("first_letter").apply(lambda x:Series((list(x['NAME']), np.sum(x['P0010001'])), 
                                                 index=['states','total_pop']))



In [ ]:

    
df2 = df.groupby("first_letter").apply(lambda x:Series((list(x['NAME']), np.sum(x['P0010001'])), 
                                                 index=['states','total_pop'])).sort_index(by='total_pop',ascending=False)



In [ ]:

    
# make sure you understand the syntax here:
# .ix: http://pandas.pydata.org/pandas-docs/dev/indexing.html#advanced-indexing-with-ix
# ability to grab columns by name to return a new DataFrame

df2.ix['C'][['states','total_pop']]



In [ ]:

    
print "states that start with 'C'", df2.ix['C']['states']



In [ ]:

    
print "total population of states that start with 'C'", df2.ix['C']['total_pop']



In [ ]:

	NAME	P0010001	state
0	Alabama	4779736	01
1	Alaska	710231	02
2	Arizona	6392017	04
3	Arkansas	2915918	05
4	California	37253956	06

	P0010001
NAME
A	14797902
C	45857249
D	1499657
F	18801310
G	9687653
H	1360301
I	23928371
K	7192485
L	4533372
M	38782746
N	46280611
O	19118929
P	12702379
R	1052567
S	5439544
T	31491666
U	2763885
V	8626765
W	14828146

	NAME	P0010001	state	first_letter
4	California	37253956	06	C
5	Colorado	5029196	08	C
6	Connecticut	3574097	09	C

What this notebook is about

Version of pandas

Setup for grabbing state-related census data

Aggregation