What this notebook is about

Goal:

  • to learn about the basics of Pandas groupby operations using the census information about states

References:

Note the split-apply-combine framework of thinking

Version of pandas


In [1]:
# note version of pandas
import pandas
pandas.__version__


Out[1]:
'0.16.2'

Setup for grabbing state-related census data


In [2]:
import us
import census
import settings

import pandas as pd
import numpy as np
from pandas import DataFrame, Series
from itertools import islice

c = census.Census(settings.CENSUS_KEY)

def states(variables='NAME'):
    geo={'for':'state:*'}
    
    states_fips = set([state.fips for state in us.states.STATES])
    # need to filter out non-states
    for r in c.sf1.get(variables, geo=geo, year=2010):
        if r['state'] in states_fips:
            yield r

In [3]:
# make a dataframe from the total populations of states in the 2010 Census

df = DataFrame(states('NAME,P0010001'))
df.P0010001 = df.P0010001.astype('int')
df.head()


Out[3]:
NAME P0010001 state
0 Alabama 4779736 01
1 Alaska 710231 02
2 Arizona 6392017 04
3 Arkansas 2915918 05
4 California 37253956 06

In [4]:
# check that that we have the right total population

df.P0010001.sum() == 308745538


Out[4]:
True

In [5]:
# add a column with the first letter 
# we'll be grouping states based on the first letter of the state NAME

df['first_letter'] = df.NAME.apply(lambda s:s[0])
df.head()


Out[5]:
NAME P0010001 state first_letter
0 Alabama 4779736 01 A
1 Alaska 710231 02 A
2 Arizona 6392017 04 A
3 Arkansas 2915918 05 A
4 California 37253956 06 C

Possible to do grouping in many ways. "The mapping can be specified many different ways":

  • A Python function, to be called on each of the axis labels
  • A list or NumPy array of the same length as the selected axis
  • A dict or Series, providing a label -> group name mapping
  • For DataFrame objects, a string indicating a column to be used to group. Of course df.groupby('A') is just syntactic sugar for df.groupby(df['A']), but it makes life simpler
  • A list of any of the above things

We can also group by columns, axis=1, but such functionality is not demonstrated in this notebook.


In [6]:
# we can explicitly name df.first_letter

grouped = df.groupby(df.first_letter)
grouped


Out[6]:
<pandas.core.groupby.DataFrameGroupBy object at 0x106e76d68>

In [7]:
# shorthand for df.first_letter
# same thing as df.groupby(df.first_letter)

grouped = df.groupby('first_letter')
grouped


Out[7]:
<pandas.core.groupby.DataFrameGroupBy object at 0x106e76f28>

In [8]:
# count the number of states with each first letter -- look at top of the resulting DataFrame

grouped.count().head()


Out[8]:
NAME P0010001 state
first_letter
A 4 4 4
C 3 3 3
D 2 2 2
F 1 1 1
G 1 1 1

In [9]:
# count the number of states with each first letter -- look at bottom of the resulting DataFrame

grouped.count().tail()


Out[9]:
NAME P0010001 state
first_letter
S 2 2 2
T 2 2 2
U 1 1 1
V 2 2 2
W 4 4 4

In [10]:
# we didn't have to explicitly create a new column -- we could groupby on a dynamically generated Series
# note the use of str operations on df.NAME: 
# http://pandas.pydata.org/pandas-docs/stable/basics.html#vectorized-string-methods

df.groupby(df.NAME.str.slice(0,1)).sum()


Out[10]:
P0010001
NAME
A 14797902
C 45857249
D 1499657
F 18801310
G 9687653
H 1360301
I 23928371
K 7192485
L 4533372
M 38782746
N 46280611
O 19118929
P 12702379
R 1052567
S 5439544
T 31491666
U 2763885
V 8626765
W 14828146

In [11]:
# we can get groups of indexes

df.groupby('first_letter').groups


Out[11]:
{'A': [0, 1, 2, 3],
 'C': [4, 5, 6],
 'D': [7, 8],
 'F': [9],
 'G': [10],
 'H': [11],
 'I': [12, 13, 14, 15],
 'K': [16, 17],
 'L': [18],
 'M': [19, 20, 21, 22, 23, 24, 25, 26],
 'N': [27, 28, 29, 30, 31, 32, 33, 34],
 'O': [35, 36, 37],
 'P': [38],
 'R': [39],
 'S': [40, 41],
 'T': [42, 43],
 'U': [44],
 'V': [45, 46],
 'W': [47, 48, 49, 50]}

In [12]:
# loop through all groups --> here just the first one

for name, group in islice(df.groupby('first_letter'),1):
    print(name)
    print (type(group)) # yes -- a DataFrame
    print (group.index)
    print((group),)


A
<class 'pandas.core.frame.DataFrame'>
Int64Index([0, 1, 2, 3], dtype='int64')
       NAME  P0010001 state first_letter
0   Alabama   4779736    01            A
1    Alaska    710231    02            A
2   Arizona   6392017    04            A
3  Arkansas   2915918    05            A

In [13]:
# how about accessing group 'C'?
# http://stackoverflow.com/a/14734627/7782
# http://stackoverflow.com/questions/19804282/in-pandas-is-there-something-like-a-groupby-get-group-but-with-an-optional-defa
# http://pandas.pydata.org/pandas-docs/dev/generated/pandas.core.groupby.GroupBy.get_group.html

grouped = df.groupby('first_letter')
grouped.get_group('C')


Out[13]:
NAME P0010001 state first_letter
4 California 37253956 06 C
5 Colorado 5029196 08 C
6 Connecticut 3574097 09 C

In [14]:
# total population of states starting with 'C'

grouped.get_group('C').P0010001.sum()


Out[14]:
45857249

Aggregation


In [15]:
# generate a Series of total populations by first letter

grouped = df.groupby('first_letter')
s = grouped['P0010001'].sum()
s


Out[15]:
first_letter
A    14797902
C    45857249
D     1499657
F    18801310
G     9687653
H     1360301
I    23928371
K     7192485
L     4533372
M    38782746
N    46280611
O    19118929
P    12702379
R     1052567
S     5439544
T    31491666
U     2763885
V     8626765
W    14828146
Name: P0010001, dtype: int64

In [16]:
# sort the list to get the most populous groups
# http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.sort.html
# note sorting in place

s.sort(ascending=False)
s


Out[16]:
first_letter
N    46280611
C    45857249
M    38782746
T    31491666
I    23928371
O    19118929
F    18801310
W    14828146
A    14797902
P    12702379
G     9687653
V     8626765
K     7192485
S     5439544
L     4533372
U     2763885
D     1499657
H     1360301
R     1052567
Name: P0010001, dtype: int64

In [17]:
s.index


Out[17]:
Index(['N', 'C', 'M', 'T', 'I', 'O', 'F', 'W', 'A', 'P', 'G', 'V', 'K', 'S',
       'L', 'U', 'D', 'H', 'R'],
      dtype='object', name='first_letter')

In [18]:
# first pass at pulling together the letter, the total population, and the list of corresponding states

for k in s.index:
    print k, s[k], list(grouped.get_group(k).NAME)


  File "<ipython-input-18-46eede2925bd>", line 4
    print k, s[k], list(grouped.get_group(k).NAME)
          ^
SyntaxError: invalid syntax

In [ ]:


In [ ]:
# let's try this again and try to do this in a more idiomatic Pandas fashion
# ideally, generate a DataFrame with a NAME column that is a list of all states with the same first letter

# It turns out that apply can come to the rescue
# here' calculate the list of NAMEs

# http://pandas.pydata.org/pandas-docs/dev/groupby.html#flexible-apply
# http://stackoverflow.com/questions/19530568/can-pandas-groupby-aggregate-into-a-list-rather-than-sum-mean-etc

df.groupby("first_letter").apply(lambda x: list(x['NAME']))

In [ ]:
# apply can be used to add up the populations by group

df.groupby("first_letter").apply(lambda x: np.sum(x['P0010001']))

In [ ]:
# make a tuple out of the list of names and the population

df.groupby("first_letter").apply(lambda x:( list(x['NAME']), np.sum(x['P0010001'])))

In [ ]:
# remind ourselves on how to turn a tuple into a Series with a small example

Series(([1,2],2), index=['one','two'])

In [ ]:
# we're ready to make a new DataFrame

df.groupby("first_letter").apply(lambda x:Series((list(x['NAME']), np.sum(x['P0010001'])), 
                                                 index=['states','total_pop']))

In [ ]:
df2 = df.groupby("first_letter").apply(lambda x:Series((list(x['NAME']), np.sum(x['P0010001'])), 
                                                 index=['states','total_pop'])).sort_index(by='total_pop',ascending=False)

In [ ]:
# make sure you understand the syntax here:
# .ix: http://pandas.pydata.org/pandas-docs/dev/indexing.html#advanced-indexing-with-ix
# ability to grab columns by name to return a new DataFrame

df2.ix['C'][['states','total_pop']]

In [ ]:
print "states that start with 'C'", df2.ix['C']['states']

In [ ]:
print "total population of states that start with 'C'", df2.ix['C']['total_pop']

In [ ]: