Solution: Hands-on with Pandas

This notebook will walk you through some exercises to get practice using Pandas for data manipulation.

As you use this, feel free to make ample use of the Pandas Documentation, the Pandas StackOverflow Channel, and your favorite search engine. For example, if you search phrases like "Pandas sum all columns", you're very likely to find an answer to the question you have in mind.

Also, if it comes down to it, note that solutions are available in the Git repository.


In [1]:
# Start with our normal batch of imports and settings
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Following is optional: set plotting styles
import seaborn; seaborn.set()

Diving Deeper into Baby Names

In the lecture, we looked at the US Social Security Baby Names data. Here let's dive a little bit deeper into this.

Try to do the following with the Baby Names data.

0. Load the baby names data

(Here you can copy the code from the other notebook; make sure you understand what it's doing!)


In [2]:
def load_year(year):
    data = pd.read_csv('../data/names/yob{0}.txt'.format(year),
                       names=['name', 'gender', 'births'])
    data['year'] = year
    return data

names = pd.concat([load_year(year) for year in range(1880, 2014)])
names.head()


Out[2]:
name gender births year
0 Mary F 7065 1880
1 Anna F 2604 1880
2 Emma F 2003 1880
3 Elizabeth F 1939 1880
4 Minnie F 1746 1880

1. Find your own name within the data.

  • How many babies per year are born with your name?
  • What fraction of births each year have your name?

Note: there are multiple ways to do this, but the first part will use masking and pivot tables, while the second part might also throw-in a groupby.


In [3]:
# Plot the prevalence of my name over time
my_name = 'Jacob'
subset = names[names.name == my_name]

births = subset.pivot_table('births', index='year',
                            columns='gender', aggfunc='sum')
births.plot();



In [4]:
# Whoah... there are some female "Jacob"s? Let's look at this:
births['F'].fillna(0).plot();



In [5]:
# Now we'll normalize the births against the total for each year
def add_birth_frac(group):
    group['birth_frac'] = group.births / group.births.sum()
    return group

names = names.groupby(['year', 'gender']).apply(add_birth_frac)
names.head()


Out[5]:
name gender births year birth_frac
0 Mary F 7065 1880 0.077643
1 Anna F 2604 1880 0.028618
2 Emma F 2003 1880 0.022013
3 Elizabeth F 1939 1880 0.021309
4 Minnie F 1746 1880 0.019188

In [6]:
names[names.name == my_name].pivot_table('birth_frac', index='year',
                                         columns='gender', aggfunc='sum').plot();


My name (Jacob) seems to have dipped in popularity in the 1960s, only to make a profound rise around the time I was born (1981). I'll choose to presume that I was the trend-setter.

2. Find names which have switched genders.

This is a bit tricky: you might be tempted to use a groupby and apply over the multiple indices ['year', 'gender', 'name'], but if you try this you'll find that it's very computationally intensive.

I'd suggest doing the following:

  • Use a pivot table, and find the total number of births for each name before some early date (say, 1920) and after some later date (say, 1980).
  • Compute the percentage of males for each name within those groups.
  • Use masking to find which names have transitioned from a low percentage to a high percentage, and vice versa.

Is a name more likely to transition from female to male, or from male to female?


In [7]:
births = names.pivot_table('births', index=['year', 'name'], columns='gender')
births = births.fillna(0)
births.head()


Out[7]:
gender F M
year name
1880 Aaron 0 102
Ab 0 5
Abbie 71 0
Abbott 0 5
Abby 6 0

In [8]:
early_period = births[:1920].fillna(0).sum(axis=0, level=1)
late_period = births[1980:].fillna(0).sum(axis=0, level=1)

early_period.head()


Out[8]:
gender F M
name
Aaron 0 196
Ab 0 5
Abbie 152 0
Abbott 0 5
Abby 13 0

In [9]:
early_pct_M = 100 * early_period['M'] / early_period.sum(1)
late_pct_M = 100 * late_period['M'] / late_period.sum(1)
fractions = pd.DataFrame({'early_pct_M': early_pct_M,
                         'late_pct_M': late_pct_M})
fractions.head()


Out[9]:
early_pct_M late_pct_M
Aaban NaN 100
Aabha NaN 0
Aabid NaN 100
Aabriella NaN 0
Aadam NaN 100

Names which switched from female to male:


In [10]:
fractions[(fractions.early_pct_M < 40) & (fractions.late_pct_M > 60)]


Out[10]:
early_pct_M late_pct_M
Byrd 0 74.163180
Donnie 0 89.382491
Texas 0 61.595547

Names which switched from male to female:


In [11]:
fractions[(fractions.late_pct_M < 40) & (fractions.early_pct_M > 60)]


Out[11]:
early_pct_M late_pct_M
Addison 100.000000 11.834706
Allison 100.000000 1.339702
Ashley 100.000000 1.829521
Aubrey 100.000000 27.222843
Avery 100.000000 35.322797
Bailey 100.000000 19.046721
Beverly 100.000000 1.213195
Billie 100.000000 23.801506
Dana 100.000000 21.765366
Dee 76.923077 31.270896
Edie 100.000000 1.747844
Hollie 100.000000 7.608124
Holly 100.000000 0.772741
Ivey 100.000000 38.790821
Jodie 100.000000 11.670106
Kelly 100.000000 14.757141
Leslie 90.804598 29.935140
Lindsay 100.000000 3.829148
Lindsey 100.000000 4.732257
Lonie 100.000000 38.973799
Lynn 85.714286 22.320630
Madison 100.000000 2.156303
Marion 62.171053 27.595220
Merida 100.000000 0.000000
Morgan 100.000000 16.627275
Nova 100.000000 7.898499
Orla 100.000000 29.629630
Ossie 64.285714 36.361646
Pat 100.000000 39.964689
Payton 100.000000 33.960323
Presley 100.000000 20.217981
Reese 100.000000 38.903715
Sandy 100.000000 13.159606
Shelby 100.000000 10.215817
Stacy 100.000000 11.978544
Sydney 100.000000 5.343434
Taylor 100.000000 26.055568
Tracy 100.000000 19.573779

Apparently it's much more common for a name to transition from male to female than the other way around!