Solution: Hands-on with Pandas

This notebook will walk you through some exercises to get practice using Pandas for data manipulation.

As you use this, feel free to make ample use of the Pandas Documentation, the Pandas StackOverflow Channel, and your favorite search engine. For example, if you search phrases like "Pandas sum all columns", you're very likely to find an answer to the question you have in mind.

Also, if it comes down to it, note that solutions are available in the Git repository.



In [1]:

    
# Start with our normal batch of imports and settings
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Following is optional: set plotting styles
import seaborn; seaborn.set()

Diving Deeper into Baby Names

In the lecture, we looked at the US Social Security Baby Names data. Here let's dive a little bit deeper into this.

Try to do the following with the Baby Names data.

0. Load the baby names data

(Here you can copy the code from the other notebook; make sure you understand what it's doing!)



In [2]:

    
def load_year(year):
    data = pd.read_csv('../data/names/yob{0}.txt'.format(year),
                       names=['name', 'gender', 'births'])
    data['year'] = year
    return data

names = pd.concat([load_year(year) for year in range(1880, 2014)])
names.head()

1. Find your own name within the data.

How many babies per year are born with your name?
What fraction of births each year have your name?

Note: there are multiple ways to do this, but the first part will use masking and pivot tables, while the second part might also throw-in a groupby.



In [3]:

    
# Plot the prevalence of my name over time
my_name = 'Jacob'
subset = names[names.name == my_name]

births = subset.pivot_table('births', index='year',
                            columns='gender', aggfunc='sum')
births.plot();



In [4]:

    
# Whoah... there are some female "Jacob"s? Let's look at this:
births['F'].fillna(0).plot();



In [5]:

    
# Now we'll normalize the births against the total for each year
def add_birth_frac(group):
    group['birth_frac'] = group.births / group.births.sum()
    return group

names = names.groupby(['year', 'gender']).apply(add_birth_frac)
names.head()



In [6]:

    
names[names.name == my_name].pivot_table('birth_frac', index='year',
                                         columns='gender', aggfunc='sum').plot();

My name (Jacob) seems to have dipped in popularity in the 1960s, only to make a profound rise around the time I was born (1981). I'll choose to presume that I was the trend-setter.

2. Find names which have switched genders.

This is a bit tricky: you might be tempted to use a groupby and apply over the multiple indices ['year', 'gender', 'name'], but if you try this you'll find that it's very computationally intensive.

I'd suggest doing the following:

Use a pivot table, and find the total number of births for each name before some early date (say, 1920) and after some later date (say, 1980).
Compute the percentage of males for each name within those groups.
Use masking to find which names have transitioned from a low percentage to a high percentage, and vice versa.

Is a name more likely to transition from female to male, or from male to female?



In [7]:

    
births = names.pivot_table('births', index=['year', 'name'], columns='gender')
births = births.fillna(0)
births.head()



In [8]:

    
early_period = births[:1920].fillna(0).sum(axis=0, level=1)
late_period = births[1980:].fillna(0).sum(axis=0, level=1)

early_period.head()



In [9]:

    
early_pct_M = 100 * early_period['M'] / early_period.sum(1)
late_pct_M = 100 * late_period['M'] / late_period.sum(1)
fractions = pd.DataFrame({'early_pct_M': early_pct_M,
                         'late_pct_M': late_pct_M})
fractions.head()









    Out[9]:






  
    
      
      early_pct_M
      late_pct_M
    
  
  
    
      Aaban
      NaN
       100
    
    
      Aabha
      NaN
         0
    
    
      Aabid
      NaN
       100
    
    
      Aabriella
      NaN
         0
    
    
      Aadam
      NaN
       100

Names which switched from female to male:



In [10]:

    
fractions[(fractions.early_pct_M < 40) & (fractions.late_pct_M > 60)]









    Out[10]:






  
    
      
      early_pct_M
      late_pct_M
    
  
  
    
      Byrd
       0
       74.163180
    
    
      Donnie
       0
       89.382491
    
    
      Texas
       0
       61.595547

Names which switched from male to female:



In [11]:

    
fractions[(fractions.late_pct_M < 40) & (fractions.early_pct_M > 60)]









    Out[11]:






  
    
      
      early_pct_M
      late_pct_M
    
  
  
    
      Addison
       100.000000
       11.834706
    
    
      Allison
       100.000000
        1.339702
    
    
      Ashley
       100.000000
        1.829521
    
    
      Aubrey
       100.000000
       27.222843
    
    
      Avery
       100.000000
       35.322797
    
    
      Bailey
       100.000000
       19.046721
    
    
      Beverly
       100.000000
        1.213195
    
    
      Billie
       100.000000
       23.801506
    
    
      Dana
       100.000000
       21.765366
    
    
      Dee
        76.923077
       31.270896
    
    
      Edie
       100.000000
        1.747844
    
    
      Hollie
       100.000000
        7.608124
    
    
      Holly
       100.000000
        0.772741
    
    
      Ivey
       100.000000
       38.790821
    
    
      Jodie
       100.000000
       11.670106
    
    
      Kelly
       100.000000
       14.757141
    
    
      Leslie
        90.804598
       29.935140
    
    
      Lindsay
       100.000000
        3.829148
    
    
      Lindsey
       100.000000
        4.732257
    
    
      Lonie
       100.000000
       38.973799
    
    
      Lynn
        85.714286
       22.320630
    
    
      Madison
       100.000000
        2.156303
    
    
      Marion
        62.171053
       27.595220
    
    
      Merida
       100.000000
        0.000000
    
    
      Morgan
       100.000000
       16.627275
    
    
      Nova
       100.000000
        7.898499
    
    
      Orla
       100.000000
       29.629630
    
    
      Ossie
        64.285714
       36.361646
    
    
      Pat
       100.000000
       39.964689
    
    
      Payton
       100.000000
       33.960323
    
    
      Presley
       100.000000
       20.217981
    
    
      Reese
       100.000000
       38.903715
    
    
      Sandy
       100.000000
       13.159606
    
    
      Shelby
       100.000000
       10.215817
    
    
      Stacy
       100.000000
       11.978544
    
    
      Sydney
       100.000000
        5.343434
    
    
      Taylor
       100.000000
       26.055568
    
    
      Tracy
       100.000000
       19.573779

Apparently it's much more common for a name to transition from male to female than the other way around!

	name	gender	births	year
0	Mary	F	7065	1880
1	Anna	F	2604	1880
2	Emma	F	2003	1880
3	Elizabeth	F	1939	1880
4	Minnie	F	1746	1880

	name	gender	births	year	birth_frac
0	Mary	F	7065	1880	0.077643
1	Anna	F	2604	1880	0.028618
2	Emma	F	2003	1880	0.022013
3	Elizabeth	F	1939	1880	0.021309
4	Minnie	F	1746	1880	0.019188

	gender	F	M
year	name
1880	Aaron	0	102
	Ab	0	5
	Abbie	71	0
	Abbott	0	5
	Abby	6	0

gender	F	M
name
Aaron	0	196
Ab	0	5
Abbie	152	0
Abbott	0	5
Abby	13	0

	early_pct_M	late_pct_M
Addison	100.000000	11.834706
Allison	100.000000	1.339702
Ashley	100.000000	1.829521
Aubrey	100.000000	27.222843
Avery	100.000000	35.322797
Bailey	100.000000	19.046721
Beverly	100.000000	1.213195
Billie	100.000000	23.801506
Dana	100.000000	21.765366
Dee	76.923077	31.270896
Edie	100.000000	1.747844
Hollie	100.000000	7.608124
Holly	100.000000	0.772741
Ivey	100.000000	38.790821
Jodie	100.000000	11.670106
Kelly	100.000000	14.757141
Leslie	90.804598	29.935140
Lindsay	100.000000	3.829148
Lindsey	100.000000	4.732257
Lonie	100.000000	38.973799
Lynn	85.714286	22.320630
Madison	100.000000	2.156303
Marion	62.171053	27.595220
Merida	100.000000	0.000000
Morgan	100.000000	16.627275
Nova	100.000000	7.898499
Orla	100.000000	29.629630
Ossie	64.285714	36.361646
Pat	100.000000	39.964689
Payton	100.000000	33.960323
Presley	100.000000	20.217981
Reese	100.000000	38.903715
Sandy	100.000000	13.159606
Shelby	100.000000	10.215817
Stacy	100.000000	11.978544
Sydney	100.000000	5.343434
Taylor	100.000000	26.055568
Tracy	100.000000	19.573779