US Baby Names

This example is taken and adapted from the "Data Analysis for Python" book by Wes MCKinney.

The United States Social Security Administration (SSA) supplies data about the frequency of baby names from 1880 through the present. http://www.ssa.gov/oact/babynames/

That's why I love ipython ... you can also use shell commands (might not work under windows, maybe the command more works)


In [ ]:
!head -n 10 data/names/yob1880.txt

In [38]:
import pandas as pd
import numpy as np
%matplotlib inline

names1880 = pd.read_csv('data/names/yob1880.txt', names=['name', 'sex', 'births'])

In [ ]:
names1880

In [8]:
names1880.groupby('sex').births.sum()

The Data set is split into a file by year we first need to concatenate them into one Dataframe.


In [10]:
years = range(1880, 2011)

In [ ]:
years

In [14]:
pieces = []
columns = ['name', 'sex', 'births']
for year in years:
    path = 'data/names/yob%d.txt' % year
    frame = pd.read_csv(path, names=columns)
    frame['year'] = year 
    pieces.append(frame)

In [15]:
names = pd.concat(pieces, ignore_index=True)

In [25]:
total_births = names.pivot_table('births', index='year',columns='sex', aggfunc=sum)
#replace index with rows and columns with cols in case you get an error

In [ ]:
total_births.tail()

In [ ]:
total_births.plot(title='Total births by sex and year')

Next, let’s insert a column prop with the fraction of babies given each name relative to the total number of births. A prop value of 0.02 would indicate that 2 out of every 100 babies was given a particular name. Thus, we group the data by year and sex, then add the new column to each group:


In [35]:
def add_prop(group):
    # Integer division floors
    births = group.births.astype(float)
    group['prop'] = births / births.sum()
    return group

names = names.groupby(['year', 'sex']).apply(add_prop)

In [ ]:
names

When performing a group operation like this, it's often valuable to do a sanity check, like verifying that the prop column sums to 1 within all the groups. Since this is floating point data, use np.allclose to check that the group sums are sufficiently close to (but perhaps not exactly equal to) 1.


In [ ]:
np.allclose(names.groupby(['year', 'sex']).prop.sum(), 1)

In [40]:
def get_top1000(group):
    return group.sort_index(by='births', ascending=False)[:1000]

In [41]:
grouped = names.groupby(['year', 'sex']) 
top1000 = grouped.apply(get_top1000)

Analyzing Name Trends


In [44]:
boys = top1000[top1000.sex == 'M']
girls = top1000[top1000.sex == 'F']

In [45]:
total_births = top1000.pivot_table('births', index='year', columns='name',aggfunc=sum)

In [ ]:
total_births

In [46]:
subset = total_births[['John', 'Harry', 'Mary', 'Marilyn']]

In [ ]:
subset.plot(subplots=True, figsize=(12, 10), grid=False, title="Number of births per year")

In [48]:
#Measuring diversity

In [50]:
table = top1000.pivot_table('prop', index='year', columns='sex', aggfunc=sum)

In [ ]:
table.plot(title='Sum of table1000.prop by year and sex',yticks=np.linspace(0, 1.2, 13), xticks=range(1880, 2020, 10))

Another interesting metric is the number of distinct names, taken in order of popularity from highest to lowest, in the top 50% of births. This number is a bit more tricky to compute. After sorting prop in descending order, we want to know how many of the most popular names it takes to reach 50%.


In [56]:
df = boys[boys.year == 2010]
prop_cumsum = df.sort_index(by='prop', ascending=False).prop.cumsum()
prop_cumsum[:10]
prop_cumsum.searchsorted(0.5)


Out[56]:
array([116])

In [59]:
df = boys[boys.year == 1900]
in1900 = df.sort_index(by='prop', ascending=False).prop.cumsum()
in1900.searchsorted(0.5) + 1


Out[59]:
array([25])

In [73]:
def get_quantile_count(group, q=0.5):
    group = group.sort_index(by='prop', ascending=False) 
    return (group.prop.cumsum().searchsorted(q) + 1)[0]

diversity = top1000.groupby(['year', 'sex']).apply(get_quantile_count)

In [ ]:
diversity = diversity.unstack('sex')
diversity.head()

In [ ]:
diversity.plot(title="Number of popular names in top 50%")

In [ ]: