This example is taken and adapted from the "Data Analysis for Python" book by Wes MCKinney.
The United States Social Security Administration (SSA) supplies data about the frequency of baby names from 1880 through the present. http://www.ssa.gov/oact/babynames/
That's why I love ipython ... you can also use shell commands (might not work under windows, maybe the command more works)
In [ ]:
!head -n 10 data/names/yob1880.txt
In [38]:
import pandas as pd
import numpy as np
%matplotlib inline
names1880 = pd.read_csv('data/names/yob1880.txt', names=['name', 'sex', 'births'])
In [ ]:
names1880
In [8]:
names1880.groupby('sex').births.sum()
The Data set is split into a file by year we first need to concatenate them into one Dataframe.
In [10]:
years = range(1880, 2011)
In [ ]:
years
In [14]:
pieces = []
columns = ['name', 'sex', 'births']
for year in years:
path = 'data/names/yob%d.txt' % year
frame = pd.read_csv(path, names=columns)
frame['year'] = year
pieces.append(frame)
In [15]:
names = pd.concat(pieces, ignore_index=True)
In [25]:
total_births = names.pivot_table('births', index='year',columns='sex', aggfunc=sum)
#replace index with rows and columns with cols in case you get an error
In [ ]:
total_births.tail()
In [ ]:
total_births.plot(title='Total births by sex and year')
Next, let’s insert a column prop with the fraction of babies given each name relative to the total number of births. A prop value of 0.02 would indicate that 2 out of every 100 babies was given a particular name. Thus, we group the data by year and sex, then add the new column to each group:
In [35]:
def add_prop(group):
# Integer division floors
births = group.births.astype(float)
group['prop'] = births / births.sum()
return group
names = names.groupby(['year', 'sex']).apply(add_prop)
In [ ]:
names
When performing a group operation like this, it's often valuable to do a sanity check, like verifying that the prop column sums to 1 within all the groups. Since this is floating point data, use np.allclose to check that the group sums are sufficiently close to (but perhaps not exactly equal to) 1.
In [ ]:
np.allclose(names.groupby(['year', 'sex']).prop.sum(), 1)
In [40]:
def get_top1000(group):
return group.sort_index(by='births', ascending=False)[:1000]
In [41]:
grouped = names.groupby(['year', 'sex'])
top1000 = grouped.apply(get_top1000)
In [44]:
boys = top1000[top1000.sex == 'M']
girls = top1000[top1000.sex == 'F']
In [45]:
total_births = top1000.pivot_table('births', index='year', columns='name',aggfunc=sum)
In [ ]:
total_births
In [46]:
subset = total_births[['John', 'Harry', 'Mary', 'Marilyn']]
In [ ]:
subset.plot(subplots=True, figsize=(12, 10), grid=False, title="Number of births per year")
In [48]:
#Measuring diversity
In [50]:
table = top1000.pivot_table('prop', index='year', columns='sex', aggfunc=sum)
In [ ]:
table.plot(title='Sum of table1000.prop by year and sex',yticks=np.linspace(0, 1.2, 13), xticks=range(1880, 2020, 10))
Another interesting metric is the number of distinct names, taken in order of popularity from highest to lowest, in the top 50% of births. This number is a bit more tricky to compute. After sorting prop in descending order, we want to know how many of the most popular names it takes to reach 50%.
In [56]:
df = boys[boys.year == 2010]
prop_cumsum = df.sort_index(by='prop', ascending=False).prop.cumsum()
prop_cumsum[:10]
prop_cumsum.searchsorted(0.5)
Out[56]:
In [59]:
df = boys[boys.year == 1900]
in1900 = df.sort_index(by='prop', ascending=False).prop.cumsum()
in1900.searchsorted(0.5) + 1
Out[59]:
In [73]:
def get_quantile_count(group, q=0.5):
group = group.sort_index(by='prop', ascending=False)
return (group.prop.cumsum().searchsorted(q) + 1)[0]
diversity = top1000.groupby(['year', 'sex']).apply(get_quantile_count)
In [ ]:
diversity = diversity.unstack('sex')
diversity.head()
In [ ]:
diversity.plot(title="Number of popular names in top 50%")
In [ ]: