Overview

This notebook is an attempt to recreate the dataset used in FiveThirtyEight's article How to Tell Someone’s Age When All You Know Is Her Name.

The gist of the article is that by using a combination of

  1. Social Security Administration (SSA) Baby Name Data
  2. National Center for Health Statistics (NCHS) Birth Data
  3. SSA Actuarial Tables

you can generate a dsitribution of the popularity of the name, for the end goal of finding the likely median age or relative popularity of a name after somebody with that name comes into the spotlight.

First, we import all the stuff we'll end up using. (These are also defined in the requirements.txt file in the repository.)


In [1]:
import csv
import json
import os
from collections import namedtuple, defaultdict
from glob import glob

Collect Data

We use this convenience function to convert the CSV data to namedtuples, which makes it a lot easier to handle queries on the data. Using this method, you don't have to memorize the indexes of the columns you want.


In [2]:
def namedtuples_from_csv(filename, name=None, header=None, prefix=''):
    with open(filename) as f:
        reader = csv.reader(f)
        
        if header is None:
            header = next(reader)
            header = [prefix + col.lower().strip().replace(' ', '_') for col in header]
        
        if name is None:
            name = os.path.splitext(os.path.split(filename)[1])[0]
        
        schema = namedtuple(name, header)
        
        return [schema(*row) for row in reader]

Load Name Data


In [3]:
national_name_data = []
for filename in glob('./data/nationaldata/*.txt'):
    national_name_data.append(namedtuples_from_csv(filename, header=['name', 'gender', 'count']))

Now, let's rearrange to get rid of any data pre-1910 (assume we only care about living) and convert this data to a unified dict structure. We will only use data for the years 1910-2014.

Designate 'NATIONAL' as the data from the national level SSA file.


In [4]:
data_holder = namedtuple('data_holder', ['state', 'name', 'gender', 'count'])
yearly_name_data = defaultdict(list)

for year_of_data in national_name_data:
    year = type(year_of_data[0]).__name__[-4:]
    year = int(year)
    if year < 1910:
        continue
    
    for row in year_of_data:
        yearly_name_data[year].append(data_holder('NATIONAL', row.name, row.gender, int(row.count)))

Load Birth Data


In [5]:
birth_data = namedtuples_from_csv('./data/births.csv')

Cheat and use 2013 birth data for 2014.


In [6]:
births2013 = [x for x in birth_data if x.year == '2013'][0]
birth_data.append(type(births2013)('2014', births2013.birth_number, births2013.general_fertility_rate, births2013.crude_birth_rate))

Load Death Data

These are data from 1900-2010. The columns represent the probability that a person born in the year row[0] will not survive until to see the next year. Thus, row[1] represents the infant mortality rate, row[2] represents the probability a 1 year-old won't become 2, etc.


In [7]:
female_death_data = namedtuples_from_csv('./data/death/DeathProbsE_F_Hist_TR2014.csv', prefix='year')
male_death_data = namedtuples_from_csv('./data/death/DeathProbsE_M_Hist_TR2014.csv', prefix='year')

In [8]:
death_data_holder = namedtuple('death_data_holder', ['gender'] + ['year{}'.format(y) for y in range(0, 119+1)])
yearly_death_data = defaultdict(list)
for data in female_death_data:
    yearly_death_data[int(data.yearyear)].append(death_data_holder(*['F'] + list(data)[1:]))
for data in male_death_data:
    yearly_death_data[int(data.yearyear)].append(death_data_holder(*['M'] + list(data)[1:]))

In addition, let's cheat a bit since we don't have data for 2011-2014. Let's use the same data for 2010 for these years as well.


In [9]:
for fake_year in range(2011, 2014+1):
    yearly_death_data[fake_year] = yearly_death_data[2010]

Interpolate Age Data Based on Birth Counts

Motivation: SSA registration was not mandatory until 1930s in some states and even now is spotty.


In [10]:
birth_data_dict = {}
for year_data in birth_data:
    birth_data_dict[int(year_data.year)] = int(year_data.birth_number)

In [11]:
prorated_data_holder = namedtuple('prorated_data_holder', ['state', 'name', 'gender', 'count', 'count_unadj'])
prorated_data = defaultdict(list)
for year, values in yearly_name_data.items():
    if year not in birth_data_dict or year > 2013:
        continue
    
    total_named_births = sum(x.count for x in values if x.state == 'NATIONAL')
    total_births = birth_data_dict.get(year)
    multiply_factor = total_births / total_named_births
    
    for row in values:
        if row.state != 'NATIONAL':
            continue
        prorated_data[year].append(prorated_data_holder('NATIONAL',
                                                        row.name,
                                                        row.gender,
                                                        round(multiply_factor * row.count),
                                                        row.count))

Map Actuarial Data to SSA Name Data

Transform data to a dict keyed on the name and gender with the value as a list of counts for the years from 1910-2013.


In [12]:
transform_key = namedtuple('transform_key', ['name', 'gender'])
transformed_data = defaultdict(dict)
for year, values in prorated_data.items():
    for row in values:
        key = transform_key(row.name, row.gender)
        transformed_data[key][year] = (row.count, row.count_unadj)
For every name/gender combination
    For every <year> starting from 1910 to 2014 (not necessarily in order), get the number of births in that year
        Then from <year> to 2014, calculate how many of name/gender are alive and add to previous values

This calculation will return the number of people alive at the start of 2016.


In [13]:
target_year = 2016
yearly_alive_data = defaultdict(dict)

for compound_key, yearly_birth_values in transformed_data.items():
    name = compound_key.name
    gender = compound_key.gender
    
    for birth_year, births_tuple in yearly_birth_values.items():
        births = births_tuple[0]
        births_unadjusted = births_tuple[1]
        # Get death data for people born in birth_year
        death_data = None
        for gender_value in yearly_death_data[birth_year]:
            if gender_value.gender == gender:
                death_data = gender_value

        # Calculate number born in birth_year still alive in target_year
        survival_chance = 1
        for year_offset in range(target_year-birth_year): # exclusive end point
            death_chance = float(death_data[year_offset+1]) # offset to skip year column
            survival_chance *= 1 - death_chance
    
        # Thing to export
        yearly_alive_data[compound_key][birth_year] = {
            'born': births,
            'born_unadjusted': births_unadjusted,
            'alive': round(births * survival_chance),
            'alive_chance': survival_chance,
        }

Round Values & Save to JSON

Want to save data so this API is possible:

var x = j['data']['Oliver']['M']['2014']
var born2014 = x['born']
var alive2016 = x['alive']

In [14]:
to_write = {
    'start_year': 1910,
    'end_year': 2014,
    'alive_year': target_year,
    'data': defaultdict(dict),
    'count': 0,
}

for compound_key, data in yearly_alive_data.items():
    to_write['data'][compound_key.name][compound_key.gender] = data
to_write['count'] = len(to_write['data'])

In [15]:
with open('alive_data.json', 'w') as f:
    json.dump(to_write, f)

In [16]:
csv_schema = ['name', 'gender', 'year', 'born', 'born_unadjusted', 'alive', 'alive_chance']
with open('alive_data.csv', 'w') as f:
    writer = csv.writer(f)
    writer.writerow(csv_schema)
    for compound_key, data in yearly_alive_data.items():
        for year, v in data.items():
            writer.writerow([compound_key.name, compound_key.gender, year,
                             v['born'], v['born_unadjusted'], v['alive'], v['alive_chance']])