This notebook is an attempt to recreate the dataset used in FiveThirtyEight's article How to Tell Someone’s Age When All You Know Is Her Name.
The gist of the article is that by using a combination of
you can generate a dsitribution of the popularity of the name, for the end goal of finding the likely median age or relative popularity of a name after somebody with that name comes into the spotlight.
First, we import all the stuff we'll end up using. (These are also defined in the requirements.txt
file in the repository.)
In [1]:
import csv
import json
import os
from collections import namedtuple, defaultdict
from glob import glob
Name | Source | Data Range | URL | License |
---|---|---|---|---|
National name data | SSA | 1910-2014 | https://catalog.data.gov/dataset/baby-names-from-social-security-card-applications-national-level-data | cc-zero |
Death probabilities | SSA | 1900-2010 | https://catalog.data.gov/dataset/death-probabilities-for-males-1900-2010 and https://catalog.data.gov/dataset/death-probabilities-for-females-1900-2010 | cc-zero |
Birth rates | NCHS | 1909-2013 | https://catalog.data.gov/dataset/births-and-general-fertility-rates-united-states-1909-2013 | public domain |
We use this convenience function to convert the CSV data to namedtuples, which makes it a lot easier to handle queries on the data. Using this method, you don't have to memorize the indexes of the columns you want.
In [2]:
def namedtuples_from_csv(filename, name=None, header=None, prefix=''):
with open(filename) as f:
reader = csv.reader(f)
if header is None:
header = next(reader)
header = [prefix + col.lower().strip().replace(' ', '_') for col in header]
if name is None:
name = os.path.splitext(os.path.split(filename)[1])[0]
schema = namedtuple(name, header)
return [schema(*row) for row in reader]
In [3]:
national_name_data = []
for filename in glob('./data/nationaldata/*.txt'):
national_name_data.append(namedtuples_from_csv(filename, header=['name', 'gender', 'count']))
Now, let's rearrange to get rid of any data pre-1910 (assume we only care about living) and convert this data to a unified dict structure. We will only use data for the years 1910-2014.
Designate 'NATIONAL'
as the data from the national level SSA file.
In [4]:
data_holder = namedtuple('data_holder', ['state', 'name', 'gender', 'count'])
yearly_name_data = defaultdict(list)
for year_of_data in national_name_data:
year = type(year_of_data[0]).__name__[-4:]
year = int(year)
if year < 1910:
continue
for row in year_of_data:
yearly_name_data[year].append(data_holder('NATIONAL', row.name, row.gender, int(row.count)))
In [5]:
birth_data = namedtuples_from_csv('./data/births.csv')
Cheat and use 2013 birth data for 2014.
In [6]:
births2013 = [x for x in birth_data if x.year == '2013'][0]
birth_data.append(type(births2013)('2014', births2013.birth_number, births2013.general_fertility_rate, births2013.crude_birth_rate))
These are data from 1900-2010. The columns represent the probability that a person born in the year row[0]
will not survive until to see the next year. Thus, row[1]
represents the infant mortality rate, row[2]
represents the probability a 1 year-old won't become 2, etc.
In [7]:
female_death_data = namedtuples_from_csv('./data/death/DeathProbsE_F_Hist_TR2014.csv', prefix='year')
male_death_data = namedtuples_from_csv('./data/death/DeathProbsE_M_Hist_TR2014.csv', prefix='year')
In [8]:
death_data_holder = namedtuple('death_data_holder', ['gender'] + ['year{}'.format(y) for y in range(0, 119+1)])
yearly_death_data = defaultdict(list)
for data in female_death_data:
yearly_death_data[int(data.yearyear)].append(death_data_holder(*['F'] + list(data)[1:]))
for data in male_death_data:
yearly_death_data[int(data.yearyear)].append(death_data_holder(*['M'] + list(data)[1:]))
In addition, let's cheat a bit since we don't have data for 2011-2014. Let's use the same data for 2010 for these years as well.
In [9]:
for fake_year in range(2011, 2014+1):
yearly_death_data[fake_year] = yearly_death_data[2010]
Motivation: SSA registration was not mandatory until 1930s in some states and even now is spotty.
In [10]:
birth_data_dict = {}
for year_data in birth_data:
birth_data_dict[int(year_data.year)] = int(year_data.birth_number)
In [11]:
prorated_data_holder = namedtuple('prorated_data_holder', ['state', 'name', 'gender', 'count', 'count_unadj'])
prorated_data = defaultdict(list)
for year, values in yearly_name_data.items():
if year not in birth_data_dict or year > 2013:
continue
total_named_births = sum(x.count for x in values if x.state == 'NATIONAL')
total_births = birth_data_dict.get(year)
multiply_factor = total_births / total_named_births
for row in values:
if row.state != 'NATIONAL':
continue
prorated_data[year].append(prorated_data_holder('NATIONAL',
row.name,
row.gender,
round(multiply_factor * row.count),
row.count))
Transform data to a dict keyed on the name and gender with the value as a list of counts for the years from 1910-2013.
In [12]:
transform_key = namedtuple('transform_key', ['name', 'gender'])
transformed_data = defaultdict(dict)
for year, values in prorated_data.items():
for row in values:
key = transform_key(row.name, row.gender)
transformed_data[key][year] = (row.count, row.count_unadj)
For every name/gender combination
For every <year> starting from 1910 to 2014 (not necessarily in order), get the number of births in that year
Then from <year> to 2014, calculate how many of name/gender are alive and add to previous values
This calculation will return the number of people alive at the start of 2016.
In [13]:
target_year = 2016
yearly_alive_data = defaultdict(dict)
for compound_key, yearly_birth_values in transformed_data.items():
name = compound_key.name
gender = compound_key.gender
for birth_year, births_tuple in yearly_birth_values.items():
births = births_tuple[0]
births_unadjusted = births_tuple[1]
# Get death data for people born in birth_year
death_data = None
for gender_value in yearly_death_data[birth_year]:
if gender_value.gender == gender:
death_data = gender_value
# Calculate number born in birth_year still alive in target_year
survival_chance = 1
for year_offset in range(target_year-birth_year): # exclusive end point
death_chance = float(death_data[year_offset+1]) # offset to skip year column
survival_chance *= 1 - death_chance
# Thing to export
yearly_alive_data[compound_key][birth_year] = {
'born': births,
'born_unadjusted': births_unadjusted,
'alive': round(births * survival_chance),
'alive_chance': survival_chance,
}
Want to save data so this API is possible:
var x = j['data']['Oliver']['M']['2014']
var born2014 = x['born']
var alive2016 = x['alive']
In [14]:
to_write = {
'start_year': 1910,
'end_year': 2014,
'alive_year': target_year,
'data': defaultdict(dict),
'count': 0,
}
for compound_key, data in yearly_alive_data.items():
to_write['data'][compound_key.name][compound_key.gender] = data
to_write['count'] = len(to_write['data'])
In [15]:
with open('alive_data.json', 'w') as f:
json.dump(to_write, f)
In [16]:
csv_schema = ['name', 'gender', 'year', 'born', 'born_unadjusted', 'alive', 'alive_chance']
with open('alive_data.csv', 'w') as f:
writer = csv.writer(f)
writer.writerow(csv_schema)
for compound_key, data in yearly_alive_data.items():
for year, v in data.items():
writer.writerow([compound_key.name, compound_key.gender, year,
v['born'], v['born_unadjusted'], v['alive'], v['alive_chance']])