Goals

  • learn from how PfDA treats the baby names database
  • focus on a specific problem with the baby names database: compute the change in the sex of persons with given names

In [1]:
%pylab --no-import-all inline


Populating the interactive namespace from numpy and matplotlib

In [2]:
import matplotlib.pyplot as plt
import numpy as np

from pylab import figure, show

from pandas import DataFrame, Series
import pandas as pd

Preliminaries: Assumed location of pydata-book files

To make it more practical for me to look at your homework, I'm again going to assume a relative placement of files. I placed the files from

https://github.com/pydata/pydata-book

in a local directory, which in my case is "/Users/raymondyee/D/Document/Working_with_Open_Data/pydata-book/"

and then symbolically linked (ln -s) to the the pydata-book from the root directory of the working-open-data folder. i.e., on OS X

cd /Users/raymondyee/D/Document/Working_with_Open_Data/working-open-data
ln -s /Users/raymondyee/D/Document/Working_with_Open_Data/pydata-book/ pydata-book

That way the files from the pydata-book repository look like they sit in the working-open-data directory -- without having to actually copy the files.

With this arrangment, I should then be able to drop your notebook into my own notebooks directory and run them without having to mess around with paths.


In [3]:
import os

NAMES_DIR = os.path.join(os.pardir, "pydata-book", "ch02", "names")

assert os.path.exists(NAMES_DIR)

Please make sure the above assertion works.

Baby names dataset

discussed in p. 35 of PfDA book

To download all the data, including that for 2011 and 2012: Popular Baby Names --> includes state by state data.

Start with one year 1880


In [4]:
# take a look at some lines from each of the names files

import os
os.path.join(NAMES_DIR,'yob1880.txt')
yob1880_path = os.path.join(NAMES_DIR,'yob1880.txt')
!head $yob1880_path












In [5]:
# create a DataFrame for 1880 data

import pandas as pd
import codecs

names1880_file = codecs.open(os.path.join(NAMES_DIR,'yob1880.txt'), encoding='iso-8859-1')
names1880 = pd.read_csv(names1880_file, names=['name', 'sex', 'births'])

names1880.head()


Out[5]:
name sex births
0 Mary F 7065
1 Anna F 2604
2 Emma F 2003
3 Elizabeth F 1939
4 Minnie F 1746

5 rows × 3 columns


In [6]:
# how many females represented in the 1880 data?

names1880[names1880.sex=='F']['births'].sum()


Out[6]:
90993

In [7]:
# number of names in 1880 file

len(names1880.groupby('name'))


Out[7]:
1889

In [8]:
# group by name to find out which names are both M and F -- by looking at which names have more 
# than one sex represented.

name_count = names1880.groupby('name').apply(lambda s: len(s))
set(name_count[name_count > 1].index)


Out[8]:
{'Addie',
 'Allie',
 'Alma',
 'Alpha',
 'Alva',
 'Anna',
 'Annie',
 'Arthur',
 'Artie',
 'Augusta',
 'Augustine',
 'Bennie',
 'Bertie',
 'Bird',
 'Carey',
 'Carrie',
 'Cecil',
 'Charles',
 'Clair',
 'Clara',
 'Clare',
 'Clarence',
 'Claude',
 'Cleo',
 'Clifford',
 'Clyde',
 'Connie',
 'Cora',
 'Daisy',
 'Dee',
 'Dell',
 'Dora',
 'Eddie',
 'Edgar',
 'Edith',
 'Eliza',
 'Elizabeth',
 'Elsie',
 'Emma',
 'Erie',
 'Ethel',
 'Fay',
 'Flora',
 'Florence',
 'Frances',
 'Francis',
 'Frank',
 'Fred',
 'Freddie',
 'George',
 'Glenn',
 'Grace',
 'Hallie',
 'Harry',
 'Hattie',
 'Henry',
 'Hope',
 'Ida',
 'Ira',
 'Ivy',
 'James',
 'Jean',
 'Jennie',
 'Jesse',
 'Jessie',
 'Jewel',
 'Jimmie',
 'Joe',
 'John',
 'Johnnie',
 'Joseph',
 'June',
 'Katherine',
 'Lacy',
 'Lee',
 'Leo',
 'Leslie',
 'Lillie',
 'Lonnie',
 'Lou',
 'Louie',
 'Louis',
 'Lynn',
 'Marion',
 'Mary',
 'Mattie',
 'Merle',
 'Minnie',
 'Myrtle',
 'Nellie',
 'Nora',
 'Oda',
 'Ollie',
 'Ora',
 'Ossie',
 'Pearl',
 'Ray',
 'Robert',
 'Rose',
 'Ruby',
 'Sammie',
 'Shirley',
 'Sidney',
 'Theo',
 'Thomas',
 'Tommie',
 'Vernie',
 'Vivian',
 'Walter',
 'William',
 'Willie'}

In [9]:
# number of births by sex for 1880

names1880.groupby('sex').sum()


Out[9]:
births
sex
F 90993
M 110493

2 rows × 1 columns


In [10]:
# total number of births in 1880

names1880['births'].sum()


Out[10]:
201486

In [11]:
# sort by number of births to get most popular names 

names1880.sort('births', ascending=False)[:10]


Out[11]:
name sex births
942 John M 9655
943 William M 9533
0 Mary F 7065
944 James M 5927
945 Charles M 5348
946 George M 5126
947 Frank M 3242
948 Joseph M 2632
1 Anna F 2604
949 Thomas M 2534

10 rows × 3 columns


In [12]:
# most popular female names

names1880[names1880.sex == 'F'].sort('births', ascending=False)[:10]


Out[12]:
name sex births
0 Mary F 7065
1 Anna F 2604
2 Emma F 2003
3 Elizabeth F 1939
4 Minnie F 1746
5 Margaret F 1578
6 Ida F 1472
7 Alice F 1414
8 Bertha F 1320
9 Sarah F 1288

10 rows × 3 columns


In [16]:
# try out seaborn if you want
#import seaborn as sns

In [14]:
num_names1880 = len(names1880['births'].order(ascending=False))
plt.plot(np.arange(num_names1880), names1880['births'].order(ascending=False), 'ro', ms=1)
plt.yscale('log')
plt.xlabel('order of name')
plt.ylabel('number of babies')


Out[14]:
<matplotlib.text.Text at 0x793f7b0>

Loading all data into Pandas


In [17]:
!ls $NAMES_DIR


NationalReadMe.pdf yob1896.txt        yob1913.txt        yob1930.txt        yob1947.txt        yob1964.txt        yob1981.txt        yob1998.txt
yob1880.txt        yob1897.txt        yob1914.txt        yob1931.txt        yob1948.txt        yob1965.txt        yob1982.txt        yob1999.txt
yob1881.txt        yob1898.txt        yob1915.txt        yob1932.txt        yob1949.txt        yob1966.txt        yob1983.txt        yob2000.txt
yob1882.txt        yob1899.txt        yob1916.txt        yob1933.txt        yob1950.txt        yob1967.txt        yob1984.txt        yob2001.txt
yob1883.txt        yob1900.txt        yob1917.txt        yob1934.txt        yob1951.txt        yob1968.txt        yob1985.txt        yob2002.txt
yob1884.txt        yob1901.txt        yob1918.txt        yob1935.txt        yob1952.txt        yob1969.txt        yob1986.txt        yob2003.txt
yob1885.txt        yob1902.txt        yob1919.txt        yob1936.txt        yob1953.txt        yob1970.txt        yob1987.txt        yob2004.txt
yob1886.txt        yob1903.txt        yob1920.txt        yob1937.txt        yob1954.txt        yob1971.txt        yob1988.txt        yob2005.txt
yob1887.txt        yob1904.txt        yob1921.txt        yob1938.txt        yob1955.txt        yob1972.txt        yob1989.txt        yob2006.txt
yob1888.txt        yob1905.txt        yob1922.txt        yob1939.txt        yob1956.txt        yob1973.txt        yob1990.txt        yob2007.txt
yob1889.txt        yob1906.txt        yob1923.txt        yob1940.txt        yob1957.txt        yob1974.txt        yob1991.txt        yob2008.txt
yob1890.txt        yob1907.txt        yob1924.txt        yob1941.txt        yob1958.txt        yob1975.txt        yob1992.txt        yob2009.txt
yob1891.txt        yob1908.txt        yob1925.txt        yob1942.txt        yob1959.txt        yob1976.txt        yob1993.txt        yob2010.txt
yob1892.txt        yob1909.txt        yob1926.txt        yob1943.txt        yob1960.txt        yob1977.txt        yob1994.txt
yob1893.txt        yob1910.txt        yob1927.txt        yob1944.txt        yob1961.txt        yob1978.txt        yob1995.txt
yob1894.txt        yob1911.txt        yob1928.txt        yob1945.txt        yob1962.txt        yob1979.txt        yob1996.txt
yob1895.txt        yob1912.txt        yob1929.txt        yob1946.txt        yob1963.txt        yob1980.txt        yob1997.txt

In [18]:
# 2010 is the last available year in the pydata-book repo
import os

years = range(1880, 2011)

pieces = []
columns = ['name', 'sex', 'births']

for year in years:
    path = os.path.join(NAMES_DIR, 'yob%d.txt' % year)
    frame = pd.read_csv(path, names=columns)

    frame['year'] = year
    pieces.append(frame)

# Concatenate everything into a single DataFrame
names = pd.concat(pieces, ignore_index=True)

# why floats?  I'm not sure.
names.describe()


Out[18]:
births year
count 1690784.000000 1690784.000000
mean 190.682386 1969.454384
std 1615.899711 32.823526
min 5.000000 1880.000000
25% 7.000000 1946.000000
50% 12.000000 1979.000000
75% 32.000000 1997.000000
max 99651.000000 2010.000000

8 rows × 2 columns


In [19]:
# how many people, names, males and females  represented in names?

names.births.sum()


Out[19]:
322402727

In [20]:
# F vs M

names.groupby('sex')['births'].sum()


Out[20]:
sex
F      159990140
M      162412587
Name: births, dtype: int64

In [21]:
# total number of names

len(names.groupby('name'))


Out[21]:
88496

In [22]:
# use pivot_table to collect records by year (rows) and sex (columns)

total_births = names.pivot_table('births', rows='year', cols='sex', aggfunc=sum)
total_births.head()


Out[22]:
sex F M
year
1880 90993 110493
1881 91955 100748
1882 107851 113687
1883 112322 104632
1884 129021 114445

5 rows × 2 columns


In [23]:
# You can use groupy to get equivalent pivot_table calculation

names.groupby('year').apply(lambda s: s.groupby('sex').agg('sum')).unstack()['births']


Out[23]:
sex F M
year
1880 90993 110493
1881 91955 100748
1882 107851 113687
1883 112322 104632
1884 129021 114445
1885 133056 107802
1886 144538 110785
1887 145983 101412
1888 178631 120857
1889 178369 110590
1890 190377 111026
1891 185486 101198
1892 212350 122038
1893 212908 112319
1894 222923 115775
1895 233632 117398
1896 237924 119575
1897 234199 112760
1898 258771 122703
1899 233022 106218
1900 299873 150554
1901 239351 106478
1902 264079 122660
1903 261976 119240
1904 275375 128129
1905 291641 132319
1906 295301 133159
1907 318558 146838
1908 334277 154339
1909 347191 163983
1910 396416 194198
1911 418180 225936
1912 557939 429926
1913 624317 512482
1914 761376 654746
1915 983824 848647
1916 1044249 890142
1917 1081194 925512
1918 1157585 1013720
1919 1130149 980215
1920 1198214 1064468
1921 1232845 1101374
1922 1200796 1088380
1923 1206239 1096227
1924 1248821 1132671
1925 1217217 1115798
1926 1185078 1110440
1927 1192207 1126259
1928 1152836 1107113
1929 1116284 1074833
1930 1125521 1096663
1931 1064233 1038586
1932 1066930 1043512
1933 1007523 990677
1934 1043879 1031962
1935 1048264 1040649
1936 1040068 1036662
1937 1063722 1065964
1938 1103173 1108480
1939 1096394 1106328
... ...

131 rows × 2 columns


In [24]:
# how to calculate the total births / year

names.groupby('year').sum().plot(title="total births by year")


Out[24]:
<matplotlib.axes.AxesSubplot at 0xf5e8b50>

In [25]:
names.groupby('year').apply(lambda s: s.groupby('sex').agg('sum')).unstack()['births'].plot(title="births (M/F) by year")


Out[25]:
<matplotlib.axes.AxesSubplot at 0xd731c30>

In [26]:
# number of names in 1880
#names.groupby('year').apply(lambda s: s.groupby('name').agg('count'))
#len(names1880.groupby('name').agg('count'))
len(names1880.groupby('name'))


Out[26]:
1889

In [27]:
# can groupby more than one column
# 131 years x 2 sexes

len(names.groupby(['year', 'sex']))


Out[27]:
262

In [28]:
# how many combo of name x year

len(names.groupby(['name','year']))


Out[28]:
1540017

In [ ]:
# from book: add prop to names

def add_prop(group):
    # Integer division floors
    births = group.births.astype(float)

    group['prop'] = births / births.sum()
    return group

names = names.groupby(['year', 'sex']).apply(add_prop)

In [ ]:
# verify prop --> all adds up to 1

np.allclose(names.groupby(['year', 'sex']).prop.sum(), 1)

In [ ]:
# number of records in full names dataframe

len(names)

In [ ]:
#  from book: useful to work with top 1000 for each year/sex combo
# can use groupby/apply

names.groupby(['year', 'sex']).apply(lambda g: g.sort_index(by='births', ascending=False)[:1000])

In [ ]:
def get_top1000(group):
    return group.sort_index(by='births', ascending=False)[:1000]

grouped = names.groupby(['year', 'sex'])
top1000 = grouped.apply(get_top1000)
top1000.head()

In [ ]:
# Do pivot table: row: year and cols= names for top 1000

top_births = top1000.pivot_table('births', rows='year', cols='name', aggfunc=np.sum)
top_births.tail()

In [ ]:
# is your name in the top_births list?

top_births['Raymond'].plot(title='plot for Raymond')

In [ ]:
# for Aaden, which shows up at the end

top_births.Aaden.plot(xlim=[1880,2010])

In [ ]:
# number of names represented in top_births

len(top_births.columns)

In [ ]:
# how to get the most popular name of all time in top_births?

most_common_names = top_births.sum()
most_common_names.sort(ascending=False)

most_common_names.head()

In [ ]:
plt.figure()
most_common_names[:50][::-1].plot(kind='barh', figsize=(10,10))

Starts and Ends of Names

We go from 1880 to 2010.

It might be helpful to calculate a cumulative sum for all names...

  • For a name start to happen --> look for the the first non-zero value
  • For a name death to happen, the cumulative sum has to reach max.

In [ ]:
# replace n/a with 0 -- might not actually need to do this

top_births = top_births.fillna(0)

In [ ]:
top_births_cumsum = top_births.apply(lambda s: s.cumsum(), axis=0)

In [ ]:
def start_year(s):
    active_years = s.index[s > 0]
    if len(active_years):
        return active_years[0]
    else:
        return None
    
def end_year(s):
    max_years = s.index[s == s.irow(-1)]
    return max_years[0]

def start_end_years(s):
    active_years = s.index[s > 0]
    max_years = s.index[s == s.irow(-1)]
    return Series({'start': active_years[0] if len(active_years) else None,
                   'end': max_years[0] })
    
    
top_births_cumsum.apply(start_end_years)

start/end calc with whole data set


In [ ]:
# instead of top_birth -- get all_births

all_births = names.pivot_table('births', rows='year', cols='name', aggfunc=sum)

In [ ]:
all_births = all_births.fillna(0)
all_births.tail()

In [ ]:
# set up to do start/end calculation

all_births_cumsum = all_births.apply(lambda s: s.cumsum(), axis=0)

In [ ]:
all_births_cumsum.tail()

In [ ]:
def start_end_years(s):
    active_years = s.index[s > 0]
    max_years = s.index[s == s.irow(-1)]
    return Series({'start': active_years[0] if len(active_years) else None,
                   'end': max_years[0] })
    
    
all_start_end = all_births_cumsum.apply(start_end_years)

In [ ]:
# all_start_end.to_pickle('Day_12_Baby_Names_all_start_end.pickle')

In [ ]:
all_start_end.tail()

In [ ]:
vc_start = all_start_end.ix['start'].value_counts()
vc_end = all_start_end.ix['end'].value_counts()

fig = plt.figure()
ax1 = fig.add_subplot(111)
plt.ylim(0,2000)
ax1.scatter(vc_start.index, vc_start, c='b')
ax1.scatter(vc_end.index, vc_end, c='r')
ax1.set_xlabel('year')
ax1.set_xlabel('number of starts/ends')

plt.tight_layout()

plt.show()

Experiments -- don't need to look at closely


In [ ]:
# max / min totals and when? -- awkward -- must be a better way
total_births_sum = names.groupby('year').sum()
max_value = list(total_births_sum.max())[0]
min_value = list(total_births_sum.min())[0]

is_max = total_births_sum.births == max_value
is_min = total_births_sum.births == min_value
is_max[is_max], is_min[is_min]

In [ ]:
# a "derivative" -- when is there great population rate change

total_births_sum.diff().plot()

In [ ]:
# plot multiple names on same plot or as multiple axes

def name_sex_count_in_year(name,sex):
    return names[(names.name==name) & (names.sex==sex)][['year', 'births']].set_index(keys='year')

def name_sex_prop_in_year(name,sex):
    return names[(names.name==name) & (names.sex==sex)][['year', 'prop']].set_index(keys='year')

name_df = DataFrame(index=np.arange(1880,2010))

name_df['Raymond'] = name_sex_count_in_year('Raymond','M')
name_df['Laura'] = name_sex_count_in_year('Laura','F')

name_df.plot()

In [ ]:
# plot proportion instead of absolute births

name_df = DataFrame(index=np.arange(1880,2010))

name_df['Raymond'] = name_sex_prop_in_year('Raymond','M')
name_df['Laura'] = name_sex_prop_in_year('Laura','F')

name_df.plot()

In [ ]:
total_births.plot(title='Total births by sex and year')

In [ ]:
# http://en.wikipedia.org/wiki/Human_sex_ratio
# make an agg figure
fig = figure()

# meaning of 111: http://stackoverflow.com/a/3584933/7782
ax = fig.add_subplot(111)
ax.set_title('Ratio of M to F births')

cum_ratio_by_sex = total_births.M.cumsum() / total_births.F.cumsum()
cum_ratio_by_sex.plot(ax=ax, label="cumulative", color="red")

# add instantaneous ratio

annual_ratio_by_sex = total_births.M / total_births.F
annual_ratio_by_sex.plot(ax=ax, label="annual", color="green")

ax.legend(loc='best')

fig.canvas.draw()

In [ ]:
# number of names over time

names.groupby('year').count()[['name']].plot()

In [ ]:
# first attempt to calculate entropy of names

fig = figure()

# meaning of 111: http://stackoverflow.com/a/3584933/7782
ax = fig.add_subplot(111)
ax.set_title('Entropy of names')

S_male = names[names.sex=='M'].groupby('year').prop.agg(lambda x: sum([-j*np.log(j) for j in x])) # apply(lambda x: -x*log(x))
S_male.plot(ax=ax, label="M", color="blue")

S_female = names[names.sex=='F'].groupby('year').prop.agg(lambda x: sum([-j*np.log(j) for j in x])) # apply(lambda x: -x*log(x))
S_female.plot(ax=ax, label="F", color="red")

ax.legend(loc='best')
ax.set_ylim(0)

fig.canvas.draw()

Names that are both M and F

Goal: start to explore names that have been given to both male and female babies. Is there a general trend to feminization of names? (That is, is it more likely that names start as male names become feminine names than vice versa?)