Goals

learn from how PfDA treats the baby names database
focus on a specific problem with the baby names database: compute the change in the sex of persons with given names



In [1]:

    
%pylab --no-import-all inline









    



Populating the interactive namespace from numpy and matplotlib



In [2]:

    
import matplotlib.pyplot as plt
import numpy as np

from pylab import figure, show

from pandas import DataFrame, Series
import pandas as pd

Preliminaries: Assumed location of pydata-book files

To make it more practical for me to look at your homework, I'm again going to assume a relative placement of files. I placed the files from

https://github.com/pydata/pydata-book

in a local directory, which in my case is "/Users/raymondyee/D/Document/Working_with_Open_Data/pydata-book/"

and then symbolically linked (ln -s) to the the pydata-book from the root directory of the working-open-data folder. i.e., on OS X

cd /Users/raymondyee/D/Document/Working_with_Open_Data/working-open-data
ln -s /Users/raymondyee/D/Document/Working_with_Open_Data/pydata-book/ pydata-book

That way the files from the pydata-book repository look like they sit in the working-open-data directory -- without having to actually copy the files.

With this arrangment, I should then be able to drop your notebook into my own notebooks directory and run them without having to mess around with paths.



In [3]:

    
import os

NAMES_DIR = os.path.join(os.pardir, "pydata-book", "ch02", "names")

assert os.path.exists(NAMES_DIR)

Please make sure the above assertion works.

Baby names dataset

discussed in p. 35 of PfDA book

To download all the data, including that for 2011 and 2012: Popular Baby Names --> includes state by state data.

Start with one year 1880



In [4]:

    
# take a look at some lines from each of the names files

import os
os.path.join(NAMES_DIR,'yob1880.txt')
yob1880_path = os.path.join(NAMES_DIR,'yob1880.txt')
!head $yob1880_path



In [5]:

    
# create a DataFrame for 1880 data

import pandas as pd
import codecs

names1880_file = codecs.open(os.path.join(NAMES_DIR,'yob1880.txt'), encoding='iso-8859-1')
names1880 = pd.read_csv(names1880_file, names=['name', 'sex', 'births'])

names1880.head()









    Out[5]:






  
    
      
      name
      sex
      births
    
  
  
    
      0
            Mary
       F
       7065
    
    
      1
            Anna
       F
       2604
    
    
      2
            Emma
       F
       2003
    
    
      3
       Elizabeth
       F
       1939
    
    
      4
          Minnie
       F
       1746
    
  

5 rows × 3 columns



In [6]:

    
# how many females represented in the 1880 data?

names1880[names1880.sex=='F']['births'].sum()









    Out[6]:





90993



In [7]:

    
# number of names in 1880 file

len(names1880.groupby('name'))









    Out[7]:





1889



In [8]:

    
# group by name to find out which names are both M and F -- by looking at which names have more 
# than one sex represented.

name_count = names1880.groupby('name').apply(lambda s: len(s))
set(name_count[name_count > 1].index)









    Out[8]:





{'Addie',
 'Allie',
 'Alma',
 'Alpha',
 'Alva',
 'Anna',
 'Annie',
 'Arthur',
 'Artie',
 'Augusta',
 'Augustine',
 'Bennie',
 'Bertie',
 'Bird',
 'Carey',
 'Carrie',
 'Cecil',
 'Charles',
 'Clair',
 'Clara',
 'Clare',
 'Clarence',
 'Claude',
 'Cleo',
 'Clifford',
 'Clyde',
 'Connie',
 'Cora',
 'Daisy',
 'Dee',
 'Dell',
 'Dora',
 'Eddie',
 'Edgar',
 'Edith',
 'Eliza',
 'Elizabeth',
 'Elsie',
 'Emma',
 'Erie',
 'Ethel',
 'Fay',
 'Flora',
 'Florence',
 'Frances',
 'Francis',
 'Frank',
 'Fred',
 'Freddie',
 'George',
 'Glenn',
 'Grace',
 'Hallie',
 'Harry',
 'Hattie',
 'Henry',
 'Hope',
 'Ida',
 'Ira',
 'Ivy',
 'James',
 'Jean',
 'Jennie',
 'Jesse',
 'Jessie',
 'Jewel',
 'Jimmie',
 'Joe',
 'John',
 'Johnnie',
 'Joseph',
 'June',
 'Katherine',
 'Lacy',
 'Lee',
 'Leo',
 'Leslie',
 'Lillie',
 'Lonnie',
 'Lou',
 'Louie',
 'Louis',
 'Lynn',
 'Marion',
 'Mary',
 'Mattie',
 'Merle',
 'Minnie',
 'Myrtle',
 'Nellie',
 'Nora',
 'Oda',
 'Ollie',
 'Ora',
 'Ossie',
 'Pearl',
 'Ray',
 'Robert',
 'Rose',
 'Ruby',
 'Sammie',
 'Shirley',
 'Sidney',
 'Theo',
 'Thomas',
 'Tommie',
 'Vernie',
 'Vivian',
 'Walter',
 'William',
 'Willie'}



In [9]:

    
# number of births by sex for 1880

names1880.groupby('sex').sum()









    Out[9]:






  
    
      
      births
    
    
      sex
      
    
  
  
    
      F
        90993
    
    
      M
       110493
    
  

2 rows × 1 columns



In [10]:

    
# total number of births in 1880

names1880['births'].sum()









    Out[10]:





201486



In [11]:

    
# sort by number of births to get most popular names 

names1880.sort('births', ascending=False)[:10]









    Out[11]:






  
    
      
      name
      sex
      births
    
  
  
    
      942
          John
       M
       9655
    
    
      943
       William
       M
       9533
    
    
      0  
          Mary
       F
       7065
    
    
      944
         James
       M
       5927
    
    
      945
       Charles
       M
       5348
    
    
      946
        George
       M
       5126
    
    
      947
         Frank
       M
       3242
    
    
      948
        Joseph
       M
       2632
    
    
      1  
          Anna
       F
       2604
    
    
      949
        Thomas
       M
       2534
    
  

10 rows × 3 columns



In [12]:

    
# most popular female names

names1880[names1880.sex == 'F'].sort('births', ascending=False)[:10]









    Out[12]:






  
    
      
      name
      sex
      births
    
  
  
    
      0
            Mary
       F
       7065
    
    
      1
            Anna
       F
       2604
    
    
      2
            Emma
       F
       2003
    
    
      3
       Elizabeth
       F
       1939
    
    
      4
          Minnie
       F
       1746
    
    
      5
        Margaret
       F
       1578
    
    
      6
             Ida
       F
       1472
    
    
      7
           Alice
       F
       1414
    
    
      8
          Bertha
       F
       1320
    
    
      9
           Sarah
       F
       1288
    
  

10 rows × 3 columns



In [16]:

    
# try out seaborn if you want
#import seaborn as sns



In [14]:

    
num_names1880 = len(names1880['births'].order(ascending=False))
plt.plot(np.arange(num_names1880), names1880['births'].order(ascending=False), 'ro', ms=1)
plt.yscale('log')
plt.xlabel('order of name')
plt.ylabel('number of babies')









    Out[14]:





<matplotlib.text.Text at 0x793f7b0>

Loading all data into Pandas



In [17]:

    
!ls $NAMES_DIR









    



NationalReadMe.pdf yob1896.txt        yob1913.txt        yob1930.txt        yob1947.txt        yob1964.txt        yob1981.txt        yob1998.txt
yob1880.txt        yob1897.txt        yob1914.txt        yob1931.txt        yob1948.txt        yob1965.txt        yob1982.txt        yob1999.txt
yob1881.txt        yob1898.txt        yob1915.txt        yob1932.txt        yob1949.txt        yob1966.txt        yob1983.txt        yob2000.txt
yob1882.txt        yob1899.txt        yob1916.txt        yob1933.txt        yob1950.txt        yob1967.txt        yob1984.txt        yob2001.txt
yob1883.txt        yob1900.txt        yob1917.txt        yob1934.txt        yob1951.txt        yob1968.txt        yob1985.txt        yob2002.txt
yob1884.txt        yob1901.txt        yob1918.txt        yob1935.txt        yob1952.txt        yob1969.txt        yob1986.txt        yob2003.txt
yob1885.txt        yob1902.txt        yob1919.txt        yob1936.txt        yob1953.txt        yob1970.txt        yob1987.txt        yob2004.txt
yob1886.txt        yob1903.txt        yob1920.txt        yob1937.txt        yob1954.txt        yob1971.txt        yob1988.txt        yob2005.txt
yob1887.txt        yob1904.txt        yob1921.txt        yob1938.txt        yob1955.txt        yob1972.txt        yob1989.txt        yob2006.txt
yob1888.txt        yob1905.txt        yob1922.txt        yob1939.txt        yob1956.txt        yob1973.txt        yob1990.txt        yob2007.txt
yob1889.txt        yob1906.txt        yob1923.txt        yob1940.txt        yob1957.txt        yob1974.txt        yob1991.txt        yob2008.txt
yob1890.txt        yob1907.txt        yob1924.txt        yob1941.txt        yob1958.txt        yob1975.txt        yob1992.txt        yob2009.txt
yob1891.txt        yob1908.txt        yob1925.txt        yob1942.txt        yob1959.txt        yob1976.txt        yob1993.txt        yob2010.txt
yob1892.txt        yob1909.txt        yob1926.txt        yob1943.txt        yob1960.txt        yob1977.txt        yob1994.txt
yob1893.txt        yob1910.txt        yob1927.txt        yob1944.txt        yob1961.txt        yob1978.txt        yob1995.txt
yob1894.txt        yob1911.txt        yob1928.txt        yob1945.txt        yob1962.txt        yob1979.txt        yob1996.txt
yob1895.txt        yob1912.txt        yob1929.txt        yob1946.txt        yob1963.txt        yob1980.txt        yob1997.txt



In [18]:

    
# 2010 is the last available year in the pydata-book repo
import os

years = range(1880, 2011)

pieces = []
columns = ['name', 'sex', 'births']

for year in years:
    path = os.path.join(NAMES_DIR, 'yob%d.txt' % year)
    frame = pd.read_csv(path, names=columns)

    frame['year'] = year
    pieces.append(frame)

# Concatenate everything into a single DataFrame
names = pd.concat(pieces, ignore_index=True)

# why floats?  I'm not sure.
names.describe()









    Out[18]:






  
    
      
      births
      year
    
  
  
    
      count
       1690784.000000
       1690784.000000
    
    
      mean
           190.682386
          1969.454384
    
    
      std
          1615.899711
            32.823526
    
    
      min
             5.000000
          1880.000000
    
    
      25%
             7.000000
          1946.000000
    
    
      50%
            12.000000
          1979.000000
    
    
      75%
            32.000000
          1997.000000
    
    
      max
         99651.000000
          2010.000000
    
  

8 rows × 2 columns



In [19]:

    
# how many people, names, males and females  represented in names?

names.births.sum()









    Out[19]:





322402727



In [20]:

    
# F vs M

names.groupby('sex')['births'].sum()









    Out[20]:





sex
F      159990140
M      162412587
Name: births, dtype: int64



In [21]:

    
# total number of names

len(names.groupby('name'))









    Out[21]:





88496



In [22]:

    
# use pivot_table to collect records by year (rows) and sex (columns)

total_births = names.pivot_table('births', rows='year', cols='sex', aggfunc=sum)
total_births.head()









    Out[22]:






  
    
      sex
      F
      M
    
    
      year
      
      
    
  
  
    
      1880
        90993
       110493
    
    
      1881
        91955
       100748
    
    
      1882
       107851
       113687
    
    
      1883
       112322
       104632
    
    
      1884
       129021
       114445
    
  

5 rows × 2 columns



In [23]:

    
# You can use groupy to get equivalent pivot_table calculation

names.groupby('year').apply(lambda s: s.groupby('sex').agg('sum')).unstack()['births']









    Out[23]:






  
    
      sex
      F
      M
    
    
      year
      
      
    
  
  
    
      1880
         90993
        110493
    
    
      1881
         91955
        100748
    
    
      1882
        107851
        113687
    
    
      1883
        112322
        104632
    
    
      1884
        129021
        114445
    
    
      1885
        133056
        107802
    
    
      1886
        144538
        110785
    
    
      1887
        145983
        101412
    
    
      1888
        178631
        120857
    
    
      1889
        178369
        110590
    
    
      1890
        190377
        111026
    
    
      1891
        185486
        101198
    
    
      1892
        212350
        122038
    
    
      1893
        212908
        112319
    
    
      1894
        222923
        115775
    
    
      1895
        233632
        117398
    
    
      1896
        237924
        119575
    
    
      1897
        234199
        112760
    
    
      1898
        258771
        122703
    
    
      1899
        233022
        106218
    
    
      1900
        299873
        150554
    
    
      1901
        239351
        106478
    
    
      1902
        264079
        122660
    
    
      1903
        261976
        119240
    
    
      1904
        275375
        128129
    
    
      1905
        291641
        132319
    
    
      1906
        295301
        133159
    
    
      1907
        318558
        146838
    
    
      1908
        334277
        154339
    
    
      1909
        347191
        163983
    
    
      1910
        396416
        194198
    
    
      1911
        418180
        225936
    
    
      1912
        557939
        429926
    
    
      1913
        624317
        512482
    
    
      1914
        761376
        654746
    
    
      1915
        983824
        848647
    
    
      1916
       1044249
        890142
    
    
      1917
       1081194
        925512
    
    
      1918
       1157585
       1013720
    
    
      1919
       1130149
        980215
    
    
      1920
       1198214
       1064468
    
    
      1921
       1232845
       1101374
    
    
      1922
       1200796
       1088380
    
    
      1923
       1206239
       1096227
    
    
      1924
       1248821
       1132671
    
    
      1925
       1217217
       1115798
    
    
      1926
       1185078
       1110440
    
    
      1927
       1192207
       1126259
    
    
      1928
       1152836
       1107113
    
    
      1929
       1116284
       1074833
    
    
      1930
       1125521
       1096663
    
    
      1931
       1064233
       1038586
    
    
      1932
       1066930
       1043512
    
    
      1933
       1007523
        990677
    
    
      1934
       1043879
       1031962
    
    
      1935
       1048264
       1040649
    
    
      1936
       1040068
       1036662
    
    
      1937
       1063722
       1065964
    
    
      1938
       1103173
       1108480
    
    
      1939
       1096394
       1106328
    
    
      
      ...
      ...
    
  

131 rows × 2 columns



In [24]:

    
# how to calculate the total births / year

names.groupby('year').sum().plot(title="total births by year")









    Out[24]:





<matplotlib.axes.AxesSubplot at 0xf5e8b50>



In [25]:

    
names.groupby('year').apply(lambda s: s.groupby('sex').agg('sum')).unstack()['births'].plot(title="births (M/F) by year")









    Out[25]:





<matplotlib.axes.AxesSubplot at 0xd731c30>



In [26]:

    
# number of names in 1880
#names.groupby('year').apply(lambda s: s.groupby('name').agg('count'))
#len(names1880.groupby('name').agg('count'))
len(names1880.groupby('name'))









    Out[26]:





1889



In [27]:

    
# can groupby more than one column
# 131 years x 2 sexes

len(names.groupby(['year', 'sex']))









    Out[27]:





262



In [28]:

    
# how many combo of name x year

len(names.groupby(['name','year']))









    Out[28]:





1540017



In [ ]:

    
# from book: add prop to names

def add_prop(group):
    # Integer division floors
    births = group.births.astype(float)

    group['prop'] = births / births.sum()
    return group

names = names.groupby(['year', 'sex']).apply(add_prop)



In [ ]:

    
# verify prop --> all adds up to 1

np.allclose(names.groupby(['year', 'sex']).prop.sum(), 1)



In [ ]:

    
# number of records in full names dataframe

len(names)



In [ ]:

    
#  from book: useful to work with top 1000 for each year/sex combo
# can use groupby/apply

names.groupby(['year', 'sex']).apply(lambda g: g.sort_index(by='births', ascending=False)[:1000])



In [ ]:

    
def get_top1000(group):
    return group.sort_index(by='births', ascending=False)[:1000]

grouped = names.groupby(['year', 'sex'])
top1000 = grouped.apply(get_top1000)
top1000.head()



In [ ]:

    
# Do pivot table: row: year and cols= names for top 1000

top_births = top1000.pivot_table('births', rows='year', cols='name', aggfunc=np.sum)
top_births.tail()



In [ ]:

    
# is your name in the top_births list?

top_births['Raymond'].plot(title='plot for Raymond')



In [ ]:

    
# for Aaden, which shows up at the end

top_births.Aaden.plot(xlim=[1880,2010])



In [ ]:

    
# number of names represented in top_births

len(top_births.columns)



In [ ]:

    
# how to get the most popular name of all time in top_births?

most_common_names = top_births.sum()
most_common_names.sort(ascending=False)

most_common_names.head()



In [ ]:

    
plt.figure()
most_common_names[:50][::-1].plot(kind='barh', figsize=(10,10))

Starts and Ends of Names

We go from 1880 to 2010.

It might be helpful to calculate a cumulative sum for all names...

For a name start to happen --> look for the the first non-zero value
For a name death to happen, the cumulative sum has to reach max.



In [ ]:

    
# replace n/a with 0 -- might not actually need to do this

top_births = top_births.fillna(0)



In [ ]:

    
top_births_cumsum = top_births.apply(lambda s: s.cumsum(), axis=0)



In [ ]:

    
def start_year(s):
    active_years = s.index[s > 0]
    if len(active_years):
        return active_years[0]
    else:
        return None
    
def end_year(s):
    max_years = s.index[s == s.irow(-1)]
    return max_years[0]

def start_end_years(s):
    active_years = s.index[s > 0]
    max_years = s.index[s == s.irow(-1)]
    return Series({'start': active_years[0] if len(active_years) else None,
                   'end': max_years[0] })
    
    
top_births_cumsum.apply(start_end_years)

start/end calc with whole data set



In [ ]:

    
# instead of top_birth -- get all_births

all_births = names.pivot_table('births', rows='year', cols='name', aggfunc=sum)



In [ ]:

    
all_births = all_births.fillna(0)
all_births.tail()



In [ ]:

    
# set up to do start/end calculation

all_births_cumsum = all_births.apply(lambda s: s.cumsum(), axis=0)



In [ ]:

    
all_births_cumsum.tail()



In [ ]:

    
def start_end_years(s):
    active_years = s.index[s > 0]
    max_years = s.index[s == s.irow(-1)]
    return Series({'start': active_years[0] if len(active_years) else None,
                   'end': max_years[0] })
    
    
all_start_end = all_births_cumsum.apply(start_end_years)



In [ ]:

    
# all_start_end.to_pickle('Day_12_Baby_Names_all_start_end.pickle')



In [ ]:

    
all_start_end.tail()



In [ ]:

    
vc_start = all_start_end.ix['start'].value_counts()
vc_end = all_start_end.ix['end'].value_counts()

fig = plt.figure()
ax1 = fig.add_subplot(111)
plt.ylim(0,2000)
ax1.scatter(vc_start.index, vc_start, c='b')
ax1.scatter(vc_end.index, vc_end, c='r')
ax1.set_xlabel('year')
ax1.set_xlabel('number of starts/ends')

plt.tight_layout()

plt.show()

Experiments -- don't need to look at closely



In [ ]:

    
# max / min totals and when? -- awkward -- must be a better way
total_births_sum = names.groupby('year').sum()
max_value = list(total_births_sum.max())[0]
min_value = list(total_births_sum.min())[0]

is_max = total_births_sum.births == max_value
is_min = total_births_sum.births == min_value
is_max[is_max], is_min[is_min]



In [ ]:

    
# a "derivative" -- when is there great population rate change

total_births_sum.diff().plot()



In [ ]:

    
# plot multiple names on same plot or as multiple axes

def name_sex_count_in_year(name,sex):
    return names[(names.name==name) & (names.sex==sex)][['year', 'births']].set_index(keys='year')

def name_sex_prop_in_year(name,sex):
    return names[(names.name==name) & (names.sex==sex)][['year', 'prop']].set_index(keys='year')

name_df = DataFrame(index=np.arange(1880,2010))

name_df['Raymond'] = name_sex_count_in_year('Raymond','M')
name_df['Laura'] = name_sex_count_in_year('Laura','F')

name_df.plot()



In [ ]:

    
# plot proportion instead of absolute births

name_df = DataFrame(index=np.arange(1880,2010))

name_df['Raymond'] = name_sex_prop_in_year('Raymond','M')
name_df['Laura'] = name_sex_prop_in_year('Laura','F')

name_df.plot()



In [ ]:

    
total_births.plot(title='Total births by sex and year')



In [ ]:

    
# http://en.wikipedia.org/wiki/Human_sex_ratio
# make an agg figure
fig = figure()

# meaning of 111: http://stackoverflow.com/a/3584933/7782
ax = fig.add_subplot(111)
ax.set_title('Ratio of M to F births')

cum_ratio_by_sex = total_births.M.cumsum() / total_births.F.cumsum()
cum_ratio_by_sex.plot(ax=ax, label="cumulative", color="red")

# add instantaneous ratio

annual_ratio_by_sex = total_births.M / total_births.F
annual_ratio_by_sex.plot(ax=ax, label="annual", color="green")

ax.legend(loc='best')

fig.canvas.draw()



In [ ]:

    
# number of names over time

names.groupby('year').count()[['name']].plot()



In [ ]:

    
# first attempt to calculate entropy of names

fig = figure()

# meaning of 111: http://stackoverflow.com/a/3584933/7782
ax = fig.add_subplot(111)
ax.set_title('Entropy of names')

S_male = names[names.sex=='M'].groupby('year').prop.agg(lambda x: sum([-j*np.log(j) for j in x])) # apply(lambda x: -x*log(x))
S_male.plot(ax=ax, label="M", color="blue")

S_female = names[names.sex=='F'].groupby('year').prop.agg(lambda x: sum([-j*np.log(j) for j in x])) # apply(lambda x: -x*log(x))
S_female.plot(ax=ax, label="F", color="red")

ax.legend(loc='best')
ax.set_ylim(0)

fig.canvas.draw()

Names that are both M and F

Goal: start to explore names that have been given to both male and female babies. Is there a general trend to feminization of names? (That is, is it more likely that names start as male names become feminine names than vice versa?)

	name	sex	births
942	John	M	9655
943	William	M	9533
0	Mary	F	7065
944	James	M	5927
945	Charles	M	5348
946	George	M	5126
947	Frank	M	3242
948	Joseph	M	2632
1	Anna	F	2604
949	Thomas	M	2534

	births	year
count	1690784.000000	1690784.000000
mean	190.682386	1969.454384
std	1615.899711	32.823526
min	5.000000	1880.000000
25%	7.000000	1946.000000
50%	12.000000	1979.000000
75%	32.000000	1997.000000
max	99651.000000	2010.000000

sex	F	M
year
1880	90993	110493
1881	91955	100748
1882	107851	113687
1883	112322	104632
1884	129021	114445