Baby Names Assignment

In this assigment, I analyzed baby names from two main perspectives:

Analyzing The variation/trend in names by first letter - a) in terms of 50 years gap b) interms of male/female names
Measuring sex shift on various dimensions in top ambi-names (using proportional/normalized populations)

Note: to show that Day_13_C_Baby_Names_MF_Completed is running properly, I added my work at the bottom of this notebook



In [1]:

    
%matplotlib inline



In [1]:

    
import matplotlib.pyplot as plt
import numpy as np

from pylab import figure, show

from pandas import DataFrame, Series
import pandas as pd



In [2]:

    
try:
    import mpld3
    from mpld3 import enable_notebook
    from mpld3 import plugins
    enable_notebook()
except Exception as e:
    print "Attempt to import and enable mpld3 failed", e



In [3]:

    
# what would seaborn do?
try:
    import seaborn as sns
except Exception as e:
    print "Attempt to import and enable seaborn failed", e









    



Attempt to import and enable seaborn failed No module named formula.api

Preliminaries: Assumed location of pydata-book files

To make it more practical for me to look at your homework, I'm again going to assume a relative placement of files. I placed the files from

https://github.com/pydata/pydata-book

in a local directory, which in my case is "/Users/raymondyee/D/Document/Working_with_Open_Data/pydata-book/"

and then symbolically linked (ln -s) to the the pydata-book from the root directory of the working-open-data folder. i.e., on OS X

cd /Users/raymondyee/D/Document/Working_with_Open_Data/working-open-data
ln -s /Users/raymondyee/D/Document/Working_with_Open_Data/pydata-book/ pydata-book

That way the files from the pydata-book repository look like they sit in the working-open-data directory -- without having to actually copy the files.

With this arrangment, I should then be able to drop your notebook into my own notebooks directory and run them without having to mess around with paths.



In [5]:

    
import os

NAMES_DIR = os.path.join(os.pardir, "pydata-book", "ch02", "names")

#NAMES_DIR ---> '../pydata-book/ch02/names'
assert os.path.exists(NAMES_DIR)

Please make sure the above assertion works.

Baby names dataset

discussed in p. 35 of PfDA book

To download all the data, including that for 2011 and 2012: Popular Baby Names --> includes state by state data.

Loading all data into Pandas



In [6]:

    
# show the first five files in the NAMES_DIR

import glob
glob.glob(NAMES_DIR + "/*")[:5]









    Out[6]:





['../pydata-book/ch02/names/NationalReadMe.pdf',
 '../pydata-book/ch02/names/yob1880.txt',
 '../pydata-book/ch02/names/yob1881.txt',
 '../pydata-book/ch02/names/yob1882.txt',
 '../pydata-book/ch02/names/yob1883.txt']



In [7]:

    
# 2010 is the last available year in the pydata-book repo
import os

years = range(1880, 2011)

pieces = []
columns = ['name', 'sex', 'births']

for year in years:
    path = os.path.join(NAMES_DIR, 'yob%d.txt' % year)
#     print path
    frame = pd.read_csv(path, names=columns)
#     print frame

    frame['year'] = year
#     print frame
    pieces.append(frame)
#     print pieces

# Concatenate everything into a single DataFrame
names = pd.concat(pieces, ignore_index=True)

# why floats?  I'm not sure.
names.describe()
# names = pd.concat(pieces)
# len(names) -->1690784









    Out[7]:






  
    
      
      births
      year
    
  
  
    
      count
       1690784.000000
       1690784.000000
    
    
      mean
           190.682386
          1969.454384
    
    
      std
          1615.899711
            32.823526
    
    
      min
             5.000000
          1880.000000
    
    
      25%
             7.000000
          1946.000000
    
    
      50%
            12.000000
          1979.000000
    
    
      75%
            32.000000
          1997.000000
    
    
      max
         99651.000000
          2010.000000
    
  

8 rows × 2 columns



In [8]:

    
names.head()









    Out[8]:






  
    
      
      name
      sex
      births
      year
    
  
  
    
      0
            Mary
       F
       7065
       1880
    
    
      1
            Anna
       F
       2604
       1880
    
    
      2
            Emma
       F
       2003
       1880
    
    
      3
       Elizabeth
       F
       1939
       1880
    
    
      4
          Minnie
       F
       1746
       1880
    
  

5 rows × 4 columns



In [9]:

    
names.births









    Out[9]:





0     7065
1     2604
2     2003
3     1939
4     1746
5     1578
6     1472
7     1414
8     1320
9     1288
10    1258
11    1226
12    1156
13    1063
14    1045
...
1690769    5
1690770    5
1690771    5
1690772    5
1690773    5
1690774    5
1690775    5
1690776    5
1690777    5
1690778    5
1690779    5
1690780    5
1690781    5
1690782    5
1690783    5
Name: births, Length: 1690784, dtype: int64



In [10]:

    
# how many people, names, males and females  represented in names?
names.births.sum()









    Out[10]:





322402727



In [11]:

    
names.groupby('sex').head()









    Out[11]:






  
    
      
      
      name
      sex
      births
      year
    
    
      sex
      
      
      
      
      
    
  
  
    
      F
      0  
            Mary
       F
       7065
       1880
    
    
      1  
            Anna
       F
       2604
       1880
    
    
      2  
            Emma
       F
       2003
       1880
    
    
      3  
       Elizabeth
       F
       1939
       1880
    
    
      4  
          Minnie
       F
       1746
       1880
    
    
      M
      942
            John
       M
       9655
       1880
    
    
      943
         William
       M
       9533
       1880
    
    
      944
           James
       M
       5927
       1880
    
    
      945
         Charles
       M
       5348
       1880
    
    
      946
          George
       M
       5126
       1880
    
  

10 rows × 4 columns



In [12]:

    
# F vs M

names.groupby('sex')['births'].sum()









    Out[12]:





sex
F      159990140
M      162412587
Name: births, dtype: int64



In [13]:

    
grp = names.groupby('name')



In [111]:

    
#experimenting with groups
# from itertools import islice

# for key, g_df in islice(grp,5):
#     print key, type(g_df), g_df.columns, g_df, g_df.sex



In [15]:

    
# total number of names

len(names.groupby('name'))









    Out[15]:





88496



In [16]:

    
# use pivot_table to collect records by year (rows) and sex (columns)

total_births = names.pivot_table('births', rows='year', cols='sex', aggfunc=sum)
total_births.head()









    Out[16]:






  
    
      sex
      F
      M
    
    
      year
      
      
    
  
  
    
      1880
        90993
       110493
    
    
      1881
        91955
       100748
    
    
      1882
       107851
       113687
    
    
      1883
       112322
       104632
    
    
      1884
       129021
       114445
    
  

5 rows × 2 columns



In [17]:

    
names.groupby('year').head()









    Out[17]:






  
    
      
      
      name
      sex
      births
      year
    
    
      year
      
      
      
      
      
    
  
  
    
      1880
      0    
            Mary
       F
        7065
       1880
    
    
      1    
            Anna
       F
        2604
       1880
    
    
      2    
            Emma
       F
        2003
       1880
    
    
      3    
       Elizabeth
       F
        1939
       1880
    
    
      4    
          Minnie
       F
        1746
       1880
    
    
      1881
      2000 
            Mary
       F
        6919
       1881
    
    
      2001 
            Anna
       F
        2698
       1881
    
    
      2002 
            Emma
       F
        2034
       1881
    
    
      2003 
       Elizabeth
       F
        1852
       1881
    
    
      2004 
        Margaret
       F
        1658
       1881
    
    
      1882
      3935 
            Mary
       F
        8149
       1882
    
    
      3936 
            Anna
       F
        3143
       1882
    
    
      3937 
            Emma
       F
        2303
       1882
    
    
      3938 
       Elizabeth
       F
        2187
       1882
    
    
      3939 
          Minnie
       F
        2004
       1882
    
    
      1883
      6062 
            Mary
       F
        8012
       1883
    
    
      6063 
            Anna
       F
        3306
       1883
    
    
      6064 
            Emma
       F
        2367
       1883
    
    
      6065 
       Elizabeth
       F
        2255
       1883
    
    
      6066 
          Minnie
       F
        2035
       1883
    
    
      1884
      8146 
            Mary
       F
        9217
       1884
    
    
      8147 
            Anna
       F
        3860
       1884
    
    
      8148 
            Emma
       F
        2587
       1884
    
    
      8149 
       Elizabeth
       F
        2549
       1884
    
    
      8150 
          Minnie
       F
        2243
       1884
    
    
      1885
      10443
            Mary
       F
        9128
       1885
    
    
      10444
            Anna
       F
        3994
       1885
    
    
      10445
            Emma
       F
        2728
       1885
    
    
      10446
       Elizabeth
       F
        2582
       1885
    
    
      10447
        Margaret
       F
        2204
       1885
    
    
      1886
      12737
            Mary
       F
        9891
       1886
    
    
      12738
            Anna
       F
        4283
       1886
    
    
      12739
            Emma
       F
        2764
       1886
    
    
      12740
       Elizabeth
       F
        2680
       1886
    
    
      12741
          Minnie
       F
        2372
       1886
    
    
      1887
      15129
            Mary
       F
        9888
       1887
    
    
      15130
            Anna
       F
        4227
       1887
    
    
      15131
       Elizabeth
       F
        2681
       1887
    
    
      15132
            Emma
       F
        2647
       1887
    
    
      15133
        Margaret
       F
        2419
       1887
    
    
      1888
      17502
            Mary
       F
       11754
       1888
    
    
      17503
            Anna
       F
        4982
       1888
    
    
      17504
       Elizabeth
       F
        3224
       1888
    
    
      17505
            Emma
       F
        3087
       1888
    
    
      17506
        Margaret
       F
        2904
       1888
    
    
      1889
      20153
            Mary
       F
       11649
       1889
    
    
      20154
            Anna
       F
        5062
       1889
    
    
      20155
       Elizabeth
       F
        3058
       1889
    
    
      20156
        Margaret
       F
        2917
       1889
    
    
      20157
            Emma
       F
        2884
       1889
    
    
      1890
      22743
            Mary
       F
       12078
       1890
    
    
      22744
            Anna
       F
        5233
       1890
    
    
      22745
       Elizabeth
       F
        3112
       1890
    
    
      22746
        Margaret
       F
        3100
       1890
    
    
      22747
            Emma
       F
        2980
       1890
    
    
      1891
      25438
            Mary
       F
       11704
       1891
    
    
      25439
            Anna
       F
        5099
       1891
    
    
      25440
        Margaret
       F
        3066
       1891
    
    
      25441
       Elizabeth
       F
        3059
       1891
    
    
      25442
            Emma
       F
        2884
       1891
    
    
      
      
      ...
      ...
      ...
      ...
    
  

655 rows × 4 columns



In [18]:

    
names.groupby('year').apply(lambda s: s.groupby('sex').agg('sum')).head()









    Out[18]:






  
    
      
      
      births
      year
    
    
      year
      sex
      
      
    
  
  
    
      1880
      F
        90993
       1770960
    
    
      M
       110493
       1989040
    
    
      1881
      F
        91955
       1764378
    
    
      M
       100748
       1875357
    
    
      1882
      F
       107851
       1934696
    
  

5 rows × 2 columns



In [19]:

    
# You can use groupy to get equivalent pivot_table calculation

names.groupby('year').apply(lambda s: s.groupby('sex').agg('sum')).unstack()['births'].head()









    Out[19]:






  
    
      sex
      F
      M
    
    
      year
      
      
    
  
  
    
      1880
        90993
       110493
    
    
      1881
        91955
       100748
    
    
      1882
       107851
       113687
    
    
      1883
       112322
       104632
    
    
      1884
       129021
       114445
    
  

5 rows × 2 columns



In [20]:

    
# how to calculate the total births / year

names.groupby('year').sum().plot(title="total births by year")









    Out[20]:





<matplotlib.axes.AxesSubplot at 0x11263f490>



In [21]:

    
names.groupby('year').apply(lambda s: s.groupby('sex').agg('sum')).unstack()['births'].plot(title="births (M/F) by year")









    Out[21]:





<matplotlib.axes.AxesSubplot at 0x116d92e50>



In [110]:

    
#some more experimentation with groups
# from itertools import islice

# for key, g_df in islice(names.groupby(['year', 'sex']),5):
#     print key,g_df
#     print key, type(g_df), g_df.columns, g_df, g_df.sex



In [23]:

    
# from book: add prop to names

def add_prop(group):
    # Integer division floors
    births = group.births.astype(float)
#     print births
    group['prop'] = births / births.sum()
    return group

propped_names = names.groupby(['year', 'sex']).apply(add_prop)
propped_names.head()









    Out[23]:






  
    
      
      name
      sex
      births
      year
      prop
    
  
  
    
      0
            Mary
       F
       7065
       1880
       0.077643
    
    
      1
            Anna
       F
       2604
       1880
       0.028618
    
    
      2
            Emma
       F
       2003
       1880
       0.022013
    
    
      3
       Elizabeth
       F
       1939
       1880
       0.021309
    
    
      4
          Minnie
       F
       1746
       1880
       0.019188
    
  

5 rows × 5 columns



In [24]:

    
# verify prop --> all adds up to 1

# np.allclose(names.groupby(['year', 'sex']).prop.sum(), 1)
np.allclose(propped_names.groupby(['year', 'sex']).prop.sum(), 1)









    Out[24]:





True



In [25]:

    
# number of records in full names dataframe

# len(names) --> 1690784
len(propped_names)









    Out[25]:





1690784

How to do top1000 calculation

This section on the top1000 calculation is kept in here to provide some inspiration on how to work with baby names



In [26]:

    
#  from book: useful to work with top 1000 for each year/sex combo
# can use groupby/apply

names.groupby(['year', 'sex']).apply(lambda g: g.sort_index(by='births', ascending=False)[:1000]).head()









    Out[26]:






  
    
      
      
      
      name
      sex
      births
      year
    
    
      year
      sex
      
      
      
      
      
    
  
  
    
      1880
      F
      0
            Mary
       F
       7065
       1880
    
    
      1
            Anna
       F
       2604
       1880
    
    
      2
            Emma
       F
       2003
       1880
    
    
      3
       Elizabeth
       F
       1939
       1880
    
    
      4
          Minnie
       F
       1746
       1880
    
  

5 rows × 4 columns



In [27]:

    
def get_top1000(group):
    return group.sort_index(by='births', ascending=False)[:1000]

grouped = names.groupby(['year', 'sex'])
top1000 = grouped.apply(get_top1000)
top1000.head()









    Out[27]:






  
    
      
      
      
      name
      sex
      births
      year
    
    
      year
      sex
      
      
      
      
      
    
  
  
    
      1880
      F
      0
            Mary
       F
       7065
       1880
    
    
      1
            Anna
       F
       2604
       1880
    
    
      2
            Emma
       F
       2003
       1880
    
    
      3
       Elizabeth
       F
       1939
       1880
    
    
      4
          Minnie
       F
       1746
       1880
    
  

5 rows × 4 columns



In [28]:

    
# Do pivot table: row: year and cols= names for top 1000

top_births = top1000.pivot_table('births', rows='year', cols='name', aggfunc=np.sum)
top_births.tail()









    Out[28]:






  
    
      name
      Aaden
      Aaliyah
      Aarav
      Aaron
      Aarush
      Ab
      Abagail
      Abb
      Abbey
      Abbie
      Abbigail
      Abbott
      Abby
      Abdiel
      Abdul
      Abdullah
      Abe
      Abel
      Abelardo
      Abigail
      
    
    
      year
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2006
        NaN
       3737
       NaN
       8279
       NaN
      NaN
       297
      NaN
       404
       440
       630
      NaN
       1682
       NaN
      NaN
       219
      NaN
        922
      NaN
       15615
      ...
    
    
      2007
        NaN
       3941
       NaN
       8914
       NaN
      NaN
       313
      NaN
       349
       468
       651
      NaN
       1573
       NaN
      NaN
       224
      NaN
        939
      NaN
       15447
      ...
    
    
      2008
        955
       4028
       219
       8511
       NaN
      NaN
       317
      NaN
       344
       400
       608
      NaN
       1328
       199
      NaN
       210
      NaN
        863
      NaN
       15045
      ...
    
    
      2009
       1265
       4352
       270
       7936
       NaN
      NaN
       296
      NaN
       307
       369
       675
      NaN
       1274
       229
      NaN
       256
      NaN
        960
      NaN
       14342
      ...
    
    
      2010
        448
       4628
       438
       7374
       226
      NaN
       277
      NaN
       295
       324
       585
      NaN
       1140
       264
      NaN
       225
      NaN
       1119
      NaN
       14124
      ...
    
  

5 rows × 6865 columns



In [29]:

    
#instead of pivot, I used groupby here

grp_top_births = top1000.groupby('year').apply(lambda s:s.groupby('name').agg('sum')).unstack()['births']
grp_top_births.tail()
# grp_top_births['Raymond'].plot()
#









    Out[29]:






  
    
      name
      Aaden
      Aaliyah
      Aarav
      Aaron
      Aarush
      Ab
      Abagail
      Abb
      Abbey
      Abbie
      Abbigail
      Abbott
      Abby
      Abdiel
      Abdul
      Abdullah
      Abe
      Abel
      Abelardo
      Abigail
      
    
    
      year
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2006
        NaN
       3737
       NaN
       8279
       NaN
      NaN
       297
      NaN
       404
       440
       630
      NaN
       1682
       NaN
      NaN
       219
      NaN
        922
      NaN
       15615
      ...
    
    
      2007
        NaN
       3941
       NaN
       8914
       NaN
      NaN
       313
      NaN
       349
       468
       651
      NaN
       1573
       NaN
      NaN
       224
      NaN
        939
      NaN
       15447
      ...
    
    
      2008
        955
       4028
       219
       8511
       NaN
      NaN
       317
      NaN
       344
       400
       608
      NaN
       1328
       199
      NaN
       210
      NaN
        863
      NaN
       15045
      ...
    
    
      2009
       1265
       4352
       270
       7936
       NaN
      NaN
       296
      NaN
       307
       369
       675
      NaN
       1274
       229
      NaN
       256
      NaN
        960
      NaN
       14342
      ...
    
    
      2010
        448
       4628
       438
       7374
       226
      NaN
       277
      NaN
       295
       324
       585
      NaN
       1140
       264
      NaN
       225
      NaN
       1119
      NaN
       14124
      ...
    
  

5 rows × 6865 columns



In [30]:

    
"""my name "Prabha" or "Matta" is missing in the database :("""
# top_births['Matta'].plot(title='plot for Matta')









    Out[30]:





'my name "Prabha" or "Matta" is missing in the database :('



In [31]:

    
# is your name in the top_births list?

top_births['Raymond'].plot(title='plot for Raymond')









    Out[31]:





<matplotlib.axes.AxesSubplot at 0x110a13b50>



In [32]:

    
# for Aaden, which shows up at the end

top_births.Aaden.plot(xlim=[1880,2010])









    Out[32]:





<matplotlib.axes.AxesSubplot at 0x113eeef10>



In [33]:

    
# number of names represented in top_births

len(top_births.columns)









    Out[33]:





6865



In [34]:

    
top_births.head()









    Out[34]:






  
    
      name
      Aaden
      Aaliyah
      Aarav
      Aaron
      Aarush
      Ab
      Abagail
      Abb
      Abbey
      Abbie
      Abbigail
      Abbott
      Abby
      Abdiel
      Abdul
      Abdullah
      Abe
      Abel
      Abelardo
      Abigail
      
    
    
      year
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1880
      NaN
      NaN
      NaN
       102
      NaN
      NaN
      NaN
      NaN
      NaN
       71
      NaN
      NaN
        6
      NaN
      NaN
      NaN
       50
        9
      NaN
       12
      ...
    
    
      1881
      NaN
      NaN
      NaN
        94
      NaN
      NaN
      NaN
      NaN
      NaN
       81
      NaN
      NaN
        7
      NaN
      NaN
      NaN
       36
       12
      NaN
        8
      ...
    
    
      1882
      NaN
      NaN
      NaN
        85
      NaN
      NaN
      NaN
      NaN
      NaN
       80
      NaN
      NaN
       11
      NaN
      NaN
      NaN
       50
       10
      NaN
       14
      ...
    
    
      1883
      NaN
      NaN
      NaN
       105
      NaN
      NaN
      NaN
      NaN
      NaN
       79
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
       43
       12
      NaN
       11
      ...
    
    
      1884
      NaN
      NaN
      NaN
        97
      NaN
      NaN
      NaN
      NaN
      NaN
       98
      NaN
      NaN
        6
      NaN
      NaN
      NaN
       45
       14
      NaN
       13
      ...
    
  

5 rows × 6865 columns



In [118]:

    
# how to get the most popular name of all time in top_births?

most_common_names = top_births.sum()
# print most_common_names
most_common_names.sort(ascending=False)

# most_common_names.head()

# # James      5071647
# # John       5060953
# # Robert     4787187
# # Michael    4263083
# # Mary       4117746

# temp=grp_top_births.sum()
# temp.sort()



In [117]:

    
# most_common_names = top_births.sum()
# # print type(most_common_names)
# most_common_names.sort(ascending=False)

# most_common_names.head()

# # # James      5071647
# # # John       5060953
# # # Robert     4787187
# # # Michael    4263083
# # # Mary       4117746

# temp=grp_top_births.sum()
# print type(temp)
# temp.sort(ascending=False)
# temp.head()



In [37]:

    
# as of mpl v 0.1 (2014.03.04), the name labeling doesn't work -- so disble mpld3 for this figure

mpld3.disable_notebook()
plt.figure()
most_common_names[:50][::-1].plot(kind='barh', figsize=(10,10))









    Out[37]:





<matplotlib.axes.AxesSubplot at 0x110a143d0>



In [38]:

    
# turn mpld3 back on

mpld3.enable_notebook()

all_births pivot table



In [39]:

    
#using groupby
names.groupby('year').apply(lambda s: s.groupby('name').agg('sum')).unstack()['births'].tail()









    Out[39]:






  
    
      name
      Aaban
      Aabid
      Aabriella
      Aadam
      Aadan
      Aadarsh
      Aaden
      Aadesh
      Aadhav
      Aadhavan
      Aadhya
      Aadi
      Aadil
      Aadin
      Aadison
      Aadit
      Aadith
      Aaditri
      Aaditya
      Aadon
      
    
    
      year
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2006
      NaN
      NaN
      NaN
       9
      NaN
       14
         55
      NaN
        5
      NaN
      NaN
       74
       11
      NaN
      NaN
       17
      NaN
      NaN
       42
        7
      ...
    
    
      2007
        5
      NaN
      NaN
       8
        8
       13
        155
      NaN
      NaN
      NaN
       10
       72
       15
       10
      NaN
       31
        7
      NaN
       43
       10
      ...
    
    
      2008
      NaN
      NaN
        5
       6
       22
       13
        955
      NaN
      NaN
      NaN
        9
       76
       20
       22
      NaN
       24
        5
      NaN
       51
       10
      ...
    
    
      2009
        6
      NaN
      NaN
       9
       23
       16
       1270
        5
        5
      NaN
       18
       76
       17
       25
        6
       12
      NaN
      NaN
       38
       23
      ...
    
    
      2010
        9
      NaN
      NaN
       7
       11
      NaN
        448
      NaN
       13
        5
       19
       54
       11
       18
      NaN
       23
      NaN
        5
       37
      NaN
      ...
    
  

5 rows × 88496 columns



In [40]:

    
# instead of top_birth -- get all_births

all_births = names.pivot_table('births', rows='year', cols='name', aggfunc=sum)
all_births.tail()









    Out[40]:






  
    
      name
      Aaban
      Aabid
      Aabriella
      Aadam
      Aadan
      Aadarsh
      Aaden
      Aadesh
      Aadhav
      Aadhavan
      Aadhya
      Aadi
      Aadil
      Aadin
      Aadison
      Aadit
      Aadith
      Aaditri
      Aaditya
      Aadon
      
    
    
      year
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2006
      NaN
      NaN
      NaN
       9
      NaN
       14
         55
      NaN
        5
      NaN
      NaN
       74
       11
      NaN
      NaN
       17
      NaN
      NaN
       42
        7
      ...
    
    
      2007
        5
      NaN
      NaN
       8
        8
       13
        155
      NaN
      NaN
      NaN
       10
       72
       15
       10
      NaN
       31
        7
      NaN
       43
       10
      ...
    
    
      2008
      NaN
      NaN
        5
       6
       22
       13
        955
      NaN
      NaN
      NaN
        9
       76
       20
       22
      NaN
       24
        5
      NaN
       51
       10
      ...
    
    
      2009
        6
      NaN
      NaN
       9
       23
       16
       1270
        5
        5
      NaN
       18
       76
       17
       25
        6
       12
      NaN
      NaN
       38
       23
      ...
    
    
      2010
        9
      NaN
      NaN
       7
       11
      NaN
        448
      NaN
       13
        5
       19
       54
       11
       18
      NaN
       23
      NaN
        5
       37
      NaN
      ...
    
  

5 rows × 88496 columns



In [41]:

    
all_births = all_births.fillna(0)
all_births.tail()









    Out[41]:






  
    
      name
      Aaban
      Aabid
      Aabriella
      Aadam
      Aadan
      Aadarsh
      Aaden
      Aadesh
      Aadhav
      Aadhavan
      Aadhya
      Aadi
      Aadil
      Aadin
      Aadison
      Aadit
      Aadith
      Aaditri
      Aaditya
      Aadon
      
    
    
      year
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2006
       0
       0
       0
       9
        0
       14
         55
       0
        5
       0
        0
       74
       11
        0
       0
       17
       0
       0
       42
        7
      ...
    
    
      2007
       5
       0
       0
       8
        8
       13
        155
       0
        0
       0
       10
       72
       15
       10
       0
       31
       7
       0
       43
       10
      ...
    
    
      2008
       0
       0
       5
       6
       22
       13
        955
       0
        0
       0
        9
       76
       20
       22
       0
       24
       5
       0
       51
       10
      ...
    
    
      2009
       6
       0
       0
       9
       23
       16
       1270
       5
        5
       0
       18
       76
       17
       25
       6
       12
       0
       0
       38
       23
      ...
    
    
      2010
       9
       0
       0
       7
       11
        0
        448
       0
       13
       5
       19
       54
       11
       18
       0
       23
       0
       5
       37
        0
      ...
    
  

5 rows × 88496 columns



In [42]:

    
# set up to do start/end calculation

all_births_cumsum = all_births.apply(lambda s: s.cumsum(), axis=0)



In [43]:

    
all_births_cumsum.tail()









    Out[43]:






  
    
      name
      Aaban
      Aabid
      Aabriella
      Aadam
      Aadan
      Aadarsh
      Aaden
      Aadesh
      Aadhav
      Aadhavan
      Aadhya
      Aadi
      Aadil
      Aadin
      Aadison
      Aadit
      Aadith
      Aaditri
      Aaditya
      Aadon
      
    
    
      year
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2006
        0
       5
       0
       103
        5
        67
        149
        5
       11
       0
        0
       171
       175
       10
       0
        67
        5
       0
       153
       18
      ...
    
    
      2007
        5
       5
       0
       111
       13
        80
        304
        5
       11
       0
       10
       243
       190
       20
       0
        98
       12
       0
       196
       28
      ...
    
    
      2008
        5
       5
       5
       117
       35
        93
       1259
        5
       11
       0
       19
       319
       210
       42
       0
       122
       17
       0
       247
       38
      ...
    
    
      2009
       11
       5
       5
       126
       58
       109
       2529
       10
       16
       0
       37
       395
       227
       67
       6
       134
       17
       0
       285
       61
      ...
    
    
      2010
       20
       5
       5
       133
       69
       109
       2977
       10
       29
       5
       56
       449
       238
       85
       6
       157
       17
       5
       322
       61
      ...
    
  

5 rows × 88496 columns



In [44]:

    
all_births_cumsum['Raymond'].plot()









    Out[44]:





<matplotlib.axes.AxesSubplot at 0x12f9c8410>

Names that are both M and F



In [45]:

    
# remind ourselves of what's in names

names.head()









    Out[45]:






  
    
      
      name
      sex
      births
      year
    
  
  
    
      0
            Mary
       F
       7065
       1880
    
    
      1
            Anna
       F
       2604
       1880
    
    
      2
            Emma
       F
       2003
       1880
    
    
      3
       Elizabeth
       F
       1939
       1880
    
    
      4
          Minnie
       F
       1746
       1880
    
  

5 rows × 4 columns



In [46]:

    
# columns in names

names.columns









    Out[46]:





Index([u'name', u'sex', u'births', u'year'], dtype='object')



In [112]:

    
# for key, g_df in islice(names.groupby('sex'),5):
#     print key,g_df

Calculating ambigendered names



In [48]:

    
# calculate set of male_only, female_only, ambigender names

def calc_of_sex_of_names():

    k = names.groupby('sex').apply(lambda s: set(list(s['name'])))
    print k
    male_only_names = k['M'] - k['F']
    female_only_names = k['F'] - k['M']
    ambi_names = k['F'] & k['M'] # intersection of two 
    return {'male_only_names': male_only_names, 
            'female_only_names': female_only_names,
            'ambi_names': ambi_names }
    
names_by_sex = calc_of_sex_of_names() 
ambi_names_array = np.array(list(names_by_sex['ambi_names']))

[(k, len(v)) for (k,v) in names_by_sex.items()]









    



sex
F      set([Satara, Britiney, Arely, Charelle, Dejama...
M      set([Antal, Elizeo, Dago, Jhase, Jamesson, Lar...
dtype: object






    Out[48]:





[('female_only_names', 51754),
 ('male_only_names', 27090),
 ('ambi_names', 9652)]



In [49]:

    
# total number of people in names
names.births.sum()









    Out[49]:





322402727



In [50]:

    
#learning in1d
# >>> test = np.array([0, 1, 2, 5, 0])
# >>> states = [0, 2]
# >>> mask = np.in1d(test, states)
# >>> mask
# array([ True, False,  True, False,  True], dtype=bool)
# >>> test[mask]
# array([0, 2, 0])
# >>> mask = np.in1d(test, states, invert=True)
# >>> mask
# array([False,  True, False,  True, False], dtype=bool)
# >>> test[mask]
# array([1, 5])

# pivot table of ambigendered names to aggregate 

names_ambi = names[np.in1d(names.name,ambi_names_array)]
ambi_names_pt = names_ambi.pivot_table('births',
                            rows='year', 
                            cols=['name','sex'], 
                            aggfunc='sum')
ambi_names_pt.tail()









    Out[50]:






  
    
      name
      Aaden
      Aadi
      Aadyn
      Aalijah
      Aaliyah
      Aamari
      Aaren
      Aareon
      Aarian
      Aarin
      
    
    
      sex
      F
      M
      F
      M
      F
      M
      F
      M
      F
      M
      F
      M
      F
      M
      F
      M
      F
      M
      F
      M
      
    
    
      year
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2006
      NaN
         55
        5
       69
      NaN
       16
        5
        5
       3737
      NaN
        5
      NaN
      NaN
       26
      NaN
      NaN
      NaN
      NaN
       10
       12
      ...
    
    
      2007
      NaN
        155
      NaN
       72
      NaN
       27
        8
       10
       3941
      NaN
       10
       10
      NaN
       26
      NaN
        5
      NaN
        6
        6
       20
      ...
    
    
      2008
      NaN
        955
      NaN
       76
        9
       56
        5
       15
       4028
      NaN
        5
        9
      NaN
       29
      NaN
      NaN
      NaN
      NaN
        9
       16
      ...
    
    
      2009
        5
       1265
      NaN
       76
        7
       76
        7
       12
       4352
      NaN
      NaN
        8
      NaN
       28
      NaN
        6
        6
        7
      NaN
       19
      ...
    
    
      2010
      NaN
        448
      NaN
       54
      NaN
       38
      NaN
       15
       4628
        6
        8
        8
        5
       30
      NaN
       11
      NaN
        5
        7
       21
      ...
    
  

5 rows × 19304 columns



In [51]:

    
ambi_names_pt['Raymond'].plot()









    Out[51]:





<matplotlib.axes.AxesSubplot at 0x13f64ba90>



In [52]:

    
# total number of people in k1 -- almost everyone!

ambi_names_pt.sum().sum()









    Out[52]:





299879378.0



In [53]:

    
# fill n/a with 0 and look at the table at the end

ambi_names_pt=ambi_names_pt.fillna(0L)
ambi_names_pt.tail()









    Out[53]:






  
    
      name
      Aaden
      Aadi
      Aadyn
      Aalijah
      Aaliyah
      Aamari
      Aaren
      Aareon
      Aarian
      Aarin
      
    
    
      sex
      F
      M
      F
      M
      F
      M
      F
      M
      F
      M
      F
      M
      F
      M
      F
      M
      F
      M
      F
      M
      
    
    
      year
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2006
       0
         55
       5
       69
       0
       16
       5
        5
       3737
       0
        5
        0
       0
       26
       0
        0
       0
       0
       10
       12
      ...
    
    
      2007
       0
        155
       0
       72
       0
       27
       8
       10
       3941
       0
       10
       10
       0
       26
       0
        5
       0
       6
        6
       20
      ...
    
    
      2008
       0
        955
       0
       76
       9
       56
       5
       15
       4028
       0
        5
        9
       0
       29
       0
        0
       0
       0
        9
       16
      ...
    
    
      2009
       5
       1265
       0
       76
       7
       76
       7
       12
       4352
       0
        0
        8
       0
       28
       0
        6
       6
       7
        0
       19
      ...
    
    
      2010
       0
        448
       0
       54
       0
       38
       0
       15
       4628
       6
        8
        8
       5
       30
       0
       11
       0
       5
        7
       21
      ...
    
  

5 rows × 19304 columns



In [54]:

    
ambi_names_pt.T.head()









    Out[54]:






  
    
      
      year
      1880
      1881
      1882
      1883
      1884
      1885
      1886
      1887
      1888
      1889
      1890
      1891
      1892
      1893
      1894
      1895
      1896
      1897
      1898
      1899
      
    
    
      name
      sex
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Aaden
      F
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      M
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      Aadi
      F
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      M
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      Aadyn
      F
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
  

5 rows × 131 columns



In [55]:

    
# plot M, F in ambigender_names over time
ambi_names_pt.T.xs('M',level='sex').sum().cumsum().plot()









    Out[55]:





<matplotlib.axes.AxesSubplot at 0x11db77710>



In [56]:

    
ambi_names_pt.T.xs('F',level='sex').sum().cumsum().plot()









    Out[56]:





<matplotlib.axes.AxesSubplot at 0x1128467d0>



In [57]:

    
# don't know what pivot table has type float
# https://github.com/pydata/pandas/issues/3283
ambi_names_pt['Raymond', 'M'].dtype









    Out[57]:





dtype('float64')



In [58]:

    
# calculate proportion of males for given name

def prop_male(name):
    return (ambi_names_pt[name]['M']/ \
    ((ambi_names_pt[name]['M'] + ambi_names_pt[name]['F'])))

def prop_c_male(name):
    return (ambi_names_pt[name]['M'].cumsum()/ \
    ((ambi_names_pt[name]['M'].cumsum() + ambi_names_pt[name]['F'].cumsum())))



In [59]:

    
prop_c_male('Leslie').plot()









    Out[59]:





<matplotlib.axes.AxesSubplot at 0x12f9ad410>



In [61]:

    
# I couldn't figure out a way of iterating over the names rather than names/sex combo in
# a vectorized way.  

from itertools import islice

names_to_calc = list(islice(list(ambi_names_pt.T.index.levels[0]),None))

m = [(name_, ambi_names_pt[name_]['M']/(ambi_names_pt[name_]['F'] + ambi_names_pt[name_]['M']))  \
     for name_ in names_to_calc]
p_m_instant = DataFrame(dict(m))
p_m_instant.tail()









    Out[61]:






  
    
      
      Aaden
      Aadi
      Aadyn
      Aalijah
      Aaliyah
      Aamari
      Aaren
      Aareon
      Aarian
      Aarin
      Aarion
      Aaris
      Aaron
      Aarya
      Aaryn
      Aba
      Abba
      Abbey
      Abbie
      Abbigail
      
    
    
      year
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2006
       1.000000
       0.932432
       1.000000
       0.500000
       0.000000
       0.000000
       1.000000
      NaN
            NaN
       0.545455
       0.730769
       1.000000
       0.997109
       0.481481
       0.595745
        1
      NaN
       0
       0
       0
      ...
    
    
      2007
       1.000000
       1.000000
       1.000000
       0.555556
       0.000000
       0.500000
       1.000000
        1
       1.000000
       0.769231
       0.794118
       0.454545
       0.997426
       0.240506
       0.518519
      NaN
      NaN
       0
       0
       0
      ...
    
    
      2008
       1.000000
       1.000000
       0.861538
       0.750000
       0.000000
       0.642857
       1.000000
      NaN
            NaN
       0.640000
       0.666667
            NaN
       0.996604
       0.213333
       0.480519
      NaN
      NaN
       0
       0
       0
      ...
    
    
      2009
       0.996063
       1.000000
       0.915663
       0.631579
       0.000000
       1.000000
       1.000000
        1
       0.538462
       1.000000
       0.750000
            NaN
       0.995984
       0.247312
       0.406250
      NaN
      NaN
       0
       0
       0
      ...
    
    
      2010
       1.000000
       1.000000
       1.000000
       1.000000
       0.001295
       0.500000
       0.857143
        1
       1.000000
       0.750000
       1.000000
            NaN
       0.996891
       0.265306
       0.340000
      NaN
      NaN
       0
       0
       0
      ...
    
  

5 rows × 9652 columns



In [62]:

    
# similar calculation except instead of looking at the proportions for a given year only,
# we look at the cumulative number of male/female babies for given name

from itertools import islice

names_to_calc = list(islice(list(ambi_names_pt.T.index.levels[0]),None))

m = [(name_, ambi_names_pt[name_]['M'].cumsum()/(ambi_names_pt[name_]['F'].cumsum() + ambi_names_pt[name_]['M'].cumsum()))  \
     for name_ in names_to_calc]
p_m_cum = DataFrame(dict(m))
p_m_cum.tail()









    Out[62]:






  
    
      
      Aaden
      Aadi
      Aadyn
      Aalijah
      Aaliyah
      Aamari
      Aaren
      Aareon
      Aarian
      Aarin
      Aarion
      Aaris
      Aaron
      Aarya
      Aaryn
      Aba
      Abba
      Abbey
      Abbie
      Abbigail
      
    
    
      year
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2006
       1.000000
       0.970760
       1.000000
       0.461538
       0.001677
       0.289474
       0.650694
       0.500000
       0.238095
       0.500000
       0.714667
       0.52381
       0.991825
       0.481818
       0.391437
       0.185185
       0.666667
       0.002404
       0.017656
       0.000761
      ...
    
    
      2007
       1.000000
       0.979424
       1.000000
       0.477064
       0.001494
       0.362069
       0.661783
       0.600000
       0.407407
       0.512727
       0.721271
       0.50000
       0.991925
       0.418060
       0.398068
       0.185185
       0.666667
       0.002348
       0.017220
       0.000693
      ...
    
    
      2008
       1.000000
       0.984326
       0.934783
       0.519380
       0.001344
       0.416667
       0.673349
       0.600000
       0.407407
       0.518261
       0.718245
       0.50000
       0.992003
       0.377005
       0.403777
       0.185185
       0.666667
       0.002295
       0.016863
       0.000639
      ...
    
    
      2009
       0.998023
       0.987342
       0.927602
       0.533784
       0.001213
       0.475000
       0.683790
       0.677419
       0.450000
       0.533670
       0.719647
       0.50000
       0.992064
       0.351178
       0.403912
       0.185185
       0.666667
       0.002250
       0.016547
       0.000588
      ...
    
    
      2010
       0.998320
       0.988864
       0.938224
       0.576687
       0.001221
       0.479167
       0.690450
       0.761905
       0.511111
       0.543408
       0.729787
       0.50000
       0.992131
       0.336283
       0.401305
       0.185185
       0.666667
       0.002208
       0.016280
       0.000550
      ...
    
  

5 rows × 9652 columns



In [63]:

    
p_m_cum['Donnie'].plot()









    Out[63]:





<matplotlib.axes.AxesSubplot at 0x11de6ca10>



In [64]:

    
# some metrics that attempt to measure how a time series s has changed

def min_max_range(s):
    """range of s signed -- positive if slope between two points p +ve and negative
    otherwise; 0 if slope is 0"""
    # note np.argmax, np.argmin returns the position of first occurence of global max, min
    sign = np.sign(np.argmax(s) - np.argmin(s))
    if sign == 0:
        return 0.0
    else:
        return sign*(np.max(s) - np.min(s))

def last_first_diff(s):
    """difference between latest and earliest value"""
    s0 = s.dropna()
    return (s0.iloc[-1] - s0.iloc[0])



In [65]:

    
# population distributions of ambinames 
# might want to remove from consideration instances when total ratio is too great
# or range of existence of a name/sex combo too short

total_pop_ambiname = all_births.sum()[np.in1d(all_births.sum().index, ambi_names_array)]
total_pop_ambiname.sort(ascending=False)
total_pop_ambiname.plot(logy=True)









    Out[65]:





<matplotlib.axes.AxesSubplot at 0x11e06cf50>



In [66]:

    
# now calculate a DataFrame to visualize results

# calculate the total population, the change in p_m from last to first appearance, 
# the change from max to min in p_m, and the percentage of males overall for name

df = DataFrame()
df['total_pop'] = total_pop_ambiname
df['last_first_diff'] = p_m_cum.apply(last_first_diff)
df['min_max_range'] = p_m_cum.apply(min_max_range)
df['abs_min_max_range'] = np.abs(df.min_max_range)
df['p_m'] = p_m_cum.iloc[-1]

# distance from full ambigender -- p_m=0.5 leads to 1, p_m=1 or 0 -> 0
df['ambi_index'] = df.p_m.apply(lambda p: 1 - 2* np.abs(p-0.5))

df.head()









    Out[66]:






  
    
      
      total_pop
      last_first_diff
      min_max_range
      abs_min_max_range
      p_m
      ambi_index
    
    
      name
      
      
      
      
      
      
    
  
  
    
      James
       5072771
      -0.000845
      -0.002123
       0.002123
       0.995457
       0.009085
    
    
      John
       5061897
       0.000479
      -0.001921
       0.001921
       0.995737
       0.008526
    
    
      Robert
       4788050
       0.000344
       0.002027
       0.002027
       0.995811
       0.008377
    
    
      Michael
       4265373
      -0.005034
      -0.006425
       0.006425
       0.994966
       0.010067
    
    
      Mary
       4119074
      -0.000132
      -0.000829
       0.000829
       0.003675
       0.007351
    
  

5 rows × 6 columns



In [67]:

    
# plot: x -> log10 of total population, y->how p_m has changed from first to last
# turn off d3 for this plot

mpld3.disable_notebook()
plt.scatter(np.log10(df.total_pop), df.last_first_diff, s=1)









    Out[67]:





<matplotlib.collections.PathCollection at 0x111f42ed0>



In [ ]:

    
# turn d3 back on

mpld3.enable_notebook()
plt.scatter(np.log10(df.total_pop), df.last_first_diff, s=1)



In [68]:

    
# general directionality counts -- looking for over asymmetry

df.groupby(np.sign(df.last_first_diff)).count()









    Out[68]:






  
    
      
      total_pop
      last_first_diff
      min_max_range
      abs_min_max_range
      p_m
      ambi_index
    
    
      last_first_diff
      
      
      
      
      
      
    
  
  
    
      -1
       4890
       4890
       4890
       4890
       4890
       4890
    
    
       0
         24
         24
         24
         24
         24
         24
    
    
       1
       4738
       4738
       4738
       4738
       4738
       4738
    
  

3 rows × 6 columns



In [69]:

    
# let's concentrate on more populous names that have seen big swings in the cumulative p_m

# you can play with the population and range filter
popular_names_with_shifts = df[(df.total_pop>5000) & (df.abs_min_max_range >0.7)]
popular_names_with_shifts.sort_index(by="abs_min_max_range", ascending=False)









    Out[69]:






  
    
      
      total_pop
      last_first_diff
      min_max_range
      abs_min_max_range
      p_m
      ambi_index
    
    
      name
      
      
      
      
      
      
    
  
  
    
      Hailey
       123318
      -0.998151
      -0.998151
       0.998151
       0.001849
       0.003698
    
    
      Abbey
        15854
      -0.997792
      -0.997802
       0.997802
       0.002208
       0.004415
    
    
      Summer
        64702
      -0.997002
      -0.997002
       0.997002
       0.002998
       0.005997
    
    
      Raegan
         9744
      -0.990148
      -0.995873
       0.995873
       0.009852
       0.019704
    
    
      Bria
        11160
      -0.995072
      -0.995072
       0.995072
       0.004928
       0.009857
    
    
      Fallon
         7476
      -0.972311
      -0.994122
       0.994122
       0.027689
       0.055377
    
    
      Chanel
        14087
      -0.993966
      -0.993966
       0.993966
       0.006034
       0.012068
    
    
      Star
         6684
      -0.983543
      -0.993738
       0.993738
       0.016457
       0.032914
    
    
      Holly
       196587
      -0.992161
      -0.992161
       0.992161
       0.007839
       0.015678
    
    
      Nigel
        10501
       0.991906
       0.991906
       0.991906
       0.991906
       0.016189
    
    
      Michele
       225226
      -0.988372
      -0.991136
       0.991136
       0.011628
       0.023257
    
    
      Nova
         6899
      -0.930135
      -0.990991
       0.990991
       0.069865
       0.139730
    
    
      Ronda
        34628
      -0.989633
      -0.989633
       0.989633
       0.010367
       0.020735
    
    
      Paige
       122569
      -0.989198
      -0.989198
       0.989198
       0.010802
       0.021604
    
    
      Brooke
       173658
      -0.988489
      -0.988489
       0.988489
       0.011511
       0.023022
    
    
      Beverly
       380492
      -0.987824
      -0.987824
       0.987824
       0.012176
       0.024353
    
    
      Lauren
       450853
      -0.987302
      -0.987302
       0.987302
       0.012698
       0.025396
    
    
      Alexus
        17835
      -0.987272
      -0.987286
       0.987286
       0.012728
       0.025456
    
    
      Allison
       262727
      -0.985826
      -0.985826
       0.985826
       0.014174
       0.028349
    
    
      Cordell
         9464
       0.984362
       0.984362
       0.984362
       0.984362
       0.031276
    
    
      Lauri
        11199
      -0.983302
      -0.983302
       0.983302
       0.016698
       0.033396
    
    
      Joy
       131572
      -0.981827
      -0.981862
       0.981862
       0.018173
       0.036345
    
    
      Ashley
       832350
      -0.981496
      -0.981496
       0.981496
       0.018504
       0.037008
    
    
      Lyric
         8899
      -0.838409
      -0.980916
       0.980916
       0.161591
       0.323182
    
    
      Christy
        99452
      -0.980734
      -0.980760
       0.980760
       0.019266
       0.038531
    
    
      Kenna
         7979
      -0.980323
      -0.980659
       0.980659
       0.019677
       0.039353
    
    
      Tyrese
         7582
       0.980480
       0.980480
       0.980480
       0.980480
       0.039040
    
    
      Robby
        10399
       0.979229
       0.979229
       0.979229
       0.979229
       0.041542
    
    
      Mallory
        48990
      -0.977648
      -0.977648
       0.977648
       0.022352
       0.044703
    
    
      Madison
       308970
      -0.976341
      -0.976341
       0.976341
       0.023659
       0.047319
    
    
      Jermaine
        39286
       0.975386
       0.975386
       0.975386
       0.975386
       0.049229
    
    
      Shelly
        86081
      -0.974570
      -0.974570
       0.974570
       0.025430
       0.050859
    
    
      Carley
        14885
      -0.972455
      -0.972455
       0.972455
       0.027545
       0.055089
    
    
      Lacey
        47635
      -0.969770
      -0.969770
       0.969770
       0.030230
       0.060460
    
    
      Ainsley
         8817
      -0.968357
      -0.968357
       0.968357
       0.031643
       0.063287
    
    
      Santana
         7399
       0.422760
       0.966667
       0.966667
       0.422760
       0.845520
    
    
      Kelsey
       144166
      -0.964811
      -0.964811
       0.964811
       0.035189
       0.070377
    
    
      Ansley
         7202
      -0.964315
      -0.964315
       0.964315
       0.035685
       0.071369
    
    
      Ronnie
       186260
       0.960781
       0.964046
       0.964046
       0.960781
       0.078439
    
    
      Kay
       101704
      -0.962479
      -0.962605
       0.962605
       0.037521
       0.075041
    
    
      Delaney
        27608
      -0.962402
      -0.962402
       0.962402
       0.037598
       0.075196
    
    
      Lindsay
       131956
      -0.961351
      -0.961351
       0.961351
       0.038649
       0.077298
    
    
      Lesly
        12407
      -0.959942
      -0.959942
       0.959942
       0.040058
       0.080116
    
    
      Marquise
         9308
       0.957886
       0.957886
       0.957886
       0.957886
       0.084229
    
    
      Kenzie
         8793
      -0.956443
      -0.956443
       0.956443
       0.043557
       0.087115
    
    
      Hillary
        28763
      -0.956159
      -0.956159
       0.956159
       0.043841
       0.087682
    
    
      Mckenzie
        44315
      -0.953988
      -0.953988
       0.953988
       0.046012
       0.092023
    
    
      Linsey
         5138
      -0.953095
      -0.953095
       0.953095
       0.046905
       0.093811
    
    
      Lindsey
       159977
      -0.952274
      -0.952274
       0.952274
       0.047726
       0.095451
    
    
      Shamar
         5093
       0.951109
       0.951109
       0.951109
       0.951109
       0.097781
    
    
      Kinsey
         5800
      -0.951034
      -0.951034
       0.951034
       0.048966
       0.097931
    
    
      Sydney
       156602
      -0.943922
      -0.943922
       0.943922
       0.056078
       0.112157
    
    
      Kimber
         5455
      -0.943538
      -0.943538
       0.943538
       0.056462
       0.112924
    
    
      Raven
        37100
      -0.927143
      -0.943274
       0.943274
       0.072857
       0.145714
    
    
      Meredith
        73898
      -0.942502
      -0.942502
       0.942502
       0.057498
       0.114996
    
    
      Cassidy
        49871
      -0.941349
      -0.941349
       0.941349
       0.058651
       0.117303
    
    
      Whitney
        98164
      -0.940701
      -0.940701
       0.940701
       0.059299
       0.118597
    
    
      Richie
         6540
       0.938532
       0.938532
       0.938532
       0.938532
       0.122936
    
    
      Diamond
        32377
      -0.936776
      -0.936776
       0.936776
       0.063224
       0.126448
    
    
      Gay
        19363
      -0.928678
      -0.928678
       0.928678
       0.071322
       0.142643
    
    
      
      ...
      ...
      ...
      ...
      ...
      ...
    
  

150 rows × 6 columns



In [70]:

    
popular_names_with_shifts.groupby(np.sign(df.last_first_diff)).count()









    Out[70]:






  
    
      
      total_pop
      last_first_diff
      min_max_range
      abs_min_max_range
      p_m
      ambi_index
    
    
      last_first_diff
      
      
      
      
      
      
    
  
  
    
      -1
       116
       116
       116
       116
       116
       116
    
    
       1
        34
        34
        34
        34
        34
        34
    
  

2 rows × 6 columns



In [ ]:

    
#popular_names_with_shifts.to_pickle('popular_names_with_shifts.pickle')



In [71]:

    
fig, ax = plt.subplots(subplot_kw=dict(axisbg='#EEEEEE'))
x = np.log10(popular_names_with_shifts.total_pop)
y = popular_names_with_shifts.min_max_range 

scatter = ax.scatter(x, y)

ax.grid(color='white', linestyle='solid')
ax.set_title("Populous Names with Major Sex Shift", size=20)
ax.set_xlabel('log10(total_pop)')
ax.set_ylabel('min_max_range')

#labels = ['point {0}'.format(i + 1) for i in range(len(x))]
labels = list(popular_names_with_shifts.index)
tooltip = plugins.PointLabelTooltip(scatter, labels=labels)
plugins.connect(fig, tooltip)



In [ ]:

    
prop_c_male('Leslie').plot()

Prabha's Experimentation with Baby Names

Analyzing The variation/trend in names by first letter



In [72]:

    
get_first_letter = lambda x:x[0]
first_letters = names.name.map(get_first_letter)
first_letters.name = 'first_letter'


first_letter_trend = names.pivot_table('births', rows='year', cols=[first_letters,'sex'], aggfunc=sum)
first_letter_trend.head()









    Out[72]:






  
    
      first_letter
      A
      B
      C
      D
      E
      F
      G
      H
      I
      J
      
    
    
      sex
      F
      M
      F
      M
      F
      M
      F
      M
      F
      M
      F
      M
      F
      M
      F
      M
      F
      M
      F
      M
      
    
    
      year
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1880
        9334
       7406
       3874
       2115
       5868
        9949
       2218
       2488
       11444
       6894
       2957
       6529
       2463
       6274
       2743
       7599
       2480
        947
       3801
       22272
      ...
    
    
      1881
        9405
       6852
       4013
       1993
       5661
        9047
       2299
       2168
       11742
       6505
       2875
       5697
       2621
       5718
       2630
       7018
       2456
        955
       3813
       20313
      ...
    
    
      1882
       11001
       7789
       4824
       2280
       6454
       10211
       2557
       2461
       13771
       7478
       3512
       6526
       3054
       6412
       3192
       7847
       2788
        932
       4491
       22419
      ...
    
    
      1883
       11632
       7199
       5194
       2091
       6857
        9727
       2709
       2324
       14449
       6907
       3614
       6032
       3210
       5837
       3373
       7482
       2890
        860
       4612
       20428
      ...
    
    
      1884
       13324
       7574
       6005
       2362
       7919
       10344
       3060
       2388
       16465
       7548
       4196
       6625
       3790
       6926
       3973
       8003
       3389
       1030
       5239
       22175
      ...
    
  

5 rows × 52 columns



In [73]:

    
first_letter_trend['A'].plot()









    Out[73]:





<matplotlib.axes.AxesSubplot at 0x110b23710>

Finding trend of starting letter by year :



In [74]:

    
yearwise_first_letter_trend = names.pivot_table('births', rows=first_letters, cols=['sex','year'], aggfunc=sum)

#trending of names starting with 'A' 
# first_letter_trend.plot()
yearwise_first_letter_trend.head()









    Out[74]:






  
    
      sex
      F
      
    
    
      year
      1880
      1881
      1882
      1883
      1884
      1885
      1886
      1887
      1888
      1889
      1890
      1891
      1892
      1893
      1894
      1895
      1896
      1897
      1898
      1899
      
    
    
      first_letter
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      A
        9334
        9405
       11001
       11632
       13324
       13437
       14416
       14836
       17729
       17689
       18601
       17964
       20241
       20150
       20983
       21720
       22008
       21265
       23067
       20950
      ...
    
    
      B
        3874
        4013
        4824
        5194
        6005
        6340
        6990
        7110
        8775
        8744
        9111
        8917
       10085
        9985
       10243
       10923
       10882
       10699
       11864
       10484
      ...
    
    
      C
        5868
        5661
        6454
        6857
        7919
        8164
        8412
        8605
       10412
       10257
       10670
       10033
       11490
       11304
       11693
       12027
       11945
       11727
       13035
       11769
      ...
    
    
      D
        2218
        2299
        2557
        2709
        3060
        3031
        3231
        3144
        3852
        3732
        3995
        3923
        4427
        4474
        4930
        5118
        5458
        5531
        6365
        5741
      ...
    
    
      E
       11444
       11742
       13771
       14449
       16465
       17379
       18825
       19140
       23258
       23244
       24489
       24258
       27331
       28128
       29335
       30630
       31026
       30126
       32512
       28845
      ...
    
  

5 rows × 262 columns



In [113]:

    
#plotting for all the first_letters and years
yearwise_first_letter_trend.plot(legend=False)









    Out[113]:





<matplotlib.axes.AxesSubplot at 0x121f9f590>



In [108]:

    
# yearwise_first_letter_trend.sum()

Analyzing the trend after every 50 years ->1880,1930,1970, 2010



In [77]:

    
#Let us analyze the trend for every 50 years ->1880,1930,1970, 2010
interval_yearwise_first_letter_trend = yearwise_first_letter_trend.reindex(columns = [1880,1930,1970, 2010], level = 'year')
interval_yearwise_first_letter_trend.head()
letter_prop = interval_yearwise_first_letter_trend/interval_yearwise_first_letter_trend.sum().astype(float)



In [78]:

    
mpld3.disable_notebook()
import matplotlib.pyplot as plt
# letter_prop = yearwise_first_letter_trend/yearwise_first_letter_trend.sum().astype(float)
fig, axes = plt.subplots(2, 1, figsize=(10, 8))
letter_prop['M'].plot(kind='bar', rot=0, ax=axes[0], title='Male')
letter_prop['F'].plot(kind='bar', rot=0, ax=axes[1], title='Female', legend=False)
# yearwise_first_letter_trend[2010].plot()









    Out[78]:





<matplotlib.axes.AxesSubplot at 0x117064510>

Measuring sex shift on various dimensions in top ambi-names (using proportional/normalized populations)



In [80]:

    
# some metrics that attempt to measure how a time series s has changed

def min_max_range(s):
    """range of s signed -- positive if slope between two points p +ve and negative
    otherwise; 0 if slope is 0"""
    # note np.argmax, np.argmin returns the position of first occurence of global max, min
    sign = np.sign(np.argmax(s) - np.argmin(s))
    if sign == 0:
        return 0.0
    else:
        return sign*(np.max(s) - np.min(s))

def last_first_diff(s):
    """difference between latest and earliest value"""
    s0 = s.dropna()
    return (s0.iloc[-1] - s0.iloc[0])



In [81]:

    
total_pop_ambiname = all_births.sum()[np.in1d(all_births.sum().index, ambi_names_array)]
total_pop_ambiname.sort(ascending=False)



In [102]:

    
top5_ambi_data = DataFrame()
top5_ambi_data['total_pop'] = total_pop_ambiname
top5_ambi_data['last_first_diff'] = p_m_cum.apply(last_first_diff)
top5_ambi_data['min_max_range'] = p_m_cum.apply(min_max_range)
top5_ambi_data['abs_min_max_range'] = np.abs(df.min_max_range)
top5_ambi_data['p_m'] = p_m_cum.iloc[-1]

# distance from full ambigender -- p_m=0.5 leads to 1, p_m=1 or 0 -> 0
top5_ambi_data['ambi_index'] = df.p_m.apply(lambda p: 1 - 2* np.abs(p-0.5))

Analyzing the ambi-names which has maximum last year to first year diff



In [84]:

    
#sorting the ambi-names which has maximum last year to first year diff
top5_ambi_data.sort_index(by='last_first_diff', ascending=False).head()









    Out[84]:






  
    
      
      total_pop
      last_first_diff
      min_max_range
      abs_min_max_range
      p_m
      ambi_index
    
    
      name
      
      
      
      
      
      
    
  
  
    
      Krish
        2070
       0.995169
       0.995169
       0.995169
       0.995169
       0.009662
    
    
      Lydell
        1757
       0.994308
       0.994308
       0.994308
       0.994308
       0.011383
    
    
      Nicco
         638
       0.992163
       0.992163
       0.992163
       0.992163
       0.015674
    
    
      Nigel
       10501
       0.991906
       0.991906
       0.991906
       0.991906
       0.016189
    
    
      Jawan
        1329
       0.990971
       0.990971
       0.990971
       0.990971
       0.018059
    
  

5 rows × 6 columns



In [92]:

    
# we see that Krish has changed the maximum.
#let us analyze Krish


names_ambi = names[np.in1d(names.name,ambi_names_array)]
ambi_names_pt = names_ambi.pivot_table('births',
                            rows='year',
                            cols=['name','sex'],
                            aggfunc='sum')

ambi_names_pt= ambi_names_pt.fillna(0L)



In [93]:

    
normalized_ambi_names =  ambi_names_pt.div(ambi_names_pt.sum(1),axis=0)



In [94]:

    
normalized_ambi_names.tail()









    Out[94]:






  
    
      name
      Aaden
      Aadi
      Aadyn
      Aalijah
      Aaliyah
      Aamari
      Aaren
      Aareon
      Aarian
      Aarin
      
    
    
      sex
      F
      M
      F
      M
      F
      M
      F
      M
      F
      M
      F
      M
      F
      M
      F
      M
      F
      M
      F
      M
      
    
    
      year
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2006
       0.000000
       0.000016
       0.000001
       0.000021
       0.000000
       0.000005
       0.000001
       0.000001
       0.001121
       0.000000
       0.000001
       0.000000
       0.000000
       0.000008
       0
       0.000000
       0.000000
       0.000000
       0.000003
       0.000004
      ...
    
    
      2007
       0.000000
       0.000046
       0.000000
       0.000022
       0.000000
       0.000008
       0.000002
       0.000003
       0.001180
       0.000000
       0.000003
       0.000003
       0.000000
       0.000008
       0
       0.000001
       0.000000
       0.000002
       0.000002
       0.000006
      ...
    
    
      2008
       0.000000
       0.000293
       0.000000
       0.000023
       0.000003
       0.000017
       0.000002
       0.000005
       0.001238
       0.000000
       0.000002
       0.000003
       0.000000
       0.000009
       0
       0.000000
       0.000000
       0.000000
       0.000003
       0.000005
      ...
    
    
      2009
       0.000002
       0.000404
       0.000000
       0.000024
       0.000002
       0.000024
       0.000002
       0.000004
       0.001389
       0.000000
       0.000000
       0.000003
       0.000000
       0.000009
       0
       0.000002
       0.000002
       0.000002
       0.000000
       0.000006
      ...
    
    
      2010
       0.000000
       0.000149
       0.000000
       0.000018
       0.000000
       0.000013
       0.000000
       0.000005
       0.001542
       0.000002
       0.000003
       0.000003
       0.000002
       0.000010
       0
       0.000004
       0.000000
       0.000002
       0.000002
       0.000007
      ...
    
  

5 rows × 19304 columns



In [97]:

    
#plotting for Krish
normalized_ambi_names['Krish'].plot()
"""Observation for 'Krish'
    It is interesting to note that though the name has changed from female to male. Apparantly, Krish has become popular only after 1980's
"""









    Out[97]:





"Observation for 'Krish'\n    we see that though the name has changed from female to male. Apparantly, Krish has become popular only after 1980's\n"



In [98]:

    
#plotting for Lydell
normalized_ambi_names['Lydell'].plot()
"""Observation for 'Lydell'
    we see that the name has completely transformed from female to male. 
"""









    Out[98]:





"Observation for 'Lydell'\n    we see that though the name has changed from female to male. Apparantly, Krish has become popular only after 1980's\n"



In [99]:

    
top5_ambi_data.sort_index(by='last_first_diff', ascending=False).tail()









    Out[99]:






  
    
      
      total_pop
      last_first_diff
      min_max_range
      abs_min_max_range
      p_m
      ambi_index
    
    
      name
      
      
      
      
      
      
    
  
  
    
      Sherell
         1479
      -0.996619
      -0.996619
       0.996619
       0.003381
       0.006761
    
    
      Summer
        64702
      -0.997002
      -0.997002
       0.997002
       0.002998
       0.005997
    
    
      Lindsy
         2039
      -0.997548
      -0.997548
       0.997548
       0.002452
       0.004904
    
    
      Abbey
        15854
      -0.997792
      -0.997802
       0.997802
       0.002208
       0.004415
    
    
      Hailey
       123318
      -0.998151
      -0.998151
       0.998151
       0.001849
       0.003698
    
  

5 rows × 6 columns



In [109]:

    
#plotting for Hailey
normalized_ambi_names['Hailey'].plot()
"""Observation for 'Hailey'
    we see that the name has increasing given to female babies. 
"""









    Out[109]:





"Observation for 'Hailey'\n    we see that the name has increasing given to female babies. \n"

Analyzing the ambi-names which has maximum ambi_index



In [101]:

    
top5_ambi_data.sort_index(by='ambi_index', ascending=False).head()









    Out[101]:






  
    
      
      total_pop
      last_first_diff
      min_max_range
      abs_min_max_range
      p_m
      ambi_index
    
    
      name
      
      
      
      
      
      
    
  
  
    
      Challie
        10
      -0.5
      -0.500000
       0.500000
       0.5
       1
    
    
      Keagyn
        22
      -0.5
      -0.500000
       0.500000
       0.5
       1
    
    
      Ashdyn
        12
       0.5
       0.500000
       0.500000
       0.5
       1
    
    
      Callaway
       196
       0.5
       0.507143
       0.507143
       0.5
       1
    
    
      Mizan
        12
       0.5
       0.500000
       0.500000
       0.5
       1
    
  

5 rows × 6 columns



In [103]:

    
#plotting for Challie
normalized_ambi_names['Challie'].plot()
"""Observation for 'Challie'
    Very interesting plot
"""









    Out[103]:





"Observation for 'Challie'\n    we see that the name has increasing given to female babies. \n"



In [104]:

    
top5_ambi_data.sort_index(by='ambi_index', ascending=False).tail()









    Out[104]:






  
    
      
      total_pop
      last_first_diff
      min_max_range
      abs_min_max_range
      p_m
      ambi_index
    
    
      name
      
      
      
      
      
      
    
  
  
    
      Glenna
       25493
       0.000196
       0.000412
       0.000412
       0.000196
       0.000392
    
    
      Dorothea
       33626
       0.000178
       0.000210
       0.000210
       0.000178
       0.000357
    
    
      Leila
       35283
       0.000170
       0.000453
       0.000453
       0.000170
       0.000340
    
    
      Therese
       34628
       0.000144
       0.000712
       0.000712
       0.000144
       0.000289
    
    
      Annabelle
       34998
       0.000143
       0.000154
       0.000154
       0.000143
       0.000286
    
  

5 rows × 6 columns



In [107]:

    
#plotting for Annabelle
normalized_ambi_names['Annabelle'].plot()
"""Observation for 'Annabelle'
    Annabelle is a hot name now :) very trending
"""









    Out[107]:





"Observation for 'Annabelle'\n    Annabelle is a hot name now :) very trending\n"

	births	year
count	1690784.000000	1690784.000000
mean	190.682386	1969.454384
std	1615.899711	32.823526
min	5.000000	1880.000000
25%	7.000000	1946.000000
50%	12.000000	1979.000000
75%	32.000000	1997.000000
max	99651.000000	2010.000000

	name	sex	births	year
0	Mary	F	7065	1880
1	Anna	F	2604	1880
2	Emma	F	2003	1880
3	Elizabeth	F	1939	1880
4	Minnie	F	1746	1880

sex	F	M
year
1880	90993	110493
1881	91955	100748
1882	107851	113687
1883	112322	104632
1884	129021	114445

name	Aaden	Aaliyah	Aarav	Aaron	Aarush	Ab	Abagail	Abb	Abbey	Abbie	Abbigail	Abbott	Abby	Abdiel	Abdul	Abdullah	Abe	Abel	Abelardo	Abigail
year
2006	NaN	3737	NaN	8279	NaN	NaN	297	NaN	404	440	630	NaN	1682	NaN	NaN	219	NaN	922	NaN	15615	...
2007	NaN	3941	NaN	8914	NaN	NaN	313	NaN	349	468	651	NaN	1573	NaN	NaN	224	NaN	939	NaN	15447	...
2008	955	4028	219	8511	NaN	NaN	317	NaN	344	400	608	NaN	1328	199	NaN	210	NaN	863	NaN	15045	...
2009	1265	4352	270	7936	NaN	NaN	296	NaN	307	369	675	NaN	1274	229	NaN	256	NaN	960	NaN	14342	...
2010	448	4628	438	7374	226	NaN	277	NaN	295	324	585	NaN	1140	264	NaN	225	NaN	1119	NaN	14124	...

	year	1880	1881	1882	1883	1884	1885	1886	1887	1888	1889	1890	1891	1892	1893	1894	1895	1896	1897	1898	1899
name	sex
Aaden	F	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
Aaden	M	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
Aadi	F	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
Aadi	M	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
Aadyn	F	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

	Aaden	Aadi	Aadyn	Aalijah	Aaliyah	Aamari	Aaren	Aareon	Aarian	Aarin	Aarion	Aaris	Aaron	Aarya	Aaryn	Aba	Abba	Abbey	Abbie	Abbigail
year
2006	1.000000	0.932432	1.000000	0.500000	0.000000	0.000000	1.000000	NaN	NaN	0.545455	0.730769	1.000000	0.997109	0.481481	0.595745	1	NaN	0	0	0	...
2007	1.000000	1.000000	1.000000	0.555556	0.000000	0.500000	1.000000	1	1.000000	0.769231	0.794118	0.454545	0.997426	0.240506	0.518519	NaN	NaN	0	0	0	...
2008	1.000000	1.000000	0.861538	0.750000	0.000000	0.642857	1.000000	NaN	NaN	0.640000	0.666667	NaN	0.996604	0.213333	0.480519	NaN	NaN	0	0	0	...
2009	0.996063	1.000000	0.915663	0.631579	0.000000	1.000000	1.000000	1	0.538462	1.000000	0.750000	NaN	0.995984	0.247312	0.406250	NaN	NaN	0	0	0	...
2010	1.000000	1.000000	1.000000	1.000000	0.001295	0.500000	0.857143	1	1.000000	0.750000	1.000000	NaN	0.996891	0.265306	0.340000	NaN	NaN	0	0	0	...

	total_pop	last_first_diff	min_max_range	abs_min_max_range	p_m	ambi_index
name
James	5072771	-0.000845	-0.002123	0.002123	0.995457	0.009085
John	5061897	0.000479	-0.001921	0.001921	0.995737	0.008526
Robert	4788050	0.000344	0.002027	0.002027	0.995811	0.008377
Michael	4265373	-0.005034	-0.006425	0.006425	0.994966	0.010067
Mary	4119074	-0.000132	-0.000829	0.000829	0.003675	0.007351

	total_pop	last_first_diff	min_max_range	abs_min_max_range	p_m	ambi_index
last_first_diff
-1	4890	4890	4890	4890	4890	4890
0	24	24	24	24	24	24
1	4738	4738	4738	4738	4738	4738

	total_pop	last_first_diff	min_max_range	abs_min_max_range	p_m	ambi_index
name
Hailey	123318	-0.998151	-0.998151	0.998151	0.001849	0.003698
Abbey	15854	-0.997792	-0.997802	0.997802	0.002208	0.004415
Summer	64702	-0.997002	-0.997002	0.997002	0.002998	0.005997
Raegan	9744	-0.990148	-0.995873	0.995873	0.009852	0.019704
Bria	11160	-0.995072	-0.995072	0.995072	0.004928	0.009857
Fallon	7476	-0.972311	-0.994122	0.994122	0.027689	0.055377
Chanel	14087	-0.993966	-0.993966	0.993966	0.006034	0.012068
Star	6684	-0.983543	-0.993738	0.993738	0.016457	0.032914
Holly	196587	-0.992161	-0.992161	0.992161	0.007839	0.015678
Nigel	10501	0.991906	0.991906	0.991906	0.991906	0.016189
Michele	225226	-0.988372	-0.991136	0.991136	0.011628	0.023257
Nova	6899	-0.930135	-0.990991	0.990991	0.069865	0.139730
Ronda	34628	-0.989633	-0.989633	0.989633	0.010367	0.020735
Paige	122569	-0.989198	-0.989198	0.989198	0.010802	0.021604
Brooke	173658	-0.988489	-0.988489	0.988489	0.011511	0.023022
Beverly	380492	-0.987824	-0.987824	0.987824	0.012176	0.024353
Lauren	450853	-0.987302	-0.987302	0.987302	0.012698	0.025396
Alexus	17835	-0.987272	-0.987286	0.987286	0.012728	0.025456
Allison	262727	-0.985826	-0.985826	0.985826	0.014174	0.028349
Cordell	9464	0.984362	0.984362	0.984362	0.984362	0.031276
Lauri	11199	-0.983302	-0.983302	0.983302	0.016698	0.033396
Joy	131572	-0.981827	-0.981862	0.981862	0.018173	0.036345
Ashley	832350	-0.981496	-0.981496	0.981496	0.018504	0.037008
Lyric	8899	-0.838409	-0.980916	0.980916	0.161591	0.323182
Christy	99452	-0.980734	-0.980760	0.980760	0.019266	0.038531
Kenna	7979	-0.980323	-0.980659	0.980659	0.019677	0.039353
Tyrese	7582	0.980480	0.980480	0.980480	0.980480	0.039040
Robby	10399	0.979229	0.979229	0.979229	0.979229	0.041542
Mallory	48990	-0.977648	-0.977648	0.977648	0.022352	0.044703
Madison	308970	-0.976341	-0.976341	0.976341	0.023659	0.047319
Jermaine	39286	0.975386	0.975386	0.975386	0.975386	0.049229
Shelly	86081	-0.974570	-0.974570	0.974570	0.025430	0.050859
Carley	14885	-0.972455	-0.972455	0.972455	0.027545	0.055089
Lacey	47635	-0.969770	-0.969770	0.969770	0.030230	0.060460
Ainsley	8817	-0.968357	-0.968357	0.968357	0.031643	0.063287
Santana	7399	0.422760	0.966667	0.966667	0.422760	0.845520
Kelsey	144166	-0.964811	-0.964811	0.964811	0.035189	0.070377
Ansley	7202	-0.964315	-0.964315	0.964315	0.035685	0.071369
Ronnie	186260	0.960781	0.964046	0.964046	0.960781	0.078439
Kay	101704	-0.962479	-0.962605	0.962605	0.037521	0.075041
Delaney	27608	-0.962402	-0.962402	0.962402	0.037598	0.075196
Lindsay	131956	-0.961351	-0.961351	0.961351	0.038649	0.077298
Lesly	12407	-0.959942	-0.959942	0.959942	0.040058	0.080116
Marquise	9308	0.957886	0.957886	0.957886	0.957886	0.084229
Kenzie	8793	-0.956443	-0.956443	0.956443	0.043557	0.087115
Hillary	28763	-0.956159	-0.956159	0.956159	0.043841	0.087682
Mckenzie	44315	-0.953988	-0.953988	0.953988	0.046012	0.092023
Linsey	5138	-0.953095	-0.953095	0.953095	0.046905	0.093811
Lindsey	159977	-0.952274	-0.952274	0.952274	0.047726	0.095451
Shamar	5093	0.951109	0.951109	0.951109	0.951109	0.097781
Kinsey	5800	-0.951034	-0.951034	0.951034	0.048966	0.097931
Sydney	156602	-0.943922	-0.943922	0.943922	0.056078	0.112157
Kimber	5455	-0.943538	-0.943538	0.943538	0.056462	0.112924
Raven	37100	-0.927143	-0.943274	0.943274	0.072857	0.145714
Meredith	73898	-0.942502	-0.942502	0.942502	0.057498	0.114996
Cassidy	49871	-0.941349	-0.941349	0.941349	0.058651	0.117303
Whitney	98164	-0.940701	-0.940701	0.940701	0.059299	0.118597
Richie	6540	0.938532	0.938532	0.938532	0.938532	0.122936
Diamond	32377	-0.936776	-0.936776	0.936776	0.063224	0.126448
Gay	19363	-0.928678	-0.928678	0.928678	0.071322	0.142643
	...	...	...	...	...	...

first_letter	A		B		C		D		E		F		G		H		I		J
sex	F	M	F	M	F	M	F	M	F	M	F	M	F	M	F	M	F	M	F	M
year
1880	9334	7406	3874	2115	5868	9949	2218	2488	11444	6894	2957	6529	2463	6274	2743	7599	2480	947	3801	22272	...
1881	9405	6852	4013	1993	5661	9047	2299	2168	11742	6505	2875	5697	2621	5718	2630	7018	2456	955	3813	20313	...
1882	11001	7789	4824	2280	6454	10211	2557	2461	13771	7478	3512	6526	3054	6412	3192	7847	2788	932	4491	22419	...
1883	11632	7199	5194	2091	6857	9727	2709	2324	14449	6907	3614	6032	3210	5837	3373	7482	2890	860	4612	20428	...
1884	13324	7574	6005	2362	7919	10344	3060	2388	16465	7548	4196	6625	3790	6926	3973	8003	3389	1030	5239	22175	...

	total_pop	last_first_diff	min_max_range	abs_min_max_range	p_m	ambi_index
name
Krish	2070	0.995169	0.995169	0.995169	0.995169	0.009662
Lydell	1757	0.994308	0.994308	0.994308	0.994308	0.011383
Nicco	638	0.992163	0.992163	0.992163	0.992163	0.015674
Nigel	10501	0.991906	0.991906	0.991906	0.991906	0.016189
Jawan	1329	0.990971	0.990971	0.990971	0.990971	0.018059

	total_pop	last_first_diff	min_max_range	abs_min_max_range	p_m	ambi_index
name
Sherell	1479	-0.996619	-0.996619	0.996619	0.003381	0.006761
Summer	64702	-0.997002	-0.997002	0.997002	0.002998	0.005997
Lindsy	2039	-0.997548	-0.997548	0.997548	0.002452	0.004904
Abbey	15854	-0.997792	-0.997802	0.997802	0.002208	0.004415
Hailey	123318	-0.998151	-0.998151	0.998151	0.001849	0.003698

	total_pop	last_first_diff	min_max_range	abs_min_max_range	p_m	ambi_index
name
Challie	10	-0.5	-0.500000	0.500000	0.5	1
Keagyn	22	-0.5	-0.500000	0.500000	0.5	1
Ashdyn	12	0.5	0.500000	0.500000	0.5	1
Callaway	196	0.5	0.507143	0.507143	0.5	1
Mizan	12	0.5	0.500000	0.500000	0.5	1

	total_pop	last_first_diff	min_max_range	abs_min_max_range	p_m	ambi_index
name
Glenna	25493	0.000196	0.000412	0.000412	0.000196	0.000392
Dorothea	33626	0.000178	0.000210	0.000210	0.000178	0.000357
Leila	35283	0.000170	0.000453	0.000453	0.000170	0.000340
Therese	34628	0.000144	0.000712	0.000712	0.000144	0.000289
Annabelle	34998	0.000143	0.000154	0.000154	0.000143	0.000286