Baby Names Assignment

In this assigment, I analyzed baby names from two main perspectives:

  1. Analyzing The variation/trend in names by first letter - a) in terms of 50 years gap b) interms of male/female names
  2. Measuring sex shift on various dimensions in top ambi-names (using proportional/normalized populations)

Note: to show that Day_13_C_Baby_Names_MF_Completed is running properly, I added my work at the bottom of this notebook


In [1]:
%matplotlib inline

In [1]:
import matplotlib.pyplot as plt
import numpy as np

from pylab import figure, show

from pandas import DataFrame, Series
import pandas as pd

In [2]:
try:
    import mpld3
    from mpld3 import enable_notebook
    from mpld3 import plugins
    enable_notebook()
except Exception as e:
    print "Attempt to import and enable mpld3 failed", e

In [3]:
# what would seaborn do?
try:
    import seaborn as sns
except Exception as e:
    print "Attempt to import and enable seaborn failed", e


Attempt to import and enable seaborn failed No module named formula.api

Preliminaries: Assumed location of pydata-book files

To make it more practical for me to look at your homework, I'm again going to assume a relative placement of files. I placed the files from

https://github.com/pydata/pydata-book

in a local directory, which in my case is "/Users/raymondyee/D/Document/Working_with_Open_Data/pydata-book/"

and then symbolically linked (ln -s) to the the pydata-book from the root directory of the working-open-data folder. i.e., on OS X

cd /Users/raymondyee/D/Document/Working_with_Open_Data/working-open-data
ln -s /Users/raymondyee/D/Document/Working_with_Open_Data/pydata-book/ pydata-book

That way the files from the pydata-book repository look like they sit in the working-open-data directory -- without having to actually copy the files.

With this arrangment, I should then be able to drop your notebook into my own notebooks directory and run them without having to mess around with paths.


In [5]:
import os

NAMES_DIR = os.path.join(os.pardir, "pydata-book", "ch02", "names")

#NAMES_DIR ---> '../pydata-book/ch02/names'
assert os.path.exists(NAMES_DIR)

Please make sure the above assertion works.

Baby names dataset

discussed in p. 35 of PfDA book

To download all the data, including that for 2011 and 2012: Popular Baby Names --> includes state by state data.

Loading all data into Pandas


In [6]:
# show the first five files in the NAMES_DIR

import glob
glob.glob(NAMES_DIR + "/*")[:5]


Out[6]:
['../pydata-book/ch02/names/NationalReadMe.pdf',
 '../pydata-book/ch02/names/yob1880.txt',
 '../pydata-book/ch02/names/yob1881.txt',
 '../pydata-book/ch02/names/yob1882.txt',
 '../pydata-book/ch02/names/yob1883.txt']

In [7]:
# 2010 is the last available year in the pydata-book repo
import os

years = range(1880, 2011)

pieces = []
columns = ['name', 'sex', 'births']

for year in years:
    path = os.path.join(NAMES_DIR, 'yob%d.txt' % year)
#     print path
    frame = pd.read_csv(path, names=columns)
#     print frame

    frame['year'] = year
#     print frame
    pieces.append(frame)
#     print pieces

# Concatenate everything into a single DataFrame
names = pd.concat(pieces, ignore_index=True)

# why floats?  I'm not sure.
names.describe()
# names = pd.concat(pieces)
# len(names) -->1690784


Out[7]:
births year
count 1690784.000000 1690784.000000
mean 190.682386 1969.454384
std 1615.899711 32.823526
min 5.000000 1880.000000
25% 7.000000 1946.000000
50% 12.000000 1979.000000
75% 32.000000 1997.000000
max 99651.000000 2010.000000

8 rows × 2 columns


In [8]:
names.head()


Out[8]:
name sex births year
0 Mary F 7065 1880
1 Anna F 2604 1880
2 Emma F 2003 1880
3 Elizabeth F 1939 1880
4 Minnie F 1746 1880

5 rows × 4 columns


In [9]:
names.births


Out[9]:
0     7065
1     2604
2     2003
3     1939
4     1746
5     1578
6     1472
7     1414
8     1320
9     1288
10    1258
11    1226
12    1156
13    1063
14    1045
...
1690769    5
1690770    5
1690771    5
1690772    5
1690773    5
1690774    5
1690775    5
1690776    5
1690777    5
1690778    5
1690779    5
1690780    5
1690781    5
1690782    5
1690783    5
Name: births, Length: 1690784, dtype: int64

In [10]:
# how many people, names, males and females  represented in names?
names.births.sum()


Out[10]:
322402727

In [11]:
names.groupby('sex').head()


Out[11]:
name sex births year
sex
F 0 Mary F 7065 1880
1 Anna F 2604 1880
2 Emma F 2003 1880
3 Elizabeth F 1939 1880
4 Minnie F 1746 1880
M 942 John M 9655 1880
943 William M 9533 1880
944 James M 5927 1880
945 Charles M 5348 1880
946 George M 5126 1880

10 rows × 4 columns


In [12]:
# F vs M

names.groupby('sex')['births'].sum()


Out[12]:
sex
F      159990140
M      162412587
Name: births, dtype: int64

In [13]:
grp = names.groupby('name')

In [111]:
#experimenting with groups
# from itertools import islice

# for key, g_df in islice(grp,5):
#     print key, type(g_df), g_df.columns, g_df, g_df.sex

In [15]:
# total number of names

len(names.groupby('name'))


Out[15]:
88496

In [16]:
# use pivot_table to collect records by year (rows) and sex (columns)

total_births = names.pivot_table('births', rows='year', cols='sex', aggfunc=sum)
total_births.head()


Out[16]:
sex F M
year
1880 90993 110493
1881 91955 100748
1882 107851 113687
1883 112322 104632
1884 129021 114445

5 rows × 2 columns


In [17]:
names.groupby('year').head()


Out[17]:
name sex births year
year
1880 0 Mary F 7065 1880
1 Anna F 2604 1880
2 Emma F 2003 1880
3 Elizabeth F 1939 1880
4 Minnie F 1746 1880
1881 2000 Mary F 6919 1881
2001 Anna F 2698 1881
2002 Emma F 2034 1881
2003 Elizabeth F 1852 1881
2004 Margaret F 1658 1881
1882 3935 Mary F 8149 1882
3936 Anna F 3143 1882
3937 Emma F 2303 1882
3938 Elizabeth F 2187 1882
3939 Minnie F 2004 1882
1883 6062 Mary F 8012 1883
6063 Anna F 3306 1883
6064 Emma F 2367 1883
6065 Elizabeth F 2255 1883
6066 Minnie F 2035 1883
1884 8146 Mary F 9217 1884
8147 Anna F 3860 1884
8148 Emma F 2587 1884
8149 Elizabeth F 2549 1884
8150 Minnie F 2243 1884
1885 10443 Mary F 9128 1885
10444 Anna F 3994 1885
10445 Emma F 2728 1885
10446 Elizabeth F 2582 1885
10447 Margaret F 2204 1885
1886 12737 Mary F 9891 1886
12738 Anna F 4283 1886
12739 Emma F 2764 1886
12740 Elizabeth F 2680 1886
12741 Minnie F 2372 1886
1887 15129 Mary F 9888 1887
15130 Anna F 4227 1887
15131 Elizabeth F 2681 1887
15132 Emma F 2647 1887
15133 Margaret F 2419 1887
1888 17502 Mary F 11754 1888
17503 Anna F 4982 1888
17504 Elizabeth F 3224 1888
17505 Emma F 3087 1888
17506 Margaret F 2904 1888
1889 20153 Mary F 11649 1889
20154 Anna F 5062 1889
20155 Elizabeth F 3058 1889
20156 Margaret F 2917 1889
20157 Emma F 2884 1889
1890 22743 Mary F 12078 1890
22744 Anna F 5233 1890
22745 Elizabeth F 3112 1890
22746 Margaret F 3100 1890
22747 Emma F 2980 1890
1891 25438 Mary F 11704 1891
25439 Anna F 5099 1891
25440 Margaret F 3066 1891
25441 Elizabeth F 3059 1891
25442 Emma F 2884 1891
... ... ... ...

655 rows × 4 columns


In [18]:
names.groupby('year').apply(lambda s: s.groupby('sex').agg('sum')).head()


Out[18]:
births year
year sex
1880 F 90993 1770960
M 110493 1989040
1881 F 91955 1764378
M 100748 1875357
1882 F 107851 1934696

5 rows × 2 columns


In [19]:
# You can use groupy to get equivalent pivot_table calculation

names.groupby('year').apply(lambda s: s.groupby('sex').agg('sum')).unstack()['births'].head()


Out[19]:
sex F M
year
1880 90993 110493
1881 91955 100748
1882 107851 113687
1883 112322 104632
1884 129021 114445

5 rows × 2 columns


In [20]:
# how to calculate the total births / year

names.groupby('year').sum().plot(title="total births by year")


Out[20]:
<matplotlib.axes.AxesSubplot at 0x11263f490>

In [21]:
names.groupby('year').apply(lambda s: s.groupby('sex').agg('sum')).unstack()['births'].plot(title="births (M/F) by year")


Out[21]:
<matplotlib.axes.AxesSubplot at 0x116d92e50>

In [110]:
#some more experimentation with groups
# from itertools import islice

# for key, g_df in islice(names.groupby(['year', 'sex']),5):
#     print key,g_df
#     print key, type(g_df), g_df.columns, g_df, g_df.sex

In [23]:
# from book: add prop to names

def add_prop(group):
    # Integer division floors
    births = group.births.astype(float)
#     print births
    group['prop'] = births / births.sum()
    return group

propped_names = names.groupby(['year', 'sex']).apply(add_prop)
propped_names.head()


Out[23]:
name sex births year prop
0 Mary F 7065 1880 0.077643
1 Anna F 2604 1880 0.028618
2 Emma F 2003 1880 0.022013
3 Elizabeth F 1939 1880 0.021309
4 Minnie F 1746 1880 0.019188

5 rows × 5 columns


In [24]:
# verify prop --> all adds up to 1

# np.allclose(names.groupby(['year', 'sex']).prop.sum(), 1)
np.allclose(propped_names.groupby(['year', 'sex']).prop.sum(), 1)


Out[24]:
True

In [25]:
# number of records in full names dataframe

# len(names) --> 1690784
len(propped_names)


Out[25]:
1690784

How to do top1000 calculation

This section on the top1000 calculation is kept in here to provide some inspiration on how to work with baby names


In [26]:
#  from book: useful to work with top 1000 for each year/sex combo
# can use groupby/apply

names.groupby(['year', 'sex']).apply(lambda g: g.sort_index(by='births', ascending=False)[:1000]).head()


Out[26]:
name sex births year
year sex
1880 F 0 Mary F 7065 1880
1 Anna F 2604 1880
2 Emma F 2003 1880
3 Elizabeth F 1939 1880
4 Minnie F 1746 1880

5 rows × 4 columns


In [27]:
def get_top1000(group):
    return group.sort_index(by='births', ascending=False)[:1000]

grouped = names.groupby(['year', 'sex'])
top1000 = grouped.apply(get_top1000)
top1000.head()


Out[27]:
name sex births year
year sex
1880 F 0 Mary F 7065 1880
1 Anna F 2604 1880
2 Emma F 2003 1880
3 Elizabeth F 1939 1880
4 Minnie F 1746 1880

5 rows × 4 columns


In [28]:
# Do pivot table: row: year and cols= names for top 1000

top_births = top1000.pivot_table('births', rows='year', cols='name', aggfunc=np.sum)
top_births.tail()


Out[28]:
name Aaden Aaliyah Aarav Aaron Aarush Ab Abagail Abb Abbey Abbie Abbigail Abbott Abby Abdiel Abdul Abdullah Abe Abel Abelardo Abigail
year
2006 NaN 3737 NaN 8279 NaN NaN 297 NaN 404 440 630 NaN 1682 NaN NaN 219 NaN 922 NaN 15615 ...
2007 NaN 3941 NaN 8914 NaN NaN 313 NaN 349 468 651 NaN 1573 NaN NaN 224 NaN 939 NaN 15447 ...
2008 955 4028 219 8511 NaN NaN 317 NaN 344 400 608 NaN 1328 199 NaN 210 NaN 863 NaN 15045 ...
2009 1265 4352 270 7936 NaN NaN 296 NaN 307 369 675 NaN 1274 229 NaN 256 NaN 960 NaN 14342 ...
2010 448 4628 438 7374 226 NaN 277 NaN 295 324 585 NaN 1140 264 NaN 225 NaN 1119 NaN 14124 ...

5 rows × 6865 columns


In [29]:
#instead of pivot, I used groupby here

grp_top_births = top1000.groupby('year').apply(lambda s:s.groupby('name').agg('sum')).unstack()['births']
grp_top_births.tail()
# grp_top_births['Raymond'].plot()
#


Out[29]:
name Aaden Aaliyah Aarav Aaron Aarush Ab Abagail Abb Abbey Abbie Abbigail Abbott Abby Abdiel Abdul Abdullah Abe Abel Abelardo Abigail
year
2006 NaN 3737 NaN 8279 NaN NaN 297 NaN 404 440 630 NaN 1682 NaN NaN 219 NaN 922 NaN 15615 ...
2007 NaN 3941 NaN 8914 NaN NaN 313 NaN 349 468 651 NaN 1573 NaN NaN 224 NaN 939 NaN 15447 ...
2008 955 4028 219 8511 NaN NaN 317 NaN 344 400 608 NaN 1328 199 NaN 210 NaN 863 NaN 15045 ...
2009 1265 4352 270 7936 NaN NaN 296 NaN 307 369 675 NaN 1274 229 NaN 256 NaN 960 NaN 14342 ...
2010 448 4628 438 7374 226 NaN 277 NaN 295 324 585 NaN 1140 264 NaN 225 NaN 1119 NaN 14124 ...

5 rows × 6865 columns


In [30]:
"""my name "Prabha" or "Matta" is missing in the database :("""
# top_births['Matta'].plot(title='plot for Matta')


Out[30]:
'my name "Prabha" or "Matta" is missing in the database :('

In [31]:
# is your name in the top_births list?

top_births['Raymond'].plot(title='plot for Raymond')


Out[31]:
<matplotlib.axes.AxesSubplot at 0x110a13b50>

In [32]:
# for Aaden, which shows up at the end

top_births.Aaden.plot(xlim=[1880,2010])


Out[32]:
<matplotlib.axes.AxesSubplot at 0x113eeef10>

In [33]:
# number of names represented in top_births

len(top_births.columns)


Out[33]:
6865

In [34]:
top_births.head()


Out[34]:
name Aaden Aaliyah Aarav Aaron Aarush Ab Abagail Abb Abbey Abbie Abbigail Abbott Abby Abdiel Abdul Abdullah Abe Abel Abelardo Abigail
year
1880 NaN NaN NaN 102 NaN NaN NaN NaN NaN 71 NaN NaN 6 NaN NaN NaN 50 9 NaN 12 ...
1881 NaN NaN NaN 94 NaN NaN NaN NaN NaN 81 NaN NaN 7 NaN NaN NaN 36 12 NaN 8 ...
1882 NaN NaN NaN 85 NaN NaN NaN NaN NaN 80 NaN NaN 11 NaN NaN NaN 50 10 NaN 14 ...
1883 NaN NaN NaN 105 NaN NaN NaN NaN NaN 79 NaN NaN NaN NaN NaN NaN 43 12 NaN 11 ...
1884 NaN NaN NaN 97 NaN NaN NaN NaN NaN 98 NaN NaN 6 NaN NaN NaN 45 14 NaN 13 ...

5 rows × 6865 columns


In [118]:
# how to get the most popular name of all time in top_births?

most_common_names = top_births.sum()
# print most_common_names
most_common_names.sort(ascending=False)

# most_common_names.head()

# # James      5071647
# # John       5060953
# # Robert     4787187
# # Michael    4263083
# # Mary       4117746

# temp=grp_top_births.sum()
# temp.sort()

In [117]:
# most_common_names = top_births.sum()
# # print type(most_common_names)
# most_common_names.sort(ascending=False)

# most_common_names.head()

# # # James      5071647
# # # John       5060953
# # # Robert     4787187
# # # Michael    4263083
# # # Mary       4117746

# temp=grp_top_births.sum()
# print type(temp)
# temp.sort(ascending=False)
# temp.head()

In [37]:
# as of mpl v 0.1 (2014.03.04), the name labeling doesn't work -- so disble mpld3 for this figure

mpld3.disable_notebook()
plt.figure()
most_common_names[:50][::-1].plot(kind='barh', figsize=(10,10))


Out[37]:
<matplotlib.axes.AxesSubplot at 0x110a143d0>

In [38]:
# turn mpld3 back on

mpld3.enable_notebook()

all_births pivot table


In [39]:
#using groupby
names.groupby('year').apply(lambda s: s.groupby('name').agg('sum')).unstack()['births'].tail()


Out[39]:
name Aaban Aabid Aabriella Aadam Aadan Aadarsh Aaden Aadesh Aadhav Aadhavan Aadhya Aadi Aadil Aadin Aadison Aadit Aadith Aaditri Aaditya Aadon
year
2006 NaN NaN NaN 9 NaN 14 55 NaN 5 NaN NaN 74 11 NaN NaN 17 NaN NaN 42 7 ...
2007 5 NaN NaN 8 8 13 155 NaN NaN NaN 10 72 15 10 NaN 31 7 NaN 43 10 ...
2008 NaN NaN 5 6 22 13 955 NaN NaN NaN 9 76 20 22 NaN 24 5 NaN 51 10 ...
2009 6 NaN NaN 9 23 16 1270 5 5 NaN 18 76 17 25 6 12 NaN NaN 38 23 ...
2010 9 NaN NaN 7 11 NaN 448 NaN 13 5 19 54 11 18 NaN 23 NaN 5 37 NaN ...

5 rows × 88496 columns


In [40]:
# instead of top_birth -- get all_births

all_births = names.pivot_table('births', rows='year', cols='name', aggfunc=sum)
all_births.tail()


Out[40]:
name Aaban Aabid Aabriella Aadam Aadan Aadarsh Aaden Aadesh Aadhav Aadhavan Aadhya Aadi Aadil Aadin Aadison Aadit Aadith Aaditri Aaditya Aadon
year
2006 NaN NaN NaN 9 NaN 14 55 NaN 5 NaN NaN 74 11 NaN NaN 17 NaN NaN 42 7 ...
2007 5 NaN NaN 8 8 13 155 NaN NaN NaN 10 72 15 10 NaN 31 7 NaN 43 10 ...
2008 NaN NaN 5 6 22 13 955 NaN NaN NaN 9 76 20 22 NaN 24 5 NaN 51 10 ...
2009 6 NaN NaN 9 23 16 1270 5 5 NaN 18 76 17 25 6 12 NaN NaN 38 23 ...
2010 9 NaN NaN 7 11 NaN 448 NaN 13 5 19 54 11 18 NaN 23 NaN 5 37 NaN ...

5 rows × 88496 columns


In [41]:
all_births = all_births.fillna(0)
all_births.tail()


Out[41]:
name Aaban Aabid Aabriella Aadam Aadan Aadarsh Aaden Aadesh Aadhav Aadhavan Aadhya Aadi Aadil Aadin Aadison Aadit Aadith Aaditri Aaditya Aadon
year
2006 0 0 0 9 0 14 55 0 5 0 0 74 11 0 0 17 0 0 42 7 ...
2007 5 0 0 8 8 13 155 0 0 0 10 72 15 10 0 31 7 0 43 10 ...
2008 0 0 5 6 22 13 955 0 0 0 9 76 20 22 0 24 5 0 51 10 ...
2009 6 0 0 9 23 16 1270 5 5 0 18 76 17 25 6 12 0 0 38 23 ...
2010 9 0 0 7 11 0 448 0 13 5 19 54 11 18 0 23 0 5 37 0 ...

5 rows × 88496 columns


In [42]:
# set up to do start/end calculation

all_births_cumsum = all_births.apply(lambda s: s.cumsum(), axis=0)

In [43]:
all_births_cumsum.tail()


Out[43]:
name Aaban Aabid Aabriella Aadam Aadan Aadarsh Aaden Aadesh Aadhav Aadhavan Aadhya Aadi Aadil Aadin Aadison Aadit Aadith Aaditri Aaditya Aadon
year
2006 0 5 0 103 5 67 149 5 11 0 0 171 175 10 0 67 5 0 153 18 ...
2007 5 5 0 111 13 80 304 5 11 0 10 243 190 20 0 98 12 0 196 28 ...
2008 5 5 5 117 35 93 1259 5 11 0 19 319 210 42 0 122 17 0 247 38 ...
2009 11 5 5 126 58 109 2529 10 16 0 37 395 227 67 6 134 17 0 285 61 ...
2010 20 5 5 133 69 109 2977 10 29 5 56 449 238 85 6 157 17 5 322 61 ...

5 rows × 88496 columns


In [44]:
all_births_cumsum['Raymond'].plot()


Out[44]:
<matplotlib.axes.AxesSubplot at 0x12f9c8410>

Names that are both M and F


In [45]:
# remind ourselves of what's in names

names.head()


Out[45]:
name sex births year
0 Mary F 7065 1880
1 Anna F 2604 1880
2 Emma F 2003 1880
3 Elizabeth F 1939 1880
4 Minnie F 1746 1880

5 rows × 4 columns


In [46]:
# columns in names

names.columns


Out[46]:
Index([u'name', u'sex', u'births', u'year'], dtype='object')

In [112]:
# for key, g_df in islice(names.groupby('sex'),5):
#     print key,g_df

Calculating ambigendered names


In [48]:
# calculate set of male_only, female_only, ambigender names

def calc_of_sex_of_names():

    k = names.groupby('sex').apply(lambda s: set(list(s['name'])))
    print k
    male_only_names = k['M'] - k['F']
    female_only_names = k['F'] - k['M']
    ambi_names = k['F'] & k['M'] # intersection of two 
    return {'male_only_names': male_only_names, 
            'female_only_names': female_only_names,
            'ambi_names': ambi_names }
    
names_by_sex = calc_of_sex_of_names() 
ambi_names_array = np.array(list(names_by_sex['ambi_names']))

[(k, len(v)) for (k,v) in names_by_sex.items()]


sex
F      set([Satara, Britiney, Arely, Charelle, Dejama...
M      set([Antal, Elizeo, Dago, Jhase, Jamesson, Lar...
dtype: object
Out[48]:
[('female_only_names', 51754),
 ('male_only_names', 27090),
 ('ambi_names', 9652)]

In [49]:
# total number of people in names
names.births.sum()


Out[49]:
322402727

In [50]:
#learning in1d
# >>> test = np.array([0, 1, 2, 5, 0])
# >>> states = [0, 2]
# >>> mask = np.in1d(test, states)
# >>> mask
# array([ True, False,  True, False,  True], dtype=bool)
# >>> test[mask]
# array([0, 2, 0])
# >>> mask = np.in1d(test, states, invert=True)
# >>> mask
# array([False,  True, False,  True, False], dtype=bool)
# >>> test[mask]
# array([1, 5])

# pivot table of ambigendered names to aggregate 

names_ambi = names[np.in1d(names.name,ambi_names_array)]
ambi_names_pt = names_ambi.pivot_table('births',
                            rows='year', 
                            cols=['name','sex'], 
                            aggfunc='sum')
ambi_names_pt.tail()


Out[50]:
name Aaden Aadi Aadyn Aalijah Aaliyah Aamari Aaren Aareon Aarian Aarin
sex F M F M F M F M F M F M F M F M F M F M
year
2006 NaN 55 5 69 NaN 16 5 5 3737 NaN 5 NaN NaN 26 NaN NaN NaN NaN 10 12 ...
2007 NaN 155 NaN 72 NaN 27 8 10 3941 NaN 10 10 NaN 26 NaN 5 NaN 6 6 20 ...
2008 NaN 955 NaN 76 9 56 5 15 4028 NaN 5 9 NaN 29 NaN NaN NaN NaN 9 16 ...
2009 5 1265 NaN 76 7 76 7 12 4352 NaN NaN 8 NaN 28 NaN 6 6 7 NaN 19 ...
2010 NaN 448 NaN 54 NaN 38 NaN 15 4628 6 8 8 5 30 NaN 11 NaN 5 7 21 ...

5 rows × 19304 columns


In [51]:
ambi_names_pt['Raymond'].plot()


Out[51]:
<matplotlib.axes.AxesSubplot at 0x13f64ba90>

In [52]:
# total number of people in k1 -- almost everyone!

ambi_names_pt.sum().sum()


Out[52]:
299879378.0

In [53]:
# fill n/a with 0 and look at the table at the end

ambi_names_pt=ambi_names_pt.fillna(0L)
ambi_names_pt.tail()


Out[53]:
name Aaden Aadi Aadyn Aalijah Aaliyah Aamari Aaren Aareon Aarian Aarin
sex F M F M F M F M F M F M F M F M F M F M
year
2006 0 55 5 69 0 16 5 5 3737 0 5 0 0 26 0 0 0 0 10 12 ...
2007 0 155 0 72 0 27 8 10 3941 0 10 10 0 26 0 5 0 6 6 20 ...
2008 0 955 0 76 9 56 5 15 4028 0 5 9 0 29 0 0 0 0 9 16 ...
2009 5 1265 0 76 7 76 7 12 4352 0 0 8 0 28 0 6 6 7 0 19 ...
2010 0 448 0 54 0 38 0 15 4628 6 8 8 5 30 0 11 0 5 7 21 ...

5 rows × 19304 columns


In [54]:
ambi_names_pt.T.head()


Out[54]:
year 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899
name sex
Aaden F 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
M 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Aadi F 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
M 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Aadyn F 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

5 rows × 131 columns


In [55]:
# plot M, F in ambigender_names over time
ambi_names_pt.T.xs('M',level='sex').sum().cumsum().plot()


Out[55]:
<matplotlib.axes.AxesSubplot at 0x11db77710>

In [56]:
ambi_names_pt.T.xs('F',level='sex').sum().cumsum().plot()


Out[56]:
<matplotlib.axes.AxesSubplot at 0x1128467d0>

In [57]:
# don't know what pivot table has type float
# https://github.com/pydata/pandas/issues/3283
ambi_names_pt['Raymond', 'M'].dtype


Out[57]:
dtype('float64')

In [58]:
# calculate proportion of males for given name

def prop_male(name):
    return (ambi_names_pt[name]['M']/ \
    ((ambi_names_pt[name]['M'] + ambi_names_pt[name]['F'])))

def prop_c_male(name):
    return (ambi_names_pt[name]['M'].cumsum()/ \
    ((ambi_names_pt[name]['M'].cumsum() + ambi_names_pt[name]['F'].cumsum())))

In [59]:
prop_c_male('Leslie').plot()


Out[59]:
<matplotlib.axes.AxesSubplot at 0x12f9ad410>

In [61]:
# I couldn't figure out a way of iterating over the names rather than names/sex combo in
# a vectorized way.  

from itertools import islice

names_to_calc = list(islice(list(ambi_names_pt.T.index.levels[0]),None))

m = [(name_, ambi_names_pt[name_]['M']/(ambi_names_pt[name_]['F'] + ambi_names_pt[name_]['M']))  \
     for name_ in names_to_calc]
p_m_instant = DataFrame(dict(m))
p_m_instant.tail()


Out[61]:
Aaden Aadi Aadyn Aalijah Aaliyah Aamari Aaren Aareon Aarian Aarin Aarion Aaris Aaron Aarya Aaryn Aba Abba Abbey Abbie Abbigail
year
2006 1.000000 0.932432 1.000000 0.500000 0.000000 0.000000 1.000000 NaN NaN 0.545455 0.730769 1.000000 0.997109 0.481481 0.595745 1 NaN 0 0 0 ...
2007 1.000000 1.000000 1.000000 0.555556 0.000000 0.500000 1.000000 1 1.000000 0.769231 0.794118 0.454545 0.997426 0.240506 0.518519 NaN NaN 0 0 0 ...
2008 1.000000 1.000000 0.861538 0.750000 0.000000 0.642857 1.000000 NaN NaN 0.640000 0.666667 NaN 0.996604 0.213333 0.480519 NaN NaN 0 0 0 ...
2009 0.996063 1.000000 0.915663 0.631579 0.000000 1.000000 1.000000 1 0.538462 1.000000 0.750000 NaN 0.995984 0.247312 0.406250 NaN NaN 0 0 0 ...
2010 1.000000 1.000000 1.000000 1.000000 0.001295 0.500000 0.857143 1 1.000000 0.750000 1.000000 NaN 0.996891 0.265306 0.340000 NaN NaN 0 0 0 ...

5 rows × 9652 columns


In [62]:
# similar calculation except instead of looking at the proportions for a given year only,
# we look at the cumulative number of male/female babies for given name

from itertools import islice

names_to_calc = list(islice(list(ambi_names_pt.T.index.levels[0]),None))

m = [(name_, ambi_names_pt[name_]['M'].cumsum()/(ambi_names_pt[name_]['F'].cumsum() + ambi_names_pt[name_]['M'].cumsum()))  \
     for name_ in names_to_calc]
p_m_cum = DataFrame(dict(m))
p_m_cum.tail()


Out[62]:
Aaden Aadi Aadyn Aalijah Aaliyah Aamari Aaren Aareon Aarian Aarin Aarion Aaris Aaron Aarya Aaryn Aba Abba Abbey Abbie Abbigail
year
2006 1.000000 0.970760 1.000000 0.461538 0.001677 0.289474 0.650694 0.500000 0.238095 0.500000 0.714667 0.52381 0.991825 0.481818 0.391437 0.185185 0.666667 0.002404 0.017656 0.000761 ...
2007 1.000000 0.979424 1.000000 0.477064 0.001494 0.362069 0.661783 0.600000 0.407407 0.512727 0.721271 0.50000 0.991925 0.418060 0.398068 0.185185 0.666667 0.002348 0.017220 0.000693 ...
2008 1.000000 0.984326 0.934783 0.519380 0.001344 0.416667 0.673349 0.600000 0.407407 0.518261 0.718245 0.50000 0.992003 0.377005 0.403777 0.185185 0.666667 0.002295 0.016863 0.000639 ...
2009 0.998023 0.987342 0.927602 0.533784 0.001213 0.475000 0.683790 0.677419 0.450000 0.533670 0.719647 0.50000 0.992064 0.351178 0.403912 0.185185 0.666667 0.002250 0.016547 0.000588 ...
2010 0.998320 0.988864 0.938224 0.576687 0.001221 0.479167 0.690450 0.761905 0.511111 0.543408 0.729787 0.50000 0.992131 0.336283 0.401305 0.185185 0.666667 0.002208 0.016280 0.000550 ...

5 rows × 9652 columns


In [63]:
p_m_cum['Donnie'].plot()


Out[63]:
<matplotlib.axes.AxesSubplot at 0x11de6ca10>

In [64]:
# some metrics that attempt to measure how a time series s has changed

def min_max_range(s):
    """range of s signed -- positive if slope between two points p +ve and negative
    otherwise; 0 if slope is 0"""
    # note np.argmax, np.argmin returns the position of first occurence of global max, min
    sign = np.sign(np.argmax(s) - np.argmin(s))
    if sign == 0:
        return 0.0
    else:
        return sign*(np.max(s) - np.min(s))

def last_first_diff(s):
    """difference between latest and earliest value"""
    s0 = s.dropna()
    return (s0.iloc[-1] - s0.iloc[0])

In [65]:
# population distributions of ambinames 
# might want to remove from consideration instances when total ratio is too great
# or range of existence of a name/sex combo too short

total_pop_ambiname = all_births.sum()[np.in1d(all_births.sum().index, ambi_names_array)]
total_pop_ambiname.sort(ascending=False)
total_pop_ambiname.plot(logy=True)


Out[65]:
<matplotlib.axes.AxesSubplot at 0x11e06cf50>

In [66]:
# now calculate a DataFrame to visualize results

# calculate the total population, the change in p_m from last to first appearance, 
# the change from max to min in p_m, and the percentage of males overall for name

df = DataFrame()
df['total_pop'] = total_pop_ambiname
df['last_first_diff'] = p_m_cum.apply(last_first_diff)
df['min_max_range'] = p_m_cum.apply(min_max_range)
df['abs_min_max_range'] = np.abs(df.min_max_range)
df['p_m'] = p_m_cum.iloc[-1]

# distance from full ambigender -- p_m=0.5 leads to 1, p_m=1 or 0 -> 0
df['ambi_index'] = df.p_m.apply(lambda p: 1 - 2* np.abs(p-0.5))

df.head()


Out[66]:
total_pop last_first_diff min_max_range abs_min_max_range p_m ambi_index
name
James 5072771 -0.000845 -0.002123 0.002123 0.995457 0.009085
John 5061897 0.000479 -0.001921 0.001921 0.995737 0.008526
Robert 4788050 0.000344 0.002027 0.002027 0.995811 0.008377
Michael 4265373 -0.005034 -0.006425 0.006425 0.994966 0.010067
Mary 4119074 -0.000132 -0.000829 0.000829 0.003675 0.007351

5 rows × 6 columns


In [67]:
# plot: x -> log10 of total population, y->how p_m has changed from first to last
# turn off d3 for this plot

mpld3.disable_notebook()
plt.scatter(np.log10(df.total_pop), df.last_first_diff, s=1)


Out[67]:
<matplotlib.collections.PathCollection at 0x111f42ed0>

In [ ]:
# turn d3 back on

mpld3.enable_notebook()
plt.scatter(np.log10(df.total_pop), df.last_first_diff, s=1)

In [68]:
# general directionality counts -- looking for over asymmetry

df.groupby(np.sign(df.last_first_diff)).count()


Out[68]:
total_pop last_first_diff min_max_range abs_min_max_range p_m ambi_index
last_first_diff
-1 4890 4890 4890 4890 4890 4890
0 24 24 24 24 24 24
1 4738 4738 4738 4738 4738 4738

3 rows × 6 columns


In [69]:
# let's concentrate on more populous names that have seen big swings in the cumulative p_m

# you can play with the population and range filter
popular_names_with_shifts = df[(df.total_pop>5000) & (df.abs_min_max_range >0.7)]
popular_names_with_shifts.sort_index(by="abs_min_max_range", ascending=False)


Out[69]:
total_pop last_first_diff min_max_range abs_min_max_range p_m ambi_index
name
Hailey 123318 -0.998151 -0.998151 0.998151 0.001849 0.003698
Abbey 15854 -0.997792 -0.997802 0.997802 0.002208 0.004415
Summer 64702 -0.997002 -0.997002 0.997002 0.002998 0.005997
Raegan 9744 -0.990148 -0.995873 0.995873 0.009852 0.019704
Bria 11160 -0.995072 -0.995072 0.995072 0.004928 0.009857
Fallon 7476 -0.972311 -0.994122 0.994122 0.027689 0.055377
Chanel 14087 -0.993966 -0.993966 0.993966 0.006034 0.012068
Star 6684 -0.983543 -0.993738 0.993738 0.016457 0.032914
Holly 196587 -0.992161 -0.992161 0.992161 0.007839 0.015678
Nigel 10501 0.991906 0.991906 0.991906 0.991906 0.016189
Michele 225226 -0.988372 -0.991136 0.991136 0.011628 0.023257
Nova 6899 -0.930135 -0.990991 0.990991 0.069865 0.139730
Ronda 34628 -0.989633 -0.989633 0.989633 0.010367 0.020735
Paige 122569 -0.989198 -0.989198 0.989198 0.010802 0.021604
Brooke 173658 -0.988489 -0.988489 0.988489 0.011511 0.023022
Beverly 380492 -0.987824 -0.987824 0.987824 0.012176 0.024353
Lauren 450853 -0.987302 -0.987302 0.987302 0.012698 0.025396
Alexus 17835 -0.987272 -0.987286 0.987286 0.012728 0.025456
Allison 262727 -0.985826 -0.985826 0.985826 0.014174 0.028349
Cordell 9464 0.984362 0.984362 0.984362 0.984362 0.031276
Lauri 11199 -0.983302 -0.983302 0.983302 0.016698 0.033396
Joy 131572 -0.981827 -0.981862 0.981862 0.018173 0.036345
Ashley 832350 -0.981496 -0.981496 0.981496 0.018504 0.037008
Lyric 8899 -0.838409 -0.980916 0.980916 0.161591 0.323182
Christy 99452 -0.980734 -0.980760 0.980760 0.019266 0.038531
Kenna 7979 -0.980323 -0.980659 0.980659 0.019677 0.039353
Tyrese 7582 0.980480 0.980480 0.980480 0.980480 0.039040
Robby 10399 0.979229 0.979229 0.979229 0.979229 0.041542
Mallory 48990 -0.977648 -0.977648 0.977648 0.022352 0.044703
Madison 308970 -0.976341 -0.976341 0.976341 0.023659 0.047319
Jermaine 39286 0.975386 0.975386 0.975386 0.975386 0.049229
Shelly 86081 -0.974570 -0.974570 0.974570 0.025430 0.050859
Carley 14885 -0.972455 -0.972455 0.972455 0.027545 0.055089
Lacey 47635 -0.969770 -0.969770 0.969770 0.030230 0.060460
Ainsley 8817 -0.968357 -0.968357 0.968357 0.031643 0.063287
Santana 7399 0.422760 0.966667 0.966667 0.422760 0.845520
Kelsey 144166 -0.964811 -0.964811 0.964811 0.035189 0.070377
Ansley 7202 -0.964315 -0.964315 0.964315 0.035685 0.071369
Ronnie 186260 0.960781 0.964046 0.964046 0.960781 0.078439
Kay 101704 -0.962479 -0.962605 0.962605 0.037521 0.075041
Delaney 27608 -0.962402 -0.962402 0.962402 0.037598 0.075196
Lindsay 131956 -0.961351 -0.961351 0.961351 0.038649 0.077298
Lesly 12407 -0.959942 -0.959942 0.959942 0.040058 0.080116
Marquise 9308 0.957886 0.957886 0.957886 0.957886 0.084229
Kenzie 8793 -0.956443 -0.956443 0.956443 0.043557 0.087115
Hillary 28763 -0.956159 -0.956159 0.956159 0.043841 0.087682
Mckenzie 44315 -0.953988 -0.953988 0.953988 0.046012 0.092023
Linsey 5138 -0.953095 -0.953095 0.953095 0.046905 0.093811
Lindsey 159977 -0.952274 -0.952274 0.952274 0.047726 0.095451
Shamar 5093 0.951109 0.951109 0.951109 0.951109 0.097781
Kinsey 5800 -0.951034 -0.951034 0.951034 0.048966 0.097931
Sydney 156602 -0.943922 -0.943922 0.943922 0.056078 0.112157
Kimber 5455 -0.943538 -0.943538 0.943538 0.056462 0.112924
Raven 37100 -0.927143 -0.943274 0.943274 0.072857 0.145714
Meredith 73898 -0.942502 -0.942502 0.942502 0.057498 0.114996
Cassidy 49871 -0.941349 -0.941349 0.941349 0.058651 0.117303
Whitney 98164 -0.940701 -0.940701 0.940701 0.059299 0.118597
Richie 6540 0.938532 0.938532 0.938532 0.938532 0.122936
Diamond 32377 -0.936776 -0.936776 0.936776 0.063224 0.126448
Gay 19363 -0.928678 -0.928678 0.928678 0.071322 0.142643
... ... ... ... ... ...

150 rows × 6 columns


In [70]:
popular_names_with_shifts.groupby(np.sign(df.last_first_diff)).count()


Out[70]:
total_pop last_first_diff min_max_range abs_min_max_range p_m ambi_index
last_first_diff
-1 116 116 116 116 116 116
1 34 34 34 34 34 34

2 rows × 6 columns


In [ ]:
#popular_names_with_shifts.to_pickle('popular_names_with_shifts.pickle')

In [71]:
fig, ax = plt.subplots(subplot_kw=dict(axisbg='#EEEEEE'))
x = np.log10(popular_names_with_shifts.total_pop)
y = popular_names_with_shifts.min_max_range 

scatter = ax.scatter(x, y)

ax.grid(color='white', linestyle='solid')
ax.set_title("Populous Names with Major Sex Shift", size=20)
ax.set_xlabel('log10(total_pop)')
ax.set_ylabel('min_max_range')

#labels = ['point {0}'.format(i + 1) for i in range(len(x))]
labels = list(popular_names_with_shifts.index)
tooltip = plugins.PointLabelTooltip(scatter, labels=labels)
plugins.connect(fig, tooltip)



In [ ]:
prop_c_male('Leslie').plot()

Prabha's Experimentation with Baby Names

Analyzing The variation/trend in names by first letter


In [72]:
get_first_letter = lambda x:x[0]
first_letters = names.name.map(get_first_letter)
first_letters.name = 'first_letter'


first_letter_trend = names.pivot_table('births', rows='year', cols=[first_letters,'sex'], aggfunc=sum)
first_letter_trend.head()


Out[72]:
first_letter A B C D E F G H I J
sex F M F M F M F M F M F M F M F M F M F M
year
1880 9334 7406 3874 2115 5868 9949 2218 2488 11444 6894 2957 6529 2463 6274 2743 7599 2480 947 3801 22272 ...
1881 9405 6852 4013 1993 5661 9047 2299 2168 11742 6505 2875 5697 2621 5718 2630 7018 2456 955 3813 20313 ...
1882 11001 7789 4824 2280 6454 10211 2557 2461 13771 7478 3512 6526 3054 6412 3192 7847 2788 932 4491 22419 ...
1883 11632 7199 5194 2091 6857 9727 2709 2324 14449 6907 3614 6032 3210 5837 3373 7482 2890 860 4612 20428 ...
1884 13324 7574 6005 2362 7919 10344 3060 2388 16465 7548 4196 6625 3790 6926 3973 8003 3389 1030 5239 22175 ...

5 rows × 52 columns


In [73]:
first_letter_trend['A'].plot()


Out[73]:
<matplotlib.axes.AxesSubplot at 0x110b23710>

Finding trend of starting letter by year :


In [74]:
yearwise_first_letter_trend = names.pivot_table('births', rows=first_letters, cols=['sex','year'], aggfunc=sum)

#trending of names starting with 'A' 
# first_letter_trend.plot()
yearwise_first_letter_trend.head()


Out[74]:
sex F
year 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899
first_letter
A 9334 9405 11001 11632 13324 13437 14416 14836 17729 17689 18601 17964 20241 20150 20983 21720 22008 21265 23067 20950 ...
B 3874 4013 4824 5194 6005 6340 6990 7110 8775 8744 9111 8917 10085 9985 10243 10923 10882 10699 11864 10484 ...
C 5868 5661 6454 6857 7919 8164 8412 8605 10412 10257 10670 10033 11490 11304 11693 12027 11945 11727 13035 11769 ...
D 2218 2299 2557 2709 3060 3031 3231 3144 3852 3732 3995 3923 4427 4474 4930 5118 5458 5531 6365 5741 ...
E 11444 11742 13771 14449 16465 17379 18825 19140 23258 23244 24489 24258 27331 28128 29335 30630 31026 30126 32512 28845 ...

5 rows × 262 columns


In [113]:
#plotting for all the first_letters and years
yearwise_first_letter_trend.plot(legend=False)


Out[113]:
<matplotlib.axes.AxesSubplot at 0x121f9f590>

In [108]:
# yearwise_first_letter_trend.sum()

Analyzing the trend after every 50 years ->1880,1930,1970, 2010


In [77]:
#Let us analyze the trend for every 50 years ->1880,1930,1970, 2010
interval_yearwise_first_letter_trend = yearwise_first_letter_trend.reindex(columns = [1880,1930,1970, 2010], level = 'year')
interval_yearwise_first_letter_trend.head()
letter_prop = interval_yearwise_first_letter_trend/interval_yearwise_first_letter_trend.sum().astype(float)

In [78]:
mpld3.disable_notebook()
import matplotlib.pyplot as plt
# letter_prop = yearwise_first_letter_trend/yearwise_first_letter_trend.sum().astype(float)
fig, axes = plt.subplots(2, 1, figsize=(10, 8))
letter_prop['M'].plot(kind='bar', rot=0, ax=axes[0], title='Male')
letter_prop['F'].plot(kind='bar', rot=0, ax=axes[1], title='Female', legend=False)
# yearwise_first_letter_trend[2010].plot()


Out[78]:
<matplotlib.axes.AxesSubplot at 0x117064510>

Measuring sex shift on various dimensions in top ambi-names (using proportional/normalized populations)


In [80]:
# some metrics that attempt to measure how a time series s has changed

def min_max_range(s):
    """range of s signed -- positive if slope between two points p +ve and negative
    otherwise; 0 if slope is 0"""
    # note np.argmax, np.argmin returns the position of first occurence of global max, min
    sign = np.sign(np.argmax(s) - np.argmin(s))
    if sign == 0:
        return 0.0
    else:
        return sign*(np.max(s) - np.min(s))

def last_first_diff(s):
    """difference between latest and earliest value"""
    s0 = s.dropna()
    return (s0.iloc[-1] - s0.iloc[0])

In [81]:
total_pop_ambiname = all_births.sum()[np.in1d(all_births.sum().index, ambi_names_array)]
total_pop_ambiname.sort(ascending=False)

In [102]:
top5_ambi_data = DataFrame()
top5_ambi_data['total_pop'] = total_pop_ambiname
top5_ambi_data['last_first_diff'] = p_m_cum.apply(last_first_diff)
top5_ambi_data['min_max_range'] = p_m_cum.apply(min_max_range)
top5_ambi_data['abs_min_max_range'] = np.abs(df.min_max_range)
top5_ambi_data['p_m'] = p_m_cum.iloc[-1]

# distance from full ambigender -- p_m=0.5 leads to 1, p_m=1 or 0 -> 0
top5_ambi_data['ambi_index'] = df.p_m.apply(lambda p: 1 - 2* np.abs(p-0.5))

Analyzing the ambi-names which has maximum last year to first year diff


In [84]:
#sorting the ambi-names which has maximum last year to first year diff
top5_ambi_data.sort_index(by='last_first_diff', ascending=False).head()


Out[84]:
total_pop last_first_diff min_max_range abs_min_max_range p_m ambi_index
name
Krish 2070 0.995169 0.995169 0.995169 0.995169 0.009662
Lydell 1757 0.994308 0.994308 0.994308 0.994308 0.011383
Nicco 638 0.992163 0.992163 0.992163 0.992163 0.015674
Nigel 10501 0.991906 0.991906 0.991906 0.991906 0.016189
Jawan 1329 0.990971 0.990971 0.990971 0.990971 0.018059

5 rows × 6 columns


In [92]:
# we see that Krish has changed the maximum.
#let us analyze Krish


names_ambi = names[np.in1d(names.name,ambi_names_array)]
ambi_names_pt = names_ambi.pivot_table('births',
                            rows='year',
                            cols=['name','sex'],
                            aggfunc='sum')

ambi_names_pt= ambi_names_pt.fillna(0L)

In [93]:
normalized_ambi_names =  ambi_names_pt.div(ambi_names_pt.sum(1),axis=0)

In [94]:
normalized_ambi_names.tail()


Out[94]:
name Aaden Aadi Aadyn Aalijah Aaliyah Aamari Aaren Aareon Aarian Aarin
sex F M F M F M F M F M F M F M F M F M F M
year
2006 0.000000 0.000016 0.000001 0.000021 0.000000 0.000005 0.000001 0.000001 0.001121 0.000000 0.000001 0.000000 0.000000 0.000008 0 0.000000 0.000000 0.000000 0.000003 0.000004 ...
2007 0.000000 0.000046 0.000000 0.000022 0.000000 0.000008 0.000002 0.000003 0.001180 0.000000 0.000003 0.000003 0.000000 0.000008 0 0.000001 0.000000 0.000002 0.000002 0.000006 ...
2008 0.000000 0.000293 0.000000 0.000023 0.000003 0.000017 0.000002 0.000005 0.001238 0.000000 0.000002 0.000003 0.000000 0.000009 0 0.000000 0.000000 0.000000 0.000003 0.000005 ...
2009 0.000002 0.000404 0.000000 0.000024 0.000002 0.000024 0.000002 0.000004 0.001389 0.000000 0.000000 0.000003 0.000000 0.000009 0 0.000002 0.000002 0.000002 0.000000 0.000006 ...
2010 0.000000 0.000149 0.000000 0.000018 0.000000 0.000013 0.000000 0.000005 0.001542 0.000002 0.000003 0.000003 0.000002 0.000010 0 0.000004 0.000000 0.000002 0.000002 0.000007 ...

5 rows × 19304 columns


In [97]:
#plotting for Krish
normalized_ambi_names['Krish'].plot()
"""Observation for 'Krish'
    It is interesting to note that though the name has changed from female to male. Apparantly, Krish has become popular only after 1980's
"""


Out[97]:
"Observation for 'Krish'\n    we see that though the name has changed from female to male. Apparantly, Krish has become popular only after 1980's\n"

In [98]:
#plotting for Lydell
normalized_ambi_names['Lydell'].plot()
"""Observation for 'Lydell'
    we see that the name has completely transformed from female to male. 
"""


Out[98]:
"Observation for 'Lydell'\n    we see that though the name has changed from female to male. Apparantly, Krish has become popular only after 1980's\n"

In [99]:
top5_ambi_data.sort_index(by='last_first_diff', ascending=False).tail()


Out[99]:
total_pop last_first_diff min_max_range abs_min_max_range p_m ambi_index
name
Sherell 1479 -0.996619 -0.996619 0.996619 0.003381 0.006761
Summer 64702 -0.997002 -0.997002 0.997002 0.002998 0.005997
Lindsy 2039 -0.997548 -0.997548 0.997548 0.002452 0.004904
Abbey 15854 -0.997792 -0.997802 0.997802 0.002208 0.004415
Hailey 123318 -0.998151 -0.998151 0.998151 0.001849 0.003698

5 rows × 6 columns


In [109]:
#plotting for Hailey
normalized_ambi_names['Hailey'].plot()
"""Observation for 'Hailey'
    we see that the name has increasing given to female babies. 
"""


Out[109]:
"Observation for 'Hailey'\n    we see that the name has increasing given to female babies. \n"

Analyzing the ambi-names which has maximum ambi_index


In [101]:
top5_ambi_data.sort_index(by='ambi_index', ascending=False).head()


Out[101]:
total_pop last_first_diff min_max_range abs_min_max_range p_m ambi_index
name
Challie 10 -0.5 -0.500000 0.500000 0.5 1
Keagyn 22 -0.5 -0.500000 0.500000 0.5 1
Ashdyn 12 0.5 0.500000 0.500000 0.5 1
Callaway 196 0.5 0.507143 0.507143 0.5 1
Mizan 12 0.5 0.500000 0.500000 0.5 1

5 rows × 6 columns


In [103]:
#plotting for Challie
normalized_ambi_names['Challie'].plot()
"""Observation for 'Challie'
    Very interesting plot
"""


Out[103]:
"Observation for 'Challie'\n    we see that the name has increasing given to female babies. \n"

In [104]:
top5_ambi_data.sort_index(by='ambi_index', ascending=False).tail()


Out[104]:
total_pop last_first_diff min_max_range abs_min_max_range p_m ambi_index
name
Glenna 25493 0.000196 0.000412 0.000412 0.000196 0.000392
Dorothea 33626 0.000178 0.000210 0.000210 0.000178 0.000357
Leila 35283 0.000170 0.000453 0.000453 0.000170 0.000340
Therese 34628 0.000144 0.000712 0.000712 0.000144 0.000289
Annabelle 34998 0.000143 0.000154 0.000154 0.000143 0.000286

5 rows × 6 columns


In [107]:
#plotting for Annabelle
normalized_ambi_names['Annabelle'].plot()
"""Observation for 'Annabelle'
    Annabelle is a hot name now :) very trending
"""


Out[107]:
"Observation for 'Annabelle'\n    Annabelle is a hot name now :) very trending\n"