Goals

The focus of this notebook is on baby names that have been given to both male and female.


In [1]:
%matplotlib inline

In [2]:
import matplotlib.pyplot as plt
import numpy as np

from pylab import figure, show

from pandas import DataFrame, Series
import pandas as pd

In [3]:
try:
    import mpld3
    from mpld3 import enable_notebook
    from mpld3 import plugins
    enable_notebook()
except Exception as e:
    print "Attempt to import and enable mpld3 failed", e

In [4]:
# what would seaborn do?
try:
    import seaborn as sns
except Exception as e:
    print "Attempt to import and enable seaborn failed", e


/Users/prabha/anaconda/lib/python2.7/site-packages/numpy/oldnumeric/__init__.py:11: ModuleDeprecationWarning: The oldnumeric module will be dropped in Numpy 1.9
  warnings.warn(_msg, ModuleDeprecationWarning)

Preliminaries: Assumed location of pydata-book files

To make it more practical for me to look at your homework, I'm again going to assume a relative placement of files. I placed the files from

https://github.com/pydata/pydata-book

in a local directory, which in my case is "/Users/raymondyee/D/Document/Working_with_Open_Data/pydata-book/"

and then symbolically linked (ln -s) to the the pydata-book from the root directory of the working-open-data folder. i.e., on OS X

cd /Users/raymondyee/D/Document/Working_with_Open_Data/working-open-data
ln -s /Users/raymondyee/D/Document/Working_with_Open_Data/pydata-book/ pydata-book

That way the files from the pydata-book repository look like they sit in the working-open-data directory -- without having to actually copy the files.

With this arrangment, I should then be able to drop your notebook into my own notebooks directory and run them without having to mess around with paths.


In [5]:
import os

NAMES_DIR = os.path.join(os.pardir, "pydata-book", "ch02", "names")

assert os.path.exists(NAMES_DIR)

Please make sure the above assertion works.

Baby names dataset

discussed in p. 35 of PfDA book

To download all the data, including that for 2011 and 2012: Popular Baby Names --> includes state by state data.

Loading all data into Pandas


In [6]:
# show the first five files in the NAMES_DIR

import glob
glob.glob(NAMES_DIR + "/*")[:5]


Out[6]:
['../pydata-book/ch02/names/NationalReadMe.pdf',
 '../pydata-book/ch02/names/yob1880.txt',
 '../pydata-book/ch02/names/yob1881.txt',
 '../pydata-book/ch02/names/yob1882.txt',
 '../pydata-book/ch02/names/yob1883.txt']

In [7]:
# 2010 is the last available year in the pydata-book repo
import os

years = range(1880, 2011)

pieces = []
columns = ['name', 'sex', 'births']

for year in years:
    path = os.path.join(NAMES_DIR, 'yob%d.txt' % year)
    frame = pd.read_csv(path, names=columns)

    frame['year'] = year
    pieces.append(frame)

# Concatenate everything into a single DataFrame
names = pd.concat(pieces, ignore_index=True)

# why floats?  I'm not sure.
names.describe()


Out[7]:
births year
count 1690784.000000 1690784.000000
mean 190.682386 1969.454384
std 1615.899711 32.823526
min 5.000000 1880.000000
25% 7.000000 1946.000000
50% 12.000000 1979.000000
75% 32.000000 1997.000000
max 99651.000000 2010.000000

8 rows × 2 columns


In [8]:
# how many people, names, males and females  represented in names?

names.births.sum()


Out[8]:
322402727

In [9]:
# F vs M

names.groupby('sex')['births'].sum()


Out[9]:
sex
F      159990140
M      162412587
Name: births, dtype: int64

In [10]:
# total number of names

len(names.groupby('name'))


Out[10]:
88496

In [11]:
# use pivot_table to collect records by year (rows) and sex (columns)

total_births = names.pivot_table('births', rows='year', cols='sex', aggfunc=sum)
total_births.head()


Out[11]:
sex F M
year
1880 90993 110493
1881 91955 100748
1882 107851 113687
1883 112322 104632
1884 129021 114445

5 rows × 2 columns


In [12]:
# You can use groupy to get equivalent pivot_table calculation

names.groupby('year').apply(lambda s: s.groupby('sex').agg('sum')).unstack()['births']


Out[12]:
sex F M
year
1880 90993 110493
1881 91955 100748
1882 107851 113687
1883 112322 104632
1884 129021 114445
1885 133056 107802
1886 144538 110785
1887 145983 101412
1888 178631 120857
1889 178369 110590
1890 190377 111026
1891 185486 101198
1892 212350 122038
1893 212908 112319
1894 222923 115775
1895 233632 117398
1896 237924 119575
1897 234199 112760
1898 258771 122703
1899 233022 106218
1900 299873 150554
1901 239351 106478
1902 264079 122660
1903 261976 119240
1904 275375 128129
1905 291641 132319
1906 295301 133159
1907 318558 146838
1908 334277 154339
1909 347191 163983
1910 396416 194198
1911 418180 225936
1912 557939 429926
1913 624317 512482
1914 761376 654746
1915 983824 848647
1916 1044249 890142
1917 1081194 925512
1918 1157585 1013720
1919 1130149 980215
1920 1198214 1064468
1921 1232845 1101374
1922 1200796 1088380
1923 1206239 1096227
1924 1248821 1132671
1925 1217217 1115798
1926 1185078 1110440
1927 1192207 1126259
1928 1152836 1107113
1929 1116284 1074833
1930 1125521 1096663
1931 1064233 1038586
1932 1066930 1043512
1933 1007523 990677
1934 1043879 1031962
1935 1048264 1040649
1936 1040068 1036662
1937 1063722 1065964
1938 1103173 1108480
1939 1096394 1106328
... ...

131 rows × 2 columns


In [13]:
import seaborn

In [14]:
# how to calculate the total births / year

names.groupby('year').sum().plot(title="total births by year")


Out[14]:
<matplotlib.axes.AxesSubplot at 0x114842450>

In [15]:
names.groupby('year').apply(lambda s: s.groupby('sex').agg('sum')).unstack()['births'].plot(title="births (M/F) by year")


Out[15]:
<matplotlib.axes.AxesSubplot at 0x110af8ed0>

In [16]:
# from book: add prop to names

def add_prop(group):
    # Integer division floors
    births = group.births.astype(float)

    group['prop'] = births / births.sum()
    return group

names = names.groupby(['year', 'sex']).apply(add_prop)

In [17]:
# verify prop --> all adds up to 1

np.allclose(names.groupby(['year', 'sex']).prop.sum(), 1)


Out[17]:
True

In [18]:
# number of records in full names dataframe

len(names)


Out[18]:
1690784

How to do top1000 calculation

This section on the top1000 calculation is kept in here to provide some inspiration on how to work with baby names


In [19]:
#  from book: useful to work with top 1000 for each year/sex combo
# can use groupby/apply

names.groupby(['year', 'sex']).apply(lambda g: g.sort_index(by='births', ascending=False)[:1000])


Out[19]:
name sex births year prop
year sex
1880 F 0 Mary F 7065 1880 0.077643
1 Anna F 2604 1880 0.028618
2 Emma F 2003 1880 0.022013
3 Elizabeth F 1939 1880 0.021309
4 Minnie F 1746 1880 0.019188
5 Margaret F 1578 1880 0.017342
6 Ida F 1472 1880 0.016177
7 Alice F 1414 1880 0.015540
8 Bertha F 1320 1880 0.014507
9 Sarah F 1288 1880 0.014155
10 Annie F 1258 1880 0.013825
11 Clara F 1226 1880 0.013474
12 Ella F 1156 1880 0.012704
13 Florence F 1063 1880 0.011682
14 Cora F 1045 1880 0.011484
15 Martha F 1040 1880 0.011429
16 Laura F 1012 1880 0.011122
17 Nellie F 995 1880 0.010935
18 Grace F 982 1880 0.010792
19 Carrie F 949 1880 0.010429
20 Maude F 858 1880 0.009429
21 Mabel F 808 1880 0.008880
22 Bessie F 794 1880 0.008726
23 Jennie F 793 1880 0.008715
24 Gertrude F 787 1880 0.008649
25 Julia F 783 1880 0.008605
26 Hattie F 769 1880 0.008451
27 Edith F 768 1880 0.008440
28 Mattie F 704 1880 0.007737
29 Rose F 700 1880 0.007693
30 Catherine F 688 1880 0.007561
31 Lillian F 672 1880 0.007385
32 Ada F 652 1880 0.007165
33 Lillie F 647 1880 0.007110
34 Helen F 636 1880 0.006990
35 Jessie F 635 1880 0.006979
36 Louise F 635 1880 0.006979
37 Ethel F 633 1880 0.006957
38 Lula F 621 1880 0.006825
39 Myrtle F 615 1880 0.006759
40 Eva F 614 1880 0.006748
41 Frances F 605 1880 0.006649
42 Lena F 603 1880 0.006627
43 Lucy F 591 1880 0.006495
44 Edna F 588 1880 0.006462
45 Maggie F 582 1880 0.006396
46 Pearl F 569 1880 0.006253
47 Daisy F 564 1880 0.006198
48 Fannie F 560 1880 0.006154
49 Josephine F 544 1880 0.005978
50 Dora F 524 1880 0.005759
51 Rosa F 507 1880 0.005572
52 Katherine F 502 1880 0.005517
53 Agnes F 473 1880 0.005198
54 Marie F 471 1880 0.005176
55 Nora F 471 1880 0.005176
56 May F 462 1880 0.005077
57 Mamie F 436 1880 0.004792
58 Blanche F 427 1880 0.004693
59 Stella F 414 1880 0.004550
... ... ... ... ...

261877 rows × 5 columns


In [20]:
def get_top1000(group):
    return group.sort_index(by='births', ascending=False)[:1000]

grouped = names.groupby(['year', 'sex'])
top1000 = grouped.apply(get_top1000)
top1000.head()


Out[20]:
name sex births year prop
year sex
1880 F 0 Mary F 7065 1880 0.077643
1 Anna F 2604 1880 0.028618
2 Emma F 2003 1880 0.022013
3 Elizabeth F 1939 1880 0.021309
4 Minnie F 1746 1880 0.019188

5 rows × 5 columns


In [21]:
# Do pivot table: row: year and cols= names for top 1000

top_births = top1000.pivot_table('births', rows='year', cols='name', aggfunc=np.sum)
top_births.tail()


Out[21]:
name Aaden Aaliyah Aarav Aaron Aarush Ab Abagail Abb Abbey Abbie Abbigail Abbott Abby Abdiel Abdul Abdullah Abe Abel Abelardo Abigail
year
2006 NaN 3737 NaN 8279 NaN NaN 297 NaN 404 440 630 NaN 1682 NaN NaN 219 NaN 922 NaN 15615 ...
2007 NaN 3941 NaN 8914 NaN NaN 313 NaN 349 468 651 NaN 1573 NaN NaN 224 NaN 939 NaN 15447 ...
2008 955 4028 219 8511 NaN NaN 317 NaN 344 400 608 NaN 1328 199 NaN 210 NaN 863 NaN 15045 ...
2009 1265 4352 270 7936 NaN NaN 296 NaN 307 369 675 NaN 1274 229 NaN 256 NaN 960 NaN 14342 ...
2010 448 4628 438 7374 226 NaN 277 NaN 295 324 585 NaN 1140 264 NaN 225 NaN 1119 NaN 14124 ...

5 rows × 6865 columns


In [22]:
# is your name in the top_births list?

top_births['Raymond'].plot(title='plot for Raymond')


Out[22]:
<matplotlib.axes.AxesSubplot at 0x113383910>

In [23]:
# for Aaden, which shows up at the end

top_births.Aaden.plot(xlim=[1880,2010])


Out[23]:
<matplotlib.axes.AxesSubplot at 0x112435690>

In [24]:
# number of names represented in top_births

len(top_births.columns)


Out[24]:
6865

In [25]:
# how to get the most popular name of all time in top_births?

most_common_names = top_births.sum()
most_common_names.sort(ascending=False)

most_common_names.head()


Out[25]:
name
James      5071647
John       5060953
Robert     4787187
Michael    4263083
Mary       4117746
dtype: float64

In [26]:
# as of mpl v 0.1 (2014.03.04), the name labeling doesn't work -- so disble mpld3 for this figure

mpld3.disable_notebook()
plt.figure()
most_common_names[:50][::-1].plot(kind='barh', figsize=(10,10))


Out[26]:
<matplotlib.axes.AxesSubplot at 0x112a25b90>

In [27]:
# turn mpld3 back on

mpld3.enable_notebook()

all_births pivot table


In [28]:
# instead of top_birth -- get all_births

all_births = names.pivot_table('births', rows='year', cols='name', aggfunc=sum)

In [29]:
all_births = all_births.fillna(0)
all_births.tail()


Out[29]:
name Aaban Aabid Aabriella Aadam Aadan Aadarsh Aaden Aadesh Aadhav Aadhavan Aadhya Aadi Aadil Aadin Aadison Aadit Aadith Aaditri Aaditya Aadon
year
2006 0 0 0 9 0 14 55 0 5 0 0 74 11 0 0 17 0 0 42 7 ...
2007 5 0 0 8 8 13 155 0 0 0 10 72 15 10 0 31 7 0 43 10 ...
2008 0 0 5 6 22 13 955 0 0 0 9 76 20 22 0 24 5 0 51 10 ...
2009 6 0 0 9 23 16 1270 5 5 0 18 76 17 25 6 12 0 0 38 23 ...
2010 9 0 0 7 11 0 448 0 13 5 19 54 11 18 0 23 0 5 37 0 ...

5 rows × 88496 columns


In [31]:
# set up to do start/end calculation

all_births_cumsum = all_births.apply(lambda s: s.cumsum(), axis=0)

In [32]:
all_births_cumsum.tail()


Out[32]:
name Aaban Aabid Aabriella Aadam Aadan Aadarsh Aaden Aadesh Aadhav Aadhavan Aadhya Aadi Aadil Aadin Aadison Aadit Aadith Aaditri Aaditya Aadon
year
2006 0 5 0 103 5 67 149 5 11 0 0 171 175 10 0 67 5 0 153 18 ...
2007 5 5 0 111 13 80 304 5 11 0 10 243 190 20 0 98 12 0 196 28 ...
2008 5 5 5 117 35 93 1259 5 11 0 19 319 210 42 0 122 17 0 247 38 ...
2009 11 5 5 126 58 109 2529 10 16 0 37 395 227 67 6 134 17 0 285 61 ...
2010 20 5 5 133 69 109 2977 10 29 5 56 449 238 85 6 157 17 5 322 61 ...

5 rows × 88496 columns

Names that are both M and F


In [33]:
# remind ourselves of what's in names

names.head()


Out[33]:
name sex births year prop
0 Mary F 7065 1880 0.077643
1 Anna F 2604 1880 0.028618
2 Emma F 2003 1880 0.022013
3 Elizabeth F 1939 1880 0.021309
4 Minnie F 1746 1880 0.019188

5 rows × 5 columns


In [34]:
# columns in names

names.columns


Out[34]:
Index([u'name', u'sex', u'births', u'year', u'prop'], dtype='object')

Calculating ambigendered names


In [35]:
# calculate set of male_only, female_only, ambigender names

def calc_of_sex_of_names():

    k = names.groupby('sex').apply(lambda s: set(list(s['name'])))
    male_only_names = k['M'] - k['F']
    female_only_names = k['F'] - k['M']
    ambi_names = k['F'] & k['M'] # intersection of two 
    return {'male_only_names': male_only_names, 
            'female_only_names': female_only_names,
            'ambi_names': ambi_names }
    
names_by_sex = calc_of_sex_of_names() 
ambi_names_array = np.array(list(names_by_sex['ambi_names']))

[(k, len(v)) for (k,v) in names_by_sex.items()]


Out[35]:
[('female_only_names', 51754),
 ('male_only_names', 27090),
 ('ambi_names', 9652)]

In [36]:
# total number of people in names
names.births.sum()


Out[36]:
322402727

In [37]:
# pivot table of ambigendered names to aggregate 

names_ambi = names[np.in1d(names.name,ambi_names_array)]
ambi_names_pt = names_ambi.pivot_table('births',
                            rows='year', 
                            cols=['name','sex'], 
                            aggfunc='sum')

In [38]:
# total number of people in k1 -- almost everyone!

ambi_names_pt.sum().sum()


Out[38]:
299879378.0

In [40]:
# fill n/a with 0 and look at the table at the end

ambi_names_pt=ambi_names_pt.fillna(0L)
ambi_names_pt.tail()


Out[40]:
name Aaden Aadi Aadyn Aalijah Aaliyah Aamari Aaren Aareon Aarian Aarin
sex F M F M F M F M F M F M F M F M F M F M
year
2006 0 55 5 69 0 16 5 5 3737 0 5 0 0 26 0 0 0 0 10 12 ...
2007 0 155 0 72 0 27 8 10 3941 0 10 10 0 26 0 5 0 6 6 20 ...
2008 0 955 0 76 9 56 5 15 4028 0 5 9 0 29 0 0 0 0 9 16 ...
2009 5 1265 0 76 7 76 7 12 4352 0 0 8 0 28 0 6 6 7 0 19 ...
2010 0 448 0 54 0 38 0 15 4628 6 8 8 5 30 0 11 0 5 7 21 ...

5 rows × 19304 columns


In [41]:
# plot M, F in ambigender_names over time
ambi_names_pt.T.xs('M',level='sex').sum().cumsum()


Out[41]:
year
1880     106651
1881     204087
1882     313916
1883     415179
1884     525828
1885     630369
1886     737903
1887     836292
1888     953442
1889    1060938
1890    1168749
1891    1267012
1892    1385329
1893    1494578
1894    1606974
...
1996    130749651
1997    132512323
1998    134292862
1999    136076209
2000    137891972
2001    139680202
2002    141462692
2003    143272468
2004    145083717
2005    146897504
2006    148753457
2007    150617962
2008    152440177
2009    154199909
2010    155887704
Length: 131, dtype: float64

In [42]:
ambi_names_pt.T.xs('F',level='sex').sum().cumsum()


Out[42]:
year
1880      85843
1881     172815
1882     274572
1883     380536
1884     501868
1885     626787
1886     762777
1887     899953
1888    1067632
1889    1235155
1890    1413925
1891    1588115
1892    1787380
1893    1987334
1894    2196434
...
1996    123666074
1997    125136589
1998    126618849
1999    128096077
2000    129593542
2001    131064314
2002    132524950
2003    133994760
2004    135461519
2005    136920409
2006    138398341
2007    139872628
2008    141304691
2009    142678345
2010    143991674
Length: 131, dtype: float64

In [43]:
# don't know what pivot table has type float
# https://github.com/pydata/pandas/issues/3283
ambi_names_pt['Raymond', 'M'].dtype


Out[43]:
dtype('float64')

In [44]:
# calculate proportion of males for given name

def prop_male(name):
    return (ambi_names_pt[name]['M']/ \
    ((ambi_names_pt[name]['M'] + ambi_names_pt[name]['F'])))

def prop_c_male(name):
    return (ambi_names_pt[name]['M'].cumsum()/ \
    ((ambi_names_pt[name]['M'].cumsum() + ambi_names_pt[name]['F'].cumsum())))

In [45]:
prop_c_male('Leslie').plot()


Out[45]:
<matplotlib.axes.AxesSubplot at 0x112a0c7d0>

In [46]:
# I couldn't figure out a way of iterating over the names rather than names/sex combo in
# a vectorized way.  

from itertools import islice

names_to_calc = list(islice(list(ambi_names_pt.T.index.levels[0]),None))

m = [(name_, ambi_names_pt[name_]['M']/(ambi_names_pt[name_]['F'] + ambi_names_pt[name_]['M']))  \
     for name_ in names_to_calc]
p_m_instant = DataFrame(dict(m))
p_m_instant.tail()


Out[46]:
Aaden Aadi Aadyn Aalijah Aaliyah Aamari Aaren Aareon Aarian Aarin Aarion Aaris Aaron Aarya Aaryn Aba Abba Abbey Abbie Abbigail
year
2006 1.000000 0.932432 1.000000 0.500000 0.000000 0.000000 1.000000 NaN NaN 0.545455 0.730769 1.000000 0.997109 0.481481 0.595745 1 NaN 0 0 0 ...
2007 1.000000 1.000000 1.000000 0.555556 0.000000 0.500000 1.000000 1 1.000000 0.769231 0.794118 0.454545 0.997426 0.240506 0.518519 NaN NaN 0 0 0 ...
2008 1.000000 1.000000 0.861538 0.750000 0.000000 0.642857 1.000000 NaN NaN 0.640000 0.666667 NaN 0.996604 0.213333 0.480519 NaN NaN 0 0 0 ...
2009 0.996063 1.000000 0.915663 0.631579 0.000000 1.000000 1.000000 1 0.538462 1.000000 0.750000 NaN 0.995984 0.247312 0.406250 NaN NaN 0 0 0 ...
2010 1.000000 1.000000 1.000000 1.000000 0.001295 0.500000 0.857143 1 1.000000 0.750000 1.000000 NaN 0.996891 0.265306 0.340000 NaN NaN 0 0 0 ...

5 rows × 9652 columns


In [47]:
# similar calculation except instead of looking at the proportions for a given year only,
# we look at the cumulative number of male/female babies for given name

from itertools import islice

names_to_calc = list(islice(list(ambi_names_pt.T.index.levels[0]),None))

m = [(name_, ambi_names_pt[name_]['M'].cumsum()/(ambi_names_pt[name_]['F'].cumsum() + ambi_names_pt[name_]['M'].cumsum()))  \
     for name_ in names_to_calc]
p_m_cum = DataFrame(dict(m))
p_m_cum.tail()


Out[47]:
Aaden Aadi Aadyn Aalijah Aaliyah Aamari Aaren Aareon Aarian Aarin Aarion Aaris Aaron Aarya Aaryn Aba Abba Abbey Abbie Abbigail
year
2006 1.000000 0.970760 1.000000 0.461538 0.001677 0.289474 0.650694 0.500000 0.238095 0.500000 0.714667 0.52381 0.991825 0.481818 0.391437 0.185185 0.666667 0.002404 0.017656 0.000761 ...
2007 1.000000 0.979424 1.000000 0.477064 0.001494 0.362069 0.661783 0.600000 0.407407 0.512727 0.721271 0.50000 0.991925 0.418060 0.398068 0.185185 0.666667 0.002348 0.017220 0.000693 ...
2008 1.000000 0.984326 0.934783 0.519380 0.001344 0.416667 0.673349 0.600000 0.407407 0.518261 0.718245 0.50000 0.992003 0.377005 0.403777 0.185185 0.666667 0.002295 0.016863 0.000639 ...
2009 0.998023 0.987342 0.927602 0.533784 0.001213 0.475000 0.683790 0.677419 0.450000 0.533670 0.719647 0.50000 0.992064 0.351178 0.403912 0.185185 0.666667 0.002250 0.016547 0.000588 ...
2010 0.998320 0.988864 0.938224 0.576687 0.001221 0.479167 0.690450 0.761905 0.511111 0.543408 0.729787 0.50000 0.992131 0.336283 0.401305 0.185185 0.666667 0.002208 0.016280 0.000550 ...

5 rows × 9652 columns


In [48]:
p_m_cum['Donnie'].plot()


Out[48]:
<matplotlib.axes.AxesSubplot at 0x11bdbd490>

In [54]:
# some metrics that attempt to measure how a time series s has changed

def min_max_range(s):
    """range of s signed -- positive if slope between two points p +ve and negative
    otherwise; 0 if slope is 0"""
    # note np.argmax, np.argmin returns the position of first occurence of global max, min
    sign = np.sign(np.argmax(s) - np.argmin(s))
    if sign == 0:
        return 0.0
    else:
        return sign*(np.max(s) - np.min(s))

def last_first_diff(s):
    """difference between latest and earliest value"""
    s0 = s.dropna()
    return (s0.iloc[-1] - s0.iloc[0])

In [55]:
# population distributions of ambinames 
# might want to remove from consideration instances when total ratio is too great
# or range of existence of a name/sex combo too short

total_pop_ambiname = all_births.sum()[np.in1d(all_births.sum().index, ambi_names_array)]
total_pop_ambiname.sort(ascending=False)
total_pop_ambiname.plot(logy=True)


Out[55]:
<matplotlib.axes.AxesSubplot at 0x138c1c290>

In [56]:
# now calculate a DataFrame to visualize results

# calculate the total population, the change in p_m from last to first appearance, 
# the change from max to min in p_m, and the percentage of males overall for name

df = DataFrame()
df['total_pop'] = total_pop_ambiname
df['last_first_diff'] = p_m_cum.apply(last_first_diff)
df['min_max_range'] = p_m_cum.apply(min_max_range)
df['abs_min_max_range'] = np.abs(df.min_max_range)
df['p_m'] = p_m_cum.iloc[-1]

# distance from full ambigender -- p_m=0.5 leads to 1, p_m=1 or 0 -> 0
df['ambi_index'] = df.p_m.apply(lambda p: 1 - 2* np.abs(p-0.5))

df.head()


Out[56]:
total_pop last_first_diff min_max_range abs_min_max_range p_m ambi_index
name
James 5072771 -0.000845 -0.002123 0.002123 0.995457 0.009085
John 5061897 0.000479 -0.001921 0.001921 0.995737 0.008526
Robert 4788050 0.000344 0.002027 0.002027 0.995811 0.008377
Michael 4265373 -0.005034 -0.006425 0.006425 0.994966 0.010067
Mary 4119074 -0.000132 -0.000829 0.000829 0.003675 0.007351

5 rows × 6 columns


In [57]:
# plot: x -> log10 of total population, y->how p_m has changed from first to last
# turn off d3 for this plot

mpld3.disable_notebook()
plt.scatter(np.log10(df.total_pop), df.last_first_diff, s=1)


Out[57]:
<matplotlib.collections.PathCollection at 0x111e57f50>

In [58]:
# turn d3 back on

mpld3.enable_notebook()

In [59]:
# general directionality counts -- looking for over asymmetry

df.groupby(np.sign(df.last_first_diff)).count()


Out[59]:
total_pop last_first_diff min_max_range abs_min_max_range p_m ambi_index
last_first_diff
-1 4890 4890 4890 4890 4890 4890
0 24 24 24 24 24 24
1 4738 4738 4738 4738 4738 4738

3 rows × 6 columns


In [60]:
# let's concentrate on more populous names that have seen big swings in the cumulative p_m

# you can play with the population and range filter
popular_names_with_shifts = df[(df.total_pop>5000) & (df.abs_min_max_range >0.7)]
popular_names_with_shifts.sort_index(by="abs_min_max_range", ascending=False)


Out[60]:
total_pop last_first_diff min_max_range abs_min_max_range p_m ambi_index
name
Hailey 123318 -0.998151 -0.998151 0.998151 0.001849 0.003698
Abbey 15854 -0.997792 -0.997802 0.997802 0.002208 0.004415
Summer 64702 -0.997002 -0.997002 0.997002 0.002998 0.005997
Raegan 9744 -0.990148 -0.995873 0.995873 0.009852 0.019704
Bria 11160 -0.995072 -0.995072 0.995072 0.004928 0.009857
Fallon 7476 -0.972311 -0.994122 0.994122 0.027689 0.055377
Chanel 14087 -0.993966 -0.993966 0.993966 0.006034 0.012068
Star 6684 -0.983543 -0.993738 0.993738 0.016457 0.032914
Holly 196587 -0.992161 -0.992161 0.992161 0.007839 0.015678
Nigel 10501 0.991906 0.991906 0.991906 0.991906 0.016189
Michele 225226 -0.988372 -0.991136 0.991136 0.011628 0.023257
Nova 6899 -0.930135 -0.990991 0.990991 0.069865 0.139730
Ronda 34628 -0.989633 -0.989633 0.989633 0.010367 0.020735
Paige 122569 -0.989198 -0.989198 0.989198 0.010802 0.021604
Brooke 173658 -0.988489 -0.988489 0.988489 0.011511 0.023022
Beverly 380492 -0.987824 -0.987824 0.987824 0.012176 0.024353
Lauren 450853 -0.987302 -0.987302 0.987302 0.012698 0.025396
Alexus 17835 -0.987272 -0.987286 0.987286 0.012728 0.025456
Allison 262727 -0.985826 -0.985826 0.985826 0.014174 0.028349
Cordell 9464 0.984362 0.984362 0.984362 0.984362 0.031276
Lauri 11199 -0.983302 -0.983302 0.983302 0.016698 0.033396
Joy 131572 -0.981827 -0.981862 0.981862 0.018173 0.036345
Ashley 832350 -0.981496 -0.981496 0.981496 0.018504 0.037008
Lyric 8899 -0.838409 -0.980916 0.980916 0.161591 0.323182
Christy 99452 -0.980734 -0.980760 0.980760 0.019266 0.038531
Kenna 7979 -0.980323 -0.980659 0.980659 0.019677 0.039353
Tyrese 7582 0.980480 0.980480 0.980480 0.980480 0.039040
Robby 10399 0.979229 0.979229 0.979229 0.979229 0.041542
Mallory 48990 -0.977648 -0.977648 0.977648 0.022352 0.044703
Madison 308970 -0.976341 -0.976341 0.976341 0.023659 0.047319
Jermaine 39286 0.975386 0.975386 0.975386 0.975386 0.049229
Shelly 86081 -0.974570 -0.974570 0.974570 0.025430 0.050859
Carley 14885 -0.972455 -0.972455 0.972455 0.027545 0.055089
Lacey 47635 -0.969770 -0.969770 0.969770 0.030230 0.060460
Ainsley 8817 -0.968357 -0.968357 0.968357 0.031643 0.063287
Santana 7399 0.422760 0.966667 0.966667 0.422760 0.845520
Kelsey 144166 -0.964811 -0.964811 0.964811 0.035189 0.070377
Ansley 7202 -0.964315 -0.964315 0.964315 0.035685 0.071369
Ronnie 186260 0.960781 0.964046 0.964046 0.960781 0.078439
Kay 101704 -0.962479 -0.962605 0.962605 0.037521 0.075041
Delaney 27608 -0.962402 -0.962402 0.962402 0.037598 0.075196
Lindsay 131956 -0.961351 -0.961351 0.961351 0.038649 0.077298
Lesly 12407 -0.959942 -0.959942 0.959942 0.040058 0.080116
Marquise 9308 0.957886 0.957886 0.957886 0.957886 0.084229
Kenzie 8793 -0.956443 -0.956443 0.956443 0.043557 0.087115
Hillary 28763 -0.956159 -0.956159 0.956159 0.043841 0.087682
Mckenzie 44315 -0.953988 -0.953988 0.953988 0.046012 0.092023
Linsey 5138 -0.953095 -0.953095 0.953095 0.046905 0.093811
Lindsey 159977 -0.952274 -0.952274 0.952274 0.047726 0.095451
Shamar 5093 0.951109 0.951109 0.951109 0.951109 0.097781
Kinsey 5800 -0.951034 -0.951034 0.951034 0.048966 0.097931
Sydney 156602 -0.943922 -0.943922 0.943922 0.056078 0.112157
Kimber 5455 -0.943538 -0.943538 0.943538 0.056462 0.112924
Raven 37100 -0.927143 -0.943274 0.943274 0.072857 0.145714
Meredith 73898 -0.942502 -0.942502 0.942502 0.057498 0.114996
Cassidy 49871 -0.941349 -0.941349 0.941349 0.058651 0.117303
Whitney 98164 -0.940701 -0.940701 0.940701 0.059299 0.118597
Richie 6540 0.938532 0.938532 0.938532 0.938532 0.122936
Diamond 32377 -0.936776 -0.936776 0.936776 0.063224 0.126448
Gay 19363 -0.928678 -0.928678 0.928678 0.071322 0.142643
... ... ... ... ... ...

150 rows × 6 columns


In [61]:
popular_names_with_shifts.groupby(np.sign(df.last_first_diff)).count()


Out[61]:
total_pop last_first_diff min_max_range abs_min_max_range p_m ambi_index
last_first_diff
-1 116 116 116 116 116 116
1 34 34 34 34 34 34

2 rows × 6 columns


In [ ]:
#popular_names_with_shifts.to_pickle('popular_names_with_shifts.pickle')

In [62]:
fig, ax = plt.subplots(subplot_kw=dict(axisbg='#EEEEEE'))
x = np.log10(popular_names_with_shifts.total_pop)
y = popular_names_with_shifts.min_max_range 

scatter = ax.scatter(x, y)

ax.grid(color='white', linestyle='solid')
ax.set_title("Populous Names with Major Sex Shift", size=20)
ax.set_xlabel('log10(total_pop)')
ax.set_ylabel('min_max_range')

#labels = ['point {0}'.format(i + 1) for i in range(len(x))]
labels = list(popular_names_with_shifts.index)
tooltip = plugins.PointLabelTooltip(scatter, labels=labels)
plugins.connect(fig, tooltip)



In [63]:
prop_c_male('Ronnie').plot()


Out[63]:
<matplotlib.axes.AxesSubplot at 0x138c2c910>