Goals

The focus of this notebook is on baby names that have been given to both male and female.


In [1]:
%matplotlib inline

In [2]:
import matplotlib.pyplot as plt
import numpy as np

from pylab import figure, show

from pandas import DataFrame, Series
import pandas as pd

In [3]:
try:
    import mpld3
    from mpld3 import enable_notebook
    from mpld3 import plugins
    enable_notebook()
except Exception as e:
    print "Attempt to import and enable mpld3 failed", e

In [4]:
# what would seaborn do?
try:
    import seaborn as sns
except Exception as e:
    print "Attempt to import and enable seaborn failed", e


/Users/prabha/anaconda/lib/python2.7/site-packages/numpy/oldnumeric/__init__.py:11: ModuleDeprecationWarning: The oldnumeric module will be dropped in Numpy 1.9
  warnings.warn(_msg, ModuleDeprecationWarning)

Preliminaries: Assumed location of pydata-book files

To make it more practical for me to look at your homework, I'm again going to assume a relative placement of files. I placed the files from

https://github.com/pydata/pydata-book

in a local directory, which in my case is "/Users/raymondyee/D/Document/Working_with_Open_Data/pydata-book/"

and then symbolically linked (ln -s) to the the pydata-book from the root directory of the working-open-data folder. i.e., on OS X

cd /Users/raymondyee/D/Document/Working_with_Open_Data/working-open-data
ln -s /Users/raymondyee/D/Document/Working_with_Open_Data/pydata-book/ pydata-book

That way the files from the pydata-book repository look like they sit in the working-open-data directory -- without having to actually copy the files.

With this arrangment, I should then be able to drop your notebook into my own notebooks directory and run them without having to mess around with paths.


In [5]:
import os

NAMES_DIR = os.path.join(os.pardir, "pydata-book", "ch02", "names")

assert os.path.exists(NAMES_DIR)

Please make sure the above assertion works.

Baby names dataset

discussed in p. 35 of PfDA book

To download all the data, including that for 2011 and 2012: Popular Baby Names --> includes state by state data.

Loading all data into Pandas


In [6]:
# show the first five files in the NAMES_DIR

import glob
glob.glob(NAMES_DIR + "/*")[:5]


Out[6]:
['../pydata-book/ch02/names/NationalReadMe.pdf',
 '../pydata-book/ch02/names/yob1880.txt',
 '../pydata-book/ch02/names/yob1881.txt',
 '../pydata-book/ch02/names/yob1882.txt',
 '../pydata-book/ch02/names/yob1883.txt']

In [7]:
# 2010 is the last available year in the pydata-book repo
import os

years = range(1880, 2011)

pieces = []
columns = ['name', 'sex', 'births']

for year in years:
    path = os.path.join(NAMES_DIR, 'yob%d.txt' % year)
    frame = pd.read_csv(path, names=columns)

    frame['year'] = year
    pieces.append(frame)

# Concatenate everything into a single DataFrame
names = pd.concat(pieces, ignore_index=True)

# why floats?  I'm not sure.
names.describe()


Out[7]:
births year
count 1690784.000000 1690784.000000
mean 190.682386 1969.454384
std 1615.899711 32.823526
min 5.000000 1880.000000
25% 7.000000 1946.000000
50% 12.000000 1979.000000
75% 32.000000 1997.000000
max 99651.000000 2010.000000

8 rows × 2 columns


In [8]:
# how many people, names, males and females  represented in names?

names.births.sum()


Out[8]:
322402727

In [9]:
# F vs M

names.groupby('sex')['births'].sum()


Out[9]:
sex
F      159990140
M      162412587
Name: births, dtype: int64

In [10]:
# total number of names

len(names.groupby('name'))


Out[10]:
88496

In [11]:
# use pivot_table to collect records by year (rows) and sex (columns)

total_births = names.pivot_table('births', rows='year', cols='sex', aggfunc=sum)
total_births.head()


Out[11]:
sex F M
year
1880 90993 110493
1881 91955 100748
1882 107851 113687
1883 112322 104632
1884 129021 114445

5 rows × 2 columns


In [12]:
# You can use groupy to get equivalent pivot_table calculation

names.groupby('year').apply(lambda s: s.groupby('sex').agg('sum')).unstack()['births']


Out[12]:
sex F M
year
1880 90993 110493
1881 91955 100748
1882 107851 113687
1883 112322 104632
1884 129021 114445
1885 133056 107802
1886 144538 110785
1887 145983 101412
1888 178631 120857
1889 178369 110590
1890 190377 111026
1891 185486 101198
1892 212350 122038
1893 212908 112319
1894 222923 115775
1895 233632 117398
1896 237924 119575
1897 234199 112760
1898 258771 122703
1899 233022 106218
1900 299873 150554
1901 239351 106478
1902 264079 122660
1903 261976 119240
1904 275375 128129
1905 291641 132319
1906 295301 133159
1907 318558 146838
1908 334277 154339
1909 347191 163983
1910 396416 194198
1911 418180 225936
1912 557939 429926
1913 624317 512482
1914 761376 654746
1915 983824 848647
1916 1044249 890142
1917 1081194 925512
1918 1157585 1013720
1919 1130149 980215
1920 1198214 1064468
1921 1232845 1101374
1922 1200796 1088380
1923 1206239 1096227
1924 1248821 1132671
1925 1217217 1115798
1926 1185078 1110440
1927 1192207 1126259
1928 1152836 1107113
1929 1116284 1074833
1930 1125521 1096663
1931 1064233 1038586
1932 1066930 1043512
1933 1007523 990677
1934 1043879 1031962
1935 1048264 1040649
1936 1040068 1036662
1937 1063722 1065964
1938 1103173 1108480
1939 1096394 1106328
... ...

131 rows × 2 columns


In [13]:
# how to calculate the total births / year

names.groupby('year').sum().plot(title="total births by year")


Out[13]:
<matplotlib.axes.AxesSubplot at 0x115fe4e90>

In [14]:
names.groupby('year').apply(lambda s: s.groupby('sex').agg('sum')).unstack()['births'].plot(title="births (M/F) by year")


Out[14]:
<matplotlib.axes.AxesSubplot at 0x10ea330d0>

In [15]:
# from book: add prop to names

def add_prop(group):
    # Integer division floors
    births = group.births.astype(float)

    group['prop'] = births / births.sum()
    return group

names = names.groupby(['year', 'sex']).apply(add_prop)

In [16]:
# verify prop --> all adds up to 1

np.allclose(names.groupby(['year', 'sex']).prop.sum(), 1)


Out[16]:
True

In [17]:
# number of records in full names dataframe

len(names)


Out[17]:
1690784

How to do top1000 calculation

This section on the top1000 calculation is kept in here to provide some inspiration on how to work with baby names


In [18]:
#  from book: useful to work with top 1000 for each year/sex combo
# can use groupby/apply

names.groupby(['year', 'sex']).apply(lambda g: g.sort_index(by='births', ascending=False)[:1000])


Out[18]:
name sex births year prop
year sex
1880 F 0 Mary F 7065 1880 0.077643
1 Anna F 2604 1880 0.028618
2 Emma F 2003 1880 0.022013
3 Elizabeth F 1939 1880 0.021309
4 Minnie F 1746 1880 0.019188
5 Margaret F 1578 1880 0.017342
6 Ida F 1472 1880 0.016177
7 Alice F 1414 1880 0.015540
8 Bertha F 1320 1880 0.014507
9 Sarah F 1288 1880 0.014155
10 Annie F 1258 1880 0.013825
11 Clara F 1226 1880 0.013474
12 Ella F 1156 1880 0.012704
13 Florence F 1063 1880 0.011682
14 Cora F 1045 1880 0.011484
15 Martha F 1040 1880 0.011429
16 Laura F 1012 1880 0.011122
17 Nellie F 995 1880 0.010935
18 Grace F 982 1880 0.010792
19 Carrie F 949 1880 0.010429
20 Maude F 858 1880 0.009429
21 Mabel F 808 1880 0.008880
22 Bessie F 794 1880 0.008726
23 Jennie F 793 1880 0.008715
24 Gertrude F 787 1880 0.008649
25 Julia F 783 1880 0.008605
26 Hattie F 769 1880 0.008451
27 Edith F 768 1880 0.008440
28 Mattie F 704 1880 0.007737
29 Rose F 700 1880 0.007693
30 Catherine F 688 1880 0.007561
31 Lillian F 672 1880 0.007385
32 Ada F 652 1880 0.007165
33 Lillie F 647 1880 0.007110
34 Helen F 636 1880 0.006990
35 Jessie F 635 1880 0.006979
36 Louise F 635 1880 0.006979
37 Ethel F 633 1880 0.006957
38 Lula F 621 1880 0.006825
39 Myrtle F 615 1880 0.006759
40 Eva F 614 1880 0.006748
41 Frances F 605 1880 0.006649
42 Lena F 603 1880 0.006627
43 Lucy F 591 1880 0.006495
44 Edna F 588 1880 0.006462
45 Maggie F 582 1880 0.006396
46 Pearl F 569 1880 0.006253
47 Daisy F 564 1880 0.006198
48 Fannie F 560 1880 0.006154
49 Josephine F 544 1880 0.005978
50 Dora F 524 1880 0.005759
51 Rosa F 507 1880 0.005572
52 Katherine F 502 1880 0.005517
53 Agnes F 473 1880 0.005198
54 Marie F 471 1880 0.005176
55 Nora F 471 1880 0.005176
56 May F 462 1880 0.005077
57 Mamie F 436 1880 0.004792
58 Blanche F 427 1880 0.004693
59 Stella F 414 1880 0.004550
... ... ... ... ...

261877 rows × 5 columns


In [19]:
def get_top1000(group):
    return group.sort_index(by='births', ascending=False)[:1000]

grouped = names.groupby(['year', 'sex'])
top1000 = grouped.apply(get_top1000)
top1000.head()


Out[19]:
name sex births year prop
year sex
1880 F 0 Mary F 7065 1880 0.077643
1 Anna F 2604 1880 0.028618
2 Emma F 2003 1880 0.022013
3 Elizabeth F 1939 1880 0.021309
4 Minnie F 1746 1880 0.019188

5 rows × 5 columns


In [20]:
# Do pivot table: row: year and cols= names for top 1000

top_births = top1000.pivot_table('births', rows='year', cols='name', aggfunc=np.sum)
top_births.tail()


Out[20]:
name Aaden Aaliyah Aarav Aaron Aarush Ab Abagail Abb Abbey Abbie Abbigail Abbott Abby Abdiel Abdul Abdullah Abe Abel Abelardo Abigail
year
2006 NaN 3737 NaN 8279 NaN NaN 297 NaN 404 440 630 NaN 1682 NaN NaN 219 NaN 922 NaN 15615 ...
2007 NaN 3941 NaN 8914 NaN NaN 313 NaN 349 468 651 NaN 1573 NaN NaN 224 NaN 939 NaN 15447 ...
2008 955 4028 219 8511 NaN NaN 317 NaN 344 400 608 NaN 1328 199 NaN 210 NaN 863 NaN 15045 ...
2009 1265 4352 270 7936 NaN NaN 296 NaN 307 369 675 NaN 1274 229 NaN 256 NaN 960 NaN 14342 ...
2010 448 4628 438 7374 226 NaN 277 NaN 295 324 585 NaN 1140 264 NaN 225 NaN 1119 NaN 14124 ...

5 rows × 6865 columns


In [21]:
# is your name in the top_births list?

top_births['Raymond'].plot(title='plot for Raymond')


Out[21]:
<matplotlib.axes.AxesSubplot at 0x113ac1390>

In [22]:
# for Aaden, which shows up at the end

top_births.Aaden.plot(xlim=[1880,2010])


Out[22]:
<matplotlib.axes.AxesSubplot at 0x113aca5d0>

In [23]:
# number of names represented in top_births

len(top_births.columns)


Out[23]:
6865

In [24]:
# how to get the most popular name of all time in top_births?

most_common_names = top_births.sum()
most_common_names.sort(ascending=False)

most_common_names.head()


Out[24]:
name
James      5071647
John       5060953
Robert     4787187
Michael    4263083
Mary       4117746
dtype: float64

In [25]:
# as of mpl v 0.1 (2014.03.04), the name labeling doesn't work -- so disble mpld3 for this figure

mpld3.disable_notebook()
plt.figure()
most_common_names[:50][::-1].plot(kind='barh', figsize=(10,10))


Out[25]:
<matplotlib.axes.AxesSubplot at 0x112c5cc10>

In [26]:
# turn mpld3 back on

mpld3.enable_notebook()

all_births pivot table


In [27]:
# instead of top_birth -- get all_births

all_births = names.pivot_table('births', rows='year', cols='name', aggfunc=sum)

In [28]:
all_births = all_births.fillna(0)
all_births.tail()


Out[28]:
name Aaban Aabid Aabriella Aadam Aadan Aadarsh Aaden Aadesh Aadhav Aadhavan Aadhya Aadi Aadil Aadin Aadison Aadit Aadith Aaditri Aaditya Aadon
year
2006 0 0 0 9 0 14 55 0 5 0 0 74 11 0 0 17 0 0 42 7 ...
2007 5 0 0 8 8 13 155 0 0 0 10 72 15 10 0 31 7 0 43 10 ...
2008 0 0 5 6 22 13 955 0 0 0 9 76 20 22 0 24 5 0 51 10 ...
2009 6 0 0 9 23 16 1270 5 5 0 18 76 17 25 6 12 0 0 38 23 ...
2010 9 0 0 7 11 0 448 0 13 5 19 54 11 18 0 23 0 5 37 0 ...

5 rows × 88496 columns


In [29]:
# set up to do start/end calculation

all_births_cumsum = all_births.apply(lambda s: s.cumsum(), axis=0)

In [30]:
all_births_cumsum.tail()


Out[30]:
name Aaban Aabid Aabriella Aadam Aadan Aadarsh Aaden Aadesh Aadhav Aadhavan Aadhya Aadi Aadil Aadin Aadison Aadit Aadith Aaditri Aaditya Aadon
year
2006 0 5 0 103 5 67 149 5 11 0 0 171 175 10 0 67 5 0 153 18 ...
2007 5 5 0 111 13 80 304 5 11 0 10 243 190 20 0 98 12 0 196 28 ...
2008 5 5 5 117 35 93 1259 5 11 0 19 319 210 42 0 122 17 0 247 38 ...
2009 11 5 5 126 58 109 2529 10 16 0 37 395 227 67 6 134 17 0 285 61 ...
2010 20 5 5 133 69 109 2977 10 29 5 56 449 238 85 6 157 17 5 322 61 ...

5 rows × 88496 columns

Names that are both M and F


In [31]:
# remind ourselves of what's in names

names.head()


Out[31]:
name sex births year prop
0 Mary F 7065 1880 0.077643
1 Anna F 2604 1880 0.028618
2 Emma F 2003 1880 0.022013
3 Elizabeth F 1939 1880 0.021309
4 Minnie F 1746 1880 0.019188

5 rows × 5 columns


In [32]:
# columns in names

names.columns


Out[32]:
Index([u'name', u'sex', u'births', u'year', u'prop'], dtype='object')

Approach to exploring ambigendered names

Some things to think about:

  • calculate a set of ambi_names -- names that are both M and F in the database: names_ambi
  • calculate a pivot table ambi_names_pt that use a hierarchical index name/sex vs years
  • for a specific name, make a plot of male vs female population to validate your approach
  • think of using cumulative vs year-by-year instantaneous populations
  • think about metrics for measuring the sex shift of names
  • think about how to calculate how ambigendered a name is

Exercise

Submit a notebook that describes what you've learned about the nature of ambigendered names in the baby names database. (Due date: Monday, March 10 at 11:5pm --> bCourses assignment to come.) I'm interested in seeing what you do with the data set in this regard. At the minimum, show that you are able to run Day_13_C_Baby_Names_MF_Completed. Be creative and have fun.


In [32]: