Goals

The focus of this notebook is on baby names that have been given to both male and female.



In [ ]:

    
%matplotlib inline



In [ ]:

    
import matplotlib.pyplot as plt
import numpy as np

from pylab import figure, show

from pandas import DataFrame, Series
import pandas as pd



In [ ]:

    
try:
    import mpld3
    from mpld3 import enable_notebook
    from mpld3 import plugins
    enable_notebook()
except Exception as e:
    print "Attempt to import and enable mpld3 failed", e



In [ ]:

    
# what would seaborn do?
try:
    import seaborn as sns
except Exception as e:
    print "Attempt to import and enable seaborn failed", e

Preliminaries: Assumed location of pydata-book files

To make it more practical for me to look at your homework, I'm again going to assume a relative placement of files. I placed the files from

https://github.com/pydata/pydata-book

in a local directory, which in my case is "/Users/raymondyee/D/Document/Working_with_Open_Data/pydata-book/"

and then symbolically linked (ln -s) to the the pydata-book from the root directory of the working-open-data folder. i.e., on OS X

cd /Users/raymondyee/D/Document/Working_with_Open_Data/working-open-data
ln -s /Users/raymondyee/D/Document/Working_with_Open_Data/pydata-book/ pydata-book

That way the files from the pydata-book repository look like they sit in the working-open-data directory -- without having to actually copy the files.

With this arrangment, I should then be able to drop your notebook into my own notebooks directory and run them without having to mess around with paths.



In [ ]:

    
import os

NAMES_DIR = os.path.join(os.pardir, "pydata-book", "ch02", "names")

assert os.path.exists(NAMES_DIR)

Please make sure the above assertion works.

Baby names dataset

discussed in p. 35 of PfDA book

To download all the data, including that for 2011 and 2012: Popular Baby Names --> includes state by state data.

Loading all data into Pandas



In [ ]:

    
# show the first five files in the NAMES_DIR

import glob
glob.glob(NAMES_DIR + "/*")[:5]



In [ ]:

    
# 2010 is the last available year in the pydata-book repo
import os

years = range(1880, 2011)

pieces = []
columns = ['name', 'sex', 'births']

for year in years:
    path = os.path.join(NAMES_DIR, 'yob%d.txt' % year)
    frame = pd.read_csv(path, names=columns)

    frame['year'] = year
    pieces.append(frame)

# Concatenate everything into a single DataFrame
names = pd.concat(pieces, ignore_index=True)

# why floats?  I'm not sure.
names.describe()



In [ ]:

    
# how many people, names, males and females  represented in names?

names.births.sum()



In [ ]:

    
# F vs M

names.groupby('sex')['births'].sum()



In [ ]:

    
# total number of names

len(names.groupby('name'))



In [ ]:

    
# use pivot_table to collect records by year (rows) and sex (columns)

total_births = names.pivot_table('births', rows='year', cols='sex', aggfunc=sum)
total_births.head()



In [ ]:

    
# You can use groupy to get equivalent pivot_table calculation

names.groupby('year').apply(lambda s: s.groupby('sex').agg('sum')).unstack()['births']



In [ ]:

    
# how to calculate the total births / year

names.groupby('year').sum().plot(title="total births by year")



In [ ]:

    
names.groupby('year').apply(lambda s: s.groupby('sex').agg('sum')).unstack()['births'].plot(title="births (M/F) by year")



In [ ]:

    
# from book: add prop to names

def add_prop(group):
    # Integer division floors
    births = group.births.astype(float)

    group['prop'] = births / births.sum()
    return group

names = names.groupby(['year', 'sex']).apply(add_prop)



In [ ]:

    
# verify prop --> all adds up to 1

np.allclose(names.groupby(['year', 'sex']).prop.sum(), 1)



In [ ]:

    
# number of records in full names dataframe

len(names)

How to do top1000 calculation

This section on the top1000 calculation is kept in here to provide some inspiration on how to work with baby names



In [ ]:

    
#  from book: useful to work with top 1000 for each year/sex combo
# can use groupby/apply

names.groupby(['year', 'sex']).apply(lambda g: g.sort_index(by='births', ascending=False)[:1000])



In [ ]:

    
def get_top1000(group):
    return group.sort_index(by='births', ascending=False)[:1000]

grouped = names.groupby(['year', 'sex'])
top1000 = grouped.apply(get_top1000)
top1000.head()



In [ ]:

    
# Do pivot table: row: year and cols= names for top 1000

top_births = top1000.pivot_table('births', rows='year', cols='name', aggfunc=np.sum)
top_births.tail()



In [ ]:

    
# is your name in the top_births list?

top_births['Raymond'].plot(title='plot for Raymond')



In [ ]:

    
# for Aaden, which shows up at the end

top_births.Aaden.plot(xlim=[1880,2010])



In [ ]:

    
# number of names represented in top_births

len(top_births.columns)



In [ ]:

    
# how to get the most popular name of all time in top_births?

most_common_names = top_births.sum()
most_common_names.sort(ascending=False)

most_common_names.head()



In [ ]:

    
# as of mpl v 0.1 (2014.03.04), the name labeling doesn't work -- so disble mpld3 for this figure

mpld3.disable_notebook()
plt.figure()
most_common_names[:50][::-1].plot(kind='barh', figsize=(10,10))



In [ ]:

    
# turn mpld3 back on

mpld3.enable_notebook()

all_births pivot table



In [ ]:

    
# instead of top_birth -- get all_births

all_births = names.pivot_table('births', rows='year', cols='name', aggfunc=sum)



In [ ]:

    
all_births = all_births.fillna(0)
all_births.tail()



In [ ]:

    
# set up to do start/end calculation

all_births_cumsum = all_births.apply(lambda s: s.cumsum(), axis=0)



In [ ]:

    
all_births_cumsum.tail()

Names that are both M and F



In [ ]:

    
# remind ourselves of what's in names

names.head()



In [ ]:

    
# columns in names

names.columns

Approach to exploring ambigendered names

Some things to think about:

calculate a set of ambi_names -- names that are both M and F in the database: names_ambi
calculate a pivot table ambi_names_pt that use a hierarchical index name/sex vs years
for a specific name, make a plot of male vs female population to validate your approach
think of using cumulative vs year-by-year instantaneous populations
think about metrics for measuring the sex shift of names
think about how to calculate how ambigendered a name is

Exercise

Submit a notebook that describes what you've learned about the nature of ambigendered names in the baby names database. (Due date: Monday, March 10 at 11:5pm --> bCourses assignment to come.) I'm interested in seeing what you do with the data set in this regard. At the minimum, show that you are able to run Day_13_C_Baby_Names_MF_Completed. Be creative and have fun.



In [ ]: