The focus of this notebook is on baby names that have been given to both male and female.
In [1]:
%matplotlib inline
In [2]:
import matplotlib.pyplot as plt
import numpy as np
from pylab import figure, show
from pandas import DataFrame, Series
import pandas as pd
In [3]:
try:
import mpld3
from mpld3 import enable_notebook
from mpld3 import plugins
enable_notebook()
except Exception as e:
print "Attempt to import and enable mpld3 failed", e
In [4]:
# what would seaborn do?
try:
import seaborn as sns
except Exception as e:
print "Attempt to import and enable seaborn failed", e
To make it more practical for me to look at your homework, I'm again going to assume a relative placement of files. I placed the files from
https://github.com/pydata/pydata-book
in a local directory, which in my case is "/Users/raymondyee/D/Document/Working_with_Open_Data/pydata-book/"
and then symbolically linked (ln -s
) to the the pydata-book from the root directory of the working-open-data folder. i.e., on OS X
cd /Users/raymondyee/D/Document/Working_with_Open_Data/working-open-data
ln -s /Users/raymondyee/D/Document/Working_with_Open_Data/pydata-book/ pydata-book
That way the files from the pydata-book repository look like they sit in the working-open-data directory -- without having to actually copy the files.
With this arrangment, I should then be able to drop your notebook into my own notebooks directory and run them without having to mess around with paths.
In [5]:
import os
NAMES_DIR = os.path.join(os.pardir, "pydata-book", "ch02", "names")
assert os.path.exists(NAMES_DIR)
Please make sure the above assertion works.
discussed in p. 35 of PfDA
book
To download all the data, including that for 2011 and 2012: Popular Baby Names --> includes state by state data.
In [6]:
# show the first five files in the NAMES_DIR
import glob
glob.glob(NAMES_DIR + "/*")[:5]
Out[6]:
In [7]:
# 2010 is the last available year in the pydata-book repo
import os
years = range(1880, 2011)
pieces = []
columns = ['name', 'sex', 'births']
for year in years:
path = os.path.join(NAMES_DIR, 'yob%d.txt' % year)
frame = pd.read_csv(path, names=columns)
frame['year'] = year
pieces.append(frame)
# Concatenate everything into a single DataFrame
names = pd.concat(pieces, ignore_index=True)
# why floats? I'm not sure.
names.describe()
Out[7]:
In [8]:
# how many people, names, males and females represented in names?
names.births.sum()
Out[8]:
In [9]:
# F vs M
names.groupby('sex')['births'].sum()
Out[9]:
In [10]:
# total number of names
len(names.groupby('name'))
Out[10]:
In [11]:
# use pivot_table to collect records by year (rows) and sex (columns)
total_births = names.pivot_table('births', rows='year', cols='sex', aggfunc=sum)
total_births.head()
Out[11]:
In [12]:
# You can use groupy to get equivalent pivot_table calculation
names.groupby('year').apply(lambda s: s.groupby('sex').agg('sum')).unstack()['births']
Out[12]:
In [13]:
# how to calculate the total births / year
names.groupby('year').sum().plot(title="total births by year")
Out[13]:
In [14]:
names.groupby('year').apply(lambda s: s.groupby('sex').agg('sum')).unstack()['births'].plot(title="births (M/F) by year")
Out[14]:
In [15]:
# from book: add prop to names
def add_prop(group):
# Integer division floors
births = group.births.astype(float)
group['prop'] = births / births.sum()
return group
names = names.groupby(['year', 'sex']).apply(add_prop)
In [16]:
# verify prop --> all adds up to 1
np.allclose(names.groupby(['year', 'sex']).prop.sum(), 1)
Out[16]:
In [17]:
# number of records in full names dataframe
len(names)
Out[17]:
This section on the top1000 calculation is kept in here to provide some inspiration on how to work with baby names
In [18]:
# from book: useful to work with top 1000 for each year/sex combo
# can use groupby/apply
names.groupby(['year', 'sex']).apply(lambda g: g.sort_index(by='births', ascending=False)[:1000])
Out[18]:
In [19]:
def get_top1000(group):
return group.sort_index(by='births', ascending=False)[:1000]
grouped = names.groupby(['year', 'sex'])
top1000 = grouped.apply(get_top1000)
top1000.head()
Out[19]:
In [20]:
# Do pivot table: row: year and cols= names for top 1000
top_births = top1000.pivot_table('births', rows='year', cols='name', aggfunc=np.sum)
top_births.tail()
Out[20]:
In [21]:
# is your name in the top_births list?
top_births['Raymond'].plot(title='plot for Raymond')
Out[21]:
In [22]:
# for Aaden, which shows up at the end
top_births.Aaden.plot(xlim=[1880,2010])
Out[22]:
In [23]:
# number of names represented in top_births
len(top_births.columns)
Out[23]:
In [24]:
# how to get the most popular name of all time in top_births?
most_common_names = top_births.sum()
most_common_names.sort(ascending=False)
most_common_names.head()
Out[24]:
In [25]:
# as of mpl v 0.1 (2014.03.04), the name labeling doesn't work -- so disble mpld3 for this figure
mpld3.disable_notebook()
plt.figure()
most_common_names[:50][::-1].plot(kind='barh', figsize=(10,10))
Out[25]:
In [26]:
# turn mpld3 back on
mpld3.enable_notebook()
In [27]:
# instead of top_birth -- get all_births
all_births = names.pivot_table('births', rows='year', cols='name', aggfunc=sum)
In [28]:
all_births = all_births.fillna(0)
all_births.tail()
Out[28]:
In [29]:
# set up to do start/end calculation
all_births_cumsum = all_births.apply(lambda s: s.cumsum(), axis=0)
In [30]:
all_births_cumsum.tail()
Out[30]:
In [31]:
# remind ourselves of what's in names
names.head()
Out[31]:
In [32]:
# columns in names
names.columns
Out[32]:
Some things to think about:
names_ambi
ambi_names_pt
that use a hierarchical index name/sex vs yearsSubmit a notebook that describes what you've learned about the nature of ambigendered names in the baby names database. (Due date: Monday, March 10 at 11:5pm --> bCourses assignment to come.) I'm interested in seeing what you do with the data set in this regard. At the minimum, show that you are able to run Day_13_C_Baby_Names_MF_Completed. Be creative and have fun.
In [32]: