ReproduceIt is a series of articles that reproduce the results from data analysis articles focusing on having open data and open code.

Today as small return for the ReproduceIt series I try to reproduce a simple but nice data analysis and webapp that braid.io did called Most Beyonces are 14 years old and most Kanyes are about 11.

The article analyses the trend of names of some music artits (Beyonce, Kanye and Madona) in the US, it also has some nice possible explanations for the ups and downs in time, its a quick read. The data is based on Social Security Office and can be downloaded from the SSN website: Beyond the Top 1000 Names

The data is very small and loading it into pandas and plotting using bokeh it was very easy.


In [1]:
%matplotlib inline

In [2]:
import pandas as pd

In [3]:
import os

In [4]:
data_dir = os.path.expanduser("~/data/names/names")

In [5]:
files = os.listdir(data_dir)

In [6]:
data = pd.DataFrame(columns=["year", "name", "sex", "occurrences"])

In [7]:
for fname in files:
    if fname.endswith(".txt"):
        fpath = os.path.join(data_dir, fname)
        df = pd.read_csv(fpath, header=None, names=["name", "sex", "occurrences"])
        df["year"] = int(fname[3:7])
        data = data.append(df)

In [8]:
data.year = data.year.astype(int)

In [9]:
data.head()


Out[9]:
name occurrences sex year
0 Mary 7065 F 1880
1 Anna 2604 F 1880
2 Emma 2003 F 1880
3 Elizabeth 1939 F 1880
4 Minnie 1746 F 1880

In [10]:
data.shape


Out[10]:
(1825433, 4)

In [11]:
data.dtypes


Out[11]:
name            object
occurrences    float64
sex             object
year             int64
dtype: object

Beyonce

Now that the data is into a simple dataframe we can just filter by the name we want and make a Bar Chart.


In [12]:
beyonce = data[data["name"] == "Beyonce"][["year", "occurrences"]]

In [13]:
from bokeh.charts import ColumnDataSource, Bar, output_notebook, show

In [14]:
from bokeh.models import HoverTool

In [15]:
output_notebook()


Loading BokehJS ...

In [16]:
p = Bar(data=beyonce, label="year", values="occurrences", title="No. Babies named Beyoncé",
        color="#0277BD", ylabel='', tools="save,reset")
show(p)


Out[16]:

<Bokeh Notebook handle for In[16]>

And thats it! Nothing crazy or big data this time but a nice example on how to get something done in python in 30 minutes. Go to the article page and you can search for your own name in a nice webapp.