ReproduceIt is a series of articles that reproduce the results from data analysis articles focusing on having open data and open code.

Today as small return for the ReproduceIt series I try to reproduce a simple but nice data analysis and webapp that braid.io did called Most Beyonces are 14 years old and most Kanyes are about 11.

The article analyses the trend of names of some music artits (Beyonce, Kanye and Madona) in the US, it also has some nice possible explanations for the ups and downs in time, its a quick read. The data is based on Social Security Office and can be downloaded from the SSN website: Beyond the Top 1000 Names

The data is very small and loading it into pandas and plotting using bokeh it was very easy.



In [1]:

    
%matplotlib inline



In [2]:

    
import pandas as pd



In [3]:

    
import os



In [4]:

    
data_dir = os.path.expanduser("~/data/names/names")



In [5]:

    
files = os.listdir(data_dir)



In [6]:

    
data = pd.DataFrame(columns=["year", "name", "sex", "occurrences"])



In [7]:

    
for fname in files:
    if fname.endswith(".txt"):
        fpath = os.path.join(data_dir, fname)
        df = pd.read_csv(fpath, header=None, names=["name", "sex", "occurrences"])
        df["year"] = int(fname[3:7])
        data = data.append(df)



In [8]:

    
data.year = data.year.astype(int)



In [9]:

    
data.head()









    Out[9]:






  
    
      
      name
      occurrences
      sex
      year
    
  
  
    
      0
      Mary
      7065
      F
      1880
    
    
      1
      Anna
      2604
      F
      1880
    
    
      2
      Emma
      2003
      F
      1880
    
    
      3
      Elizabeth
      1939
      F
      1880
    
    
      4
      Minnie
      1746
      F
      1880



In [10]:

    
data.shape









    Out[10]:





(1825433, 4)



In [11]:

    
data.dtypes









    Out[11]:





name            object
occurrences    float64
sex             object
year             int64
dtype: object

Beyonce

Now that the data is into a simple dataframe we can just filter by the name we want and make a Bar Chart.



In [12]:

    
beyonce = data[data["name"] == "Beyonce"][["year", "occurrences"]]



In [13]:

    
from bokeh.charts import ColumnDataSource, Bar, output_notebook, show



In [14]:

    
from bokeh.models import HoverTool



In [15]:

    
output_notebook()









    





    
        
        Loading BokehJS ...



In [16]:

    
p = Bar(data=beyonce, label="year", values="occurrences", title="No. Babies named Beyoncé",
        color="#0277BD", ylabel='', tools="save,reset")
show(p)









    






    







    Out[16]:




<Bokeh Notebook handle for In[16]>

And thats it! Nothing crazy or big data this time but a nice example on how to get something done in python in 30 minutes. Go to the article page and you can search for your own name in a nice webapp.

	name	occurrences	sex	year
0	Mary	7065	F	1880
1	Anna	2604	F	1880
2	Emma	2003	F	1880
3	Elizabeth	1939	F	1880
4	Minnie	1746	F	1880