ReproduceIt is a series of articles that reproduce the results from data analysis articles focusing on having open data and open code.
Today as small return for the ReproduceIt series I try to reproduce a simple but nice data analysis and webapp that braid.io did called Most Beyonces are 14 years old and most Kanyes are about 11.
The article analyses the trend of names of some music artits (Beyonce, Kanye and Madona) in the US, it also has some nice possible explanations for the ups and downs in time, its a quick read. The data is based on Social Security Office and can be downloaded from the SSN website: Beyond the Top 1000 Names
The data is very small and loading it into pandas and plotting using bokeh it was very easy.
In [1]:
%matplotlib inline
In [2]:
import pandas as pd
In [3]:
import os
In [4]:
data_dir = os.path.expanduser("~/data/names/names")
In [5]:
files = os.listdir(data_dir)
In [6]:
data = pd.DataFrame(columns=["year", "name", "sex", "occurrences"])
In [7]:
for fname in files:
if fname.endswith(".txt"):
fpath = os.path.join(data_dir, fname)
df = pd.read_csv(fpath, header=None, names=["name", "sex", "occurrences"])
df["year"] = int(fname[3:7])
data = data.append(df)
In [8]:
data.year = data.year.astype(int)
In [9]:
data.head()
Out[9]:
In [10]:
data.shape
Out[10]:
In [11]:
data.dtypes
Out[11]:
In [12]:
beyonce = data[data["name"] == "Beyonce"][["year", "occurrences"]]
In [13]:
from bokeh.charts import ColumnDataSource, Bar, output_notebook, show
In [14]:
from bokeh.models import HoverTool
In [15]:
output_notebook()
In [16]:
p = Bar(data=beyonce, label="year", values="occurrences", title="No. Babies named Beyoncé",
color="#0277BD", ylabel='', tools="save,reset")
show(p)
Out[16]:
And thats it! Nothing crazy or big data this time but a nice example on how to get something done in python in 30 minutes. Go to the article page and you can search for your own name in a nice webapp.