This example is drawn from Wes McKinney's excellent book on the Pandas library, O'Reilly's Python for Data Analysis.
We'll be taking a look at a freely available dataset: the database of names given to babies in the United States over the last century.
First things first, we need to download the data, which can be found at http://www.ssa.gov/oact/babynames/limits.html. If you uncomment the following commands, it will do this automatically (note that these are linux shell commands; they will not work on Windows):
In [59]:
# !curl -O http://www.ssa.gov/oact/babynames/names.zip
In [60]:
# !mkdir -p data/names
# !mv names.zip data/names/
# !cd data/names/ && unzip names.zip
Now we should have a data/names
directory which contains a number of text files, one for each year of data:
In [61]:
!ls data/names
Let's take a quick look at one of these files:
In [62]:
!head data/names/yob1880.txt
Each file is just a comma-separated list of names, genders, and counts of babies with that name in each year.
We can load these files using pd.read_csv
, which is specifically designed for this:
In [63]:
names1880 = pd.read_csv('data/names/yob1880.txt')
names1880.head()
Out[63]:
Oops! Something went wrong. Our algorithm tried to be smart, and use the first line as index labels. Let's fix this by specifying the index names manually:
In [64]:
names1880 = pd.read_csv('data/names/yob1880.txt',
names=['name', 'gender', 'births'])
names1880.head()
Out[64]:
That looks better. Now we can start playing with the data a bit.
First let's think about how we might count the total number of females and males born in the US in 1880.
If you're used to NumPy, you might be tempted to use masking like this:
First, we can get a mask over all females & males, and then use it to select a subset of the data:
In [65]:
males = names1880[names1880.gender == 'M']
females = names1880[names1880.gender == 'F']
Now we can take the sum of the births for each of these:
In [66]:
males.births.sum(), females.births.sum()
Out[66]:
But there's an easier way to do this, using one of Pandas' very powerful features: groupby
:
In [67]:
grouped = names1880.groupby('gender')
grouped
Out[67]:
This grouped object is now an abstract representation of the data, where the data is split on the given column. In order to actually do something with this data, we need to specify an aggregation operation to do across the data. In this case, what we want is the sum:
In [68]:
grouped.sum()
Out[68]:
We can do other aggregations as well:
In [69]:
grouped.size()
Out[69]:
In [70]:
grouped.mean()
Out[70]:
Or, if we wish, we can get a description of the grouping:
In [71]:
grouped.describe()
Out[71]:
But here we've just been looking at a single year. Let's try to put together all the data in all the years.
To do this, we'll have to use pandas concat
function to concatenate all the data together.
First we'll create a function which loads the data as we did the above data:
In [72]:
def load_year(year):
data = pd.read_csv('data/names/yob{0}.txt'.format(year),
names=['name', 'gender', 'births'])
data['year'] = year
return data
Now let's load all the data into a list, and call pd.concat
on that list:
In [73]:
names = pd.concat([load_year(year) for year in range(1880, 2014)])
names.head()
Out[73]:
It looks like we've done it!
Let's start with something easy: we'll use groupby
again to see the total number of births per year:
In [74]:
births = names.groupby('year').births.sum()
births.head()
Out[74]:
We can use the plot()
method to see a quick plot of these (note that because we used the %matplotlib inline
magic at the start of the notebook, the resulting plot will be shown inline within the notebook).
In [75]:
births.plot();
The so-called "baby boom" generation after the second world war is abundantly clear!
We can also use other aggregates: let's see how many names are used each year:
In [76]:
names.groupby('year').births.count().plot();
Apparently there's been a huge increase of the diversity of names with time!
groupby
can also be used to add columns to the data: think of it as a view of the data that you're modifying. Let's add a column giving the frequency of each name within each year & gender:
In [77]:
def add_frequency(group):
group['birth_freq'] = group.births / group.births.sum()
return group
names = names.groupby(['year', 'gender']).apply(add_frequency)
names.head()
Out[77]:
Notice that the apply()
function iterates over each group, and calls a function which modifies the group.
This result is then re-constructed into a container which looks ike the original dataframe.
In [78]:
men = names[names.gender == 'M']
women = names[names.gender == 'W']
And then we could proceed as above, using groupby
to group on the year.
But we would end up with two different views of the data. A better way to do this is to use a pivot_table
, which is essentially a groupby in multiple dimensions at once:
In [79]:
births = names.pivot_table('births',
index='year', columns='gender',
aggfunc=sum)
births.head()
Out[79]:
Note that this has grouped the index by the value of year
, and grouped the columns by the value of gender
.
Let's plot the results now:
In [80]:
births.plot(title='Total Births');
In [81]:
names_to_check = ['Allison', 'Alison']
# filter on just the names we're interested in
births = names[names.name.isin(names_to_check)]
# pivot table to get year vs. gender
births = births.pivot_table('births', index='year', columns='gender')
# fill all NaNs with zeros
births = births.fillna(0)
# normalize along columns
births = births.div(births.sum(1), axis=0)
births.plot(title='Fraction of babies named Allison');
We can see that prior to about 1905, all babies named Allison were male. Over the 20th century, this reversed, until the end of the century nearly all Allisons were female!
There's some noise in this data: we can smooth it out a bit by using a 5-year rolling mean:
In [82]:
pd.rolling_mean(births, 5).plot(title="Allisons: 5-year moving average");
This gives a smoother picture of the transition, and is an example of the bias/variance tradeoff that we'll often see in modeling: a smoother model has less variance (variation due to sampling or other noise) but at the expense of more bias (the model systematically mis-represents the data slightly).
We'll discuss this type of tradeoff more in coming sessions.
We've just scratched the surface of what can be done with Pandas, but we'll get a chance to play with this more in the breakout session coming up.
For more information on using Pandas, check out the pandas documentation or the book Python for Data Analysis by Pandas creator Wes McKinney.
In [ ]: