R data-munging idioms and their equvalents in pandas/python:
%in%:df.query('name in ["Andrew", "Andre"]') via link
In [1]:
%qtconsole
In [2]:
%matplotlib inline
In [3]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from ggplot import *
In [6]:
df = pd.read_csv("data/babynames.csv")
Looking at the first samples:
In [7]:
df.head()
Out[7]:
How many unique names are collected?
In [8]:
(df['name'].nunique(), df['name'].size)
Out[8]:
In [9]:
df['name'].nunique() / float(df['name'].size)
Out[9]:
We might thhink that approx. 20 entries per name are collected. Should it be equal to the length of the years period?
In [10]:
df['year'].max() - df['year'].min()
Out[10]:
Not really. That means there are many zero entries in n column for many names.
In [11]:
df['name'].isin(['Andrew']).value_counts()
Out[11]:
In [12]:
df.query('name == "Andrew"').shape
Out[12]:
In [13]:
ggplot(df.query('name == "Joe"'), aes(x = 'year', y = 'n')) + geom_point() + ggtitle("name: Joe")
Out[13]:
Don't forget the names are given for two genders.
In [14]:
ggplot(df.query('name == "Joe"'), aes(x = 'year', y = 'n', color = 'sex')) +\
geom_point(size = 10) + geom_smooth(span = 0.1) + ggtitle("name: Joe")
Out[14]:
Joe as a name for girls seems to be OK. What about Mary?
In [15]:
ggplot(df.query('name == "Mary"'), aes(x = 'year', y = 'n', color = 'sex')) +\
geom_point(size = 10) + geom_smooth(span = 0.1) + ggtitle("name: Mary")
Out[15]:
Name Mary for boys? Let's do a bit more subsetting by n < 500 filter.
In [16]:
ggplot(df.query('name == "Mary" & n < 500'), aes(x = 'year', y = 'n', color = 'sex')) +\
geom_point(size = 10) + geom_smooth(span = 0.1) + ggtitle("name: Mary, n < 500")
Out[16]:
Now let's get a record, where the number of female names Mary was the maximum.
In [17]:
ind = np.argmax(df.query('name == "Mary" & sex == "M"')['n'])
ind
Out[17]:
In [18]:
df.query('index == @ind')
Out[18]:
In [21]:
df.query('name in ["Andrew", "Andrey"]').shape
Out[21]:
In [31]:
anames = ['Andrew', 'Andrey', 'Andres', 'Andre', 'And']
In [32]:
df.query('name in @anames').shape
Out[32]:
In [33]:
ggplot(df.query('name in @anames'), aes(x = 'year', y = 'n', color = 'name')) +\
geom_point(size = 10) + geom_smooth(span = 0.1) + facet_wrap("sex")
Out[33]:
In [43]:
ggplot(df.query('name in @anames & sex == "M"'), aes(x = 'year', y = 'n', color = 'name')) +\
geom_point(size = 10) + geom_smooth(span = 0.1, se = False) + scale_y_log(10) +\
ggtitle('Andre* male names ')
Out[43]:
In [75]:
(df.query('name in @anames')
.groupby(['name', 'sex'])
[['n']].sum())
Out[75]:
In [76]:
(df.query('name in @anames & sex == "M"')
.groupby(['name'])
[['n']].sum())
Out[76]:
In [90]:
sf = (df.query('name in @anames & sex == "M"')
.groupby(['name'])
.agg({'n': {'total': sum, 'max': lambda x: x.max()},
'prop': {'max': max}}))
sf
Out[90]:
In [99]:
sf = (df.query('name in @anames & sex == "M"')
.groupby(['name'])
.apply(lambda x: sum(x.n)))
sf
Out[99]:
In [109]:
sf = (df.query('name in @anames & sex == "M"')
.groupby(['name'])
.apply(lambda x: pd.DataFrame({
'min': min(x.n),
'total': sum(x.n)}, index = x.index)))
sf
Out[109]:
In [ ]: