About

R data-munging idioms and their equvalents in pandas/python:

  • Subset with multiple-choise %in%:
    • R: `subset(df, name %in% c("Andrew", "Andre"))
    • python: df.query('name in ["Andrew", "Andre"]') via link

Set up


In [1]:
%qtconsole


/home/aziyatdinov/anaconda2/lib/python2.7/site-packages/IPython/parallel.py:13: ShimWarning: The `IPython.parallel` package has been deprecated. You should import from ipyparallel instead.
  "You should import from ipyparallel instead.", ShimWarning)

In [2]:
%matplotlib inline

Imports


In [3]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from ggplot import *

Read data


In [6]:
df = pd.read_csv("data/babynames.csv")

Basic statistics

Looking at the first samples:


In [7]:
df.head()


Out[7]:
year sex name n prop
0 1880 F Mary 7065 0.072384
1 1880 F Anna 2604 0.026679
2 1880 F Emma 2003 0.020521
3 1880 F Elizabeth 1939 0.019866
4 1880 F Minnie 1746 0.017888

How many unique names are collected?


In [8]:
(df['name'].nunique(), df['name'].size)


Out[8]:
(92600, 1792091)

In [9]:
df['name'].nunique() / float(df['name'].size)


Out[9]:
0.05167148320035087

We might thhink that approx. 20 entries per name are collected. Should it be equal to the length of the years period?


In [10]:
df['year'].max() - df['year'].min()


Out[10]:
133

Not really. That means there are many zero entries in n column for many names.

Filter

By a single name


In [11]:
df['name'].isin(['Andrew']).value_counts()


Out[11]:
False    1791850
True         241
Name: name, dtype: int64

In [12]:
df.query('name == "Andrew"').shape


Out[12]:
(241, 5)

In [13]:
ggplot(df.query('name == "Joe"'), aes(x = 'year', y = 'n')) + geom_point() + ggtitle("name: Joe")


/home/aziyatdinov/anaconda2/lib/python2.7/site-packages/matplotlib/__init__.py:872: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))
Out[13]:
<ggplot: (-894627974)>

Don't forget the names are given for two genders.


In [14]:
ggplot(df.query('name == "Joe"'), aes(x = 'year', y = 'n', color = 'sex')) +\
    geom_point(size = 10) + geom_smooth(span = 0.1) + ggtitle("name: Joe")


/home/aziyatdinov/anaconda2/lib/python2.7/site-packages/ggplot/stats/stat_smooth.py:22: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
  data = data.sort(['x'])
Out[14]:
<ggplot: (-894628044)>

Joe as a name for girls seems to be OK. What about Mary?


In [15]:
ggplot(df.query('name == "Mary"'), aes(x = 'year', y = 'n', color = 'sex')) +\
    geom_point(size = 10) + geom_smooth(span = 0.1) + ggtitle("name: Mary")


Out[15]:
<ggplot: (-895067122)>

Name Mary for boys? Let's do a bit more subsetting by n < 500 filter.


In [16]:
ggplot(df.query('name == "Mary" & n < 500'), aes(x = 'year', y = 'n', color = 'sex')) +\
    geom_point(size = 10) + geom_smooth(span = 0.1) + ggtitle("name: Mary, n < 500")


Out[16]:
<ggplot: (-895083044)>

Now let's get a record, where the number of female names Mary was the maximum.


In [17]:
ind = np.argmax(df.query('name == "Mary" & sex == "M"')['n'])
ind


Out[17]:
280385

In [18]:
df.query('index == @ind')


Out[18]:
year sex name n prop
280385 1930 M Mary 340 0.000301

By multiple names


In [21]:
df.query('name in ["Andrew", "Andrey"]').shape


Out[21]:
(317, 5)

In [31]:
anames = ['Andrew', 'Andrey', 'Andres', 'Andre', 'And']

In [32]:
df.query('name in @anames').shape


Out[32]:
(677, 5)

In [33]:
ggplot(df.query('name in @anames'), aes(x = 'year', y = 'n', color = 'name')) +\
    geom_point(size = 10) + geom_smooth(span = 0.1) + facet_wrap("sex")


Out[33]:
<ggplot: (-894665920)>

In [43]:
ggplot(df.query('name in @anames & sex == "M"'), aes(x = 'year', y = 'n', color = 'name')) +\
    geom_point(size = 10) + geom_smooth(span = 0.1, se = False) + scale_y_log(10) +\
    ggtitle('Andre* male names ')


Out[43]:
<ggplot: (-895057484)>

In [75]:
(df.query('name in @anames')
    .groupby(['name', 'sex'])
    [['n']].sum())


Out[75]:
n
name sex
Andre F 2191
M 116253
Andres F 439
M 75133
Andrew F 4835
M 1244667
Andrey F 100
M 1917

In [76]:
(df.query('name in @anames  & sex == "M"')
    .groupby(['name'])
    [['n']].sum())


Out[76]:
n
name
Andre 116253
Andres 75133
Andrew 1244667
Andrey 1917

In [90]:
sf = (df.query('name in @anames  & sex == "M"')
    .groupby(['name'])
    .agg({'n': {'total': sum, 'max': lambda x: x.max()},
        'prop': {'max': max}}))
sf


Out[90]:
prop n
max max total
name
Andre 0.001329 2509 116253
Andres 0.001259 2717 75133
Andrew 0.018572 36194 1244667
Andrey 0.000049 99 1917

In [99]:
sf = (df.query('name in @anames  & sex == "M"')
    .groupby(['name'])
    .apply(lambda x: sum(x.n)))
sf


Out[99]:
name
Andre      116253
Andres      75133
Andrew    1244667
Andrey       1917
dtype: int64

In [109]:
sf = (df.query('name in @anames  & sex == "M"')
    .groupby(['name'])
    .apply(lambda x: pd.DataFrame({
        'min': min(x.n), 
        'total': sum(x.n)}, index = x.index)))
sf


Out[109]:
min total
965 480 1244667
1426 5 75133
1757 5 116253
2961 480 1244667
3724 5 75133
4986 480 1244667
5659 5 75133
7139 480 1244667
7761 5 75133
9346 480 1244667
10172 5 75133
11669 480 1244667
12341 5 75133
14047 480 1244667
14626 5 75133
16467 480 1244667
17120 5 75133
19008 480 1244667
21664 480 1244667
22357 5 75133
24310 480 1244667
25017 5 75133
25282 5 116253
27003 480 1244667
29790 480 1244667
30718 5 75133
30832 5 116253
32710 480 1244667
33439 5 75133
35586 480 1244667
... ... ...
1539023 5 116253
1540443 5 1917
1573398 480 1244667
1573540 5 75133
1573602 5 116253
1575056 5 1917
1608215 480 1244667
1608361 5 75133
1608423 5 116253
1609923 5 1917
1642980 480 1244667
1643137 5 75133
1643184 5 116253
1644615 5 1917
1677280 480 1244667
1677447 5 75133
1677494 5 116253
1679016 5 1917
1711051 480 1244667
1711213 5 75133
1711274 5 116253
1712651 5 1917
1744829 480 1244667
1745002 5 75133
1745062 5 116253
1746393 5 1917
1778154 480 1244667
1778344 5 75133
1778400 5 116253
1779767 5 1917

434 rows × 2 columns


In [ ]: