About

R data-munging idioms and their equvalents in pandas/python:

Subset with multiple-choise %in%:
- R: `subset(df, name %in% c("Andrew", "Andre"))
- python: df.query('name in ["Andrew", "Andre"]') via link

Set up



In [1]:

    
%qtconsole









    



/home/aziyatdinov/anaconda2/lib/python2.7/site-packages/IPython/parallel.py:13: ShimWarning: The `IPython.parallel` package has been deprecated. You should import from ipyparallel instead.
  "You should import from ipyparallel instead.", ShimWarning)



In [2]:

    
%matplotlib inline

Imports



In [3]:

    
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from ggplot import *

Read data



In [6]:

    
df = pd.read_csv("data/babynames.csv")

Basic statistics

Looking at the first samples:



In [7]:

    
df.head()

How many unique names are collected?



In [8]:

    
(df['name'].nunique(), df['name'].size)









    Out[8]:





(92600, 1792091)



In [9]:

    
df['name'].nunique() / float(df['name'].size)









    Out[9]:





0.05167148320035087

We might thhink that approx. 20 entries per name are collected. Should it be equal to the length of the years period?



In [10]:

    
df['year'].max() - df['year'].min()









    Out[10]:





133

Not really. That means there are many zero entries in n column for many names.

Filter

By a single name



In [11]:

    
df['name'].isin(['Andrew']).value_counts()









    Out[11]:





False    1791850
True         241
Name: name, dtype: int64



In [12]:

    
df.query('name == "Andrew"').shape









    Out[12]:





(241, 5)



In [13]:

    
ggplot(df.query('name == "Joe"'), aes(x = 'year', y = 'n')) + geom_point() + ggtitle("name: Joe")









    



/home/aziyatdinov/anaconda2/lib/python2.7/site-packages/matplotlib/__init__.py:872: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))






    












    Out[13]:





<ggplot: (-894627974)>

Don't forget the names are given for two genders.



In [14]:

    
ggplot(df.query('name == "Joe"'), aes(x = 'year', y = 'n', color = 'sex')) +\
    geom_point(size = 10) + geom_smooth(span = 0.1) + ggtitle("name: Joe")









    



/home/aziyatdinov/anaconda2/lib/python2.7/site-packages/ggplot/stats/stat_smooth.py:22: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
  data = data.sort(['x'])






    












    Out[14]:





<ggplot: (-894628044)>

Joe as a name for girls seems to be OK. What about Mary?



In [15]:

    
ggplot(df.query('name == "Mary"'), aes(x = 'year', y = 'n', color = 'sex')) +\
    geom_point(size = 10) + geom_smooth(span = 0.1) + ggtitle("name: Mary")









    












    Out[15]:





<ggplot: (-895067122)>

Name Mary for boys? Let's do a bit more subsetting by n < 500 filter.



In [16]:

    
ggplot(df.query('name == "Mary" & n < 500'), aes(x = 'year', y = 'n', color = 'sex')) +\
    geom_point(size = 10) + geom_smooth(span = 0.1) + ggtitle("name: Mary, n < 500")









    












    Out[16]:





<ggplot: (-895083044)>

Now let's get a record, where the number of female names Mary was the maximum.



In [17]:

    
ind = np.argmax(df.query('name == "Mary" & sex == "M"')['n'])
ind









    Out[17]:





280385



In [18]:

    
df.query('index == @ind')

By multiple names



In [21]:

    
df.query('name in ["Andrew", "Andrey"]').shape









    Out[21]:





(317, 5)



In [31]:

    
anames = ['Andrew', 'Andrey', 'Andres', 'Andre', 'And']



In [32]:

    
df.query('name in @anames').shape









    Out[32]:





(677, 5)



In [33]:

    
ggplot(df.query('name in @anames'), aes(x = 'year', y = 'n', color = 'name')) +\
    geom_point(size = 10) + geom_smooth(span = 0.1) + facet_wrap("sex")









    












    Out[33]:





<ggplot: (-894665920)>



In [43]:

    
ggplot(df.query('name in @anames & sex == "M"'), aes(x = 'year', y = 'n', color = 'name')) +\
    geom_point(size = 10) + geom_smooth(span = 0.1, se = False) + scale_y_log(10) +\
    ggtitle('Andre* male names ')









    












    Out[43]:





<ggplot: (-895057484)>



In [75]:

    
(df.query('name in @anames')
    .groupby(['name', 'sex'])
    [['n']].sum())



In [76]:

    
(df.query('name in @anames  & sex == "M"')
    .groupby(['name'])
    [['n']].sum())



In [90]:

    
sf = (df.query('name in @anames  & sex == "M"')
    .groupby(['name'])
    .agg({'n': {'total': sum, 'max': lambda x: x.max()},
        'prop': {'max': max}}))
sf



In [99]:

    
sf = (df.query('name in @anames  & sex == "M"')
    .groupby(['name'])
    .apply(lambda x: sum(x.n)))
sf









    Out[99]:





name
Andre      116253
Andres      75133
Andrew    1244667
Andrey       1917
dtype: int64



In [109]:

    
sf = (df.query('name in @anames  & sex == "M"')
    .groupby(['name'])
    .apply(lambda x: pd.DataFrame({
        'min': min(x.n), 
        'total': sum(x.n)}, index = x.index)))
sf









    Out[109]:






  
    
      
      min
      total
    
  
  
    
      965
      480
      1244667
    
    
      1426
      5
      75133
    
    
      1757
      5
      116253
    
    
      2961
      480
      1244667
    
    
      3724
      5
      75133
    
    
      4986
      480
      1244667
    
    
      5659
      5
      75133
    
    
      7139
      480
      1244667
    
    
      7761
      5
      75133
    
    
      9346
      480
      1244667
    
    
      10172
      5
      75133
    
    
      11669
      480
      1244667
    
    
      12341
      5
      75133
    
    
      14047
      480
      1244667
    
    
      14626
      5
      75133
    
    
      16467
      480
      1244667
    
    
      17120
      5
      75133
    
    
      19008
      480
      1244667
    
    
      21664
      480
      1244667
    
    
      22357
      5
      75133
    
    
      24310
      480
      1244667
    
    
      25017
      5
      75133
    
    
      25282
      5
      116253
    
    
      27003
      480
      1244667
    
    
      29790
      480
      1244667
    
    
      30718
      5
      75133
    
    
      30832
      5
      116253
    
    
      32710
      480
      1244667
    
    
      33439
      5
      75133
    
    
      35586
      480
      1244667
    
    
      ...
      ...
      ...
    
    
      1539023
      5
      116253
    
    
      1540443
      5
      1917
    
    
      1573398
      480
      1244667
    
    
      1573540
      5
      75133
    
    
      1573602
      5
      116253
    
    
      1575056
      5
      1917
    
    
      1608215
      480
      1244667
    
    
      1608361
      5
      75133
    
    
      1608423
      5
      116253
    
    
      1609923
      5
      1917
    
    
      1642980
      480
      1244667
    
    
      1643137
      5
      75133
    
    
      1643184
      5
      116253
    
    
      1644615
      5
      1917
    
    
      1677280
      480
      1244667
    
    
      1677447
      5
      75133
    
    
      1677494
      5
      116253
    
    
      1679016
      5
      1917
    
    
      1711051
      480
      1244667
    
    
      1711213
      5
      75133
    
    
      1711274
      5
      116253
    
    
      1712651
      5
      1917
    
    
      1744829
      480
      1244667
    
    
      1745002
      5
      75133
    
    
      1745062
      5
      116253
    
    
      1746393
      5
      1917
    
    
      1778154
      480
      1244667
    
    
      1778344
      5
      75133
    
    
      1778400
      5
      116253
    
    
      1779767
      5
      1917
    
  

434 rows × 2 columns



In [ ]:

	year	sex	name	n	prop
0	1880	F	Mary	7065	0.072384
1	1880	F	Anna	2604	0.026679
2	1880	F	Emma	2003	0.020521
3	1880	F	Elizabeth	1939	0.019866
4	1880	F	Minnie	1746	0.017888

		n
name	sex
Andre	F	2191
Andre	M	116253
Andres	F	439
Andres	M	75133
Andrew	F	4835
Andrew	M	1244667
Andrey	F	100
Andrey	M	1917

	prop	n
	max	max	total
name
Andre	0.001329	2509	116253
Andres	0.001259	2717	75133
Andrew	0.018572	36194	1244667
Andrey	0.000049	99	1917

	min	total
965	480	1244667
1426	5	75133
1757	5	116253
2961	480	1244667
3724	5	75133
4986	480	1244667
5659	5	75133
7139	480	1244667
7761	5	75133
9346	480	1244667
10172	5	75133
11669	480	1244667
12341	5	75133
14047	480	1244667
14626	5	75133
16467	480	1244667
17120	5	75133
19008	480	1244667
21664	480	1244667
22357	5	75133
24310	480	1244667
25017	5	75133
25282	5	116253
27003	480	1244667
29790	480	1244667
30718	5	75133
30832	5	116253
32710	480	1244667
33439	5	75133
35586	480	1244667
...	...	...
1539023	5	116253
1540443	5	1917
1573398	480	1244667
1573540	5	75133
1573602	5	116253
1575056	5	1917
1608215	480	1244667
1608361	5	75133
1608423	5	116253
1609923	5	1917
1642980	480	1244667
1643137	5	75133
1643184	5	116253
1644615	5	1917
1677280	480	1244667
1677447	5	75133
1677494	5	116253
1679016	5	1917
1711051	480	1244667
1711213	5	75133
1711274	5	116253
1712651	5	1917
1744829	480	1244667
1745002	5	75133
1745062	5	116253
1746393	5	1917
1778154	480	1244667
1778344	5	75133
1778400	5	116253
1779767	5	1917