dfply example gallery

Working features only.


head and tail


In [1]:
from dfply import *

In [2]:
diamonds >> head(5)


Out[2]:
carat cut color clarity depth table price x y z
0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63
4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75

In [3]:
diamonds >> tail(3)


Out[3]:
carat cut color clarity depth table price x y z
53937 0.70 Very Good D SI1 62.8 60.0 2757 5.66 5.68 3.56
53938 0.86 Premium H SI2 61.0 58.0 2757 6.15 6.12 3.74
53939 0.75 Ideal D SI2 62.2 55.0 2757 5.83 5.87 3.64

groupby


In [4]:
diamonds >> groupby(X.cut) >> head(2)


Out[4]:
carat cut color clarity depth table price x y z
8 0.22 Fair E VS2 65.1 61.0 337 3.87 3.78 2.49
91 0.86 Fair E SI2 55.1 69.0 2757 6.45 6.33 3.52
2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75
0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
11 0.23 Ideal J VS1 62.8 56.0 340 3.93 3.90 2.46
1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63
5 0.24 Very Good J VVS2 62.8 57.0 336 3.94 3.96 2.48
6 0.24 Very Good I VVS1 62.3 57.0 336 3.95 3.98 2.47

Ungrouping is performed with ungroup(). Operations prior to the ungrouping are done on groups.


In [5]:
diamonds >> groupby(X.cut) >> head(2) >> ungroup() >> head(5)


Out[5]:
carat cut color clarity depth table price x y z
8 0.22 Fair E VS2 65.1 61.0 337 3.87 3.78 2.49
91 0.86 Fair E SI2 55.1 69.0 2757 6.45 6.33 3.52
2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75
0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43

Selecting and dropping

Selection functions also have corresponding drop functions.

Selection works with mixed symbolics (X.cut for example), indices, and strings.


In [6]:
diamonds >> select(X.carat, X.cut) >> head(2)


Out[6]:
carat cut
0 0.23 Ideal
1 0.21 Premium

In [7]:
diamonds >> select(0, X.color, 'depth') >> head(2)


Out[7]:
carat color depth
0 0.23 E 61.5
1 0.21 E 59.8

Arguments will be "flattened" in selection functions.


In [8]:
diamonds >> select(0, [X.color, [[X.depth]]]) >> head(2)


Out[8]:
carat color depth
0 0.23 E 61.5
1 0.21 E 59.8

Dropping:


In [9]:
diamonds >> drop(0,1,[4,5]) >> head(2)


Out[9]:
color clarity price x y z
0 E SI2 326 3.95 3.98 2.43
1 E SI1 326 3.89 3.84 2.31

Selection and dropping containing strings:


In [10]:
diamonds >> select_containing('c') >> head(2)


Out[10]:
carat cut color clarity price
0 0.23 Ideal E SI2 326
1 0.21 Premium E SI1 326

In [11]:
diamonds >> drop_containing('c') >> head(2)


Out[11]:
depth table x y z
0 61.5 55.0 3.95 3.98 2.43
1 59.8 61.0 3.89 3.84 2.31

Selection and dropping starting with and ending with strings


In [12]:
diamonds >> select_startswith('c') >> head(1)


Out[12]:
carat cut color clarity
0 0.23 Ideal E SI2

In [13]:
diamonds >> drop_startswith('c') >> head(1)


Out[13]:
depth table price x y z
0 61.5 55.0 326 3.95 3.98 2.43

In [14]:
diamonds >> select_endswith('t') >> head(1)


Out[14]:
carat cut
0 0.23 Ideal

In [15]:
diamonds >> drop_endswith('t') >> head(1)


Out[15]:
color clarity depth table price x y z
0 E SI2 61.5 55.0 326 3.95 3.98 2.43

Selecting ranges:

  • select_between
  • select_to
  • select_through

And their drop equivalents


In [16]:
diamonds >> select_between(X.depth, 'price') >> head(1)


Out[16]:
depth table price
0 61.5 55.0 326

In [17]:
diamonds >> select_to(X.x) >> head(1)


Out[17]:
carat cut color clarity depth table price
0 0.23 Ideal E SI2 61.5 55.0 326

In [18]:
diamonds >> select_through(X.x) >> head(1)


Out[18]:
carat cut color clarity depth table price x
0 0.23 Ideal E SI2 61.5 55.0 326 3.95

Transformation/variable creation

mutate creates variables


In [19]:
diamonds >> mutate(price_shift = X.price.shift(1)) >> head(5)


Out[19]:
carat cut color clarity depth table price x y z price_shift
0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43 NaN
1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31 326.0
2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31 326.0
3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63 327.0
4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75 334.0

Mutate can create multiple variables at once


In [20]:
diamonds >> mutate(price_shift = X.price.shift(1), depth_shift2 = X.depth.shift(2)) >> head(5)


Out[20]:
carat cut color clarity depth table price x y z depth_shift2 price_shift
0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43 NaN NaN
1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31 NaN 326.0
2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31 61.5 326.0
3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63 59.8 327.0
4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75 56.9 334.0

Mutate works with grouping


In [21]:
diamonds >> groupby(X.cut) >> mutate(price_shift = X.price.shift(1)) >> head(2)


Out[21]:
carat cut color clarity depth table price x y z price_shift
8 0.22 Fair E VS2 65.1 61.0 337 3.87 3.78 2.49 NaN
91 0.86 Fair E SI2 55.1 69.0 2757 6.45 6.33 3.52 337.0
2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31 NaN
4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75 327.0
0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43 NaN
11 0.23 Ideal J VS1 62.8 56.0 340 3.93 3.90 2.46 326.0
1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31 NaN
3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63 326.0
5 0.24 Very Good J VVS2 62.8 57.0 336 3.94 3.96 2.48 NaN
6 0.24 Very Good I VVS1 62.3 57.0 336 3.95 3.98 2.47 336.0

transmute creates and selects


In [22]:
diamonds >> transmute(x_times_y=X.x*X.y, y_times_z=X.y*X.z) >> head(5)


Out[22]:
y_times_z x_times_y
0 9.6714 15.7210
1 8.8704 14.9376
2 9.4017 16.4835
3 11.1249 17.7660
4 11.9625 18.8790

In [23]:
diamonds >> groupby(X.cut) >> transmute(price_shift=X.price.shift(1)) >> head(5)


Out[23]:
cut price_shift
8 Fair NaN
91 Fair 337.0
97 Fair 2757.0
123 Fair 2759.0
124 Fair 2762.0
2 Good NaN
4 Good 327.0
10 Good 335.0
17 Good 339.0
18 Good 351.0
0 Ideal NaN
11 Ideal 326.0
13 Ideal 340.0
16 Ideal 344.0
39 Ideal 348.0
1 Premium NaN
3 Premium 326.0
12 Premium 334.0
14 Premium 342.0
15 Premium 345.0
5 Very Good NaN
6 Very Good 336.0
7 Very Good 336.0
9 Very Good 337.0
19 Very Good 338.0

Summarization

summarize creates summary variables


In [24]:
(diamonds >> 
 groupby(X.cut) >> 
 summarize(price_mean=np.mean(X.price),
           price_first=X.price.values[0]) >>
 head(2))


Out[24]:
cut price_first price_mean
0 Fair 337 4358.757764
1 Good 327 3928.864452
2 Ideal 326 3457.541970
3 Premium 326 4584.257704
4 Very Good 336 3981.759891

summarize_each applies summary functions to columns


In [25]:
import numpy as np

(diamonds >>
 groupby(X.color) >>
 summarize_each([np.mean, np.std, np.var], X.price, X.depth, X.x))


Out[25]:
color price_mean price_std price_var depth_mean depth_std depth_var x_mean x_std x_var
0 D 3169.954096 3356.343207 1.126504e+07 61.698125 1.410926 1.990711 5.417051 0.939261 0.882212
1 E 3076.752475 3343.988008 1.118226e+07 61.662090 1.444189 2.085681 5.411580 0.961054 0.923624
2 F 3724.886397 3784.793668 1.432466e+07 61.694582 1.437438 2.066228 5.614961 1.008615 1.017305
3 G 3999.135671 4050.923462 1.640998e+07 61.757111 1.369950 1.876762 5.677543 1.083054 1.173005
4 H 4486.669196 4215.690313 1.777204e+07 61.836850 1.443095 2.082522 5.983335 1.196928 1.432637
5 I 5091.874954 4721.952100 2.229683e+07 61.846385 1.459486 2.130100 6.222826 1.253941 1.572367
6 J 5323.818020 4437.396905 1.969049e+07 61.887215 1.549680 2.401507 6.519338 1.202455 1.445898

Subsetting

sample does the same thing as DataFrame.sample()


In [26]:
diamonds >> sample(n=4, replace=False)


Out[26]:
carat cut color clarity depth table price x y z
17458 1.01 Very Good F VS1 63.7 56.0 7001 6.27 6.33 4.01
17795 1.32 Very Good G SI2 61.6 57.0 7173 6.98 7.05 4.32
47840 0.53 Ideal F SI1 60.0 57.0 1908 5.30 5.27 3.17
16747 1.06 Premium E VS1 59.1 58.0 6669 6.66 6.71 3.95

In [27]:
diamonds >> sample(frac=0.0001, replace=True)


Out[27]:
carat cut color clarity depth table price x y z
6504 0.90 Very Good H VS2 62.1 60.0 4064 6.11 6.20 3.82
38693 0.37 Premium E VS2 61.1 57.0 1041 4.66 4.63 2.84
1083 0.70 Premium D VS2 60.9 57.0 2909 5.71 5.69 3.47
47222 0.51 Very Good E VS1 61.1 59.0 1839 5.14 5.20 3.16
33412 0.32 Good F VS2 63.6 57.0 828 4.37 4.34 2.77

distinct selects unique values


In [28]:
(diamonds >> select('depth') >> distict()).shape


Out[28]:
(184, 1)

row_slice selects rows


In [29]:
diamonds >> groupby(X.color) >> row_slice([1,7])


Out[29]:
carat cut color clarity depth table price x y z
34 0.23 Very Good D VS1 61.9 58.0 402 3.92 3.96 2.44
62 0.30 Ideal D SI1 62.5 57.0 552 4.29 4.32 2.69
1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
32 0.23 Very Good E VS1 60.7 59.0 402 3.97 4.01 2.42
29 0.23 Very Good F VS1 60.9 57.0 357 3.96 3.99 2.42
75 0.26 Very Good F VVS2 59.2 60.0 554 4.19 4.22 2.49
51 0.23 Ideal G VS1 61.9 54.0 404 3.93 3.95 2.44
94 0.78 Very Good G SI2 63.8 56.0 2759 5.81 5.85 3.72
9 0.23 Very Good H VS1 59.4 61.0 338 4.00 4.05 2.39
49 0.29 Very Good H SI2 60.7 60.0 404 4.33 4.37 2.64
6 0.24 Very Good I VVS1 62.3 57.0 336 3.95 3.98 2.47
52 0.32 Ideal I SI1 60.9 55.0 404 4.45 4.48 2.72
5 0.24 Very Good J VVS2 62.8 57.0 336 3.94 3.96 2.48
19 0.30 Very Good J SI1 62.7 59.0 351 4.21 4.27 2.66

mask filters with boolean arrays


In [30]:
diamonds >> mask(X.cut == 'Ideal') >> head(4)


Out[30]:
carat cut color clarity depth table price x y z
0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
11 0.23 Ideal J VS1 62.8 56.0 340 3.93 3.90 2.46
13 0.31 Ideal J SI2 62.2 54.0 344 4.35 4.37 2.71
16 0.30 Ideal I SI2 62.0 54.0 348 4.31 4.34 2.68

In [31]:
diamonds >> mask(X.cut == 'Ideal', X.color == 'E', X.table < 55, X.price < 500)


Out[31]:
carat cut color clarity depth table price x y z
26683 0.33 Ideal E SI2 62.2 54.0 427 4.44 4.46 2.77
32297 0.34 Ideal E SI2 62.4 54.0 454 4.49 4.52 2.81
40928 0.30 Ideal E SI1 61.6 54.0 499 4.32 4.35 2.67
50623 0.30 Ideal E SI2 62.1 54.0 401 4.32 4.35 2.69
50625 0.30 Ideal E SI2 62.0 54.0 401 4.33 4.35 2.69

Reshaping

rename


In [32]:
diamonds >> rename(CUT=X.cut, CLARITY=X.clarity) >> head(2)


Out[32]:
carat CUT color CLARITY depth table price x y z
0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31

In [33]:
diamonds >> rename(CuT='cut') >> head(2)


Out[33]:
carat CuT color clarity depth table price x y z
0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31

arrange


In [34]:
diamonds >> arrange(X.table, X.price) >> head(10)


Out[34]:
carat cut color clarity depth table price x y z
11368 1.04 Ideal I VS1 62.9 43.0 4997 6.45 6.41 4.04
35633 0.29 Very Good E VS1 62.8 44.0 474 4.20 4.24 2.65
22701 0.30 Fair E SI1 64.5 49.0 630 4.28 4.25 2.75
5979 1.00 Fair I VS1 64.0 49.0 3951 6.43 6.39 4.10
7418 1.02 Fair F SI1 61.8 50.0 4227 6.59 6.51 4.05
25179 2.00 Fair H SI1 61.2 50.0 13764 8.17 8.08 4.97
3238 0.94 Fair H SI2 66.0 50.1 3353 6.13 6.17 4.06
33586 0.37 Premium F VS1 62.7 51.0 833 4.65 4.57 2.89
45798 0.51 Fair E VS2 65.5 51.0 1709 5.06 5.01 3.30
46040 0.57 Good H VS1 63.7 51.0 1728 5.36 5.29 3.39

In [35]:
diamonds >> arrange(X.table, X.price, ascending=False) >> head(10)


Out[35]:
carat cut color clarity depth table price x y z
24932 2.01 Fair F SI1 58.6 95.0 13387 8.32 8.31 4.87
50773 0.81 Fair F SI2 68.8 79.0 2301 5.26 5.20 3.58
51342 0.79 Fair G SI1 65.3 76.0 2362 5.52 5.13 3.35
52860 0.50 Fair E VS2 79.0 73.0 2579 5.21 5.18 4.09
52861 0.50 Fair E VS2 79.0 73.0 2579 5.21 5.18 4.09
51391 0.71 Fair D VS2 55.6 73.0 2368 6.01 5.96 3.33
49375 0.70 Fair H VS1 62.0 73.0 2100 5.65 5.54 3.47
44938 0.68 Fair G SI1 58.0 71.0 1633 5.85 5.70 3.35
23067 1.51 Fair E SI1 58.4 70.0 11102 7.55 7.39 4.36
777 0.75 Fair F VS1 55.8 70.0 2859 6.09 5.98 3.37

In [36]:
diamonds >> groupby(X.cut) >> arrange(X.depth) >> head(2)


Out[36]:
carat cut color clarity depth table price x y z
4518 1.00 Fair G SI1 43.0 59.0 3634 6.32 6.27 3.97
6341 1.00 Fair G VS2 44.0 53.0 4032 6.31 6.24 4.12
42804 0.53 Good D SI2 54.3 65.0 1352 5.46 5.51 2.98
36958 0.28 Good H VVS1 56.0 64.0 480 4.38 4.41 2.46
10377 1.09 Ideal J VS2 43.0 54.0 4778 6.53 6.55 4.12
32598 0.30 Ideal F VVS1 58.0 60.0 802 4.34 4.37 2.64
4360 1.01 Premium D SI2 58.0 60.0 3604 6.58 6.53 3.80
23911 1.70 Premium I VS1 58.0 60.0 12030 7.88 7.84 4.56
26650 2.03 Very Good H SI2 56.8 61.0 16442 8.52 8.42 4.81
49882 0.70 Very Good F SI2 56.9 62.0 2177 5.87 5.90 3.35

gather


In [37]:
diamonds >> gather('variable', 'value', ['price', 'depth','x','y','z']) >> head(5)


Out[37]:
carat cut color clarity table variable value
0 0.23 Ideal E SI2 55.0 price 326.0
1 0.21 Premium E SI1 61.0 price 326.0
2 0.23 Good E VS1 65.0 price 327.0
3 0.29 Premium I VS2 58.0 price 334.0
4 0.31 Good J SI2 58.0 price 335.0

In [38]:
diamonds >> gather('variable', 'value') >> head(5)


Out[38]:
variable value
0 carat 0.23
1 carat 0.21
2 carat 0.23
3 carat 0.29
4 carat 0.31

In [39]:
elongated = diamonds >> gather('variable', 'value', add_id=True)
elongated >> head(5)


Out[39]:
_ID variable value
0 0 carat 0.23
1 1 carat 0.21
2 2 carat 0.23
3 3 carat 0.29
4 4 carat 0.31

spread


In [40]:
widened = elongated >> spread(X.variable, X.value)
widened >> head(5)


Out[40]:
_ID carat clarity color cut depth price table x y z
0 0 0.23 SI2 E Ideal 61.5 326 55 3.95 3.98 2.43
1 1 0.21 SI1 E Premium 59.8 326 61 3.89 3.84 2.31
2 10 0.3 SI1 J Good 64 339 55 4.25 4.28 2.73
3 100 0.75 SI1 D Very Good 63.2 2760 56 5.8 5.75 3.65
4 1000 0.75 SI1 D Ideal 62.3 2898 55 5.83 5.8 3.62

In [41]:
widened.dtypes


Out[41]:
_ID         int64
carat      object
clarity    object
color      object
cut        object
depth      object
price      object
table      object
x          object
y          object
z          object
dtype: object

Convert keyword argument attempts to convert types


In [42]:
widened = elongated >> spread(X.variable, X.value, convert=True)
widened >> head(5)


Out[42]:
_ID carat clarity color cut depth price table x y z
0 0 0.23 SI2 E Ideal 61.5 326 55.0 3.95 3.98 2.43
1 1 0.21 SI1 E Premium 59.8 326 61.0 3.89 3.84 2.31
2 10 0.30 SI1 J Good 64.0 339 55.0 4.25 4.28 2.73
3 100 0.75 SI1 D Very Good 63.2 2760 56.0 5.80 5.75 3.65
4 1000 0.75 SI1 D Ideal 62.3 2898 55.0 5.83 5.80 3.62

In [43]:
widened.dtypes


Out[43]:
_ID          int64
carat      float64
clarity     object
color       object
cut         object
depth      float64
price        int64
table      float64
x          float64
y          float64
z          float64
dtype: object

Joining

inner_join


In [44]:
a = pd.DataFrame({
        'x1':['A','B','C'],
        'x2':[1,2,3]
    })
b = pd.DataFrame({
    'x1':['A','B','D'],
    'x3':[True,False,True]
})

In [45]:
a >> inner_join(b, by='x1')


Out[45]:
x1 x2 x3
0 A 1 True
1 B 2 False

outer_join aka full_join


In [46]:
a >> outer_join(b, by='x1')


Out[46]:
x1 x2 x3
0 A 1.0 True
1 B 2.0 False
2 C 3.0 NaN
3 D NaN True

In [47]:
a >> full_join(b, by='x1')


Out[47]:
x1 x2 x3
0 A 1.0 True
1 B 2.0 False
2 C 3.0 NaN
3 D NaN True

left_join


In [48]:
a >> left_join(b, by='x1')


Out[48]:
x1 x2 x3
0 A 1 True
1 B 2 False
2 C 3 NaN

right_join


In [49]:
a >> right_join(b, by='x1')


Out[49]:
x1 x2 x3
0 A 1.0 True
1 B 2.0 False
2 D NaN True

In [ ]: