In [3]:

    
import pandas as pd
import numpy as np

from dfply import *

Case #1: A custom pipe function

Pandas has a function pd.crosstab which can generate a cross-tabluation of factors. Let's say we wanted to build a pipe function that wrapped around this. The docstring of the Pandas function is below:

Compute a simple cross-tabulation of two (or more) factors. By default computes a frequency table of the factors unless an array of values and an aggregation function are passed

Parameters
----------
index : array-like, Series, or list of arrays/Series
    Values to group by in the rows
columns : array-like, Series, or list of arrays/Series
    Values to group by in the columns
values : array-like, optional
    Array of values to aggregate according to the factors.
    Requires `aggfunc` be specified.
aggfunc : function, optional
    If specified, requires `values` be specified as well
rownames : sequence, default None
    If passed, must match number of row arrays passed
colnames : sequence, default None
    If passed, must match number of column arrays passed
margins : boolean, default False
    Add row/column margins (subtotals)
dropna : boolean, default True
    Do not include columns whose entries are all NaN
normalize : boolean, {'all', 'index', 'columns'}, or {0,1}, default False
    Normalize by dividing all values by the sum of values.

    - If passed 'all' or `True`, will normalize over all values.
    - If passed 'index' will normalize over each row.
    - If passed 'columns' will normalize over each column.
    - If margins is `True`, will also normalize margin values.

To keep it simple, let's build a reduced version of this that takes only:

index
columns
values
aggfunc

Below is a function that wraps around the call to pd.crosstab.



In [5]:

    
def crosstab(index, columns, values=None, aggfunc=None):
    return pd.crosstab(index, columns, values=values, aggfunc=aggfunc)



In [6]:

    
diamonds.head(2)



In [7]:

    
crosstab(diamonds.cut, diamonds.color)

If you want your function to be part of a dfply pipe chain, the first argument must be a dataframe, which is implicitly passed through during the evaluation of the chain! We will need to redefine the function to have the implicit df passed in as the first argument.

The most common and straightforward way to convert a custom function to a dfply piping function is to use the @dfpipe decorator.

Note: the @dfpipe decorator is in fact a convenience decorator that stacks three dfply decorators together:

@pipe
@group_delegation
@symbolic_evaluation

@pipe ensures that the function will work in the dfply piping syntax and take an implicit DataFrame, @group_delegation makes the function work with groupings applied prior in the chain, and @symbolic_evaluation enables you to use and evaluate symbolic arguments like X.cut that are placeholders for incoming data.



In [8]:

    
@dfpipe
def crosstab(df, index, columns, values=None, aggfunc=None):
    return pd.crosstab(index, columns, values=values, aggfunc=aggfunc)



In [9]:

    
diamonds >> crosstab(X.cut, X.color)

Case #2: A function that works with symbolic arguments

Many tasks are simpler and do not require the capacity to work as a pipe function. The dfply window functions are the common examples of this: functions that take a Series (or symbolic Series) and return a modified version.

Let's say we had a dataframe with dates represented by strings that we wanted to convert to pandas datetime objects using the pd.to_datetime function. Below is a tiny example dataframe with this issue.



In [10]:

    
sales = pd.DataFrame(dict(date=['7/10/17','7/11/17','7/12/17','7/13/17','7/14/17'],
                          sales=[1220, 1592, 908, 1102, 1395]))
sales



In [11]:

    
sales.dtypes









    Out[11]:





date     object
sales     int64
dtype: object

In pandas we would use the pd.to_datetime function to convert the strings to date objects, and add it as a new column like so:



In [12]:

    
sales['pd_date'] = pd.to_datetime(sales['date'], infer_datetime_format=True)
sales



In [13]:

    
sales.drop('pd_date', axis=1, inplace=True)

What if you tried to use the pd.to_datetime function inside of a call to mutate, like so?

sales >> mutate(pd_date=pd.to_datetime(X.date, infer_datetime_format=True))

This will unfortunately break. The dfply functions are special in that they "know" to delay their evaluation until the data is at that point in the chain. pd.to_datetime is not such a function, and will immediately try to evaluate X.date. With a symbolic Intention argument passed in, the function will fail as it does not know what to do with that.

Instead, we will need to make a wrapper around pd.to_datetime that can handle these symbolic arguments and delay evaluation until the right time.

This is quite simple: all you need to do is decorate a function with the @make_symbolic decorator, like so:



In [14]:

    
@make_symbolic
def to_datetime(series, infer_datetime_format=True):
    return pd.to_datetime(series, infer_datetime_format=infer_datetime_format)



In [15]:

    
sales >> mutate(pd_date=to_datetime(X.date))

And there you go. Able to delay the evaluation.

What's particularly nice about the @make_symbolic decorator is that it has no trouble working with non-symbolic arguments too. If we were to pass in the series itself the function evaluates without a problem:



In [16]:

    
to_datetime(sales.date)









    Out[16]:





0   2017-07-10
1   2017-07-11
2   2017-07-12
3   2017-07-13
4   2017-07-14
Name: date, dtype: datetime64[ns]

Keep in mind, though, that if any of the arguments or keyword arguments are symbolic Intention objects, the return will itself be an Intention object representing the function awaiting evaluation by a dataframe:



In [17]:

    
to_datetime(X.date)









    Out[17]:





<dfply.base.Intention at 0x1199570f0>



In [19]:

    
awaiting = to_datetime(X.date)
awaiting.evaluate(sales)









    Out[19]:





0   2017-07-10
1   2017-07-11
2   2017-07-12
3   2017-07-13
4   2017-07-14
Name: date, dtype: datetime64[ns]



In [ ]:

	carat	cut	color	clarity	depth	table	price	x	y	z
0	0.23	Ideal	E	SI2	61.5	55.0	326	3.95	3.98	2.43
1	0.21	Premium	E	SI1	59.8	61.0	326	3.89	3.84	2.31

color	D	E	F	G	H	I	J
cut
Fair	163	224	312	314	303	175	119
Good	662	933	909	871	702	522	307
Ideal	2834	3903	3826	4884	3115	2093	896
Premium	1603	2337	2331	2924	2360	1428	808
Very Good	1513	2400	2164	2299	1824	1204	678

color	D	E	F	G	H	I	J
cut
Fair	163	224	312	314	303	175	119
Good	662	933	909	871	702	522	307
Ideal	2834	3903	3826	4884	3115	2093	896
Premium	1603	2337	2331	2924	2360	1428	808
Very Good	1513	2400	2164	2299	1824	1204	678

	date	sales	pd_date
0	7/10/17	1220	2017-07-10
1	7/11/17	1592	2017-07-11
2	7/12/17	908	2017-07-12
3	7/13/17	1102	2017-07-13
4	7/14/17	1395	2017-07-14