In [3]:
import pandas as pd
import numpy as np

from dfply import *

Case #1: A custom pipe function


Pandas has a function pd.crosstab which can generate a cross-tabluation of factors. Let's say we wanted to build a pipe function that wrapped around this. The docstring of the Pandas function is below:

Compute a simple cross-tabulation of two (or more) factors. By default computes a frequency table of the factors unless an array of values and an aggregation function are passed

Parameters
----------
index : array-like, Series, or list of arrays/Series
    Values to group by in the rows
columns : array-like, Series, or list of arrays/Series
    Values to group by in the columns
values : array-like, optional
    Array of values to aggregate according to the factors.
    Requires `aggfunc` be specified.
aggfunc : function, optional
    If specified, requires `values` be specified as well
rownames : sequence, default None
    If passed, must match number of row arrays passed
colnames : sequence, default None
    If passed, must match number of column arrays passed
margins : boolean, default False
    Add row/column margins (subtotals)
dropna : boolean, default True
    Do not include columns whose entries are all NaN
normalize : boolean, {'all', 'index', 'columns'}, or {0,1}, default False
    Normalize by dividing all values by the sum of values.

    - If passed 'all' or `True`, will normalize over all values.
    - If passed 'index' will normalize over each row.
    - If passed 'columns' will normalize over each column.
    - If margins is `True`, will also normalize margin values.


To keep it simple, let's build a reduced version of this that takes only:

  • index
  • columns
  • values
  • aggfunc

Below is a function that wraps around the call to pd.crosstab.


In [5]:
def crosstab(index, columns, values=None, aggfunc=None):
    return pd.crosstab(index, columns, values=values, aggfunc=aggfunc)

In [6]:
diamonds.head(2)


Out[6]:
carat cut color clarity depth table price x y z
0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31

In [7]:
crosstab(diamonds.cut, diamonds.color)


Out[7]:
color D E F G H I J
cut
Fair 163 224 312 314 303 175 119
Good 662 933 909 871 702 522 307
Ideal 2834 3903 3826 4884 3115 2093 896
Premium 1603 2337 2331 2924 2360 1428 808
Very Good 1513 2400 2164 2299 1824 1204 678

If you want your function to be part of a dfply pipe chain, the first argument must be a dataframe, which is implicitly passed through during the evaluation of the chain! We will need to redefine the function to have the implicit df passed in as the first argument.

The most common and straightforward way to convert a custom function to a dfply piping function is to use the @dfpipe decorator.

Note: the @dfpipe decorator is in fact a convenience decorator that stacks three dfply decorators together:

@pipe
@group_delegation
@symbolic_evaluation

@pipe ensures that the function will work in the dfply piping syntax and take an implicit DataFrame, @group_delegation makes the function work with groupings applied prior in the chain, and @symbolic_evaluation enables you to use and evaluate symbolic arguments like X.cut that are placeholders for incoming data.


In [8]:
@dfpipe
def crosstab(df, index, columns, values=None, aggfunc=None):
    return pd.crosstab(index, columns, values=values, aggfunc=aggfunc)

In [9]:
diamonds >> crosstab(X.cut, X.color)


Out[9]:
color D E F G H I J
cut
Fair 163 224 312 314 303 175 119
Good 662 933 909 871 702 522 307
Ideal 2834 3903 3826 4884 3115 2093 896
Premium 1603 2337 2331 2924 2360 1428 808
Very Good 1513 2400 2164 2299 1824 1204 678

Case #2: A function that works with symbolic arguments


Many tasks are simpler and do not require the capacity to work as a pipe function. The dfply window functions are the common examples of this: functions that take a Series (or symbolic Series) and return a modified version.

Let's say we had a dataframe with dates represented by strings that we wanted to convert to pandas datetime objects using the pd.to_datetime function. Below is a tiny example dataframe with this issue.


In [10]:
sales = pd.DataFrame(dict(date=['7/10/17','7/11/17','7/12/17','7/13/17','7/14/17'],
                          sales=[1220, 1592, 908, 1102, 1395]))
sales


Out[10]:
date sales
0 7/10/17 1220
1 7/11/17 1592
2 7/12/17 908
3 7/13/17 1102
4 7/14/17 1395

In [11]:
sales.dtypes


Out[11]:
date     object
sales     int64
dtype: object

In pandas we would use the pd.to_datetime function to convert the strings to date objects, and add it as a new column like so:


In [12]:
sales['pd_date'] = pd.to_datetime(sales['date'], infer_datetime_format=True)
sales


Out[12]:
date sales pd_date
0 7/10/17 1220 2017-07-10
1 7/11/17 1592 2017-07-11
2 7/12/17 908 2017-07-12
3 7/13/17 1102 2017-07-13
4 7/14/17 1395 2017-07-14

In [13]:
sales.drop('pd_date', axis=1, inplace=True)

What if you tried to use the pd.to_datetime function inside of a call to mutate, like so?

sales >> mutate(pd_date=pd.to_datetime(X.date, infer_datetime_format=True))

This will unfortunately break. The dfply functions are special in that they "know" to delay their evaluation until the data is at that point in the chain. pd.to_datetime is not such a function, and will immediately try to evaluate X.date. With a symbolic Intention argument passed in, the function will fail as it does not know what to do with that.

Instead, we will need to make a wrapper around pd.to_datetime that can handle these symbolic arguments and delay evaluation until the right time.

This is quite simple: all you need to do is decorate a function with the @make_symbolic decorator, like so:


In [14]:
@make_symbolic
def to_datetime(series, infer_datetime_format=True):
    return pd.to_datetime(series, infer_datetime_format=infer_datetime_format)

In [15]:
sales >> mutate(pd_date=to_datetime(X.date))


Out[15]:
date sales pd_date
0 7/10/17 1220 2017-07-10
1 7/11/17 1592 2017-07-11
2 7/12/17 908 2017-07-12
3 7/13/17 1102 2017-07-13
4 7/14/17 1395 2017-07-14

And there you go. Able to delay the evaluation.

What's particularly nice about the @make_symbolic decorator is that it has no trouble working with non-symbolic arguments too. If we were to pass in the series itself the function evaluates without a problem:


In [16]:
to_datetime(sales.date)


Out[16]:
0   2017-07-10
1   2017-07-11
2   2017-07-12
3   2017-07-13
4   2017-07-14
Name: date, dtype: datetime64[ns]

Keep in mind, though, that if any of the arguments or keyword arguments are symbolic Intention objects, the return will itself be an Intention object representing the function awaiting evaluation by a dataframe:


In [17]:
to_datetime(X.date)


Out[17]:
<dfply.base.Intention at 0x1199570f0>

In [19]:
awaiting = to_datetime(X.date)
awaiting.evaluate(sales)


Out[19]:
0   2017-07-10
1   2017-07-11
2   2017-07-12
3   2017-07-13
4   2017-07-14
Name: date, dtype: datetime64[ns]

In [ ]: