In [3]:
import pandas as pd
import numpy as np
from dfply import *
Pandas has a function pd.crosstab which can generate a cross-tabluation of factors. Let's say we wanted to build a pipe function that wrapped around this. The docstring of the Pandas function is below:
Compute a simple cross-tabulation of two (or more) factors. By default computes a frequency table of the factors unless an array of values and an aggregation function are passed
Parameters
----------
index : array-like, Series, or list of arrays/Series
Values to group by in the rows
columns : array-like, Series, or list of arrays/Series
Values to group by in the columns
values : array-like, optional
Array of values to aggregate according to the factors.
Requires `aggfunc` be specified.
aggfunc : function, optional
If specified, requires `values` be specified as well
rownames : sequence, default None
If passed, must match number of row arrays passed
colnames : sequence, default None
If passed, must match number of column arrays passed
margins : boolean, default False
Add row/column margins (subtotals)
dropna : boolean, default True
Do not include columns whose entries are all NaN
normalize : boolean, {'all', 'index', 'columns'}, or {0,1}, default False
Normalize by dividing all values by the sum of values.
- If passed 'all' or `True`, will normalize over all values.
- If passed 'index' will normalize over each row.
- If passed 'columns' will normalize over each column.
- If margins is `True`, will also normalize margin values.
To keep it simple, let's build a reduced version of this that takes only:
indexcolumnsvaluesaggfuncBelow is a function that wraps around the call to pd.crosstab.
In [5]:
def crosstab(index, columns, values=None, aggfunc=None):
return pd.crosstab(index, columns, values=values, aggfunc=aggfunc)
In [6]:
diamonds.head(2)
Out[6]:
In [7]:
crosstab(diamonds.cut, diamonds.color)
Out[7]:
If you want your function to be part of a dfply pipe chain, the first argument must be a dataframe, which is implicitly passed through during the evaluation of the chain! We will need to redefine the function to have the implicit df passed in as the first argument.
The most common and straightforward way to convert a custom function to a dfply piping function is to use the @dfpipe decorator.
Note: the
@dfpipedecorator is in fact a convenience decorator that stacks three dfply decorators together:
@pipe
@group_delegation
@symbolic_evaluation
@pipeensures that the function will work in the dfply piping syntax and take an implicit DataFrame,@group_delegationmakes the function work with groupings applied prior in the chain, and@symbolic_evaluationenables you to use and evaluate symbolic arguments likeX.cutthat are placeholders for incoming data.
In [8]:
@dfpipe
def crosstab(df, index, columns, values=None, aggfunc=None):
return pd.crosstab(index, columns, values=values, aggfunc=aggfunc)
In [9]:
diamonds >> crosstab(X.cut, X.color)
Out[9]:
Many tasks are simpler and do not require the capacity to work as a pipe function. The dfply window functions are the common examples of this: functions that take a Series (or symbolic Series) and return a modified version.
Let's say we had a dataframe with dates represented by strings that we wanted to convert to pandas datetime objects using the pd.to_datetime function. Below is a tiny example dataframe with this issue.
In [10]:
sales = pd.DataFrame(dict(date=['7/10/17','7/11/17','7/12/17','7/13/17','7/14/17'],
sales=[1220, 1592, 908, 1102, 1395]))
sales
Out[10]:
In [11]:
sales.dtypes
Out[11]:
In pandas we would use the pd.to_datetime function to convert the strings to date objects, and add it as a new column like so:
In [12]:
sales['pd_date'] = pd.to_datetime(sales['date'], infer_datetime_format=True)
sales
Out[12]:
In [13]:
sales.drop('pd_date', axis=1, inplace=True)
What if you tried to use the pd.to_datetime function inside of a call to mutate, like so?
sales >> mutate(pd_date=pd.to_datetime(X.date, infer_datetime_format=True))
This will unfortunately break. The dfply functions are special in that they "know" to delay their evaluation until the data is at that point in the chain. pd.to_datetime is not such a function, and will immediately try to evaluate X.date. With a symbolic Intention argument passed in, the function will fail as it does not know what to do with that.
Instead, we will need to make a wrapper around pd.to_datetime that can handle these symbolic arguments and delay evaluation until the right time.
This is quite simple: all you need to do is decorate a function with the @make_symbolic decorator, like so:
In [14]:
@make_symbolic
def to_datetime(series, infer_datetime_format=True):
return pd.to_datetime(series, infer_datetime_format=infer_datetime_format)
In [15]:
sales >> mutate(pd_date=to_datetime(X.date))
Out[15]:
And there you go. Able to delay the evaluation.
What's particularly nice about the @make_symbolic decorator is that it has no trouble working with non-symbolic arguments too. If we were to pass in the series itself the function evaluates without a problem:
In [16]:
to_datetime(sales.date)
Out[16]:
Keep in mind, though, that if any of the arguments or keyword arguments are symbolic Intention objects, the return will itself be an Intention object representing the function awaiting evaluation by a dataframe:
In [17]:
to_datetime(X.date)
Out[17]:
In [19]:
awaiting = to_datetime(X.date)
awaiting.evaluate(sales)
Out[19]:
In [ ]: