I've made a small tool while waiting for an airplane. It's a small wrapper around pandas that makes dataframe manipulation more functional. The goal was to give it a slightly more functional api for personal use. It sort of works and I would like to demonstrate the functionality in this document. It might not be a package that I'll offer a lot of support for, but it has some ideas that may be worth sharing. In this document I'll describe some functionality.
Data should be a noun and any manipulations on it should be described with verb. In R it is convenient to have those verbs be functions because the language allows you to write global operators that can chain functions togehter. In python it makes sense to have them wrapped with methods via an object instead.
The idea behind the tool is to have a more minimal api that accomodates 80-90% of the typical dataframe manipulations by attaching a few very useful composable verbs to a wrapped dataframer. This work is not meant to replace pandas nor is it meant as a python port of popular r-packages (though I'll gladly admit that a lot of ideas are bluntly copied from tidyr
and dplyr
).
The goal of this work is to show a proof of concept to demonstrate extra composability and readability for python based data manipulation. Some performance sacrifices will ensue as a result.
To start to explain what the tool does, we need a dataset to start out with.
In [1]:
import numpy as np
import pandas as pd
import kadro as kd
%matplotlib inline
np.random.seed(42)
n = 20
df = pd.DataFrame({
'a': np.random.randn(n),
'b': np.random.randn(n),
'c': ['foo' if x > 0.5 else 'bar' for x in np.random.rand(n)],
'd': ['fizz' if x > 0.6 else 'bo' for x in np.random.rand(n)]
})
df = df.sort_values(['c', 'd'])
print(df)
This is the data that we'll work with. We won't change the dataframe or it's api, rather we'll wrap it in an object that contains extra methods.
In [2]:
kf = kd.Frame(df)
kf
Out[2]:
This new object contains the original pandas dataframe as a child, this means that you can always access the original data via <obj_name>.df
. You will always be able to work with pure pandas if need be but a few methods are added to enchance readability and composability.
The added methods are:
If you're used to dplyr
or Spark then you may recognize some of these functions. We'll demonstrate how they work one by one.
tibble.mutate
Mutate overwrites a column value or creates a new one. You specify the name of the column via the parameter name and you use a lambda function to describe the contents of the column. The input of the function will be the dataframe in the tibble object. This means that you can refer to columns but that you can also use numpy to describe changes.
In [3]:
kf.mutate(e = lambda _: _['a'] + _['b']*2)
Out[3]:
The column is added via a lambda function. This lambda function accepts the original dataframe. This allows you to refer to columns and allows you to apply any function you'd like. You can also create multiple columns in a single mutate statement.
In [4]:
(kf
.mutate(e = lambda _: _['a'] + _['b']*2,
f = lambda _: np.sqrt(_['e']),
a = lambda _: _['a'] / 2))
Out[4]:
You might appreciate that this method is somewhat lazy. You can use the 1st new column that you've created when creating your second one without the need to call mutate
again.
Notice that any numpy warnings that you may cause will still appear, but they won't cause the operation to fail.
In [5]:
(kf
.filter(lambda _: _['a'] > 0,
lambda _: _['b'] > 0))
Out[5]:
Again you should notice a lazy structure. No need to call .filter
multiple times as you can apply multiple filters in a single step.
In [6]:
kf.slice(2, 3, 10)
Out[6]:
In [7]:
kf.slice([2, 3, 10])
Out[7]:
In [8]:
kf.head(5)
Out[8]:
In [9]:
kf.head(5).tail(3)
Out[9]:
In [10]:
kf.select('b', 'c')
Out[10]:
In [11]:
kf.select(['b', 'c'])
Out[11]:
In [12]:
kf
Out[12]:
In [13]:
kf.rename({"aa":"a", "bb":"b"})
Out[13]:
In [14]:
kf.set_names(["a", "b", "c", "omg_d"])
Out[14]:
In [15]:
kf.drop("a", "b")
Out[15]:
In [16]:
kf.drop(["a", "b"])
Out[16]:
In [17]:
kf.group_by("c", "d")
Out[17]:
In [18]:
kf.agg(m_a = lambda _: np.mean(_['a']),
v_b = lambda _: np.var(_['b']),
cov_ab = lambda _: np.cov(_['a'], _['b'])[1,1])
Out[18]:
In [19]:
(kf
.group_by("c", "d")
.agg(m_a = lambda _: np.mean(_['a']),
v_b = lambda _: np.var(_['b']),
cov_ab = lambda _: np.cov(_['a'], _['b'])[1,1]))
Out[19]:
A few things to note.
.agg
method you can pass any function that accepts a dataframe.agg
method can have multiple output columns that depend on different columns
In [20]:
kf.sort("a")
Out[20]:
Note that grouping a datastructure has an effect on how it is sorted.
In [21]:
kf.group_by("c", "d").sort("a")
Out[21]:
In [22]:
kf.group_by("a")
Out[22]:
In [23]:
kf.group_by("a").ungroup()
Out[23]:
In [24]:
kf.sample_n(10)
Out[24]:
In [25]:
kf.sample_n(1000, replace=True).sort("a").head(5)
Out[25]:
In [26]:
def heavy_func(frame, colname, multy):
frame[colname] = frame[colname] * multy
return frame
(kf
.pipe(heavy_func, colname = "b", multy = 2)
.pipe(heavy_func, colname = "a", multy = 10))
Out[26]:
In [27]:
kf.gather('key', 'value')
In [ ]:
In [28]:
df_age = pd.DataFrame({
'name': ['vincent', 'tim', 'anna'],
'age': [28, 30, 25]
})
df_length = pd.DataFrame({
'name': ['vincent', 'tim'],
'length': [188, 172]
})
kd_age = kd.Frame(df_age)
kd_length = kd.Frame(df_length)
In [29]:
kd_age.left_join(kd_length)
Out[29]:
In [30]:
kd_age.inner_join(kd_length)
Out[30]:
In [31]:
kf.plot('a', 'b', kind = 'scatter')
Out[31]:
The nice thing about plotting is that this doesn't break your flow in kadro.
In [32]:
(kf
.mutate(a = lambda _: _['a'] + _['b'])
.filter(lambda _: _['a'] < 1)
.plot('a', 'b', kind = 'scatter', title = 'foobar'))
Out[32]: