In [1]:
from dataframe import DataFrame
from dataframe import GroupedDataFrame
For demonstration purposes we also include some datasets (and regex
for parsing):
In [2]:
from sklearn import datasets
import re
iris_data = datasets.load_iris()
This will load all the data from sklearn
. In particular we use the iris dataset, which goes back to Ronald Fisher I think. From the iris dataset, we take the feature names and covariables for each feature and put it into a dictionary.
In [3]:
features = [re.sub("\s|cm|\(|\)", "", x) for x in iris_data.feature_names]
print(features)
In [4]:
data = {features[i]: iris_data.data[:,i] for i in range(len(iris_data.data[1,:]))}
We also add the species of each sample:
In [5]:
data["target"] = iris_data.target
Now we can take the dictionary to create a DataFrame
object by using:
In [6]:
frame = DataFrame(**data)
Notice that we use the **kwargs
syntax to give keyword arguments to the constructor. Alternatively you can just call the constructor like this:
In [7]:
frame_expl = DataFrame(sepallength=iris_data.data[:,0],
sepalwidth=iris_data.data[:,1],
petallength=iris_data.data[:,2],
petalwidth=iris_data.data[:,3],
target=iris_data.target)
The results are the same, only that the second approach is more verbose and we have to enter the arguments manually.
In [8]:
print("Frame kwargs:")
print(frame)
print("Frame verbose:")
print(frame_expl)
Note that upon instantiation the column names are sorted alphabetically.
In [9]:
sub_frame = frame.subset("target")
print(sub_frame)
aggregate
takes one or multiple columns and computes an aggregation function. With the aggregated values a new DataFrame
object is returned. Beware that your aggregation function returns a scalar, e.g. a float
. First we need to write a class that extends Callable
and that overwrites __call__
. Some basic functions are already implemented. For the sake of illustration let's write a class that calculates the mean of a list:
In [10]:
from dataframe import Callable
import numpy
class Mean(Callable):
def __call__(self, *args):
vals = args[0].values
return numpy.mean(vals)
Now you can aggregate the frame like this:
In [11]:
print(frame)
agg_frame = frame.aggregate(Mean, "mean", "petallength")
print(agg_frame)
Note that all other columns are discarded here, because the DataFrame
cannot know what you want to do with them.
Similar to aggregate
, we can modify
several columns, too. To do that, we again have to write a class extending Callable
. Beware that unlike in aggregation, modification requires to give a list of the same size as your original column length, i.e. your class has to return a list and not a scalar. For example:
In [12]:
print(len(frame["target"].values))
So if we call modify
on a column in our frame
the result has to be of length 150
.
As an example let's standardize the column pentallength
.
In [13]:
import scipy.stats as sps
class Zscore(Callable):
def __call__(self, *args):
vals = args[0].values
return sps.zscore(vals).tolist()
mod_frame = frame.modify(Zscore, "zscore", "petallength")
print(mod_frame)
I noticed that scipy
calculates other values than when I standardize using R
. Maybe you have the same issue.
In [14]:
grouped_frame = frame.group("target")
print(grouped_frame)
In the table to the top, we created several groups. Visually you can distinguish a DataFrame
from a GroupedDataFrame
by the dashes when printing. We'll discuss using the GroupedDataFrame
class in the next section.
Basically GroupedDataFrame
has the same features as DataFrame
since both inherit from the same superclass ADataFrame
. So the routines do the same things, only on every group and not on the whole DataFrame
object. We start out with a plain DataFrame
and work through all the important methods. Since it is the same methods as in DataFrame
I just show some examples.
In [15]:
sub_grouped_frame = grouped_frame.subset("petallength", "target")
print(sub_grouped_frame)
In [16]:
agg_grouped_frame = grouped_frame.aggregate(Mean, "mean", "petalwidth")
print(agg_grouped_frame)
In [17]:
mod_grouped_frame = grouped_frame.modify(Zscore, "zscore", "petallength")
print(mod_grouped_frame)
In [18]:
twice_grouped_frame = grouped_frame.group("petallength")
print(twice_grouped_frame)
One of the many great features of the unix
-commandline is method piping. For example
grep -i "^daemon" /etc/passwd | sed 's/:/ /g' | cut -f1 -d' ' | tr -s 'dae' 'si'
(This is rather inefficient, but for the sake of demostration it works). In order for python
to support this, we overloaded the >>
operator such that instead of calling
frame.method(*args)
you can alternatively call a method like this now
method(frame, *args)
This sofar only works for the four main methods for dataframes (subset
, ...). In the following are a few examples.
In [19]:
print(frame)
>>
is implemented for the four dataframe methods group
, subset
, aggregate
and modify
. Let's first just subset the frame
.
In [20]:
from dataframe import group, modify, subset, aggregate
obj = frame >> subset("target")
print(obj)
Or you can directly put it into the method.
In [21]:
obj = subset(frame, "target")
print(obj)
Of course we can chain multiple times, too. Here we first group the data by the target
column and the aggregate the groups using the mean
:
In [22]:
obj = frame >> \
group("target") >> \
aggregate(Mean, "m", "sepallength")
print(obj)
Group the data again and then modify it by taking Z-scores:
In [23]:
obj = frame >> \
group("target") >> \
modify(Zscore, "zs", "petalwidth")
print(obj)
Finally a last example using all the methods:
In [24]:
obj = frame >> \
subset("target", "petalwidth") >> \
group("target") >> \
modify(Zscore, "zs", "petalwidth") >> \
aggregate(Mean, "m", "zs")
print(obj)