a dataframe for C++, based on xtensor and xtl

In [1]:
#include <string>
#include <iostream>

#include "xtensor/xrandom.hpp"
#include "xtensor/xmath.hpp"

#include "xframe/xio.hpp"
#include "xframe/xvariable.hpp"
#include "xframe/xvariable_view.hpp"
#include "xframe/xvariable_masked_view.hpp"
#include "xframe/xreindex_view.hpp"

Let's first define some useful type aliases so we can reduce the amount of typing


In [2]:
using coordinate_type = xf::xcoordinate<xf::fstring>;
using variable_type = xf::xvariable<double, coordinate_type>;
using data_type = variable_type::data_type;

 1. Variables

 1.1. Creating variable

In the following we define a 2D variable called dry_temperature. A variable in xframe is the composition of a tensor data and a coordinate system. It is the equivalent of DataArray from xarray. The tensor data can be any valid xtensor expression whose value_type is xoptional. Common types are xarray_optional, xtensor_optional and xoptional_assembly, which allows to create an optional expression from existing regular tensor expressions.


In [3]:
data_type dry_temperature_data = xt::eval(xt::random::rand({6, 3}, 15., 25.));
dry_temperature_data(0, 0).has_value() = false;
dry_temperature_data(2, 1).has_value() = false;

In [4]:
dry_temperature_data


Out[4]:
    N/A
23.3501
24.6887
17.2103
18.0817
20.4722
16.8838
    N/A
24.9646
24.6769
22.2584
24.8111
16.0986
22.9811
17.9703
15.0478
16.1246
21.3976

Once the data is defined, we can define the coordinate system. A coordinate system is a mapping of dimension names with label axes. Although it is possible to create an axe from a vector of labels, then the coordinate system from a map containing axes and dimension names, and finally the variable from this coordinate system and the previously created data, xframe makes use of the initialize-list syntax so everything can be created in place with a very expressive syntax:


In [5]:
auto time_axis = xf::axis({"2018-01-01", "2018-01-02", "2018-01-03", "2018-01-04", "2018-01-05", "2018-01-06"});

In [6]:
auto dry_temperature = variable_type(
    dry_temperature_data,
    {
        {"date", time_axis},
        {"city", xf::axis({"London", "Paris", "Brussels"})}
    }
);

In [7]:
dry_temperature


Out[7]:
LondonParisBrussels
2018-01-01
    N/A
23.3501
24.6887
2018-01-02
17.2103
18.0817
20.4722
2018-01-03
16.8838
    N/A
24.9646
2018-01-04
24.6769
22.2584
24.8111
2018-01-05
16.0986
22.9811
17.9703
2018-01-06
15.0478
16.1246
21.3976

1.2. Indexing and selecting data

Like xarray, xframe supports four different kinds of indexing as described below:

Dimension lookup: Positional - Index lookup: By integer


In [8]:
dry_temperature(3, 0)


Out[8]:
24.6769

Dimension lookup: Positional - Index lookup: By label


In [9]:
dry_temperature.locate("2018-01-04", "London")


Out[9]:
24.6769

Dimension lookup: By name - Index lookup: By integer


In [10]:
dry_temperature.iselect({{"date", 3}, {"city", 0}})


Out[10]:
24.6769

Dimension lookup: By name - Index lookup: By label


In [11]:
dry_temperature.select({{"date", "2018-01-04"}, {"city", "London"}})


Out[11]:
24.6769

Contrary to xarray, these methods return a single value, they do not allow to create views of the variable by selecting many data points. This feature is possible with xframe though, by using the free function counterparts of the methods described above, and will be covered in a next section.

1.3. Maths and broadcasting

Variable support all the common mathematics operations and functions; like xtensor, these operations are lazy and return expressions. xframe supports operations on variables with different dimensions and labels thanks to broadcasting. This one is performed according the dimension names rather than the dimension positions as shown below.

Let's first define a variable containing the relative humidity for cities:


In [12]:
data_type relative_humidity_data = xt::eval(xt::random::rand({3}, 50.0, 70.0));

auto relative_humidity = variable_type(
    relative_humidity_data,
    {
        {"city", xf::axis({"Paris", "London", "Brussels"})}
    }
);

relative_humidity


Out[12]:
Paris
67.5686
London
60.0733
Brussels
65.9586

We will use it and the previously defined dry_temperature variable (that we show again below) to compute the water_pour_pressure


In [13]:
dry_temperature


Out[13]:
LondonParisBrussels
2018-01-01
    N/A
23.3501
24.6887
2018-01-02
17.2103
18.0817
20.4722
2018-01-03
16.8838
    N/A
24.9646
2018-01-04
24.6769
22.2584
24.8111
2018-01-05
16.0986
22.9811
17.9703
2018-01-06
15.0478
16.1246
21.3976

In [14]:
auto water_vapour_pressure = 0.01 * relative_humidity * 6.1 * xt::exp((17.27 * dry_temperature) / (237.7 + dry_temperature));

In [15]:
water_vapour_pressure


Out[15]:
ParisLondonBrussels
2018-01-01
19.3174
    N/A
20.4322
2018-01-02
13.9728
11.7596
15.8251
2018-01-03
    N/A
11.5192
20.7708
2018-01-04
 18.083
18.5961
20.5819
2018-01-05
18.8922
10.9587
13.5448
2018-01-06
12.3464
 10.246
16.7499

The relative humidity has been broadcasted so its values are repeated for each date. When the labels of variables involved in an operation are not the same, the result contains the intersection of the label sets:


In [16]:
data_type coeff_data = xt::eval(xt::random::rand({6, 3}, 0.7, 0.9));
dry_temperature_data(0, 0).has_value() = false;
dry_temperature_data(2, 1).has_value() = false;

auto coeff = variable_type(
    coeff_data,
    {
        {"date", time_axis},
        {"city", xf::axis({"London", "New York", "Brussels"})}
    }
);
coeff


Out[16]:
LondonNew YorkBrussels
2018-01-01
0.772259
0.742385
0.836272
2018-01-02
0.779748
0.848129
0.794952
2018-01-03
0.784418
0.734773
0.760383
2018-01-04
0.859456
 0.76331
0.874486
2018-01-05
0.729823
0.898814
0.864381
2018-01-06
0.725037
 0.85275
0.798118

In [17]:
auto res = coeff * dry_temperature;
res


Out[17]:
LondonBrussels
2018-01-01
    N/A
20.6464
2018-01-02
13.4197
16.2744
2018-01-03
 13.244
18.9827
2018-01-04
21.2088
 21.697
2018-01-05
11.7491
15.5332
2018-01-06
10.9102
17.0778

 1.4. Higher dimension variables

The following code creates and displays a three-dimensional variable.


In [18]:
data_type pressure_data = {{{ 1.,  2., 3. },
                            { 4.,  5., 6. },
                            { 7.,  8., 9. }},
                           {{ 1.3, 1.5, 1.},
                            { 2., 2.3, 2.4},
                            { 3.1, 3.8, 3.}},
                           {{ 8.5, 8.2, 8.6},
                            { 7.5, 8.6, 9.7},
                            { 4.5, 4.4, 4.3}}};

In [19]:
auto pressure = variable_type(
    pressure_data,
    {
        {"x", xf::axis(3)},
        {"y", xf::axis(3, 6, 1)},
        {"z", xf::axis(3)},
    }
);

In [20]:
pressure


Out[20]:
012
03
  1
  2
  3
4
  4
  5
  6
5
  7
  8
  9
13
1.3
1.5
  1
4
  2
2.3
2.4
5
3.1
3.8
  3
23
8.5
8.2
8.6
4
7.5
8.6
9.7
5
4.5
4.4
4.3

2. Views

2.1. Multiselection

Views can be used to select many data points in a variable. The syntax is similar to the one used for selecting a single data point, excpet that it uses free functions instead of methods of variable.


In [21]:
dry_temperature


Out[21]:
LondonParisBrussels
2018-01-01
    N/A
23.3501
24.6887
2018-01-02
17.2103
18.0817
20.4722
2018-01-03
16.8838
    N/A
24.9646
2018-01-04
24.6769
22.2584
24.8111
2018-01-05
16.0986
22.9811
17.9703
2018-01-06
15.0478
16.1246
21.3976

Dimension lookup: Positional - Index lookup: By integer


In [22]:
auto v1 = ilocate(dry_temperature, xf::irange(0, 5, 2), xf::irange(1, 3));
v1


Out[22]:
ParisBrussels
2018-01-01
23.3501
24.6887
2018-01-03
    N/A
24.9646
2018-01-05
22.9811
17.9703

Dimension lookup: Positional - Index lookup: By label


In [23]:
auto v2 = locate(dry_temperature, xf::range("2018-01-01", "2018-01-06", 2), xf::range("Paris", "Brussels"));
v2


Out[23]:
ParisBrussels
2018-01-01
23.3501
24.6887
2018-01-03
    N/A
24.9646
2018-01-05
22.9811
17.9703

Dimension lookup: By name - Index lookup: By integer


In [24]:
auto v3 = iselect(dry_temperature, {{"city", xf::irange(1, 3)}, {"date", xf::irange(0, 5, 2)}});
v3


Out[24]:
ParisBrussels
2018-01-01
23.3501
24.6887
2018-01-03
    N/A
24.9646
2018-01-05
22.9811
17.9703

Dimension lookup: By name - Index lookup: By label


In [25]:
auto v4 = select(dry_temperature, 
                 {{"city", xf::range("Paris", "Brussels")},
                  {"date", xf::range("2018-01-01", "2018-01-06", 2)}});
v4


Out[25]:
ParisBrussels
2018-01-01
23.3501
24.6887
2018-01-03
    N/A
24.9646
2018-01-05
22.9811
17.9703

2.2. Keeping and dropping labels

The previous selection made use of ranges (label range from xframe and index range from xtensor), however it is also possible to select data points by explicitly specifying a list of labels to keep or to drop.

Dimension lookup: Positional - Index lookup: By integer


In [ ]:
auto v5 = ilocate(dry_temperature, xf::ikeep(0, 2, 4), xf::idrop(0));
v5

Dimension lookup: By name - Index lookup: By integer


In [ ]:
auto v6 = locate(dry_temperature, xf::keep("2018-01-01", "2018-01-03", "2018-01-05"), xf::drop("London"));
v6

Dimension lookup: By name - Index lookup: By integer


In [ ]:
auto v7 = iselect(dry_temperature, {{"city", xf::idrop(0)}, {"date", xf::ikeep(0, 2, 4)}});
v7

Dimension lookup: By name - Index lookup: By label


In [ ]:
auto v8 = select(dry_temperature,
                 {{"city", xf::drop("London")},
                  {"date", xf::keep("2018-01-01", "2018-01-03", "2018-01-05")}});
v8

 2.3 Masking views

Masking views allow to select data points based on conditions expressed on labels. These conditons can be complicated boolean expressions.


In [ ]:
pressure

In [ ]:
auto masked_pressure = xf::where(
    pressure,
    not_equal(pressure.axis<int>("x"), 2) && pressure.axis<int>("z") < 2
);

In [ ]:
masked_pressure

When assigning to a masking view, masked values are not changed. Like other views, a masking view is a proxy on its junderlying expression, no copy is made, so changing a unmasked value actually changes the corresponding value in the underlying expression.


In [ ]:
masked_pressure = 1.;
masked_pressure

In [ ]:
pressure

2.4 Reindexing views

Reindexing views give variables new set of coordinates to corresponding dimensions. Like other views, no copy is involved. Asking for values corresponding to new labels not found in the original set of coordinates returns missing values. In the next example, we reindex the city dimension.


In [ ]:
dry_temperature

In [ ]:
auto temp = reindex(dry_temperature, {{"city", xf::axis({"London", "New York", "Brussels"})}});
temp

The reindex_like is a shortcut that allows to reindex a variable given the set of coordinates of another variable


In [ ]:
auto dry_temp2 = variable_type(
    dry_temperature_data,
    {
        {"date", time_axis},
        {"city", xf::axis({"London", "New York", "Brussels"})}
    }
);
auto temp2 = reindex_like(dry_temperature, dry_temp2);
temp2

In [ ]: