While vaex.ml
does not yet implement predictive models, we provide wrappers to powerful libraries (e.g. Scikit-learn, xgboost) and make them work efficiently with vaex
. vaex.ml
does implement a variety of standard data transformers (e.g. PCA, numerical scalers, categorical encoders) and a very efficient KMeans algorithm that take full advantage of vaex
.
The following is a simple example on use of vaex.ml
. We will be using the well known Iris dataset, and we will use it to build a model which distinguishes between the three Irish species (Iris setosa, Iris virginica and Iris versicolor).
Lets start by importing the common libraries, load and inspect the data.
In [1]:
import vaex
import vaex.ml
import pylab as plt
df = vaex.ml.datasets.load_iris()
df
Out[1]:
Splitting the data into train and test steps should be done immediately, before any manipulation is done on the data. vaex.ml
contains a train_test_split
method which creates shallow copies of the main DataFrame, meaning that no extra memory is used when defining train and test sets. Note that the train_test_split
method does an ordered split of the main DataFrame to create the two sets. In some cases, one may need to shuffle the data.
If shuffling is required, we recommend the following:
df.export("shuffled", shuffle=True)
df = vaex.open("shuffled.hdf5)
df_train, df_test = df.ml.train_test_split(test_size=0.2)
In the present scenario, the dataset is already shuffled, so we can simply do the split right away.
In [2]:
# Orderd split in train and test
df_train, df_test = df.ml.train_test_split(test_size=0.2)
As this is a very simple tutorial, we will just use the columns already provided as features for training the model.
In [3]:
features = df_train.column_names[:4]
features
Out[3]:
The vaex.ml
module contains several classes for dataset transformations that are commonly used to pre-process data prior to building a model. These include numerical feature scalers, category encoders, and PCA transformations. We have adopted the scikit-learn API, meaning that all transformers have the .fit
and .transform
methods.
Let's use apply a PCA transformation on the training set. There is no need to scale the data beforehand, since the PCA also normalizes the data.
In [4]:
pca = vaex.ml.PCA(features=features, n_components=4)
df_train = pca.fit_transform(df_train)
df_train
Out[4]:
The result of pca .fit_transform
method is a shallow copy of the DataFrame which contains the resulting columns of the transformation, in this case the PCA components, as virtual columns. This means that the transformed DataFrame takes no memory at all! So while this example is made with only 120 sample, this would work in the same way even for millions or billions of samples.
Now let's train a gradient boosting model. While vaex.ml
does not currently include this type of models, we support the popular boosted trees libraries xgboost, lightgbm, and catboost. In this tutorial we will use the lightgbm
classifier.
In [9]:
import lightgbm
import vaex.ml.sklearn
# Features on which to train the model
train_features = df_train.get_column_names(regex='PCA_.*')
# The target column
target = 'class_'
# Instantiate the LightGBM Classifier
booster = lightgbm.sklearn.LGBMClassifier(num_leaves=5,
max_depth=5,
n_estimators=100,
random_state=42)
# Make it a vaex transformer (for the automagic pipeline and lazy predictions)
model = vaex.ml.sklearn.SKLearnPredictor(features=train_features,
target=target,
model=booster,
prediction_name='prediction')
# Train and predict
model.fit(df=df_train)
df_train = model.transform(df=df_train)
df_train
Out[9]:
Notice that after training the model, we use the .transform
method to obtain a shallow copy of the DataFrame which contains the prediction of the model, in a form of a virtual column. This makes it easy to evaluate the model, and easily create various diagnostic plots. If required, one can call the .predict
method, which will result in an in-memory numpy.array
housing the predictions.
Assuming we are happy with the performance of the model, we can continue and apply our transformations and model to the test set. Unlike other libraries, we do not need to explicitly create a pipeline here in order to propagate the transformations. In fact, with vaex
and vaex.ml
, a pipeline is automatically being created as one is doing the exploration of the data. Each vaex
DataFrame contains a state, which is a (serializable) object containing information of all transformations applied to the DataFrame (filtering, creation of new virtual columns, transformations).
Recall that the outputs of both the PCA transformation and the boosted model were in fact virtual columns, and thus are stored in the state of df_train
. All we need to do, is to apply this state to another similar DataFrame (e.g. the test set), and all the changes will be propagated.
In [6]:
state = df_train.state_get()
df_test.state_set(state)
df_test
Out[6]:
In [7]:
from sklearn.metrics import accuracy_score
acc = accuracy_score(y_true=df_test.class_.values, y_pred=df_test.prediction.values)
acc *= 100.
print(f'Test set accuracy: {acc}%')
The model get perfect accuracy of 100%. This is not surprising as this problem is rather easy: doing a PCA transformation on the features nicely separates the 3 flower species. Plotting the first two PCA axes, and colouring the samples according to their class already shows an almost perfect separation.
In [8]:
plt.figure(figsize=(8, 4))
df_test.scatter(df_test.PCA_0, df_test.PCA_1, c_expr=df_test.class_, s=50)
plt.show()