Vaex is a python library for lazy Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular datasets. It can calculate statistics such as mean, sum, count, standard deviation etc, on an N-dimensional grid up to a billion ($10^9$) objects/rows per second. Visualization is done using histograms, density plots and 3d volume rendering, allowing interactive exploration of big data. Vaex uses memory mapping, a zero memory copy policy, and lazy computations for best performance (no memory wasted).
ds.mean<tab>, feels very similar to Pandas.Lean: separated into multiple packages
vaex-core: DataFrame and core algorithms, takes numpy arrays as input columns.vaex-hdf5: Provides memory mapped numpy arrays to a DataFrame.vaex-arrow: Arrow support for cross language data sharing.vaex-viz: Visualization based on matplotlib.vaex-jupyter: Interactive visualization based on Jupyter widgets / ipywidgets, bqplot, ipyvolume and ipyleaflet.vaex-astro: Astronomy related transformations and FITS file support.vaex-server: Provides a server to access a DataFrame remotely.vaex-distributed: (Proof of concept) combined multiple servers / cluster into a single DataFrame for distributed computations.vaex-qt: Program written using Qt GUI.vaex: Meta package that installs all of the above.vaex-ml: Machine learningJupyter integration: vaex-jupyter will give you interactive visualization and selection in the Jupyter notebook and Jupyter lab.
Using conda:
conda install -c conda-forge vaexUsing pip:
pip install --upgrade vaexOr read the detailed instructions
We assume that you have installed vaex, and are running a Jupyter notebook server. We start by importing vaex and asking it to give us an example dataset.
In [1]:
import vaex
df = vaex.example() # open the example dataset provided with vaex
Instead, you can download some larger datasets, or read in your csv file.
In [2]:
df # will pretty print the DataFrame
Out[2]:
Using square brackets[], we can easily filter or get different views on the DataFrame.
In [3]:
df_negative = df[df.x < 0] # easily filter your DataFrame, without making a copy
df_negative[:5][['x', 'y']] # take the first five rows, and only the 'x' and 'y' column (no memory copy!)
Out[3]:
When dealing with huge datasets, say a billion rows ($10^9$), computations with the data can waste memory, up to 8 GB for a new column. Instead, vaex uses lazy computation, storing only a representation of the computation, and computations are done on the fly when needed. You can just use many of the numpy functions, as if it was a normal array.
In [4]:
import numpy as np
# creates an expression (nothing is computed)
some_expression = df.x + df.z
some_expression # for convenience, we print out some values
Out[4]:
These expressions can be added to a DataFrame, creating what we call a virtual column. These virtual columns are similar to normal columns, except they do not waste memory.
In [5]:
df['r'] = some_expression # add a (virtual) column that will be computed on the fly
df.mean(df.x), df.mean(df.r) # calculate statistics on normal and virtual columns
Out[5]:
One of the core features of vaex is its ability to calculate statistics on a regular (N-dimensional) grid. The dimensions of the grid are specified by the binby argument (analogous to SQL's grouby), and the shape and limits.
In [6]:
df.mean(df.r, binby=df.x, shape=32, limits=[-10, 10]) # create statistics on a regular grid (1d)
Out[6]:
In [7]:
df.mean(df.r, binby=[df.x, df.y], shape=32, limits=[-10, 10]) # or 2d
df.count(df.r, binby=[df.x, df.y], shape=32, limits=[-10, 10]) # or 2d counts/histogram
Out[7]:
These one and two dimensional grids can be visualized using any plotting library, such as matplotlib, but the setup can be tedious. For convenience we can use plot1d, plot, or see the list of plotting commands
In [8]:
df.plot(df.x, df.y, show=True); # make a plot quickly
Continue the tutorial here or check the examples