Tutorial for vaex as a library

Introduction

This tutorial shortly introduces how to use vaex from IPython notebook. This tutorial assumes you have vaex installed as a library, you can run python -c 'import vaex' to check this. This document although not a IPython notebook, is generated from a notebook, and you should be able to reproduce all examples.

Run IPython notebook

From the IPython notebook website:

The IPython Notebook is an interactive computational environment, in which you can combine code execution, rich text, mathematics, plots and rich media

To start it, run $ ipython notebook in your shell, and it should automatically open the main webpage. Start a new notebook by clicking new.

Starting

Start you notebook by importing the relevant packages, for this tutorial, we will be using vaex itself, numpy and matplotlib for plotting. We also configure matplotib to show the plots in the notebook itself


In [1]:
import vaex as vx
import numpy as np
import matplotlib.pylab as plt # simpler interface for matplotlib
# next line configures matplotlib to show the plots in the notebook, other option is qt to open a dialog
%matplotlib inline

Open a dataset

To open a dataset, we can call vx.open to open local files. See the documentation of vaex.open for the arguments, hit shift-tab (1 or 2 times) or run vx.open? in the notebook for direct help. For this tutorial we use vx.example() which opens a dataset provided with vaex. (Note that ds is short for dataset)


In [2]:
ds = vx.example()
# ds = vx.open('yourfile.hdf5') # in case you want to load a different dataset

You can get information about the dataset, such as the columns by simply typing ds as the last command in a cell.


In [3]:
ds


Out[3]:
<class 'vaex.file.other.Hdf5MemoryMapped'> - helmi-dezeeuw-2000-10p (length=330000)
xfloat64
yfloat64
zfloat64
vxfloat64
vyfloat64
vzfloat64
Efloat64
FeHfloat64
Lfloat64
Lzfloat64
random_indexint64

To get a list with all column names, use Dataset's get_column_names method. Note that tab completion should work, typing ds.get_c and then pressing tab should help your complete it.


In [4]:
ds.get_column_names()


Out[4]:
['x', 'y', 'z', 'vx', 'vy', 'vz', 'E', 'FeH', 'L', 'Lz', 'random_index']

Calculating statistics

Vaex can calculate statistics for colums, but also for an expression build from columns.


In [5]:
ds.mean("x"), ds.std("x"), ds.correlation("vx**2+vy**2+vz**2", "E")


Out[5]:
(-0.067131491264005971, 7.3174597654824751, array(0.00676355917633636))

Since columns names can sometimes be difficult to remember, and to take advantage of the autocomplete features of the Notebook, column names can be accessed using the .col property, for instance


In [6]:
print(ds.col.x)


x

In [7]:
ds.mean(ds.col.x)


Out[7]:
-0.067131491264005971

Dataset contains many methods to compute statistics, and plotting routines, see the API documentation for more details, for instance for:

Most of the statistics can also be calculated on a grid, which can also be visualized using for instance matplotlib.


In [8]:
ds.mean("E", binby=["x", "y"], shape=(2,2), limits=[[-10,10], [-10, 10]])


Out[8]:
array([[-119166.43858099, -118291.18402363],
       [-117650.31604966, -119542.86139539]])

In [9]:
mean_energy = ds.mean("E", binby=["x", "y"], shape=(128,128), limits=[[-10,10], [-10, 10]])
plt.imshow(mean_energy)


Out[9]:
<matplotlib.image.AxesImage at 0x114f28a90>

Plotting

Instead of using "bare" matplotlib to plot, using the .plot method is more convenient. It sets axes limits, labels (with units when known), and adds a colorbar. Learn more using the docstring, by typing ds.plot? or using shift-tab, or opening Dataset.plot.


In [10]:
ds.plot("x", "y", limits=[[-10,10], [-10, 10]]);


Instead of plotting the counts, the mean of an expression can be plotted. (Other options are sum, std, var, correlation, covar, min, max)


In [11]:
ds.plot("x", "y", what="mean(vx)", limits=[[-10,10], [-10, 10]], vmin=-200, vmax=200, shape=128);


/Users/maartenbreddels/anaconda3/envs/vaex-forge2/lib/python3.5/site-packages/matplotlib/colors.py:581: RuntimeWarning: invalid value encountered in less
  cbook._putmask(xa, xa < 0.0, -1)

More panels can be plotting giving a list of pairs of expressions as the first argument (which we call a subspace).


In [12]:
ds.plot([["x", "y"], ["x", "z"]], limits=[[-10, 10], [-10, 10]], figsize=(10,5), shape=128);


And the same can be done for the what argument. Note that the f argument is the transformation that will be applied to the values, for instance "log", "log10", "abs", or None when doing no transformation. If given as a single argument, if will apply to all plots, otherwise it should be a list of the same length as the what argument.


In [13]:
ds.plot("x", "y", what=["count(*)", "mean(vx)"], f=["log", None],
        limits=[[-10, 10], [-10, 10]], figsize=(10,5), shape=128, vmin=[0, -200], vmax=[4, 200]);


/Users/maartenbreddels/anaconda3/envs/vaex-forge2/lib/python3.5/site-packages/matplotlib/colors.py:581: RuntimeWarning: invalid value encountered in less
  cbook._putmask(xa, xa < 0.0, -1)

When they are combined, what will form the columns of a subplot, while the rows are the different subspaces.


In [14]:
ds.plot([["x", "y"], ["x", "z"]],  f=["log", None, None, None],
        what=["count(*)", "mean(vx)", "mean(vy)", "correlation(vx,vy)"],
        colormap=["afmhot", "afmhot", "afmhot", "bwr"],
        limits=[[-10, 10], [-10, 10]], figsize=(14,8), shape=128);


/Users/maartenbreddels/anaconda3/envs/vaex-forge2/lib/python3.5/site-packages/matplotlib/colors.py:581: RuntimeWarning: invalid value encountered in less
  cbook._putmask(xa, xa < 0.0, -1)

Selections

For working with a part of the data, we use what we call selections. When a selection is applied to a dataset, it keeps a boolean in memory for each row indicating it is selected or not. All statistical methods take a selection argument, which can be None or False for no selection, True or "default" for the default selection, or a string refering to the selection (corresponding to the name argument of the Dataset.select method). It is also possible to have expressions in a selection, but these selections will not be cached and computed every time when needed.


In [15]:
# the following plots are all identical
ds.select("y > x")
ds.plot("x", "y", selection=True, show=True)
ds.plot("x", "y", selection="default", show=True) # same as the previous
ds.plot("x", "y", selection="y > x", show=True); # similar, but selection will be recomputed every time


Multiple selections can be overplotted, where None means no selection, and True is an alias for the default selection name of "default". The selections will be overplotted where the background will be faded. (Note that becase the log is taken of zero, this results in NaN, which is shown as transparent pixels.)


In [16]:
ds.plot("x", "y", selection=[None, True], f="log");


/Users/maartenbreddels/anaconda3/envs/vaex-forge2/lib/python3.5/site-packages/matplotlib/colors.py:581: RuntimeWarning: invalid value encountered in less
  cbook._putmask(xa, xa < 0.0, -1)
/Users/maartenbreddels/vaex/src/vaex/vaex/image.py:96: RuntimeWarning: invalid value encountered in true_divide
  result = ((1.-aB) * aA * xA  + (1.-aA) * aB * xB + aA * aB * f) / aR
/Users/maartenbreddels/vaex/src/vaex/vaex/image.py:99: RuntimeWarning: invalid value encountered in true_divide
  result = (np.minimum(aA, 1-aB)*xA + aB*xB)/aR

Selection can be made more complicated, or can be logically combined using a boolean operator. The default is to replace the current selections, other possiblities are: "replace", "and", "or", "xor", "subtract"


In [17]:
ds.select("y > x")
ds.select("y > -x", mode="or")
# this next line has the same effect as the above two
# dataset.select("(y > x) | (x > -y)")
# |,& and ^ are used for 'or' 'and', and 'xor'
ds.select("x > 5", mode="subtract")
ds.plot("x", "y", selection=[None, True], f="log");


/Users/maartenbreddels/anaconda3/envs/vaex-forge2/lib/python3.5/site-packages/matplotlib/colors.py:581: RuntimeWarning: invalid value encountered in less
  cbook._putmask(xa, xa < 0.0, -1)
/Users/maartenbreddels/vaex/src/vaex/vaex/image.py:96: RuntimeWarning: invalid value encountered in true_divide
  result = ((1.-aB) * aA * xA  + (1.-aA) * aB * xB + aA * aB * f) / aR
/Users/maartenbreddels/vaex/src/vaex/vaex/image.py:99: RuntimeWarning: invalid value encountered in true_divide
  result = (np.minimum(aA, 1-aB)*xA + aB*xB)/aR

Using the visual argument, it is possible to show the selections as columns instead, see Dataset.plot for more details.


In [18]:
ds.select("x - 5> y", name="other")
ds.plot("x", "y", selection=[None, True, "other", "other | default"],
        f="log", visual=dict(column="selection"), figsize=(12,4));


/Users/maartenbreddels/anaconda3/envs/vaex-forge2/lib/python3.5/site-packages/matplotlib/colors.py:581: RuntimeWarning: invalid value encountered in less
  cbook._putmask(xa, xa < 0.0, -1)

Besides making plots, statisics can also be computed for selections


In [19]:
ds.max("x", selection=True)


Out[19]:
array(4.99998713)

In [20]:
ds.max("x", selection=[None, True])


Out[20]:
array([ 271.365997  ,    4.99998713])

In [21]:
ds.max(["x", "y"], selection=[None, True])


Out[21]:
array([[ 271.365997  ,    4.99998713],
       [ 146.465836  ,  146.465836  ]])

In [22]:
ds.mean(["x", "y"], selection=[None, True, "other", "x > y"])


Out[22]:
array([[-0.06713149, -2.98854513,  5.90555941,  3.59256693],
       [-0.05358987,  2.99097581, -6.92724312, -4.19886827]])

Virtual columns

If a particular expression occurs often, it may be convenient to create a virtual column, it behaves exactly like a normal column, but it is calculated on the fly (without taking up the memory of a full column, since it is done is chunks).


In [23]:
ds.add_virtual_column("r", "sqrt(x**2+y**2+z**2)")
ds.add_virtual_column("v", "sqrt(vx**2+vy**2+vz**2)")
ds.plot("log(r)", "log(v)", f="log10");


/Users/maartenbreddels/anaconda3/envs/vaex-forge2/lib/python3.5/site-packages/matplotlib/colors.py:581: RuntimeWarning: invalid value encountered in less
  cbook._putmask(xa, xa < 0.0, -1)

More about the dataset

Vaex works best with hdf5 and fits files, but can import from other sources as well. File formats are recognized by the extension. For .vot a VOTable is assumed, and astropy is used for reading this. For .asc the astropy's ascii reader is used. However, these formats require the dataset to fit into memory, and exporting them in hdf5 or fits format may lead to better performance and faster read times. Datasets can also be made from numpy arrays using vaex.from_arrays, or imported for convenience from pandas using vaex.from_pandas.

In the next example we create a dataset from arrays, and export it to disk.


In [24]:
# Create a 6d gaussian clump
q = np.random.normal(10, 2, (6, 10000))
dataset_clump_arrays = vx.from_arrays(x=q[0], y=q[1], z=q[2], vx=q[3], vy=q[4], vz=q[5])
dataset_clump_arrays.add_virtual_column("r", "sqrt(x**2+y**2+z**2)")
dataset_clump_arrays.add_virtual_column("v", "sqrt(vx**2+vy**2+vz**2)")

# create a temporary file
import tempfile
filename = tempfile.mktemp(suffix=".hdf5")

# when exporting takes long, progress=True will give a progress bar
# here, we don't want to export virtual columns, which is the default
dataset_clump_arrays.export_hdf5(filename, progress=True, virtual=False)
print("Exported to: %s" % filename)


Exported to: /var/folders/vn/_rmzj8jd0215_g9yfrn8pmgm0000gn/T/tmps5efbvfh.hdf5
exporting: 100% |####################################################################################################################################| Time: 0:00:00 CPU Usage:     0%

In [25]:
ds_clump = vx.open(filename)
print("Columns: %r" % ds_clump.get_column_names())


Columns: ['x', 'y', 'z', 'vx', 'vy', 'vz']

concatenating tables

Using the .concat method, datasets can be concatenated to form one big dataset (without copying the data).


In [26]:
ds2 = ds.concat(ds_clump)
ds2.plot("x", "y", f="log1p", limits=[[-20, 20], [-20, 20]]);


Shuffling

TODO

Efficient use of multiple calculations

Imaging you want to calcule the correlation coefficient for a few subspaces. First we calculate it for E and Lz.


In [27]:
ds.correlation("E", "Lz")


Out[27]:
array(-0.09404020895356191)

In the process, all the data for the column E and Lz was processed, if we now calculate the correlation coefficient for E and L, we go over the data for column E again. Especially if the data does not fit into memory, this is quiet inefficient.


In [28]:
ds.correlation("E", "L")


Out[28]:
array(0.6890619164898808)

If instead, we call the correlation method with a list of subspaces, there is only one pass over the data, which can me much more efficient.


In [29]:
ds.correlation([["E", "Lz"], ["E", "L"]])


Out[29]:
array([-0.09404021,  0.68906192])

Especially if many subspaces are used, as in the following example.


In [30]:
subspaces = ds.combinations()
correlations = ds.correlation(subspaces)
mutual_informations = ds.mutual_information(subspaces)

In [31]:
from astropy.io import ascii
import sys
names = ["_".join(subspace) for subspace in subspaces]
ascii.write([names, correlations, mutual_informations], sys.stdout,
            names=["names", "correlation", "mutual_information"])
# replace sys.stdout by a filename such as "example.asc"
filename_asc = tempfile.mktemp(suffix=".asc")
ascii.write([names, correlations, mutual_informations], filename_asc,
            names=["names", "correlation", "mutual_information"])

print("--------")
# or write it as a latex table
ascii.write([names, correlations, mutual_informations],
            sys.stdout, names=["names", "correlation", "mutual information"], Writer=ascii.Latex)


names correlation mutual_information
x_y -0.066913086088751 0.1511814526380327
x_z -0.026563129089248065 0.18439180585071951
x_vx -0.0077917898183534 0.10435586691547903
x_vy 0.0001401879823959935 0.15943598551987362
x_vz 0.020449779578494472 0.10991350641870239
x_E -0.012435764665535712 0.37575458684670193
x_FeH 0.005261856198512363 0.1249204442224574
x_L -0.02566286245858961 0.19727462018198172
x_Lz -0.00030294055309957344 0.2150162106163776
x_random_index 0.002157476491340153 0.32861028043895546
y_z 0.030838572698652564 0.21418760688854802
y_vx 0.01804910998078914 0.17013399253827877
y_vy -0.004114980900371909 0.1097919284981679
y_vz -0.028477638600608927 0.11538952339280187
y_E -0.006099113309545572 0.43174233264307804
y_FeH 0.015003295277717208 0.13276137709114438
y_L -0.00838158892350927 0.21606618030384148
y_Lz 0.027260049760350104 0.23708979843321443
y_random_index -0.002740986556550819 0.36853287435535315
z_vx -0.021753308878140573 0.11626543575085833
z_vy 0.029883551266368533 0.11377910176974865
z_vz -0.009658004899831468 0.10902906677577343
z_E 0.01244518987212551 0.39227851480701037
z_FeH -0.024137983404556557 0.1438360275619597
z_L 0.003231034609821104 0.21072426604103467
z_Lz -0.06334896485239119 0.24951081671985298
z_random_index 0.028387450432431596 0.5119671679312902
vx_vy -0.03524604328853534 0.11105372656186498
vx_vz 0.005550990948008108 0.12708618558232798
vx_E -0.006280672311820793 0.14699427054830352
vx_FeH 0.010488839427892981 0.10342994503122623
vx_L -0.007520910755859279 0.11295633234855083
vx_Lz 0.02359219180674478 0.12115009113844397
vx_random_index -0.00522915568828874 0.13209120343292602
vy_vz 0.009916570683825747 0.1316782544117304
vy_E 0.01786299906409399 0.16308541508821278
vy_FeH -0.011055183715440823 0.10261812705059323
vy_L 0.023488436893893464 0.1120822135394338
vy_Lz -0.02312324972293643 0.11578395258810391
vy_random_index -0.0007787184679206849 0.13081608199020858
vz_E 0.01921099148010763 0.14587270420692017
vz_FeH 0.00375742600931867 0.11145188427951819
vz_L 0.031076571360133275 0.1322073313358699
vz_Lz 0.03296464183711297 0.12410177399196129
vz_random_index -0.011321762177823203 0.18483519919555708
E_FeH -0.014068223053940808 0.45424186792187254
E_L 0.6890619164898808 0.7404061337881687
E_Lz -0.09404020895356191 1.0706737929496781
E_random_index -0.1294438804260704 1.7132853532556342
FeH_L -0.08144257446458827 0.3087854784476569
FeH_Lz 0.4653258482938841 0.67903425399271
FeH_random_index 0.2150690111880752 1.435098879037007
L_Lz -0.1289411770767984 1.0311950571530903
L_random_index -0.01195286572778252 0.9863646953020592
Lz_random_index -0.22159810290993964 1.8807524566430187
--------
\begin{table}
\begin{tabular}{ccc}
names & correlation & mutual information \\
x_y & -0.0669130860888 & 0.151181452638 \\
x_z & -0.0265631290892 & 0.184391805851 \\
x_vx & -0.00779178981835 & 0.104355866915 \\
x_vy & 0.000140187982396 & 0.15943598552 \\
x_vz & 0.0204497795785 & 0.109913506419 \\
x_E & -0.0124357646655 & 0.375754586847 \\
x_FeH & 0.00526185619851 & 0.124920444222 \\
x_L & -0.0256628624586 & 0.197274620182 \\
x_Lz & -0.0003029405531 & 0.215016210616 \\
x_random_index & 0.00215747649134 & 0.328610280439 \\
y_z & 0.0308385726987 & 0.214187606889 \\
y_vx & 0.0180491099808 & 0.170133992538 \\
y_vy & -0.00411498090037 & 0.109791928498 \\
y_vz & -0.0284776386006 & 0.115389523393 \\
y_E & -0.00609911330955 & 0.431742332643 \\
y_FeH & 0.0150032952777 & 0.132761377091 \\
y_L & -0.00838158892351 & 0.216066180304 \\
y_Lz & 0.0272600497604 & 0.237089798433 \\
y_random_index & -0.00274098655655 & 0.368532874355 \\
z_vx & -0.0217533088781 & 0.116265435751 \\
z_vy & 0.0298835512664 & 0.11377910177 \\
z_vz & -0.00965800489983 & 0.109029066776 \\
z_E & 0.0124451898721 & 0.392278514807 \\
z_FeH & -0.0241379834046 & 0.143836027562 \\
z_L & 0.00323103460982 & 0.210724266041 \\
z_Lz & -0.0633489648524 & 0.24951081672 \\
z_random_index & 0.0283874504324 & 0.511967167931 \\
vx_vy & -0.0352460432885 & 0.111053726562 \\
vx_vz & 0.00555099094801 & 0.127086185582 \\
vx_E & -0.00628067231182 & 0.146994270548 \\
vx_FeH & 0.0104888394279 & 0.103429945031 \\
vx_L & -0.00752091075586 & 0.112956332349 \\
vx_Lz & 0.0235921918067 & 0.121150091138 \\
vx_random_index & -0.00522915568829 & 0.132091203433 \\
vy_vz & 0.00991657068383 & 0.131678254412 \\
vy_E & 0.0178629990641 & 0.163085415088 \\
vy_FeH & -0.0110551837154 & 0.102618127051 \\
vy_L & 0.0234884368939 & 0.112082213539 \\
vy_Lz & -0.0231232497229 & 0.115783952588 \\
vy_random_index & -0.000778718467921 & 0.13081608199 \\
vz_E & 0.0192109914801 & 0.145872704207 \\
vz_FeH & 0.00375742600932 & 0.11145188428 \\
vz_L & 0.0310765713601 & 0.132207331336 \\
vz_Lz & 0.0329646418371 & 0.124101773992 \\
vz_random_index & -0.0113217621778 & 0.184835199196 \\
E_FeH & -0.0140682230539 & 0.454241867922 \\
E_L & 0.68906191649 & 0.740406133788 \\
E_Lz & -0.0940402089536 & 1.07067379295 \\
E_random_index & -0.129443880426 & 1.71328535326 \\
FeH_L & -0.0814425744646 & 0.308785478448 \\
FeH_Lz & 0.465325848294 & 0.679034253993 \\
FeH_random_index & 0.215069011188 & 1.43509887904 \\
L_Lz & -0.128941177077 & 1.03119505715 \\
L_random_index & -0.0119528657278 & 0.986364695302 \\
Lz_random_index & -0.22159810291 & 1.88075245664 \\
\end{tabular}
\end{table}

In [32]:
# reading it back in
table = ascii.read(filename_asc)
print("this is an astropy table:\n", table)
correlations = table["correlation"]
print
print("this is an astropy column:\n", correlations)
print
print("this is the numpy data:\n", correlations.data)
# short: table["correlation"].data


this is an astropy table:
      names          correlation    mutual_information
---------------- ----------------- ------------------
             x_y  -0.0669130860888     0.151181452638
             x_z  -0.0265631290892     0.184391805851
            x_vx -0.00779178981835     0.104355866915
            x_vy 0.000140187982396      0.15943598552
            x_vz   0.0204497795785     0.109913506419
             x_E  -0.0124357646655     0.375754586847
           x_FeH  0.00526185619851     0.124920444222
             x_L  -0.0256628624586     0.197274620182
            x_Lz  -0.0003029405531     0.215016210616
  x_random_index  0.00215747649134     0.328610280439
             ...               ...                ...
 vz_random_index  -0.0113217621778     0.184835199196
           E_FeH  -0.0140682230539     0.454241867922
             E_L     0.68906191649     0.740406133788
            E_Lz  -0.0940402089536      1.07067379295
  E_random_index   -0.129443880426      1.71328535326
           FeH_L  -0.0814425744646     0.308785478448
          FeH_Lz    0.465325848294     0.679034253993
FeH_random_index    0.215069011188      1.43509887904
            L_Lz   -0.128941177077      1.03119505715
  L_random_index  -0.0119528657278     0.986364695302
 Lz_random_index    -0.22159810291      1.88075245664
Length = 55 rows
this is an astropy column:
    correlation   
-----------------
 -0.0669130860888
 -0.0265631290892
-0.00779178981835
0.000140187982396
  0.0204497795785
 -0.0124357646655
 0.00526185619851
 -0.0256628624586
 -0.0003029405531
 0.00215747649134
              ...
 -0.0113217621778
 -0.0140682230539
    0.68906191649
 -0.0940402089536
  -0.129443880426
 -0.0814425744646
   0.465325848294
   0.215069011188
  -0.128941177077
 -0.0119528657278
   -0.22159810291
Length = 55 rows
this is the numpy data:
 [ -6.69130861e-02  -2.65631291e-02  -7.79178982e-03   1.40187982e-04
   2.04497796e-02  -1.24357647e-02   5.26185620e-03  -2.56628625e-02
  -3.02940553e-04   2.15747649e-03   3.08385727e-02   1.80491100e-02
  -4.11498090e-03  -2.84776386e-02  -6.09911331e-03   1.50032953e-02
  -8.38158892e-03   2.72600498e-02  -2.74098656e-03  -2.17533089e-02
   2.98835513e-02  -9.65800490e-03   1.24451899e-02  -2.41379834e-02
   3.23103461e-03  -6.33489649e-02   2.83874504e-02  -3.52460433e-02
   5.55099095e-03  -6.28067231e-03   1.04888394e-02  -7.52091076e-03
   2.35921918e-02  -5.22915569e-03   9.91657068e-03   1.78629991e-02
  -1.10551837e-02   2.34884369e-02  -2.31232497e-02  -7.78718468e-04
   1.92109915e-02   3.75742601e-03   3.10765714e-02   3.29646418e-02
  -1.13217622e-02  -1.40682231e-02   6.89061916e-01  -9.40402090e-02
  -1.29443880e-01  -8.14425745e-02   4.65325848e-01   2.15069011e-01
  -1.28941177e-01  -1.19528657e-02  -2.21598103e-01]

Where to go from here?

Continue reading on:

This tutorial covers the basics, more can be learned by reading the API documentation. But note that every docstring can be read from the notebook using shift-tab, or using for instance ds.plot?.

If you think a particular topic should be addressed here, please open an issue at github


In [ ]: