This tutorial shortly introduces how to use vaex from IPython notebook. This tutorial assumes you have vaex installed as a library, you can run python -c 'import vaex'
to check this.
This document although not a IPython notebook, is generated from a notebook, and you should be able to reproduce all examples.
From the IPython notebook website:
The IPython Notebook is an interactive computational environment, in which you can combine code execution, rich text, mathematics, plots and rich media
To start it, run $ ipython notebook
in your shell, and it should automatically open the main webpage. Start a new notebook by clicking new
.
Start you notebook by importing the relevant packages, for this tutorial, we will be using vaex itself, numpy and matplotlib for plotting. We also configure matplotib to show the plots in the notebook itself
In [1]:
import vaex as vx
import numpy as np
import matplotlib.pylab as plt # simpler interface for matplotlib
# next line configures matplotlib to show the plots in the notebook, other option is qt to open a dialog
%matplotlib inline
To open a dataset, we can call vx.open to open local files. See the documentation of vaex.open for the arguments, hit shift-tab (1 or 2 times) or run vx.open?
in the notebook for direct help. For this tutorial we use vx.example()
which opens a dataset provided with vaex. (Note that ds is short for dataset)
In [2]:
ds = vx.example()
# ds = vx.open('yourfile.hdf5') # in case you want to load a different dataset
You can get information about the dataset, such as the columns by simply typing ds
as the last command in a cell.
In [3]:
ds
Out[3]:
To get a list with all column names, use Dataset's get_column_names method. Note that tab completion should work, typing ds.get_c
and then pressing tab should help your complete it.
In [4]:
ds.get_column_names()
Out[4]:
In [5]:
ds.mean("x"), ds.std("x"), ds.correlation("vx**2+vy**2+vz**2", "E")
Out[5]:
Since columns names can sometimes be difficult to remember, and to take advantage of the autocomplete features of the Notebook, column names can be accessed using the .col property, for instance
In [6]:
print(ds.col.x)
In [7]:
ds.mean(ds.col.x)
Out[7]:
Dataset contains many methods to compute statistics, and plotting routines, see the API documentation for more details, for instance for:
Most of the statistics can also be calculated on a grid, which can also be visualized using for instance matplotlib.
In [8]:
ds.mean("E", binby=["x", "y"], shape=(2,2), limits=[[-10,10], [-10, 10]])
Out[8]:
In [9]:
mean_energy = ds.mean("E", binby=["x", "y"], shape=(128,128), limits=[[-10,10], [-10, 10]])
plt.imshow(mean_energy)
Out[9]:
Instead of using "bare" matplotlib to plot, using the .plot method is more convenient. It sets axes limits, labels (with units when known), and adds a colorbar. Learn more using the docstring, by typing ds.plot?
or using shift-tab, or opening Dataset.plot.
In [10]:
ds.plot("x", "y", limits=[[-10,10], [-10, 10]]);
Instead of plotting the counts, the mean of an expression can be plotted. (Other options are sum, std, var, correlation, covar, min, max)
In [11]:
ds.plot("x", "y", what="mean(vx)", limits=[[-10,10], [-10, 10]], vmin=-200, vmax=200, shape=128);
More panels can be plotting giving a list of pairs of expressions as the first argument (which we call a subspace).
In [12]:
ds.plot([["x", "y"], ["x", "z"]], limits=[[-10, 10], [-10, 10]], figsize=(10,5), shape=128);
And the same can be done for the what
argument. Note that the f argument is the transformation that will be applied to the values, for instance "log", "log10", "abs", or None when doing no transformation. If given as a single argument, if will apply to all plots, otherwise it should be a list of the same length as the what argument.
In [13]:
ds.plot("x", "y", what=["count(*)", "mean(vx)"], f=["log", None],
limits=[[-10, 10], [-10, 10]], figsize=(10,5), shape=128, vmin=[0, -200], vmax=[4, 200]);
When they are combined, what
will form the columns of a subplot, while the rows are the different subspaces.
In [14]:
ds.plot([["x", "y"], ["x", "z"]], f=["log", None, None, None],
what=["count(*)", "mean(vx)", "mean(vy)", "correlation(vx,vy)"],
colormap=["afmhot", "afmhot", "afmhot", "bwr"],
limits=[[-10, 10], [-10, 10]], figsize=(14,8), shape=128);
For working with a part of the data, we use what we call selections. When a selection is applied to a dataset, it keeps a boolean in memory for each row indicating it is selected or not. All statistical methods take a selection argument, which can be None
or False
for no selection, True
or "default"
for the default selection, or a string refering to the selection (corresponding to the name argument of the Dataset.select method). It is also possible to have expressions in a selection, but these selections will not be cached and computed every time when needed.
In [15]:
# the following plots are all identical
ds.select("y > x")
ds.plot("x", "y", selection=True, show=True)
ds.plot("x", "y", selection="default", show=True) # same as the previous
ds.plot("x", "y", selection="y > x", show=True); # similar, but selection will be recomputed every time
Multiple selections can be overplotted, where None
means no selection, and True
is an alias for the default selection name of "default". The selections will be overplotted where the background will be faded
. (Note that becase the log is taken of zero, this results in NaN, which is shown as transparent pixels.)
In [16]:
ds.plot("x", "y", selection=[None, True], f="log");
Selection can be made more complicated, or can be logically combined using a boolean operator. The default is to replace the current selections, other possiblities are: "replace", "and", "or", "xor", "subtract"
In [17]:
ds.select("y > x")
ds.select("y > -x", mode="or")
# this next line has the same effect as the above two
# dataset.select("(y > x) | (x > -y)")
# |,& and ^ are used for 'or' 'and', and 'xor'
ds.select("x > 5", mode="subtract")
ds.plot("x", "y", selection=[None, True], f="log");
Using the visual argument, it is possible to show the selections as columns instead, see Dataset.plot for more details.
In [18]:
ds.select("x - 5> y", name="other")
ds.plot("x", "y", selection=[None, True, "other", "other | default"],
f="log", visual=dict(column="selection"), figsize=(12,4));
Besides making plots, statisics can also be computed for selections
In [19]:
ds.max("x", selection=True)
Out[19]:
In [20]:
ds.max("x", selection=[None, True])
Out[20]:
In [21]:
ds.max(["x", "y"], selection=[None, True])
Out[21]:
In [22]:
ds.mean(["x", "y"], selection=[None, True, "other", "x > y"])
Out[22]:
In [23]:
ds.add_virtual_column("r", "sqrt(x**2+y**2+z**2)")
ds.add_virtual_column("v", "sqrt(vx**2+vy**2+vz**2)")
ds.plot("log(r)", "log(v)", f="log10");
Extra methods for creating common virtual columns are:
Don't fear to look at the source (click the green link [source]).
Vaex works best with hdf5 and fits files, but can import from other sources as well. File formats are recognized by the extension. For .vot a VOTable is assumed, and astropy is used for reading this. For .asc the astropy's ascii reader is used. However, these formats require the dataset to fit into memory, and exporting them in hdf5 or fits format may lead to better performance and faster read times. Datasets can also be made from numpy arrays using vaex.from_arrays, or imported for convenience from pandas using vaex.from_pandas.
In the next example we create a dataset from arrays, and export it to disk.
In [24]:
# Create a 6d gaussian clump
q = np.random.normal(10, 2, (6, 10000))
dataset_clump_arrays = vx.from_arrays(x=q[0], y=q[1], z=q[2], vx=q[3], vy=q[4], vz=q[5])
dataset_clump_arrays.add_virtual_column("r", "sqrt(x**2+y**2+z**2)")
dataset_clump_arrays.add_virtual_column("v", "sqrt(vx**2+vy**2+vz**2)")
# create a temporary file
import tempfile
filename = tempfile.mktemp(suffix=".hdf5")
# when exporting takes long, progress=True will give a progress bar
# here, we don't want to export virtual columns, which is the default
dataset_clump_arrays.export_hdf5(filename, progress=True, virtual=False)
print("Exported to: %s" % filename)
In [25]:
ds_clump = vx.open(filename)
print("Columns: %r" % ds_clump.get_column_names())
In [26]:
ds2 = ds.concat(ds_clump)
ds2.plot("x", "y", f="log1p", limits=[[-20, 20], [-20, 20]]);
In [27]:
ds.correlation("E", "Lz")
Out[27]:
In the process, all the data for the column E and Lz was processed, if we now calculate the correlation coefficient for E and L, we go over the data for column E again. Especially if the data does not fit into memory, this is quiet inefficient.
In [28]:
ds.correlation("E", "L")
Out[28]:
If instead, we call the correlation method with a list of subspaces, there is only one pass over the data, which can me much more efficient.
In [29]:
ds.correlation([["E", "Lz"], ["E", "L"]])
Out[29]:
Especially if many subspaces are used, as in the following example.
In [30]:
subspaces = ds.combinations()
correlations = ds.correlation(subspaces)
mutual_informations = ds.mutual_information(subspaces)
In [31]:
from astropy.io import ascii
import sys
names = ["_".join(subspace) for subspace in subspaces]
ascii.write([names, correlations, mutual_informations], sys.stdout,
names=["names", "correlation", "mutual_information"])
# replace sys.stdout by a filename such as "example.asc"
filename_asc = tempfile.mktemp(suffix=".asc")
ascii.write([names, correlations, mutual_informations], filename_asc,
names=["names", "correlation", "mutual_information"])
print("--------")
# or write it as a latex table
ascii.write([names, correlations, mutual_informations],
sys.stdout, names=["names", "correlation", "mutual information"], Writer=ascii.Latex)
In [32]:
# reading it back in
table = ascii.read(filename_asc)
print("this is an astropy table:\n", table)
correlations = table["correlation"]
print
print("this is an astropy column:\n", correlations)
print
print("this is the numpy data:\n", correlations.data)
# short: table["correlation"].data
Continue reading on:
This tutorial covers the basics, more can be learned by reading the API documentation. But note that every docstring can be read from the notebook using shift-tab, or using for instance ds.plot?
.
If you think a particular topic should be addressed here, please open an issue at github
In [ ]: