In [6]:
# First things first, set up holoviews for visualizations
import holoviews as hv
hv.notebook_extension('bokeh')
The fundamental purpose of the Catalog
object is to provide an interface to access desired columns from data tables on demand, without necessarily having to keep the whole contents in memory.
This is currently primarily implemented in the ParquetCatalog
object, which uses the dask interface to parquet tables. Initialize a ParquetCatalog
with the path to a parquet file, or a list of multiple such files.
In [7]:
from explorer.catalog import ParquetCatalog
path = '/datasets/hsc/repo/rerun/RC/w_2018_04/DM-13256/plots/HSC-I/tract-9813/forced.parq'
catalog = ParquetCatalog(path)
The workhorse method of a Catalog
is get_columns
, which returns one or more columns (passed as a list) as a dask DataFrame
.
In [8]:
catalog.get_columns(['coord_ra', 'coord_dec'])
Out[8]:
Note that dask data structures are out-of-memory unless explicitly requested with .compute()
, at which point they return a pandas DataFrame
. Note that this step takes a few seconds, as this is the first time data is being read from disk. Note also that only the requested columns are loaded into memory, not the entire table block, as would be necessary with FITS or HDF. This is the beauty of the parquet column-store table format, and a key feature enabling QA at scale.
In [9]:
catalog.get_columns(['coord_ra', 'coord_dec']).compute().head() # .head() to limit cell output
Out[9]:
Aside from get_columns
, there are a few other convenience attributes that can be useful:
In [10]:
# Names of all the available columns in the catalog
catalog.columns
Out[10]:
In [11]:
# All columns that have bool datatype
catalog.flags
Out[11]:
In [12]:
# in-memory dataframe of RA/Dec in degrees:
catalog.coords.head()
Out[12]:
Let's now use holoviews to plot the ra-dec positions of the 2 million sources in this catalog. This demonstrates some of the basic holoviews tools that are used throughout the qa_tools package. First, we define the holoviews data containers, which are wrappers around the data (no copies made).
In [13]:
dset = hv.Dataset(catalog.coords)
pts = hv.Points(dset, kdims=['ra', 'dec'])
If there were less than ~$10^4$ points in this dataset, at this point we could just exectue pts
and a scatter plot with all the points would be displayed. However, we will bring the browser to its knees if we try to do that with $10^6$ points. This is where the datashader library comes in. Datashader renders an on-the-fly aggregation of the underlying data, displaying the resulting image rather than passing individual points to the browser, and recomputing as you pan and zoom. The %%opts
magic function at the top of the cell is how you change the display defaults (the default plots in holoviews are pretty small).
In [14]:
%%opts RGB [width=500, height=500]
from holoviews.operation.datashader import datashade
datashade(pts)
Out[14]:
The above image basically just shows a density map, as the default aggregation is count
, and there are only two columns in this dataset anyway. However, you can do more interesting things, like have the aggregation be a mean of a different column.
In [15]:
# Add another column onto the coords dataframe
import pandas as pd
import datashader # for the different aggregator
import colorcet as cc # For colormaps
other_cols = ['base_InputCount_value', 'base_PsfFlux_flux']
df = pd.concat([catalog.coords, catalog.get_columns(other_cols).compute()], axis=1)
dset = hv.Dataset(df)
pts = hv.Points(dset, kdims=['ra', 'dec'], vdims=other_cols)
# To illustrate some more holoviews stuff
map1 = datashade(pts, aggregator=datashader.mean('base_InputCount_value')).relabel('Input Count')
map2 = datashade(pts, aggregator=datashader.mean('base_PsfFlux_flux'), cmap=cc.palette['fire']).relabel('Mean flux')
# Plot two side by side; axes automatically linked because they share dimension
map1 + map2
Out[15]:
Having demonstrated basic visualization like this, let's do the same with loading multiple tracts into a single catalog object.
In [16]:
ls /datasets/hsc/repo/rerun/RC/w_2018_04/DM-13256/plots/HSC-I/
In [17]:
import glob
# From PDR1
paths = glob.glob('/project/tmorton/DM-12043/SSP_WIDE_VVDS/plots/HSC-I/tract*/forced.parq')
vvds = ParquetCatalog(paths)
print(len(vvds.coords))
In [18]:
%%opts RGB [width=1200, height=400]
datashade(hv.Points(vvds.coords, kdims=['ra', 'dec']))
Out[18]:
There are also some convenience subclasses of ParquetCatalog
that allow you to access catalogs using dataId
's.
In [4]:
from lsst.daf.persistence import Butler
repo_path = '/datasets/hsc/repo/rerun/RC/w_2018_04/DM-13256/'
butler = Butler(repo_path)
In [5]:
from explorer.catalog import CoaddCatalog, VisitCatalog
coaddId = {'tract':9813, 'filter':'HSC-I'}
coaddCat = CoaddCatalog(butler, coaddId)
print(len(coaddCat.coords))
visitId = {'tract':9813, 'filter':'HSC-I', 'visit':1228}
visitCat = VisitCatalog(butler, visitId)
print(len(visitCat.coords))
There's even a convenience function to know for what visits are parquet tables available in given repo, given a tract and filter:
In [16]:
from explorer.utils import get_visits
get_visits(butler, 9813, 'HSC-I')
Out[16]:
(Note that all these dataId
-based functions currently abuse the butler, using pipe_analysis.utils.Filenamer
.)
While one can always call get_columns
to get what you want and then do calculations on them, it is usually more convenient to use a Functor
object on a catalog. There are several advantages to this:
Functor
knows what columns it needs to request from a catalogCompositeFunctor
In [4]:
from explorer.functors import Mag, MagDiff, Seeing, DeconvolvedMoments, Column, CustomFunctor
psfmag = Mag('base_PsfFlux')
print(psfmag.columns)
psfmag_df = psfmag(coaddCat)
psfmag_df.head()
Out[4]:
In [5]:
seeing = Seeing()
print(seeing.columns)
seeing_df = seeing(coaddCat)
seeing_df.head()
Out[5]:
In [6]:
magdiff = MagDiff('modelfit_CModel', 'base_PsfFlux')
print(magdiff.columns)
magdiff_df = magdiff(coaddCat)
magdiff_df.head()
Out[6]:
There are also functors for labelling:
In [21]:
from explorer.functors import StarGalaxyLabeller
labeller = StarGalaxyLabeller()
print(labeller.columns)
label_df = labeller(coaddCat)
label_df.head()
Out[21]:
Or, we can do all at once:
In [22]:
from explorer.functors import CompositeFunctor
funcs = CompositeFunctor({'psf':psfmag, 'seeing':seeing, 'magdiff':magdiff, 'label':labeller})
print(funcs.columns)
df = funcs(coaddCat)
df.head()
Out[22]:
And we can look at the distribution of these quantities (just to show how to do this using holoviews):
In [21]:
from holoviews.operation import histogram
dset = hv.Dataset(df)
hv.Layout([histogram(dset, dimension=d, num_bins=40) for d in ['psf', 'magdiff', 'seeing']])
Out[21]:
In [6]:
from explorer.catalog import MatchedCatalog
matchedCat = MatchedCatalog(coaddCat, visitCat)
The catalogs match using the KD-tree implementation in explorer.match.match_lists
. Like everything else, the matching is lazy, so is only done when needed. When you request columns from a MatchedCatalog
, you get a tuple of two dataframes back. However, as mentioned above, get_columns
usually does not to be used directly; rather, computations should be done through a Functor
. For example:
In [23]:
psfmag(matchedCat).head()
Out[23]:
These numbers may be unexpected, as they do not look like PSF magnitudes. This is because the default way that a MatchedCatalog
computes a functor is by returning a dataframe of the difference (first minus second). The desired behavior can be made explicit, if something else is desired:
In [8]:
print(psfmag(matchedCat, how='first').head())
print(psfmag(matchedCat, how='second').head())
print(psfmag(matchedCat, how='all').head())
In [35]:
funcs = CompositeFunctor({'psf':psfmag, 'seeing':seeing, 'label':labeller})
funcs(matchedCat, how='all').head()
Out[35]:
Note how calling a composite functor on a MatchedCatalog
gives you a dataframe with a multi-level index. Also note that we did not initialize the catalogs we are using with names. So instead, they are given names of random strings of characters, seeded by the hash of the catalog filename(s) for consistency: catalogs will thus have consistent "random" character-string names.
There is also a matched catalog implementation that allows for multiple catalogs to all match to a single one; this is particularly useful for multi-visit -> coadd matching, to investigate things like photometric or astrometric repeatability, or to see how quantities vary from visit to visit.
In [7]:
from explorer.utils import get_visits
from explorer.catalog import MultiMatchedCatalog
visits = get_visits(butler, 9813, 'HSC-I')[:3]
multiMatchedCat = MultiMatchedCatalog(coaddCat, [VisitCatalog(butler, {'tract':9813, 'filter':'HSC-I', 'visit':v}, name=v) for v in visits])
In [36]:
funcs(multiMatchedCat).dropna(how='any').sample(10)
Out[36]:
In [8]:
from explorer.catalog import MultiBandCatalog
filters = ['HSC-G', 'HSC-R', 'HSC-I']
multiBandCat = MultiBandCatalog({f:CoaddCatalog(butler, {'tract':9813, 'filter':f}) for f in filters})
In [26]:
funcs(multiBandCat).sample(10)
Out[26]:
In [9]:
from explorer.functors import MagDiff, Column, Seeing
from explorer.dataset import QADataset
magdiff = MagDiff('modelfit_CModel', 'base_PsfFlux')
magdiff_Gauss = MagDiff('base_GaussianFlux', 'base_PsfFlux')
funcs = {'cmodel':magdiff, 'count': Column('base_InputCount_value'), 'seeing':Seeing()}
visitFuncs = {'gauss':magdiff_Gauss, 'seeing':Seeing()}
coaddData = QADataset(coaddCat, funcs)
visitData = QADataset(visitCat, visitFuncs)
matchedData = QADataset(matchedCat, visitFuncs)
multiData = QADataset(multiMatchedCat, visitFuncs)
multiBandData = QADataset(multiBandCat, funcs)
The .df
attribute (lazily computed) is a pandas dataframe that holds all the columns of the requested functors, plus columns for the coordinates, ccd/patch information, a label, and an x
column, which by default is the PSF mag. The .ds
attribute is an hv.Dataset
wrapper of that data. Here is how all these look for all the different kinds of catalogs used:
In [14]:
print(coaddData.ds)
coaddData.df.head()
Out[14]:
In [15]:
print(visitData.ds)
visitData.df.head()
Out[15]:
In [12]:
print(matchedData.ds)
matchedData.df.head()
Out[12]:
In [13]:
print(multiData.ds)
multiData.df.head()
Out[13]:
In [16]:
print(multiBandData.ds)
multiBandData.df.head()
Out[16]:
In [ ]: