This tutorial focuses on using the DM stack in Python. Some of the things we'll be doing are more commonly done on the command-line, via executable scripts the stack also provides. A complete tutorial for the command-line functionality can be found in DM Tech Note 23.
More notebook examples can be found here: https://github.com/RobertLuptonTheGood/notebooks/tree/master/Demos
Instead of operating directly on files and directories, we interact with on-disk data products via an abstraction layer called the data butler. The butler operates on data repositories, and our first task is to set up a repository with some raw data, master calibration files, and an external reference catalog. All of these are from a self-contained test dataset we call ci_hsc. The full ci_hsc dataset includes just enough data to run the full (current) LSST pipeline, which extends through processing coadds from multiple bands together. In this tutorial we'll focus on processing an individual image, and that's all this particular subset will support. We also won't go into the details of how to build master calibration files or reference catalogs here.
These first few steps to set up a data repository are best performed on the command-line, but use a Jupyter trick to do that within the notebook. You're also welcome to copy and paste these lines (minus the "%%script bash
" line, of course) into a JupyterLab terminal window and run them individually instead if you want to pay close attention to what we're doing.
In [ ]:
%%script bash
export DATA_DIR=$HOME/DATA
export CI_HSC_DIR=$DATA_DIR/ci_hsc_small
mkdir -p $DATA_DIR
cd $DATA_DIR
if ! [ -d $CI_HSC_DIR ]; then
curl -O http://lsst-web.ncsa.illinois.edu/~krughoff/data/small_demo.tar.gz
tar zxvf small_demo.tar.gz
fi
export WORK_DIR=$HOME/WORK
mkdir -p $WORK_DIR
if ! [ -f $WORK_DIR/_mapper ]; then
echo "lsst.obs.hsc.HscMapper" > $WORK_DIR/_mapper
ingestImages.py $WORK_DIR $CI_HSC_DIR/raw/*.fits --mode=link
cd $WORK_DIR
ln -s $CI_HSC_DIR/CALIB .
mkdir ref_cats
cd ref_cats
ln -s $CI_HSC_DIR/ps1_pv3_3pi_20170110 .
fi
In [ ]:
import os
DATA_DIR = os.path.join(os.environ['HOME'], "DATA")
CI_HSC_DIR = os.path.join(DATA_DIR, "ci_hsc_small")
WORK_DIR = os.path.join(os.environ['HOME'], "WORK")
Before we can start doing interesting things, we need some minimally processed images (i.e. flat-fielded, bias-corrected, etc). Because the HSC team has spent a lot of time characterizing the instrument, we really want to run this step with the default configuration they've provided. That's also much actually easier to do from the command-line, and while we could do it from Python, that'd involve a lot of little irrelevant workarounds we'd rather not get bogged down in.
ISR is implemented as a subclass of lsst.pipe.base.Task
. Nearly all of our high-level algorithms are implemented as Task
s, which are essentially just callable objects that can be composed (a high-level Task
can hold one or more lower-level "subtasks", to which it can delegate work) and configured (every task takes an instance of a configuration class that controls what it does in detail). ISR is actually a CmdLineTask
, a special kind of task that can be run from the command-line and use the data butler for all of its inputs and outputs (regular Task
s generally do not use the butler directly). Unlike virtually every other algorithm, there is a different ISR Task
for each major camera (though there's also a simple default one), reflecting the specialized processing that's needed at this level.
But (for uninteresting, historical reasons), it's not currently possible to run IsrTask
from the command-line. Instead, what we can do is run a parent CmdLineTask
, lsst.pipe.tasks.ProcessCcdTask
, which will run IsrTask
as well as a few other steps. By default, it doesn't actually save the image directly after ISR is run - it performs a few more operations first, and then saves that image. But we can tell it to do so by modifying the tasks' configuration when we.
The full command-line for running ProcessCcdTask
is below. Note that $HOME/WORK
is just the WORK_DIR
variable we've defined above, but we have to redefine it here because evironment variables in one %%script
environment don't propagate to the next. If you've been running these from a terminal tab instead, you can just use $WORK_DIR
instead.
In [ ]:
%%script bash
processCcd.py $HOME/WORK --rerun isr --id visit=903334 ccd=16 --config isr.doWrite=True
There are a few features of this command-line that bear explaining:
processCcd.py
, not ProcessCcdTask
. There's a similar driver script for all CmdLineTask
s, with the name formed by making the first word lowercase and removing the Task
suffix. These are added to your PATH
when you set up the LSST package in which they're defined (which happens automatically in the JupyterLab environment).--rerun
argument to set the location of the output repository, in this case $HOME/WORK/rerun/isr
. You can also use --output
to set the path more directly, but we recommend --rerun
because it enforces a nice convention for where to put outputs that helps with discoverability.--id
argument sets the data ID(s) to be processed, in this case a single CCD from a single visit. All CmdLineTasks
share a fairly sophisticated syntax for expressions that match multiple data IDs, which you can learn more about by running any CmdLineTask
with --help
.--config
option, in this case to make sure the just-after-ISR image file is written. Running a CmdLineTask
automatically also includes applying configuation overrides that customize the task for the kind of data you're processing (i.e. which camera it comes from), and that's how the task knows to run the custom ISR task for HSC, rather than the generic default. You can see all of the config options for a CmdLineTask
by running with --show config
, though the results can be a bit overwhelming.The rest of this tutorial is focused on using LSST software as a Python library, so this will be the last thing we run from the command-line. Again, for more information about how to run LSST's existing processing scripts from the command-line, check out DM Tech Note 23.
The outputs of CmdLineTasks
, like their inputs, are organized into data repositories, which are managed by an object called Butler
. To retrieve a dataset from the Butler
, we start by constructing one pointing to the output repository from the processing run (which is now an input repository for this Butler
, which won't have an output repository since we won't be writing any more files):
In [ ]:
from lsst.daf.persistence import Butler
butler = Butler(inputs=os.path.join(WORK_DIR, "rerun/isr"))
We can then call get
with the name and data ID of the dataset. The name of the image that's saved directly after ISR is postISRCCD
.
In [ ]:
exposure = butler.get("postISRCCD", visit=903334, ccd=16)
A full 2k x 4k HSC CCD is a pretty big image to display when you don't have specialized display code. The DM stack does have specialized display code, but it either requires DS9 (which requires some ssh tunnels to use with data living on a server) or a Firefly server installation. For this tutorial, we'll just throw together a naive matplotlib display function, and create a view to a subimage that we'll display instead of the full image.
This section features a few of our most important class objects:
lsst.afw.image.Exposure
is an image object that actually holds three image planes: the science image (Exposure.image
), an image of variance in every pixel (Exposure.variance
), an integer bit mask (Exposure.mask
). It also holds a lot of more complex objects that characterize the image, such as a point-spread function (lsst.afw.detection.Psf
) and world-coordinate system (lsst.afw.image.Wcs
). Most of these objects aren't filled in yet, because all we've run so far is ISR. It doesn't generally make sense to perform mathematical operations (i.e. addition) on Exposure
s, because those operations aren't always well-defined on the more complex objects. You can get a MaskedImage
object with the same image, mask, and variance planes that does support mathematical operations but doesn't contain Psf
s and Wcs
s (etc) with Exposure.maskedImage
.
The Exposure.image
and Exposure.variance
properties return lsst.afw.image.Image
objects. These have a .array
property that returns a numpy.ndarray
view to the Image
's pixels. Conceptually, you should think of an Image
as just a numpy.ndarray
with a possibly nonzero origin.
The Exposure.mask
property returns a lsst.afw.image.Mask
object, which behaves like an Image
with a dictionary-like object that relates string labels to bit numbers.
All of these image-like objects have a getBBox()
method, which returns a lsst.afw.geom.Box2I
. The minimum and maximum points of a Box2I
are specified in integers that correspond to the centers of the lower-left and upper-right pixels in the box, but the box conceptually contains the entirety of those pixels. To get a box with a floating-point representation of the same boundary for the extent
argument to imshow
below, we construct a Box2D
from the Box2I
.
Point2I
and Extent2I
are used to represent absolute positions and offsets between positions as integers (respectively). These have floating-point counterparts Point2D
and Extent2D
.
In [ ]:
from lsst.afw.geom import Box2D, Box2I, Point2I, Extent2I
from lsst.afw.image import Exposure
In [ ]:
# Execute this cell (and the one below) to re-load the post-ISR Exposure from disk after
# modifying it.
exposure = butler.get("postISRCCD", visit=903334, ccd=16)
In [ ]:
bbox = exposure.getBBox()
bbox.grow(-bbox.getDimensions()//3) # box containing the central third (in each dimension)
bbox.grow(-Extent2I(0, 400)) # make it a bit smaller in x
# exposure[bbox] would also work here because exposure.getXY0() == (0, 0),
# but it's dangerous in general because it ignores that origin.
sub = Exposure(exposure, bbox=bbox, dtype=exposure.dtype, deep=False)
In [ ]:
import numpy as np
import matplotlib
%matplotlib inline
matplotlib.rcParams["figure.figsize"] = (8, 6)
matplotlib.rcParams["font.size"] = 12
In [ ]:
def display(image, mask=None, colors=None, alpha=0.40, **kwds):
box = Box2D(image.getBBox())
extent = (box.getMinX(), box.getMaxX(), box.getMinY(), box.getMaxY())
kwds.setdefault("extent", extent)
kwds.setdefault("origin", "lower")
kwds.setdefault("interpolation", "nearest")
matplotlib.pyplot.imshow(image.array, **kwds)
kwds.pop("vmin", None)
kwds.pop("vmax", None)
kwds.pop("norm", None)
kwds.pop("cmap", None)
if mask is not None:
for plane, color in colors.items():
array = np.zeros(mask.array.shape + (4,), dtype=float)
rgba = np.array(matplotlib.colors.hex2color(matplotlib.colors.cnames[color]) + (alpha, ),
dtype=float)
np.multiply.outer((mask.array & mask.getPlaneBitMask(plane)).astype(bool), rgba, out=array)
matplotlib.pyplot.imshow(array, **kwds)
And now here's the (cutout) of the detrended image. I've cheated in setting the scale by looking at the background level in advance.
In [ ]:
display(sub.image, vmin=175, vmax=300, cmap=matplotlib.cm.gray)
The next step we usually take is to estimate and subtract the background, using lsst.meas.algorithms.SubtractBackgroundTask
. This is a regular Task
, not a CmdLineTask
, and hence we'll just pass it our Exposure
object (it operates in-place) instead of a Butler
.
In [ ]:
from lsst.meas.algorithms import SubtractBackgroundTask
In [ ]:
bkgConfig = SubtractBackgroundTask.ConfigClass()
In [ ]:
# Execute this cell to get fun & terrible results!
bkgConfig.useApprox = False
bkgConfig.binSize = 20
The pattern for configuration here is the same as it was for SubaruIsrTask
, but here we're setting values directly instead of loading a configuration file from the obs_subaru
camera-specialization package. The config
object here is an instance of a class that inherits from lsst.pex.config.Config
that contains a set of lsst.pex.config.Field
objects that define the options that can be modified. Each Field
behaves more or less like a Python property
, and you can get information on all of the fields in a config object by either using help
:
In [ ]:
help(bkgConfig)
In [ ]:
SubtractBackgroundTask.ConfigClass.algorithm?
In [ ]:
bkgTask = SubtractBackgroundTask(config=bkgConfig)
In [ ]:
bkgResult = bkgTask.run(exposure)
In [ ]:
display(sub.image, vmin=-0.5, vmax=100, cmap=matplotlib.cm.gray)
If you've run through all of these steps after executing the cell that warns about terrible results, you should notice that the galaxy in the upper right has been oversubtracted.
Exercise: Before continuing on, re-load the exposure from disk, reset the configuration and Task
instances, and re-run without executing the cell that applies bad values to the config, all by just re-executing the right cells above. You should end up an image in which the upper-right galaxy looks essentially the same as it does in the image before we subtracted the background.
Most later processing steps require a PSF model, which is represented by a Psf
object that's attached to the Exposure
. For now, we'll just make a Gaussian PSF with some guess at the seeing.
In [ ]:
from lsst.meas.algorithms import SingleGaussianPsf
In [ ]:
FWHM_TO_SIGMA = 1.0/(2*np.sqrt(2*np.log(2)))
PIXEL_SCALE = 0.168 # arcsec/pixel
SEEING = 0.7 # FWHM in arcsec
sigma = FWHM_TO_SIGMA*SEEING/PIXEL_SCALE
width = int(sigma*3)*2 + 1
psf = SingleGaussianPsf(width, width, sigma=sigma)
exposure.setPsf(psf)
A Psf
object can basically just do one thing: it can return an image of itself at a point. SingleGaussianPsf
represents a constant PSF, so it always returns the same image, regardless of the point you give it.
But there are two ways to evaluate a Psf
at a point. If you want an image centered on the middle pixel, and that middle pixel to be the origin - what you'd usually want if you're going to convolve the PSF with another model - use computeKernelImage(x, y)
:
In [ ]:
from lsst.afw.geom import Point2D
display(psf.computeKernelImage(Point2D(60.5, 7.2)))
If you want to compare the PSF to a star at the exact same position, use computeImage(x, y)
. That will shift the image returned by computeKernelImage(x, y)
to the right sub-pixel offset, and update the origin of the image to take care of the rest, so you end up with a postage stamp in the same coordinate system as the original image where the star is.
In [ ]:
display(psf.computeImage(Point2D(60.5, 7.2)))
Cosmic rays are detected and interpolated by RepairTask
, which also sets mask planes to indicate where the cosmic rays were ("CR") and which pixels were interpolated ("INTERP"; this may happen due to saturation or bad pixels as well). Because we're just using the default configuration, we can skip creating a config object and just construct the Task
with no arguments.
In [ ]:
from lsst.pipe.tasks.repair import RepairTask
In [ ]:
repairTask = RepairTask()
In [ ]:
repairTask.run(exposure)
In [ ]:
display(sub.image, mask=sub.mask, colors={"CR": "red"},
vmin=-0.5, vmax=100, alpha=0.8, cmap=matplotlib.cm.gray)
Unlike the other Task
s we've dealt with so far, SourceDetectionTask
creates a SourceCatalog
in addition to updating the image (all it does to the image is add a "DETECTED" mask plane). All Task
s that work with catalogs need to be initialized with a lsst.afw.table.Schema
object, to which the Task
will add the fields necessary to store its outputs. A SourceCatalog
's Schema
cannot be modified after the SourceCatalog
has been constructed, which means it's necessary to construct all Schema
-using Task
s before actually running any of them.
Each record in the catalog returned by SourceDetectionTask
has a Footprint
object attached to it. A Footprint
represents the approximate region covered by a source in a run-length encoding data structure. It also contains a list of peaks found within that region. The "DETECTED" mask plane is set to exactly the pixels covered by any Footprint
in the returned catalog.
In [ ]:
from lsst.meas.algorithms import SourceDetectionTask
from lsst.afw.table import SourceTable, SourceCatalog
In [ ]:
schema = SourceTable.makeMinimalSchema()
In [ ]:
detectTask = SourceDetectionTask(schema=schema)
In [ ]:
# A SourceTable is really just a factory object for records; don't confuse it with SourceCatalog, which is
# usually what you want. But a SourceTable *is* what SourceDetectionTask wants here.
table = SourceTable.make(schema)
In [ ]:
detectResult = detectTask.run(table, exposure)
In [ ]:
display(sub.image, mask=sub.mask, colors={"DETECTED": "blue"}, vmin=-0.5, vmax=100, cmap=matplotlib.cm.gray)
Deblending attempts to separate detections with multiple peaks into separate objects. We keep all of the original sources in the SourceCatalog
(called parent
s) when we deblend, but for each parent
source that contains more than one peak, we create a new record (called a child
) for each of those peaks. The Footprint
s attached to the child
objects are instances of a subclass called HeavyFootprint
, which include new deblended pixel values as well as the region description. These can be used by calling insert
to replace an Image
's pixels with the HeavyFootprint
's pixels.
EXERCISE: This section will not run if the cells are executed naively in order. At some point you'll have to go re-execute one or more cells in the previous section to get the right behavior. Which one(s)? Why? Copy those cells here (in the right places) when you figure it out.
In [ ]:
from lsst.meas.deblender import SourceDeblendTask
In [ ]:
deblendTask = SourceDeblendTask(schema=schema)
In [ ]:
catalog = detectResult.sources
In [ ]:
deblendTask.run(exposure, catalog)
To inspect some deblender outputs, we'll start by finding some parent objects that were deblended into multiple children, by looking at the deblend_nChild
field (which was added to the Schema
when we constructed the SourceDeblendTask
, and populated when we called run
).
In [ ]:
# Find some blended sources inside the subimage:
blendParents = []
for record in catalog:
if record.get("deblend_nChild") > 0 and bbox.contains(record.getFootprint().getBBox()):
blendParents.append(record)
# Sort by peak brightness so we can look at something with decent S/N
blendParents.sort(key=lambda r: -r.getFootprint().getPeaks()[0].getPeakValue())
In [ ]:
from lsst.afw.image import Image
The image of the parent object is just the original image, but we'll cut out just the region inside its Footprint
:
In [ ]:
blendParentImage = Image(exposure.image, bbox=blendParents[0].getFootprint().getBBox(),
deep=True, dtype=np.float32)
Now we'll insert the deblended child pixels into blank images of the same size:
In [ ]:
blendChildImages = []
for blendChild in catalog.getChildren(blendParents[0].getId()):
image = Image(blendParentImage.getBBox(), dtype=np.float32)
blendChild.getFootprint().insert(image)
blendChildImages.append(image)
In [ ]:
nSubPlots = len(blendChildImages) + 1
nCols = 3
nRows = nSubPlots//nCols + 1
matplotlib.pyplot.subplot(nRows, nCols, 1)
display(blendParentImage, vmin=-0.5, vmax=100, cmap=matplotlib.cm.gray)
for n, image in enumerate(blendChildImages):
matplotlib.pyplot.subplot(nRows, nCols, n + 2)
display(image, vmin=-0.5, vmax=100, cmap=matplotlib.cm.gray)
SingleFrameMeasurementTask
is typically responsible for adding most fields to a SourceCatalog
. It runs a series of plugins that make different measurements (you can configure them with the .plugins
dictionary-like field on its config object, and control which are run with .names
). If the deblender has been run first, it will measure child objects using their deblended pixels.
EXERCISE: Like the Deblending section, you'll have to re-execute some previous cells somewhere in this section to get the right behavior. Copy those cells into the right places here once you've gotten it working.
In [ ]:
from lsst.meas.base import SingleFrameMeasurementTask
In [ ]:
measureConfig = SingleFrameMeasurementTask.ConfigClass()
In [ ]:
# What measurements are configured to run
print(measureConfig.plugins.names)
In [ ]:
# Import an extension module that adds a new measurement
import lsst.meas.extensions.photometryKron
In [ ]:
# What measurements *could* be configured to run
print(list(measureConfig.plugins.keys()))
In [ ]:
# Configure the new measurement to run
measureConfig.plugins.names.add("ext_photometryKron_KronFlux")
In [ ]:
measureTask = SingleFrameMeasurementTask(schema=schema, config=measureConfig)
In [ ]:
measureTask.run(catalog, exposure)
We'll show some of the results of measurement by overlaying the measured ellipses on the image.
The shapes and centroids we use here (by calling record.getX()
, record.getY()
, record.getShape()
) are aliases (called "slots") to fields with longer names that are our recommended measurements for these quantities. You can see the set of aliases by printing the schema (see next section).
In [ ]:
from lsst.afw.geom.ellipses import Axes
In [ ]:
display(sub.image, mask=sub.mask, colors={"DETECTED": "blue"}, vmin=-0.5, vmax=100, cmap=matplotlib.cm.gray)
for record in catalog:
if record.get("deblend_nChild") != 0:
continue
axes = Axes(record.getShape()) # convert to A, B, THETA parameterization
axes.scale(2.0) # matplotlib uses diameters, not radii
patch = matplotlib.patches.Ellipse((record.getX(), record.getY()),
axes.getA(), axes.getB(), axes.getTheta() * 180.0 / np.pi,
fill=False, edgecolor="green")
matplotlib.pyplot.gca().add_patch(patch)
matplotlib.pyplot.show()
Print the schema:
In [ ]:
print(catalog.getSchema())
Get arrays of columns (requires the catalog to be continguous in memory, which we can guarantee with a deep copy):
In [ ]:
catalog = catalog.copy(deep=True)
In [ ]:
psfFlux = catalog["base_PsfFlux_flux"]
Note that boolean values are stored in Flag
columns, which are packed into bits. Unlike other column types, when you get an array of a Flag
column, you get a copy, not a view.
Use Key
objects instead of strings to do fast repeated access to fields when iterating over records:
In [ ]:
key = catalog.getSchema().find("deblend_nChild").key
deblended = [record for record in catalog if record.get(key) == 0]
You can also get dict
version of a subset of a Schema
, a Catalog
, or a Record
by calling either extract
methods with a glob:
In [ ]:
catalog[0].extract("base_PsfFlux_*") # or regex='...'
For Record
s, the dict values are just the values of the fields, and for Catalogs
, they're numpy.ndarray
columns. For Schema
s they're SchemaItem
s, which behave liked a named tuple containing a Key
and a Field
, which contains more descriptive information.
Get an Astropy view of the catalog (from which you can make a Pandas view):
In [ ]:
table = catalog.asAstropy()
You can find some reference documentation for the catalog library here.
Make a scatter plot of Kron Flux vs. Psf Flux.
Write a single function that performs all of the above steps on a post-ISR Exposure
object, modifying the Exposure
in-place and returning a new SourceCatalog
with a complete set of measurements.
Add PSF modeling to the end of that function, delegating most of the work to lsst.pipe.tasks.MeasurePsfTask
. You may want to use a higher threshold (e.g. 50-sigma) for detection, since PSF modeling should only use bright stars.
Make images of the PSF stars, the PSF model at the position of those stars, and the difference between them.
Add another detect-deblend-measure sequence after PSF modeling at a deeper threshold.
Rewrite the function as a class that constructs all of the Task
s that it use in __init__
and processes a single Exposure
with you call its run
method. Make sure it will behave properly if run
is called multiple times wiht different Exposure
objects.
-