Using the LSST DM Stack in Python

This tutorial focuses on using the DM stack in Python. Some of the things we'll be doing are more commonly done on the command-line, via executable scripts the stack also provides. A complete tutorial for the command-line functionality can be found in DM Tech Note 23.

More notebook examples can be found here: https://github.com/RobertLuptonTheGood/notebooks/tree/master/Demos

Data Repository Setup

Instead of operating directly on files and directories, we interact with on-disk data products via an abstraction layer called the data butler. The butler operates on data repositories, and our first task is to set up a repository with some raw data, master calibration files, and an external reference catalog. All of these are from a self-contained test dataset we call ci_hsc. The full ci_hsc dataset includes just enough data to run the full (current) LSST pipeline, which extends through processing coadds from multiple bands together. In this tutorial we'll focus on processing an individual image, and that's all this particular subset will support. We also won't go into the details of how to build master calibration files or reference catalogs here.

These first few steps to set up a data repository are best performed on the command-line, but use a Jupyter trick to do that within the notebook. You're also welcome to copy and paste these lines (minus the "%%script bash" line, of course) into a JupyterLab terminal window and run them individually instead if you want to pay close attention to what we're doing.



In [ ]:

    
%%script bash
export DATA_DIR=$HOME/DATA
export CI_HSC_DIR=$DATA_DIR/ci_hsc_small
mkdir -p $DATA_DIR
cd $DATA_DIR
if ! [ -d $CI_HSC_DIR ]; then
    curl -O http://lsst-web.ncsa.illinois.edu/~krughoff/data/small_demo.tar.gz
    tar zxvf small_demo.tar.gz
fi
export WORK_DIR=$HOME/WORK
mkdir -p $WORK_DIR
if ! [ -f $WORK_DIR/_mapper ]; then
    echo "lsst.obs.hsc.HscMapper" > $WORK_DIR/_mapper
    ingestImages.py $WORK_DIR $CI_HSC_DIR/raw/*.fits --mode=link
    cd $WORK_DIR
    ln -s $CI_HSC_DIR/CALIB .
    mkdir ref_cats
    cd ref_cats
    ln -s $CI_HSC_DIR/ps1_pv3_3pi_20170110 .
fi



In [ ]:

    
import os
DATA_DIR = os.path.join(os.environ['HOME'], "DATA")
CI_HSC_DIR = os.path.join(DATA_DIR, "ci_hsc_small")
WORK_DIR = os.path.join(os.environ['HOME'], "WORK")

Instrument Signature Removal and Command-Line Tasks

Before we can start doing interesting things, we need some minimally processed images (i.e. flat-fielded, bias-corrected, etc). Because the HSC team has spent a lot of time characterizing the instrument, we really want to run this step with the default configuration they've provided. That's also much actually easier to do from the command-line, and while we could do it from Python, that'd involve a lot of little irrelevant workarounds we'd rather not get bogged down in.

ISR is implemented as a subclass of lsst.pipe.base.Task. Nearly all of our high-level algorithms are implemented as Tasks, which are essentially just callable objects that can be composed (a high-level Task can hold one or more lower-level "subtasks", to which it can delegate work) and configured (every task takes an instance of a configuration class that controls what it does in detail). ISR is actually a CmdLineTask, a special kind of task that can be run from the command-line and use the data butler for all of its inputs and outputs (regular Tasks generally do not use the butler directly). Unlike virtually every other algorithm, there is a different ISR Task for each major camera (though there's also a simple default one), reflecting the specialized processing that's needed at this level.

But (for uninteresting, historical reasons), it's not currently possible to run IsrTask from the command-line. Instead, what we can do is run a parent CmdLineTask, lsst.pipe.tasks.ProcessCcdTask, which will run IsrTask as well as a few other steps. By default, it doesn't actually save the image directly after ISR is run - it performs a few more operations first, and then saves that image. But we can tell it to do so by modifying the tasks' configuration when we.

The full command-line for running ProcessCcdTask is below. Note that $HOME/WORK is just the WORK_DIR variable we've defined above, but we have to redefine it here because evironment variables in one %%script environment don't propagate to the next. If you've been running these from a terminal tab instead, you can just use $WORK_DIR instead.



In [ ]:

    
%%script bash
processCcd.py $HOME/WORK --rerun isr --id visit=903334 ccd=16 --config isr.doWrite=True

There are a few features of this command-line that bear explaining:

We run processCcd.py, not ProcessCcdTask. There's a similar driver script for all CmdLineTasks, with the name formed by making the first word lowercase and removing the Task suffix. These are added to your PATH when you set up the LSST package in which they're defined (which happens automatically in the JupyterLab environment).
The first argument to any command-line task is the path an input data repository.
We've used the --rerun argument to set the location of the output repository, in this case $HOME/WORK/rerun/isr. You can also use --output to set the path more directly, but we recommend --rerun because it enforces a nice convention for where to put outputs that helps with discoverability.
The --id argument sets the data ID(s) to be processed, in this case a single CCD from a single visit. All CmdLineTasks share a fairly sophisticated syntax for expressions that match multiple data IDs, which you can learn more about by running any CmdLineTask with --help.
We've overridden a configuration value with the --config option, in this case to make sure the just-after-ISR image file is written. Running a CmdLineTask automatically also includes applying configuation overrides that customize the task for the kind of data you're processing (i.e. which camera it comes from), and that's how the task knows to run the custom ISR task for HSC, rather than the generic default. You can see all of the config options for a CmdLineTask by running with --show config, though the results can be a bit overwhelming.

The rest of this tutorial is focused on using LSST software as a Python library, so this will be the last thing we run from the command-line. Again, for more information about how to run LSST's existing processing scripts from the command-line, check out DM Tech Note 23.

Data Access with Butler

The outputs of CmdLineTasks, like their inputs, are organized into data repositories, which are managed by an object called Butler. To retrieve a dataset from the Butler, we start by constructing one pointing to the output repository from the processing run (which is now an input repository for this Butler, which won't have an output repository since we won't be writing any more files):



In [ ]:

    
from lsst.daf.persistence import Butler
butler = Butler(inputs=os.path.join(WORK_DIR, "rerun/isr"))

We can then call get with the name and data ID of the dataset. The name of the image that's saved directly after ISR is postISRCCD.



In [ ]:

    
exposure = butler.get("postISRCCD", visit=903334, ccd=16)

Image, Boxes, and (Crude) Image Display

A full 2k x 4k HSC CCD is a pretty big image to display when you don't have specialized display code. The DM stack does have specialized display code, but it either requires DS9 (which requires some ssh tunnels to use with data living on a server) or a Firefly server installation. For this tutorial, we'll just throw together a naive matplotlib display function, and create a view to a subimage that we'll display instead of the full image.

This section features a few of our most important class objects:

lsst.afw.image.Exposure is an image object that actually holds three image planes: the science image (Exposure.image), an image of variance in every pixel (Exposure.variance), an integer bit mask (Exposure.mask). It also holds a lot of more complex objects that characterize the image, such as a point-spread function (lsst.afw.detection.Psf) and world-coordinate system (lsst.afw.image.Wcs). Most of these objects aren't filled in yet, because all we've run so far is ISR. It doesn't generally make sense to perform mathematical operations (i.e. addition) on Exposures, because those operations aren't always well-defined on the more complex objects. You can get a MaskedImage object with the same image, mask, and variance planes that does support mathematical operations but doesn't contain Psfs and Wcss (etc) with Exposure.maskedImage.
The Exposure.image and Exposure.variance properties return lsst.afw.image.Image objects. These have a .array property that returns a numpy.ndarray view to the Image's pixels. Conceptually, you should think of an Image as just a numpy.ndarray with a possibly nonzero origin.
The Exposure.mask property returns a lsst.afw.image.Mask object, which behaves like an Image with a dictionary-like object that relates string labels to bit numbers.
All of these image-like objects have a getBBox() method, which returns a lsst.afw.geom.Box2I. The minimum and maximum points of a Box2I are specified in integers that correspond to the centers of the lower-left and upper-right pixels in the box, but the box conceptually contains the entirety of those pixels. To get a box with a floating-point representation of the same boundary for the extent argument to imshow below, we construct a Box2D from the Box2I.
Point2I and Extent2I are used to represent absolute positions and offsets between positions as integers (respectively). These have floating-point counterparts Point2D and Extent2D.



In [ ]:

    
from lsst.afw.geom import Box2D, Box2I, Point2I, Extent2I
from lsst.afw.image import Exposure



In [ ]:

    
# Execute this cell (and the one below) to re-load the post-ISR Exposure from disk after
# modifying it.
exposure = butler.get("postISRCCD", visit=903334, ccd=16)



In [ ]:

    
bbox = exposure.getBBox()
bbox.grow(-bbox.getDimensions()//3)  # box containing the central third (in each dimension)
bbox.grow(-Extent2I(0, 400))  # make it a bit smaller in x
# exposure[bbox] would also work here because exposure.getXY0() == (0, 0),
# but it's dangerous in general because it ignores that origin.
sub = Exposure(exposure, bbox=bbox, dtype=exposure.dtype, deep=False)



In [ ]:

    
import numpy as np
import matplotlib
%matplotlib inline
matplotlib.rcParams["figure.figsize"] = (8, 6)
matplotlib.rcParams["font.size"] = 12



In [ ]:

    
def display(image, mask=None, colors=None, alpha=0.40, **kwds):
    box = Box2D(image.getBBox())
    extent = (box.getMinX(), box.getMaxX(), box.getMinY(), box.getMaxY())
    kwds.setdefault("extent", extent)
    kwds.setdefault("origin", "lower")
    kwds.setdefault("interpolation", "nearest")
    matplotlib.pyplot.imshow(image.array, **kwds)
    kwds.pop("vmin", None)
    kwds.pop("vmax", None)
    kwds.pop("norm", None)
    kwds.pop("cmap", None)
    if mask is not None:
        for plane, color in colors.items():
            array = np.zeros(mask.array.shape + (4,), dtype=float)
            rgba = np.array(matplotlib.colors.hex2color(matplotlib.colors.cnames[color]) + (alpha, ),
                            dtype=float)
            np.multiply.outer((mask.array & mask.getPlaneBitMask(plane)).astype(bool), rgba, out=array)
            matplotlib.pyplot.imshow(array, **kwds)

And now here's the (cutout) of the detrended image. I've cheated in setting the scale by looking at the background level in advance.



In [ ]:

    
display(sub.image, vmin=175, vmax=300, cmap=matplotlib.cm.gray)

Background Subtraction and Task Configuration

The next step we usually take is to estimate and subtract the background, using lsst.meas.algorithms.SubtractBackgroundTask. This is a regular Task, not a CmdLineTask, and hence we'll just pass it our Exposure object (it operates in-place) instead of a Butler.



In [ ]:

    
from lsst.meas.algorithms import SubtractBackgroundTask



In [ ]:

    
bkgConfig = SubtractBackgroundTask.ConfigClass()



In [ ]:

    
# Execute this cell to get fun & terrible results!
bkgConfig.useApprox = False
bkgConfig.binSize = 20

The pattern for configuration here is the same as it was for SubaruIsrTask, but here we're setting values directly instead of loading a configuration file from the obs_subaru camera-specialization package. The config object here is an instance of a class that inherits from lsst.pex.config.Config that contains a set of lsst.pex.config.Field objects that define the options that can be modified. Each Field behaves more or less like a Python property, and you can get information on all of the fields in a config object by either using help:



In [ ]:

    
help(bkgConfig)



In [ ]:

    
SubtractBackgroundTask.ConfigClass.algorithm?



In [ ]:

    
bkgTask = SubtractBackgroundTask(config=bkgConfig)



In [ ]:

    
bkgResult = bkgTask.run(exposure)



In [ ]:

    
display(sub.image, vmin=-0.5, vmax=100, cmap=matplotlib.cm.gray)

If you've run through all of these steps after executing the cell that warns about terrible results, you should notice that the galaxy in the upper right has been oversubtracted.

Exercise: Before continuing on, re-load the exposure from disk, reset the configuration and Task instances, and re-run without executing the cell that applies bad values to the config, all by just re-executing the right cells above. You should end up an image in which the upper-right galaxy looks essentially the same as it does in the image before we subtracted the background.

Installing an Initial-Guess PSF

Most later processing steps require a PSF model, which is represented by a Psf object that's attached to the Exposure. For now, we'll just make a Gaussian PSF with some guess at the seeing.



In [ ]:

    
from lsst.meas.algorithms import SingleGaussianPsf



In [ ]:

    
FWHM_TO_SIGMA = 1.0/(2*np.sqrt(2*np.log(2)))
PIXEL_SCALE = 0.168  # arcsec/pixel
SEEING = 0.7         # FWHM in arcsec
sigma = FWHM_TO_SIGMA*SEEING/PIXEL_SCALE
width = int(sigma*3)*2 + 1
psf = SingleGaussianPsf(width, width, sigma=sigma)
exposure.setPsf(psf)

A Psf object can basically just do one thing: it can return an image of itself at a point. SingleGaussianPsf represents a constant PSF, so it always returns the same image, regardless of the point you give it.

But there are two ways to evaluate a Psf at a point. If you want an image centered on the middle pixel, and that middle pixel to be the origin - what you'd usually want if you're going to convolve the PSF with another model - use computeKernelImage(x, y):



In [ ]:

    
from lsst.afw.geom import Point2D
display(psf.computeKernelImage(Point2D(60.5, 7.2)))

If you want to compare the PSF to a star at the exact same position, use computeImage(x, y). That will shift the image returned by computeKernelImage(x, y) to the right sub-pixel offset, and update the origin of the image to take care of the rest, so you end up with a postage stamp in the same coordinate system as the original image where the star is.



In [ ]:

    
display(psf.computeImage(Point2D(60.5, 7.2)))

Removing Cosmic Rays

Cosmic rays are detected and interpolated by RepairTask, which also sets mask planes to indicate where the cosmic rays were ("CR") and which pixels were interpolated ("INTERP"; this may happen due to saturation or bad pixels as well). Because we're just using the default configuration, we can skip creating a config object and just construct the Task with no arguments.



In [ ]:

    
from lsst.pipe.tasks.repair import RepairTask



In [ ]:

    
repairTask = RepairTask()



In [ ]:

    
repairTask.run(exposure)



In [ ]:

    
display(sub.image, mask=sub.mask, colors={"CR": "red"},
        vmin=-0.5, vmax=100, alpha=0.8, cmap=matplotlib.cm.gray)

Detecting Sources

Unlike the other Tasks we've dealt with so far, SourceDetectionTask creates a SourceCatalog in addition to updating the image (all it does to the image is add a "DETECTED" mask plane). All Tasks that work with catalogs need to be initialized with a lsst.afw.table.Schema object, to which the Task will add the fields necessary to store its outputs. A SourceCatalog's Schema cannot be modified after the SourceCatalog has been constructed, which means it's necessary to construct all Schema-using Tasks before actually running any of them.

Each record in the catalog returned by SourceDetectionTask has a Footprint object attached to it. A Footprint represents the approximate region covered by a source in a run-length encoding data structure. It also contains a list of peaks found within that region. The "DETECTED" mask plane is set to exactly the pixels covered by any Footprint in the returned catalog.



In [ ]:

    
from lsst.meas.algorithms import SourceDetectionTask
from lsst.afw.table import SourceTable, SourceCatalog



In [ ]:

    
schema = SourceTable.makeMinimalSchema()



In [ ]:

    
detectTask = SourceDetectionTask(schema=schema)



In [ ]:

    
# A SourceTable is really just a factory object for records; don't confuse it with SourceCatalog, which is
# usually what you want.  But a SourceTable *is* what SourceDetectionTask wants here.
table = SourceTable.make(schema)



In [ ]:

    
detectResult = detectTask.run(table, exposure)



In [ ]:

    
display(sub.image, mask=sub.mask, colors={"DETECTED": "blue"}, vmin=-0.5, vmax=100, cmap=matplotlib.cm.gray)

Deblending

Deblending attempts to separate detections with multiple peaks into separate objects. We keep all of the original sources in the SourceCatalog (called parents) when we deblend, but for each parent source that contains more than one peak, we create a new record (called a child) for each of those peaks. The Footprints attached to the child objects are instances of a subclass called HeavyFootprint, which include new deblended pixel values as well as the region description. These can be used by calling insert to replace an Image's pixels with the HeavyFootprint's pixels.

EXERCISE: This section will not run if the cells are executed naively in order. At some point you'll have to go re-execute one or more cells in the previous section to get the right behavior. Which one(s)? Why? Copy those cells here (in the right places) when you figure it out.



In [ ]:

    
from lsst.meas.deblender import SourceDeblendTask



In [ ]:

    
deblendTask = SourceDeblendTask(schema=schema)



In [ ]:

    
catalog = detectResult.sources



In [ ]:

    
deblendTask.run(exposure, catalog)

To inspect some deblender outputs, we'll start by finding some parent objects that were deblended into multiple children, by looking at the deblend_nChild field (which was added to the Schema when we constructed the SourceDeblendTask, and populated when we called run).



In [ ]:

    
# Find some blended sources inside the subimage:
blendParents = []
for record in catalog:
    if record.get("deblend_nChild") > 0 and bbox.contains(record.getFootprint().getBBox()):
        blendParents.append(record)
# Sort by peak brightness so we can look at something with decent S/N
blendParents.sort(key=lambda r: -r.getFootprint().getPeaks()[0].getPeakValue())



In [ ]:

    
from lsst.afw.image import Image

The image of the parent object is just the original image, but we'll cut out just the region inside its Footprint:



In [ ]:

    
blendParentImage = Image(exposure.image, bbox=blendParents[0].getFootprint().getBBox(),
                         deep=True, dtype=np.float32)

Now we'll insert the deblended child pixels into blank images of the same size:



In [ ]:

    
blendChildImages = []
for blendChild in catalog.getChildren(blendParents[0].getId()):
    image = Image(blendParentImage.getBBox(), dtype=np.float32)
    blendChild.getFootprint().insert(image)
    blendChildImages.append(image)



In [ ]:

    
nSubPlots = len(blendChildImages) + 1
nCols = 3
nRows = nSubPlots//nCols + 1
matplotlib.pyplot.subplot(nRows, nCols, 1)
display(blendParentImage, vmin=-0.5, vmax=100, cmap=matplotlib.cm.gray)
for n, image in enumerate(blendChildImages):
    matplotlib.pyplot.subplot(nRows, nCols, n + 2)
    display(image, vmin=-0.5, vmax=100, cmap=matplotlib.cm.gray)

Measurement

SingleFrameMeasurementTask is typically responsible for adding most fields to a SourceCatalog. It runs a series of plugins that make different measurements (you can configure them with the .plugins dictionary-like field on its config object, and control which are run with .names). If the deblender has been run first, it will measure child objects using their deblended pixels.

EXERCISE: Like the Deblending section, you'll have to re-execute some previous cells somewhere in this section to get the right behavior. Copy those cells into the right places here once you've gotten it working.



In [ ]:

    
from lsst.meas.base import SingleFrameMeasurementTask



In [ ]:

    
measureConfig = SingleFrameMeasurementTask.ConfigClass()



In [ ]:

    
# What measurements are configured to run
print(measureConfig.plugins.names)



In [ ]:

    
# Import an extension module that adds a new measurement
import lsst.meas.extensions.photometryKron



In [ ]:

    
# What measurements *could* be configured to run
print(list(measureConfig.plugins.keys()))



In [ ]:

    
# Configure the new measurement to run
measureConfig.plugins.names.add("ext_photometryKron_KronFlux")



In [ ]:

    
measureTask = SingleFrameMeasurementTask(schema=schema, config=measureConfig)



In [ ]:

    
measureTask.run(catalog, exposure)

We'll show some of the results of measurement by overlaying the measured ellipses on the image.

The shapes and centroids we use here (by calling record.getX(), record.getY(), record.getShape()) are aliases (called "slots") to fields with longer names that are our recommended measurements for these quantities. You can see the set of aliases by printing the schema (see next section).



In [ ]:

    
from lsst.afw.geom.ellipses import Axes



In [ ]:

    
display(sub.image, mask=sub.mask, colors={"DETECTED": "blue"}, vmin=-0.5, vmax=100, cmap=matplotlib.cm.gray)

for record in catalog:
    if record.get("deblend_nChild") != 0:
        continue
    axes = Axes(record.getShape())   # convert to A, B, THETA parameterization
    axes.scale(2.0)  # matplotlib uses diameters, not radii
    patch = matplotlib.patches.Ellipse((record.getX(), record.getY()),
                                       axes.getA(), axes.getB(), axes.getTheta() * 180.0 / np.pi,
                                      fill=False, edgecolor="green")
    matplotlib.pyplot.gca().add_patch(patch)
matplotlib.pyplot.show()

Working With Catalogs

Print the schema:



In [ ]:

    
print(catalog.getSchema())

Get arrays of columns (requires the catalog to be continguous in memory, which we can guarantee with a deep copy):



In [ ]:

    
catalog = catalog.copy(deep=True)



In [ ]:

    
psfFlux = catalog["base_PsfFlux_flux"]

Note that boolean values are stored in Flag columns, which are packed into bits. Unlike other column types, when you get an array of a Flag column, you get a copy, not a view.

Use Key objects instead of strings to do fast repeated access to fields when iterating over records:



In [ ]:

    
key = catalog.getSchema().find("deblend_nChild").key
deblended = [record for record in catalog if record.get(key) == 0]

You can also get dict version of a subset of a Schema, a Catalog, or a Record by calling either extract methods with a glob:



In [ ]:

    
catalog[0].extract("base_PsfFlux_*")  # or regex='...'

For Records, the dict values are just the values of the fields, and for Catalogs, they're numpy.ndarray columns. For Schemas they're SchemaItems, which behave liked a named tuple containing a Key and a Field, which contains more descriptive information.

Get an Astropy view of the catalog (from which you can make a Pandas view):



In [ ]:

    
table = catalog.asAstropy()

You can find some reference documentation for the catalog library here.

Exercises

Make a scatter plot of Kron Flux vs. Psf Flux.
Write a single function that performs all of the above steps on a post-ISR Exposure object, modifying the Exposure in-place and returning a new SourceCatalog with a complete set of measurements.
Add PSF modeling to the end of that function, delegating most of the work to lsst.pipe.tasks.MeasurePsfTask. You may want to use a higher threshold (e.g. 50-sigma) for detection, since PSF modeling should only use bright stars.
Make images of the PSF stars, the PSF model at the position of those stars, and the difference between them.
Add another detect-deblend-measure sequence after PSF modeling at a deeper threshold.
Rewrite the function as a class that constructs all of the Tasks that it use in __init__ and processes a single Exposure with you call its run method. Make sure it will behave properly if run is called multiple times wiht different Exposure objects.