Data Standardisation Pipeline (version 0.1.0)

In this notebook, I detail the process of importing data into crowdastro.

Input data sources

The input data sources are:

Radio Galaxy Zoo
- Subjects (radio_subjects.json)
- Classifications (radio_classifications.json)
ATLAS
- Catalogue (ATLASDR3_cmpcat_23July2015.dat)
- FITS images of the radio sky (cdfs & elais)
SWIRE
- SWIRE Catalogue (SWIRE3_CDFS_cat_IRAC24_21Dec05.tbl)
- FITS images of the infrared sky (cdfs & elais)

Paths to these should be specified in crowdastro.json. Radio Galaxy Zoo data should be imported into a database in MongoDB, specified by radio_galaxy_zoo_db in crowdastro.json. Following is an example crowdastro.json:

{
    "data_sources": {
        "atlas_catalogue": "data/ATLASDR3_cmpcat_23July2015.dat",
        "cdfs_fits": "data/cdfs",
        "elais_s1_fits": "data/elais",
        "radio_galaxy_zoo_db": "radio",
        "swire_catalogue": "data/SWIRE3_CDFS_cat_IRAC24_21Dec05.tbl"
    },

    "mongo": {
        "host": "localhost",
        "port": 27017
    }
}

Output data format

The input data is converted into the output data by this pipeline. There are two output files. The files are crowdastro.h5 and crowdastro.csv.

The .h5 files contain numeric data, including all FITS images (both radio and infrared), classifications, and subject metadata. The structure is as follows, with datasets italicised:

/
- atlas
  - cdfs
    - images_2x2
    - images_5x5
    - classification_positions
    - classification_combinations
    - positions
    - training_indices
    - validation_indices
    - testing_indices
  - elais-s1
    - images_2x2
    - images_5x5
    - classification_positions
    - classification_combinations
    - positions
    - training_indices
    - validation_indices
    - testing_indices
- swire
  - cdfs
    - images_2x2
    - images_5x5
    - catalogue
    - training_indices
    - validation_indices
    - testing_indices
  - elais-s1
    - images_2x2
    - images_5x5
    - catalogue
    - training_indices
    - validation_indices
    - testing_indices

I'm only using ATLAS and SWIRE for now, but this is easily generalised to FIRST and WISE or EMU and MIGHTEE.

training_indices, testing_indices, and validation_indices are non-overlapping subsets of the indices of the relevant data set. This is used to consistently partition into training/testing/validation data sets. The partitioning ratio is specified in crowdastro.json as test_size and validation_size. Note that test_size is first used to partition the data, and then validation_size is used to partition the remaining data.

The .csv files contain textual data such as Zooniverse IDs. They are pretty much lookup tables; the reason for using CSV instead of HDF5 tables is partly for human readability and partly because dealing with textual data in HDF5 is unpleasant. The columns (examples parenthesised) are:

index (1)
survey (atlas)
field (cdfs)
zooniverse_id (ARG0003r18)
name (ATLAS3_J033403.6-282423C)

Importing ATLAS data

The ATLAS dataset consists of images of the radio sky in two fields, CDFS and ELAIS-S1, as well as a catalogue of objects in these fields. Unfortunately, the identifiers of the images and the identifiers in the catalogue are different, so we cannot rely on them matching (in fact, they don't match at all). The ATLAS catalogue identifiers correspond to nothing I've found so far, but the CDFS/ELAIS-S1 identifiers correspond to the image names, the Galaxy Zoo: Radio Talk survey IDs, and the radio subjects' source field in the Radio Galaxy Zoo dataset. There's a catalogue associated with the images, but it doesn't include the ATLAS names.

Each object in the dataset is called a "radio component" and the $2' \times 2'$ patch of radio sky centred on each component is called a "radio subject". There are $2460$ CDFS components and $1935$ ELAIS-S1 components — but the ELAIS-S1 components do not appear in Radio Galaxy Zoo at all.

The origin of each image is in the top left corner. Each image is $201$ pixels in width and height.

The RA/DEC of each component is stored in the Radio Galaxy Zoo dataset.

Each component is sorted by Zooniverse ID in all output data. The images are stored in /atlas/{cdfs,elais-s1}/images. The RA/DEC of the components are stored in /atlas/{cdfs,elais-s1}/positions in decimal degrees. The names and Zooniverse IDs of each component are stored directly in the CSV along with the index they have in the HDF5 file.

ATLAS components are only imported if they have a corresponding Zooniverse ID (i.e., if they are in the Radio Galaxy Zoo dataset), and if they are within a threshold radius from a Zooniverse object. The threshold is set as radio_location_threshold in crowdastro.json.

Importing SWIRE data

The SWIRE dataset consists of images of the infrared sky in two fields, CDFS and ELAIS-S1, as well as a catalogue of objects in these fields. They have no associated Zooniverse object, so we're just going to import them all.

There are images from SWIRE, but these are actually associated with ATLAS objects, so they are imported alongside ATLAS data. We thus only have to import the catalogue into /swire/{cdfs,elais-s1}/catalogue and into the CSV. All coordinates are in RA/DEC in decimal degrees. Each object is sorted by SWIRE name. The index in the CSV corresponds to the index in the HDF5 catalogue.

Importing Radio Galaxy Zoo classifications

Radio Galaxy Zoo (RGZ) classifications consist of a number of annotations of a subject. We only care about the combination of radio contours selected by a volunteer, and the corresponding pixel locations, as well as the Zooniverse ID associated with the subject.

For each subject in the ATLAS dataset, parse all classifications. All valid classifications are then stored in /atlas/{cdfs,elais-s1}/{classification_positions,classification_combinations} with the position of the classification and the corresponding radio combination (and corresponding full radio combination) respectively. These also contain the ATLAS index they are associated with, and are sorted by this index.



In [ ]: