The input data sources are:
radio_subjects.json
)radio_classifications.json
)ATLASDR3_cmpcat_23July2015.dat
)cdfs
& elais
)SWIRE3_CDFS_cat_IRAC24_21Dec05.tbl
)cdfs
& elais
)Paths to these should be specified in crowdastro.json
. Radio Galaxy Zoo data should be imported into a database in MongoDB, specified by radio_galaxy_zoo_db
in crowdastro.json
. Following is an example crowdastro.json
:
{
"data_sources": {
"atlas_catalogue": "data/ATLASDR3_cmpcat_23July2015.dat",
"cdfs_fits": "data/cdfs",
"elais_s1_fits": "data/elais",
"radio_galaxy_zoo_db": "radio",
"swire_catalogue": "data/SWIRE3_CDFS_cat_IRAC24_21Dec05.tbl"
},
"mongo": {
"host": "localhost",
"port": 27017
}
}
The input data is converted into the output data by this pipeline. There are two output files. The files are crowdastro.h5
and crowdastro.csv
.
The .h5
files contain numeric data, including all FITS images (both radio and infrared), classifications, and subject metadata. The structure is as follows, with datasets italicised:
/
atlas
cdfs
images_2x2
images_5x5
classification_positions
classification_combinations
positions
training_indices
validation_indices
testing_indices
elais-s1
images_2x2
images_5x5
classification_positions
classification_combinations
positions
training_indices
validation_indices
testing_indices
swire
cdfs
images_2x2
images_5x5
catalogue
training_indices
validation_indices
testing_indices
elais-s1
images_2x2
images_5x5
catalogue
training_indices
validation_indices
testing_indices
I'm only using ATLAS and SWIRE for now, but this is easily generalised to FIRST and WISE or EMU and MIGHTEE.
training_indices
, testing_indices
, and validation_indices
are non-overlapping subsets of the indices of the relevant data set. This is used to consistently partition into training/testing/validation data sets. The partitioning ratio is specified in crowdastro.json
as test_size
and validation_size
. Note that test_size
is first used to partition the data, and then validation_size
is used to partition the remaining data.
The .csv
files contain textual data such as Zooniverse IDs. They are pretty much lookup tables; the reason for using CSV instead of HDF5 tables is partly for human readability and partly because dealing with textual data in HDF5 is unpleasant. The columns (examples parenthesised) are:
index
(1)survey
(atlas)field
(cdfs)zooniverse_id
(ARG0003r18)name
(ATLAS3_J033403.6-282423C)The ATLAS dataset consists of images of the radio sky in two fields, CDFS and ELAIS-S1, as well as a catalogue of objects in these fields. Unfortunately, the identifiers of the images and the identifiers in the catalogue are different, so we cannot rely on them matching (in fact, they don't match at all). The ATLAS catalogue identifiers correspond to nothing I've found so far, but the CDFS/ELAIS-S1 identifiers correspond to the image names, the Galaxy Zoo: Radio Talk survey IDs, and the radio subjects' source
field in the Radio Galaxy Zoo dataset. There's a catalogue associated with the images, but it doesn't include the ATLAS names.
Each object in the dataset is called a "radio component" and the $2' \times 2'$ patch of radio sky centred on each component is called a "radio subject". There are $2460$ CDFS components and $1935$ ELAIS-S1 components — but the ELAIS-S1 components do not appear in Radio Galaxy Zoo at all.
The origin of each image is in the top left corner. Each image is $201$ pixels in width and height.
The RA/DEC of each component is stored in the Radio Galaxy Zoo dataset.
Each component is sorted by Zooniverse ID in all output data. The images are stored in /atlas/{cdfs,elais-s1}/images
. The RA/DEC of the components are stored in /atlas/{cdfs,elais-s1}/positions
in decimal degrees. The names and Zooniverse IDs of each component are stored directly in the CSV along with the index they have in the HDF5 file.
ATLAS components are only imported if they have a corresponding Zooniverse ID (i.e., if they are in the Radio Galaxy Zoo dataset), and if they are within a threshold radius from a Zooniverse object. The threshold is set as radio_location_threshold
in crowdastro.json
.
The SWIRE dataset consists of images of the infrared sky in two fields, CDFS and ELAIS-S1, as well as a catalogue of objects in these fields. They have no associated Zooniverse object, so we're just going to import them all.
There are images from SWIRE, but these are actually associated with ATLAS objects, so they are imported alongside ATLAS data. We thus only have to import the catalogue into /swire/{cdfs,elais-s1}/catalogue
and into the CSV. All coordinates are in RA/DEC in decimal degrees. Each object is sorted by SWIRE name. The index in the CSV corresponds to the index in the HDF5 catalogue.
Radio Galaxy Zoo (RGZ) classifications consist of a number of annotations of a subject. We only care about the combination of radio contours selected by a volunteer, and the corresponding pixel locations, as well as the Zooniverse ID associated with the subject.
For each subject in the ATLAS dataset, parse all classifications. All valid classifications are then stored in /atlas/{cdfs,elais-s1}/{classification_positions,classification_combinations}
with the position of the classification and the corresponding radio combination (and corresponding full radio combination) respectively. These also contain the ATLAS index they are associated with, and are sorted by this index.
In [ ]: