FITS Format + Consensus Calculations

This notebook loads and displays FITS data from the clean samples (see #8 and #7). It also uses the rgz-analysis consensus code to find peaks.

The end goal here is to generate a set of training data for the classification problem.


In [170]:
import os.path
import pprint
import sys

import astropy.io.fits
import matplotlib.colors
import matplotlib.pyplot
import numpy
import pymongo
import requests

sys.path.insert(1, '..')
import crowdastro.rgz_analysis.consensus

%matplotlib inline
matplotlib.pyplot.rcParams['image.cmap'] = 'gray'

HOST = 'localhost'
PORT = 27017
DB_NAME = 'radio'

DATA_PATH = os.path.join('..', 'data')
ATLAS_CATALOGUE_PATH = os.path.join(DATA_PATH, 'ATLASDR3_cmpcat_23July2015.dat')
TILE_SIZE = '2x2'

FITS_IMAGE_WIDTH = 200
FITS_IMAGE_HEIGHT = 200

CLICK_IMAGE_WIDTH = 500
CLICK_IMAGE_HEIGHT = 500

CLICK_TO_FITS_X = FITS_IMAGE_WIDTH / CLICK_IMAGE_WIDTH
CLICK_TO_FITS_Y = FITS_IMAGE_HEIGHT / CLICK_IMAGE_HEIGHT

# Setup Mongo DB.
client = pymongo.MongoClient(HOST, PORT)
db = client[DB_NAME]

The FITS images have filenames of the form CID_{ir, radio}.fits. The mapping between IAU name and CID (component ID) is in the ATLAS catalogue file. Where in the database do the IAU names appear?


In [25]:
db.radio_subjects.find_one()


Out[25]:
{'_id': ObjectId('52af7d53eb9a9b05ef000001'),
 'activated_at': datetime.datetime(2013, 12, 17, 17, 45, 13, 844000),
 'classification_count': 20,
 'coords': [206.419375, 23.38236111111111],
 'created_at': datetime.datetime(2013, 12, 17, 9, 16, 38, 435000),
 'group': {'_id': ObjectId('55f6f8723ae740d851000002'),
  'name': 'first',
  'zooniverse_id': 'GRG0000002'},
 'group_id': ObjectId('55f6f8723ae740d851000002'),
 'location': {'contours': 'http://radio.galaxyzoo.org/subjects/contours/52af7d53eb9a9b05ef000001.json',
  'radio': 'http://radio.galaxyzoo.org/subjects/radio/52af7d53eb9a9b05ef000001.jpg',
  'standard': 'http://radio.galaxyzoo.org/subjects/standard/52af7d53eb9a9b05ef000001.jpg'},
 'metadata': {'bmaj': '5.40000004694',
  'bmin': '5.40000004694',
  'bpa': '0.0',
  'contour_count': 2,
  'dec_dms': '23.0 22.0 56.5',
  'ra_hms': '13.0 45.0 40.65',
  'rms': '0.000178',
  'source': 'FIRSTJ134540.6+232256',
  'survey': 'first'},
 'project_id': ObjectId('52afdb804d69636532000001'),
 'random': 0.5988090089044151,
 'state': 'complete',
 'updated_at': datetime.datetime(2013, 12, 17, 9, 16, 38, 468000),
 'workflow_ids': [ObjectId('52afdb804d69636532000002')],
 'zooniverse_id': 'ARG000255t'}

Looks like the subject we found is part of the FIRST survey rather than the ATLAS survey. Let's try and find an ATLAS image.


In [26]:
db.radio_subjects.find_one({'metadata.survey': 'atlas'})


Out[26]:
{'_id': ObjectId('54b7f9ee0136916b75000002'),
 'activated_at': datetime.datetime(2015, 9, 15, 17, 1, 22, 115000),
 'classification_count': 20,
 'coords': [51.89172, -28.772626],
 'created_at': datetime.datetime(2015, 1, 15, 19, 35, 24, 105000),
 'group': {'_id': ObjectId('55f6f8723ae740d851000001'),
  'name': 'atlas',
  'zooniverse_id': 'GRG0000001'},
 'group_id': ObjectId('55f6f8723ae740d851000001'),
 'location': {'contours': 'http://radio.galaxyzoo.org/subjects/contours/54b7f9ee0136916b75000002.json',
  'radio': 'http://radio.galaxyzoo.org/subjects/radio/54b7f9ee0136916b75000002.png',
  'standard': 'http://radio.galaxyzoo.org/subjects/standard/54b7f9ee0136916b75000002.png'},
 'metadata': {'contour_count': 2,
  'dec_dms': '-28 46 21.453600',
  'ra_hms': '03 27 34.012800',
  'rms': '0.031',
  'source': 'CI0002',
  'survey': 'atlas'},
 'project_id': ObjectId('52afdb804d69636532000001'),
 'random': 0.1151774575179044,
 'state': 'complete',
 'updated_at': datetime.datetime(2015, 10, 20, 10, 30, 23, 973000),
 'workflow_ids': [ObjectId('52afdb804d69636532000002')],
 'zooniverse_id': 'ARG0003r17'}

It seems that the database itself actually contains the CID for the ATLAS survey (but not for the FIRST survey). Knowing this, we can write a function to load FITS images.


In [211]:
def open_fits(zid, field, wavelength):
    """Opens a FITS image.
    
    zid: Zooniverse ID.
    field: 'elais' or 'cdfs'.
    wavelength: 'ir' or 'radio'.
    -> FITS image file handle.
    """
    if field not in {'elais', 'cdfs'}:
        raise ValueError('field must be either "elais" or "cdfs".')

    if wavelength not in {'ir', 'radio'}:
        raise ValueError('wavelength must be either "ir" or "radio".')
    
    subject = db.radio_subjects.find_one({'zooniverse_id': zid})
    assert subject['metadata']['survey'] == 'atlas', 'Subject not from ATLAS survey.'

    cid = subject['metadata']['source']
    filename = '{}_{}.fits'.format(cid, wavelength)
    path = os.path.join(DATA_PATH, field, TILE_SIZE, filename)
    
    return astropy.io.fits.open(path, ignore_blank=True)

def imshow(im, contrast=0.05):
    """Helper function for showing an image."""
    im = im - im.min() + contrast
    return matplotlib.pyplot.imshow(im,
            origin='lower',
            norm=matplotlib.colors.LogNorm(
                    vmin=im.min(),
                    vmax=im.max(),
            ),
    )

In [212]:
# Let's try it out.
with open_fits('ARG0003r17', 'cdfs', 'ir') as fits_file:
    ir = fits_file[0].data
    imshow(ir)
    matplotlib.pyplot.show()

with open_fits('ARG0003r17', 'cdfs', 'radio') as fits_file:
    radio = fits_file[0].data
    imshow(radio)
    matplotlib.pyplot.show()


Now, let's try and get the plurality click using the consensus module from willettk/rgz-analysis. I've updated this to Python 3, extracted all the consensus-related code, rewritten the main code to use ATLAS instead of FIRST, and assembled it into a module.


In [227]:
cons = crowdastro.rgz_analysis.consensus.consensus('ARG0003r17')
pprint.pprint(cons)


{'answer': {121.757: {'bbox': [('121.7565906430343',
                                '158.85337033885054',
                                '82.07730463613342',
                                '59.63945198384692')],
                      'ind': 0,
                      'ir_peak': (252.0, 251.0),
                      'ir_x': [248.23113207547172,
                               247.05188679245285,
                               248.23113207547172,
                               251.7688679245283,
                               251.7688679245283,
                               244.96477954792527,
                               250.91308017946642,
                               254.12735849056605],
                      'ir_y': [252.35849056603774,
                               245.28301886792454,
                               253.53773584905662,
                               250.00000000000003,
                               252.35849056603774,
                               220.30447114188718,
                               247.49818837867593,
                               254.7169811320755],
                      'peak_data': {'X': array([[   1.,    1.,    1., ...,    1.,    1.,    1.],
       [   2.,    2.,    2., ...,    2.,    2.,    2.],
       [   3.,    3.,    3., ...,    3.,    3.,    3.],
       ..., 
       [ 497.,  497.,  497., ...,  497.,  497.,  497.],
       [ 498.,  498.,  498., ...,  498.,  498.,  498.],
       [ 499.,  499.,  499., ...,  499.,  499.,  499.]]),
                                    'Y': array([[   1.,    2.,    3., ...,  497.,  498.,  499.],
       [   1.,    2.,    3., ...,  497.,  498.,  499.],
       [   1.,    2.,    3., ...,  497.,  498.,  499.],
       ..., 
       [   1.,    2.,    3., ...,  497.,  498.,  499.],
       [   1.,    2.,    3., ...,  497.,  498.,  499.],
       [   1.,    2.,    3., ...,  497.,  498.,  499.]]),
                                    'Z': array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       ..., 
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.]]),
                                    'npeaks': 2},
                      'xmax': [121.7565906430343]},
            126.439: {'bbox': [('126.4391877329964',
                                '60.26833839421603',
                                '118.96630142128373',
                                '40.69775954610628')],
                      'ind': 1,
                      'ir': (308.372641509434, 130.89622641509436),
                      'xmax': [126.4391877329964]}},
 'n_total': 19,
 'n_users': 8,
 'source': 'CI0002',
 'zid': 'ARG0003r17'}

In [228]:
with open_fits('ARG0003r17', 'cdfs', 'ir') as fits_file:
    ir = fits_file[0].data
    imshow(ir)
    matplotlib.pyplot.scatter([CLICK_TO_FITS_X * numpy.array(cons['answer'][121.757]['ir_x'])],
                              [CLICK_TO_FITS_Y * numpy.array(cons['answer'][121.757]['ir_y'])],
                              c='w', marker='+')
    matplotlib.pyplot.scatter([CLICK_TO_FITS_X * cons['answer'][121.757]['ir_peak'][0]],
                              [CLICK_TO_FITS_Y * cons['answer'][121.757]['ir_peak'][1]],
                              c='r', marker='x')
    matplotlib.pyplot.scatter([CLICK_TO_FITS_X * cons['answer'][126.439]['ir'][0]],
                              [CLICK_TO_FITS_Y * cons['answer'][126.439]['ir'][1]],
                              c='g', marker='x')
    matplotlib.pyplot.show()


There are two peaks in the output. I've plotted them in red and green. I think red is the plurality peak; I have no idea what the other peak is. I think that the white marks represent places people clicked.

Let's try doing this again on a few more subjects.

Note: For some reason, the contours and scatter plots flip to whatever the opposite orientation of the image is. I've manually flipped them back. There should be a better way, but all resources I've found tell me that setting origin = 'lower' on the imshow call should flip the image correctly...


In [220]:
def plot_contours(contours, colour='gray'):
    for row in contours:
        for col in row:
            xs = []
            ys = []
            for pair in col['arr']:
                xs.append(pair['x'])
                ys.append(pair['y'])
            matplotlib.pyplot.plot(xs, FITS_IMAGE_HEIGHT - numpy.array(ys), c=colour)

In [221]:
limit = 10

for subject in db.radio_subjects.find({'metadata.survey': 'atlas'}).limit(limit):
    matplotlib.pyplot.figure(figsize=(15, 15))

    zid = subject['zooniverse_id']
    cons = crowdastro.rgz_analysis.consensus.consensus(zid)
    
    contours = requests.get(subject['location']['contours']).json()['contours']

    print('=' * 40, zid, '=' * 40)

    # Not all answers seem to contain valid peaks.
    n_answers = sum(1 for answer in cons['answer'].values() if 'ir_peak' in answer)
    
    with open_fits(zid, 'cdfs', 'ir') as fits_file:
        ir = fits_file[0].data

    with open_fits(zid, 'cdfs', 'radio') as fits_file:
        radio = fits_file[0].data

    for answer_index, (checksum, answer) in enumerate(cons['answer'].items()):
        if 'ir_peak' not in answer:
            continue

        matplotlib.pyplot.subplot(n_answers, 2, answer_index * 2 + 1)
        matplotlib.pyplot.title('{} (IR): ind = {}'.format(zid, answer['ind']))
        imshow(ir)
        plot_contours(contours, colour='green')
        matplotlib.pyplot.scatter([CLICK_TO_FITS_X * numpy.array(answer['ir_peak'][0])],
                                  [FITS_IMAGE_HEIGHT - CLICK_TO_FITS_Y * numpy.array(answer['ir_peak'][1])],
                                  c='r', marker='o')
        matplotlib.pyplot.subplot(n_answers, 2, answer_index * 2 + 2)
        matplotlib.pyplot.title('{} (Radio): ind = {}'.format(zid, answer['ind']))
        imshow(radio)
        plot_contours(contours, colour='green')
        matplotlib.pyplot.scatter([CLICK_TO_FITS_X * numpy.array(answer['ir_peak'][0])],
                                  [FITS_IMAGE_HEIGHT - CLICK_TO_FITS_Y * numpy.array(answer['ir_peak'][1])],
                                  c='r', marker='o')
    
    matplotlib.pyplot.show()


======================================== ARG0003r17 ========================================
======================================== ARG0003r18 ========================================
======================================== ARG0003r19 ========================================
======================================== ARG0003r1a ========================================
======================================== ARG0003r1b ========================================
======================================== ARG0003r1c ========================================
======================================== ARG0003r1d ========================================
======================================== ARG0003r1f ========================================
<matplotlib.figure.Figure at 0xfa1e3c8>
======================================== ARG0003r1e ========================================
======================================== ARG0003r1g ========================================

The extra indices refer to extra sets of contours. How many objects in the dataset are "simple", i.e., have just one set of contours? I'll just run a counter on a subset.


In [225]:
limit = 1000

n_one = 0

for subject in db.radio_subjects.find({'metadata.survey': 'atlas'}).limit(limit):
    zid = subject['zooniverse_id']
    cons = crowdastro.rgz_analysis.consensus.consensus(zid)
    n_answers = sum(1 for answer in cons['answer'].values() if 'ir_peak' in answer)
    if n_answers == 1:
        n_one += 1

print('Number of subjects with one set of contours:', n_one)


Number of subjects with one set of contours: 308

So about 30%. I think that's a reasonable first attempt at a training data set. I'll try and assemble that in another notebook.


In [ ]: