Labels

We want to use the Radio Galaxy Zoo click data as training data. To do this, we first need to convert the raw click data to a useful label — most likely the $(x, y)$ coordinate of the host galaxy for a given set of radio emissions.

Let's start by analysing what we're looking at in the raw data.


In [79]:
import collections
import operator
from pprint import pprint
import sqlite3
import sys
import warnings

import matplotlib.pyplot
import numpy
import scipy.stats
import sklearn.mixture

%matplotlib inline

sys.path.insert(1, '..')
import crowdastro.data
import crowdastro.show

warnings.simplefilter('ignore', UserWarning)  # astropy always raises warnings on Windows.

Reading raw classification data

We'll grab a random subject (complete, from the ATLAS survey) and look at its classifications.


In [4]:
subject = crowdastro.data.db.radio_subjects.find_one({'metadata.survey': 'atlas', 'state': 'complete',
                                                      'zooniverse_id': 'ARG0003r18'})

In [5]:
crowdastro.show.subject(subject)
matplotlib.pyplot.show()


This is a kinda nice subject to look at because there are multiple radio sources. This is often true for Radio Galaxy Zoo subjects. Let's look at a classification.


In [10]:
classification = crowdastro.data.db.radio_classifications.find_one({'subject_ids': subject['_id']})
pprint(classification)


{'_id': ObjectId('54b85550bb7b563218000002'),
 'annotations': [{'ir': {'0': {'x': '215.44094848632812',
                               'y': '211.6597137451172'}},
                  'radio': {'0': {'scale_height': '2.1144278606965172',
                                  'scale_width': '2.1144278606965172',
                                  'xmax': '122.85618152302268',
                                  'xmin': '80.9806809782321',
                                  'ymax': '148.59797461973457',
                                  'ymin': '50.36754313888382'}}},
                 {'ir': {'0': {'x': '402.4409484863281',
                               'y': '48.65971374511719'}},
                  'radio': {'0': {'scale_height': '2.1144278606965172',
                                  'scale_width': '2.1144278606965172',
                                  'xmax': '198.21655859377296',
                                  'xmin': '178.38630228611456',
                                  'ymax': '44.87593150595179',
                                  'ymin': '1'}}},
                 {'finished_at': 'Fri, 16 Jan 2015 01:00:53 GMT',
                  'started_at': 'Fri, 16 Jan 2015 00:59:45 GMT'},
                 {'user_agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) '
                                'AppleWebKit/537.36 (KHTML, like Gecko) '
                                'Chrome/39.0.2171.95 Safari/537.36'},
                 {'lang': 'ru'}],
 'created_at': datetime.datetime(2015, 1, 16, 0, 3, 28),
 'project_id': ObjectId('52afdb804d69636532000001'),
 'subject_ids': [ObjectId('54b7f9ee0136916b75000003')],
 'subjects': [{'id': ObjectId('54b7f9ee0136916b75000003'),
               'location': {'contours': 'http://radio.galaxyzoo.org/subjects/contours/54b7f9ee0136916b75000003.json',
                            'radio': 'http://radio.galaxyzoo.org/subjects/radio/54b7f9ee0136916b75000003.png',
                            'standard': 'http://radio.galaxyzoo.org/subjects/standard/54b7f9ee0136916b75000003.png'},
               'zooniverse_id': 'ARG0003r18'}],
 'tutorial': False,
 'updated_at': datetime.datetime(2015, 1, 16, 0, 3, 36, 883000),
 'user_id': ObjectId('52961b669e32605fe3000009'),
 'user_name': 'antikodon',
 'workflow_id': ObjectId('52afdb804d69636532000002')}

We only really care about the annotations, so I've pulled those out. Some of them aren't useful — the langauge, user agent, and start/end times here, though there may potentially be more in other classifications. It looks like the ones that are useful contain both 'ir' and 'radio' keys. I'll use "annotations" to refer to these exclusively from now on.

Each annotation contains an IR component and a radio component. The IR component is obviously where people clicked, though the format is a little unusual — it seems to have the option for multiple clicks, though I can't come up with a situation in which this would arise. I think we can safely ignore all but the first IR coordinate. The radio component describes which contours are included in this annotation. Volunteers may click on multiple contours that they think have been emitted from the same radio source, and these are grouped here. The combination of radio contours here is important as it affects what the volunteers believe they are classifying (see Banfield et al. 2015 for more details).

We need to pull out something describing which combination of radio contours were selected, as well as the actual IR coordinate associated with each one.


In [7]:
def make_radio_combination_signature(radio_annotation):
    """Generates a unique signature for a radio annotation.
    
    radio_annotation: 'radio' dictionary from a classification.
    -> Something immutable
    """
    # My choice of immutable object will be a tuple of the xmax values,
    # sorted to ensure determinism, and rounded to nix floating point errors.
    # Note that the x scale is not the same as the IR scale, but the scale factor is
    # included in the annotation, so I have multiplied this out here for consistency.
    # Sometimes, there's no scale, so I've included a default scale.
    xmaxes = [round(float(c['xmax']) * float(c.get('scale_width', '2.1144278606965172')), 14)
              for c in radio_annotation.values()]
    return tuple(sorted(xmaxes))

def read_classification(classification):
    """Converts a raw RGZ classification into radio combinations and IR locations.
    
    classification: RGZ classification dictionary.
    -> dict mapping radio combination signatures to IR locations.
    """
    result = {}

    for annotation in classification['annotations']:
        if 'radio' not in annotation:
            # This is a metadata annotation and we can ignore it.
            continue
        
        radio_signature = make_radio_combination_signature(annotation['radio'])
        
        if annotation['ir'] == 'No Sources':
            ir_location = None
        else:
            ir_location = (float(annotation['ir']['0']['x']), float(annotation['ir']['0']['y']))

        result[radio_signature] = ir_location

    return result

In [8]:
read_classification(classification)


Out[8]:
{(259.77053307106786,): (215.44094848632812, 211.6597137451172),
 (419.1146139420572,): (402.4409484863281, 48.65971374511719)}

Now let's try running that over a few classifications, to see if anything breaks.


In [9]:
for classification in crowdastro.data.db.radio_classifications.find().limit(100):
    print(read_classification(classification))


{}
{}
{}
{}
{(183.4725288692084,): (112.5, 381.0), (335.4714873147084, 374.2027114828553): (213.5, 212.0)}
{}
{}
{(183.4725288692084,): (116.0, 379.0), (335.4714873147084, 374.2027114828553): (216.0, 211.0)}
{(183.4725288692084,): (103.0, 378.0), (335.4714873147084, 374.2027114828553): None}
{(183.4725288692084,): (114.0, 374.0), (335.4714873147084, 374.2027114828553): None}
{(183.4725288692084,): (111.5, 383.0), (335.4714873147084, 374.2027114828553): (213.5, 209.0)}
{}
{}
{}
{}
{(183.4725288692084,): (112.0, 380.0), (335.4714873147084, 374.2027114828553): (214.0, 211.0)}
{(183.4725288692084,): (103.0, 378.0), (335.4714873147084, 374.2027114828553): None}
{(183.4725288692084,): (114.0, 383.0), (335.4714873147084, 374.2027114828553): (214.0, 211.0)}
{(335.4714873147084, 374.2027114828553): (213.0, 211.0)}
{(183.4725288692084,): (112.0, 381.0)}
{(183.4725288692084,): (112.0, 378.0), (335.4714873147084, 374.2027114828553): (212.0, 211.0)}
{(183.4725288692084,): (111.0, 380.0), (335.4714873147084, 374.2027114828553): (212.0, 212.0)}
{(183.4725288692084,): (111.0, 379.0), (335.4714873147084, 374.2027114828553): (212.0, 210.0)}
{(183.4725288692084,): (111.0, 379.0), (335.4714873147084, 374.2027114828553): (212.0, 210.0)}
{}
{(335.4714873147084, 374.2027114828553): (214.0, 214.0)}
{}
{(335.4714873147084, 374.2027114828553): (208.0, 207.0)}
{}
{}
{}
{(183.4725288692084,): (110.0, 382.0), (335.4714873147084, 374.2027114828553): (213.0, 211.0)}
{(183.4725288692084,): (113.5, 379.0), (335.4714873147084, 374.2027114828553): (213.5, 211.0)}
{}
{}
{}
{}
{}
{}
{(183.4725288692084,): (111.0, 379.0), (335.4714873147084, 374.2027114828553): (212.0, 211.0)}
{}
{}
{}
{}
{}
{}
{}
{}
{}
{}
{}
{}
{(183.4725288692084,): (116.0, 390.0), (335.4714873147084,): (212.0, 207.0)}
{(183.4725288692084,): (111.5, 382.0), (335.4714873147084, 374.2027114828553): (217.5, 214.0)}
{}
{}
{}
{(251.56309657864517,): (367.5, 63.0), (190.5251974827036,): (238.5, 211.0)}
{}
{(149.85340527269625,): (210.0, 212.0)}
{(149.90900501614814,): (213.0, 212.0)}
{}
{(145.58968414346245,): None}
{}
{(160.3076353982779, 202.64618440126293): (253.5, 224.0)}
{(147.22731586095435,): (207.5, 214.0)}
{(114.67696788767692, 152.67806740158596): (176.5, 210.0), (245.04081185199112,): (361.5, 282.0)}
{(147.45551692997657,): None}
{(146.84022929062598,): (208.5, 213.0)}
{(98.35427575123786,): (127.0, 144.0), (160.53480053766592, 216.17601605243718): (264.0, 265.0)}
{(146.98494586244377,): (212.5, 212.0)}
{(186.10914941532886,): (239.5, 205.0)}
{(183.4725288692084,): (113.0, 382.0), (335.4714873147084, 374.2027114828553): (213.0, 209.0)}
{(153.782571981706,): (216.5, 223.0)}
{(148.8403788634242, 160.32400010228102): (221.5, 206.0)}
{}
{(165.05868145664812,): (217.0, 209.0)}
{(153.28251412098925,): (218.0, 229.0)}
{}
{(147.17018067299549,): None}
{(152.43009507651794,): (207.5, 211.0)}
{(148.42931353083614,): (207.5, 214.0)}
{(183.4725288692084, 335.4714873147084, 361.747355205209, 374.2027114828553): (111.0, 380.0)}
{(156.5623242623957,): None}
{(148.97928067463164,): None}
{(153.13090383739134,): None}
{(149.28448566386885,): None}
{(152.68077462201506,): (212.0, 208.0)}
{(146.37237203225556,): (214.5, 214.0)}
{(157.76858358655076,): (213.0, 223.0)}
{(111.68408067541347,): (162.5, 248.0), (148.74231448450922,): (221.5, 210.0)}
{(154.3201727448506,): (215.0, 213.0)}
{(146.76458344710306,): (213.5, 212.0)}
{(145.89426411930782,): (217.5, 219.0)}
{(151.43420712422292,): None}
{(152.97128039115492,): (197.5, 201.0)}
{(147.54003598713743,): None}
{(122.0652074923785,): None, (155.59876780715894,): (210.5, 219.0)}
{(194.00878043034882,): (246.5, 213.0)}
{}

This seems to work pretty well! The next step is to freeze the whole classification database in this simplified form. I'll do this in a script in crowdastro.__main__.

This section is summarised in crowdastro.labels.parse_classification.

Now we can move on to using these to find the consensus for each subject.

Finding consensus radio/location mapping

Given these simplified data points, it should be a lot easier to figure out what the consensus is for a given subject: Pull up all the classifications for a given subject, find the most common full radio signature, and get the clicks for each resulting radio signature.

I've frozen the previous results into a neat little database, so we should be good to go.


In [61]:
conn = sqlite3.connect('../crowdastro-data/processed.db')
conn.row_factory = sqlite3.Row
cur = conn.cursor()

classifications = list(cur.execute('SELECT full_radio_signature, part_radio_signature, source_x, source_y '
                                   'FROM classifications WHERE subject_id=?', [str(subject['_id'])]))
frs_counter = collections.Counter([c['full_radio_signature'] for c in classifications])
most_common_frs = frs_counter.most_common(1)[0][0]
radio_consensus_classifications = collections.defaultdict(list)

for classification in classifications:
    if classification['full_radio_signature'] == most_common_frs:
        radio_consensus_classifications[classification['part_radio_signature']].append((classification['source_x'],
                                                                                        classification['source_y']))

conn.close()

In [62]:
radio_consensus_classifications


Out[62]:
defaultdict(list,
            {b'\xb4\xd7\x1c?': [(215.44094848632812, 211.6597137451172),
              (227.25, 215.0),
              (212.5, 213.0),
              (211.20001220703125, 211.8000030517578),
              (212.79998779296875, 210.4166717529297),
              (213.5, 211.0),
              (212.5, 212.0),
              (210.5, 202.0),
              (191.88540649414062, 153.4409637451172),
              (212.5, 218.0),
              (None, None),
              (210.5999755859375, 210.60000610351562),
              (212.5, 211.0)],
             b'\xe3\x0c}?': [(402.4409484863281, 48.65971374511719),
              (393.25, 46.0),
              (396.5, 53.0),
              (401.20001220703125, 51.399993896484375),
              (400.79998779296875, 53.41667175292969),
              (400.5, 57.0),
              (400.5, 56.0),
              (382.5, 45.0),
              (384.8854064941406, 44.44096374511719),
              (398.5, 46.0),
              (401.0, 48.010009765625),
              (401.5999755859375, 47.599998474121094),
              (399.5, 46.0)]})

Pulling out the coordinates for each radio source was reasonably straightforward. Now, we want to use these to figure out which location these volunteers agree on. Let's just focus on \xe3\x0c}? for now. Plotting it to get an idea of what we have:


In [76]:
crowdastro.show.ir(subject)
xs = [a[0] * crowdastro.config.get('click_to_fits_x') for a in radio_consensus_classifications[b'\xe3\x0c}?']]
ys = [crowdastro.config.get('fits_image_height') - a[1] * crowdastro.config.get('click_to_fits_y')
      for a in radio_consensus_classifications[b'\xe3\x0c}?']]
matplotlib.pyplot.xlim((170, 200))
matplotlib.pyplot.ylim((170, 185))
matplotlib.pyplot.scatter(xs, ys, marker='+')
matplotlib.pyplot.show()


A Gaussian assumption is not reasonable — maybe the volunteers think there are two galaxies that the radio source could belong to? I think a Gaussian mixture assumption is reasonable, though, so I will use PG-means to try and find this.


In [146]:
def pg_means(points, significance=0.01, projections=24):
    """Cluster points with the PG-means algorithm."""
    k = 1
    
    while True:
        # Fit a Gaussian mixture model with k components.
        gmm = sklearn.mixture.GMM(n_components=k)
        try:
            gmm.fit(points)
        except ValueError:
            return None

        
        for _ in range(projections):
            # Project the data to one dimension.
            projection_vector = numpy.random.random(size=(2,))
            projected_points = points @ projection_vector
            # Project the model to one dimension.
            # We need the CDF in one dimension, so we'll sample some data points and project them.
            n_samples = 1000
            samples = gmm.sample(n_samples) @ projection_vector
            samples.sort()
            
            def cdf(x):
                for sample, y in zip(samples, numpy.arange(n_samples) / n_samples):
                    if sample >= x:
                        break
                return y
            
            _, p_value = scipy.stats.kstest(projected_points, numpy.vectorize(cdf))
            if p_value < significance:
                # Reject the null hypothesis.
                break
        else:
            # Null hypothesis was not broken.
            return gmm
        
        k += 1

Let's try that.


In [114]:
xs = [a[0] * crowdastro.config.get('click_to_fits_x') for a in radio_consensus_classifications[b'\xe3\x0c}?']]
ys = [crowdastro.config.get('fits_image_height') - a[1] * crowdastro.config.get('click_to_fits_y')
      for a in radio_consensus_classifications[b'\xe3\x0c}?']]
points = numpy.vstack([numpy.array(xs), numpy.array(ys)])
crowdastro.show.ir(subject)
matplotlib.pyplot.xlim((170, 200))
matplotlib.pyplot.ylim((170, 185))
matplotlib.pyplot.scatter(xs, ys, marker='+')
matplotlib.pyplot.scatter(*pg_means(points.T, significance=0.02).means_.T)
matplotlib.pyplot.show()


That seems alright. Let's try with the other radio source.


In [145]:
xs = [a[0] * crowdastro.config.get('click_to_fits_x')
      for a in radio_consensus_classifications[b'\xb4\xd7\x1c?']
      if a[0] is not None]
ys = [crowdastro.config.get('fits_image_height') - a[1] * crowdastro.config.get('click_to_fits_y')
      for a in radio_consensus_classifications[b'\xb4\xd7\x1c?']
      if a[1] is not None]
points = numpy.vstack([numpy.array(xs), numpy.array(ys)])
crowdastro.show.ir(subject)
matplotlib.pyplot.xlim((80, 120))
matplotlib.pyplot.ylim((80, 150))
matplotlib.pyplot.scatter(xs, ys, marker='+')
matplotlib.pyplot.scatter(*pg_means(points.T, significance=0.02, projections=24).means_.T)
matplotlib.pyplot.show()


Note that we get different results for significance = 0.01 and significance = 0.02. This is unfortunately a hyperparameter that will need to be tweaked. Let's bundle this up into a function.


In [150]:
def get_subject_consensus(subject, conn=None, significance=0.02):
    conn.row_factory = sqlite3.Row
    cur = conn.cursor()

    classifications = list(cur.execute('SELECT full_radio_signature, part_radio_signature, source_x, source_y '
                                       'FROM classifications WHERE subject_id=?', [str(subject['_id'])]))
    frs_counter = collections.Counter([c['full_radio_signature'] for c in classifications])
    most_common_frs = frs_counter.most_common(1)[0][0]
    radio_consensus_classifications = collections.defaultdict(list)

    for classification in classifications:
        if classification['full_radio_signature'] == most_common_frs:
            radio_consensus_classifications[classification['part_radio_signature']].append((classification['source_x'],
                                                                                            classification['source_y']))
    
    consensus = {}  # Maps radio signatures to (x, y) NumPy arrays.
    for radio_signature in radio_consensus_classifications:
        xs = [a[0] * crowdastro.config.get('click_to_fits_x')
              for a in radio_consensus_classifications[radio_signature]
              if a[0] is not None]
        ys = [crowdastro.config.get('fits_image_height') - a[1] * crowdastro.config.get('click_to_fits_y')
              for a in radio_consensus_classifications[radio_signature]
              if a[1] is not None]
        points = numpy.vstack([xs, ys])
        gmm = pg_means(points.T, significance=significance, projections=24)
        consensus[radio_signature] = gmm.means_[gmm.weights_.argmax()]
    
    return consensus

In [152]:
conn = sqlite3.connect('../crowdastro-data/processed.db')
consensus = get_subject_consensus(subject, conn=conn)
conn.close()

In [161]:
matplotlib.pyplot.figure(figsize=(10, 10))
matplotlib.pyplot.axis('off')
crowdastro.show.subject(subject)
matplotlib.pyplot.scatter(*numpy.array(list(consensus.values())).T)
matplotlib.pyplot.show()


Similarly, I'll run this over the whole dataset and freeze the results. The consensus algorithm is located in crowdastro.labels.find_subject_consensus and I have also added the freezing script to crowdastro.__main__.