Documentation of the Radio Galaxy Zoo database

Kyle Willett, University of Minnesota

The following is a brief description of the data structures and organization for the Radio Galaxy Zoo (RGZ) citizen science project, organized as part of the Zooniverse.

The data for the project is stored on Amazon Web Services using MongoDB. MongoDB is a "NoSQL"-type database, meaning that it does not operate on traditional joins and tabular relations such as those used in SQL. Individual records are stored as data documents in JSON or BSON formats.

The examples below show how MongoDB can be queried using Python and the pymongo module.

The live version of the database is stored on the Amazon servers and is not designed to be directly queried by the science team, since that can potentially slow the response of the system for the volunteers. Access for the science team to do analysis should be done on the backup copies, which are obtained through email links sent out weekly. Contact Chris Snyder at Zooniverse (cs@zooniverse.org) if you want to be put on the email list for this.

There are three databases for RGZ data: radio_classifications, radio_subjects, and radio_users. All are stored in the BSON files of the same time, which you can find after downloading the backup copies and untarring the zip file locally.


In [168]:
from pymongo import MongoClient

# Load the Mongo database so we can show examples of each data type. In this case, I have already restored the MongoDB files
# to my machine and am running a local instance of mongod on port 27017. 
client = MongoClient("localhost",27017)

# Select the default database name (ouroboros) for RGZ classifications
db = client['ouroboros']

Database #1: radio_subjects

The radio_subjects collection contains the information and metadata for each subject (in this case, a radio source from the FIRST survey) being classified as part of RGZ. As of the project launch in late December 2013, this comprises 175,001 images (175,000 galaxies + 1 tutorial subject). Let's look at what data is being stored for a subject.


In [171]:
# If all the RGZ data has been loaded in, there should be three collections available. Let's first look at the subjects.
subjects = db['radio_subjects']

# Extract a sample subject from the collection and print the data to the screen.
import pprint
sample_subject = subjects.find_one()     # In MongoDB, the data is stored similar to a JSON file; 
                                         # in Python, it is a nested dictionary.

pprint.pprint(sample_subject)


{u'_id': ObjectId('52af7d53eb9a9b05ef000001'),
 u'activated_at': datetime.datetime(2013, 12, 17, 17, 45, 13, 844000),
 u'classification_count': 20,
 u'coords': [206.419375, 23.382361111111113],
 u'created_at': datetime.datetime(2013, 12, 17, 9, 16, 38, 435000),
 u'location': {u'contours': u'http://radio.galaxyzoo.org/subjects/contours/52af7d53eb9a9b05ef000001.json',
               u'radio': u'http://radio.galaxyzoo.org/subjects/radio/52af7d53eb9a9b05ef000001.jpg',
               u'standard': u'http://radio.galaxyzoo.org/subjects/standard/52af7d53eb9a9b05ef000001.jpg'},
 u'metadata': {u'dec_dms': u'23.0 22.0 56.5',
               u'ra_hms': u'13.0 45.0 40.65',
               u'rms': u'0.000178',
               u'source': u'FIRSTJ134540.6+232256'},
 u'project_id': ObjectId('52afdb804d69636532000001'),
 u'random': 0.5988090089044151,
 u'state': u'complete',
 u'updated_at': datetime.datetime(2013, 12, 17, 9, 16, 38, 468000),
 u'workflow_ids': [ObjectId('52afdb804d69636532000002')],
 u'zooniverse_id': u'ARG000255t'}

The subject contains lots of data associated with the galaxy, as well as several IDs that can be used as keys to match this against the other databases.

Every document contains a unique ID which acts as the primary key. In the collection itself, this is always designated as '_id'. If you're trying to find matches for this in other collections, the key is renamed; for example, 'subject_ids' in the classifications database is matched on '_id' in the collections database.


In [143]:
print sample_subject['_id']


52af7d53eb9a9b05ef000001

Dates and times for when the object was first inserted into the database (created), activated as a subject that could be classified, and last updated. This last date will either be the date of its last classification on the site, or when the metadata was for some reason changed. These should all be in Universal Time (UT).


In [144]:
print sample_subject['activated_at'];
print sample_subject['created_at'];
print sample_subject['updated_at'];


2013-12-17 17:45:13.844000
2013-12-17 09:16:38.435000
2013-12-17 09:16:38.468000

Astronomical metadata on the source. This includes coordinates (RA and dec in decimal degrees) as well as coordinates (sexagesimal), the constructed source name, and rms value (not sure about the last one).


In [145]:
print sample_subject['metadata']['source']
print 'RA  [hms]: %s' % sample_subject['metadata']['ra_hms']
print 'dec [dms]: %s' % sample_subject['metadata']['dec_dms']
print 'RA, dec (decimal degrees): %.2f,%.2f' % (float(sample_subject['coords'][0]),float(sample_subject['coords'][1]));
print 'rms: %.3e' % float(sample_subject['metadata']['rms'])


FIRSTJ134540.6+232256
RA  [hms]: 13.0 45.0 40.65
dec [dms]: 23.0 22.0 56.5
RA, dec (decimal degrees): 206.42,23.38
rms: 1.780e-04

Information on the classification status of the object. If the object exceeds 20 classifications, it is marked as complete and retired from active classification.


In [146]:
print sample_subject['classification_count'];
print sample_subject['state']


20
complete

Other IDs in the system include the project ID, which tells the system that this object is associated with RGZ (should be the same for all subjects), the Zooniverse ID (which can be used to find the object in Talk), and the workflow ID (which designates the workflows within a project that can be applied to this subject). At the moment, we have only a single workflow for all of RGZ.


In [147]:
print sample_subject['project_id'];
print sample_subject['workflow_ids'][0];
print sample_subject['zooniverse_id']


52afdb804d69636532000001
52afdb804d69636532000002
ARG000255t

The URLs for the raw data are also given in the file. The radio contour information is stored as a series of coordinates (in pixel space) in JSON format, and the radio and IR images are stored as JPGs. Can be handily used if you ever want to grab the raw subject.


In [148]:
print sample_subject['location']['contours'];     # FIRST radio contours
print sample_subject['location']['radio'];        # FIRST radio image at full opacity
print sample_subject['location']['standard'];     # WISE infrared image at full opacity


http://radio.galaxyzoo.org/subjects/contours/52af7d53eb9a9b05ef000001.json
http://radio.galaxyzoo.org/subjects/radio/52af7d53eb9a9b05ef000001.jpg
http://radio.galaxyzoo.org/subjects/standard/52af7d53eb9a9b05ef000001.jpg

Database #2: radio_classifications

The radio_classifications database contains the actual annotations being performed by the users on our subjects. This also collects metadata on the classification process (timestamp, browser used, etc.) and IDs that can be used to link this datum to the RGZ subject or to the user who classified it. As of 3 Mar 2014, the RGZ database had registered 533,934 unique classifications.


In [149]:
# Retrieve classifications from the database
classifications = db['radio_classifications']

# Find the latest date for which a classification was performed
mrc = classifications.find().sort([("updated_at", -1)]).limit(1)
most_recent_date = [x for x in mrc][0]['updated_at']

from datetime import datetime
tf = '%a, %d %b %Y %H:%M:%S %Z'

# Find total number of classifications 
print 'There are %i unique classifications as of %s.' % (classifications.find().count(),datetime.strftime(most_recent_date,tf))


There are 533934 unique classifications as of Mon, 03 Mar 2014 10:17:53 .

In [150]:
# Retrieve sample classification. Let's make it one that I (KWW) did.
my_id = db['radio_users'].find_one({'name':'KWillett'})['_id']
sample_classification = classifications.find_one({'user_id':my_id})
pprint.pprint(sample_classification)


{u'_id': ObjectId('52dd541b35cb5d7d76000576'),
 u'annotations': [{u'ir': {u'0': {u'x': u'216', u'y': u'229'}},
                   u'radio': {u'0': {u'scale_height': u'3.2442748091603053',
                                     u'scale_width': u'3.2196969696969697',
                                     u'xmax': u'73.10710356353849',
                                     u'xmin': u'62.22619851927605',
                                     u'ymax': u'80.45084724053386',
                                     u'ymin': u'61.529507305686664'}}},
                  {u'finished_at': u'Mon, 20 Jan 2014 16:51:39 GMT',
                   u'started_at': u'Mon, 20 Jan 2014 16:51:31 GMT'},
                  {u'user_agent': u'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.77 Safari/537.36'},
                  {u'lang': u'en'}],
 u'created_at': datetime.datetime(2014, 1, 20, 16, 51, 39),
 u'project_id': ObjectId('52afdb804d69636532000001'),
 u'subject_ids': [ObjectId('52af810f7aa69f059a0048ec')],
 u'subjects': [{u'id': ObjectId('52af810f7aa69f059a0048ec'),
                u'location': {u'contours': u'http://radio.galaxyzoo.org/subjects/contours/52af810f7aa69f059a0048ec.json',
                              u'radio': u'http://radio.galaxyzoo.org/subjects/radio/52af810f7aa69f059a0048ec.jpg',
                              u'standard': u'http://radio.galaxyzoo.org/subjects/standard/52af810f7aa69f059a0048ec.jpg'},
                u'zooniverse_id': u'ARG0002buk'}],
 u'tutorial': False,
 u'updated_at': datetime.datetime(2014, 1, 20, 16, 51, 35, 997000),
 u'user_id': ObjectId('503fad32ba40af241100063a'),
 u'user_ip': u'75.72.226.46',
 u'user_name': u'KWillett',
 u'workflow_id': ObjectId('52afdb804d69636532000002')}

As with all documents, each classification has a unique ID. This is referred to as "_id" in this collection, and as "classification_id" when matching it in other collections.


In [151]:
print sample_classification['_id']


52dd541b35cb5d7d76000576

There are other IDs to match this classification against its project (RGZ), the workflow used (standard radio contour + IR host identification), and subject (the galaxy being worked on).


In [152]:
print sample_classification['project_id'];
print sample_classification['subject_ids'][0];
print sample_classification['workflow_id'];


52afdb804d69636532000001
52af810f7aa69f059a0048ec
52afdb804d69636532000002

There is some metadata on the act of performing the classification by the user. This includes the browser system they used, their IP address, timestamps for when they started and finished the classification, and when the classification was loaded into the system.


In [167]:
print 'IP address: %s' % sample_classification['user_ip'];
print sample_classification['annotations'][2]['user_agent'];

# Convert timestamps into Python datetime objects and we can do math on them.
started = datetime.strptime(sample_classification['annotations'][1]['started_at'],tf);
finished = datetime.strptime(sample_classification['annotations'][1]['finished_at'],tf);
print ''
print 'Started classification at:  %s' % datetime.strftime(started,tf);
print 'Finished classification at: %s' % datetime.strftime(finished,tf);
print 'User took %.2f seconds to finish classification' % (finished - started).seconds

print ''
print sample_classification['created_at'];   # Should be within seconds of user completing classification
print sample_classification['updated_at'];


IP address: 75.72.226.46
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.77 Safari/537.36

Started classification at:  Mon, 20 Jan 2014 16:51:31 
Finished classification at: Mon, 20 Jan 2014 16:51:39 
User took 8.00 seconds to finish classification

2014-01-20 16:51:39
2014-01-20 16:51:35.997000

There is a True/False keyword to indicate if the classification was on the tutorial subject.


In [154]:
print 'The RGZ tutorial has been completed %i times as of %s.' % (classifications.find({'tutorial':True}).count(),datetime.strftime(most_recent_date,tf))


The RGZ tutorial has been completed 22263 times as of Mon, 03 Mar 2014 10:17:53 .

Finally, the annotations themselves. The annotations are stored as a list of JSON elements; each element in the list corresponds to a unique infrared identification made by the user, and any radio components they selected as being associated with that infrared source. We allowed users to select more than one set of IR/radio associations in each image, although this may end up not being what we wanted --- there should have been only a single source per image.

Information for the IR source is given as a single set of (x,y) coordinates in pixel space. This is the center position (rounded to the nearest pixel) of where the users clicked on the image. The location of the radio components is given as the four corners of the box containing the contours of the component.

Here is an example of a classification where the user identified a single IR host galaxy and one radio component.


In [155]:
ir_coordinates = sample_classification['annotations'][0]['ir']['0']
radio_coordinates = sample_classification['annotations'][0]['radio']

r = radio_coordinates['0']

print 'IR source is located at (x,y) = (%i,%i)' % (int(ir_coordinates['x']),int(ir_coordinates['y']))
print 'Radio component (xmin, xmax, ymin, ymax) = (%.2f, %.2f, %.2f, %.2f)' % (float(r['xmin']),float(r['xmax']),float(r['ymin']),float(r['ymax']))


IR source is located at (x,y) = (216,229)
Radio component (xmin, xmax, ymin, ymax) = (62.23, 73.11, 61.53, 80.45)

Somewhat confusingly, the pixel scales for the radio and IR coordinates are NOT the same. To convert between them, they must be multiplied by a scaling factor, which is included in the data:


In [160]:
sh = float(sample_classification['annotations'][0]['radio']['0']['scale_height'])
sw = float(sample_classification['annotations'][0]['radio']['0']['scale_width'])

print 'Coordinates of radio and IR components on the same system:'
print ''
print 'IR source is located at (x,y) = (%i,%i)' % (int(ir_coordinates['x']),int(ir_coordinates['y']))
print 'Radio component (xmin, xmax, ymin, ymax) = (%.2f, %.2f, %.2f, %.2f)' % (float(r['xmin'])*sw,float(r['xmax'])*sw,float(r['ymin'])*sh,float(r['ymax'])*sh)


Coordinates of radio and IR components on the same system:

IR source is located at (x,y) = (216,229)
Radio component (xmin, xmax, ymin, ymax) = (200.35, 235.38, 199.62, 261.00)

Let's look at the images and see if the classification seems reasonable.


In [165]:
from IPython.display import Image

# Show the radio image
Image(url=sample_subject['location']['radio'])


Out[165]:

In [166]:
# Show the infrared image
Image(url=sample_subject['location']['standard'])


Out[166]:

Database #3: radio_users

The third database contains the information for all of the users who have participated in RGZ.


In [157]:
# Database of users for RGZ
users = db['radio_users']

# Find my record as an example user
sample_user = users.find_one({'name':'KWillett'})
pprint.pprint(sample_user)


{u'_id': ObjectId('503fad32ba40af241100063a'),
 u'api_key': u'3ff69a72d0e143167bb8',
 u'avatar': u'http://zooniverse-avatars.s3.amazonaws.com/users/570417/forum.png',
 u'classification_count': 819,
 u'email': u'willettk@gmail.com',
 u'favorite_count': 16,
 u'ip': u'131.212.231.203',
 u'name': u'KWillett',
 u'preferences': {u'5101a1341a320ea77f000001': {u'dashboard': {u'tutorial': True}},
                  u'51e6fcdd3ae74023b9000001': {u'dashboard': {u'tutorial': True}},
                  u'dashboard': {u'beta': True, u'welcome_tut': True},
                  u'm83_tutorial_done': u'true',
                  u'radio': {u'tutorial_done': u'true'},
                  u'wise': {u'tutorial_done': u'true'}},
 u'projects': {u'4fdf8fb3c32dab6c95000001': {u'classification_count': 7,
                                             u'favorite_count': 0,
                                             u'recent_count': 5,
                                             u'splits': {},
                                             u'tutorial_done': True},
               u'4fff255d516bcb407b000001': {u'classification_count': 16,
                                             u'favorite_count': 0,
                                             u'recent_count': 16,
                                             u'splits': {},
                                             u'tutorial_done': True},
               u'502a701d516bcb0001000001': {u'classification_count': 2,
                                             u'favorite_count': 0,
                                             u'invitation': {u'response': u'no',
                                                             u'timestamp': datetime.datetime(2012, 9, 12, 20, 9, 53, 739000)},
                                             u'last_active_at': datetime.datetime(2012, 9, 12, 20, 9, 53, 739000),
                                             u'recent_count': 2,
                                             u'splits': {},
                                             u'tutorial_done': True},
               u'502a90cd516bcb060c000001': {u'classification_count': 182,
                                             u'favorite_count': 7,
                                             u'groups': {u'50251c3b516bcb6ecb000001': {u'classification_count': 34},
                                                         u'50251c3b516bcb6ecb000002': {u'classification_count': 136},
                                                         u'5244909c3ae7402d53000001': {u'classification_count': 9},
                                                         u'5249cbce3ae740728d000001': {u'classification_count': 3}},
                                             u'recent_count': 62,
                                             u'splits': {},
                                             u'talk': {u'active_at': datetime.datetime(2013, 10, 7, 1, 34, 18, 416000)}},
               u'503293e6516bcb6782000001': {u'classification_count': 12,
                                             u'favorite_count': 1,
                                             u'groups': {u'50575d4d516bcb57170246d7': {u'classification_count': 6},
                                                         u'50575db3516bcb5717025c85': {u'classification_count': 6}},
                                             u'recent_count': 12,
                                             u'reveal_count': 2,
                                             u'splits': {u'classifier_messaging': u'b'}},
               u'5040d826a7823f1d95000001': {u'classification_count': 4,
                                             u'favorite_count': 2,
                                             u'recent_count': 2,
                                             u'splits': {}},
               u'5077375154558fabd7000001': {u'classification_count': 349,
                                             u'favorite_count': 6,
                                             u'groups': {u'50c6197ea2fc8e1110000001': {u'classification_count': 49},
                                                         u'50c61e51a2fc8e1110000002': {u'classification_count': 79},
                                                         u'50c62517a2fc8e1110000003': {u'classification_count': 69},
                                                         u'50e477293ae740a45f000001': {u'classification_count': 31},
                                                         u'51ad041f3ae7401ecc000001': {u'classification_count': 119},
                                                         u'51f158983ae74082bb000001': {u'classification_count': 2,
                                                                                       u'name': u'season_6'}},
                                             u'recent_count': 228,
                                             u'splits': {u'classifier_messaging': u'b'},
                                             u'talk': {u'active_at': datetime.datetime(2013, 9, 17, 16, 18, 53, 464000)},
                                             u'tutorial_done': True},
               u'507edef23ae74020d6000001': {u'classification_count': 29,
                                             u'recent_count': 3,
                                             u'splits': {},
                                             u'tutorial_done': True},
               u'50e9e3d33ae740f1f3000001': {u'splits': {},
                                             u'talk': {u'active_at': datetime.datetime(2013, 5, 31, 15, 29, 9, 548000)}},
               u'5101a1341a320ea77f000001': {u'annotation_count': 10,
                                             u'classification_count': 44,
                                             u'groups': {u'5154a3783ae74086ab000001': {u'classification_count': 39},
                                                         u'5154a3783ae74086ab000002': {u'classification_count': 5}},
                                             u'splits': {},
                                             u'talk': {u'active_at': datetime.datetime(2013, 9, 12, 14, 18, 13, 926000)}},
               u'511410da3ae740c3ec000001': {u'classification_count': 42,
                                             u'groups': {u'5170103b3ae74027cf000002': {u'classification_count': 16},
                                                         u'517010563ae74027d3000002': {u'classification_count': 26}},
                                             u'splits': {},
                                             u'talk': {u'active_at': datetime.datetime(2013, 7, 1, 15, 36, 51, 759000)}},
               u'5154abce3ae740898b000001': {u'splits': {}},
               u'516d6f243ae740bc96000001': {u'classification_count': 3,
                                             u'splits': {u'tutorial': u'j'},
                                             u'talk': {u'active_at': datetime.datetime(2013, 9, 17, 15, 35, 35, 426000)},
                                             u'tutorial_done': True},
               u'51c1c9523ae74071c0000001': {u'classification_count': 27,
                                             u'groups': {u'530be1183ae74079c3000001': {u'classification_count': 5,
                                                                                       u'name': u'bin_0_20'},
                                                         u'530be1183ae74079c3000003': {u'classification_count': 9,
                                                                                       u'name': u'bin_20_40'},
                                                         u'530be1183ae74079c3000005': {u'classification_count': 6,
                                                                                       u'name': u'bin_40_50'},
                                                         u'530be1183ae74079c3000007': {u'classification_count': 4,
                                                                                       u'name': u'bin_50_60'},
                                                         u'530be1183ae74079c300000b': {u'classification_count': 3,
                                                                                       u'name': u'bin_65_90'}},
                                             u'splits': {}},
               u'51c9bba83ae7407725000001': {u'classification_count': 3,
                                             u'score': 400,
                                             u'splits': {},
                                             u'tutorial_done': True},
               u'51e6fcdd3ae74023b9000001': {u'classification_count': 15,
                                             u'splits': {},
                                             u'talk': {u'active_at': datetime.datetime(2013, 9, 19, 19, 21, 27, 932000)}},
               u'523ca1a03ae74053b9000001': {u'classification_count': 23,
                                             u'groups': {u'523ca1a03ae74053b9000003': {u'classification_count': 11},
                                                         u'523ca1a03ae74053b9000004': {u'classification_count': 12}},
                                             u'splits': {},
                                             u'tutorial_done': True},
               u'52afdb804d69636532000001': {u'classification_count': 30,
                                             u'splits': {},
                                             u'tutorial_done': True},
               u'52d065303ae740380a000001': {u'activity_count': 5,
                                             u'classification_count': 8,
                                             u'diary_date_count': 9,
                                             u'groups': {u'52d0568d3ae74026a3014592': {u'classification_count': 8,
                                                                                       u'name': u'3 CAVALRY DIVISION: 14 Mobile Veterinary Section '}},
                                             u'person_count': 3,
                                             u'place_count': 9,
                                             u'splits': {}},
               u'52d1718e3ae7401cc8000001': {u'classification_count': 2,
                                             u'splits': {}},
               u'52e2cfc1806ea54590000001': {u'classification_count': 19,
                                             u'splits': {}}},
 u'recent_count': 330,
 u'talk': {u'roles': {u'502a90cd516bcb060c000001': [u'scientist',
                                                    u'moderator'],
                      u'51e6fcdd3ae74023b9000001': [u'scientist'],
                      u'52afdb804d69636532000001': [u'scientist',
                                                    u'moderator']}},
 u'user_groups': [{u'id': ObjectId('50759e59d10d2426d0000d0f'),
                   u'name': u'UMN Zooites'},
                  {u'id': ObjectId('520bd2aee917ff7b35000051'),
                   u'name': u'UBERT Workshop'}],
 u'zooniverse_id': 570417}

This contains information on all the Zooniverse projects I've been doing, not just Radio Galaxy Zoo. To limit it to RGZ work, look for the matching project ID.


In [158]:
rgz_id = sample_subject['project_id']
rgz_user = sample_user['projects'][str(rgz_id)]

print 'User has %scompleted the RGZ tutorial' % '' if rgz_user['tutorial_done'] else 'not '
print 'User has classified %i RGZ subjects' % rgz_user['classification_count']


User has completed the RGZ tutorial
User has classified 30 RGZ subjects

To identify the user, there is a unique ID that serves as the primary key, as well as their name and IP address (if logged in). Either the ID or the name can be used to match classifications to the user who carried them out.


In [159]:
print sample_user['_id']
print sample_user['name']
print sample_user['ip']


503fad32ba40af241100063a
KWillett
131.212.231.203