The following is a brief description of the data structures and organization for the Radio Galaxy Zoo (RGZ) citizen science project, organized as part of the Zooniverse.
The data for the project is stored on Amazon Web Services using MongoDB. MongoDB is a "NoSQL"-type database, meaning that it does not operate on traditional joins and tabular relations such as those used in SQL. Individual records are stored as data documents in JSON or BSON formats.
The examples below show how MongoDB can be queried using Python and the pymongo module.
The live version of the database is stored on the Amazon servers and is not designed to be directly queried by the science team, since that can potentially slow the response of the system for the volunteers. Access for the science team to do analysis should be done on the backup copies, which are obtained through email links sent out weekly. Contact Chris Snyder at Zooniverse (cs@zooniverse.org) if you want to be put on the email list for this.
There are three databases for RGZ data: radio_classifications, radio_subjects, and radio_users. All are stored in the BSON files of the same time, which you can find after downloading the backup copies and untarring the zip file locally.
In [168]:
from pymongo import MongoClient
# Load the Mongo database so we can show examples of each data type. In this case, I have already restored the MongoDB files
# to my machine and am running a local instance of mongod on port 27017.
client = MongoClient("localhost",27017)
# Select the default database name (ouroboros) for RGZ classifications
db = client['ouroboros']
The radio_subjects
collection contains the information and metadata for each subject (in this case, a radio source from the FIRST survey) being classified as part of RGZ. As of the project launch in late December 2013, this comprises 175,001 images (175,000 galaxies + 1 tutorial subject). Let's look at what data is being stored for a subject.
In [171]:
# If all the RGZ data has been loaded in, there should be three collections available. Let's first look at the subjects.
subjects = db['radio_subjects']
# Extract a sample subject from the collection and print the data to the screen.
import pprint
sample_subject = subjects.find_one() # In MongoDB, the data is stored similar to a JSON file;
# in Python, it is a nested dictionary.
pprint.pprint(sample_subject)
The subject contains lots of data associated with the galaxy, as well as several IDs that can be used as keys to match this against the other databases.
Every document contains a unique ID which acts as the primary key. In the collection itself, this is always designated as '_id'. If you're trying to find matches for this in other collections, the key is renamed; for example, 'subject_ids' in the classifications database is matched on '_id' in the collections database.
In [143]:
print sample_subject['_id']
Dates and times for when the object was first inserted into the database (created), activated as a subject that could be classified, and last updated. This last date will either be the date of its last classification on the site, or when the metadata was for some reason changed. These should all be in Universal Time (UT).
In [144]:
print sample_subject['activated_at'];
print sample_subject['created_at'];
print sample_subject['updated_at'];
Astronomical metadata on the source. This includes coordinates (RA and dec in decimal degrees) as well as coordinates (sexagesimal), the constructed source name, and rms value (not sure about the last one).
In [145]:
print sample_subject['metadata']['source']
print 'RA [hms]: %s' % sample_subject['metadata']['ra_hms']
print 'dec [dms]: %s' % sample_subject['metadata']['dec_dms']
print 'RA, dec (decimal degrees): %.2f,%.2f' % (float(sample_subject['coords'][0]),float(sample_subject['coords'][1]));
print 'rms: %.3e' % float(sample_subject['metadata']['rms'])
Information on the classification status of the object. If the object exceeds 20 classifications, it is marked as complete and retired from active classification.
In [146]:
print sample_subject['classification_count'];
print sample_subject['state']
Other IDs in the system include the project ID, which tells the system that this object is associated with RGZ (should be the same for all subjects), the Zooniverse ID (which can be used to find the object in Talk), and the workflow ID (which designates the workflows within a project that can be applied to this subject). At the moment, we have only a single workflow for all of RGZ.
In [147]:
print sample_subject['project_id'];
print sample_subject['workflow_ids'][0];
print sample_subject['zooniverse_id']
The URLs for the raw data are also given in the file. The radio contour information is stored as a series of coordinates (in pixel space) in JSON format, and the radio and IR images are stored as JPGs. Can be handily used if you ever want to grab the raw subject.
In [148]:
print sample_subject['location']['contours']; # FIRST radio contours
print sample_subject['location']['radio']; # FIRST radio image at full opacity
print sample_subject['location']['standard']; # WISE infrared image at full opacity
The radio_classifications
database contains the actual annotations being performed by the users on our subjects. This also collects metadata on the classification process (timestamp, browser used, etc.) and IDs that can be used to link this datum to the RGZ subject or to the user who classified it. As of 3 Mar 2014, the RGZ database had registered 533,934 unique classifications.
In [149]:
# Retrieve classifications from the database
classifications = db['radio_classifications']
# Find the latest date for which a classification was performed
mrc = classifications.find().sort([("updated_at", -1)]).limit(1)
most_recent_date = [x for x in mrc][0]['updated_at']
from datetime import datetime
tf = '%a, %d %b %Y %H:%M:%S %Z'
# Find total number of classifications
print 'There are %i unique classifications as of %s.' % (classifications.find().count(),datetime.strftime(most_recent_date,tf))
In [150]:
# Retrieve sample classification. Let's make it one that I (KWW) did.
my_id = db['radio_users'].find_one({'name':'KWillett'})['_id']
sample_classification = classifications.find_one({'user_id':my_id})
pprint.pprint(sample_classification)
As with all documents, each classification has a unique ID. This is referred to as "_id" in this collection, and as "classification_id" when matching it in other collections.
In [151]:
print sample_classification['_id']
There are other IDs to match this classification against its project (RGZ), the workflow used (standard radio contour + IR host identification), and subject (the galaxy being worked on).
In [152]:
print sample_classification['project_id'];
print sample_classification['subject_ids'][0];
print sample_classification['workflow_id'];
There is some metadata on the act of performing the classification by the user. This includes the browser system they used, their IP address, timestamps for when they started and finished the classification, and when the classification was loaded into the system.
In [167]:
print 'IP address: %s' % sample_classification['user_ip'];
print sample_classification['annotations'][2]['user_agent'];
# Convert timestamps into Python datetime objects and we can do math on them.
started = datetime.strptime(sample_classification['annotations'][1]['started_at'],tf);
finished = datetime.strptime(sample_classification['annotations'][1]['finished_at'],tf);
print ''
print 'Started classification at: %s' % datetime.strftime(started,tf);
print 'Finished classification at: %s' % datetime.strftime(finished,tf);
print 'User took %.2f seconds to finish classification' % (finished - started).seconds
print ''
print sample_classification['created_at']; # Should be within seconds of user completing classification
print sample_classification['updated_at'];
There is a True/False keyword to indicate if the classification was on the tutorial subject.
In [154]:
print 'The RGZ tutorial has been completed %i times as of %s.' % (classifications.find({'tutorial':True}).count(),datetime.strftime(most_recent_date,tf))
Finally, the annotations themselves. The annotations are stored as a list of JSON elements; each element in the list corresponds to a unique infrared identification made by the user, and any radio components they selected as being associated with that infrared source. We allowed users to select more than one set of IR/radio associations in each image, although this may end up not being what we wanted --- there should have been only a single source per image.
Information for the IR source is given as a single set of (x,y) coordinates in pixel space. This is the center position (rounded to the nearest pixel) of where the users clicked on the image. The location of the radio components is given as the four corners of the box containing the contours of the component.
Here is an example of a classification where the user identified a single IR host galaxy and one radio component.
In [155]:
ir_coordinates = sample_classification['annotations'][0]['ir']['0']
radio_coordinates = sample_classification['annotations'][0]['radio']
r = radio_coordinates['0']
print 'IR source is located at (x,y) = (%i,%i)' % (int(ir_coordinates['x']),int(ir_coordinates['y']))
print 'Radio component (xmin, xmax, ymin, ymax) = (%.2f, %.2f, %.2f, %.2f)' % (float(r['xmin']),float(r['xmax']),float(r['ymin']),float(r['ymax']))
Somewhat confusingly, the pixel scales for the radio and IR coordinates are NOT the same. To convert between them, they must be multiplied by a scaling factor, which is included in the data:
In [160]:
sh = float(sample_classification['annotations'][0]['radio']['0']['scale_height'])
sw = float(sample_classification['annotations'][0]['radio']['0']['scale_width'])
print 'Coordinates of radio and IR components on the same system:'
print ''
print 'IR source is located at (x,y) = (%i,%i)' % (int(ir_coordinates['x']),int(ir_coordinates['y']))
print 'Radio component (xmin, xmax, ymin, ymax) = (%.2f, %.2f, %.2f, %.2f)' % (float(r['xmin'])*sw,float(r['xmax'])*sw,float(r['ymin'])*sh,float(r['ymax'])*sh)
Let's look at the images and see if the classification seems reasonable.
In [165]:
from IPython.display import Image
# Show the radio image
Image(url=sample_subject['location']['radio'])
Out[165]:
In [166]:
# Show the infrared image
Image(url=sample_subject['location']['standard'])
Out[166]:
The third database contains the information for all of the users who have participated in RGZ.
In [157]:
# Database of users for RGZ
users = db['radio_users']
# Find my record as an example user
sample_user = users.find_one({'name':'KWillett'})
pprint.pprint(sample_user)
This contains information on all the Zooniverse projects I've been doing, not just Radio Galaxy Zoo. To limit it to RGZ work, look for the matching project ID.
In [158]:
rgz_id = sample_subject['project_id']
rgz_user = sample_user['projects'][str(rgz_id)]
print 'User has %scompleted the RGZ tutorial' % '' if rgz_user['tutorial_done'] else 'not '
print 'User has classified %i RGZ subjects' % rgz_user['classification_count']
To identify the user, there is a unique ID that serves as the primary key, as well as their name and IP address (if logged in). Either the ID or the name can be used to match classifications to the user who carried them out.
In [159]:
print sample_user['_id']
print sample_user['name']
print sample_user['ip']