In [68]:
from __future__ import division
import os
os.chdir('/Users/willettk/Astronomy/Research/GalaxyZoo/rgz-analysis/python/')

import rgz

In [4]:
subjects,classifications,users = rgz.load_rgz_data()

(1) Of the 6970 volunteers, how many are anonymous? Or are all anonymous users classified as one user?


In [29]:
total_users = users.count()

c = classifications.find_one()
rgz_project_id = c['project_id']

did_at_least_one = users.find({'projects.%s.classification_count' % rgz_project_id:{'$gte':1}})

In [58]:
registered_users = users.find({'projects.%s.classification_count' % rgz_project_id:{'$gte':1},'name':{'$exists':True}})
print '%i registered users did at least 1 RGZ classification.' % registered_users.count()


6902 registered users did at least 1 RGZ classification.

In [ ]:
from collections import Counter

anonymous_classifications = classifications.find({'user_name':{'$exists':False}})
ips = [c['user_ip'] for c in anonymous_classifications]
ip_counter = Counter(ips)

registered_classifications = classifications.find({'user_name':{'$exists':True}})
rips = [c['user_ip'] for c in registered_classifications]
rip_counter = Counter(rips)

In [62]:
print '%i different IP addresses submitted an anonymous RGZ classification.' % len(ip_counter)


20396 different IP addresses submitted an anonymous RGZ classification.

(2) What is the highest number of classifications that one user has completed?


In [66]:
mc = max(users.find({'projects.%s.classification_count' % rgz_project_id:{'$gte':1}},{'projects.%s.classification_count' % rgz_project_id:1,'_id':0}))
nmax = mc['projects'][str(rgz_project_id)]['classification_count']
max_user = users.find_one({'projects.%s.classification_count' % rgz_project_id:nmax})
print 'Most classifications done is %i by %s' % (nmax,max_user['name'])


Most classifications done is 72823 by antikodon

(3) How many classifications have come from anonymous users? Do we include them in our analysis?


In [69]:
ac = anonymous_classifications.count()
cc = classifications.count()
print '%i RGZ classifications were by anonymous users. This is %.2f percent of the total.' % (ac,ac/cc * 100)


321549 RGZ classifications were by anonymous users. This is 26.68 percent of the total.

Anonymous and registered classifications are treated identically in the current analysis.

(4) A plot of percentage of total classifications per user or number of classifications per user. Whichever looks better. Split the superusers from everyone else as the graph will look funny if we put everyone together. Can we use username or should we label the users as 1, 2, 3 etc.?


In [72]:
%pylab inline
import pandas as pd


Populating the interactive namespace from numpy and matplotlib

In [73]:
# Retrieve RGZ data, convert into data frames
batch_classifications = classifications.find({"updated_at": {"$gt": rgz.main_release_date}})
dfc = pd.DataFrame( list(batch_classifications) )

In [75]:
rgz.plot_user_counts(dfc)


(5) to highlight the contribution of superusers, can you get the completion fraction of the project with and without the superusers?


In [74]:
rgz.plot_empirical_distribution_function(dfc)