First import the pandas
data table analyis library and check which version I'm using (as I'm constantly changing that to keep up-to-date.)
In [ ]:
import planet4 as p4
import pandas as pd
from planet4 import io
In [ ]:
db = io.DBManager()
db_fname = db.dbname
In [ ]:
db.dbname
Because the Seasons 2 and 3 together only use about 1.7 GB of RAM, no need of special on-disk techniques, I can just load in the whole file.
In [ ]:
df = db.get_all()
In [ ]:
df.info()
In [ ]:
df.user_name.nunique()
In [ ]:
df[df.user_name.str.startswith('not-logged-in')].user_name.nunique()
In [ ]:
df[df.marking=='fan'].shape
In [ ]:
df[df.marking=='blotch'].shape
In [ ]:
df[df.marking=='interesting'].shape
In [ ]:
df.shape
In [ ]:
img_ids = df.image_id.unique()
img_names = df.image_name.unique()
So, how many objects were online:
In [ ]:
no_all = len(img_ids)
print(no_all)
n_images = len(img_names)
print(n_images)
Now we need to find out how often each image_id has been looked at. For that we have the groupby functionality. Specifically, because we want to know how many citizens have submitted a classification for each image_id, we need to group by the image_id and count the unique classification_ids within each image_id group.
We need to constrain for uniqueness because each classified object is included with the same classification_id and we don't want to count them more than once, because we are interested in the overall submission only for now.
In other words: Because the different fans, blobs and interesting things for one image_id have all been submitted with the same classification_id, I need to constrain to unique classification_ids, otherwise images with a lot of submitted items would appear 'more completed' just for having a lot of fan-content, and not for being analyzed by a lot of citizens, which is what we want.
In [ ]:
from planet4 import stats
stats.define_season_column(df)
In [ ]:
grp = df.groupby(['season', 'image_id'])
In [ ]:
counts = grp.classification_id.nunique()
In [ ]:
%matplotlib ipympl
In [ ]:
import seaborn as sns
In [ ]:
plt.close('all')
In [ ]:
sns.set_palette('colorblind', 4)
In [ ]:
pal = sns.color_palette("colorblind", 4)
In [ ]:
bins = np.arange(25, 170, 5)
bins
In [ ]:
n_my29 = counts.loc[2].size
n_my30 = counts.loc[3].size
In [ ]:
sns.distplot?
In [ ]:
fig, ax = plt.subplots()
kwargs = {'alpha':0.2}
axlabel="Number of classifications per Planet Four tile"
sns.distplot(counts.loc[2], label='MY29', ax=ax, kde=True, bins=bins, color=pal[0],
hist_kws=kwargs)
sns.distplot(counts.loc[3], label='MY30', ax=ax, kde=True, bins=bins, color=pal[2],
axlabel=axlabel, hist_kws=kwargs)
ax.set_ylabel('Density')
ax.legend()
ax.set_title("Distribution of Planet Four classification counts")
fig.savefig("/Users/klay6683/Dropbox/src/p4_paper1/figures/count_stats.png",
dpi=200)
In [ ]:
plt.close('all')
In [ ]:
sns.set_context('paper')
In [ ]:
fig, ax = plt.subplots(figsize=(9,4))
pal = sns.color_palette('colorblind')
weights2 = np.ones_like(counts.loc[2].values)/float(len(counts.loc[2]))
weights3 = np.ones_like(counts.loc[3].values)/float(len(counts.loc[3]))
ax.hist([counts.loc[2], counts.loc[3]], normed=False, bins=bins, color=pal[:3:2],
label=['MY29', 'MY30'], weights=[weights2, weights3])
ax.set_xlabel("Number of classifications per Planet Four tile")
ax.set_ylabel("Fraction of number of classifications")
ax.set_title("Distribution of Planet Four classification counts.")
ax.legend()
fig.savefig("/Users/klay6683/Dropbox/src/p4_paper1/figures/count_stats.png",
dpi=200)
In [ ]:
In [ ]:
counts[counts >= 30].size
That's pretty disappointing, but alas, the cold hard truth. This means, taking all submitted years into account in the data, we have currently only the following percentage done:
In [ ]:
counts[counts>= 30].size / float(no_all) * 100
Wishing to see higher values, I was for some moments contemplating if one maybe has to sum up the different counts to be correct, but I don't think that's it.
The way I see it, one has to decide in what 'phase-space' one works to determine the status of Planet4. Either in the phase space of total subframes or in the total number of classifications. And I believe to determine the finished state of Planet4 it is sufficient and actually easier to focus on the available number of subframes and determine how often each of them has been looked at.
The different seasons of our south polar observations are separated by several counts of the thousands
digit in the image_id
column of the original HiRISE image id, in P4 called image_name.
In [ ]:
# str[5:7] is the 2-digit thousands count in, e.g., ESP_011234_0950, in this case 11.
df['thousands'] = df.image_name.str[5:7].astype('int')
In [ ]:
thousands = df.thousands.value_counts().sort_index()
thousands
As one can see, we have groups of [1..5, 11..13, 20..22]. Let's add another season column to the dataframe, first filled with zeros.
In [ ]:
df['season'] = 0
For the first season, we actually don't need to look at the thousands counter, as the first 3 letters of the image_names started all with PSP in the first season (for 'P_rimary S_cience P_hase'). Now let's set all rows with names starting with 'PSP' to season 1.
In [ ]:
df.loc[:, 'season'][df.image_name.str.startswith('PSP')] = 1
And for the later seasons, we actually need to group by the thousands counter:
In [ ]:
df.loc[:, 'season'][(df.thousands > 10) & (df.thousands < 20)] = 2
df.loc[:, 'season'][df.thousands > 19] = 3
So, for all seasons, how many rows to we have in the overall data:
In [ ]:
no_all = df.season.value_counts()
no_all
In [ ]:
def get_counts_per_classification_id(df, unique=True):
grouping = df.classification_id.groupby(df.image_id, sort=False)
# because I only grouped the classification_id column above, this function is only
# applied to it. First, reduce to a unique list, and then save the size of that list.
if unique:
return grouping.agg(lambda x: x.unique().size)
else:
return grouping.size()
In [ ]:
df.image_name.groupby(df.season).agg(lambda x:x.unique().size)
In [ ]:
no_all = df.image_id.groupby(df.season).agg(lambda x: x.unique().size)
no_all
In [ ]:
def done_per_season(season, limit, unique=True, in_percent=True):
subdf = df[df.season == season]
counts_per_classid = get_counts_per_classification_id(subdf, unique)
no_done = counts_per_classid[counts_per_classid >= limit].size
if in_percent:
return 100.0 * no_done / no_all[season]
else:
return no_done
In [ ]:
for season in [1,2,3]:
print season
print done_per_season(season, 30, in_percent=True)
In the following code I not only check for the different years, but also the influence on the demanded limit of counts to define a subframe as 'finished'.
To collect the data I create an empty dataframe with an index ranging through the different limits I want to check (i.e. range(30,101,10)
)
In [ ]:
import sys
from collections import OrderedDict
results = pd.DataFrame(index=range(30,101,10))
for season in [1,2,3]:
print season
sys.stdout.flush() # to force a print out of the std buffer
subdf = df[df.season == season]
counts = get_counts_per_classification_id(subdf)
values = OrderedDict()
for limit in results.index:
values[limit] = done_per_season(season, limit)
results[season] = values.values()
In [ ]:
np.round(results)
I realised that user_ids should provide just the same access to the performed counts, because each classification_id should have exactly one user_id, as they are created when that user clicks on Submit, right? At least that's how I understood it.
So imagine my surprise when I found out it isn't the same answer. And unfortunately it looks like we have to reduce our dataset even further by apparent multiple submissions of the same classification, but let's see.
First, create the respective function to determine counts via the user_name instead of classification_id after grouping for image_id. This first grouping by image_id is the essential step for the determination how often a particular image_id has been worked on, so that doesn't change.
In [ ]:
def get_counts_per_user_name(df):
grouping = df.user_name.groupby(df.image_id, sort=False)
counts = grouping.agg(lambda x: x.unique().size)
# counts = counts.order(ascending=False)
return counts
In [ ]:
counts_by_user = get_counts_per_user_name(df)
counts_by_user
Compare that again to the output for classifying per classification_id:
In [ ]:
counts_by_class = get_counts_per_classification_id(df)
counts_by_class
So, not the same result! Let's dig deeper.
Focus on one image_id and study what is happening there. I first get a sub-table for the subframe 'jp7' and determine the user_names that worked on that subframe.
Then I loop over the names, filtering another sub-part of the table where the current user worked on jp7. According to the hypothesis that a classification_id is created for a user at submisssion time and the idea that a user should not see an image twice, there should only be one classification_id in that sub-part.
I am testing that by checking if the unique list of classification_ids has a length $>1$. If it does, I print out the user_name.
In [ ]:
jp7 = df[df.image_id == 'APF0000jp7']
unique_users = jp7.user_name.unique()
# having the list of users that worked on jp7
for user in unique_users:
subdf = jp7[jp7.user_name == user]
if len(subdf.classification_id.unique()) > 1:
print user, len(subdf)
Ok, so let's have a look at the data for the first user_name for the subframe jp7
In [ ]:
jp7[jp7.user_name == 'not-logged-in-8d495c463aeffd67c08b2dfc1141f33b']
First note that the creation time of these 2 different classifications is different, so it looks like this user has seen the jp7 subframe more than once.
But then when you scroll this html table to the right, you will notice that the submitted object has the exact same coordinates in both classifications? How likely is it, that the user finds the exact same coordinates in less than 60 seconds?
So the question is, is this really a new classification and the user has done it twice? Or was the same thing submitted twice? Hopefully Meg knows the answer to that.
I found it instructive to look at how the status of finished data depends on the limit we put on the reached counts per image_id (i.e. subframe).
Also, how does it change when looking for unique user_names per image_id instead of unique classification_ids.
In [ ]:
results[[2,3]].plot()
xlabel('Required number of analyses submitted to be considered "done".')
ylabel('Current percentage of dataset finished [%]')
title("Season 2 and 3 status, depending on definition of 'done'.")
savefig('Season2_3_status.png', dpi=200)
In [ ]:
x = range(1,101)
per_class = []
per_user = []
for val in x:
per_class.append(100 * counts_by_class[counts_by_class >= val].size/float(no_all))
per_user.append(100 * counts_by_user[counts_by_user >= val].size/float(no_all))
In [ ]:
plot(x,per_class)
plot(x, per_user)
xlabel('Counts constraint for _finished_ criterium')
ylabel('Current percent finished [%]')
Ok, so not that big a deal until we require more than 80 classifications to be done.
The method 'value_counts()' basically delivers a histogram on the counts_by_user data series. In other words, it shows how the frequency of classifications distribute over the dataset. It shows an to be expected peak close to 100, because that's what we are aiming now and the system does today not anymore show a subframe that has been seen 100 times.
But it also shows quite some waste in citizen power from all the counts that went for counts > 100.
In [ ]:
counts_by_user.value_counts()
In [ ]:
counts_by_user.value_counts().plot(style='*')
In [ ]:
users_work = df.classification_id.groupby(df.user_name).agg(lambda x: x.unique().size)
In [ ]:
users_work.order(ascending=False)[:10]
In [ ]:
df[df.user_name=='gwyneth walker'].classification_id.value_counts()
In [ ]:
import helper_functions as hf
reload(hf)
In [ ]:
hf.classification_counts_for_user('Kitharode', df).hist?
In [ ]:
hf.classification_counts_for_user('Paul Johnson', df)
In [ ]:
np.isnan(df.marking)
In [ ]:
df.marking
In [ ]:
s = 'INVESTIGATION OF POLAR SEASONAL FAN DEPOSITS USING CROWDSOURCING'
In [ ]:
s.title()
In [ ]: