Distribution of edits by editor experience

I was intrigued by this graph for OSM that shows the distribution of edits on OpenStreetMap depending on their experience (apologies to the person who suggested it on IRC, I don't remember who this was). Their conclusion is that the majority of edits are actually made by a small minority of experienced editors.

Is the situation similar for MusicBrainz?

Setup


In [1]:
%run startup.ipy


Last notebook update: 2018-06-07
Git repo: git@bitbucket.org:loujine/musicbrainz-dataviz.git
Importing libs
Defining database parameters

Defining *sql* helper function
Last database update: 2018-06-02

Python packages versions:
numpy       1.14.3
pandas      0.23.0
sqlalchemy  1.2.8
CPython 3.7.0b5
IPython 6.4.0

Fetch data from the DB

We could limit the edit history by date (open_time > ...) but let's take everything since the current edit system was created (in 2012 I think).


In [2]:
edits_count = sql("""
SELECT editor.name AS editor,
       COUNT(*) AS cnt
  FROM edit
  JOIN editor ON editor.id = edit.editor
 WHERE editor.name != 'ModBot'
-- AND edit.open_time >= '2017-01-01'
GROUP BY editor.name
ORDER BY cnt DESC
-- LIMIT 1000
;
""")

In [3]:
edits_count.index = edits_count.editor
edits_count.drop('editor', axis=1, inplace=True)
edits_count.head()


Out[3]:
cnt
editor
reosarevok 1721106
TheBookkeeper 1645012
drsaunde 1305561
ListMyCDs.com 1142201
HibiscusKazeneko 816887

Those are the most active editors (i.e. highest number of edits) as of mid-2018, the results should be close to the official editors statistics page.


In [4]:
print('Number of editors: {}'.format(len(edits_count)))
print('Number of edits: {}'.format(edits_count.sum().values[0]))


Number of editors: 195938
Number of edits: 49148162

Split editors in bins

We can split the editors in different bins to correspond to their "experience level" (from complete novice to old and wise auto-editor). In order to do that, we add a "category" column in our dataframe.

Note that the limit between bins is completely arbitrary.


In [5]:
bounds = [0, 5, 10, 20, 50, 100, 1000, 10000, 100000, 1000000, 10000000]
names = ['hit-and-run', 'newbie', 'casual', 'great', 'heavy', 
         'super', 'legendary', 'fantastic', 'mega', 'epic']

In [6]:
edits_count['category'] = pandas.cut(edits_count.cnt, bins=bounds)

In [7]:
edits_count.head()


Out[7]:
cnt category
editor
reosarevok 1721106 (1000000, 10000000]
TheBookkeeper 1645012 (1000000, 10000000]
drsaunde 1305561 (1000000, 10000000]
ListMyCDs.com 1142201 (1000000, 10000000]
HibiscusKazeneko 816887 (100000, 1000000]

So there are (currently) 4 editors in the "epic" category (more than 1 million edits).

Split edit count by category

Now we want to compute the total count of edits made by editors in each category.


In [11]:
cats = edits_count.groupby('category').count()
cats = cats.rename({"cnt": "nb_editors"}, axis="columns")
cats['nb_edits'] = edits_count.groupby('category').sum().values
cats.index = ['{name} {idx}'.format(name=name, idx=idx)
              for (name, idx) in zip(names, cats.index)]
cats


Out[11]:
nb_editors nb_edits
hit-and-run (0, 5] 104858 231326
newbie (5, 10] 27866 212614
casual (10, 20] 22096 324307
great (20, 50] 19110 609774
heavy (50, 100] 8549 606343
super (100, 1000] 11006 3137339
legendary (1000, 10000] 1925 5487680
fantastic (10000, 100000] 449 13314860
mega (100000, 1000000] 75 19410039
epic (1000000, 10000000] 4 5813880

Let's plot those results as bar graph and pie charts using plotly (so that the graphs are interactive).


In [12]:
iplot({
    'data': [{'type': 'bar', 'x': cats.index, 'y': cats.nb_editors}],
    'layout': {'title': 'Number of editors by category',
               'xaxis': {'title': 'Editor category'},
               'yaxis': {'title': 'Number of editors'},
              }
})


No surprise there, the immense majority of editors have only a few edits...


In [13]:
iplot({
    'data': [{'type': 'bar', 'x': cats.index, 'y': cats.nb_edits}],
    'layout': {'title': 'Number of edits by category',
               'xaxis': {'title': 'Editor category'},
               'yaxis': {'title': 'Number of edits'},
              }
})


... but the majority of edits are made by experienced users (more than 100 edits each).

Same result as pie charts:


In [14]:
iplot({
    'data': [{'type': 'pie', 'labels': cats.index, 'values': cats.nb_editors, 
              'sort': False, 'direction': 'clockwise'}],
    'layout': {'title': 'Number of editors by category'}
})



In [15]:
iplot({
    'data': [{'type': 'pie', 'labels': cats.index, 'values': cats.nb_edits, 
              'sort': False, 'direction': 'clockwise'}],
    'layout': {'title': 'Number of edits by category'}
})