Get a list of 50,000 random users who have registered and created a userpage, from https://quarry.wmflabs.org/query/11715


In [3]:
wget https://quarry.wmflabs.org/run/107498/output/0/tsv?download=true -O user_reg_with_userpages.tsv


--2016-08-13 16:35:36--  https://quarry.wmflabs.org/run/107498/output/0/tsv?download=true
Resolving quarry.wmflabs.org (quarry.wmflabs.org)... 10.68.21.68
Connecting to quarry.wmflabs.org (quarry.wmflabs.org)|10.68.21.68|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/csv]
Saving to: ‘user_reg_with_userpages.tsv’

user_reg_with_userp     [      <=>             ]   1.69M  1.66MB/s   in 1.0s   

2016-08-13 16:35:37 (1.66 MB/s) - ‘user_reg_with_userpages.tsv’ saved [1768565]


In [6]:
python retention.py > retention.sh



This creates a shell script to query the database, which only works if you are on the Wikimedia Foundation's analytics server, ToolLabs. See https://tools.wmflabs.org/

The script runs one command to the mysql server for each user, querying the number of edits they have made between 1 and 2 years after they registered, and appends it to a file retention.tsv.


In [10]:
head -n 5 retention.sh


sql enwiki -e "   select 20158302 as user_id, 20131117000933 as user_registration,'2013-11' as year_mo, count(rev_id) as survival_1yr from ( select rev_id from revision_userindex where rev_user = 20158302 and rev_timestamp BETWEEN 20141117000933 and 20151117000933 limit 1 ) as s " >> retention.tsv  
sql enwiki -e "   select 2769023 as user_id, 20061116051526 as user_registration,'2006-11' as year_mo, count(rev_id) as survival_1yr from ( select rev_id from revision_userindex where rev_user = 2769023 and rev_timestamp BETWEEN 20071116051526 and 20081116051526 limit 1 ) as s " >> retention.tsv  
sql enwiki -e "   select 844241 as user_id, 20060127001248 as user_registration,'2006-01' as year_mo, count(rev_id) as survival_1yr from ( select rev_id from revision_userindex where rev_user = 844241 and rev_timestamp BETWEEN 20070127001248 and 20080127001248 limit 1 ) as s " >> retention.tsv  
sql enwiki -e "   select 21450716 as user_id, 20140521113239 as user_registration,'2014-05' as year_mo, count(rev_id) as survival_1yr from ( select rev_id from revision_userindex where rev_user = 21450716 and rev_timestamp BETWEEN 20150521113239 and 20160521113239 limit 1 ) as s " >> retention.tsv  
sql enwiki -e "   select 13065449 as user_id, 20100913082123 as user_registration,'2010-09' as year_mo, count(rev_id) as survival_1yr from ( select rev_id from revision_userindex where rev_user = 13065449 and rev_timestamp BETWEEN 20110913082123 and 20120913082123 limit 1 ) as s " >> retention.tsv  

Then we run it, although this WILL NOT WORK if you are not on ToolLabs. It also takes about 4 hours.


In [ ]:
bash retention.sh

Then we have to remove the header that mysql -e outputs with each query appended, so remove all lines that begin with u.


In [ ]:
grep -v '^u' retention.tsv