VoxCharta Part I

Webscraping VoxCharta.org

Michael Gully-Santiago

December 17, 2014

My approach is to copy the data from the usage statistics.


In [26]:
%%html
<script type="text/javascript">
     show=true;
     function toggle(){
         if (show){$('div.input').hide();}else{$('div.input').show();}
            show = !show}
 </script>
 <h2><a href="javascript:toggle()" target="_self">Click to toggle code input</a></h2>


Click to toggle code input


In [1]:
%pylab inline
import numpy as np
from numpy.random import randn
import pandas as pd
from scipy import stats
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
from pandas import Series, DataFrame


Populating the interactive namespace from numpy and matplotlib

Rather than use Beautiful Soup, I simply copied the data to the clipboard.

But for reproducibility purposes, I immediately saved the data to a .csv file with the code below.

institutes = pd.read_clipboard(header=None, sep='\s{2,}', index_col='Institute',
                               names=['Institute', 'Users', 'Votes', 'Up', 'Down',
                                        'Comments', 'Posts','DateStarted'])
institutes.to_csv('data/institutes.csv')

In [3]:
institutes = pd.read_csv('data/institutes.csv', na_values='-', parse_dates=['DateStarted'])
institutes.tail()


Out[3]:
Institute Users Votes Up Down Comments Posts DateStarted
222 UCSC Outflows 1 2 1 1 0 0 NaT
223 University of Surrey 1 2 1 1 0 0 NaT
224 CWRU 0 0 0 0 0 0 NaT
225 CWRU-Theory 0 0 0 0 0 0 2011-11-21
226 Yale-Clusters 0 0 0 0 0 0 NaT

In [4]:
#institutes['Votes'].cumsum().plot(title='Cumulative distribution')
institutes['Rank'] = institutes.Votes.rank(method='first', ascending=False)

In [5]:
cumsum_normalized = institutes.Votes.cumsum().div(institutes.Votes.cumsum().max())
cumsum_normalized.plot(title='Cumulative distribution of VoxCharta votes')
UT_id = institutes.Institute == 'UT Austin'
plt.scatter(institutes['Rank'][UT_id], cumsum_normalized[UT_id], s=50, c='r', marker='o')
plt.annotate('UTexas', 
             xy=(institutes['Rank'][UT_id], cumsum_normalized[UT_id]), 
             textcoords='offset points',
             fontsize=16.0,
             arrowprops=dict(arrowstyle="fancy", #linestyle="dashed",
                color="0.5",
                shrinkB=9,
                connectionstyle="arc3,rad=0.3",
                ),
            )

#plt.add_at("mutate", loc=2)


/Users/gully/anaconda/lib/python2.7/site-packages/matplotlib/text.py:1788: UserWarning: You have used the `textcoords` kwarg, but not the `xytext` kwarg.  This can lead to surprising results.
  warnings.warn("You have used the `textcoords` kwarg, but not "
Out[5]:
<matplotlib.text.Annotation at 0x112db39d0>

In [6]:
ids = (cumsum_normalized < 0.5)
count = len(cumsum_normalized)
statement = "{0} institutions are responsible for half of all votes out of {1} institutions"
print statement.format(np.sum(ids), count)
print "They are: "
institutes.Institute[ids]


25 institutions are responsible for half of all votes out of 227 institutions
They are: 
Out[6]:
0                                                  DARK
1                                                   UCR
2                                             STScI/JHU
3                                                  Yale
4                         PUC-Institute of Astrophysics
5                                                  NTHU
6                                                  UCSC
7                                         U. Pittsburgh
8                                            AIfA-Cosmo
9                                                   LAM
10                                           CEA Saclay
11                                         unaffiliated
12                                          Harvard ITC
13                               University of Maryland
14                                             Columbia
15                                                 AIfA
16    Osservatorio Astrofisico di Arcetri - Extragal...
17                                             RSAA-ANU
18                                        UMass Amherst
19                                                  UCB
20                                                  UCI
21                                    Durham University
22                       U. of Sussex, Astronomy Centre
23                                              Caltech
24                                              Cornell
Name: Institute, dtype: object

Once again, scrape the data manually, then immediately save to .csv

users = pd.read_clipboard(header=None, sep='\s{2,}', index_col='User',
                               names=['User', 'Votes', 'Up', 'Down',
                                        'Comments', 'Posts','DateUser', 'Affiliation'])
users.to_csv('data/users.csv')

In [7]:
users = pd.read_csv('data/users.csv', parse_dates=['DateUser'], index_col='User')
users.tail()


Out[7]:
Votes Up Down Comments Posts DateUser Affiliation
User
HildaBramlett 0 0 0 0 0 2014-12-13 Chicago-KICP
RoseannShea 0 0 0 0 0 2014-12-13 Chicago-KICP
SabinaDolling 0 0 0 0 0 2014-12-13 Chicago-KICP
AnneBrassell 0 0 0 0 0 2014-12-13 Chicago-KICP
KarryRanclaud 0 0 0 0 0 2014-12-13 Chicago-KICP

In [28]:
users.count() #5169, that's a lot of users!


Out[28]:
Votes          5169
Up             5169
Down           5169
Comments       5169
Posts          5169
DateUser       5169
Affiliation    5169
Rank           5169
dtype: int64

In [8]:
users['Rank'] = users.Votes.rank(method='first', ascending=False)

In [9]:
cumsum_norm_users = users.Votes.cumsum().div(users.Votes.cumsum().max())
cumsum_norm_users.plot(title='Cumulative distribution of VoxCharta votes by user')

gully_id = (users.index == 'gully')

plt.scatter(users['Rank'][gully_id], cumsum_norm_users[gully_id], s=50, c='r', marker='o')
plt.annotate('gully', 
             xy=(users['Rank'][gully_id], cumsum_norm_users[gully_id]), 
             textcoords='offset points',
             fontsize=16.0,
             arrowprops=dict(arrowstyle="fancy", #linestyle="dashed",
                color="0.5",
                shrinkB=9,
                connectionstyle="arc3,rad=0.3",
                ))


Out[9]:
<matplotlib.text.Annotation at 0x1132975d0>

I want to emphasize that it is not necessarily virtuous to have voted for a paper. This analysis and the assignment of 'rank' has no implied significance to the user's scientific merits- it simply is a reflection of how many times he or she has voted for something on VoxCharta, that's it. Indeed, it might be annoying for someone to vote all the time for no reason whatsoever, having never read the paper.

Let's restrict the discussion to just UTexas at Austin (my current affiliation).


In [10]:
UTexas = (users.Affiliation == 'UT Austin')
print "There are {} users at UT Austin".format(np.sum(UTexas))


There are 79 users at UT Austin

In [11]:
UTexas_users = users[UTexas]
UTexas_users.head()


Out[11]:
Votes Up Down Comments Posts DateUser Affiliation Rank
User
gully 178 178 0 0 1 2011-07-18 UT Austin 144
Steve Finkelstein 119 119 0 0 0 2011-08-23 UT Austin 233
comerford 85 85 0 0 2 2011-07-18 UT Austin 329
H. Kim 85 85 0 1 0 2011-08-11 UT Austin 330
mjohnson 55 55 0 0 0 2012-05-16 UT Austin 488

In [12]:
cumsum_norm_UT = UTexas_users.Votes.cumsum().div(UTexas_users.Votes.cumsum().max())
cumsum_norm_UT.plot(title='Cumulative distribution of VoxCharta votes at UTexas')
UTexas_users['Cumulative'] = cumsum_norm_UT


-c:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

In [13]:
never_voted = (UTexas_users.Votes == 0)
print "{} users out of {} at UT Austin have never voted".format(np.sum(never_voted), np.sum(UTexas))


23 users out of 79 at UT Austin have never voted

I'm curious to see how the UTexas CDF compares to other institutions. This is a bit hard to compare since there are different total numbers of users, so we might have to consider normalizing the number of voters (i.e. the $x-$axis) too. What we might do instead is use a coarse estimator, like the number of votes of the highest-voting user of each institution. Let's come back to that.

Let's look at my personal voting record

The usual clipboad gimick.

mgs = pd.read_clipboard(header=None, sep='\s{2,}', index_col='Date',
                               names=['Date', 'Title'])
mgs.to_csv('data/mgs.csv')

In [20]:
mgs = pd.read_csv('data/mgs.csv', parse_dates=['Date'], index_col='Date')
mgs['count'] = mgs.count(axis=1)
mgs.head()


Out[20]:
Title count
Date
2014-12-17 The velocity distribution in the solar neighbo... 1
2014-12-17 High-contrast Imaging with Spitzer: Deep Obser... 1
2014-12-16 A Measurement of the Cosmic Microwave Backgrou... 1
2014-12-15 Photometric Calibration on Lunar-based Ultravi... 1
2014-12-15 ALMA observations of alpha Centauri: First det... 1

In [22]:
grouped = mgs.groupby(level=0)
gsum = grouped.sum()

In [23]:
mgs_byday = gsum.resample('D', how=sum)
mgs_byday.fillna(0, inplace=True)
mgs_bymonth = gsum.resample('M', how=sum)
mgs_bymonth.fillna(0, inplace=True)

In [24]:
mgs_bymonth.plot(title='MGS voting record by month')


Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x112d36690>

In [25]:
mgs_byday.cumsum().plot(title='MGS cumulative votes as a function of time')


Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x113caf0d0>

Insights: My voting frequency is irregular. Notably there were big patches of time when I did not vote. Lately I have been voting a lot.


In [ ]: