VoxCharta Part I

Webscraping VoxCharta.org

Michael Gully-Santiago

December 17, 2014

My approach is to copy the data from the usage statistics.



In [26]:

    
%%html
<script type="text/javascript">
     show=true;
     function toggle(){
         if (show){$('div.input').hide();}else{$('div.input').show();}
            show = !show}
 </script>
 <h2><a href="javascript:toggle()" target="_self">Click to toggle code input</a></h2>









    





 Click to toggle code input



In [1]:

    
%pylab inline
import numpy as np
from numpy.random import randn
import pandas as pd
from scipy import stats
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
from pandas import Series, DataFrame









    



Populating the interactive namespace from numpy and matplotlib

Rather than use `Beautiful Soup`, I simply copied the data to the clipboard.

But for reproducibility purposes, I immediately saved the data to a .csv file with the code below.

institutes = pd.read_clipboard(header=None, sep='\s{2,}', index_col='Institute',
                               names=['Institute', 'Users', 'Votes', 'Up', 'Down',
                                        'Comments', 'Posts','DateStarted'])
institutes.to_csv('data/institutes.csv')



In [3]:

    
institutes = pd.read_csv('data/institutes.csv', na_values='-', parse_dates=['DateStarted'])
institutes.tail()









    Out[3]:






  
    
      
      Institute
      Users
      Votes
      Up
      Down
      Comments
      Posts
      DateStarted
    
  
  
    
      222
              UCSC Outflows
       1
       2
       1
       1
       0
       0
             NaT
    
    
      223
       University of Surrey
       1
       2
       1
       1
       0
       0
             NaT
    
    
      224
                       CWRU
       0
       0
       0
       0
       0
       0
             NaT
    
    
      225
                CWRU-Theory
       0
       0
       0
       0
       0
       0
      2011-11-21
    
    
      226
              Yale-Clusters
       0
       0
       0
       0
       0
       0
             NaT



In [4]:

    
#institutes['Votes'].cumsum().plot(title='Cumulative distribution')
institutes['Rank'] = institutes.Votes.rank(method='first', ascending=False)



In [5]:

    
cumsum_normalized = institutes.Votes.cumsum().div(institutes.Votes.cumsum().max())
cumsum_normalized.plot(title='Cumulative distribution of VoxCharta votes')
UT_id = institutes.Institute == 'UT Austin'
plt.scatter(institutes['Rank'][UT_id], cumsum_normalized[UT_id], s=50, c='r', marker='o')
plt.annotate('UTexas', 
             xy=(institutes['Rank'][UT_id], cumsum_normalized[UT_id]), 
             textcoords='offset points',
             fontsize=16.0,
             arrowprops=dict(arrowstyle="fancy", #linestyle="dashed",
                color="0.5",
                shrinkB=9,
                connectionstyle="arc3,rad=0.3",
                ),
            )

#plt.add_at("mutate", loc=2)









    



/Users/gully/anaconda/lib/python2.7/site-packages/matplotlib/text.py:1788: UserWarning: You have used the `textcoords` kwarg, but not the `xytext` kwarg.  This can lead to surprising results.
  warnings.warn("You have used the `textcoords` kwarg, but not "






    Out[5]:





<matplotlib.text.Annotation at 0x112db39d0>



In [6]:

    
ids = (cumsum_normalized < 0.5)
count = len(cumsum_normalized)
statement = "{0} institutions are responsible for half of all votes out of {1} institutions"
print statement.format(np.sum(ids), count)
print "They are: "
institutes.Institute[ids]









    



25 institutions are responsible for half of all votes out of 227 institutions
They are: 






    Out[6]:





0                                                  DARK
1                                                   UCR
2                                             STScI/JHU
3                                                  Yale
4                         PUC-Institute of Astrophysics
5                                                  NTHU
6                                                  UCSC
7                                         U. Pittsburgh
8                                            AIfA-Cosmo
9                                                   LAM
10                                           CEA Saclay
11                                         unaffiliated
12                                          Harvard ITC
13                               University of Maryland
14                                             Columbia
15                                                 AIfA
16    Osservatorio Astrofisico di Arcetri - Extragal...
17                                             RSAA-ANU
18                                        UMass Amherst
19                                                  UCB
20                                                  UCI
21                                    Durham University
22                       U. of Sussex, Astronomy Centre
23                                              Caltech
24                                              Cornell
Name: Institute, dtype: object

Once again, scrape the data manually, then immediately save to `.csv`

users = pd.read_clipboard(header=None, sep='\s{2,}', index_col='User',
                               names=['User', 'Votes', 'Up', 'Down',
                                        'Comments', 'Posts','DateUser', 'Affiliation'])
users.to_csv('data/users.csv')



In [7]:

    
users = pd.read_csv('data/users.csv', parse_dates=['DateUser'], index_col='User')
users.tail()









    Out[7]:






  
    
      
      Votes
      Up
      Down
      Comments
      Posts
      DateUser
      Affiliation
    
    
      User
      
      
      
      
      
      
      
    
  
  
    
      HildaBramlett
       0
       0
       0
       0
       0
      2014-12-13
       Chicago-KICP
    
    
      RoseannShea
       0
       0
       0
       0
       0
      2014-12-13
       Chicago-KICP
    
    
      SabinaDolling
       0
       0
       0
       0
       0
      2014-12-13
       Chicago-KICP
    
    
      AnneBrassell
       0
       0
       0
       0
       0
      2014-12-13
       Chicago-KICP
    
    
      KarryRanclaud
       0
       0
       0
       0
       0
      2014-12-13
       Chicago-KICP



In [28]:

    
users.count() #5169, that's a lot of users!









    Out[28]:





Votes          5169
Up             5169
Down           5169
Comments       5169
Posts          5169
DateUser       5169
Affiliation    5169
Rank           5169
dtype: int64



In [8]:

    
users['Rank'] = users.Votes.rank(method='first', ascending=False)



In [9]:

    
cumsum_norm_users = users.Votes.cumsum().div(users.Votes.cumsum().max())
cumsum_norm_users.plot(title='Cumulative distribution of VoxCharta votes by user')

gully_id = (users.index == 'gully')

plt.scatter(users['Rank'][gully_id], cumsum_norm_users[gully_id], s=50, c='r', marker='o')
plt.annotate('gully', 
             xy=(users['Rank'][gully_id], cumsum_norm_users[gully_id]), 
             textcoords='offset points',
             fontsize=16.0,
             arrowprops=dict(arrowstyle="fancy", #linestyle="dashed",
                color="0.5",
                shrinkB=9,
                connectionstyle="arc3,rad=0.3",
                ))









    Out[9]:





<matplotlib.text.Annotation at 0x1132975d0>

I want to emphasize that it is not necessarily virtuous to have voted for a paper. This analysis and the assignment of 'rank' has no implied significance to the user's scientific merits- it simply is a reflection of how many times he or she has voted for something on VoxCharta, that's it. Indeed, it might be annoying for someone to vote all the time for no reason whatsoever, having never read the paper.

Let's restrict the discussion to just UTexas at Austin (my current affiliation).



In [10]:

    
UTexas = (users.Affiliation == 'UT Austin')
print "There are {} users at UT Austin".format(np.sum(UTexas))









    



There are 79 users at UT Austin



In [11]:

    
UTexas_users = users[UTexas]
UTexas_users.head()









    Out[11]:






  
    
      
      Votes
      Up
      Down
      Comments
      Posts
      DateUser
      Affiliation
      Rank
    
    
      User
      
      
      
      
      
      
      
      
    
  
  
    
      gully
       178
       178
       0
       0
       1
      2011-07-18
       UT Austin
       144
    
    
      Steve Finkelstein
       119
       119
       0
       0
       0
      2011-08-23
       UT Austin
       233
    
    
      comerford
        85
        85
       0
       0
       2
      2011-07-18
       UT Austin
       329
    
    
      H. Kim
        85
        85
       0
       1
       0
      2011-08-11
       UT Austin
       330
    
    
      mjohnson
        55
        55
       0
       0
       0
      2012-05-16
       UT Austin
       488



In [12]:

    
cumsum_norm_UT = UTexas_users.Votes.cumsum().div(UTexas_users.Votes.cumsum().max())
cumsum_norm_UT.plot(title='Cumulative distribution of VoxCharta votes at UTexas')
UTexas_users['Cumulative'] = cumsum_norm_UT









    



-c:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



In [13]:

    
never_voted = (UTexas_users.Votes == 0)
print "{} users out of {} at UT Austin have never voted".format(np.sum(never_voted), np.sum(UTexas))









    



23 users out of 79 at UT Austin have never voted

I'm curious to see how the UTexas CDF compares to other institutions. This is a bit hard to compare since there are different total numbers of users, so we might have to consider normalizing the number of voters (i.e. the $x-$axis) too. What we might do instead is use a coarse estimator, like the number of votes of the highest-voting user of each institution. Let's come back to that.

Let's look at my personal voting record

The usual clipboad gimick.

mgs = pd.read_clipboard(header=None, sep='\s{2,}', index_col='Date',
                               names=['Date', 'Title'])
mgs.to_csv('data/mgs.csv')



In [20]:

    
mgs = pd.read_csv('data/mgs.csv', parse_dates=['Date'], index_col='Date')
mgs['count'] = mgs.count(axis=1)
mgs.head()









    Out[20]:






  
    
      
      Title
      count
    
    
      Date
      
      
    
  
  
    
      2014-12-17
       The velocity distribution in the solar neighbo...
       1
    
    
      2014-12-17
       High-contrast Imaging with Spitzer: Deep Obser...
       1
    
    
      2014-12-16
       A Measurement of the Cosmic Microwave Backgrou...
       1
    
    
      2014-12-15
       Photometric Calibration on Lunar-based Ultravi...
       1
    
    
      2014-12-15
       ALMA observations of alpha Centauri: First det...
       1



In [22]:

    
grouped = mgs.groupby(level=0)
gsum = grouped.sum()



In [23]:

    
mgs_byday = gsum.resample('D', how=sum)
mgs_byday.fillna(0, inplace=True)
mgs_bymonth = gsum.resample('M', how=sum)
mgs_bymonth.fillna(0, inplace=True)



In [24]:

    
mgs_bymonth.plot(title='MGS voting record by month')









    Out[24]:





<matplotlib.axes._subplots.AxesSubplot at 0x112d36690>



In [25]:

    
mgs_byday.cumsum().plot(title='MGS cumulative votes as a function of time')









    Out[25]:





<matplotlib.axes._subplots.AxesSubplot at 0x113caf0d0>

Insights: My voting frequency is irregular. Notably there were big patches of time when I did not vote. Lately I have been voting a lot.



In [ ]:

	Institute	Users	Votes	Up	Down	DateStarted
222	UCSC Outflows	1	2	1	1	NaT
223	University of Surrey	1	2	1	1	NaT
224	CWRU	0	0	0	0	NaT
225	CWRU-Theory	0	0	0	0	2011-11-21
226	Yale-Clusters	0	0	0	0	NaT

	Votes	Up	Down	Comments	Posts	DateUser	Affiliation
User
HildaBramlett	0	0	0	0	0	2014-12-13	Chicago-KICP
RoseannShea	0	0	0	0	0	2014-12-13	Chicago-KICP
SabinaDolling	0	0	0	0	0	2014-12-13	Chicago-KICP
AnneBrassell	0	0	0	0	0	2014-12-13	Chicago-KICP
KarryRanclaud	0	0	0	0	0	2014-12-13	Chicago-KICP

	Votes	Up	Down	Comments	Posts	DateUser	Affiliation	Rank
User
gully	178	178	0	0	1	2011-07-18	UT Austin	144
Steve Finkelstein	119	119	0	0	0	2011-08-23	UT Austin	233
comerford	85	85	0	0	2	2011-07-18	UT Austin	329
H. Kim	85	85	0	1	0	2011-08-11	UT Austin	330
mjohnson	55	55	0	0	0	2012-05-16	UT Austin	488

	Title	count
Date
2014-12-17	The velocity distribution in the solar neighbo...	1
2014-12-17	High-contrast Imaging with Spitzer: Deep Obser...	1
2014-12-16	A Measurement of the Cosmic Microwave Backgrou...	1
2014-12-15	Photometric Calibration on Lunar-based Ultravi...	1
2014-12-15	ALMA observations of alpha Centauri: First det...	1