Batch Data Collection

This notebook will download all 100k+ articles with abstracts from the PLOS api.

nbviewer to share: http://nbviewer.ipython.org/gist/anonymous/11133500

Imports



In [32]:

    
import pandas as pd
import numpy as np 
import settings
import requests
import urllib
import time
from retrying import retry
import os

API Call Function



In [2]:

    
#adapted from Raymond's notebook

def plos_search(q,start=0,rows=100,fl=None, extras=None):

    BASE_URL = 'http://api.plos.org/search'
    DEFAULT_FL = ('abstract','author',
                  'id','journal','publication_date',
                  'score','title_display', 'subject','subject_level')
        #removed elements: eissn, article_type
    
    # fl indicates fields to return
    # http://wiki.apache.org/solr/CommonQueryParameters#fl
    
    if fl is None:
        fl_ = ",".join(DEFAULT_FL)
    else:
        fl_ = ",".join(fl)
        
    query = {'q':q,
             'start':start,
             'rows':rows,
             'api_key':settings.PLOS_KEY,
             'wt':'json',
             'fl':fl_,
             'fq': 'doc_type:full AND !article_type_facet:"Issue Image"'}
    
    if extras is not None:
        query.update(extras)
        
    query_url = BASE_URL + "?" +urllib.urlencode(query)
    
    r = requests.get(query_url)
    return r

Finding Parameters

Need to make sure the calls do not exceed the following:

7200 requests a day, 300 per hour, 10 per minute and allow 5 seconds for your search to return results.

To be safe there will be a 15 second wait between each call: 15 sec per call 4 calls per minute 240 calls per hour

Call for all articles



In [27]:

    
r = plos_search(q='subject:"Information technology"')

Check the total number of articles with abstracts.



In [30]:

    
tot_articles = r.json()['response']['numFound']
tot_articles









    Out[30]:





1120

With 118545 acticles total that means that we will have to perform 1,186 API requests at 100 articles per request.

At 240 requests per hour it should take about 5 hours to get all the data needed.

Looping Function

This function will call the plos_search function every 15 seconds while incrementing the start number so that all of the articles can be pulled.



In [34]:

    
@retry(wait='exponential_sleep', wait_exponential_multiplier=1000, wait_exponential_max=10000, stop='stop_after_attempt', stop_max_attempt_number=7)
def data_request(end, start=0):
    if os.path.exists('../data/abstract_df.pkl'):
        df = pd.read_pickle('../data/abstract_df.pkl')
        start = len(df)
    current_end = end - start
    loops = (current_end/100) + 1
    for n in range(loops):
        r = plos_search(q='subject:"Information technology"', start=start)
        
        #store data before next call
        data = r.json()['response']['docs']
        if start == 0:
            abstract_df = pd.DataFrame(data)
        else:
            df = pd.read_pickle('../data/abstract_df.pkl')
            abstract_df = df.append(pd.DataFrame(data))
        
        #increment the start for the next request
        start+=100
        
        #every request pickle the dataframe
        abstract_df.to_pickle('../data/abstract_df.pkl')
        
        #wait 15 seconds before the next loop
        time.sleep(15)
        
    #pickle when finished
    abstract_df.to_pickle('../data/abstract_df.pkl')
    
    return abstract_df

Now we can run the function inputing the tot_articles as the end parameter.

Make sure that the 'abstract_df.pkl' does not exist in the directory before running



In [35]:

    
abstract_df = data_request(end=tot_articles)

Exploring Output



In [9]:

    
abstract_df = pd.read_pickle('abstract_df.pkl')



In [13]:

    
len(list(abstract_df.author))









    Out[13]:





300



In [ ]:

    
print list(abstract_df.subject)[0]



In [12]:

    
abstract_df.tail()









    Out[12]:






  
    
      
      abstract
      author
      id
      journal
      publication_date
      score
      subject
      title_display
    
  
  
    
      95
       [Background: Sperm from C57BL/6 mice are diffi...
       [G Charles Ostermeier, Michael V Wiles, Jane S...
       10.1371/journal.pone.0002792
       PLoS ONE
       2008-07-30T00:00:00Z
       1
       [/Biology and life sciences/Biotechnology/Gene...
       Conserving, Distributing and Managing Genetica...
    
    
      96
       [Background: This study was conducted to deter...
       [Mahamadou S Sissoko, Abdoulaye Dabo, Hamidou ...
       10.1371/journal.pone.0006732
       PLoS ONE
       2009-10-05T00:00:00Z
       1
       [/Research and analysis methods/Research desig...
       Efficacy of Artesunate + Sulfamethoxypyrazine/...
    
    
      97
       [Background: In southern China, a wild ectomyc...
       [Mochan Li, Junfeng Liang, Yanchun Li, Bang Fe...
       10.1371/journal.pone.0010684
       PLoS ONE
       2010-05-18T00:00:00Z
       1
       [/Research and analysis methods/Molecular biol...
       Genetic Diversity of Dahongjun, the Commercial...
    
    
      98
       [Background: To examine the corneal epithelial...
       [I-Jong Wang, Ray Jui-Fang Tsai, Lung-Kun Yeh,...
       10.1371/journal.pone.0014537
       PLoS ONE
       2011-01-14T00:00:00Z
       1
       [/Biology and life sciences/Anatomy/Ocular sys...
       Changes in Corneal Basal Epithelial Phenotypes...
    
    
      99
       [\n        In order to further understand the ...
       [Daniel Wiswede, Svenja Taubner, Thomas F Münt...
       10.1371/journal.pone.0022599
       PLoS ONE
       2011-07-21T00:00:00Z
       1
       [/Biology and life sciences/Psychology/Behavio...
       Neurophysiological Correlates of Laboratory-In...
    
  

5 rows × 8 columns

Cleaning Output

Testing retry decorator

This is making sure that the retry decorator works



In [26]:

    
import random

@retry
def do_something_unreliable():
    if random.randint(0, 2) > 1:
        raise IOError("Broken sauce, everything is hosed!!!111one")
    else:
        return "Awesome sauce!"

print do_something_unreliable()









    



Awesome sauce!



In [ ]:

	abstract	author	id	journal	publication_date	score	subject	title_display
95	[Background: Sperm from C57BL/6 mice are diffi...	[G Charles Ostermeier, Michael V Wiles, Jane S...	10.1371/journal.pone.0002792	PLoS ONE	2008-07-30T00:00:00Z	1	[/Biology and life sciences/Biotechnology/Gene...	Conserving, Distributing and Managing Genetica...
96	[Background: This study was conducted to deter...	[Mahamadou S Sissoko, Abdoulaye Dabo, Hamidou ...	10.1371/journal.pone.0006732	PLoS ONE	2009-10-05T00:00:00Z	1	[/Research and analysis methods/Research desig...	Efficacy of Artesunate + Sulfamethoxypyrazine/...
97	[Background: In southern China, a wild ectomyc...	[Mochan Li, Junfeng Liang, Yanchun Li, Bang Fe...	10.1371/journal.pone.0010684	PLoS ONE	2010-05-18T00:00:00Z	1	[/Research and analysis methods/Molecular biol...	Genetic Diversity of Dahongjun, the Commercial...
98	[Background: To examine the corneal epithelial...	[I-Jong Wang, Ray Jui-Fang Tsai, Lung-Kun Yeh,...	10.1371/journal.pone.0014537	PLoS ONE	2011-01-14T00:00:00Z	1	[/Biology and life sciences/Anatomy/Ocular sys...	Changes in Corneal Basal Epithelial Phenotypes...
99	[\n In order to further understand the ...	[Daniel Wiswede, Svenja Taubner, Thomas F Münt...	10.1371/journal.pone.0022599	PLoS ONE	2011-07-21T00:00:00Z	1	[/Biology and life sciences/Psychology/Behavio...	Neurophysiological Correlates of Laboratory-In...