Batch Data Collection

This notebook will download all 100k+ articles with abstracts from the PLOS api.

nbviewer to share: http://nbviewer.ipython.org/gist/anonymous/11133500

Imports


In [32]:
import pandas as pd
import numpy as np 
import settings
import requests
import urllib
import time
from retrying import retry
import os

API Call Function


In [2]:
#adapted from Raymond's notebook

def plos_search(q,start=0,rows=100,fl=None, extras=None):

    BASE_URL = 'http://api.plos.org/search'
    DEFAULT_FL = ('abstract','author',
                  'id','journal','publication_date',
                  'score','title_display', 'subject','subject_level')
        #removed elements: eissn, article_type
    
    # fl indicates fields to return
    # http://wiki.apache.org/solr/CommonQueryParameters#fl
    
    if fl is None:
        fl_ = ",".join(DEFAULT_FL)
    else:
        fl_ = ",".join(fl)
        
    query = {'q':q,
             'start':start,
             'rows':rows,
             'api_key':settings.PLOS_KEY,
             'wt':'json',
             'fl':fl_,
             'fq': 'doc_type:full AND !article_type_facet:"Issue Image"'}
    
    if extras is not None:
        query.update(extras)
        
    query_url = BASE_URL + "?" +urllib.urlencode(query)
    
    r = requests.get(query_url)
    return r

Finding Parameters

Need to make sure the calls do not exceed the following:

7200 requests a day, 300 per hour, 10 per minute and allow 5 seconds for your search to return results.

To be safe there will be a 15 second wait between each call: 15 sec per call 4 calls per minute 240 calls per hour

Call for all articles


In [27]:
r = plos_search(q='subject:"Information technology"')

Check the total number of articles with abstracts.


In [30]:
tot_articles = r.json()['response']['numFound']
tot_articles


Out[30]:
1120

With 118545 acticles total that means that we will have to perform 1,186 API requests at 100 articles per request.

At 240 requests per hour it should take about 5 hours to get all the data needed.

Looping Function

This function will call the plos_search function every 15 seconds while incrementing the start number so that all of the articles can be pulled.


In [34]:
@retry(wait='exponential_sleep', wait_exponential_multiplier=1000, wait_exponential_max=10000, stop='stop_after_attempt', stop_max_attempt_number=7)
def data_request(end, start=0):
    if os.path.exists('../data/abstract_df.pkl'):
        df = pd.read_pickle('../data/abstract_df.pkl')
        start = len(df)
    current_end = end - start
    loops = (current_end/100) + 1
    for n in range(loops):
        r = plos_search(q='subject:"Information technology"', start=start)
        
        #store data before next call
        data = r.json()['response']['docs']
        if start == 0:
            abstract_df = pd.DataFrame(data)
        else:
            df = pd.read_pickle('../data/abstract_df.pkl')
            abstract_df = df.append(pd.DataFrame(data))
        
        #increment the start for the next request
        start+=100
        
        #every request pickle the dataframe
        abstract_df.to_pickle('../data/abstract_df.pkl')
        
        #wait 15 seconds before the next loop
        time.sleep(15)
        
    #pickle when finished
    abstract_df.to_pickle('../data/abstract_df.pkl')
    
    return abstract_df

Now we can run the function inputing the tot_articles as the end parameter.

Make sure that the 'abstract_df.pkl' does not exist in the directory before running


In [35]:
abstract_df = data_request(end=tot_articles)

Exploring Output


In [9]:
abstract_df = pd.read_pickle('abstract_df.pkl')

In [13]:
len(list(abstract_df.author))


Out[13]:
300

In [ ]:
print list(abstract_df.subject)[0]

In [12]:
abstract_df.tail()


Out[12]:
abstract author id journal publication_date score subject title_display
95 [Background: Sperm from C57BL/6 mice are diffi... [G Charles Ostermeier, Michael V Wiles, Jane S... 10.1371/journal.pone.0002792 PLoS ONE 2008-07-30T00:00:00Z 1 [/Biology and life sciences/Biotechnology/Gene... Conserving, Distributing and Managing Genetica...
96 [Background: This study was conducted to deter... [Mahamadou S Sissoko, Abdoulaye Dabo, Hamidou ... 10.1371/journal.pone.0006732 PLoS ONE 2009-10-05T00:00:00Z 1 [/Research and analysis methods/Research desig... Efficacy of Artesunate + Sulfamethoxypyrazine/...
97 [Background: In southern China, a wild ectomyc... [Mochan Li, Junfeng Liang, Yanchun Li, Bang Fe... 10.1371/journal.pone.0010684 PLoS ONE 2010-05-18T00:00:00Z 1 [/Research and analysis methods/Molecular biol... Genetic Diversity of Dahongjun, the Commercial...
98 [Background: To examine the corneal epithelial... [I-Jong Wang, Ray Jui-Fang Tsai, Lung-Kun Yeh,... 10.1371/journal.pone.0014537 PLoS ONE 2011-01-14T00:00:00Z 1 [/Biology and life sciences/Anatomy/Ocular sys... Changes in Corneal Basal Epithelial Phenotypes...
99 [\n In order to further understand the ... [Daniel Wiswede, Svenja Taubner, Thomas F Münt... 10.1371/journal.pone.0022599 PLoS ONE 2011-07-21T00:00:00Z 1 [/Biology and life sciences/Psychology/Behavio... Neurophysiological Correlates of Laboratory-In...

5 rows × 8 columns

Cleaning Output

Testing retry decorator

This is making sure that the retry decorator works


In [26]:
import random

@retry
def do_something_unreliable():
    if random.randint(0, 2) > 1:
        raise IOError("Broken sauce, everything is hosed!!!111one")
    else:
        return "Awesome sauce!"

print do_something_unreliable()


Awesome sauce!

In [ ]: