This notebook will download all 100k+ articles with abstracts from the PLOS api.
nbviewer to share: http://nbviewer.ipython.org/gist/anonymous/11133500
In [32]:
import pandas as pd
import numpy as np
import settings
import requests
import urllib
import time
from retrying import retry
import os
In [2]:
#adapted from Raymond's notebook
def plos_search(q,start=0,rows=100,fl=None, extras=None):
BASE_URL = 'http://api.plos.org/search'
DEFAULT_FL = ('abstract','author',
'id','journal','publication_date',
'score','title_display', 'subject','subject_level')
#removed elements: eissn, article_type
# fl indicates fields to return
# http://wiki.apache.org/solr/CommonQueryParameters#fl
if fl is None:
fl_ = ",".join(DEFAULT_FL)
else:
fl_ = ",".join(fl)
query = {'q':q,
'start':start,
'rows':rows,
'api_key':settings.PLOS_KEY,
'wt':'json',
'fl':fl_,
'fq': 'doc_type:full AND !article_type_facet:"Issue Image"'}
if extras is not None:
query.update(extras)
query_url = BASE_URL + "?" +urllib.urlencode(query)
r = requests.get(query_url)
return r
Need to make sure the calls do not exceed the following:
7200 requests a day, 300 per hour, 10 per minute and allow 5 seconds for your search to return results.
To be safe there will be a 15 second wait between each call: 15 sec per call 4 calls per minute 240 calls per hour
Call for all articles
In [27]:
r = plos_search(q='subject:"Information technology"')
Check the total number of articles with abstracts.
In [30]:
tot_articles = r.json()['response']['numFound']
tot_articles
Out[30]:
With 118545 acticles total that means that we will have to perform 1,186 API requests at 100 articles per request.
At 240 requests per hour it should take about 5 hours to get all the data needed.
This function will call the plos_search function every 15 seconds while incrementing the start number so that all of the articles can be pulled.
In [34]:
@retry(wait='exponential_sleep', wait_exponential_multiplier=1000, wait_exponential_max=10000, stop='stop_after_attempt', stop_max_attempt_number=7)
def data_request(end, start=0):
if os.path.exists('../data/abstract_df.pkl'):
df = pd.read_pickle('../data/abstract_df.pkl')
start = len(df)
current_end = end - start
loops = (current_end/100) + 1
for n in range(loops):
r = plos_search(q='subject:"Information technology"', start=start)
#store data before next call
data = r.json()['response']['docs']
if start == 0:
abstract_df = pd.DataFrame(data)
else:
df = pd.read_pickle('../data/abstract_df.pkl')
abstract_df = df.append(pd.DataFrame(data))
#increment the start for the next request
start+=100
#every request pickle the dataframe
abstract_df.to_pickle('../data/abstract_df.pkl')
#wait 15 seconds before the next loop
time.sleep(15)
#pickle when finished
abstract_df.to_pickle('../data/abstract_df.pkl')
return abstract_df
Now we can run the function inputing the tot_articles as the end parameter.
Make sure that the 'abstract_df.pkl' does not exist in the directory before running
In [35]:
abstract_df = data_request(end=tot_articles)
In [9]:
abstract_df = pd.read_pickle('abstract_df.pkl')
In [13]:
len(list(abstract_df.author))
Out[13]:
In [ ]:
print list(abstract_df.subject)[0]
In [12]:
abstract_df.tail()
Out[12]:
In [26]:
import random
@retry
def do_something_unreliable():
if random.randint(0, 2) > 1:
raise IOError("Broken sauce, everything is hosed!!!111one")
else:
return "Awesome sauce!"
print do_something_unreliable()
In [ ]: