This notebook is about our process for figuring out what PLOS Cloud Explorer was going to be. It includes early code, prototypes, and dead ends.
For the full story including the happy ending, read this document and follow the other notebook links to see the code we actually used.
First things first. All imports for this notebook:
In [1]:
from __future__ import unicode_literals
# You need an API Key for PLOS
import settings
# Data analysis
import numpy as np
import pandas as pd
from numpy import nan
from pandas import Series, DataFrame
# Interacting with API
import requests
import urllib
import time
from retrying import retry
import os
import random
import json
# Natural language processing
import nltk
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
from nltk.corpus import stopwords
import string
# For the IPython widgets:
from IPython.display import display, Image, HTML, clear_output
from IPython.html import widgets
from jinja2 import Template
We began with a really simple way of getting article data from the PLOS Search API:
In [2]:
r = requests.get('http://api.plos.org/search?q=subject:"biotechnology"&start=0&rows=500&api_key={%s}&wt=json' % settings.PLOS_KEY).json()
len(r['response']['docs'])
Out[2]:
In [ ]:
# Write out a file.
with open('biotech500.json', 'wb') as fp:
json.dump(r, fp)
We later developed a much more sophisticated way to get huge amounts of data from the API. To see how we collected data sets, see the batch data collection notebook.
Here we show what the output looks like, from a previously run API query. Through the magic of Python, we can pickle the resulting DataFrame and access it again now without making any API calls.
In [8]:
abstract_df = pd.read_pickle('../data/abstract_df.pkl')
In [9]:
len(list(abstract_df.author))
Out[9]:
In [10]:
print list(abstract_df.subject)[0]
In [11]:
abstract_df.tail()
Out[11]:
We wanted to use basic natural language processing (NLP) to make word clouds out of aggregated abstract text, and see how they change over time.
NB: These examples use a previously collected dataset that's different and smaller than the one we generated above.
In [12]:
# Globally define a set of stopwords.
stops = set(stopwords.words('english'))
# We can add science-y stuff to it as well. Just an example:
stops.add('conclusions')
def wordify(abs_list, min_word_len=2):
'''
Convert the abstract field from PLoS API data to a filtered list of words.
'''
# The abstract field is a list. Make it a string.
text = ' '.join(abs_list).strip(' \n\t')
if text == '':
return nan
else:
# Remove punctuation & replace with space,
# because we want 'metal-contaminated' => 'metal contaminated'
# ...not 'metalcontaminated', and so on.
for c in string.punctuation:
text = text.replace(c, ' ')
# Now make it a Series of words, and do some cleaning.
words = Series(text.split(' '))
words = words.str.lower()
# Filter out words less than minimum word length.
words = words[words.str.len() >= min_word_len]
words = words[~words.str.contains(r'[^#@a-z]')] # What exactly does this do?
# Filter out globally-defined stopwords
ignore = stops & set(words.unique())
words_out = [w for w in words.tolist() if w not in ignore]
return words_out
Load up some data.
In [13]:
with open('biotech500.json', 'rb') as fp:
data = json.load(fp)
articles_list = data['response']['docs']
articles = DataFrame(articles_list)
articles = articles[articles['abstract'].notnull()]
articles.head()
Out[13]:
Applying this to the whole DataFrame of articles
In [14]:
articles['words'] = articles.apply(lambda s: wordify(s['abstract'] + [s['title_display']]), axis=1)
articles.drop(['article_type', 'score', 'title_display', 'abstract'], axis=1, inplace=True)
articles.head()
Out[14]:
In [15]:
abs_df = DataFrame(articles['words'].apply(lambda x: ' '.join(x)).tolist(), columns=['text'])
abs_df.head()
Out[15]:
This section uses all words from abstracts to find the common word pairs.
In [16]:
#include all words from abstracts for getting common word pairs
words_all = pd.Series(' '.join(abs_df['text']).split(' '))
words_all.value_counts()
Out[16]:
In [17]:
relevant_words_pairs = words_all.copy()
relevant_words_pairs.value_counts()
Out[17]:
In [18]:
bcf = BigramCollocationFinder.from_words(relevant_words_pairs)
for pair in bcf.nbest(BigramAssocMeasures.likelihood_ratio, 30):
print ' '.join(pair)
In [19]:
bcf.nbest(BigramAssocMeasures.likelihood_ratio, 20)
Out[19]:
Here, we takes only unique words from each abstract.
In [20]:
abs_set_df = DataFrame(articles['words'].apply(lambda x: ' '.join(set(x))).tolist(), columns=['text'])
abs_set_df.head()
Out[20]:
In [21]:
words = pd.Series(' '.join(abs_set_df['text']).split(' '))
words.value_counts()
Out[21]:
In [22]:
top_words = words.value_counts().reset_index()
top_words.columns = ['word', 'count']
top_words.head(15)
Out[22]:
In [23]:
# top_words.to_csv('../wordcloud2.csv', index=False)
When we created the word clouds, we noticed something about the most common words in these article abstracts...
In [24]:
articles_list = data['response']['docs']
articles = DataFrame(articles_list)
articles = articles[articles['abstract'].notnull()].ix[:,['abstract', 'publication_date']]
articles.abstract = articles.abstract.apply(wordify, 3)
articles = articles[articles['abstract'].notnull()]
articles.publication_date = pd.to_datetime(articles.publication_date)
articles.head()
Out[24]:
In [25]:
print articles.publication_date.min(), articles.publication_date.max()
print len(articles)
The time series spans ~9 years with 57 data points. We need to resample!
There are probably many ways to do this...
In [26]:
articles_timed = articles.set_index('publication_date')
articles_timed.head()
Out[26]:
Using the sum
aggregation method works because all the values were lists. The three abstracts published in 2013-05 were concatenated together (see below).
In [27]:
articles_monthly = articles_timed.resample('M', how='sum', fill_method='ffill', kind='period')
articles_monthly.abstract = articles_monthly.abstract.apply(lambda x: np.nan if x == 0 else x)
articles_monthly.fillna(method='ffill', inplace=True)
articles_monthly.head()
Out[27]:
In [28]:
widgetmax = len(articles_monthly) - 1
def textbarf(t):
html_template = """
<style>
#textbarf {
display: block;
width: 666px;
padding: 23px;
background-color: #ddeeff;
}
</style>
<div id="textbarf"> {{blargh}} </div>"""
blob = ' '.join(articles_monthly.ix[t]['abstract'])
html_src = Template(html_template).render(blargh=blob)
display(HTML(html_src))
In [29]:
widgets.interact(textbarf,
t=widgets.IntSliderWidget(min=0,max=widgetmax,step=1,value=42),
)
Out[29]: