In [1]:

    
from __future__ import unicode_literals
import json
import numpy as np
import pandas as pd
from pandas import DataFrame, Series

Transforming batch-collected data

The desired data structure for article information is the following JSON object:

articles:
  [ <doi>:
      { author: [ ... ]
        title:
        journal:
        publication_date: <yyyy>
        subject: [ <full subject /-separated strings>, ... ]
        subj_top: [ set of top levels of each subject ]
        subj_leaf: [ set of last terms of each subject ]
      },
    ...
  ]



In [2]:

    
df = pd.read_pickle('../data/abstract_df.pkl')

# Dropping abstract and score.
df.drop(['abstract', 'score'], axis=1, inplace=True)
df.set_index('id', inplace=True)
df.columns = ['author', 'journal', 'publication_date', 'subject', 'title']
df = df.reindex(columns=['author', 'title', 'journal', 'publication_date', 'subject'])
# We just want the year.
df.publication_date = df.publication_date.str[:4]

df.head()









    Out[2]:






  
    
      
      author
      title
      journal
      publication_date
      subject
    
    
      id
      
      
      
      
      
    
  
  
    
      10.1371/journal.pntd.0000413
       [Darren J Gray, Simon J Forsyth, Robert S Li, ...
       An Innovative Database for Epidemiological Fie...
       PLoS Neglected Tropical Diseases
       2009
       [/Computer and information sciences/Informatio...
    
    
      10.1371/journal.pone.0083016
       [Pedro Lopes, Tiago Nunes, David Campos, Laura...
       Gathering and Exploring Scientific Knowledge i...
                               PLoS ONE
       2013
       [/Medicine and health sciences/Pharmacology/Dr...
    
    
      10.1371/journal.pmed.0030249
                                     [Matthew E Falagas]
       Unique Author Identification Number in Scienti...
                          PLoS Medicine
       2006
       [/Research and analysis methods/Database and i...
    
    
      10.1371/journal.pone.0073275
                                     [George J Besseris]
       A Distribution-Free Multi-Factorial Profiler f...
                               PLoS ONE
       2013
       [/Computer and information sciences/Informatio...
    
    
      10.1371/journal.pone.0043558
       [Yuncui Hu, Yanpeng Li, Hongfei Lin, Zhihao Ya...
       Integrating Various Resources for Gene Name No...
                                    NaN
       2012
       [/Computer and information sciences/Informatio...
    
  

5 rows × 5 columns



In [3]:

    
def get_subj_top(subjects):
    subj_top = set()
    for s in subjects:
        # the string gets split at its first character, so not [0] here:
        subj_top.add(s.split('/')[1])
    return subj_top

def get_subj_leaf(subjects):
    subj_top = set()
    for s in subjects:
        subj_top.add(s.split('/')[-1])
    return subj_top



In [4]:

    
df['subj_top'] = df.subject.apply(get_subj_top)
df['subj_leaf'] = df.subject.apply(get_subj_leaf)



In [5]:

    
df.head()









    Out[5]:






  
    
      
      author
      title
      journal
      publication_date
      subject
      subj_top
      subj_leaf
    
    
      id
      
      
      
      
      
      
      
    
  
  
    
      10.1371/journal.pntd.0000413
       [Darren J Gray, Simon J Forsyth, Robert S Li, ...
       An Innovative Database for Epidemiological Fie...
       PLoS Neglected Tropical Diseases
       2009
       [/Computer and information sciences/Informatio...
       set([Physical sciences, Medicine and health sc...
       set([Statistical methods, Infectious disease c...
    
    
      10.1371/journal.pone.0083016
       [Pedro Lopes, Tiago Nunes, David Campos, Laura...
       Gathering and Exploring Scientific Knowledge i...
                               PLoS ONE
       2013
       [/Medicine and health sciences/Pharmacology/Dr...
       set([Medicine and health sciences, Engineering...
       set([Signal processing, Engines, Drug interact...
    
    
      10.1371/journal.pmed.0030249
                                     [Matthew E Falagas]
       Unique Author Identification Number in Scienti...
                          PLoS Medicine
       2006
       [/Research and analysis methods/Database and i...
       set([Engineering and technology, Research and ...
       set([Database searching, Electronics, Data pro...
    
    
      10.1371/journal.pone.0073275
                                     [George J Besseris]
       A Distribution-Free Multi-Factorial Profiler f...
                               PLoS ONE
       2013
       [/Computer and information sciences/Informatio...
       set([Biology and life sciences, Physical scien...
       set([Statistical distributions, Engineering an...
    
    
      10.1371/journal.pone.0043558
       [Yuncui Hu, Yanpeng Li, Hongfei Lin, Zhihao Ya...
       Integrating Various Resources for Gene Name No...
                                    NaN
       2012
       [/Computer and information sciences/Informatio...
       set([Biology and life sciences, Physical scien...
       set([Text mining, Gene mapping, Entity disambi...
    
  

5 rows × 7 columns

Here is where we check to see if I did it right...



In [6]:

    
#df.head().to_json(orient='index', force_ascii=False)

If all is OK, export..



In [7]:

    
#df.to_json(path_or_buf='../data/articles.json', orient='index', force_ascii=False)

Transforming the PLOS thesaurus

The PLOS thesaurus was kindly provided to us as a spreadsheet with thousands of rows, one node per row. It is a polyhierarchy represented in the form of a tree. We need to transform it into a JSON object that also includes article counts for all the nodes in the tree.

An example of the desired data structure for PLOS thesaurus:

{ 
  "name": "Computer and information sciences",
  "count": ###,
  "children": [
    {
      "name": "Information technology",
      "count": ###,
      "children": [
        {"name": "Data mining", "count": ###},
        {"name": "Data reduction", "count": ###}, 
        {
          "name":  "Databases",
          "count": ###,
          "children": [
            {"name": "Relational databases", "count": ###}
          ]
        },
        ...,
        {"name": "Text mining","count": ###} 
      ]
    }
  ]
}, 
...

In Python, each node is a dict. Children are specified as a list of dicts. The whole thing is a list of nodes, therefore, a list of dicts.



In [8]:

    
# Let's make sure we are counting articles correctly for each subject node.

def count_articles(df, subject_path):
    s = df.subject.apply(lambda s: str(s))
    matching = s[s.str.contains(subject_path)]
    return len(matching)

print 'Total articles:', len(df)
print 'Science policy:', count_articles(df, 'Science policy')
print 'Science policy/Bioethics:', count_articles(df, 'Science policy/Bioethics')









    



Total articles: 1120
Science policy: 31
Science policy/Bioethics: 2



In [9]:

    
import xlrd
from collections import defaultdict

def tree_from_spreadsheet(f, df, verbose=False):
    
    subjects = df.subject.apply(lambda s: str(s))
    
    book = xlrd.open_workbook(f)
    pt = book.sheet_by_index(0)
    # spreadsheet cells : (row, col) :: cell A1 = (0, 0)

    # Initialize a list to contain the thesaurus.
    # Our test case will only have one item in this list.
    pt_test = []

    # Keep track of the path in the tree.
    cur_path = Series([np.nan]*10)

    for r in range(1, pt.nrows):
        # Start on row two.

        # Columns: the hierarchy goes up to 10 tiers.
        for c in range(10):
            if pt.cell_value(r, c):
                # If this condition is satisfied, we are at the node that's in this line.

                # Construct the path to this node.
                # Clean strings because some terms (RNA nomenclature) cause unicode error
                text = pt.cell_value(r, c).replace(u'\u2019', "'")
                cur_path[c] = text
                cur_path[c+1:] = np.nan
                path_list = list(cur_path.dropna())
                tier = len(path_list)
                path_str = '/'.join(path_list)
                if verbose:
                    print tier, path_str

                # Add the node to the JSON-like tree structure.
                node = defaultdict(list)
                node['name'] = text
                node['count']=  len(subjects[subjects.str.contains(path_str)])

                # This part is completely ridiculous. But it seems to work.
                if tier == 1:
                    pt_test.append(node)
                    pt_test.append
                elif tier == 2:
                    pt_test[-1]['children'].append(node)
                elif tier == 3:
                    pt_test[-1]['children'][-1]['children'].append(node)
                elif tier == 4:
                    pt_test[-1]['children'][-1]['children'][-1]['children'].append(node)
                elif tier == 5:
                    pt_test[-1]['children'][-1]['children'][-1]['children'].append(node)
                elif tier == 6:
                    pt_test[-1]['children'][-1]['children'][-1]['children'][-1]['children'].append(node)
                elif tier == 7:
                    pt_test[-1]['children'][-1]['children'][-1]['children'][-1]['children'][-1]['children'].append(node)
                elif tier == 8:
                    pt_test[-1]['children'][-1]['children'][-1]['children'][-1]['children'][-1]['children'][-1]['children'].append(node)
                elif tier == 9:
                    pt_test[-1]['children'][-1]['children'][-1]['children'][-1]['children'][-1]['children'][-1]['children'][-1]['children'].append(node)
                elif tier == 10:
                    pt_test[-1]['children'][-1]['children'][-1]['children'][-1]['children'][-1]['children'][-1]['children'][-1]['children'][-1]['children'].append(node)

                # Go to next row after finding a term. There is only one term listed per row.
                break

    return pt_test

Prototyping

Experimenting on a smaller subset of the thesaurus: the very small Science policy branch.



In [12]:

    
plosthes_test_file = '../data/plosthes_test.xlsx'

json.dumps(tree_from_spreadsheet(plosthes_test_file, df, verbose=True))









    



1 Science policy
2 Science policy/Bioethics
3 Science policy/Bioethics/Justice in science
3 Science policy/Bioethics/Respect for human dignity
3 Science policy/Bioethics/Sanctity of life
3 Science policy/Bioethics/Scientific beneficence
3 Science policy/Bioethics/Scientific nonmaleficence
2 Science policy/Material transfer agreements
2 Science policy/Research funding
3 Science policy/Research funding/Corporate funding of science
3 Science policy/Research funding/Government funding of science
3 Science policy/Research funding/Institutional funding of science
3 Science policy/Research funding/Military funding of science
3 Science policy/Research funding/Philanthropic funding of science
3 Science policy/Research funding/Research grants
2 Science policy/Research integrity
3 Science policy/Research integrity/Publication ethics
3 Science policy/Research integrity/Scientific misconduct
2 Science policy/Science and technology workforce
3 Science policy/Science and technology workforce/Careers in research
2 Science policy/Science education
3 Science policy/Science education/Science fairs
2 Science policy/Science policy and economics
2 Science policy/Technology regulations






    Out[12]:





'[{"count": 31, "name": "Science policy", "children": [{"count": 2, "name": "Bioethics", "children": [{"count": 0, "name": "Justice in science"}, {"count": 0, "name": "Respect for human dignity"}, {"count": 0, "name": "Sanctity of life"}, {"count": 0, "name": "Scientific beneficence"}, {"count": 0, "name": "Scientific nonmaleficence"}]}, {"count": 0, "name": "Material transfer agreements"}, {"count": 14, "name": "Research funding", "children": [{"count": 0, "name": "Corporate funding of science"}, {"count": 6, "name": "Government funding of science"}, {"count": 1, "name": "Institutional funding of science"}, {"count": 0, "name": "Military funding of science"}, {"count": 1, "name": "Philanthropic funding of science"}, {"count": 3, "name": "Research grants"}]}, {"count": 5, "name": "Research integrity", "children": [{"count": 3, "name": "Publication ethics"}, {"count": 0, "name": "Scientific misconduct"}]}, {"count": 0, "name": "Science and technology workforce", "children": [{"count": 0, "name": "Careers in research"}]}, {"count": 1, "name": "Science education", "children": [{"count": 0, "name": "Science fairs"}]}, {"count": 0, "name": "Science policy and economics"}, {"count": 0, "name": "Technology regulations"}]}]'

Ready for full conversion and export?

Warning: it takes a minute.



In [11]:

    
plosthes_full_file = '../data/plosthes.2014-1.full.xlsx'

plos_tree = tree_from_spreadsheet(plosthes_full_file, df)

with open('../data/plos_full.json', 'wb') as f:
    json.dump(plos_tree, f)

	author	title	journal	publication_date	subject
id
10.1371/journal.pntd.0000413	[Darren J Gray, Simon J Forsyth, Robert S Li, ...	An Innovative Database for Epidemiological Fie...	PLoS Neglected Tropical Diseases	2009	[/Computer and information sciences/Informatio...
10.1371/journal.pone.0083016	[Pedro Lopes, Tiago Nunes, David Campos, Laura...	Gathering and Exploring Scientific Knowledge i...	PLoS ONE	2013	[/Medicine and health sciences/Pharmacology/Dr...
10.1371/journal.pmed.0030249	[Matthew E Falagas]	Unique Author Identification Number in Scienti...	PLoS Medicine	2006	[/Research and analysis methods/Database and i...
10.1371/journal.pone.0073275	[George J Besseris]	A Distribution-Free Multi-Factorial Profiler f...	PLoS ONE	2013	[/Computer and information sciences/Informatio...
10.1371/journal.pone.0043558	[Yuncui Hu, Yanpeng Li, Hongfei Lin, Zhihao Ya...	Integrating Various Resources for Gene Name No...	NaN	2012	[/Computer and information sciences/Informatio...