In [1]:
from __future__ import unicode_literals
import json
import numpy as np
import pandas as pd
from pandas import DataFrame, Series

Transforming batch-collected data

The desired data structure for article information is the following JSON object:

articles:
  [ <doi>:
      { author: [ ... ]
        title:
        journal:
        publication_date: <yyyy>
        subject: [ <full subject /-separated strings>, ... ]
        subj_top: [ set of top levels of each subject ]
        subj_leaf: [ set of last terms of each subject ]
      },
    ...
  ]

In [2]:
df = pd.read_pickle('../data/abstract_df.pkl')

# Dropping abstract and score.
df.drop(['abstract', 'score'], axis=1, inplace=True)
df.set_index('id', inplace=True)
df.columns = ['author', 'journal', 'publication_date', 'subject', 'title']
df = df.reindex(columns=['author', 'title', 'journal', 'publication_date', 'subject'])
# We just want the year.
df.publication_date = df.publication_date.str[:4]

df.head()


Out[2]:
author title journal publication_date subject
id
10.1371/journal.pntd.0000413 [Darren J Gray, Simon J Forsyth, Robert S Li, ... An Innovative Database for Epidemiological Fie... PLoS Neglected Tropical Diseases 2009 [/Computer and information sciences/Informatio...
10.1371/journal.pone.0083016 [Pedro Lopes, Tiago Nunes, David Campos, Laura... Gathering and Exploring Scientific Knowledge i... PLoS ONE 2013 [/Medicine and health sciences/Pharmacology/Dr...
10.1371/journal.pmed.0030249 [Matthew E Falagas] Unique Author Identification Number in Scienti... PLoS Medicine 2006 [/Research and analysis methods/Database and i...
10.1371/journal.pone.0073275 [George J Besseris] A Distribution-Free Multi-Factorial Profiler f... PLoS ONE 2013 [/Computer and information sciences/Informatio...
10.1371/journal.pone.0043558 [Yuncui Hu, Yanpeng Li, Hongfei Lin, Zhihao Ya... Integrating Various Resources for Gene Name No... NaN 2012 [/Computer and information sciences/Informatio...

5 rows × 5 columns


In [3]:
def get_subj_top(subjects):
    subj_top = set()
    for s in subjects:
        # the string gets split at its first character, so not [0] here:
        subj_top.add(s.split('/')[1])
    return subj_top

def get_subj_leaf(subjects):
    subj_top = set()
    for s in subjects:
        subj_top.add(s.split('/')[-1])
    return subj_top

In [4]:
df['subj_top'] = df.subject.apply(get_subj_top)
df['subj_leaf'] = df.subject.apply(get_subj_leaf)

In [5]:
df.head()


Out[5]:
author title journal publication_date subject subj_top subj_leaf
id
10.1371/journal.pntd.0000413 [Darren J Gray, Simon J Forsyth, Robert S Li, ... An Innovative Database for Epidemiological Fie... PLoS Neglected Tropical Diseases 2009 [/Computer and information sciences/Informatio... set([Physical sciences, Medicine and health sc... set([Statistical methods, Infectious disease c...
10.1371/journal.pone.0083016 [Pedro Lopes, Tiago Nunes, David Campos, Laura... Gathering and Exploring Scientific Knowledge i... PLoS ONE 2013 [/Medicine and health sciences/Pharmacology/Dr... set([Medicine and health sciences, Engineering... set([Signal processing, Engines, Drug interact...
10.1371/journal.pmed.0030249 [Matthew E Falagas] Unique Author Identification Number in Scienti... PLoS Medicine 2006 [/Research and analysis methods/Database and i... set([Engineering and technology, Research and ... set([Database searching, Electronics, Data pro...
10.1371/journal.pone.0073275 [George J Besseris] A Distribution-Free Multi-Factorial Profiler f... PLoS ONE 2013 [/Computer and information sciences/Informatio... set([Biology and life sciences, Physical scien... set([Statistical distributions, Engineering an...
10.1371/journal.pone.0043558 [Yuncui Hu, Yanpeng Li, Hongfei Lin, Zhihao Ya... Integrating Various Resources for Gene Name No... NaN 2012 [/Computer and information sciences/Informatio... set([Biology and life sciences, Physical scien... set([Text mining, Gene mapping, Entity disambi...

5 rows × 7 columns

Here is where we check to see if I did it right...


In [6]:
#df.head().to_json(orient='index', force_ascii=False)

If all is OK, export..


In [7]:
#df.to_json(path_or_buf='../data/articles.json', orient='index', force_ascii=False)

Transforming the PLOS thesaurus

The PLOS thesaurus was kindly provided to us as a spreadsheet with thousands of rows, one node per row. It is a polyhierarchy represented in the form of a tree. We need to transform it into a JSON object that also includes article counts for all the nodes in the tree.

An example of the desired data structure for PLOS thesaurus:

{ 
  "name": "Computer and information sciences",
  "count": ###,
  "children": [
    {
      "name": "Information technology",
      "count": ###,
      "children": [
        {"name": "Data mining", "count": ###},
        {"name": "Data reduction", "count": ###}, 
        {
          "name":  "Databases",
          "count": ###,
          "children": [
            {"name": "Relational databases", "count": ###}
          ]
        },
        ...,
        {"name": "Text mining","count": ###} 
      ]
    }
  ]
}, 
...

In Python, each node is a dict. Children are specified as a list of dicts. The whole thing is a list of nodes, therefore, a list of dicts.


In [8]:
# Let's make sure we are counting articles correctly for each subject node.

def count_articles(df, subject_path):
    s = df.subject.apply(lambda s: str(s))
    matching = s[s.str.contains(subject_path)]
    return len(matching)

print 'Total articles:', len(df)
print 'Science policy:', count_articles(df, 'Science policy')
print 'Science policy/Bioethics:', count_articles(df, 'Science policy/Bioethics')


Total articles: 1120
Science policy: 31
Science policy/Bioethics: 2

In [9]:
import xlrd
from collections import defaultdict

def tree_from_spreadsheet(f, df, verbose=False):
    
    subjects = df.subject.apply(lambda s: str(s))
    
    book = xlrd.open_workbook(f)
    pt = book.sheet_by_index(0)
    # spreadsheet cells : (row, col) :: cell A1 = (0, 0)

    # Initialize a list to contain the thesaurus.
    # Our test case will only have one item in this list.
    pt_test = []

    # Keep track of the path in the tree.
    cur_path = Series([np.nan]*10)

    for r in range(1, pt.nrows):
        # Start on row two.

        # Columns: the hierarchy goes up to 10 tiers.
        for c in range(10):
            if pt.cell_value(r, c):
                # If this condition is satisfied, we are at the node that's in this line.

                # Construct the path to this node.
                # Clean strings because some terms (RNA nomenclature) cause unicode error
                text = pt.cell_value(r, c).replace(u'\u2019', "'")
                cur_path[c] = text
                cur_path[c+1:] = np.nan
                path_list = list(cur_path.dropna())
                tier = len(path_list)
                path_str = '/'.join(path_list)
                if verbose:
                    print tier, path_str

                # Add the node to the JSON-like tree structure.
                node = defaultdict(list)
                node['name'] = text
                node['count']=  len(subjects[subjects.str.contains(path_str)])

                # This part is completely ridiculous. But it seems to work.
                if tier == 1:
                    pt_test.append(node)
                    pt_test.append
                elif tier == 2:
                    pt_test[-1]['children'].append(node)
                elif tier == 3:
                    pt_test[-1]['children'][-1]['children'].append(node)
                elif tier == 4:
                    pt_test[-1]['children'][-1]['children'][-1]['children'].append(node)
                elif tier == 5:
                    pt_test[-1]['children'][-1]['children'][-1]['children'].append(node)
                elif tier == 6:
                    pt_test[-1]['children'][-1]['children'][-1]['children'][-1]['children'].append(node)
                elif tier == 7:
                    pt_test[-1]['children'][-1]['children'][-1]['children'][-1]['children'][-1]['children'].append(node)
                elif tier == 8:
                    pt_test[-1]['children'][-1]['children'][-1]['children'][-1]['children'][-1]['children'][-1]['children'].append(node)
                elif tier == 9:
                    pt_test[-1]['children'][-1]['children'][-1]['children'][-1]['children'][-1]['children'][-1]['children'][-1]['children'].append(node)
                elif tier == 10:
                    pt_test[-1]['children'][-1]['children'][-1]['children'][-1]['children'][-1]['children'][-1]['children'][-1]['children'][-1]['children'].append(node)

                # Go to next row after finding a term. There is only one term listed per row.
                break

    return pt_test

Prototyping

Experimenting on a smaller subset of the thesaurus: the very small Science policy branch.


In [12]:
plosthes_test_file = '../data/plosthes_test.xlsx'

json.dumps(tree_from_spreadsheet(plosthes_test_file, df, verbose=True))


1 Science policy
2 Science policy/Bioethics
3 Science policy/Bioethics/Justice in science
3 Science policy/Bioethics/Respect for human dignity
3 Science policy/Bioethics/Sanctity of life
3 Science policy/Bioethics/Scientific beneficence
3 Science policy/Bioethics/Scientific nonmaleficence
2 Science policy/Material transfer agreements
2 Science policy/Research funding
3 Science policy/Research funding/Corporate funding of science
3 Science policy/Research funding/Government funding of science
3 Science policy/Research funding/Institutional funding of science
3 Science policy/Research funding/Military funding of science
3 Science policy/Research funding/Philanthropic funding of science
3 Science policy/Research funding/Research grants
2 Science policy/Research integrity
3 Science policy/Research integrity/Publication ethics
3 Science policy/Research integrity/Scientific misconduct
2 Science policy/Science and technology workforce
3 Science policy/Science and technology workforce/Careers in research
2 Science policy/Science education
3 Science policy/Science education/Science fairs
2 Science policy/Science policy and economics
2 Science policy/Technology regulations
Out[12]:
'[{"count": 31, "name": "Science policy", "children": [{"count": 2, "name": "Bioethics", "children": [{"count": 0, "name": "Justice in science"}, {"count": 0, "name": "Respect for human dignity"}, {"count": 0, "name": "Sanctity of life"}, {"count": 0, "name": "Scientific beneficence"}, {"count": 0, "name": "Scientific nonmaleficence"}]}, {"count": 0, "name": "Material transfer agreements"}, {"count": 14, "name": "Research funding", "children": [{"count": 0, "name": "Corporate funding of science"}, {"count": 6, "name": "Government funding of science"}, {"count": 1, "name": "Institutional funding of science"}, {"count": 0, "name": "Military funding of science"}, {"count": 1, "name": "Philanthropic funding of science"}, {"count": 3, "name": "Research grants"}]}, {"count": 5, "name": "Research integrity", "children": [{"count": 3, "name": "Publication ethics"}, {"count": 0, "name": "Scientific misconduct"}]}, {"count": 0, "name": "Science and technology workforce", "children": [{"count": 0, "name": "Careers in research"}]}, {"count": 1, "name": "Science education", "children": [{"count": 0, "name": "Science fairs"}]}, {"count": 0, "name": "Science policy and economics"}, {"count": 0, "name": "Technology regulations"}]}]'

Ready for full conversion and export?

Warning: it takes a minute.


In [11]:
plosthes_full_file = '../data/plosthes.2014-1.full.xlsx'

plos_tree = tree_from_spreadsheet(plosthes_full_file, df)

with open('../data/plos_full.json', 'wb') as f:
    json.dump(plos_tree, f)