Text Parsing. Counting authors in a journal issue

I was recently reading some of the reviews in a special issue of Chemical Reviews. I was surprised to see how long the reviews were and also that some authors appeared in several reviews. So here we are going to practice some Python to analyse this data.

I downloaded all the references of the issue in a RIS format that you can find in the data directory (achs_chreay114_6557.ris). The RIS format contains each article separated with a blank line and all authors are quoted in lines starting with AU -.

(A powerful way to store this information would be in some structured format, such as JSON, but as this is only an exploratory work in an introductory course, we will only use simple Python structures)

We will start by parsing the file and generating a list of dictionaries, where each dictonary will contain the author keys, the starting page (sp) key and the end page (ep) key.

As in many of our everyday tasks, we need to parse some files (or web-page) to get the data. This parsing is usually neglected in many scientific programming approaches, but it can be one of the most time-consuming parts of our task (fortunately, not in this case).

We first create a function that returns the value. In this function you can see that if you put some examples in the docstring, you can directly use that for testing with the module doctest. Testing your code takes time and it can be difficult for scientific algorithms, but it is very important.

doctest is the simplest module. More elaborate testing can be done with nose


In [ ]:
def get_val(line):
    """
    Get the value after the key for a RIS formatted line
    
    >>> get_val('AU  - Garcia-Pino, Abel')
    'Garcia-Pino, Abel'
    >>> get_val('AU  - Uversky, Vladimir N.')
    'Uversky, Vladimir N.'
    >>> get_val('SP  - 6933')
    '6933'
    >>> get_val('EP  - 6947')
    '6947'
    """
    #Finish...

import doctest
doctest.testmod()

Now we scan all the text file. For each empty line we create a new entry (we'll correct for the blank lines at the beggining of the file later).

We use the setdefault method to add an empty list when the key is not present. As usual, if you do not know about this method, you could have programmed that line a little more verbosely with something like:

if line.startswith('AU  -'):
    if 'authors' in articles[-1]:
        articles[-1]['authors'].append(get_val(line).strip())
    else:
        articles[-1]['authors'] = [get_val(line).strip(), ]

This is a recurrent situation in Python, which we have already commented. Python is a large language and part of the coding time should be invested in looking for documentation about modules and language capabilities that can ease our coding. At some point, however, you have to start coding and use the tools you have found. Even if there could be a more adequate tool specific for that task. The time you invest in this information search depends on how much you expect to use or re-use the code you are writing.


In [ ]:
filein = open('data/achs_chreay114_6557.ris', 'r')
articles = []
for line in filein:
    if line.strip()=='':
        articles.append(dict())
    if line.startswith('AU  -'):
        articles[-1].setdefault('authors', []).append(get_val(line).strip())
    if line.startswith('SP  -'):
        #Finish...
    if line.startswith('EP  -'):
        #finish...
filein.close()

Because there were some blank lines at the begining of the file, that created some empty dictionaries that we can now remove (using the fact that empty object evaluate as False):


In [ ]:
articles = #Finish...

Analysing the data

We start by counting the average length of the papers in this issue (it would be interesting to compare it with other issues of the same journal...).


In [ ]:
import numpy as np

In [ ]:
page_lengths = [*FINISH* for d in articles]
page_lengths

In [ ]:
"The average number of pages is {:.1f} and its standard deviation {:.1f}".format(*FINISH*)

Now let's count how many papers in this issue has contributed each author. We'll create a dictionary of authors keys and number of papers as values.


In [ ]:
author_dict = {}
for d in articles:
    for author in d['authors']:
        author_dict[author] = author_dict.setdefault(author, 0) + 1

In [ ]:
author_dict.values()

So one author published 7 papers in that issue, another one 3 papers and several have authored 2 papers! Let's see who authored more than one paper. We will print the results sorted by the number of papers authored. We want to get a list such as:

Uversky, Vladimir N. published 7 papers
Tompa, Peter published 3 papers
Longhi, Sonia published 2 papers
Habchi, Johnny published 2 papers
Weatheritt, Robert J. published 2 papers
Xue, Bin published 2 papers
Kurgan, Lukasz published 2 papers
Fuxreiter, Monika published 2 papers

In [ ]:
papers_authored = set([p for p in author_dict.values() if p>1]) # Remove repeated elements
papers_authored = # We need to convert to a list to sort
#Sort the list...

# Print the results

Now let's see how many pages they have (presumably) written or supervised... Get something like:

Uversky, Vladimir N. published 7 papers adding up to 208 pages
Tompa, Peter published 3 papers adding up to 89 pages
Longhi, Sonia published 2 papers adding up to 60 pages
Habchi, Johnny published 2 papers adding up to 60 pages
Weatheritt, Robert J. published 2 papers adding up to 89 pages
Xue, Bin published 2 papers adding up to 70 pages
Kurgan, Lukasz published 2 papers adding up to 70 pages
Fuxreiter, Monika published 2 papers adding up to 81 pages

In [ ]:
def get_pages(name, articles):
    """
    Get the total number of pages for a given author.
    Return 0 if the author is not in the article authors.
    """
    pages = 0
    for art in articles:
        if name in art['authors']:
            pages += art['ep']-art['sp']+1
    return pages
    
#Finish...

As usual, some documentation search would have shown us a module that could have eased the coding. The collections module has a Counter type that is useful for counting things. When fed with a list, Counter counts its elements and stores something similar to a dictionary.


In [ ]:
from collections import Counter

In [ ]:
author_count = Counter()
for d in articles:
    author_count.update(d['authors'])

author_dict and author_count are similar objects. But Counters has some useful counting methods, so that we do not need the papers_authored list. Using the Counter object the code is simpler. This prints our final list in a single loop:


In [ ]:
for author, val in author_count.most_common():
    if val > 1:
        pages = get_pages(author, articles)
        print("{} published {} papers adding up to {} pages".format(author, val, pages))