Inspecting the PubMed Paper Dataset

(Adapted from: Inspecting the dataset - Luís F. Simões. Assignments added by J.E. Hoeksema, 2014-10-16. Converted to Python 3 and minor changes by Tobias Kuhn, 2015-10-23.)


This notebook's purpose is to provide a basic illustration of how to handle data in the PubMed dataset, as well as to provide some basic assignments about this dataset. Make sure you download all the dataset files (air__Summaries.pkl.bz2, etc.) from Blackboard and save them in a directory called data, which should be a sub-directory of the one that contains this notebook file (or adjust the file path in the code). The dataset consists of information about scientific papers from the PubMed dataset that contain the word "air" in the title or abstract.

Note that you can run all of this code from a normal python or ipython shell, except for certain magic codes (marked with %) used for display within a notebook.

Loading the dataset


In [1]:
import pickle, bz2

Summaries_file = 'data/air__Summaries.pkl.bz2'
Summaries = pickle.load( bz2.BZ2File( Summaries_file, 'rb' ) )

To make it easier to access the data, we convert here paper entries into named tuples. This will allow us to refer to fields by keyword, rather than index.


In [2]:
from collections import namedtuple

paper = namedtuple( 'paper', ['title', 'authors', 'year', 'doi'] )

for (id, paper_info) in Summaries.items():
    Summaries[id] = paper( *paper_info )

In [3]:
Summaries[26488732]


Out[3]:
paper(title='Use of Whole-Genome Sequencing to Link Burkholderia pseudomallei from Air Sampling to Mediastinal Melioidosis, Australia.', authors=['Currie BJ', 'Price EP', 'Mayo M', 'Kaestli M', 'Theobald V', 'Harrington I', 'Harrington G', 'Sarovich DS'], year=2015, doi='10.3201/eid2111.141802')

In [4]:
Summaries[26488732].title


Out[4]:
'Use of Whole-Genome Sequencing to Link Burkholderia pseudomallei from Air Sampling to Mediastinal Melioidosis, Australia.'

Dataset statistics

Plotting relies on matplotlib, which you can download from here (NumPy is also required, and can be downloaded here).


In [5]:
import matplotlib.pyplot as plt

# show plots inline within the notebook
%matplotlib inline
# set plots' resolution
plt.rcParams['savefig.dpi'] = 100

Papers per year

Here, we will get information on how many papers in the dataset were published per year.

We'll be using the Counter class to determine the number of papers per year.


In [6]:
from collections import Counter

paper_years = [ p.year for p in Summaries.values() ]
papers_per_year = sorted( Counter(paper_years).items() )
print('Number of papers in the dataset per year for the past decade:')
print(papers_per_year[-10:])


Number of papers in the dataset per year for the past decade:
[(2007, 6729), (2008, 7159), (2009, 7454), (2010, 7552), (2011, 8293), (2012, 8924), (2013, 9267), (2014, 10090), (2015, 9887), (2016, 23)]

Filtering results, to obain only papers since 1950:


In [7]:
papers_per_year = [ (y,count) for (y,count) in papers_per_year if y >= 1950 ]
years = [ y for (y,count) in papers_per_year ]
nr_papers = [ count for (y,count) in papers_per_year ]

print('Number of papers in the dataset published since 1950: %d.' % sum(nr_papers))


Number of papers in the dataset published since 1950: 188772.

Creating a bar plot to visualize the results (using matplotlib.pyplot.bar):


In [8]:
plt.bar( left=years, height=nr_papers, width=1.0 )
plt.xlim(1950,2016)
plt.xlabel('year')
plt.ylabel('number of papers');


Papers per author

Here, we will obtain the distribution characterizing the number of papers published by an author.


In [9]:
# flattening out of the list of lists of authors
authors_expanded = [
    auth
    for paper in Summaries.values()
    for auth in paper.authors
    ]

nr_papers_by_author = Counter( authors_expanded )

In [10]:
print('There are %d authors in the dataset with distinct names.\n' % len(nr_papers_by_author))
print('50 authors with greatest number of papers:')
print(sorted(nr_papers_by_author.items(), key=lambda i:i[1] )[-50:])


There are 378722 authors in the dataset with distinct names.

50 authors with greatest number of papers:
[('Kim H', 160), ('Yang J', 161), ('Wu J', 164), ('Wang Q', 165), ('Wang D', 168), ('Li W', 172), ('Chen Z', 173), ('Li C', 174), ('Liu Z', 176), ('Liu L', 181), ('Jones KC', 181), ('Zhao Y', 182), ('Zhang G', 184), ('Liu H', 187), ('Xu X', 188), ('Zhang Q', 191), ('Chen H', 191), ('Chen X', 193), ('Chen L', 196), ('Yang Y', 199), ('Zhang W', 204), ('Lee JH', 207), ('Brunekreef B', 209), ('Wang C', 224), ('Li Z', 227), ('Wang W', 227), ('Zhang Z', 233), ('Zhang H', 243), ('Li L', 243), ('Li H', 261), ('Wang Z', 275), ('Wang L', 277), ('Liu X', 277), ('Wang S', 280), ('Wang H', 299), ('Chen J', 305), ('Schwartz J', 325), ('Liu J', 337), ('Chen Y', 341), ('Li X', 369), ('Zhang L', 383), ('Zhang X', 394), ('Li J', 418), ('Zhang J', 438), ('Wang J', 446), ('Li Y', 460), ('Liu Y', 479), ('Wang X', 493), ('Zhang Y', 598), ('Wang Y', 625)]

Creating a histogram to visualize the results (using matplotlib.pyplot.hist):


In [11]:
plt.hist( x=list(nr_papers_by_author.values()), bins=range(51), histtype='step' )
plt.yscale('log')
plt.xlabel('number of papers authored')
plt.ylabel('number of authors');


Authors per paper


In [12]:
plt.hist( x=[ len(p.authors) for p in Summaries.values() ], bins=range(20), histtype='bar', align='left', normed=True )
plt.xlabel('number of authors in one paper')
plt.ylabel('fraction of papers')
plt.xlim(0,15);


Most frequently occurring words in paper titles


In [13]:
# assemble list of words in paper titles, convert them to lowercase, and remove trailing '.'
title_words = Counter([
    ( word if word[-1] != '.' else word[:-1] ).lower()
    for paper in Summaries.values()
    for word in paper.title.split(' ')
    if word != ''   # discard empty strings that are generated when consecutive spaces occur in the title
    ])

In [14]:
print(len(title_words), 'distinct words occur in the paper titles.\n')
print('50 most frequently occurring words:')
print(sorted( title_words.items(), key=lambda i:i[1] )[-50:])


144310 distinct words occur in the paper titles.

50 most frequently occurring words:
[('case', 2738), ('pressure', 2784), ('determination', 2832), ('surface', 2914), ('water', 2951), ('method', 3002), ('cells', 3044), ('use', 3136), ('comparison', 3142), ('assessment', 3147), ('system', 3197), ('treatment', 3255), ('new', 3296), ('between', 3313), ('environmental', 3459), ('carbon', 3479), ('gas', 3484), ('model', 3494), ('health', 3943), ('evaluation', 4040), ('patients', 4179), ('oxygen', 4205), ('respiratory', 4436), ('human', 4692), ('analysis', 4714), ('after', 4892), ('pulmonary', 5242), ('as', 5373), ('lung', 6160), ('at', 6836), ('using', 7104), ('effect', 7846), ('pollution', 8047), ('exposure', 8099), ('study', 8163), ('during', 8407), ('effects', 9216), ('an', 10919), ('from', 11991), ('by', 16682), ('with', 22315), ('to', 25693), ('on', 27451), ('for', 28468), ('air', 40947), ('a', 42906), ('the', 82133), ('and', 93899), ('in', 101331), ('of', 169882)]

Assignments

Your name: ...

  • Create a plot for the years from 1970 until 2015 that shows how many authors published at least one paper.
    Hint: use a defaultdict with a default value of set. You can retrieve the number of unique items in a set s with len(s). See also the documentation for set and defaultdict

In [15]:
# Add your code here
  • Calculate and plot (e.g. using plt.plot) a graph of the frequency of the 50 most frequent words in titles of papers, from most frequent to least frequent.

In [16]:
# Add your code here
  • While keeping in mind that we are dealing with a biased (preselected) dataset about air-related papers, what do you notice when looking at the top 10 most frequent words?

[Write your answer text here]


In [17]:
# Add your code here

Submit the answers to the assignment as a modified version of this Notebook file (file with .ipynb extension) that includes your code and your answers via Blackboard.