(Adapted from: Inspecting the dataset - Luís F. Simões. Assignments added by J.E. Hoeksema, 2014-10-16. Converted to Python 3 and minor changes by Tobias Kuhn, 2015-10-23.)
This notebook's purpose is to provide a basic illustration of how to handle data in the PubMed dataset, as well as to provide some basic assignments about this dataset. Make sure you download all the dataset files (air__Summaries.pkl.bz2
, etc.) from Blackboard and save them in a directory called data
, which should be a sub-directory of the one that contains this notebook file (or adjust the file path in the code). The dataset consists of information about scientific papers from the PubMed dataset that contain the word "air" in the title or abstract.
Note that you can run all of this code from a normal python or ipython shell, except for certain magic codes (marked with %) used for display within a notebook.
In [1]:
import pickle, bz2
Summaries_file = 'data/air__Summaries.pkl.bz2'
Summaries = pickle.load( bz2.BZ2File( Summaries_file, 'rb' ) )
To make it easier to access the data, we convert here paper entries into named tuples. This will allow us to refer to fields by keyword, rather than index.
In [2]:
from collections import namedtuple
paper = namedtuple( 'paper', ['title', 'authors', 'year', 'doi'] )
for (id, paper_info) in Summaries.items():
Summaries[id] = paper( *paper_info )
In [3]:
Summaries[26488732]
Out[3]:
In [4]:
Summaries[26488732].title
Out[4]:
Plotting relies on matplotlib, which you can download from here (NumPy is also required, and can be downloaded here).
In [5]:
import matplotlib.pyplot as plt
# show plots inline within the notebook
%matplotlib inline
# set plots' resolution
plt.rcParams['savefig.dpi'] = 100
Here, we will get information on how many papers in the dataset were published per year.
We'll be using the Counter class to determine the number of papers per year.
In [6]:
from collections import Counter
paper_years = [ p.year for p in Summaries.values() ]
papers_per_year = sorted( Counter(paper_years).items() )
print('Number of papers in the dataset per year for the past decade:')
print(papers_per_year[-10:])
Filtering results, to obain only papers since 1950:
In [7]:
papers_per_year = [ (y,count) for (y,count) in papers_per_year if y >= 1950 ]
years = [ y for (y,count) in papers_per_year ]
nr_papers = [ count for (y,count) in papers_per_year ]
print('Number of papers in the dataset published since 1950: %d.' % sum(nr_papers))
Creating a bar plot to visualize the results (using matplotlib.pyplot.bar):
In [8]:
plt.bar( left=years, height=nr_papers, width=1.0 )
plt.xlim(1950,2016)
plt.xlabel('year')
plt.ylabel('number of papers');
Here, we will obtain the distribution characterizing the number of papers published by an author.
In [9]:
# flattening out of the list of lists of authors
authors_expanded = [
auth
for paper in Summaries.values()
for auth in paper.authors
]
nr_papers_by_author = Counter( authors_expanded )
In [10]:
print('There are %d authors in the dataset with distinct names.\n' % len(nr_papers_by_author))
print('50 authors with greatest number of papers:')
print(sorted(nr_papers_by_author.items(), key=lambda i:i[1] )[-50:])
Creating a histogram to visualize the results (using matplotlib.pyplot.hist):
In [11]:
plt.hist( x=list(nr_papers_by_author.values()), bins=range(51), histtype='step' )
plt.yscale('log')
plt.xlabel('number of papers authored')
plt.ylabel('number of authors');
In [12]:
plt.hist( x=[ len(p.authors) for p in Summaries.values() ], bins=range(20), histtype='bar', align='left', normed=True )
plt.xlabel('number of authors in one paper')
plt.ylabel('fraction of papers')
plt.xlim(0,15);
In [13]:
# assemble list of words in paper titles, convert them to lowercase, and remove trailing '.'
title_words = Counter([
( word if word[-1] != '.' else word[:-1] ).lower()
for paper in Summaries.values()
for word in paper.title.split(' ')
if word != '' # discard empty strings that are generated when consecutive spaces occur in the title
])
In [14]:
print(len(title_words), 'distinct words occur in the paper titles.\n')
print('50 most frequently occurring words:')
print(sorted( title_words.items(), key=lambda i:i[1] )[-50:])
Your name: ...
s
with len(s)
. See also the documentation for set and defaultdict
In [15]:
# Add your code here
In [16]:
# Add your code here
[Write your answer text here]
In [17]:
# Add your code here
Submit the answers to the assignment as a modified version of this Notebook file (file with .ipynb
extension) that includes your code and your answers via Blackboard.