In this code for the first mini-assignment, we will get to know the dataset that we will be using throughout. You can find the assignment tasks at the very bottom of this document.
Our dataset consists of short texts (article abstracts) from the PubMed database of scientific publications in the Life Science domain. As the full dataset consists of millions of documents, we are using just a small subset, namely all publications that contain the word "malaria" in their title or abstract. You can download that dataset in the form of four files (malaria__Summaries.pkl.bz2
, etc.) from Blackboard. Save these four files in a directory called data
, which should be a sub-directory of the one that contains this notebook file (or adjust the file path in the code)
In [1]:
import pickle, bz2
Summaries_file = 'data/malaria__Summaries.pkl.bz2'
Summaries = pickle.load( bz2.BZ2File( Summaries_file, 'rb' ) )
To make it easier to access the data, we convert here paper entries into named tuples. This will allow us to refer to fields by keyword (like var.year
), rather than index (like var[2]
).
In [2]:
from collections import namedtuple
paper = namedtuple( 'paper', ['title', 'authors', 'year', 'doi'] )
for (id, paper_info) in Summaries.items():
Summaries[id] = paper( *paper_info )
In [3]:
Summaries[24130474]
Out[3]:
In [4]:
Summaries[24130474].title
Out[4]:
Plotting relies on matplotlib and NumPy. If your installation doesn't have them included already, you can download them here and here, respectively.
In [5]:
import matplotlib.pyplot as plt
# show plots inline within the notebook
%matplotlib inline
# set plots' resolution
plt.rcParams['savefig.dpi'] = 100
Here, we will get information on how many papers in the dataset were published per year.
We'll be using the Counter class to determine the number of papers per year.
In [6]:
from collections import Counter
paper_years = [ p.year for p in Summaries.values() ]
papers_per_year = sorted( Counter(paper_years).items() )
print('Number of papers in the dataset per year for the past decade:')
print(papers_per_year[-10:])
Filtering results, to obain only papers since 1940:
In [7]:
papers_per_year_since_1940 = [ (y,count) for (y,count) in papers_per_year if y >= 1940 ]
years_since_1940 = [ y for (y,count) in papers_per_year_since_1940 ]
nr_papers_since_1940 = [ count for (y,count) in papers_per_year_since_1940 ]
print('Number of papers in the dataset published since 1940:')
print(sum(nr_papers_since_1940))
Creating a bar plot to visualize the results (using matplotlib.pyplot.bar):
In [8]:
plt.bar(left=years_since_1940, height=nr_papers_since_1940, width=1.0)
plt.xlim(1940, 2017)
plt.xlabel('year')
plt.ylabel('number of papers');
Alternatively, you can get the same result in a more direct manner by plotting it as a histogram with matplotlib.pyplot.hist:
In [9]:
plt.hist( x=[p.year for p in Summaries.values()], bins=range(1940,2018) );
plt.xlim(1940, 2017)
plt.xlabel('year')
plt.ylabel('number of papers');
Here, we will obtain the distribution characterizing the number of papers published by an author.
In [10]:
# flattening the list of lists of authors:
authors_expanded = [ auth for paper in Summaries.values() for auth in paper.authors ]
nr_papers_by_author = Counter( authors_expanded )
In [11]:
print('Number of authors in the dataset with distinct names:')
print(len(nr_papers_by_author))
In [12]:
print('Top 50 authors with greatest number of papers:')
print(sorted(nr_papers_by_author.items(), key=lambda i:i[1], reverse=True)[:50])
Creating a histogram to visualize the results:
In [13]:
plt.hist( x=list(nr_papers_by_author.values()), bins=range(51), log=True )
plt.xlabel('number of papers authored')
plt.ylabel('number of authors');
In [14]:
plt.hist( x=[ len(p.authors) for p in Summaries.values() ], bins=range(20), align='left', normed=True )
plt.xlabel('number of authors in one paper')
plt.ylabel('fraction of papers')
plt.xlim(-0.5, 15.5);
assemble list of words in paper titles, convert them to lowercase, and remove trailing '.':
In [15]:
words = [ word.lower() for paper in Summaries.values() for word in paper.title.split(' ') ]
word_counts = Counter(words)
print('Number of distinct words in the paper titles:')
print(len(word_counts))
Your name: ...
Create a Python dictionary object that returns sets of author names for a given year. You can name this dictionary, for example, authors_at_year
. (You can use a defaultdict with a default value of set.) Demonstrate the working of this dictionary by showing the author set for the year 1941.
In [ ]:
# Add your code here
In [ ]:
# Add your code here
Calculate and plot (e.g. using plt.plot) a graph of the frequency of the 100 most frequent words in titles of papers, from most frequent to least frequent. (You can make use of the data structures created above.)
In [ ]:
# Add your code here
In [ ]:
# Add your code here
Answer: [Write your answer text here]
Submit the answers to the assignment as a modified version of this Notebook file (file with .ipynb
extension) that includes your code and your answers via Blackboard. Don't forget to add your name, and remember that the assignments have to be done individually and group submissions are not allowed.