Adapted from: Inspecting the dataset - Luís F. Simões
Assignments added by J.E. Hoeksema, 2014-10-16
This notebook's purpose is to provide a basic illustration of how to handle data in the "evolution" dataset, as well as to provide some basic assignments about this dataset.
Note that you can run all of this code from a normal python or ipython shell, except for certain magic codes (marked with %) used for display within a notebook.
In [1]:
search_term = 'evolution'
Summaries_file = search_term + '__Summaries.pkl.bz2'
In [2]:
import cPickle, bz2
Summaries = cPickle.load( bz2.BZ2File( Summaries_file, 'rb' ) )
To make it easier to access the data, we convert here paper entries into named tuples. This will allow us to refer to fields by keyword, rather than index.
In [3]:
from collections import namedtuple
paper = namedtuple( 'paper', ['title', 'authors', 'year', 'doi'] )
for (id, paper_info) in Summaries.iteritems():
Summaries[id] = paper( *paper_info )
In [4]:
Summaries[23144668]
Out[4]:
In [5]:
Summaries[23144668].title
Out[5]:
Plotting relies on matplotlib, which you can download from here (NumPy is also required, and can be downloaded here).
In [6]:
import matplotlib.pyplot as plt
# show plots inline within the notebook
%matplotlib inline
# set plots' resolution
plt.rcParams['savefig.dpi'] = 100
Here, we will get information on how many papers in the dataset were published per year.
We'll be using the Counter class to determine the number of papers per year.
In [7]:
paper_year = [ p.year for p in Summaries.itervalues() ]
from collections import Counter
papers_per_year = sorted( Counter(paper_year).items() )
print 'Number of papers in the dataset per year for the past decade:'
print papers_per_year[-10:]
Filtering results, to obain only papers since 1950:
In [8]:
papers_per_year = [
(y,count)
for (y,count) in papers_per_year
if y >= 1950
]
years = [ y for (y,count) in papers_per_year ]
nr_papers = [ count for (y,count) in papers_per_year ]
print 'Number of papers in the dataset published since 1950: %d.' % sum(nr_papers)
Creating a bar plot to visualize the results (using matplotlib.pyplot.bar):
In [9]:
plt.bar( left=years, height=nr_papers, width=1.0 )
plt.xlim(1950,2016)
plt.xlabel( 'year' )
plt.ylabel( 'number of papers' );
Here, we will obtain the distribution characterizing the number of papers published by an author.
In [10]:
# flattening out of the list of lists of authors
authors_expanded = [
auth
for paper in Summaries.itervalues()
for auth in paper.authors
]
nr_papers_by_author = Counter( authors_expanded )
In [11]:
print 'There are %d authors in the dataset with distinct names.\n' % len(nr_papers_by_author)
print '50 authors with greatest number of papers:'
print sorted( nr_papers_by_author.items(), key=lambda i:i[1] )[-50:]
Creating a histogram to visualize the results (using matplotlib.pyplot.hist):
In [12]:
plt.hist( x=nr_papers_by_author.values(), bins=range(51), histtype='step' )
plt.yscale('log')
plt.xlabel('number of papers authored')
plt.ylabel('number of authors');
In [13]:
plt.hist( x=[ len(p.authors) for p in Summaries.itervalues() ], bins=range(20), histtype='bar', align='left', normed=True )
plt.xlabel('number of authors in one paper')
plt.ylabel('fraction of papers')
plt.xlim(0,15);
In [14]:
# assemble list of words in paper titles, convert them to lowercase, and remove eventual trailing '.'
title_words = Counter([
( word if word[-1]!='.' else word[:-1] ).lower()
for paper in Summaries.itervalues()
for word in paper.title.split(' ')
if word != '' # the split by spaces generates empty strings when consecutive spaces occur in the title; this discards them
])
In [15]:
print len(title_words), 'distinct words occur in the paper titles.\n'
print '50 most frequently occurring words:'
print sorted( title_words.items(), key=lambda i:i[1] )[-50:]
s
with len(s)
. See also the documentation for set and defaultdict