(Inspired by and borrowed heavily from: Collective Intelligence - Luís F. Simões. IR version and assignments by J.E. Hoeksema, 2014-11-12. Converted to Python 3 and minor changes by Tobias Kuhn, 2015-11-17.)
This notebook's purpose is to give examples of how to use graph algorithms to improve search engine. We look at two graphs in particular: the co-authorship network and the citation network.
The citation network is similar to the link network of the web: Citations are like web links pointing to other documents. We can therefore apply the same network-based ranking methods.
In [1]:
import pickle, bz2
from collections import *
import numpy as np
import matplotlib.pyplot as plt
# show plots inline within the notebook
%matplotlib inline
# set plots' resolution
plt.rcParams['savefig.dpi'] = 100
from IPython.display import display, HTML
In [2]:
Ids_file = 'data/air__Ids.pkl.bz2'
Summaries_file = 'data/air__Summaries.pkl.bz2'
Citations_file = 'data/air__Citations.pkl.bz2'
Abstracts_file = 'data/air__Abstracts.pkl.bz2'
In [3]:
Ids = pickle.load( bz2.BZ2File( Ids_file, 'rb' ) )
In [4]:
Summaries = pickle.load( bz2.BZ2File( Summaries_file, 'rb' ) )
paper = namedtuple( 'paper', ['title', 'authors', 'year', 'doi'] )
for (id, paper_info) in Summaries.items():
Summaries[id] = paper( *paper_info )
In [5]:
Citations = pickle.load( bz2.BZ2File( Citations_file, 'rb' ) )
In [6]:
def display_summary( id, extra_text='' ):
"""
Function for printing a paper's summary through IPython's Rich Display System.
Trims long titles or author lists, and links to the paper's DOI (when available).
"""
s = Summaries[ id ]
title = ( s.title if s.title[-1]!='.' else s.title[:-1] )
title = title[:150].rstrip() + ('' if len(title)<=150 else '...')
if s.doi!='':
title = '<a href=http://dx.doi.org/%s>%s</a>' % (s.doi, title)
authors = ', '.join( s.authors[:5] ) + ('' if len(s.authors)<=5 else ', ...')
lines = [
title,
authors,
str(s.year),
'<small>id: %d%s</small>' % (id, extra_text)
]
display( HTML( '<blockquote>%s</blockquote>' % '<br>'.join(lines) ) )
In [7]:
from math import log10
def tokenize(text):
return text.split(' ')
def preprocess(tokens):
result = []
for token in tokens:
result.append(token.lower())
return result
Abstracts = pickle.load( bz2.BZ2File( Abstracts_file, 'rb' ) )
inverted_index = defaultdict(set)
for (id, abstract) in Abstracts.items():
for term in preprocess(tokenize(abstract)):
inverted_index[term].add(id)
tf_matrix = defaultdict(Counter)
for (id, abstract) in Abstracts.items():
tf_matrix[id] = Counter(preprocess(tokenize(abstract)))
def tf(t,d):
return float(tf_matrix[d][t])
def df(t):
return float(len(inverted_index[t]))
numdocs = float(len(Abstracts))
def num_documents():
return numdocs
# We don't need to keep this object in memory any longer:
Abstracts = {}
Summaries
maps paper ids to paper summaries. Let us now create here mappings by different criteria.
We'll start by building a mapping from authors to the set of ids of papers they authored. We'll be using Python's sets again for that purpose.
In [8]:
papers_of_author = defaultdict(set)
for id,p in Summaries.items():
for a in p.authors:
papers_of_author[a].add( id )
In [9]:
papers_of_author['Vine AK']
Out[9]:
In [10]:
for id in papers_of_author['Vine AK']:
display_summary(id)
We now build a co-authorship network, a graph linking authors, to the set of co-authors they have published with.
In [11]:
coauthors = defaultdict(set)
for p in Summaries.values():
for a in p.authors:
coauthors[a].update( p.authors )
# The code above results in each author being listed as having co-autored with himself/herself.
# We remove these self-references here:
for a,ca in coauthors.items():
ca.remove(a)
In [12]:
print(', '.join( coauthors['Vine AK'] ))
Now we can have a look at some basic statistics about our graph:
In [13]:
print('Number of nodes: %8d (node = author)' % len(coauthors))
print('Number of links: %8d (link = collaboration between the two linked authors on at least one paper)' \
% sum( len(cas) for cas in coauthors.values() ))
With this data in hand, we can plot the degree distribution by showing the number of collaborators a scientist has published with:
In [14]:
plt.hist( x=[ len(ca) for ca in coauthors.values() ], bins=range(55), histtype='bar', align='left', normed=True )
plt.xlabel('number of collaborators')
plt.ylabel('fraction of scientists')
plt.xlim(0,50);
We'll start by expanding the Citations
dataset into two mappings:
papers_citing[id]
: papers citing a given paper;cited_by[id]
: papers cited by a given paper (in other words, its list of references).If we see the Citations dataset as a directed graph where papers are nodes, and citations links between then, then papers_citing
gives you the list of a node's incoming links, whereas cited_by
gives you the list of its outgoing links.
The dataset was assembled by querying for papers citing a given paper. As a result, the data mapped to in cited_by
(its values) is necessarily limited to ids of papers that are part of the dataset.
In [15]:
papers_citing = Citations # no changes needed, this is what we are storing already in the Citations dataset
cited_by = defaultdict(list)
for ref, papers_citing_ref in papers_citing.items():
for id in papers_citing_ref:
cited_by[ id ].append( ref )
Let us now look at an arbitrary paper, let's say PubMed ID 16820458 ("Changes in the spoilage-related microbiota of beef during refrigerated storage under different packaging conditions"). We can now use the cited_by
mapping to retrieve what we know of its list of references.
As mentioned above, because the process generating the dataset asked for papers citing a given paper (and not papers a paper cites), the papers we get through cited_by
are then necessarily all members of our datasets, and we can therefore find them in Summaries
.
In [16]:
paper_id = 16820458
refs = { id : Summaries[id].title for id in cited_by[paper_id] }
print(len(refs), 'references identified for the paper with id', paper_id)
refs
Out[16]:
If we lookup the same paper in papers_citing
, we now see that some of the cited papers are themselves in our dataset, but others are not (denote here by '??'
):
In [17]:
{ id : Summaries.get(id,['??'])[0] for id in papers_citing[paper_id] }
Out[17]:
Paper 17696886, for example, is not in our dataset and we do not have any direct information about it, but its repeated occurrence in other papers' citation lists does allow us to reconstruct a good portion of its references. Below is the list of papers in our dataset cited by that paper:
In [18]:
paper_id2 = 17696886
refs2 = { id : Summaries[id].title for id in cited_by[paper_id2] }
print(len(refs2), 'references identified for the paper with id', paper_id2)
refs2
Out[18]:
Now that we have a better understanding about the data we're dealing with, let us obtain again some basic statistics about our graph.
In [19]:
print('Number of core ids %d (100.00 %%)' % len(Ids))
with_cit = [ id for id in Ids if papers_citing[id]!=[] ]
print('Number of papers cited at least once: %d (%.2f %%)' % (len(with_cit), 100.*len(with_cit)/len(Ids)))
isolated = set( id for id in Ids if papers_citing[id]==[] and id not in cited_by )
print('Number of isolated nodes: %d (%.2f %%)\n\t' \
'(papers that are not cited by any others, nor do themselves cite any in the dataset)'% (
len(isolated), 100.*len(isolated)/len(Ids) ))
noCit_withRefs = [ id for id in Ids if papers_citing[id]==[] and id in cited_by ]
print('Number of dataset ids with no citations, but known references: %d (%.2f %%)' % (
len(noCit_withRefs), 100.*len(noCit_withRefs)/len(Ids)))
print('(percentages calculated with respect to just the core ids (members of `Ids`) -- exclude outsider ids)\n')
In [20]:
Ids_set = set( Ids )
citing_Ids = set( cited_by.keys() ) # == set( c for citing in papers_citing.itervalues() for c in citing )
outsiders = citing_Ids - Ids_set # set difference: removes from `citing_Ids` all the ids that occur in `Ids_set`
nodes = citing_Ids | Ids_set - isolated # set union, followed by set difference
print('Number of (non-isolated) nodes in the graph: %d\n\t(papers with at least 1 known citation, or 1 known reference)' % len(nodes))
print(len( citing_Ids ), 'distinct ids are citing papers in our dataset.')
print('Of those, %d (%.2f %%) are ids from outside the dataset.\n' % ( len(outsiders), 100.*len(outsiders)/len(citing_Ids) ))
In [21]:
all_cits = [ c for citing in papers_citing.values() for c in citing ]
outsider_cits = [ c for citing in papers_citing.values() for c in citing if c in outsiders ]
print('Number of links (citations) in the graph:', len(all_cits))
print('A total of %d citations are logged in the dataset.' % len(all_cits))
print('Citations by ids from outside the dataset comprise %d (%.2f %%) of that total.\n' % (
len(outsider_cits),
100.*len(outsider_cits)/len(all_cits) ))
Let us now find which 10 papers are the most cited in our dataset.
In [22]:
nr_cits_per_paper = [ (id, len(cits)) for (id,cits) in papers_citing.items() ]
for (id, cits) in sorted( nr_cits_per_paper, key=lambda i:i[1], reverse=True )[:10]:
display_summary( id, ', nr. citations: %d' % cits )
In order to use the citation network, we need to be able to perform some complex graph algorithms on it. To make our lives easier, we will load the data into the python package NetworkX, a package for the creation, manipulation, and study of the structure, dynamics, and function of complex networks, which provides a number of these graph algorithms (such as HITS and PageRank) out of the box.
You probably have to install the NetworkX package first.
In [23]:
import networkx as nx
G = nx.DiGraph(cited_by)
We now have a NetworkX Directed Graph stored in G
, where a node represents a paper, and an edge represents a citation. This means we can now apply the algorithms and functions of NetworkX to our graph:
In [24]:
print(nx.info(G))
print(nx.is_directed(G))
print(nx.density(G))
As this graph was generated from citations only, we need to add all isolated nodes (nodes that are not cited and do not cite other papers) as well:
In [25]:
G.add_nodes_from(isolated)
In [26]:
print(nx.info(G))
print(nx.is_directed(G))
print(nx.density(G))
Your name: ...
In [27]:
# Add your code here
[Write your answer text here]
Using the Link Analysis algorithms provided by NetworkX, calculate the PageRank score for each node in the citation network, and store them in a variable. Print out the PageRank values for the two example papers given below.
Hint: the pagerank_scipy
implementation tend to be considerably faster than its regular pagerank
counterpart (but you have to install the SciPy package for that).
In [28]:
# Add your code here
# print PageRank for paper 7168798
# print PageRank for paper 21056779
In [29]:
# Add your code here
[Write your answer text here]