In this mini-assignment, we will exploit graph algorithms to improve search results. For our dataset of scientific papers, we look at two graphs in particular: the co-authorship network and the citation network.
The citation network is similar to the link network of the web: Citations are like web links pointing to other documents. We can therefore apply the same network-based ranking methods.
In [1]:
import pickle, bz2
from collections import defaultdict, namedtuple, Counter
from math import log10, sqrt
from IPython.display import display, HTML
import matplotlib.pyplot as plt
# show plots inline within the notebook
%matplotlib inline
# set plots' resolution
plt.rcParams['savefig.dpi'] = 100
In [2]:
Ids_file = 'data/malaria__Ids.pkl.bz2'
Summaries_file = 'data/malaria__Summaries.pkl.bz2'
Citations_file = 'data/malaria__Citations.pkl.bz2'
Abstracts_file = 'data/malaria__Abstracts.pkl.bz2'
Ids = pickle.load( bz2.BZ2File( Ids_file, 'rb' ) )
Summaries = pickle.load( bz2.BZ2File( Summaries_file, 'rb' ) )
Citations = pickle.load( bz2.BZ2File( Citations_file, 'rb' ) )
Abstracts = pickle.load( bz2.BZ2File( Abstracts_file, 'rb' ) )
paper = namedtuple( 'paper', ['title', 'authors', 'year', 'doi'] )
for (id, paper_info) in Summaries.items():
Summaries[id] = paper( *paper_info )
In [3]:
def display_summary( id, show_abstract=False, show_id=True, extra_text='' ):
"""
Function for printing a paper's summary through IPython's Rich Display System.
Trims long author lists, and adds a link to the paper's DOI (when available).
"""
s = Summaries[id]
lines = []
title = s.title
if s.doi != '':
title = '<a href=http://dx.doi.org/%s>%s</a>' % (s.doi, title)
title = '<strong>' + title + '</strong>'
lines.append(title)
authors = ', '.join( s.authors[:20] ) + ('' if len(s.authors) <= 20 else ', ...')
lines.append(str(s.year) + '. ' + authors)
if (show_abstract):
lines.append('<small><strong>Abstract:</strong> <em>%s</em></small>' % Abstracts[id])
if (show_id):
lines.append('[ID: %d]' % id)
if (extra_text != ''):
lines.append(extra_text)
display( HTML('<br>'.join(lines)) )
In [4]:
def tokenize(text):
return text.split(' ')
def preprocess(tokens):
result = []
for token in tokens:
result.append(token.lower())
return result
In [5]:
inverted_index = defaultdict(set)
for (id, abstract) in Abstracts.items():
for term in preprocess(tokenize(abstract)):
inverted_index[term].add(id)
tf_matrix = defaultdict(Counter)
length_values = defaultdict(int)
for (doc_id, abstract) in Abstracts.items():
tokens = preprocess(tokenize(abstract))
tf_matrix[doc_id] = Counter(tokens)
l = 0
for t in tf_matrix[doc_id].keys():
l += tf_matrix[doc_id][t] ** 2
length_values[doc_id] = sqrt(l)
def tf(t,d):
return float(tf_matrix[d][t])
def df(t):
return float(len(inverted_index[t]))
def num_documents():
return float(len(Abstracts))
def length_tf(d):
return length_values[d]
def tfidf(t,d):
return tf(t,d) * log10(num_documents()/df(t))
We start by building a mapping from authors to the set of identifiers of papers they authored. We'll be using Python's sets again for that purpose.
In [6]:
papers_of_author = defaultdict(set)
for (id, p) in Summaries.items():
for a in p.authors:
papers_of_author[a].add(id)
Let's try it out:
In [7]:
papers_of_author['Clauset A']
Out[7]:
In [8]:
for id in papers_of_author['Clauset A']:
display_summary(id)
We can now build a co-authorship network, that is a graph linking authors to the set of co-authors they have published with:
In [9]:
coauthors = defaultdict(set)
for p in Summaries.values():
for a in p.authors:
coauthors[a].update(p.authors)
# The code above results in each author being listed as having co-authored with himself/herself.
# We remove these self-references here:
for (a, ca) in coauthors.items():
ca.remove(a)
And let's try it out again:
In [10]:
print(', '.join( coauthors['Clauset A'] ))
Now we can have a look at some basic statistics about our graph:
In [11]:
print('Number of nodes (authors): ', len(coauthors))
coauthor_rel_count = sum( len(c) for c in coauthors.values() )
print('Number of links (co-authorship relations): ', coauthor_rel_count)
With this data at hand, we can plot the degree distribution by showing the number of collaborators a scientist has published with:
In [12]:
plt.hist( x=[ len(ca) for ca in coauthors.values() ], bins=range(60) )
plt.xlabel('number of co-authors')
plt.ylabel('number of researchers')
plt.xlim(0,51);
Next, we can look at the citation network. We'll start by expanding the our data about citations into two mappings:
papers_citing[id]
: papers citing a given papercited_by[id]
: papers cited by a given paper (in other words: its list of references)papers_citing
will give us the list of a node's incoming links, whereas cited_by
will give us the list of its outgoing links.
In [13]:
papers_citing = Citations # no changes needed, this is what we are storing already in the Citations dataset
cited_by = defaultdict(list)
for ref, papers_citing_ref in papers_citing.items():
for id in papers_citing_ref:
cited_by[ id ].append( ref )
In [14]:
display_summary(24130474)
As we are dealing with a subset of the data, papers_citing
can contain references to papers outside of our subset. On the other hand, the way we created cited_by
, it will only contain backward references from within our dataset, meaning that it is incomplete with respect to the whole dataset. Nethertheless, we can use this citation network on our subset of malaria-related papers to implement link analysis techniques.
Let us now look at an exemlary paper, let's say the one with identifier 24130474. We can now use the cited_by
mapping to retrieve its (incomplete) list of references:
In [15]:
paper_id = 24130474
refs = { id : Summaries[id].title for id in cited_by[paper_id] }
print(len(refs), 'references found for paper', paper_id)
refs
Out[15]:
If we lookup the same paper in papers_citing
, we now see that some of the cited papers are themselves in our dataset, but others are not (shown below as '??'
):
In [16]:
{ id : Summaries.get(id,['??'])[0] for id in papers_citing[paper_id] }
Out[16]:
Paper 25122340, for example, is not in our dataset and we do not have any direct information about it, but its repeated occurrence in other papers' citation lists does allow us to reconstruct some of its references. Below is the list of papers in our dataset cited by that paper:
In [17]:
paper_id2 = 25122340
refs2 = { id : Summaries[id].title for id in cited_by[paper_id2] }
print(len(refs2), 'references identified for the paper with id', paper_id2)
refs2
Out[17]:
Now that we have a better understanding about the data we're dealing with, let us obtain again some basic statistics about our graph.
In [18]:
n = len(Ids)
print('Number of papers in our subset: %d (%.2f %%)' % (n, 100.0) )
with_citation = [ id for id in Ids if papers_citing[id] != [] ]
with_citation_rel = 100. * len(with_citation) / n
print('Number of papers cited at least once: %d (%.2f %%)' % (len(with_citation), with_citation_rel) )
isolated = set( id for id in Ids if papers_citing[id] == [] and id not in cited_by )
isolated_rel = 100. * len(isolated) / n
print('Number of isolated nodes: %d (%.2f %%)' % (len(isolated), isolated_rel) )
In [19]:
id_set = set( Ids )
citing_set = set( cited_by.keys() )
outsiders = citing_set - id_set # set difference
nodes = citing_set | id_set # set union
non_isolated = nodes - isolated # set difference
print('Overall number of nodes: %d (%.2f %%)' % (len(nodes), 100.0) )
non_isolated_rel = 100. * len(non_isolated) / len(nodes)
print('Number of non-isolated nodes: %d (%.2f %%)' % (len(non_isolated), non_isolated_rel) )
outsiders_rel = 100. * len(outsiders) / len(nodes)
print('Number of nodes outside our subset: %d (%.2f %%)' % ( len(outsiders), outsiders_rel ) )
In [20]:
all_citations = [ c for citing in papers_citing.values() for c in citing ]
outsider_citations = [ c for citing in papers_citing.values() for c in citing if c in outsiders ]
print('Overal number of links (citations): %d (%.2f %%)' % (len(all_citations), 100.0) )
outsider_citations_rel = 100. * len(outsider_citations) / len(all_citations)
print('Citations from outside the subset: %d (%.2f %%)' % (len(outsider_citations), outsider_citations_rel) )
Let us now find which 10 papers are the most cited in our dataset.
In [21]:
citation_count_per_paper = [ (id, len(citations)) for (id,citations) in papers_citing.items() ]
sorted_by_citation_count = sorted(citation_count_per_paper, key=lambda i:i[1], reverse=True)
for (id, c) in sorted_by_citation_count[:10]:
display_summary(id, extra_text = 'Citation count: ' + str(c))
In order to use the citation network, we need to be able to perform some complex graph algorithms on it. To make our lives easier, we will use NetworkX, a Python package for dealing with complex networks. You might have to install the NetworkX package first.
In [22]:
import networkx as nx
G = nx.DiGraph(cited_by)
We now have a NetworkX Directed Graph stored in G
, where a node represents a paper, and an edge represents a citation. This means we can now apply the algorithms and functions of NetworkX to our graph:
In [23]:
print(nx.info(G))
print('Directed graph:', nx.is_directed(G))
print('Density of graph:', nx.density(G))
As this graph was generated from citations only, we need to add all isolated nodes (nodes that are not cited and do not cite other papers) as well:
In [24]:
G.add_nodes_from(isolated)
And now we get slightly different values:
In [25]:
print(nx.info(G))
print('Directed graph:', nx.is_directed(G))
print('Density of graph:', nx.density(G))
Your name: ...
In [ ]:
# Add your code here
Answer: [Write your answer text here]
Using the Link Analysis algorithms provided by NetworkX, calculate the PageRank score for each node in the citation network, and store them in a variable. Print out the PageRank values for the two example papers given below.
You can also use the pagerank_scipy
implementation, which tends to be considerably faster than its regular pagerank
counterpart (but you have to install the SciPy package for that). To print and compare PageRank values, you might want to use commands like print('%.6f' % var)
to use regular decimal notation with a fixed number of decimal places.
In [ ]:
# Add your code here
# print PageRank for paper 10399593
# print PageRank for paper 23863622
In [ ]:
# Add your code here
Answer: [Write your answer text here]
Copy the scoring function score_ntn
from Task 4 of mini-assignment 3. Rename it to score_ntn_pagerank
and change its code to incorporate a paper's PageRank score in it's final score, in addition to tf-idf. In other words, the new function should return a single value that is calculated based on both scores (PageRank and tf-idf). Explain your decision on how to combine the two scores.
In [ ]:
# Add your code here
Answer: [Write your answer text here]
Copy the query function query_ntn
from Task 4 of mini-assignment 3. Rename it to query_ntn_pagerank
and change the code to use our new scoring function score_ntn_pagerank
from task 4 above. Demonstrate these functions with an example query that returns our paper 10399593 from above as the top result.
In [ ]:
# Add your code here