Mini-Assignment 4: Link Analysis

In this mini-assignment, we will exploit graph algorithms to improve search results. For our dataset of scientific papers, we look at two graphs in particular: the co-authorship network and the citation network.

The citation network is similar to the link network of the web: Citations are like web links pointing to other documents. We can therefore apply the same network-based ranking methods.

Code from previous exercises


In [1]:
import pickle, bz2
from collections import defaultdict, namedtuple, Counter
from math import log10, sqrt
from IPython.display import display, HTML
import matplotlib.pyplot as plt

# show plots inline within the notebook
%matplotlib inline
# set plots' resolution
plt.rcParams['savefig.dpi'] = 100

In [2]:
Ids_file = 'data/malaria__Ids.pkl.bz2'
Summaries_file = 'data/malaria__Summaries.pkl.bz2'
Citations_file = 'data/malaria__Citations.pkl.bz2'
Abstracts_file = 'data/malaria__Abstracts.pkl.bz2'

Ids = pickle.load( bz2.BZ2File( Ids_file, 'rb' ) )
Summaries = pickle.load( bz2.BZ2File( Summaries_file, 'rb' ) )
Citations = pickle.load( bz2.BZ2File( Citations_file, 'rb' ) )
Abstracts = pickle.load( bz2.BZ2File( Abstracts_file, 'rb' ) )

paper = namedtuple( 'paper', ['title', 'authors', 'year', 'doi'] )

for (id, paper_info) in Summaries.items():
    Summaries[id] = paper( *paper_info )

In [3]:
def display_summary( id, show_abstract=False, show_id=True, extra_text='' ):
    """
    Function for printing a paper's summary through IPython's Rich Display System.
    Trims long author lists, and adds a link to the paper's DOI (when available).
    """
    s = Summaries[id]
    lines = []
    title = s.title
    if s.doi != '':
        title = '<a href=http://dx.doi.org/%s>%s</a>' % (s.doi, title)
    title = '<strong>' + title + '</strong>'
    lines.append(title)
    authors = ', '.join( s.authors[:20] ) + ('' if len(s.authors) <= 20 else ', ...')
    lines.append(str(s.year) + '. ' + authors)
    if (show_abstract):
        lines.append('<small><strong>Abstract:</strong> <em>%s</em></small>' % Abstracts[id])
    if (show_id):
        lines.append('[ID: %d]' % id)
    if (extra_text != ''):
         lines.append(extra_text)
    display( HTML('<br>'.join(lines)) )

In [4]:
def tokenize(text):
    return text.split(' ')

def preprocess(tokens):
    result = []
    for token in tokens:
        result.append(token.lower())
    return result

In [5]:
inverted_index = defaultdict(set)

for (id, abstract) in Abstracts.items():
    for term in preprocess(tokenize(abstract)):
        inverted_index[term].add(id)

tf_matrix = defaultdict(Counter)
length_values = defaultdict(int)

for (doc_id, abstract) in Abstracts.items():
    tokens = preprocess(tokenize(abstract))
    tf_matrix[doc_id] = Counter(tokens)
    l = 0
    for t in tf_matrix[doc_id].keys():
        l += tf_matrix[doc_id][t] ** 2
    length_values[doc_id] = sqrt(l)

def tf(t,d):
    return float(tf_matrix[d][t])

def df(t):
    return float(len(inverted_index[t]))
    
def num_documents():
    return float(len(Abstracts))

def length_tf(d):
    return length_values[d]

def tfidf(t,d):
    return tf(t,d) * log10(num_documents()/df(t))

Co-authorship network

We start by building a mapping from authors to the set of identifiers of papers they authored. We'll be using Python's sets again for that purpose.


In [6]:
papers_of_author = defaultdict(set)

for (id, p) in Summaries.items():
    for a in p.authors:
        papers_of_author[a].add(id)

Let's try it out:


In [7]:
papers_of_author['Clauset A']


Out[7]:
{24130474, 26456841}

In [8]:
for id in papers_of_author['Clauset A']:
    display_summary(id)


Ape parasite origins of human malaria virulence genes.
2015. Larremore DB, Sundararaman SA, Liu W, Proto WR, Clauset A, Loy DE, Speede S, Plenderleith LJ, Sharp PM, Hahn BH, Rayner JC, Buckee CO
[ID: 26456841]
A network approach to analyzing highly recombinant malaria parasite genes.
2013. Larremore DB, Clauset A, Buckee CO
[ID: 24130474]

We can now build a co-authorship network, that is a graph linking authors to the set of co-authors they have published with:


In [9]:
coauthors = defaultdict(set)

for p in Summaries.values():
    for a in p.authors:
        coauthors[a].update(p.authors)

# The code above results in each author being listed as having co-authored with himself/herself.
# We remove these self-references here:
for (a, ca) in coauthors.items():
    ca.remove(a)

And let's try it out again:


In [10]:
print(', '.join( coauthors['Clauset A'] ))


Liu W, Plenderleith LJ, Rayner JC, Sharp PM, Larremore DB, Buckee CO, Proto WR, Loy DE, Speede S, Hahn BH, Sundararaman SA

Now we can have a look at some basic statistics about our graph:


In [11]:
print('Number of nodes (authors): ', len(coauthors))

coauthor_rel_count = sum( len(c) for c in coauthors.values() )
print('Number of links (co-authorship relations): ', coauthor_rel_count)


Number of nodes (authors):  114572
Number of links (co-authorship relations):  2488486

With this data at hand, we can plot the degree distribution by showing the number of collaborators a scientist has published with:


In [12]:
plt.hist( x=[ len(ca) for ca in coauthors.values() ], bins=range(60) )
plt.xlabel('number of co-authors')
plt.ylabel('number of researchers')
plt.xlim(0,51);


Citations network

Next, we can look at the citation network. We'll start by expanding the our data about citations into two mappings:

  • papers_citing[id]: papers citing a given paper
  • cited_by[id]: papers cited by a given paper (in other words: its list of references)

papers_citing will give us the list of a node's incoming links, whereas cited_by will give us the list of its outgoing links.


In [13]:
papers_citing = Citations  # no changes needed, this is what we are storing already in the Citations dataset

cited_by = defaultdict(list)

for ref, papers_citing_ref in papers_citing.items():
    for id in papers_citing_ref:
        cited_by[ id ].append( ref )

In [14]:
display_summary(24130474)


A network approach to analyzing highly recombinant malaria parasite genes.
2013. Larremore DB, Clauset A, Buckee CO
[ID: 24130474]

As we are dealing with a subset of the data, papers_citing can contain references to papers outside of our subset. On the other hand, the way we created cited_by, it will only contain backward references from within our dataset, meaning that it is incomplete with respect to the whole dataset. Nethertheless, we can use this citation network on our subset of malaria-related papers to implement link analysis techniques.

Let us now look at an exemlary paper, let's say the one with identifier 24130474. We can now use the cited_by mapping to retrieve its (incomplete) list of references:


In [15]:
paper_id = 24130474
refs = { id : Summaries[id].title for id in cited_by[paper_id] }
print(len(refs), 'references found for paper', paper_id)
refs


26 references found for paper 24130474
Out[15]:
{7606788: 'The large diverse gene family var encodes proteins involved in cytoadherence and antigenic variation of Plasmodium falciparum-infected erythrocytes.',
 9500614: 'Parasite antigens on the infected red cell surface are targets for naturally acquired immunity to malaria.',
 9916084: 'Antibody recognition of Plasmodium falciparum erythrocyte surface antigens in Kenya: evidence for rare and prevalent variants.',
 10086393: 'Immunity to non-cerebral severe malaria is acquired after one or two infections.',
 10714439: 'Antibodies to variable Plasmodium falciparum-infected erythrocyte surface antigens are associated with protection from novel malaria infections.',
 11069183: 'Frequent ectopic recombination of virulence factor genes in telomeric chromosome clusters of P. falciparum.',
 11071284: 'Classification of adhesive domains in the Plasmodium falciparum erythrocyte membrane protein 1 family.',
 11349035: 'Antibodies to variant antigens on the surfaces of infected erythrocytes are associated with protection from malaria in Ghanaian children.',
 11544371: 'Antigenic variation at the infected red cell surface in malaria.',
 11827798: 'The role of antibodies to Plasmodium falciparum-infected-erythrocyte surface antigens in naturally acquired immunity to malaria.',
 12368864: 'Genome sequence of the human malaria parasite Plasmodium falciparum.',
 14565852: 'Sub-grouping of Plasmodium falciparum 3D7 var genes based on sequence analysis of coding and non-coding regions.',
 14651636: 'Evidence for the importance of genetic structuring to the structural and functional specialization of the Plasmodium falciparum var gene family.',
 16304608: 'Plasmodium falciparum variant surface antigen expression patterns during malaria.',
 16697476: 'Global genetic diversity and evolution of var genes associated with placental and severe childhood malaria.',
 16814594: 'A family affair: var genes, PfEMP1 binding, and malaria disease.',
 17286864: 'Patterns of gene recombination shape var gene repertoires in Plasmodium falciparum: comparisons of geographically diverse isolates.',
 17367208: 'Population genomics of the immune evasion (var) genes of Plasmodium falciparum.',
 17669514: 'Structural polymorphism and diversifying selection on the pregnancy malaria vaccine candidate VAR2CSA.',
 18395207: 'Frequent recombination events generate diversity within the multi-copy variant antigen gene families of Plasmodium falciparum.',
 20018734: 'Plasmodium falciparum var gene expression is modified by host immunity.',
 20862303: 'Plasmodium falciparum erythrocyte membrane protein 1 diversity in seven genomes--divide and conquer.',
 22496547: 'Prognostic indicators of life-threatening malaria are associated with distinct parasite variant antigen profiles.',
 22511852: 'Evolution of the multi-domain structures of virulence genes in the human malaria parasite, Plasmodium falciparum.',
 22850879: 'Targets of antibodies against Plasmodium falciparum-infected erythrocytes in malaria immunity.',
 23408914: 'Mitotic evolution of Plasmodium falciparum shows a stable core genome but recombination in antigen families.'}

If we lookup the same paper in papers_citing, we now see that some of the cited papers are themselves in our dataset, but others are not (shown below as '??'):


In [16]:
{ id : Summaries.get(id,['??'])[0]  for id in papers_citing[paper_id] }


Out[16]:
{25122340: '??',
 25303095: '??',
 25368109: 'Immune characterization of Plasmodium falciparum parasites with a shared genetic signature in a region of decreasing transmission.',
 25521112: '??',
 26456841: 'Ape parasite origins of human malaria virulence genes.',
 27306566: '??'}

Paper 25122340, for example, is not in our dataset and we do not have any direct information about it, but its repeated occurrence in other papers' citation lists does allow us to reconstruct some of its references. Below is the list of papers in our dataset cited by that paper:


In [17]:
paper_id2 = 25122340
refs2 = { id : Summaries[id].title for id in cited_by[paper_id2] }
print(len(refs2), 'references identified for the paper with id', paper_id2)
refs2


2 references identified for the paper with id 25122340
Out[17]:
{20862303: 'Plasmodium falciparum erythrocyte membrane protein 1 diversity in seven genomes--divide and conquer.',
 24130474: 'A network approach to analyzing highly recombinant malaria parasite genes.'}

Now that we have a better understanding about the data we're dealing with, let us obtain again some basic statistics about our graph.


In [18]:
n = len(Ids)
print('Number of papers in our subset: %d (%.2f %%)' % (n, 100.0) )

with_citation = [ id for id in Ids if papers_citing[id] != [] ]
with_citation_rel = 100. * len(with_citation) / n
print('Number of papers cited at least once: %d (%.2f %%)' % (len(with_citation), with_citation_rel) )

isolated = set( id for id in Ids if papers_citing[id] == [] and id not in cited_by )
isolated_rel = 100. * len(isolated) / n
print('Number of isolated nodes: %d (%.2f %%)' % (len(isolated), isolated_rel) )


Number of papers in our subset: 67028 (100.00 %)
Number of papers cited at least once: 45340 (67.64 %)
Number of isolated nodes: 18765 (28.00 %)

In [19]:
id_set = set( Ids )
citing_set = set( cited_by.keys() )

outsiders = citing_set - id_set   # set difference
nodes = citing_set | id_set   # set union
non_isolated = nodes - isolated   # set difference

print('Overall number of nodes: %d (%.2f %%)' % (len(nodes), 100.0) )

non_isolated_rel = 100. * len(non_isolated) / len(nodes)
print('Number of non-isolated nodes: %d (%.2f %%)' % (len(non_isolated), non_isolated_rel) )

outsiders_rel = 100. * len(outsiders) / len(nodes)
print('Number of nodes outside our subset: %d (%.2f %%)' % ( len(outsiders), outsiders_rel ) )


Overall number of nodes: 146869 (100.00 %)
Number of non-isolated nodes: 128104 (87.22 %)
Number of nodes outside our subset: 79841 (54.36 %)

In [20]:
all_citations = [ c for citing in papers_citing.values() for c in citing ]
outsider_citations = [ c for citing in papers_citing.values() for c in citing if c in outsiders ]

print('Overal number of links (citations): %d (%.2f %%)' % (len(all_citations), 100.0) )

outsider_citations_rel = 100. * len(outsider_citations) / len(all_citations)
print('Citations from outside the subset: %d (%.2f %%)' % (len(outsider_citations), outsider_citations_rel) )


Overal number of links (citations): 552219 (100.00 %)
Citations from outside the subset: 172094 (31.16 %)

Let us now find which 10 papers are the most cited in our dataset.


In [21]:
citation_count_per_paper = [ (id, len(citations)) for (id,citations) in papers_citing.items() ]
sorted_by_citation_count = sorted(citation_count_per_paper, key=lambda i:i[1], reverse=True)

for (id, c) in sorted_by_citation_count[:10]:
    display_summary(id, extra_text = 'Citation count: ' + str(c))


Human malaria parasites in continuous culture.
1976. Trager W, Jensen JB
[ID: 781840]
Citation count: 1503
Global and regional mortality from 235 causes of death for 20 age groups in 1990 and 2010: a systematic analysis for the Global Burden of Disease Study 2010.
2012. Lozano R, Naghavi M, Foreman K, Lim S, Shibuya K, Aboyans V, Abraham J, Adair T, Aggarwal R, Ahn SY, Alvarado M, Anderson HR, Anderson LM, Andrews KG, Atkinson C, Baddour LM, Barker-Collo S, Bartels DH, Bell ML, Benjamin EJ, ...
[ID: 23245604]
Citation count: 1396
OrthoMCL: identification of ortholog groups for eukaryotic genomes.
2003. Li L, Stoeckert CJ Jr, Roos DS
[ID: 12952885]
Citation count: 1187
Genome sequence of the human malaria parasite Plasmodium falciparum.
2002. Gardner MJ, Hall N, Fung E, White O, Berriman M, Hyman RW, Carlton JM, Pain A, Nelson KE, Bowman S, Paulsen IT, James K, Eisen JA, Rutherford K, Salzberg SL, Craig A, Kyes S, Chan MS, Nene V, Shallom SJ, ...
[ID: 12368864]
Citation count: 1063
Mass spectrometry-based proteomics.
2003. Aebersold R, Mann M
[ID: 12634793]
Citation count: 967
Global and regional burden of disease and risk factors, 2001: systematic analysis of population health data.
2006. Lopez AD, Mathers CD, Ezzati M, Jamison DT, Murray CJ
[ID: 16731270]
Citation count: 918
Artemisinin resistance in Plasmodium falciparum malaria.
2009. Dondorp AM, Nosten F, Yi P, Das D, Phyo AP, Tarning J, Lwin KM, Ariey F, Hanpithakpong W, Lee SJ, Ringwald P, Silamut K, Imwong M, Chotivanich K, Lim P, Herdman T, An SS, Yeung S, Singhasivanon P, Day NP, ...
[ID: 19641202]
Citation count: 807
The global distribution of clinical episodes of Plasmodium falciparum malaria.
2005. Snow RW, Guerra CA, Noor AM, Myint HY, Hay SI
[ID: 15759000]
Citation count: 776
Synchronization of Plasmodium falciparum erythrocytic stages in culture.
1979. Lambros C, Vanderberg JP
[ID: 383936]
Citation count: 775

In order to use the citation network, we need to be able to perform some complex graph algorithms on it. To make our lives easier, we will use NetworkX, a Python package for dealing with complex networks. You might have to install the NetworkX package first.


In [22]:
import networkx as nx

G = nx.DiGraph(cited_by)

We now have a NetworkX Directed Graph stored in G, where a node represents a paper, and an edge represents a citation. This means we can now apply the algorithms and functions of NetworkX to our graph:


In [23]:
print(nx.info(G))
print('Directed graph:', nx.is_directed(G))
print('Density of graph:', nx.density(G))


Name: 
Type: DiGraph
Number of nodes: 128104
Number of edges: 552219
Average in degree:   4.3107
Average out degree:   4.3107
Directed graph: True
Density of graph: 3.365033205197717e-05

As this graph was generated from citations only, we need to add all isolated nodes (nodes that are not cited and do not cite other papers) as well:


In [24]:
G.add_nodes_from(isolated)

And now we get slightly different values:


In [25]:
print(nx.info(G))
print('Directed graph:', nx.is_directed(G))
print('Density of graph:', nx.density(G))


Name: 
Type: DiGraph
Number of nodes: 146869
Number of edges: 552219
Average in degree:   3.7599
Average out degree:   3.7599
Directed graph: True
Density of graph: 2.560082886552999e-05

Assignments

Your name: ...

Task 1

Plot the in-degree distribution (the distribution of the number of incoming links) for the citation network. What can you tell about the shape of this distribution, and what does this tell us about the network?


In [ ]:
# Add your code here

Answer: [Write your answer text here]

Task 2

Using the Link Analysis algorithms provided by NetworkX, calculate the PageRank score for each node in the citation network, and store them in a variable. Print out the PageRank values for the two example papers given below. You can also use the pagerank_scipy implementation, which tends to be considerably faster than its regular pagerank counterpart (but you have to install the SciPy package for that). To print and compare PageRank values, you might want to use commands like print('%.6f' % var) to use regular decimal notation with a fixed number of decimal places.


In [ ]:
# Add your code here

# print PageRank for paper 10399593
# print PageRank for paper 23863622

Task 3

Why do the two papers above have such different PageRank values? Write code below to investigate and show the cause of this, and then explain the cause of this difference based on the results generated by your code.


In [ ]:
# Add your code here

Answer: [Write your answer text here]

Task 4

Copy the scoring function score_ntn from Task 4 of mini-assignment 3. Rename it to score_ntn_pagerank and change its code to incorporate a paper's PageRank score in it's final score, in addition to tf-idf. In other words, the new function should return a single value that is calculated based on both scores (PageRank and tf-idf). Explain your decision on how to combine the two scores.


In [ ]:
# Add your code here

Answer: [Write your answer text here]

Task 5

Copy the query function query_ntn from Task 4 of mini-assignment 3. Rename it to query_ntn_pagerank and change the code to use our new scoring function score_ntn_pagerank from task 4 above. Demonstrate these functions with an example query that returns our paper 10399593 from above as the top result.


In [ ]:
# Add your code here