Link analysis

(Inspired by and borrowed heavily from: Collective Intelligence - Luís F. Simões. IR version and assignments by J.E. Hoeksema, 2014-11-12. Converted to Python 3 and minor changes by Tobias Kuhn, 2015-11-17.)


This notebook's purpose is to give examples of how to use graph algorithms to improve search engine. We look at two graphs in particular: the co-authorship network and the citation network.

The citation network is similar to the link network of the web: Citations are like web links pointing to other documents. We can therefore apply the same network-based ranking methods.

Code from previous exercises


In [1]:
import pickle, bz2
from collections import *

import numpy as np
import matplotlib.pyplot as plt

# show plots inline within the notebook
%matplotlib inline
# set plots' resolution
plt.rcParams['savefig.dpi'] = 100

from IPython.display import display, HTML

In [2]:
Ids_file = 'data/air__Ids.pkl.bz2'
Summaries_file = 'data/air__Summaries.pkl.bz2'
Citations_file = 'data/air__Citations.pkl.bz2'
Abstracts_file = 'data/air__Abstracts.pkl.bz2'

In [3]:
Ids = pickle.load( bz2.BZ2File( Ids_file, 'rb' ) )

In [4]:
Summaries = pickle.load( bz2.BZ2File( Summaries_file, 'rb' ) )

paper = namedtuple( 'paper', ['title', 'authors', 'year', 'doi'] )

for (id, paper_info) in Summaries.items():
    Summaries[id] = paper( *paper_info )

In [5]:
Citations = pickle.load( bz2.BZ2File( Citations_file, 'rb' ) )

In [6]:
def display_summary( id, extra_text='' ):
    """
    Function for printing a paper's summary through IPython's Rich Display System.
    Trims long titles or author lists, and links to the paper's  DOI (when available).
    """
    s = Summaries[ id ]
    
    title = ( s.title if s.title[-1]!='.' else s.title[:-1] )
    title = title[:150].rstrip() + ('' if len(title)<=150 else '...')
    if s.doi!='':
        title = '<a href=http://dx.doi.org/%s>%s</a>' % (s.doi, title)
    
    authors = ', '.join( s.authors[:5] ) + ('' if len(s.authors)<=5 else ', ...')
    
    lines = [
        title,
        authors,
        str(s.year),
        '<small>id: %d%s</small>' % (id, extra_text)
        ]
    
    display( HTML( '<blockquote>%s</blockquote>' % '<br>'.join(lines) ) )

In [7]:
from math import log10

def tokenize(text):
    return text.split(' ')

def preprocess(tokens):
    result = []
    for token in tokens:
        result.append(token.lower())
    return result

Abstracts = pickle.load( bz2.BZ2File( Abstracts_file, 'rb' ) )

inverted_index = defaultdict(set)

for (id, abstract) in Abstracts.items():
    for term in preprocess(tokenize(abstract)):
        inverted_index[term].add(id)

tf_matrix = defaultdict(Counter)
for (id, abstract) in Abstracts.items():
    tf_matrix[id] = Counter(preprocess(tokenize(abstract)))

def tf(t,d):
    return float(tf_matrix[d][t])

def df(t):
    return float(len(inverted_index[t]))

numdocs = float(len(Abstracts))

def num_documents():
    return numdocs

# We don't need to keep this object in memory any longer:
Abstracts = {}

Co-authorship network

Summaries maps paper ids to paper summaries. Let us now create here mappings by different criteria.

We'll start by building a mapping from authors to the set of ids of papers they authored. We'll be using Python's sets again for that purpose.


In [8]:
papers_of_author = defaultdict(set)

for id,p in Summaries.items():
    for a in p.authors:
        papers_of_author[a].add( id )

In [9]:
papers_of_author['Vine AK']


Out[9]:
{3174025, 9002129}

In [10]:
for id in papers_of_author['Vine AK']:
    display_summary(id)


Thrombin in the management of full thickness macular holes
Vine AK, Johnson MW
1996
id: 9002129
Ocular complications associated with retrobulbar injections
Morgan CM, Schatz H, Vine AK, Cantrill HL, Davidorf FH, ...
1988
id: 3174025

We now build a co-authorship network, a graph linking authors, to the set of co-authors they have published with.


In [11]:
coauthors = defaultdict(set)

for p in Summaries.values():
    for a in p.authors:
        coauthors[a].update( p.authors )

# The code above results in each author being listed as having co-autored with himself/herself.
# We remove these self-references here:
for a,ca in coauthors.items():
    ca.remove(a)

In [12]:
print(', '.join( coauthors['Vine AK'] ))


Rudich R, Morgan CM, Cantrill HL, Johnson MW, Gitter KA, Davidorf FH, Schatz H

Now we can have a look at some basic statistics about our graph:


In [13]:
print('Number of nodes: %8d (node = author)' % len(coauthors))
print('Number of links: %8d (link = collaboration between the two linked authors on at least one paper)'  \
        % sum( len(cas) for cas in coauthors.values() ))


Number of nodes:   378722 (node = author)
Number of links:  4156534 (link = collaboration between the two linked authors on at least one paper)

With this data in hand, we can plot the degree distribution by showing the number of collaborators a scientist has published with:


In [14]:
plt.hist( x=[ len(ca) for ca in coauthors.values() ], bins=range(55), histtype='bar', align='left', normed=True )
plt.xlabel('number of collaborators')
plt.ylabel('fraction of scientists')
plt.xlim(0,50);


Citations network

We'll start by expanding the Citations dataset into two mappings:

  • papers_citing[id]: papers citing a given paper;
  • cited_by[id]: papers cited by a given paper (in other words, its list of references).

If we see the Citations dataset as a directed graph where papers are nodes, and citations links between then, then papers_citing gives you the list of a node's incoming links, whereas cited_by gives you the list of its outgoing links.

The dataset was assembled by querying for papers citing a given paper. As a result, the data mapped to in cited_by (its values) is necessarily limited to ids of papers that are part of the dataset.


In [15]:
papers_citing = Citations  # no changes needed, this is what we are storing already in the Citations dataset

cited_by = defaultdict(list)

for ref, papers_citing_ref in papers_citing.items():
    for id in papers_citing_ref:
        cited_by[ id ].append( ref )

Let us now look at an arbitrary paper, let's say PubMed ID 16820458 ("Changes in the spoilage-related microbiota of beef during refrigerated storage under different packaging conditions"). We can now use the cited_by mapping to retrieve what we know of its list of references.

As mentioned above, because the process generating the dataset asked for papers citing a given paper (and not papers a paper cites), the papers we get through cited_by are then necessarily all members of our datasets, and we can therefore find them in Summaries.


In [16]:
paper_id = 16820458
refs = { id : Summaries[id].title for id in cited_by[paper_id] }
print(len(refs), 'references identified for the paper with id', paper_id)
refs


3 references identified for the paper with id 16820458
Out[16]:
{4066549: 'Time course of volatile compound formation during refrigerated storage of naturally contaminated beef in air.',
 11851808: 'Effect of oregano essential oil on microbiological and physico-chemical attributes of minced meat stored in air and modified atmospheres.',
 12382683: 'Preservation of fresh meat with active and modified atmosphere packaging conditions.'}

If we lookup the same paper in papers_citing, we now see that some of the cited papers are themselves in our dataset, but others are not (denote here by '??'):


In [17]:
{ id : Summaries.get(id,['??'])[0]  for id in papers_citing[paper_id] }


Out[17]:
{17142357: 'Role of broiler carcasses and processing plant air in contamination of modified-atmosphere-packaged broiler products with psychrotrophic lactic acid bacteria.',
 17293505: '??',
 17696886: '??',
 19201980: '??',
 21784913: 'Spoilage-related activity of Carnobacterium maltaromaticum strains in air-stored and vacuum-packed meat.',
 21803905: 'Monitoring of microbial metabolites and bacterial diversity in beef stored under different packaging conditions.',
 22207745: '??',
 22582378: '??',
 23936168: '??',
 24884520: '??',
 24928886: '??',
 25083110: '??'}

Paper 17696886, for example, is not in our dataset and we do not have any direct information about it, but its repeated occurrence in other papers' citation lists does allow us to reconstruct a good portion of its references. Below is the list of papers in our dataset cited by that paper:


In [18]:
paper_id2 = 17696886
refs2 = { id : Summaries[id].title for id in cited_by[paper_id2] }
print(len(refs2), 'references identified for the paper with id', paper_id2)
refs2


5 references identified for the paper with id 17696886
Out[18]:
{7577360: 'Effect of growth of selected lactic acid bacteria on storage life of beef stored under vacuum and in air.',
 12361300: 'Carnobacterium viridans sp. nov., an alkaliphilic, facultative anaerobe isolated from refrigerated, vacuum-packed bologna sausage.',
 16820458: 'Changes in the spoilage-related microbiota of beef during refrigerated storage under different packaging conditions.',
 16834594: 'Biogenic amine formation and microbial spoilage in chilled garfish (Belone belone belone)--effect of modified atmosphere packaging and previous frozen storage.',
 17142357: 'Role of broiler carcasses and processing plant air in contamination of modified-atmosphere-packaged broiler products with psychrotrophic lactic acid bacteria.'}

Now that we have a better understanding about the data we're dealing with, let us obtain again some basic statistics about our graph.


In [19]:
print('Number of core ids %d (100.00 %%)' % len(Ids))

with_cit = [ id for id in Ids if papers_citing[id]!=[] ]
print('Number of papers cited at least once: %d (%.2f %%)' % (len(with_cit), 100.*len(with_cit)/len(Ids)))

isolated = set( id for id in Ids if papers_citing[id]==[] and id not in cited_by )
print('Number of isolated nodes: %d (%.2f %%)\n\t'   \
      '(papers that are not cited by any others, nor do themselves cite any in the dataset)'% (
    len(isolated), 100.*len(isolated)/len(Ids) ))

noCit_withRefs = [ id for id in Ids if papers_citing[id]==[] and id in cited_by ]
print('Number of dataset ids with no citations, but known references: %d (%.2f %%)' % (
    len(noCit_withRefs), 100.*len(noCit_withRefs)/len(Ids)))

print('(percentages calculated with respect to just the core ids (members of `Ids`) -- exclude outsider ids)\n')


Number of core ids 190555 (100.00 %)
Number of papers cited at least once: 85150 (44.69 %)
Number of isolated nodes: 101264 (53.14 %)
	(papers that are not cited by any others, nor do themselves cite any in the dataset)
Number of dataset ids with no citations, but known references: 4141 (2.17 %)
(percentages calculated with respect to just the core ids (members of `Ids`) -- exclude outsider ids)


In [20]:
Ids_set    = set( Ids )
citing_Ids = set( cited_by.keys() ) # == set( c for citing in papers_citing.itervalues() for c in citing )

outsiders = citing_Ids - Ids_set    # set difference: removes from `citing_Ids` all the ids that occur in `Ids_set`
nodes     = citing_Ids | Ids_set - isolated     # set union, followed by set difference

print('Number of (non-isolated) nodes in the graph: %d\n\t(papers with at least 1 known citation, or 1 known reference)' % len(nodes))
print(len( citing_Ids ), 'distinct ids are citing papers in our dataset.')
print('Of those, %d (%.2f %%) are ids from outside the dataset.\n' % ( len(outsiders), 100.*len(outsiders)/len(citing_Ids) ))


Number of (non-isolated) nodes in the graph: 301238
	(papers with at least 1 known citation, or 1 known reference)
227033 distinct ids are citing papers in our dataset.
Of those, 211947 (93.36 %) are ids from outside the dataset.


In [21]:
all_cits      = [ c for citing in papers_citing.values() for c in citing ]
outsider_cits = [ c for citing in papers_citing.values() for c in citing if c in outsiders ]

print('Number of links (citations) in the graph:', len(all_cits))
print('A total of %d citations are logged in the dataset.' % len(all_cits))
print('Citations by ids from outside the dataset comprise %d (%.2f %%) of that total.\n' % (
    len(outsider_cits),
    100.*len(outsider_cits)/len(all_cits) ))


Number of links (citations) in the graph: 473193
A total of 473193 citations are logged in the dataset.
Citations by ids from outside the dataset comprise 365543 (77.25 %) of that total.

Most cited papers

Let us now find which 10 papers are the most cited in our dataset.


In [22]:
nr_cits_per_paper = [ (id, len(cits)) for (id,cits) in papers_citing.items() ]

for (id, cits) in sorted( nr_cits_per_paper, key=lambda i:i[1], reverse=True )[:10]:
    display_summary( id, ', nr. citations: %d' % cits )


Random-effects models for longitudinal data
Laird NM, Ware JH
1982
id: 7168798, nr. citations: 960
An association between air pollution and mortality in six U.S. cities
Dockery DW, Pope CA 3rd, Xu X, Spengler JD, Ware JH, ...
1993
id: 8179653, nr. citations: 597
A simple method for organotypic cultures of nervous tissue
Stoppini L, Buchs PA, Muller D
1991
id: 1715499, nr. citations: 480
Lung cancer, cardiopulmonary mortality, and long-term exposure to fine particulate air pollution
Pope CA 3rd, Burnett RT, Thun MJ, Calle EE, Krewski D, ...
2002
id: 11879110, nr. citations: 474
Particulate matter air pollution and cardiovascular disease: An update to the scientific statement from the American Heart Association
Brook RD, Rajagopalan S, Pope CA 3rd, Brook JR, Bhatnagar A, ...
2010
id: 20458016, nr. citations: 454
HADDOCK: a protein-protein docking approach based on biochemical or biophysical information
Dominguez C, Boelens R, Bonvin AM
2003
id: 12580598, nr. citations: 434
Bilirubin is an antioxidant of possible physiological importance
Stocker R, Yamamoto Y, McDonagh AF, Glazer AN, Ames BN
1987
id: 3029864, nr. citations: 373
Primary prevention of acute coronary events with lovastatin in men and women with average cholesterol levels: results of AFCAPS/TexCAPS. Air Force/Tex...
Downs JR, Clearfield M, Weis S, Whitney E, Shapiro DR, ...
1998
id: 9613910, nr. citations: 363

In order to use the citation network, we need to be able to perform some complex graph algorithms on it. To make our lives easier, we will load the data into the python package NetworkX, a package for the creation, manipulation, and study of the structure, dynamics, and function of complex networks, which provides a number of these graph algorithms (such as HITS and PageRank) out of the box.

You probably have to install the NetworkX package first.


In [23]:
import networkx as nx

G = nx.DiGraph(cited_by)

We now have a NetworkX Directed Graph stored in G, where a node represents a paper, and an edge represents a citation. This means we can now apply the algorithms and functions of NetworkX to our graph:


In [24]:
print(nx.info(G))
print(nx.is_directed(G))
print(nx.density(G))


Name: 
Type: DiGraph
Number of nodes: 301238
Number of edges: 473193
Average in degree:   1.5708
Average out degree:   1.5708
True
5.214590895602556e-06

As this graph was generated from citations only, we need to add all isolated nodes (nodes that are not cited and do not cite other papers) as well:


In [25]:
G.add_nodes_from(isolated)

In [26]:
print(nx.info(G))
print(nx.is_directed(G))
print(nx.density(G))


Name: 
Type: DiGraph
Number of nodes: 402502
Number of edges: 473193
Average in degree:   1.1756
Average out degree:   1.1756
True
2.9208099879856355e-06

Assignments

Your name: ...

  • Plot the in-degree distribution (the distribution of the number of incoming links) for the citation network. What can you tell about the shape of this distribution, and what does this tell us about the network?

In [27]:
# Add your code here

[Write your answer text here]

  • Using the Link Analysis algorithms provided by NetworkX, calculate the PageRank score for each node in the citation network, and store them in a variable. Print out the PageRank values for the two example papers given below.

    Hint: the pagerank_scipy implementation tend to be considerably faster than its regular pagerank counterpart (but you have to install the SciPy package for that).


In [28]:
# Add your code here

# print PageRank for paper 7168798
# print PageRank for paper 21056779
  • Copy your search engine from mini-assignment 3, and create a version that incorporates a paper's PageRank score in it's final score, in addition to tf-idf. Show the result of an example query, and explain your decision on how to combine the two scores (PageRank and tf-idf).

In [29]:
# Add your code here

[Write your answer text here]