Introduction to Tethne: Co-citation analysis

In this workbook we will conduct a co-citation analysis using the approach outlined in Chen (2009). If you have used the Java-based desktop application CiteSpace II, this should be familiar: this is the same methodology that is implemented in that application.

Before you start

  • Download the practice dataset from here, and store it in a place where you can find it. You'll need the full path to your dataset.
  • Complete the tutorial "Time-variant networks"

In [3]:
%pylab inline


Populating the interactive namespace from numpy and matplotlib

In [4]:
import matplotlib.pyplot as plt

Background

Co-citation analysis gained popularity in the 1970s as a technique for “mapping” scientific literatures, and for finding latent semantic relationships among technical publications.

Two papers are co-cited if they are both cited by the same, third, paper. The standard approach to co-citation analysis is to generate a sample of bibliographic records from a particular field by using certain keywords or journal names, and then build a co-citation graph describing relationships among their cited references. Thus the majority of papers that are represented as nodes in the co-citation graph are not papers that responded to the selection criteria used to build the dataset.

Our objective in this tutorial is to identify papers that bridge the gap between otherwise disparate areas of knowledge in the scientific literature. In this tutorial, we rely on the theoretical framework described in Chen (2006) and Chen et al. (2009).

According to Chen, we can detect potentially transformative changes in scientific knowledge by looking for cited references that both (a) rapidly accrue citations, and (b) have high betweenness-centrality in a co-citation network. It helps if we think of each scientific paper as representing a “concept” (its core knowledge claim, perhaps), and a co-citation event as representing a proposition connecting two concepts in the knowledge-base of a scientific field. If a new paper emerges that is highly co-cited with two otherwise-distinct clusters of concepts, then that might mean that the field is adopting new concepts and propositions in a way that is structurally radical for their conceptual framework.

Chen (2009) introduces sigma ($\Sigma$) as a metric for potentially transformative cited references:

$$ \Sigma(v) = (g(v) + 1)^{burstness(v)} $$

...where the betweenness centrality of each node v is:

$$ g(v) = \sum\limits_{i\neq j\neq v} \frac{\sigma_{ij} (v)}{\sigma_{ij}} $$

...where $\sigma_{ij}$ is the number of shortest paths from node i to node j and $\sigma_{ij}(v)$ is the number of those paths that pass through v. Burstness (0.-1. normalized) is estimated using Kleingberg’s (2002) automaton model, and is designed to detect rate-spikes around features in a stream of documents.

Loading data

We'll use the same WoS dataset as in previous workbooks...


In [5]:
from tethne.readers import wos
datadirpath = '/Users/erickpeirson/Projects/tethne-notebooks/data/wos'
MyCorpus = wos.read(datadirpath)

Time-slicing

Our first decision is about the time-resolution for our analysis. In this tutorial, we'll slice our Corpus into two-year sequential time periods.

Note: in previous version of Tethne, slice() created new slice indices, which could then be accessed by other methods. As of v0.7, slice() returns a generator that yields subcorpora.


In [6]:
years, values = MyCorpus.distribution(window_size=2)
plt.plot(years, values)
plt.show()


Co-citation graph

We will use the GraphCollection.build method to generate a cocitation GraphCollection.

The method_kwargs parameter lets us set keyword arguments for the networks.papers.cocitation graph builder. min_weight sets the minimum number of cocitations for an edge to be included in the graph.


In [7]:
from tethne import GraphCollection

In [8]:
CoCitation = GraphCollection()
CoCitation.build(MyCorpus, 'cocitation', method_kwargs={'min_weight': 3})

Burstness

Kleingberg’s (2002) burstness model is a popular approach for detecting “busts” of interest or activity in streams of data (e.g. identifying trending terms in Twitter feeds). Chen (2009) suggests that we apply this model to citations. The idea is that the (observed) frequency with which a reference is cited is a product of an (unobserved) level or state of interest surrounding that citation. Kleinberg uses a hidden hidden markov model to infer the most likely sequence of “burstness” states for an event (a cited reference, in our case) over time. His algorithm is implemented in tethne.analyze.corpus.burstness(), and can be used for any feature in our Corpus.

Since citations are features in our Corpus, we can use the burstness function in tethne.analyze.corpus to get the burstness profiles for the top-cited reference in our dataset.


In [9]:
from tethne.analyze.corpus import burstness

In [10]:
B = burstness(MyCorpus, 'citations', k=5, topn=5, perslice=True)

In [14]:
B.items()[1]


Out[14]:
(u'MEYER SE 1989 AM J BOT',
 ([1994, 1995, 1997, 2000, 2001, 2006], [0.4, 0.6, 0.6, 0.6, 0.6, 0.6]))

We can visualize the results of the burstness algorithm using the plot_burstness() function in tethne.plot.


In [12]:
from tethne.plot import plot_burstness

In [18]:
plot_burstness(MyCorpus, B)


Burstness values are normalized with respect to the highest possible burstness state. In other words, a burstness of 1.0 corresponds to the highest possible state.

Years prior to the first occurrence of each feature are grayed out. Periods in which the feature was bursty are depicted by colored blocks, the opacity of which indicates burstness intensity.

Sigma, $\Sigma$

Chen (2009) proposed sigma ($\Sigma$) as a metric for potentially transformative cited references:

$$ \Sigma(v) = (g(v) + 1)^{burstness(v)} $$

The module analyze.corpus provides methods for calculating $\Sigma$ from a cocitation GraphCollection and a Corpus in one step.


In [19]:
from tethne.analyze.corpus import sigma

In [20]:
S = sigma(CoCitation, MyCorpus, 'citations')

In [21]:
S.items()[0:10]


Out[21]:
[(u'LEGER EA 2007 J EVOLUTION BIOL',
  [(2009, 2011, 2013), (0.006221135579585146, 0.0, 0.008468515817537714)]),
 (u'ROTH TC 2010 P ROY SOC B', [(2012,), (0.0,)]),
 (u'MEYER SE 1989 AM J BOT', [(1994,), (0.0,)]),
 (u'MULLER H 1989 WEED RES', [(2012,), (0.0,)]),
 (u'BERVEN KA 1990 ECOLOGY', [(2008, 2009), (0.0, 0.0)]),
 (u'HWANG SY 1997 OECOLOGIA', [(2007,), (0.0,)]),
 (u'CAMPBELL LG 2007 NEW PHYTOL', [(2009,), (0.0,)]),
 (u'PARKER IM 2003 CONSERV BIOL',
  [(2007, 2009, 2010, 2011, 2012, 2013),
   (0.03101505931486015,
    0.00010581637349571515,
    0.0,
    0.0001152248450788651,
    0.0,
    0.0)]),
 (u'SIEMANN E 2003 ECOL APPL', [(2008,), (0.0,)]),
 (u'SIMS RA 1989 FIELD GUIDE FOREST E', [(1996,), (0.0,)])]

The method plot_sigma generates a figure that shows $\Sigma$ values for the top nodes in the corpus.


In [22]:
from tethne.plot import plot_sigma

In [23]:
fig = plot_sigma(MyCorpus, S, topn=5, perslice=True)    # The top 5 citations per slice.


The nodes in our CoCitation GraphCollection were updated with a new 'sigma' node attribute.


In [24]:
CoCitation.values()[-1].nodes(data=True)[0]   # Attributes for a node in the GraphCollection


Out[24]:
(513, {'count': 8.0, 'documentCount': 8, 'sigma': 0.032113761390449636})

Export and visualize

We can export our CoCitation GraphCollection using tethne.writers.collection.to_dxgmml.


In [25]:
from tethne.writers import collection

In [26]:
outpath = '/Users/erickpeirson/Projects/tethne-notebooks/output/my_cocitation.xgmml'
collection.to_dxgmml(CoCitation, outpath)

In the visualizations below, node and label sizes are mapped to $\Sigma$, and border width is mapped to the number of citations for each respective node in each slice.