In this workbook we will conduct a co-citation analysis using the approach outlined in Chen (2009). If you have used the Java-based desktop application CiteSpace II, this should be familiar: this is the same methodology that is implemented in that application.
In [3]:
%pylab inline
In [4]:
import matplotlib.pyplot as plt
Co-citation analysis gained popularity in the 1970s as a technique for “mapping” scientific literatures, and for finding latent semantic relationships among technical publications.
Two papers are co-cited if they are both cited by the same, third, paper. The standard approach to co-citation analysis is to generate a sample of bibliographic records from a particular field by using certain keywords or journal names, and then build a co-citation graph describing relationships among their cited references. Thus the majority of papers that are represented as nodes in the co-citation graph are not papers that responded to the selection criteria used to build the dataset.
Our objective in this tutorial is to identify papers that bridge the gap between otherwise disparate areas of knowledge in the scientific literature. In this tutorial, we rely on the theoretical framework described in Chen (2006) and Chen et al. (2009).
According to Chen, we can detect potentially transformative changes in scientific knowledge by looking for cited references that both (a) rapidly accrue citations, and (b) have high betweenness-centrality in a co-citation network. It helps if we think of each scientific paper as representing a “concept” (its core knowledge claim, perhaps), and a co-citation event as representing a proposition connecting two concepts in the knowledge-base of a scientific field. If a new paper emerges that is highly co-cited with two otherwise-distinct clusters of concepts, then that might mean that the field is adopting new concepts and propositions in a way that is structurally radical for their conceptual framework.
Chen (2009) introduces sigma ($\Sigma$) as a metric for potentially transformative cited references:
$$ \Sigma(v) = (g(v) + 1)^{burstness(v)} $$...where the betweenness centrality of each node v is:
$$ g(v) = \sum\limits_{i\neq j\neq v} \frac{\sigma_{ij} (v)}{\sigma_{ij}} $$...where $\sigma_{ij}$ is the number of shortest paths from node i to node j and $\sigma_{ij}(v)$ is the number of those paths that pass through v. Burstness (0.-1. normalized) is estimated using Kleingberg’s (2002) automaton model, and is designed to detect rate-spikes around features in a stream of documents.
We'll use the same WoS dataset as in previous workbooks...
In [5]:
from tethne.readers import wos
datadirpath = '/Users/erickpeirson/Projects/tethne-notebooks/data/wos'
MyCorpus = wos.read(datadirpath)
Our first decision is about the time-resolution for our analysis. In this tutorial, we'll slice our Corpus
into two-year sequential time periods.
Note: in previous version of Tethne, slice()
created new slice indices, which could then be accessed by other methods. As of v0.7, slice()
returns a generator that yields subcorpora.
In [6]:
years, values = MyCorpus.distribution(window_size=2)
plt.plot(years, values)
plt.show()
We will use the GraphCollection.build
method to generate a cocitation GraphCollection
.
The method_kwargs
parameter lets us set keyword arguments for the networks.papers.cocitation
graph builder. min_weight
sets the minimum number of cocitations for an edge to be included in the graph.
In [7]:
from tethne import GraphCollection
In [8]:
CoCitation = GraphCollection()
CoCitation.build(MyCorpus, 'cocitation', method_kwargs={'min_weight': 3})
Kleingberg’s (2002) burstness model is a popular approach for detecting “busts” of interest or activity in streams of data (e.g. identifying trending terms in Twitter feeds). Chen (2009) suggests that we apply this model to citations. The idea is that the (observed) frequency with which a reference is cited is a product of an (unobserved) level or state of interest surrounding that citation. Kleinberg uses a hidden hidden markov model to infer the most likely sequence of “burstness” states for an event (a cited reference, in our case) over time. His algorithm is implemented in tethne.analyze.corpus.burstness()
, and can be used for any feature in our Corpus.
Since citations are features in our Corpus, we can use the burstness
function in tethne.analyze.corpus
to get the burstness profiles for the top-cited reference in our dataset.
In [9]:
from tethne.analyze.corpus import burstness
In [10]:
B = burstness(MyCorpus, 'citations', k=5, topn=5, perslice=True)
In [14]:
B.items()[1]
Out[14]:
We can visualize the results of the burstness algorithm using the plot_burstness()
function in tethne.plot
.
In [12]:
from tethne.plot import plot_burstness
In [18]:
plot_burstness(MyCorpus, B)
Burstness values are normalized with respect to the highest possible burstness state. In other words, a burstness of 1.0 corresponds to the highest possible state.
Years prior to the first occurrence of each feature are grayed out. Periods in which the feature was bursty are depicted by colored blocks, the opacity of which indicates burstness intensity.
Chen (2009) proposed sigma ($\Sigma$) as a metric for potentially transformative cited references:
$$ \Sigma(v) = (g(v) + 1)^{burstness(v)} $$The module analyze.corpus
provides methods for calculating $\Sigma$ from a cocitation GraphCollection and a Corpus in one step.
In [19]:
from tethne.analyze.corpus import sigma
In [20]:
S = sigma(CoCitation, MyCorpus, 'citations')
In [21]:
S.items()[0:10]
Out[21]:
The method plot_sigma
generates a figure that shows $\Sigma$ values for the top nodes in the corpus.
In [22]:
from tethne.plot import plot_sigma
In [23]:
fig = plot_sigma(MyCorpus, S, topn=5, perslice=True) # The top 5 citations per slice.
The nodes in our CoCitation
GraphCollection
were updated with a new 'sigma'
node attribute.
In [24]:
CoCitation.values()[-1].nodes(data=True)[0] # Attributes for a node in the GraphCollection
Out[24]:
We can export our CoCitation
GraphCollection
using tethne.writers.collection.to_dxgmml
.
In [25]:
from tethne.writers import collection
In [26]:
outpath = '/Users/erickpeirson/Projects/tethne-notebooks/output/my_cocitation.xgmml'
collection.to_dxgmml(CoCitation, outpath)
In the visualizations below, node and label sizes are mapped to $\Sigma$, and border width is mapped to the number of citations for each respective node in each slice.