In this workbook we'll start to analyze our data diachronically. To do that, we'll make use of a special object called a Corpus
that provides some handy methods for organizing and indexing our Paper
s. We'll use the slice
method to index our Paper
s temporally and across journals, and plot the distribution of features across those slice axes.
We'll use the Web of Science dataset that we used in our last workbook. Since this is a new workbook, we'll have to load it again.
In [ ]:
from tethne.readers import wos
datadirpath = '/Users/erickpeirson/Downloads/datasets/wos'
MyCorpus = wos.read(datadirpath)
Think of a Corpus
as a container for your Paper
s. Your Paper
s are still here; you can access your Paper
s via the papers
attribute.
In [ ]:
MyCorpus.papers
In [ ]:
from tethne.networks import authors
cg = authors.coauthors(MyCorpus)
print 'This graph has {0} nodes and {1} edges, just like before!'.format(len(cg.nodes()), len(cg.edges()))
Your Paper
s are also indexed. For WoS datasets, they are indexed by wosid
(UT
in the original field-tagged data file).
In [ ]:
MyCorpus.indexed_papers.keys()[:10] # The first 10 keys in the Paper index.
So if you know the wosid of a Paper, you can retrieve it from Corpus.papers
:
In [ ]:
MyCorpus.indexed_papers['WOS:000305886800001']
Often we're interested in how networks evolve over time. In Tethne, you can access Paper
s in a time-variant fashion using the slice
method. slice
returns a generator that yields dates and subcorpora.
In [ ]:
[i for i in MyCorpus.slice('date') ]
The default behavior is to divide your Paper
s up into 1-year time-periods. You can visualize the distribution of Paper
s over time using the plot_distribution
method.
In [9]:
MyCorpus.axes['date'].keys()
Out[9]:
In [7]:
fig = MyCorpus.plot_distribution('date')
You can change how slice
divides up your corpus temporally using the method
, window_size
, step_size
, and cumulative
keyword arguments.
Here are some methods for slicing a Corpus, which you can specify using the method
keyword argument.
Method | Description |
---|---|
time_window |
Slices data using a sliding time-window. Dataslices are indexed by the start of the time-window. |
time_period |
Slices data into time periods of equal length. Dataslices are indexed by the start of the time period. |
The main difference between the sliding time-window (time_window
) and the time-period (time_period
) slicing methods are whether the resulting periods can overlap. Whereas time-period slicing divides data into subsets by sequential non-overlapping time periods, subsets generated by time-window slicing can overlap.
Time-period slicing, with a window-size of 4 years:
Time-window slicing, with a window-size of 4 years and a step-size of 1 year:
Here's what plot_distribution
looks like using a sliding time-window of 4 years:
In [10]:
MyCorpus.slice('date', method='time_window', window_size=4)
In [11]:
fig = MyCorpus.plot_distribution('date')
Note that it's a bit smoother, and the per-slice counts are much higher over all (max of ~700/slice, versus ~200/slice before).
Setting cumulative=True
means that all of the Paper
s in the slice at time 0 will be included in the slice at time 1, and so on.
In [12]:
MyCorpus.slice('date', cumulative=True)
In [13]:
fig = MyCorpus.plot_distribution('date')
Before we get to the stage of networking, we can visualize how certain features are distributed in our Corpus
. We'll develop the concept of features further along in these tutorials.
In Tethne, a feature is anything (a categorical variable, scalar, etc) that can be distributed over Paper
s. A cited reference, for example, can be a feature. So can a word. We can think of features in terms of the presence or absence of something (e.g. a cited reference), or in terms of a quantity (e.g. the number of times the word 'organism' appears in a Paper
).
A featureset is, as the name suggests, a collection of similar features. For example, we can think of all of the cited references in our Corpus
as a featureset. A vocabulary of words can also be a featureset.
Each Corpus
has an attribute called features
that holds featuresets. It's a dictionary, so you can see what featuresets your Corpus
contains by calling Corpus.features.keys
:
In [14]:
MyCorpus.features.keys()
Out[14]:
So far our Corpus
contains only a featureset called citations. Each featureset has an index of features that it contains.
In [15]:
print 'There are {0} cited references in this Corpus.'.format(len(MyCorpus.features['citations']['index']))
MyCorpus.features['citations']['index'].items()[0:10] # Only viewing the first 10 items, since there are so many.
Out[15]:
We can view the distribution of a feature across the slices in our Corpus
using plot_distribution
. This is a bit more complicated than before. We first create a dictionary that describes the feature we're interested in, and then we pass it and the mode='features'
keyword argument to plot_distribution
.
In [17]:
fkwargs = {
'featureset': 'citations',
'feature': 'FALCONER DS 1996 INTRO QUANTITATIVE G', # An infamous textbook on quant-gen!
'mode': 'counts', # The frequency of the citation in each slice.
'normed': True, # Normalized by the number of papers in each slice.
}
In [18]:
MyCorpus.slice('date') # We'll go back to 1-year time-period slicing for now.
fig = MyCorpus.plot_distribution('date', mode='features', fkwargs=fkwargs)
We can also slice our Corpus
using other elements of our data, such as journal names. To slice by journal name, try:
In [19]:
MyCorpus.slice('jtitle') # jtitle stands for 'journal title'
We can use plot_distribution
took look at the distribution of Paper
s across journals. In this example I use MatPlotLib's pyplot
module to control the layout of the figure (it's pretty huge).
In [16]:
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(50,145))
fig = MyCorpus.plot_distribution(y_axis='jtitle', fig=fig, aspect=0.6, interpolation='none')
And we can view the distribution of a feature, like citations of Falconer's Quantitative Genetics textbook, across journals in a similar way:
In [17]:
fig = plt.figure(figsize=(50,145))
fig = MyCorpus.plot_distribution(y_axis='jtitle', fig=fig, aspect=0.6, interpolation='none', mode='features', fkwargs=fkwargs)
Of course, these figures are enormous and not very informative as is. The idea is that you can use this as a starting point for your analysis, and ultimately generate your own plots.
In [ ]: