In [1]:
%matplotlib inline
In [2]:
from pprint import pprint
import matplotlib.pyplot as plt
A Corpus
is a collection of Paper
s with superpowers. Most importantly, it provides a consistent way of indexing bibliographic records. Indexing is important, because it sets the stage for all of the subsequent analyses that we may wish to do with our bibliographic data.
In 1. Loading Data, part 1 we used the read
function in tethne.readers.wos
to parse a collection of Web of Science field-tagged data files and build a Corpus
.
In [11]:
from tethne.readers import wos
datapath = '/Users/erickpeirson/Downloads/datasets/wos'
corpus = wos.read(datapath)
In this notebook, we'll dive deeper into the guts of the Corpus
, focusing on indexing and and features.
index_by
The primary indexing field is the field that Tethne uses to identify each of the Paper
s in your dataset. Ideally, each one of the records in your bibliographic dataset will have this field. Good candidates include DOIs, URIs, or other unique identifiers.
Depending on which module you use, read
will make assumptions about which field to use as the primary index for the Paper
s in your dataset. The default for Web of Science data, for example, is 'wosid'
(the value of the UT
field-tag).
In [5]:
print 'The primary index field for the Papers in my Corpus is "%s"' % corpus.index_by
The primary index for your Corpus
can be found in the indexed_papers
attribute. indexed_papers
is a dictionary that maps the value of the indexing field for each Paper
onto that Paper
itself.
In [7]:
corpus.indexed_papers.items()[0:10] # We'll just show the first ten Papers, for the sake of space.
Out[7]:
So if you know (in this case) the wosid
of a Paper
, you can retrieve that Paper
by passing the wosid
to indexed_papers
:
In [8]:
corpus.indexed_papers['WOS:000321911200011']
Out[8]:
If you'd prefer to index by a different field, you can pass the index_by
parameter to read
.
In [12]:
otherCorpus = wos.read(datapath, index_by='doi')
In [13]:
print 'The primary index field for the Papers in this other Corpus is "%s"' % otherCorpus.index_by
If some of the Paper
s lack the indexing field that you specified with the index_by
parameter, Tethne will automatically generate a unique identifier for each of those Papers
. For example, in our otherCorpus
that we indexed by doi
, most of the papers have valid DOIs, but a few (#1, below) did not -- a nonsensical-looking sequence of alphanumeric characters was used instead.
In [15]:
i = 0
for doi, paper in otherCorpus.indexed_papers.items()[0:10]:
print '(%i) DOI: %s \t ---> \t Paper: %s' % (i, doi.ljust(30), paper)
i += 1
In [16]:
print 'The following Paper fields have been indexed: \n\n\t%s' % '\n\t'.join(corpus.indices.keys())
The 'citations'
index, for example, allows us to look up all of the Paper
s that contain a particular bibliographic reference:
In [18]:
for citation, papers in corpus.indices['citations'].items()[7:10]: # Show the first three, for space's sake.
print 'The following Papers cite %s: \n\n\t%s \n' % (citation, '\n\t'.join(papers))
Notice that the values above are not Paper
s themselves, but identifiers. These are the same identifiers used in the primary index, so we can use them to look up Paper
s:
In [20]:
papers = corpus.indices['citations']['CARLSON SM 2004 EVOL ECOL RES'] # Who cited Carlson 2004?
print papers
for paper in papers:
print corpus.indexed_papers[paper]
We can create new indices using the index
method. For example, to index our Corpus
using the authorKeywords
field:
In [22]:
corpus.index('authorKeywords')
In [25]:
for keyword, papers in corpus.indices['authorKeywords'].items()[6:10]: # Show the first three, for space's sake.
print 'The following Papers contain the keyword %s: \n\n\t%s \n' % (keyword, '\n\t'.join(papers))
Since we're interested in historical trends in our Corpus
, we probably also want to index the date
field:
In [27]:
corpus.index('date')
for date, papers in corpus.indices['date'].items()[-11:-1]: # Last ten years.
print 'There are %i Papers from %i' % (len(papers), date)
We can examine the distribution of Paper
s over time using the distribution
method:
In [29]:
corpus.distribution()[-11:-1] # Last ten years.
Out[29]:
In [30]:
plt.figure(figsize=(10, 3))
start = min(corpus.indices['date'].keys())
end = max(corpus.indices['date'].keys())
X = range(start, end + 1)
plt.plot(X, corpus.distribution(), lw=2)
plt.ylabel('Number of Papers')
plt.xlim(start, end)
plt.show()
In [31]:
corpus['WOS:000309391500014']
Out[31]:
Whoa! But it gets better. We can select Paper
s using any of the indices in the Corpus
. For example, we can select all of the papers with the authorKeyword
LIFE
:
In [33]:
corpus[('authorKeywords', 'LIFE')]
Out[33]:
We can also select Paper
s using several values. For example, with the primary index field:
In [34]:
corpus[['WOS:000309391500014', 'WOS:000306532900015']]
Out[34]:
...and with other indexed fields (think of this as an OR search):
In [36]:
corpus[('authorKeywords', ['LIFE', 'ENZYME GENOTYPE', 'POLAR AUXIN'])]
Out[36]:
Since we indexed 'date'
earlier, we could select any Papers
published between 2011 and 2012:
In [38]:
papers = corpus[('date', range(2002, 2013))] # range() excludes the "last" value.
print 'There are %i Papers published between %i and %i' % (len(papers), 2002, 2012)
Earlier we used specific fields in our Paper
s to create indices. The inverse of an index is what we call a FeatureSet
. A FeatureSet
contains data about the occurrence of specific features across all of the Paper
s in our Corpus
.
The read
method generates a few FeatureSet
s by default. All of the available FeatureSet
s are stored in a dictionary, the features
attribute.
In [39]:
corpus.features.items()
Out[39]:
Each FeatureSet
has several properties:
FeatureSet.index
maps integer identifiers to specific features. For example, for author names:
In [40]:
featureset = corpus.features['authors']
for k, author in featureset.index.items()[0:10]:
print '%i --> "%s"' % (k, ', '.join(author)) # Author names are stored as (LAST, FIRST M).
FeatureSet.lookup
is the reverse of index
: it maps features onto their integer IDs:
In [42]:
featureset = corpus.features['authors']
for author, k in featureset.lookup.items()[0:10]:
print '%s --> %i' % (', '.join(author).ljust(25), k)
FeatureSet.documentCounts
shows how many Paper
s in our Corpus
have a specific feature:
In [43]:
featureset = corpus.features['authors']
for k, count in featureset.documentCounts.items()[0:10]:
print 'Feature %i (which identifies author "%s") is found in %i documents' % (k, ', '.join(featureset.index[k]), count)
FeatureSet.features
shows how many times each feature occurs in each Paper
.
In [44]:
featureset.features.items()[0]
Out[44]:
We can create a new FeatureSet
from just about any field in our Corpus
, using the index_feature
method. For example, suppose that we were interested in the distribution of authorKeywords
across the whole corpus:
In [46]:
corpus.index_feature('authorKeywords')
corpus.features.keys()
Out[46]:
In [48]:
featureset = corpus.features['authorKeywords']
for k, count in featureset.documentCounts.items()[0:10]:
print 'Keyword %s is found in %i documents' % (featureset.index[k], count)
In [49]:
featureset.features['WOS:000324532900018'] # Feature for a specific Paper.
Out[49]:
In [50]:
plt.figure(figsize=(10, 3))
years, values = corpus.feature_distribution('authorKeywords', 'DIVERSITY')
start = min(years)
end = max(years)
X = range(start, end + 1)
plt.plot(years, values, lw=2)
plt.ylabel('Papers with DIVERSITY in authorKeywords')
plt.xlim(start, end)
plt.show()