In [1]:
%matplotlib inline
In [2]:
from pprint import pprint
import matplotlib.pyplot as plt
In this notebook we will take our first steps with the Tethne Python package. We'll parse some bibliographic records from the ISI Web of Science, and take a look at the Corpus
class and its various features. We'll then use some of the functions in tethne.networks
to generate some simple networks from our bibliographic dataset.
This notebook is part of a cluster of learning resources developed by the Laubichler Lab and the Digital Innovation Group at Arizona State University as part of an initiative for digital and computational humanities (d+cH). For more information, see our evolving online methods course at https://diging.atlassian.net/wiki/display/DCH.
Development of the Tethne project is led by Erick Peirson. To get help, first check our issue tracking system on GitHub. There, you can search for questions and problems reported by other users, or ask a question of your own. You can also reach Erick via e-mail at erick.peirson@asu.edu.
Additional documentation and tutorials for the Tethne Python package are available at ....
In [3]:
print "This is a code cell!"
You can execute the code in a code cell by clicking on it and pressing Shift-Enter on your keyboard, or by clicking the right-arrow "Run" button in the toolbar at the top of the page. The cell below will automatically be selected, so you can run many cells in quick succession by repeatedly pressing Shift-Enter (or the "Run" button). It's a good idea to run all of the code cells in order, from the top of the tutorial, since many commands later in the tutorial will depend on earlier ones.
As we work through the notebook, you'll need to modify certain values depending on where your data is located. You should also experiment! Try changing the parameters in the functions demonstrated below, and re-run the code-cell to see the result. That's what's great about iPython notebooks: you can play around with specific chunks of code without having to re-run the entire script.
The ISI Web of Science is a proprietary database owned by Thompson Reuters. It is one of the oldest and most comprehensive scientific bibliographic databases in existance. If you are affiliated with an academic institution, you may have access to this database via an institutional license.
For the purpose of this tutorial, you can download a practice dataset from (insert link to dataset). Move the downloaded zip to a place where you can find it, and uncompress its contents. You'll need the full path to the uncompressed dataset.
Perform a search for literature of interest using the interface provided.
Your search criteria will be informed by the objectives of your research
project. If you are attempting to characterize the development of a research
field, for example, you should choose terms that pick out that field as uniquely
as possible (consider using the Publication Name
search field). You can also
pick out literatures originating from particular institutions, by using the
Organization-Enhanced
search field.
Note also that you can restrict your research to one of three indexes in the Web of Science Core Collection:
Once you have found the papers that you are interested in, find the Send to:
menu at the top of the list of results. Click the small orange down-arrow, and
select Other File Formats
.
A small in-browser window should open in the foreground. Specify the range of
records that you wish to download. Note that you can only download 500 records
at a time, so you may have to make multiple download requests. Be sure to
specify Full Record and Cited References
in the Record Content field, and
Plain Text
in the File Format field. Then click Send
.
After a few moments, a download should begin. WoS usually returns a field-tagged
data file called savedrecs.txt
. Put this in a location on your filesystem
where you can find it later; this is the input for Tethne's WoS reader methods.
If you open the text file returned by the WoS database (usually named 'savedrecs.txt'), you should see a whole bunch of field-tagged data. "Field-tagged" means that each metadata field is denoted by a "tag" (a two-letter code), followed by values for that field. A complete list of WoS field tags can be found here. For best results, you should avoid making changes to the contents of WoS data files.
The metadata record for each paper in your data file should begin with:
PT J
...and end with:
ER
There are two author fields: the AU field is always provided, and values take the form "Last, FI". AF is provided if author full-names are available, and values take the form "Last, First Middle". For example:
AU Dauvin, JC
Grimes, S
Bakalem, A
AF Dauvin, Jean-Claude
Grimes, Samir
Bakalem, Ali
Citations are listed in the CR block. For example:
CR Airoldi L, 2007, OCEANOGR MAR BIOL, V45, P345
Alexander Vera, 2011, Marine Biodiversity, V41, P545, DOI 10.1007/s12526-011-0084-1
Arvanitidis C, 2002, MAR ECOL PROG SER, V244, P139, DOI 10.3354/meps244139
Bakalem A, 2009, ECOL INDIC, V9, P395, DOI 10.1016/j.ecolind.2008.05.008
Bakalem Ali, 1995, Mesogee, V54, P49
…
Zenetos A, 2005, MEDITERR MAR SCI, V6, P63
Zenetos A, 2004, CIESM ATLAS EXOTIC S, V3
More recent records also include the institutional affiliations of authors in the C1 block.
C1 [Wang, Changlin; Washida, Haruhiko; Crofts, Andrew J.; Hamada, Shigeki;
Katsube-Tanaka, Tomoyuki; Kim, Dongwook; Choi, Sang-Bong; Modi, Mahendra; Singh,
Salvinder; Okita, Thomas W.] Washington State Univ, Inst Biol Chem, Pullman, WA 99164
USA.
For more information about WoS field tags, see a list on the Thompson Reuters website, here.
The modules in the tethne.readers
subpackage allow you to parse data from a few different databases. The readers for Web of Science, JSTOR DfR, and Zotero RDF datasets are the most rigorously tested. Request support for a new dataset on our GitHub project site.
Database | module |
---|---|
Web of Science | tethne.readers.wos |
JSTOR Data-for-Research | tethne.readers.dfr |
Zotero | tethne.readers.zotero |
You can load the tethne.readers.wos
module by importing it from the tethne.readers
subpackage:
In [4]:
from tethne.readers import wos
To parse data from a WoS dataset, use the read
method. Each module in the tethne.readers
subpackage should have a read
method.
read
can parse a single data file, or a directory full of data files, and returns a Corpus
object. Just pass it a string containing the path to your data. First, try parsing a single WoS field-tagged data file.
In [5]:
corpus = wos.read('/Users/erickpeirson/Dropbox/HSS ThatCamp Workshop/sample_data/wos/savedrecs.txt')
You can see how many records were loaded from your data file by evaluating the len
of the Corpus
.
In [6]:
print 'Loaded %i records!' % len(corpus)
Often you'll be working with datasets comprised of multiple data files. The Web of Science database only allows you to download 500 records at a time (because they're dirty capitalists). You can use the read
function to load a list of Paper
s from a directory containing multiple data files.
Instead of providing the path to a single data file, just provide the path to a directory containing several WoS field-tagged data files. The read
function knows that your path is a directory and not a data file; it looks inside of that directory for WoS data files.
In [7]:
corpus = wos.read('/Users/erickpeirson/Dropbox/HSS ThatCamp Workshop/sample_data/wos/')
We should have quite a few more records this time:
In [8]:
print 'Loaded %i records!' % len(corpus)
A Corpus
is a collection of Paper
s with superpowers. Each Paper
represents one bibliographic record. Most importantly, the Corpus
provides a consistent way of indexing bibliographic records. Indexing is important, because it sets the stage for all of the subsequent analyses that we may wish to do with our bibliographic data.
A Corpus
behaves like a list of Paper
s. We can selecte a single Paper
like this:
In [9]:
corpus[500].__dict__ # [500] gets the 501st Paper, and __dict__ generates a
# key-value representation of the data in the Paper.
Out[9]:
There are several things to notice in the output above. First, each Paper
should (generally) have a title:
In [10]:
corpus[500].title
Out[10]:
Each Paper should also have a date
, journal
, and wosid
(WoS accession ID). Many will also have doi
s. Note that we can access the attributes of each Paper
using .
notation:
In [11]:
# corpus[500] gets a Paper, and ``.date`` gets the date attribute.
print 'Date:'.ljust(20), corpus[500].date
print 'Journal:'.ljust(20), corpus[500].journal
print 'WoS accession ID:'.ljust(20), corpus[500].wosid
print 'DOI:'.ljust(20), corpus[500].doi
Each Paper
will also have authors
. Tethne represents author names as "tuples" of the form (last, first)
. Depending on the record, first
might be first and middle initials, or first and middle names.
In [12]:
corpus[500].authors
Out[12]:
Unlike other bibliographic datasets, WoS data contain the cited references of each Paper
. Each cited reference is represented as a Paper
:
In [13]:
corpus[2].citedReferences
Out[13]:
A "prettier" representation of the cited references is available in the citations
attribute.
In [14]:
corpus[2].citations
Out[14]:
Each cited reference is represented by what we call an 'ayjid'
: it contains the author name, year of publication, and the journal in which it was published. Every Paper
has an 'ayjid'
.
In [15]:
corpus[2].ayjid
Out[15]:
The most important functionality of the Corpus
is indexing. Indexing provides a way of looking up Paper
s by specific attributes, e.g. by the year in which they were published, or by author.
Each Corpus
has a single "primary" index. For WoS data, the wosid
field (WoS accession ID) is used by default, since every WoS record has one. You can see which field was used as the primary index by accessing the .index_by
attribute of the Corpus
.
In [16]:
corpus.index_by
Out[16]:
All of the Paper
s in the Corpus
are stored by wosid
in the indexed_papers
attribute. The code cell below shows the first ten Papers with their indexing keys.
In [17]:
corpus.indexed_papers.items()[:10]
Out[17]:
Additional indexes are located in the indices
attribute. The code-cell below shows which fields are already indexed.
In [18]:
corpus.indices.keys()
Out[18]:
We can look up Papers
using the name of an indexed field and some value. For example, to see all of the Paper
s in which ('MAIENSCHEIN', 'J')
is an author, we could do:
In [19]:
for paper in corpus[('authors', ('MAIENSCHEIN', 'J'))]:
print paper.date, paper.title
We can create a new index using the index()
method. For example, to index Paper
s by date
, we could do:
In [20]:
corpus.index('date')
'date'
should now show up in the available indices...
In [21]:
corpus.indices.keys()
Out[21]:
...and we can now look up all of the Paper
s published in 1985:
In [22]:
for paper in corpus[('date', 1985)]:
print paper.date, paper.title
In [23]:
from tethne import networks
Now use the coauthors
function to create the network. We need provide it only our Corpus
:
In [24]:
coauthor_graph = networks.coauthors(corpus)
Tethne uses a package called NetworkX to build networks. All of the network-building functions return NetworkX Graph
objects. We can see how large our network is using the order()
and size()
methods:
In [25]:
print coauthor_graph.order() # Number of nodes.
print coauthor_graph.size() # Number of edges.
As you can see, historians of science don't collaborate much.
To see a list of nodes, use the nodes()
method:
In [26]:
coauthor_graph.nodes()[:10] # [:10] just shows the first ten.
Out[26]:
...and edges()
for edges:
In [27]:
coauthor_graph.edges(data=True)[:10] # [:10] just shows the first ten.
# data=True tells edges() to return details about each edge.
Out[27]:
For networks with anything more than a few nodes, it's hard to visualize what's going on in the iPython environment. So we'll expore the coauthor_graph
and visualize it in a network analysis package called Cytoscape.
Cytoscape understands several network file formats. GraphML ((link here)) is probably the most versatile, so we'll use it to export our coauthor graph.
The tethne.writers.graph
module has several functions for writing graphs to disk. We'll use to_graphml()
:
In [28]:
from tethne.writers.graph import to_graphml
to_graphml()
accepts two arguments: the graph itself, and a string with the path to the output file (that will be created). In the example below, I just put the graph on my desktop.
In [29]:
to_graphml(coauthor_graph, '/Users/erickpeirson/Desktop/coauthors_graph.graphml')
If you were to open that file, the first few lines would look something like this:
<?xml version='1.0' encoding='utf-8'?>
<graphml xmlns="http://graphml.graphdrawing.org/xmlns" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd">
<key attr.name="weight" attr.type="int" for="edge" id="weight" />
<key attr.name="documentCount" attr.type="int" for="node" id="documentCount" />
<key attr.name="count" attr.type="double" for="node" id="count" />
<graph edgedefault="undirected">
<node id="WEBER, BH">
<data key="count">1.0</data>
<data key="documentCount">1</data>
</node>
<node id="CAPELOTTI, PJ">
<data key="count">1.0</data>
<data key="documentCount">1</data>
</node>
<node id="GREENE, MOTT T">
<data key="count">3.0</data>
<data key="documentCount">3</data>
</node>
<node id="PRETE, FR">
<data key="count">3.0</data>
<data key="documentCount">3</data>
</node>
Everything in the graph is enclosed between the <graphml ...></graphml>
tags. Each author is represented by a <node></node>
element. Further down, relationships between authors are represented by <edge></edge>
elements.
<edge source="TAUBER, AI" target="BALABAN, M">
<data key="weight">1</data>
</edge>
<edge source="TAUBER, AI" target="PODOLSKY, SH">
<data key="weight">1</data>
</edge>
<edge source="TAUBER, AI" target="CRIST, E">
<data key="weight">1</data>
</edge>
<edge source="RUPKE, N" target="HOSSFELD, U">
<data key="weight">1</data>
</edge>
<edge source="GAWNE, RICHARD" target="NICHOLSON, DANIEL J">
<data key="weight">1</data>
</edge>
Go ahead and load Cytoscape. After the application loads, you should see a splash screen like the one below. Click on "From network file", then select your graphml file and click OK.
Once the network loads, you'll see a jumble of nodes and edges. Click on the "Apply Preferred Layout" button (it looks like nodes with arrows pointing in various directions) at the top of the screen.
By default, this should apply a force-directed layout. After a few moments, your network should look something like the image below.
We can visualize attributes of the graph in the "Styles" menu. Click on "Styles" in the upper left. In the example below, I set node & height widths to be equal, and set node size as a continuous function of "count" (this is the number of papers written by each author.
We can set edge attributes, too. In the example below, I set edge width to be a function of "weight", which is the number of papers that the two connected authors wrote together.
You can zoom in and out to take a closer look at parts of the graph. If you click on the "network" tab in the upper left, you'll see a mini version of your network in the lower left, with a blue box showing which area you're currently viewing.
Bibliographic coupling can be a useful and computationally cheap way to explore the thematic topology of a large scientific literature.
Bibliographic coupling was first proposed as a method for detecting latent topical affinities among research publications by Myer M. Kessler at MIT in 1958. In 1972, J.C. Donohue suggested that bibliographic coupling could be used to the map "research fronts" in science, and this method, along with co-citation analysis and other citation-based clustering techniques, became a core methodology of the science-mapping craze of the 1970s. Bibliographic coupling is still employed in the context of both information-retrieval and science-studies.
Two papers are bibliographically coupled if they both cite at least some of the same papers. The core assumption of bibliographic coupling analysis is that if two papers cite similar literatures, then they must be topically related in some way. That is, they are more likely to be related to each other than to papers with which they share no cited references.
What we are aiming for is a graph model of our bibliographic
data that reveals thematically coherent and informative clusters of documents. We will use
Tethne's bibligraphic_coupling()
function to generate such a network.
First we import the function:
In [30]:
from tethne import bibliographic_coupling
We use this function just like the coauthors()
function -- passing the Corpus
are our first argument -- but we can also pass additional arguments. min_weight
indicates that two Paper
s must share at least three cited references to be coupled. node_attrs
tells the function to add additional information to each node; in this case, 'date'
and 'title'
.
In [41]:
coupling_graph = bibliographic_coupling(corpus, min_weight=3, node_attrs=['date', 'title'])
We can "tune" this function by increasing or decreasing min_weight
to yield more or less dense graphs. order()
(the number of nodes) and size()
(the number of edges) give us a sense of the density of the graph.
In [42]:
coupling_graph.order(), coupling_graph.size()
Out[42]:
We can use the to_graphml()
function once again to write the graph to disk, so that we can visualize it in Cytoscape.
In [40]:
to_graphml(coupling_graph, '/Users/erickpeirson/Desktop/coupling_graph.graphml')
The resulting graph, with some styling, might look like this: