In [1]:

    
%matplotlib inline



In [2]:

    
from pprint import pprint
import matplotlib.pyplot as plt

Introduction to Tethne: Working with data from the Web of Science

In this notebook we will take our first steps with the Tethne Python package. We'll parse some bibliographic records from the ISI Web of Science, and take a look at the Corpus class and its various features. We'll then use some of the functions in tethne.networks to generate some simple networks from our bibliographic dataset.

Methods in Digital & Computational Humanities

This notebook is part of a cluster of learning resources developed by the Laubichler Lab and the Digital Innovation Group at Arizona State University as part of an initiative for digital and computational humanities (d+cH). For more information, see our evolving online methods course at https://diging.atlassian.net/wiki/display/DCH.

Getting Help

Development of the Tethne project is led by Erick Peirson. To get help, first check our issue tracking system on GitHub. There, you can search for questions and problems reported by other users, or ask a question of your own. You can also reach Erick via e-mail at erick.peirson@asu.edu.

Using this notebook

This is an interactive Python notebook. Most of the content is just marked-down text, like this paragraph, that provides expository on some aspect of the Tethne package. Some of the cells are "code" cells, which look like this:



In [3]:

    
print "This is a code cell!"









    



This is a code cell!

You can execute the code in a code cell by clicking on it and pressing Shift-Enter on your keyboard, or by clicking the right-arrow "Run" button in the toolbar at the top of the page. The cell below will automatically be selected, so you can run many cells in quick succession by repeatedly pressing Shift-Enter (or the "Run" button). It's a good idea to run all of the code cells in order, from the top of the tutorial, since many commands later in the tutorial will depend on earlier ones.

Play!!

As we work through the notebook, you'll need to modify certain values depending on where your data is located. You should also experiment! Try changing the parameters in the functions demonstrated below, and re-run the code-cell to see the result. That's what's great about iPython notebooks: you can play around with specific chunks of code without having to re-run the entire script.

Getting Bibliographic Data from the ISI Web of Science

The ISI Web of Science is a proprietary database owned by Thompson Reuters. It is one of the oldest and most comprehensive scientific bibliographic databases in existence. If you are affiliated with an academic institution, you may have access to this database via an institutional license.

For the purpose of this tutorial, you can download a practice dataset from here. Move the downloaded zip to a place where you can find it, and uncompress its contents. You'll need the full path to the uncompressed dataset.

Perform a search for literature of interest using the interface provided.

Your search criteria will be informed by the objectives of your research project. If you are attempting to characterize the development of a research field, for example, you should choose terms that pick out that field as uniquely as possible (consider using the Publication Name search field). You can also pick out literatures originating from particular institutions, by using the Organization-Enhanced search field.

Note also that you can restrict your research to one of three indexes in the Web of Science Core Collection:

Science Citation Index Expanded is the largest index, containing scientific publications from 1900 onward.
Social Sciences Citation Index covers 1956 onward.
Arts & Humanities Citation Index is the smallest index, containing publications from 1975 onward.

Once you have found the papers that you are interested in, find the Send to: menu at the top of the list of results. Click the small orange down-arrow, and select Other File Formats.

A small in-browser window should open in the foreground. Specify the range of records that you wish to download. Note that you can only download 500 records at a time, so you may have to make multiple download requests. Be sure to specify Full Record and Cited References in the Record Content field, and Plain Text in the File Format field. Then click Send.

After a few moments, a download should begin. WoS usually returns a field-tagged data file called savedrecs.txt. Put this in a location on your filesystem where you can find it later; this is the input for Tethne's WoS reader methods.

Structure of the WoS Field-Tagged Data File

If you open the text file returned by the WoS database (usually named 'savedrecs.txt'), you should see a whole bunch of field-tagged data. "Field-tagged" means that each metadata field is denoted by a "tag" (a two-letter code), followed by values for that field. A complete list of WoS field tags can be found here. For best results, you should avoid making changes to the contents of WoS data files.

The metadata record for each paper in your data file should begin with:

   PT J

...and end with:

ER

There are two author fields: the AU field is always provided, and values take the form "Last, FI". AF is provided if author full-names are available, and values take the form "Last, First Middle". For example:

   AU Dauvin, JC
      Grimes, S
      Bakalem, A
   AF Dauvin, Jean-Claude
      Grimes, Samir
      Bakalem, Ali

Citations are listed in the CR block. For example:

   CR Airoldi L, 2007, OCEANOGR MAR BIOL, V45, P345
      Alexander Vera, 2011, Marine Biodiversity, V41, P545, DOI 10.1007/s12526-011-0084-1
      Arvanitidis C, 2002, MAR ECOL PROG SER, V244, P139, DOI 10.3354/meps244139
      Bakalem A, 2009, ECOL INDIC, V9, P395, DOI 10.1016/j.ecolind.2008.05.008
      Bakalem Ali, 1995, Mesogee, V54, P49
      …
      Zenetos A, 2005, MEDITERR MAR SCI, V6, P63
      Zenetos A, 2004, CIESM ATLAS EXOTIC S, V3

More recent records also include the institutional affiliations of authors in the C1 block.

   C1 [Wang, Changlin; Washida, Haruhiko; Crofts, Andrew J.; Hamada, Shigeki;
   Katsube-Tanaka, Tomoyuki; Kim, Dongwook; Choi, Sang-Bong; Modi, Mahendra; Singh,
   Salvinder; Okita, Thomas W.] Washington State Univ, Inst Biol Chem, Pullman, WA 99164
   USA.

For more information about WoS field tags, see a list on the Thompson Reuters website, here.

Parsing Web of Science Field-Tagged Data

The modules in the tethne.readers subpackage allow you to parse data from a few different databases. The readers for Web of Science, JSTOR DfR, and Zotero RDF datasets are the most rigorously tested. Request support for a new dataset on our GitHub project site.

Database	module
Web of Science	`tethne.readers.wos`
JSTOR Data-for-Research	`tethne.readers.dfr`
Zotero	`tethne.readers.zotero`

You can load the tethne.readers.wos module by importing it from the tethne.readers subpackage:



In [4]:

    
from tethne.readers import wos

To parse data from a WoS dataset, use the read method. Each module in the tethne.readers subpackage should have a read method.

read can parse a single data file, or a directory full of data files, and returns a Corpus object. Just pass it a string containing the path to your data. First, try parsing a single WoS field-tagged data file.



In [5]:

    
corpus = wos.read('/Users/erickpeirson/Dropbox/HSS ThatCamp Workshop/sample_data/wos/savedrecs.txt')

You can see how many records were loaded from your data file by evaluating the len of the Corpus.



In [6]:

    
print 'Loaded %i records!' % len(corpus)









    



Loaded 500 records!

Reading more than one data file at a time

Often you'll be working with datasets comprised of multiple data files. The Web of Science database only allows you to download 500 records at a time (because they're dirty capitalists). You can use the read function to load a list of Papers from a directory containing multiple data files.

Instead of providing the path to a single data file, just provide the path to a directory containing several WoS field-tagged data files. The read function knows that your path is a directory and not a data file; it looks inside of that directory for WoS data files.



In [7]:

    
corpus = wos.read('/Users/erickpeirson/Dropbox/HSS ThatCamp Workshop/sample_data/wos/')

We should have quite a few more records this time:



In [8]:

    
print 'Loaded %i records!' % len(corpus)









    



Loaded 1168 records!

`Corpus` objects

A Corpus is a collection of Papers with superpowers. Each Paper represents one bibliographic record. Most importantly, the Corpus provides a consistent way of indexing bibliographic records. Indexing is important, because it sets the stage for all of the subsequent analyses that we may wish to do with our bibliographic data.

A Corpus behaves like a list of Papers. We can selecte a single Paper like this:



In [9]:

    
corpus[500].__dict__  # [500] gets the 501st Paper, and __dict__ generates a 
                      #  key-value representation of the data in the Paper.









    Out[9]:





{'ER': u'',
 'GA': u'804NA',
 'ISSN': [u'0022-5010', u'J9 J HIST BIOL'],
 'PD': u'SPR',
 'PG': 50,
 'PT': u'J',
 'WC': u'Biology; History & Philosophy Of Science',
 'abstract': u'The new discipline of exobiology formed from the intertwining of origin of life research with the search for life or its building blocks on other planets, from 1957-1973. The field was inherently highly interdisciplinary, yet it coalesced very quickly and was responsible in its first twenty years for numerous important contributions to twentieth century life science and planetary sciences such as climatology, the study of mass extinctions, etc. NASA played a very important role in catalyzing the rapid consolidation of exobiology, both through research grants and through sponsored meetings that overcame disciplinary boundaries, bringing together scientists from diverse backgrounds. The presence of a handful of prominent senior scientists such as Joshua Lederberg, Melvin Calvin and Norman Horowitz helped gain credibility for exobiology, in the face of criticism and competition from existing life sciences disciplines. Tensions within the exobiology research community and between NASA-funded science and the academic research community are explored, as are such milestones of discipline formation as journals and professional societies. C1 Franklin & Marshall Coll, Program Sci Technol & Soc, Lancaster, PA 17604 USA.',
 'authorKeywords': [u'PRIMITIVE EARTH',
  u'LIFE',
  u'ORIGIN',
  u'HUMANOIDS',
  u'MOLECULES',
  u'CYANIDE',
  u'ADENINE',
  u'SPACE'],
 'authors_full': [(u'STRICK', u'JE')],
 'authors_init': [(u'STRICK', u'J E')],
 'citationCount': 118,
 'citedReferences': [<tethne.classes.paper.Paper at 0x109fba490>,
  <tethne.classes.paper.Paper at 0x109fc34d0>,
  <tethne.classes.paper.Paper at 0x109fcef90>,
  <tethne.classes.paper.Paper at 0x109fcef50>,
  <tethne.classes.paper.Paper at 0x109fcefd0>,
  <tethne.classes.paper.Paper at 0x109fdb050>,
  <tethne.classes.paper.Paper at 0x109fdb090>,
  <tethne.classes.paper.Paper at 0x109fdb110>,
  <tethne.classes.paper.Paper at 0x109fdb0d0>,
  <tethne.classes.paper.Paper at 0x109fdb150>,
  <tethne.classes.paper.Paper at 0x109fdb190>,
  <tethne.classes.paper.Paper at 0x109fdb1d0>,
  <tethne.classes.paper.Paper at 0x109fdb210>,
  <tethne.classes.paper.Paper at 0x109fdb250>,
  <tethne.classes.paper.Paper at 0x109fdb290>,
  <tethne.classes.paper.Paper at 0x109fdb310>,
  <tethne.classes.paper.Paper at 0x109fdb2d0>,
  <tethne.classes.paper.Paper at 0x109fdb390>,
  <tethne.classes.paper.Paper at 0x109fdb350>,
  <tethne.classes.paper.Paper at 0x109fdb410>,
  <tethne.classes.paper.Paper at 0x109fdb3d0>,
  <tethne.classes.paper.Paper at 0x109fdb490>,
  <tethne.classes.paper.Paper at 0x109fdb450>,
  <tethne.classes.paper.Paper at 0x109fdb4d0>,
  <tethne.classes.paper.Paper at 0x109fdb510>,
  <tethne.classes.paper.Paper at 0x109fdb550>,
  <tethne.classes.paper.Paper at 0x109fdb590>,
  <tethne.classes.paper.Paper at 0x109fdb5d0>,
  <tethne.classes.paper.Paper at 0x109fdb650>,
  <tethne.classes.paper.Paper at 0x109fdb610>,
  <tethne.classes.paper.Paper at 0x109fdb6d0>,
  <tethne.classes.paper.Paper at 0x109fdb690>,
  <tethne.classes.paper.Paper at 0x109fdb710>,
  <tethne.classes.paper.Paper at 0x109fdb750>,
  <tethne.classes.paper.Paper at 0x109fdb7d0>,
  <tethne.classes.paper.Paper at 0x109fdb810>,
  <tethne.classes.paper.Paper at 0x109fdb890>,
  <tethne.classes.paper.Paper at 0x109fdb910>,
  <tethne.classes.paper.Paper at 0x109fdb950>,
  <tethne.classes.paper.Paper at 0x109fdb9d0>,
  <tethne.classes.paper.Paper at 0x109fdb990>,
  <tethne.classes.paper.Paper at 0x109fdba50>,
  <tethne.classes.paper.Paper at 0x109fdba90>,
  <tethne.classes.paper.Paper at 0x109fdbb10>,
  <tethne.classes.paper.Paper at 0x109fdbad0>,
  <tethne.classes.paper.Paper at 0x109fdbb50>,
  <tethne.classes.paper.Paper at 0x109fdbb90>,
  <tethne.classes.paper.Paper at 0x109fdbc10>,
  <tethne.classes.paper.Paper at 0x109fdbbd0>,
  <tethne.classes.paper.Paper at 0x109fdbc50>,
  <tethne.classes.paper.Paper at 0x109fdbcd0>,
  <tethne.classes.paper.Paper at 0x109fdbc90>,
  <tethne.classes.paper.Paper at 0x109fdbd50>,
  <tethne.classes.paper.Paper at 0x109fdbd90>,
  <tethne.classes.paper.Paper at 0x109fdbd10>,
  <tethne.classes.paper.Paper at 0x109fdbdd0>,
  <tethne.classes.paper.Paper at 0x109fdbe50>,
  <tethne.classes.paper.Paper at 0x109fdbe90>,
  <tethne.classes.paper.Paper at 0x109fdbed0>,
  <tethne.classes.paper.Paper at 0x109fdbf10>,
  <tethne.classes.paper.Paper at 0x109fdbf50>,
  <tethne.classes.paper.Paper at 0x109fdbfd0>,
  <tethne.classes.paper.Paper at 0x109fe5090>,
  <tethne.classes.paper.Paper at 0x109fe50d0>,
  <tethne.classes.paper.Paper at 0x109fe5110>,
  <tethne.classes.paper.Paper at 0x109fe5150>,
  <tethne.classes.paper.Paper at 0x109fe5190>,
  <tethne.classes.paper.Paper at 0x109fe51d0>,
  <tethne.classes.paper.Paper at 0x109fe5210>,
  <tethne.classes.paper.Paper at 0x109fe5290>,
  <tethne.classes.paper.Paper at 0x109fe52d0>,
  <tethne.classes.paper.Paper at 0x109fe5310>,
  <tethne.classes.paper.Paper at 0x109fe5350>,
  <tethne.classes.paper.Paper at 0x109fe5390>,
  <tethne.classes.paper.Paper at 0x109fe53d0>,
  <tethne.classes.paper.Paper at 0x109fe5450>,
  <tethne.classes.paper.Paper at 0x109fe5410>,
  <tethne.classes.paper.Paper at 0x109fe5490>,
  <tethne.classes.paper.Paper at 0x109fe54d0>,
  <tethne.classes.paper.Paper at 0x109fe5510>,
  <tethne.classes.paper.Paper at 0x109fe5550>,
  <tethne.classes.paper.Paper at 0x109fe55d0>,
  <tethne.classes.paper.Paper at 0x109fe5650>,
  <tethne.classes.paper.Paper at 0x109fe5690>,
  <tethne.classes.paper.Paper at 0x109fe56d0>,
  <tethne.classes.paper.Paper at 0x109fe5710>,
  <tethne.classes.paper.Paper at 0x109fe5750>,
  <tethne.classes.paper.Paper at 0x109fe57d0>,
  <tethne.classes.paper.Paper at 0x109fe5790>,
  <tethne.classes.paper.Paper at 0x109fe5850>,
  <tethne.classes.paper.Paper at 0x109fe5810>,
  <tethne.classes.paper.Paper at 0x109fe58d0>,
  <tethne.classes.paper.Paper at 0x109fe5890>,
  <tethne.classes.paper.Paper at 0x109fe5910>,
  <tethne.classes.paper.Paper at 0x109fe5950>,
  <tethne.classes.paper.Paper at 0x109fe5990>,
  <tethne.classes.paper.Paper at 0x109fe5a10>,
  <tethne.classes.paper.Paper at 0x109fe5a90>,
  <tethne.classes.paper.Paper at 0x109fe5ad0>,
  <tethne.classes.paper.Paper at 0x109fe5b10>,
  <tethne.classes.paper.Paper at 0x109fe5b50>,
  <tethne.classes.paper.Paper at 0x109fe5b90>,
  <tethne.classes.paper.Paper at 0x109fe5bd0>,
  <tethne.classes.paper.Paper at 0x109fe5c50>,
  <tethne.classes.paper.Paper at 0x109fe5c90>,
  <tethne.classes.paper.Paper at 0x109fe5d10>,
  <tethne.classes.paper.Paper at 0x109fe5d90>,
  <tethne.classes.paper.Paper at 0x109fe5d50>,
  <tethne.classes.paper.Paper at 0x109fe5dd0>,
  <tethne.classes.paper.Paper at 0x109fe5e10>,
  <tethne.classes.paper.Paper at 0x109fe5cd0>,
  <tethne.classes.paper.Paper at 0x109fe5e90>,
  <tethne.classes.paper.Paper at 0x109fe5e50>,
  <tethne.classes.paper.Paper at 0x109fe5ed0>,
  <tethne.classes.paper.Paper at 0x109fe5f10>,
  <tethne.classes.paper.Paper at 0x109fe5f50>,
  <tethne.classes.paper.Paper at 0x109fe5fd0>,
  <tethne.classes.paper.Paper at 0x109fe5f90>],
 'date': 2004,
 'documentType': u'Review',
 'doi': u'10.1023/B:HIST.0000020279.73895.f2',
 'emailAddress': u'jamesstrick@earthlink.net',
 'isoSource': u'J. Hist. Biol.',
 'issue': 1,
 'journal': u'JOURNAL OF THE HISTORY OF BIOLOGY',
 'keywordsPlus': [u'abiogenesis; archaea; astrobiology; discipline formation; endosymbiosis;',
  u'exobiology; extraterrestrial life; microspheres; NASA; origin of life;',
  u'planetary science; proteinoids'],
 'language': u'English',
 'pageEnd': 180,
 'pageStart': 131,
 'publisher': u'KLUWER ACADEMIC PUBL',
 'publisherAddress': u'VAN GODEWIJCKSTRAAT 30, 3311 GZ DORDRECHT, NETHERLANDS',
 'publisherCity': u'DORDRECHT',
 'reprintAddress': u'Strick, JE (reprint author), Franklin & Marshall Coll, Program Sci Technol & Soc, Lancaster, PA 17604 USA.',
 'subject': [u'Life Sciences & Biomedicine - Other Topics; History & Philosophy of',
  u'Science'],
 'timesCited': [11, u'Z9 11'],
 'title': u'Creating A Cosmic Discipline: The Crystallization And Consolidation Of Exobiology, 1957-1973',
 'volume': u'37',
 'wosid': u'WOS:000220304200007'}

There are several things to notice in the output above. First, each Paper should (generally) have a title:



In [10]:

    
corpus[500].title









    Out[10]:





u'Creating A Cosmic Discipline: The Crystallization And Consolidation Of Exobiology, 1957-1973'

Each Paper should also have a date, journal, and wosid (WoS accession ID). Many will also have dois. Note that we can access the attributes of each Paper using . notation:



In [11]:

    
# corpus[500] gets a Paper, and ``.date`` gets the date attribute.
print 'Date:'.ljust(20), corpus[500].date    
print 'Journal:'.ljust(20), corpus[500].journal
print 'WoS accession ID:'.ljust(20), corpus[500].wosid
print 'DOI:'.ljust(20), corpus[500].doi









    



Date:                2004
Journal:             JOURNAL OF THE HISTORY OF BIOLOGY
WoS accession ID:    WOS:000220304200007
DOI:                 10.1023/B:HIST.0000020279.73895.f2

Each Paper will also have authors. Tethne represents author names as "tuples" of the form (last, first). Depending on the record, first might be first and middle initials, or first and middle names.



In [12]:

    
corpus[500].authors









    Out[12]:





[((u'STRICK', u'JE'), 1)]

Unlike other bibliographic datasets, WoS data contain the cited references of each Paper. Each cited reference is represented as a Paper:



In [13]:

    
corpus[2].citedReferences









    Out[13]:





[<tethne.classes.paper.Paper at 0x10d09e3d0>,
 <tethne.classes.paper.Paper at 0x10d09e4d0>,
 <tethne.classes.paper.Paper at 0x10d09e510>,
 <tethne.classes.paper.Paper at 0x10d09e550>,
 <tethne.classes.paper.Paper at 0x10d09e590>,
 <tethne.classes.paper.Paper at 0x10d09e5d0>,
 <tethne.classes.paper.Paper at 0x10d063f90>,
 <tethne.classes.paper.Paper at 0x10d09e610>,
 <tethne.classes.paper.Paper at 0x10d09e650>,
 <tethne.classes.paper.Paper at 0x10d09e690>,
 <tethne.classes.paper.Paper at 0x10d09e6d0>,
 <tethne.classes.paper.Paper at 0x10d09e710>,
 <tethne.classes.paper.Paper at 0x10d09e750>,
 <tethne.classes.paper.Paper at 0x10d09e7d0>,
 <tethne.classes.paper.Paper at 0x10d09e810>,
 <tethne.classes.paper.Paper at 0x10d09e790>,
 <tethne.classes.paper.Paper at 0x10d09e850>,
 <tethne.classes.paper.Paper at 0x10d09e890>,
 <tethne.classes.paper.Paper at 0x10d09e8d0>]

A "prettier" representation of the cited references is available in the citations attribute.



In [14]:

    
corpus[2].citations









    Out[14]:





[(u'EHRENBERG CG 1831 SYMBOLAE PHYS SEU IC', 1),
 (u'SCHWEIGGER AF 1819 BEOBACHTUNGEN NATURH', 1),
 (u'CUVIER GEORGES 1817 REGNE ANIMAL DISTRIB', 1),
 (u'DALY M 2007 ZOOTAXA', 1),
 (u'OWEN R 1843 LECT COMP ANATOMY PH', 1),
 (u'WINSOR MP 1976 STARFISH JELLYFISH O', 1),
 (u'JOHNSTON G 1838 HIST BRIT ZOOPHYTES', 1),
 (u'BENEDEN PVAN 1845 NOUVEAUX MEMOIRES AC', 1),
 (u'LEUCKART R 1848 MORPHOLOGIE VERWANDT', 1),
 (u'CARTWRIGHT P 2008 J MAR BIOL ASSOC UK', 1),
 (u'STEENSTRUP JJS 1845 ALTERNATION GENERATI', 1),
 (u'FREY H 1847 BEITRAGE KENNTNISS W', 1),
 (u'DUNN CW 2005 DEV DYNAM', 1),
 (u'LINNAEUS C 2003 LINNAEUS PHILOS BOT', 1),
 (u'BRAUN A 1855 AM J SCI ARTS', 3),
 (u'ELWICK J 2007 STYLES REASONING BRI', 1),
 (u'HOFMEISTER W 1851 VERGLEICHENDE UNTERS', 1)]

Each cited reference is represented by what we call an 'ayjid': it contains the author name, year of publication, and the journal in which it was published. Every Paper has an 'ayjid'.



In [15]:

    
corpus[2].ayjid









    Out[15]:





u'NYHART LK 2011 JOURNAL OF THE HISTORY OF BIOLOGY'

Indexing

The most important functionality of the Corpus is indexing. Indexing provides a way of looking up Papers by specific attributes, e.g. by the year in which they were published, or by author.

Each Corpus has a single "primary" index. For WoS data, the wosid field (WoS accession ID) is used by default, since every WoS record has one. You can see which field was used as the primary index by accessing the .index_by attribute of the Corpus.



In [16]:

    
corpus.index_by









    Out[16]:





'wosid'

All of the Papers in the Corpus are stored by wosid in the indexed_papers attribute. The code cell below shows the first ten Papers with their indexing keys.



In [17]:

    
corpus.indexed_papers.items()[:10]









    Out[17]:





[(u'WOS:000235887200021', <tethne.classes.paper.Paper at 0x10df59790>),
 (u'WOS:A1992KB57000009', <tethne.classes.paper.Paper at 0x10b7d0f50>),
 (u'WOS:000295037200001', <tethne.classes.paper.Paper at 0x10d09e350>),
 (u'WOS:A1986A555900003', <tethne.classes.paper.Paper at 0x10c22d910>),
 (u'WOS:000260782100010', <tethne.classes.paper.Paper at 0x10d8c8b90>),
 (u'WOS:A1991GX40800004', <tethne.classes.paper.Paper at 0x10b8a7110>),
 (u'WOS:A1991GX40800005', <tethne.classes.paper.Paper at 0x10b8b52d0>),
 (u'WOS:000295037200003', <tethne.classes.paper.Paper at 0x10d09e410>),
 (u'WOS:A1989CE40700004', <tethne.classes.paper.Paper at 0x10bd5fc10>),
 (u'WOS:A1991GX40800001', <tethne.classes.paper.Paper at 0x10b851510>)]

Additional indexes are located in the indices attribute. The code-cell below shows which fields are already indexed.



In [18]:

    
corpus.indices.keys()









    Out[18]:





['citations', 'authors']

We can look up Papers using the name of an indexed field and some value. For example, to see all of the Papers in which ('MAIENSCHEIN', 'J') is an author, we could do:



In [19]:

    
for paper in corpus[('authors', ('MAIENSCHEIN', 'J'))]:
    print paper.date, paper.title









    



1999 Time, Love, Memory: A Great Biologist And His Quest For The Origins Of Behavior.
2006 Sturtevant & Dobzhansky: Two Scientists At Odds, With A Student'S Recollections
2001 On Cloning: Advocating History Of Biology In The Public Interest
2003 Good Seeing: A Century Of Science At The Carnegie Institution Of Washington. 1902-2002
2000 Camillo Golgi And The Neurosciences
2000 The Hidden Structure: A Scientific Biography Of Camillo Golgi
2005 Jacob'S Ladder. The History Of The Human Genome
1993 Why Collaborate
2001 Untitled - Introduction
1986 Reflections On Ecology And Evolution - Preface
1986 Between Ecology And Evolutionary Biology - Introduction
1999 Untitled
2003 The Rockefeller University Achievements: A Century Of Science For The Benefit Of Mankind. 1901-2001

We can create a new index using the index() method. For example, to index Papers by date, we could do:



In [20]:

    
corpus.index('date')

'date' should now show up in the available indices...



In [21]:

    
corpus.indices.keys()









    Out[21]:





['date', 'citations', 'authors']

...and we can now look up all of the Papers published in 1985:



In [22]:

    
for paper in corpus[('date', 1985)]:
    print paper.date, paper.title









    



1985 From Galen Ureters To Harvey Veins
1985 Paleontology And Philosophy - A Critique
1985 Essay Review - Mayr,Ernst On The History Of Biology
1985 The History Of Ecology - Achievements And Opportunities .2.
1985 The Historical Bases Of The Concept Of Allelopathy
1985 Organotherapy And The Emergence Of Reproductive Endocrinology
1985 Conklin,E.G. On Evolution - The Popular Writings Of An Embryologist
1985 Essay Review - Stone Ages, Old And New
1985 Europe Discovers Civet Cats And Civet
1985 Notes On Darwin,Charles Autobiography
1985 Weismann And Evolution
1985 Keill,James, Cheyne,George, And Newtonian Physiology, 1690-1740
1985 Conceptual Models And Analytical Tools - The Biology Of Physicist Delbruck,Max
1985 American Morphology In The Late 19Th-Century - The Biology Department At Johns-Hopkins-University
1985 The Rise And Fall Of Darwin 2Nd Theory

Simple Networks

One of the core features of Tethne is a set of functions for building networks from bibliographic datasets. These functions are located in the tethne.networks subpackage. In this section, we'll build a coauthorship network.

The first step is to import the networks subpackage.



In [23]:

    
from tethne import networks

Now use the coauthors function to create the network. We need provide it only our Corpus:



In [24]:

    
coauthor_graph = networks.coauthors(corpus)

Tethne uses a package called NetworkX to build networks. All of the network-building functions return NetworkX Graph objects. We can see how large our network is using the order() and size() methods:



In [25]:

    
print coauthor_graph.order()    # Number of nodes.
print coauthor_graph.size()     # Number of edges.

As you can see, historians of science don't collaborate much.

To see a list of nodes, use the nodes() method:



In [26]:

    
coauthor_graph.nodes()[:10]    # [:10] just shows the first ten.









    Out[26]:





[(u'WEBER', u'BH'),
 (u'CAPELOTTI', u'PJ'),
 (u'GREENE', u'MOTT T'),
 (u'PRETE', u'FR'),
 (u'SEYFARTH', u'ERNSTAUGUST'),
 (u'GAYON', u'JEAN'),
 (u'BURIAN', u'RM'),
 (u'DOYLE', u'THOMAS J'),
 (u'MORGAN', u'GREGORY J'),
 (u'SIMOES', u'A')]

...and edges() for edges:



In [27]:

    
coauthor_graph.edges(data=True)[:10]    # [:10] just shows the first ten.
                                        # data=True tells edges() to return details about each edge.









    Out[27]:





[((u'WEBER', u'BH'), (u'PREBBLE', u'JN'), {'weight': 1}),
 ((u'CAPELOTTI', u'PJ'), (u'DEVLIN', u'CL'), {'weight': 1}),
 ((u'GREENE', u'MOTT T'), (u'GREENE', u'PETER S'), {'weight': 1}),
 ((u'PRETE', u'FR'), (u'WOLFE', u'MM'), {'weight': 1}),
 ((u'SEYFARTH', u'ERNSTAUGUST'), (u'ZOTTOLI', u'STEVEN J'), {'weight': 1}),
 ((u'GAYON', u'JEAN'), (u'OSBORNE', u'MICHAEL A'), {'weight': 1}),
 ((u'GAYON', u'JEAN'), (u'GOHAU', u'GABRIEL'), {'weight': 1}),
 ((u'GAYON', u'JEAN'), (u'TIRARD', u'STEPHANC'), {'weight': 1}),
 ((u'BURIAN', u'RM'), (u'LEDERMAN', u'M'), {'weight': 1}),
 ((u'BURIAN', u'RM'), (u'ZALLEN', u'D'), {'weight': 1})]

For networks with anything more than a few nodes, it's hard to visualize what's going on in the iPython environment. So we'll expore the coauthor_graph and visualize it in a network analysis package called Cytoscape.

Cytoscape understands several network file formats. GraphML ((link here)) is probably the most versatile, so we'll use it to export our coauthor graph.

The tethne.writers.graph module has several functions for writing graphs to disk. We'll use to_graphml():



In [28]:

    
from tethne.writers.graph import to_graphml

to_graphml() accepts two arguments: the graph itself, and a string with the path to the output file (that will be created). In the example below, I just put the graph on my desktop.



In [29]:

    
to_graphml(coauthor_graph, '/Users/erickpeirson/Desktop/coauthors_graph.graphml')

If you were to open that file, the first few lines would look something like this:

<?xml version='1.0' encoding='utf-8'?>
<graphml xmlns="http://graphml.graphdrawing.org/xmlns" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd">
  <key attr.name="weight" attr.type="int" for="edge" id="weight" />
  <key attr.name="documentCount" attr.type="int" for="node" id="documentCount" />
  <key attr.name="count" attr.type="double" for="node" id="count" />
  <graph edgedefault="undirected">
    <node id="WEBER, BH">
      <data key="count">1.0</data>
      <data key="documentCount">1</data>
    </node>
    <node id="CAPELOTTI, PJ">
      <data key="count">1.0</data>
      <data key="documentCount">1</data>
    </node>
    <node id="GREENE, MOTT T">
      <data key="count">3.0</data>
      <data key="documentCount">3</data>
    </node>
    <node id="PRETE, FR">
      <data key="count">3.0</data>
      <data key="documentCount">3</data>
    </node>

Everything in the graph is enclosed between the <graphml ...></graphml> tags. Each author is represented by a <node></node> element. Further down, relationships between authors are represented by <edge></edge> elements.

    <edge source="TAUBER, AI" target="BALABAN, M">
      <data key="weight">1</data>
    </edge>
    <edge source="TAUBER, AI" target="PODOLSKY, SH">
      <data key="weight">1</data>
    </edge>
    <edge source="TAUBER, AI" target="CRIST, E">
      <data key="weight">1</data>
    </edge>
    <edge source="RUPKE, N" target="HOSSFELD, U">
      <data key="weight">1</data>
    </edge>
    <edge source="GAWNE, RICHARD" target="NICHOLSON, DANIEL J">
      <data key="weight">1</data>
    </edge>

Visualization in Cytoscape

Go ahead and load Cytoscape. After the application loads, you should see a splash screen like the one below. Click on "From network file", then select your graphml file and click OK.

Once the network loads, you'll see a jumble of nodes and edges. Click on the "Apply Preferred Layout" button (it looks like nodes with arrows pointing in various directions) at the top of the screen.

By default, this should apply a force-directed layout. After a few moments, your network should look something like the image below.

We can visualize attributes of the graph in the "Styles" menu. Click on "Styles" in the upper left. In the example below, I set node & height widths to be equal, and set node size as a continuous function of "count" (this is the number of papers written by each author.

We can set edge attributes, too. In the example below, I set edge width to be a function of "weight", which is the number of papers that the two connected authors wrote together.

You can zoom in and out to take a closer look at parts of the graph. If you click on the "network" tab in the upper left, you'll see a mini version of your network in the lower left, with a blue box showing which area you're currently viewing.

Bibliographic Coupling

Bibliographic coupling can be a useful and computationally cheap way to explore the thematic topology of a large scientific literature.

Bibliographic coupling was first proposed as a method for detecting latent topical affinities among research publications by Myer M. Kessler at MIT in 1958. In 1972, J.C. Donohue suggested that bibliographic coupling could be used to the map "research fronts" in science, and this method, along with co-citation analysis and other citation-based clustering techniques, became a core methodology of the science-mapping craze of the 1970s. Bibliographic coupling is still employed in the context of both information-retrieval and science-studies.

Two papers are bibliographically coupled if they both cite at least some of the same papers. The core assumption of bibliographic coupling analysis is that if two papers cite similar literatures, then they must be topically related in some way. That is, they are more likely to be related to each other than to papers with which they share no cited references.

What we are aiming for is a graph model of our bibliographic data that reveals thematically coherent and informative clusters of documents. We will use Tethne's bibligraphic_coupling() function to generate such a network.

First we import the function:



In [30]:

    
from tethne import bibliographic_coupling

We use this function just like the coauthors() function -- passing the Corpus are our first argument -- but we can also pass additional arguments. min_weight indicates that two Papers must share at least three cited references to be coupled. node_attrs tells the function to add additional information to each node; in this case, 'date' and 'title'.



In [41]:

    
coupling_graph = bibliographic_coupling(corpus, min_weight=3, node_attrs=['date', 'title'])

We can "tune" this function by increasing or decreasing min_weight to yield more or less dense graphs. order() (the number of nodes) and size() (the number of edges) give us a sense of the density of the graph.



In [42]:

    
coupling_graph.order(), coupling_graph.size()









    Out[42]:





(438, 1798)

We can use the to_graphml() function once again to write the graph to disk, so that we can visualize it in Cytoscape.



In [40]:

    
to_graphml(coupling_graph, '/Users/erickpeirson/Desktop/coupling_graph.graphml')

The resulting graph, with some styling, might look like this: