This guide provides a brief demo of what you can do with Paperweight. Here we'll use a LaTeX document that I have handy, but you may want to try the same commands on your own papers. There are some issues typesetting ipython notebooks on Read The Docs, so you may prefer to view this notebook with the ipython notebook viewer.
First, let's set our current working directory to match the paper for convenience.
In [1]:
from pprint import pprint
import os
os.chdir("/Users/jsick/Dropbox/m31/writing/skysubpub")
!rm embedded_demo.tex
If you're processing TeX documents in an automated fashion, you might not know a priori what file contains the root of a LaTeX document. By root we mean the file with a '\documentclass'
declaration. Paperweight provides an API to find this. (Note that find_root_tex_document()
can accept a directory path as an argument, so you are not restricted to the current working directory)
In [2]:
from paperweight import texutils
root_tex_path = texutils.find_root_tex_document()
print(root_tex_path)
In [3]:
from paperweight.document import FilesystemTexDocument
doc = FilesystemTexDocument(root_tex_path)
With a FilesystemTexDocument
you can ask questions about the document, such as what BibTeX file it uses:
In [4]:
print(doc.bib_name)
print(doc.bib_path)
or what the names and word-count locations of the sections are:
In [5]:
doc.sections
Out[5]:
or what other tex files it includes using \input{}
commands:
In [6]:
doc.find_input_documents()
Out[6]:
We can manipulate our LaTeX document too. For instance, we can embed the input TeX files and bibliography directly into the main text body:
In [7]:
doc.inline_inputs()
doc.inline_bbl()
Now you'll see that we no longer reference other tex files or a bibtex file since all text content is embedded into the root TeX document. This can be handy for submitting the article to a journal (in fact the preprint tool uses Paperweight to do just that).
In [8]:
print(doc.find_input_documents())
print(doc.bib_name)
We can delete comments from the LaTeX source as well. When we do that you'll notice that the sections now appear at earlier word count locations.
In [9]:
doc.remove_comments()
In [10]:
doc.sections
Out[10]:
You can easily write the modified TeX source back to the filesystem with the write()
method:
In [11]:
doc.write("embedded_demo.tex")
One of the goals of Paperweight is to allow us to understand our scientific documents. A big part of that is understanding how we cite other papers.
With our document, we can ask for what references are made in the document according to the cite keys used in \cite*{}
commands:
In [12]:
doc.bib_keys
Out[12]:
This is useful, but we can go deeper by understanding the context in which these works are cited. To do this we can use the extract_citation_context()
method to generate a dictionary, keyed by bib keys, of all citation instances in the document. In this example paper I've cited 45 works:
In [13]:
cites = doc.extract_citation_context()
print(len(cites))
Each entry in the cites
dictionary is a list of specific occurences where that work was cited. Thus its easy to count the number of times each work was cited:
In [14]:
for cite_key, instances in cites.iteritems():
print("{0} cited {1:d} time(s)".format(cite_key, len(instances)))
It looks like I've cited Vaduvescu:2004
a lot. Lets look at where it was cited:
In [15]:
print([c['section'] for c in cites['Vaduvescu:2004']])
In the list above, the first item lists the cumulative word count where the section starts, while the second item is the name of the section.
There's a lot of other information associated with each citation instance. Here's metadata associated with the first reference to Vaduvescu:2004
:
In [16]:
pprint(cites['Vaduvescu:2004'][0])
In [ ]: