Krista Longnecker 11 November 2015 Use this version to keep the code intact
Biopython 1.65 was released on 17th December 2014. One of the new additions to the libraries was the Bio.Graphics.KGML
module, which can be used for customised renderings of KEGG metabolic pathways.
This notebook is an introduction to the Bio.KEGG
and Bio.Graphics.KGML
modules, providing examples of usage, including data retrieval, manipulation and output rendering.
We need to import Biopython and parts of the KEGG
and KGML
modules, but first we'll check that a suitably recent version of Biopython is installed (1.65 or higher):
In [1]:
import Bio
Bio.__version__
Out[1]:
We're going to be using some other Biopython modules, and working with images in the notebook, so we also prepare for this.
In [2]:
from Bio import SeqIO
from Bio.KEGG.REST import *
from Bio.KEGG.KGML import KGML_parser
from Bio.Graphics.KGML_vis import KGMLCanvas
from Bio.Graphics.ColorSpiral import ColorSpiral
from IPython.display import Image, HTML
import random
# A bit of code that will help us display the PDF output
def PDF(filename):
return HTML('<iframe src=%s width=700 height=350></iframe>' % filename)
# A bit of helper code to shorten long text
def head(text, lines=10):
""" Print the first lines lines of the passed text.
"""
print '\n'.join(text.split('\n')[:lines] + ['[...]'])
The KEGG database provides a RESTful interface to its data (see http://www.kegg.jp/kegg/rest/), which we can use via Biopython's Bio.KEGG.REST
module.
The KEGG interface is not as well documented as some other resources (such as NCBI or Ensembl), and KEGG does not provide any usage guidelines. To avoid risking overloading the service, Biopython restricts us to three calls per second. Be warned also that the conditions of service include:
"This service should not be used for bulk data downloads".
We can obtain basic information about a particular database (e.g. "pathway", "kegg", "hsa", etc.) using the kegg_info()
function. This returns a handle which can be read as plaintext, with some useful information about the resource.
In [3]:
# Kyoto Encyclopedia of Genes and Genomes
print(kegg_info("kegg").read())
In [4]:
# KEGG Pathway Database
print(kegg_info("pathway").read())
In [5]:
# Escherichia coli K-12 MG1655 KEGG Genes Database
print(kegg_info("eco").read())
The kegg_list()
function retrieves a handle, which provides a plaintext tab-separated key:value
list of entry identifiers and definitions in the given database or set of database entries.
The kegg_list()
function allows the organism to be specified for a database, with an optional second argument.
In [6]:
# List all pathways in the pathway database
head(kegg_list('pathway').read())
In [7]:
# Only list pathways present in E. coli K-12 MG1655
head(kegg_list('pathway', 'tps').read())
If only the organism is specified, then a list of gene entries is returned.
In [8]:
# E. coli K-12 MG1655 genes
head(kegg_list('tps').read())
The KEGG API also allows the +
operator to be used to combine list identifiers.
In [12]:
# Compound entry C01290 and glycan entry G00092
print(kegg_list('C01290+K00942').read())
But kegg_list()
and other functions will take a list of identifiers, and combine them automatically:
In [13]:
# Compound entry C01290 and glycan entry G00092
print(kegg_list(['C01290', 'K00942']).read())
The kegg_find(database, query)
function finds entries with matching query
keywords or other query
data in a given database
. Data is returned as tab-separated key:value
plain text.
In [15]:
# Find shiga toxin genes
head(kegg_find('genes', 'purines').read())
In [17]:
# Find shiga toxin genes only in Escherichia coli O111 H-11128 (EHEC)
print(kegg_find('tps', 'nucleotidase').read())
The API lets us query on useful properties, such as compound molecular weight:
In [18]:
# Compounds with molecular weight between 300 and 310g/mol
head(kegg_find('compound', '300-310/mol_weight').read())
The kegg_get()
function lets us retrieve data from KEGG in a range of different formats.
Database entries can be retrieved directly as plain text; sequences may be returned as database entries, amino acid or nucleotide sequences in FASTA format; compounds as database entries, or images (.gif
); pathways as database entries, KGML, or images (.png
).
Compounds can be retrieved as their database entry in plain text, or as a .gif
image.
In [19]:
# Compound as database entry
head(kegg_get("cpd:C00144").read())
In [20]:
# Compound as image
Image(kegg_get("cpd:C00144", "image").read())
Out[20]:
Gene data can be retrieved as the plain text database entry, or a FASTA nucleotide or protein sequence.
In [16]:
# Gene as database entry
head(kegg_get("ece:Z5100").read())
In [17]:
# Gene as amino acid sequence
print(kegg_get("ece:Z5100", "aaseq").read())
In [18]:
# Gene as nucleotide sequence
print(kegg_get("ece:Z5100", "ntseq").read())
In [19]:
# Parsing a returned sequence with SeqIO
seq = SeqIO.read(kegg_get("ece:Z5100", "ntseq"), 'fasta')
print seq.format('stockholm')
Pathways can be returned as database entries in plain text, or in the KGML format, or as .png
images. These last two formats are especially useful for generating custom pathway renderings with Bio.Graphics.KGML
.
In [21]:
# Pathway as database entry
head(kegg_get("tps00230").read())
In [22]:
# Pathway as image (png)
Image(kegg_get("tps00230", "image").read())
Out[22]:
In [23]:
# Pathway as KGML
head(kegg_get("tps00230", "kgml").read())
The easiest way to render a pathway is to retrieve an image file directly from KEGG with kegg_get()
, as above.
In [26]:
# Render central metabolism
Image(kegg_get("map01100", "image").read())
Out[26]:
In [27]:
# Render fatty-acid biosynthesis
Image(kegg_get("map00230", "image").read())
Out[27]:
KEGG also provides renderings that are specific to an organism, indicating which elements of the pathway are present in their database for that organism.
For central metabolism, compounds and reactions are only coloured if they are present in the specified organism.
In [25]:
# Render E.coli K-12 MG1655 central metabolism
Image(kegg_get("eco01100", "image").read())
Out[25]:
With other pathways, the reactions and/or genes are coloured green when they are present in the organism.
In [26]:
# Render E.coli K-12 MG1655 fatty-acid biosynthesis
Image(kegg_get("eco00061", "image").read())
Out[26]:
Three types of reference pathway can also be rendered: ko, ec, and rn. These are shown with blue boxes.
In [28]:
# Render reference fatty-acid biosynthesis
Image(kegg_get("ko00061", "image").read())
#Image(kegg_get("ec00061", "image").read())
#Image(kegg_get("rn00061", "image").read())
Out[28]:
KEGG provides descriptions of their pathway maps in an exchange format called KGML (spec). This doesn't quite match the maps that can be downloaded through the public interface. If you're a subscriber to KEGG, then you have access to an alternative set of KGML+ files. The Biopython interface uses the public API, so we only work with KGML files.
To obtain a pathway's KGML file from KEGG, we can use the kegg_get()
function:
In [29]:
# Get KGML for fatty-acid biosynthesis
ko_map = (kegg_get("ko00061", "kgml").read()) # KO version (KEGG orthologues)
eco_map = (kegg_get("eco00061", "kgml").read()) # E. coli version
Visualising the contents of the KGML file, it can be seen that two useful kinds of elements are described in the file: ortholog
and compound
entries. Reactions and other connections are usually not defined directly.
In [30]:
# View the contents of ko_map KGML
head(ko_map)
Typically, ortholog
s and compound
s are the only items that can be manipulated on a KEGG pathway map. KGML does not describe the complete map, as provided in the downloaded image.
The base maps (such as, in this case, map00061
) do not contain ortholog
or compound
features, and consequently there is no associated KGML file. Attempting to retrieve KGML data for these maps produces an error.
In [30]:
# Trying to retrieve base map KGML generates an error
#base_map = (kegg_get("map00061", "kgml").read())
The beautiful images generated by KEGG are rendered as .png
files, which may be downloaded, and the features described in the accompanying KGML description overlaid, with custom formatting. This is what is currently done in the majority of cases, here.
First, we need to parse the KGML we get from KEGG into a Pathway
object. This holds an internal representation of the pathway information, such as ortholog
s and compound
s (and, as we'll see later, reaction
s and relations.
The Pathway
object provides some useful functions for manipulation of elements in the KGML representations, and allows us to write out a new KGML file that preserves the changes we make. Using print
on the Pathway
object gives us some summary information.
KGML_parser
will read a KGML file on disk, and the stream returned from a kegg_get()
function, as below.
In [31]:
pathway = KGML_parser.read(kegg_get("ko00061", "kgml"))
print(pathway)
If we want to render this as an image, we need to use a KGMLCanvas
object, from Bio.Graphics.KGML_vis
. We do this by instantiating a KGMLCanvas
object with our pathway.
In [32]:
canvas = KGMLCanvas(pathway)
Now that we have an image, we can render it. At the moment, there is only the capability to render to .pdf
. To render to a file, call the draw()
method of the canvas.
In [33]:
canvas.draw("fab_map.pdf")
PDF("fab_map.pdf")
Out[33]:
But there's something wrong with this picture. It shows only the elements defined in the KGML file. All the beautiful KEGG layout, including the lines indicating the flow of the biochemistry, is missing. That's because the KGML file doesn't contain any information about it, directly.
However, each Pathway
object created from KGML file has an image
attribute with the URL of its .png
format representation. To use this as a background, all we need to do to render this is specify import_imagemap=True
when creating the canvas, or set canvas.import_imagemap = True
once the canvas is created. (It is also possible to specify an image file on disk as the background.)
In [34]:
canvas.import_imagemap = True
canvas.draw("fab_map_with_image.pdf")
PDF("fab_map_with_image.pdf")
Out[34]:
This could be encapsulated in a useful helper function:
In [35]:
def draw_kegg_map(map_id):
""" Render a local PDF of a KEGG map with the passed map ID
"""
# Get the background image first
pathway = KGML_parser.read(kegg_get(map_id, "kgml"))
canvas = KGMLCanvas(pathway, import_imagemap=True)
img_filename = "%s.pdf" % map_id
canvas.draw(img_filename)
In the Bio.Graphics
model of rendering KGML, we are superimposing the elements described in a KGML file over an associated background image. As a result, only the elements that are described in the KGML file (or any we introduce ourselves...) can be modified.
The current interface to identifying and retrieving pathway elements is a little obscure as of Biopython 1.65 - if you want to retrieve a particular element by ID or annotation - but it is possible to modify the graphical elements that are rendered, to produce customised maps from KGML data.
The modifications we can make are:
graphics.type
)graphics.x
, graphics.y
)graphics.height
)graphics.width
)graphics.fgcolor
)graphics.bgcolor
)graphics.name
)but this still gives us a lot of flexibility in terms of visual annotation from a dataset. We can also choose to display, or not, elements of interest, and transparency effects are available.
As a motivating example, we're going to use the fatty-acid biosynthesis pathway (map00061
) from earlier, and make the following modifications:
ortholog
elements according to a random choice of colourscompound
element to reflect its molecular massWe saw the default rendering of the ko00061
pathway earlier:
In [36]:
pathway = KGML_parser.read(kegg_get("ko00061", "kgml"))
canvas = KGMLCanvas(pathway, import_imagemap=True)
canvas.draw("fab_map_with_image.pdf")
PDF("fab_map_with_image.pdf")
Out[36]:
We can get access to all the pathway ortholog
elements via the pathway.orthologs
attribute. This will list all ortholog
elements in the pathway as Entry
objects.
In [37]:
pathway.orthologs[:5]
Out[37]:
In [38]:
print(pathway.orthologs[0])
Each Entry
object has a graphics
attribute that stores information about how the item is rendered, in a list of Graphics
objects - this reflects the data structure in the KGML representation, which may have more than one graphical entity per entry in the pathway.
In [39]:
pathway.orthologs[0].graphics
Out[39]:
Each Graphics
object has attributes that may be modified directly:
name Label for the graphics object
x X-axis position of the object (int)
y Y-axis position of the object (int)
coords polyline co-ordinates, list of (int, int) tuples
type object shape
width object width (int)
height object height (int)
fgcolor object foreground color (hex RGB)
bgcolor object background color (hex RGB)
There are also two read-only properties that are provided for convenience
bounds returns the graphical bounds of the object
centre returns the centre of the object as an (x, y) tuple
In [40]:
element = pathway.orthologs[0].graphics[0]
attrs = [element.name, element.x, element.y, element.coords, element.type,
element.width, element.height, element.fgcolor, element.bgcolor,
element.bounds, element.centre]
print '\n'.join([str(attr) for attr in attrs])
To change the ortholog
colours, it's easy enough to loop over the ortholog
elements in the pathway, and modify their graphics.bgcolor
attributes.
As KGML expects hex colours in its specification, we also provide hex colours here, using the helper function below:
In [41]:
# Helper function to convert colour as RGB tuple to hex string
def rgb_to_hex(rgb):
rgb = tuple([int(255*val) for val in rgb])
return '#' + ''.join([hex(val)[2:] for val in rgb]).upper()
For this example we're defining arbitrary colours, but some obvious uses might include the use of colour to represent some biologically-significant value, like transcript abundance, $\frac{dN}{dS}$, or flux through that step.
In [42]:
# Define arbitrary colours
colorspiral = ColorSpiral()
colorlist = colorspiral.get_colors(len(pathway.orthologs))
# Change the colours of ortholog elements
for color, element in zip(colorlist, pathway.orthologs):
for graphic in element.graphics:
graphic.bgcolor = rgb_to_hex(color)
In [43]:
canvas = KGMLCanvas(pathway, import_imagemap=True)
canvas.draw("fab_map_new_colours.pdf")
PDF("fab_map_new_colours.pdf")
Out[43]:
We can get access to the pathway compound
elements via pathway.compounds
, just as for ortholog
s, and to change sizes, we can loop over them, modifying both width
and height
.
As with the colour changes, the sizes used are arbitrary, but this kind of modification (maybe incorporating colour changes, too) could be used to represent biochemical measurements such as metabolite concentration.
In [44]:
# Change the sizes of compound elements
for size, element in zip(range(8, 8+len(pathway.compounds)), pathway.compounds):
for graphic in element.graphics:
graphic.width = size
graphic.height = size
In [45]:
canvas = KGMLCanvas(pathway, import_imagemap=True)
canvas.draw("fab_map_new_sizes.pdf")
PDF("fab_map_new_sizes.pdf")
Out[45]:
The maps we've been considering so far have not been completely described by the publicly-available KGML file. This is true for the great majority of KEGG maps. But there are three other maps: ko01100
, ko01110
, and ko01120
for which this is not the case, and the KGML file provides a lot more information. Specifically, they also describe lines representing reactions that connect metabolite compounds. This gives us quite a lot of scope for modifying their presentation, but it also means that they're slower to render.
In [46]:
# The three KEGG maps with lines representing reactions.
maps = ['ko01100', 'ko01110', 'ko01120']
[draw_kegg_map(map) for map in maps]
print(kegg_get(maps).read())
In [47]:
PDF(maps[0]+'.pdf')
Out[47]:
In [48]:
PDF(maps[1]+'.pdf')
Out[48]:
In [49]:
PDF(maps[2]+'.pdf')
Out[49]:
Unlike the custom renderings that can be obtained from KEGG directly, with this module we now have full control over line width, element size, element colour, and labelling options.
To avoid noise in the image, it can be worth setting import_imagemap=False
when making the canvas or drawing to a file for these maps.
First, let's set each reaction line to a random value between 1 and 10. This could be adapted to represent carbon flux measurements, for example.
In [48]:
# Use the bacterial diverse environments map
pathway = KGML_parser.read(kegg_get("ko01120", "kgml"))
# Change the widths of reaction entries elements
for element in pathway.orthologs:
for graphic in element.graphics:
graphic.width = random.randrange(1, 10, 1)
In [49]:
canvas = KGMLCanvas(pathway, import_imagemap=False)
canvas.draw("bacteria_mod_widths.pdf")
PDF("bacteria_mod_widths.pdf")
Out[49]:
We can modify pathway colours in the same way as before, except that now we have to modify the foreground, not the background colour, through the attribute fgcolor
. The foreground colour should be specified as a hex string.
In [50]:
# Define arbitrary colours
colorspiral = ColorSpiral()
colorlist = colorspiral.get_colors(len(pathway.orthologs))
# Change the colours of ortholog elements
for color, element in zip(colorlist, pathway.orthologs):
for graphic in element.graphics:
graphic.fgcolor = rgb_to_hex(color)
In [51]:
canvas = KGMLCanvas(pathway, import_imagemap=False)
canvas.draw("bacteria_mod_colour.pdf")
PDF("bacteria_mod_colour.pdf")
Out[51]:
In [53]: