Interacting with KEGG data using Biopython

 Introduction

Biopython 1.65 was released on 17th December 2014. One of the new additions to the libraries was the Bio.Graphics.KGML module, which can be used for customised renderings of KEGG metabolic pathways.

This notebook is an introduction to the Bio.KEGG and Bio.Graphics.KGML modules, providing examples of usage, including data retrieval, manipulation and output rendering.

Running cells in this notebook

This is an interactive notebook, which means you are able to run the code that is written in each of the cells.

If this is successful, you should see the input marker to the left of the cell change from

In [ ]:

to (for example)

In [1]:

and you may see output appear below the cell.

0. Setting up

We need to import Biopython and parts of the KEGG and KGML modules, but first we'll check that a suitably recent version of Biopython is installed (1.65 or higher for Python 2; 1.67 or higher for Python 3):


In [ ]:
import Bio
Bio.__version__

We're going to be using some other Biopython modules, and working with images in the notebook, so we also prepare for this, with the following imports.


In [ ]:
from Bio import SeqIO
from Bio.KEGG.REST import *
from Bio.KEGG.KGML import KGML_parser
from Bio.Graphics.KGML_vis import KGMLCanvas
from Bio.Graphics.ColorSpiral import ColorSpiral

from IPython.display import Image, HTML

import os
import random

To visualise the PDF output generated by this library inline, and to provide a head-like peek at text output, the functions below are defined:


In [ ]:
# A bit of code that will help us display the PDF output
def PDF(filename):
    return HTML('<iframe src=%s width=700 height=350></iframe>' % filename)

# A bit of helper code to shorten long text
def head(text, lines=10):
    """Print the first lines lines of the passed text."""
    print('\n'.join(text.split('\n')[:lines] + ['[...]']))

Output files will be stored in the subdirectory kgml_output, which may need to be created:


In [ ]:
# Create subdirectory to store output
outdir = "kgml_output"
try:
    os.makedirs(outdir, exist_ok=True)
except TypeError:
    if not os.path.isdir(outdir):
        os.mkdir(outdir)

1. Retrieving data from KEGG with Bio.KEGG.REST

The KEGG database provides a RESTful interface to its data (see http://www.kegg.jp/kegg/rest/), which we can use via Biopython's Bio.KEGG.REST module.

The KEGG interface is not as well documented as some other resources (such as NCBI or Ensembl), and KEGG does not provide any usage guidelines. To avoid risking overloading the service, Biopython restricts us to three calls per second. Be warned also that the conditions of service include:

"This service should not be used for bulk data downloads".

NOTE: The KEGG API returns data in Byte handles (in Python3, these are strings prefixed with the b character). Biopython's interface converts text data to String handles, but leaves image data as the Byte type.

1a. Obtaining information about a database with kegg_info()

We can obtain basic information about the KEGG databases (e.g. "pathway", "kegg", "hsa", etc.) using the kegg_info() function. This returns a handle which can be read as plaintext, with some useful information about the resource.


In [ ]:
# Information about the Kyoto Encyclopedia of Genes and Genomes
print(kegg_info("kegg").read())

In [ ]:
# Information about the KEGG Pathway Database
print(kegg_info("pathway").read())

In [ ]:
# Information about the Escherichia coli K-12 MG1655 KEGG Genes Database
print(kegg_info("eco").read())

1b. Obtaining a list of entries in a database with kegg_list()

The kegg_list() function retrieves a handle, which provides a plaintext tab-separated key:value list of entry identifiers and definitions in the given database or set of database entries.

The kegg_list() function allows the organism to be specified for a database, with an optional second argument.


In [ ]:
# List all pathways in the pathway database
head(kegg_list('pathway').read())

In [ ]:
# Only list pathways present in E. coli K-12 MG1655
head(kegg_list('pathway', 'eco').read())

If only the organism is specified, then a list of gene entries is returned.


In [ ]:
# E. coli K-12 MG1655 genes
head(kegg_list('eco').read())

The KEGG API also allows the + operator to be used to combine list identifiers.


In [ ]:
# Compound entry C01290 and glycan entry G00092
print(kegg_list('C01290+G00092').read())

But kegg_list() and other functions will take a Python list of identifiers, and combine them automatically:


In [ ]:
# Compound entry C01290 and glycan entry G00092
print(kegg_list(['C01290', 'G00092']).read())

1c. Finding a specific entry with kegg_find()

The kegg_find(database, query) function finds entries with matching query keywords or other query data in a given database. Data is returned as tab-separated key:value plain text.


In [ ]:
# Find shiga toxin genes
head(kegg_find('genes', 'shiga+toxin').read())

In [ ]:
# Find shiga toxin genes only in Escherichia coli O111 H-11128 (EHEC)
print(kegg_find('eoi', 'shiga+toxin').read())

The API lets us query on useful properties, such as compound molecular weight:


In [ ]:
# Compounds with molecular weight between 300 and 310g/mol
head(kegg_find('compound', '300-310/mol_weight').read())

1d. Retrieving specific entries with kegg_get()

The kegg_get() function lets us retrieve data from KEGG in a range of different formats.

Database entries can be retrieved directly as plain text; sequences may be returned as database entries, amino acid or nucleotide sequences in FASTA format; compounds as database entries, or images (.gif); pathways as database entries, KGML, or images (.png).

Compounds can be retrieved as their database entry in plain text, or as a .gif image.


In [ ]:
# Compound as database entry
head(kegg_get("cpd:C01290").read())

In [ ]:
# Compound as image
Image(kegg_get("cpd:C01290", "image").read())

Gene data can be retrieved as the plain text database entry, or a FASTA nucleotide or protein sequence.


In [ ]:
# Gene as database entry
head(kegg_get("ece:Z5100").read())

In [ ]:
# Gene as amino acid sequence
print(kegg_get("ece:Z5100", "aaseq").read())

In [ ]:
# Gene as nucleotide sequence
print(kegg_get("ece:Z5100", "ntseq").read())

In [ ]:
# Parsing a returned sequence with SeqIO, and converting format
seq = SeqIO.read(kegg_get("ece:Z5100", "ntseq"), 'fasta')
print(seq.format('stockholm'))

Pathways can be returned as database entries in plain text, or in the KGML format, or as .png images. These last two formats are especially useful for generating custom pathway renderings with Bio.Graphics.KGML.


In [ ]:
# Pathway as database entry
head(kegg_get("hsa05130").read())

In [ ]:
# Pathway as image (png)
Image(kegg_get("hsa05130", "image").read())

In [ ]:
# Pathway as KGML
head(kegg_get("hsa05130", "kgml").read())

2. Rendering pathways with Bio.KEGG and Bio.Graphics.KGML

2a. Retrieving renderings from KEGG

The easiest way to render a pathway is to retrieve an image file directly from KEGG with kegg_get(), as above.


In [ ]:
# Render central metabolism
Image(kegg_get("map01100", "image").read())

In [ ]:
# Render fatty-acid biosynthesis
Image(kegg_get("map00061", "image").read())

KEGG also provides renderings that are specific to an organism, indicating which elements of the pathway are present in their database for that organism.

For central metabolism, compounds and reactions are only coloured if they are present in the specified organism. Otherwise, they are rendered in grey.


In [ ]:
# Render E.coli K-12 MG1655 central metabolism
Image(kegg_get("eco01100", "image").read())

With other pathways, the reactions and/or genes are coloured green when they are present in the organism, and in white otherwise.


In [ ]:
# Render E.coli K-12 MG1655 fatty-acid biosynthesis
Image(kegg_get("eco00061", "image").read())

Three types of reference pathway can also be rendered: ko, ec, and rn. These are shown with blue boxes.


In [ ]:
# Render reference fatty-acid biosynthesis
#Image(kegg_get("ko00061", "image").read())
#Image(kegg_get("ec00061", "image").read())
Image(kegg_get("rn00061", "image").read())

2b. Reproducing KEGG renderings from KGML

KEGG provides descriptions of their pathway maps in an exchange format called KGML (spec). This doesn't quite match the maps that can be downloaded through the public interface. If you're a subscriber to KEGG, then you have access to an alternative set of KGML+ files. The Biopython interface uses the public API, so we only work with KGML files.

To obtain a pathway's KGML file from KEGG, we can use the kegg_get() function. In the cell below, we store the downloaded data for two maps in the variables ko_map and eco_map.


In [ ]:
# Get KGML for fatty-acid biosynthesis
ko_map = (kegg_get("ko00061", "kgml").read())    # KO version (KEGG orthologues)
eco_map = (kegg_get("eco00061", "kgml").read())  # E. coli version

Visualising the contents of the KGML file, it can be seen that two useful kinds of elements are described in the file: ortholog and compound entries. Reactions and other connections are usually not defined directly.


In [ ]:
# View the contents of ko_map KGML
head(ko_map, 20)

Typically, orthologs and compounds are the only items that can be manipulated on a KEGG pathway map. KGML does not describe the complete map, with its connectors and labels, as provided in the downloaded image.

The base maps (such as, in this case, map00061) do not contain ortholog or compound features, and consequently there is no associated KGML file. Attempting to retrieve KGML data for these maps produces an HTTP 404: not found error.


In [ ]:
# Trying to retrieve base map KGML generates an HTTP 404: not found error
#base_map = (kegg_get("map00061", "kgml").read())

The beautiful images generated by KEGG are rendered as .png files, which may be downloaded, and the features described in the accompanying KGML description overlaid, with custom formatting. This is what is currently done in the majority of cases, here.

To do this, first we need to parse the KGML we get from KEGG into a Pathway object. This holds an internal representation of the pathway information, such as orthologs and compounds (and, as we'll see later, reactions and relations.

The Pathway object provides some useful functions for manipulation of elements in the KGML representations, and allows us to write out a new KGML file that preserves the changes we make. Using print on the Pathway object gives us some summary information.

KGML_parser will read a KGML file on disk, and the stream returned from a kegg_get() function, as below.


In [ ]:
pathway = KGML_parser.read(kegg_get("ko00061", "kgml"))
print(pathway)

To render this data as an image, we need to set up a KGMLCanvas object, from Bio.Graphics.KGML_vis, to draw on. We do this by instantiating a KGMLCanvas object with our pathway data.


In [ ]:
canvas = KGMLCanvas(pathway)

Now that we have an image, we can render it. At the moment, there is only the capability to render to .pdf. To render to a file, call the draw() method of the canvas, with the output filename as an argument.

Using the cell below, we also view the .pdf file that is created in an iFrame.


In [ ]:
canvas.draw(os.path.join(outdir, "fab_map.pdf"))
PDF(os.path.join(outdir, "fab_map.pdf"))

But there's something wrong with this picture. It shows only the elements defined in the KGML file. All the beautiful KEGG layout, including the lines indicating the flow of the biochemistry, is missing. That's because the KGML file doesn't contain any information about it, directly.

However, each Pathway object created from KGML file has an image attribute with the URL of its .png format representation. To use this as a background, all we need to do to render this is specify import_imagemap=True when creating the canvas, or set canvas.import_imagemap = True once the canvas is created. (It is also possible to specify an image file on disk as the background.)


In [ ]:
canvas.import_imagemap = True
canvas.draw(os.path.join(outdir, "fab_map_with_image.pdf"))
PDF(os.path.join(outdir, "fab_map_with_image.pdf"))

This could be encapsulated in a useful helper function, as in the cell below:


In [ ]:
def draw_kegg_map(map_id, outdir):
    """Render a local PDF of a KEGG map with the passed map ID."""
    # Get the background image first
    pathway = KGML_parser.read(kegg_get(map_id, "kgml"))
    canvas = KGMLCanvas(pathway, import_imagemap=True)
    img_filename = "%s.pdf" % map_id
    canvas.draw(os.path.join(outdir, img_filename))

2c. Customised renderings from KGML - Part 1

In the Bio.Graphics model of rendering KGML, we are superimposing the elements described in a KGML file over an associated background image. As a result, only the elements that are described in the KGML file (or that we introduce ourselves...) can be modified.

The current interface to identifying and retrieving pathway elements is a little obscure as of Biopython 1.65 - if you want to retrieve a particular element by ID or annotation - but it is possible to modify the graphical elements that are rendered, to produce customised maps from KGML data.

The modifications we can make are:

  • shape (graphics.type)
  • location ($x,y$ co-ordinates: graphics.x, graphics.y)
  • height (graphics.height)
  • width (graphics.width)
  • foreground colour (graphics.fgcolor)
  • background colour (graphics.bgcolor)
  • label (graphics.name)

but this still gives us a lot of flexibility in terms of visual annotation from a dataset. We can also choose to display, or not, elements of interest, and transparency effects are available.

A motivating example

As a motivating example, we're going to use the fatty-acid biosynthesis pathway (map00061) from earlier, and make the following modifications:

  1. colour the ortholog elements according to a random choice of colours
  2. change the size of each compound element to reflect its molecular mass

We saw the default rendering of the ko00061 pathway earlier:


In [ ]:
pathway = KGML_parser.read(kegg_get("ko00061", "kgml"))
canvas = KGMLCanvas(pathway, import_imagemap=True)
canvas.draw(os.path.join(outdir, "fab_map_with_image.pdf"))
PDF(os.path.join(outdir, "fab_map_with_image.pdf"))

2c.1 Changing ortholog background colours

We can get access to all the pathway ortholog elements via the pathway.orthologs attribute. This will list all ortholog elements in the pathway as Entry objects.


In [ ]:
pathway.orthologs[:5]

In [ ]:
print(pathway.orthologs[0])

Each Entry object has a graphics attribute that stores information about how the item is rendered, in a list of Graphics objects - this reflects the data structure in the KGML representation, which may have more than one graphical entity per entry in the pathway.


In [ ]:
pathway.orthologs[0].graphics

Each Graphics object has attributes that may be modified directly:

name         Label for the graphics object
x            X-axis position of the object (int)
y            Y-axis position of the object (int)
coords       polyline co-ordinates, list of (int, int) tuples
type         object shape
width        object width (int)
height       object height (int)
fgcolor      object foreground color (hex RGB)
bgcolor      object background color (hex RGB)

There are also two read-only properties that are provided for convenience

bounds       returns the graphical bounds of the object
centre       returns the centre of the object as an (x, y) tuple

In [ ]:
element = pathway.orthologs[0].graphics[0]
attrs = ["name", "x", "y", "coords", "type", "width", "height",
         "fgcolor", "bgcolor", "bounds", "centre"]
for a in attrs:
    print("element.{0: <10}:\t{1}".format(a, getattr(element, a)))

To change the ortholog colours, it's possible to loop over the ortholog elements in the pathway, and modify their graphics.bgcolor attributes.

As KGML expects hex colours in its specification, we also provide hex colours here, using the helper function below:


In [ ]:
# Helper function to convert colour as RGB tuple to hex string
def rgb_to_hex(rgb):
    rgb = tuple([int(255*val) for val in rgb])
    return '#' + ''.join([hex(val)[2:] for val in rgb]).upper()

For this example we're defining arbitrary colours, but some obvious uses might include the use of colour to represent some biologically-significant value, like transcript abundance, $\frac{dN}{dS}$, or flux through that step.


In [ ]:
# Define arbitrary colours
colorspiral = ColorSpiral()
colorlist = colorspiral.get_colors(len(pathway.orthologs))

# Change the colours of ortholog elements
for color, element in zip(colorlist, pathway.orthologs):
    for graphic in element.graphics:
        graphic.bgcolor = rgb_to_hex(color)

In [ ]:
canvas = KGMLCanvas(pathway, import_imagemap=True)
canvas.draw(os.path.join(outdir, "fab_map_new_colours.pdf"))
PDF(os.path.join(outdir, "fab_map_new_colours.pdf"))

2c.2 Changing compound sizes

We can get access to the pathway compound elements via pathway.compounds, just as for orthologs, and to change sizes, we can loop over them, modifying both width and height.

As with the colour changes, the sizes used are arbitrary, but this kind of modification (maybe incorporating colour changes, too) could be used to represent biochemical measurements such as metabolite concentration.


In [ ]:
# Change the sizes of compound elements
for size, element in zip(range(8, 8+len(pathway.compounds)), pathway.compounds):
    for graphic in element.graphics:
        graphic.width = size
        graphic.height = size

In [ ]:
canvas = KGMLCanvas(pathway, import_imagemap=True)
canvas.draw(os.path.join(outdir, "fab_map_new_sizes.pdf"))
PDF(os.path.join(outdir, "fab_map_new_sizes.pdf"))

2d. Customised renderings from KGML - Part 2: central metabolism

The maps we've been considering so far have not been completely described by the publicly-available KGML file. This is true for the great majority of KEGG maps. But there are three KEGG maps: ko01100, ko01110, and ko01120 for which this is not the case, and the KGML file provides a lot more information. Specifically, they also describe lines representing reactions that connect metabolite compounds. This gives us quite a lot of scope for modifying their presentation to convey useful information.


In [ ]:
# The three KEGG maps with lines representing reactions.
maps = ['ko01100', 'ko01110', 'ko01120']
[draw_kegg_map(map, outdir) for map in maps]
print(kegg_get(maps).read())

In [ ]:
PDF(os.path.join(outdir, maps[0]+'.pdf'))

In [ ]:
PDF(os.path.join(outdir, maps[1]+'.pdf'))

In [ ]:
PDF(os.path.join(outdir, maps[2]+'.pdf'))

Unlike the custom renderings that can be obtained from KEGG directly, with this module we now have full control over line width, element size, element colour, and labelling options.

To avoid noise in the image, we can set import_imagemap=False when making the canvas or drawing to a file for these maps.

First, let's set each reaction line to a random value between 1 and 10. This could be adapted to represent carbon flux measurements, for example.


In [ ]:
# Use the bacterial diverse environments map
pathway = KGML_parser.read(kegg_get("ko01120", "kgml"))

# Change the widths of reaction entries elements
for element in pathway.orthologs:
    for graphic in element.graphics:
        graphic.width = random.randrange(1, 10, 1)

In [ ]:
canvas = KGMLCanvas(pathway, import_imagemap=False)
canvas.draw(os.path.join(outdir, "bacteria_mod_widths.pdf"))
PDF(os.path.join(outdir, "bacteria_mod_widths.pdf"))

We can modify pathway colours in the same way as before, except that now we have to modify the foreground, not the background colour, through the attribute fgcolor. The foreground colour should be specified as a hex string.


In [ ]:
# Define arbitrary colours
colorspiral = ColorSpiral()
colorlist = colorspiral.get_colors(len(pathway.orthologs))

# Change the colours of ortholog elements
for color, element in zip(colorlist, pathway.orthologs):
    for graphic in element.graphics:
        graphic.fgcolor = rgb_to_hex(color)

In [ ]:
canvas = KGMLCanvas(pathway, import_imagemap=False)
canvas.draw(os.path.join(outdir, "bacteria_mod_colour.pdf"))
PDF(os.path.join(outdir, "bacteria_mod_colour.pdf"))