Krista Longnecker 11 November 2015 Use this version to keep the code intact

Playing with the new Biopython KGML rendering module

Biopython 1.65 was released on 17th December 2014. One of the new additions to the libraries was the Bio.Graphics.KGML module, which can be used for customised renderings of KEGG metabolic pathways.

This notebook is an introduction to the Bio.KEGG and Bio.Graphics.KGML modules, providing examples of usage, including data retrieval, manipulation and output rendering.

0. Setting up

We need to import Biopython and parts of the KEGG and KGML modules, but first we'll check that a suitably recent version of Biopython is installed (1.65 or higher):


In [1]:
import Bio
Bio.__version__


Out[1]:
'1.66'

We're going to be using some other Biopython modules, and working with images in the notebook, so we also prepare for this.


In [2]:
from Bio import SeqIO
from Bio.KEGG.REST import *
from Bio.KEGG.KGML import KGML_parser
from Bio.Graphics.KGML_vis import KGMLCanvas
from Bio.Graphics.ColorSpiral import ColorSpiral

from IPython.display import Image, HTML

import random

# A bit of code that will help us display the PDF output
def PDF(filename):
    return HTML('<iframe src=%s width=700 height=350></iframe>' % filename)

# A bit of helper code to shorten long text
def head(text, lines=10):
    """ Print the first lines lines of the passed text.
    """
    print '\n'.join(text.split('\n')[:lines] + ['[...]'])

1. Retrieving data from KEGG with Bio.KEGG.REST

The KEGG database provides a RESTful interface to its data (see http://www.kegg.jp/kegg/rest/), which we can use via Biopython's Bio.KEGG.REST module.

The KEGG interface is not as well documented as some other resources (such as NCBI or Ensembl), and KEGG does not provide any usage guidelines. To avoid risking overloading the service, Biopython restricts us to three calls per second. Be warned also that the conditions of service include:

"This service should not be used for bulk data downloads".

1a. Obtaining information about a database with kegg_info()

We can obtain basic information about a particular database (e.g. "pathway", "kegg", "hsa", etc.) using the kegg_info() function. This returns a handle which can be read as plaintext, with some useful information about the resource.


In [3]:
# Kyoto Encyclopedia of Genes and Genomes
print(kegg_info("kegg").read())


kegg             Kyoto Encyclopedia of Genes and Genomes
kegg             Release 76.0+/11-10, Nov 15
                 Kanehisa Laboratories
                 pathway     418,712 entries
                 brite       142,941 entries
                 module      338,339 entries
                 disease       1,432 entries
                 drug         10,310 entries
                 environ         850 entries
                 orthology    19,072 entries
                 genome        4,069 entries
                 genes     17,856,112 entries
                 dgenes      602,101 entries
                 compound     17,492 entries
                 glycan       10,989 entries
                 reaction      9,970 entries
                 rpair        15,159 entries
                 rclass        2,983 entries
                 enzyme        6,605 entries


In [4]:
# KEGG Pathway Database
print(kegg_info("pathway").read())


pathway          KEGG Pathway Database
path             Release 76.0+/11-11, Nov 15
                 Kanehisa Laboratories
                 418,712 entries


In [5]:
# Escherichia coli K-12 MG1655 KEGG Genes Database
print(kegg_info("eco").read())


T00007           Escherichia coli K-12 MG1655 KEGG Genes Database
eco              Release 76.0+/11-10, Nov 15
                 Kanehisa Laboratories
                 4,497 entries

1b. Obtaining a list of entries in a database with kegg_list()

The kegg_list() function retrieves a handle, which provides a plaintext tab-separated key:value list of entry identifiers and definitions in the given database or set of database entries.

The kegg_list() function allows the organism to be specified for a database, with an optional second argument.


In [6]:
# List all pathways in the pathway database
head(kegg_list('pathway').read())


path:map00010	Glycolysis / Gluconeogenesis
path:map00020	Citrate cycle (TCA cycle)
path:map00030	Pentose phosphate pathway
path:map00040	Pentose and glucuronate interconversions
path:map00051	Fructose and mannose metabolism
path:map00052	Galactose metabolism
path:map00053	Ascorbate and aldarate metabolism
path:map00061	Fatty acid biosynthesis
path:map00062	Fatty acid elongation
path:map00071	Fatty acid degradation
[...]

In [7]:
# Only list pathways present in E. coli K-12 MG1655
head(kegg_list('pathway', 'tps').read())


path:tps00010	Glycolysis / Gluconeogenesis - Thalassiosira pseudonana
path:tps00020	Citrate cycle (TCA cycle) - Thalassiosira pseudonana
path:tps00030	Pentose phosphate pathway - Thalassiosira pseudonana
path:tps00040	Pentose and glucuronate interconversions - Thalassiosira pseudonana
path:tps00051	Fructose and mannose metabolism - Thalassiosira pseudonana
path:tps00052	Galactose metabolism - Thalassiosira pseudonana
path:tps00053	Ascorbate and aldarate metabolism - Thalassiosira pseudonana
path:tps00061	Fatty acid biosynthesis - Thalassiosira pseudonana
path:tps00062	Fatty acid elongation - Thalassiosira pseudonana
path:tps00071	Fatty acid degradation - Thalassiosira pseudonana
[...]

If only the organism is specified, then a list of gene entries is returned.


In [8]:
# E. coli K-12 MG1655 genes
head(kegg_list('tps').read())


tps:THAPSDRAFT_3326	hypothetical protein
tps:THAPSDRAFT_3328	hypothetical protein
tps:THAPSDRAFT_3331	hypothetical protein
tps:THAPSDRAFT_21698	hypothetical protein
tps:THAPSDRAFT_32933	hypothetical protein
tps:THAPSDRAFT_21706	hypothetical protein; K03679 exosome complex component RRP4
tps:THAPSDRAFT_3349	hypothetical protein; K10903 HUS1 checkpoint protein
tps:THAPSDRAFT_21707	hypothetical protein
tps:THAPSDRAFT_3354	hypothetical protein
tps:THAPSDRAFT_21712	hypothetical protein; K17907 autophagy-related protein 9
[...]

The KEGG API also allows the + operator to be used to combine list identifiers.


In [12]:
# Compound entry C01290 and glycan entry G00092
print(kegg_list('C01290+K00942').read())


cpd:C01290	beta-D-Galactosyl-(1->4)-beta-D-glucosyl-(1<->1)-ceramide; beta-D-Galactosyl-1,4-beta-D-glucosylceramide; Lactosylceramide; Gal-beta1->4Glc-beta1->1'Cer; LacCer; Lactosyl-N-acylsphingosine; D-Galactosyl-1,4-beta-D-glucosylceramide
ko:K00942	E2.7.4.8, gmk; guanylate kinase [EC:2.7.4.8]

But kegg_list() and other functions will take a list of identifiers, and combine them automatically:


In [13]:
# Compound entry C01290 and glycan entry G00092
print(kegg_list(['C01290', 'K00942']).read())


cpd:C01290	beta-D-Galactosyl-(1->4)-beta-D-glucosyl-(1<->1)-ceramide; beta-D-Galactosyl-1,4-beta-D-glucosylceramide; Lactosylceramide; Gal-beta1->4Glc-beta1->1'Cer; LacCer; Lactosyl-N-acylsphingosine; D-Galactosyl-1,4-beta-D-glucosylceramide
ko:K00942	E2.7.4.8, gmk; guanylate kinase [EC:2.7.4.8]

1c. Finding a specific entry with kegg_find()

The kegg_find(database, query) function finds entries with matching query keywords or other query data in a given database. Data is returned as tab-separated key:value plain text.


In [15]:
# Find shiga toxin genes
head(kegg_find('genes', 'purines').read())


aly:ARALYDRAFT_487134	permease, cytosine/purines, uracil, thiamine, allantoin family protein; K03457 nucleobase:cation symporter-1, NCS1 family
dosa:Os02t0666700-01	Os02g0666700; Permease for cytosine/purines, uracil, thiamine, allantoin family protein.; K03457 nucleobase:cation symporter-1, NCS1 family
olu:OSTLU_24245	NCS1 family transporter: cytosine/purines/uracil/thiamine/allantoin
olu:OSTLU_49824	NCS1 family transporter: cytosine/purines/uracil/thiamine/allantoin; K03457 nucleobase:cation symporter-1, NCS1 family
ota:Ot03g03230	Permease for cytosine/purines, uracil, t (ISS)
fme:FOMMEDRAFT_108086	permease for cytosine/purines uracil thiamine allantoin
ecv:APECO1_2638	putative permease for cytosine/purines, uracil, thiamine, allantoin
ecx:EcHS_A0584	permease, cytosine/purines, uracil, thiamine, allantoin family; K10975 allantoin permease
ecw:EcE24377A_0547	permease, cytosine/purines, uracil, thiamine, allantoin family; K10975 allantoin permease
ebd:ECBD_4207	permease for cytosine/purines uracil thiamine allantoin
[...]

In [17]:
# Find shiga toxin genes only in Escherichia coli O111 H-11128 (EHEC)
print(kegg_find('tps', 'nucleotidase').read())


tps:THAPSDRAFT_31247	hypothetical protein; K12304 soluble calcium-activated nucleotidase 1 [EC:3.6.1.6]
tps:THAPSDRAFT_34077	hypothetical protein; K18550 IMP and pyridine-specific 5'-nucleotidase [EC:3.1.3.-]
tps:THAPSDRAFT_3837	probable 5'-nucleotidase
tps:THAPSDRAFT_263360	hypothetical protein; K01082 3'(2'), 5'-bisphosphate nucleotidase [EC:3.1.3.7]
tps:THAPSDRAFT_32319	imp-gmp specific 5' nucleotidase
tps:THAPSDRAFT_263722	bisphosphate nucleotidase (EC:3.1.3.7); K01082 3'(2'), 5'-bisphosphate nucleotidase [EC:3.1.3.7]

The API lets us query on useful properties, such as compound molecular weight:


In [18]:
# Compounds with molecular weight between 300 and 310g/mol
head(kegg_find('compound', '300-310/mol_weight').read())


cpd:C00051	307.32348
cpd:C00200	306.33696
cpd:C00219	304.46688
cpd:C00239	307.197122
cpd:C00270	309.26986
cpd:C00357	301.187702
cpd:C00365	308.181882
cpd:C00389	302.2357
cpd:C00732	308.37276
cpd:C00777	300.43512
[...]

1d. Retrieving specific entries with kegg_get()

The kegg_get() function lets us retrieve data from KEGG in a range of different formats.

Database entries can be retrieved directly as plain text; sequences may be returned as database entries, amino acid or nucleotide sequences in FASTA format; compounds as database entries, or images (.gif); pathways as database entries, KGML, or images (.png).

Compounds can be retrieved as their database entry in plain text, or as a .gif image.


In [19]:
# Compound as database entry
head(kegg_get("cpd:C00144").read())


ENTRY       C00144                      Compound
NAME        GMP;
            Guanosine 5'-phosphate;
            Guanosine monophosphate;
            Guanosine 5'-monophosphate;
            Guanylic acid
FORMULA     C10H14N5O8P
EXACT_MASS  363.058
MOL_WEIGHT  363.2206
REACTION    R00328 R00332 R00426 R00881 R01134 R01227 R01228 R01229 
[...]

In [20]:
# Compound as image
Image(kegg_get("cpd:C00144", "image").read())


Out[20]:

Gene data can be retrieved as the plain text database entry, or a FASTA nucleotide or protein sequence.


In [16]:
# Gene as database entry
head(kegg_get("ece:Z5100").read())


ENTRY       Z5100             CDS       T00044
NAME        espF
DEFINITION  hypothetical protein
ORTHOLOGY   K12786  LEE-encoded effector EspF
ORGANISM    ece  Escherichia coli O157:H7 EDL933 (EHEC)
PATHWAY     ece05130  Pathogenic Escherichia coli infection
MODULE      ece_M00542  EHEC/EPEC pathogenicity signature, T3SS and effectors
BRITE       KEGG Orthology (KO) [BR:ece00001]
             Human Diseases
              Infectious diseases
[...]

In [17]:
# Gene as amino acid sequence
print(kegg_get("ece:Z5100", "aaseq").read())


>ece:Z5100 espF; hypothetical protein; K12786 LEE-encoded effector EspF (A)
MLNGISNAASTLGRQLVGIASRVSSAGGTGFSVAPQAVRLTPVKVHSPFSPGSSNVNART
IFNVSSQVTSFTPSRPAPPPPTSGQASGASRPLPPIAQALKEHLAAYEKSKGPEALGFKP
ARQAPPPPTSGQASGASRPLPPIAQALKEHLAAYEKSKGPEALGFKPARQAPPPPTSGQA
SGASRPLPPIAQALKEHLAAYEKSKGPEALGFKPARQAPPPPTGPSGLPPLAQALKDHLA
AYEQSKKG


In [18]:
# Gene as nucleotide sequence
print(kegg_get("ece:Z5100", "ntseq").read())


>ece:Z5100 espF; hypothetical protein; K12786 LEE-encoded effector EspF (N)
atgcttaatggaattagtaacgctgcttctacactagggcggcagcttgtaggtatcgca
agtcgagtgagctctgcggggggaactggattttctgtagcccctcaggccgtgcgtctt
actccggtgaaagttcattcccctttttctccaggctcgtcgaatgttaatgcgagaacg
atttttaatgtgagcagccaggtgacttcatttactccctctcgtccggcaccgccgcca
ccgacaagtggacaggcatccggggcatcccgacctttaccgcccattgcacaggcatta
aaagagcacttggctgcctatgaaaaatcgaaaggtcctgaggctttaggttttaagccc
gcccgtcaggcaccgccgccaccgacaagtggacaggcatccggggcatcccgaccttta
ccgcccattgcacaggcattaaaagagcacttggctgcctatgaaaaatcgaaaggtcct
gaggctttaggttttaagcccgcccgtcaggcaccgccgccaccgacaagtggacaggca
tccggggcatcccgacctttaccgcccattgcacaggcattaaaagagcacttggctgcc
tatgaaaaatcgaaaggtcctgaggctttaggttttaagcccgcccgtcaggcaccaccg
ccaccgacagggcctagtggactaccgccccttgcacaggcattaaaagatcatttagct
gcctatgagcaatcgaagaaagggtaa


In [19]:
# Parsing a returned sequence with SeqIO
seq = SeqIO.read(kegg_get("ece:Z5100", "ntseq"), 'fasta')
print seq.format('stockholm')


# STOCKHOLM 1.0
#=GF SQ 1
ece:Z5100 atgcttaatggaattagtaacgctgcttctacactagggcggcagcttgtaggtatcgcaagtcgagtgagctctgcggggggaactggattttctgtagcccctcaggccgtgcgtcttactccggtgaaagttcattcccctttttctccaggctcgtcgaatgttaatgcgagaacgatttttaatgtgagcagccaggtgacttcatttactccctctcgtccggcaccgccgccaccgacaagtggacaggcatccggggcatcccgacctttaccgcccattgcacaggcattaaaagagcacttggctgcctatgaaaaatcgaaaggtcctgaggctttaggttttaagcccgcccgtcaggcaccgccgccaccgacaagtggacaggcatccggggcatcccgacctttaccgcccattgcacaggcattaaaagagcacttggctgcctatgaaaaatcgaaaggtcctgaggctttaggttttaagcccgcccgtcaggcaccgccgccaccgacaagtggacaggcatccggggcatcccgacctttaccgcccattgcacaggcattaaaagagcacttggctgcctatgaaaaatcgaaaggtcctgaggctttaggttttaagcccgcccgtcaggcaccaccgccaccgacagggcctagtggactaccgccccttgcacaggcattaaaagatcatttagctgcctatgagcaatcgaagaaagggtaa
#=GS ece:Z5100 AC ece:Z5100
#=GS ece:Z5100 DE ece:Z5100 espF; hypothetical protein; K12786 LEE-encoded effector EspF (N)
//

Pathways can be returned as database entries in plain text, or in the KGML format, or as .png images. These last two formats are especially useful for generating custom pathway renderings with Bio.Graphics.KGML.


In [21]:
# Pathway as database entry
head(kegg_get("tps00230").read())


ENTRY       tps00230                    Pathway
NAME        Purine metabolism - Thalassiosira pseudonana
CLASS       Metabolism; Nucleotide metabolism
PATHWAY_MAP tps00230  Purine metabolism
MODULE      tps_M00005  PRPP biosynthesis, ribose 5P => PRPP [PATH:tps00230]
            tps_M00049  Adenine ribonucleotide biosynthesis, IMP => ADP,ATP [PATH:tps00230]
            tps_M00050  Guanine ribonucleotide biosynthesis IMP => GDP,GTP [PATH:tps00230]
DBLINKS     BSID: 93916
            GO: 0042278
ORGANISM    Thalassiosira pseudonana [GN:tps]
[...]

In [22]:
# Pathway as image (png)
Image(kegg_get("tps00230", "image").read())


Out[22]:

In [23]:
# Pathway as KGML
head(kegg_get("tps00230", "kgml").read())


<?xml version="1.0"?>
<!DOCTYPE pathway SYSTEM "http://www.kegg.jp/kegg/xml/KGML_v0.7.1_.dtd">
<!-- Creation date: Sep 4, 2015 13:53:44 +0900 (GMT+9) -->
<pathway name="path:tps00230" org="tps" number="00230"
         title="Purine metabolism"
         image="http://www.kegg.jp/kegg/pathway/tps/tps00230.png"
         link="http://www.kegg.jp/kegg-bin/show_pathway?tps00230">
    <entry id="137" name="tps:THAPSDRAFT_268807 tps:THAPSDRAFT_32555 tps:THAPSDRAFT_33670 tps:THAPSDRAFT_370" type="gene" reaction="rn:R02019"
        link="http://www.kegg.jp/dbget-bin/www_bget?tps:THAPSDRAFT_268807+tps:THAPSDRAFT_32555+tps:THAPSDRAFT_33670+tps:THAPSDRAFT_370">
        <graphics name="RIR1_2..." fgcolor="#000000" bgcolor="#BFFFBF"
[...]

2. Rendering pathways with Bio.KEGG and Bio.Graphics.KGML

2a. Retrieving renderings from KEGG

The easiest way to render a pathway is to retrieve an image file directly from KEGG with kegg_get(), as above.


In [26]:
# Render central metabolism
Image(kegg_get("map01100", "image").read())


Out[26]:

In [27]:
# Render fatty-acid biosynthesis
Image(kegg_get("map00230", "image").read())


Out[27]:

KEGG also provides renderings that are specific to an organism, indicating which elements of the pathway are present in their database for that organism.

For central metabolism, compounds and reactions are only coloured if they are present in the specified organism.


In [25]:
# Render E.coli K-12 MG1655 central metabolism
Image(kegg_get("eco01100", "image").read())


Out[25]:

With other pathways, the reactions and/or genes are coloured green when they are present in the organism.


In [26]:
# Render E.coli K-12 MG1655 fatty-acid biosynthesis
Image(kegg_get("eco00061", "image").read())


Out[26]:

Three types of reference pathway can also be rendered: ko, ec, and rn. These are shown with blue boxes.


In [28]:
# Render reference fatty-acid biosynthesis
Image(kegg_get("ko00061", "image").read())
#Image(kegg_get("ec00061", "image").read())
#Image(kegg_get("rn00061", "image").read())


Out[28]:

2b. Reproducing KEGG renderings from KGML

KEGG provides descriptions of their pathway maps in an exchange format called KGML (spec). This doesn't quite match the maps that can be downloaded through the public interface. If you're a subscriber to KEGG, then you have access to an alternative set of KGML+ files. The Biopython interface uses the public API, so we only work with KGML files.

To obtain a pathway's KGML file from KEGG, we can use the kegg_get() function:


In [29]:
# Get KGML for fatty-acid biosynthesis
ko_map = (kegg_get("ko00061", "kgml").read())    # KO version (KEGG orthologues)
eco_map = (kegg_get("eco00061", "kgml").read())  # E. coli version

Visualising the contents of the KGML file, it can be seen that two useful kinds of elements are described in the file: ortholog and compound entries. Reactions and other connections are usually not defined directly.


In [30]:
# View the contents of ko_map KGML
head(ko_map)


<?xml version="1.0"?>
<!DOCTYPE pathway SYSTEM "http://www.kegg.jp/kegg/xml/KGML_v0.7.1_.dtd">
<!-- Creation date: Aug 13, 2015 14:15:00 +0900 (GMT+9) -->
<pathway name="path:ko00061" org="ko" number="00061"
         title="Fatty acid biosynthesis"
         image="http://www.kegg.jp/kegg/pathway/ko/ko00061.png"
         link="http://www.kegg.jp/kegg-bin/show_pathway?ko00061">
    <entry id="18" name="ko:K02371" type="ortholog" reaction="rn:R04969"
        link="http://www.kegg.jp/dbget-bin/www_bget?K02371">
        <graphics name="K02371" fgcolor="#000000" bgcolor="#BFBFFF"
[...]

Typically, orthologs and compounds are the only items that can be manipulated on a KEGG pathway map. KGML does not describe the complete map, as provided in the downloaded image.

The base maps (such as, in this case, map00061) do not contain ortholog or compound features, and consequently there is no associated KGML file. Attempting to retrieve KGML data for these maps produces an error.


In [30]:
# Trying to retrieve base map KGML generates an error
#base_map = (kegg_get("map00061", "kgml").read())

The beautiful images generated by KEGG are rendered as .png files, which may be downloaded, and the features described in the accompanying KGML description overlaid, with custom formatting. This is what is currently done in the majority of cases, here.

First, we need to parse the KGML we get from KEGG into a Pathway object. This holds an internal representation of the pathway information, such as orthologs and compounds (and, as we'll see later, reactions and relations.

The Pathway object provides some useful functions for manipulation of elements in the KGML representations, and allows us to write out a new KGML file that preserves the changes we make. Using print on the Pathway object gives us some summary information.

KGML_parser will read a KGML file on disk, and the stream returned from a kegg_get() function, as below.


In [31]:
pathway = KGML_parser.read(kegg_get("ko00061", "kgml"))
print(pathway)


Pathway: Fatty acid biosynthesis
KEGG ID: path:ko00061
Image file: http://www.kegg.jp/kegg/pathway/ko/ko00061.png
Organism: ko
Entries: 242
Entry types:
	ortholog: 183
	compound: 50
	map: 9

If we want to render this as an image, we need to use a KGMLCanvas object, from Bio.Graphics.KGML_vis. We do this by instantiating a KGMLCanvas object with our pathway.


In [32]:
canvas = KGMLCanvas(pathway)

Now that we have an image, we can render it. At the moment, there is only the capability to render to .pdf. To render to a file, call the draw() method of the canvas.


In [33]:
canvas.draw("fab_map.pdf")
PDF("fab_map.pdf")


Out[33]:

But there's something wrong with this picture. It shows only the elements defined in the KGML file. All the beautiful KEGG layout, including the lines indicating the flow of the biochemistry, is missing. That's because the KGML file doesn't contain any information about it, directly.

However, each Pathway object created from KGML file has an image attribute with the URL of its .png format representation. To use this as a background, all we need to do to render this is specify import_imagemap=True when creating the canvas, or set canvas.import_imagemap = True once the canvas is created. (It is also possible to specify an image file on disk as the background.)


In [34]:
canvas.import_imagemap = True
canvas.draw("fab_map_with_image.pdf")
PDF("fab_map_with_image.pdf")


Out[34]:

This could be encapsulated in a useful helper function:


In [35]:
def draw_kegg_map(map_id):
    """ Render a local PDF of a KEGG map with the passed map ID
    """
    # Get the background image first
    pathway = KGML_parser.read(kegg_get(map_id, "kgml"))
    canvas = KGMLCanvas(pathway, import_imagemap=True)
    img_filename = "%s.pdf" % map_id
    canvas.draw(img_filename)

2c. Customised renderings from KGML - Part 1

In the Bio.Graphics model of rendering KGML, we are superimposing the elements described in a KGML file over an associated background image. As a result, only the elements that are described in the KGML file (or any we introduce ourselves...) can be modified.

The current interface to identifying and retrieving pathway elements is a little obscure as of Biopython 1.65 - if you want to retrieve a particular element by ID or annotation - but it is possible to modify the graphical elements that are rendered, to produce customised maps from KGML data.

The modifications we can make are:

  • shape (graphics.type)
  • location ($x,y$ co-ordinates: graphics.x, graphics.y)
  • height (graphics.height)
  • width (graphics.width)
  • foreground colour (graphics.fgcolor)
  • background colour (graphics.bgcolor)
  • label (graphics.name)

but this still gives us a lot of flexibility in terms of visual annotation from a dataset. We can also choose to display, or not, elements of interest, and transparency effects are available.

A motivating example

As a motivating example, we're going to use the fatty-acid biosynthesis pathway (map00061) from earlier, and make the following modifications:

  1. colour the ortholog elements according to a random choice of colours
  2. change the size of each compound element to reflect its molecular mass

We saw the default rendering of the ko00061 pathway earlier:


In [36]:
pathway = KGML_parser.read(kegg_get("ko00061", "kgml"))
canvas = KGMLCanvas(pathway, import_imagemap=True)
canvas.draw("fab_map_with_image.pdf")
PDF("fab_map_with_image.pdf")


Out[36]:

2c.1 Changing ortholog background colours

We can get access to all the pathway ortholog elements via the pathway.orthologs attribute. This will list all ortholog elements in the pathway as Entry objects.


In [37]:
pathway.orthologs[:5]


Out[37]:
[<Bio.KEGG.KGML.KGML_pathway.Entry at 0x7480208>,
 <Bio.KEGG.KGML.KGML_pathway.Entry at 0x7480240>,
 <Bio.KEGG.KGML.KGML_pathway.Entry at 0x74802b0>,
 <Bio.KEGG.KGML.KGML_pathway.Entry at 0x7480320>,
 <Bio.KEGG.KGML.KGML_pathway.Entry at 0x7480390>]

In [38]:
print(pathway.orthologs[0])


Entry node ID: 18
Names: ko:K02371
Type: ortholog
Components: set([])
Reactions: rn:R04969
Graphics elements: 1 [<Bio.KEGG.KGML.KGML_pathway.Graphics object at 0x0000000007480278>]

Each Entry object has a graphics attribute that stores information about how the item is rendered, in a list of Graphics objects - this reflects the data structure in the KGML representation, which may have more than one graphical entity per entry in the pathway.


In [39]:
pathway.orthologs[0].graphics


Out[39]:
[<Bio.KEGG.KGML.KGML_pathway.Graphics at 0x7480278>]

Each Graphics object has attributes that may be modified directly:

name         Label for the graphics object
x            X-axis position of the object (int)
y            Y-axis position of the object (int)
coords       polyline co-ordinates, list of (int, int) tuples
type         object shape
width        object width (int)
height       object height (int)
fgcolor      object foreground color (hex RGB)
bgcolor      object background color (hex RGB)

There are also two read-only properties that are provided for convenience

bounds       returns the graphical bounds of the object
centre       returns the centre of the object as an (x, y) tuple

In [40]:
element = pathway.orthologs[0].graphics[0]
attrs = [element.name, element.x, element.y, element.coords, element.type, 
         element.width, element.height, element.fgcolor, element.bgcolor, 
         element.bounds, element.centre]
print '\n'.join([str(attr) for attr in attrs])


K02371
1037.0
606.0
None
rectangle
46.0
17.0
#000000
#BFBFFF
[(1014.0, 597.5), (1060.0, 614.5)]
(1037.0, 606.0)

To change the ortholog colours, it's easy enough to loop over the ortholog elements in the pathway, and modify their graphics.bgcolor attributes.

As KGML expects hex colours in its specification, we also provide hex colours here, using the helper function below:


In [41]:
# Helper function to convert colour as RGB tuple to hex string
def rgb_to_hex(rgb):
    rgb = tuple([int(255*val) for val in rgb])
    return '#' + ''.join([hex(val)[2:] for val in rgb]).upper()

For this example we're defining arbitrary colours, but some obvious uses might include the use of colour to represent some biologically-significant value, like transcript abundance, $\frac{dN}{dS}$, or flux through that step.


In [42]:
# Define arbitrary colours
colorspiral = ColorSpiral()
colorlist = colorspiral.get_colors(len(pathway.orthologs))

# Change the colours of ortholog elements
for color, element in zip(colorlist, pathway.orthologs):
    for graphic in element.graphics:
        graphic.bgcolor = rgb_to_hex(color)

In [43]:
canvas = KGMLCanvas(pathway, import_imagemap=True)
canvas.draw("fab_map_new_colours.pdf")
PDF("fab_map_new_colours.pdf")


Out[43]:

2c.2 Changing compound sizes

We can get access to the pathway compound elements via pathway.compounds, just as for orthologs, and to change sizes, we can loop over them, modifying both width and height.

As with the colour changes, the sizes used are arbitrary, but this kind of modification (maybe incorporating colour changes, too) could be used to represent biochemical measurements such as metabolite concentration.


In [44]:
# Change the sizes of compound elements
for size, element in zip(range(8, 8+len(pathway.compounds)), pathway.compounds):
    for graphic in element.graphics:
        graphic.width = size
        graphic.height = size

In [45]:
canvas = KGMLCanvas(pathway, import_imagemap=True)
canvas.draw("fab_map_new_sizes.pdf")
PDF("fab_map_new_sizes.pdf")


Out[45]:

2d. Customised renderings from KGML - Part 2: central metabolism

The maps we've been considering so far have not been completely described by the publicly-available KGML file. This is true for the great majority of KEGG maps. But there are three other maps: ko01100, ko01110, and ko01120 for which this is not the case, and the KGML file provides a lot more information. Specifically, they also describe lines representing reactions that connect metabolite compounds. This gives us quite a lot of scope for modifying their presentation, but it also means that they're slower to render.


In [46]:
# The three KEGG maps with lines representing reactions.
maps = ['ko01100', 'ko01110', 'ko01120']
[draw_kegg_map(map) for map in maps]
print(kegg_get(maps).read())


ENTRY       ko01100                     Pathway
NAME        Metabolic pathways
PATHWAY_MAP ko01100  Metabolic pathways
DISEASE     H00617  Desmosterolosis
            H00689  Delayed sleep phase syndrome
            H00734  Lamellar ichthyosis (LI) and Non-bullous congenital ichthyosiform erythroderma (NBCIE)
            H00753  Urofacial syndrome
            H00761  SEMD, Pakistani type
            H00804  Multiple cutaneous and uterine leiomyomata
            H00824  Calcification of joints and arteries
            H00920  Exocrine pancreatic insufficiency, dyserythropoietic anemia, and calvarial hyperostosis
            H00946  Arts syndrome
            H01002  Generalized arterial calcification of infancy
///
ENTRY       ko01110                     Pathway
NAME        Biosynthesis of secondary metabolites
PATHWAY_MAP ko01110  Biosynthesis of secondary metabolites
///
ENTRY       ko01120                     Pathway
NAME        Microbial metabolism in diverse environments
PATHWAY_MAP ko01120  Microbial metabolism in diverse environments
///


In [47]:
PDF(maps[0]+'.pdf')


Out[47]:

In [48]:
PDF(maps[1]+'.pdf')


Out[48]:

In [49]:
PDF(maps[2]+'.pdf')


Out[49]:

Unlike the custom renderings that can be obtained from KEGG directly, with this module we now have full control over line width, element size, element colour, and labelling options.

To avoid noise in the image, it can be worth setting import_imagemap=False when making the canvas or drawing to a file for these maps.

First, let's set each reaction line to a random value between 1 and 10. This could be adapted to represent carbon flux measurements, for example.


In [48]:
# Use the bacterial diverse environments map
pathway = KGML_parser.read(kegg_get("ko01120", "kgml"))

# Change the widths of reaction entries elements
for element in pathway.orthologs:
    for graphic in element.graphics:
        graphic.width = random.randrange(1, 10, 1)

In [49]:
canvas = KGMLCanvas(pathway, import_imagemap=False)
canvas.draw("bacteria_mod_widths.pdf")
PDF("bacteria_mod_widths.pdf")


Out[49]:

We can modify pathway colours in the same way as before, except that now we have to modify the foreground, not the background colour, through the attribute fgcolor. The foreground colour should be specified as a hex string.


In [50]:
# Define arbitrary colours
colorspiral = ColorSpiral()
colorlist = colorspiral.get_colors(len(pathway.orthologs))

# Change the colours of ortholog elements
for color, element in zip(colorlist, pathway.orthologs):
    for graphic in element.graphics:
        graphic.fgcolor = rgb_to_hex(color)

In [51]:
canvas = KGMLCanvas(pathway, import_imagemap=False)
canvas.draw("bacteria_mod_colour.pdf")
PDF("bacteria_mod_colour.pdf")


Out[51]:

In [53]: