Playing with the new Biopython KGML rendering module

Biopython 1.65 was released on 17th December 2014. One of the new additions to the libraries was the Bio.Graphics.KGML module, which can be used for customised renderings of KEGG metabolic pathways.

This notebook is an introduction to the Bio.KEGG and Bio.Graphics.KGML modules, providing examples of usage, including data retrieval, manipulation and output rendering.

0. Setting up

We need to import Biopython and parts of the KEGG and KGML modules, but first we'll check that a suitably recent version of Biopython is installed (1.65 or higher):


In [1]:
import Bio
Bio.__version__


Out[1]:
'1.65+'

We're going to be using some other Biopython modules, and working with images in the notebook, so we also prepare for this.


In [2]:
from Bio import SeqIO
from Bio.KEGG.REST import *
from Bio.KEGG.KGML import KGML_parser
from Bio.Graphics.KGML_vis import KGMLCanvas
from Bio.Graphics.ColorSpiral import ColorSpiral

from IPython.display import Image, HTML

import random

# A bit of code that will help us display the PDF output
def PDF(filename):
    return HTML('<iframe src=%s width=700 height=350></iframe>' % filename)

# A bit of helper code to shorten long text
def head(text, lines=10):
    """ Print the first lines lines of the passed text.
    """
    print '\n'.join(text.split('\n')[:lines] + ['[...]'])

1. Retrieving data from KEGG with Bio.KEGG.REST

The KEGG database provides a RESTful interface to its data (see http://www.kegg.jp/kegg/rest/), which we can use via Biopython's Bio.KEGG.REST module.

The KEGG interface is not as well documented as some other resources (such as NCBI or Ensembl), and KEGG does not provide any usage guidelines. To avoid risking overloading the service, Biopython restricts us to three calls per second. Be warned also that the conditions of service include:

"This service should not be used for bulk data downloads".

1a. Obtaining information about a database with kegg_info()

We can obtain basic information about a particular database (e.g. "pathway", "kegg", "hsa", etc.) using the kegg_info() function. This returns a handle which can be read as plaintext, with some useful information about the resource.


In [3]:
# Kyoto Encyclopedia of Genes and Genomes
print(kegg_info("kegg").read())


kegg             Kyoto Encyclopedia of Genes and Genomes
kegg             Release 72.0+/12-21, Dec 14
                 Kanehisa Laboratories
                 pathway     338,948 entries
                 brite       119,262 entries
                 module      275,779 entries
                 disease       1,402 entries
                 drug         10,130 entries
                 environ         849 entries
                 orthology    18,334 entries
                 genome        3,515 entries
                 genes     15,319,849 entries
                 dgenes      654,883 entries
                 compound     17,343 entries
                 glycan       10,987 entries
                 reaction      9,775 entries
                 rpair        14,849 entries
                 rclass        2,945 entries
                 enzyme        6,415 entries


In [4]:
# KEGG Pathway Database
print(kegg_info("pathway").read())


pathway          KEGG Pathway Database
path             Release 72.0+/12-21, Dec 14
                 Kanehisa Laboratories
                 338,948 entries


In [5]:
# Escherichia coli K-12 MG1655 KEGG Genes Database
print(kegg_info("eco").read())


T00007           Escherichia coli K-12 MG1655 KEGG Genes Database
eco              Release 72.0+/12-20, Dec 14
                 Kanehisa Laboratories
                 4,497 entries

1b. Obtaining a list of entries in a database with kegg_list()

The kegg_list() function retrieves a handle, which provides a plaintext tab-separated key:value list of entry identifiers and definitions in the given database or set of database entries.

The kegg_list() function allows the organism to be specified for a database, with an optional second argument.


In [6]:
# List all pathways in the pathway database
head(kegg_list('pathway').read())


path:map00010	Glycolysis / Gluconeogenesis
path:map00020	Citrate cycle (TCA cycle)
path:map00030	Pentose phosphate pathway
path:map00040	Pentose and glucuronate interconversions
path:map00051	Fructose and mannose metabolism
path:map00052	Galactose metabolism
path:map00053	Ascorbate and aldarate metabolism
path:map00061	Fatty acid biosynthesis
path:map00062	Fatty acid elongation
path:map00071	Fatty acid degradation
[...]

In [7]:
# Only list pathways present in E. coli K-12 MG1655
head(kegg_list('pathway', 'eco').read())


path:eco00010	Glycolysis / Gluconeogenesis - Escherichia coli K-12 MG1655
path:eco00020	Citrate cycle (TCA cycle) - Escherichia coli K-12 MG1655
path:eco00030	Pentose phosphate pathway - Escherichia coli K-12 MG1655
path:eco00040	Pentose and glucuronate interconversions - Escherichia coli K-12 MG1655
path:eco00051	Fructose and mannose metabolism - Escherichia coli K-12 MG1655
path:eco00052	Galactose metabolism - Escherichia coli K-12 MG1655
path:eco00053	Ascorbate and aldarate metabolism - Escherichia coli K-12 MG1655
path:eco00061	Fatty acid biosynthesis - Escherichia coli K-12 MG1655
path:eco00071	Fatty acid degradation - Escherichia coli K-12 MG1655
path:eco00130	Ubiquinone and other terpenoid-quinone biosynthesis - Escherichia coli K-12 MG1655
[...]

If only the organism is specified, then a list of gene entries is returned.


In [8]:
# E. coli K-12 MG1655 genes
head(kegg_list('eco').read())


eco:b0001	thrL; thr operon leader peptide; K08278 thr operon leader peptide
eco:b0002	thrA; Bifunctional aspartokinase/homoserine dehydrogenase 1 (EC:1.1.1.3 2.7.2.4); K12524 bifunctional aspartokinase / homoserine dehydrogenase 1 [EC:2.7.2.4 1.1.1.3]
eco:b0003	thrB; homoserine kinase (EC:2.7.1.39); K00872 homoserine kinase [EC:2.7.1.39]
eco:b0004	thrC; L-threonine synthase (EC:4.2.3.1); K01733 threonine synthase [EC:4.2.3.1]
eco:b0005	yaaX; DUF2502 family putative periplasmic protein
eco:b0006	yaaA; peroxide resistance protein, lowers intracellular iron; K09861 hypothetical protein
eco:b0007	yaaJ; putative transporter; K03310 alanine or glycine:cation symporter, AGCS family
eco:b0008	talB; transaldolase B (EC:2.2.1.2); K00616 transaldolase [EC:2.2.1.2]
eco:b0009	mog; molybdochelatase incorporating molybdenum into molybdopterin; K03831 molybdopterin adenylyltransferase [EC:2.7.7.75]
eco:b0010	satP; succinate-acetate transporter; K07034
[...]

The KEGG API also allows the + operator to be used to combine list identifiers.


In [9]:
# Compound entry C01290 and glycan entry G00092
print(kegg_list('C01290+G00092').read())


cpd:C01290	beta-D-Galactosyl-(1->4)-beta-D-glucosyl-(1<->1)-ceramide; beta-D-Galactosyl-1,4-beta-D-glucosylceramide; Lactosylceramide; Gal-beta1->4Glc-beta1->1'Cer; LacCer; Lactosyl-N-acylsphingosine; D-Galactosyl-1,4-beta-D-glucosylceramide
gl:G00092	Lactosylceramide; CDw17; LacCer; (Gal)1 (Glc)1 (Cer)1; Glycolipid; Sphingolipid

But kegg_list() and other functions will take a list of identifiers, and combine them automatically:


In [10]:
# Compound entry C01290 and glycan entry G00092
print(kegg_list(['C01290', 'G00092']).read())


cpd:C01290	beta-D-Galactosyl-(1->4)-beta-D-glucosyl-(1<->1)-ceramide; beta-D-Galactosyl-1,4-beta-D-glucosylceramide; Lactosylceramide; Gal-beta1->4Glc-beta1->1'Cer; LacCer; Lactosyl-N-acylsphingosine; D-Galactosyl-1,4-beta-D-glucosylceramide
gl:G00092	Lactosylceramide; CDw17; LacCer; (Gal)1 (Glc)1 (Cer)1; Glycolipid; Sphingolipid

1c. Finding a specific entry with kegg_find()

The kegg_find(database, query) function finds entries with matching query keywords or other query data in a given database. Data is returned as tab-separated key:value plain text.


In [11]:
# Find shiga toxin genes
head(kegg_find('genes', 'shiga+toxin').read())


ece:Z1464	stx2A; shiga-like toxin II A subunit encoded by bacteriophage BP-933W; K11006 shiga toxin subunit A
ece:Z1465	stx2B; shiga-like toxin II B subunit encoded by bacteriophage BP-933W; K11007 shiga toxin subunit B
ece:Z3343	stx1B; shiga-like toxin 1 subunit B encoded within prophage CP-933V; K11007 shiga toxin subunit B
ece:Z3344	stx1A; shiga-like toxin 1 subunit A encoded within prophage CP-933V; K11006 shiga toxin subunit A
ecs:ECs1205	Shiga toxin 2 subunit A; K11006 shiga toxin subunit A
ecs:ECs1206	Shiga toxin 2 subunit B; K11007 shiga toxin subunit B
ecs:ECs2973	Shiga toxin I subunit B; K11007 shiga toxin subunit B
ecs:ECs2974	Shiga toxin I subunit A; K11006 shiga toxin subunit A
ecf:ECH74115_2905	shigatoxin 2, subunit B; K11007 shiga toxin subunit B
ecf:ECH74115_2906	shiga toxin subunit A (EC:3.2.2.22); K11006 shiga toxin subunit A
[...]

In [12]:
# Find shiga toxin genes only in Escherichia coli O111 H-11128 (EHEC)
print(kegg_find('eoi', 'shiga+toxin').read())


eoi:ECO111_2429	Shiga toxin 2 subunit B; K11007 shiga toxin subunit B
eoi:ECO111_2430	Shiga toxin 2 subunit A; K11006 shiga toxin subunit A
eoi:ECO111_3361	Shiga toxin 1 subunit A; K11006 shiga toxin subunit A
eoi:ECO111_3362	Shiga toxin 1 subunit B; K11007 shiga toxin subunit B

The API lets us query on useful properties, such as compound molecular weight:


In [13]:
# Compounds with molecular weight between 300 and 310g/mol
head(kegg_find('compound', '300-310/mol_weight').read())


cpd:C00051	307.32348
cpd:C00200	306.33696
cpd:C00219	304.46688
cpd:C00239	307.197122
cpd:C00270	309.26986
cpd:C00357	301.187702
cpd:C00365	308.181882
cpd:C00389	302.2357
cpd:C00732	308.37276
cpd:C00777	300.43512
[...]

1d. Retrieving specific entries with kegg_get()

The kegg_get() function lets us retrieve data from KEGG in a range of different formats.

Database entries can be retrieved directly as plain text; sequences may be returned as database entries, amino acid or nucleotide sequences in FASTA format; compounds as database entries, or images (.gif); pathways as database entries, KGML, or images (.png).

Compounds can be retrieved as their database entry in plain text, or as a .gif image.


In [14]:
# Compound as database entry
head(kegg_get("cpd:C01290").read())


ENTRY       C01290                      Compound
NAME        beta-D-Galactosyl-(1->4)-beta-D-glucosyl-(1<->1)-ceramide;
            beta-D-Galactosyl-1,4-beta-D-glucosylceramide;
            Lactosylceramide;
            Gal-beta1->4Glc-beta1->1'Cer;
            LacCer;
            Lactosyl-N-acylsphingosine;
            D-Galactosyl-1,4-beta-D-glucosylceramide
FORMULA     C31H56NO13R
REMARK      Same as: G00092
[...]

In [15]:
# Compound as image
Image(kegg_get("cpd:C01290", "image").read())


Out[15]:

Gene data can be retrieved as the plain text database entry, or a FASTA nucleotide or protein sequence.


In [16]:
# Gene as database entry
head(kegg_get("ece:Z5100").read())


ENTRY       Z5100             CDS       T00044
NAME        espF
DEFINITION  hypothetical protein
ORTHOLOGY   K12786  LEE-encoded effector EspF
ORGANISM    ece  Escherichia coli O157:H7 EDL933 (EHEC)
PATHWAY     ece05130  Pathogenic Escherichia coli infection
MODULE      ece_M00542  EHEC/EPEC pathogenicity signature, T3SS and effectors
BRITE       KEGG Orthology (KO) [BR:ece00001]
             Human Diseases
              Infectious diseases
[...]

In [17]:
# Gene as amino acid sequence
print(kegg_get("ece:Z5100", "aaseq").read())


>ece:Z5100 espF; hypothetical protein; K12786 LEE-encoded effector EspF (A)
MLNGISNAASTLGRQLVGIASRVSSAGGTGFSVAPQAVRLTPVKVHSPFSPGSSNVNART
IFNVSSQVTSFTPSRPAPPPPTSGQASGASRPLPPIAQALKEHLAAYEKSKGPEALGFKP
ARQAPPPPTSGQASGASRPLPPIAQALKEHLAAYEKSKGPEALGFKPARQAPPPPTSGQA
SGASRPLPPIAQALKEHLAAYEKSKGPEALGFKPARQAPPPPTGPSGLPPLAQALKDHLA
AYEQSKKG


In [18]:
# Gene as nucleotide sequence
print(kegg_get("ece:Z5100", "ntseq").read())


>ece:Z5100 espF; hypothetical protein; K12786 LEE-encoded effector EspF (N)
atgcttaatggaattagtaacgctgcttctacactagggcggcagcttgtaggtatcgca
agtcgagtgagctctgcggggggaactggattttctgtagcccctcaggccgtgcgtctt
actccggtgaaagttcattcccctttttctccaggctcgtcgaatgttaatgcgagaacg
atttttaatgtgagcagccaggtgacttcatttactccctctcgtccggcaccgccgcca
ccgacaagtggacaggcatccggggcatcccgacctttaccgcccattgcacaggcatta
aaagagcacttggctgcctatgaaaaatcgaaaggtcctgaggctttaggttttaagccc
gcccgtcaggcaccgccgccaccgacaagtggacaggcatccggggcatcccgaccttta
ccgcccattgcacaggcattaaaagagcacttggctgcctatgaaaaatcgaaaggtcct
gaggctttaggttttaagcccgcccgtcaggcaccgccgccaccgacaagtggacaggca
tccggggcatcccgacctttaccgcccattgcacaggcattaaaagagcacttggctgcc
tatgaaaaatcgaaaggtcctgaggctttaggttttaagcccgcccgtcaggcaccaccg
ccaccgacagggcctagtggactaccgccccttgcacaggcattaaaagatcatttagct
gcctatgagcaatcgaagaaagggtaa


In [19]:
# Parsing a returned sequence with SeqIO
seq = SeqIO.read(kegg_get("ece:Z5100", "ntseq"), 'fasta')
print seq.format('stockholm')


# STOCKHOLM 1.0
#=GF SQ 1
ece:Z5100 atgcttaatggaattagtaacgctgcttctacactagggcggcagcttgtaggtatcgcaagtcgagtgagctctgcggggggaactggattttctgtagcccctcaggccgtgcgtcttactccggtgaaagttcattcccctttttctccaggctcgtcgaatgttaatgcgagaacgatttttaatgtgagcagccaggtgacttcatttactccctctcgtccggcaccgccgccaccgacaagtggacaggcatccggggcatcccgacctttaccgcccattgcacaggcattaaaagagcacttggctgcctatgaaaaatcgaaaggtcctgaggctttaggttttaagcccgcccgtcaggcaccgccgccaccgacaagtggacaggcatccggggcatcccgacctttaccgcccattgcacaggcattaaaagagcacttggctgcctatgaaaaatcgaaaggtcctgaggctttaggttttaagcccgcccgtcaggcaccgccgccaccgacaagtggacaggcatccggggcatcccgacctttaccgcccattgcacaggcattaaaagagcacttggctgcctatgaaaaatcgaaaggtcctgaggctttaggttttaagcccgcccgtcaggcaccaccgccaccgacagggcctagtggactaccgccccttgcacaggcattaaaagatcatttagctgcctatgagcaatcgaagaaagggtaa
#=GS ece:Z5100 AC ece:Z5100
#=GS ece:Z5100 DE ece:Z5100 espF; hypothetical protein; K12786 LEE-encoded effector EspF (N)
//

Pathways can be returned as database entries in plain text, or in the KGML format, or as .png images. These last two formats are especially useful for generating custom pathway renderings with Bio.Graphics.KGML.


In [20]:
# Pathway as database entry
head(kegg_get("hsa05130").read())


ENTRY       hsa05130                    Pathway
NAME        Pathogenic Escherichia coli infection - Homo sapiens (human)
DESCRIPTION Enteropathogenic E. coli (EPEC) and enterohemorrhagic E. coli (EHEC) are closely related pathogenic strains of Escherichia coli. The hallmark of EPEC/EHEC infections [DS:H00278 H00277] is induction of attaching and effacing (A/E) lesions that damage intestinal epithelial cells. The capacity to form A/E lesions is encoded mainly by the locus of enterocyte effacement (LEE) pathogenicity island. Tir, Map, EspF, EspG are known LEE-encoded effector proteins secreted via the type III secretion system, which is also LEE-encoded, into the host cell. EPEC and EHEC Tir's link the extracellular bacterium to the cell cytoskeleton. Map and EspF are involved in mitochondrion membrane permeabilization. EspG interacts with tubulins and stimulates microtubule destabilization. LEE-encoded adhesin or intimin (Eae) is exported via the general secretory pathway to the periplasm, where it is inserted into the outer membrane. In addition to Tir, two potential host cell-carried intimin receptors, beta1 integrin (ITGB1) and nucleolin (NCL), have so far been identified. The distinguishing feature of EHEC is the elaboration of Shiga-like toxin (Stx). Stx cleaves ribosomal RNA, thereby disrupting protein synthesis and killing the intoxicated epithelial or endothelial cells.
CLASS       Human Diseases; Infectious diseases
PATHWAY_MAP hsa05130  Pathogenic Escherichia coli infection
DISEASE     H00277  Enterohemorrhagic Escherichia coli (EHEC) infection
            H00278  Enteropathogenic Escherichia coli (EPEC) infection
ORGANISM    Homo sapiens (human) [GN:hsa]
GENE        7100  TLR5; toll-like receptor 5 [KO:K10168]
            929  CD14; CD14 molecule [KO:K04391]
[...]

In [21]:
# Pathway as image (png)
Image(kegg_get("hsa05130", "image").read())


Out[21]:

In [22]:
# Pathway as KGML
head(kegg_get("hsa05130", "kgml").read())


<?xml version="1.0"?>
<!DOCTYPE pathway SYSTEM "http://www.kegg.jp/kegg/xml/KGML_v0.7.1_.dtd">
<!-- Creation date: May 13, 2014 11:30:29 +0900 (GMT+09:00) -->
<pathway name="path:hsa05130" org="hsa" number="05130"
         title="Pathogenic Escherichia coli infection"
         image="http://www.kegg.jp/kegg/pathway/hsa/hsa05130.png"
         link="http://www.kegg.jp/kegg-bin/show_pathway?hsa05130">
    <entry id="1" name="path:hsa04810" type="map"
        link="http://www.kegg.jp/dbget-bin/www_bget?hsa04810">
        <graphics name="Regulation of actin cytoskeleton" fgcolor="#000000" bgcolor="#FFFFFF"
[...]

2. Rendering pathways with Bio.KEGG and Bio.Graphics.KGML

2a. Retrieving renderings from KEGG

The easiest way to render a pathway is to retrieve an image file directly from KEGG with kegg_get(), as above.


In [23]:
# Render central metabolism
Image(kegg_get("map01100", "image").read())


Out[23]:

In [24]:
# Render fatty-acid biosynthesis
Image(kegg_get("map00061", "image").read())


Out[24]:

KEGG also provides renderings that are specific to an organism, indicating which elements of the pathway are present in their database for that organism.

For central metabolism, compounds and reactions are only coloured if they are present in the specified organism.


In [25]:
# Render E.coli K-12 MG1655 central metabolism
Image(kegg_get("eco01100", "image").read())


Out[25]:

With other pathways, the reactions and/or genes are coloured green when they are present in the organism.


In [26]:
# Render E.coli K-12 MG1655 fatty-acid biosynthesis
Image(kegg_get("eco00061", "image").read())


Out[26]:

Three types of reference pathway can also be rendered: ko, ec, and rn. These are shown with blue boxes.


In [27]:
# Render reference fatty-acid biosynthesis
Image(kegg_get("ko00061", "image").read())
#Image(kegg_get("ec00061", "image").read())
#Image(kegg_get("rn00061", "image").read())


Out[27]:

2b. Reproducing KEGG renderings from KGML

KEGG provides descriptions of their pathway maps in an exchange format called KGML (spec). This doesn't quite match the maps that can be downloaded through the public interface. If you're a subscriber to KEGG, then you have access to an alternative set of KGML+ files. The Biopython interface uses the public API, so we only work with KGML files.

To obtain a pathway's KGML file from KEGG, we can use the kegg_get() function:


In [28]:
# Get KGML for fatty-acid biosynthesis
ko_map = (kegg_get("ko00061", "kgml").read())    # KO version (KEGG orthologues)
eco_map = (kegg_get("eco00061", "kgml").read())  # E. coli version

Visualising the contents of the KGML file, it can be seen that two useful kinds of elements are described in the file: ortholog and compound entries. Reactions and other connections are usually not defined directly.


In [29]:
# View the contents of ko_map KGML
head(ko_map)


<?xml version="1.0"?>
<!DOCTYPE pathway SYSTEM "http://www.kegg.jp/kegg/xml/KGML_v0.7.1_.dtd">
<!-- Creation date: Aug 19, 2014 16:25:18 +0900 (GMT+09:00) -->
<pathway name="path:ko00061" org="ko" number="00061"
         title="Fatty acid biosynthesis"
         image="http://www.kegg.jp/kegg/pathway/ko/ko00061.png"
         link="http://www.kegg.jp/kegg-bin/show_pathway?ko00061">
    <entry id="18" name="ko:K02371" type="ortholog" reaction="rn:R04969"
        link="http://www.kegg.jp/dbget-bin/www_bget?K02371">
        <graphics name="K02371" fgcolor="#000000" bgcolor="#BFBFFF"
[...]

Typically, orthologs and compounds are the only items that can be manipulated on a KEGG pathway map. KGML does not describe the complete map, as provided in the downloaded image.

The base maps (such as, in this case, map00061) do not contain ortholog or compound features, and consequently there is no associated KGML file. Attempting to retrieve KGML data for these maps produces an error.


In [30]:
# Trying to retrieve base map KGML generates an error
#base_map = (kegg_get("map00061", "kgml").read())

The beautiful images generated by KEGG are rendered as .png files, which may be downloaded, and the features described in the accompanying KGML description overlaid, with custom formatting. This is what is currently done in the majority of cases, here.

First, we need to parse the KGML we get from KEGG into a Pathway object. This holds an internal representation of the pathway information, such as orthologs and compounds (and, as we'll see later, reactions and relations.

The Pathway object provides some useful functions for manipulation of elements in the KGML representations, and allows us to write out a new KGML file that preserves the changes we make. Using print on the Pathway object gives us some summary information.

KGML_parser will read a KGML file on disk, and the stream returned from a kegg_get() function, as below.


In [31]:
pathway = KGML_parser.read(kegg_get("ko00061", "kgml"))
print(pathway)


Pathway: Fatty acid biosynthesis
KEGG ID: path:ko00061
Image file: http://www.kegg.jp/kegg/pathway/ko/ko00061.png
Organism: ko
Entries: 242
Entry types:
	ortholog: 183
	compound: 50
	map: 9

If we want to render this as an image, we need to use a KGMLCanvas object, from Bio.Graphics.KGML_vis. We do this by instantiating a KGMLCanvas object with our pathway.


In [32]:
canvas = KGMLCanvas(pathway)

Now that we have an image, we can render it. At the moment, there is only the capability to render to .pdf. To render to a file, call the draw() method of the canvas.


In [33]:
canvas.draw("fab_map.pdf")
PDF("fab_map.pdf")


Out[33]:

But there's something wrong with this picture. It shows only the elements defined in the KGML file. All the beautiful KEGG layout, including the lines indicating the flow of the biochemistry, is missing. That's because the KGML file doesn't contain any information about it, directly.

However, each Pathway object created from KGML file has an image attribute with the URL of its .png format representation. To use this as a background, all we need to do to render this is specify import_imagemap=True when creating the canvas, or set canvas.import_imagemap = True once the canvas is created. (It is also possible to specify an image file on disk as the background.)


In [34]:
canvas.import_imagemap = True
canvas.draw("fab_map_with_image.pdf")
PDF("fab_map_with_image.pdf")


Out[34]:

This could be encapsulated in a useful helper function:


In [35]:
def draw_kegg_map(map_id):
    """ Render a local PDF of a KEGG map with the passed map ID
    """
    # Get the background image first
    pathway = KGML_parser.read(kegg_get(map_id, "kgml"))
    canvas = KGMLCanvas(pathway, import_imagemap=True)
    img_filename = "%s.pdf" % map_id
    canvas.draw(img_filename)

2c. Customised renderings from KGML - Part 1

In the Bio.Graphics model of rendering KGML, we are superimposing the elements described in a KGML file over an associated background image. As a result, only the elements that are described in the KGML file (or any we introduce ourselves...) can be modified.

The current interface to identifying and retrieving pathway elements is a little obscure as of Biopython 1.65 - if you want to retrieve a particular element by ID or annotation - but it is possible to modify the graphical elements that are rendered, to produce customised maps from KGML data.

The modifications we can make are:

  • shape (graphics.type)
  • location ($x,y$ co-ordinates: graphics.x, graphics.y)
  • height (graphics.height)
  • width (graphics.width)
  • foreground colour (graphics.fgcolor)
  • background colour (graphics.bgcolor)
  • label (graphics.name)

but this still gives us a lot of flexibility in terms of visual annotation from a dataset. We can also choose to display, or not, elements of interest, and transparency effects are available.

A motivating example

As a motivating example, we're going to use the fatty-acid biosynthesis pathway (map00061) from earlier, and make the following modifications:

  1. colour the ortholog elements according to a random choice of colours
  2. change the size of each compound element to reflect its molecular mass

We saw the default rendering of the ko00061 pathway earlier:


In [36]:
pathway = KGML_parser.read(kegg_get("ko00061", "kgml"))
canvas = KGMLCanvas(pathway, import_imagemap=True)
canvas.draw("fab_map_with_image.pdf")
PDF("fab_map_with_image.pdf")


Out[36]:

2c.1 Changing ortholog background colours

We can get access to all the pathway ortholog elements via the pathway.orthologs attribute. This will list all ortholog elements in the pathway as Entry objects.


In [37]:
pathway.orthologs[:5]


Out[37]:
[<Bio.KEGG.KGML.KGML_pathway.Entry at 0x105097210>,
 <Bio.KEGG.KGML.KGML_pathway.Entry at 0x105097850>,
 <Bio.KEGG.KGML.KGML_pathway.Entry at 0x1050978d0>,
 <Bio.KEGG.KGML.KGML_pathway.Entry at 0x105097950>,
 <Bio.KEGG.KGML.KGML_pathway.Entry at 0x1050979d0>]

In [38]:
print(pathway.orthologs[0])


Entry node ID: 18
Names: ko:K02371
Type: ortholog
Components: set([])
Reactions: rn:R04969
Graphics elements: 1 [<Bio.KEGG.KGML.KGML_pathway.Graphics object at 0x105097890>]

Each Entry object has a graphics attribute that stores information about how the item is rendered, in a list of Graphics objects - this reflects the data structure in the KGML representation, which may have more than one graphical entity per entry in the pathway.


In [39]:
pathway.orthologs[0].graphics


Out[39]:
[<Bio.KEGG.KGML.KGML_pathway.Graphics at 0x105097890>]

Each Graphics object has attributes that may be modified directly:

name         Label for the graphics object
x            X-axis position of the object (int)
y            Y-axis position of the object (int)
coords       polyline co-ordinates, list of (int, int) tuples
type         object shape
width        object width (int)
height       object height (int)
fgcolor      object foreground color (hex RGB)
bgcolor      object background color (hex RGB)

There are also two read-only properties that are provided for convenience

bounds       returns the graphical bounds of the object
centre       returns the centre of the object as an (x, y) tuple

In [40]:
element = pathway.orthologs[0].graphics[0]
attrs = [element.name, element.x, element.y, element.coords, element.type, 
         element.width, element.height, element.fgcolor, element.bgcolor, 
         element.bounds, element.centre]
print '\n'.join([str(attr) for attr in attrs])


K02371
1037.0
606.0
None
rectangle
46.0
17.0
#000000
#BFBFFF
[(1014.0, 597.5), (1060.0, 614.5)]
(1037.0, 606.0)

To change the ortholog colours, it's easy enough to loop over the ortholog elements in the pathway, and modify their graphics.bgcolor attributes.

As KGML expects hex colours in its specification, we also provide hex colours here, using the helper function below:


In [41]:
# Helper function to convert colour as RGB tuple to hex string
def rgb_to_hex(rgb):
    rgb = tuple([int(255*val) for val in rgb])
    return '#' + ''.join([hex(val)[2:] for val in rgb]).upper()

For this example we're defining arbitrary colours, but some obvious uses might include the use of colour to represent some biologically-significant value, like transcript abundance, $\frac{dN}{dS}$, or flux through that step.


In [42]:
# Define arbitrary colours
colorspiral = ColorSpiral()
colorlist = colorspiral.get_colors(len(pathway.orthologs))

# Change the colours of ortholog elements
for color, element in zip(colorlist, pathway.orthologs):
    for graphic in element.graphics:
        graphic.bgcolor = rgb_to_hex(color)

In [43]:
canvas = KGMLCanvas(pathway, import_imagemap=True)
canvas.draw("fab_map_new_colours.pdf")
PDF("fab_map_new_colours.pdf")


Out[43]:

2c.2 Changing compound sizes

We can get access to the pathway compound elements via pathway.compounds, just as for orthologs, and to change sizes, we can loop over them, modifying both width and height.

As with the colour changes, the sizes used are arbitrary, but this kind of modification (maybe incorporating colour changes, too) could be used to represent biochemical measurements such as metabolite concentration.


In [44]:
# Change the sizes of compound elements
for size, element in zip(range(8, 8+len(pathway.compounds)), pathway.compounds):
    for graphic in element.graphics:
        graphic.width = size
        graphic.height = size

In [45]:
canvas = KGMLCanvas(pathway, import_imagemap=True)
canvas.draw("fab_map_new_sizes.pdf")
PDF("fab_map_new_sizes.pdf")


Out[45]:

2d. Customised renderings from KGML - Part 2: central metabolism

The maps we've been considering so far have not been completely described by the publicly-available KGML file. This is true for the great majority of KEGG maps. But there are three other maps: ko01100, ko01110, and ko01120 for which this is not the case, and the KGML file provides a lot more information. Specifically, they also describe lines representing reactions that connect metabolite compounds. This gives us quite a lot of scope for modifying their presentation, but it also means that they're slower to render.


In [46]:
# The three KEGG maps with lines representing reactions.
maps = ['ko01100', 'ko01110', 'ko01120']
[draw_kegg_map(map) for map in maps]
print(kegg_get(maps).read())


ENTRY       ko01100                     Pathway
NAME        Metabolic pathways
PATHWAY_MAP ko01100  Metabolic pathways
DISEASE     H00617  Desmosterolosis
            H00689  Delayed sleep phase syndrome
            H00734  Lamellar ichthyosis (LI) and Non-bullous congenital ichthyosiform erythroderma (NBCIE)
            H00753  Urofacial syndrome
            H00761  SEMD, Pakistani type
            H00804  Multiple cutaneous and uterine leiomyomata
            H00824  Calcification of joints and arteries
            H00920  Exocrine pancreatic insufficiency, dyserythropoietic anemia, and calvarial hyperostosis
            H00946  Arts syndrome
            H01002  Generalized arterial calcification of infancy
///
ENTRY       ko01110                     Pathway
NAME        Biosynthesis of secondary metabolites
PATHWAY_MAP ko01110  Biosynthesis of secondary metabolites
///
ENTRY       ko01120                     Pathway
NAME        Microbial metabolism in diverse environments
PATHWAY_MAP ko01120  Microbial metabolism in diverse environments
///


In [47]:
PDF(maps[0]+'.pdf')


Out[47]:

In [48]:
PDF(maps[1]+'.pdf')


Out[48]:

In [49]:
PDF(maps[2]+'.pdf')


Out[49]:

Unlike the custom renderings that can be obtained from KEGG directly, with this module we now have full control over line width, element size, element colour, and labelling options.

To avoid noise in the image, it can be worth setting import_imagemap=False when making the canvas or drawing to a file for these maps.

First, let's set each reaction line to a random value between 1 and 10. This could be adapted to represent carbon flux measurements, for example.


In [50]:
# Use the bacterial diverse environments map
pathway = KGML_parser.read(kegg_get("ko01120", "kgml"))

# Change the widths of reaction entries elements
for element in pathway.orthologs:
    for graphic in element.graphics:
        graphic.width = random.randrange(1, 10, 1)

In [51]:
canvas = KGMLCanvas(pathway, import_imagemap=False)
canvas.draw("bacteria_mod_widths.pdf")
PDF("bacteria_mod_widths.pdf")


Out[51]:

We can modify pathway colours in the same way as before, except that now we have to modify the foreground, not the background colour, through the attribute fgcolor. The foreground colour should be specified as a hex string.


In [52]:
# Define arbitrary colours
colorspiral = ColorSpiral()
colorlist = colorspiral.get_colors(len(pathway.orthologs))

# Change the colours of ortholog elements
for color, element in zip(colorlist, pathway.orthologs):
    for graphic in element.graphics:
        graphic.fgcolor = rgb_to_hex(color)

In [53]:
canvas = KGMLCanvas(pathway, import_imagemap=False)
canvas.draw("bacteria_mod_colour.pdf")
PDF("bacteria_mod_colour.pdf")


Out[53]:

In [53]: