03b - An Example Reproducible Document

Introduction
Python Imports/Startup
Biological Motivation
Load Sequence
Build BLAST database
Run BLAST query
Load BLAST results
Query UniProt
Query KEGG
Share The Document

1. Introduction

This notebook is an example of what could be done in part `03a - Building a Reproducible Document` if more time was available.

It is provided as a reference against which you can check your work from notebook 03a, and inspect to get ideas for how you can use the same techniques and methods in your own research.

2. Python Imports/Startup

It can be very convenient to have all the `Python` library imports at the top of the notebook.

This is very helpful when running the notebook with, e.g. Cell -> Run All or Kernel -> Restart & Run All from the menu bar, all the libraries are available throughout the document.



In [1]:

    
# The line below allows the notebooks to show graphics inline
%pylab inline

import io                            # This lets us handle streaming data
import os                            # This lets us communicate with the operating system

import pandas as pd                  # This lets us use dataframes
import seaborn as sns                # This lets us draw pretty graphics

# Biopython is a widely-used library for bioinformatics
# tasks, and integrating with software
from Bio import SeqIO                # This lets us handle sequence data
from Bio.KEGG import REST            # This lets us connect to the KEGG databases

# The bioservices library allows connections to common
# online bioinformatics resources
from bioservices import UniProt      # This lets us connect to the UniProt databases

from IPython.display import Image    # This lets us display images (.png etc) from code









    



Populating the interactive namespace from numpy and matplotlib

It can be useful here to create any output directories that will be used throughout the document.

The os.makedirs() function allows us to create a new directory, and the exist_ok option will prevent the notebook code from stopping and throwing an error if the directory already exists.



In [2]:

    
# Create a new directory for notebook output
OUTDIR = os.path.join("data", "reproducible", "output")
os.makedirs(OUTDIR, exist_ok=True)

It can be useful here to create helper functions that will be used throughout the document.

The to_df() function will turn tabular data into a pandas dataframe



In [3]:

    
# A small function to return a Pandas dataframe, given tabular text
def to_df(result):
    return pd.read_table(io.StringIO(result), header=None)

3. Biological Motivation

We are working on a project to improve bacterial throughput for biosynthesis, and have been provided with a nucleotide sequence of a gene of interest.

This gene is overrepresented in populations of bacteria that appear to be associated with enhanced metabolic function relevant to a biosynthetic output (lipid conversion to ethanol).

We want to find out more about the annotated function and literature associated with this gene, which appears to derive from *Proteus mirabilis*.

Our plan is to:

identify a homologue in a reference isolate of P. mirabilis
obtain the protein sequence/identifier for the homologue
get information about the molecular function of this protein from UniProt
get information about the metabolic function of this protein from KEGG
visualise some of the information about this gene/protein

4. Load Sequence

We first load the sequence from a local `FASTA` file, using the `Biopython` `SeqIO` library.

In the code cell below:

we put the path to the sequence file in the variable seqfile
we load the sequence into the variable wildtype



In [4]:

    
# Define the path to the sequence file (in data/reproducible/sequences)
seqfile = os.path.join("data", "reproducible", "sequences", "lipase.fasta")

# Load the sequence data
wildtype = SeqIO.read(seqfile, 'fasta')

Once the sequence is loaded, we print it to the notebook to check it is correct.



In [5]:

    
# Print the 'wildtype' sequence
print(wildtype)









    



ID: candidate
Name: candidate
Description: candidate lipase protein from Proteus mirabilis population
Number of features: 0
Seq('ATGAGCACCAAGTACCCCATCGTGCTGGTGCACGGCCTGGCCGGCTTCAGCGAG...CTG', SingleLetterAlphabet())

To see the sequence in FASTA format, we can use the .format() method:



In [6]:

    
# Print the 'wildtype' sequence as a FASTA formatted sequence
print(wildtype.format('fasta'))
print(len(wildtype))









    



>candidate lipase protein from Proteus mirabilis population
ATGAGCACCAAGTACCCCATCGTGCTGGTGCACGGCCTGGCCGGCTTCAGCGAGATCGTG
GGCTTCCCCTACTTCTACGGCATCGCCGACGCCCTGACCCAGGACGGCCACCAGGTGTTC
ACCGCCAGCCTGAGCGCCTTCAACAGCAACGAGGTGAGGGGCAAGCAGCTGTGGCAGTTC
GTGCAGACCATCCTGCAGGAGACCCAGACCAAGAAGGTGAACTTCATCGGCCACAGCCAG
GGCCCCCTGGCCTGCAGGTACGTGGCCGCCAACTACCCCGACAGCGTGGCCAGCGTGACC
AGCATCAACGGCGTGAACCACGGCAGCGAGATCGCCGACCTGTACAGGAGGATCATCAGG
AAGGACAGCATCCCCGAGTACATCGTGGAGAAGGTGCTGAACGCCTTCGGCACCATCATC
AGCACCTTCAGCGGCCACAGGGGCGACCCCCAGGACGCCATCGCCGCCCTGGAGAGCCTG
ACCACCGAGCAGGTGACCGAGTTCAACAACAAGTACCCCCAGGCCCTGCCCAAGACCCCC
TGCGGCGAGGGCGACGAGATCGTGAACGGCGTGCACTACTACTGCTTCGGCAGCTACATC
CAGGAGCTGATCGCCGGCGAGAACGGCAACCTGCTGGACCCCACCCACGCCGCCATGAGG
GTGCTGAACACCCTGTTCACCGAGAAGCAGAACGACGGCCTGGTGGGCAGGTGCAGCATG
AGGCTGGGCAAGCTGATCAAGGACGACTACGCCCAGGACCACTTCGACATGGTGAACCAG
GTGGCCGGCCTGGTGAGCTACAACGAGAACATCGTGGCCATCTACACCCTGCACGCCAAG
TACCTGGCCAGCAAGCAGCTG

861

5. Build `BLAST` Database

We now build a local `BLAST` database from the *P. mirabilis* reference proteins.

to do this, we run terminal commands in the notebook by using the "magic" command %%bash at the top of the code cell
using the %%bash command makes the rest of the lines in the cell work as though they were the command-line terminal

We build a BLAST protein database from the file data/reproducible/sequences/GCA_000069965.1_ASM6996v1_protein.faa using the makeblastdb command.

The output from this command is printed in the notebook below the cell when it runs.



In [7]:

    
%%bash
# Drop out to bash to create the BLAST database

# Make a BLAST protein database
makeblastdb -in data/reproducible/sequences/GCA_000069965.1_ASM6996v1_protein.faa \
            -dbtype prot \
            -title reference_protein \
            -out data/reproducible/output/reference_protein









    




Building a new DB, current time: 04/16/2017 15:17:38
New DB name:   /Users/lpritc/Development/GitHub/Teaching-SfAM-ECR/workshop/data/reproducible/output/reference_protein
New DB title:  reference_protein
Sequence type: Protein
Deleted existing Protein BLAST database named /Users/lpritc/Development/GitHub/Teaching-SfAM-ECR/workshop/data/reproducible/output/reference_protein
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 3662 sequences in 0.210875 seconds.

6. Run `BLAST` Query

We now query the wildtype sequence against our custom `BLAST` database from the *P. mirabilis* reference proteins.

to do this, we again run terminal commands in the notebook by using the "magic" command %%bash at the top of the code cell

We BLASTX the nucleotide sequence query against the BLAST protein database in data/reproducible/output/reference_protein.

We set the `BLAST` output format to `6` (tabular) format, so we can read it in with `pandas`.

The `BLASTX` command does not produce output in the notebook, if it is successful

We have named the `BLASTX` output file in such a way that we can tell what the query was, what the database was, and which `BLAST` program was used. This makes it easier to identify results when looking through output files.



In [8]:

    
%%bash
# Drop out to bash to run the BLASTX query

blastx -query data/reproducible/sequences/lipase.fasta \
        -db data/reproducible/output/reference_protein \
        -outfmt 6 \
        -out data/reproducible/output/wildtype_blastx_reference_protein.tab

7. Load `BLAST` Results

We now load the `BLASTX` results for inspection and visualisation, using `pandas`

we use the pandas read_csv() function to load the tabular data that was created using the option -outfmt 6 in the blastx command
we define column headers, to make the tabular results easier to understand

The tabular format is tab-separated, so we need to specify `sep="\t"` as the separator.

The tabular output has no header row, so we need to specify `header=None`

We add column headers to the dataframe *after* it is loaded



In [9]:

    
# Define the path to the output file
blastfile = os.path.join("data", "reproducible", "output",
                         "wildtype_blastx_reference_protein.tab")

# Define column headers
# These are the default columns produced with -outfmt 6
headers = ['query', 'subject',
           'pc_identity', 'aln_length',
           'mismatches', 'gaps_opened',
           'query_start', 'query_end',
           'subject_start', 'subject_end',
           'e_value', 'bitscore']

# Load the BLAST output into a dataframe
results = pd.read_csv(blastfile, sep="\t", header=None)
results.columns = headers

We can now inspect the first few rows of the dataframe, to see what the best hits were in the reference genome:



In [10]:

    
# Show first few rows of the dataframe
results.head()









    Out[10]:






  
    
      
      query
      subject
      pc_identity
      aln_length
      mismatches
      gaps_opened
      query_start
      query_end
      subject_start
      subject_end
      e_value
      bitscore
    
  
  
    
      0
      candidate
      CAR42190.1
      95.122
      287
      14
      0
      1
      861
      1
      287
      0.00
      567.0
    
    
      1
      candidate
      CAR41557.1
      28.947
      76
      44
      3
      598
      795
      93
      168
      0.59
      28.1
    
    
      2
      candidate
      CAR45487.1
      34.043
      47
      31
      0
      235
      375
      131
      177
      0.78
      27.7
    
    
      3
      candidate
      CAR42262.1
      43.478
      23
      13
      0
      727
      795
      191
      213
      1.60
      26.9
    
    
      4
      candidate
      CAR42804.1
      32.258
      31
      21
      0
      721
      813
      33
      63
      3.10
      24.3

There is a single best match - CAR42190.1 - that aligns over the full length of the candidate sequence (861/3 = 287) with 95% identity and 14 residue mismatches.

It is very likely that this protein shares the same molecular and metabolic function as our candidate sequence, so we proceed to query the UniProt and KEGG databases.

8. Query `UniProt`

We now query the `UniProt` databases for information on our best match

we use the bioservices library to make a connection to the remote UniProt servers, and make our query
we ask for the query results to be returned as a pandas dataframe, for ease of processing

We set up a connection to the service with the `UniProt()` class, putting this in a variable called `service`

We can make many queries to the same service, using the `.search()` method

The default results from `UniProt` can be quite sparse, so we'll ask for them in a different format shortly.



In [11]:

    
# Make a link to the UniProt webservice
service = UniProt()

# Make a new query string, and run a remote search for the UniProt entry
query = "CAR42190"
result = service.search(query)

# Inspect the result
print(result)









    



Entry	Entry name		Protein names	Gene names	Organism	Length
B4EVM3	B4EVM3_PROMH	unreviewed	Putative lipase (EC 3.1.1.3)	PMI0999	Proteus mirabilis (strain HI4320)	287

This result tells us that the UniProt record for this sequence has the accession B4EVM3 (not the CAR42190 that we first thought), and that it is a putative lipase sequence that has been assigned the Enzyme Classification number EC 3.1.1.3.

The `UniProt` output tells us that the gene name of record for this sequence is `PMI0999`. This will be important for our `KEGG` query.

the EC number will also be useful when querying KEGG later

We can also ask for the query results to be returned as a `pandas` dataframe, for ease of processing, with the `service.get_df()` function:



In [12]:

    
# Get a dataframe from UniProt
uniprot_df = service.get_df(query)

# View the dataframe
uniprot_df.head()









    Out[12]:






  
    
      
      Entry
      Entry name
      Gene names
      Gene names  (primary )
      Gene names  (synonym )
      Gene names  (ordered locus )
      Gene names  (ORF )
      Organism
      Organism ID
      Protein names
      ...
      Miscellaneous [CC]
      Keywords
      Protein existence
      Unnamed: 93
      Sequence annotation (Features)
      Protein families
      Version
      Comments
      Cross-reference (null)
      Pathway.1
    
  
  
    
      0
      B4EVM3
      B4EVM3_PROMH
      [PMI0999]
      NaN
      NaN
      PMI0999
      NaN
      Proteus mirabilis (strain HI4320)
      529507
      Putative lipase (EC 3.1.1.3)
      ...
      NaN
      [3D-structure, Calcium, Complete proteome, Hyd...
      Evidence at protein level
      unreviewed
      NaN
      []
      53
      []
      NaN
      NaN
    
  

1 rows × 100 columns

This dataframe gives us much more information (100 columns of data - we can see the headers with uniprot_df.columns) than the initial result. We can see that there is a 3D structure for the protein, and that there is evidence for its existence at the protein level.

`UniProt` can provide much more information than this if asked explicitly, using the same syntax as you would at the `UniProt` website.

We can look at the distribution of enzymes annotated with this EC number across Enterobacterales genera, using a new `UniProt` query, and some useful functions of dataframes.

To do this, we'll write a function that allows us to pass an EC number, and taxonomic classification, and produces a histogram of how many examples are found per genus.



In [13]:

    
# A function to plot the number of enzymes annotated with a given
# EC number, per genus, under a specified taxonomic classification.
def hist_enzymes_by_genus(ec=None, taxon=None):
    """Plots a histogram of enzymes with passed EC number, per
    genus, under the indicated taxon.
    """
    # Ask UniProt for all proteins with EC number under the passed
    # taxon
    query = "ec:%s taxonomy:%s" % (ec, taxon)
    uniprot_df = service.get_df(query)

    # How many sequences are returned?
    print("UniProt reports %i sequences (EC: %s, taxon: %s)" %
          (len(uniprot_df), ec, taxon))
    
    # Create a new column for  the genus of the organism
    uniprot_df['genus'] = uniprot_df['Organism'].str.split().str.get(0)
    
    # Count the number of enzymes per genus
    genus_count = uniprot_df.groupby('genus')['Entry'].count()
    
    # Plot the number of enzymes per genus
    g = sns.barplot(genus_count.index, genus_count);
    g.set_xticklabels(labels=genus_count.index, rotation=45);
    g.set_title("EC: %s, taxon: %s" % (ec, taxon));
    g.set_ylabel("count")
    
    # Return the plot
    return g



In [14]:

    
# Call function to plot the number of enzymes in each genus
hist_enzymes_by_genus(ec="3.1.1.3", taxon="enterobacterales");









    



UniProt reports 198 sequences (EC: 3.1.1.3, taxon: enterobacterales)

9. Query `KEGG`

We now query the `KEGG` databases for information on our best match

we use the Biopython library to make a connection to the remote KEGG servers, and make our query
we use the to_df() function to turn the tabular query results from KEGG into a pandas dataframe, for ease of processing
the UniProt result tells us that the gene name is PMI0999, so we use this as our query
KEGG is composed of several databases, and we need to tell it which database we are serching in, as well as our query

We use the `KEGG.REST` module to connect to the `KEGG` databases.

We first search in the `GENES` database to see what `KEGG` knows about our sequence, and what `KEGG`'s internal ID is (we will need this)

We use the `REST.kegg_find()` function to identify `KEGG`'s internal identifier for this sequence, so we can access further records



In [15]:

    
# Find KEGG GENES result for the query PMI0999
result = REST.kegg_find("genes", "PMI0999").read()     # Do search at KEGG
to_df(result)                                          # convert to a dataframe









    Out[15]:






  
    
      
      0
      1
    
  
  
    
      0
      pmr:PMI0999
      K01046 triacylglycerol lipase [EC:3.1.1.3] | (...

The first search tells us that KEGG's internal ID for our query sequence is pmr:PMI0999.

We now use the `REST.kegg_get()` function to extract the record information



In [16]:

    
# Get the KEGG GENES entry for pmr:PMI0999
result = REST.kegg_get("pmr:PMI0999").read()
print(result)









    



ENTRY       PMI0999           CDS       T00753
DEFINITION  (GenBank) putative lipase
ORTHOLOGY   K01046  triacylglycerol lipase [EC:3.1.1.3]
ORGANISM    pmr  Proteus mirabilis HI4320
PATHWAY     pmr00561  Glycerolipid metabolism
            pmr01100  Metabolic pathways
BRITE       KEGG Orthology (KO) [BR:pmr00001]
             Metabolism
              Lipid metabolism
               00561 Glycerolipid metabolism
                PMI0999
            Enzymes [BR:pmr01000]
             3. Hydrolases
              3.1  Acting on ester bonds
               3.1.1  Carboxylic-ester hydrolases
                3.1.1.3  triacylglycerol lipase
                 PMI0999
POSITION    1063081..1063944
MOTIF       Pfam: Abhydrolase_1 DUF676 Lipase_2 DUF915 BAAT_C Hydrolase_4 FliG_N Abhydrolase_6
DBLINKS     NCBI-ProteinID: CAR42190
            UniProt: B4EVM3
STRUCTURE   PDB: 4HS9 3W9U 4GW3 4GXN
AASEQ       287
            MSTKYPIVLVHGLAGFNEIVGFPYFYGIADALRQDGHQVFTASLSAFNSNEVRGKQLWQF
            VQTLLQETQAKKVNFIGHSQGPLACRYVAANYPDSVASVTSINGVNHGSEIADLYRRIMR
            KDSIPEYIVEKVLNAFGTIISTFSGHRGDPQDAIAALESLTTEQVTEFNNKYPQALPKTP
            GGEGDEIVNGVHYYCFGSYIQGLIAGEKGNLLDPTHAAMRVLNTFFTEKQNDGLVGRSSM
            RLGKLIKDDYAQDHIDMVNQVAGLVGYNEDIVAIYTQHAKYLASKQL
NTSEQ       864
            atgtcaacgaaatatcctatcgttttagttcatggtttagctggttttaatgaaattgtt
            ggttttccctacttttatggtattgcagatgccttaaggcaagatgggcatcaagttttt
            accgcttctctttctgcctttaattcaaatgaagtgcgtggcaagcagttgtggcaattt
            gtccagactctcttacaagagacacaagcgaaaaaagtgaattttataggtcatagccaa
            gggccattagcttgtcgttacgttgcggctaattatcctgatagtgttgcttcggtaact
            tcaattaatggtgtaaatcatggttcagaaatagccgacttatatcgacgcattatgcgt
            aaagacagtatcccagaatatattgtcgaaaaagtattaaatgcatttggtactattatc
            tcaacattttctggtcatagaggcgatcctcaagatgctattgccgcattagagtcttta
            acaacagagcaagtcacagaatttaataacaaatacccacaggcattacctaaaacccct
            ggtggtgaaggtgatgaaatcgtcaatggggttcattactactgttttggtagctatatt
            cagggattaattgctggggaaaaggggaatctactggatcctactcatgcggcaatgcgt
            gtgttaaatacattctttactgaaaaacaaaatgacggcttagttggtcgctcgagtatg
            cgattaggcaaattgataaaagatgattatgcgcaagatcatattgatatggttaaccaa
            gttgcagggttagtgggttataatgaagatatcgttgctatttatactcagcacgctaaa
            tacttagcgagtaagcaactttaa
///

From this entry we can see that:

our protein is annotated as a triglycerol lipase
it participates in two metabolic pathways: Glycerolipid metabolism [pmr00561] and central metabolism [pmr01100]
the protein has eight annotated motifs in PFam
there are cross-references to several databases, including four structures in RCSB/PDB

We want to visualise the pathway that our gene product participates in

KEGG stores pathways in a separate database to genes, the PATHWAY database
we can use REST.kegg_get() to get the text description of a pathway, and also the KEGG database image for that pathway.

We use the `REST.kegg_get()` function to obtain the pathway entry as text.



In [17]:

    
# Get the KEGG PATHWAY entry for our gene in plain text
pathway = REST.kegg_get("pmr00561").read()
print(pathway)









    



ENTRY       pmr00561                    Pathway
NAME        Glycerolipid metabolism - Proteus mirabilis HI4320
CLASS       Metabolism; Lipid metabolism
PATHWAY_MAP pmr00561  Glycerolipid metabolism
DBLINKS     BSID: 73017
            GO: 0045017 0046486 0046503
ORGANISM    Proteus mirabilis HI4320 [GN:pmr]
GENE        PMI3609  garK; glycerate kinase [KO:K00865] [EC:2.7.1.165]
            PMI2402  dhaK; PTS-dependent dihydroxyacetone kinase, dihydroxyacetone-binding subunit [KO:K05878] [EC:2.7.1.-]
            PMI1231  dhaK1; dihydroxyacetone kinase (glycerone kinase), kinase subunit [KO:K05878] [EC:2.7.1.-]
            PMI1232  dhaK2; dihydroxyacetone kinase, phosphatase subunit [KO:K05879] [EC:2.7.1.-]
            PMI2403  dhaL; PTS-dependent dihydroxyacetone kinase, ADP-binding subunit [KO:K05879] [EC:2.7.1.-]
            PMI3595  gldA; glycerol dehydrogenase [KO:K00005] [EC:1.1.1.6]
            PMI2766  putative glycerol dehydrogenase [KO:K00005] [EC:1.1.1.6]
            PMI3210  glpK; glycerol kinase [KO:K00864] [EC:2.7.1.30]
            PMI2751  plsB; glycerol-3-phosphate acyltransferase [KO:K00631] [EC:2.3.1.15]
            PMI0858  plsX; fatty acid/phospholipid synthesis protein [KO:K03621] [EC:2.3.1.15]
            PMI2364  putative membrane protein [KO:K08591] [EC:2.3.1.15]
            PMI2343  plsC; 1-acyl-glycerol-3-phosphate acyltransferase [KO:K00655] [EC:2.3.1.51]
            PMI2750  dgkA; diacylglycerol kinase [KO:K00901] [EC:2.7.1.107]
            PMI0999  putative lipase [KO:K01046] [EC:3.1.1.3]
COMPOUND    C00029  UDP-glucose
            C00040  Acyl-CoA
            C00093  sn-Glycerol 3-phosphate
            C00111  Glycerone phosphate
            C00116  Glycerol
            C00162  Fatty acid
            C00184  Glycerone
            C00197  3-Phospho-D-glycerate
            C00258  D-Glycerate
            C00416  Phosphatidate
            C00422  Triacylglycerol
            C00577  D-Glyceraldehyde
            C00631  2-Phospho-D-glycerate
            C00641  1,2-Diacyl-sn-glycerol
            C00681  1-Acyl-sn-glycerol 3-phosphate
            C00969  3-Hydroxypropanal
            C01885  1-Acylglycerol
            C02457  Propane-1,3-diol
            C03692  1,2-Diacyl-3-beta-D-galactosyl-sn-glycerol
            C04046  3-D-Glucosyl-1,2-diacylglycerol
            C04315  2-Acyl-3-O-(beta-D-galactosyl)-sn-glycerol
            C04872  1,2-Diacyl-3-[3-(alpha-D-N-acetylneuraminyl)-beta-D-galactosyl]-sn-glycerol
            C05401  3-beta-D-Galactosyl-sn-glycerol
            C06037  Digalactosyl-diacylglycerol
            C06038  Acyl1-monogalactosyl-diacylglycerol
            C06040  Diglucosyldiacylglycerol
            C06041  Glycerophosphoglycoglycerolipid
            C06042  Lipoteichoic acid
            C06364  1,2-Diacyl-3-alpha-D-glucosyl-sn-glycerol
            C06365  alpha-Kojibiosyldiacylglycerol
            C11521  UDP-6-sulfoquinovose
            C13508  Sulfoquinovosyldiacylglycerol
            C20897  Glycerophosphoglycoglycerolipid
            C20898  Lipoteichoic acid
            C20991  1,2-Diacyl-3-O-[beta-D-galactosyl-(1->6)-beta-D-galactosyl]-sn-glycerol
REFERENCE   PMID:8662716
  AUTHORS   Norbeck J, Pahlman AK, Akhtar N, Blomberg A, Adler L.
  TITLE     Purification and characterization of two isoenzymes of DL-glycerol-3-phosphatase from Saccharomyces cerevisiae. Identification of the corresponding GPP1 and GPP2 genes and evidence for osmotic regulation of Gpp2p expression by the osmosensing mitogen-activated protein kinase signal transduction pathway.
  JOURNAL   J Biol Chem 271:13875-81 (1996)
REFERENCE   PMID:8995384
  AUTHORS   Karlsson OP, Dahlqvist A, Vikstrom S, Wieslander A.
  TITLE     Lipid dependence and basic kinetics of the purified 1,2-diacylglycerol 3-glucosyltransferase from membranes of Acholeplasma laidlawii.
  JOURNAL   J Biol Chem 272:929-36 (1997)
REFERENCE   PMID:11294844
  AUTHORS   Berg S, Edman M, Li L, Wikstrom M, Wieslander A.
  TITLE     Sequence properties of the 1,2-diacylglycerol 3-glucosyltransferase from Acholeplasma laidlawii membranes. Recognition of a large group of lipid glycosyltransferases in eubacteria and archaea.
  JOURNAL   J Biol Chem 276:22056-63 (2001)
KO_PATHWAY  ko00561
///

This query has given us the full text of the entry, including:

the identifiers for every enzyme and compound involved in the pathway
references for literature concerning the pathway
Gene Ontology (GO) terms associated with the pathway

We use the `REST.kegg_get()` function to obtain the pathway entry as an image.
To do this, we need to specify the `"image"` argument when we call the function.



In [18]:

    
pathway_img = REST.kegg_get("pmr00561", "image").read()
Image(pathway_img)









    Out[18]:

This now renders the KEGG pathway image for pmr00561, with the enzymes that are present in our organism (Proteus mirabilis) highlighted in green.

We can use the `REST.kegg_get()` function in the same way to get a map of central metabolism for *P. mirabilis*:



In [19]:

    
# Get the central metabolism image for P. mirabilis
metabolism_img = REST.kegg_get("pmr01100", "image").read()
Image(metabolism_img)









    Out[19]:

Notebooks are convenient for making work reproducible, and part of that is sharing the notebook for reuse.

We can now share our notebook summarising what we know about our candidate sequence in a number of ways, using the File -> Download as options in the menu bar. These are summarised below, with links to example output.

Notebook (.ipynb): This will allow others to open the notebook and interact with it just as we are doing now, in Jupyter. (example.ipynb)
Python (.py): This will extract the Python code from the notebook into a file that can be run as a plain Python script. (example.py)
HTML (.html): This will render the notebook in HTML format, so it can be viewed, but not interacted with, in a web browser. (example.html)
Markdown (.md): This will render the notebook in Markdown format, so it can be edited as a plaintext document, or converted to a new output format. (example.md)

Services such as GitHub can help you share your notebooks and work with a much wider audience.

Where a notebook can help others understand your publication, they can be included as supplementary information, or made available publicly and linked to from your manuscript.

	query	subject	pc_identity	aln_length	mismatches	gaps_opened	query_start	query_end	subject_start	subject_end	e_value	bitscore
0	candidate	CAR42190.1	95.122	287	14	0	1	861	1	287	0.00	567.0
1	candidate	CAR41557.1	28.947	76	44	3	598	795	93	168	0.59	28.1
2	candidate	CAR45487.1	34.043	47	31	0	235	375	131	177	0.78	27.7
3	candidate	CAR42262.1	43.478	23	13	0	727	795	191	213	1.60	26.9
4	candidate	CAR42804.1	32.258	31	21	0	721	813	33	63	3.10	24.3

03b - An Example Reproducible Document

Table of Contents

1. Introduction

2. Python Imports/Startup

3. Biological Motivation

4. Load Sequence

5. Build BLAST Database

6. Run BLAST Query

7. Load BLAST Results

8. Query UniProt

9. Query KEGG

10. Share The Document

5. Build `BLAST` Database

6. Run `BLAST` Query

7. Load `BLAST` Results

8. Query `UniProt`

9. Query `KEGG`