03a - Building a Reproducible Document

1. Python Imports/Startup

It can be very convenient to have all the `Python` library imports at the top of the notebook.

This is very helpful when running the notebook with, e.g. Cell -> Run All or Kernel -> Restart & Run All from the menu bar, all the libraries are available throughout the document.


In [ ]:
# The line below allows the notebooks to show graphics inline
%pylab inline

import io                            # This lets us handle streaming data
import os                            # This lets us communicate with the operating system

import pandas as pd                  # This lets us use dataframes
import seaborn as sns                # This lets us draw pretty graphics

# Biopython is a widely-used library for bioinformatics
# tasks, and integrating with software
from Bio import SeqIO                # This lets us handle sequence data
from Bio.KEGG import REST            # This lets us connect to the KEGG databases

# The bioservices library allows connections to common
# online bioinformatics resources
from bioservices import UniProt      # This lets us connect to the UniProt databases

from IPython.display import Image    # This lets us display images (.png etc) from code

It can be useful here to create any output directories that will be used throughout the document.

The os.makedirs() function allows us to create a new directory, and the exist_ok option will prevent the notebook code from stopping and throwing an error if the directory already exists.


In [ ]:
# Create a new directory for notebook output
OUTDIR = os.path.join("data", "reproducible", "output")
os.makedirs(OUTDIR, exist_ok=True)

It can be useful here to create helper functions that will be used throughout the document.

The to_df() function will turn tabular data into a pandas dataframe


In [ ]:
# A small function to return a Pandas dataframe, given tabular text
def to_df(result):
    return pd.read_table(io.StringIO(result), header=None)

2. Biological Motivation

We are working on a project to improve bacterial throughput for biosynthesis, and have been provided with a nucleotide sequence of a gene of interest.

This gene is overrepresented in populations of bacteria that appear to be associated with enhanced metabolic function relevant to a biosynthetic output (lipid conversion to ethanol).

We want to find out more about the annotated function and literature associated with this gene, which appears to derive from *Proteus mirabilis*.

Our plan is to:

  1. identify a homologue in a reference isolate of P. mirabilis
  2. obtain the protein sequence/identifier for the homologue
  3. get information about the molecular function of this protein from UniProt
  4. get information about the metabolic function of this protein from KEGG
  5. visualise some of the information about this gene/protein

3. Load Sequence

We first load the sequence from a local `FASTA` file, using the `Biopython` `SeqIO` library.

4. Build BLAST Database

We now build a local `BLAST` database from the *P. mirabilis* reference proteins.

5. Run BLAST Query

We now query the wildtype sequence against our custom `BLAST` database from the *P. mirabilis* reference proteins.

6. Load BLAST Results

We now load the `BLASTX` results for inspection and visualisation, using `pandas`

7. Query UniProt

We now query the `UniProt` databases for information on our best match

8. Query KEGG

We now query the `KEGG` databases for information on our best match