03a - Building a Reproducible Document

Python Imports/Startup
Biological Motivation
Load Sequence
Build BLAST database
Run BLAST query
Load BLAST results
Query UniProt
Query KEGG

1. Python Imports/Startup

It can be very convenient to have all the `Python` library imports at the top of the notebook.

This is very helpful when running the notebook with, e.g. Cell -> Run All or Kernel -> Restart & Run All from the menu bar, all the libraries are available throughout the document.



In [ ]:

    
# The line below allows the notebooks to show graphics inline
%pylab inline

import io                            # This lets us handle streaming data
import os                            # This lets us communicate with the operating system

import pandas as pd                  # This lets us use dataframes
import seaborn as sns                # This lets us draw pretty graphics

# Biopython is a widely-used library for bioinformatics
# tasks, and integrating with software
from Bio import SeqIO                # This lets us handle sequence data
from Bio.KEGG import REST            # This lets us connect to the KEGG databases

# The bioservices library allows connections to common
# online bioinformatics resources
from bioservices import UniProt      # This lets us connect to the UniProt databases

from IPython.display import Image    # This lets us display images (.png etc) from code

It can be useful here to create any output directories that will be used throughout the document.

The os.makedirs() function allows us to create a new directory, and the exist_ok option will prevent the notebook code from stopping and throwing an error if the directory already exists.



In [ ]:

    
# Create a new directory for notebook output
OUTDIR = os.path.join("data", "reproducible", "output")
os.makedirs(OUTDIR, exist_ok=True)

It can be useful here to create helper functions that will be used throughout the document.

The to_df() function will turn tabular data into a pandas dataframe



In [ ]:

    
# A small function to return a Pandas dataframe, given tabular text
def to_df(result):
    return pd.read_table(io.StringIO(result), header=None)

2. Biological Motivation

We are working on a project to improve bacterial throughput for biosynthesis, and have been provided with a nucleotide sequence of a gene of interest.

This gene is overrepresented in populations of bacteria that appear to be associated with enhanced metabolic function relevant to a biosynthetic output (lipid conversion to ethanol).

We want to find out more about the annotated function and literature associated with this gene, which appears to derive from *Proteus mirabilis*.

Our plan is to:

identify a homologue in a reference isolate of P. mirabilis
obtain the protein sequence/identifier for the homologue
get information about the molecular function of this protein from UniProt
get information about the metabolic function of this protein from KEGG
visualise some of the information about this gene/protein

3. Load Sequence

We first load the sequence from a local `FASTA` file, using the `Biopython` `SeqIO` library.

4. Build `BLAST` Database

We now build a local `BLAST` database from the *P. mirabilis* reference proteins.

5. Run `BLAST` Query

We now query the wildtype sequence against our custom `BLAST` database from the *P. mirabilis* reference proteins.

6. Load `BLAST` Results

We now load the `BLASTX` results for inspection and visualisation, using `pandas`

7. Query `UniProt`

We now query the `UniProt` databases for information on our best match

8. Query `KEGG`

We now query the `KEGG` databases for information on our best match

03a - Building a Reproducible Document

Table of Contents

1. Python Imports/Startup

2. Biological Motivation

3. Load Sequence

4. Build BLAST Database

5. Run BLAST Query

6. Load BLAST Results

7. Query UniProt

8. Query KEGG

4. Build `BLAST` Database

5. Run `BLAST` Query

6. Load `BLAST` Results

7. Query `UniProt`

8. Query `KEGG`