It is provided as a reference against which you can check your work from notebook 03a
, and inspect to get ideas for how you can use the same techniques and methods in your own research.
This is very helpful when running the notebook with, e.g. Cell -> Run All
or Kernel -> Restart & Run All
from the menu bar, all the libraries are available throughout the document.
In [1]:
# The line below allows the notebooks to show graphics inline
%pylab inline
import io # This lets us handle streaming data
import os # This lets us communicate with the operating system
import pandas as pd # This lets us use dataframes
import seaborn as sns # This lets us draw pretty graphics
# Biopython is a widely-used library for bioinformatics
# tasks, and integrating with software
from Bio import SeqIO # This lets us handle sequence data
from Bio.KEGG import REST # This lets us connect to the KEGG databases
# The bioservices library allows connections to common
# online bioinformatics resources
from bioservices import UniProt # This lets us connect to the UniProt databases
from IPython.display import Image # This lets us display images (.png etc) from code
The os.makedirs()
function allows us to create a new directory, and the exist_ok
option will prevent the notebook code from stopping and throwing an error if the directory already exists.
In [2]:
# Create a new directory for notebook output
OUTDIR = os.path.join("data", "reproducible", "output")
os.makedirs(OUTDIR, exist_ok=True)
The to_df()
function will turn tabular data into a pandas
dataframe
In [3]:
# A small function to return a Pandas dataframe, given tabular text
def to_df(result):
return pd.read_table(io.StringIO(result), header=None)
Our plan is to:
UniProt
KEGG
In [4]:
# Define the path to the sequence file (in data/reproducible/sequences)
seqfile = os.path.join("data", "reproducible", "sequences", "lipase.fasta")
# Load the sequence data
wildtype = SeqIO.read(seqfile, 'fasta')
Once the sequence is loaded, we print it to the notebook to check it is correct.
In [5]:
# Print the 'wildtype' sequence
print(wildtype)
To see the sequence in FASTA
format, we can use the .format()
method:
In [6]:
# Print the 'wildtype' sequence as a FASTA formatted sequence
print(wildtype.format('fasta'))
print(len(wildtype))
BLAST
Database%%bash
at the top of the code
cell%%bash
command makes the rest of the lines in the cell work as though they were the command-line terminalWe build a BLAST
protein database from the file data/reproducible/sequences/GCA_000069965.1_ASM6996v1_protein.faa
using the makeblastdb
command.
In [7]:
%%bash
# Drop out to bash to create the BLAST database
# Make a BLAST protein database
makeblastdb -in data/reproducible/sequences/GCA_000069965.1_ASM6996v1_protein.faa \
-dbtype prot \
-title reference_protein \
-out data/reproducible/output/reference_protein
BLAST
Query%%bash
at the top of the code
cellWe BLASTX
the nucleotide sequence query against the BLAST
protein database in data/reproducible/output/reference_protein
.
In [8]:
%%bash
# Drop out to bash to run the BLASTX query
blastx -query data/reproducible/sequences/lipase.fasta \
-db data/reproducible/output/reference_protein \
-outfmt 6 \
-out data/reproducible/output/wildtype_blastx_reference_protein.tab
BLAST
Resultspandas
read_csv()
function to load the tabular data that was created using the option -outfmt 6
in the blastx
command
In [9]:
# Define the path to the output file
blastfile = os.path.join("data", "reproducible", "output",
"wildtype_blastx_reference_protein.tab")
# Define column headers
# These are the default columns produced with -outfmt 6
headers = ['query', 'subject',
'pc_identity', 'aln_length',
'mismatches', 'gaps_opened',
'query_start', 'query_end',
'subject_start', 'subject_end',
'e_value', 'bitscore']
# Load the BLAST output into a dataframe
results = pd.read_csv(blastfile, sep="\t", header=None)
results.columns = headers
We can now inspect the first few rows of the dataframe, to see what the best hits were in the reference genome:
In [10]:
# Show first few rows of the dataframe
results.head()
Out[10]:
There is a single best match - CAR42190.1
- that aligns over the full length of the candidate sequence (861/3 = 287
) with 95% identity and 14 residue mismatches.
It is very likely that this protein shares the same molecular and metabolic function as our candidate sequence, so we proceed to query the UniProt
and KEGG
databases.
UniProt
bioservices
library to make a connection to the remote UniProt
servers, and make our querypandas
dataframe, for ease of processing
In [11]:
# Make a link to the UniProt webservice
service = UniProt()
# Make a new query string, and run a remote search for the UniProt entry
query = "CAR42190"
result = service.search(query)
# Inspect the result
print(result)
This result tells us that the UniProt
record for this sequence has the accession B4EVM3
(not the CAR42190
that we first thought), and that it is a putative lipase sequence that has been assigned the Enzyme Classification number EC 3.1.1.3
.
KEGG
later
In [12]:
# Get a dataframe from UniProt
uniprot_df = service.get_df(query)
# View the dataframe
uniprot_df.head()
Out[12]:
This dataframe gives us much more information (100 columns of data - we can see the headers with uniprot_df.columns
) than the initial result. We can see that there is a 3D structure for the protein, and that there is evidence for its existence at the protein level.
In [13]:
# A function to plot the number of enzymes annotated with a given
# EC number, per genus, under a specified taxonomic classification.
def hist_enzymes_by_genus(ec=None, taxon=None):
"""Plots a histogram of enzymes with passed EC number, per
genus, under the indicated taxon.
"""
# Ask UniProt for all proteins with EC number under the passed
# taxon
query = "ec:%s taxonomy:%s" % (ec, taxon)
uniprot_df = service.get_df(query)
# How many sequences are returned?
print("UniProt reports %i sequences (EC: %s, taxon: %s)" %
(len(uniprot_df), ec, taxon))
# Create a new column for the genus of the organism
uniprot_df['genus'] = uniprot_df['Organism'].str.split().str.get(0)
# Count the number of enzymes per genus
genus_count = uniprot_df.groupby('genus')['Entry'].count()
# Plot the number of enzymes per genus
g = sns.barplot(genus_count.index, genus_count);
g.set_xticklabels(labels=genus_count.index, rotation=45);
g.set_title("EC: %s, taxon: %s" % (ec, taxon));
g.set_ylabel("count")
# Return the plot
return g
In [14]:
# Call function to plot the number of enzymes in each genus
hist_enzymes_by_genus(ec="3.1.1.3", taxon="enterobacterales");
KEGG
Biopython
library to make a connection to the remote KEGG
servers, and make our queryto_df()
function to turn the tabular query results from KEGG
into a pandas
dataframe, for ease of processingUniProt
result tells us that the gene name is PMI0999
, so we use this as our queryKEGG
is composed of several databases, and we need to tell it which database we are serching in, as well as our query
In [15]:
# Find KEGG GENES result for the query PMI0999
result = REST.kegg_find("genes", "PMI0999").read() # Do search at KEGG
to_df(result) # convert to a dataframe
Out[15]:
The first search tells us that KEGG
's internal ID for our query sequence is pmr:PMI0999
.
In [16]:
# Get the KEGG GENES entry for pmr:PMI0999
result = REST.kegg_get("pmr:PMI0999").read()
print(result)
From this entry we can see that:
pmr00561
] and central metabolism [pmr01100
]PFam
KEGG
stores pathways in a separate database to genes, the PATHWAY
databaseREST.kegg_get()
to get the text description of a pathway, and also the KEGG
database image for that pathway.
In [17]:
# Get the KEGG PATHWAY entry for our gene in plain text
pathway = REST.kegg_get("pmr00561").read()
print(pathway)
This query has given us the full text of the entry, including:
In [18]:
pathway_img = REST.kegg_get("pmr00561", "image").read()
Image(pathway_img)
Out[18]:
This now renders the KEGG
pathway image for pmr00561
, with the enzymes that are present in our organism (Proteus mirabilis) highlighted in green.
In [19]:
# Get the central metabolism image for P. mirabilis
metabolism_img = REST.kegg_get("pmr01100", "image").read()
Image(metabolism_img)
Out[19]:
We can now share our notebook summarising what we know about our candidate sequence in a number of ways, using the File -> Download as
options in the menu bar. These are summarised below, with links to example output.
Notebook (.ipynb)
: This will allow others to open the notebook and interact with it just as we are doing now, in Jupyter. (example.ipynb
)Python (.py)
: This will extract the Python
code from the notebook into a file that can be run as a plain Python
script. (example.py
)HTML (.html)
: This will render the notebook in HTML format, so it can be viewed, but not interacted with, in a web browser. (example.html
)Markdown (.md)
: This will render the notebook in Markdown
format, so it can be edited as a plaintext document, or converted to a new output format. (example.md
)