It is provided as a reference against which you can check your work from notebook 03a, and inspect to get ideas for how you can use the same techniques and methods in your own research.
This is very helpful when running the notebook with, e.g. Cell -> Run All or Kernel -> Restart & Run All from the menu bar, all the libraries are available throughout the document.
In [1]:
# The line below allows the notebooks to show graphics inline
%pylab inline
import io # This lets us handle streaming data
import os # This lets us communicate with the operating system
import pandas as pd # This lets us use dataframes
import seaborn as sns # This lets us draw pretty graphics
# Biopython is a widely-used library for bioinformatics
# tasks, and integrating with software
from Bio import SeqIO # This lets us handle sequence data
from Bio.KEGG import REST # This lets us connect to the KEGG databases
# The bioservices library allows connections to common
# online bioinformatics resources
from bioservices import UniProt # This lets us connect to the UniProt databases
from IPython.display import Image # This lets us display images (.png etc) from code
The os.makedirs() function allows us to create a new directory, and the exist_ok option will prevent the notebook code from stopping and throwing an error if the directory already exists.
In [2]:
# Create a new directory for notebook output
OUTDIR = os.path.join("data", "reproducible", "output")
os.makedirs(OUTDIR, exist_ok=True)
The to_df() function will turn tabular data into a pandas dataframe
In [3]:
# A small function to return a Pandas dataframe, given tabular text
def to_df(result):
return pd.read_table(io.StringIO(result), header=None)
Our plan is to:
UniProtKEGG
In [4]:
# Define the path to the sequence file (in data/reproducible/sequences)
seqfile = os.path.join("data", "reproducible", "sequences", "lipase.fasta")
# Load the sequence data
wildtype = SeqIO.read(seqfile, 'fasta')
Once the sequence is loaded, we print it to the notebook to check it is correct.
In [5]:
# Print the 'wildtype' sequence
print(wildtype)
To see the sequence in FASTA format, we can use the .format() method:
In [6]:
# Print the 'wildtype' sequence as a FASTA formatted sequence
print(wildtype.format('fasta'))
print(len(wildtype))
BLAST Database%%bash at the top of the code cell%%bash command makes the rest of the lines in the cell work as though they were the command-line terminalWe build a BLAST protein database from the file data/reproducible/sequences/GCA_000069965.1_ASM6996v1_protein.faa using the makeblastdb command.
In [7]:
%%bash
# Drop out to bash to create the BLAST database
# Make a BLAST protein database
makeblastdb -in data/reproducible/sequences/GCA_000069965.1_ASM6996v1_protein.faa \
-dbtype prot \
-title reference_protein \
-out data/reproducible/output/reference_protein
BLAST Query%%bash at the top of the code cellWe BLASTX the nucleotide sequence query against the BLAST protein database in data/reproducible/output/reference_protein.
In [8]:
%%bash
# Drop out to bash to run the BLASTX query
blastx -query data/reproducible/sequences/lipase.fasta \
-db data/reproducible/output/reference_protein \
-outfmt 6 \
-out data/reproducible/output/wildtype_blastx_reference_protein.tab
BLAST Resultspandas read_csv() function to load the tabular data that was created using the option -outfmt 6 in the blastx command
In [9]:
# Define the path to the output file
blastfile = os.path.join("data", "reproducible", "output",
"wildtype_blastx_reference_protein.tab")
# Define column headers
# These are the default columns produced with -outfmt 6
headers = ['query', 'subject',
'pc_identity', 'aln_length',
'mismatches', 'gaps_opened',
'query_start', 'query_end',
'subject_start', 'subject_end',
'e_value', 'bitscore']
# Load the BLAST output into a dataframe
results = pd.read_csv(blastfile, sep="\t", header=None)
results.columns = headers
We can now inspect the first few rows of the dataframe, to see what the best hits were in the reference genome:
In [10]:
# Show first few rows of the dataframe
results.head()
Out[10]:
There is a single best match - CAR42190.1 - that aligns over the full length of the candidate sequence (861/3 = 287) with 95% identity and 14 residue mismatches.
It is very likely that this protein shares the same molecular and metabolic function as our candidate sequence, so we proceed to query the UniProt and KEGG databases.
UniProtbioservices library to make a connection to the remote UniProt servers, and make our querypandas dataframe, for ease of processing
In [11]:
# Make a link to the UniProt webservice
service = UniProt()
# Make a new query string, and run a remote search for the UniProt entry
query = "CAR42190"
result = service.search(query)
# Inspect the result
print(result)
This result tells us that the UniProt record for this sequence has the accession B4EVM3 (not the CAR42190 that we first thought), and that it is a putative lipase sequence that has been assigned the Enzyme Classification number EC 3.1.1.3.
KEGG later
In [12]:
# Get a dataframe from UniProt
uniprot_df = service.get_df(query)
# View the dataframe
uniprot_df.head()
Out[12]:
This dataframe gives us much more information (100 columns of data - we can see the headers with uniprot_df.columns) than the initial result. We can see that there is a 3D structure for the protein, and that there is evidence for its existence at the protein level.
In [13]:
# A function to plot the number of enzymes annotated with a given
# EC number, per genus, under a specified taxonomic classification.
def hist_enzymes_by_genus(ec=None, taxon=None):
"""Plots a histogram of enzymes with passed EC number, per
genus, under the indicated taxon.
"""
# Ask UniProt for all proteins with EC number under the passed
# taxon
query = "ec:%s taxonomy:%s" % (ec, taxon)
uniprot_df = service.get_df(query)
# How many sequences are returned?
print("UniProt reports %i sequences (EC: %s, taxon: %s)" %
(len(uniprot_df), ec, taxon))
# Create a new column for the genus of the organism
uniprot_df['genus'] = uniprot_df['Organism'].str.split().str.get(0)
# Count the number of enzymes per genus
genus_count = uniprot_df.groupby('genus')['Entry'].count()
# Plot the number of enzymes per genus
g = sns.barplot(genus_count.index, genus_count);
g.set_xticklabels(labels=genus_count.index, rotation=45);
g.set_title("EC: %s, taxon: %s" % (ec, taxon));
g.set_ylabel("count")
# Return the plot
return g
In [14]:
# Call function to plot the number of enzymes in each genus
hist_enzymes_by_genus(ec="3.1.1.3", taxon="enterobacterales");
KEGGBiopython library to make a connection to the remote KEGG servers, and make our queryto_df() function to turn the tabular query results from KEGG into a pandas dataframe, for ease of processingUniProt result tells us that the gene name is PMI0999, so we use this as our queryKEGG is composed of several databases, and we need to tell it which database we are serching in, as well as our query
In [15]:
# Find KEGG GENES result for the query PMI0999
result = REST.kegg_find("genes", "PMI0999").read() # Do search at KEGG
to_df(result) # convert to a dataframe
Out[15]:
The first search tells us that KEGG's internal ID for our query sequence is pmr:PMI0999.
In [16]:
# Get the KEGG GENES entry for pmr:PMI0999
result = REST.kegg_get("pmr:PMI0999").read()
print(result)
From this entry we can see that:
pmr00561] and central metabolism [pmr01100]PFamKEGG stores pathways in a separate database to genes, the PATHWAY databaseREST.kegg_get() to get the text description of a pathway, and also the KEGG database image for that pathway.
In [17]:
# Get the KEGG PATHWAY entry for our gene in plain text
pathway = REST.kegg_get("pmr00561").read()
print(pathway)
This query has given us the full text of the entry, including:
In [18]:
pathway_img = REST.kegg_get("pmr00561", "image").read()
Image(pathway_img)
Out[18]:
This now renders the KEGG pathway image for pmr00561, with the enzymes that are present in our organism (Proteus mirabilis) highlighted in green.
In [19]:
# Get the central metabolism image for P. mirabilis
metabolism_img = REST.kegg_get("pmr01100", "image").read()
Image(metabolism_img)
Out[19]:
We can now share our notebook summarising what we know about our candidate sequence in a number of ways, using the File -> Download as options in the menu bar. These are summarised below, with links to example output.
Notebook (.ipynb): This will allow others to open the notebook and interact with it just as we are doing now, in Jupyter. (example.ipynb)Python (.py): This will extract the Python code from the notebook into a file that can be run as a plain Python script. (example.py)HTML (.html): This will render the notebook in HTML format, so it can be viewed, but not interacted with, in a web browser. (example.html)Markdown (.md): This will render the notebook in Markdown format, so it can be edited as a plaintext document, or converted to a new output format. (example.md)