In this workshop you will use your Bash
skills to translate bacterial DNA into amino acid sequences and use this to discover the location of protein coding genes.
Before you begin this workshop you should have familiarised yourself with chapters 1-25 of the interactive guide.
At the end of this workshop you should be able to;
Run the Setup Code
In order for this notebook to work properly you need to run the cell below before doing anything else. This will load custom functions and settings required to make the self assessment exercises work.
If you restart your kernel you will also need to rerun the setup code
Don't use the
cd
command
The answers to all self assessment exercises assume that you don't change your directory from the default. You shouldn't ever need to use the cd
command to answer an exercise.
In [1]:
wget -q https://www.dropbox.com/s/es7lr8sy4ubg3ug/translate.py?dl=0 -O translate.py
chmod u+x translate.py
wget -q https://www.dropbox.com/s/4irxxomz3aajtou/setup.sh?dl=0 -O setup.sh
source setup.sh
Imagine that you are part of a team of researchers working on the pathogenic bacteria, Burkholderia pseudomallei. B. pseudomallei is a soil dwelling bacterium and is the causal agent of melioidosis
You have sequenced a 20kb long fragment of the B. pseudomallei genome. Through a mutagenesis screen you know that this fragment contains a region that is implicated in eukaryotic cell invasion.
Your task is to map the locations of protein coding genes within this region and determine whether any of them overlap with the known mutant site.
Use the ls
command to list the files in your current directory. One of the files should be called genomic_sequence.fasta
. This file contains the 20kb long genome fragment in FASTA format.
FASTA is used for both nucleotide and amino acid sequences and is very popular due to its simplicity. A FASTA file consists of one or more sequence entries. Each sequence entry begins with a description line which begins with a >
. Following the description line comes the actual sequence. When another description line is encountered it denotes a separate entry.
Inspect the top few lines of the file with the head
command and verify that it is is in FASTA format.
head genomic_sequence.fasta
Since only the header lines contain the >
symbol we can confirm that there is just one entry in the file by searching for these symbols with the grep
command as follows;
grep '>' genomic_sequence.fasta
Except for the first line, all of the text in the genomic_sequence.fasta
file consists of DNA sequence. Your goal in this exercise is to determine the length of the sequence (ie how many nucleotide bases are represented).
The wc
command can be use to count lines of text. For example;
wc -l genomic_sequence.fasta
There are 80 bases per line and the last line contains 55 bases.
Your task: Determine the length of the genomic sequence in genomic_sequence.fasta
Use the echo
command to make your answer work with the autograder. For example let's say your answer is 45
you would enter it as;
echo "45"
In [2]:
e1_answer(){
### BEGIN SOLUTION
echo "20455"
### END SOLUTION
}
In [3]:
test_e1
A better tool for obtaining FASTA sequence lengths (and much more) is bioawk . This tool is beyond the scope of this tutorial but if you are curious you may wish to try it.
The general form of the bioawk command for determining sequence lengths is shown below. See if you can reformulate this command to give the answer to exercise 1.
bioawk -c fastx '{print length($seq)}' <input>.fasta
In [4]:
bioawk -c fastx '{print length($seq)}' genomic_sequence.fasta
In principle the amino acid sequences of proteins encoded by bacterial genes can be determined simply by translating directly from the genomic sequence. Identifying the starting sites of translation in these genes is a little more challenging and will be explored in more detail later.
Given the bacterial DNA sequence
GCTGCGTGACGTACG
We would translate in the first forward frame by first breaking the sequence into its codons.
GCT GCG TGA CGT ACG
And then using the standard genetic code to translate these to amino acids.
GCT GCG TGA CGT ACG
A A * R T
Note: A stop codon *
indicates termination of translation.
A DNA sequence can be translate in six possible ways depending on the direction and starting base of translation.
Forward
The example above shows translation based on the first forward reading frame. In this frame we read from left to right and start at the first position. If we were translating in the second forward reading frame we would obtain codons and translate as follows;
G CTG CGT GAC GTA CG
L R D V
Note that we discarded one base from the left. This also resulted in two discarded bases at the right but overall it resulted in a longer contiuous amino acid sequence because there was no stop codon.
The third forward reading frame is obtained by discarding one more base from the left.
GC TGC GTG ACG TAC G
C V T Y
What if we kept going? Since there are three bases in a codon, discarding three from the left is equivalent to the first forward frame.
GCT GCG TGA CGT ACG
A * R T
Reverse
Reverse frames correspond to translation on the opposite strand of DNA. The bases on the opposite strand will be complementary to the forward strand and will be read in reverse. We therefore translate based on the reverse complement. For our example sequence this is;
CGTACGTCACGCAGC
Translation in the first reverse frame is therefore
CGT ACG TCA CGC AGC
R T S R S
The second and third reverse frames can be derived on the complementary sequence in exactly the same way as the forward frames on the forward sequence.
The tutorial directory provides a program translate.py
that translates DNA in all 6 frames.
When working with large datasets it is often useful to run some test code on a small fraction of the data.
The head
command is useful here since it will extract the top n
lines of a dataset where n
can be any number we specify.
For example to extract the top 10
lines of genomic_sequence.fasta
we would do
head -n 10 genomic_sequence.fasta
Rather than creating a separate file with our test data we simply pipe it to other commands we want to run. For example to verify that head did indeed extract only 10
lines we could pipe to wc
.
head -n 10 genomic_sequence.fasta | wc -l
Your task: Translate the top 5 lines of genomic_sequence.fasta
using translate.py
In [5]:
head -n 2 genomic_sequence.fasta | translate.py
In [6]:
e2_answer(){
### BEGIN SOLUTION
head -n 5 genomic_sequence.fasta | translate.py
### END SOLUTION
}
In [7]:
test_e2
Familiarise yourself with translate.py
and its options by doing the following;
translate.py -h
head genomic_sequence.fasta | translate.py
head genomic_sequence.fasta | translate.py -f 2
head genomic_sequence.fasta | translate.py -F 'tabular'
head genomic_sequence.fasta | translate.py -F 'tabular' -s
Don't worry if you don't understand all of the outputs at this stage. We'll cover that later. For now just run the commands above and verify that the output roughly corresponds to the option you specified.
One of the ways we can distinguish real coding sequences from translated non-coding sequences is their length.
Since 3/64
codons in the standard code are stops we would expect a stop codon to occur around 1 in every 20 amino acids if the underlying DNA sequence were random. Based on this logic, the probability that a continuous sequence of 100 amino acids with no stop codons could be obtained from a random sequence of nucleotides is;
In other words, any sequences we obtain that are $100$ amino acids or longer are unlikely to be non-coding sequences.
Note: Since non-coding sequences are often functional (eg regulatory) they are unlikely to be random but to a first approximation we can assume that they are.
Use translate.py
to translate genomic_sequence.fasta
in the first frame
translate.py -f 1 genomic_sequence.fasta
The tr
command can be used to delete all characters other than a specified character. For example to output only A
's in the sequence we would do;
translate.py -f 1 genomic_sequence.fasta | tr -cd 'A'
The number of A
's can then be counted with
translate.py -f 1 genomic_sequence.fasta | tr -cd 'A' | wc -m
Your task: Write a command to output the number of stop codons in the frame 1 translation of genomic_sequence.fasta
Compare your answer with the expected number of stop codons if the sequence was random. What does this tell you about the gene content of our sequence in frame 1?
In [8]:
e3_answer(){
### BEGIN SOLUTION
translate.py -f 1 genomic_sequence.fasta | tr -cd '*' | wc -m
### END SOLUTION
}
In [9]:
test_e3
Modify the command pipeline you created for exercise 3 to count amino acids.
Remember that the header line >NC_006351_f_1
contains some letters that could be confused with amino acids (ie N
and C
.
You can deal with this by using the tail
command to remove the first line of output like this
translate.py -f 1 genomic_sequence.fasta | tail -n +2
Your task: Write code to output the number of cysteines in the frame 1 translation of genomic_sequence.fasta
Try your code on various other amino acids. Do they vary much in abundance?
In [10]:
e4_answer(){
### BEGIN SOLUTION
translate.py -f 1 genomic_sequence.fasta | tail -n +2 | tr -cd 'C' | wc -m
### END SOLUTION
}
In [11]:
test_e4
For the purposes of this tutorial we will define an open reading frame as the part of a reading frame that does not contain stop codons. Some definitions also require that an ORF begin with a start codon.
Our strategy for finding genes in the B. mallei genome will be to first look for long ORFs as these are unlikely to occur in non-coding regions.
So far we have used translate.py
in its default output mode which simply translates the genome into one long amino acid sequence, including stop codons. To make finding long orfs easier we will use the --split
option to divide up the translation into its constituent ORFs. Eg;
translate.py -f 1 --split genomic_sequence.fasta | head
To make things even easier we change the output format to tabular
. This is very useful because unix provides many ways to manipulate tabular files
translate.py -f 1 --split -F 'tabular' genomic_sequence.fasta | head
The tabular format consists of the following fields (columns)
parent_id start end strand orf_number orf_length orf_sequence
The translate.py
script usually works with one frame at a time but you can ask it to translate in all 6 frames by providing 0
for the frame
argument. ie
translate.py -f 0 --split -F 'tabular' genomic_sequence.fasta
Your Task: Write code to output the number of ORFs generated by translating genomic_sequence.fasta
in all 6 frames.
In [12]:
e5_answer(){
### BEGIN SOLUTION
translate.py -f 0 --split -F 'tabular' genomic_sequence.fasta | wc -l
### END SOLUTION
}
In [13]:
test_e5
The sort
command is very useful when working with tabular files.
Have a look at the help for sort
sort --help
As you can see it has many options. For our purposes the most useful options are
-k, --key
-n, --numeric-sort
The k
option allows us to specify a column to sort on. For example we could sort according to the 3rd column with
sort -k 3
By default, sort
sorts alphabetically. Providing the n
option changes this behaviour to a numeric sort (which we often want). By combining k
and n
options we could do a numeric sort on the 3rd field
translate.py -f 1 --split -F 'tabular' genomic_sequence.fasta | sort -n -k 3 | head
In this exercise we will finally obtain a list of candidate ORFs that may represent bacterial gene products.
Preparation:
tail
command to display only the longest of these ORFs-n
option to tail
to adjust the number of long ORFs displayed. 100
.grep
to search for methionines M
.wc -l
to produce a count.Your task: Write code to print the number of ORFs from a 6 frame translation of genomic_sequence.fasta
that have length greater than or equal to 100 AND contain at least one start codon M
.
In [14]:
e6_answer(){
### BEGIN SOLUTION
translate.py -f 0 --split -F 'tabular' genomic_sequence.fasta | sort -n -k 6 | tail -n 100 | grep 'M' | wc -l
### END SOLUTION
}
In [15]:
test_e6