Workshop 3 - Finding bacterial genes

In this workshop you will use your Bash skills to translate bacterial DNA into amino acid sequences and use this to discover the location of protein coding genes.

Before you begin this workshop you should have familiarised yourself with chapters 1-25 of the interactive guide.

At the end of this workshop you should be able to;

  1. Recognise and understand FASTA formatted data

IMPORTANT

Run the Setup Code

In order for this notebook to work properly you need to run the cell below before doing anything else. This will load custom functions and settings required to make the self assessment exercises work.

If you restart your kernel you will also need to rerun the setup code

Don't use the cd command

The answers to all self assessment exercises assume that you don't change your directory from the default. You shouldn't ever need to use the cd command to answer an exercise.


In [1]:
wget -q https://www.dropbox.com/s/es7lr8sy4ubg3ug/translate.py?dl=0 -O translate.py
chmod u+x translate.py

wget -q https://www.dropbox.com/s/4irxxomz3aajtou/setup.sh?dl=0 -O setup.sh
source setup.sh


Setup Done

Background

Imagine that you are part of a team of researchers working on the pathogenic bacteria, Burkholderia pseudomallei. B. pseudomallei is a soil dwelling bacterium and is the causal agent of melioidosis

You have sequenced a 20kb long fragment of the B. pseudomallei genome. Through a mutagenesis screen you know that this fragment contains a region that is implicated in eukaryotic cell invasion.

Your task is to map the locations of protein coding genes within this region and determine whether any of them overlap with the known mutant site.

FASTA format

Use the ls command to list the files in your current directory. One of the files should be called genomic_sequence.fasta. This file contains the 20kb long genome fragment in FASTA format.

FASTA is used for both nucleotide and amino acid sequences and is very popular due to its simplicity. A FASTA file consists of one or more sequence entries. Each sequence entry begins with a description line which begins with a >. Following the description line comes the actual sequence. When another description line is encountered it denotes a separate entry.

Inspect the top few lines of the file with the head command and verify that it is is in FASTA format.

head genomic_sequence.fasta

Since only the header lines contain the > symbol we can confirm that there is just one entry in the file by searching for these symbols with the grep command as follows;

grep '>' genomic_sequence.fasta

Exercise 1

Except for the first line, all of the text in the genomic_sequence.fasta file consists of DNA sequence. Your goal in this exercise is to determine the length of the sequence (ie how many nucleotide bases are represented).

The wc command can be use to count lines of text. For example;

wc -l genomic_sequence.fasta

There are 80 bases per line and the last line contains 55 bases.

Your task: Determine the length of the genomic sequence in genomic_sequence.fasta

Use the echo command to make your answer work with the autograder. For example let's say your answer is 45 you would enter it as;

echo "45"

In [2]:
e1_answer(){
### BEGIN SOLUTION
echo "20455"
### END SOLUTION
}

In [3]:
test_e1


Your answer is correct

Optional

A better tool for obtaining FASTA sequence lengths (and much more) is bioawk . This tool is beyond the scope of this tutorial but if you are curious you may wish to try it.

The general form of the bioawk command for determining sequence lengths is shown below. See if you can reformulate this command to give the answer to exercise 1.

bioawk -c fastx '{print length($seq)}' <input>.fasta

In [4]:
bioawk -c fastx '{print length($seq)}' genomic_sequence.fasta


20455

In-silico Translation

In principle the amino acid sequences of proteins encoded by bacterial genes can be determined simply by translating directly from the genomic sequence. Identifying the starting sites of translation in these genes is a little more challenging and will be explored in more detail later.

Given the bacterial DNA sequence

GCTGCGTGACGTACG

We would translate in the first forward frame by first breaking the sequence into its codons.

GCT GCG TGA CGT ACG

And then using the standard genetic code to translate these to amino acids.

GCT GCG TGA CGT ACG
 A   A   *   R   T

Note: A stop codon * indicates termination of translation.

Reading frames

A DNA sequence can be translate in six possible ways depending on the direction and starting base of translation.

Forward

The example above shows translation based on the first forward reading frame. In this frame we read from left to right and start at the first position. If we were translating in the second forward reading frame we would obtain codons and translate as follows;

G CTG CGT GAC GTA CG
   L   R   D   V

Note that we discarded one base from the left. This also resulted in two discarded bases at the right but overall it resulted in a longer contiuous amino acid sequence because there was no stop codon.

The third forward reading frame is obtained by discarding one more base from the left.

GC TGC GTG ACG TAC G
    C   V   T   Y

What if we kept going? Since there are three bases in a codon, discarding three from the left is equivalent to the first forward frame.

GCT GCG TGA CGT ACG
     A   *  R    T

Reverse

Reverse frames correspond to translation on the opposite strand of DNA. The bases on the opposite strand will be complementary to the forward strand and will be read in reverse. We therefore translate based on the reverse complement. For our example sequence this is;

CGTACGTCACGCAGC

Translation in the first reverse frame is therefore

CGT ACG TCA CGC AGC
 R   T   S   R   S

The second and third reverse frames can be derived on the complementary sequence in exactly the same way as the forward frames on the forward sequence.

Software

The tutorial directory provides a program translate.py that translates DNA in all 6 frames.

Exercise 2

When working with large datasets it is often useful to run some test code on a small fraction of the data.

The head command is useful here since it will extract the top n lines of a dataset where n can be any number we specify.

For example to extract the top 10 lines of genomic_sequence.fasta we would do

head -n 10 genomic_sequence.fasta

Rather than creating a separate file with our test data we simply pipe it to other commands we want to run. For example to verify that head did indeed extract only 10 lines we could pipe to wc.

head -n 10 genomic_sequence.fasta | wc -l

Your task: Translate the top 5 lines of genomic_sequence.fasta using translate.py


In [5]:
head -n 2 genomic_sequence.fasta | translate.py


>NC_006351_f_1
AARRTPCGARRAARGGKPCAATRAGG

In [6]:
e2_answer(){
### BEGIN SOLUTION
head -n 5 genomic_sequence.fasta | translate.py
### END SOLUTION
}

In [7]:
test_e2


Your answer is correct

Familiarise yourself with translate.py and its options by doing the following;

  1. View the help.
    translate.py -h
    
  2. Translate the some sequence in the default frame (frame 1)
    head genomic_sequence.fasta | translate.py
    
  3. Translate in a different frame
    head genomic_sequence.fasta | translate.py -f 2
    
  4. Change output from fasta to tabular format
    head genomic_sequence.fasta | translate.py -F 'tabular'
    
  5. Split the translated sequence by stop codons
    head genomic_sequence.fasta | translate.py -F 'tabular' -s
    

Don't worry if you don't understand all of the outputs at this stage. We'll cover that later. For now just run the commands above and verify that the output roughly corresponds to the option you specified.

Distinguishing coding from non-coding sequences

One of the ways we can distinguish real coding sequences from translated non-coding sequences is their length.

Since 3/64 codons in the standard code are stops we would expect a stop codon to occur around 1 in every 20 amino acids if the underlying DNA sequence were random. Based on this logic, the probability that a continuous sequence of 100 amino acids with no stop codons could be obtained from a random sequence of nucleotides is;

$$p = (1-(3/64))^{100}$$$$p \approx 0.01$$

In other words, any sequences we obtain that are $100$ amino acids or longer are unlikely to be non-coding sequences.

Note: Since non-coding sequences are often functional (eg regulatory) they are unlikely to be random but to a first approximation we can assume that they are.

Exercise 3

Use translate.py to translate genomic_sequence.fasta in the first frame

translate.py -f 1 genomic_sequence.fasta

The tr command can be used to delete all characters other than a specified character. For example to output only A's in the sequence we would do;

translate.py -f 1 genomic_sequence.fasta | tr -cd 'A'

The number of A's can then be counted with

translate.py -f 1 genomic_sequence.fasta | tr -cd 'A' | wc -m

Your task: Write a command to output the number of stop codons in the frame 1 translation of genomic_sequence.fasta

Compare your answer with the expected number of stop codons if the sequence was random. What does this tell you about the gene content of our sequence in frame 1?


In [8]:
e3_answer(){
### BEGIN SOLUTION
translate.py -f 1 genomic_sequence.fasta | tr -cd '*' | wc -m
### END SOLUTION
}

In [9]:
test_e3


Your answer is correct

Exercise 4

Modify the command pipeline you created for exercise 3 to count amino acids.

Remember that the header line >NC_006351_f_1 contains some letters that could be confused with amino acids (ie N and C.

You can deal with this by using the tail command to remove the first line of output like this

translate.py -f 1 genomic_sequence.fasta | tail -n +2

Your task: Write code to output the number of cysteines in the frame 1 translation of genomic_sequence.fasta

Try your code on various other amino acids. Do they vary much in abundance?


In [10]:
e4_answer(){
### BEGIN SOLUTION
translate.py -f 1 genomic_sequence.fasta | tail -n +2 | tr -cd 'C' | wc -m
### END SOLUTION
}

In [11]:
test_e4


Your answer is correct

Open Reading Frames

For the purposes of this tutorial we will define an open reading frame as the part of a reading frame that does not contain stop codons. Some definitions also require that an ORF begin with a start codon.

Our strategy for finding genes in the B. mallei genome will be to first look for long ORFs as these are unlikely to occur in non-coding regions.

So far we have used translate.py in its default output mode which simply translates the genome into one long amino acid sequence, including stop codons. To make finding long orfs easier we will use the --split option to divide up the translation into its constituent ORFs. Eg;

translate.py -f 1 --split genomic_sequence.fasta | head

To make things even easier we change the output format to tabular. This is very useful because unix provides many ways to manipulate tabular files

translate.py -f 1 --split -F 'tabular' genomic_sequence.fasta | head

The tabular format consists of the following fields (columns)

parent_id    start   end   strand   orf_number   orf_length   orf_sequence

Exercise 5

The translate.py script usually works with one frame at a time but you can ask it to translate in all 6 frames by providing 0 for the frame argument. ie

translate.py -f 0 --split -F 'tabular' genomic_sequence.fasta

Your Task: Write code to output the number of ORFs generated by translating genomic_sequence.fasta in all 6 frames.


In [12]:
e5_answer(){
### BEGIN SOLUTION
translate.py -f 0 --split -F 'tabular' genomic_sequence.fasta | wc -l
### END SOLUTION
}

In [13]:
test_e5


Your answer is correct

Sorting tabular output

The sort command is very useful when working with tabular files.

Have a look at the help for sort

sort --help

As you can see it has many options. For our purposes the most useful options are

-k, --key
-n, --numeric-sort

The k option allows us to specify a column to sort on. For example we could sort according to the 3rd column with

sort -k 3

By default, sort sorts alphabetically. Providing the n option changes this behaviour to a numeric sort (which we often want). By combining k and n options we could do a numeric sort on the 3rd field

translate.py -f 1 --split -F 'tabular' genomic_sequence.fasta | sort -n -k 3 | head

Exercise 6

In this exercise we will finally obtain a list of candidate ORFs that may represent bacterial gene products.

Preparation:

  1. Use a test cell to formulate a command that prints all ORFs (in all frames) (in tabular format) sorted by length.
  2. Use the tail command to display only the longest of these ORFs
  3. Use the -n option to tail to adjust the number of long ORFs displayed.
  4. Adjust the displayed number of lines until you only display ORFs equal to or longer than 100.
  5. Use grep to search for methionines M.
  6. Use wc -l to produce a count.

Your task: Write code to print the number of ORFs from a 6 frame translation of genomic_sequence.fasta that have length greater than or equal to 100 AND contain at least one start codon M.


In [14]:
e6_answer(){
### BEGIN SOLUTION
translate.py -f 0 --split -F 'tabular' genomic_sequence.fasta | sort -n -k 6 | tail -n 100 | grep 'M' | wc -l
### END SOLUTION
}

In [15]:
test_e6


Your answer is correct