This workshop continues working with the B. pseudomallei genome sequence from Workshop 3.
Before you begin this workshop you should have familiarised yourself with chapters 1-25 of the interactive guide.
At the end of this workshop you should be able to;
awk
to convert between tabular formatsRun the Setup Code
In order for this notebook to work properly you need to run the cell below before doing anything else. This will load custom functions and settings required to make the self assessment exercises work.
If you restart your kernel you will also need to rerun the setup code
Don't use the
cd
command
The answers to all self assessment exercises assume that you don't change your directory from the default. You shouldn't ever need to use the cd
command to answer an exercise.
Complete exercises in order
Unlike previous workshops this one requires you to complete all the exercises in the correct order. Most importantly you must make sure you correctly generate the files long_orfs.tabular
(exercise 1) and long_orfs.bed
(exercise 3) before moving to later exercises.
In [1]:
wget -q https://www.dropbox.com/s/es7lr8sy4ubg3ug/translate.py?dl=0 -O translate.py
chmod u+x translate.py
#wget -q https://www.dropbox.com/s/e5j9qra71qsdlrl/NC_006351.all.bed?dl=0 -O known_genes.bed
wget -q https://www.dropbox.com/s/z8xgt7jnkpqugxs/setup.sh?dl=0 -O setup.sh
source setup.sh
At the end of workshop 3 we had formulated a command to translate a genomic sequence from Burkholderia pseudomallei in all $6$ reading frames and extract all ORFs longer than $100$ amino acids.
In this tutorial we will match experimental information from a mutagenesis screen with the gene corresponding to one of these ORFs. The data we will use comes from a paper by Jones et. al who generated mutant forms of B. pseudomallei using a transposon, Tn5-OT182
. This transposon inserts itself into the B. pseudomallei genome at random points and thereby disrupts activity of genes by dramatically altering the genome sequence around its insertion point.
Our interest is in one particular mutant which was observed by Jones et. al to result in a dramatic reduction in the ability of B. pseudomallei to invade Eukaryotic cells.
The genomic location of the mutant is provided in a file called mutant.bed
.
Use the cat
command to display the contents of this file
cat mutant.bed
Your output should look like this
NC_006351 13833 13834 Tn5-OT182 Integration Site 0 +
Look carefully at the output. This is a bed formatted file with a single entry. It consists of $6$ columns with the following values;
chromosome
. Since B. pseudomallei has one chromosome this is just the genome IDstart
. The starting position of the feature in the genome in bases. end
. The end position of the feature in bases.name
. The name of the feature. In this case it is a mutant sitescore
. (Not important for this tutorial)strand
. A +
means the feature is on the forward strand and -
on the reverse. Coordinates (start
and end
are based on the forward strand though.The coordinates given in this BED
file refer to positions in our genomic fragment which is located in genomic_sequence.fasta
.
If you remember back to tutorial 2 you might realise that bed
format looks alot like gff
. Both formats have much the same purpose and you might be surprised to learn that they aren't the only formats used for genome annotation.
Why have multiple formats that do the same thing?
Unsurprisingly one of the most common tasks in bioinformatics is format conversion.
At the end of workshop 3 we translated our genomic sequence into all 6 frames and extracted ORFs longer than 100 amino acids (and containing a start codon) in tabular form.
The command we used was
translate.py -f 0 --split -F 'tabular' genomic_sequence.fasta | sort -n -k 6 | tail -n 100 | grep 'M'
Run this command in a test cell to verify that it works.
Your task: Run the command above and redirect its output to a file. Your output file must be called long_orfs.tabular
In [2]:
e1_answer(){
### BEGIN SOLUTION
translate.py -f 0 --split -F 'tabular' genomic_sequence.fasta | sort -n -k 6 | tail -n 100 | grep 'M' > long_orfs.tabular
### END SOLUTION
}
In [3]:
test_e1
Use the head
command to examine the output of the file you created in exercise 1. (It should be called long_orfs.tabular
.
The columns in this file are
chromosome
. Since B. pseudomallei has one chromosome this is just the genome IDstart
. The starting position of the ORF in the genome in bases. end
. The end position of the ORF in bases.strand
. A +
means the feature is on the forward strand and -
on the reverse.orf_num
. Just a number identifying the ORF within its framelength
. Length of the ORF in amino acidssequence
. Amino acid sequence of the ORFHow would you rearrange these columns in order to create a valid BED file?
Your task: Use awk
to transform long_orfs.tabular
so that the columns are in the correct order for BED
format. Pipe the result to head
to show only the first 10 lines of the result. For our purposes the name
and score
columns of the BED file are unimportant so you should use orf_num
for name
and length
for score
.
Note: The default behaviour of awk
is to separate columns with spaces but bed
requires tabs. For this exercise just leave it as spaces.
In [4]:
# If you could not get the answer to exercise 1 you should run this cell so that you have the
# required files for later exercises
ensure_long_orfs_tabular
In [5]:
e2_answer(){
### BEGIN SOLUTION
awk '{print $1,$2,$3,$5,$6,$4}' long_orfs.tabular | head
### END SOLUTION
}
In [6]:
test_e2
Look closely at your the result of running your answer to exercise 2. It should look like this
NC_006351 5651 5951 103 100 -
NC_006351 15256 15562 60 102 +
NC_006351 19818 20130 2 104 -
NC_006351 11406 11721 44 105 +
NC_006351 12347 12662 59 105 +
NC_006351 17863 18181 75 106 +
NC_006351 10209 10536 27 109 -
NC_006351 14923 15253 59 110 +
NC_006351 8519 8864 32 115 +
NC_006351 11261 11609 67 116 -
Notice that the fields on each line are separated by a single space.
The columns in a valid bed
file must be separated by tabs not spaces.
We can fix this by using the tr command. tr
is very useful. It can be used to translate any character in a file to any other character. We want to translate a space, which is represented by ' '
into a tab '\t'
.
Your task: Modify the command from exercise 2 in combination with the tr
command to produce a valid (tab separated) bed file. Pipe your result to a file called long_orfs.bed
. Make sure you include all of the ORFs in the original long_orfs.tabular
file.
Note: Another way to do this would be to set OFS
in awk
.
In [7]:
e3_answer(){
### BEGIN SOLUTION
awk '{print $1,$2,$3,$5,$6,$4}' long_orfs.tabular | tr ' ' '\t' > long_orfs.bed
### END SOLUTION
}
In [8]:
test_e3
Now that we have our long ORFs and our mutant site in bed
format we can use bedtools to compare them.
As its name suggests, bedtools
is not just one tool but many. All of the tools are accessible through the bedtools
command and its sub commands.
To find out all the available bedtools commands simply type bedtools
with no arguments
bedtools
Look at the usage:
statement at the very top of the bedtools help.
usage: bedtools <subcommand> [options]
Which tells us that if we want to use a particular subcommand, (eg intersect
) we would type bedtools intersect
. To see the help for bedtools intersect
type
bedtools intersect -h
The top part of this help tells us most of what we need for now;
Tool: bedtools intersect (aka intersectBed)
Version: v2.25.0
Summary: Report overlaps between two feature files.
Usage: bedtools intersect [OPTIONS] -a <bed/gff/vcf/bam> -b <bed/gff/vcf/bam>
The Summary:
tells what the tool does, ie report overlaps between two feature files and Usage:
tells us that we must provide two interval files (in bed
, gff
, vcf
or bam
format).
The bedtools manual online includes this graphical depiction of what the intersect
command does.
Which tells us that if we have two bed
files labeled a
and b
, intersect
will report the parts of intervals in a
that overlap with intervals in b
. With the -wa
option it will report full intervals rather than just the overlapping parts.
We have two feature files we want to compare
file name | description | format |
---|---|---|
long_orfs.bed |
coordinates of all long ORFs in our genomic segment | bed |
mutant.bed |
coordinates of the Tn5-OT182 insertion site identified by Jones et al. |
bed |
Let's try bedtools intersect
with these two files.
bedtools intersect -a long_orfs.bed -b mutant.bed
Make sure you understand the output. The following image may help. It shows ORFs overlapping the mutant site in red. Notice that each of the red blocks representing an ORF has a blunt end and a pointy end. This indicates the strand that the ORF is derived from. ORFs on the positive strand point from left to right while those on the negative strand point from right to left.
Your task: Modify the command above so that the output reports the entire interval not just the overlapping portion.
In [9]:
e4_answer(){
### BEGIN SOLUTION
bedtools intersect -wa -a long_orfs.bed -b mutant.bed
### END SOLUTION
}
In [10]:
test_e4
The full output from exercise 4 includes ORFs on both the positive and negative strands. In a later exercise we will need to separate these into separate datasets corresponding to the different strands.
There are various ways this can be done. Consider how you would do this with the grep
command
Your Task: Create a command that outputs the entire intervals for ORFs that overlap the mutant site on the negative strand.
Note: This task is the same as the task for exercise 4 but with the addition of restricting output to the negative strand only.
In [11]:
e5_answer(){
### BEGIN SOLUTION
bedtools intersect -wa -a long_orfs.bed -b mutant.bed | grep '-'
### END SOLUTION
}
In [12]:
test_e5
So far we have used intersect
which is only one of the many bedtools
subcommands. Our goal has been to find ORFs that correspond to genes that could be affected by a known mutation site. We have reduced this to a list of just four ORFs.
The next step in our search for genes affected by the mutation site involves looking at the sequences of these ORFs. We can extract the sequences for these ORFs using another one of bedtools
many subcommands. This time we will use the getfasta
command.
The following graphical explanation illustrates what the getfasta
command does (Graphic is from The bedtools online manual).
The command line help tells us the arguments that are required.
Usage: bedtools getfasta [OPTIONS] -fi <fasta> -bed <bed/gff/vcf>
We need;
-fi
: A fasta file with the genome sequence. This would be genomic_sequence.fasta
-bed
: A bed file with the features whose sequence we want to extract (see below)BED Files
The features we want to extract correspond to the ORFs that overlap the mutant site. These correspond to the outputs from exercises 4 and 5.
To run getfasta
we simply need to put these results into a file. This could be done with a simple output redirect like so
bedtools intersect -wa -a long_orfs.bed -b mutant.bed > long_orfs_overlapping.bed
Later we will want to translate these sequences so it's a good idea to separate them by strand. This can be done like this
bedtools intersect -wa -a long_orfs.bed -b mutant.bed | grep '-' > long_orfs_overlapping_neg.bed
bedtools intersect -wa -a long_orfs.bed -b mutant.bed | grep '+' > long_orfs_overlapping_pos.bed
In [13]:
# You must run this cell to generate files required for exercise 6
ensure_long_orfs_bed
bedtools intersect -wa -a long_orfs.bed -b mutant.bed | grep '-' > long_orfs_overlapping_neg.bed
bedtools intersect -wa -a long_orfs.bed -b mutant.bed | grep '+' > long_orfs_overlapping_pos.bed
You should now have two files with ORF coordinates ready for extraction called long_orfs_overlapping_neg.bed
and long_orfs_overlapping_pos.bed
We can use bedtools getfasta
to extract the nucleotide sequences. For example to extract the nucleotide sequences for the positive strand ORFs;
bedtools getfasta -fi genomic_sequence.fasta -bed long_orfs_overlapping_pos.bed
Since these sequences are created in fasta
format we can pipe them to the translate command. For ORFs on the positive strand we just translate in the first positive frame as follows;
bedtools getfasta -fi genomic_sequence.fasta -bed long_orfs_overlapping_pos.bed | translate.py
Your Task: Modify the command above to output the correct translation of ORFs on the negative strand, ie those contained within long_orfs_overlapping_neg.bed
In [14]:
e6_answer(){
### BEGIN SOLUTION
bedtools getfasta -fi genomic_sequence.fasta -bed long_orfs_overlapping_neg.bed | translate.py -f -1
### END SOLUTION
}
In [15]:
test_e6
Now that we have translations for the four ORFs of interested we can use BLAST to find out which of these are homologous to known genes. If we were working on the genome of an as yet unsequenced species this process would be important for two reasons;
Since the genome of B. pseudomallei has already been sequenced, and its genes comprehensively annotated we can instead simply search the translations of our four candidate ORFs against known B. pseudomallei genes. To do this follow these steps
fasta
format. This was done as part of exercise 6 but for convenience you may copy-paste the text below
>NC_006351:13826-14189_f_1
SPLRSAPCADTCRSSARLPRSVFASGFTSMIRAAGAARPRARASRAIRPRVRRAWRAAARRAPPDRVPAKRAASARSTTG
CRRSRPSRAGAGPPRPCRWRTPSAGSRATRRAPSSAPAAAA
>NC_006351:13209-13914_f_1
SSARRARASDFEHVAHPAHRVDQLRLVRVVDLRAQPADRRVDDVRVAVEIHVPHLRGDRRARQDLALAAHQQPQQREFLR
GQQHDVALARGLAPRQVELQIGDPQDRRFARLPAAQDRAHARDQLDEIERLDDVIVRAELEPLHAVADVVARGQEQHGRR
VRAAQPLEHRPAVVAGQHHVEDDQVVLLGFGLMQARHAVLDPVRDEARFGQPLAQIHAGLRLVFHDQYSHLDSPR
>NC_006351:13212-13905_r_1
IQMRILIVEDEPKTGMYLRKGLTEAGFIADWVEDGVTGLHQAETEEYDLIILDVMLPGHDGWTVLERLRRAHSTPVLFLT
ARDDVGDRVKGLELGADDYVVKPFDFVELVARVRSILRRGQARESTVLRIADLELDLTRRKATRQGDVVLLTAKEFALLW
LLMRREGEILPRATIASQVWDMNFNSDTNVVDAAIRRLRSKIDDAYEPKLIHTVRGMGYVLEVRSASAPSR
>NC_006351:11821-15115_r_1
RADLRRAQRRRGQDLRRRHGDAQRNGRADRGRAAEGAGRVRGEGRADHGPARADRQPRSRQARALRRDGGRRAGHGRRGG
RRAQGRPAVPGRSPLRHRRAAARRAAHRRRGDQAAADRATPGRAGGRERPARAGAVRAARRARDGGHIARAEPDQPRGRQ
APRRRQRERARARRRLVRRRRARADRLGRPGARRLLAVVGRPVRTVAKRERAAEDRRAARARDGVRVAVRDVQQREGRPA
RVHRHPVRAHGRRRVAVAARHSAVDHGRGRLHRAIGRRGAQRPRDDLVHPQPARRRRAARRRGARRRAHAAAAGADDGAR
RVARLPADGVRHRHGRGGPAPARDGRDRRHPVVDRADAARFAGTLSGGARRAAARHARRTRGRMAREARARGRAAPAARI
IEVNPDANTDRGRRAEDRHVSAQGADRSGLHRGLGRGRRDGPASGRNRGVRPDHPRRDAARPRRLDGARAAAPRALDARA
VPDRARRRRRPREGARARRGRLRRQAVRFRRAGRARALDPAPRAGARIDGPADRRSGARPDAAQGHAPGRRRAADREGIR
AAVAADAPRGRDPAARDDRLAGVGHEFQQRHERRRRGDPPAALEDRRRVRAEADPHGARDGLRARSQKRERAEPMIRRLL
PRTLRARLTALIILSTAATLALSGVALYSALHNRLVGMSSYEMSATLAAMRTHLANVANVDDIPRKSDLWIDQLHGHQNL
DLAIYDTDGRLRFATRGFVAPRPALGAPQTRVPASAAPAGATFSYLADDAPLRGGNPRTARIVVQYDGKNDHALLRAYAY
TVVVIEVLAVVLTAALAYGIAMLGLSPLRRLVARAEQMSSSRLAQPLPELDTSGELKEMEHAFNAMLKRLDESFVRLSQF
SSNLAHDMRTPLTNLLAEAQVALSKPRTADEYRDVIESSIDEYQRLSRMIEDMLFLARSDNAQSHLAIRTLDAAAQAERV
AGYYEPMAEDADVRIVVRGKAEVRADALLYHRALSNLISNALNHAPRGSTITIECAQAADAATISVSDTGRGIEAPHRER
IFERFYRVDPARHNSASGTGLGLAIVRSIMENHGGTCGVDSEPHVRTTFWLKFPAHAA
Quick Blast
and then enter search parameters as in the screenshot shown. Make sure you restrict taxonomy to B. pseudomallei otherwise the search will take a very long time. If this were a real genome annotation project we would perform the blast search from the command line using a supercomputer.
You should find that only the two ORFs in the reverse frame correspond to known B. pseudomallei genes. One of these irlR
overlaps directly with the mutation site. The other gene irlS
is downstream of irlR
(notice that the BLAST hit for this ORF is only in amino acids $\approx 600-1100$).