Workshop 4 - Genomic Intervals

This workshop continues working with the B. pseudomallei genome sequence from Workshop 3.

Before you begin this workshop you should have familiarised yourself with chapters 1-25 of the interactive guide.

At the end of this workshop you should be able to;

  1. Use awk to convert between tabular formats
  2. Use bedtools to manipulate genomic intervals
  3. Extract and translate genomic sequence within known intervals

IMPORTANT

Run the Setup Code

In order for this notebook to work properly you need to run the cell below before doing anything else. This will load custom functions and settings required to make the self assessment exercises work.

If you restart your kernel you will also need to rerun the setup code

Don't use the cd command

The answers to all self assessment exercises assume that you don't change your directory from the default. You shouldn't ever need to use the cd command to answer an exercise.

Complete exercises in order

Unlike previous workshops this one requires you to complete all the exercises in the correct order. Most importantly you must make sure you correctly generate the files long_orfs.tabular (exercise 1) and long_orfs.bed (exercise 3) before moving to later exercises.


In [1]:
wget -q https://www.dropbox.com/s/es7lr8sy4ubg3ug/translate.py?dl=0 -O translate.py
chmod u+x translate.py

#wget -q https://www.dropbox.com/s/e5j9qra71qsdlrl/NC_006351.all.bed?dl=0 -O known_genes.bed

wget -q https://www.dropbox.com/s/z8xgt7jnkpqugxs/setup.sh?dl=0 -O setup.sh
source setup.sh


Setup Done

Background

At the end of workshop 3 we had formulated a command to translate a genomic sequence from Burkholderia pseudomallei in all $6$ reading frames and extract all ORFs longer than $100$ amino acids.

In this tutorial we will match experimental information from a mutagenesis screen with the gene corresponding to one of these ORFs. The data we will use comes from a paper by Jones et. al who generated mutant forms of B. pseudomallei using a transposon, Tn5-OT182. This transposon inserts itself into the B. pseudomallei genome at random points and thereby disrupts activity of genes by dramatically altering the genome sequence around its insertion point.

Our interest is in one particular mutant which was observed by Jones et. al to result in a dramatic reduction in the ability of B. pseudomallei to invade Eukaryotic cells.

BED Format

The genomic location of the mutant is provided in a file called mutant.bed.

Use the cat command to display the contents of this file

cat mutant.bed

Your output should look like this

NC_006351   13833   13834   Tn5-OT182 Integration Site  0   +

Look carefully at the output. This is a bed formatted file with a single entry. It consists of $6$ columns with the following values;

  • Column 1: chromosome. Since B. pseudomallei has one chromosome this is just the genome ID
  • Column 2: start . The starting position of the feature in the genome in bases.
  • Column 3: end . The end position of the feature in bases.
  • Column 4: name. The name of the feature. In this case it is a mutant site
  • Column 5: score. (Not important for this tutorial)
  • Column 6: strand. A + means the feature is on the forward strand and - on the reverse. Coordinates (start and end are based on the forward strand though.

The coordinates given in this BED file refer to positions in our genomic fragment which is located in genomic_sequence.fasta.

If you remember back to tutorial 2 you might realise that bed format looks alot like gff. Both formats have much the same purpose and you might be surprised to learn that they aren't the only formats used for genome annotation.

Why have multiple formats that do the same thing?

Unsurprisingly one of the most common tasks in bioinformatics is format conversion.

Exercise 1

At the end of workshop 3 we translated our genomic sequence into all 6 frames and extracted ORFs longer than 100 amino acids (and containing a start codon) in tabular form.

The command we used was

translate.py -f 0 --split -F 'tabular' genomic_sequence.fasta | sort -n -k 6 | tail -n 100 | grep 'M'

Run this command in a test cell to verify that it works.

Your task: Run the command above and redirect its output to a file. Your output file must be called long_orfs.tabular


In [2]:
e1_answer(){
### BEGIN SOLUTION
translate.py -f 0 --split -F 'tabular' genomic_sequence.fasta | sort -n -k 6 | tail -n 100 | grep 'M' > long_orfs.tabular
### END SOLUTION
}

In [3]:
test_e1


Your answer is correct

Exercise 2

Use the head command to examine the output of the file you created in exercise 1. (It should be called long_orfs.tabular.

The columns in this file are

  • Column 1: chromosome. Since B. pseudomallei has one chromosome this is just the genome ID
  • Column 2: start . The starting position of the ORF in the genome in bases.
  • Column 3: end . The end position of the ORF in bases.
  • Column 4: strand. A + means the feature is on the forward strand and - on the reverse.
  • Column 5: orf_num. Just a number identifying the ORF within its frame
  • Column 6: length. Length of the ORF in amino acids
  • Column 7: sequence. Amino acid sequence of the ORF

How would you rearrange these columns in order to create a valid BED file?

Your task: Use awk to transform long_orfs.tabular so that the columns are in the correct order for BED format. Pipe the result to head to show only the first 10 lines of the result. For our purposes the name and score columns of the BED file are unimportant so you should use orf_num for name and length for score.

Note: The default behaviour of awk is to separate columns with spaces but bed requires tabs. For this exercise just leave it as spaces.


In [4]:
# If you could not get the answer to exercise 1 you should run this cell so that you have the
# required files for later exercises

ensure_long_orfs_tabular

In [5]:
e2_answer(){
### BEGIN SOLUTION
awk '{print $1,$2,$3,$5,$6,$4}' long_orfs.tabular | head
### END SOLUTION
}

In [6]:
test_e2


Your answer is correct

Exercise 3 - Convert spaces to tabs

Look closely at your the result of running your answer to exercise 2. It should look like this

NC_006351 5651 5951 103 100 -
NC_006351 15256 15562 60 102 +
NC_006351 19818 20130 2 104 -
NC_006351 11406 11721 44 105 +
NC_006351 12347 12662 59 105 +
NC_006351 17863 18181 75 106 +
NC_006351 10209 10536 27 109 -
NC_006351 14923 15253 59 110 +
NC_006351 8519 8864 32 115 +
NC_006351 11261 11609 67 116 -

Notice that the fields on each line are separated by a single space.

The columns in a valid bed file must be separated by tabs not spaces.

We can fix this by using the tr command. tr is very useful. It can be used to translate any character in a file to any other character. We want to translate a space, which is represented by ' ' into a tab '\t'.

Your task: Modify the command from exercise 2 in combination with the tr command to produce a valid (tab separated) bed file. Pipe your result to a file called long_orfs.bed. Make sure you include all of the ORFs in the original long_orfs.tabular file.

Note: Another way to do this would be to set OFS in awk.


In [7]:
e3_answer(){
### BEGIN SOLUTION
awk '{print $1,$2,$3,$5,$6,$4}' long_orfs.tabular | tr ' ' '\t' > long_orfs.bed
### END SOLUTION
}

In [8]:
test_e3


Your answer is correct

Bedtools

Now that we have our long ORFs and our mutant site in bed format we can use bedtools to compare them.

As its name suggests, bedtools is not just one tool but many. All of the tools are accessible through the bedtools command and its sub commands.

To find out all the available bedtools commands simply type bedtools with no arguments

bedtools

Look at the usage: statement at the very top of the bedtools help.

usage:    bedtools <subcommand> [options]

Which tells us that if we want to use a particular subcommand, (eg intersect) we would type bedtools intersect. To see the help for bedtools intersect type

bedtools intersect -h

The top part of this help tells us most of what we need for now;

Tool:    bedtools intersect (aka intersectBed)
Version: v2.25.0
Summary: Report overlaps between two feature files.

Usage:   bedtools intersect [OPTIONS] -a <bed/gff/vcf/bam> -b <bed/gff/vcf/bam>

The Summary: tells what the tool does, ie report overlaps between two feature files and Usage: tells us that we must provide two interval files (in bed, gff, vcf or bam format).

The bedtools manual online includes this graphical depiction of what the intersect command does.

Which tells us that if we have two bed files labeled a and b, intersect will report the parts of intervals in a that overlap with intervals in b. With the -wa option it will report full intervals rather than just the overlapping parts.

Exercise 4

We have two feature files we want to compare

file name description format
long_orfs.bed coordinates of all long ORFs in our genomic segment bed
mutant.bed coordinates of the Tn5-OT182 insertion site identified by Jones et al. bed

Let's try bedtools intersect with these two files.

bedtools intersect -a long_orfs.bed -b mutant.bed

Make sure you understand the output. The following image may help. It shows ORFs overlapping the mutant site in red. Notice that each of the red blocks representing an ORF has a blunt end and a pointy end. This indicates the strand that the ORF is derived from. ORFs on the positive strand point from left to right while those on the negative strand point from right to left.

  1. Four lines are produced. Why? What do they represent?
  2. Look closely at the start and end coordinate of the output. Why does the command report such short features?

Your task: Modify the command above so that the output reports the entire interval not just the overlapping portion.


In [9]:
e4_answer(){
### BEGIN SOLUTION
bedtools intersect -wa -a long_orfs.bed -b mutant.bed
### END SOLUTION
}

In [10]:
test_e4


Your answer is correct

Exercise 5

The full output from exercise 4 includes ORFs on both the positive and negative strands. In a later exercise we will need to separate these into separate datasets corresponding to the different strands.

There are various ways this can be done. Consider how you would do this with the grep command

Your Task: Create a command that outputs the entire intervals for ORFs that overlap the mutant site on the negative strand.

Note: This task is the same as the task for exercise 4 but with the addition of restricting output to the negative strand only.


In [11]:
e5_answer(){
### BEGIN SOLUTION
bedtools intersect -wa -a long_orfs.bed -b mutant.bed | grep '-'
### END SOLUTION
}

In [12]:
test_e5


Your answer is correct

Bedtools getfasta

So far we have used intersect which is only one of the many bedtools subcommands. Our goal has been to find ORFs that correspond to genes that could be affected by a known mutation site. We have reduced this to a list of just four ORFs.

The next step in our search for genes affected by the mutation site involves looking at the sequences of these ORFs. We can extract the sequences for these ORFs using another one of bedtools many subcommands. This time we will use the getfasta command.

The following graphical explanation illustrates what the getfasta command does (Graphic is from The bedtools online manual).

The command line help tells us the arguments that are required.

Usage:   bedtools getfasta [OPTIONS] -fi <fasta> -bed <bed/gff/vcf>

We need;

  • -fi : A fasta file with the genome sequence. This would be genomic_sequence.fasta
  • -bed : A bed file with the features whose sequence we want to extract (see below)

BED Files

The features we want to extract correspond to the ORFs that overlap the mutant site. These correspond to the outputs from exercises 4 and 5.

To run getfasta we simply need to put these results into a file. This could be done with a simple output redirect like so

bedtools intersect -wa -a long_orfs.bed -b mutant.bed > long_orfs_overlapping.bed

Later we will want to translate these sequences so it's a good idea to separate them by strand. This can be done like this

bedtools intersect -wa -a long_orfs.bed -b mutant.bed | grep '-' > long_orfs_overlapping_neg.bed
bedtools intersect -wa -a long_orfs.bed -b mutant.bed | grep '+' > long_orfs_overlapping_pos.bed

In [13]:
# You must run this cell to generate files required for exercise 6

ensure_long_orfs_bed

bedtools intersect -wa -a long_orfs.bed -b mutant.bed | grep '-' > long_orfs_overlapping_neg.bed

bedtools intersect -wa -a long_orfs.bed -b mutant.bed | grep '+' > long_orfs_overlapping_pos.bed

Exercise 6

You should now have two files with ORF coordinates ready for extraction called long_orfs_overlapping_neg.bed and long_orfs_overlapping_pos.bed

We can use bedtools getfasta to extract the nucleotide sequences. For example to extract the nucleotide sequences for the positive strand ORFs;

bedtools getfasta -fi genomic_sequence.fasta -bed long_orfs_overlapping_pos.bed

Since these sequences are created in fasta format we can pipe them to the translate command. For ORFs on the positive strand we just translate in the first positive frame as follows;

bedtools getfasta -fi genomic_sequence.fasta -bed long_orfs_overlapping_pos.bed | translate.py

Your Task: Modify the command above to output the correct translation of ORFs on the negative strand, ie those contained within long_orfs_overlapping_neg.bed


In [14]:
e6_answer(){
### BEGIN SOLUTION
bedtools getfasta -fi genomic_sequence.fasta -bed long_orfs_overlapping_neg.bed | translate.py -f -1
### END SOLUTION
}

In [15]:
test_e6


Your answer is correct

Now that we have translations for the four ORFs of interested we can use BLAST to find out which of these are homologous to known genes. If we were working on the genome of an as yet unsequenced species this process would be important for two reasons;

  1. Homology to known genes provides evidence that an ORF corresponds to a transcribed gene
  2. Homology provides some indication of the function that the corresponding gene.

Since the genome of B. pseudomallei has already been sequenced, and its genes comprehensively annotated we can instead simply search the translations of our four candidate ORFs against known B. pseudomallei genes. To do this follow these steps

  1. Extract translations of our four candidate ORFs into fasta format. This was done as part of exercise 6 but for convenience you may copy-paste the text below
    >NC_006351:13826-14189_f_1
    SPLRSAPCADTCRSSARLPRSVFASGFTSMIRAAGAARPRARASRAIRPRVRRAWRAAARRAPPDRVPAKRAASARSTTG
    CRRSRPSRAGAGPPRPCRWRTPSAGSRATRRAPSSAPAAAA
    >NC_006351:13209-13914_f_1
    SSARRARASDFEHVAHPAHRVDQLRLVRVVDLRAQPADRRVDDVRVAVEIHVPHLRGDRRARQDLALAAHQQPQQREFLR
    GQQHDVALARGLAPRQVELQIGDPQDRRFARLPAAQDRAHARDQLDEIERLDDVIVRAELEPLHAVADVVARGQEQHGRR
    VRAAQPLEHRPAVVAGQHHVEDDQVVLLGFGLMQARHAVLDPVRDEARFGQPLAQIHAGLRLVFHDQYSHLDSPR
    >NC_006351:13212-13905_r_1
    IQMRILIVEDEPKTGMYLRKGLTEAGFIADWVEDGVTGLHQAETEEYDLIILDVMLPGHDGWTVLERLRRAHSTPVLFLT
    ARDDVGDRVKGLELGADDYVVKPFDFVELVARVRSILRRGQARESTVLRIADLELDLTRRKATRQGDVVLLTAKEFALLW
    LLMRREGEILPRATIASQVWDMNFNSDTNVVDAAIRRLRSKIDDAYEPKLIHTVRGMGYVLEVRSASAPSR
    >NC_006351:11821-15115_r_1
    RADLRRAQRRRGQDLRRRHGDAQRNGRADRGRAAEGAGRVRGEGRADHGPARADRQPRSRQARALRRDGGRRAGHGRRGG
    RRAQGRPAVPGRSPLRHRRAAARRAAHRRRGDQAAADRATPGRAGGRERPARAGAVRAARRARDGGHIARAEPDQPRGRQ
    APRRRQRERARARRRLVRRRRARADRLGRPGARRLLAVVGRPVRTVAKRERAAEDRRAARARDGVRVAVRDVQQREGRPA
    RVHRHPVRAHGRRRVAVAARHSAVDHGRGRLHRAIGRRGAQRPRDDLVHPQPARRRRAARRRGARRRAHAAAAGADDGAR
    RVARLPADGVRHRHGRGGPAPARDGRDRRHPVVDRADAARFAGTLSGGARRAAARHARRTRGRMAREARARGRAAPAARI
    IEVNPDANTDRGRRAEDRHVSAQGADRSGLHRGLGRGRRDGPASGRNRGVRPDHPRRDAARPRRLDGARAAAPRALDARA
    VPDRARRRRRPREGARARRGRLRRQAVRFRRAGRARALDPAPRAGARIDGPADRRSGARPDAAQGHAPGRRRAADREGIR
    AAVAADAPRGRDPAARDDRLAGVGHEFQQRHERRRRGDPPAALEDRRRVRAEADPHGARDGLRARSQKRERAEPMIRRLL
    PRTLRARLTALIILSTAATLALSGVALYSALHNRLVGMSSYEMSATLAAMRTHLANVANVDDIPRKSDLWIDQLHGHQNL
    DLAIYDTDGRLRFATRGFVAPRPALGAPQTRVPASAAPAGATFSYLADDAPLRGGNPRTARIVVQYDGKNDHALLRAYAY
    TVVVIEVLAVVLTAALAYGIAMLGLSPLRRLVARAEQMSSSRLAQPLPELDTSGELKEMEHAFNAMLKRLDESFVRLSQF
    SSNLAHDMRTPLTNLLAEAQVALSKPRTADEYRDVIESSIDEYQRLSRMIEDMLFLARSDNAQSHLAIRTLDAAAQAERV
    AGYYEPMAEDADVRIVVRGKAEVRADALLYHRALSNLISNALNHAPRGSTITIECAQAADAATISVSDTGRGIEAPHRER
    IFERFYRVDPARHNSASGTGLGLAIVRSIMENHGGTCGVDSEPHVRTTFWLKFPAHAA
  2. Go to the NCBI Blast website https://blast.ncbi.nlm.nih.gov/Blast.cgi
  3. Select Quick Blast and then enter search parameters as in the screenshot shown. Make sure you restrict taxonomy to B. pseudomallei otherwise the search will take a very long time. If this were a real genome annotation project we would perform the blast search from the command line using a supercomputer.
  4. Run the BLAST search and look at the results.

You should find that only the two ORFs in the reverse frame correspond to known B. pseudomallei genes. One of these irlR overlaps directly with the mutation site. The other gene irlS is downstream of irlR (notice that the BLAST hit for this ORF is only in amino acids $\approx 600-1100$).