Part 2: Running a local BLAST+ search

Introduction

BLAST+ is divided into different applications based on three broad categories:

  1. Search tools
  2. Database tools
  3. Sequence filtering tools

We have already seen a database tool when we used makeblastdb to generate our BLAST databases. However, we have not yet looked at the different search tools. These search tools are divided based on functionality, query type and database type. Below is a summary of the five main search tools.

Program Input format Database
blastn nucleotide nucleotide
blastp protein protein
blastx translated nucleotide protein
tblastn protein translated nucleotide
tblastx translated nucleotide translated nucleotide

The first two search tools, blastp and blastn, have an additional -task option which will optimize the parameters such as word size or gap cost. This can be useful if you are searching short, similar or dissimilar sequences. Below is a summary of the available tasks and their uses.

Program Task name Description
blastn blastn Traditional blastn search
blastn blastn-short Optimized for queries shorter than 50 bases
blastn megablast Optimized to search for sequences with high similarity (e.g. intraspecies)
blastn dc-megablast Discontiguous megablast. Optimised to search for more distant (e.g. interspecies) sequences
blastp blastp Traditional blastp search
blastp blastp-short Optimized for queries shorter than 30 residues

How do these BLAST applications work?

Every BLAST search starts with a query sequence provided by you, the user. BLAST takes that query sequence and splits it into smaller fragments called words. It then uses these words to search the database for other sequences that contain identical or similar, words. This process is called seeding.

Each match or hit is checked to make sure that it meets a certain threshold or score. The score for each alignment is calculated using a substitution matrix and, if the score meets the threshold, the alignment moves forward for extension. During the extension phase, BLAST will try to increase the length of the alignment, extending out from either side. Extension will continue until the score for that alignment falls below a pre-defined threshold or score. The extended alignment is known as a High-scoring Segment Pair (HSP).

So, to recap. The BLAST output for a query sequence is known as the result. When a sequence that is similar to the query is found in the database, this is known as a hit. Within the hit there may be multiple regions of similarity which can be aligned and are known as a HSPs. Each hit can have multiple HSPs which must each meet a minimum threshold or score.

All of this sounds great, but it would be easier to understand if we could see an example, right? Then let's try running one of the simpler BLAST applications, blastn.

First we need to remember the three things we need for any BLAST search:

  • A query sequence
  • A sequence database
  • A BLAST application

Now, take another look at the table above, what do we need for a blastn search?

So, we already have our application, blastn and we have already generated our nucleotide database bacteria_nucl in the first part of the tutorial here. All we need now is our nucleotide query sequence which can be found in example/unknown.fa.

Let's take a look and check that our query is a nucleotide sequence!


In [1]:
ls example


unknown.fa

In [2]:
cat example/unknown.fa


>my_unknown_sequence
GTGATAGCATATGAAAACATAGAATTTTTTATATGCTTGGTGAATGTTTTGGGCAACAATATGTATAATA
TCCTTTTCTTCATCTTTCTTTCAATAGCAATTCCATTCCTTTTATTCCTCGCATGGAAACAGCACCTAAA
AACCAAAGAGATTCGTTCATATCTATTGAAAGAGGGATATAATATTATTTTCAACGGAGAAGGTAACTCA
TATCTCGCGTTTAATATTAGTAATGCGACATTTCGCGCAGGTAATTTAACTTCCAATGATTATTTTCAAG
CATCAATTTCTTATATCCACGATTATAGATGGGAGTGGAAGGAGGTTGAGGCAAAGAAAATAAATAATAT
ATTTATTATTTACATTTCGAATATTGATTTCCCTTCCCAAAAACTATTTTATCGCAATAATAAATCTTTA
GCAGAAATAGACTGGGCAAAATTACAAGCAATTTTTCATCAACCATATGAAATACAGAATGACGTCATGC
AAGATAACAATAATACGCACTATGATTTTTTCATATCCCATGCAAAAGAGGATAAAGATACTTTTGTCAG
ACCACTGGTAGACGAGTTAAATAGACTTGGTGTAATTATTTGGTATGATGAACAGACACTTGAAGTCGGC
GATAGCTTAAGGAGAAATATTGATTTAGGCCTAAGAAAAGCAAATTATGGCATAGTCATACTTTCTCATA
ACTTTCTAAACAAGAAATGGACACAATACGAATTAGATAGTTTAATTAATCGTGCAGTGTATGATGATAA
TAAGATTATATTGCCAATCTGGCATAATATCAATGCTCAAGAGGTATCTAAATACAGCCATTATTTGGCG
GATAAAATGGCACTGCAAACTTCTTTATATAGCGTTAAGGAAATAGCAAGAGAGTTGGCTGAAATAGCAT
ACAGGAGAAGATAA

So, now we have our three essentials let's run our blastn search. To look at the parameters available type blastn -help.

Parameter Meaning
-task Only for blastn and blastp. Defaults to megablast for blastn.
-query The location of the file containing your query sequence.
-db Location and reference (e.g. bacteria_nucl) of your BLAST database
-out Location and name of the output file

The format of the command will be:

blastn -task [task] -query [input file] -db [database reference] -out [output file]

As we are not in the same directory as our database, we will need to tell the program to look in the db folder. We can do this by putting the location (relative to where we are now) before the reference e.g. db/bacteria/bacteria_nucl. You can do the same for the output file which we want to write to the example folder. It is normally a good idea to give your output file a descriptive name. Here we use the program and a generic description of the database being queried e.g. blastn_bacteria.out.

Now, let's try and identify our unknown sequence using blastn!


In [3]:
blastn -task blastn -query example/unknown.fa -db db/bacteria/bacteria_nucl -out example/blastn_bacteria.out



If this has worked, you should be able to see a new file in the example directory called blastn_bacteria.out. To look at your file, you can open it in a text editor or look at it in the terminal using the command:

less example/blastn_bacteria.out

The output can be split into two sections: a summary hit table and the corresponding alignments. If we were to have multiple queries, these sections would then be replicated for each query.

Let's take a look at the hit summary...


In [4]:
grep -A 10 'Query=' example/blastn_bacteria.out


Query= my_unknown_sequence

Length=924
                                                                      Score     E
Sequences producing significant alignments:                          (Bits)  Value

  GQ903013.1 Escherichia coli strain AD110 TcpC (tcpC) gene, comp...  1667    0.0  
  GQ903012.1 Escherichia coli strain Asp12e TcpC (tcpC) gene, com...  1667    0.0  
  GQ903011.1 Escherichia coli strain BEN13f TcpC (tcpC) gene, com...  1667    0.0  
  GQ903010.1 Escherichia coli strain C1845 TcpC (tcpC) gene, comp...  1667    0.0  
  GQ903009.1 Escherichia coli strain C4737 TcpC (tcpC) gene, comp...  1667    0.0  

The hits from the bacteria_nucl database which match our unknown sequence are ordered by their bit score from highest to lowest, the highest representing the best hit. This is not the same as the alignment score that we were discussing earlier. The bit score is derived from those alignment scores, but is normalised so that it is possible to compare the alignment scores from different searches.

In addition to a bit score, each hit is also given an E Value which represents the number of different alignments that have scores which are the same or better than a score which is expected to occur by chance in a database search. Broadly speaking, it is a measure of confidence. The lower the value the more confident you can be that this score would not occur by chance.

Let's take a look at the lower end of the table and see how the bit score and e value differ.


In [5]:
grep -A 5 '  AY461808.1 ' example/blastn_bacteria.out


  AY461808.1 Vibrio cholerae strain XJ93006 toxin-regulated pilus...  22.9    6.5  
  KJ596546.1 Vibrio cholerae H1 toxin corregulated pilus (tcpA) g...  22.9    6.5  
  KJ596547.1 Vibrio cholerae strain N1 toxin corregulated pilus (...  22.9    6.5  
  KJ596548.1 Vibrio cholerae strain PC01 toxin corregulated pilus...  22.9    6.5  
  KJ596549.1 Vibrio cholerae strain VC55 toxin corregulated pilus...  22.9    6.5  

Now let's look at the alignment which corresponds to the best hit (GQ903013.1).


In [6]:
grep -A 18 '> GQ903013.1' example/blastn_bacteria.out


> GQ903013.1 Escherichia coli strain AD110 TcpC (tcpC) gene, complete 
cds
Length=924

 Score = 1667 bits (1848),  Expect = 0.0
 Identities = 924/924 (100%), Gaps = 0/924 (0%)
 Strand=Plus/Plus

Query  1    GTGATAGCATATGAAAACATAGAATTTTTTATATGCTTGGTGAATGTTTTGGGCAACAAT  60
            ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  1    GTGATAGCATATGAAAACATAGAATTTTTTATATGCTTGGTGAATGTTTTGGGCAACAAT  60

Query  61   ATGTATAATATCCTTTTCTTCATCTTTCTTTCAATAGCAATTCCATTCCTTTTATTCCTC  120
            ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  61   ATGTATAATATCCTTTTCTTCATCTTTCTTTCAATAGCAATTCCATTCCTTTTATTCCTC  120

Query  121  GCATGGAAACAGCACCTAAAAACCAAAGAGATTCGTTCATATCTATTGAAAGAGGGATAT  180
            ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  121  GCATGGAAACAGCACCTAAAAACCAAAGAGATTCGTTCATATCTATTGAAAGAGGGATAT  180

The alignment gives more detail about this match than the summary table. Here, the hit is referred to as the subject (or Sbjct). The alignment gives us details about the orientation of our hit and query sequences. In this case, the sequences align in the same, forward direction (Plus/Plus) but this may not always be the case (e.g. if the hit was in the reverse orientation Plus/Minus). It also provides information on the number of exact matches (Identities) and gaps within our alignment, both of which contribute to the alignment score.

Based on the output of our blastn search, which species do you think our unknown sequence comes from? What gene might it be?

Output formats

BLAST results can be written in different formats. If we don't specify an output format the default is pairwise which contains a summary hit table and the corresponding alignments as we saw above. There are several other useful formats which are available using the -outfmt parameter.

-outfmt value Description
0 pairwise
1 query-anchored showing identities
2 query-anchored no identities
3 flat query-anchored, show identities
4 flat query-anchored, no identities
5 XML Blast output
6 tabular
7 tabular with comment lines
8 Text ASN.1
9 Binary ASN.1
10 Comma-separated values
11 BLAST archive format (ASN.1)

If you don't need to see the alignments, a tabular output is often the most simple to work with. Let's try adding -outfmt 6 to our command (don't forget to change the output file name!!).


In [7]:
blastn -task blastn -query example/unknown.fa -db db/bacteria/bacteria_nucl -out example/blastn_bacteria_outfmt6.out -outfmt 6



And take a look at what we've got...


In [8]:
head -10 example/blastn_bacteria_outfmt6.out


my_unknown_sequence	GQ903013.1	100.00	924	0	0	1	924	1	924	0.0	1667
my_unknown_sequence	GQ903012.1	100.00	924	0	0	1	924	1	924	0.0	1667
my_unknown_sequence	GQ903011.1	100.00	924	0	0	1	924	1	924	0.0	1667
my_unknown_sequence	GQ903010.1	100.00	924	0	0	1	924	1	924	0.0	1667
my_unknown_sequence	GQ903009.1	100.00	924	0	0	1	924	1	924	0.0	1667
my_unknown_sequence	GQ903008.1	100.00	924	0	0	1	924	1	924	0.0	1667
my_unknown_sequence	GQ903007.1	100.00	924	0	0	1	924	1	924	0.0	1667
my_unknown_sequence	GQ903006.1	100.00	924	0	0	1	924	1	924	0.0	1667
my_unknown_sequence	GQ903005.1	100.00	924	0	0	1	924	1	924	0.0	1667
my_unknown_sequence	GQ903004.1	100.00	924	0	0	1	924	1	924	0.0	1667

Our output is now in a tab-delimited format but we have no column names. By default, these are:

Heading tags Meaning
qseqid Query identifier/accession
sseqid Subject (hit) identifier/accession
pident Percentage of identical positions in alignment
length Alignment length
mismatch Number of mismatches in alignment
gapopen Number of gaps in alignment
qstart Start position of alignment in query
qend End position of alignment in query
sstart Start position of alignment in query
send End position of alignment in query
evalue E Value
bitscore Bit Score

One useful statistic that we are given when we use the online version of BLAST is the percentage of our query that has been aligned, also known as our query coverage. Not to worry, we don't have to manually calculate this as BLAST has some extra parameters we can use. For more information try blastn -help and look at the Formatting options section.

Let's say we don't want to know all of the alignment statistics, how can we generate a summary which tells us: query id, subject id, query length, subject length, alignment length, percentage identity, query coverage, bit score and evalue? Well, we need to specify the corresponding tags. To do this, we still need to use the -outfmt parameter but now we put our format identifier (0-11) followed by the tags for the columns we want to include. The tags should be separated by a single space with the format identifier and tags all enclosed in double quotes. Let's give it a try, it will make much more sense once you see it written out below!


In [9]:
blastn -task blastn -query example/unknown.fa -db db/bacteria/bacteria_nucl -out example/blastn_bacteria_final.out -outfmt "6 qseqid sseqid qlen slen length pident qcovs bitscore evalue"



Let's take a look at our output. Remember, the columns are in the same order as you specified in the command.


In [10]:
head -10 example/blastn_bacteria_final.out


my_unknown_sequence	GQ903013.1	924	924	924	100.00	100	1667	0.0
my_unknown_sequence	GQ903012.1	924	924	924	100.00	100	1667	0.0
my_unknown_sequence	GQ903011.1	924	924	924	100.00	100	1667	0.0
my_unknown_sequence	GQ903010.1	924	924	924	100.00	100	1667	0.0
my_unknown_sequence	GQ903009.1	924	924	924	100.00	100	1667	0.0
my_unknown_sequence	GQ903008.1	924	924	924	100.00	100	1667	0.0
my_unknown_sequence	GQ903007.1	924	924	924	100.00	100	1667	0.0
my_unknown_sequence	GQ903006.1	924	924	924	100.00	100	1667	0.0
my_unknown_sequence	GQ903005.1	924	924	924	100.00	100	1667	0.0
my_unknown_sequence	GQ903004.1	924	924	924	100.00	100	1667	0.0

From this output, you should now be able answer the following questions.

What percentage of our query aligns with our top hit?

Is our query sequence the same length as our top hit?

Using different tasks to optimise parameters?

In the last section of the tutorial, you will have noticed that we used the -task parameter to tell blastn that we want to use the blastn parameters. But what are these parameters?

If you remember back at the start, we described how BLAST splits your query sequence into smaller segements called words. The length of the word is defined by a parameter called -word_size which has a default value of 11. Broadly speaking, we can think of this as the minimum length of the initial alignment which can be found and extended by BLAST. So,if you have a large database, you can increase the speed of your search just by increasing word size.

The word size is just one of the parameters which is automatically changed when you use tasks such as megablast. For megablast, the word size is increase to a default of 28 and the cost of opening and extending gaps in the alignment is optimised to find long, highly similar alignments. This is why megablast is very efficient and particularly suited to interspecies comparisons.

Let's see if using -task megablast parameters instead of blastn changes our results. We are going to use the default -outfmt 6 columns this time.


In [11]:
blastn -task megablast -query example/unknown.fa -db db/bacteria/bacteria_nucl -out example/megablast_bacteria_outfmt6.out -outfmt 6



Let's take a look at our output. Looks pretty similar to the blastn results, right?


In [12]:
head -10 example/megablast_bacteria_outfmt6.out


my_unknown_sequence	GQ903013.1	100.00	924	0	0	1	924	1	924	0.0	1707
my_unknown_sequence	GQ903012.1	100.00	924	0	0	1	924	1	924	0.0	1707
my_unknown_sequence	GQ903011.1	100.00	924	0	0	1	924	1	924	0.0	1707
my_unknown_sequence	GQ903010.1	100.00	924	0	0	1	924	1	924	0.0	1707
my_unknown_sequence	GQ903009.1	100.00	924	0	0	1	924	1	924	0.0	1707
my_unknown_sequence	GQ903008.1	100.00	924	0	0	1	924	1	924	0.0	1707
my_unknown_sequence	GQ903007.1	100.00	924	0	0	1	924	1	924	0.0	1707
my_unknown_sequence	GQ903006.1	100.00	924	0	0	1	924	1	924	0.0	1707
my_unknown_sequence	GQ903005.1	100.00	924	0	0	1	924	1	924	0.0	1707
my_unknown_sequence	GQ903004.1	100.00	924	0	0	1	924	1	924	0.0	1707

Well, that's not quite true. Let's see how many results we have in both our blastn and megablast searches.


In [13]:
wc -l example/blastn_bacteria_outfmt6.out
wc -l example/megablast_bacteria_outfmt6.out


      45 example/blastn_bacteria_outfmt6.out
      24 example/megablast_bacteria_outfmt6.out

Did the blastn and megablast searches produce the same nummber of hits? Why do you think this is?
(hint: have a look at the end of the pairwise alignment file and think about the default megablast word size)

Searches using translated nucleotide sequences

Sometimes, depending on the biological question (e.g. don't do this with primers!), it can be better to perform a BLAST search using a translated nucleotide query. The simplest explanation is that several codons may encode the same amino acid (redundancy). So, while there may be differences between two nucleotide sequences, they may in fact encode the same amino acid sequence.

So far, we have created our own BLAST database of bacteria sequences and identified our unknown sequence as TcpC from Escherichia coli. But, is it only found in bacteria?

Exercise 2

We can't use our bacteria database for this search but we can use what you learnt in the first part of the tutorial, here, to generate a new database. There are some sequences provided for you in the db/mammalian folder. Let's take a look.


In [14]:
ls db/mammalian


mammalian.fa

In [15]:
head db/mammalian/mammalian.fa


>NP_067272.1 toll-like receptor 4 precursor [Mus musculus]
MMPPWLLARTLIMALFFSCLTPGSLNPCIEVVPNITYQCMDQKLSKVPDDIPSSTKNIDLSFNPLKILKS
YSFSNFSELQWLDLSRCEIETIEDKAWHGLHHLSNLILTGNPIQSFSPGSFSGLTSLENLVAVETKLASL
ESFPIGQLITLKKLNVAHNFIHSCKLPAYFSNLTNLVHVDLSYNYIQTITVNDLQFLRENPQVNLSLDMS
LNPIDFIQDQAFQGIKLHELTLRGNFNSSNIMKTCLQNLAGLHVHRLILGEFKDERNLEIFEPSIMEGLC
DVTIDEFRLTYTNDFSDDIVKFHCLANVSAMSLAGVSIKYLEDVPKHFKWQSLSIIRCQLKQFPTLDLPF
LKSLTLTMNKGSISFKKVALPSLSYLDLSRNALSFSGCCSYSDLGTNSLRHLDLSFNGAIIMSANFMGLE
ELQHLDFQHSTLKRVTEFSAFLSLEKLLYLDISYTNTKIDFDGIFLGLTSLNTLKMAGNSFKDNTLSNVF
ANTTNLTFLDLSKCQLEQISWGVFDTLHRLQLLNMSHNNLLFLDSSHYNQLYSLSTLDCSFNRIETSKGI
LQHFPKSLAFFNLTNNSVACICEHQKFLQWVKEQKQFLVNVEQMTCATPVEMNTSLVLDFNNSTCYMYKT

Using mammalian.fa create a new database which has the output prefix mammalian and can be referenced as mammalian. (_hint: you don't need to be in the same folder as your FASTA file to write your database files there, just prefix the output prefix with the relative location - e.g. db/mammalian/mammalian)

If our query sequence is nucleotide and we want to search a protein database, what BLAST application do we need to use?
(hint: look at the BLAST application table above)

With example/unknown.fa, run a BLAST search using the application in your answer above and search the database you have just created. We want a standard tabulated output file with the following additional columns

  • Full subject title
  • Query length
  • Subject length
  • Percentage query coverage

Notice in the previous tabulated output there is only the subject accession, not the full title/description. For the answer to this exercise you should look at stitle on the application help page. Also, you don't need to specify all of the standard output columns, just use std (e.g. -outfmt "6 std extra1 extra2...". Remember, as we are not using either blastn or blastp, we do not need the -task parameter.

What is our top hit? How much of our query sequence is covered by this alignment? What is the length of our top hit and where does the alignment start and finish?

So, our original question was whether TcpC is only found in bacteria. The answer is both yes and no. Looking at our answers, we could not find the whole TcpC protein in the mammalian database (see your answer to for query coverage). However, we did find a region of similarity in mammalian toll-like receptors. Biologically, this makes sense as TcpC and other bacterial protiens contain a region called a TIR domain. These domains are also found in mammalian innate immune receptors which include toll-like receptors. There is evidence to suggest that the similarity between the bacterial TcpC and mammalian immune receptor TIR domains allow the bacteria to interfere with the host immune system. And we can see much of this with a simple BLAST!

Well done, you have finished this tutorial! If you want to check your answers, click here.