Part 1: Creating a BLAST database

Introduction

As mentioned before, we need two things to run a local BLAST search:

  • Your query sequence
  • A database to search

When you run BLAST online, you are offered a series of pre-formatted databases (e.g. nr/nt, refseq_rna...). You can download these databases from ftp://ftp.ncbi.nlm.nih.gov/blast/db/. This is great, but what if we want to search our query against our own set of sequences?

Your sequences will typically be in FASTA format, but BLAST cannot use this. So, this part of tutorial will show you how to use makeblastdb to convert your FASTA sequences into a format which BLAST can use.

Storing database files

Before we get started, let's consider a bit of housekeeping. BLAST databases are typically kept in a folder called db. Within this, it is good practice to give each of your databases their own folder. This is so that you don't accidentally overwrite the original files when you download newer versions of the same database or accidentally replace an old database by giving your new database the same name.

To have a look at what we mean, let's take a look at the db folder for this tutorial.


In [1]:
ls db


bacteria	mammalian

In this part of the tutorial, we are going to create a BLAST database from a set of FASTA-formatted bacteria sequences which can be found in the bacteria folder (db/bacteria/bacteria.fa). Let's take a closer look.


In [2]:
cd db/bacteria
ls


bacteria.fa	bacteria_tr.fa

In [3]:
head bacteria.fa


>KJ596549.1 Vibrio cholerae strain VC55 toxin corregulated pilus (tcpA) gene, complete cds
ATGCAATTATTAAAACAGCTTTTTAAGAAGAAGTTTGTAAAAGAAGAACACGATAAGAAAACCGGTCAAG
AGGGTATGACATTACTCGAAGTAATCATTGTTCTGGGTATTATGGGTGTGGTCTCAGCGGGTGTTGTTAC
GCTGGCTCAGCGTGCGATTGATTCGCAGAATATGACTAAGGCTGCGCAAAATCTAAACAGCGTGCAAATT
GCAATGACACAAACTTATCGTAGTCTTGGTAATTATCCAGCTACCGCAAACGCAAGTGCTGCTACACAGC
TAGCTAATGGTTTGGTCAGCCTTGGTAAGGTTTCAGCTGATGAGGCAAAGAATCCTTTCACTGGTACAGC
TATGGGGATTTTCTCATTTCCACGAAACTCTGCAGCGAATAAAGCATTCGCAATTACAGTCGGTGGCTTG
ACCCAAGCACAATGTAAGACTTTGGTTACAAGCGTAGGGGATATGTTTCCATTTATCAACGTGAAAGAAG
GTGCTTTCGCTGCTGTCGCTGATCTTGGTGATTTCGAAACGAGTGTCGCAGATGCTGCTACTGGCGCTGG
CGTAATTAAGTCCATTGCACCAGGAAGTGCCAACTTAAACCTAACTAATATCACGCATGTTGAGAAGCTT

What is the name of the file containing our FASTA sequences?
hint: it will have the file extension .fa or .fasta

What type of sequences do we have in our bacteria file?
hint: are they nucleotide or protein?

Creating a BLAST database

To create a BLAST database from our FASTA sequences we use the makeblastdb application. Information about the different parameters we can give to makeblastdb can be found by typing makeblastdb --help.

However, there are two parameters we must always give to makeblastdb: the location of our input file and the type of sequences it contains.

Parameter Meaning
-in The location of the file containing your FASTA sequences.
-dbtype The type of sequences in your database (e.g. nucleotide=nucl or protein=prot)

Using these parameters, the command we need will take the format:

makeblastdb -in [input file] -db_type [nucl or prot]

Using the answers from the previous section and the information above, let's try creating our BLAST database.


In [4]:
makeblastdb -in bacteria.fa -dbtype nucl



Building a new DB, current time: 11/08/2016 14:12:59
New DB name:   bacteria.fa
New DB title:  bacteria.fa
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 75 sequences in 0.0082469 seconds.

Using the output generated from our command, try and answer the following:

What is our new BLAST database (DB) called?

How many sequences were added to our new database?

If you want to check that the number of sequences added to the new database match the number of sequences in our FASTA file we can use grep.


In [5]:
grep -c '>' bacteria.fa


75

Was the number of sequences added to our database the same as the number of sequences in our FASTA file?

Now let's take a look at the files we have created.


In [6]:
ls -l


total 248
-rwxrwxr-x@ 1 vo1  1662  66352  4 Nov 16:15 bacteria.fa
-rw-r--r--  1 vo1  1662  11007  8 Nov 14:12 bacteria.fa.nhr
-rw-r--r--  1 vo1  1662    976  8 Nov 14:12 bacteria.fa.nin
-rw-r--r--  1 vo1  1662  14799  8 Nov 14:12 bacteria.fa.nsq
-rw-r--r--@ 1 vo1  1662  22812  7 Nov 15:58 bacteria_tr.fa

You will notice that three new files have been created with new file extensions: .nhr, .nin and .nsq. You don't need to worry what these files are but in general: .nhr file are the headers, .nin the index and .nsq the sequences.

Naming databases and creating logfiles

In the previous section we created a database using only the required parameters. However, there are several other parameters which can be very useful.

Parameter Meaning
-title The name of the database (e.g. how it will be referenced by BLAST)
-out The prefix for your output database files (e.g. database.nin,database.nhr...)
-logfile The file in which to write all command output and errors

Let's take a look at what these parameters actually do. The following command will generate a BLAST database called bacteria_nucl from our FASTA sequences stored in bacteria.fa which can be recalled by BLAST using the reference bacteria_nucl and writes all command line output to bacteria_nucl.log


In [7]:
makeblastdb -in bacteria.fa -dbtype nucl -title bacteria_nucl -out bacteria_nucl -logfile bacteria_nucl.log



Did you notice that this time there was no output (e.g. Building a new DB,....)? This has all been written to bacteria_nucl.log. Let's take a look.


In [8]:
head bacteria_nucl.log



Building a new DB, current time: 11/08/2016 14:13:00
New DB name:   bacteria_nucl
New DB title:  bacteria_nucl
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 75 sequences in 0.00582099 seconds.

Let's also take a look at the database files generated.


In [9]:
ls -l


total 320
-rwxrwxr-x@ 1 vo1  1662  66352  4 Nov 16:15 bacteria.fa
-rw-r--r--  1 vo1  1662  11007  8 Nov 14:12 bacteria.fa.nhr
-rw-r--r--  1 vo1  1662    976  8 Nov 14:12 bacteria.fa.nin
-rw-r--r--  1 vo1  1662  14799  8 Nov 14:12 bacteria.fa.nsq
-rw-r--r--  1 vo1  1662    272  8 Nov 14:13 bacteria_nucl.log
-rw-r--r--  1 vo1  1662  11007  8 Nov 14:13 bacteria_nucl.nhr
-rw-r--r--  1 vo1  1662    984  8 Nov 14:13 bacteria_nucl.nin
-rw-r--r--  1 vo1  1662  14799  8 Nov 14:13 bacteria_nucl.nsq
-rw-r--r--@ 1 vo1  1662  22812  7 Nov 15:58 bacteria_tr.fa

Here you will see the files created by our first command, which used only the required parameters, have the prefix bacteria.fa. This is because by default -out is the same as -in (see makeblastdb --help). We changed this by giving a simpler prefix e.g. -out bacteria_nucl. This can be very useful when you have complex file names but want a simpler or more descriptive database name.

Exercise 1

You will have noticed that there is also a file in the /bacteria folder called bacteria_tr.fa which also contains FASTA sequences which need to be converted into a BLAST database. Create a BLAST database from this file which has the output prefix bacteria_prot and can be referenced using the title bacteria_prot.

It is up to you whether you create a logfile but it is worth using head to check the type of sequences.
(hint: they might not be nucleotide).

What do you notice about the file extensions for the bacteria_prot database?
(hint: use ls -l)

Why do you think they are different from the previous files?
(hint: sequence type)

Summary

We have created two BLAST databases, one nucleotide (bacteria_nucl) and one protein (bacteria_prot), each containing 75 bacterial sequences which we will now use in the next part of the tutorial. Click here for how to run a BLAST search.