Answers to tutorial questions and exercises

Part 1: Creating a BLAST database

General questions

What is the name of the file containing our FASTA sequences?
bacteria.fa

What type of sequences do we have in our bacteria file?
Nucleotide

What is our new BLAST database (DB) called?
bacteria.fa

How many sequences were added to our new database?
75

Was the number of sequences added to our database the same as the number of sequences in our FASTA file?
Yes

Exercise 1

You will have noticed that there is also a file in the /bacteria folder called bacteria_tr.fa which also contains FASTA sequences which need to be converted into a BLAST database. Create a BLAST database from this file which has the output prefix bacteria_prot and can be referenced using the title bacteria_prot.

It is up to you whether you create a logfile but it is worth using head to check the type of sequences.
(hint: they might not be nucleotide).


In [1]:
makeblastdb -in db/bacteria/bacteria_tr.fa -dbtype prot -title bacteria_prot -out db/bacteria/bacteria_prot -logfile db/bacteria/bacteria_prot.log




In [2]:
head db/bacteria/bacteria_prot.log



Building a new DB, current time: 11/08/2016 15:12:26
New DB name:   db/bacteria/bacteria_prot
New DB title:  bacteria_prot
Sequence type: Protein
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 75 sequences in 0.004632 seconds.

In [3]:
ls -l .


total 120
drwxrwxr-x  6 vo1  1662    204  8 Nov 15:11 db
drwxrwxr-x  9 vo1  1662    306  8 Nov 15:10 example
-rw-rw-r--  1 vo1  1662  13063  8 Nov 14:14 format_database.ipynb
-rw-r--r--  1 vo1  1662   7723  8 Nov 15:11 general_question_and_exercise_answers.ipynb
-rw-r--r--  1 vo1  1662   3947  8 Nov 14:12 index.ipynb
-rw-r--r--  1 vo1  1662  30232  8 Nov 15:10 run_blast.ipynb

What do you notice about the file extensions for the bacteria_prot database?
They begin with a 'p' not an 'n' (e.g. '.pin' not '.nin')

Why do you think they are different from the previous files?
Because nucleotide BLAST database files have an 'n' prefix (e.g. '.nin'), but protein BLAST database files have a 'p' prefix (e.g. '.pin')

General questions

What percentage of our query aligns with our top hit?
100%

Is our query sequence the same length as our top hit?
Yes, they are both 924 bp

Based on the output of our blastn search, which species do you think our unknown sequence comes from? What gene might it be?
Based on the description of the top hit, our sequence is TcpC from Escherichia coli

Exercise 2

Using mammalian.fa create a new database which has the output prefix mammalian and can be referenced as mammalian.
(hint: you don't need to be in the same folder as your FASTA file to write your database files there, just prefix the output prefix with the relative location - e.g. db/mammalian/mammalian)


In [4]:
head db/mammalian/mammalian.fa


>NP_067272.1 toll-like receptor 4 precursor [Mus musculus]
MMPPWLLARTLIMALFFSCLTPGSLNPCIEVVPNITYQCMDQKLSKVPDDIPSSTKNIDLSFNPLKILKS
YSFSNFSELQWLDLSRCEIETIEDKAWHGLHHLSNLILTGNPIQSFSPGSFSGLTSLENLVAVETKLASL
ESFPIGQLITLKKLNVAHNFIHSCKLPAYFSNLTNLVHVDLSYNYIQTITVNDLQFLRENPQVNLSLDMS
LNPIDFIQDQAFQGIKLHELTLRGNFNSSNIMKTCLQNLAGLHVHRLILGEFKDERNLEIFEPSIMEGLC
DVTIDEFRLTYTNDFSDDIVKFHCLANVSAMSLAGVSIKYLEDVPKHFKWQSLSIIRCQLKQFPTLDLPF
LKSLTLTMNKGSISFKKVALPSLSYLDLSRNALSFSGCCSYSDLGTNSLRHLDLSFNGAIIMSANFMGLE
ELQHLDFQHSTLKRVTEFSAFLSLEKLLYLDISYTNTKIDFDGIFLGLTSLNTLKMAGNSFKDNTLSNVF
ANTTNLTFLDLSKCQLEQISWGVFDTLHRLQLLNMSHNNLLFLDSSHYNQLYSLSTLDCSFNRIETSKGI
LQHFPKSLAFFNLTNNSVACICEHQKFLQWVKEQKQFLVNVEQMTCATPVEMNTSLVLDFNNSTCYMYKT

In [5]:
makeblastdb -in db/mammalian/mammalian.fa -dbtype prot -title mammalian -out db/mammalian/mammalian -logfile db/mammalian/mammalian.log




In [6]:
ls -l db/mammalian


total 112
-rwxrwxr-x  1 vo1  1662  21425  4 Nov 16:24 mammalian.fa
-rw-r--r--  1 vo1  1662    274  8 Nov 15:12 mammalian.log
-rw-r--r--  1 vo1  1662   2634  8 Nov 15:12 mammalian.phr
-rw-r--r--  1 vo1  1662    248  8 Nov 15:12 mammalian.pin
-rw-r--r--  1 vo1  1662  19841  8 Nov 15:12 mammalian.psq

If our query sequence is nucleotide and we want to search a protein database, what BLAST application do we need to use?
blastx

With example/unknown.fa, run a BLAST search using the application in your answer above and search the database you have just created. We want a standard tabulated output file with the following additional columns

  • Full subject title
  • Query length
  • Subject length
  • Percentage query coverage

In [7]:
blastx -query example/unknown.fa -db db/mammalian/mammalian -out example/blastx_mammalian.out -outfmt "6 std stitle qlen slen qcovs"




In [8]:
head example/blastx_mammalian.out


my_unknown_sequence	NP_003254.2	27.48	131	94	1	502	891	634	764	4e-07	42.7	NP_003254.2 toll-like receptor 1 precursor [Homo sapiens]	924	786	45
my_unknown_sequence	NP_003254.2	70.00	10	3	0	72	101	349	358	8.9	19.2	NP_003254.2 toll-like receptor 1 precursor [Homo sapiens]	924	786	45
my_unknown_sequence	NP_006059.2	26.72	131	95	1	502	891	639	769	6e-06	38.9	NP_006059.2 toll-like receptor 6 precursor [Homo sapiens]	924	796	42
my_unknown_sequence	NP_035734.3	25.64	78	58	0	502	735	650	727	2e-04	33.9	NP_035734.3 toll-like receptor 6 [Mus musculus]	924	806	25
my_unknown_sequence	NP_112218.2	27.71	83	58	1	490	732	627	709	3e-04	33.5	NP_112218.2 toll-like receptor 10 isoform a [Homo sapiens]	924	811	26
my_unknown_sequence	NP_109607.1	26.92	78	57	0	502	735	637	714	4e-04	33.1	NP_109607.1 toll-like receptor 1 precursor [Mus musculus]	924	795	25
my_unknown_sequence	NP_003255.2	26.92	104	67	3	433	738	622	718	8e-04	32.3	NP_003255.2 toll-like receptor 2 precursor [Homo sapiens]	924	784	33
my_unknown_sequence	NP_036035.3	25.96	104	68	2	433	738	622	718	0.001	31.6	NP_036035.3 toll-like receptor 2 precursor [Mus musculus]	924	784	33
my_unknown_sequence	NP_991388.2	22.22	81	58	1	508	735	778	858	0.078	25.8	NP_991388.2 toll-like receptor 11 [Mus musculus]	924	931	25

What is our top hit?
toll-like receptor 1 precursor [Homo sapiens]

How much of our query sequence is covered by this alignment?
45%

What is the length of our top hit and where does the alignment start and finish? Our top his is 786 amino acids in length with the alignment covering residues 634-764