What is the name of the file containing our FASTA sequences?
bacteria.fa
What type of sequences do we have in our bacteria file?
Nucleotide
What is our new BLAST database (DB) called?
bacteria.fa
How many sequences were added to our new database?
75
Was the number of sequences added to our database the same as the number of sequences in our FASTA file?
Yes
You will have noticed that there is also a file in the /bacteria folder called bacteria_tr.fa which also contains FASTA sequences which need to be converted into a BLAST database. Create a BLAST database from this file which has the output prefix bacteria_prot and can be referenced using the title bacteria_prot.
It is up to you whether you create a logfile but it is worth using head to check the type of sequences.
(hint: they might not be nucleotide).
In [1]:
makeblastdb -in db/bacteria/bacteria_tr.fa -dbtype prot -title bacteria_prot -out db/bacteria/bacteria_prot -logfile db/bacteria/bacteria_prot.log
In [2]:
head db/bacteria/bacteria_prot.log
In [3]:
ls -l .
What do you notice about the file extensions for the bacteria_prot database?
They begin with a 'p' not an 'n' (e.g. '.pin' not '.nin')
Why do you think they are different from the previous files?
Because nucleotide BLAST database files have an 'n' prefix (e.g. '.nin'), but protein BLAST database files have a 'p' prefix (e.g. '.pin')
What percentage of our query aligns with our top hit?
100%
Is our query sequence the same length as our top hit?
Yes, they are both 924 bp
Based on the output of our blastn search, which species do you think our unknown sequence comes from? What gene might it be?
Based on the description of the top hit, our sequence is TcpC from Escherichia coli
Using mammalian.fa create a new database which has the output prefix mammalian and can be referenced as mammalian.
(hint: you don't need to be in the same folder as your FASTA file to write your database files there, just prefix the output prefix with the relative location - e.g. db/mammalian/mammalian)
In [4]:
head db/mammalian/mammalian.fa
In [5]:
makeblastdb -in db/mammalian/mammalian.fa -dbtype prot -title mammalian -out db/mammalian/mammalian -logfile db/mammalian/mammalian.log
In [6]:
ls -l db/mammalian
If our query sequence is nucleotide and we want to search a protein database, what BLAST application do we need to use?
blastx
With example/unknown.fa, run a BLAST search using the application in your answer above and search the database you have just created. We want a standard tabulated output file with the following additional columns
In [7]:
blastx -query example/unknown.fa -db db/mammalian/mammalian -out example/blastx_mammalian.out -outfmt "6 std stitle qlen slen qcovs"
In [8]:
head example/blastx_mammalian.out
What is our top hit?
toll-like receptor 1 precursor [Homo sapiens]
How much of our query sequence is covered by this alignment?
45%
What is the length of our top hit and where does the alignment start and finish? Our top his is 786 amino acids in length with the alignment covering residues 634-764