Run a blast and specify output format 6. Depending on the size of your input fasta file and the memory on your system, you probably want to split up the input fasta and/or also blast against the component parts of the nr/refseq on a cluster and then combine the output. E.g. to blast against each part of nr on a Slurm
cluster:
Slurm
job script for individual blast, example.00.sh:
In [ ]:
#!/usr/bin/env bash
#SBATCH --job-name=example_nr.00
#SBATCH --mail-user=your@email
#SBATCH --mail-type=all
#SBATCH --mem=3072
#SBATCH --cpus-per-task=12
#SBATCH --time=2-00:00:00
blastx -query example.fna -db nr.00 -num_threads 12 -evalue 1e-3 -max_target_seqs 1 -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send stitle staxids sscinames evalue" -out example.fna.nr.00.outfmt6
Make a Slurm
job for each part of the nr blast database (at the time of writing nr is available from ftp://ftp.ncbi.nlm.nih.gov/blast/db/ in 32 parts):
In [ ]:
%%bash
for i in {01..32}
do
sed "s/nr.00/nr.$i/g" example.00.sh > example.$i.sh
done
for i in *sh; do sbatch $i; done
Concatenate all blast results:
In [1]:
%%bash
cat *outfmt6 > outfmt6.all
For each 'gene', keep the blast result with the smallest e-value. This relies on the format of the current (v2.0.6, e.g. 'TR10000|c0_g1_i1') Trinity
fasta headers:
In [2]:
%%bash
sort -t _ -k1,2 outfmt6.all | sort -t$'\t' -k14,14g | sort -t _ -k1,2 -u > outfmt6.all.uniq
In [3]:
import pandas
import matplotlib
import matplotlib.pyplot as plt
In [4]:
def blast_in(in_file):
blast_data = pandas.read_table(in_file,header = None)
blast_data = pandas.DataFrame(blast_data)
return blast_data
def genus(blast_data):
desc = blast_data[tax_field]
desc = desc.str.split(' ').str.get(0)
desc = desc.str.split(';').str.get(0)
desc = desc.value_counts()
topfreq_desc = desc[:num_tophits]
return topfreq_desc
def plot_out(topfreq_desc):
fig = plt.figure()
topfreq_desc.plot(kind='barh')
plt.title("Top blast hits")
plt.xlabel("Number of best hits")
plt.ylabel("Taxa")
fig.tight_layout()
Define the field (column) containing the taxa ids (zero-based index) and number of taxa to plot (one-based index):
In [5]:
tax_field = int(12)
num_tophits = int(15)
In [6]:
%matplotlib inline
blast_data = blast_in("outfmt6.all.uniq")
topfreq_desc = genus(blast_data)
plot_out(topfreq_desc)
Or to a plot, after merging the blast output and filtering for best hits:
In [ ]:
%%bash
python blastplus_taxa_barh.py outfmt6.all.uniq 13 15 taxa_plot.png
Or if blastplus_taxa_barh.py
is in your PYTHONPATH
then:
In [4]:
from blastplus_taxa_barh import *
In [7]:
%matplotlib inline
blast_data = blast_in("outfmt6.all.uniq")
topfreq_desc = genus(blast_data, int(12),int(15))
plot_in(topfreq_desc)
In [9]:
blast_data.head()
Out[9]:
In [11]:
topfreq_desc
Out[11]:
In [12]:
plot_in(topfreq_desc)