Parse taxon names from Trinity blast results

Run a blast and specify output format 6. Depending on the size of your input fasta file and the memory on your system, you probably want to split up the input fasta and/or also blast against the component parts of the nr/refseq on a cluster and then combine the output. E.g. to blast against each part of nr on a Slurm cluster:

Slurm job script for individual blast, example.00.sh:


In [ ]:
#!/usr/bin/env bash
#SBATCH --job-name=example_nr.00
#SBATCH --mail-user=your@email
#SBATCH --mail-type=all
#SBATCH --mem=3072
#SBATCH --cpus-per-task=12
#SBATCH --time=2-00:00:00
blastx -query example.fna -db nr.00 -num_threads 12 -evalue 1e-3 -max_target_seqs 1 -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send stitle staxids sscinames evalue" -out example.fna.nr.00.outfmt6

Make a Slurm job for each part of the nr blast database (at the time of writing nr is available from ftp://ftp.ncbi.nlm.nih.gov/blast/db/ in 32 parts):


In [ ]:
%%bash
for i in {01..32}
do
  sed "s/nr.00/nr.$i/g" example.00.sh > example.$i.sh
done
for i in *sh; do sbatch $i; done

Concatenate all blast results:


In [1]:
%%bash
cat *outfmt6 > outfmt6.all

For each 'gene', keep the blast result with the smallest e-value. This relies on the format of the current (v2.0.6, e.g. 'TR10000|c0_g1_i1') Trinity fasta headers:


In [2]:
%%bash
sort -t _ -k1,2 outfmt6.all | sort -t$'\t' -k14,14g | sort -t _ -k1,2 -u > outfmt6.all.uniq

In [3]:
import pandas
import matplotlib
import matplotlib.pyplot as plt

In [4]:
def blast_in(in_file):
  blast_data = pandas.read_table(in_file,header = None)
  blast_data = pandas.DataFrame(blast_data)
  return blast_data

def genus(blast_data):
  desc = blast_data[tax_field]
  desc = desc.str.split(' ').str.get(0)
  desc = desc.str.split(';').str.get(0)
  desc = desc.value_counts()
  topfreq_desc = desc[:num_tophits]
  return topfreq_desc

def plot_out(topfreq_desc):
  fig = plt.figure()
  topfreq_desc.plot(kind='barh')
  plt.title("Top blast hits")
  plt.xlabel("Number of best hits")
  plt.ylabel("Taxa")
  fig.tight_layout()

Define the field (column) containing the taxa ids (zero-based index) and number of taxa to plot (one-based index):


In [5]:
tax_field = int(12)
num_tophits = int(15)

In [6]:
%matplotlib inline
blast_data = blast_in("outfmt6.all.uniq")
topfreq_desc = genus(blast_data)
plot_out(topfreq_desc)


Or to a plot, after merging the blast output and filtering for best hits:


In [ ]:
%%bash
python blastplus_taxa_barh.py outfmt6.all.uniq 13 15 taxa_plot.png

Or if blastplus_taxa_barh.py is in your PYTHONPATH then:


In [4]:
from blastplus_taxa_barh import *

In [7]:
%matplotlib inline
blast_data = blast_in("outfmt6.all.uniq")
topfreq_desc = genus(blast_data, int(12),int(15))
plot_in(topfreq_desc)



In [9]:
blast_data.head()


Out[9]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 TR10144|c0_g1_i1 gi|544770924|ref|WP_021196208.1| 100 20 0 0 60 1 64 83 terpene utilization protein AtuA [Ralstonia sp... 1235457 Ralstonia sp. AU12-08 0.001000
1 TR10186|c0_g1_i1 gi|490789651|ref|WP_004651801.1| 100 19 0 0 3 59 391 409 hypothetical protein [Acinetobacter sp. ANC 3994] 1217715 Acinetobacter sp. ANC 3994 0.000020
2 TR10270|c0_g1_i1 gi|224103219|ref|XP_002312970.1| 95 20 1 0 60 1 505 524 predicted protein [Populus trichocarpa] 3694 Populus trichocarpa 0.000004
3 TR10309|c0_g1_i1 gi|255576756|ref|XP_002529265.1| 95 20 1 0 1 60 57 76 60S ribosomal protein L13, putative [Ricinus c... 3988 Ricinus communis 0.000007
4 TR10330|c0_g1_i1 gi|554633737|ref|WP_023168797.1| 100 19 0 0 58 2 205 223 bifunctional aconitate hydratase 2/2-methyliso... 28901 Salmonella enterica 0.000100

In [11]:
topfreq_desc


Out[11]:
Escherichia    96
Ralstonia      79
Solanum        75
Populus        62
Arabidopsis    61
Glycine        37
Vitis          34
Cucumis        31
Ricinus        29
Medicago       29
Salmonella     25
Oryza          23
Cicer          16
Capsella       16
Fragaria       15
dtype: int64

In [12]:
plot_in(topfreq_desc)