Parse taxon names from `Trinity` blast results

Run a blast and specify output format 6. Depending on the size of your input fasta file and the memory on your system, you probably want to split up the input fasta and/or also blast against the component parts of the nr/refseq on a cluster and then combine the output. E.g. to blast against each part of nr on a Slurm cluster:

Slurm job script for individual blast, example.00.sh:



In [ ]:

    
#!/usr/bin/env bash
#SBATCH --job-name=example_nr.00
#SBATCH --mail-user=your@email
#SBATCH --mail-type=all
#SBATCH --mem=3072
#SBATCH --cpus-per-task=12
#SBATCH --time=2-00:00:00
blastx -query example.fna -db nr.00 -num_threads 12 -evalue 1e-3 -max_target_seqs 1 -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send stitle staxids sscinames evalue" -out example.fna.nr.00.outfmt6

Make a Slurm job for each part of the nr blast database (at the time of writing nr is available from ftp://ftp.ncbi.nlm.nih.gov/blast/db/ in 32 parts):



In [ ]:

    
%%bash
for i in {01..32}
do
  sed "s/nr.00/nr.$i/g" example.00.sh > example.$i.sh
done
for i in *sh; do sbatch $i; done

Concatenate all blast results:



In [1]:

    
%%bash
cat *outfmt6 > outfmt6.all

For each 'gene', keep the blast result with the smallest e-value. This relies on the format of the current (v2.0.6, e.g. 'TR10000|c0_g1_i1') Trinity fasta headers:



In [2]:

    
%%bash
sort -t _ -k1,2 outfmt6.all | sort -t$'\t' -k14,14g | sort -t _ -k1,2 -u > outfmt6.all.uniq



In [3]:

    
import pandas
import matplotlib
import matplotlib.pyplot as plt



In [4]:

    
def blast_in(in_file):
  blast_data = pandas.read_table(in_file,header = None)
  blast_data = pandas.DataFrame(blast_data)
  return blast_data

def genus(blast_data):
  desc = blast_data[tax_field]
  desc = desc.str.split(' ').str.get(0)
  desc = desc.str.split(';').str.get(0)
  desc = desc.value_counts()
  topfreq_desc = desc[:num_tophits]
  return topfreq_desc

def plot_out(topfreq_desc):
  fig = plt.figure()
  topfreq_desc.plot(kind='barh')
  plt.title("Top blast hits")
  plt.xlabel("Number of best hits")
  plt.ylabel("Taxa")
  fig.tight_layout()

Define the field (column) containing the taxa ids (zero-based index) and number of taxa to plot (one-based index):



In [5]:

    
tax_field = int(12)
num_tophits = int(15)



In [6]:

    
%matplotlib inline
blast_data = blast_in("outfmt6.all.uniq")
topfreq_desc = genus(blast_data)
plot_out(topfreq_desc)

Or to a plot, after merging the blast output and filtering for best hits:



In [ ]:

    
%%bash
python blastplus_taxa_barh.py outfmt6.all.uniq 13 15 taxa_plot.png

Or if blastplus_taxa_barh.py is in your PYTHONPATH then:



In [4]:

    
from blastplus_taxa_barh import *



In [7]:

    
%matplotlib inline
blast_data = blast_in("outfmt6.all.uniq")
topfreq_desc = genus(blast_data, int(12),int(15))
plot_in(topfreq_desc)



In [9]:

    
blast_data.head()









    Out[9]:






  
    
      
      0
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
    
  
  
    
      0
       TR10144|c0_g1_i1
       gi|544770924|ref|WP_021196208.1|
       100
       20
       0
       0
       60
        1
        64
        83
       terpene utilization protein AtuA [Ralstonia sp...
       1235457
            Ralstonia sp. AU12-08
       0.001000
    
    
      1
       TR10186|c0_g1_i1
       gi|490789651|ref|WP_004651801.1|
       100
       19
       0
       0
        3
       59
       391
       409
       hypothetical protein [Acinetobacter sp. ANC 3994]
       1217715
       Acinetobacter sp. ANC 3994
       0.000020
    
    
      2
       TR10270|c0_g1_i1
       gi|224103219|ref|XP_002312970.1|
        95
       20
       1
       0
       60
        1
       505
       524
                 predicted protein [Populus trichocarpa]
          3694
              Populus trichocarpa
       0.000004
    
    
      3
       TR10309|c0_g1_i1
       gi|255576756|ref|XP_002529265.1|
        95
       20
       1
       0
        1
       60
        57
        76
       60S ribosomal protein L13, putative [Ricinus c...
          3988
                 Ricinus communis
       0.000007
    
    
      4
       TR10330|c0_g1_i1
       gi|554633737|ref|WP_023168797.1|
       100
       19
       0
       0
       58
        2
       205
       223
       bifunctional aconitate hydratase 2/2-methyliso...
         28901
              Salmonella enterica
       0.000100



In [11]:

    
topfreq_desc









    Out[11]:





Escherichia    96
Ralstonia      79
Solanum        75
Populus        62
Arabidopsis    61
Glycine        37
Vitis          34
Cucumis        31
Ricinus        29
Medicago       29
Salmonella     25
Oryza          23
Cicer          16
Capsella       16
Fragaria       15
dtype: int64



In [12]:

    
plot_in(topfreq_desc)

	0	1	2	3	4	6	7	8	9	10	11	12	13
0	TR10144\|c0_g1_i1	gi\|544770924\|ref\|WP_021196208.1\|	100	20	0	60	1	64	83	terpene utilization protein AtuA [Ralstonia sp...	1235457	Ralstonia sp. AU12-08	0.001000
1	TR10186\|c0_g1_i1	gi\|490789651\|ref\|WP_004651801.1\|	100	19	0	3	59	391	409	hypothetical protein [Acinetobacter sp. ANC 3994]	1217715	Acinetobacter sp. ANC 3994	0.000020
2	TR10270\|c0_g1_i1	gi\|224103219\|ref\|XP_002312970.1\|	95	20	1	60	1	505	524	predicted protein [Populus trichocarpa]	3694	Populus trichocarpa	0.000004
3	TR10309\|c0_g1_i1	gi\|255576756\|ref\|XP_002529265.1\|	95	20	1	1	60	57	76	60S ribosomal protein L13, putative [Ricinus c...	3988	Ricinus communis	0.000007
4	TR10330\|c0_g1_i1	gi\|554633737\|ref\|WP_023168797.1\|	100	19	0	58	2	205	223	bifunctional aconitate hydratase 2/2-methyliso...	28901	Salmonella enterica	0.000100

Parse taxon names from Trinity blast results

Parse taxon names from `Trinity` blast results