Lesser Blast job

I downloaded the bacterial and archea RefSeq datasets into /mnt/data2/lesser/prok/blastdb and made a blast database


In [ ]:
cd blastdb
wget ftp://ftp.ncbi.nlm.nih.gov/refseq/release/bacteria/bacteria.*.1.genomic.fna.gz
gzip -d *gz
cat *fna | makeblastdb -in - -out prok -dbtype nucl -title prok

I ran blast using settings that mimic the settings used by blast2go. The output is in XML, should be able to import to Blast2Go.


In [6]:
%%bash

PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/share/bin:/share/include:/share/snap/:/share/RepeatMasker:/share/ma
ker/bin/:/share/bwa:\
/share/trinityrnaseq_r20140717:/share/Trimmomatic-0.32:/share/geneid/bin:\
/share/CEGMA_v2/bin/:/share/TransDecoder_r20140704:/share/Fastool:/share/barrnap-0.4.2/bin:\
/share/bcftools:/home/macmanes/omega_new/omegaMap:/share/angsd:/share/tophat-2.0.9.Linux_x86_64:\
/share/cufflinks-2.2.1.Linux_x86_64:/share/samtools:/share/bedtools2/bin:\
/share/RepeatModeler/RepeatModeler:\
$HOME/.rvm/bin:/home/macmanes/.rvm/gems/ruby-2.1.2/bin:\
/home/macmanes/.rvm/gems/ruby-2.1.2@global/bin:\
/home/macmanes/.rvm/rubies/ruby-2.1.2/bin:\
/home/macmanes/bin:/share/bless"

blastn -max_target_seqs 1 -query prok_only_contigs_CF_testN.fasta -db blastdb/prok -evalue 1e-04 -num_threads 25 -outfmt 5 > bact_arch.blast5

In [7]:
%%bash
cd blastdb
wget ftp://ftp.ncbi.nlm.nih.gov/refseq/release/invertebrate/inver*genomic.fna.gz
gzip -d blastdb/*gz

In [9]:
%%bash
cd blastdb

PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/share/bin:/share/include:/share/snap/:/share/RepeatMasker:/share/ma
ker/bin/:/share/bwa:\
/share/trinityrnaseq_r20140717:/share/Trimmomatic-0.32:/share/geneid/bin:\
/share/CEGMA_v2/bin/:/share/TransDecoder_r20140704:/share/Fastool:/share/barrnap-0.4.2/bin:\
/share/bcftools:/home/macmanes/omega_new/omegaMap:/share/angsd:/share/tophat-2.0.9.Linux_x86_64:\
/share/cufflinks-2.2.1.Linux_x86_64:/share/samtools:/share/bedtools2/bin:\
/share/RepeatModeler/RepeatModeler:\
$HOME/.rvm/bin:/home/macmanes/.rvm/gems/ruby-2.1.2/bin:\
/home/macmanes/.rvm/gems/ruby-2.1.2@global/bin:\
/home/macmanes/.rvm/rubies/ruby-2.1.2/bin:\
/home/macmanes/bin:/share/bless"

cat invert*fna | makeblastdb -in - -out invert -dbtype nucl -title invert
cd ../



Building a new DB, current time: 08/21/2014 12:24:48
New DB name:   invert
New DB title:  invert
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 996532 sequences in 668.545 seconds.
Error: (1431.1) FASTA-Reader: Warning: FASTA-Reader: First data line in seq is about 63% ambiguous nucleotides (shouldn't be over 40%)
Error: (1431.1) FASTA-Reader: Warning: FASTA-Reader: First data line in seq is about 52% ambiguous nucleotides (shouldn't be over 40%)

In [19]:
%%bash
PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/share/bin:/share/include:/share/snap/:/share/RepeatMasker:/share/ma
ker/bin/:/share/bwa:\
/share/trinityrnaseq_r20140717:/share/Trimmomatic-0.32:/share/geneid/bin:\
/share/CEGMA_v2/bin/:/share/TransDecoder_r20140704:/share/Fastool:/share/barrnap-0.4.2/bin:\
/share/bcftools:/home/macmanes/omega_new/omegaMap:/share/angsd:/share/tophat-2.0.9.Linux_x86_64:\
/share/cufflinks-2.2.1.Linux_x86_64:/share/samtools:/share/bedtools2/bin:\
/share/RepeatModeler/RepeatModeler:\
$HOME/.rvm/bin:/home/macmanes/.rvm/gems/ruby-2.1.2/bin:\
/home/macmanes/.rvm/gems/ruby-2.1.2@global/bin:\
/home/macmanes/.rvm/rubies/ruby-2.1.2/bin:\
/home/macmanes/bin:/share/bless"
blastn -max_target_seqs 1 -query prok_only_contigs_CF_testN.fasta -db blastdb/invert -evalue 1e-04 -num_threads 25 -outfmt 5 > invert.blast5

In [20]:
head invert.blast5


<?xml version="1.0"?>
<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd">
<BlastOutput>
  <BlastOutput_program>blastn</BlastOutput_program>
  <BlastOutput_version>BLASTN 2.2.29+</BlastOutput_version>
  <BlastOutput_reference>Zheng Zhang, Scott Schwartz, Lukas Wagner, and Webb Miller (2000), &quot;A greedy algorithm for aligning DNA sequences&quot;, J Comput Biol 2000; 7(1-2):203-14.</BlastOutput_reference>
  <BlastOutput_db>blastdb/invert</BlastOutput_db>
  <BlastOutput_query-ID>Query_1</BlastOutput_query-ID>
  <BlastOutput_query-def>10009_consensus</BlastOutput_query-def>
  <BlastOutput_query-len>551</BlastOutput_query-len>