Description:

  • This notebook goes through the analysis of blasting spacers vs a database containing potential protospacers.
  • Any blast datbase can be used.
  • A script is provided to easily blast all of the genomes in CLdb.

Before running this notebook:

User-defined variables


In [40]:
# directory where you want the spacer blasting to be done
## CHANGE THIS!
workDir = "/home/nyoungb2/t/CLdb_Methanosarcina/"

Init


In [41]:
import os
from IPython.display import FileLinks
%load_ext rpy2.ipython


The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython

In [42]:
blastDir = os.path.join(workDir, 'arrayBlast')
if not os.path.isdir(blastDir):
    os.makedirs(blastDir)

Blast vs all E. coli genomes in the CLdb

Spacer blast

Selecting spacer sequences


In [43]:
!cd $blastDir; \
    CLdb -- array2fasta -cluster -cut 1 > spacers_cut1.fna

## getting number of spacer sequences
!printf 'Number of unique spacers:\t'
!cd $blastDir; \
    grep -c ">" spacers_cut1.fna; \
    echo; \
    head -n 6 spacers_cut1.fna


Number of unique spacers:	1104

>NA|spacer|NA|628|1
TTTTACCGGGGATGTGACAGTCAGGGAATACGGCAT
>NA|spacer|NA|804|1
TACCCAACTGTTCAAATTTCAAAGCCACTACGTGA
>NA|spacer|NA|923|1
TACTGTTCGGCGGTTCTTCGCATTGTTGCGTCAAT

Blast

  • The array blast commands are sub-sub commands.
  • SO, that's why when you type: 'CLdb -- arrayBlast -h', you get then following:

In [44]:
!CLdb -- arrayBlast -h


Usage:
    arrayBlast [options] subcommand -- [subcommand_options]

  Options:
    --list
        List all subcommands.

    --perldoc
        Get perldoc of subcommand.

    -v Verbose output
    -h This help message

  For more information:
    CLdb --perldoc -- arrayBlast


In [45]:
# listing all arrayBlast subcommands
!CLdb -- arrayBlast --list


add_crRNA
add_proto
align_proto
filter_arrayHits
get_PAM
get_SEEDstats
get_align
get_proto
run
srl2csv
srl2json
xml2json
xml2srl

Sub-sub-command help:

CLdb arrayBlast -- run -- -h


In [54]:
# spacer blast
## Really just a thin wrapper around blastn that will blast the spacers against all genomes in CLdb
### If you want to blast some else, just run the blast yourself (use 'perldoc CLdb_arrayBlast.pl' for more info)

!cd $blastDir; \
    CLdb -- arrayBlast -- run \
    -query spacers_cut1.fna \
    -fork 10 \
    > spacers_cut1_blastn.xml
    
# NOTE: using 10 CPUs


Making BLAST databases...
...making blast databases in /home/nyoungb2/t/CLdb_Methanosarcina//arrayBlast/blast_db/
makeblastdb -dbtype nucl -parse_seqids -in /home/nyoungb2/t/CLdb_Methanosarcina//arrayBlast/blast_db//Methanosarcina_acetivorans_C2A.fasta
makeblastdb -dbtype nucl -parse_seqids -in /home/nyoungb2/t/CLdb_Methanosarcina//arrayBlast/blast_db//Methanosarcina_barkeri_str_fusaro.fasta
makeblastdb -dbtype nucl -parse_seqids -in /home/nyoungb2/t/CLdb_Methanosarcina//arrayBlast/blast_db//Methanosarcina_horonobensis_HB_1.fasta
makeblastdb -dbtype nucl -parse_seqids -in /home/nyoungb2/t/CLdb_Methanosarcina//arrayBlast/blast_db//Methanosarcina_mazei_C16.fasta
makeblastdb -dbtype nucl -parse_seqids -in /home/nyoungb2/t/CLdb_Methanosarcina//arrayBlast/blast_db//Methanosarcina_mazei_Go1.fasta
makeblastdb -dbtype nucl -parse_seqids -in /home/nyoungb2/t/CLdb_Methanosarcina//arrayBlast/blast_db//Methanosarcina_mazei_LYC.fasta
makeblastdb -dbtype nucl -parse_seqids -in /home/nyoungb2/t/CLdb_Methanosarcina//arrayBlast/blast_db//Methanosarcina_mazei_S_6.fasta
makeblastdb -dbtype nucl -parse_seqids -in /home/nyoungb2/t/CLdb_Methanosarcina//arrayBlast/blast_db//Methanosarcina_mazei_SarPi.fasta
makeblastdb -dbtype nucl -parse_seqids -in /home/nyoungb2/t/CLdb_Methanosarcina//arrayBlast/blast_db//Methanosarcina_mazei_WWM610.fasta

Blasting sequences...
Number of blast hits to Methanosarcina_mazei_Go1.fasta:	28622
Number of blast hits to Methanosarcina_mazei_S_6.fasta:	27769
Number of blast hits to Methanosarcina_mazei_SarPi.fasta:	33535
Number of blast hits to Methanosarcina_mazei_C16.fasta:	29628
Number of blast hits to Methanosarcina_mazei_LYC.fasta:	31244
Number of blast hits to Methanosarcina_mazei_WWM610.fasta:	28519
Number of blast hits to Methanosarcina_barkeri_str_fusaro.fasta:	25491
Number of blast hits to Methanosarcina_horonobensis_HB_1.fasta:	28834
Number of blast hits to Methanosarcina_acetivorans_C2A.fasta:	24641

Note:

  • Using spacer clustering cutoff of 1, so just 'unique' spacer sequences

In [55]:
# converting .xml to .srl
## This retains all of the info in the xml, but produces a smaller file that is faster to load and write.
## For large blast runs (xml files > 100 Mb), this will take a little while

!cd $blastDir; \
    CLdb -- arrayBlast -- xml2srl \
    spacers_cut1_blastn.xml \
    > spacers_cut1_blastn.srl

# checking output file
!cd $blastDir; \
    ls -thlc spacers_cut1_blastn.*


-rw-rw-r-- 1 nyoungb2 nyoungb2 2.5M Jan  3 20:21 spacers_cut1_blastn.srl
-rw-rw-r-- 1 nyoungb2 nyoungb2 9.6M Jan  3 20:20 spacers_cut1_blastn.xml

Note:

  • You could pipe the output from CLdb -- arrayBlast -- run directly into CLdb -- arrayBlast -- xml2srl
    • Example: CLdb -- arrayBlast -- run -query spacers_cut1.fna | CLdb -- arrayBlast -- xml2srl

DR Blast

  • BLASTing DR sequences against the same subjects.
  • This is needed to filter out spacer blast hits to CRISPR arrays.

In [56]:
# selecting DRs, so we can blast them.
## This is needed blast filter out spacers that hit CRISPR arrays

!cd $blastDir; \
    CLdb -- array2fasta \
    -cluster -cut 1 -r \
    > DRs_cut1.fna

## getting number of spacer sequences
!printf 'Number of unique DRs:\t'
!cd $blastDir; \
    grep -c ">" DRs_cut1.fna; \
    echo; \
    head -n 6 DRs_cut1.fna


Number of unique DRs:	38

>NA|DR|NA|5|1
GTTTCAATCCTTGTTTTAGTGGATCTTGCTCTCGAAT
>NA|DR|NA|20|1
ATTCGTGAGCATGATCCATTAAGAGCCTATTCGAAAA
>NA|DR|NA|27|1
GCTTCAATTCTGCCACAACCTTTCGGTTATGGAAAC

In [57]:
# DR blast, just like spacer blast

!cd $blastDir; \
    CLdb -- arrayBlast -- run \
    -query DRs_cut1.fna \
    -fork 10 \
    > DRs_cut1_blastn.xml


Making BLAST databases...
...making blast databases in /home/nyoungb2/t/CLdb_Methanosarcina//arrayBlast/blast_db/
makeblastdb -dbtype nucl -parse_seqids -in /home/nyoungb2/t/CLdb_Methanosarcina//arrayBlast/blast_db//Methanosarcina_acetivorans_C2A.fasta
makeblastdb -dbtype nucl -parse_seqids -in /home/nyoungb2/t/CLdb_Methanosarcina//arrayBlast/blast_db//Methanosarcina_barkeri_str_fusaro.fasta
makeblastdb -dbtype nucl -parse_seqids -in /home/nyoungb2/t/CLdb_Methanosarcina//arrayBlast/blast_db//Methanosarcina_horonobensis_HB_1.fasta
makeblastdb -dbtype nucl -parse_seqids -in /home/nyoungb2/t/CLdb_Methanosarcina//arrayBlast/blast_db//Methanosarcina_mazei_C16.fasta
makeblastdb -dbtype nucl -parse_seqids -in /home/nyoungb2/t/CLdb_Methanosarcina//arrayBlast/blast_db//Methanosarcina_mazei_Go1.fasta
makeblastdb -dbtype nucl -parse_seqids -in /home/nyoungb2/t/CLdb_Methanosarcina//arrayBlast/blast_db//Methanosarcina_mazei_LYC.fasta
makeblastdb -dbtype nucl -parse_seqids -in /home/nyoungb2/t/CLdb_Methanosarcina//arrayBlast/blast_db//Methanosarcina_mazei_S_6.fasta
makeblastdb -dbtype nucl -parse_seqids -in /home/nyoungb2/t/CLdb_Methanosarcina//arrayBlast/blast_db//Methanosarcina_mazei_SarPi.fasta
makeblastdb -dbtype nucl -parse_seqids -in /home/nyoungb2/t/CLdb_Methanosarcina//arrayBlast/blast_db//Methanosarcina_mazei_WWM610.fasta

Blasting sequences...
Number of blast hits to Methanosarcina_barkeri_str_fusaro.fasta:	17138
Number of blast hits to Methanosarcina_acetivorans_C2A.fasta:	21463
Number of blast hits to Methanosarcina_mazei_S_6.fasta:	32573
Number of blast hits to Methanosarcina_horonobensis_HB_1.fasta:	40840
Number of blast hits to Methanosarcina_mazei_Go1.fasta:	47480
Number of blast hits to Methanosarcina_mazei_C16.fasta:	64637
Number of blast hits to Methanosarcina_mazei_WWM610.fasta:	47727
Number of blast hits to Methanosarcina_mazei_LYC.fasta:	79088
Number of blast hits to Methanosarcina_mazei_SarPi.fasta:	86164

In [58]:
# converting .xml to .srl
## This retains all of the info in the xml, but produces a smaller file that is faster to load and write.
## For large blast runs (xml files > 100 Mb), this will take a little while

!cd $blastDir; \
    CLdb -- arrayBlast -- xml2srl \
    DRs_cut1_blastn.xml \
    > DRs_cut1_blastn.srl


# checking output file
!cd $blastDir; \
    ls -thlc DRs_cut1_blastn.*


-rw-rw-r-- 1 nyoungb2 nyoungb2 5.4M Jan  3 20:23 DRs_cut1_blastn.srl
-rw-rw-r-- 1 nyoungb2 nyoungb2  17M Jan  3 20:22 DRs_cut1_blastn.xml

Filtering out spacer hits to CRISPRs

  • Based on DR hits falling adjacent to spacer hits

In [59]:
# filtering spacer hits to arrays
## This is based on the premise that DR hits adjacent to spacer hits signify that the spacer is hitting a CRISPR array

!cd $blastDir; \
    CLdb -- arrayBlast -- filter_arrayHits \
    spacers_cut1_blastn.srl \
    DRs_cut1_blastn.srl \
    > spacers_cut1_blastn_filt.srl
    
    
# checking output file
!cd $blastDir; \
    echo; \
    ls -thlc spacers_cut1_blastn_filt.srl


### Summary for filtering of 'bad' DR hits ###
Total DR hits: 22538
DR hits with E-values < 10: 0
DR hits with hit lengths (fraction of total) < 0.66: 0
DR hits used for filtering: 22538
### DR filtering of spacer blast hits ###
Total spacer blast hits: 2585
 NOTE: Deleting the blast hits to arrays
Spacer blast hits hitting an array defined just by DR-blast hits: 2414
 NOTE: Deleting the blast hits to arrays
Spacer blast hits hitting a mobile genetic element: 171

-rw-rw-r-- 1 nyoungb2 nyoungb2 1.9M Jan  3 20:23 spacers_cut1_blastn_filt.srl

Converting blast .srl to .csv


In [63]:
!cd $blastDir; \
    CLdb -- arrayBlast -- srl2csv \
    spacers_cut1_blastn_filt.srl \
    > spacers_cut1_blastn_filt.txt
    
# assessing table of blast hits    
!cd $blastDir; \
    head spacers_cut1_blastn_filt.txt


Decoding the srl file...
qseqid	sseqid	pident	length	mismatch	gapopen	qstart	qend	sstart	send	evalue	bitscore	
NA|spacer|NA|841|1	NC_007355	95.6521739130435	23	undef	5	10	32	3027136	3027114	0.000347095	38.1576
NA|spacer|NA|465|1	NC_007355	96.2962962962963	27	undef	5	5	31	857001	856975	1.48854e-06	46.087
NA|spacer|NA|465|1	NC_007355	100	19	undef	5	13	31	4703448	4703466	0.000362872	38.1576
NA|spacer|NA|465|1	NC_007355	100	19	undef	5	13	31	1036282	1036264	0.000362872	38.1576
NA|spacer|NA|280|1	NC_007355	95.8333333333333	24	undef	5	2	25	3935219	3935242	9.58273e-05	40.14
NA|spacer|NA|722|1	NC_007355	100	19	undef	5	3	21	3312027	3312009	0.000347095	38.1576
NA|spacer|NA|184|1	NC_007355	91.8918918918919	37	undef	5	1	37	2848699	2848735	9.9483e-08	50.0517
NA|spacer|NA|446|1	NC_007355	88.8888888888889	36	undef	5	1	36	3417664	3417699	9.18345e-05	40.14
NA|spacer|NA|446|1	NC_007355	88.8888888888889	36	undef	5	1	36	3928350	3928315	9.18345e-05	40.14

In [64]:
#!printf "Number of blast hits: "
nhits = !cd $blastDir; wc -l spacers_cut1_blastn_filt.txt
nhits = nhits[0].split(' ')[0]
print "Number of blast hits: {}".format(int(nhits) - 1)


Number of blast hits: 171

Conclusions

  • We have some BLAST hits!
  • Let's add some information to the blast results

adding crRNA info

  • Adding crRNA region to .srl (actually the DNA template, refering to it as 'crDNA')
    • How much of the adjacent DR sequences are included in the crDNA is determined by the user (default: 10bp on either side)

In [65]:
!cd $blastDir; \
    CLdb -- arrayBlast -- add_crRNA \
    < spacers_cut1_blastn_filt.srl \
    > spacers_cut1_blastn_filt_crDNA.srl
   
## checking output
!cd $blastDir; echo; \
    ls -thlc spacers_cut1_blastn_filt_crDNA.srl


Decoding .srl file...
Extracting all queries with blast hits from blast hit file...
...Number of total unique query IDs:	1103
Getting info from CLdb on each blast query spacer...
Getting spacer regions from each genome...
 Processing genome: Methanosarcina_mazei_Go1.fasta...
 Processing genome: Methanosarcina_mazei_SarPi.fasta...
 Processing genome: Methanosarcina_acetivorans_C2A.fasta...
 Processing genome: Methanosarcina_barkeri_str_fusaro.fasta...
 Processing genome: Methanosarcina_mazei_WWM610.fasta...
 Processing genome: Methanosarcina_horonobensis_HB_1.fasta...
 Processing genome: Methanosarcina_mazei_S_6.fasta...
 Processing genome: Methanosarcina_mazei_LYC.fasta...
 Processing genome: Methanosarcina_mazei_C16.fasta...

-rw-rw-r-- 1 nyoungb2 nyoungb2 2.3M Jan  3 20:24 spacers_cut1_blastn_filt_crDNA.srl

Adding protospacer info

  • Similar to adding crRNA info, but getting information from the blast database(s)

In [67]:
!cd $blastDir; \
    CLdb -- arrayBlast -- add_proto -workers 24 \
    < spacers_cut1_blastn_filt_crDNA.srl \
    > spacers_cut1_blastn_filt_crDNA_proto.srl   
    
## checking output
!cd $blastDir; echo; \
    ls -thlc spacers_cut1_blastn_filt_crDNA_proto.srl


Retrieving proto info from blast db: 'Methanosarcina_mazei_LYC.fasta'
Retrieving proto info from blast db: 'Methanosarcina_acetivorans_C2A.fasta'
Retrieving proto info from blast db: 'Methanosarcina_mazei_Go1.fasta'
Retrieving proto info from blast db: 'Methanosarcina_mazei_WWM610.fasta'
Retrieving proto info from blast db: 'Methanosarcina_horonobensis_HB_1.fasta'
Retrieving proto info from blast db: 'Methanosarcina_barkeri_str_fusaro.fasta'
Retrieving proto info from blast db: 'Methanosarcina_mazei_C16.fasta'
Retrieving proto info from blast db: 'Methanosarcina_mazei_SarPi.fasta'
Retrieving proto info from blast db: 'Methanosarcina_mazei_S_6.fasta'
Retrieving proto sequences from blast db: 'Methanosarcina_mazei_LYC.fasta'
Retrieving proto sequences from blast db: 'Methanosarcina_mazei_S_6.fasta'
Retrieving proto sequences from blast db: 'Methanosarcina_mazei_C16.fasta'
Retrieving proto sequences from blast db: 'Methanosarcina_mazei_Go1.fasta'
Retrieving proto sequences from blast db: 'Methanosarcina_horonobensis_HB_1.fasta'
Retrieving proto sequences from blast db: 'Methanosarcina_acetivorans_C2A.fasta'
Retrieving proto sequences from blast db: 'Methanosarcina_barkeri_str_fusaro.fasta'
Retrieving proto sequences from blast db: 'Methanosarcina_mazei_WWM610.fasta'
Retrieving proto sequences from blast db: 'Methanosarcina_mazei_SarPi.fasta'

-rw-rw-r-- 1 nyoungb2 nyoungb2 2.3M Jan  3 20:24 spacers_cut1_blastn_filt_crDNA_proto.srl

Aligning crRNA & protospacer

  • Alignment added to the .srl file.
  • These are individual alignments between the spacer and protospacer (using clustalw).

In [68]:
# aligning crRNA and protospacer

!cd $blastDir; \
    CLdb -- arrayBlast -- align_proto \
    < spacers_cut1_blastn_filt_crDNA_proto.srl \
    > spacers_cut1_blastn_filt_crDNA_proto_aln.srl
    
## checking output
!cd $blastDir; echo; \
    ls -thlc spacers_cut1_blastn_filt_crDNA_proto_aln.srl


Aligning protospacers from blast db: 'Methanosarcina_mazei_S_6.fasta'
Aligning protospacers from blast db: 'Methanosarcina_acetivorans_C2A.fasta'
Aligning protospacers from blast db: 'Methanosarcina_barkeri_str_fusaro.fasta'
Aligning protospacers from blast db: 'Methanosarcina_mazei_LYC.fasta'
Aligning protospacers from blast db: 'Methanosarcina_mazei_C16.fasta'
Aligning protospacers from blast db: 'Methanosarcina_mazei_SarPi.fasta'
Aligning protospacers from blast db: 'Methanosarcina_horonobensis_HB_1.fasta'
Aligning protospacers from blast db: 'Methanosarcina_mazei_Go1.fasta'
Aligning protospacers from blast db: 'Methanosarcina_mazei_WWM610.fasta'

-rw-rw-r-- 1 nyoungb2 nyoungb2 2.3M Jan  3 20:25 spacers_cut1_blastn_filt_crDNA_proto_aln.srl

Getting alignments (fasta)

  • You can parse out the alignments you want (parse by subtype, taxon_name, etc.)
    • OR you can parse them after writing (add subtype, taxon_name, e-value, etc. to the sequence name)
  • The alignments will be oriented to the crDNA (5'-3' for the crRNA)
  • See CLdb -- arrayBlast --perldoc -- get_align for info on what -outfmt does

In [69]:
# getting alignment of protospacer and crDNA and writing as a fasta

!cd $blastDir; \
    CLdb -- arrayBlast -- get_align \
    -outfmt taxon_name,subtype,evalue \
    < spacers_cut1_blastn_filt_crDNA_proto_aln.srl \
    > blastn_crDNA-proto_aln.fna
    

## checking output
!cd $blastDir; echo; \
    ls -thlc blastn_crDNA-proto_aln.fna


Decoding .srl file...
Getting alignments...

-rw-rw-r-- 1 nyoungb2 nyoungb2 35K Jan  3 20:25 blastn_crDNA-proto_aln.fna

In [70]:
# let's take a look at the sequences 
!cd $blastDir; \
    head -n 10 blastn_crDNA-proto_aln.fna


>crDNA|17|56|1ebv46LCvRJMH|0.000298084
tgctcgcgaaTCCCTGGTTTATGTCCTGCT.CGTCAAAATCTATGTTgtttcaatcc
>proto|17|56|1ebv46LCvRJMH|0.000298084
.agttaattacACCTGGTTTATGTCCTGCTCCGATTACCGCAACAAtggttttgtt.
>crDNA|17|89|1ebv46LCvRJMH|0.000298084
tgctcgcgaaTCCCTGGTTTATGTCCTGCT.CGTCAAAATCTATGTTgtttcaatcc
>proto|17|89|1ebv46LCvRJMH|0.000298084
.agttaattacACCTGGTTTATGTCCTGCTCCGATTACCGCAACAAtggttttgtt.
>crDNA|7|7|1eX1MJpd4Qk75|0.000325182
aggattgaaaCTTCTTCGATTGTTTTTATGCCGTT.TTTATCAAAAATAgtcgagaagt
  • The alignment is really just for each crDNA-protospacer pair (every 4 lines)
  • First, the crRNA (sequence name starts with "crRNA")
  • Second, the protospacer (sequence name start with "proto")
  • Example:

    • >crDNA
      ATAGACA
      >proto
      ATAGACA
  • Note: the lower case letters are sequence adjacent to the actual protospacer

If you wanted each crRNA-protospacer pair individually, you could use the split bash command like this:

WARNING: this produces a lot of files, so we will make them in a new directory


In [71]:
# file/directory
inFile = os.path.join(blastDir, 'blastn_crDNA-proto_aln.fna')
outDir = os.path.join(blastDir, 'blastn_crDNA-proto_aln_split')
if not os.path.isdir(outDir):
    os.makedirs(outDir)
outPre = os.path.join(outDir, 'split')    

# spliting alignment
!split -l 4 -d -a 4 $inFile $outPre

# checking out files
!printf 'Number of files produced: '
!find $outDir -name "split*" | wc -l


Number of files produced: 179

Getting PAMs

  • Looking at the alignments can be very useful, but it can help to extract what we specifically want and summarize.
    • Let's pull out the potential PAM regions from the alignment
    • We are going to assume that the PAM is:
      • 4bp long
      • 1bp away from the 5' end of the protospacer ('right' side of the alignment)

In [72]:
!cd $blastDir; \
    CLdb -- arrayBlast -- get_PAM -PAM 1 4 -f - \
    < blastn_crDNA-proto_aln.fna \
    > blastn_crDNA-proto_aln_PAM.fna


## checking output
!cd $blastDir; echo; \
    head blastn_crDNA-proto_aln_PAM.fna


>proto|18|37|uWdGwRtuvgdp|3.13923e-13
taat
>proto|18|74|1vQhyRa3ZkVOT|1.31536e-06
taaa
>proto|10|18|1VdAVgxl34ddZ|3.90827e-07
taaa
>proto|17|33|u4GEEvxNLPSL|0.000307282
atat
>proto|11|78|1beRkenSGXl|0.000347095
tata

Summary table of each unique PAM sequence


In [73]:
!cd $blastDir; \
    egrep -v "^>" blastn_crDNA-proto_aln_PAM.fna | \
    sort | uniq -c | sort -k 1 -n -r


     42 taaa
     31 ttaa
     12 tggt
      7 taac
      7 ctgc
      7 cagg
      6 tatt
      6 ccgg
      5 tccg
      5 taat
      5 gggc
      5 atat
      5 aaaa
      4 tgaa
      4 caat
      3 aagg
      2 tgac
      2 gaca
      2 ataa
      1 ttgc
      1 tctg
      1 tcca
      1 tcaa
      1 tatg
      1 tata
      1 taag
      1 gggg
      1 gctg
      1 cgca
      1 ccgc
      1 caag
      1 attg
      1 atga
      1 agtc
      1 aggg
      1 agac
      1 aatt
      1 aaga

Spacer (crDNA) - protospacer mismatch distribution

  • Let's look at potential spacer targeting by assessing mismatches of the crDNAs and protospacers.
  • This script will make 2 summary files of mismatches for each crDNA-protospacer alignment
  • Given that spacers can vary in length, the alignment is relative to a user-defined 'SEED' region.
  • I'm going to use the default SEED region: '-SEED -8 -1'
    • This SEED region covers the last 8 bp of the alignment

In [74]:
!cd $blastDir; \
    CLdb -- arrayBlast -- get_SEEDstats \
    --SEED -8 -1 \
    -prefix blastn_crDNA-proto_aln \
    < blastn_crDNA-proto_aln.fna 
    
## checking output files
!cd $blastDir; echo; \
    ls -thlc blastn_crDNA-proto_aln_SEEDstats*.txt


SEED summary file written: 'blastn_crDNA-proto_aln_SEEDstats-summary.txt'
SEED by-position table written: 'blastn_crDNA-proto_aln_SEEDstats-byPos.txt'

-rw-rw-r-- 1 nyoungb2 nyoungb2 3.3K Jan  3 20:28 blastn_crDNA-proto_aln_SEEDstats-byPos.txt
-rw-rw-r-- 1 nyoungb2 nyoungb2  55K Jan  3 20:28 blastn_crDNA-proto_aln_SEEDstats-summary.txt

In [75]:
# checking files
!cd $blastDir; \
    head -n 6 blastn_crDNA-proto_aln_SEEDstats-byPos.txt
    
print '-' * 50


!cd $blastDir; \
    head -n 6 blastn_crDNA-proto_aln_SEEDstats-summary.txt


region	pos_rel_SEED	pos_rel_align	count	mismatches	mismatch_norm
SEED	1	10	179	94	0.525139664804469
SEED	2	11	179	48	0.268156424581006
SEED	3	12	179	38	0.212290502793296
SEED	4	13	179	65	0.363128491620112
SEED	5	14	179	63	0.35195530726257
--------------------------------------------------
alignment	crDNA	protospacer	region	count	mismatches	mismatch_norm
9	crDNA|11|78|1beRkenSGXl|0.000347095	proto|11|78|1beRkenSGXl|0.000347095	protospacer	40	10	0.25
9	crDNA|11|78|1beRkenSGXl|0.000347095	proto|11|78|1beRkenSGXl|0.000347095	nonSEED	32	8	0.25
9	crDNA|11|78|1beRkenSGXl|0.000347095	proto|11|78|1beRkenSGXl|0.000347095	SEED	10	2	0.2
164	crDNA|11|49|2A8bCEvyJhoO3|6.57658e-12	proto|11|49|2A8bCEvyJhoO3|6.57658e-12	SEED	10	2	0.2
164	crDNA|11|49|2A8bCEvyJhoO3|6.57658e-12	proto|11|49|2A8bCEvyJhoO3|6.57658e-12	protospacer	39	5	0.128205128205128

Plotting mismatches


In [76]:
%%R
library(ggplot2)
library(gridExtra)

In [77]:
%%R -i blastDir -w 800

# loading table
infile = file.path(blastDir, 'blastn_crDNA-proto_aln_SEEDstats-byPos.txt')

tbl = read.delim(infile)
tbl = tbl[tbl$region != 'protospacer', ]

# mismatches and counts, & normalized mismatches
p.mis = ggplot(tbl, aes(pos_rel_SEED, mismatches, group=region, fill=region)) +
    geom_bar(stat='identity', position='dodge') +
    labs(y='Number\nof mismatches', x='Alignment position relative to SEED start')
p.count = ggplot(tbl, aes(pos_rel_SEED, count, group=region, fill=region)) +  
    geom_bar(stat='identity', position='dodge') +
    labs(y='Number of alignments\nwith nucleotide\nat position', x='Alignment position relative to SEED start')
p.norm = ggplot(tbl, aes(pos_rel_SEED, mismatch_norm, group=region, fill=region)) +
  geom_bar(stat='identity', position='dodge') +
    labs(y='Mismatchs /\nalignments', x='Alignment position relative to SEED start')

grid.arrange(p.count, p.mis, p.norm, ncol=1)


Notes on plots:

  • The top plot shows that some spacer-protospacer are not quiet as long as the rest, which is why the 'Number of alignment with nucleotide at position' drops at the higher positions (> ~30).
  • There are fewer mismatches in the middle of the spacer-protospacer alignments.

Array Blast vs NCBI's nt database

  • See the E. coli example for conducting a spacer blast vs the nt database.

In [ ]: