Description:

This notebook goes through the analysis of blasting spacers vs a database containing potential protospacers.
Any blast datbase can be used.
A script is provided to easily blast all of the genomes in CLdb.

Before running this notebook:

run the Setup notebook

User-defined variables



In [40]:

    
# directory where you want the spacer blasting to be done
## CHANGE THIS!
workDir = "/home/nyoungb2/t/CLdb_Methanosarcina/"

Init



In [41]:

    
import os
from IPython.display import FileLinks
%load_ext rpy2.ipython









    



The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython



In [42]:

    
blastDir = os.path.join(workDir, 'arrayBlast')
if not os.path.isdir(blastDir):
    os.makedirs(blastDir)

Blast vs all E. coli genomes in the CLdb

Spacer blast

Selecting spacer sequences



In [43]:

    
!cd $blastDir; \
    CLdb -- array2fasta -cluster -cut 1 > spacers_cut1.fna

## getting number of spacer sequences
!printf 'Number of unique spacers:\t'
!cd $blastDir; \
    grep -c ">" spacers_cut1.fna; \
    echo; \
    head -n 6 spacers_cut1.fna









    



Number of unique spacers:	1104

>NA|spacer|NA|628|1
TTTTACCGGGGATGTGACAGTCAGGGAATACGGCAT
>NA|spacer|NA|804|1
TACCCAACTGTTCAAATTTCAAAGCCACTACGTGA
>NA|spacer|NA|923|1
TACTGTTCGGCGGTTCTTCGCATTGTTGCGTCAAT

Blast

The array blast commands are sub-sub commands.
SO, that's why when you type: 'CLdb -- arrayBlast -h', you get then following:



In [44]:

    
!CLdb -- arrayBlast -h









    



Usage:
    arrayBlast [options] subcommand -- [subcommand_options]

  Options:
    --list
        List all subcommands.

    --perldoc
        Get perldoc of subcommand.

    -v Verbose output
    -h This help message

  For more information:
    CLdb --perldoc -- arrayBlast



In [45]:

    
# listing all arrayBlast subcommands
!CLdb -- arrayBlast --list









    



add_crRNA
add_proto
align_proto
filter_arrayHits
get_PAM
get_SEEDstats
get_align
get_proto
run
srl2csv
srl2json
xml2json
xml2srl

Sub-sub-command help:

CLdb arrayBlast -- run -- -h



In [54]:

    
# spacer blast
## Really just a thin wrapper around blastn that will blast the spacers against all genomes in CLdb
### If you want to blast some else, just run the blast yourself (use 'perldoc CLdb_arrayBlast.pl' for more info)

!cd $blastDir; \
    CLdb -- arrayBlast -- run \
    -query spacers_cut1.fna \
    -fork 10 \
    > spacers_cut1_blastn.xml
    
# NOTE: using 10 CPUs









    



Making BLAST databases...
...making blast databases in /home/nyoungb2/t/CLdb_Methanosarcina//arrayBlast/blast_db/
makeblastdb -dbtype nucl -parse_seqids -in /home/nyoungb2/t/CLdb_Methanosarcina//arrayBlast/blast_db//Methanosarcina_acetivorans_C2A.fasta
makeblastdb -dbtype nucl -parse_seqids -in /home/nyoungb2/t/CLdb_Methanosarcina//arrayBlast/blast_db//Methanosarcina_barkeri_str_fusaro.fasta
makeblastdb -dbtype nucl -parse_seqids -in /home/nyoungb2/t/CLdb_Methanosarcina//arrayBlast/blast_db//Methanosarcina_horonobensis_HB_1.fasta
makeblastdb -dbtype nucl -parse_seqids -in /home/nyoungb2/t/CLdb_Methanosarcina//arrayBlast/blast_db//Methanosarcina_mazei_C16.fasta
makeblastdb -dbtype nucl -parse_seqids -in /home/nyoungb2/t/CLdb_Methanosarcina//arrayBlast/blast_db//Methanosarcina_mazei_Go1.fasta
makeblastdb -dbtype nucl -parse_seqids -in /home/nyoungb2/t/CLdb_Methanosarcina//arrayBlast/blast_db//Methanosarcina_mazei_LYC.fasta
makeblastdb -dbtype nucl -parse_seqids -in /home/nyoungb2/t/CLdb_Methanosarcina//arrayBlast/blast_db//Methanosarcina_mazei_S_6.fasta
makeblastdb -dbtype nucl -parse_seqids -in /home/nyoungb2/t/CLdb_Methanosarcina//arrayBlast/blast_db//Methanosarcina_mazei_SarPi.fasta
makeblastdb -dbtype nucl -parse_seqids -in /home/nyoungb2/t/CLdb_Methanosarcina//arrayBlast/blast_db//Methanosarcina_mazei_WWM610.fasta

Blasting sequences...
Number of blast hits to Methanosarcina_mazei_Go1.fasta:	28622
Number of blast hits to Methanosarcina_mazei_S_6.fasta:	27769
Number of blast hits to Methanosarcina_mazei_SarPi.fasta:	33535
Number of blast hits to Methanosarcina_mazei_C16.fasta:	29628
Number of blast hits to Methanosarcina_mazei_LYC.fasta:	31244
Number of blast hits to Methanosarcina_mazei_WWM610.fasta:	28519
Number of blast hits to Methanosarcina_barkeri_str_fusaro.fasta:	25491
Number of blast hits to Methanosarcina_horonobensis_HB_1.fasta:	28834
Number of blast hits to Methanosarcina_acetivorans_C2A.fasta:	24641

Note:

Using spacer clustering cutoff of 1, so just 'unique' spacer sequences



In [55]:

    
# converting .xml to .srl
## This retains all of the info in the xml, but produces a smaller file that is faster to load and write.
## For large blast runs (xml files > 100 Mb), this will take a little while

!cd $blastDir; \
    CLdb -- arrayBlast -- xml2srl \
    spacers_cut1_blastn.xml \
    > spacers_cut1_blastn.srl

# checking output file
!cd $blastDir; \
    ls -thlc spacers_cut1_blastn.*









    



-rw-rw-r-- 1 nyoungb2 nyoungb2 2.5M Jan  3 20:21 spacers_cut1_blastn.srl
-rw-rw-r-- 1 nyoungb2 nyoungb2 9.6M Jan  3 20:20 spacers_cut1_blastn.xml

Note:

You could pipe the output from CLdb -- arrayBlast -- run directly into CLdb -- arrayBlast -- xml2srl
- Example: CLdb -- arrayBlast -- run -query spacers_cut1.fna | CLdb -- arrayBlast -- xml2srl

DR Blast

BLASTing DR sequences against the same subjects.
This is needed to filter out spacer blast hits to CRISPR arrays.



In [56]:

    
# selecting DRs, so we can blast them.
## This is needed blast filter out spacers that hit CRISPR arrays

!cd $blastDir; \
    CLdb -- array2fasta \
    -cluster -cut 1 -r \
    > DRs_cut1.fna

## getting number of spacer sequences
!printf 'Number of unique DRs:\t'
!cd $blastDir; \
    grep -c ">" DRs_cut1.fna; \
    echo; \
    head -n 6 DRs_cut1.fna









    



Number of unique DRs:	38

>NA|DR|NA|5|1
GTTTCAATCCTTGTTTTAGTGGATCTTGCTCTCGAAT
>NA|DR|NA|20|1
ATTCGTGAGCATGATCCATTAAGAGCCTATTCGAAAA
>NA|DR|NA|27|1
GCTTCAATTCTGCCACAACCTTTCGGTTATGGAAAC



In [57]:

    
# DR blast, just like spacer blast

!cd $blastDir; \
    CLdb -- arrayBlast -- run \
    -query DRs_cut1.fna \
    -fork 10 \
    > DRs_cut1_blastn.xml









    



Making BLAST databases...
...making blast databases in /home/nyoungb2/t/CLdb_Methanosarcina//arrayBlast/blast_db/
makeblastdb -dbtype nucl -parse_seqids -in /home/nyoungb2/t/CLdb_Methanosarcina//arrayBlast/blast_db//Methanosarcina_acetivorans_C2A.fasta
makeblastdb -dbtype nucl -parse_seqids -in /home/nyoungb2/t/CLdb_Methanosarcina//arrayBlast/blast_db//Methanosarcina_barkeri_str_fusaro.fasta
makeblastdb -dbtype nucl -parse_seqids -in /home/nyoungb2/t/CLdb_Methanosarcina//arrayBlast/blast_db//Methanosarcina_horonobensis_HB_1.fasta
makeblastdb -dbtype nucl -parse_seqids -in /home/nyoungb2/t/CLdb_Methanosarcina//arrayBlast/blast_db//Methanosarcina_mazei_C16.fasta
makeblastdb -dbtype nucl -parse_seqids -in /home/nyoungb2/t/CLdb_Methanosarcina//arrayBlast/blast_db//Methanosarcina_mazei_Go1.fasta
makeblastdb -dbtype nucl -parse_seqids -in /home/nyoungb2/t/CLdb_Methanosarcina//arrayBlast/blast_db//Methanosarcina_mazei_LYC.fasta
makeblastdb -dbtype nucl -parse_seqids -in /home/nyoungb2/t/CLdb_Methanosarcina//arrayBlast/blast_db//Methanosarcina_mazei_S_6.fasta
makeblastdb -dbtype nucl -parse_seqids -in /home/nyoungb2/t/CLdb_Methanosarcina//arrayBlast/blast_db//Methanosarcina_mazei_SarPi.fasta
makeblastdb -dbtype nucl -parse_seqids -in /home/nyoungb2/t/CLdb_Methanosarcina//arrayBlast/blast_db//Methanosarcina_mazei_WWM610.fasta

Blasting sequences...
Number of blast hits to Methanosarcina_barkeri_str_fusaro.fasta:	17138
Number of blast hits to Methanosarcina_acetivorans_C2A.fasta:	21463
Number of blast hits to Methanosarcina_mazei_S_6.fasta:	32573
Number of blast hits to Methanosarcina_horonobensis_HB_1.fasta:	40840
Number of blast hits to Methanosarcina_mazei_Go1.fasta:	47480
Number of blast hits to Methanosarcina_mazei_C16.fasta:	64637
Number of blast hits to Methanosarcina_mazei_WWM610.fasta:	47727
Number of blast hits to Methanosarcina_mazei_LYC.fasta:	79088
Number of blast hits to Methanosarcina_mazei_SarPi.fasta:	86164



In [58]:

    
# converting .xml to .srl
## This retains all of the info in the xml, but produces a smaller file that is faster to load and write.
## For large blast runs (xml files > 100 Mb), this will take a little while

!cd $blastDir; \
    CLdb -- arrayBlast -- xml2srl \
    DRs_cut1_blastn.xml \
    > DRs_cut1_blastn.srl


# checking output file
!cd $blastDir; \
    ls -thlc DRs_cut1_blastn.*









    



-rw-rw-r-- 1 nyoungb2 nyoungb2 5.4M Jan  3 20:23 DRs_cut1_blastn.srl
-rw-rw-r-- 1 nyoungb2 nyoungb2  17M Jan  3 20:22 DRs_cut1_blastn.xml

Filtering out spacer hits to CRISPRs

Based on DR hits falling adjacent to spacer hits



In [59]:

    
# filtering spacer hits to arrays
## This is based on the premise that DR hits adjacent to spacer hits signify that the spacer is hitting a CRISPR array

!cd $blastDir; \
    CLdb -- arrayBlast -- filter_arrayHits \
    spacers_cut1_blastn.srl \
    DRs_cut1_blastn.srl \
    > spacers_cut1_blastn_filt.srl
    
    
# checking output file
!cd $blastDir; \
    echo; \
    ls -thlc spacers_cut1_blastn_filt.srl









    



### Summary for filtering of 'bad' DR hits ###
Total DR hits: 22538
DR hits with E-values < 10: 0
DR hits with hit lengths (fraction of total) < 0.66: 0
DR hits used for filtering: 22538
### DR filtering of spacer blast hits ###
Total spacer blast hits: 2585
 NOTE: Deleting the blast hits to arrays
Spacer blast hits hitting an array defined just by DR-blast hits: 2414
 NOTE: Deleting the blast hits to arrays
Spacer blast hits hitting a mobile genetic element: 171

-rw-rw-r-- 1 nyoungb2 nyoungb2 1.9M Jan  3 20:23 spacers_cut1_blastn_filt.srl

Converting blast .srl to .csv



In [63]:

    
!cd $blastDir; \
    CLdb -- arrayBlast -- srl2csv \
    spacers_cut1_blastn_filt.srl \
    > spacers_cut1_blastn_filt.txt
    
# assessing table of blast hits    
!cd $blastDir; \
    head spacers_cut1_blastn_filt.txt









    



Decoding the srl file...
qseqid	sseqid	pident	length	mismatch	gapopen	qstart	qend	sstart	send	evalue	bitscore	
NA|spacer|NA|841|1	NC_007355	95.6521739130435	23	undef	5	10	32	3027136	3027114	0.000347095	38.1576
NA|spacer|NA|465|1	NC_007355	96.2962962962963	27	undef	5	5	31	857001	856975	1.48854e-06	46.087
NA|spacer|NA|465|1	NC_007355	100	19	undef	5	13	31	4703448	4703466	0.000362872	38.1576
NA|spacer|NA|465|1	NC_007355	100	19	undef	5	13	31	1036282	1036264	0.000362872	38.1576
NA|spacer|NA|280|1	NC_007355	95.8333333333333	24	undef	5	2	25	3935219	3935242	9.58273e-05	40.14
NA|spacer|NA|722|1	NC_007355	100	19	undef	5	3	21	3312027	3312009	0.000347095	38.1576
NA|spacer|NA|184|1	NC_007355	91.8918918918919	37	undef	5	1	37	2848699	2848735	9.9483e-08	50.0517
NA|spacer|NA|446|1	NC_007355	88.8888888888889	36	undef	5	1	36	3417664	3417699	9.18345e-05	40.14
NA|spacer|NA|446|1	NC_007355	88.8888888888889	36	undef	5	1	36	3928350	3928315	9.18345e-05	40.14



In [64]:

    
#!printf "Number of blast hits: "
nhits = !cd $blastDir; wc -l spacers_cut1_blastn_filt.txt
nhits = nhits[0].split(' ')[0]
print "Number of blast hits: {}".format(int(nhits) - 1)









    



Number of blast hits: 171

Conclusions

We have some BLAST hits!
Let's add some information to the blast results

adding crRNA info

Adding crRNA region to .srl (actually the DNA template, refering to it as 'crDNA')
- How much of the adjacent DR sequences are included in the crDNA is determined by the user (default: 10bp on either side)



In [65]:

    
!cd $blastDir; \
    CLdb -- arrayBlast -- add_crRNA \
    < spacers_cut1_blastn_filt.srl \
    > spacers_cut1_blastn_filt_crDNA.srl
   
## checking output
!cd $blastDir; echo; \
    ls -thlc spacers_cut1_blastn_filt_crDNA.srl









    



Decoding .srl file...
Extracting all queries with blast hits from blast hit file...
...Number of total unique query IDs:	1103
Getting info from CLdb on each blast query spacer...
Getting spacer regions from each genome...
 Processing genome: Methanosarcina_mazei_Go1.fasta...
 Processing genome: Methanosarcina_mazei_SarPi.fasta...
 Processing genome: Methanosarcina_acetivorans_C2A.fasta...
 Processing genome: Methanosarcina_barkeri_str_fusaro.fasta...
 Processing genome: Methanosarcina_mazei_WWM610.fasta...
 Processing genome: Methanosarcina_horonobensis_HB_1.fasta...
 Processing genome: Methanosarcina_mazei_S_6.fasta...
 Processing genome: Methanosarcina_mazei_LYC.fasta...
 Processing genome: Methanosarcina_mazei_C16.fasta...

-rw-rw-r-- 1 nyoungb2 nyoungb2 2.3M Jan  3 20:24 spacers_cut1_blastn_filt_crDNA.srl

Adding protospacer info

Similar to adding crRNA info, but getting information from the blast database(s)



In [67]:

    
!cd $blastDir; \
    CLdb -- arrayBlast -- add_proto -workers 24 \
    < spacers_cut1_blastn_filt_crDNA.srl \
    > spacers_cut1_blastn_filt_crDNA_proto.srl   
    
## checking output
!cd $blastDir; echo; \
    ls -thlc spacers_cut1_blastn_filt_crDNA_proto.srl









    



Retrieving proto info from blast db: 'Methanosarcina_mazei_LYC.fasta'
Retrieving proto info from blast db: 'Methanosarcina_acetivorans_C2A.fasta'
Retrieving proto info from blast db: 'Methanosarcina_mazei_Go1.fasta'
Retrieving proto info from blast db: 'Methanosarcina_mazei_WWM610.fasta'
Retrieving proto info from blast db: 'Methanosarcina_horonobensis_HB_1.fasta'
Retrieving proto info from blast db: 'Methanosarcina_barkeri_str_fusaro.fasta'
Retrieving proto info from blast db: 'Methanosarcina_mazei_C16.fasta'
Retrieving proto info from blast db: 'Methanosarcina_mazei_SarPi.fasta'
Retrieving proto info from blast db: 'Methanosarcina_mazei_S_6.fasta'
Retrieving proto sequences from blast db: 'Methanosarcina_mazei_LYC.fasta'
Retrieving proto sequences from blast db: 'Methanosarcina_mazei_S_6.fasta'
Retrieving proto sequences from blast db: 'Methanosarcina_mazei_C16.fasta'
Retrieving proto sequences from blast db: 'Methanosarcina_mazei_Go1.fasta'
Retrieving proto sequences from blast db: 'Methanosarcina_horonobensis_HB_1.fasta'
Retrieving proto sequences from blast db: 'Methanosarcina_acetivorans_C2A.fasta'
Retrieving proto sequences from blast db: 'Methanosarcina_barkeri_str_fusaro.fasta'
Retrieving proto sequences from blast db: 'Methanosarcina_mazei_WWM610.fasta'
Retrieving proto sequences from blast db: 'Methanosarcina_mazei_SarPi.fasta'

-rw-rw-r-- 1 nyoungb2 nyoungb2 2.3M Jan  3 20:24 spacers_cut1_blastn_filt_crDNA_proto.srl

Aligning crRNA & protospacer

Alignment added to the .srl file.
These are individual alignments between the spacer and protospacer (using clustalw).



In [68]:

    
# aligning crRNA and protospacer

!cd $blastDir; \
    CLdb -- arrayBlast -- align_proto \
    < spacers_cut1_blastn_filt_crDNA_proto.srl \
    > spacers_cut1_blastn_filt_crDNA_proto_aln.srl
    
## checking output
!cd $blastDir; echo; \
    ls -thlc spacers_cut1_blastn_filt_crDNA_proto_aln.srl









    



Aligning protospacers from blast db: 'Methanosarcina_mazei_S_6.fasta'
Aligning protospacers from blast db: 'Methanosarcina_acetivorans_C2A.fasta'
Aligning protospacers from blast db: 'Methanosarcina_barkeri_str_fusaro.fasta'
Aligning protospacers from blast db: 'Methanosarcina_mazei_LYC.fasta'
Aligning protospacers from blast db: 'Methanosarcina_mazei_C16.fasta'
Aligning protospacers from blast db: 'Methanosarcina_mazei_SarPi.fasta'
Aligning protospacers from blast db: 'Methanosarcina_horonobensis_HB_1.fasta'
Aligning protospacers from blast db: 'Methanosarcina_mazei_Go1.fasta'
Aligning protospacers from blast db: 'Methanosarcina_mazei_WWM610.fasta'

-rw-rw-r-- 1 nyoungb2 nyoungb2 2.3M Jan  3 20:25 spacers_cut1_blastn_filt_crDNA_proto_aln.srl

Getting alignments (fasta)

You can parse out the alignments you want (parse by subtype, taxon_name, etc.)
- OR you can parse them after writing (add subtype, taxon_name, e-value, etc. to the sequence name)
The alignments will be oriented to the crDNA (5'-3' for the crRNA)
See CLdb -- arrayBlast --perldoc -- get_align for info on what -outfmt does



In [69]:

    
# getting alignment of protospacer and crDNA and writing as a fasta

!cd $blastDir; \
    CLdb -- arrayBlast -- get_align \
    -outfmt taxon_name,subtype,evalue \
    < spacers_cut1_blastn_filt_crDNA_proto_aln.srl \
    > blastn_crDNA-proto_aln.fna
    

## checking output
!cd $blastDir; echo; \
    ls -thlc blastn_crDNA-proto_aln.fna









    



Decoding .srl file...
Getting alignments...

-rw-rw-r-- 1 nyoungb2 nyoungb2 35K Jan  3 20:25 blastn_crDNA-proto_aln.fna



In [70]:

    
# let's take a look at the sequences 
!cd $blastDir; \
    head -n 10 blastn_crDNA-proto_aln.fna









    



>crDNA|17|56|1ebv46LCvRJMH|0.000298084
tgctcgcgaaTCCCTGGTTTATGTCCTGCT.CGTCAAAATCTATGTTgtttcaatcc
>proto|17|56|1ebv46LCvRJMH|0.000298084
.agttaattacACCTGGTTTATGTCCTGCTCCGATTACCGCAACAAtggttttgtt.
>crDNA|17|89|1ebv46LCvRJMH|0.000298084
tgctcgcgaaTCCCTGGTTTATGTCCTGCT.CGTCAAAATCTATGTTgtttcaatcc
>proto|17|89|1ebv46LCvRJMH|0.000298084
.agttaattacACCTGGTTTATGTCCTGCTCCGATTACCGCAACAAtggttttgtt.
>crDNA|7|7|1eX1MJpd4Qk75|0.000325182
aggattgaaaCTTCTTCGATTGTTTTTATGCCGTT.TTTATCAAAAATAgtcgagaagt

The alignment is really just for each crDNA-protospacer pair (every 4 lines)
First, the crRNA (sequence name starts with "crRNA")
Second, the protospacer (sequence name start with "proto")
Example:
- ```
>crDNA
ATAGACA
>proto
ATAGACA
```
Note: the lower case letters are sequence adjacent to the actual protospacer

If you wanted each crRNA-protospacer pair individually, you could use the split bash command like this:

WARNING: this produces a lot of files, so we will make them in a new directory



In [71]:

    
# file/directory
inFile = os.path.join(blastDir, 'blastn_crDNA-proto_aln.fna')
outDir = os.path.join(blastDir, 'blastn_crDNA-proto_aln_split')
if not os.path.isdir(outDir):
    os.makedirs(outDir)
outPre = os.path.join(outDir, 'split')    

# spliting alignment
!split -l 4 -d -a 4 $inFile $outPre

# checking out files
!printf 'Number of files produced: '
!find $outDir -name "split*" | wc -l









    



Number of files produced: 179

Getting PAMs

Looking at the alignments can be very useful, but it can help to extract what we specifically want and summarize.
- Let's pull out the potential PAM regions from the alignment
- We are going to assume that the PAM is:
  - 4bp long
  - 1bp away from the 5' end of the protospacer ('right' side of the alignment)



In [72]:

    
!cd $blastDir; \
    CLdb -- arrayBlast -- get_PAM -PAM 1 4 -f - \
    < blastn_crDNA-proto_aln.fna \
    > blastn_crDNA-proto_aln_PAM.fna


## checking output
!cd $blastDir; echo; \
    head blastn_crDNA-proto_aln_PAM.fna









    



>proto|18|37|uWdGwRtuvgdp|3.13923e-13
taat
>proto|18|74|1vQhyRa3ZkVOT|1.31536e-06
taaa
>proto|10|18|1VdAVgxl34ddZ|3.90827e-07
taaa
>proto|17|33|u4GEEvxNLPSL|0.000307282
atat
>proto|11|78|1beRkenSGXl|0.000347095
tata

Summary table of each unique PAM sequence



In [73]:

    
!cd $blastDir; \
    egrep -v "^>" blastn_crDNA-proto_aln_PAM.fna | \
    sort | uniq -c | sort -k 1 -n -r









    



     42 taaa
     31 ttaa
     12 tggt
      7 taac
      7 ctgc
      7 cagg
      6 tatt
      6 ccgg
      5 tccg
      5 taat
      5 gggc
      5 atat
      5 aaaa
      4 tgaa
      4 caat
      3 aagg
      2 tgac
      2 gaca
      2 ataa
      1 ttgc
      1 tctg
      1 tcca
      1 tcaa
      1 tatg
      1 tata
      1 taag
      1 gggg
      1 gctg
      1 cgca
      1 ccgc
      1 caag
      1 attg
      1 atga
      1 agtc
      1 aggg
      1 agac
      1 aatt
      1 aaga

Spacer (crDNA) - protospacer mismatch distribution

Let's look at potential spacer targeting by assessing mismatches of the crDNAs and protospacers.
This script will make 2 summary files of mismatches for each crDNA-protospacer alignment
Given that spacers can vary in length, the alignment is relative to a user-defined 'SEED' region.
I'm going to use the default SEED region: '-SEED -8 -1'
- This SEED region covers the last 8 bp of the alignment



In [74]:

    
!cd $blastDir; \
    CLdb -- arrayBlast -- get_SEEDstats \
    --SEED -8 -1 \
    -prefix blastn_crDNA-proto_aln \
    < blastn_crDNA-proto_aln.fna 
    
## checking output files
!cd $blastDir; echo; \
    ls -thlc blastn_crDNA-proto_aln_SEEDstats*.txt









    



SEED summary file written: 'blastn_crDNA-proto_aln_SEEDstats-summary.txt'
SEED by-position table written: 'blastn_crDNA-proto_aln_SEEDstats-byPos.txt'

-rw-rw-r-- 1 nyoungb2 nyoungb2 3.3K Jan  3 20:28 blastn_crDNA-proto_aln_SEEDstats-byPos.txt
-rw-rw-r-- 1 nyoungb2 nyoungb2  55K Jan  3 20:28 blastn_crDNA-proto_aln_SEEDstats-summary.txt



In [75]:

    
# checking files
!cd $blastDir; \
    head -n 6 blastn_crDNA-proto_aln_SEEDstats-byPos.txt
    
print '-' * 50


!cd $blastDir; \
    head -n 6 blastn_crDNA-proto_aln_SEEDstats-summary.txt









    



region	pos_rel_SEED	pos_rel_align	count	mismatches	mismatch_norm
SEED	1	10	179	94	0.525139664804469
SEED	2	11	179	48	0.268156424581006
SEED	3	12	179	38	0.212290502793296
SEED	4	13	179	65	0.363128491620112
SEED	5	14	179	63	0.35195530726257
--------------------------------------------------
alignment	crDNA	protospacer	region	count	mismatches	mismatch_norm
9	crDNA|11|78|1beRkenSGXl|0.000347095	proto|11|78|1beRkenSGXl|0.000347095	protospacer	40	10	0.25
9	crDNA|11|78|1beRkenSGXl|0.000347095	proto|11|78|1beRkenSGXl|0.000347095	nonSEED	32	8	0.25
9	crDNA|11|78|1beRkenSGXl|0.000347095	proto|11|78|1beRkenSGXl|0.000347095	SEED	10	2	0.2
164	crDNA|11|49|2A8bCEvyJhoO3|6.57658e-12	proto|11|49|2A8bCEvyJhoO3|6.57658e-12	SEED	10	2	0.2
164	crDNA|11|49|2A8bCEvyJhoO3|6.57658e-12	proto|11|49|2A8bCEvyJhoO3|6.57658e-12	protospacer	39	5	0.128205128205128

Plotting mismatches



In [76]:

    
%%R
library(ggplot2)
library(gridExtra)



In [77]:

    
%%R -i blastDir -w 800

# loading table
infile = file.path(blastDir, 'blastn_crDNA-proto_aln_SEEDstats-byPos.txt')

tbl = read.delim(infile)
tbl = tbl[tbl$region != 'protospacer', ]

# mismatches and counts, & normalized mismatches
p.mis = ggplot(tbl, aes(pos_rel_SEED, mismatches, group=region, fill=region)) +
    geom_bar(stat='identity', position='dodge') +
    labs(y='Number\nof mismatches', x='Alignment position relative to SEED start')
p.count = ggplot(tbl, aes(pos_rel_SEED, count, group=region, fill=region)) +  
    geom_bar(stat='identity', position='dodge') +
    labs(y='Number of alignments\nwith nucleotide\nat position', x='Alignment position relative to SEED start')
p.norm = ggplot(tbl, aes(pos_rel_SEED, mismatch_norm, group=region, fill=region)) +
  geom_bar(stat='identity', position='dodge') +
    labs(y='Mismatchs /\nalignments', x='Alignment position relative to SEED start')

grid.arrange(p.count, p.mis, p.norm, ncol=1)

Notes on plots:

The top plot shows that some spacer-protospacer are not quiet as long as the rest, which is why the 'Number of alignment with nucleotide at position' drops at the higher positions (> ~30).
There are fewer mismatches in the middle of the spacer-protospacer alignments.

Array Blast vs NCBI's nt database

See the E. coli example for conducting a spacer blast vs the nt database.



In [ ]: