Description:

  • This notebook goes through the analysis of blasting spacers vs a database containing potential protospacers.
  • Any blast datbase can be used.
  • A script is provided to easily blast all of the genomes in CLdb.

Before running this notebook:

User-defined variables


In [51]:
# directory where you want the spacer blasting to be done
## CHANGE THIS!
workDir = "/home/nyoungb2/t/CLdb_Ecoli/"

Init


In [52]:
import os
from IPython.display import FileLinks
%load_ext rpy2.ipython


The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython

In [53]:
blastDir = os.path.join(workDir, 'arrayBlast')
if not os.path.isdir(blastDir):
    os.makedirs(blastDir)

Blast vs all E. coli genomes in the CLdb

Spacer blast

Selecting spacer sequences


In [85]:
!cd $blastDir; \
    CLdb -- array2fasta -cluster -cut 1 > spacers_cut1.fna

## getting number of spacer sequences
!printf 'Number of unique spacers:\t'
!cd $blastDir; \
    grep -c ">" spacers_cut1.fna; \
    echo; \
    head -n 6 spacers_cut1.fna


Number of unique spacers:	39

>NA|spacer|NA|10|1
ATAGACCCCGAACAACAATACGCGCAAACCGA
>NA|spacer|NA|1|1
TGGCTCTGCAACAGCAGCACCCATGACCACGTC
>NA|spacer|NA|29|1
GACAGAACGGCCTCAGTAGTCTCGTCAGGCTC

Blast

  • The array blast commands are sub-sub commands.
  • SO, that's why when you type: 'CLdb -- arrayBlast -h', you get then following:

In [81]:
!CLdb -- arrayBlast -h


Usage:
    arrayBlast [options] subcommand -- [subcommand_options]

  Options:
    --list
        List all subcommands.

    --perldoc
        Get perldoc of subcommand.

    -v Verbose output
    -h This help message

  For more information:
    CLdb_perldoc arrayBlast


In [82]:
# listing all arrayBlast subcommands
!CLdb -- arrayBlast --list


add_crRNA
add_proto
align_proto
filter_arrayHits
get_PAM
get_SEEDstats
get_align
get_proto
run
srl2csv
srl2json
xml2json
xml2srl

Sub-sub-command help:

CLdb arrayBlast -- run -- -h


In [83]:
# spacer blast
## Really just a thin wrapper around blastn that will blast the spacers against all genomes in CLdb
### If you want to blast some else, just run the blast yourself (use 'perldoc CLdb_arrayBlast.pl' for more info)

!cd $blastDir; \
    CLdb -- arrayBlast -- run \
    -query spacers_cut1.fna \
    > spacers_cut1_blastn.xml


Making BLAST databases...
...making blast databases in /home/nyoungb2/t/CLdb_Ecoli//arrayBlast/blast_db/
makeblastdb -dbtype nucl -parse_seqids -in /home/nyoungb2/t/CLdb_Ecoli//arrayBlast/blast_db//Escherichia_coli_BL21_DE3.fasta
makeblastdb -dbtype nucl -parse_seqids -in /home/nyoungb2/t/CLdb_Ecoli//arrayBlast/blast_db//Escherichia_coli_K-12_DH10B.fasta
makeblastdb -dbtype nucl -parse_seqids -in /home/nyoungb2/t/CLdb_Ecoli//arrayBlast/blast_db//Escherichia_coli_K-12_MG1655.fasta
makeblastdb -dbtype nucl -parse_seqids -in /home/nyoungb2/t/CLdb_Ecoli//arrayBlast/blast_db//Escherichia_coli_K-12_W3110.fasta
makeblastdb -dbtype nucl -parse_seqids -in /home/nyoungb2/t/CLdb_Ecoli//arrayBlast/blast_db//Escherichia_coli_O157_H7.fasta

Blasting sequences...
Number of blast hits to Escherichia_coli_BL21_DE3.fasta:	1290
Number of blast hits to Escherichia_coli_K-12_DH10B.fasta:	1317
Number of blast hits to Escherichia_coli_K-12_MG1655.fasta:	1317
Number of blast hits to Escherichia_coli_K-12_W3110.fasta:	1317
Number of blast hits to Escherichia_coli_O157_H7.fasta:	912

Note:

  • Using spacer clustering cutoff of 1, so just 'unique' spacer sequences

In [84]:
# converting .xml to .srl
## This retains all of the info in the xml, but produces a smaller file that is faster to load and write.
## For large blast runs (xml files > 100 Mb), this will take a little while

!cd $blastDir; \
    CLdb -- arrayBlast -- xml2srl \
    spacers_cut1_blastn.xml \
    > spacers_cut1_blastn.srl

# checking output file
!cd $blastDir; \
    ls -thlc spacers_cut1_blastn.*


-rw-rw-r-- 1 nyoungb2 nyoungb2  62K Dec 31 06:13 spacers_cut1_blastn.srl
-rw-rw-r-- 1 nyoungb2 nyoungb2 227K Dec 31 06:13 spacers_cut1_blastn.xml

Note:

  • You could pipe the output from CLdb -- arrayBlast -- run directly into CLdb -- arrayBlast -- xml2srl
    • Example: CLdb -- arrayBlast -- run -query spacers_cut1.fna | CLdb -- arrayBlast -- xml2srl

DR Blast

  • BLASTing DR sequences against the same subjects.
  • This is needed to filter out spacer blast hits to CRISPR arrays.

In [86]:
# selecting DRs, so we can blast them.
## This is needed blast filter out spacers that hit CRISPR arrays

!cd $blastDir; \
    CLdb -- array2fasta \
    -cluster -cut 1 -r \
    > DRs_cut1.fna

## getting number of spacer sequences
!printf 'Number of unique DRs:\t'
!cd $blastDir; \
    grep -c ">" DRs_cut1.fna; \
    echo; \
    head -n 6 DRs_cut1.fna


Number of unique DRs:	12

>NA|DR|NA|9|1
TGGTTTATCCCCGCTGGCGCGGGGAACTC
>NA|DR|NA|6|1
CGGTTTATCCTCGCTGGCGCGGGGAACTC
>NA|DR|NA|10|1
GGTTTATCCCCGCTGGCGCGGGGAACAC

In [87]:
# DR blast, just like spacer blast

!cd $blastDir; \
    CLdb -- arrayBlast -- run \
    -query DRs_cut1.fna \
    > DRs_cut1_blastn.xml


Making BLAST databases...
...making blast databases in /home/nyoungb2/t/CLdb_Ecoli//arrayBlast/blast_db/
makeblastdb -dbtype nucl -parse_seqids -in /home/nyoungb2/t/CLdb_Ecoli//arrayBlast/blast_db//Escherichia_coli_BL21_DE3.fasta
makeblastdb -dbtype nucl -parse_seqids -in /home/nyoungb2/t/CLdb_Ecoli//arrayBlast/blast_db//Escherichia_coli_K-12_DH10B.fasta
makeblastdb -dbtype nucl -parse_seqids -in /home/nyoungb2/t/CLdb_Ecoli//arrayBlast/blast_db//Escherichia_coli_K-12_MG1655.fasta
makeblastdb -dbtype nucl -parse_seqids -in /home/nyoungb2/t/CLdb_Ecoli//arrayBlast/blast_db//Escherichia_coli_K-12_W3110.fasta
makeblastdb -dbtype nucl -parse_seqids -in /home/nyoungb2/t/CLdb_Ecoli//arrayBlast/blast_db//Escherichia_coli_O157_H7.fasta

Blasting sequences...
Number of blast hits to Escherichia_coli_BL21_DE3.fasta:	4730
Number of blast hits to Escherichia_coli_K-12_DH10B.fasta:	4825
Number of blast hits to Escherichia_coli_K-12_MG1655.fasta:	4882
Number of blast hits to Escherichia_coli_K-12_W3110.fasta:	4882
Number of blast hits to Escherichia_coli_O157_H7.fasta:	1747

In [88]:
# converting .xml to .srl
## This retains all of the info in the xml, but produces a smaller file that is faster to load and write.
## For large blast runs (xml files > 100 Mb), this will take a little while

!cd $blastDir; \
    CLdb -- arrayBlast -- xml2srl \
    DRs_cut1_blastn.xml \
    > DRs_cut1_blastn.srl


# checking output file
!cd $blastDir; \
    ls -thlc DRs_cut1_blastn.*


-rw-rw-r-- 1 nyoungb2 nyoungb2 247K Dec 31 06:14 DRs_cut1_blastn.srl
-rw-rw-r-- 1 nyoungb2 nyoungb2 783K Dec 31 06:14 DRs_cut1_blastn.xml

Filtering out spacer hits to CRISPRs

  • Based on DR hits falling adjacent to spacer hits

In [89]:
# filtering spacer hits to arrays
## This is based on the premise that DR hits adjacent to spacer hits signify that the spacer is hitting a CRISPR array

!cd $blastDir; \
    CLdb -- arrayBlast -- filter_arrayHits \
    spacers_cut1_blastn.srl \
    DRs_cut1_blastn.srl \
    > spacers_cut1_blastn_filt.srl
    
    
# checking output file
!cd $blastDir; \
    echo; \
    ls -thlc spacers_cut1_blastn_filt.srl


### Summary for filtering of 'bad' DR hits ###
Total DR hits: 1014
DR hits with E-values < 10: 0
DR hits with hit lengths (fraction of total) < 0.66: 0
DR hits used for filtering: 1014
### DR filtering of spacer blast hits ###
Total spacer blast hits: 79
 NOTE: Deleting the blast hits to arrays
Spacer blast hits hitting an array defined just by DR-blast hits: 79
 NOTE: Deleting the blast hits to arrays
Spacer blast hits hitting a mobile genetic element: 0

-rw-rw-r-- 1 nyoungb2 nyoungb2 43K Dec 31 06:14 spacers_cut1_blastn_filt.srl

Converting blast .srl to .csv


In [90]:
!cd $blastDir; \
    CLdb -- arrayBlast -- srl2csv \
    spacers_cut1_blastn_filt.srl \
    > spacers_cut1_blastn_filt.txt
    
# assessing table of blast hits    
!cd $blastDir; \
    head spacers_cut1_blastn_filt.txt


Decoding the srl file...
qseqid	sseqid	pident	length	mismatch	gapopen	qstart	qend	sstart	send	evalue	bitscore	

Conclusions

  • No spacers hit any protospacers on the Ecoli genomes in the CLdb!
    • This indicates that there are no interegrated mobile genetic elements in the E.coli genomes that can be detected using the CRISPR spacers.
    • Maybe we will get some hits when we blast against NCBI's nt database.

Array Blast vs NCBI's nt database

  • NOTE: you will need the BLAST nt database & the blast+ toolkit
    • The arrayBlast wrapper only works with the genomes in CLdb.
    • We will need to do this part 'manually'
      • The blast output just needs to be in .xml format.
      • You will need the BLAST nt database

In [91]:
#set the number of threads to use for your machine 
num_threads=20

Spacer blast


In [92]:
## spacer blast with blastn
!cd $blastDir; \
    blastn -task blastn-short -db nt -query spacers_cut1.fna \
    -outfmt 5 -evalue 1e-5 -num_threads $num_threads > spacers_cut1_vNT.xml
    

## checking output
!cd $blastDir; \
    ls -thlc spacers_cut1_vNT.xml


-rw-rw-r-- 1 nyoungb2 nyoungb2 1.6M Dec 31 06:15 spacers_cut1_vNT.xml

DR blast


In [93]:
## spacer blast with blastn
!cd $blastDir; \
    blastn -task blastn-short -db nt -query DRs_cut1.fna \
    -outfmt 5 -evalue 1e-5 -num_threads $num_threads > DRs_cut1_vNT.xml
    

## checking output
!cd $blastDir; \
    ls -thlc DRs_cut1_vNT.xml


-rw-rw-r-- 1 nyoungb2 nyoungb2 15M Dec 31 06:16 DRs_cut1_vNT.xml

Converting to .srl


In [94]:
## spacers
!cd $blastDir; \
    CLdb -- arrayBlast -- xml2srl \
    spacers_cut1_vNT.xml \
    > spacers_cut1_vNT.srl

## DRs
!cd $blastDir; \
    CLdb -- arrayBlast -- xml2srl \
    DRs_cut1_vNT.xml \
    > DRs_cut1_vNT.srl


## checking output
!cd $blastDir; \
    ls -thlc *_vNT.srl


-rw-rw-r-- 1 nyoungb2 nyoungb2 4.8M Dec 31 06:17 DRs_cut1_vNT.srl
-rw-rw-r-- 1 nyoungb2 nyoungb2 584K Dec 31 06:16 spacers_cut1_vNT.srl

Filtering spacer hits


In [95]:
# filtering spacer hits to arrays
## This is based on the premise that DR hits adjacent to spacer hits signify that the spacer is hitting a CRISPR array

!cd $blastDir; \
    CLdb -- arrayBlast -- filter_arrayHits \
    spacers_cut1_vNT.srl \
    DRs_cut1_vNT.srl \
    > spacers_cut1_vNT_blastn_filt.srl
    
    
## checking output file
!cd $blastDir; echo; \
    ls -thlc spacers_cut1_vNT_blastn_filt.srl


### Summary for filtering of 'bad' DR hits ###
Total DR hits: 20159
DR hits with E-values < 10: 0
DR hits with hit lengths (fraction of total) < 0.66: 0
DR hits used for filtering: 20159
### DR filtering of spacer blast hits ###
Total spacer blast hits: 1573
 NOTE: Deleting the blast hits to arrays
Spacer blast hits hitting an array defined just by DR-blast hits: 863
 NOTE: Deleting the blast hits to arrays
Spacer blast hits hitting a mobile genetic element: 710

-rw-rw-r-- 1 nyoungb2 nyoungb2 382K Dec 31 06:17 spacers_cut1_vNT_blastn_filt.srl

Great! It's looks like we have some hits to some putative protospacers!

File conversion

  • FYI, the .srl files can be converted to json or csv format. For example:

In [98]:
# The srl file can be converted to a json or csv file if you would like to use tools external to CLdb.

!cd $blastDir; \
    CLdb -- arrayBlast -- srl2json \
    < spacers_cut1_vNT_blastn_filt.srl \
    > spacers_cut1_vNT_blastn_filt.json


## viewing some of the json file
!cd $blastDir; \
    cat spacers_cut1_vNT_blastn_filt.json | \
    python -mjson.tool 2>/dev/null | head -n 40


{
    "0": {
        "BlastOutput_db": "nt",
        "BlastOutput_iterations": {
            "Iteration": [
                {
                    "Iteration_hits": {
                        "Hit": [
                            {
                                "Hit_accession": "KC765378",
                                "Hit_def": "Escherichia coli strain 90-0009 CRISPR2a repeat region",
                                "Hit_hsps": {
                                    "Hsp": {
                                        "2lOuFxXa0uuEg": {
                                            "CLdb_array-hit": 0,
                                            "Hsp_align-len": "32",
                                            "Hsp_bit-score": "63.9282",
                                            "Hsp_evalue": "3.97536e-08",
                                            "Hsp_gaps": "0",
                                            "Hsp_hit-frame": "-1",
                                            "Hsp_hit-from": "652",
                                            "Hsp_hit-to": "621",
                                            "Hsp_hseq": "ATAGACCCCGAACAACAATACGCGCAAACCGA",
                                            "Hsp_identity": "32",
                                            "Hsp_midline": "||||||||||||||||||||||||||||||||",
                                            "Hsp_num": "1",
                                            "Hsp_positive": "32",
                                            "Hsp_qseq": "ATAGACCCCGAACAACAATACGCGCAAACCGA",
                                            "Hsp_query-frame": "1",
                                            "Hsp_query-from": "1",
                                            "Hsp_query-to": "32",
                                            "Hsp_score": "32"
                                        }
                                    }
                                },
                                "Hit_id": "gi|528889517|gb|KC765378.1|",
                                "Hit_len": "922",
                                "Hit_num": "1"
                            },
                            {

adding crRNA info

  • Adding crRNA region to .srl (actually the DNA template, refering to it as 'crDNA')
    • How much of the adjacent DR sequences are included in the crDNA is determined by the user (default: 10bp on either side)

In [100]:
!cd $blastDir; \
    CLdb -- arrayBlast -- add_crRNA \
    < spacers_cut1_vNT_blastn_filt.srl \
    > spacers_cut1_vNT_blastn_filt_crDNA.srl
   

## checking output
!cd $blastDir; echo; \
    ls -thlc spacers_cut1_vNT_blastn_filt_crDNA.srl


Decoding .srl file...
Extracting all queries with blast hits from blast hit file...
...Number of total unique query IDs:	39
Getting info from CLdb on each blast query spacer...
Getting spacer regions from each genome...
 Processing genome: Escherichia_coli_K-12_DH10B.fasta...
 Processing genome: Escherichia_coli_K-12_MG1655.fasta...
 Processing genome: Escherichia_coli_BL21_DE3.fasta...
 Processing genome: Escherichia_coli_K-12_W3110.fasta...
 Processing genome: Escherichia_coli_O157_H7.fasta...

-rw-rw-r-- 1 nyoungb2 nyoungb2 406K Dec 31 06:18 spacers_cut1_vNT_blastn_filt_crDNA.srl

Adding protospacer info

  • Similar to adding crRNA info, but getting information from the blast database(s)

In [102]:
!cd $blastDir; \
    CLdb -- arrayBlast -- add_proto -workers 24 \
    < spacers_cut1_vNT_blastn_filt_crDNA.srl \
    > spacers_cut1_vNT_blastn_filt_crDNA_proto.srl   
    
## checking output
!cd $blastDir; echo; \
    ls -thlc spacers_cut1_vNT_blastn_filt_crDNA_proto.srl


Retrieving proto info from blast db: 'nt'
Retrieving proto sequences from blast db: 'nt'

-rw-rw-r-- 1 nyoungb2 nyoungb2 517K Dec 31 06:18 spacers_cut1_vNT_blastn_filt_crDNA_proto.srl

Aligning crRNA & protospacer

  • Alignment added to the .srl file.
  • These are individual alignments between the spacer and protospacer (using clustalw).

In [106]:
# aligning crRNA and protospacer

!cd $blastDir; \
    CLdb -- arrayBlast -- align_proto \
    < spacers_cut1_vNT_blastn_filt_crDNA_proto.srl \
    > spacers_cut1_vNT_blastn_filt_crDNA_proto_aln.srl
    
## checking output
!cd $blastDir; echo; \
    ls -thlc spacers_cut1_vNT_blastn_filt_crDNA_proto_aln.srl


Aligning protospacers from blast db: 'nt'

-rw-rw-r-- 1 nyoungb2 nyoungb2 710K Dec 31 06:21 spacers_cut1_vNT_blastn_filt_crDNA_proto_aln.srl

Getting alignments (fasta)

  • You can parse out the alignments you want (parse by subtype, taxon_name, etc.)
    • OR you can parse them after writing (add subtype, taxon_name, e-value, etc. to the sequence name)
  • The alignments will be oriented to the crDNA (5'-3' for the crRNA)
  • See CLdb -- arrayBlast --perldoc -- get_align for info on what -outfmt does

In [108]:
# getting alignment of protospacer and crDNA and writing as a fasta

!cd $blastDir; \
    CLdb -- arrayBlast -- get_align \
    -outfmt taxon_name,subtype,evalue \
    < spacers_cut1_vNT_blastn_filt_crDNA_proto_aln.srl \
    > vNT_crDNA-proto_aln.fna
    

## checking output
!cd $blastDir; echo; \
    ls -thlc vNT_crDNA-proto_aln.fna


Decoding .srl file...
Getting alignments...

-rw-rw-r-- 1 nyoungb2 nyoungb2 285K Dec 31 06:23 vNT_crDNA-proto_aln.fna

In [109]:
# let's take a look at the sequences 
!cd $blastDir; \
    head -n 10 vNT_crDNA-proto_aln.fna


>crDNA|J|3|2lOuFxXa0uuEg|3.97536e-08
gggataaaccGTCGGTTTGCGCGTATTGTTGTTCGGGGTCTATgtgttccccg
>proto|J|3|2lOuFxXa0uuEg|3.97536e-08
.ggataaaccgTCGGTTTGCGCGTATTGTTGTTCGGGGTCTATgtgttccccg
>crDNA|J|3|2pFfHSafceYTm|3.97536e-08
gggataaaccGTCGGTTTGCGCGTATTGTTGTTCGGGGTCTATgtgttccccg
>proto|J|3|2pFfHSafceYTm|3.97536e-08
.ggataaaccgTCGGTTTGCGCGTATTGTTGTTCGGGGTCTATgtgttccccg
>crDNA|J|3|2q2X0hEcddMla|3.97536e-08
gggataaaccGTCGGTTTGCGCGTATTGTTGTTCGGGGTCTATgtgttccccg
  • The alignment is really just for each crDNA-protospacer pair (every 4 lines)
  • First, the crRNA (sequence name starts with "crRNA")
  • Second, the protospacer (sequence name start with "proto")
  • Example:

    • >crDNA
      ATAGACA
      >proto
      ATAGACA
  • Note: the lower case letters are sequence adjacent to the actual protospacer

If you wanted each crRNA-protospacer pair individually, you could use the split bash command like this:

WARNING: this produces a lot of files, so we will make them in a new directory


In [118]:
# file/directory
inFile = os.path.join(blastDir, 'vNT_crDNA-proto_aln.fna')
outDir = os.path.join(blastDir, 'vNT_crDNA-proto_aln_split')
if not os.path.isdir(outDir):
    os.makedirs(outDir)
outPre = os.path.join(outDir, 'split')    

# spliting alignment
!split -l 4 -d -a 4 $inFile $outPre

# checking out files
!printf 'Number of files produced: '
!find $outDir -name "split*" | wc -l


Number of files produced: 1594

In [123]:
## Let look at one of them
!find $outDir -name "split*" 2>/dev/null | head -n 1 | xargs -I % cat % 2>/dev/null


>crDNA|C|9|2BRoy0pg1sN3q|1.08991e-08
gggataaaccGTTTGGATCGGGTCTGGAATTTCTGAGCGGTCGCgagttccccg
>proto|C|9|2BRoy0pg1sN3q|1.08991e-08
...ataaaccgTTTGGATCGGGTCTGGAATTTCTGAGCGGTCGCgagttccccg

Getting PAMs

  • Looking at the alignments can be very useful, but it can help to extract what we specifically want and summarize.
    • Let's pull out the potential PAM regions from the alignment
    • We are going to assume that the PAM is:
      • 3bp long
      • 1bp away from the 5' end of the protospacer ('right' side of the alignment)

In [132]:
!cd $blastDir; \
    CLdb -- arrayBlast -- get_PAM -PAM 2 4 -f - \
    < vNT_crDNA-proto_aln.fna \
    > vNT_crDNA-proto_aln_PAM.fna


## checking output
!cd $blastDir; echo; \
    head vNT_crDNA-proto_aln_PAM.fna


>proto|C|9|2GoDPvFP7iQVG|2.65694e-06
agt
>proto|J|13|1tPvM8ZwJiJQY|3.97536e-08
tgt
>proto|F|3|J5IiK0FqYBYc|1.08991e-08
tgt
>proto|E|7|2NfzYigGApLo4|3.97536e-08
agt
>proto|F|5|35jIrFszg79Lq|1.08991e-08
tgt

Summary table of each unique PAM sequence


In [133]:
!cd $blastDir; \
    egrep -v "^>" vNT_crDNA-proto_aln_PAM.fna | \
    sort | uniq -c | sort -k 1 -n -r


   1201 tgt
    374 agt
      6 
      4 agc
      3 att
      2 gtg
      2 ggt
      2 cgg

And if we reverse the sequence...


In [134]:
!cd $blastDir; \
    egrep -v "^>" vNT_crDNA-proto_aln_PAM.fna | \
    rev | sort | uniq -c | sort -k 1 -n -r


   1201 tgt
    374 tga
      6 
      4 cga
      3 tta
      2 tgg
      2 gtg
      2 ggc

We get the (less prevalent) PAM region ('TGT') for known I-E CRISPR systems as shown in Rorek et al., 2013!

Spacer (crDNA) - protospacer mismatch distribution

  • Let's look at potential spacer targeting by assessing mismatches of the crDNAs and protospacers.
  • This script will make 2 summary files of mismatches for each crDNA-protospacer alignment
  • Given that spacers can vary in length, the alignment is relative to a user-defined 'SEED' region.
  • I'm going to use the default SEED region: '-SEED -8 -1'
    • This SEED region covers the last 8 bp of the alignment

In [136]:
!cd $blastDir; \
    CLdb -- arrayBlast -- get_SEEDstats \
    --SEED -8 -1 \
    -prefix vNT_crDNA-proto_aln \
    < vNT_crDNA-proto_aln.fna 
    


## checking output files
!cd $blastDir; echo; \
    ls -thlc vNT_crDNA-proto_aln_SEEDstats*.txt


SEED summary file written: 'vNT_crDNA-proto_aln_SEEDstats-summary.txt'
SEED by-position table written: 'vNT_crDNA-proto_aln_SEEDstats-byPos.txt'

-rw-rw-r-- 1 nyoungb2 nyoungb2 2.3K Dec 31 06:59 vNT_crDNA-proto_aln_SEEDstats-byPos.txt
-rw-rw-r-- 1 nyoungb2 nyoungb2 457K Dec 31 06:59 vNT_crDNA-proto_aln_SEEDstats-summary.txt

In [140]:
# checking files
!cd $blastDir; \
    head -n 6 vNT_crDNA-proto_aln_SEEDstats-byPos.txt
    
print '-' * 50


!cd $blastDir; \
    head -n 6 vNT_crDNA-proto_aln_SEEDstats-summary.txt


region	pos_rel_SEED	pos_rel_align	count	mismatches	mismatch_norm
SEED	1	11	1594	12	0.00752823086574655
SEED	2	12	1594	7	0.00439146800501882
SEED	3	13	1594	3	0.00188205771643664
SEED	4	14	1594	3	0.00188205771643664
SEED	5	15	1594	3	0.00188205771643664
--------------------------------------------------
alignment	crDNA	protospacer	region	count	mismatches	mismatch_norm
910	crDNA|I|4|1TElyNyRCfHEA|3.97536e-08	proto|I|4|1TElyNyRCfHEA|3.97536e-08	SEED	10	1	0.1
910	crDNA|I|4|1TElyNyRCfHEA|3.97536e-08	proto|I|4|1TElyNyRCfHEA|3.97536e-08	nonSEED	27	0	0
910	crDNA|I|4|1TElyNyRCfHEA|3.97536e-08	proto|I|4|1TElyNyRCfHEA|3.97536e-08	protospacer	35	1	0.0285714285714286
969	crDNA|J|9|2bqVUCgqCkNao|3.97536e-08	proto|J|9|2bqVUCgqCkNao|3.97536e-08	SEED	10	1	0.1
969	crDNA|J|9|2bqVUCgqCkNao|3.97536e-08	proto|J|9|2bqVUCgqCkNao|3.97536e-08	nonSEED	27	0	0

Plotting mismatches


In [141]:
%%R
library(ggplot2)
library(gridExtra)


Need help? Try the ggplot2 mailing list: http://groups.google.com/group/ggplot2.

In [142]:
%%R -i blastDir -w 800

# loading table
infile = file.path(blastDir, 'vNT_crDNA-proto_aln_SEEDstats-byPos.txt')

tbl = read.delim(infile)
tbl = tbl[tbl$region != 'protospacer', ]

# mismatches and counts, & normalized mismatches
p.mis = ggplot(tbl, aes(pos_rel_SEED, mismatches, group=region, fill=region)) +
    geom_bar(stat='identity', position='dodge') +
    labs(y='Number\nof mismatches', x='Alignment position relative to SEED start')
p.count = ggplot(tbl, aes(pos_rel_SEED, count, group=region, fill=region)) +  
    geom_bar(stat='identity', position='dodge') +
    labs(y='Number of alignments\nwith nucleotide\nat position', x='Alignment position relative to SEED start')
p.norm = ggplot(tbl, aes(pos_rel_SEED, mismatch_norm, group=region, fill=region)) +
  geom_bar(stat='identity', position='dodge') +
    labs(y='Mismatchs /\nalignments', x='Alignment position relative to SEED start')

grid.arrange(p.count, p.mis, p.norm, ncol=1)


Notes on plots:

  • The top plot shows that some spacer-protospacer are not quiet as long as the rest, which is why the 'Number of alignment with nucleotide at position' drops at the higher positions (> ~30).
  • There are no mismatches in the middle of the spacer-protospacer alignments.

In [ ]: