Description:

This notebook goes through the assessment of spacers shared across CRISPR loci

Before running this notebook:

run the Setup notebook

User-defined variables



In [1]:

    
# directory where you want the spacer blasting to be done
## CHANGE THIS!
workDir = "/home/nyoungb2/t/CLdb_Ecoli/spacers_shared/"

Init



In [2]:

    
import os
from IPython.display import FileLinks
%load_ext rpy2.ipython



In [13]:

    
%%R
library(dplyr)
library(tidyr)
library(ggplot2)









    





Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



In [3]:

    
if not os.path.isdir(workDir):
    os.makedirs(workDir)



In [4]:

    
# checking that CLdb is in $PATH & ~/.CLdb config file is set up
!CLdb --config-params









    



#-- Config params --#
DATABASE = /home/nyoungb2/t/CLdb_Ecoli/CLdb.sqlite

Spacers-shared table



In [5]:

    
!cd $workDir; \
    CLdb -- spacersShared -h









    



Usage:
    spacersShared.pl [flags] > shared.txt

  Required flags:
    -database <char>
        CLdb database.

  Optional flags:
    -subtype <bool>
        Group count data by subtype? [FALSE]

    -id <bool>
        Group count data by taxon_id? [FALSE]

    -name <bool>
        Group count data by taxon_name? [FALSE]

    -locus <bool>
        Group count data by locus_id? [FALSE]

    -cutoff <float>
        Which Spacer/DR clustering cutoffs to summarize? [1]

    -long <bool>
        Write long form of table instead of wide format.

    -sep <char>
        The separator for delimiting group IDs (e.g., subtype or taxon_name)

    -verbose <bool>
        Verbose output. [FALSE]

    -help <bool>
        This help message

  For more information:
    CLdb_perldoc spacersShared.pl

Lets try the default table



In [9]:

    
!cd $workDir; \
    CLdb -- spacersShared | head









    



Spacer_cluster	Total
0	3
1	3
10	1
11	3
12	1
13	3
14	1
15	3
16	1

The first column is the cluster that the spacer sequence belongs to.
- By default ('-cutoff'), the clusters are at 100% sequence identity ('unique' spacer sequences)
The 2nd column shows the number of times that spacer sequence is found in the database

Let's show totals for each taxon



In [11]:

    
!cd $workDir; \
    CLdb -- spacersShared -name | head









    



Spacer_cluster	Escherichia_coli_BL21_DE3	Escherichia_coli_K-12_DH10B	Escherichia_coli_K-12_MG1655	Escherichia_coli_K-12_W3110	Escherichia_coli_O157_H7
0	0	1	1	1	0
1	0	1	1	1	0
10	1	0	0	0	0
11	0	1	1	1	0
12	1	0	0	0	0
13	0	1	1	1	0
14	1	0	0	0	0
15	0	1	1	1	0
16	0	0	0	0	1

As you can see, not every genome contains all of the same unique spacer sequences

Let's plot this



In [12]:

    
!cd $workDir; \
     CLdb -- spacersShared -name -long > shared_byTaxon.txt



In [16]:

    
%%R -i workDir

infile = file.path(workDir, 'shared_byTaxon.txt')
df = read.delim(infile, sep='\t')
df %>% head









    





  Spacer_cluster                     group_ID count
1              0  Escherichia_coli_K-12_DH10B     1
2              0 Escherichia_coli_K-12_MG1655     1
3              0  Escherichia_coli_K-12_W3110     1
4              1  Escherichia_coli_K-12_DH10B     1
5              1 Escherichia_coli_K-12_MG1655     1
6              1  Escherichia_coli_K-12_W3110     1



In [21]:

    
%%R
# plotting
ggplot(df, aes(group_ID, Spacer_cluster, fill=count)) +
    geom_tile() +
    scale_x_discrete(expand=c(0,0)) +
    scale_y_continuous(expand=c(0,0)) +
    labs(y='unique spacer sequence') +
    theme_bw() +
    theme(
        text = element_text(size=16),
        axis.title.x = element_blank(),
        axis.text.x = element_text(angle=60, hjust=1)
    )

Notes

All of the K-12's have the same spacers (not surprising).
While BL21-DE3 & 0157-H7 show some variation.



In [ ]: