Description:

  • This notebook goes through the assessment of spacers shared across CRISPR loci

Before running this notebook:

User-defined variables


In [1]:
# directory where you want the spacer blasting to be done
## CHANGE THIS!
workDir = "/home/nyoungb2/t/CLdb_Ecoli/spacers_shared/"

Init


In [2]:
import os
from IPython.display import FileLinks
%load_ext rpy2.ipython

In [13]:
%%R
library(dplyr)
library(tidyr)
library(ggplot2)


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


In [3]:
if not os.path.isdir(workDir):
    os.makedirs(workDir)

In [4]:
# checking that CLdb is in $PATH & ~/.CLdb config file is set up
!CLdb --config-params


#-- Config params --#
DATABASE = /home/nyoungb2/t/CLdb_Ecoli/CLdb.sqlite

Spacers-shared table


In [5]:
!cd $workDir; \
    CLdb -- spacersShared -h


Usage:
    spacersShared.pl [flags] > shared.txt

  Required flags:
    -database <char>
        CLdb database.

  Optional flags:
    -subtype <bool>
        Group count data by subtype? [FALSE]

    -id <bool>
        Group count data by taxon_id? [FALSE]

    -name <bool>
        Group count data by taxon_name? [FALSE]

    -locus <bool>
        Group count data by locus_id? [FALSE]

    -cutoff <float>
        Which Spacer/DR clustering cutoffs to summarize? [1]

    -long <bool>
        Write long form of table instead of wide format.

    -sep <char>
        The separator for delimiting group IDs (e.g., subtype or taxon_name)

    -verbose <bool>
        Verbose output. [FALSE]

    -help <bool>
        This help message

  For more information:
    CLdb_perldoc spacersShared.pl

Lets try the default table


In [9]:
!cd $workDir; \
    CLdb -- spacersShared | head


Spacer_cluster	Total
0	3
1	3
10	1
11	3
12	1
13	3
14	1
15	3
16	1
  • The first column is the cluster that the spacer sequence belongs to.
    • By default ('-cutoff'), the clusters are at 100% sequence identity ('unique' spacer sequences)
  • The 2nd column shows the number of times that spacer sequence is found in the database

Let's show totals for each taxon


In [11]:
!cd $workDir; \
    CLdb -- spacersShared -name | head


Spacer_cluster	Escherichia_coli_BL21_DE3	Escherichia_coli_K-12_DH10B	Escherichia_coli_K-12_MG1655	Escherichia_coli_K-12_W3110	Escherichia_coli_O157_H7
0	0	1	1	1	0
1	0	1	1	1	0
10	1	0	0	0	0
11	0	1	1	1	0
12	1	0	0	0	0
13	0	1	1	1	0
14	1	0	0	0	0
15	0	1	1	1	0
16	0	0	0	0	1

As you can see, not every genome contains all of the same unique spacer sequences

Let's plot this


In [12]:
!cd $workDir; \
     CLdb -- spacersShared -name -long > shared_byTaxon.txt

In [16]:
%%R -i workDir

infile = file.path(workDir, 'shared_byTaxon.txt')
df = read.delim(infile, sep='\t')
df %>% head


  Spacer_cluster                     group_ID count
1              0  Escherichia_coli_K-12_DH10B     1
2              0 Escherichia_coli_K-12_MG1655     1
3              0  Escherichia_coli_K-12_W3110     1
4              1  Escherichia_coli_K-12_DH10B     1
5              1 Escherichia_coli_K-12_MG1655     1
6              1  Escherichia_coli_K-12_W3110     1

In [21]:
%%R
# plotting
ggplot(df, aes(group_ID, Spacer_cluster, fill=count)) +
    geom_tile() +
    scale_x_discrete(expand=c(0,0)) +
    scale_y_continuous(expand=c(0,0)) +
    labs(y='unique spacer sequence') +
    theme_bw() +
    theme(
        text = element_text(size=16),
        axis.title.x = element_blank(),
        axis.text.x = element_text(angle=60, hjust=1)
    )


Notes

  • All of the K-12's have the same spacers (not surprising).
  • While BL21-DE3 & 0157-H7 show some variation.

In [ ]: