This notebook describes the setup of CLdb with a set of E. coli genomes.

Notes

  • It is assumed that you have CLdb in your PATH

In [287]:
# path to raw files
## CHANGE THIS!
rawFileDir = "~/perl/projects/CLdb/data/Ecoli/"
# directory where the CLdb database will be created
## CHANGE THIS!
workDir = "~/t/CLdb_Ecoli/"

In [288]:
# viewing file links
import os
import zipfile
import csv
from IPython.display import FileLinks
# pretty viewing of tables
## get from: http://epmoyer.github.io/ipy_table/
from ipy_table import *

In [290]:
rawFileDir = os.path.expanduser(rawFileDir)
workDir = os.path.expanduser(workDir)

The required files are in '../ecoli_raw/':

  • a loci table
  • array files
  • genome nucleotide sequences
    • genbank (preferred) or fasta format

Let's look at the provided files for this example:


In [151]:
FileLinks(rawFileDir)


Out[151]:
/home/nyoungb2/perl/projects/CLdb/data/Ecoli/
  loci.zip
  GIs.txt.zip
  array.zip

Checking that CLdb is installed in PATH


In [152]:
!CLdb -h


Usage:
    CLdb [options] -- subcommand [subcommand_options]

  Options:
    --list
        List all subcommands.

    --perldoc
        Get perldoc of subcommand.

    --sql
        SQL passed to subcommand for limiting queries. (eg., --sql
        'loci.subtype == "I-B" or loci.subtype == "I-C"'). NOTE: The sql
        statement must go in SINGLE quotes!

    --config
        Config file (if not ~/.CLdb)

    --config-params
        List params set by config

    -v Verbose output
    -h This help message

  For more information:
    perldoc CLdb

Setting up the CLdb directory


In [153]:
# this makes the working directory
if not os.path.isdir(workDir):
    os.makedirs(workDir)

In [154]:
# unarchiving files in the raw folder over to the newly made working folder
files = ['array.zip','loci.zip', 'GIs.txt.zip']
files = [os.path.join(rawFileDir, x) for x in files]
for f in files:
    if not os.path.isfile(f):
        raise IOError, 'Cannot find file: {}'.format(f)
    else:
        zip = zipfile.ZipFile(f)
        zip.extractall(path=workDir)         

print 'unzipped raw files:'        
FileLinks(workDir)


unzipped raw files:
Out[154]:
/home/nyoungb2/t/CLdb_Ecoli/
  GIs.txt
/home/nyoungb2/t/CLdb_Ecoli/array/
  Ecoli_K12_DH10B_a2.txt
  Ecoli_K12_W3110_a1.txt
  Ecoli_K12_MG1655_a1.txt
  Ecoli_BL21_DE3_a2.txt
  Ecoli_K12_MG1655_a2.txt
  Ecoli_BL21_DE3_a1.txt
  Ecoli_K12_W3110_a2.txt
  Ecoli_K12_DH10B_a1.txt
  Ecoli_0157_H7_a1.txt
  Ecoli_0157_H7_a2.txt
/home/nyoungb2/t/CLdb_Ecoli/loci/
  loci.txt

Downloading the genome genbank files. Using the 'GIs.txt' file

  • GIs.txt is just a list of GIs and taxon names.

In [155]:
# making genbank directory
genbankDir = os.path.join(workDir, 'genbank')
if not os.path.isdir(genbankDir):
    os.makedirs(genbankDir)    

# downloading genomes
!cd $genbankDir; \
    CLdb -- accession-GI2fastaGenome -format genbank -fork 5 < ../GIs.txt
    
# checking files
!cd $genbankDir; \
    ls -thlc *.gbk


Writing files to '/home/nyoungb2/t/CLdb_Ecoli/genbank'
Attempting to stream: Escherichia_coli_K-12_W3110 (accession/GI = 388476123)
Attempting to stream: Escherichia_coli_BL21_DE3 (accession/GI = 387825439)
Attempting to stream: Escherichia_coli_K-12_DH10B (accession/GI = 170079663)
Attempting to stream: Escherichia_coli_O157_H7 (accession/GI = 16445223)
Attempting to stream: Escherichia_coli_K-12_MG1655 (accession/GI = 49175990)
-rw-rw-r-- 1 nyoungb2 nyoungb2 14M Dec 29 14:47 Escherichia_coli_O157_H7.gbk
-rw-rw-r-- 1 nyoungb2 nyoungb2 12M Dec 29 14:47 Escherichia_coli_K-12_W3110.gbk
-rw-rw-r-- 1 nyoungb2 nyoungb2 11M Dec 29 14:47 Escherichia_coli_BL21_DE3.gbk
-rw-rw-r-- 1 nyoungb2 nyoungb2 12M Dec 29 14:47 Escherichia_coli_K-12_DH10B.gbk
-rw-rw-r-- 1 nyoungb2 nyoungb2 13M Dec 29 14:46 Escherichia_coli_K-12_MG1655.gbk

Creating/loading CLdb of E. coli CRISPR data


In [271]:
!CLdb -- makeDB -h


Usage:
    makeDB.pl [options] [DATABASE_name]

  options:
    -replace <bool>
        Replace existing database.

    -table <char>
        Table(s) to keep as is (if they exist). ["leaders" "genes"]

    -drop <bool>
        Drop all tables. [FALSE]

    -help <bool>
        This help message

  For more information:
    CLdb --perldoc -- makeDB

Making CLdb sqlite file


In [272]:
!cd $workDir; \
    CLdb -- makeDB -r -drop
    
CLdbFile = os.path.join(workDir, 'CLdb.sqlite')
print 'CLdb file location: {}'.format(CLdbFile)


...sqlite3 database tables created
CLdb file location: /home/nyoungb2/t/CLdb_Ecoli/CLdb.sqlite

Setting up CLdb config

  • This way, the CLdb script will know where the CLdb database is located.
    • Otherwise, you would have to keep telling the CLdb script where the database is.

In [273]:
s = 'DATABASE = ' + CLdbFile
configFile = os.path.join(os.path.expanduser('~'), '.CLdb')

with open(configFile, 'wb') as outFH:
    outFH.write(s)
    
print 'Config file written: {}'.format(configFile)


Config file written: /home/nyoungb2/.CLdb

Loading loci

  • The next step is loading the loci table.
    • This table contains the user-provided info on each CRISPR-CAS system in the genomes.
    • Let's look at the table before loading it in CLdb

Checking out the CRISPR loci table


In [274]:
lociFile = os.path.join(workDir, 'loci', 'loci.txt')

# reading in file
tbl = []
with open(lociFile, 'rb') as f:
    reader = csv.reader(f, delimiter='\t')
    for row in reader:
        tbl.append(row)

# making table
make_table(tbl)
apply_theme('basic')


Out[274]:
Locus_IDTaxon_IDTaxon_NameSubtypeScaffoldLocus_StartLocus_EndCAS_StartCAS_EndArray_StartArray_EndCAS_StatusArray_StatusGenbank_FileFasta_FileArray_FileScaffold_countFile_Creation_DateAuthorLeader_StartLeader_End
A49175990Escherichia_coli_K-12_MG1655I-ENC_000913287572328852172876486288521728757232876485intactEscherichia_coli_K-12_MG1655.gbkEcoli_K12_MG1655_a1.txt4/19/14Nick
B49175990Escherichia_coli_K-12_MG1655I-ENC_000913288017729024292880177290203529020362902429intactEscherichia_coli_K-12_MG1655.gbkEcoli_K12_MG1655_a2.txt4/19/14Nick
C170079663Escherichia_coli_K-12_DH10BI-ENC_010473296826529777832969028297778329682652969027intactEscherichia_coli_K-12_DH10B.gbkEcoli_K12_DH10B_a1.txt4/19/14Nick
D170079663Escherichia_coli_K-12_DH10BI-ENC_010473297271929949712972719299457729945782994971intactEscherichia_coli_K-12_DH10B.gbkEcoli_K12_DH10B_a2.txt4/19/14Nick
E388476123Escherichia_coli_K-12_W3110I-ENC_007779287635728858752877120288587528763572877119intactEscherichia_coli_K-12_W3110.gbkEcoli_K12_W3110_a1.txt4/19/14Nick
F388476123Escherichia_coli_K-12_W3110I-ENC_007779288081129030632880811290266929026702903063intactEscherichia_coli_K-12_W3110.gbkEcoli_K12_W3110_a2.txt4/19/14Nick
G16445223Escherichia_coli_O157_H7I-ENC_002655366552136914243665733369142436655213665732intactEscherichia_coli_O157_H7.gbkEcoli_0157_H7_a1.txt4/19/14Nick
H16445223Escherichia_coli_O157_H7I-ENC_002655366582936915133665829369142336914243691513intactEscherichia_coli_O157_H7.gbkEcoli_0157_H7_a2.txt4/19/14Nick
I387825439Escherichia_coli_BL21_DE3I-ENC_0129712717668271794027176682717940intactEscherichia_coli_BL21_DE3.gbkEcoli_BL21_DE3_a1.txt4/19/14Nick
J387825439Escherichia_coli_BL21_DE3I-ENC_0129712736174273699627361742736996intactEscherichia_coli_BL21_DE3.gbkEcoli_BL21_DE3_a2.txt4/19/14Nick

Notes on the loci table:

  • As you can see, not all of the fields have values. Some are not required (e.g., 'fasta_file').
  • You will get an error if you try to load a table with missing values in required fields.
  • For a list of required columns, see the documentation for CLdb -- loadLoci -h.

Loading loci info into database


In [275]:
!CLdb -- loadLoci -h


Usage:
    loadLoci.pl [flags] < loci_table.txt

  Required flags:
    -database <char>
        CLdb database.

  Optional flags:
    -forks <int>
        Number of files to process in parallel. [1]

    -verbose <bool>
        Verbose output. [TRUE]

    -help <bool>
        This help message

  For more information:
    CLdb --perldoc -- loadLoci


In [276]:
!CLdb -- loadLoci < $lociFile


### checking line breaks for all external files (converting to unix) ###
 processing: /home/nyoungb2/t/CLdb_Ecoli/genbank/Escherichia_coli_K-12_MG1655.gbk
 processing: /home/nyoungb2/t/CLdb_Ecoli/array/Ecoli_K12_MG1655_a2.txt
 processing: /home/nyoungb2/t/CLdb_Ecoli/genbank/Escherichia_coli_O157_H7.gbk
 processing: /home/nyoungb2/t/CLdb_Ecoli/array/Ecoli_0157_H7_a1.txt
 processing: /home/nyoungb2/t/CLdb_Ecoli/genbank/Escherichia_coli_K-12_DH10B.gbk
 processing: /home/nyoungb2/t/CLdb_Ecoli/array/Ecoli_K12_DH10B_a1.txt
 processing: /home/nyoungb2/t/CLdb_Ecoli/genbank/Escherichia_coli_K-12_MG1655.gbk
 processing: /home/nyoungb2/t/CLdb_Ecoli/array/Ecoli_K12_MG1655_a1.txt
 processing: /home/nyoungb2/t/CLdb_Ecoli/genbank/Escherichia_coli_BL21_DE3.gbk
 processing: /home/nyoungb2/t/CLdb_Ecoli/array/Ecoli_BL21_DE3_a2.txt
 processing: /home/nyoungb2/t/CLdb_Ecoli/genbank/Escherichia_coli_K-12_W3110.gbk
 processing: /home/nyoungb2/t/CLdb_Ecoli/array/Ecoli_K12_W3110_a2.txt
 processing: /home/nyoungb2/t/CLdb_Ecoli/genbank/Escherichia_coli_K-12_W3110.gbk
 processing: /home/nyoungb2/t/CLdb_Ecoli/array/Ecoli_K12_W3110_a1.txt
 processing: /home/nyoungb2/t/CLdb_Ecoli/genbank/Escherichia_coli_BL21_DE3.gbk
 processing: /home/nyoungb2/t/CLdb_Ecoli/array/Ecoli_BL21_DE3_a1.txt
 processing: /home/nyoungb2/t/CLdb_Ecoli/genbank/Escherichia_coli_K-12_DH10B.gbk
 processing: /home/nyoungb2/t/CLdb_Ecoli/array/Ecoli_K12_DH10B_a2.txt
 processing: /home/nyoungb2/t/CLdb_Ecoli/genbank/Escherichia_coli_O157_H7.gbk
 processing: /home/nyoungb2/t/CLdb_Ecoli/array/Ecoli_0157_H7_a2.txt
### checking locus_ID values ###
...locus_ID values are OK
### Checking/copying external files specfied in loci table ###
Processing locus 'B'...
  Checking: '/home/nyoungb2/t/CLdb_Ecoli/array/Ecoli_K12_MG1655_a2.txt'
	File found: '/home/nyoungb2/t/CLdb_Ecoli/array/Ecoli_K12_MG1655_a2.txt'
  Checking: '/home/nyoungb2/t/CLdb_Ecoli/genbank/Escherichia_coli_K-12_MG1655.gbk'
	File found: '/home/nyoungb2/t/CLdb_Ecoli/genbank/Escherichia_coli_K-12_MG1655.gbk'
Processing locus 'G'...
  Checking: '/home/nyoungb2/t/CLdb_Ecoli/array/Ecoli_0157_H7_a1.txt'
	File found: '/home/nyoungb2/t/CLdb_Ecoli/array/Ecoli_0157_H7_a1.txt'
  Checking: '/home/nyoungb2/t/CLdb_Ecoli/genbank/Escherichia_coli_O157_H7.gbk'
	File found: '/home/nyoungb2/t/CLdb_Ecoli/genbank/Escherichia_coli_O157_H7.gbk'
Processing locus 'C'...
  Checking: '/home/nyoungb2/t/CLdb_Ecoli/array/Ecoli_K12_DH10B_a1.txt'
	File found: '/home/nyoungb2/t/CLdb_Ecoli/array/Ecoli_K12_DH10B_a1.txt'
  Checking: '/home/nyoungb2/t/CLdb_Ecoli/genbank/Escherichia_coli_K-12_DH10B.gbk'
	File found: '/home/nyoungb2/t/CLdb_Ecoli/genbank/Escherichia_coli_K-12_DH10B.gbk'
Processing locus 'A'...
  Checking: '/home/nyoungb2/t/CLdb_Ecoli/array/Ecoli_K12_MG1655_a1.txt'
	File found: '/home/nyoungb2/t/CLdb_Ecoli/array/Ecoli_K12_MG1655_a1.txt'
  Checking: '/home/nyoungb2/t/CLdb_Ecoli/genbank/Escherichia_coli_K-12_MG1655.gbk'
	File found: '/home/nyoungb2/t/CLdb_Ecoli/genbank/Escherichia_coli_K-12_MG1655.gbk'
Processing locus 'J'...
  Checking: '/home/nyoungb2/t/CLdb_Ecoli/array/Ecoli_BL21_DE3_a2.txt'
	File found: '/home/nyoungb2/t/CLdb_Ecoli/array/Ecoli_BL21_DE3_a2.txt'
  Checking: '/home/nyoungb2/t/CLdb_Ecoli/genbank/Escherichia_coli_BL21_DE3.gbk'
	File found: '/home/nyoungb2/t/CLdb_Ecoli/genbank/Escherichia_coli_BL21_DE3.gbk'
Processing locus 'F'...
  Checking: '/home/nyoungb2/t/CLdb_Ecoli/array/Ecoli_K12_W3110_a2.txt'
	File found: '/home/nyoungb2/t/CLdb_Ecoli/array/Ecoli_K12_W3110_a2.txt'
  Checking: '/home/nyoungb2/t/CLdb_Ecoli/genbank/Escherichia_coli_K-12_W3110.gbk'
	File found: '/home/nyoungb2/t/CLdb_Ecoli/genbank/Escherichia_coli_K-12_W3110.gbk'
Processing locus 'E'...
  Checking: '/home/nyoungb2/t/CLdb_Ecoli/array/Ecoli_K12_W3110_a1.txt'
	File found: '/home/nyoungb2/t/CLdb_Ecoli/array/Ecoli_K12_W3110_a1.txt'
  Checking: '/home/nyoungb2/t/CLdb_Ecoli/genbank/Escherichia_coli_K-12_W3110.gbk'
	File found: '/home/nyoungb2/t/CLdb_Ecoli/genbank/Escherichia_coli_K-12_W3110.gbk'
Processing locus 'I'...
  Checking: '/home/nyoungb2/t/CLdb_Ecoli/array/Ecoli_BL21_DE3_a1.txt'
	File found: '/home/nyoungb2/t/CLdb_Ecoli/array/Ecoli_BL21_DE3_a1.txt'
  Checking: '/home/nyoungb2/t/CLdb_Ecoli/genbank/Escherichia_coli_BL21_DE3.gbk'
	File found: '/home/nyoungb2/t/CLdb_Ecoli/genbank/Escherichia_coli_BL21_DE3.gbk'
Processing locus 'D'...
  Checking: '/home/nyoungb2/t/CLdb_Ecoli/array/Ecoli_K12_DH10B_a2.txt'
	File found: '/home/nyoungb2/t/CLdb_Ecoli/array/Ecoli_K12_DH10B_a2.txt'
  Checking: '/home/nyoungb2/t/CLdb_Ecoli/genbank/Escherichia_coli_K-12_DH10B.gbk'
	File found: '/home/nyoungb2/t/CLdb_Ecoli/genbank/Escherichia_coli_K-12_DH10B.gbk'
Processing locus 'H'...
  Checking: '/home/nyoungb2/t/CLdb_Ecoli/array/Ecoli_0157_H7_a2.txt'
	File found: '/home/nyoungb2/t/CLdb_Ecoli/array/Ecoli_0157_H7_a2.txt'
  Checking: '/home/nyoungb2/t/CLdb_Ecoli/genbank/Escherichia_coli_O157_H7.gbk'
	File found: '/home/nyoungb2/t/CLdb_Ecoli/genbank/Escherichia_coli_O157_H7.gbk'
### Checking for existence of genome fasta ###
### Processing locus: "B" ###
  No genome fasta found. Trying to extract sequence from genbank...
  Success! Found fasta in '/home/nyoungb2/t/CLdb_Ecoli/fasta/Escherichia_coli_K-12_MG1655.fasta'
  Adding fasta to loci table
### Processing locus: "G" ###
  No genome fasta found. Trying to extract sequence from genbank...
  Success! Found fasta in '/home/nyoungb2/t/CLdb_Ecoli/fasta/Escherichia_coli_O157_H7.fasta'
  Adding fasta to loci table
### Processing locus: "C" ###
  No genome fasta found. Trying to extract sequence from genbank...
  Success! Found fasta in '/home/nyoungb2/t/CLdb_Ecoli/fasta/Escherichia_coli_K-12_DH10B.fasta'
  Adding fasta to loci table
### Processing locus: "A" ###
  No genome fasta found. Trying to extract sequence from genbank...
  Success! Found fasta in '/home/nyoungb2/t/CLdb_Ecoli/fasta/Escherichia_coli_K-12_MG1655.fasta'
  Adding fasta to loci table
### Processing locus: "J" ###
  No genome fasta found. Trying to extract sequence from genbank...
  Success! Found fasta in '/home/nyoungb2/t/CLdb_Ecoli/fasta/Escherichia_coli_BL21_DE3.fasta'
  Adding fasta to loci table
### Processing locus: "F" ###
  No genome fasta found. Trying to extract sequence from genbank...
  Success! Found fasta in '/home/nyoungb2/t/CLdb_Ecoli/fasta/Escherichia_coli_K-12_W3110.fasta'
  Adding fasta to loci table
### Processing locus: "E" ###
  No genome fasta found. Trying to extract sequence from genbank...
  Success! Found fasta in '/home/nyoungb2/t/CLdb_Ecoli/fasta/Escherichia_coli_K-12_W3110.fasta'
  Adding fasta to loci table
### Processing locus: "I" ###
  No genome fasta found. Trying to extract sequence from genbank...
  Success! Found fasta in '/home/nyoungb2/t/CLdb_Ecoli/fasta/Escherichia_coli_BL21_DE3.fasta'
  Adding fasta to loci table
### Processing locus: "D" ###
  No genome fasta found. Trying to extract sequence from genbank...
  Success! Found fasta in '/home/nyoungb2/t/CLdb_Ecoli/fasta/Escherichia_coli_K-12_DH10B.fasta'
  Adding fasta to loci table
### Processing locus: "H" ###
  No genome fasta found. Trying to extract sequence from genbank...
  Success! Found fasta in '/home/nyoungb2/t/CLdb_Ecoli/fasta/Escherichia_coli_O157_H7.fasta'
  Adding fasta to loci table
### checking for genome fasta for scaffold names ###

### loading entries into CLdb ###
...Number of entries added/updated to 'loci' table: 10
### Leader start-end provided. Loading values into leader table ###
...Number of entries added/updated to 'leaders' table: 0

Notes on loading

  • A lot is going on here:
    1. Various checks on the input files
    2. Extracting the genome fasta sequence from each genbank file
      • the genome fasta is required
    3. Loading of the loci information into the sqlite database

Notes on the command

  • Why didn't I use the 'required' -database flag for CLdb -- loadLoci???
    • I didn't have to use the -database flag because it is provided via the .CLdb config file that was previously created.

In [277]:
# This is just a quick summary of the database 
## It should show 10 loci for the 'loci' rows
!CLdb -- summary


loci	NULL	intact	10
loci	Total	NA	10
spacer	NA	All	NULL
DR	NA	All	NULL
genes	Total	NA	NULL
leaders	Total	NA	NA

The summary doesn't show anything for spacers, DRs, genes or leaders!

That's because we haven't loaded that info yet...

Loading CRISPR arrays

  • The next step is to load the CRISPR array tables.
  • These are tables in 'CRISPRFinder format' that have CRISPR array info.
    • Let's take a look at one of the array files before loading them all.

In [278]:
# an example array file (obtained from CRISPRFinder)
arrayFile = os.path.join(workDir, 'array', 'Ecoli_0157_H7_a1.txt')
!head $arrayFile


3665521	CGGTTTATCCCCGCTGATGCGGGGAACAC	AGCGGCACGCTGGATTGAACAAATCCCTGGGC 	3665581	
3665582	CGGTTTATCCCCGCTGGCGCGGGGAACAC	AAACCGAAACACACGATCAATCCGAATATGAG 	3665642	
3665643	CGGTTTATCCCCGCTGGCGCGGGGAACAC	TTTGGTGACAGTTTTTGTCACTGTTTTGGTGA 	3665703	
3665704	CGGTTTATCCCCGCTGGCGCGGGGAACAC		3665732

Note: the array file consists of 4 columns:

  1. spacer start
  2. spacer sequence
  3. direct-repeat sequence
  4. direct-repeat stop

All extra columns ignored!


In [279]:
# loading CRISPR array info
!CLdb -- loadArrays


...Number of entries added/updated to 'DRs' table: 85
...Number of entries added/updated to 'spacers' table: 75

In [280]:
# This is just a quick summary of the database 
!CLdb -- summary


loci	NULL	intact	10
loci	Total	NA	10
spacer	NA	All	75
DR	NA	All	85
genes	Total	NA	NULL
leaders	Total	NA	NA

Note: The output should show 75 spacer & 85 DR entries in the database

Loading CAS genes

  • Technically, all coding seuqences in the region specified in the loci table (CAS_start, CAS_end) will be loaded.
  • This requires 2 subcommands:
    1. The 1st gets the gene info
    2. The 2nd loads the info into CLdb

In [281]:
geneDir = os.path.join(workDir, 'genes')
if not os.path.isdir(geneDir):
    os.makedirs(geneDir)

In [282]:
!cd $geneDir; \
    CLdb -- getGenesInLoci 2> CAS.log > CAS.txt
    
# checking output    
!cd $geneDir; \
    head -n 5 CAS.log; \
    echo -----------; \
    tail -n 5 CAS.log; \
    echo -----------; \
    head -n 5 CAS.txt


...Getting features in:
 file:			/home/nyoungb2/t/CLdb_Ecoli/genbank/Escherichia_coli_K-12_DH10B.gbk
 scaffold:		NC_010473
 region:		2972719-2994971
 CAS_status:		
-----------
 WARNING: LocusD -> 9 does not have a FIG-PEG ID in a db_xref tag!
 WARNING: LocusD -> 15 does not have a FIG-PEG ID in a db_xref tag!
 WARNING: LocusD -> 4 does not have a FIG-PEG ID in a db_xref tag!
 WARNING: LocusD -> 10 does not have a FIG-PEG ID in a db_xref tag!
 WARNING: LocusD -> 13 does not have a FIG-PEG ID in a db_xref tag!
-----------
Locus_ID	Gene_Id	Gene_start	Gene_end	Gene_length__AA	In_CAS	Gene_Alias	Sequence
B	GI:16130675::ASAP:ABE-0009074::UniProtKB/Swiss-Prot:Q46906::EcoGene:EG13123::GeneID:947536	2892218	2892793	191	yes	putative anti-terminator regulatory protein	MPLLHLLRQNPVIAAVKDNASLQLAIDSECQFISVLYGNICTISNIVKKIKNAGKYAFIHVDLLEGASNKEVVIQFLKLVTEADGIISTKASMLKAARAEGFFCIHRLFIVDSISFHNIDKQVAQSNPDCIEILPGCMPKVLGWVTEKIRQPLIAGGLVCDEEDARNAINAGVVALSTTNTGVWTLAKKLL
B	GI:16130670::ASAP:ABE-0009061::UniProtKB/Swiss-Prot:P17846::EcoGene:EG10190::GeneID:947231	2888121	2886409	570	yes	sulfite reductase, beta subunit, NAD(P)-binding, heme-binding	MSEKHPGPLVVEGKLTDAERMKHESNYLRGTIAEDLNDGLTGGFKGDNFLLIRFHGMYQQDDRDIRAERAEQKLEPRHAMLLRCRLPGGVITTKQWQAIDKFAGENTIYGSIRLTNRQTFQFHGILKKNVKPVHQMLHSVGLDALATANDMNRNVLCTSNPYESQLHAEAYEWAKKISEHLLPRTRAYAEIWLDQEKVATTDEEPILGQTYLPRKFKTTVVIPPQNDIDLHANDMNFVAIAENGKLVGFNLLVGGGLSIEHGNKKTYARTASEFGYLPLEHTLAVAEAVVTTQRDWGNRTDRKNAKTKYTLERVGVETFKAEVERRAGIKFEPIRPYEFTGRGDRIGWVKGIDDNWHLTLFIENGRILDYPARPLKTGLLEIAKIHKGDFRITANQNLIIAGVPESEKAKIEKIAKESGLMNAVTPQRENSMACVSFPTCPLAMAEAERFLPSFIDNIDNLMAKHGVSDEHIVMRVTGCPNGCGRAMLAEVGLVGKAPGRYNLHLGGNRIGTRIPRMYKENITEPEILASLDELIGRWAKEREAGEGFGDFTVRAGIIRPVLDPARDLWD
B	GI:90111487::ASAP:ABE-0009079::UniProtKB/Swiss-Prot:Q46908::EcoGene:EG13125::GeneID:947240	2894577	2893798	259	yes	putative flavoprotein	MNILLAFKAEPDAGMLAEKEWQAAAQGKSGPDISLLRSLLGADEQAAAALLLAQRKNGTPMSLTALSMGDERALHWLRYLMALGFEEAVLLETAADLRFAPEFVARHIAEWQHQNPLDLIITGCQSSEGQNGQTPFLLAEMLGWPCFTQVERFTLDALFITLEQRTEHGLRCCRVRLPAVIAVRQCGEVALPVPGMRQRMAAGKAEIIRKTVAAEMPAMQCLQLARAEQRRGATLIDGQTVAEKAQKLWQDYLRQRMQP
B	GI:16130669::ASAP:ABE-0009057::UniProtKB/Swiss-Prot:P17854::EcoGene:EG10189::GeneID:947230	2886334	2885600	244	yes	3'-phosphoadenosine 5'-phosphosulfate reductase	MSKLDLNALNELPKVDRILALAETNAELEKLDAEGRVAWALDNLPGEYVLSSSFGIQAAVSLHLVNQIRPDIPVILTDTGYLFPETYRFIDELTDKLKLNLKVYRATESAAWQEARYGKLWEQGVEGIEKYNDINKVEPMNRALKELNAQTWFAGLRREQSGSRANLPVLAIQRGVFKVLPIIDWDNRTIYQYLQKHGLKYHPLWDEGYLSVGDTHTTRKWEPGMAEEETRFFGLKRECGLHEG

In [283]:
# loading gene table into the database
!cd $geneDir; \
    CLdb -- loadGenes < CAS.txt


...Number of entries added/updated to 'genes' table: 123

Setting array sense strand

  • The strand that is transcribed needs to be defined in order to have the correct sequence for downstream analyses (e.g., blasting spacers and getting PAM regions)
  • The sense (reading) strand is defined by (order of precedence):
    • The leader region (if defined; in this case, no).
    • Array_start,Array_end in the loci table
      • The genome negative strand will be used if array_start > array_end

In [284]:
!CLdb -- setSenseStrand


Setting sense strand for 10 loci...
...sense strand set based on array_start-array_end for:	10 loci
...sense strand set base on leader region for:	0 loci

Spacer and DR clustering

  • Clustering of spacer and/or DR sequences accomplishes:
    • A method of comparing within and between CRISPRs
    • A reducing redundancy for spacer and DR blasting

In [285]:
!CLdb -- clusterArrayElements -s -r


Deleted all entries in 'spacer_clusters' table
Deleted all entries in 'DR_clusters' table
Getting array element sequences. Orienting sequences by array_sense_strand

Clustering spacer sequences...
...Clustering spacers at cutoff: 0.80
	Number of clusters produced: 39
...Clustering spacers at cutoff: 0.81
	Number of clusters produced: 39
...Clustering spacers at cutoff: 0.82
	Number of clusters produced: 39
...Clustering spacers at cutoff: 0.83
	Number of clusters produced: 39
...Clustering spacers at cutoff: 0.84
	Number of clusters produced: 39
...Clustering spacers at cutoff: 0.85
	Number of clusters produced: 39
...Clustering spacers at cutoff: 0.86
	Number of clusters produced: 39
...Clustering spacers at cutoff: 0.87
	Number of clusters produced: 39
...Clustering spacers at cutoff: 0.88
	Number of clusters produced: 39
...Clustering spacers at cutoff: 0.89
	Number of clusters produced: 39
...Clustering spacers at cutoff: 0.90
	Number of clusters produced: 39
...Clustering spacers at cutoff: 0.91
	Number of clusters produced: 39
...Clustering spacers at cutoff: 0.92
	Number of clusters produced: 39
...Clustering spacers at cutoff: 0.93
	Number of clusters produced: 39
...Clustering spacers at cutoff: 0.94
	Number of clusters produced: 39
...Clustering spacers at cutoff: 0.95
	Number of clusters produced: 39
...Clustering spacers at cutoff: 0.96
	Number of clusters produced: 39
...Clustering spacers at cutoff: 0.97
	Number of clusters produced: 39
...Clustering spacers at cutoff: 0.98
	Number of clusters produced: 39
...Clustering spacers at cutoff: 0.99
	Number of clusters produced: 39
...Clustering spacers at cutoff: 1.00
	Number of clusters produced: 39

Clustering DR sequences...
...Clustering DRs at cutoff: 0.80
	Number of clusters produced: 2
...Clustering DRs at cutoff: 0.81
	Number of clusters produced: 2
...Clustering DRs at cutoff: 0.82
	Number of clusters produced: 2
...Clustering DRs at cutoff: 0.83
	Number of clusters produced: 2
...Clustering DRs at cutoff: 0.84
	Number of clusters produced: 2
...Clustering DRs at cutoff: 0.85
	Number of clusters produced: 2
...Clustering DRs at cutoff: 0.86
	Number of clusters produced: 2
...Clustering DRs at cutoff: 0.87
	Number of clusters produced: 2
...Clustering DRs at cutoff: 0.88
	Number of clusters produced: 2
...Clustering DRs at cutoff: 0.89
	Number of clusters produced: 2
...Clustering DRs at cutoff: 0.90
	Number of clusters produced: 4
...Clustering DRs at cutoff: 0.91
	Number of clusters produced: 4
...Clustering DRs at cutoff: 0.92
	Number of clusters produced: 4
...Clustering DRs at cutoff: 0.93
	Number of clusters produced: 4
...Clustering DRs at cutoff: 0.94
	Number of clusters produced: 6
...Clustering DRs at cutoff: 0.95
	Number of clusters produced: 6
...Clustering DRs at cutoff: 0.96
	Number of clusters produced: 6
...Clustering DRs at cutoff: 0.97
	Number of clusters produced: 12
...Clustering DRs at cutoff: 0.98
	Number of clusters produced: 12
...Clustering DRs at cutoff: 0.99
	Number of clusters produced: 12
...Clustering DRs at cutoff: 1.00
	Number of clusters produced: 12

Inserting/updating entries in CLdb...
...1575 spacer cluster entries added/updated
...1785 DR cluster entries added/updated

Creating indices for cluster ids...

Database summary


In [286]:
!CLdb -- summary -name -subtype


loci	I-E	Escherichia_coli_BL21_DE3	NULL	intact	2
loci	I-E	Escherichia_coli_K-12_DH10B	NULL	intact	2
loci	I-E	Escherichia_coli_K-12_MG1655	NULL	intact	2
loci	I-E	Escherichia_coli_K-12_W3110	NULL	intact	2
loci	I-E	Escherichia_coli_O157_H7	NULL	intact	2
loci	I-E	Escherichia_coli_BL21_DE3	Total	NA	2
loci	I-E	Escherichia_coli_K-12_DH10B	Total	NA	2
loci	I-E	Escherichia_coli_K-12_MG1655	Total	NA	2
loci	I-E	Escherichia_coli_K-12_W3110	Total	NA	2
loci	I-E	Escherichia_coli_O157_H7	Total	NA	2
spacer	I-E	Escherichia_coli_BL21_DE3	All	NA	NA	17
spacer	I-E	Escherichia_coli_K-12_DH10B	All	NA	NA	18
spacer	I-E	Escherichia_coli_K-12_MG1655	All	NA	NA	18
spacer	I-E	Escherichia_coli_K-12_W3110	All	NA	NA	18
spacer	I-E	Escherichia_coli_O157_H7	All	NA	NA	4
spacers	I-E	Escherichia_coli_BL21_DE3	num_groups	1	17
spacers	I-E	Escherichia_coli_K-12_DH10B	num_groups	1	18
spacers	I-E	Escherichia_coli_K-12_MG1655	num_groups	1	18
spacers	I-E	Escherichia_coli_K-12_W3110	num_groups	1	18
spacers	I-E	Escherichia_coli_O157_H7	num_groups	1	4
DR	I-E	Escherichia_coli_BL21_DE3	All	NA	NA	19
DR	I-E	Escherichia_coli_K-12_DH10B	All	NA	NA	20
DR	I-E	Escherichia_coli_K-12_MG1655	All	NA	NA	20
DR	I-E	Escherichia_coli_K-12_W3110	All	NA	NA	20
DR	I-E	Escherichia_coli_O157_H7	All	NA	NA	6
DRs	I-E	Escherichia_coli_BL21_DE3	num_groups	1	4
DRs	I-E	Escherichia_coli_K-12_DH10B	num_groups	1	7
DRs	I-E	Escherichia_coli_K-12_MG1655	num_groups	1	6
DRs	I-E	Escherichia_coli_K-12_W3110	num_groups	1	6
DRs	I-E	Escherichia_coli_O157_H7	num_groups	1	4
genes	I-E	Escherichia_coli_K-12_DH10B	yes	NA	25
genes	I-E	Escherichia_coli_K-12_MG1655	no	NA	1
genes	I-E	Escherichia_coli_K-12_MG1655	yes	NA	24
genes	I-E	Escherichia_coli_K-12_W3110	yes	NA	25
genes	I-E	Escherichia_coli_O157_H7	yes	NA	48
genes	I-E	Escherichia_coli_K-12_DH10B	Total	NA	25
genes	I-E	Escherichia_coli_K-12_MG1655	Total	NA	25
genes	I-E	Escherichia_coli_K-12_W3110	Total	NA	25
genes	I-E	Escherichia_coli_O157_H7	Total	NA	48

Next Steps

  • arrayBlast
    • Blast spacers (& DRs), get protospacers, PAM regions, mismatches to the protospacer & SEED sequence
  • spacers_shared
    • Spacer sequences shared among CRISPSRs
  • DR_consensus
    • Consensus sequences of direct repeats in each CRISPR
  • loci_plots
    • Plots of CRISPR arrays and CAS genes

In [ ]: