This notebook describes the setup of CLdb with a set of E. coli genomes.

Notes

It is assumed that you have CLdb in your PATH



In [287]:

    
# path to raw files
## CHANGE THIS!
rawFileDir = "~/perl/projects/CLdb/data/Ecoli/"
# directory where the CLdb database will be created
## CHANGE THIS!
workDir = "~/t/CLdb_Ecoli/"



In [288]:

    
# viewing file links
import os
import zipfile
import csv
from IPython.display import FileLinks
# pretty viewing of tables
## get from: http://epmoyer.github.io/ipy_table/
from ipy_table import *



In [290]:

    
rawFileDir = os.path.expanduser(rawFileDir)
workDir = os.path.expanduser(workDir)

The required files are in '../ecoli_raw/':

a loci table
array files
genome nucleotide sequences
- genbank (preferred) or fasta format

Let's look at the provided files for this example:



In [151]:

    
FileLinks(rawFileDir)









    Out[151]:




/home/nyoungb2/perl/projects/CLdb/data/Ecoli/

  loci.zip

  GIs.txt.zip

  array.zip

Checking that CLdb is installed in PATH



In [152]:

    
!CLdb -h









    



Usage:
    CLdb [options] -- subcommand [subcommand_options]

  Options:
    --list
        List all subcommands.

    --perldoc
        Get perldoc of subcommand.

    --sql
        SQL passed to subcommand for limiting queries. (eg., --sql
        'loci.subtype == "I-B" or loci.subtype == "I-C"'). NOTE: The sql
        statement must go in SINGLE quotes!

    --config
        Config file (if not ~/.CLdb)

    --config-params
        List params set by config

    -v Verbose output
    -h This help message

  For more information:
    perldoc CLdb

Setting up the CLdb directory



In [153]:

    
# this makes the working directory
if not os.path.isdir(workDir):
    os.makedirs(workDir)



In [154]:

    
# unarchiving files in the raw folder over to the newly made working folder
files = ['array.zip','loci.zip', 'GIs.txt.zip']
files = [os.path.join(rawFileDir, x) for x in files]
for f in files:
    if not os.path.isfile(f):
        raise IOError, 'Cannot find file: {}'.format(f)
    else:
        zip = zipfile.ZipFile(f)
        zip.extractall(path=workDir)         

print 'unzipped raw files:'        
FileLinks(workDir)









    



unzipped raw files:






    Out[154]:




/home/nyoungb2/t/CLdb_Ecoli/

  GIs.txt

/home/nyoungb2/t/CLdb_Ecoli/array/

  Ecoli_K12_DH10B_a2.txt

  Ecoli_K12_W3110_a1.txt

  Ecoli_K12_MG1655_a1.txt

  Ecoli_BL21_DE3_a2.txt

  Ecoli_K12_MG1655_a2.txt

  Ecoli_BL21_DE3_a1.txt

  Ecoli_K12_W3110_a2.txt

  Ecoli_K12_DH10B_a1.txt

  Ecoli_0157_H7_a1.txt

  Ecoli_0157_H7_a2.txt

/home/nyoungb2/t/CLdb_Ecoli/loci/

  loci.txt

Downloading the genome genbank files. Using the 'GIs.txt' file

GIs.txt is just a list of GIs and taxon names.



In [155]:

    
# making genbank directory
genbankDir = os.path.join(workDir, 'genbank')
if not os.path.isdir(genbankDir):
    os.makedirs(genbankDir)    

# downloading genomes
!cd $genbankDir; \
    CLdb -- accession-GI2fastaGenome -format genbank -fork 5 < ../GIs.txt
    
# checking files
!cd $genbankDir; \
    ls -thlc *.gbk









    



Writing files to '/home/nyoungb2/t/CLdb_Ecoli/genbank'
Attempting to stream: Escherichia_coli_K-12_W3110 (accession/GI = 388476123)
Attempting to stream: Escherichia_coli_BL21_DE3 (accession/GI = 387825439)
Attempting to stream: Escherichia_coli_K-12_DH10B (accession/GI = 170079663)
Attempting to stream: Escherichia_coli_O157_H7 (accession/GI = 16445223)
Attempting to stream: Escherichia_coli_K-12_MG1655 (accession/GI = 49175990)
-rw-rw-r-- 1 nyoungb2 nyoungb2 14M Dec 29 14:47 Escherichia_coli_O157_H7.gbk
-rw-rw-r-- 1 nyoungb2 nyoungb2 12M Dec 29 14:47 Escherichia_coli_K-12_W3110.gbk
-rw-rw-r-- 1 nyoungb2 nyoungb2 11M Dec 29 14:47 Escherichia_coli_BL21_DE3.gbk
-rw-rw-r-- 1 nyoungb2 nyoungb2 12M Dec 29 14:47 Escherichia_coli_K-12_DH10B.gbk
-rw-rw-r-- 1 nyoungb2 nyoungb2 13M Dec 29 14:46 Escherichia_coli_K-12_MG1655.gbk

Creating/loading CLdb of E. coli CRISPR data



In [271]:

    
!CLdb -- makeDB -h









    



Usage:
    makeDB.pl [options] [DATABASE_name]

  options:
    -replace <bool>
        Replace existing database.

    -table <char>
        Table(s) to keep as is (if they exist). ["leaders" "genes"]

    -drop <bool>
        Drop all tables. [FALSE]

    -help <bool>
        This help message

  For more information:
    CLdb --perldoc -- makeDB

Making CLdb sqlite file



In [272]:

    
!cd $workDir; \
    CLdb -- makeDB -r -drop
    
CLdbFile = os.path.join(workDir, 'CLdb.sqlite')
print 'CLdb file location: {}'.format(CLdbFile)









    



...sqlite3 database tables created
CLdb file location: /home/nyoungb2/t/CLdb_Ecoli/CLdb.sqlite

Setting up CLdb config

This way, the CLdb script will know where the CLdb database is located.
- Otherwise, you would have to keep telling the CLdb script where the database is.



In [273]:

    
s = 'DATABASE = ' + CLdbFile
configFile = os.path.join(os.path.expanduser('~'), '.CLdb')

with open(configFile, 'wb') as outFH:
    outFH.write(s)
    
print 'Config file written: {}'.format(configFile)









    



Config file written: /home/nyoungb2/.CLdb

Loading loci

The next step is loading the loci table.
- This table contains the user-provided info on each CRISPR-CAS system in the genomes.
- Let's look at the table before loading it in CLdb

Checking out the CRISPR loci table



In [274]:

    
lociFile = os.path.join(workDir, 'loci', 'loci.txt')

# reading in file
tbl = []
with open(lociFile, 'rb') as f:
    reader = csv.reader(f, delimiter='\t')
    for row in reader:
        tbl.append(row)

# making table
make_table(tbl)
apply_theme('basic')









    Out[274]:




Locus_ID Taxon_ID Taxon_Name Subtype Scaffold Locus_Start Locus_End CAS_Start CAS_End Array_Start Array_End CAS_Status Array_Status Genbank_File Fasta_File Array_File Scaffold_count File_Creation_Date Author Leader_Start Leader_End
A 49175990 Escherichia_coli_K-12_MG1655 I-E NC_000913 2875723 2885217 2876486 2885217 2875723 2876485 intact Escherichia_coli_K-12_MG1655.gbk Ecoli_K12_MG1655_a1.txt 4/19/14 Nick 
B 49175990 Escherichia_coli_K-12_MG1655 I-E NC_000913 2880177 2902429 2880177 2902035 2902036 2902429 intact Escherichia_coli_K-12_MG1655.gbk Ecoli_K12_MG1655_a2.txt 4/19/14 Nick 
C 170079663 Escherichia_coli_K-12_DH10B I-E NC_010473 2968265 2977783 2969028 2977783 2968265 2969027 intact Escherichia_coli_K-12_DH10B.gbk Ecoli_K12_DH10B_a1.txt 4/19/14 Nick 
D 170079663 Escherichia_coli_K-12_DH10B I-E NC_010473 2972719 2994971 2972719 2994577 2994578 2994971 intact Escherichia_coli_K-12_DH10B.gbk Ecoli_K12_DH10B_a2.txt 4/19/14 Nick 
E 388476123 Escherichia_coli_K-12_W3110 I-E NC_007779 2876357 2885875 2877120 2885875 2876357 2877119 intact Escherichia_coli_K-12_W3110.gbk Ecoli_K12_W3110_a1.txt 4/19/14 Nick 
F 388476123 Escherichia_coli_K-12_W3110 I-E NC_007779 2880811 2903063 2880811 2902669 2902670 2903063 intact Escherichia_coli_K-12_W3110.gbk Ecoli_K12_W3110_a2.txt 4/19/14 Nick 
G 16445223 Escherichia_coli_O157_H7 I-E NC_002655 3665521 3691424 3665733 3691424 3665521 3665732 intact Escherichia_coli_O157_H7.gbk Ecoli_0157_H7_a1.txt 4/19/14 Nick 
H 16445223 Escherichia_coli_O157_H7 I-E NC_002655 3665829 3691513 3665829 3691423 3691424 3691513 intact Escherichia_coli_O157_H7.gbk Ecoli_0157_H7_a2.txt 4/19/14 Nick 
I 387825439 Escherichia_coli_BL21_DE3 I-E NC_012971 2717668 2717940 2717668 2717940 intact Escherichia_coli_BL21_DE3.gbk Ecoli_BL21_DE3_a1.txt 4/19/14 Nick 
J 387825439 Escherichia_coli_BL21_DE3 I-E NC_012971 2736174 2736996 2736174 2736996 intact Escherichia_coli_BL21_DE3.gbk Ecoli_BL21_DE3_a2.txt 4/19/14 Nick

Notes on the loci table:

As you can see, not all of the fields have values. Some are not required (e.g., 'fasta_file').
You will get an error if you try to load a table with missing values in required fields.
For a list of required columns, see the documentation for CLdb -- loadLoci -h.

Loading loci info into database



In [275]:

    
!CLdb -- loadLoci -h









    



Usage:
    loadLoci.pl [flags] < loci_table.txt

  Required flags:
    -database <char>
        CLdb database.

  Optional flags:
    -forks <int>
        Number of files to process in parallel. [1]

    -verbose <bool>
        Verbose output. [TRUE]

    -help <bool>
        This help message

  For more information:
    CLdb --perldoc -- loadLoci



In [276]:

    
!CLdb -- loadLoci < $lociFile









    



### checking line breaks for all external files (converting to unix) ###
 processing: /home/nyoungb2/t/CLdb_Ecoli/genbank/Escherichia_coli_K-12_MG1655.gbk
 processing: /home/nyoungb2/t/CLdb_Ecoli/array/Ecoli_K12_MG1655_a2.txt
 processing: /home/nyoungb2/t/CLdb_Ecoli/genbank/Escherichia_coli_O157_H7.gbk
 processing: /home/nyoungb2/t/CLdb_Ecoli/array/Ecoli_0157_H7_a1.txt
 processing: /home/nyoungb2/t/CLdb_Ecoli/genbank/Escherichia_coli_K-12_DH10B.gbk
 processing: /home/nyoungb2/t/CLdb_Ecoli/array/Ecoli_K12_DH10B_a1.txt
 processing: /home/nyoungb2/t/CLdb_Ecoli/genbank/Escherichia_coli_K-12_MG1655.gbk
 processing: /home/nyoungb2/t/CLdb_Ecoli/array/Ecoli_K12_MG1655_a1.txt
 processing: /home/nyoungb2/t/CLdb_Ecoli/genbank/Escherichia_coli_BL21_DE3.gbk
 processing: /home/nyoungb2/t/CLdb_Ecoli/array/Ecoli_BL21_DE3_a2.txt
 processing: /home/nyoungb2/t/CLdb_Ecoli/genbank/Escherichia_coli_K-12_W3110.gbk
 processing: /home/nyoungb2/t/CLdb_Ecoli/array/Ecoli_K12_W3110_a2.txt
 processing: /home/nyoungb2/t/CLdb_Ecoli/genbank/Escherichia_coli_K-12_W3110.gbk
 processing: /home/nyoungb2/t/CLdb_Ecoli/array/Ecoli_K12_W3110_a1.txt
 processing: /home/nyoungb2/t/CLdb_Ecoli/genbank/Escherichia_coli_BL21_DE3.gbk
 processing: /home/nyoungb2/t/CLdb_Ecoli/array/Ecoli_BL21_DE3_a1.txt
 processing: /home/nyoungb2/t/CLdb_Ecoli/genbank/Escherichia_coli_K-12_DH10B.gbk
 processing: /home/nyoungb2/t/CLdb_Ecoli/array/Ecoli_K12_DH10B_a2.txt
 processing: /home/nyoungb2/t/CLdb_Ecoli/genbank/Escherichia_coli_O157_H7.gbk
 processing: /home/nyoungb2/t/CLdb_Ecoli/array/Ecoli_0157_H7_a2.txt
### checking locus_ID values ###
...locus_ID values are OK
### Checking/copying external files specfied in loci table ###
Processing locus 'B'...
  Checking: '/home/nyoungb2/t/CLdb_Ecoli/array/Ecoli_K12_MG1655_a2.txt'
	File found: '/home/nyoungb2/t/CLdb_Ecoli/array/Ecoli_K12_MG1655_a2.txt'
  Checking: '/home/nyoungb2/t/CLdb_Ecoli/genbank/Escherichia_coli_K-12_MG1655.gbk'
	File found: '/home/nyoungb2/t/CLdb_Ecoli/genbank/Escherichia_coli_K-12_MG1655.gbk'
Processing locus 'G'...
  Checking: '/home/nyoungb2/t/CLdb_Ecoli/array/Ecoli_0157_H7_a1.txt'
	File found: '/home/nyoungb2/t/CLdb_Ecoli/array/Ecoli_0157_H7_a1.txt'
  Checking: '/home/nyoungb2/t/CLdb_Ecoli/genbank/Escherichia_coli_O157_H7.gbk'
	File found: '/home/nyoungb2/t/CLdb_Ecoli/genbank/Escherichia_coli_O157_H7.gbk'
Processing locus 'C'...
  Checking: '/home/nyoungb2/t/CLdb_Ecoli/array/Ecoli_K12_DH10B_a1.txt'
	File found: '/home/nyoungb2/t/CLdb_Ecoli/array/Ecoli_K12_DH10B_a1.txt'
  Checking: '/home/nyoungb2/t/CLdb_Ecoli/genbank/Escherichia_coli_K-12_DH10B.gbk'
	File found: '/home/nyoungb2/t/CLdb_Ecoli/genbank/Escherichia_coli_K-12_DH10B.gbk'
Processing locus 'A'...
  Checking: '/home/nyoungb2/t/CLdb_Ecoli/array/Ecoli_K12_MG1655_a1.txt'
	File found: '/home/nyoungb2/t/CLdb_Ecoli/array/Ecoli_K12_MG1655_a1.txt'
  Checking: '/home/nyoungb2/t/CLdb_Ecoli/genbank/Escherichia_coli_K-12_MG1655.gbk'
	File found: '/home/nyoungb2/t/CLdb_Ecoli/genbank/Escherichia_coli_K-12_MG1655.gbk'
Processing locus 'J'...
  Checking: '/home/nyoungb2/t/CLdb_Ecoli/array/Ecoli_BL21_DE3_a2.txt'
	File found: '/home/nyoungb2/t/CLdb_Ecoli/array/Ecoli_BL21_DE3_a2.txt'
  Checking: '/home/nyoungb2/t/CLdb_Ecoli/genbank/Escherichia_coli_BL21_DE3.gbk'
	File found: '/home/nyoungb2/t/CLdb_Ecoli/genbank/Escherichia_coli_BL21_DE3.gbk'
Processing locus 'F'...
  Checking: '/home/nyoungb2/t/CLdb_Ecoli/array/Ecoli_K12_W3110_a2.txt'
	File found: '/home/nyoungb2/t/CLdb_Ecoli/array/Ecoli_K12_W3110_a2.txt'
  Checking: '/home/nyoungb2/t/CLdb_Ecoli/genbank/Escherichia_coli_K-12_W3110.gbk'
	File found: '/home/nyoungb2/t/CLdb_Ecoli/genbank/Escherichia_coli_K-12_W3110.gbk'
Processing locus 'E'...
  Checking: '/home/nyoungb2/t/CLdb_Ecoli/array/Ecoli_K12_W3110_a1.txt'
	File found: '/home/nyoungb2/t/CLdb_Ecoli/array/Ecoli_K12_W3110_a1.txt'
  Checking: '/home/nyoungb2/t/CLdb_Ecoli/genbank/Escherichia_coli_K-12_W3110.gbk'
	File found: '/home/nyoungb2/t/CLdb_Ecoli/genbank/Escherichia_coli_K-12_W3110.gbk'
Processing locus 'I'...
  Checking: '/home/nyoungb2/t/CLdb_Ecoli/array/Ecoli_BL21_DE3_a1.txt'
	File found: '/home/nyoungb2/t/CLdb_Ecoli/array/Ecoli_BL21_DE3_a1.txt'
  Checking: '/home/nyoungb2/t/CLdb_Ecoli/genbank/Escherichia_coli_BL21_DE3.gbk'
	File found: '/home/nyoungb2/t/CLdb_Ecoli/genbank/Escherichia_coli_BL21_DE3.gbk'
Processing locus 'D'...
  Checking: '/home/nyoungb2/t/CLdb_Ecoli/array/Ecoli_K12_DH10B_a2.txt'
	File found: '/home/nyoungb2/t/CLdb_Ecoli/array/Ecoli_K12_DH10B_a2.txt'
  Checking: '/home/nyoungb2/t/CLdb_Ecoli/genbank/Escherichia_coli_K-12_DH10B.gbk'
	File found: '/home/nyoungb2/t/CLdb_Ecoli/genbank/Escherichia_coli_K-12_DH10B.gbk'
Processing locus 'H'...
  Checking: '/home/nyoungb2/t/CLdb_Ecoli/array/Ecoli_0157_H7_a2.txt'
	File found: '/home/nyoungb2/t/CLdb_Ecoli/array/Ecoli_0157_H7_a2.txt'
  Checking: '/home/nyoungb2/t/CLdb_Ecoli/genbank/Escherichia_coli_O157_H7.gbk'
	File found: '/home/nyoungb2/t/CLdb_Ecoli/genbank/Escherichia_coli_O157_H7.gbk'
### Checking for existence of genome fasta ###
### Processing locus: "B" ###
  No genome fasta found. Trying to extract sequence from genbank...
  Success! Found fasta in '/home/nyoungb2/t/CLdb_Ecoli/fasta/Escherichia_coli_K-12_MG1655.fasta'
  Adding fasta to loci table
### Processing locus: "G" ###
  No genome fasta found. Trying to extract sequence from genbank...
  Success! Found fasta in '/home/nyoungb2/t/CLdb_Ecoli/fasta/Escherichia_coli_O157_H7.fasta'
  Adding fasta to loci table
### Processing locus: "C" ###
  No genome fasta found. Trying to extract sequence from genbank...
  Success! Found fasta in '/home/nyoungb2/t/CLdb_Ecoli/fasta/Escherichia_coli_K-12_DH10B.fasta'
  Adding fasta to loci table
### Processing locus: "A" ###
  No genome fasta found. Trying to extract sequence from genbank...
  Success! Found fasta in '/home/nyoungb2/t/CLdb_Ecoli/fasta/Escherichia_coli_K-12_MG1655.fasta'
  Adding fasta to loci table
### Processing locus: "J" ###
  No genome fasta found. Trying to extract sequence from genbank...
  Success! Found fasta in '/home/nyoungb2/t/CLdb_Ecoli/fasta/Escherichia_coli_BL21_DE3.fasta'
  Adding fasta to loci table
### Processing locus: "F" ###
  No genome fasta found. Trying to extract sequence from genbank...
  Success! Found fasta in '/home/nyoungb2/t/CLdb_Ecoli/fasta/Escherichia_coli_K-12_W3110.fasta'
  Adding fasta to loci table
### Processing locus: "E" ###
  No genome fasta found. Trying to extract sequence from genbank...
  Success! Found fasta in '/home/nyoungb2/t/CLdb_Ecoli/fasta/Escherichia_coli_K-12_W3110.fasta'
  Adding fasta to loci table
### Processing locus: "I" ###
  No genome fasta found. Trying to extract sequence from genbank...
  Success! Found fasta in '/home/nyoungb2/t/CLdb_Ecoli/fasta/Escherichia_coli_BL21_DE3.fasta'
  Adding fasta to loci table
### Processing locus: "D" ###
  No genome fasta found. Trying to extract sequence from genbank...
  Success! Found fasta in '/home/nyoungb2/t/CLdb_Ecoli/fasta/Escherichia_coli_K-12_DH10B.fasta'
  Adding fasta to loci table
### Processing locus: "H" ###
  No genome fasta found. Trying to extract sequence from genbank...
  Success! Found fasta in '/home/nyoungb2/t/CLdb_Ecoli/fasta/Escherichia_coli_O157_H7.fasta'
  Adding fasta to loci table
### checking for genome fasta for scaffold names ###

### loading entries into CLdb ###
...Number of entries added/updated to 'loci' table: 10
### Leader start-end provided. Loading values into leader table ###
...Number of entries added/updated to 'leaders' table: 0

Notes on loading

A lot is going on here:
1. Various checks on the input files
2. Extracting the genome fasta sequence from each genbank file
  - the genome fasta is required
3. Loading of the loci information into the sqlite database

Notes on the command

Why didn't I use the 'required' -database flag for CLdb -- loadLoci???
- I didn't have to use the -database flag because it is provided via the .CLdb config file that was previously created.



In [277]:

    
# This is just a quick summary of the database 
## It should show 10 loci for the 'loci' rows
!CLdb -- summary









    



loci	NULL	intact	10
loci	Total	NA	10
spacer	NA	All	NULL
DR	NA	All	NULL
genes	Total	NA	NULL
leaders	Total	NA	NA

The summary doesn't show anything for spacers, DRs, genes or leaders!

That's because we haven't loaded that info yet...

Loading CRISPR arrays

The next step is to load the CRISPR array tables.
These are tables in 'CRISPRFinder format' that have CRISPR array info.
- Let's take a look at one of the array files before loading them all.



In [278]:

    
# an example array file (obtained from CRISPRFinder)
arrayFile = os.path.join(workDir, 'array', 'Ecoli_0157_H7_a1.txt')
!head $arrayFile









    



3665521	CGGTTTATCCCCGCTGATGCGGGGAACAC	AGCGGCACGCTGGATTGAACAAATCCCTGGGC 	3665581	
3665582	CGGTTTATCCCCGCTGGCGCGGGGAACAC	AAACCGAAACACACGATCAATCCGAATATGAG 	3665642	
3665643	CGGTTTATCCCCGCTGGCGCGGGGAACAC	TTTGGTGACAGTTTTTGTCACTGTTTTGGTGA 	3665703	
3665704	CGGTTTATCCCCGCTGGCGCGGGGAACAC		3665732

Note: the array file consists of 4 columns:

spacer start
spacer sequence
direct-repeat sequence
direct-repeat stop

All extra columns ignored!



In [279]:

    
# loading CRISPR array info
!CLdb -- loadArrays









    



...Number of entries added/updated to 'DRs' table: 85
...Number of entries added/updated to 'spacers' table: 75



In [280]:

    
# This is just a quick summary of the database 
!CLdb -- summary









    



loci	NULL	intact	10
loci	Total	NA	10
spacer	NA	All	75
DR	NA	All	85
genes	Total	NA	NULL
leaders	Total	NA	NA

Note: The output should show 75 spacer & 85 DR entries in the database

Loading CAS genes

Technically, all coding seuqences in the region specified in the loci table (CAS_start, CAS_end) will be loaded.
This requires 2 subcommands:
1. The 1st gets the gene info
2. The 2nd loads the info into CLdb



In [281]:

    
geneDir = os.path.join(workDir, 'genes')
if not os.path.isdir(geneDir):
    os.makedirs(geneDir)



In [282]:

    
!cd $geneDir; \
    CLdb -- getGenesInLoci 2> CAS.log > CAS.txt
    
# checking output    
!cd $geneDir; \
    head -n 5 CAS.log; \
    echo -----------; \
    tail -n 5 CAS.log; \
    echo -----------; \
    head -n 5 CAS.txt









    



...Getting features in:
 file:			/home/nyoungb2/t/CLdb_Ecoli/genbank/Escherichia_coli_K-12_DH10B.gbk
 scaffold:		NC_010473
 region:		2972719-2994971
 CAS_status:		
-----------
 WARNING: LocusD -> 9 does not have a FIG-PEG ID in a db_xref tag!
 WARNING: LocusD -> 15 does not have a FIG-PEG ID in a db_xref tag!
 WARNING: LocusD -> 4 does not have a FIG-PEG ID in a db_xref tag!
 WARNING: LocusD -> 10 does not have a FIG-PEG ID in a db_xref tag!
 WARNING: LocusD -> 13 does not have a FIG-PEG ID in a db_xref tag!
-----------
Locus_ID	Gene_Id	Gene_start	Gene_end	Gene_length__AA	In_CAS	Gene_Alias	Sequence
B	GI:16130675::ASAP:ABE-0009074::UniProtKB/Swiss-Prot:Q46906::EcoGene:EG13123::GeneID:947536	2892218	2892793	191	yes	putative anti-terminator regulatory protein	MPLLHLLRQNPVIAAVKDNASLQLAIDSECQFISVLYGNICTISNIVKKIKNAGKYAFIHVDLLEGASNKEVVIQFLKLVTEADGIISTKASMLKAARAEGFFCIHRLFIVDSISFHNIDKQVAQSNPDCIEILPGCMPKVLGWVTEKIRQPLIAGGLVCDEEDARNAINAGVVALSTTNTGVWTLAKKLL
B	GI:16130670::ASAP:ABE-0009061::UniProtKB/Swiss-Prot:P17846::EcoGene:EG10190::GeneID:947231	2888121	2886409	570	yes	sulfite reductase, beta subunit, NAD(P)-binding, heme-binding	MSEKHPGPLVVEGKLTDAERMKHESNYLRGTIAEDLNDGLTGGFKGDNFLLIRFHGMYQQDDRDIRAERAEQKLEPRHAMLLRCRLPGGVITTKQWQAIDKFAGENTIYGSIRLTNRQTFQFHGILKKNVKPVHQMLHSVGLDALATANDMNRNVLCTSNPYESQLHAEAYEWAKKISEHLLPRTRAYAEIWLDQEKVATTDEEPILGQTYLPRKFKTTVVIPPQNDIDLHANDMNFVAIAENGKLVGFNLLVGGGLSIEHGNKKTYARTASEFGYLPLEHTLAVAEAVVTTQRDWGNRTDRKNAKTKYTLERVGVETFKAEVERRAGIKFEPIRPYEFTGRGDRIGWVKGIDDNWHLTLFIENGRILDYPARPLKTGLLEIAKIHKGDFRITANQNLIIAGVPESEKAKIEKIAKESGLMNAVTPQRENSMACVSFPTCPLAMAEAERFLPSFIDNIDNLMAKHGVSDEHIVMRVTGCPNGCGRAMLAEVGLVGKAPGRYNLHLGGNRIGTRIPRMYKENITEPEILASLDELIGRWAKEREAGEGFGDFTVRAGIIRPVLDPARDLWD
B	GI:90111487::ASAP:ABE-0009079::UniProtKB/Swiss-Prot:Q46908::EcoGene:EG13125::GeneID:947240	2894577	2893798	259	yes	putative flavoprotein	MNILLAFKAEPDAGMLAEKEWQAAAQGKSGPDISLLRSLLGADEQAAAALLLAQRKNGTPMSLTALSMGDERALHWLRYLMALGFEEAVLLETAADLRFAPEFVARHIAEWQHQNPLDLIITGCQSSEGQNGQTPFLLAEMLGWPCFTQVERFTLDALFITLEQRTEHGLRCCRVRLPAVIAVRQCGEVALPVPGMRQRMAAGKAEIIRKTVAAEMPAMQCLQLARAEQRRGATLIDGQTVAEKAQKLWQDYLRQRMQP
B	GI:16130669::ASAP:ABE-0009057::UniProtKB/Swiss-Prot:P17854::EcoGene:EG10189::GeneID:947230	2886334	2885600	244	yes	3'-phosphoadenosine 5'-phosphosulfate reductase	MSKLDLNALNELPKVDRILALAETNAELEKLDAEGRVAWALDNLPGEYVLSSSFGIQAAVSLHLVNQIRPDIPVILTDTGYLFPETYRFIDELTDKLKLNLKVYRATESAAWQEARYGKLWEQGVEGIEKYNDINKVEPMNRALKELNAQTWFAGLRREQSGSRANLPVLAIQRGVFKVLPIIDWDNRTIYQYLQKHGLKYHPLWDEGYLSVGDTHTTRKWEPGMAEEETRFFGLKRECGLHEG



In [283]:

    
# loading gene table into the database
!cd $geneDir; \
    CLdb -- loadGenes < CAS.txt









    



...Number of entries added/updated to 'genes' table: 123

Setting array sense strand

The strand that is transcribed needs to be defined in order to have the correct sequence for downstream analyses (e.g., blasting spacers and getting PAM regions)
The sense (reading) strand is defined by (order of precedence):
- The leader region (if defined; in this case, no).
- Array_start,Array_end in the loci table
  - The genome negative strand will be used if array_start > array_end



In [284]:

    
!CLdb -- setSenseStrand









    



Setting sense strand for 10 loci...
...sense strand set based on array_start-array_end for:	10 loci
...sense strand set base on leader region for:	0 loci

Spacer and DR clustering

Clustering of spacer and/or DR sequences accomplishes:
- A method of comparing within and between CRISPRs
- A reducing redundancy for spacer and DR blasting



In [285]:

    
!CLdb -- clusterArrayElements -s -r









    



Deleted all entries in 'spacer_clusters' table
Deleted all entries in 'DR_clusters' table
Getting array element sequences. Orienting sequences by array_sense_strand

Clustering spacer sequences...
...Clustering spacers at cutoff: 0.80
	Number of clusters produced: 39
...Clustering spacers at cutoff: 0.81
	Number of clusters produced: 39
...Clustering spacers at cutoff: 0.82
	Number of clusters produced: 39
...Clustering spacers at cutoff: 0.83
	Number of clusters produced: 39
...Clustering spacers at cutoff: 0.84
	Number of clusters produced: 39
...Clustering spacers at cutoff: 0.85
	Number of clusters produced: 39
...Clustering spacers at cutoff: 0.86
	Number of clusters produced: 39
...Clustering spacers at cutoff: 0.87
	Number of clusters produced: 39
...Clustering spacers at cutoff: 0.88
	Number of clusters produced: 39
...Clustering spacers at cutoff: 0.89
	Number of clusters produced: 39
...Clustering spacers at cutoff: 0.90
	Number of clusters produced: 39
...Clustering spacers at cutoff: 0.91
	Number of clusters produced: 39
...Clustering spacers at cutoff: 0.92
	Number of clusters produced: 39
...Clustering spacers at cutoff: 0.93
	Number of clusters produced: 39
...Clustering spacers at cutoff: 0.94
	Number of clusters produced: 39
...Clustering spacers at cutoff: 0.95
	Number of clusters produced: 39
...Clustering spacers at cutoff: 0.96
	Number of clusters produced: 39
...Clustering spacers at cutoff: 0.97
	Number of clusters produced: 39
...Clustering spacers at cutoff: 0.98
	Number of clusters produced: 39
...Clustering spacers at cutoff: 0.99
	Number of clusters produced: 39
...Clustering spacers at cutoff: 1.00
	Number of clusters produced: 39

Clustering DR sequences...
...Clustering DRs at cutoff: 0.80
	Number of clusters produced: 2
...Clustering DRs at cutoff: 0.81
	Number of clusters produced: 2
...Clustering DRs at cutoff: 0.82
	Number of clusters produced: 2
...Clustering DRs at cutoff: 0.83
	Number of clusters produced: 2
...Clustering DRs at cutoff: 0.84
	Number of clusters produced: 2
...Clustering DRs at cutoff: 0.85
	Number of clusters produced: 2
...Clustering DRs at cutoff: 0.86
	Number of clusters produced: 2
...Clustering DRs at cutoff: 0.87
	Number of clusters produced: 2
...Clustering DRs at cutoff: 0.88
	Number of clusters produced: 2
...Clustering DRs at cutoff: 0.89
	Number of clusters produced: 2
...Clustering DRs at cutoff: 0.90
	Number of clusters produced: 4
...Clustering DRs at cutoff: 0.91
	Number of clusters produced: 4
...Clustering DRs at cutoff: 0.92
	Number of clusters produced: 4
...Clustering DRs at cutoff: 0.93
	Number of clusters produced: 4
...Clustering DRs at cutoff: 0.94
	Number of clusters produced: 6
...Clustering DRs at cutoff: 0.95
	Number of clusters produced: 6
...Clustering DRs at cutoff: 0.96
	Number of clusters produced: 6
...Clustering DRs at cutoff: 0.97
	Number of clusters produced: 12
...Clustering DRs at cutoff: 0.98
	Number of clusters produced: 12
...Clustering DRs at cutoff: 0.99
	Number of clusters produced: 12
...Clustering DRs at cutoff: 1.00
	Number of clusters produced: 12

Inserting/updating entries in CLdb...
...1575 spacer cluster entries added/updated
...1785 DR cluster entries added/updated

Creating indices for cluster ids...

Database summary



In [286]:

    
!CLdb -- summary -name -subtype









    



loci	I-E	Escherichia_coli_BL21_DE3	NULL	intact	2
loci	I-E	Escherichia_coli_K-12_DH10B	NULL	intact	2
loci	I-E	Escherichia_coli_K-12_MG1655	NULL	intact	2
loci	I-E	Escherichia_coli_K-12_W3110	NULL	intact	2
loci	I-E	Escherichia_coli_O157_H7	NULL	intact	2
loci	I-E	Escherichia_coli_BL21_DE3	Total	NA	2
loci	I-E	Escherichia_coli_K-12_DH10B	Total	NA	2
loci	I-E	Escherichia_coli_K-12_MG1655	Total	NA	2
loci	I-E	Escherichia_coli_K-12_W3110	Total	NA	2
loci	I-E	Escherichia_coli_O157_H7	Total	NA	2
spacer	I-E	Escherichia_coli_BL21_DE3	All	NA	NA	17
spacer	I-E	Escherichia_coli_K-12_DH10B	All	NA	NA	18
spacer	I-E	Escherichia_coli_K-12_MG1655	All	NA	NA	18
spacer	I-E	Escherichia_coli_K-12_W3110	All	NA	NA	18
spacer	I-E	Escherichia_coli_O157_H7	All	NA	NA	4
spacers	I-E	Escherichia_coli_BL21_DE3	num_groups	1	17
spacers	I-E	Escherichia_coli_K-12_DH10B	num_groups	1	18
spacers	I-E	Escherichia_coli_K-12_MG1655	num_groups	1	18
spacers	I-E	Escherichia_coli_K-12_W3110	num_groups	1	18
spacers	I-E	Escherichia_coli_O157_H7	num_groups	1	4
DR	I-E	Escherichia_coli_BL21_DE3	All	NA	NA	19
DR	I-E	Escherichia_coli_K-12_DH10B	All	NA	NA	20
DR	I-E	Escherichia_coli_K-12_MG1655	All	NA	NA	20
DR	I-E	Escherichia_coli_K-12_W3110	All	NA	NA	20
DR	I-E	Escherichia_coli_O157_H7	All	NA	NA	6
DRs	I-E	Escherichia_coli_BL21_DE3	num_groups	1	4
DRs	I-E	Escherichia_coli_K-12_DH10B	num_groups	1	7
DRs	I-E	Escherichia_coli_K-12_MG1655	num_groups	1	6
DRs	I-E	Escherichia_coli_K-12_W3110	num_groups	1	6
DRs	I-E	Escherichia_coli_O157_H7	num_groups	1	4
genes	I-E	Escherichia_coli_K-12_DH10B	yes	NA	25
genes	I-E	Escherichia_coli_K-12_MG1655	no	NA	1
genes	I-E	Escherichia_coli_K-12_MG1655	yes	NA	24
genes	I-E	Escherichia_coli_K-12_W3110	yes	NA	25
genes	I-E	Escherichia_coli_O157_H7	yes	NA	48
genes	I-E	Escherichia_coli_K-12_DH10B	Total	NA	25
genes	I-E	Escherichia_coli_K-12_MG1655	Total	NA	25
genes	I-E	Escherichia_coli_K-12_W3110	Total	NA	25
genes	I-E	Escherichia_coli_O157_H7	Total	NA	48

Next Steps

arrayBlast
- Blast spacers (& DRs), get protospacers, PAM regions, mismatches to the protospacer & SEED sequence
spacers_shared
- Spacer sequences shared among CRISPSRs
DR_consensus
- Consensus sequences of direct repeats in each CRISPR
loci_plots
- Plots of CRISPR arrays and CAS genes



In [ ]: