This notebook describes the setup of CLdb with a set of Methanosarcina genomes.


In [124]:
# path to raw files
## CHANGE THIS!
rawFileDir = "~/perl/projects/CLdb/data/Methanosarcina/"
# directory where the CLdb database will be created
## CHANGE THIS!
workDir = "~/t/CLdb_Methanosarcina/"

In [125]:
# viewing file links
import os
import zipfile
import csv
from IPython.display import FileLinks
# pretty viewing of tables
## get from: http://epmoyer.github.io/ipy_table/
from ipy_table import *

In [126]:
rawFileDir = os.path.expanduser(rawFileDir)
workDir = os.path.expanduser(workDir)

The required files are in '../ecoli_raw/':

  • a loci table
  • array files
  • genome nucleotide sequences
    • genbank (preferred) or fasta format

Let's look at the provided files for this example:


In [127]:
FileLinks(rawFileDir)


Out[127]:
/home/nyoungb2/perl/projects/CLdb/data/Methanosarcina/
  loci.zip
  accessions.txt.zip
  array.zip

Checking that CLdb is installed in PATH


In [128]:
!CLdb -h


Usage:
    CLdb [options] -- subcommand [subcommand_options]

  Options:
    --list
        List all subcommands.

    --perldoc
        Get perldoc of subcommand.

    --sql
        SQL passed to subcommand for limiting queries. (eg., --sql
        'loci.subtype == "I-B" or loci.subtype == "I-C"'). NOTE: The sql
        statement must go in SINGLE quotes!

    --config
        Config file (if not ~/.CLdb)

    --config-params
        List params set by config

    -v Verbose output
    -h This help message

  For more information:
    perldoc CLdb

Setting up the CLdb directory


In [129]:
# this makes the working directory
if not os.path.isdir(workDir):
    os.makedirs(workDir)

In [130]:
# unarchiving files in the raw folder over to the newly made working folder
files = ['array.zip','loci.zip', 'accessions.txt.zip']
files = [os.path.join(rawFileDir, x) for x in files]
for f in files:
    if not os.path.isfile(f):
        raise IOError, 'Cannot find file: {}'.format(f)
    else:
        zip = zipfile.ZipFile(f)
        zip.extractall(path=workDir)         

print 'unzipped raw files:'        
FileLinks(workDir)




Downloading the genome genbank files. Using the 'GIs.txt' file

  • GIs.txt is just a list of GIs and taxon names.

In [131]:
# making genbank directory
genbankDir = os.path.join(workDir, 'genbank')
if not os.path.isdir(genbankDir):
    os.makedirs(genbankDir)    

# downloading genomes
!cd $genbankDir; \
    CLdb -- accession-GI2fastaGenome -format genbank -fork 9 < ../accessions.txt
    
# checking files
!cd $genbankDir; \
    ls -thlc *.gbk


Writing files to '/home/nyoungb2/t/CLdb_Methanosarcina/genbank'
Attempting to stream: Methanosarcina_barkeri_str_fusaro (accession/GI = NC_007355.1)
Attempting to stream: Methanosarcina_mazei_WWM610 (accession/GI = NZ_CP009509.1)
Attempting to stream: Methanosarcina_horonobensis_HB_1 (accession/GI = NZ_CP009516)
Attempting to stream: Methanosarcina_mazei_Go1 (accession/GI = NC_003901.1)
Attempting to stream: Methanosarcina_mazei_S_6 (accession/GI = NZ_CP009512.1)
Attempting to stream: Methanosarcina_mazei_C16 (accession/GI = NZ_CP009514.1)
Attempting to stream: Methanosarcina_mazei_SarPi (accession/GI = NZ_CP009511.1)
Attempting to stream: Methanosarcina_mazei_LYC (accession/GI = NZ_CP009513.1)
Attempting to stream: Methanosarcina_acetivorans_C2A (accession/GI = NC_003552)
-rw-rw-r-- 1 nyoungb2 nyoungb2  13M Jan  3 20:13 Methanosarcina_acetivorans_C2A.gbk
-rw-rw-r-- 1 nyoungb2 nyoungb2  12M Jan  3 20:13 Methanosarcina_horonobensis_HB_1.gbk
-rw-rw-r-- 1 nyoungb2 nyoungb2  11M Jan  3 20:13 Methanosarcina_barkeri_str_fusaro.gbk
-rw-rw-r-- 1 nyoungb2 nyoungb2 9.4M Jan  3 20:13 Methanosarcina_mazei_LYC.gbk
-rw-rw-r-- 1 nyoungb2 nyoungb2 9.4M Jan  3 20:13 Methanosarcina_mazei_S_6.gbk
-rw-rw-r-- 1 nyoungb2 nyoungb2 9.3M Jan  3 20:13 Methanosarcina_mazei_Go1.gbk
-rw-rw-r-- 1 nyoungb2 nyoungb2 9.4M Jan  3 20:13 Methanosarcina_mazei_C16.gbk
-rw-rw-r-- 1 nyoungb2 nyoungb2 9.2M Jan  3 20:13 Methanosarcina_mazei_SarPi.gbk
-rw-rw-r-- 1 nyoungb2 nyoungb2 9.3M Jan  3 20:13 Methanosarcina_mazei_WWM610.gbk

Creating/loading CLdb of E. coli CRISPR data


In [149]:
!CLdb -- makeDB -h


Usage:
    makeDB.pl [options] [DATABASE_name]

  options:
    -replace <bool>
        Replace existing database.

    -table <char>
        Table(s) to keep as is (if they exist). ["leaders" "genes"]

    -drop <bool>
        Drop all tables. [FALSE]

    -help <bool>
        This help message

  For more information:
    CLdb --perldoc -- makeDB

Making CLdb sqlite file


In [150]:
!cd $workDir; \
    CLdb -- makeDB -r -drop
    
CLdbFile = os.path.join(workDir, 'CLdb.sqlite')
print 'CLdb file location: {}'.format(CLdbFile)


...sqlite3 database tables created
CLdb file location: /home/nyoungb2/t/CLdb_Methanosarcina/CLdb.sqlite

Setting up CLdb config

  • This way, the CLdb script will know where the CLdb database is located.
    • Otherwise, you would have to keep telling the CLdb script where the database is.

In [151]:
s = 'DATABASE = ' + CLdbFile
configFile = os.path.join(os.path.expanduser('~'), '.CLdb')

with open(configFile, 'wb') as outFH:
    outFH.write(s)
    
print 'Config file written: {}'.format(configFile)


Config file written: /home/nyoungb2/.CLdb

In [152]:
# checking that the config is set
!CLdb --config-params


#-- Config params --#
DATABASE = /home/nyoungb2/t/CLdb_Methanosarcina/CLdb.sqlite

Loading loci

  • The next step is loading the loci table.
    • This table contains the user-provided info on each CRISPR-CAS system in the genomes.
    • Let's look at the table before loading it in CLdb

Checking out the CRISPR loci table


In [153]:
lociFile = os.path.join(workDir, 'loci', 'loci.txt')

# reading in file
tbl = []
with open(lociFile, 'rb') as f:
    reader = csv.reader(f, delimiter='\t')
    for row in reader:
        tbl.append(row)

# making table
make_table(tbl)
apply_theme('basic')


Out[153]:
Locus_IDTaxon_IDTaxon_NameSubtypeScaffoldLocus_StartLocus_EndCAS_StartCAS_EndArray_StartArray_EndCAS_StatusArray_StatusGenbank_FileFasta_FileArray_FileScaffold_countFile_Creation_DateAuthor
1188937.1Methanosarcina_acetivorans_C2AI-GNC_003552452991845180934529918451809345235224525551brokenintactMethanosarcina_acetivorans_C2A.gbkMethanosarcina_acetivorans_C2A_2.txt8-Feb-14Nick&nbspY
2188937.1Methanosarcina_acetivorans_C2AIII-ANC_003552237932823929872381368239298723793282381265brokenintactMethanosarcina_acetivorans_C2A.gbkMethanosarcina_acetivorans_C2A_1.txt8-Feb-14Nick&nbspY
3269797.3Methanosarcina_barkeri_str_fusaroIII-CNC_007355166013116746531665910167465316602421661621intactintactMethanosarcina_barkeri_str_fusaro.gbkMethanosarcina_barkeri_str_fusaro_2.txt8-Feb-14Nick&nbspY
4269797.3Methanosarcina_barkeri_str_fusaroVI-3NC_007355401877940072394018779400918240072394009012intactintactMethanosarcina_barkeri_str_fusaro.gbkMethanosarcina_barkeri_str_fusaro_3.txt8-Feb-14Nick&nbspY
5269797.3Methanosarcina_barkeri_str_fusaroI-GNC_007355353461365194353461365194356467359809brokenintactMethanosarcina_barkeri_str_fusaro.gbkMethanosarcina_barkeri_str_fusaro_1.txt8-Feb-14Nick&nbspY
61434110.3Methanosarcina_horonobensis_HB_1NANZ_CP009516239477324038092394773239971124005592403809brokenintactMethanosarcina_horonobensis_HB_1.gbkMethanosarcina_horonobensis_HB_1_3.txt8-Feb-14Nick&nbspY
71434110.3Methanosarcina_horonobensis_HB_1VIII-4NZ_CP009516239417123801152392416238011523926642394112intactintactMethanosarcina_horonobensis_HB_1.gbkMethanosarcina_horonobensis_HB_1_2.txt8-Feb-14Nick&nbspY
81434110.3Methanosarcina_horonobensis_HB_1I-BNZ_CP009516100207510152951002075101158910117031015295intactintactMethanosarcina_horonobensis_HB_1.gbkMethanosarcina_horonobensis_HB_1_1.txt8-Feb-14Nick&nbspY
91434113.3Methanosarcina_mazei_C16I-BNZ_CP009514880835895591880835889997890110895591intactintactMethanosarcina_mazei_C16.gbkMethanosarcina_mazei_C16_1.txt8-Feb-14Nick&nbspY
101434113.3Methanosarcina_mazei_C16III-BNZ_CP009514329682333092353300128330923532968233299881brokenintactMethanosarcina_mazei_C16.gbkMethanosarcina_mazei_C16_2.txt8-Feb-14Nick&nbspY
11192952.1Methanosarcina_mazei_Go1III-CNC_003901409529940804524089089408045240951874089310intactintactMethanosarcina_mazei_Go1.gbkMethanosarcina_mazei_Go1-cli118.txt28-Jun-13Nick&nbspY
12192952.1Methanosarcina_mazei_Go1I-BNC_003901691804679124691804682642682529679124intactintactMethanosarcina_mazei_Go1.gbkMethanosarcina_mazei_Go1-cli117.txt28-Jun-13Nick&nbspY
131434114.3Methanosarcina_mazei_LYCVIII-4NZ_CP009513160427916274221615209162742216043871614962intactbrokenMethanosarcina_mazei_LYC.gbkMethanosarcina_mazei_LYC_2.txt8-Feb-14Nick&nbspY
141434114.3Methanosarcina_mazei_LYCNANZ_CP009513818271799772818271802538799772802424brokenintactMethanosarcina_mazei_LYC.gbkMethanosarcina_mazei_LYC_1.txt8-Feb-14Nick&nbspY
15213585.9Methanosarcina_mazei_S_6I-DNZ_CP009512889926906212889926901670901787906212brokenintactMethanosarcina_mazei_S_6.gbkMethanosarcina_mazei_S_6_1.txt8-Feb-14Nick&nbspY
16213585.9Methanosarcina_mazei_S_6III-CNZ_CP009512162011816343441625707163434416202291625486intactintactMethanosarcina_mazei_S_6.gbkMethanosarcina_mazei_S_6_2.txt8-Feb-14Nick&nbspY
171434115.3Methanosarcina_mazei_SarPiI-BNZ_CP009511849895866550849895859057859170866550intactintactMethanosarcina_mazei_SarPi.gbkMethanosarcina_mazei_SarPi_1.txt8-Feb-14Nick&nbspY
181434115.3Methanosarcina_mazei_SarPiVIII-3NZ_CP009511150857015300441518968153004415085881518721intactintactMethanosarcina_mazei_SarPi.gbkMethanosarcina_mazei_SarPi_2.txt8-Feb-14Nick&nbspY
191434117.3Methanosarcina_mazei_WWM610III-BNZ_CP009509161023916278171616658162781716103471616437intactintactMethanosarcina_mazei_WWM610.gbkMethanosarcina_mazei_WWM610_2.txt8-Feb-14Nick&nbspY
201434117.3Methanosarcina_mazei_WWM610I-BNZ_CP009509890817903201890817900512900625903201intactintactMethanosarcina_mazei_WWM610.gbkMethanosarcina_mazei_WWM610_1.txt8-Feb-14Nick&nbspY

Notes on the loci table:

  • As you can see, not all of the fields have values. Some are not required (e.g., 'fasta_file').
  • You will get an error if you try to load a table with missing values in required fields.
  • For a list of required columns, see the documentation for CLdb -- loadLoci -h.

Loading loci info into database


In [154]:
!CLdb -- loadLoci -h


Usage:
    loadLoci.pl [flags] < loci_table.txt

  Required flags:
    -database <char>
        CLdb database.

  Optional flags:
    -forks <int>
        Number of files to process in parallel. [1]

    -verbose <bool>
        Verbose output. [TRUE]

    -help <bool>
        This help message

  For more information:
    CLdb --perldoc -- loadLoci


In [155]:
!CLdb -- loadLoci < $lociFile


### checking line breaks for all external files (converting to unix) ###
 processing: /home/nyoungb2/t/CLdb_Methanosarcina/array/Methanosarcina_mazei_S_6_1.txt
 processing: /home/nyoungb2/t/CLdb_Methanosarcina/genbank/Methanosarcina_mazei_S_6.gbk
 processing: /home/nyoungb2/t/CLdb_Methanosarcina/array/Methanosarcina_mazei_SarPi_2.txt
 processing: /home/nyoungb2/t/CLdb_Methanosarcina/genbank/Methanosarcina_mazei_SarPi.gbk
 processing: /home/nyoungb2/t/CLdb_Methanosarcina/array/Methanosarcina_horonobensis_HB_1_1.txt
 processing: /home/nyoungb2/t/CLdb_Methanosarcina/genbank/Methanosarcina_horonobensis_HB_1.gbk
 processing: /home/nyoungb2/t/CLdb_Methanosarcina/array/Methanosarcina_barkeri_str_fusaro_1.txt
 processing: /home/nyoungb2/t/CLdb_Methanosarcina/genbank/Methanosarcina_barkeri_str_fusaro.gbk
 processing: /home/nyoungb2/t/CLdb_Methanosarcina/array/Methanosarcina_acetivorans_C2A_1.txt
 processing: /home/nyoungb2/t/CLdb_Methanosarcina/genbank/Methanosarcina_acetivorans_C2A.gbk
 processing: /home/nyoungb2/t/CLdb_Methanosarcina/array/Methanosarcina_barkeri_str_fusaro_3.txt
 processing: /home/nyoungb2/t/CLdb_Methanosarcina/genbank/Methanosarcina_barkeri_str_fusaro.gbk
 processing: /home/nyoungb2/t/CLdb_Methanosarcina/array/Methanosarcina_barkeri_str_fusaro_2.txt
 processing: /home/nyoungb2/t/CLdb_Methanosarcina/genbank/Methanosarcina_barkeri_str_fusaro.gbk
 processing: /home/nyoungb2/t/CLdb_Methanosarcina/array/Methanosarcina_horonobensis_HB_1_3.txt
 processing: /home/nyoungb2/t/CLdb_Methanosarcina/genbank/Methanosarcina_horonobensis_HB_1.gbk
 processing: /home/nyoungb2/t/CLdb_Methanosarcina/array/Methanosarcina_mazei_C16_2.txt
 processing: /home/nyoungb2/t/CLdb_Methanosarcina/genbank/Methanosarcina_mazei_C16.gbk
 processing: /home/nyoungb2/t/CLdb_Methanosarcina/array/Methanosarcina_mazei_S_6_2.txt
 processing: /home/nyoungb2/t/CLdb_Methanosarcina/genbank/Methanosarcina_mazei_S_6.gbk
 processing: /home/nyoungb2/t/CLdb_Methanosarcina/array/Methanosarcina_mazei_WWM610_2.txt
 processing: /home/nyoungb2/t/CLdb_Methanosarcina/genbank/Methanosarcina_mazei_WWM610.gbk
 processing: /home/nyoungb2/t/CLdb_Methanosarcina/array/Methanosarcina_horonobensis_HB_1_2.txt
 processing: /home/nyoungb2/t/CLdb_Methanosarcina/genbank/Methanosarcina_horonobensis_HB_1.gbk
 processing: /home/nyoungb2/t/CLdb_Methanosarcina/array/Methanosarcina_mazei_WWM610_1.txt
 processing: /home/nyoungb2/t/CLdb_Methanosarcina/genbank/Methanosarcina_mazei_WWM610.gbk
 processing: /home/nyoungb2/t/CLdb_Methanosarcina/array/Methanosarcina_mazei_SarPi_1.txt
 processing: /home/nyoungb2/t/CLdb_Methanosarcina/genbank/Methanosarcina_mazei_SarPi.gbk
 processing: /home/nyoungb2/t/CLdb_Methanosarcina/array/Methanosarcina_mazei_LYC_1.txt
 processing: /home/nyoungb2/t/CLdb_Methanosarcina/genbank/Methanosarcina_mazei_LYC.gbk
 processing: /home/nyoungb2/t/CLdb_Methanosarcina/array/Methanosarcina_acetivorans_C2A_2.txt
 processing: /home/nyoungb2/t/CLdb_Methanosarcina/genbank/Methanosarcina_acetivorans_C2A.gbk
 processing: /home/nyoungb2/t/CLdb_Methanosarcina/array/Methanosarcina_mazei_LYC_2.txt
 processing: /home/nyoungb2/t/CLdb_Methanosarcina/genbank/Methanosarcina_mazei_LYC.gbk
 processing: /home/nyoungb2/t/CLdb_Methanosarcina/array/Methanosarcina_mazei_Go1-cli117.txt
 processing: /home/nyoungb2/t/CLdb_Methanosarcina/genbank/Methanosarcina_mazei_Go1.gbk
 processing: /home/nyoungb2/t/CLdb_Methanosarcina/array/Methanosarcina_mazei_Go1-cli118.txt
 processing: /home/nyoungb2/t/CLdb_Methanosarcina/genbank/Methanosarcina_mazei_Go1.gbk
 processing: /home/nyoungb2/t/CLdb_Methanosarcina/array/Methanosarcina_mazei_C16_1.txt
 processing: /home/nyoungb2/t/CLdb_Methanosarcina/genbank/Methanosarcina_mazei_C16.gbk
### checking locus_ID values ###
...locus_ID values are OK
### Checking/copying external files specfied in loci table ###
Processing locus '15'...
  Checking: '/home/nyoungb2/t/CLdb_Methanosarcina/genbank/Methanosarcina_mazei_S_6.gbk'
	File found: '/home/nyoungb2/t/CLdb_Methanosarcina/genbank/Methanosarcina_mazei_S_6.gbk'
  Checking: '/home/nyoungb2/t/CLdb_Methanosarcina/array/Methanosarcina_mazei_S_6_1.txt'
	File found: '/home/nyoungb2/t/CLdb_Methanosarcina/array/Methanosarcina_mazei_S_6_1.txt'
Processing locus '18'...
  Checking: '/home/nyoungb2/t/CLdb_Methanosarcina/genbank/Methanosarcina_mazei_SarPi.gbk'
	File found: '/home/nyoungb2/t/CLdb_Methanosarcina/genbank/Methanosarcina_mazei_SarPi.gbk'
  Checking: '/home/nyoungb2/t/CLdb_Methanosarcina/array/Methanosarcina_mazei_SarPi_2.txt'
	File found: '/home/nyoungb2/t/CLdb_Methanosarcina/array/Methanosarcina_mazei_SarPi_2.txt'
Processing locus '8'...
  Checking: '/home/nyoungb2/t/CLdb_Methanosarcina/genbank/Methanosarcina_horonobensis_HB_1.gbk'
	File found: '/home/nyoungb2/t/CLdb_Methanosarcina/genbank/Methanosarcina_horonobensis_HB_1.gbk'
  Checking: '/home/nyoungb2/t/CLdb_Methanosarcina/array/Methanosarcina_horonobensis_HB_1_1.txt'
	File found: '/home/nyoungb2/t/CLdb_Methanosarcina/array/Methanosarcina_horonobensis_HB_1_1.txt'
Processing locus '5'...
  Checking: '/home/nyoungb2/t/CLdb_Methanosarcina/genbank/Methanosarcina_barkeri_str_fusaro.gbk'
	File found: '/home/nyoungb2/t/CLdb_Methanosarcina/genbank/Methanosarcina_barkeri_str_fusaro.gbk'
  Checking: '/home/nyoungb2/t/CLdb_Methanosarcina/array/Methanosarcina_barkeri_str_fusaro_1.txt'
	File found: '/home/nyoungb2/t/CLdb_Methanosarcina/array/Methanosarcina_barkeri_str_fusaro_1.txt'
Processing locus '2'...
  Checking: '/home/nyoungb2/t/CLdb_Methanosarcina/genbank/Methanosarcina_acetivorans_C2A.gbk'
	File found: '/home/nyoungb2/t/CLdb_Methanosarcina/genbank/Methanosarcina_acetivorans_C2A.gbk'
  Checking: '/home/nyoungb2/t/CLdb_Methanosarcina/array/Methanosarcina_acetivorans_C2A_1.txt'
	File found: '/home/nyoungb2/t/CLdb_Methanosarcina/array/Methanosarcina_acetivorans_C2A_1.txt'
Processing locus '4'...
  Checking: '/home/nyoungb2/t/CLdb_Methanosarcina/genbank/Methanosarcina_barkeri_str_fusaro.gbk'
	File found: '/home/nyoungb2/t/CLdb_Methanosarcina/genbank/Methanosarcina_barkeri_str_fusaro.gbk'
  Checking: '/home/nyoungb2/t/CLdb_Methanosarcina/array/Methanosarcina_barkeri_str_fusaro_3.txt'
	File found: '/home/nyoungb2/t/CLdb_Methanosarcina/array/Methanosarcina_barkeri_str_fusaro_3.txt'
Processing locus '3'...
  Checking: '/home/nyoungb2/t/CLdb_Methanosarcina/genbank/Methanosarcina_barkeri_str_fusaro.gbk'
	File found: '/home/nyoungb2/t/CLdb_Methanosarcina/genbank/Methanosarcina_barkeri_str_fusaro.gbk'
  Checking: '/home/nyoungb2/t/CLdb_Methanosarcina/array/Methanosarcina_barkeri_str_fusaro_2.txt'
	File found: '/home/nyoungb2/t/CLdb_Methanosarcina/array/Methanosarcina_barkeri_str_fusaro_2.txt'
Processing locus '6'...
  Checking: '/home/nyoungb2/t/CLdb_Methanosarcina/genbank/Methanosarcina_horonobensis_HB_1.gbk'
	File found: '/home/nyoungb2/t/CLdb_Methanosarcina/genbank/Methanosarcina_horonobensis_HB_1.gbk'
  Checking: '/home/nyoungb2/t/CLdb_Methanosarcina/array/Methanosarcina_horonobensis_HB_1_3.txt'
	File found: '/home/nyoungb2/t/CLdb_Methanosarcina/array/Methanosarcina_horonobensis_HB_1_3.txt'
Processing locus '10'...
  Checking: '/home/nyoungb2/t/CLdb_Methanosarcina/genbank/Methanosarcina_mazei_C16.gbk'
	File found: '/home/nyoungb2/t/CLdb_Methanosarcina/genbank/Methanosarcina_mazei_C16.gbk'
  Checking: '/home/nyoungb2/t/CLdb_Methanosarcina/array/Methanosarcina_mazei_C16_2.txt'
	File found: '/home/nyoungb2/t/CLdb_Methanosarcina/array/Methanosarcina_mazei_C16_2.txt'
Processing locus '16'...
  Checking: '/home/nyoungb2/t/CLdb_Methanosarcina/genbank/Methanosarcina_mazei_S_6.gbk'
	File found: '/home/nyoungb2/t/CLdb_Methanosarcina/genbank/Methanosarcina_mazei_S_6.gbk'
  Checking: '/home/nyoungb2/t/CLdb_Methanosarcina/array/Methanosarcina_mazei_S_6_2.txt'
	File found: '/home/nyoungb2/t/CLdb_Methanosarcina/array/Methanosarcina_mazei_S_6_2.txt'
Processing locus '19'...
  Checking: '/home/nyoungb2/t/CLdb_Methanosarcina/genbank/Methanosarcina_mazei_WWM610.gbk'
	File found: '/home/nyoungb2/t/CLdb_Methanosarcina/genbank/Methanosarcina_mazei_WWM610.gbk'
  Checking: '/home/nyoungb2/t/CLdb_Methanosarcina/array/Methanosarcina_mazei_WWM610_2.txt'
	File found: '/home/nyoungb2/t/CLdb_Methanosarcina/array/Methanosarcina_mazei_WWM610_2.txt'
Processing locus '7'...
  Checking: '/home/nyoungb2/t/CLdb_Methanosarcina/genbank/Methanosarcina_horonobensis_HB_1.gbk'
	File found: '/home/nyoungb2/t/CLdb_Methanosarcina/genbank/Methanosarcina_horonobensis_HB_1.gbk'
  Checking: '/home/nyoungb2/t/CLdb_Methanosarcina/array/Methanosarcina_horonobensis_HB_1_2.txt'
	File found: '/home/nyoungb2/t/CLdb_Methanosarcina/array/Methanosarcina_horonobensis_HB_1_2.txt'
Processing locus '20'...
  Checking: '/home/nyoungb2/t/CLdb_Methanosarcina/genbank/Methanosarcina_mazei_WWM610.gbk'
	File found: '/home/nyoungb2/t/CLdb_Methanosarcina/genbank/Methanosarcina_mazei_WWM610.gbk'
  Checking: '/home/nyoungb2/t/CLdb_Methanosarcina/array/Methanosarcina_mazei_WWM610_1.txt'
	File found: '/home/nyoungb2/t/CLdb_Methanosarcina/array/Methanosarcina_mazei_WWM610_1.txt'
Processing locus '17'...
  Checking: '/home/nyoungb2/t/CLdb_Methanosarcina/genbank/Methanosarcina_mazei_SarPi.gbk'
	File found: '/home/nyoungb2/t/CLdb_Methanosarcina/genbank/Methanosarcina_mazei_SarPi.gbk'
  Checking: '/home/nyoungb2/t/CLdb_Methanosarcina/array/Methanosarcina_mazei_SarPi_1.txt'
	File found: '/home/nyoungb2/t/CLdb_Methanosarcina/array/Methanosarcina_mazei_SarPi_1.txt'
Processing locus '14'...
  Checking: '/home/nyoungb2/t/CLdb_Methanosarcina/genbank/Methanosarcina_mazei_LYC.gbk'
	File found: '/home/nyoungb2/t/CLdb_Methanosarcina/genbank/Methanosarcina_mazei_LYC.gbk'
  Checking: '/home/nyoungb2/t/CLdb_Methanosarcina/array/Methanosarcina_mazei_LYC_1.txt'
	File found: '/home/nyoungb2/t/CLdb_Methanosarcina/array/Methanosarcina_mazei_LYC_1.txt'
Processing locus '1'...
  Checking: '/home/nyoungb2/t/CLdb_Methanosarcina/genbank/Methanosarcina_acetivorans_C2A.gbk'
	File found: '/home/nyoungb2/t/CLdb_Methanosarcina/genbank/Methanosarcina_acetivorans_C2A.gbk'
  Checking: '/home/nyoungb2/t/CLdb_Methanosarcina/array/Methanosarcina_acetivorans_C2A_2.txt'
	File found: '/home/nyoungb2/t/CLdb_Methanosarcina/array/Methanosarcina_acetivorans_C2A_2.txt'
Processing locus '13'...
  Checking: '/home/nyoungb2/t/CLdb_Methanosarcina/genbank/Methanosarcina_mazei_LYC.gbk'
	File found: '/home/nyoungb2/t/CLdb_Methanosarcina/genbank/Methanosarcina_mazei_LYC.gbk'
  Checking: '/home/nyoungb2/t/CLdb_Methanosarcina/array/Methanosarcina_mazei_LYC_2.txt'
	File found: '/home/nyoungb2/t/CLdb_Methanosarcina/array/Methanosarcina_mazei_LYC_2.txt'
Processing locus '12'...
  Checking: '/home/nyoungb2/t/CLdb_Methanosarcina/genbank/Methanosarcina_mazei_Go1.gbk'
	File found: '/home/nyoungb2/t/CLdb_Methanosarcina/genbank/Methanosarcina_mazei_Go1.gbk'
  Checking: '/home/nyoungb2/t/CLdb_Methanosarcina/array/Methanosarcina_mazei_Go1-cli117.txt'
	File found: '/home/nyoungb2/t/CLdb_Methanosarcina/array/Methanosarcina_mazei_Go1-cli117.txt'
Processing locus '11'...
  Checking: '/home/nyoungb2/t/CLdb_Methanosarcina/genbank/Methanosarcina_mazei_Go1.gbk'
	File found: '/home/nyoungb2/t/CLdb_Methanosarcina/genbank/Methanosarcina_mazei_Go1.gbk'
  Checking: '/home/nyoungb2/t/CLdb_Methanosarcina/array/Methanosarcina_mazei_Go1-cli118.txt'
	File found: '/home/nyoungb2/t/CLdb_Methanosarcina/array/Methanosarcina_mazei_Go1-cli118.txt'
Processing locus '9'...
  Checking: '/home/nyoungb2/t/CLdb_Methanosarcina/genbank/Methanosarcina_mazei_C16.gbk'
	File found: '/home/nyoungb2/t/CLdb_Methanosarcina/genbank/Methanosarcina_mazei_C16.gbk'
  Checking: '/home/nyoungb2/t/CLdb_Methanosarcina/array/Methanosarcina_mazei_C16_1.txt'
	File found: '/home/nyoungb2/t/CLdb_Methanosarcina/array/Methanosarcina_mazei_C16_1.txt'
### Checking for existence of genome fasta ###
### Processing locus: "15" ###
  No genome fasta found. Trying to extract sequence from genbank...
**OUTPUT MUTED**

Notes on loading

  • A lot is going on here:
    1. Various checks on the input files
    2. Extracting the genome fasta sequence from each genbank file
      • The genome fasta is required
    3. Loading of the loci information into the sqlite database

Notes on the command

  • Why didn't I use the 'required' -database flag for CLdb -- loadLoci???
    • I didn't have to use the -database flag because it is provided via the .CLdb config file that was previously created.

In [156]:
# This is just a quick summary of the database 
## It should show 10 loci for the 'loci' rows
!CLdb -- summary


loci	broken	intact	7
loci	intact	broken	1
loci	intact	intact	12
loci	Total	NA	20
spacer	NA	All	NULL
DR	NA	All	NULL
genes	Total	NA	NULL
leaders	Total	NA	NA

The summary doesn't show anything for spacers, DRs, genes or leaders!

That's because we haven't loaded that info yet...

Loading CRISPR arrays

  • The next step is to load the CRISPR array tables.
  • These are tables in 'CRISPRFinder format' that have CRISPR array info.
    • Let's take a look at one of the array files before loading them all.

In [157]:
# an example array file (obtained from CRISPRFinder)
arrayFile = os.path.join(workDir, 'array', 'Methanosarcina_acetivorans_C2A_1.txt')
!head $arrayFile


2379328	ATTCGCGAGCAAGATCCACTAAAACAAGGATTGAAAC	TCCGGAACTGGAAACCGTGTAATGGTAACCGATGACTA 	2379402	
2379403	ATTCGCGAGCAAGATCCACTAAAACAAGGATTGAAAC	TCCTCGATTTGATCACAGCATTTCTTAACGTGATAC 	2379475	
2379476	ATTCGCGAGCAAGATCCACTAAAACAAGGATTGAAAC	TAATCAAGCTCTTTTTGAGCCTGGTTCCCGGGTTCGAAT 	2379551	
2379552	ATTCGCGAGCAAGATCCACTAAAACAAGGATTGAAAC	ATCTCTCAGGTTGGGATGATATCGCCGAAATACGGT 	2379624	
2379625	ATTCGCGAGCAAGATCCACTAAAACAAGGATTGAAAC	TTACAACAGGGCTATGAAAAACACTTTGTACAGAAAGT 	2379699	
2379700	ATTCGCGAGCAAGATCCACTAAAACAAGGATTGAAAC	CTGAAGGCTTTCCCGGTAATCCCGAATTCAGCGT 	2379770	
2379771	ATTCGCGAGCAAGATCCACTAAAACAAGGATTGAAAC	ATATAAGCGTCTTTTCCTGTGTATGTGATCCACTTT 	2379843	
2379844	ATTCGCGAGCAAGATCCACTAAAACAAGGATTGAAAC	CCTTGTTGGTAATTAGTACAATGTTACCAGTATCAG 	2379916	
2379917	ATTCGCGAGCAAGATCCACTAAAACAAGGATTGAAAC	ATTCCTCCAACTGCTTTTTTAGCTGGTCTTCTGGAACT 	2379991	
2379992	ATTCGCGAGCAAGATCCACTAAAACAAGGATTGAAAC	ACTCTAAAAGAAGAGCTTAATGAACTACGTATAGAA 	2380064	

Note: the array file consists of 4 columns:

  1. spacer start
  2. spacer sequence
  3. direct-repeat sequence
  4. direct-repeat stop

All extra columns ignored!


In [158]:
# loading CRISPR array info
!CLdb -- loadArrays


...Number of entries added/updated to 'DRs' table: 1181
...Number of entries added/updated to 'spacers' table: 1160

In [159]:
# This is just a quick summary of the database 
!CLdb -- summary


loci	broken	intact	7
loci	intact	broken	1
loci	intact	intact	12
loci	Total	NA	20
spacer	NA	All	1160
DR	NA	All	1181
genes	Total	NA	NULL
leaders	Total	NA	NA

Note: The output should show 75 spacer & 85 DR entries in the database

Loading CAS genes

  • Technically, all coding seuqences in the region specified in the loci table (CAS_start, CAS_end) will be loaded.
  • This requires 2 subcommands:
    1. The 1st gets the gene info
    2. The 2nd loads the info into CLdb

In [160]:
geneDir = os.path.join(workDir, 'genes')
if not os.path.isdir(geneDir):
    os.makedirs(geneDir)

In [161]:
!cd $geneDir; \
    CLdb -- getGenesInLoci 2> CAS.log > CAS.txt
    
# checking output    
!cd $geneDir; \
    head -n 5 CAS.log; \
    echo -----------; \
    tail -n 5 CAS.log; \
    echo -----------; \
    head -n 5 CAS.txt


...Getting features in:
 file:			/home/nyoungb2/t/CLdb_Methanosarcina/genbank/Methanosarcina_mazei_Go1.gbk
 scaffold:		NC_003901
 region:		4095299-4080452
 CAS_status:		intact
-----------
 WARNING: Locus4 -> 1 does not have a FIG-PEG ID in a db_xref tag!
 WARNING: Locus4 -> 6 does not have a FIG-PEG ID in a db_xref tag!
 WARNING: Locus4 -> 4 does not have a FIG-PEG ID in a db_xref tag!
 WARNING: Locus4 -> 3 does not have a FIG-PEG ID in a db_xref tag!
 WARNING: Locus4 -> 5 does not have a FIG-PEG ID in a db_xref tag!
-----------
Locus_ID	Gene_Id	Gene_start	Gene_end	Gene_length__AA	In_CAS	Gene_Alias	Sequence
10	GI:850492586::GeneID:24882689	3308151	3308441	96	yes	hypothetical protein	MGNIKFTSKEEREYTLISFEMDDLLSPEDLAAITPPNINGAKGVVLSGRGPIWLFCFLTHFYHPTKFIATYDPRLEGAVIVERHTSGYEIGSVIKC
10	GI:850478304::GeneID:24882684	3302926	3304689	587	yes	type III-B CRISPR-associated protein Cas10/Cmr2	MSQFLFLFTVGPVQSFIAQARKSQDLYSGSFLLSHLSDIAIYKLKTLVSSCDLIFPNKEIASKPNRFIAKIECEDPEKIGSELHNFVQNEYRKICEDIVTKLNLNAPEGFNQQIDYLLDLHWIALDFEEGEYASKFSELESYLGAVKNIRRFHQFQEAGRKCSLCGERNVLFYGGARKRAHVHDAERINNVSNKFISDGEGLCAVCLSKRFAGRHFKKKYHSDYPSTAEIALMDTLSKLDSSLLNDYKSFFGKDFDEQLYFKDNLTKKYFEKNGIKADPEISLKELAKITNIAEQTGLKFSTYYALLCLDGDNMGKWLSGKFLEDKDKSNLMDFHFDLTKKLGTYAIKVKDIVQNPKGIVVYSGGDDVLAFINLDYLLPVMKELREHFPAFEEFSYTKQGEKSSASCGVCIAHYKTPLQEVLTWARKMEHEAKSIDDNAKKDAFAIAVLKRSGEIHKTVFNWKYETLDTIEVLAELISLLKSPNNSETPPFSDSFVNKLNEEFNLLMNDEGNYSEFPLFETEIKRLINRSCMMVKNTGESETEYRSRKDQTIADITEKLCALARESRSLENFLSLLNTAVFIERGSN
10	GI:919171957::GeneID:24882687	3306791	3307201	136	yes	type III-B CRISPR module-associated protein Cmr5	MKGLEQGRAKFAYEKALVGSGIKKKKEYKAYVRKIPTLIKANGLGETFAFVKAKKVKRAEETDKPGYAYYLIYDQTSQWLKENGLLEPNTDLVKWVVSLDSPTYRAVTNEVLSLFKWLSRFSEGLIEGELENEKQE
10	GI:850492902::GeneID:24882681	3300812	3301252	146	yes	hypothetical protein	MVLSKEWINPDRVEAVMAELGLELNKEKTYVGTAANGFEFVGFYFEEIMEENGAGSVIRVMPTEGSIEKVVESIESIGSNVRSIDSTARIEKAHKDDENKVQALDNLIKNICDVVDPWRSYYRHTDYAAGLERIEPSFNEKIQKFI

In [162]:
# loading gene table into the database
!cd $geneDir; \
    CLdb -- loadGenes < CAS.txt


...Number of entries added/updated to 'genes' table: 192

Setting array sense strand

  • The strand that is transcribed needs to be defined in order to have the correct sequence for downstream analyses (e.g., blasting spacers and getting PAM regions)
  • The sense (reading) strand is defined by (order of precedence):
    • The leader region (if defined; in this case, no).
    • Array_start,Array_end in the loci table
      • The genome negative strand will be used if array_start > array_end

In [163]:
!CLdb -- setSenseStrand


Setting sense strand for 20 loci...
...sense strand set based on array_start-array_end for:	20 loci
...sense strand set base on leader region for:	0 loci

Spacer and DR clustering

  • Clustering of spacer and/or DR sequences accomplishes:
    • A method of comparing within and between CRISPRs
    • A reducing redundancy for spacer and DR blasting

In [164]:
!CLdb -- clusterArrayElements -s -r


Deleted all entries in 'spacer_clusters' table
Deleted all entries in 'DR_clusters' table
Getting array element sequences. Orienting sequences by array_sense_strand

Clustering spacer sequences...
...Clustering spacers at cutoff: 0.80
	Number of clusters produced: 1100
...Clustering spacers at cutoff: 0.81
	Number of clusters produced: 1101
...Clustering spacers at cutoff: 0.82
	Number of clusters produced: 1101
...Clustering spacers at cutoff: 0.83
	Number of clusters produced: 1101
...Clustering spacers at cutoff: 0.84
	Number of clusters produced: 1101
...Clustering spacers at cutoff: 0.85
	Number of clusters produced: 1101
...Clustering spacers at cutoff: 0.86
	Number of clusters produced: 1101
...Clustering spacers at cutoff: 0.87
	Number of clusters produced: 1101
...Clustering spacers at cutoff: 0.88
	Number of clusters produced: 1101
...Clustering spacers at cutoff: 0.89
	Number of clusters produced: 1101
...Clustering spacers at cutoff: 0.90
	Number of clusters produced: 1101
...Clustering spacers at cutoff: 0.91
	Number of clusters produced: 1101
...Clustering spacers at cutoff: 0.92
	Number of clusters produced: 1101
...Clustering spacers at cutoff: 0.93
	Number of clusters produced: 1101
...Clustering spacers at cutoff: 0.94
	Number of clusters produced: 1101
...Clustering spacers at cutoff: 0.95
	Number of clusters produced: 1102
...Clustering spacers at cutoff: 0.96
	Number of clusters produced: 1102
...Clustering spacers at cutoff: 0.97
	Number of clusters produced: 1102
...Clustering spacers at cutoff: 0.98
	Number of clusters produced: 1104
...Clustering spacers at cutoff: 0.99
	Number of clusters produced: 1104
...Clustering spacers at cutoff: 1.00
	Number of clusters produced: 1104

Clustering DR sequences...
...Clustering DRs at cutoff: 0.80
	Number of clusters produced: 11
...Clustering DRs at cutoff: 0.81
	Number of clusters produced: 11
...Clustering DRs at cutoff: 0.82
	Number of clusters produced: 12
...Clustering DRs at cutoff: 0.83
	Number of clusters produced: 12
...Clustering DRs at cutoff: 0.84
	Number of clusters produced: 15
...Clustering DRs at cutoff: 0.85
	Number of clusters produced: 15
...Clustering DRs at cutoff: 0.86
	Number of clusters produced: 15
...Clustering DRs at cutoff: 0.87
	Number of clusters produced: 15
...Clustering DRs at cutoff: 0.88
	Number of clusters produced: 15
...Clustering DRs at cutoff: 0.89
	Number of clusters produced: 15
...Clustering DRs at cutoff: 0.90
	Number of clusters produced: 16
...Clustering DRs at cutoff: 0.91
	Number of clusters produced: 16
...Clustering DRs at cutoff: 0.92
	Number of clusters produced: 21
...Clustering DRs at cutoff: 0.93
	Number of clusters produced: 21
...Clustering DRs at cutoff: 0.94
	Number of clusters produced: 22
...Clustering DRs at cutoff: 0.95
	Number of clusters produced: 26
...Clustering DRs at cutoff: 0.96
	Number of clusters produced: 26
...Clustering DRs at cutoff: 0.97
	Number of clusters produced: 28
...Clustering DRs at cutoff: 0.98
	Number of clusters produced: 38
...Clustering DRs at cutoff: 0.99
	Number of clusters produced: 38
...Clustering DRs at cutoff: 1.00
	Number of clusters produced: 38

Inserting/updating entries in CLdb...
...24360 spacer cluster entries added/updated
...24801 DR cluster entries added/updated

Creating indices for cluster ids...

Database summary


In [165]:
# summary
!cd $workDir; \
    CLdb -- summary -name -subtype > summary.txt

# checking output
!cd $workDir; \
     cat summary.txt


loci	I-B	Methanosarcina_horonobensis_HB_1	intact	intact	1
loci	I-B	Methanosarcina_mazei_C16	intact	intact	1
loci	I-B	Methanosarcina_mazei_Go1	intact	intact	1
loci	I-B	Methanosarcina_mazei_SarPi	intact	intact	1
loci	I-B	Methanosarcina_mazei_WWM610	intact	intact	1
loci	I-D	Methanosarcina_mazei_S_6	broken	intact	1
loci	I-G	Methanosarcina_acetivorans_C2A	broken	intact	1
loci	I-G	Methanosarcina_barkeri_str_fusaro	broken	intact	1
loci	III-A	Methanosarcina_acetivorans_C2A	broken	intact	1
loci	III-B	Methanosarcina_mazei_C16	broken	intact	1
loci	III-B	Methanosarcina_mazei_WWM610	intact	intact	1
loci	III-C	Methanosarcina_barkeri_str_fusaro	intact	intact	1
loci	III-C	Methanosarcina_mazei_Go1	intact	intact	1
loci	III-C	Methanosarcina_mazei_S_6	intact	intact	1
loci	NA	Methanosarcina_horonobensis_HB_1	broken	intact	1
loci	NA	Methanosarcina_mazei_LYC	broken	intact	1
loci	VI-3	Methanosarcina_barkeri_str_fusaro	intact	intact	1
loci	VIII-3	Methanosarcina_mazei_SarPi	intact	intact	1
loci	VIII-4	Methanosarcina_horonobensis_HB_1	intact	intact	1
loci	VIII-4	Methanosarcina_mazei_LYC	intact	broken	1
loci	I-B	Methanosarcina_horonobensis_HB_1	Total	NA	1
loci	I-B	Methanosarcina_mazei_C16	Total	NA	1
loci	I-B	Methanosarcina_mazei_Go1	Total	NA	1
loci	I-B	Methanosarcina_mazei_SarPi	Total	NA	1
loci	I-B	Methanosarcina_mazei_WWM610	Total	NA	1
loci	I-D	Methanosarcina_mazei_S_6	Total	NA	1
loci	I-G	Methanosarcina_acetivorans_C2A	Total	NA	1
loci	I-G	Methanosarcina_barkeri_str_fusaro	Total	NA	1
loci	III-A	Methanosarcina_acetivorans_C2A	Total	NA	1
loci	III-B	Methanosarcina_mazei_C16	Total	NA	1
loci	III-B	Methanosarcina_mazei_WWM610	Total	NA	1
loci	III-C	Methanosarcina_barkeri_str_fusaro	Total	NA	1
loci	III-C	Methanosarcina_mazei_Go1	Total	NA	1
loci	III-C	Methanosarcina_mazei_S_6	Total	NA	1
loci	NA	Methanosarcina_horonobensis_HB_1	Total	NA	1
loci	NA	Methanosarcina_mazei_LYC	Total	NA	1
loci	VI-3	Methanosarcina_barkeri_str_fusaro	Total	NA	1
loci	VIII-3	Methanosarcina_mazei_SarPi	Total	NA	1
loci	VIII-4	Methanosarcina_horonobensis_HB_1	Total	NA	1
loci	VIII-4	Methanosarcina_mazei_LYC	Total	NA	1
spacer	I-B	Methanosarcina_horonobensis_HB_1	All	NA	NA	49
spacer	I-B	Methanosarcina_mazei_C16	All	NA	NA	75
spacer	I-B	Methanosarcina_mazei_Go1	All	NA	NA	46
spacer	I-B	Methanosarcina_mazei_SarPi	All	NA	NA	101
spacer	I-B	Methanosarcina_mazei_WWM610	All	NA	NA	35
spacer	I-D	Methanosarcina_mazei_S_6	All	NA	NA	60
spacer	I-G	Methanosarcina_acetivorans_C2A	All	NA	NA	30
spacer	I-G	Methanosarcina_barkeri_str_fusaro	All	NA	NA	50
spacer	III-A	Methanosarcina_acetivorans_C2A	All	NA	NA	26
spacer	III-B	Methanosarcina_mazei_C16	All	NA	NA	41
spacer	III-B	Methanosarcina_mazei_WWM610	All	NA	NA	83
spacer	III-C	Methanosarcina_barkeri_str_fusaro	All	NA	NA	18
spacer	III-C	Methanosarcina_mazei_Go1	All	NA	NA	80
spacer	III-C	Methanosarcina_mazei_S_6	All	NA	NA	72
spacer	NA	Methanosarcina_horonobensis_HB_1	All	NA	NA	44
spacer	NA	Methanosarcina_mazei_LYC	All	NA	NA	36
spacer	VI-3	Methanosarcina_barkeri_str_fusaro	All	NA	NA	24
spacer	VIII-3	Methanosarcina_mazei_SarPi	All	NA	NA	139
spacer	VIII-4	Methanosarcina_horonobensis_HB_1	All	NA	NA	19
spacer	VIII-4	Methanosarcina_mazei_LYC	All	NA	NA	132
spacers	I-B	Methanosarcina_horonobensis_HB_1	num_groups	1	49
spacers	I-B	Methanosarcina_mazei_C16	num_groups	1	75
spacers	I-B	Methanosarcina_mazei_Go1	num_groups	1	46
spacers	I-B	Methanosarcina_mazei_SarPi	num_groups	1	97
spacers	I-B	Methanosarcina_mazei_WWM610	num_groups	1	35
spacers	I-D	Methanosarcina_mazei_S_6	num_groups	1	59
spacers	I-G	Methanosarcina_acetivorans_C2A	num_groups	1	30
spacers	I-G	Methanosarcina_barkeri_str_fusaro	num_groups	1	50
spacers	III-A	Methanosarcina_acetivorans_C2A	num_groups	1	26
spacers	III-B	Methanosarcina_mazei_C16	num_groups	1	41
spacers	III-B	Methanosarcina_mazei_WWM610	num_groups	1	83
spacers	III-C	Methanosarcina_barkeri_str_fusaro	num_groups	1	17
spacers	III-C	Methanosarcina_mazei_Go1	num_groups	1	78
spacers	III-C	Methanosarcina_mazei_S_6	num_groups	1	72
spacers	NA	Methanosarcina_horonobensis_HB_1	num_groups	1	44
spacers	NA	Methanosarcina_mazei_LYC	num_groups	1	36
spacers	VI-3	Methanosarcina_barkeri_str_fusaro	num_groups	1	24
spacers	VIII-3	Methanosarcina_mazei_SarPi	num_groups	1	132
spacers	VIII-4	Methanosarcina_horonobensis_HB_1	num_groups	1	19
spacers	VIII-4	Methanosarcina_mazei_LYC	num_groups	1	132
DR	I-B	Methanosarcina_horonobensis_HB_1	All	NA	NA	50
DR	I-B	Methanosarcina_mazei_C16	All	NA	NA	76
DR	I-B	Methanosarcina_mazei_Go1	All	NA	NA	47
DR	I-B	Methanosarcina_mazei_SarPi	All	NA	NA	102
DR	I-B	Methanosarcina_mazei_WWM610	All	NA	NA	36
DR	I-D	Methanosarcina_mazei_S_6	All	NA	NA	61
DR	I-G	Methanosarcina_acetivorans_C2A	All	NA	NA	31
DR	I-G	Methanosarcina_barkeri_str_fusaro	All	NA	NA	51
DR	III-A	Methanosarcina_acetivorans_C2A	All	NA	NA	27
DR	III-B	Methanosarcina_mazei_C16	All	NA	NA	42
DR	III-B	Methanosarcina_mazei_WWM610	All	NA	NA	84
DR	III-C	Methanosarcina_barkeri_str_fusaro	All	NA	NA	19
DR	III-C	Methanosarcina_mazei_Go1	All	NA	NA	81
DR	III-C	Methanosarcina_mazei_S_6	All	NA	NA	73
DR	NA	Methanosarcina_horonobensis_HB_1	All	NA	NA	45
DR	NA	Methanosarcina_mazei_LYC	All	NA	NA	37
DR	VI-3	Methanosarcina_barkeri_str_fusaro	All	NA	NA	25
DR	VIII-3	Methanosarcina_mazei_SarPi	All	NA	NA	140
DR	VIII-4	Methanosarcina_horonobensis_HB_1	All	NA	NA	20
DR	VIII-4	Methanosarcina_mazei_LYC	All	NA	NA	134
DRs	I-B	Methanosarcina_horonobensis_HB_1	num_groups	1	1
DRs	I-B	Methanosarcina_mazei_C16	num_groups	1	3
DRs	I-B	Methanosarcina_mazei_Go1	num_groups	1	4
DRs	I-B	Methanosarcina_mazei_SarPi	num_groups	1	3
DRs	I-B	Methanosarcina_mazei_WWM610	num_groups	1	3
DRs	I-D	Methanosarcina_mazei_S_6	num_groups	1	2
DRs	I-G	Methanosarcina_acetivorans_C2A	num_groups	1	2
DRs	I-G	Methanosarcina_barkeri_str_fusaro	num_groups	1	3
DRs	III-A	Methanosarcina_acetivorans_C2A	num_groups	1	3
DRs	III-B	Methanosarcina_mazei_C16	num_groups	1	5
DRs	III-B	Methanosarcina_mazei_WWM610	num_groups	1	2
DRs	III-C	Methanosarcina_barkeri_str_fusaro	num_groups	1	2
DRs	III-C	Methanosarcina_mazei_Go1	num_groups	1	5
DRs	III-C	Methanosarcina_mazei_S_6	num_groups	1	2
DRs	NA	Methanosarcina_horonobensis_HB_1	num_groups	1	1
DRs	NA	Methanosarcina_mazei_LYC	num_groups	1	1
DRs	VI-3	Methanosarcina_barkeri_str_fusaro	num_groups	1	6
DRs	VIII-3	Methanosarcina_mazei_SarPi	num_groups	1	4
DRs	VIII-4	Methanosarcina_horonobensis_HB_1	num_groups	1	3
DRs	VIII-4	Methanosarcina_mazei_LYC	num_groups	1	7
genes	I-B	Methanosarcina_horonobensis_HB_1	yes	NA	9
genes	I-B	Methanosarcina_mazei_C16	yes	NA	9
genes	I-B	Methanosarcina_mazei_Go1	yes	NA	9
genes	I-B	Methanosarcina_mazei_SarPi	yes	NA	9
genes	I-B	Methanosarcina_mazei_WWM610	yes	NA	9
genes	I-D	Methanosarcina_mazei_S_6	yes	NA	12
genes	I-G	Methanosarcina_acetivorans_C2A	yes	NA	9
genes	I-G	Methanosarcina_barkeri_str_fusaro	yes	NA	8
genes	III-A	Methanosarcina_acetivorans_C2A	yes	NA	9
genes	III-B	Methanosarcina_mazei_C16	yes	NA	10
genes	III-B	Methanosarcina_mazei_WWM610	yes	NA	12
genes	III-C	Methanosarcina_barkeri_str_fusaro	no	NA	5
genes	III-C	Methanosarcina_barkeri_str_fusaro	yes	NA	8
genes	III-C	Methanosarcina_mazei_Go1	yes	NA	8
genes	III-C	Methanosarcina_mazei_S_6	yes	NA	8
genes	NA	Methanosarcina_horonobensis_HB_1	no	NA	1
genes	NA	Methanosarcina_horonobensis_HB_1	yes	NA	6
genes	NA	Methanosarcina_mazei_LYC	no	NA	1
genes	NA	Methanosarcina_mazei_LYC	yes	NA	13
genes	VI-3	Methanosarcina_barkeri_str_fusaro	yes	NA	6
genes	VIII-3	Methanosarcina_mazei_SarPi	yes	NA	9
genes	VIII-4	Methanosarcina_horonobensis_HB_1	yes	NA	11
genes	VIII-4	Methanosarcina_mazei_LYC	yes	NA	11
genes	I-B	Methanosarcina_horonobensis_HB_1	Total	NA	9
genes	I-B	Methanosarcina_mazei_C16	Total	NA	9
genes	I-B	Methanosarcina_mazei_Go1	Total	NA	9
genes	I-B	Methanosarcina_mazei_SarPi	Total	NA	9
genes	I-B	Methanosarcina_mazei_WWM610	Total	NA	9
genes	I-D	Methanosarcina_mazei_S_6	Total	NA	12
genes	I-G	Methanosarcina_acetivorans_C2A	Total	NA	9
genes	I-G	Methanosarcina_barkeri_str_fusaro	Total	NA	8
genes	III-A	Methanosarcina_acetivorans_C2A	Total	NA	9
genes	III-B	Methanosarcina_mazei_C16	Total	NA	10
genes	III-B	Methanosarcina_mazei_WWM610	Total	NA	12
genes	III-C	Methanosarcina_barkeri_str_fusaro	Total	NA	13
genes	III-C	Methanosarcina_mazei_Go1	Total	NA	8
genes	III-C	Methanosarcina_mazei_S_6	Total	NA	8
genes	NA	Methanosarcina_horonobensis_HB_1	Total	NA	7
genes	NA	Methanosarcina_mazei_LYC	Total	NA	14
genes	VI-3	Methanosarcina_barkeri_str_fusaro	Total	NA	6
genes	VIII-3	Methanosarcina_mazei_SarPi	Total	NA	9
genes	VIII-4	Methanosarcina_horonobensis_HB_1	Total	NA	11
genes	VIII-4	Methanosarcina_mazei_LYC	Total	NA	11

Next Steps

  • arrayBlast
    • Blast spacers (& DRs), get protospacers, PAM regions, mismatches to the protospacer & SEED sequence
  • TODO: spacers_shared
    • Spacer sequences shared among CRISPSRs
  • TODO: DR_consensus
    • Consensus sequences of direct repeats in each CRISPR
  • TODO: loci_plots
    • Plots of CRISPR arrays and CAS genes