Description:

  • This notebook goes through the creation and assessment of direct repeat (DR) consensus sequences

Before running this notebook:

User-defined variables


In [1]:
# directory where you want the spacer blasting to be done
## CHANGE THIS!
workDir = "/home/nyoungb2/t/CLdb_Ecoli/DR_consensus/"

Init


In [2]:
import os
from IPython.display import FileLinks
%load_ext rpy2.ipython

In [3]:
if not os.path.isdir(workDir):
    os.makedirs(workDir)

In [4]:
# checking that CLdb is in $PATH & ~/.CLdb config file is set up
!CLdb --config-params


#-- Config params --#
DATABASE = /home/nyoungb2/t/CLdb_Ecoli/CLdb.sqlite

Creating DR consensus seqs & loading into CLdb


In [5]:
!CLdb -- loadDRConsensus


...Processing locus 'F'
...Processing locus 'D'
...Processing locus 'B'
...Processing locus 'H'
...Processing locus 'J'
...Processing locus 'A'
...Processing locus 'C'
...Processing locus 'I'
...Processing locus 'E'
...Processing locus 'G'
...10 entries added/updated in database

That's it! Now, the CLdb.sqlite file contains the DR consensus sequences for each CRISPR locus

Assessing DR consensus seqs


In [6]:
!CLdb -- DRconsensus2fasta -h


Usage:
    DRconsensus2fasta.pl [flags] > DR_consensus.fasta

  Required flags:
    -database <char>
        CLdb database.

  Optional flags:
    -IUPAC <bool>
        Get consensus sequences with IUPAC nucleotide ambiguity codes
        instead of consensus by threshold? [FALSE]

    -subtype <char>
        Refine query to specific a subtype(s) (>1 argument allowed).

    -taxon_id <char>
        Refine query to specific a taxon_id(s) (>1 argument allowed).

    -taxon_name <char>
        Refine query to specific a taxon_name(s) (>1 argument allowed).

    -verbose <bool>
        Verbose output. [FALSE]

    -help <bool>
        This help message

  For more information:
    CLdb_perldoc DRconsensus2fasta.pl


In [9]:
# writing out the consensus sequences
!cd $workDir; \
    CLdb -- DRconsensus2fasta > DR_consensus.fna
    
# checking output    
!cd $workDir; \
    head -n 6 DR_consensus.fna


>G|I-E
cggtttatccccgctggcgcggggaacac
>A|I-E
cggtttatccccgctggcgcggggaactc
>E|I-E
cggtttatccccgctggcgcggggaactc

Sequence naming is 'locus_ID'|'subtype'

Making a quick tree of DR consensus sequences

Alignment


In [17]:
!cd $workDir; \
    mafft --adjustdirection DR_consensus.fna > DR_consensus_aln.fna
    
!cd $workDir; \
    echo "#-------#"; \
    head -n 6 DR_consensus_aln.fna


nseq =  10
distance =  ktuples
iterate =  0
cycle =  2
nthread = 0
nadd = 0
inputfile = infile
thresholdtorev = 0.010000
generating 200PAM scoring matrix for nucleotides ... done
done
done
scoremtx = -1


 9 / 10  
makedirectionlist (nuc) Version 7.123b alg=m, model=DNA200 (2),  1.530 ( 4.590),  0.123 ( 0.369)
0 thread(s)
directionfile = _direction
inputfile = infile
10 x 29 - 28 d
nthread = 0
generating 200PAM scoring matrix for nucleotides ... done
done
done
scoremtx = -1
Gap Penalty = -1.53, +0.00, +0.00

tuplesize = 6, dorp = d


Making a distance matrix ..

WARNING : 2 unknown characters
    1 / 10
done.

Constructing a UPGMA tree ... 
    0 / 10
done.

Progressive alignment ... 
STEP     9 / 9 d
done.

disttbfast (nuc) Version 7.123b alg=A, model=DNA200 (2),  1.530 ( 4.590), -0.000 (-0.000)
0 thread(s)
nthread = 0
blosum 62 / kimura 200
generating 200PAM scoring matrix for nucleotides ... done
done
done
scoremtx = -1
Gap Penalty = -1.53, +0.00, +0.00
Making a distance matrix .. 
    0 / 9
done.

Constructing a UPGMA tree ... 
    0 / 10
done.

Progressive alignment ... 
STEP     9 /9 d
done.
tbfast (nuc) Version 7.123b alg=A, model=DNA200 (2),  1.530 ( 4.590), -0.000 (-0.000)
0 thread(s)


Strategy:
 FFT-NS-2 (Fast but rough)
 Progressive method (guide trees were built 2 times.)

If unsure which option to use, try 'mafft --auto input > output'.
For more information, see 'mafft --help', 'mafft --man' and the mafft page.

The default gap scoring scheme has been changed in version 7.110 (2013 Oct).
It tends to insert more gaps into gap-rich regions than previous versions.
To disable this change, add the --legacygappenalty option.

#-------#
>G|I-E
cggtttatccccgctggcgcggggaacac
>A|I-E
cggtttatccccgctggcgcggggaactc
>E|I-E
cggtttatccccgctggcgcggggaactc

Tree inference


In [10]:
%%R
library(ape)

In [27]:
%%R -i workDir

inFile = file.path(workDir, 'DR_consensus_aln.fna')
seqs = read.dna(inFile, format='fasta')
seqs.dist = dist.dna(seqs)
plot(hclust(seqs.dist), main='Hierarchical clustering dendrogram')
plot(nj(seqs.dist), main='Neighbor-joining tree')