Description:

This notebook goes through the creation and assessment of direct repeat (DR) consensus sequences

Before running this notebook:

run the Setup notebook

User-defined variables



In [1]:

    
# directory where you want the spacer blasting to be done
## CHANGE THIS!
workDir = "/home/nyoungb2/t/CLdb_Ecoli/DR_consensus/"

Init



In [2]:

    
import os
from IPython.display import FileLinks
%load_ext rpy2.ipython



In [3]:

    
if not os.path.isdir(workDir):
    os.makedirs(workDir)



In [4]:

    
# checking that CLdb is in $PATH & ~/.CLdb config file is set up
!CLdb --config-params









    



#-- Config params --#
DATABASE = /home/nyoungb2/t/CLdb_Ecoli/CLdb.sqlite

Creating DR consensus seqs & loading into CLdb



In [5]:

    
!CLdb -- loadDRConsensus









    



...Processing locus 'F'
...Processing locus 'D'
...Processing locus 'B'
...Processing locus 'H'
...Processing locus 'J'
...Processing locus 'A'
...Processing locus 'C'
...Processing locus 'I'
...Processing locus 'E'
...Processing locus 'G'
...10 entries added/updated in database

That's it! Now, the CLdb.sqlite file contains the DR consensus sequences for each CRISPR locus

Assessing DR consensus seqs



In [6]:

    
!CLdb -- DRconsensus2fasta -h









    



Usage:
    DRconsensus2fasta.pl [flags] > DR_consensus.fasta

  Required flags:
    -database <char>
        CLdb database.

  Optional flags:
    -IUPAC <bool>
        Get consensus sequences with IUPAC nucleotide ambiguity codes
        instead of consensus by threshold? [FALSE]

    -subtype <char>
        Refine query to specific a subtype(s) (>1 argument allowed).

    -taxon_id <char>
        Refine query to specific a taxon_id(s) (>1 argument allowed).

    -taxon_name <char>
        Refine query to specific a taxon_name(s) (>1 argument allowed).

    -verbose <bool>
        Verbose output. [FALSE]

    -help <bool>
        This help message

  For more information:
    CLdb_perldoc DRconsensus2fasta.pl



In [9]:

    
# writing out the consensus sequences
!cd $workDir; \
    CLdb -- DRconsensus2fasta > DR_consensus.fna
    
# checking output    
!cd $workDir; \
    head -n 6 DR_consensus.fna









    



>G|I-E
cggtttatccccgctggcgcggggaacac
>A|I-E
cggtttatccccgctggcgcggggaactc
>E|I-E
cggtttatccccgctggcgcggggaactc

Sequence naming is 'locus_ID'|'subtype'

Making a quick tree of DR consensus sequences

Alignment



In [17]:

    
!cd $workDir; \
    mafft --adjustdirection DR_consensus.fna > DR_consensus_aln.fna
    
!cd $workDir; \
    echo "#-------#"; \
    head -n 6 DR_consensus_aln.fna









    



nseq =  10
distance =  ktuples
iterate =  0
cycle =  2
nthread = 0
nadd = 0
inputfile = infile
thresholdtorev = 0.010000
generating 200PAM scoring matrix for nucleotides ... done
done
done
scoremtx = -1


 9 / 10  
makedirectionlist (nuc) Version 7.123b alg=m, model=DNA200 (2),  1.530 ( 4.590),  0.123 ( 0.369)
0 thread(s)
directionfile = _direction
inputfile = infile
10 x 29 - 28 d
nthread = 0
generating 200PAM scoring matrix for nucleotides ... done
done
done
scoremtx = -1
Gap Penalty = -1.53, +0.00, +0.00

tuplesize = 6, dorp = d


Making a distance matrix ..

WARNING : 2 unknown characters
    1 / 10
done.

Constructing a UPGMA tree ... 
    0 / 10
done.

Progressive alignment ... 
STEP     9 / 9 d
done.

disttbfast (nuc) Version 7.123b alg=A, model=DNA200 (2),  1.530 ( 4.590), -0.000 (-0.000)
0 thread(s)
nthread = 0
blosum 62 / kimura 200
generating 200PAM scoring matrix for nucleotides ... done
done
done
scoremtx = -1
Gap Penalty = -1.53, +0.00, +0.00
Making a distance matrix .. 
    0 / 9
done.

Constructing a UPGMA tree ... 
    0 / 10
done.

Progressive alignment ... 
STEP     9 /9 d
done.
tbfast (nuc) Version 7.123b alg=A, model=DNA200 (2),  1.530 ( 4.590), -0.000 (-0.000)
0 thread(s)


Strategy:
 FFT-NS-2 (Fast but rough)
 Progressive method (guide trees were built 2 times.)

If unsure which option to use, try 'mafft --auto input > output'.
For more information, see 'mafft --help', 'mafft --man' and the mafft page.

The default gap scoring scheme has been changed in version 7.110 (2013 Oct).
It tends to insert more gaps into gap-rich regions than previous versions.
To disable this change, add the --legacygappenalty option.

#-------#
>G|I-E
cggtttatccccgctggcgcggggaacac
>A|I-E
cggtttatccccgctggcgcggggaactc
>E|I-E
cggtttatccccgctggcgcggggaactc

Tree inference



In [10]:

    
%%R
library(ape)



In [27]:

    
%%R -i workDir

inFile = file.path(workDir, 'DR_consensus_aln.fna')
seqs = read.dna(inFile, format='fasta')
seqs.dist = dist.dna(seqs)
plot(hclust(seqs.dist), main='Hierarchical clustering dendrogram')
plot(nj(seqs.dist), main='Neighbor-joining tree')