Goal:

  • Running CD-HIT on fullCyc OTUs & bac_genome1210 16S rRNA gene dataset
    • cutoff 97% seqID
    • ID OTUs with taxa from both datasets
      • target genomes

Setting variables


In [32]:
import os

baseDir = '/home/nick/notebook/SIPSim/dev/fullCyc/'
workDir = os.path.join(baseDir, 'CD-HIT')

rnammerSeqs = '/home/nick/notebook/SIPSim/dev/bac_genome1147/rnammer/bac_genome1147_16S.fna'
otuRepFile = '/home/nick/notebook/SIPSim/dev/fullCyc/OTU_reps.fna'
otuTaxFile = '/var/seq_data/fullCyc/MiSeq_16SrRNA/515f-806r/lib1-7/OTU_binning/otusn_tax/otusn_tax_assignments.txt'
genomeDir = '/var/seq_data/ncbi_db/genome/Jan2016/bac_complete_spec-rep1_rn/'

Init


In [33]:
import re
import glob
import itertools
import random
from pprint import pprint
%load_ext rpy2.ipython


The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython

In [34]:
%%R
library(ggplot2)
library(dplyr)
library(tidyr)
library(gridExtra)

In [35]:
if not os.path.isdir(workDir):
    os.makedirs(workDir)
    
%cd $workDir


/home/nick/notebook/SIPSim/dev/fullCyc/CD-HIT

Making input fasta


In [36]:
# concatenating sequences
!cat $otuRepFile $rnammerSeqs > ssu_all.fna
!printf 'Number of sequences: '
!grep -c ">" ssu_all.fna


Number of sequences: 8868

CD-HIT run


In [9]:
!cd-hit-est -i ssu_all.fna -o ssu_all_cdhit -c 0.97 -d 0


================================================================
Program: CD-HIT, V4.6, Feb 20 2014, 09:04:54
Command: cd-hit-est -i ssu_all.fna -o ssu_all_cdhit -c 0.97 -d
         0

Started: Fri Jan 15 13:12:33 2016
================================================================
                            Output                              
----------------------------------------------------------------
total seq: 8868
longest and shortest : 3198 and 239
Total letters: 7975133
Sequences have been sorted

Approximated minimal memory consumption:
Sequence        : 9M
Buffer          : 1 X 12M = 12M
Table           : 1 X 16M = 16M
Miscellaneous   : 4M
Total           : 43M

Table limit with the given memory limit:
Max number of representatives: 464937
Max number of word counting entries: 94568193

comparing sequences from          0  to       8868
........
     8868  finished       4959  clusters

Apprixmated maximum memory consumption: 61M
writing new database
writing clustering information
program completed !

Total CPU time 6.84

Finding clusters with sequences from both datasets


In [10]:
inFile = os.path.join(workDir, 'ssu_all_cdhit.clstr')

tbl = {}
with open(inFile, 'rb') as inFH:
    clst_id = None
    for line in inFH:
        line = line.rstrip()
        if line.startswith('>'):
            clst_id = line.lstrip('>Cluster ')
            tbl[clst_id] = []
        else:
            tbl[clst_id].append(re.split('\t|, ', line))
            
print "Number of clusters loaded: {}".format(len(tbl.keys()))


Number of clusters loaded: 4959

In [11]:
# clusters that have '>OTU'  and '>rRNA' (OTU and genome)
def shared_clust(x):
    otus = any([y[2].startswith('>OTU') for y in x])
    genomes = any([y[2].startswith('>rRNA') for y in x])    
    return otus == True and genomes == True

tbl_f = {x:y for x,y in tbl.items() if shared_clust(y)}
print "Number of clusters with OTUs and genomes: {}".format(len(tbl_f.keys()))


Number of clusters with OTUs and genomes: 159

Getting taxonomic classification of OTUs in target clusters


In [14]:
# loading tax file
tax = {}
with open(otuTaxFile, 'rb') as inFH:
    for line in inFH:
        line = line.rstrip()
        if not line.startswith('OTU'):
            continue
        otu,cls,boot,_ = line.split('\t')
        cls = [x.lstrip(' __') for x in cls.split(';')]
        for i in range(8):
            try:
                len(cls[i])
            except IndexError:
                cls.append('Unclassified')
        tax[otu] = cls

In [15]:
def printDict(d, n=10):
    cnt = 0
    for x,y in d.items():
        pprint(x)
        print(y)
        cnt += 1
        if cnt >= n:
            break

In [16]:
printDict(tax, n=3)


'OTU.8469'
['Bacteria', 'Proteobacteria', 'Deltaproteobacteria', 'Bdellovibrionales', 'Bacteriovoracaceae', 'Peredibacter', 'uncultured_bacterium', 'Unclassified']
'OTU.11582'
['Bacteria', 'Acidobacteria', 'DA023', 'uncultured_bacterium', 'Unclassified', 'Unclassified', 'Unclassified', 'Unclassified']
'OTU.5687'
['Bacteria', 'Actinobacteria', 'Thermoleophilia', 'Solirubrobacterales', '480-2', 'uncultured_bacterium', 'Unclassified', 'Unclassified']

In [17]:
# adding taxonomic classifications to OTUs in target clusters

for clstr,x in tbl_f.items():
    for y in x:
        ID = y[2].lstrip('>')
        ID = re.sub('\.\.\..+','', ID)
        #print 'ID: "{}"'.format(ID)
        try:
            y.append(tax[ID])
        except KeyError:
            y.append(None)

In [18]:
# gut check: manual check of OTU classifications & genome names 

for clstr,x in tbl_f.items():
    print 'Cluster: {}'.format(clstr)
    for y in x:
        ID = y[2].lstrip('>')
        if ID.startswith('OTU'):
            # classifications
            try:
                print ':'.join(y[3])[:100]
            except IndexError:
                print ':'.join(y[3])
        elif ID.startswith('rRNA'):
            # genome names
            try:
                print ID[:100]
            except IndexError:
                print ID


Cluster: 212
Bacteria:Proteobacteria:Gammaproteobacteria:Pseudomonadales:Moraxellaceae:Acinetobacter:Unclassified
rRNA_NC_010611_Acinetobacter_baumannii_ACICU__Acinetobacter_baumannii_ACICU_3391663-3393187_DIR-... 
rRNA_NC_010611_Acinetobacter_baumannii_ACICU__Acinetobacter_baumannii_ACICU_40143-41667_DIR+... at +
rRNA_NC_010611_Acinetobacter_baumannii_ACICU__Acinetobacter_baumannii_ACICU_3891232-3892756_DIR-... 
rRNA_NC_010611_Acinetobacter_baumannii_ACICU__Acinetobacter_baumannii_ACICU_686949-688473_DIR+... at
rRNA_NC_010611_Acinetobacter_baumannii_ACICU__Acinetobacter_baumannii_ACICU_216608-218132_DIR+... at
rRNA_NC_010611_Acinetobacter_baumannii_ACICU__Acinetobacter_baumannii_ACICU_3424880-3426404_DIR-... 
rRNA_NC_014259_Acinetobacter_oleivorans_DR1__Acinetobacter_oleivorans_DR1_3479770-3481295_DIR-... at
rRNA_NC_014259_Acinetobacter_oleivorans_DR1__Acinetobacter_oleivorans_DR1_18773-20298_DIR+... at +/1
rRNA_NC_014259_Acinetobacter_oleivorans_DR1__Acinetobacter_oleivorans_DR1_4120139-4121673_DIR-... *
rRNA_NC_014259_Acinetobacter_oleivorans_DR1__Acinetobacter_oleivorans_DR1_541439-542964_DIR+... at +
rRNA_NC_014259_Acinetobacter_oleivorans_DR1__Acinetobacter_oleivorans_DR1_505907-507432_DIR+... at +
rRNA_NC_014259_Acinetobacter_oleivorans_DR1__Acinetobacter_oleivorans_DR1_3939237-3940762_DIR-... at
rRNA_NC_016603_Acinetobacter_calcoaceticus_PHEA_2_chromosome__Acinetobacter_calcoa_3321374-3322899_D
rRNA_NC_016603_Acinetobacter_calcoaceticus_PHEA_2_chromosome__Acinetobacter_calcoa_2678134-2679660_D
Cluster: 762
Bacteria:Proteobacteria:Alphaproteobacteria:Rhizobiales:Phyllobacteriaceae:Mesorhizobium:Unclassifie
rRNA_NC_019973_Mesorhizobium_australicum_WSM2073__Mesorhizobium_australicum_WSM207_1738466-1739938_D
rRNA_NC_019973_Mesorhizobium_australicum_WSM2073__Mesorhizobium_australicum_WSM207_1746046-1747518_D
rRNA_NC_014923_Mesorhizobium_ciceri_biovar_biserrulae_WSM1271_chromosome__Mesorhiz_1686680-1688152_D
rRNA_NC_014923_Mesorhizobium_ciceri_biovar_biserrulae_WSM1271_chromosome__Mesorhiz_1694553-1696026_D
rRNA_NC_002678_Mesorhizobium_loti_MAFF303099_DNA__Mesorhizobium_loti_MAFF303099_DN_2757519-2758991_D
rRNA_NC_002678_Mesorhizobium_loti_MAFF303099_DNA__Mesorhizobium_loti_MAFF303099_DN_2750006-2751478_D
rRNA_NC_015675_Mesorhizobium_opportunistum_WSM2075__Mesorhizobium_opportunistum_WS_1728175-1729647_D
rRNA_NC_015675_Mesorhizobium_opportunistum_WSM2075__Mesorhizobium_opportunistum_WS_1735934-1737406_D
Cluster: 663
Bacteria:Bacteroidetes:Cytophagia:Cytophagales:Cytophagaceae:Dyadobacter:Unclassified:Unclassified
rRNA_NC_013037_Dyadobacter_fermentans_DSM_18053__Dyadobacter_fermentans_DSM_18053_4611730-4613224_DI
rRNA_NC_013037_Dyadobacter_fermentans_DSM_18053__Dyadobacter_fermentans_DSM_18053_4003859-4005353_DI
rRNA_NC_013037_Dyadobacter_fermentans_DSM_18053__Dyadobacter_fermentans_DSM_18053_5058638-5060132_DI
rRNA_NC_013037_Dyadobacter_fermentans_DSM_18053__Dyadobacter_fermentans_DSM_18053_4639275-4640769_DI
Cluster: 767
Bacteria:Proteobacteria:Alphaproteobacteria:Rhizobiales:Brucellaceae:Ochrobactrum:Unclassified:Uncla
rRNA_NC_016795_Brucella_abortus_A13334_chromosome_1__complete_sequence__Brucella_a_606124-607596_DIR
rRNA_NC_016777_Brucella_abortus_A13334_chromosome_2__complete_sequence__Brucella_a_51294-52766_DIR-.
rRNA_NC_016795_Brucella_abortus_A13334_chromosome_1__complete_sequence__Brucella_a_810199-811671_DIR
rRNA_NC_016796_Brucella_canis_HSK_A52141_chromosome_2__complete_sequence__Brucella_893350-894822_DIR
rRNA_NC_016778_Brucella_canis_HSK_A52141_chromosome_1__complete_sequence__Brucella_1571481-1572953_D
rRNA_NC_016778_Brucella_canis_HSK_A52141_chromosome_1__complete_sequence__Brucella_1367069-1368541_D
rRNA_NC_022906_Brucella_ceti_TE10759_12_chromosome_2__complete_sequence__Brucella__189519-190991_DIR
rRNA_NC_022905_Brucella_ceti_TE10759_12_chromosome_1__complete_sequence__Brucella__396455-397927_DIR
rRNA_NC_022905_Brucella_ceti_TE10759_12_chromosome_1__complete_sequence__Brucella__198336-199808_DIR
rRNA_NC_012442_Brucella_melitensis_ATCC_23457_chromosome_II__complete_sequence__Br_1086380-1087852_D
rRNA_NC_012441_Brucella_melitensis_ATCC_23457_chromosome_I__complete_sequence__Bru_1608394-1609866_D
rRNA_NC_012441_Brucella_melitensis_ATCC_23457_chromosome_I__complete_sequence__Bru_1811793-1813265_D
rRNA_NC_013118_Brucella_microti_CCM_4915_chromosome_2__complete_sequence__Brucella_1121082-1122554_D
rRNA_NC_013119_Brucella_microti_CCM_4915_chromosome_1__complete_sequence__Brucella_1596962-1598434_D
rRNA_NC_013119_Brucella_microti_CCM_4915_chromosome_1__complete_sequence__Brucella_1801654-1803126_D
rRNA_NC_009504_Brucella_ovis_ATCC_25840_chromosome_II__complete_sequence__Brucella_1066188-1067660_D
rRNA_NC_009505_Brucella_ovis_ATCC_25840_chromosome_I__complete_sequence__Brucella__1596490-1597962_D
rRNA_NC_009505_Brucella_ovis_ATCC_25840_chromosome_I__complete_sequence__Brucella__1797640-1799112_D
rRNA_NC_015857_Brucella_pinnipedialis_B2_94_chromosome_1__complete_sequence__Bruce_1824429-1825901_D
rRNA_NC_015858_Brucella_pinnipedialis_B2_94_chromosome_2__complete_sequence__Bruce_1161698-1163170_D
rRNA_NC_015857_Brucella_pinnipedialis_B2_94_chromosome_1__complete_sequence__Bruce_1619360-1620832_D
rRNA_NC_004310_Brucella_suis_1330_chromosome_I__complete_sequence__Brucella_suis_1_1793668-1795140_D
rRNA_NC_004311_Brucella_suis_1330_chromosome_II__complete_sequence__Brucella_suis__1108145-1109617_D
rRNA_NC_004310_Brucella_suis_1330_chromosome_I__complete_sequence__Brucella_suis_1_1588606-1590078_D
rRNA_NC_009668_Ochrobactrum_anthropi_ATCC_49188_chromosome_2__complete_sequence__O_456833-458303_DIR
rRNA_NC_009667_Ochrobactrum_anthropi_ATCC_49188_chromosome_1__complete_sequence__O_1346251-1347721_D
rRNA_NC_009667_Ochrobactrum_anthropi_ATCC_49188_chromosome_1__complete_sequence__O_1084028-1085498_D
rRNA_NC_009668_Ochrobactrum_anthropi_ATCC_49188_chromosome_2__complete_sequence__O_1602089-1603559_D
Cluster: 132
Bacteria:Firmicutes:Bacilli:Lactobacillales:Streptococcaceae:Streptococcus:uncultured_bacterium:Uncl
rRNA_NC_009785_Streptococcus_gordonii_str__Challis_substr__CH1__Streptococcus_gord_661158-662702_DIR
rRNA_NC_009785_Streptococcus_gordonii_str__Challis_substr__CH1__Streptococcus_gord_1768564-1770108_D
rRNA_NC_009785_Streptococcus_gordonii_str__Challis_substr__CH1__Streptococcus_gord_2020574-2022118_D
rRNA_NC_009785_Streptococcus_gordonii_str__Challis_substr__CH1__Streptococcus_gord_2177015-2178559_D
rRNA_NC_009009_Streptococcus_sanguinis_SK36_chromosome__Streptococcus_sanguinis_SK_16386-17921_DIR+.
rRNA_NC_009009_Streptococcus_sanguinis_SK36_chromosome__Streptococcus_sanguinis_SK_1827868-1829403_D
rRNA_NC_009009_Streptococcus_sanguinis_SK36_chromosome__Streptococcus_sanguinis_SK_1933929-1935464_D
rRNA_NC_009009_Streptococcus_sanguinis_SK36_chromosome__Streptococcus_sanguinis_SK_124370-125905_DIR
Cluster: 131
Bacteria:Proteobacteria:Deltaproteobacteria:Myxococcales:Sorangiineae:Polyangiaceae:Sorangium:Unclas
rRNA_NC_021658_Sorangium_cellulosum_So0157_2__Sorangium_cellulosum_So0157_2_3603982-3605526_DIR+... 
rRNA_NC_021658_Sorangium_cellulosum_So0157_2__Sorangium_cellulosum_So0157_2_6708274-6709818_DIR-... 
rRNA_NC_021658_Sorangium_cellulosum_So0157_2__Sorangium_cellulosum_So0157_2_12129676-12131220_DIR-..
rRNA_NC_021658_Sorangium_cellulosum_So0157_2__Sorangium_cellulosum_So0157_2_4322994-4324538_DIR+... 
Cluster: 138
Bacteria:Proteobacteria:Deltaproteobacteria:Myxococcales:Nannocystineae:Haliangiaceae:Haliangium:Unc
rRNA_NC_013440_Haliangium_ochraceum_DSM_14365__Haliangium_ochraceum_DSM_14365_171442-172985_DIR+... 
rRNA_NC_013440_Haliangium_ochraceum_DSM_14365__Haliangium_ochraceum_DSM_14365_2898792-2900335_DIR+..
Cluster: 25
Bacteria:Firmicutes:Clostridia:Clostridiales:Veillonellaceae:Pelosinus:uncultured_bacterium:Unclassi
rRNA_NZ_CP010978_Pelosinus_fermentans_JBW45__Pelosinus_fermentans_JBW45_4381642-4383187_DIR-... at +
rRNA_NZ_CP010978_Pelosinus_fermentans_JBW45__Pelosinus_fermentans_JBW45_2255618-2257263_DIR+... *
rRNA_NZ_CP010978_Pelosinus_fermentans_JBW45__Pelosinus_fermentans_JBW45_4375737-4377282_DIR-... at +
rRNA_NZ_CP010978_Pelosinus_fermentans_JBW45__Pelosinus_fermentans_JBW45_2776214-2777851_DIR-... at +
rRNA_NZ_CP010978_Pelosinus_fermentans_JBW45__Pelosinus_fermentans_JBW45_31829-33374_DIR+... at +/97.
rRNA_NZ_CP010978_Pelosinus_fermentans_JBW45__Pelosinus_fermentans_JBW45_3215241-3216886_DIR-... at +
rRNA_NZ_CP010978_Pelosinus_fermentans_JBW45__Pelosinus_fermentans_JBW45_5043623-5045168_DIR-... at +
rRNA_NZ_CP010978_Pelosinus_fermentans_JBW45__Pelosinus_fermentans_JBW45_25880-27525_DIR+... at +/99.
rRNA_NZ_CP010978_Pelosinus_fermentans_JBW45__Pelosinus_fermentans_JBW45_5037863-5039408_DIR-... at +
Cluster: 27
Bacteria:Firmicutes:Clostridia:Clostridiales:Ruminococcaceae:Incertae_Sedis:Unclassified:Unclassifie
rRNA_NC_011898_Clostridium_cellulolyticum_H10__Clostridium_cellulolyticum_H10_4055819-4057459_DIR-..
rRNA_NC_011898_Clostridium_cellulolyticum_H10__Clostridium_cellulolyticum_H10_654800-656440_DIR+... 
rRNA_NC_011898_Clostridium_cellulolyticum_H10__Clostridium_cellulolyticum_H10_3983643-3985283_DIR-..
rRNA_NC_011898_Clostridium_cellulolyticum_H10__Clostridium_cellulolyticum_H10_605119-606759_DIR+... 
rRNA_NC_011898_Clostridium_cellulolyticum_H10__Clostridium_cellulolyticum_H10_2914118-2915758_DIR-..
rRNA_NC_011898_Clostridium_cellulolyticum_H10__Clostridium_cellulolyticum_H10_887840-889480_DIR+... 
rRNA_NC_011898_Clostridium_cellulolyticum_H10__Clostridium_cellulolyticum_H10_453791-455431_DIR+... 
rRNA_NC_011898_Clostridium_cellulolyticum_H10__Clostridium_cellulolyticum_H10_1851052-1852690_DIR+..
Cluster: 22
Bacteria:Firmicutes:Clostridia:Clostridiales:Peptococcaceae:Desulfosporosinus:uncultured_bacterium:U
rRNA_NC_018068_Desulfosporosinus_acidiphilus_SJ4__Desulfosporosinus_acidiphilus_SJ_3672910-3674465_D
rRNA_NC_018068_Desulfosporosinus_acidiphilus_SJ4__Desulfosporosinus_acidiphilus_SJ_2953029-2954675_D
rRNA_NC_018068_Desulfosporosinus_acidiphilus_SJ4__Desulfosporosinus_acidiphilus_SJ**OUTPUT MUTED**

Notes:

  • At least most of the taxonomic classifications make sense for the genomes in each cluster

Writing out a list of target genomes and their corresponding OTUs

  • If an OTU has multiple associations with a genome, selecting 1 a random
    • ie., 1-to-1 association

In [24]:
# making a index of scaffolds for each genome
genome_list = glob.glob(os.path.join(genomeDir, '*.fna'))

def fasta_seqID(fastaFiles):
    file_seqID = {}
    for f in fastaFiles:
        with open(f, 'rb') as iFH:
            for line in iFH:
                if line.startswith('>'):
                    line = line.lstrip('>').rstrip()
                    file_seqID[line] = f
    return file_seqID

seqID_fastaID = fasta_seqID(genome_list)                    
print 'Index length: {}'.format(len(seqID_fastaID))


Index length: 1231

In [25]:
def write_targets(tbl, oFH, seqID_fastaID):
    """
    Args:
    tbl -- cd-hit results
    oFH -- output file handle
    seqID_fastaID -- dict(seqName : fastaFileName)
    """
    oFH.write('\t'.join(['cluster', 'ssu_ID', 'genome_fileID', 'genomeID',
                         'genome_seqID', 'OTU', 'OTU_taxonomy']) + '\n')
    for clstr,rows in tbl.items():
        # parsing cluster; getting all OTUs and genome_IDs
        targets = []
        otus = []
        for row in rows:
            ID = row[2].lstrip('>')
            ID = re.sub('\.\.\..+','', ID)        
            if ID.startswith('OTU'):
                otu = [ID, ':'.join(row[3])]
                otus.append(otu)
            elif ID.startswith('rRNA'):
                targets.append(ID)
                
                        
        # writing out list
        ## one 1 randomly selected genome is associated with OTU
        random.shuffle(targets)
        for otu, target in zip(otus, itertools.cycle(targets)):
            # genome sequence name
            seqID = target[5:]  # removing rRNA_
            seqID = re.sub('_\d+-\d+_DIR.+', '', seqID)
            # genome file name
            try:
                fileID = seqID_fastaID[seqID]
            except KeyError:
                msg = 'Cannot find "{}"'
                print msg.format(seqID)
                fileID = ''
                genomeID = ''
            else:
                # genome name                
                x = os.path.split(fileID)
                genomeID = os.path.splitext(x[1])[0]
            
            # writing row
            oFH.write('\t'.join([clstr, target, fileID, genomeID, seqID] + otu) + '\n')

In [26]:
outFile = os.path.join(workDir, 'target_taxa.txt')
with open(outFile, 'wb') as oFH:
    write_targets(tbl_f, oFH, seqID_fastaID)

Gut check on written file

  • Multiple OTUs will likely cluster with some representative genomes
    • Thus, the number of target genomes should be > target OTUs

In [27]:
!printf "Number of rows in table: "
!cd $workDir; \
    tail -n +2 target_taxa.txt | wc -l


Number of rows in table: 194

In [37]:
!printf "Number of clusters: "
!tail -n +2 target_taxa.txt | cut -f 1 | sort -u | wc -l


Number of clusters: 159

In [38]:
!printf "Number of target genomes: "
!tail -n +2 target_taxa.txt | cut -f 3 | sort -u | wc -l


Number of target genomes: 165

In [39]:
!printf "Number of OTUs: "
!tail -n +2 target_taxa.txt | cut -f 4 | sort -u | wc -l


Number of OTUs: 165

Comparing amplicon GC contents


In [54]:
repGenomeAmplicons = '/home/nick/notebook/SIPSim/dev/fullCyc/CD-HIT/ssu_all.fna'
fullCycAmplicons = '/var/seq_data/fullCyc/MiSeq_16SrRNA/515f-806r/lib1-7rs/OTU_binning/otusn.pick.fasta'

In [56]:
!seq_tools fasta_unwrap ssu_all.fna | grep ">OTU" -A 1| egrep -v "^--$" > ssu_all_OTU.fna

In [57]:
!seq_tools fasta_info --fn --tgc ssu_all_OTU.fna $repGenomeAmplicons $fullCycAmplicons


WARNING: the fasta contained duplicate sequences. Making a temporary fasta with renamed sequences.
ssu_all_OTU.fna	56.26
/home/nick/notebook/SIPSim/dev/fullCyc/CD-HIT/ssu_all.fna	55.07
/var/seq_data/fullCyc/MiSeq_16SrRNA/515f-806r/lib1-7rs/OTU_binning/otusn.pick.fasta	56.12

In [ ]: