Description:

  • The first thing to do is download a microbial genome dataset.
  • Here, we will be downloading 3 bacterial genomes with differing G+C contents:
    • Clostridium ljungdahlii DSM13528 (G+C = 31.1)
    • Escherichia coli 1303 (G+C = 50.7)
    • Streptomyces pratensis ATCC33331 (G+C = 71.1)

Setting variables

"workDir" is the path to the working directory for this analysis (where the files will be download to)


In [6]:
workDir = '../../t/SIPSim_example/'

Initializing

  • Loading packages & libraries
  • Make sure you have all of the dependencies!

In [7]:
import os

In [8]:
%load_ext rpy2.ipython

In [10]:
# making directories
## working directory
workDir = os.path.abspath(workDir)
if not os.path.isdir(workDir):
    os.makedirs(workDir)
%cd $workDir


/home/nick/notebook/SIPSim/t/SIPSim_example

In [11]:
# making directories
## genome directory
workDirGenome = os.path.join(workDir, 'genomes')
if not os.path.isdir(workDirGenome):
    os.mkdir(workDirGenome)  
print(workDirGenome)


/home/nick/notebook/SIPSim/t/SIPSim_example/genomes

SIPSim install & conda environment

This example uses a conda environment called SIPSim


In [37]:
!conda list -n SIPSim


# packages in environment at /home/nick/bin/anaconda3/envs/SIPSim:
#
biopython                 1.70                      <pip>
cairo                     1.12.18                       6  
cython                    0.25.2                   py27_0  
dill                      0.2.6                    py27_0  
docopt                    0.6.2                    py27_0  
fontconfig                2.11.1                        6  
freetype                  2.5.5                         1  
ghmm                      0.9                      py27_0    bioconda
glib                      2.50.2                        1  
gsl                       2.2.1                         0  
intervaltree              2.1.0                    py27_0    bioconda
libffi                    3.2.1                         1  
libgcc                    5.2.0                         0  
libgfortran               3.0.0                         1  
libiconv                  1.14                          0  
libpng                    1.6.17                        0  
libuuid                   1.0.3                         0  
libxcb                    1.12                          1  
libxml2                   2.9.4                         0  
matplotlib                1.4.3               np110py27_2  
mkl                       11.3.3                        0  
mpmath                    0.19                     py27_1  
multiprocess              0.70.5                    <pip>
multiprocess              0.70.5                   py27_0    conda-forge
nomkl                     1.0                           0  
numpy                     1.10.4             py27_nomkl_2  [nomkl]
openblas                  0.2.14                        4  
openssl                   1.0.2l                        0  
pandas                    0.18.1              np110py27_0  
pathos                    0.2.0                    py27_1    conda-forge
pathos                    0.2.1                     <pip>
pcre                      8.39                          1  
pip                       9.0.1                    py27_1  
pixman                    0.32.6                        0  
pox                       0.2.3                    py27_0    conda-forge
ppft                      1.6.4.7.1                py27_0    conda-forge
py2cairo                  1.10.0                   py27_2  
pyfasta                   0.5.2                    py27_0    bioconda
pymix                     0.8                      py27_0    bioconda
pyparsing                 2.0.3                    py27_0  
pyqt                      4.11.4                   py27_4  
python                    2.7.13                        0  
python-dateutil           2.3                      py27_0    bioconda
pytz                      2017.2                   py27_0  
qt                        4.8.7                         3  
readline                  6.2                           2  
scipy                     0.17.1          np110py27_nomkl_1  [nomkl]
setuptools                27.2.0                   py27_0  
sip                       4.18                     py27_0  
SIPSim                    0.2                       <pip>
SIPSim-cpp                0.0.0                     <pip>
six                       1.10.0                   py27_0  
sortedcontainers          1.5.7                    py27_0  
sqlite                    3.13.0                        0  
swig                      3.0.10                        0  
sympy                     1.1                      py27_0  
tk                        8.5.18                        0  
wheel                     0.29.0                   py27_0  
xorg-kbproto              1.0.7                         1    conda-forge
xorg-libice               1.0.9                         2    conda-forge
xorg-libsm                1.2.2                         2    conda-forge
xorg-libx11               1.6.4                         6    conda-forge
xorg-libxau               1.0.8                         3    conda-forge
xorg-libxdmcp             1.1.2                         3    conda-forge
xorg-libxext              1.3.3                         2    conda-forge
xorg-libxrender           0.9.10                        0    conda-forge
xorg-renderproto          0.11.1                        1    conda-forge
xorg-xextproto            7.3.0                         1    conda-forge
xorg-xproto               7.0.31                        6    conda-forge
zlib                      1.2.8                         3  

Let's check that SIPSim is installed properly


In [13]:
%%bash
source activate SIPSim 

SIPSim -l


#-- Commands --#
BD_shift
communities
DBL
deltaBD
diffusion
fragment_KDE
fragment_KDE_cat
fragment_parse
fragments
genome_download
genome_index
genome_rename
gradient_fractions
HRSIP
incorp_config_example
isotope_incorp
KDE_bandwidth
KDE_info
KDE_parse
KDE_plot
KDE_sample
KDE_select_taxa
OTU_add_error
OTU_PCR
OTU_sample_data
OTU_subsample
OTU_sum
OTU_table
OTU_wide_long
qSIP
qSIP_atom_excess
tree_sim

Downloading genomes

  • Downloading the genome sequences from NCBI based on their accession numbers.
  • If you had Taxonomy IDs, then you could use ncbi-genome-download instead for downloading.

In [15]:
taxa="""Clostridium_ljungdahlii_DSM_13528	NC_014328.1
Escherichia_coli_1303	NZ_CP009166.1
Streptomyces_pratensis_ATCC_33331	NC_016114.1
"""

genome_file = os.path.join(workDir, 'genome_list.txt')
with open(genome_file, 'wb') as oFH:
    oFH.write(taxa)
    
print 'File written: {}'.format(genome_file)


File written: /home/nick/notebook/SIPSim/t/SIPSim_example/genome_list.txt

In [16]:
%%bash -s $genome_file
source activate SIPSim

# downloading genomes
SIPSim genome_download -d genomes -n 3 $1


File written: genomes/Clostridium_ljungdahlii_DSM_13528.fna
File written: genomes/Escherichia_coli_1303.fna
File written: genomes/Streptomyces_pratensis_ATCC_33331.fna

In [18]:
!ls -thlc ./genomes


total 17M
-rw-rw-r-- 1 nick nick 7.1M Jul 13 14:33 Streptomyces_pratensis_ATCC_33331.fna
-rw-rw-r-- 1 nick nick 4.8M Jul 13 14:33 Escherichia_coli_1303.fna
-rw-rw-r-- 1 nick nick 4.5M Jul 13 14:33 Clostridium_ljungdahlii_DSM_13528.fna

Hopefully all 3 genomes downloaded (the files should be non-empty)

Renaming genome sequences

  • Let's make the genome sequences a bit simpler

In [19]:
# current sequence names
!grep ">" genomes/*fna | perl -pe 's/.+:>/>/'


>NC_014328.1 Clostridium ljungdahlii DSM 13528, complete genome
>NZ_CP009166.1 Escherichia coli 1303, complete genome
>NC_016114.1 Streptomyces pratensis ATCC 33331, complete genome

In [20]:
%%bash 
source activate SIPSim

# making sure each sequence is unique
find ./genomes/ -name "*fna" | \
    SIPSim genome_rename -n 3 --prefix genomes_rn -


File written: /home/nick/notebook/SIPSim/t/SIPSim_example/genomes_rn/Clostridium_ljungdahlii_DSM_13528.fna
File written: /home/nick/notebook/SIPSim/t/SIPSim_example/genomes_rn/Escherichia_coli_1303.fna
File written: /home/nick/notebook/SIPSim/t/SIPSim_example/genomes_rn/Streptomyces_pratensis_ATCC_33331.fna

In [21]:
# NEW sequence names
!grep ">" genomes_rn/*fna | perl -pe 's/.+:>/>/'


>NC_014328_1_Clostridium_ljungdahlii_DSM_13528
>NZ_CP009166_1_Escherichia_coli_1303
>NC_016114_1_Streptomyces_pratensis_ATCC_33331

Indexing genomes

  • One more step!
    • Creating genome indices is needed for the upcoming in-silico PCR step

We need MFPrimer_linux to be installed


In [30]:
%%bash
source activate SIPSim

# Checking that MFE_primer.py (and associated scripts) are installed
MFE_primer.py -h | head


usage: MFEprimer.py [options] -i primers.fasta -d Human.genomic.fasta

MFEprimer: A fast and thermodynamics-based PCR primer specificity checking
program.

optional arguments:
  -h, --help            show this help message and exit
  -i [INFILE], --infile [INFILE]
                        [Required] Primer sequences for specificity checking.
                        [File]

In [31]:
# changing the working directory
workDirGenome = os.path.join(workDir, 'genomes_rn')
%cd $workDirGenome


/home/nick/notebook/SIPSim/t/SIPSim_example/genomes_rn

In [32]:
# making index file (taxon_name<tab>taxon_genome_file_name)
indexFile = """Clostridium_ljungdahlii_DSM_13528 Clostridium_ljungdahlii_DSM_13528.fna
Escherichia_coli_1303 Escherichia_coli_1303.fna
Streptomyces_pratensis_ATCC_33331 Streptomyces_pratensis_ATCC_33331.fna""".replace(' ', '\t')

F = os.path.join(workDirGenome, 'genome_index.txt')
with open(F, 'wb') as oFH:
    oFH.write(indexFile)

print 'File written: {}'.format(F)


File written: /home/nick/notebook/SIPSim/t/SIPSim_example/genomes_rn/genome_index.txt

Note: This next step will use 3 processors (--np). Change this option if needed. Even with 3 processors, it will take a minute to complete

While you wait, here's turtle...

                       ____,------------------,______
                   ___/    \            /            \_____
                __/         \__________/              \___ \___
  ,^------.____/\           /          \              /   `----\_
  | (O))      /  \_________/            \____________/         \ \
  \_____,--' /   /         \            /            \          \ \
    \___,---|___/_______,---`----------'----,_________\__________\_\
              /  :__________________________/  :___________________/
             /   :          /   :          /   :          /   :
            /    :         /    :         /    :         /    :
        (^^^     )     (^^^     )     (^^^     )     (^^^     )
         ^^^^^^^^       ^^^^^^^^       ^^^^^^^^       ^^^^^^^^

In [33]:
%%bash
source activate SIPSim
# indexing genomes; saving log 
SIPSim genome_index \
    genome_index.txt \
    --fp . --np 3 \
    > index_log.txt


Indexing: "Clostridium_ljungdahlii_DSM_13528"
Indexing: "Escherichia_coli_1303"
Indexing: "Streptomyces_pratensis_ATCC_33331"
#-- All genomes indexed --#

In [36]:
# checking all of the files produced in the ./genome_rn/ directory
!ls -thlc


total 21M
-rw-rw-r-- 1 nick nick 4.6K Jul 13 14:40 index_log.txt
-rw-r--r-- 1 nick nick 8.0K Jul 13 14:39 Streptomyces_pratensis_ATCC_33331.fna.sqlite3.db
-rw-r--r-- 1 nick nick 8.0K Jul 13 14:39 Escherichia_coli_1303.fna.sqlite3.db
-rw-r--r-- 1 nick nick 8.0K Jul 13 14:39 Clostridium_ljungdahlii_DSM_13528.fna.sqlite3.db
-rw-rw-r-- 1 nick nick 1.8M Jul 13 14:39 Streptomyces_pratensis_ATCC_33331.fna.2bit
-rw-rw-r-- 1 nick nick 1.2M Jul 13 14:39 Escherichia_coli_1303.fna.2bit
-rw-rw-r-- 1 nick nick   91 Jul 13 14:39 Streptomyces_pratensis_ATCC_33331.fna.uni
-rw-rw-r-- 1 nick nick 1.2M Jul 13 14:39 Clostridium_ljungdahlii_DSM_13528.fna.2bit
-rw-rw-r-- 1 nick nick   81 Jul 13 14:39 Escherichia_coli_1303.fna.uni
-rw-rw-r-- 1 nick nick   91 Jul 13 14:39 Clostridium_ljungdahlii_DSM_13528.fna.uni
-rw-rw-r-- 1 nick nick  191 Jul 13 14:39 genome_index.txt
-rw-rw-r-- 1 nick nick 7.1M Jul 13 14:33 Streptomyces_pratensis_ATCC_33331.fna
-rw-rw-r-- 1 nick nick 4.8M Jul 13 14:33 Escherichia_coli_1303.fna
-rw-rw-r-- 1 nick nick 4.5M Jul 13 14:33 Clostridium_ljungdahlii_DSM_13528.fna

Next steps

...and now on to the simulation!

We will simulate some shotgun genome sequences.


In [ ]: