Dataset For Group Classification

This notebook does the following:

  • Download three bacterial genomes:

    • E. Coli
    • M. Tuberculosis
    • S. Aureus
  • Simulate reads of length 500 from each genome

  • Save them into a HDF5 File as Training and Validation set

The subsequent notebook will then learn a model to classify these 500-event-fragments into their source genomes.

Warning: This example requires a few different libraries like SciKit Bio, Keras and Theano. Your best bet is to work from an Anaconda distribution!

Warning: This example is not yet polished to the point that it is easy to run. Patience please!


In [1]:
# Please change E-Mail!
from Bio import Entrez
Entrez.email = ""

In [2]:
import os
import porekit
import pandas as pd
import numpy as np
import random
import skbio
import h5py

These three species where the first thing which came to my mind for the question "What are three very different bacteria?" I searched for genomes in the refseq database and picked these accessions, mostly because they were the first in the search results.

Of course it would also be possible to use more genomes per species.

And yes, I know, that instead of fancy sequencing and bioinformatics, a simple microscope and a couple of stains would suffice to distinguish these three groups ;-)


In [3]:
sources_list = [("E_Coli_Sakai", "E_Coli_Sakai.fasta", "NC_002695.1"),
                ("M_Tuberculosis", "M_Tuberculosis.fasta", "NC_000962.3"),
 ("S_Aureus", "S_Aureus.fasta", "AP008934.1"),
]
sources = pd.DataFrame(sources_list, columns=["short_name", "file_name", "accession"])
sources.index = sources.short_name
sources


Out[3]:
short_name file_name accession
short_name
E_Coli_Sakai E_Coli_Sakai E_Coli_Sakai.fasta NC_002695.1
M_Tuberculosis M_Tuberculosis M_Tuberculosis.fasta NC_000962.3
S_Aureus S_Aureus S_Aureus.fasta AP008934.1

In [4]:
# Download from Entrez
for key, row in sources.iterrows():
    handle = Entrez.efetch(db="nucleotide", id=row.accession, rettype="fasta", retmode="text")
    data = handle.read()
    with open(row.file_name, "wt") as f:
        f.write(data)
sequences = dict({key:next(skbio.read(row.file_name, format="fasta")) for key, row in sources.iterrows()})

The "model" table inside Fast5 files references the voltages expected of all possible kmers. Both the event level and standard deviations are given with mean and standard deviation.


In [5]:
ru9 =  pd.read_hdf("../data/ru9_meta.h5", "meta")
fn = ru9[ru9.template_length > 3000].iloc()[0].absolute_filename
fast5 = porekit.Fast5File(fn)
model = fast5.get_model()
model.head()


Out[5]:
kmer level_mean level_stdv sd_mean sd_stdv weight
kmer
b'AAAAAA' b'AAAAAA' 62.784241 0.841324 0.664989 0.206892 1445.771698
b'AAAAAC' b'AAAAAC' 58.017544 0.664955 0.704270 0.225492 1661.675444
b'AAAAAG' b'AAAAAG' 62.984396 0.796764 0.689218 0.218302 1240.670417
b'AAAAAT' b'AAAAAT' 61.414159 0.813223 0.827014 0.286941 1353.020413
b'AAAACA' b'AAAACA' 56.506612 0.697534 0.713271 0.229829 1206.435764

Generate Datasets

Sample sequences from the genomes, then turn them into squiggles by sampling normally distributed events according to the model. This is done using porekit's make_squiggle function. make_dataset also partitions the genomes into pieces of 100kb, selecting odd or even numbered pieces according to the parity parameter. This is to make sure training and validation data is sampled from different parts of the genome. That way, the validation accuracy should not be affected by the recognition of individual genes/regions.


In [ ]:
def make_dataset(num_samples, m=500, parity=0):
    o = len(sources)
    k = len(model.index.values[0])
    X = np.zeros((num_samples*3, m))
    
    y = np.zeros((num_samples*3))
    i = 0
    for key, row in sources.iterrows():
        
        seq = sequences[key]
        l = len(seq)-m-k
        y[i:i+num_samples] = i / num_samples
        for j in range(i, i+num_samples):
            while True:
                a = np.random.randint(0, l)
                
                if (a // 1e6) % 2 != parity:
                    continue
                b = a+m+k
                sub = seq[a:b]
                if not "N" in sub:
                    break
            X[j,:] = porekit.make_squiggle(sub, model)
            if j%100==0:
                print(j)
        i += num_samples    
    return X, y

In [ ]:
# Create Training Dataset of 3*10000*500 events

fn = 'gclassify_11.h5'
try:
    os.remove(fn)
except FileNotFoundError:
    pass

X1,y1 = make_dataset(10000,m=500, parity=0)
h5f = h5py.File(fn, 'w')
h5f.create_dataset('training/X', data=X1)
h5f.create_dataset('training/y', data=y1)

# Create Validation Dataset of 3*2500*500 events
X,y = make_dataset(2500,m=500, parity=1)
h5f.create_dataset('validation/X', data=X)
h5f.create_dataset('validation/y', data=y)
h5f.close()

In [ ]:
!ls -ltr

In [ ]: