Dataset For Group Classification

This notebook does the following:

Download three bacterial genomes:
- E. Coli
- M. Tuberculosis
- S. Aureus
Simulate reads of length 500 from each genome
Save them into a HDF5 File as Training and Validation set

The subsequent notebook will then learn a model to classify these 500-event-fragments into their source genomes.

Warning: This example requires a few different libraries like SciKit Bio, Keras and Theano. Your best bet is to work from an Anaconda distribution!

Warning: This example is not yet polished to the point that it is easy to run. Patience please!



In [1]:

    
# Please change E-Mail!
from Bio import Entrez
Entrez.email = ""



In [2]:

    
import os
import porekit
import pandas as pd
import numpy as np
import random
import skbio
import h5py

These three species where the first thing which came to my mind for the question "What are three very different bacteria?" I searched for genomes in the refseq database and picked these accessions, mostly because they were the first in the search results.

Of course it would also be possible to use more genomes per species.

And yes, I know, that instead of fancy sequencing and bioinformatics, a simple microscope and a couple of stains would suffice to distinguish these three groups ;-)



In [3]:

    
sources_list = [("E_Coli_Sakai", "E_Coli_Sakai.fasta", "NC_002695.1"),
                ("M_Tuberculosis", "M_Tuberculosis.fasta", "NC_000962.3"),
 ("S_Aureus", "S_Aureus.fasta", "AP008934.1"),
]
sources = pd.DataFrame(sources_list, columns=["short_name", "file_name", "accession"])
sources.index = sources.short_name
sources









    Out[3]:






  
    
      
      short_name
      file_name
      accession
    
    
      short_name
      
      
      
    
  
  
    
      E_Coli_Sakai
      E_Coli_Sakai
      E_Coli_Sakai.fasta
      NC_002695.1
    
    
      M_Tuberculosis
      M_Tuberculosis
      M_Tuberculosis.fasta
      NC_000962.3
    
    
      S_Aureus
      S_Aureus
      S_Aureus.fasta
      AP008934.1



In [4]:

    
# Download from Entrez
for key, row in sources.iterrows():
    handle = Entrez.efetch(db="nucleotide", id=row.accession, rettype="fasta", retmode="text")
    data = handle.read()
    with open(row.file_name, "wt") as f:
        f.write(data)
sequences = dict({key:next(skbio.read(row.file_name, format="fasta")) for key, row in sources.iterrows()})

The "model" table inside Fast5 files references the voltages expected of all possible kmers. Both the event level and standard deviations are given with mean and standard deviation.



In [5]:

    
ru9 =  pd.read_hdf("../data/ru9_meta.h5", "meta")
fn = ru9[ru9.template_length > 3000].iloc()[0].absolute_filename
fast5 = porekit.Fast5File(fn)
model = fast5.get_model()
model.head()









    Out[5]:






  
    
      
      kmer
      level_mean
      level_stdv
      sd_mean
      sd_stdv
      weight
    
    
      kmer
      
      
      
      
      
      
    
  
  
    
      b'AAAAAA'
      b'AAAAAA'
      62.784241
      0.841324
      0.664989
      0.206892
      1445.771698
    
    
      b'AAAAAC'
      b'AAAAAC'
      58.017544
      0.664955
      0.704270
      0.225492
      1661.675444
    
    
      b'AAAAAG'
      b'AAAAAG'
      62.984396
      0.796764
      0.689218
      0.218302
      1240.670417
    
    
      b'AAAAAT'
      b'AAAAAT'
      61.414159
      0.813223
      0.827014
      0.286941
      1353.020413
    
    
      b'AAAACA'
      b'AAAACA'
      56.506612
      0.697534
      0.713271
      0.229829
      1206.435764

Generate Datasets

Sample sequences from the genomes, then turn them into squiggles by sampling normally distributed events according to the model. This is done using porekit's make_squiggle function. make_dataset also partitions the genomes into pieces of 100kb, selecting odd or even numbered pieces according to the parity parameter. This is to make sure training and validation data is sampled from different parts of the genome. That way, the validation accuracy should not be affected by the recognition of individual genes/regions.



In [ ]:

    
def make_dataset(num_samples, m=500, parity=0):
    o = len(sources)
    k = len(model.index.values[0])
    X = np.zeros((num_samples*3, m))
    
    y = np.zeros((num_samples*3))
    i = 0
    for key, row in sources.iterrows():
        
        seq = sequences[key]
        l = len(seq)-m-k
        y[i:i+num_samples] = i / num_samples
        for j in range(i, i+num_samples):
            while True:
                a = np.random.randint(0, l)
                
                if (a // 1e6) % 2 != parity:
                    continue
                b = a+m+k
                sub = seq[a:b]
                if not "N" in sub:
                    break
            X[j,:] = porekit.make_squiggle(sub, model)
            if j%100==0:
                print(j)
        i += num_samples    
    return X, y



In [ ]:

    
# Create Training Dataset of 3*10000*500 events

fn = 'gclassify_11.h5'
try:
    os.remove(fn)
except FileNotFoundError:
    pass

X1,y1 = make_dataset(10000,m=500, parity=0)
h5f = h5py.File(fn, 'w')
h5f.create_dataset('training/X', data=X1)
h5f.create_dataset('training/y', data=y1)

# Create Validation Dataset of 3*2500*500 events
X,y = make_dataset(2500,m=500, parity=1)
h5f.create_dataset('validation/X', data=X)
h5f.create_dataset('validation/y', data=y)
h5f.close()



In [ ]:

    
!ls -ltr



In [ ]:

	short_name	file_name	accession
short_name
E_Coli_Sakai	E_Coli_Sakai	E_Coli_Sakai.fasta	NC_002695.1
M_Tuberculosis	M_Tuberculosis	M_Tuberculosis.fasta	NC_000962.3
S_Aureus	S_Aureus	S_Aureus.fasta	AP008934.1

	kmer	level_mean	level_stdv	sd_mean	sd_stdv	weight
kmer
b'AAAAAA'	b'AAAAAA'	62.784241	0.841324	0.664989	0.206892	1445.771698
b'AAAAAC'	b'AAAAAC'	58.017544	0.664955	0.704270	0.225492	1661.675444
b'AAAAAG'	b'AAAAAG'	62.984396	0.796764	0.689218	0.218302	1240.670417
b'AAAAAT'	b'AAAAAT'	61.414159	0.813223	0.827014	0.286941	1353.020413
b'AAAACA'	b'AAAACA'	56.506612	0.697534	0.713271	0.229829	1206.435764