This notebook does the following:
Download three bacterial genomes:
Simulate reads of length 500 from each genome
The subsequent notebook will then learn a model to classify these 500-event-fragments into their source genomes.
Warning: This example requires a few different libraries like SciKit Bio, Keras and Theano. Your best bet is to work from an Anaconda distribution!
Warning: This example is not yet polished to the point that it is easy to run. Patience please!
In [1]:
# Please change E-Mail!
from Bio import Entrez
Entrez.email = ""
In [2]:
import os
import porekit
import pandas as pd
import numpy as np
import random
import skbio
import h5py
These three species where the first thing which came to my mind for the question "What are three very different bacteria?" I searched for genomes in the refseq database and picked these accessions, mostly because they were the first in the search results.
Of course it would also be possible to use more genomes per species.
And yes, I know, that instead of fancy sequencing and bioinformatics, a simple microscope and a couple of stains would suffice to distinguish these three groups ;-)
In [3]:
sources_list = [("E_Coli_Sakai", "E_Coli_Sakai.fasta", "NC_002695.1"),
("M_Tuberculosis", "M_Tuberculosis.fasta", "NC_000962.3"),
("S_Aureus", "S_Aureus.fasta", "AP008934.1"),
]
sources = pd.DataFrame(sources_list, columns=["short_name", "file_name", "accession"])
sources.index = sources.short_name
sources
Out[3]:
In [4]:
# Download from Entrez
for key, row in sources.iterrows():
handle = Entrez.efetch(db="nucleotide", id=row.accession, rettype="fasta", retmode="text")
data = handle.read()
with open(row.file_name, "wt") as f:
f.write(data)
sequences = dict({key:next(skbio.read(row.file_name, format="fasta")) for key, row in sources.iterrows()})
The "model" table inside Fast5 files references the voltages expected of all possible kmers. Both the event level and standard deviations are given with mean and standard deviation.
In [5]:
ru9 = pd.read_hdf("../data/ru9_meta.h5", "meta")
fn = ru9[ru9.template_length > 3000].iloc()[0].absolute_filename
fast5 = porekit.Fast5File(fn)
model = fast5.get_model()
model.head()
Out[5]:
Sample sequences from the genomes, then turn them into squiggles by sampling normally distributed events according to the model. This is done using porekit's make_squiggle function. make_dataset also partitions the genomes into pieces of 100kb, selecting odd or even numbered pieces according to the parity parameter. This is to make sure training and validation data is sampled from different parts of the genome. That way, the validation accuracy should not be affected by the recognition of individual genes/regions.
In [ ]:
def make_dataset(num_samples, m=500, parity=0):
o = len(sources)
k = len(model.index.values[0])
X = np.zeros((num_samples*3, m))
y = np.zeros((num_samples*3))
i = 0
for key, row in sources.iterrows():
seq = sequences[key]
l = len(seq)-m-k
y[i:i+num_samples] = i / num_samples
for j in range(i, i+num_samples):
while True:
a = np.random.randint(0, l)
if (a // 1e6) % 2 != parity:
continue
b = a+m+k
sub = seq[a:b]
if not "N" in sub:
break
X[j,:] = porekit.make_squiggle(sub, model)
if j%100==0:
print(j)
i += num_samples
return X, y
In [ ]:
# Create Training Dataset of 3*10000*500 events
fn = 'gclassify_11.h5'
try:
os.remove(fn)
except FileNotFoundError:
pass
X1,y1 = make_dataset(10000,m=500, parity=0)
h5f = h5py.File(fn, 'w')
h5f.create_dataset('training/X', data=X1)
h5f.create_dataset('training/y', data=y1)
# Create Validation Dataset of 3*2500*500 events
X,y = make_dataset(2500,m=500, parity=1)
h5f.create_dataset('validation/X', data=X)
h5f.create_dataset('validation/y', data=y)
h5f.close()
In [ ]:
!ls -ltr
In [ ]: