Using Nucleus and TensorFlow for DNA Sequencing Error Correction


In [0]:
# @markdown Copyright 2019 Google LLC. \
# @markdown SPDX-License-Identifier: Apache-2.0
# @markdown (license hidden in Colab)

# Copyright 2019 Google LLC

# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at

#     https://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Introduction

In this tutorial, we formulate DNA sequencing error correction as a multiclass classification problem and propose two deep learning solutions. Our first approach corrects errors in a single read, whereas the second approach, shown in Figure 1, builds a consensus from several reads to predict the correct DNA sequence. We implement the second approach using the Nucleus and TensorFlow libraries. Our goal is to show how Nucleus can be used alongside TensorFlow for solving machine learning problems in genomics.

Problem Overview

While DNA sequencing continues to become faster and cheaper, it is still an error-prone process. Error rates for raw data from next-generation sequencing (NGS) technologies developed by companies such as Illumina are around 1%. Error rates for increasingly popular third-generation technologies like those developed by Pacific BioSciences (PacBio) are around 15%. Sequencing errors can be divided into substitutions, insertions, and deletions, the last two of which are commonly referred to as indels. All of these errors can be detrimental to downstream analysis steps such as variant calling and genome assembly.

A simple approach for obtaining higher quality datasets is to discard data that likely contains errors, either by throwing away entire reads or trimming regions of low quality. This approach is not ideal as it leads to a smaller final dataset. In addition, certain sequence contexts have naturally higher error rates, leading to biases in sampling. Thus, there exists a large body of research that is focused on developing more sophisticated methods for error correction. Most methods that have been developed can be categorized into one of two groups:

  1. Methods that operate on a single read and aim to determine the correct read sequence
  2. Consensus-based methods that operate on several reads and aim to determine the correct underlying DNA sequence

Deep Learning Overview

Both of the methods that we formulate in this post use deep neural networks, which learn functions that map inputs to outputs. A neural network consists of several layers of linear and nonlinear operations applied sequentially to the input. Neural networks have been successfully applied to various problems including image classification and natural language translation. More recently, they have also been used for problems in genomics, such as protein structure prediction and variant calling.

Nucleus

Our implementation relies on Nucleus, a library developed for processing genomics data by the Genomics team in Google Brain. Nucleus makes it easy to read, write, and analyze data in common genomics file formats like BAM, FASTA, and VCF using specialized reader and writer objects. Nucleus allows us to:

  • Query a VCF file for all variants in a given genomic region
  • Query a BAM file for all reads mapping to a given genomic range
  • Query a FASTA file for the reference sequence starting at a given position

We also use Nucleus to write data out to TFRecords, a binary file format that consists of protocol buffer messages and can be easily read by TensorFlow. After reading in the TFRecords, we use the TensorFlow Keras API to train and evaluate a convolutional neural network.

Data

Below is a list of the files we use in the implementation. All of the data is publicly available, and the Appendix contains download links and instructions.

File Description
NA12878_sliced.bam Illumina HiSeq reads from chromosome 20 (positions 10,000,000-10,100,000), downsampled to 30x coverage.
NA12878_sliced.bam.bai Index for NA12878_sliced.bam.
NA12878_calls.vcf.gz Truth set of variants for NA12878 from Genome in a Bottle.
NA12878_calls.vcf.gz.tbi Index for NA12878_calls.vcf.gz.
hs37d5.fa.gz Reference genome for hs37d5.
hs37d5.fa.gz.fai and hs37d5.fa.gz.gzi Index files for hs37d5.fa.gz.

Questions or Comments?

If you have any questions or comments regarding this tutorial, do not hesitate to reach out! You can file an issue on the Nucleus GitHub page.

Setup

If you are new to Colab or Jupyter notebooks, we recommend that you first go through this .

Obtain Data, Install Nucleus, and Import Packages

Run the below cells to obtain the data, install Nucleus, and import Python packages.

Note, the code for some cells is hidden for clarity. These cells are marked with the following text: (code hidden in Colab). If you wish to view the code for a hidden cell, double click the cell. To hide the code, double click the markdown output on the right side.


In [0]:
#@markdown Run this cell to obtain the data. (code hidden in Colab)

%%capture
!gsutil cp gs://deepvariant/case-study-testdata/NA12878_sliced.bam NA12878_sliced.bam
!gsutil cp gs://deepvariant/case-study-testdata/NA12878_sliced.bam.bai NA12878_sliced.bam.bai

!wget ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/latest/GRCh37/HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer.vcf.gz -O NA12878_calls.vcf.gz
!wget ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/latest/GRCh37/HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer.vcf.gz.tbi -O NA12878_calls.vcf.gz.tbi

!gsutil cp gs://deepvariant/case-study-testdata/hs37d5.fa.gz hs37d5.fa.gz
!gsutil cp gs://deepvariant/case-study-testdata/hs37d5.fa.gz.fai hs37d5.fa.gz.fai
!gsutil cp gs://deepvariant/case-study-testdata/hs37d5.fa.gz.gzi hs37d5.fa.gz.gzi

In [0]:
#@markdown Run this cell to install Nucleus and TensorFlow 2.0. (code hidden in Colab)

!pip install -q google-nucleus==0.2.2
!pip install -q tensorflow==2.0.0-alpha0

In [0]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import os
import random

import numpy as np
import tensorflow as tf
from tensorflow.keras import layers

from nucleus.io import fasta
from nucleus.io import sam
from nucleus.io import vcf
from nucleus.io.genomics_writer import TFRecordWriter
from nucleus.protos import reads_pb2
from nucleus.util import cigar
from nucleus.util import ranges
from nucleus.util import utils

In [0]:
# Define constants and utility functions.

# We will only allow simple alignments, specified by the below cigar string
# operators. If you are not familiar with cigar strings, you can read more
# at section 1.4 of this link: https://samtools.github.io/hts-specs/SAMv1.pdf
_ALLOWED_CIGAR_OPS = frozenset([cigar.CHAR_TO_CIGAR_OPS[op] for op in 'MX='])

# We will only allow certain bases.
_ALLOWED_BASES = 'ACGT'

_TRAIN = 'train.tfrecord'
_EVAL = 'eval.tfrecord'
_TEST = 'test.tfrecord'

Network Architecture

Convolutional neural networks are commonly used for computer vision tasks, but also work well for genomics. Each convolutional layer repeatedly applies learned filters to the input. Convolutional filters appearing early in the network learn to recognize low-level features in the input, like edges and color gradients in images, whereas later filters learn to recognize more complex compositions of the low-level features. For DNA sequence inputs, low-level convolutional filters act as motif detectors, similar to the position weight matrices of sequence logos.

For our implementation, we use a standard convolutional architecture consisting of two convolutional layers, followed by three fully connected layers. We use nonlinear ReLU layers to increase the expressive capacity of our model. Maxpooling after convolutional layers shrinks the input volume, and dropout after fully connected layers acts as a regularizer. The final softmax layer normalizes the logits to produce a valid probability distribution. The details of each layer can be found in the code below.


In [0]:
def build_model(hparams):
  """Convolutional neural network architecture."""

  l2_reg = tf.keras.regularizers.l2

  return tf.keras.models.Sequential([

      # Two convolution + maxpooling blocks
      layers.Conv1D(
          filters=16,
          kernel_size=5,
          activation=tf.nn.relu,
          kernel_regularizer=l2_reg(hparams.l2)),
      layers.MaxPool1D(pool_size=3, strides=1),
      layers.Conv1D(
          filters=16,
          kernel_size=3,
          activation=tf.nn.relu,
          kernel_regularizer=l2_reg(hparams.l2)),
      layers.MaxPool1D(pool_size=3, strides=1),

      # Flatten the input volume
      layers.Flatten(),

      # Two fully connected layers, each followed by a dropout layer
      layers.Dense(
          units=16,
          activation=tf.nn.relu,
          kernel_regularizer=l2_reg(hparams.l2)),
      layers.Dropout(rate=0.3),
      layers.Dense(
          units=16,
          activation=tf.nn.relu,
          kernel_regularizer=l2_reg(hparams.l2)),
      layers.Dropout(rate=0.3),

      # Output layer with softmax activation
      layers.Dense(units=len(_ALLOWED_BASES), activation='softmax')
  ])

Approach 1: Error Correction of Single Reads

In order to correct errors in sequenced reads, we can use deep learning to train a neural network that can solve a more general task: fill in missing bases in DNA sequences. The goal of this approach is to develop a model that understands the grammar of DNA sequences. The grammar of real sequences alone likely does not contain enough information to develop a solution that can be used in production. Nonetheless, this serves as a straightforward example application.

For instructional purposes, we simplify the problem in the following ways:

  • Consider only regions with substitution errors and ignore indel errors
  • Consider only regions with no known variants

We can train the neural network on regions of the reference genome. The input to this network is a DNA sequence of fixed length, centered around the base we wish to predict. The output of the network is a distribution over the possible bases, and the final prediction is the base with highest probability. The label set is generated using the bases observed in the reference genome. Since we only use reads mapping to regions with no known truth variants, we can unambiguously denote the base present in the reference genome as the label.

We generate input sequences by splitting the reference genome into non-overlapping sections of a fixed length. At training, evaluation, and test time, we simulate missing bases by zeroing out a base in the reference sequence, as shown in Figure 3 (position 5). In addition to simulating missing data using the reference genome, we can also apply such a model to data from sequenced reads, specifically bases with quality scores below a threshold value.

Approach 2: Consensus-Based Error Correction

The ultimate goal of error correction is to determine the underlying DNA sequence, as opposed to correcting an individual read. In this section, we use the consensus of multiple reads by aggregating a sequence pileup to directly determine the DNA sequence without the intermediate step of correcting individual reads. An example of a pileup is shown below in Figure 4. Note, the figure only shows the portions of the reads that fall in the window.

For instructional purposes, we again simplify the problem in the following ways:

  • Consider only regions with substitution errors and ignore indel errors
  • Consider only regions with no known variants

Unlike the first approach, we do not train this model on the reference genome. Instead, our training data comes from mapped Illumina HiSeq reads. The input to this network is a matrix of normalized base counts observed in mapped reads, centered around the position at which we wish to predict the correct base. A similar featurization is used by the authors of Clairvoyante, a neural network for variant calling, and in an example method by Jason Chin. The output of the network is a distribution over the possible bases, and the final prediction is the base with highest probability. Similar to the first approach, the label set is generated using the bases observed in the reference genome. We use a mix of examples that contain errors (at least one read in the pileup does not match the reference at the center position) and examples that do not contain errors (all reads in the pileup match the reference at the center position).

Data Processing

The data for this problem comes from 148bp mapped Illumina HiSeq reads. In processing the data, we compare each read to the reference sequence, and any positions that differ from the reference are denoted as errors.

  • For reads containing one or more errors, we randomly choose an error and create one example centered at the corresponding position.
  • For reads containing no errors, we create one example centered at the middle position.

Once we have determined the genomic window for the example, we use Nucleus to query for all reads mapping to the window. We then build a normalized counts matrix, as shown above. For some of the reads, only a subset of all bases will fall inside the window, and we ignore the bases that fall outside the window.

Implement Neural Network Pipeline

We divide up our pipeline into the following steps, for each of which we implement several functions.

  1. Generate TFRecords datasets
  2. Read data from TFRecords datasets
  3. Train the model

Step 1: Generate TFRecords Datasets

We generate TFRecords datasets for training, evaluation, and testing. All examples that do not meet the criteria expressed in is_usable_example are discarded.


In [0]:
def generate_tfrecord_datasets(hparams):
  """Writes out TFRecords files for training, evaluation, and test datasets."""
  if not os.path.exists(hparams.out_dir):
    os.makedirs(hparams.out_dir)

  # Fraction of examples in each dataset.
  train_eval_test_split = [0.7, 0.2, 0.1]
  num_train_examples = 0
  num_eval_examples = 0
  num_test_examples = 0

  # Generate training, test, and evaluation examples.
  with TFRecordWriter(os.path.join(hparams.out_dir, _TRAIN)) as train_out, \
       TFRecordWriter(os.path.join(hparams.out_dir, _EVAL)) as eval_out, \
       TFRecordWriter(os.path.join(hparams.out_dir, _TEST)) as test_out:
    all_examples = make_ngs_examples(hparams)
    for example in all_examples:
      r = random.random()
      if r < train_eval_test_split[0]:
        train_out.write(proto=example)
        num_train_examples += 1
      elif r < train_eval_test_split[0] + train_eval_test_split[1]:
        eval_out.write(proto=example)
        num_eval_examples += 1
      else:
        test_out.write(proto=example)
        num_test_examples += 1
  print('# of training examples: %d' % num_train_examples)
  print('# of evaluation examples: %d' % num_eval_examples)
  print('# of test examples: %d' % num_test_examples)


def make_ngs_examples(hparams):
  """Generator function that yields training, evaluation and test examples."""
  ref_reader = fasta.IndexedFastaReader(input_path=hparams.ref_path)
  vcf_reader = vcf.VcfReader(input_path=hparams.vcf_path)
  read_requirements = reads_pb2.ReadRequirements()
  sam_reader = sam.SamReader(
      input_path=hparams.bam_path, read_requirements=read_requirements)

  # Use a separate SAM reader to query for reads falling in the pileup range.
  sam_query_reader = sam.SamReader(
      input_path=hparams.bam_path, read_requirements=read_requirements)
  used_pileup_ranges = set()
  with ref_reader, vcf_reader, sam_reader, sam_query_reader:
    for read in sam_reader:

      # Check that read has cigar string present and allowed alignment.
      if not read.alignment.cigar:
        print('Skipping read, no cigar alignment found')
        continue
      if not has_allowed_alignment(read):
        continue

      # Obtain window that will be used to construct an example.
      read_range = utils.read_range(read)
      ref = ref_reader.query(region=read_range)
      pileup_range = get_pileup_range(hparams, read, read_range, ref)

      # Do not construct multiple examples with the same pileup range.
      pileup_range_serialized = pileup_range.SerializeToString()
      if pileup_range_serialized in used_pileup_ranges:
        continue
      used_pileup_ranges.add(pileup_range_serialized)

      # Get reference sequence, reads, and truth variants for the pileup range.
      pileup_reads = list(sam_query_reader.query(region=pileup_range))
      pileup_ref = ref_reader.query(region=pileup_range)
      pileup_variants = list(vcf_reader.query(region=pileup_range))
      if is_usable_example(pileup_reads, pileup_variants, pileup_ref):
        yield make_example(hparams, pileup_reads, pileup_ref, pileup_range)


def get_pileup_range(hparams, read, read_range, ref):
  """Returns a range that will be used to construct one example."""

  # Find error positions where read and reference differ.
  ngs_read_length = read_range.end - read_range.start
  error_indices = [
      i for i in range(ngs_read_length) if ref[i] != read.aligned_sequence[i]
  ]

  # If read and reference sequence are the same, create an example centered
  # at middle base of read.
  if not error_indices:
    error_idx = ngs_read_length // 2

  # If read and reference differ at one or more positions, create example
  # centered at a random error position.
  else:
    error_idx = random.choice(error_indices)

  error_pos = read_range.start + error_idx
  flank_size = hparams.window_size // 2
  return ranges.make_range(
      chrom=read_range.reference_name,
      start=error_pos - flank_size,
      end=error_pos + flank_size + 1)


def has_allowed_alignment(read):
  """Determines whether a read's CIGAR string has the allowed alignments."""
  return all([c.operation in _ALLOWED_CIGAR_OPS for c in read.alignment.cigar])


def is_usable_example(reads, variants, ref_bases):
  """Determines whether a particular reference region and read can be used."""
  # Discard examples with variants or no mapped reads.
  if variants or not reads:
    return False

  # Use only examples where all reads have simple alignment and allowed bases.
  for read in reads:
    if not has_allowed_alignment(read):
      return False
    if any(base not in _ALLOWED_BASES for base in read.aligned_sequence):
      return False

  # Reference should only contain allowed bases.
  if any(base not in _ALLOWED_BASES for base in ref_bases):
    return False
  return True


def make_example(hparams, pileup_reads, pileup_ref, pileup_range):
  """Takes in an input sequence and outputs tf.train.Example ProtocolMessages.

  Each example contains the following features: A counts, C counts, G counts,
  T counts, reference sequence, correct base label.
  """
  assert len(pileup_ref) == hparams.window_size
  example = tf.train.Example()
  base_counts = np.zeros(shape=[hparams.window_size, len(_ALLOWED_BASES)])

  for read in pileup_reads:
    read_position = read.alignment.position.position
    read_ints = [_ALLOWED_BASES.index(b) for b in read.aligned_sequence]
    one_hot_read = np.zeros((len(read_ints), len(_ALLOWED_BASES)))
    one_hot_read[np.arange(len(one_hot_read)), read_ints] = 1

    window_start = read_position - pileup_range.start
    window_end = window_start + len(read_ints)

    # If read falls outside of window, adjust start/end indices for window.
    window_start = max(0, window_start)
    window_end = min(window_end, hparams.window_size)

    # We consider four possible scenarios for each read and adjust start/end
    # indices to only include portions of read that overlap the window.
    # 1) Read extends past 5' end of window
    # 2) Read extends past 3' end of window
    # 3) Read extends past 5' and 3' ends of window
    # 4) Read falls entirely within window
    if window_start == 0 and window_end != hparams.window_size:
      read_start = pileup_range.start - read_position
      read_end = None
    if window_end == hparams.window_size and window_start != 0:
      read_start = None
      read_end = -1 * ((read_position + len(read_ints)) - pileup_range.end)
    if window_start == 0 and window_end == hparams.window_size:
      read_start = pileup_range.start - read_position
      read_end = read_start + hparams.window_size
    if window_start != 0 and window_end != hparams.window_size:
      read_start = None
      read_end = None
    base_counts[window_start:window_end] += one_hot_read[read_start:read_end]

  # Use fractions at each position instead of raw base counts.
  base_counts /= np.expand_dims(np.sum(base_counts, axis=-1), -1)

  # Save counts/fractions for each base separately.
  features = example.features
  for i in range(len(_ALLOWED_BASES)):
    key = '%s_counts' % _ALLOWED_BASES[i]
    features.feature[key].float_list.value.extend(list(base_counts[:, i]))

  features.feature['ref_sequence'].int64_list.value.extend(
      [_ALLOWED_BASES.index(base) for base in pileup_ref])
  flank_size = hparams.window_size // 2
  true_base = pileup_ref[flank_size]
  features.feature['label'].int64_list.value.append(
      _ALLOWED_BASES.index(true_base))

  return example

Step 2: Read TFRecords Datasets

We define an input function to read in the TFRecords dataset. Since TFRecords is an unstructured binary format, it is necessary to define the structure of the data in order to read it back in. Specifically, we must define the type and size of each field in proto_features.


In [0]:
def get_dataset(hparams, filename, num_epochs):
  """Reads in and processes the TFRecords dataset.

  Builds a pipeline that returns pairs of features, label.
  """

  # Define field names, types, and sizes for TFRecords.
  proto_features = {
      'A_counts':
          tf.io.FixedLenFeature(shape=[hparams.window_size], dtype=tf.float32),
      'C_counts':
          tf.io.FixedLenFeature(shape=[hparams.window_size], dtype=tf.float32),
      'G_counts':
          tf.io.FixedLenFeature(shape=[hparams.window_size], dtype=tf.float32),
      'T_counts':
          tf.io.FixedLenFeature(shape=[hparams.window_size], dtype=tf.float32),
      'ref_sequence':
          tf.io.FixedLenFeature(shape=[hparams.window_size], dtype=tf.int64),
      'label':
          tf.io.FixedLenFeature(shape=[1], dtype=tf.int64),
  }

  def _process_input(proto_string):
    """Helper function for input function that parses a serialized example."""

    parsed_features = tf.io.parse_single_example(
        serialized=proto_string, features=proto_features)

    # Stack counts/fractions for all bases to create input of dimensions
    # `hparams.window_size` x len(_ALLOWED_BASES).
    feature_columns = []
    for base in _ALLOWED_BASES:
      feature_columns.append(parsed_features['%s_counts' % base])
    features = tf.stack(feature_columns, axis=-1)
    label = parsed_features['label']
    return features, label

  ds = tf.data.TFRecordDataset(filenames=filename)
  ds = ds.map(map_func=_process_input)
  ds = ds.shuffle(buffer_size=10000, reshuffle_each_iteration=True)
  ds = ds.batch(batch_size=hparams.batch_size).repeat(count=num_epochs)
  return ds

Step 3: Train the Model

We use tf.keras to train and evaluate our model.


In [0]:
def run(hparams, use_existing_data=False, seed=1):
  """Creates a model, runs training and evaluation."""

  # Set seed for reproducibility.
  random.seed(seed)
  tf.random.set_seed(seed)

  if not use_existing_data:
    print('Generating data...')
    generate_tfrecord_datasets(hparams)

  train_dataset = get_dataset(
      hparams, filename=os.path.join(hparams.out_dir, _TRAIN), num_epochs=1)
  eval_dataset = get_dataset(
      hparams, filename=os.path.join(hparams.out_dir, _EVAL), num_epochs=1)
  test_dataset = get_dataset(
      hparams, filename=os.path.join(hparams.out_dir, _TEST), num_epochs=1)

  optimizer = tf.keras.optimizers.Adam(lr=hparams.learning_rate)
  tensorboard_callback = tf.keras.callbacks.TensorBoard(
      hparams.log_dir, histogram_freq=1)
  model = build_model(hparams)
  model.compile(
      optimizer=optimizer,
      loss=tf.keras.losses.sparse_categorical_crossentropy,
      metrics=['accuracy'])

  print('Training the model. This should take ~6 minutes...')
  model.fit_generator(
      train_dataset,
      epochs=hparams.total_epochs,
      validation_data=eval_dataset,
      callbacks=[tensorboard_callback],
      verbose=0)
  print('Training complete. Obtaining final metrics...')
  eval_metrics = model.evaluate(eval_dataset, verbose=0)
  test_metrics = model.evaluate(test_dataset, verbose=0)
  print('Final eval metrics - loss: %f - accuracy: %f' %
        (eval_metrics[0], eval_metrics[1]))
  print('Final test metrics - loss: %f - accuracy: %f' %
        (test_metrics[0], test_metrics[1]))

Train and Evaluate Neural Network

We define the hyperparameters to be used and train the model.

The cell below will print some metrics directly in this notebook, but you may also wish to view the progress of training using TensorBoard. Detailed documentation for using TensorBoard locally can be found here. Through the Files tab in the sidebar, you can download TensorBoard summary files, which are in logs/.


In [0]:
# Feel free to experiment with different values.
# A description of all hyperparameters is provided
# in the appendix.

class BaseHparams(object):
  """Default hyperparameters."""

  def __init__(self,
               total_epochs=100,
               learning_rate=0.004,
               l2=0.001,
               batch_size=256,
               window_size=21,
               ref_path='hs37d5.fa.gz',
               vcf_path='NA12878_calls.vcf.gz',
               bam_path='NA12878_sliced.bam',
               out_dir='examples',
               model_dir='ngs_model',
               log_dir='logs'):

    self.total_epochs = total_epochs
    self.learning_rate = learning_rate
    self.l2 = l2
    self.batch_size = batch_size
    self.window_size = window_size
    self.ref_path = ref_path
    self.vcf_path = vcf_path
    self.bam_path = bam_path
    self.out_dir = out_dir
    self.model_dir = model_dir
    self.log_dir = log_dir

In [0]:
# Delete existing files.
!rm -rf examples
!rm -rf ngs_model
!rm -rf logs

In [0]:
# This cell should take ~6 minutes to run with the default parameters.
hparams = BaseHparams()
run(hparams)

With the default parameters, the final accuracy for this model should be around 99%. Feel free to experiment with different model architectures, learning rates, etc. Though both of the examples we develop are not complex enough to be deployed in production, we hope they will help developers learn to efficiently apply Nucleus and deep learning within genomics.

Appendix

Hyperparameters

Hyperparameter Description
total_epochs (int) The number of epochs for which training is run.
learning_rate (float) The learning rate for the optimizer.
l2 (float) The L2 regularization used for the neural network layers.
batch_size (int) The number of examples used in one iteration of training, evaluation and testing.
window_size (int) The number of bases to consider at once. This should be an odd number so that the middle base is centered evenly.
ref_path (str) Path to reference genome.
vcf_path (str) Path to truth VCF.
bam_path (str) Path to mapped reads.
out_dir (str) Path where training, evaluation, and testing TFRecords files written.
model_dir (str) Path where model model checkpoints saved. If a checkpoint already exists at this path, training will start from the checkpoint.
log_dir (str) Path where TensorBoard logs saved.

Obtaining Original Data Files

Below are the commands that were used to obtain and process the data used for this tutorial.


In [0]:
%%script false
# The line above prevents this cell from running as the
# default Colab environment does not include the necessary software.

# NA12878_sliced.bam
samtools view -h \
ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/RMNISTHS_30xdownsample.bam \
20:10,000,000-10,100,000 \
-o NA12878_sliced.bam


# NA12878_sliced.bam.bai`
samtools index NA12878_sliced.bam


# NA12878_calls.vcf.gz
wget \
ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/latest/GRCh37/HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer.vcf.gz \
-O NA12878_calls.vcf.gz


# NA12878_calls.vcf.gz.tbi
wget \
ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/latest/GRCh37/HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer.vcf.gz.tbi \
-O NA12878_calls.vcf.gz.tbi


# hs37d5.fa.gz
wget \
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz \
-O hs37d5.fa.gz


# hs37d5.fa.gz.fai
wget \
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz.fai \
-O hs37d5.fa.gz.fai


# hs37d5.fa.gz.gzi
wget \
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz.gzi \
-O hs37d5.fa.gz.gzi