In [1]:
import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt
%matplotlib inline

Introduction to Porekit-Python

Disclaimer

Porekit is the result of my personal interest in nanopore sequencing. I'm not affiliated with Oxford Nanopore Technologies, or any MAP participant. This means a lot of the factual information presented in this notebook may be wrong.

General philosophy and goal of Porekit-Python

This library is meant to provide tools for interactively exploring nanopore data inside the Jupyter notebook, and for writing simple scripts or more complex software dealing with nanopore data. Therefore a lot of attention has been given to make interactive use easy and painless, and to keep the code in the background flexible and exposed to library users.

What Oxford Nanopore Data looks like

The MinION sequencer is attached to a laptop running MinKnow. This program connects directly to the MinION device and tells it what to do. Optionally, third party software can connect to an API inside the primary software to remote control the sequencing process. That is not covered here, though.

In a nutshell, nanopore sequencing works by dragging a DNA molecule through a tiny pore in a membrane. As the DNA passes, the voltage difference between the two sides of the membrane change, depending on the electrochemical properties of the passing nucleotides. This means that, at the core of the nanopore data, there is a timeseries of voltage measurements, which is called the "squiggle".

The process to convert the squiggle into a sequence of DNA letters is called base calling. The current MinKnow software uploads the squiggle to Metrichor servers, which perform the base calling, and send the result back to the user's computer.

The result of a sequencing run is a collection of FAST5 files, each containing data on one molecule of DNA which passed through one of currently 512 channels in the flowcell. These files are stored on disk, usually in one directory per run. A convention seems to be to name each file with a unique and descriptive string:


In [2]:
!ls /home/andi/nanopore/GenomeRU2/downloads/pass/ | tail -n 10


PLSP57501_20151028_Mk1_lambda_RU9_2752_1_ch9_file64_strand.fast5
PLSP57501_20151028_Mk1_lambda_RU9_2752_1_ch9_file67_strand.fast5
PLSP57501_20151028_Mk1_lambda_RU9_2752_1_ch9_file6_strand.fast5
PLSP57501_20151028_Mk1_lambda_RU9_2752_1_ch9_file74_strand.fast5
PLSP57501_20151028_Mk1_lambda_RU9_2752_1_ch9_file86_strand.fast5
PLSP57501_20151028_Mk1_lambda_RU9_2752_1_ch9_file90_strand.fast5
PLSP57501_20151028_Mk1_lambda_RU9_2752_1_ch9_file92_strand.fast5
PLSP57501_20151028_Mk1_lambda_RU9_2752_1_ch9_file95_strand.fast5
PLSP57501_20151028_Mk1_lambda_RU9_2752_1_ch9_file97_strand.fast5
PLSP57501_20151028_Mk1_lambda_RU9_2752_1_ch9_file9_strand.fast5

These files belong to data publishd by Quick et al. http://europepmc.org/abstract/MED/25386338;jsessionid=ijHIHUVXlcpxeTzVUihz.0

Gathering Metadata

The following snippet will extract meta data from all of my downloaded nanopore data, searching directories recursively.


In [3]:
import porekit
everything = porekit.gather_metadata("/home/andi/nanopore/", workers=4)


/home/andi/anaconda3/lib/python3.5/site-packages/skbio/io/registry.py:547: FormatIdentificationWarning: <WrappedBufferedRandom> does not look like a fastq file
  % (file, fmt), FormatIdentificationWarning)
Process ForkPoolWorker-4:
Traceback (most recent call last):
  File "/home/andi/anaconda3/lib/python3.5/multiprocessing/process.py", line 254, in _bootstrap
    self.run()
  File "/home/andi/anaconda3/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/andi/anaconda3/lib/python3.5/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/home/andi/anaconda3/lib/python3.5/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/home/andi/nanopore/porekit-python/porekit/porekit.py", line 240, in get_fast5_file_metadata
    fast5 = Fast5File(file_name)
  File "/home/andi/nanopore/porekit-python/porekit/porekit.py", line 17, in __init__
    super().__init__(filename, mode, **kwargs)
  File "/home/andi/anaconda3/lib/python3.5/site-packages/h5py/_hl/files.py", line 260, in __init__
    fid = make_fid(name, mode, userblock_size, fapl, swmr=swmr)
  File "/home/andi/anaconda3/lib/python3.5/site-packages/h5py/_hl/files.py", line 109, in make_fid
    fid = h5f.open(name, h5f.ACC_RDWR, fapl=fapl)
KeyboardInterrupt
Process ForkPoolWorker-1:
Traceback (most recent call last):
  File "/home/andi/anaconda3/lib/python3.5/multiprocessing/process.py", line 254, in _bootstrap
    self.run()
  File "/home/andi/anaconda3/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/andi/anaconda3/lib/python3.5/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/home/andi/anaconda3/lib/python3.5/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/home/andi/nanopore/porekit-python/porekit/porekit.py", line 240, in get_fast5_file_metadata
    fast5 = Fast5File(file_name)
  File "/home/andi/nanopore/porekit-python/porekit/porekit.py", line 17, in __init__
    super().__init__(filename, mode, **kwargs)
  File "/home/andi/anaconda3/lib/python3.5/site-packages/h5py/_hl/files.py", line 260, in __init__
    fid = make_fid(name, mode, userblock_size, fapl, swmr=swmr)
  File "/home/andi/anaconda3/lib/python3.5/site-packages/h5py/_hl/files.py", line 109, in make_fid
    fid = h5f.open(name, h5f.ACC_RDWR, fapl=fapl)
KeyboardInterrupt
ERROR: Internal Python error in the inspect module.
Below is the traceback from this internal error.

Traceback (most recent call last):
  File "/home/andi/anaconda3/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2885, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-3-c5e67e82c3ed>", line 2, in <module>
    everything = porekit.gather_metadata("/home/andi/nanopore/", workers=4)
  File "/home/andi/nanopore/porekit-python/porekit/porekit.py", line 296, in gather_metadata
  File "/home/andi/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py", line 939, in from_records
    first_row = next(data)
  File "/home/andi/nanopore/porekit-python/porekit/porekit.py", line 272, in gather_metadata_records
    record = get_fast5_file_metadata(file_name, plugins, raise_errors=raise_errors)
  File "/home/andi/anaconda3/lib/python3.5/multiprocessing/pool.py", line 260, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/home/andi/anaconda3/lib/python3.5/multiprocessing/pool.py", line 602, in get
    self.wait(timeout)
  File "/home/andi/anaconda3/lib/python3.5/multiprocessing/pool.py", line 599, in wait
    self._event.wait(timeout)
  File "/home/andi/anaconda3/lib/python3.5/threading.py", line 549, in wait
    signaled = self._cond.wait(timeout)
  File "/home/andi/anaconda3/lib/python3.5/threading.py", line 293, in wait
    waiter.acquire()
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/andi/anaconda3/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 1827, in showtraceback
    stb = value._render_traceback_()
AttributeError: 'KeyboardInterrupt' object has no attribute '_render_traceback_'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/andi/anaconda3/lib/python3.5/site-packages/IPython/core/ultratb.py", line 1118, in get_records
    return _fixed_getinnerframes(etb, number_of_lines_of_context, tb_offset)
  File "/home/andi/anaconda3/lib/python3.5/site-packages/IPython/core/ultratb.py", line 300, in wrapped
    return f(*args, **kwargs)
  File "/home/andi/anaconda3/lib/python3.5/site-packages/IPython/core/ultratb.py", line 345, in _fixed_getinnerframes
    records = fix_frame_records_filenames(inspect.getinnerframes(etb, context))
  File "/home/andi/anaconda3/lib/python3.5/inspect.py", line 1453, in getinnerframes
    frameinfo = (tb.tb_frame,) + getframeinfo(tb, context)
  File "/home/andi/anaconda3/lib/python3.5/inspect.py", line 1410, in getframeinfo
    filename = getsourcefile(frame) or getfile(frame)
  File "/home/andi/anaconda3/lib/python3.5/inspect.py", line 672, in getsourcefile
    if getattr(getmodule(object, filename), '__loader__', None) is not None:
  File "/home/andi/anaconda3/lib/python3.5/inspect.py", line 715, in getmodule
    f = getabsfile(module)
  File "/home/andi/anaconda3/lib/python3.5/inspect.py", line 684, in getabsfile
    _filename = getsourcefile(object) or getfile(object)
  File "/home/andi/anaconda3/lib/python3.5/inspect.py", line 669, in getsourcefile
    if os.path.exists(filename):
  File "/home/andi/anaconda3/lib/python3.5/genericpath.py", line 19, in exists
    os.stat(path)
KeyboardInterrupt
Unfortunately, your original traceback can not be constructed.

---------------------------------------------------------------------------

The result is a Pandas DataFrame object, which is too big to comfortably view in its entirety, but still comparatively "small data". Here is a subset of it:


In [ ]:
everything[['asic_id', 'channel_number', 'template_length', 'complement_length']].head()

All of the columns available:


In [ ]:
everything.columns

The philosopy of porekit is to gather the metadata once and then store it in a different format. This makes it easier and faster to analyse the metadata or use it in another context, for example with alignment data. In general, Fast5 Files don't change after MinKNOW and Metrichor, and possibly some third party programs are done with it.

The recommended file format to save this metadata is the "Feather File Format" (https://github.com/wesm/feather). It is a binary format for DataFrames, and has excellent support for R.

Feather's documentation emphasizes the file format can still change, and thus should not be used for long term storage. It's still fine for most medium term purposes, and even in the longer term, pinning the package versions will make older data available just as easily.


In [ ]:
everything.to_hdf("everything.h5", "meta")

Grouping by Device, ASIC and Run Ids


In [ ]:
g = everything.groupby(['device_id', 'asic_id', 'run_id'])

In [ ]:
df = g.template_length.agg([lambda v: len(v), np.mean, np.max])
df.columns = ['Count', 'Mean template length', 'Max template_length']
df

As you can see, I have downloaded several nanopore sets from ENA. These are mostly incomplete sets, since I was interested more in the variety of data rather than the completeness. You can easily use wget to download a tarball from ENA, then extract the partial download. The last file will be truncated, but the rest is usable.