Using the Illumina InterOp Library in Python

Install

If you do not have the Python InterOp library installed, then you can do the following:

$ pip install -f https://github.com/Illumina/interop/releases/latest interop

You can verify that InterOp is properly installed:

$ python -m interop --test

Before you begin

If you plan to use this tutorial in an interactive fashion, then you should download an example run folder for an Illumina sequencer. For example, you may download and extract this example run folder: MiSeqDemo

Please change the path below so that it points at the run folder you wish to use:


In [1]:
run_folder = r"D:\RTA.Data\InteropData\MiSeqDemo"

Getting SAV Summary Tab-like Metrics

The run_metrics class encapsulates the model for all the individual InterOp files as well as containing information from the RunInfo.xml. The Modules page contains a subset of the applications programmer's interface for all the major classes in C++. The available Python models all have the same names (with a few exceptions) and take the same parameters. This page is useful for accessing specific values loaded from the individual files.


In [2]:
from interop import py_interop_run_metrics, py_interop_run, py_interop_summary

In [3]:
run_metrics = py_interop_run_metrics.run_metrics()

By default, the run_metrics class loads all the InterOp files.

run_folder = run_metrics.read(run_folder)

The InterOp library can provide a list of all necessary InterOp files for a specific application. The following shows how to generate that list for the summary statistics:


In [4]:
valid_to_load = py_interop_run.uchar_vector(py_interop_run.MetricCount, 0)

In [5]:
py_interop_run_metrics.list_summary_metrics_to_load(valid_to_load)

The run_metrics class can use this list to load only the required InterOp files as follows:


In [6]:
run_folder = run_metrics.read(run_folder, valid_to_load)

The run_summary class encapsulates all the metrics displayed on the SAV summary tab. This class contains a tree-like structure where metrics describing the run summary are at the root, there is a branch for each read summary, and a sub branch for each read/lane summary.


In [7]:
summary = py_interop_summary.run_summary()

The run_summary object can be populated from the run_metrics object just so:


In [8]:
py_interop_summary.summarize_run_metrics(run_metrics, summary)

Run Summary

The run summary comprises both the nonindex_summary and the total_summary. A metric in the run summary can be accessed as follows:

  • summary.total_summary().yield_g() or
  • summary.nonindex_summary().yield_g()

Below, we use pandas to display the run summary portion of the SAV Summary Tab:


In [9]:
import pandas as pd
columns = ( ('Yield Total (G)', 'yield_g'), ('Projected Yield (G)', 'projected_yield_g'), ('% Aligned', 'percent_aligned'))
rows = [('Non-Indexed Total', summary.nonindex_summary()), ('Total', summary.total_summary())]
d = []
for label, func in columns:
    d.append( (label, pd.Series([getattr(r[1], func)() for r in rows], index=[r[0] for r in rows])))
df = pd.DataFrame.from_items(d)
df


Out[9]:
Yield Total (G) Projected Yield (G) % Aligned
Non-Indexed Total 18.765484 18.765484 98.047264
Total 18.765484 18.765484 98.047264

You can also view the list of available metrics in the summary as follows:


In [10]:
print "\n".join([method for method in dir(summary.total_summary()) if not method.startswith('_') and method not in ("this", "resize")])


error_rate
first_cycle_intensity
percent_aligned
percent_gt_q30
projected_yield_g
yield_g

Read Summary

The read summary defines the same metrics as the run summary and can be accessed as follows:

read_index=0 # Possibly index of read 1
summary.at(read_index).summary().yield_g()

The read information can be accessed as follows:

summary.at(read_index).read().number()
summary.at(read_index).read().is_index()

The following code accesses relavant information from the read summary and puts it into a Pandas DataFrame:


In [11]:
rows = [("Read %s%d"%("(I)" if summary.at(i).read().is_index()  else " ", summary.at(i).read().number()), summary.at(i).summary()) for i in xrange(summary.size())]
d = []
for label, func in columns:
    d.append( (label, pd.Series([getattr(r[1], func)() for r in rows], index=[r[0] for r in rows])))
df = pd.DataFrame.from_items(d)
df


Out[11]:
Yield Total (G) Projected Yield (G) % Aligned
Read 1 9.382742 9.382742 99.200142
Read 2 9.382742 9.382742 96.894386

Read 1 Summary

The Read/Lane Level summary defines a a boarder set of metrics most of which provide several statistics including mean, standard deviation and median. The mean value over all tiles for density can be accessed as follows:

summary.at(read_index).at(lane_index).density().mean()

Since the value may or may not define the mean, standard deviation, median statistics, we define a simple function to detect whether it does and then format it appropriately.


In [12]:
def format_value(val):
    if hasattr(val, 'mean'):
        return val.mean()
    else:
        return val

The following code accesses relevant information from the read/lane summary and puts it into a Pandas DataFrame:


In [13]:
read = 0
columns = ( ('Lane', 'lane'), ('Tiles', 'tile_count'), ('Density (K/mm2)', 'density'))
rows = [summary.at(read).at(lane) for lane in xrange(summary.lane_count())]
d = []
for label, func in columns:
    d.append( (label, pd.Series([format_value(getattr(r, func)()) for r in rows])))
df = pd.DataFrame.from_items(d)
df


Out[13]:
Lane Tiles Density (K/mm2)
0 1 38 1361895.75