If you plan to use this tutorial in an interactive fashion, then you should download an example run folder for an Illumina sequencer. For example, you may download and extract this example run folder: MiSeqDemo
Please change the path below so that it points at the run folder you wish to use:
In [1]:
run_folder = r"D:\RTA.Data\InteropData\MiSeqDemo"
The run_metrics
class encapsulates the model for all the individual InterOp files as well as containing information
from the RunInfo.xml
. The Modules page contains a subset of the applications programmer's interface
for all the major classes in C++. The available Python models all have the same names (with a few exceptions) and take
the same parameters. This page is useful for accessing specific values loaded from the individual files.
In [2]:
from interop import py_interop_run_metrics, py_interop_run, py_interop_summary
In [3]:
run_metrics = py_interop_run_metrics.run_metrics()
By default, the run_metrics
class loads all the InterOp files.
run_folder = run_metrics.read(run_folder)
The InterOp library can provide a list of all necessary InterOp files for a specific application. The following shows how to generate that list for the summary statistics:
In [4]:
valid_to_load = py_interop_run.uchar_vector(py_interop_run.MetricCount, 0)
In [5]:
py_interop_run_metrics.list_summary_metrics_to_load(valid_to_load)
The run_metrics
class can use this list to load only the required InterOp files as follows:
In [6]:
run_folder = run_metrics.read(run_folder, valid_to_load)
The run_summary
class encapsulates all the metrics displayed on the SAV summary tab. This class contains a tree-like
structure where metrics describing the run summary are at the root, there is a branch for each read summary, and
a sub branch for each read/lane summary.
In [7]:
summary = py_interop_summary.run_summary()
The run_summary
object can be populated from the run_metrics
object just so:
In [8]:
py_interop_summary.summarize_run_metrics(run_metrics, summary)
In [9]:
import pandas as pd
columns = ( ('Yield Total (G)', 'yield_g'), ('Projected Yield (G)', 'projected_yield_g'), ('% Aligned', 'percent_aligned'))
rows = [('Non-Indexed Total', summary.nonindex_summary()), ('Total', summary.total_summary())]
d = []
for label, func in columns:
d.append( (label, pd.Series([getattr(r[1], func)() for r in rows], index=[r[0] for r in rows])))
df = pd.DataFrame.from_items(d)
df
Out[9]:
You can also view the list of available metrics in the summary as follows:
In [10]:
print "\n".join([method for method in dir(summary.total_summary()) if not method.startswith('_') and method not in ("this", "resize")])
The read summary defines the same metrics as the run summary and can be accessed as follows:
read_index=0 # Possibly index of read 1
summary.at(read_index).summary().yield_g()
The read information can be accessed as follows:
summary.at(read_index).read().number()
summary.at(read_index).read().is_index()
The following code accesses relavant information from the read summary and puts it into a Pandas DataFrame:
In [11]:
rows = [("Read %s%d"%("(I)" if summary.at(i).read().is_index() else " ", summary.at(i).read().number()), summary.at(i).summary()) for i in xrange(summary.size())]
d = []
for label, func in columns:
d.append( (label, pd.Series([getattr(r[1], func)() for r in rows], index=[r[0] for r in rows])))
df = pd.DataFrame.from_items(d)
df
Out[11]:
The Read/Lane Level summary defines a a boarder set of metrics most of which provide several statistics including mean, standard deviation and median. The mean value over all tiles for density can be accessed as follows:
summary.at(read_index).at(lane_index).density().mean()
Since the value may or may not define the mean, standard deviation, median statistics, we define a simple function to detect whether it does and then format it appropriately.
In [12]:
def format_value(val):
if hasattr(val, 'mean'):
return val.mean()
else:
return val
The following code accesses relevant information from the read/lane summary and puts it into a Pandas DataFrame:
In [13]:
read = 0
columns = ( ('Lane', 'lane'), ('Tiles', 'tile_count'), ('Density (K/mm2)', 'density'))
rows = [summary.at(read).at(lane) for lane in xrange(summary.lane_count())]
d = []
for label, func in columns:
d.append( (label, pd.Series([format_value(getattr(r, func)()) for r in rows])))
df = pd.DataFrame.from_items(d)
df
Out[13]: