pyña
Let's do a quick refresher about Curves output. First, Curves output for a trajectory is just the Curves output for each frame, concatenated together. That is, if you had frame1.out
, frame2.out
, and frame3.out
, then cat frame1.out frame2.out frame3.out
would be the same as the analysis for the whole trajectory.
Within each frame, there are 5 groups of data. Following the Curves labels, we call these groups A-E (although perhaps it would be better to give them more descriptive names?). They include the following:
groupA
: Base pair axis parameters (xdisp
, ydisp
, inclin
, tip
, ax-bend
)groupB
: Intra-base pair parameters (shear
, stretch
, stagger
, buckle
, propel
, opening
)groupC
: Inter-base pair parameters (shift
, slide
, rise
, tilt
, roll
, twist
, h-ris
, h-twi
)groupD
: Backbone parameters (not yet supported)groupE
: Groove paramemeters (w12
, d12
, w21
, d21
)Note that, while Curves capitalizes the first letter of each measurement, pyña
does not.
Getting your data into a pyña
object is as easy as the following two lines:
In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pyna
curves = pyna.CurvesAnalysis('./data/shorter.data')
In [2]:
curves.panels['groupB']
Out[2]:
The Panel
object includes 3 axes. In pyña
, the "item" axis corresponds to the property being measured (e.g., buckle
), the "minor" axis corresponds to the location within the strand as labeled by Curves (usually base-pair number), and the "major" axis corresponds to time.
If we want to get a table with all the buckle angles, with each column representing a different base pair and each row representing a different frame of the trajectory, it is as easy as this:
In [3]:
curves.panels['groupB']['buckle']
Out[3]:
Note how this dataframe differs from the Curves output: each column from Curves is now a separate table; what were the rows in Curves are columns for pyña
. And for pyña
each row represented a time, whereas Curves has each time in a separate table. At the end of this document, I'll show how to reconstruct the Curves approach.
The main object to generate statistics on subsets of that data is pyna.StrandStatistics
. You create a StrandStatistics
object with a dataframe, 0 or more locations
, and 0 or more times
. The locations refer to the columns in the pyña
dataframe, and the times refer to its rows.
If you don't set locations
, then the statistics are generated for all locations. The same is true of times
.
So if you want to see the average buckle angle, averaged over all time and all base pairs, you can obtain that with:
In [4]:
pyna.StrandStatistics(curves.panels['groupB']['buckle']).mean()
Out[4]:
Of course, it may be better to save the statistics object. The function .summary()
shows a number of properties (each available as a function of that name, with no arguments). Internally, the structure has a property .df
, to which any pandas statistics can be applied, as well as .np
, which returns a numpy array for use with numpy or scipy statistical analysis.
In [5]:
all_buckle = pyna.StrandStatistics(curves.panels['groupB']['buckle'])
print all_buckle.summary()
To get information about the distribution of the buckle angle for base pair 6 (distribution over time), do the following:
In [6]:
bp6_byTime = pyna.StrandStatistics(curves.panels['groupB']['buckle'], locations=6)
print bp6_byTime.summary()
To get information about the distribution of buckle angles in a given frame (averaging over base pair number), you can do this:
In [7]:
frame0 = pyna.StrandStatistics(curves.panels['groupB']['buckle'], times=0)
print frame0.summary()
You can also limit that to a subset of the base pairs by using the column headers for them:
In [8]:
frame0_subset_columns = pyna.StrandStatistics(curves.panels['groupB']['buckle'], times=0, locations=[4,5,6])
print frame0_subset_columns.summary()
Say you want the distribution of buckle angles for a group of base pairs averaged over both that group and over all frames:
In [9]:
sub_columns = pyna.StrandStatistics(curves.panels['groupB']['buckle'], locations=[4,5,6])
print sub_columns.summary()
print sub_columns.summary(per_location=True)
Or perhaps you only want to look at that for a subsection of the frames:
In [10]:
subtime_subcol = pyna.StrandStatistics(curves.panels['groupB']['buckle'], times=[0,2], locations=[4, 5, 6])
print subtime_subcol.summary()
Note also that pyña
correctly handles the groove parameters, which can't always be calculated for all base pair lables. Where there is no answer, pyña
returns "not-a-number" (NaN).
In [11]:
curves.panels['groupE']['w12']
Out[11]:
Having NaNs still allows correct statistics:
In [12]:
wstat3 = pyna.StrandStatistics(curves.panels['groupE']['w12'], locations=3)
print wstat3.summary()
In [13]:
wstat = pyna.StrandStatistics(curves.panels['groupE']['w12'])
print wstat.summary()
print wstat.summary(per_location=True)
In [14]:
wstat.hist(per_location=True, range=(2, 10), bins=8)
Out[14]:
There are several columns in the Curves output which remain unchanged from frame to frame. These help give additional information about what the Curves base pair key (the row in Curves, the column in pyña
's dataframes). Since they are unchanged with each frame, we only save them in one place, called co_keys
. For example, to see the co-key for the buckle (part of group B) in the column labeled "6", you type:
In [15]:
curves.co_keys['groupB'][6]
Out[15]:
These are the same columns you would see in Curves, following the "6)" label.
Maybe you want to double-check pyña
's output against the output in Curves, or maybe you just like the paradigm used by Curves more. Here's how to create Panels
using the Curves convention, where the "item" axis is time, the "minor" axis is the measurement name, and the "major" axis is the base pair label:
In [16]:
curvesE = pyna.curves_style(curves.panels['groupE'])
You still have one panel per group, but the organization of each panel has changed.
In [17]:
curvesE[0]
Out[17]:
Note that the order of columns doesn't necessarily match that of Curves+.
In [ ]: