Analyzing a Curves output file with pyña

Let's do a quick refresher about Curves output. First, Curves output for a trajectory is just the Curves output for each frame, concatenated together. That is, if you had frame1.out, frame2.out, and frame3.out, then cat frame1.out frame2.out frame3.out would be the same as the analysis for the whole trajectory.

Within each frame, there are 5 groups of data. Following the Curves labels, we call these groups A-E (although perhaps it would be better to give them more descriptive names?). They include the following:

  • groupA : Base pair axis parameters (xdisp, ydisp, inclin, tip, ax-bend)
  • groupB : Intra-base pair parameters (shear, stretch, stagger, buckle, propel, opening)
  • groupC : Inter-base pair parameters (shift, slide, rise, tilt, roll, twist, h-ris, h-twi)
  • groupD : Backbone parameters (not yet supported)
  • groupE : Groove paramemeters (w12, d12, w21, d21)

Note that, while Curves capitalizes the first letter of each measurement, pyña does not.

Getting your data into a pyña object is as easy as the following two lines:


In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pyna
curves = pyna.CurvesAnalysis('./data/shorter.data')

Using pandas.Panels for each group's data

The data from each group of parameters is put into a pandas.Panel object. By using pandas, we have many powerful options for analysis and export of data. However, I've simplified a few common tasks to have more familiar names within pyña.


In [2]:
curves.panels['groupB']


Out[2]:
<class 'pandas.core.panel.Panel'>
Dimensions: 6 (items) x 4 (major_axis) x 30 (minor_axis)
Items axis: buckle to stretch
Major_axis axis: 0 to 3
Minor_axis axis: 1.0 to 30.0

The Panel object includes 3 axes. In pyña, the "item" axis corresponds to the property being measured (e.g., buckle), the "minor" axis corresponds to the location within the strand as labeled by Curves (usually base-pair number), and the "major" axis corresponds to time.

If we want to get a table with all the buckle angles, with each column representing a different base pair and each row representing a different frame of the trajectory, it is as easy as this:


In [3]:
curves.panels['groupB']['buckle']


Out[3]:
1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 ... 21.0 22.0 23.0 24.0 25.0 26.0 27.0 28.0 29.0 30.0
0 -0.7 0.1 -5.6 -8.6 -4.7 -1.3 -4.1 -2.5 3.5 -1.9 ... -3.3 -11.9 2.8 -1.2 -15.0 -10.6 2.1 -13.3 -14.4 -14.8
1 -2.1 4.2 -0.2 -4.6 7.8 19.2 1.3 2.5 -8.3 -8.6 ... -2.9 8.5 3.2 -11.2 14.0 15.2 1.2 -5.1 0.9 -9.4
2 12.5 5.9 1.9 -11.1 -20.9 -2.1 -2.1 2.5 11.1 -10.3 ... -5.6 25.5 1.6 -11.8 1.0 10.3 9.5 -2.5 6.8 -8.5
3 -7.4 3.3 -3.5 -15.2 -1.6 -4.9 -13.6 -15.2 -10.5 11.6 ... 0.9 12.4 3.0 -4.6 -7.5 5.1 -9.3 -9.4 -16.1 -11.9

4 rows × 30 columns

Note how this dataframe differs from the Curves output: each column from Curves is now a separate table; what were the rows in Curves are columns for pyña. And for pyña each row represented a time, whereas Curves has each time in a separate table. At the end of this document, I'll show how to reconstruct the Curves approach.

The main object to generate statistics on subsets of that data is pyna.StrandStatistics. You create a StrandStatistics object with a dataframe, 0 or more locations, and 0 or more times. The locations refer to the columns in the pyña dataframe, and the times refer to its rows.

If you don't set locations, then the statistics are generated for all locations. The same is true of times.

So if you want to see the average buckle angle, averaged over all time and all base pairs, you can obtain that with:


In [4]:
pyna.StrandStatistics(curves.panels['groupB']['buckle']).mean()


Out[4]:
-1.3375000000000001

Of course, it may be better to save the statistics object. The function .summary() shows a number of properties (each available as a function of that name, with no arguments). Internally, the structure has a property .df, to which any pandas statistics can be applied, as well as .np, which returns a numpy array for use with numpy or scipy statistical analysis.


In [5]:
all_buckle = pyna.StrandStatistics(curves.panels['groupB']['buckle'])
print all_buckle.summary()


count: 120
mean: -1.3375
std:  9.43724061454
min:  -23.3
max:  26.0

To get information about the distribution of the buckle angle for base pair 6 (distribution over time), do the following:


In [6]:
bp6_byTime = pyna.StrandStatistics(curves.panels['groupB']['buckle'], locations=6)
print bp6_byTime.summary()


count: 4
mean: 2.725
std:  9.60530452406
min:  -4.9
max:  19.2

To get information about the distribution of buckle angles in a given frame (averaging over base pair number), you can do this:


In [7]:
frame0 = pyna.StrandStatistics(curves.panels['groupB']['buckle'], times=0)
print frame0.summary()


count: 30
mean: -3.29
std:  7.27765300996
min:  -15.0
max:  16.6

You can also limit that to a subset of the base pairs by using the column headers for them:


In [8]:
frame0_subset_columns = pyna.StrandStatistics(curves.panels['groupB']['buckle'], times=0, locations=[4,5,6])
print frame0_subset_columns.summary()


count: 3
mean: -4.86666666667
std:  2.98254179444
min:  -8.6
max:  -1.3

Say you want the distribution of buckle angles for a group of base pairs averaged over both that group and over all frames:


In [9]:
sub_columns = pyna.StrandStatistics(curves.panels['groupB']['buckle'], locations=[4,5,6])
print sub_columns.summary()
print sub_columns.summary(per_location=True)


count: 12
mean: -2.25
std:  9.43473193401
min:  -20.9
max:  19.2

count: [4 4 4]
mean: [-4.85   2.725 -4.625]
std:  [ 11.94724515  11.09125031   6.38507374]
min:  [-20.9  -4.9 -13.6]
max:  [  7.8  19.2   1.3]

Or perhaps you only want to look at that for a subsection of the frames:


In [10]:
subtime_subcol = pyna.StrandStatistics(curves.panels['groupB']['buckle'], times=[0,2], locations=[4, 5, 6])
print subtime_subcol.summary()


count: 6
mean: -5.86666666667
std:  6.82780263986
min:  -20.9
max:  -1.3

Note also that pyña correctly handles the groove parameters, which can't always be calculated for all base pair lables. Where there is no answer, pyña returns "not-a-number" (NaN).


In [11]:
curves.panels['groupE']['w12']


Out[11]:
1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 ... 25.0 25.5 26.0 26.5 27.0 27.5 28.0 28.5 29.0 29.5
0 NaN NaN NaN 7.0 6.4 6.0 6.1 6.0 5.5 5.5 ... 5.7 7.1 8.3 7.2 5.6 4.8 NaN NaN NaN NaN
1 NaN NaN NaN NaN 6.1 6.6 6.9 7.1 6.6 6.1 ... 6.9 5.9 4.8 4.3 4.1 4.7 NaN NaN NaN NaN
2 NaN NaN NaN 4.5 3.5 3.5 5.0 6.3 6.3 6.0 ... 7.1 7.8 7.7 6.6 5.2 4.1 4.3 NaN NaN NaN
3 NaN NaN NaN 6.6 5.6 4.9 5.1 5.2 4.8 5.1 ... 4.5 5.3 6.4 5.9 5.1 5.1 NaN NaN NaN NaN

4 rows × 57 columns

Having NaNs still allows correct statistics:


In [12]:
wstat3 = pyna.StrandStatistics(curves.panels['groupE']['w12'], locations=3)
print wstat3.summary()


count: 3
mean: 6.03333333333
std:  1.09645894689
min:  4.5
max:  7.0


In [13]:
wstat = pyna.StrandStatistics(curves.panels['groupE']['w12'])
print wstat.summary()
print wstat.summary(per_location=True)


count: 200
mean: 5.7885
std:  1.27425183932
min:  2.4
max:  8.3

count: [0 0 0 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 0 0 0]
mean: [        nan         nan         nan  6.03333333  5.4         5.25        5.775
  6.15        5.8         5.675       5.925       6.1         6.1         6.125
  6.275       6.625       7.15        7.6         7.625       7.65        7.225
  6.65        6.325       5.9         4.95        4.2         4.325       4.525
  4.85        5.65        6.275       6.65        6.875       6.8         5.675
  4.85        4.725       5.25        6.075       6.55        5.875       5.1
  4.675       4.525       4.075       4.05        4.925       6.05        6.525
  6.8         6.          5.          4.675       4.3                nan
         nan         nan]
std:  [        nan         nan         nan  1.34288247  1.30894359  1.36259556
  0.89953692  0.78528127  0.81240384  0.46457866  0.49244289  1.17473401
  1.00995049  0.63966137  0.55602758  1.18988795  0.88128694  0.46904158
  0.23629078  0.61373175  1.19547759  1.56311655  1.21483881  0.96953597
  0.65574385  0.87559504  1.40089257  1.73084758  1.14455231  0.61373175
  0.45734742  0.7         0.92870878  1.12249722  0.94648472  0.47958315
  0.7410578   0.94692485  0.85391256  0.61373175  0.49916597  0.59441848
  0.22173558  0.71355915  0.98107084  1.80462369  1.23389627  1.20415946
  1.13247517  1.5513435   1.25166556  0.63770422  0.41932485         nan
         nan         nan         nan]
min:  [ nan  nan  nan  4.5  3.5  3.5  5.   5.2  4.8  5.1  5.3  4.6  5.1  5.4  5.6
  5.6  6.4  7.3  7.3  6.9  6.   5.2  4.8  4.5  4.1  3.6  3.   2.6  3.5  4.9
  5.6  5.6  5.6  5.5  4.6  4.3  3.8  4.4  5.   5.8  5.2  4.3  4.4  3.7  2.9
  2.4  3.5  4.5  5.3  4.8  4.3  4.1  4.1  4.3  nan  nan  nan]
max:  [ nan  nan  nan  7.   6.4  6.6  6.9  7.1  6.6  6.1  6.5  7.3  7.2  6.8  6.9
  7.8  8.1  8.3  7.8  8.2  8.3  8.1  7.4  6.6  5.7  5.5  6.3  6.8  6.3  6.4
  6.6  7.   7.8  8.2  6.9  5.3  5.5  6.3  6.9  7.1  6.3  5.6  4.9  5.4  5.3
  6.4  6.5  7.1  7.8  8.3  7.2  5.6  5.1  4.3  nan  nan  nan]


In [14]:
wstat.hist(per_location=True, range=(2, 10), bins=8)


Out[14]:
1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 ... 25.0 25.5 26.0 26.5 27.0 27.5 28.0 28.5 29.0 29.5
2.0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3.0 0 0 0 0 1 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4.0 0 0 0 1 0 1 0 0 1 0 ... 1 0 1 1 1 3 1 0 0 0
5.0 0 0 0 0 1 0 2 1 1 2 ... 1 2 0 1 3 1 0 0 0 0
6.0 0 0 0 1 2 2 2 2 2 2 ... 1 0 1 1 0 0 0 0 0 0
7.0 0 0 0 1 0 0 0 1 0 0 ... 1 2 1 1 0 0 0 0 0 0
8.0 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 0 0 0 0 0 0 0
9.0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

8 rows × 57 columns

Co-keys

There are several columns in the Curves output which remain unchanged from frame to frame. These help give additional information about what the Curves base pair key (the row in Curves, the column in pyña's dataframes). Since they are unchanged with each frame, we only save them in one place, called co_keys. For example, to see the co-key for the buckle (part of group B) in the column labeled "6", you type:


In [15]:
curves.co_keys['groupB'][6]


Out[15]:
['T', '7-A', '26']

These are the same columns you would see in Curves, following the "6)" label.

Curves-like panels

Maybe you want to double-check pyña's output against the output in Curves, or maybe you just like the paradigm used by Curves more. Here's how to create Panels using the Curves convention, where the "item" axis is time, the "minor" axis is the measurement name, and the "major" axis is the base pair label:


In [16]:
curvesE =  pyna.curves_style(curves.panels['groupE'])

You still have one panel per group, but the organization of each panel has changed.


In [17]:
curvesE[0]


Out[17]:
d12 d21 w12 w21
1.5 NaN NaN NaN NaN
2.0 NaN NaN NaN NaN
2.5 NaN NaN NaN NaN
3.0 4.8 NaN 7.0 NaN
3.5 5.1 NaN 6.4 NaN
4.0 5.4 5.2 6.0 11.1
4.5 5.3 4.6 6.1 10.5
5.0 5.0 4.8 6.0 11.3
5.5 5.9 0.5 5.5 12.0
6.0 6.0 4.3 5.5 11.7
6.5 5.6 4.6 6.5 10.9
7.0 5.4 5.0 7.3 10.6
7.5 5.4 4.9 7.2 10.7
8.0 4.9 4.7 6.8 10.3
8.5 5.4 4.6 6.1 10.5
9.0 6.0 4.5 5.6 11.1
9.5 5.8 4.8 6.4 12.3
10.0 5.3 3.3 7.4 11.5
10.5 5.3 4.3 7.8 10.9
11.0 5.2 5.8 8.2 9.6
11.5 5.0 6.1 8.3 9.2
12.0 4.8 6.0 8.1 9.2
12.5 4.9 5.3 7.4 10.0
13.0 5.0 4.0 6.6 10.2
13.5 5.3 5.0 5.7 10.4
14.0 5.5 5.2 5.5 10.0
14.5 4.8 5.7 6.3 10.9
15.0 4.5 5.3 6.8 11.8
15.5 5.5 5.7 6.3 12.3
16.0 5.6 4.7 6.4 11.6
16.5 5.3 4.8 6.6 10.9
17.0 5.0 5.1 7.0 10.2
17.5 4.9 4.9 7.8 10.1
18.0 5.1 3.5 8.2 9.2
18.5 5.6 4.6 6.9 9.8
19.0 5.9 4.9 5.2 11.3
19.5 5.9 5.2 4.5 13.1
20.0 5.8 3.0 4.4 12.9
20.5 5.1 3.7 5.0 12.7
21.0 4.6 4.8 5.8 12.3
21.5 5.2 5.4 5.8 11.1
22.0 5.7 5.6 5.5 9.7
22.5 5.7 5.6 4.4 10.2
23.0 5.5 4.1 3.7 10.7
23.5 5.4 1.9 4.0 11.0
24.0 5.2 4.6 4.5 11.0
24.5 6.1 4.0 5.0 10.4
25.0 6.6 4.4 5.7 10.5
25.5 5.9 5.3 7.1 10.2
26.0 4.9 6.1 8.3 9.8
26.5 4.9 6.2 7.2 10.5
27.0 5.4 4.5 5.6 10.8
27.5 5.4 NaN 4.8 NaN
28.0 NaN NaN NaN NaN
28.5 NaN NaN NaN NaN
29.0 NaN NaN NaN NaN
29.5 NaN NaN NaN NaN

Note that the order of columns doesn't necessarily match that of Curves+.


In [ ]: