FIDDLE Predictions Visualization Tutorial:

This notebook outlines how to create graphs from the data in the output files. The types of graphs depicted in FIDDLE's original bioRxiv paper were created using similar code. It is important to note that if variables are chaged from the suggested names written throughout the steps to then also change their calls later on in step 7.

Note: The following python packages are easily installable via pip, e.g:

pip install scipy



In [1]:

    
%matplotlib inline
from matplotlib import pylab as pl
from scipy import stats
import numpy as np
import pandas as pd
import h5py
from matplotlib.backends.backend_pdf import PdfPages

1. Load predictions hdf5 dataset, e.g:

predictions = h5py.File('../results/experiment/predictions.h5','r')



In [2]:

    
predictions = h5py.File('../results/17_08_11_test1/predictions.h5','r')

1a. Examine the predictions hdf5 dataset, the keys correspond to outputs as determined by the parameters in the configurations.json file and the values correspond to the outputs' matrix shape, e.g:

predictions.items()



In [3]:

    
predictions.items()









    Out[3]:





[(u'tssseq', <HDF5 dataset "tssseq": shape (12900, 500), type "<f4">)]

2. Load original test hdf5 dataset (these are the genomic regions FIDDLE performed its final prediction on), e.g:

test = h5py.File('../data/hdf5datasets/NSMSDSRSCSTSRI_500bp/test.h5', 'r')



In [5]:

    
test = h5py.File('../data/hdf5datasets/NSMSDSRSCSTSRI_500bp/test.h5', 'r')

2a. Examine the test hdf5 dataset, the keys and values have the same structure as the predictions hdf5dataset, e.g:

test.items()



In [6]:

    
test.items()









    Out[6]:





[(u'chipseq', <HDF5 dataset "chipseq": shape (12900, 2, 500, 1), type "<f4">),
 (u'dnaseq', <HDF5 dataset "dnaseq": shape (12900, 4, 500, 1), type "<f4">),
 (u'info', <HDF5 dataset "info": shape (12900, 4), type "<f4">),
 (u'mnaseseq',
  <HDF5 dataset "mnaseseq": shape (12900, 2, 500, 1), type "<f4">),
 (u'netseq', <HDF5 dataset "netseq": shape (12900, 2, 500, 1), type "<f4">),
 (u'randinp', <HDF5 dataset "randinp": shape (12900, 1, 500, 1), type "<f4">),
 (u'rnaseq', <HDF5 dataset "rnaseq": shape (12900, 1, 500, 1), type "<f4">),
 (u'tssseq', <HDF5 dataset "tssseq": shape (12900, 2, 500, 1), type "<f4">)]

3. Read in the 'info' reference track from the test hdf5 dataset, e.g.:

infoRef_test = test.get('info')[:]



In [7]:

    
infoRef_test = test.get('info')[:]

3a. Examine the info reference track, the dimensions correspond to the following, this track is used for correct indexing:

1. Chromosome number (e.g. 1-16)
2. Strandedness (e.g. -1, 1)
3. Gene index (parsed from the original GFF file input)
4. Base Pair index (e.g. up to ~10^6)

stats.describe(infoRef_test[:, X])



In [8]:

    
stats.describe(infoRef_test[:, 0])
stats.describe(infoRef_test[:, 1])
stats.describe(infoRef_test[:, 2])
stats.describe(infoRef_test[:, 3])









    Out[8]:





DescribeResult(nobs=12900, minmax=(1.0, 16.0), mean=9.3024807, variance=20.961603, skewness=-0.12559175491333008, kurtosis=-1.3162881165056144)






    Out[8]:





DescribeResult(nobs=12900, minmax=(-1.0, 1.0), mean=0.0, variance=1.0000775, skewness=0.0, kurtosis=-2.0)






    Out[8]:





DescribeResult(nobs=12900, minmax=(1.0, 5159.0), mean=2594.1211, variance=2207513.8, skewness=-0.017716964706778526, kurtosis=-1.1933110312711752)






    Out[8]:





DescribeResult(nobs=12900, minmax=(2277.0, 1520825.0), mean=458661.19, variance=1.021782e+11, skewness=0.8305436968803406, kurtosis=0.29999693955268336)

4. Read in one of the sequencing output tracks (for instance 'tssseq' for TSS-Seq data) from the predictions hdf5 datasets. Possible output types were outlined previously in step 2a. This is the predicted profile that will be plotted, e.g.:

pred_tss = predictions.get('tssseq')[:]



In [9]:

    
pred_tss = predictions.get('tssseq')[:]

5. Correspondingly, read in the sequencing output track from the test hdf5 dataset. This is the original profile that the predicted profile will be plotted against, e.g:

orig_tss = test.get('tssseq')[:]



In [10]:

    
orig_tss = test.get('tssseq')[:]

6. Create directory to save resulting Figure:

! mkdir ../Figures

! mkdir ../Figures

6. Modify plotting variables according to your liking, such as number of random genomic regions of interest and more:

file_name='../figures/predictions_plot.pdf' # make sure directory exists before carrying through

size_interest=50 # 50 random genomic regions of interest

alpha_orig=0.6 # [0.0, 1.0] - [transparent, opaque], transparency of expected profile from test hdf5 dataset

alpha_pred=0.8 # [0.0, 1.0] - [transparent, opaque], transparency of predicted profile from predictions hdf5 dataset



In [12]:

    
file_name='../Figures/predictions_plot.pdf'
size_interest=50
alpha_orig=0.7
alpha_pred=0.9

7. Create pdf file with plots of overlayed predictions against expected profiles.



In [13]:

    
pl.ioff()
pp = PdfPages(file_name)
np.seterr(divide='ignore', invalid='ignore')

for ix in np.random.randint(pred_tss.shape[0], size=size_interest):
    x_ran = np.arange(ix, ix + 500)
    fig = pl.figure();
    pl.plot(x_ran, orig_tss[ix, 0] / np.sum(orig_tss[ix, 0]), color='red', alpha=alpha_orig);
    pl.plot(x_ran, pred_tss[ix], color='green', alpha=alpha_pred);
    
    pl.xlabel('Genomic coordinate (bp) - Chr'+ str(int(infoRef_test[ix,0])));
    pl.ylabel('Probability density function');    
    pp.savefig();
    pl.close(fig);
    
pp.close();



In [ ]: