FIDDLE File Types and Usage:

This notebook outlines input file types and their role in the FIDDLE work flow. It also outlines several ways of examining file contents and the variables lying within them that can be modified to suit a particular question. This notebook assumes that the quick start has been attempted by the user.

NOTE: The following python packages are easily installable via pip, e.g:

pip install h5py==2.7.0



In [1]:

    
from scipy import stats
import h5py

INTRO:

The highly modular nature of FIDDLE, exemplified by the depiction below, entails similarly modular input files.

RELEVANT JSON FILES:

There are two .json files that dictate how FIDDLE is run:

1. configurations.json
2. architecture.json

configurations.json

'configurations.json' parametrizes the sequencing file input types and their characteristics. In the example case, the Genome sub-field is "sacCer3", the Tracks sub-field consist of TSS-seq data and others, the Options sub-field consist of which "Inputs", "Outputs", and other traits FIDDLE takes into consideration. Note that the caveat to the hyper-modularity of this input file is that each of the modified variables must exactly mirror what lies within the input hdf5 files - more on that down the page.

! cat configurations.json



In [ ]:

    
! cat configurations.json

architecture.json

'architecture.json' parametrizes the hyper-parameters and other neural network specific variables that FIDDLE will employ. The Encoder and Decoder will utilize the same hyper-parameters.

! cat architecture.json



In [ ]:

    
! cat architecture.json

RELEVANT HDF5 FILES:

Using the quick start hdf5 datasets as examples, one can see that the dimensions of the tracks within the hdf5 datasets reflect the characteristics of the sequencing inputs. The train, validation, and test hdf5datasets are simply partitions of an original hdf5dataset that was compiled from scripts found in the 'fiddle/data_prep/' directory. A guide on how this is carried out can be found by starting up the 'fiddle/data_prep/data_guide.ipynb'.

train = h5py.File('../data/hdf5datasets/train.h5', 'r')

train.items()



In [10]:

    
train = h5py.File('../data/hdf5datasets/NSMSDSRSCSTSRI_500bp/train.h5', 'r')
train.items()









    Out[10]:





[(u'chipseq', <HDF5 dataset "chipseq": shape (109650, 2, 500, 1), type "<f4">),
 (u'dnaseq', <HDF5 dataset "dnaseq": shape (109650, 4, 500, 1), type "<f4">),
 (u'info', <HDF5 dataset "info": shape (109650, 4), type "<f4">),
 (u'mnaseseq',
  <HDF5 dataset "mnaseseq": shape (109650, 2, 500, 1), type "<f4">),
 (u'netseq', <HDF5 dataset "netseq": shape (109650, 2, 500, 1), type "<f4">),
 (u'randinp', <HDF5 dataset "randinp": shape (109650, 1, 500, 1), type "<f4">),
 (u'rnaseq', <HDF5 dataset "rnaseq": shape (109650, 1, 500, 1), type "<f4">),
 (u'tssseq', <HDF5 dataset "tssseq": shape (109650, 2, 500, 1), type "<f4">)]

validation = h5py.File('../data/hdf5datasets/validation.h5', 'r')

validation.items()



In [8]:

    
validation = h5py.File('../data/hdf5datasets/NSMSDSRSCSTSRI_500bp/validation.h5', 'r')
validation.items()









    Out[8]:





[(u'chipseq', <HDF5 dataset "chipseq": shape (6450, 2, 500, 1), type "<f4">),
 (u'dnaseq', <HDF5 dataset "dnaseq": shape (6450, 4, 500, 1), type "<f4">),
 (u'info', <HDF5 dataset "info": shape (6450, 4), type "<f4">),
 (u'mnaseseq', <HDF5 dataset "mnaseseq": shape (6450, 2, 500, 1), type "<f4">),
 (u'netseq', <HDF5 dataset "netseq": shape (6450, 2, 500, 1), type "<f4">),
 (u'randinp', <HDF5 dataset "randinp": shape (6450, 1, 500, 1), type "<f4">),
 (u'rnaseq', <HDF5 dataset "rnaseq": shape (6450, 1, 500, 1), type "<f4">),
 (u'tssseq', <HDF5 dataset "tssseq": shape (6450, 2, 500, 1), type "<f4">)]

test = h5py.File('../data/hdf5datasets/test.h5', 'r')

test.items()



In [4]:

    
test = h5py.File('../data/hdf5datasets/NSMSDSRSCSTSRI_500bp/test.h5', 'r')
test.items()









    Out[4]:





[(u'chipseq', <HDF5 dataset "chipseq": shape (12900, 2, 500, 1), type "<f4">),
 (u'dnaseq', <HDF5 dataset "dnaseq": shape (12900, 4, 500, 1), type "<f4">),
 (u'info', <HDF5 dataset "info": shape (12900, 4), type "<f4">),
 (u'mnaseseq',
  <HDF5 dataset "mnaseseq": shape (12900, 2, 500, 1), type "<f4">),
 (u'netseq', <HDF5 dataset "netseq": shape (12900, 2, 500, 1), type "<f4">),
 (u'randinp', <HDF5 dataset "randinp": shape (12900, 1, 500, 1), type "<f4">),
 (u'rnaseq', <HDF5 dataset "rnaseq": shape (12900, 1, 500, 1), type "<f4">),
 (u'tssseq', <HDF5 dataset "tssseq": shape (12900, 2, 500, 1), type "<f4">)]

Examining the 'info' track:

The 'info' track is the track that holds index information relevant to the sequencing datasets. The dimensions of the 'info' correspond to the following:

1. Chromosome number (e.g. 1-16)
2. Strandedness (e.g. -1, 1)
3. Gene index (parsed from the original GFF file input)
4. Base Pair index (e.g. up to ~10^6)

infoRef_test = test.get('info')[:]

stats.describe(infoRef_test[:, X])



In [13]:

    
infoRef_test = test.get('info')[:]
stats.describe(infoRef_test[:, 0])
stats.describe(infoRef_test[:, 1])
stats.describe(infoRef_test[:, 2])
stats.describe(infoRef_test[:, 3])
print()
infoRef_validation = validation.get('info')[:]
stats.describe(infoRef_validation[:, 0])
stats.describe(infoRef_validation[:, 1])
stats.describe(infoRef_validation[:, 2])
stats.describe(infoRef_validation[:, 3])
print()
infoRef_train = train.get('info')[:]
stats.describe(infoRef_train[:, 0])
stats.describe(infoRef_train[:, 1])
stats.describe(infoRef_train[:, 2])
stats.describe(infoRef_train[:, 3])









    Out[13]:





DescribeResult(nobs=12900, minmax=(1.0, 16.0), mean=9.3024807, variance=20.961603, skewness=-0.12559175491333008, kurtosis=-1.3162881165056144)






    Out[13]:





DescribeResult(nobs=12900, minmax=(-1.0, 1.0), mean=0.0, variance=1.0000775, skewness=0.0, kurtosis=-2.0)






    Out[13]:





DescribeResult(nobs=12900, minmax=(1.0, 5159.0), mean=2594.1211, variance=2207513.8, skewness=-0.017716964706778526, kurtosis=-1.1933110312711752)






    Out[13]:





DescribeResult(nobs=12900, minmax=(2277.0, 1520825.0), mean=458661.19, variance=1.021782e+11, skewness=0.8305436968803406, kurtosis=0.29999693955268336)






    



()






    Out[13]:





DescribeResult(nobs=6450, minmax=(1.0, 16.0), mean=9.4158144, variance=20.835289, skewness=-0.16184914112091064, kurtosis=-1.3103733499301848)






    Out[13]:





DescribeResult(nobs=6450, minmax=(-1.0, 1.0), mean=0.025736434, variance=0.99949259, skewness=-0.051489971578121185, kurtosis=-1.9973486931724096)






    Out[13]:





DescribeResult(nobs=6450, minmax=(3.0, 5159.0), mean=2578.4915, variance=2167186.8, skewness=0.014728446491062641, kurtosis=-1.1667646167150223)






    Out[13]:





DescribeResult(nobs=6450, minmax=(2757.0, 1520865.0), mean=458192.12, variance=9.9381756e+10, skewness=0.7993639707565308, kurtosis=0.28129983971314587)






    



()






    Out[13]:





DescribeResult(nobs=109650, minmax=(1.0, 16.0), mean=9.3204832, variance=20.868563, skewness=-0.13883629441261292, kurtosis=-1.3147827371123004)






    Out[13]:





DescribeResult(nobs=109650, minmax=(-1.0, 1.0), mean=0.013989968, variance=0.99981326, skewness=-0.02798268012702465, kurtosis=-1.9992167649675758)






    Out[13]:





DescribeResult(nobs=109650, minmax=(0.0, 5159.0), mean=2577.8389, variance=2223195.8, skewness=0.001250637462362647, kurtosis=-1.2026441943598167)






    Out[13]:





DescribeResult(nobs=109650, minmax=(2317.0, 1520885.0), mean=457756.59, variance=1.012257e+11, skewness=0.8345269560813904, kurtosis=0.3338435990318005)

Jupyter Notebook as a Documentation Resource:

An advantage of this medium of Python interaction is the ability to quickly examine Python scripts. Using the '?' as a prepend, FIDDLE's documentation is easily accessed. First, we will import several FIDDLE scripts and quickly checkout their internals.

import main, models, analysis



In [ ]:

    
import main, models, visualization

Note: the following applies to any Python method/class/ADT/etc. as well as all imported Python packages:

The '?' prepend allows direct access to a Python script's docstrings. The '??' prepend allows direct access to the whole Python script. Jupyter Notebook's autocomplete feature allows an easy understanding of available methods. Click to escape from the pop up that results from the following commands:

?main

??models

?visualization



In [ ]:

    
?main



In [ ]:

    
??models.



In [ ]:

    
?visualization