Simple example of local processing

In this first tutorial we will show a complete example of usage of the library using some example datasets provided with it.

Importing the library



In [1]:

    
import gmql as gl









    



[PyGMQL] Updating backend to latest version
95M [00:06, 14.85M/s]

Loading datasets

PyGMQL can work with BED and GTF files with arbitrary fields and schemas. In order to load a dataset into Python the user can use the following functions:

load_from_path: lazily loads a dataset into a GMQLDataset variable from the local file system
load_from_remote: lazily loads a dataset into a GMQLDataset variable from a remote GMQL service
from_pandas: lazily loads a dataset into a GMQLDataset variable from a Pandas DataFrame having at least the chromosome, start and stop columns

In addition to these functions we also provide a function called get_example_dataset which enables the user to load a sample dataset and play with it in order to get confidence with the library. Currently we provide two example datasets: Example_Dataset_1 and Example_Dataset_2.

In the following we will load two example datasets and play with them.



In [2]:

    
dataset1 = gl.get_example_dataset("Example_Dataset_1")
dataset2 = gl.get_example_dataset("Example_Dataset_2")

The `GMQLDataset`

The dataset variable defined above is a GMQLDataset, which represents a GMQL variable and on which it is possible to apply GMQL operators. It must be noticed that no data has been loaded in memory yet and the computation will only start when the query is triggered. We will see how to start the execution of a query in the following steps.

We can inspect the schema of the dataset with the following:



In [3]:

    
dataset1.schema









    Out[3]:





['source',
 'feature',
 'score',
 'frame',
 'name',
 'signal',
 'pvalue',
 'qvalue',
 'peak',
 'chr',
 'start',
 'stop',
 'strand']



In [4]:

    
dataset2.schema









    Out[4]:





['source',
 'feature',
 'score',
 'frame',
 'name',
 'signal',
 'pvalue',
 'qvalue',
 'peak',
 'chr',
 'start',
 'stop',
 'strand']

Filtering the dataset regions based on a predicate

The first operation we will do on dataset will be selecting only the genomic regions on the 3rd chromosome and with a start position greater than 30000.



In [5]:

    
filtered_dataset1 = dataset1.reg_select((dataset1.chr == 'chr3') & (dataset1.start >= 30000))

From this operation we can learn several things about the GMQLDataset data structure. Each GMQLDataset has a set of methods and fields which can be used to build GMQL queries. For example, in the previous statement we have:

the reg_select method, which enables us to filter the datasets on the basis of a predicate on the region positions and features
the chr and start fields, which enable the user to build predicates on the fields of the dataset.

Every GMQL operator has a relative method accessible from the GMQLDataset data structure, as well as any other field of the dataset.

Filtering a dataset based on a predicate on metadata

The Genomic Data Model enables us to work both with genomic regions and their relative metadata. Therefore we can filter dataset samples on the basis of predicates on metadata attributes. This can be done as follows:



In [6]:

    
filtered_dataset_2 = dataset2[dataset2['antibody_target'] == 'CTCF']

Notice that the notation for selecting the samples using metadata is the same as the one for filtering Pandas DataFrames.

Joining two datasets

It is not the focus of this tutorial to show all the possible operations which can be done on a GMQLDataset, they can be seen on the documentation page of the library.

For the sake of this example, let's show the JOIN operation between the two filtered datasets defined in the previous two steps. The JOIN operation semantics relies on the concept of reference and experiment datasets. The reference dataset is the one 'calling' the join function while the experiment dataset is the one 'on which' the function is called. The semantics of the function is

resulting_dataset = <reference>.join(<experiment>, <genometric predicate>, ...)



In [7]:

    
dataset_join = dataset1.join(dataset2, [gl.DLE(0)])

To understand the concept of genometric predicate please visit the documentation of the library.

Materialization of the results

As we have already said, no operation has beed effectively done up to this point. What we did up to now is to define the sequence of operations to apply on the data. In order to trigger the execution we have to apply the materialize function on the variable we want to compute.



In [8]:

    
query_result = dataset_join.materialize()









    



Collecting regions: 12it [00:03,  3.49it/s]
100%|██████████| 190/190 [00:02<00:00, 94.63it/s]

The `GDataframe`

The query_result variable holds the result of the previous GMQL query in the form of a GDataframe data structure. It holds the information about the regions and the metadata of the result, which can be respectively accessed through the regs and meta attributes.

Regions



In [10]:

    
query_result.regs.head()









    Out[10]:







  
    
      
      chr
      start
      stop
      strand
      REF.source
      REF.feature
      REF.score
      REF.frame
      REF.name
      REF.signal
      ...
      REF.peak
      EXP.source
      EXP.feature
      EXP.score
      EXP.frame
      EXP.name
      EXP.signal
      EXP.pvalue
      EXP.qvalue
      EXP.peak
    
    
      id_sample
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      5758100849671408287
      chr4
      149
      179
      -
      GMQL
      Region
      0.9
      .
      .
      15.0
      ...
      -1.0
      GMQL
      Region
      240.0
      .
      .
      17.0000
      11.1675
      -1.0
      -1.0
    
    
      5758100849671408287
      chr4
      199
      249
      -
      GMQL
      Region
      0.9
      .
      .
      15.0
      ...
      -1.0
      GMQL
      Region
      240.0
      .
      .
      17.0000
      11.1675
      -1.0
      -1.0
    
    
      -7728122121608963789
      chr2
      99
      150
      *
      GMQL
      Region
      10.0
      .
      .
      9.0
      ...
      -1.0
      GMQL
      Region
      12.0
      .
      .
      20.0000
      -1.0000
      -1.0
      -1.0
    
    
      4670800706188226503
      chr2
      99
      150
      *
      GMQL
      Region
      10.0
      .
      .
      9.0
      ...
      -1.0
      GMQL
      Region
      5.0
      .
      chr2.2538
      0.0332
      1.3000
      -1.0
      32.0
    
    
      -7728122121608963789
      chr2
      199
      240
      *
      GMQL
      Region
      9.0
      .
      .
      22.0
      ...
      -1.0
      GMQL
      Region
      12.0
      .
      .
      20.0000
      -1.0000
      -1.0
      -1.0
    
  

5 rows × 22 columns

Metadata



In [11]:

    
query_result.meta.head()









    Out[11]:







  
    
      
      EXP.ID
      EXP.antibody
      EXP.antibody_antibodyDescription
      EXP.antibody_lab
      EXP.antibody_label
      EXP.antibody_lots
      EXP.antibody_orderUrl
      EXP.antibody_tag
      EXP.antibody_target
      EXP.antibody_targetClass
      ...
      REF.treatment_label
      REF.treatment_tag
      REF.treatment_type
      REF.type
      REF.url
      REF.view
      REF.view_description
      REF.view_label
      REF.view_tag
      REF.view_type
    
    
      id_sample
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      -8999039739831780787
      [1891]
      [CTCF]
      [Rabbit polyclonal. Antibody Target: CTCF]
      [Myers, Hardison, Snyder]
      [CTCF (07-729)]
      [1350637 DAM1472197]
      [http://www.millipore.com/catalogue/item/07-729]
      [CTCF]
      [CTCF]
      [TFSS]
      ...
      [No treatment or prot]
      [NONE]
      [control]
      [narrowPeak]
      [http://hgdownload.cse.ucsc.edu/goldenPath/hg1...
      [Peaks]
      [Regions of enriched signal in experiment]
      [Peaks]
      [PKS]
      [view]
    
    
      -8744938632702845394
      [1460]
      []
      []
      []
      []
      []
      []
      []
      []
      []
      ...
      [No treatment or prot]
      [NONE]
      [control]
      [narrowPeak]
      [http://hgdownload.cse.ucsc.edu/goldenPath/hg1...
      [Peaks]
      [Regions of enriched signal in experiment]
      [Peaks]
      [PKS]
      [view]
    
    
      -8566469212180078812
      [1892]
      [CTCF]
      [Rabbit polyclonal. Antibody Target: CTCF]
      [Myers, Hardison, Snyder]
      [CTCF (07-729)]
      [1350637 DAM1472197]
      [http://www.millipore.com/catalogue/item/07-729]
      [CTCF]
      [CTCF]
      [TFSS]
      ...
      [No treatment or prot]
      [NONE]
      [control]
      [narrowPeak]
      [http://hgdownload.cse.ucsc.edu/goldenPath/hg1...
      [Peaks]
      [Regions of enriched signal in experiment]
      [Peaks]
      [PKS]
      [view]
    
    
      -7728122121608963789
      [42]
      []
      []
      []
      []
      []
      []
      []
      []
      []
      ...
      [No treatment or prot]
      [NONE]
      [control]
      [narrowPeak]
      [http://hgdownload.cse.ucsc.edu/goldenPath/hg1...
      [Peaks]
      [Regions of enriched signal in experiment]
      [Peaks]
      [PKS]
      [view]
    
    
      -7571531587018517458
      [1892]
      [CTCF]
      [Rabbit polyclonal. Antibody Target: CTCF]
      [Myers, Hardison, Snyder]
      [CTCF (07-729)]
      [1350637 DAM1472197]
      [http://www.millipore.com/catalogue/item/07-729]
      [CTCF]
      [CTCF]
      [TFSS]
      ...
      [No treatment or prot]
      [NONE]
      [control]
      [narrowPeak]
      [http://hgdownload.cse.ucsc.edu/goldenPath/hg1...
      [Peaks]
      [Regions of enriched signal in experiment]
      [Peaks]
      [PKS]
      [view]
    
  

5 rows × 189 columns

	chr	start	stop	strand	REF.source	REF.feature	REF.score	REF.frame	REF.name	REF.signal	...	REF.peak	EXP.source	EXP.feature	EXP.score	EXP.frame	EXP.name	EXP.signal	EXP.pvalue	EXP.qvalue	EXP.peak
id_sample
5758100849671408287	chr4	149	179	-	GMQL	Region	0.9	.	.	15.0	...	-1.0	GMQL	Region	240.0	.	.	17.0000	11.1675	-1.0	-1.0
5758100849671408287	chr4	199	249	-	GMQL	Region	0.9	.	.	15.0	...	-1.0	GMQL	Region	240.0	.	.	17.0000	11.1675	-1.0	-1.0
-7728122121608963789	chr2	99	150	*	GMQL	Region	10.0	.	.	9.0	...	-1.0	GMQL	Region	12.0	.	.	20.0000	-1.0000	-1.0	-1.0
4670800706188226503	chr2	99	150	*	GMQL	Region	10.0	.	.	9.0	...	-1.0	GMQL	Region	5.0	.	chr2.2538	0.0332	1.3000	-1.0	32.0
-7728122121608963789	chr2	199	240	*	GMQL	Region	9.0	.	.	22.0	...	-1.0	GMQL	Region	12.0	.	.	20.0000	-1.0000	-1.0	-1.0

	EXP.ID	EXP.antibody	EXP.antibody_antibodyDescription	EXP.antibody_lab	EXP.antibody_label	EXP.antibody_lots	EXP.antibody_orderUrl	EXP.antibody_tag	EXP.antibody_target	EXP.antibody_targetClass	...	REF.treatment_label	REF.treatment_tag	REF.treatment_type	REF.type	REF.url	REF.view	REF.view_description	REF.view_label	REF.view_tag	REF.view_type
id_sample
-8999039739831780787	[1891]	[CTCF]	[Rabbit polyclonal. Antibody Target: CTCF]	[Myers, Hardison, Snyder]	[CTCF (07-729)]	[1350637 DAM1472197]	[http://www.millipore.com/catalogue/item/07-729]	[CTCF]	[CTCF]	[TFSS]	...	[No treatment or prot]	[NONE]	[control]	[narrowPeak]	[http://hgdownload.cse.ucsc.edu/goldenPath/hg1...	[Peaks]	[Regions of enriched signal in experiment]	[Peaks]	[PKS]	[view]
-8744938632702845394	[1460]	[]	[]	[]	[]	[]	[]	[]	[]	[]	...	[No treatment or prot]	[NONE]	[control]	[narrowPeak]	[http://hgdownload.cse.ucsc.edu/goldenPath/hg1...	[Peaks]	[Regions of enriched signal in experiment]	[Peaks]	[PKS]	[view]
-8566469212180078812	[1892]	[CTCF]	[Rabbit polyclonal. Antibody Target: CTCF]	[Myers, Hardison, Snyder]	[CTCF (07-729)]	[1350637 DAM1472197]	[http://www.millipore.com/catalogue/item/07-729]	[CTCF]	[CTCF]	[TFSS]	...	[No treatment or prot]	[NONE]	[control]	[narrowPeak]	[http://hgdownload.cse.ucsc.edu/goldenPath/hg1...	[Peaks]	[Regions of enriched signal in experiment]	[Peaks]	[PKS]	[view]
-7728122121608963789	[42]	[]	[]	[]	[]	[]	[]	[]	[]	[]	...	[No treatment or prot]	[NONE]	[control]	[narrowPeak]	[http://hgdownload.cse.ucsc.edu/goldenPath/hg1...	[Peaks]	[Regions of enriched signal in experiment]	[Peaks]	[PKS]	[view]
-7571531587018517458	[1892]	[CTCF]	[Rabbit polyclonal. Antibody Target: CTCF]	[Myers, Hardison, Snyder]	[CTCF (07-729)]	[1350637 DAM1472197]	[http://www.millipore.com/catalogue/item/07-729]	[CTCF]	[CTCF]	[TFSS]	...	[No treatment or prot]	[NONE]	[control]	[narrowPeak]	[http://hgdownload.cse.ucsc.edu/goldenPath/hg1...	[Peaks]	[Regions of enriched signal in experiment]	[Peaks]	[PKS]	[view]

Simple example of local processing

Importing the library

Importing the library

Loading datasets

The GMQLDataset

Filtering the dataset regions based on a predicate

Filtering a dataset based on a predicate on metadata

Joining two datasets

Materialization of the results

The GDataframe

Regions

Metadata

The `GMQLDataset`

The `GDataframe`