In [1]:

    
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline









    



C:\Users\Luca\Anaconda3\envs\bio\lib\site-packages\IPython\html.py:14: ShimWarning: The `IPython.html` package has been deprecated since IPython 4.0. You should import from `notebook` instead. `IPython.html.widgets` has moved to `ipywidgets`.
  "`IPython.html.widgets` has moved to `ipywidgets`.", ShimWarning)

PyGMQL in action

Problem description

We are given three replicas of a ChIP-Seq experiment. We want to:

Extract high-confidence regions into one sample
Identify which of these regions overlap with a set of given genes
For each resulting region count ICGC mutations.
Finally we want to select the regions with at least one mutation.

Required GMQL operations

For this pipeline we will need the following GMQL operations:

cover: extracts regions which are confirmed by at least two replicas
join: extracts regions which overalap with genes
map: for each extracted region, counts the overlapping mutations

PyGMQL in a nutshell

PyGMQL enables the writing of GMQL queries using the Python programming language. It exposes to the user all the GMQL operators and also a data structure for holding, manipulating and converting the results of a GMQL query.

The library offers:

A data structure called GMQLDataset, which represent a GMQL variable in the query. Every GMQLDataset is produced by a GMQL operator. When you call the materialize operation on a GMQLDataset the execution is started and the result is returned.
A data structure called GDataframe, holding the result of a query. A GDataframe is a pure python structure and can be used directly as any other pandas dataframe. A GDataframe holds two pandas dataframes, one for the regions and one for the metadata. We can also, given a GDataframe, go back to a GMQLDataset for using it as a GMQL variable. This can be done calling the to_GMQLDataset function of the GDataframe.

Getting the library

The library can be downloaded from the PyPi public repository through the pip packaging system.

    pip install gmql

If you want the most recent version of the software you can directly download it from the GitHub page:

    git pull https://github.com/DEIB-GECO/PyGMQL.git
    cd PyGMQL
    pip install -e .

Importing the library



In [2]:

    
import gmql as gl
import pandas as pd

Execution modes

The queries are directly embedded inside the language and can be executed in two different modes:

Local mode: the computation is executed in the local machine
Remote mode: the query is sent to a remote server, executed there and the results are downloaded. This process is shown in the figure below.



In [3]:

    
gl.set_mode("local")
gl.set_progress(False)

The query

The GMQL query that we are going to present is the following:

refSeqGenes  = SELECT(annotation_type == 'gene' AND provider == 'RefSeq') HG19_BED_ANNOTATION;

myExp  = SELECT() myRawExp;
myConfirmExp  = COVER(2, ANY) myExp;

myExpOverGenes  = JOIN(DIST < 0; output: RIGHT_DISTINCT) refSeqGenes myConfirmExp;

myMut  = SELECT() myRawMut;
myMutOverExp = MAP() myExpOverGenes myMut;

myFilteredExp = SELECT(region: count_myMutOverExp_myMut > 0) myMutOverExp;

MATERIALIZE myFilteredExp INTO ./Results/FilteredExperiment

Loading a GMQL dataset

Loading datasets in GDM format

Both the gene dataset and the personal one are already in the GDM format, therefore the library only needs the location of the data for importing.

NB: This demo is meant to be executed without the need of a remote server. For this reason the HG19_BED_ANNOTATION dataset is loaded directly from the computer disk. Please be aware that it can also be found in the regular GMQL repository.



In [4]:

    
bed_annotation = gl.load_from_path("./Data/HG19_BED_ANNOTATION/")



In [5]:

    
myExp = gl.load_from_path("./Data/myRawExp/")

Loading datasets with a generic schema

The mutation dataset is in a classical BED format but not in the GDM format, therefore we need to specify its schema through a custom parser, which is an instance of a RegionParser. After the parser is instantiated we can use the load_from_path function to load the dataset, which is currently stored in the local file system



In [6]:

    
mutations_parser = gl.parsers.RegionParser(parser_name="mutations_parser",
                                           chrPos=0,
                                           startPos=1,
                                           stopPos=2,
                                           strandPos=3,
                                           delimiter="\t")



In [7]:

    
myMut = gl.load_from_path("./Data/myRawMut/", parser=mutations_parser)

Selection on the metadata

We have in the variable bed_annotation all the ENCODE dataset of annotations. We are interested only in the annotations regarding the genes. Therefore we need to filter the dataset on the basis of the metadata. This is called meta-selection and in PyGMQL can be performed using the square-bracket notation common to pandas users.



In [8]:

    
refSeqGenes = bed_annotation[(bed_annotation['annotation_type'] == 'gene') & 
                             (bed_annotation['provider'] == 'RefSeq')]

Cover operation

We want only reliable data from the Chip-Seq experiment, therefore we define a Chip-Seq region highly confident if it is confirmed by at least two replicas. This is a job for the cover operation.

minAcc: we define the minimum number of overlapping between samples for a region to be conserved. In this case 2.
maxAcc: we define the maximum number. The "ANY" keyword makes the upper bound infinite.



In [9]:

    
myConfirmExp = myExp.normal_cover(minAcc=2, maxAcc="ANY")

Join

Now we want to extract those regions that overlap with genes. We can do it using the join operation.

We use the genes as reference dataset
The Chip-Seq regions as experiment.
The genometric predicate DLE(0) and the option output="RIGHT" tell the engine to look at all the experiment regions that happen to intersect with the reference ones and keep them in the result.



In [10]:

    
myExpOverGenes = refSeqGenes.join(experiment=myConfirmExp, refName="gene",
                                  genometric_predicate=[gl.DLE(0)],
                                  output="RIGHT_DISTINCT")

Map

Now in insideGene we have the set of highly confident Chip-Seq regions that intersect with a gene. In order to see how many mutations happen in those region we can use the map operation. This operation will add a new attribute to the resulting dataset which will be named count_GENE_MUTATION.



In [11]:

    
myMutOverExp = myExpOverGenes.map(myMut, expName="MUTATION", refName="GENE")

Selection on regions

In order to filter out all the regions in mutationCount that do not have any mutation we can do a selection on region data using the reg_select operation. This operation takes as input a predicate on region fields.



In [12]:

    
myFilteredExp = myMutOverExp.reg_select(myMutOverExp.count_GENE_MUTATION > 0)

Materialization

We can now materialize the result. PyGMQL adopts a lazy loading approach (like Spark) and no action is performed until materialization. The result can be saved as a GDM dataset and also directly loaded in python as a GDataframe.



In [13]:

    
result = myFilteredExp.materialize("./Results/FilteredExperiment/")

The GDataframe

The result of a materialization is a GDataframe which contains both regions and metadata as two Pandas dataframes regs and meta



In [14]:

    
result.regs.head()









    Out[14]:







  
    
      
      chr
      start
      stop
      strand
      accindex
      jaccardintersect
      jaccardresult
      count_gene_mutation
    
  
  
    
      S_00000.gdm
      chr1
      8035225
      8035786
      *
      3
      0.921182
      0.748768
      1.0
    
    
      S_00000.gdm
      chr1
      11779578
      11780328
      *
      3
      0.651607
      0.501303
      1.0
    
    
      S_00000.gdm
      chr1
      11898457
      11899199
      *
      3
      0.876033
      0.783943
      1.0
    
    
      S_00000.gdm
      chr1
      17286965
      17287470
      *
      3
      0.698479
      0.641770
      1.0
    
    
      S_00000.gdm
      chr1
      22140883
      22141897
      *
      3
      0.910233
      0.754937
      2.0

The region dataframe represents the regions in the output result in a tabular format. The index of the regions is the identifier of the sample they belong to.



In [15]:

    
result.meta









    Out[15]:







  
    
      
      GENE.ID
      GENE.antibody
      GENE.antibody_antibodyDescription
      GENE.antibody_lab
      GENE.antibody_label
      GENE.antibody_lots
      GENE.antibody_orderUrl
      GENE.antibody_tag
      GENE.antibody_target
      GENE.antibody_targetClass
      ...
      GENE.treatment_tag
      GENE.treatment_type
      GENE.type
      GENE.url
      GENE.view
      GENE.view_description
      GENE.view_label
      GENE.view_tag
      GENE.view_type
      MUTATION.type
    
    
      id_sample
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      S_00000.gdm
      [1880, 1881, 1882]
      [CTCF]
      [Rabbit polyclonal. Antibody Target: CTCF]
      [Myers, Hardison, Snyder]
      [CTCF (07-729)]
      [1350637 DAM1472197]
      [http://www.millipore.com/catalogue/item/07-729]
      [CTCF]
      [CTCF]
      [TFSS]
      ...
      [NONE]
      [control]
      [broadPeak]
      [http://hgdownload.cse.ucsc.edu/goldenPath/hg1...
      [Hotspots]
      [ChIP-seq affinity zones identified using the ...
      [Hotspots]
      [HOTSPOTS]
      [view]
      [mutations]
    
  

1 rows × 92 columns

The metadata dataframe has as columns all the metadata attributes and every row represents a sample in the output dataset.

There can be multiple values in the same cell.

Additional operations

Having the result in a Pandas dataframe enables us to use all the Pandas functions and perform some statistics.

For example we can do an histogram of the mutation counts over the Chip-Seq regions showing the distribution of the number of mutations over those regions.



In [16]:

    
plt.figure(figsize=(25, 20))
result.regs[result.regs.count_gene_mutation <=50].hist("count_gene_mutation", bins=50)









    Out[16]:





array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000001FF7CDAF668>]], dtype=object)






    





<matplotlib.figure.Figure at 0x1ff7ce0cc88>

Of course we can also simply display the mean value



In [17]:

    
result.regs['count_gene_mutation'].mean()









    Out[17]:





1.404494382022472

High density regions



In [18]:

    
result.regs[result.regs.count_gene_mutation > 5].sort_values("count_gene_mutation", ascending=False)









    Out[18]:







  
    
      
      chr
      start
      stop
      strand
      accindex
      jaccardintersect
      jaccardresult
      count_gene_mutation
    
  
  
    
      S_00000.gdm
      chr19
      19296510
      19297234
      *
      3
      0.860880
      0.764566
      11.0
    
    
      S_00000.gdm
      chr19
      49495965
      49497004
      *
      3
      0.913808
      0.000000
      6.0

Alternative experiment

Mapping the mutations over the genes instead that over the experiment regions

The GMQL query that we are going to present is the following:

refSeqGenes  = SELECT(annotation_type == 'gene' AND provider == 'RefSeq') HG19_BED_ANNOTATION;

myExp  = SELECT() myRawExp;
myConfirmExp  = COVER(2, ANY) myExp;

myExpOverGenes  = JOIN(DIST < 0; output: RIGHT_DISTINCT) refSeqGenes myConfirmExp;

myMut  = SELECT() myRawMut;
myMutOverExp = MAP() myExpOverGenes myMut;

myFilteredExp = SELECT(region: count_myMutOverExp_myMut > 0) myMutOverExp;

MATERIALIZE myFilteredExp INTO ./Results/FilteredExperiment



In [19]:

    
genesOverExp = refSeqGenes.join(experiment=myConfirmExp, refName="gene",
                                  genometric_predicate=[gl.DLE(0)],
                                  output="LEFT_DISTINCT")
myMutOverGenes = genesOverExp.map(myMut, expName="MUTATION", refName="GENE")
myFilteredGenes = myMutOverGenes.reg_select(myMutOverGenes.count_GENE_MUTATION > 0)
result_genes = myFilteredGenes.materialize("./Results/FilteredGenes/")



In [20]:

    
result_genes.regs.head()









    Out[20]:







  
    
      
      chr
      start
      stop
      strand
      name
      score
      count_gene_mutation
    
  
  
    
      S_00000.gdm
      chr1
      1716723
      1822556
      -
      NM_002074
      0.0
      2.0
    
    
      S_00000.gdm
      chr1
      1716723
      1822556
      -
      NM_001282539
      0.0
      2.0
    
    
      S_00000.gdm
      chr1
      1716723
      1822556
      -
      NM_001282538
      0.0
      2.0
    
    
      S_00000.gdm
      chr1
      2115898
      2139172
      -
      NM_001146310
      0.0
      2.0
    
    
      S_00000.gdm
      chr1
      2120987
      2144159
      -
      NM_001282670
      0.0
      2.0



In [21]:

    
plt.figure(figsize=(25, 20))
result_genes.regs[result_genes.regs.count_gene_mutation <=50].hist("count_gene_mutation", bins=50)









    Out[21]:





array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000001FF7D1614A8>]], dtype=object)






    





<matplotlib.figure.Figure at 0x1ff7d161b00>



In [22]:

    
result_genes.regs['count_gene_mutation'].mean()









    Out[22]:





3.4381662107348365



In [23]:

    
result_genes.regs[result_genes.regs.count_gene_mutation > 50].sort_values("count_gene_mutation", ascending=False)









    Out[23]:







  
    
      
      chr
      start
      stop
      strand
      name
      score
      count_gene_mutation
    
  
  
    
      S_00000.gdm
      chr4
      144498560
      144621828
      -
      NM_001168235
      0.0
      283.0
    
    
      S_00000.gdm
      chr15
      78463186
      78527049
      -
      NM_001199377
      0.0
      141.0
    
    
      S_00000.gdm
      chr15
      78463186
      78527049
      -
      NM_015162
      0.0
      141.0
    
    
      S_00000.gdm
      chr2
      27505296
      27530307
      +
      NM_032546
      0.0
      114.0
    
    
      S_00000.gdm
      chr2
      27505296
      27530307
      +
      NM_187841
      0.0
      114.0
    
    
      S_00000.gdm
      chr11
      116714117
      116969131
      -
      NM_025164
      0.0
      99.0
    
    
      S_00000.gdm
      chr11
      116714117
      116969131
      -
      NM_001281749
      0.0
      99.0
    
    
      S_00000.gdm
      chr11
      116714117
      116969131
      -
      NM_001281748
      0.0
      99.0
    
    
      S_00000.gdm
      chr4
      144257982
      144395718
      +
      NM_002039
      0.0
      87.0
    
    
      S_00000.gdm
      chr4
      144257982
      144395718
      +
      NM_207123
      0.0
      87.0
    
    
      S_00000.gdm
      chr15
      78730517
      78793798
      +
      NM_004136
      0.0
      86.0
    
    
      S_00000.gdm
      chr16
      53737874
      54148379
      +
      NM_001080432
      0.0
      84.0
    
    
      S_00000.gdm
      chr1
      218458628
      218511325
      +
      NM_016052
      0.0
      71.0
    
    
      S_00000.gdm
      chr4
      88928798
      88998931
      +
      NM_000297
      0.0
      60.0
    
    
      S_00000.gdm
      chr8
      2792874
      4852328
      -
      NM_033225
      0.0
      59.0
    
    
      S_00000.gdm
      chr9
      21994789
      22121093
      +
      NR_047534
      0.0
      59.0
    
    
      S_00000.gdm
      chr9
      21994789
      22121093
      +
      NR_047535
      0.0
      59.0
    
    
      S_00000.gdm
      chr9
      21994789
      22121093
      +
      NR_047532
      0.0
      59.0
    
    
      S_00000.gdm
      chr9
      21994789
      22121093
      +
      NR_003529
      0.0
      59.0
    
    
      S_00000.gdm
      chr9
      21994789
      22121093
      +
      NR_120536
      0.0
      59.0
    
    
      S_00000.gdm
      chr9
      21994789
      22121093
      +
      NR_047536
      0.0
      59.0
    
    
      S_00000.gdm
      chr9
      21994789
      22121093
      +
      NR_047538
      0.0
      59.0
    
    
      S_00000.gdm
      chr9
      21994789
      22121093
      +
      NR_047543
      0.0
      59.0
    
    
      S_00000.gdm
      chr9
      21994789
      22121093
      +
      NR_047537
      0.0
      59.0
    
    
      S_00000.gdm
      chr15
      78556486
      78574538
      +
      NM_018602
      0.0
      52.0
    
    
      S_00000.gdm
      chr9
      133028157
      133309510
      +
      NM_001291815
      0.0
      51.0

	chr	start	stop	strand	accindex	jaccardintersect	jaccardresult	count_gene_mutation
S_00000.gdm	chr1	8035225	8035786	*	3	0.921182	0.748768	1.0
S_00000.gdm	chr1	11779578	11780328	*	3	0.651607	0.501303	1.0
S_00000.gdm	chr1	11898457	11899199	*	3	0.876033	0.783943	1.0
S_00000.gdm	chr1	17286965	17287470	*	3	0.698479	0.641770	1.0
S_00000.gdm	chr1	22140883	22141897	*	3	0.910233	0.754937	2.0

	GENE.ID	GENE.antibody	GENE.antibody_antibodyDescription	GENE.antibody_lab	GENE.antibody_label	GENE.antibody_lots	GENE.antibody_orderUrl	GENE.antibody_tag	GENE.antibody_target	GENE.antibody_targetClass	...	GENE.treatment_tag	GENE.treatment_type	GENE.type	GENE.url	GENE.view	GENE.view_description	GENE.view_label	GENE.view_tag	GENE.view_type	MUTATION.type
id_sample
S_00000.gdm	[1880, 1881, 1882]	[CTCF]	[Rabbit polyclonal. Antibody Target: CTCF]	[Myers, Hardison, Snyder]	[CTCF (07-729)]	[1350637 DAM1472197]	[http://www.millipore.com/catalogue/item/07-729]	[CTCF]	[CTCF]	[TFSS]	...	[NONE]	[control]	[broadPeak]	[http://hgdownload.cse.ucsc.edu/goldenPath/hg1...	[Hotspots]	[ChIP-seq affinity zones identified using the ...	[Hotspots]	[HOTSPOTS]	[view]	[mutations]

	chr	start	stop	strand	accindex	jaccardintersect	jaccardresult	count_gene_mutation
S_00000.gdm	chr19	19296510	19297234	*	3	0.860880	0.764566	11.0
S_00000.gdm	chr19	49495965	49497004	*	3	0.913808	0.000000	6.0

	chr	start	stop	strand	name	count_gene_mutation
S_00000.gdm	chr1	1716723	1822556	-	NM_002074	2.0
S_00000.gdm	chr1	1716723	1822556	-	NM_001282539	2.0
S_00000.gdm	chr1	1716723	1822556	-	NM_001282538	2.0
S_00000.gdm	chr1	2115898	2139172	-	NM_001146310	2.0
S_00000.gdm	chr1	2120987	2144159	-	NM_001282670	2.0

	chr	start	stop	strand	name	count_gene_mutation
S_00000.gdm	chr4	144498560	144621828	-	NM_001168235	283.0
S_00000.gdm	chr15	78463186	78527049	-	NM_001199377	141.0
S_00000.gdm	chr15	78463186	78527049	-	NM_015162	141.0
S_00000.gdm	chr2	27505296	27530307	+	NM_032546	114.0
S_00000.gdm	chr2	27505296	27530307	+	NM_187841	114.0
S_00000.gdm	chr11	116714117	116969131	-	NM_025164	99.0
S_00000.gdm	chr11	116714117	116969131	-	NM_001281749	99.0
S_00000.gdm	chr11	116714117	116969131	-	NM_001281748	99.0
S_00000.gdm	chr4	144257982	144395718	+	NM_002039	87.0
S_00000.gdm	chr4	144257982	144395718	+	NM_207123	87.0
S_00000.gdm	chr15	78730517	78793798	+	NM_004136	86.0
S_00000.gdm	chr16	53737874	54148379	+	NM_001080432	84.0
S_00000.gdm	chr1	218458628	218511325	+	NM_016052	71.0
S_00000.gdm	chr4	88928798	88998931	+	NM_000297	60.0
S_00000.gdm	chr8	2792874	4852328	-	NM_033225	59.0
S_00000.gdm	chr9	21994789	22121093	+	NR_047534	59.0
S_00000.gdm	chr9	21994789	22121093	+	NR_047535	59.0
S_00000.gdm	chr9	21994789	22121093	+	NR_047532	59.0
S_00000.gdm	chr9	21994789	22121093	+	NR_003529	59.0
S_00000.gdm	chr9	21994789	22121093	+	NR_120536	59.0
S_00000.gdm	chr9	21994789	22121093	+	NR_047536	59.0
S_00000.gdm	chr9	21994789	22121093	+	NR_047538	59.0
S_00000.gdm	chr9	21994789	22121093	+	NR_047543	59.0
S_00000.gdm	chr9	21994789	22121093	+	NR_047537	59.0
S_00000.gdm	chr15	78556486	78574538	+	NM_018602	52.0
S_00000.gdm	chr9	133028157	133309510	+	NM_001291815	51.0