In [1]:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
We are given three replicas of a ChIP-Seq experiment. We want to:
The library offers:
GMQLDataset
, which represent a GMQL variable in the query. Every GMQLDataset is produced by a GMQL operator. When you call the materialize
operation on a GMQLDataset the execution is started and the result is returned.GDataframe
, holding the result of a query. A GDataframe is a pure python structure and can be used directly as any other pandas dataframe. A GDataframe holds two pandas dataframes, one for the regions and one for the metadata. We can also, given a GDataframe, go back to a GMQLDataset for using it as a GMQL variable. This can be done calling the to_GMQLDataset
function of the GDataframe.The library can be downloaded from the PyPi public repository through the pip
packaging system.
pip install gmql
If you want the most recent version of the software you can directly download it from the GitHub page:
git pull https://github.com/DEIB-GECO/PyGMQL.git
cd PyGMQL
pip install -e .
In [2]:
import gmql as gl
import pandas as pd
The queries are directly embedded inside the language and can be executed in two different modes:
In [3]:
gl.set_mode("local")
gl.set_progress(False)
The GMQL query that we are going to present is the following:
refSeqGenes = SELECT(annotation_type == 'gene' AND provider == 'RefSeq') HG19_BED_ANNOTATION;
myExp = SELECT() myRawExp;
myConfirmExp = COVER(2, ANY) myExp;
myExpOverGenes = JOIN(DIST < 0; output: RIGHT_DISTINCT) refSeqGenes myConfirmExp;
myMut = SELECT() myRawMut;
myMutOverExp = MAP() myExpOverGenes myMut;
myFilteredExp = SELECT(region: count_myMutOverExp_myMut > 0) myMutOverExp;
MATERIALIZE myFilteredExp INTO ./Results/FilteredExperiment
Both the gene dataset and the personal one are already in the GDM format, therefore the library only needs the location of the data for importing.
NB: This demo is meant to be executed without the need of a remote server. For this reason the HG19_BED_ANNOTATION dataset is loaded directly from the computer disk. Please be aware that it can also be found in the regular GMQL repository.
In [4]:
bed_annotation = gl.load_from_path("./Data/HG19_BED_ANNOTATION/")
In [5]:
myExp = gl.load_from_path("./Data/myRawExp/")
The mutation dataset is in a classical BED format but not in the GDM format, therefore we need to specify its schema through a custom parser, which is an instance of a RegionParser
.
After the parser is instantiated we can use the load_from_path
function to load the dataset, which is currently stored in the local file system
In [6]:
mutations_parser = gl.parsers.RegionParser(parser_name="mutations_parser",
chrPos=0,
startPos=1,
stopPos=2,
strandPos=3,
delimiter="\t")
In [7]:
myMut = gl.load_from_path("./Data/myRawMut/", parser=mutations_parser)
We have in the variable bed_annotation
all the ENCODE dataset of annotations. We are interested only in the annotations regarding the genes. Therefore we need to filter the dataset on the basis of the metadata.
This is called meta-selection
and in PyGMQL can be performed using the square-bracket notation common to pandas users.
In [8]:
refSeqGenes = bed_annotation[(bed_annotation['annotation_type'] == 'gene') &
(bed_annotation['provider'] == 'RefSeq')]
We want only reliable data from the Chip-Seq experiment, therefore we define a Chip-Seq region highly confident if it is confirmed by at least two replicas. This is a job for the cover
operation.
minAcc
: we define the minimum number of overlapping between samples for a region to be conserved. In this case 2.maxAcc
: we define the maximum number. The "ANY"
keyword makes the upper bound infinite.
In [9]:
myConfirmExp = myExp.normal_cover(minAcc=2, maxAcc="ANY")
Now we want to extract those regions that overlap with genes. We can do it using the join
operation.
DLE(0)
and the option output="RIGHT"
tell the engine to look at all the experiment regions that happen to intersect with the reference ones and keep them in the result.
In [10]:
myExpOverGenes = refSeqGenes.join(experiment=myConfirmExp, refName="gene",
genometric_predicate=[gl.DLE(0)],
output="RIGHT_DISTINCT")
In [11]:
myMutOverExp = myExpOverGenes.map(myMut, expName="MUTATION", refName="GENE")
In [12]:
myFilteredExp = myMutOverExp.reg_select(myMutOverExp.count_GENE_MUTATION > 0)
In [13]:
result = myFilteredExp.materialize("./Results/FilteredExperiment/")
In [14]:
result.regs.head()
Out[14]:
The region dataframe represents the regions in the output result in a tabular format. The index of the regions is the identifier of the sample they belong to.
In [15]:
result.meta
Out[15]:
The metadata dataframe has as columns all the metadata attributes and every row represents a sample in the output dataset.
There can be multiple values in the same cell.
In [16]:
plt.figure(figsize=(25, 20))
result.regs[result.regs.count_gene_mutation <=50].hist("count_gene_mutation", bins=50)
Out[16]:
Of course we can also simply display the mean value
In [17]:
result.regs['count_gene_mutation'].mean()
Out[17]:
In [18]:
result.regs[result.regs.count_gene_mutation > 5].sort_values("count_gene_mutation", ascending=False)
Out[18]:
Mapping the mutations over the genes instead that over the experiment regions
The GMQL query that we are going to present is the following:
refSeqGenes = SELECT(annotation_type == 'gene' AND provider == 'RefSeq') HG19_BED_ANNOTATION;
myExp = SELECT() myRawExp;
myConfirmExp = COVER(2, ANY) myExp;
myExpOverGenes = JOIN(DIST < 0; output: RIGHT_DISTINCT) refSeqGenes myConfirmExp;
myMut = SELECT() myRawMut;
myMutOverExp = MAP() myExpOverGenes myMut;
myFilteredExp = SELECT(region: count_myMutOverExp_myMut > 0) myMutOverExp;
MATERIALIZE myFilteredExp INTO ./Results/FilteredExperiment
In [19]:
genesOverExp = refSeqGenes.join(experiment=myConfirmExp, refName="gene",
genometric_predicate=[gl.DLE(0)],
output="LEFT_DISTINCT")
myMutOverGenes = genesOverExp.map(myMut, expName="MUTATION", refName="GENE")
myFilteredGenes = myMutOverGenes.reg_select(myMutOverGenes.count_GENE_MUTATION > 0)
result_genes = myFilteredGenes.materialize("./Results/FilteredGenes/")
In [20]:
result_genes.regs.head()
Out[20]:
In [21]:
plt.figure(figsize=(25, 20))
result_genes.regs[result_genes.regs.count_gene_mutation <=50].hist("count_gene_mutation", bins=50)
Out[21]:
In [22]:
result_genes.regs['count_gene_mutation'].mean()
Out[22]:
In [23]:
result_genes.regs[result_genes.regs.count_gene_mutation > 50].sort_values("count_gene_mutation", ascending=False)
Out[23]: