In this tutorial you will learn:
PyGMQL can interact with a remote server hosting a GMQL instance and repository.
In order to connect to the remote instance the user has to:
It must be noticed that GMQL offers both authenticated and un-authenticated login: by default PyGMQL executes an un-authenticated login. This means that the remote instance will provide PyGMQL with limited resources and not permanent result storage. In order to login to the GMQL server as an authenticated user, one has to:
pygmql_login
tool, which is installed together with the libraryIn the following, we are going to use the publicly accessible GMQL service offered by the Genomic Computing group at Politecnico di Milano (accessible at http://gmql.eu/gmql-rest/). Since we did not registered to the service, when we the login
call is encountered, the library will login as GUEST user with limited capabilities (in terms of computation and disk space).
In [1]:
import gmql as gl
gl.set_remote_address("http://gmql.eu/gmql-rest/")
gl.login()
From this point on, all the remote calls (explained later) will be forwarded to the GMQL service hosted at the specified address as a guest user.
With PyGMQL the user can decide if the next query (sequence of statements on GMQLDatasets ended with a materialization) will be executed locally or on the remote service.
This can be done using the set_mode
function. If the user sets the mode to remote
the following will happen:
GDataframe
.The following image describes the steps of remote processing with PyGMQL:
We will showcase these features through a simple example.
We want to understand the enrichment of histone modifications on promotorial regions of REFSEQ genes. In particular:
The result of the analysis will be an heatmap $genes \times histone\_mark$ displaying the average signal value of each HM for each of the selected genes.
In [2]:
gl.set_mode("remote")
We "load" the remote RefSeq dataset using the load_from_remote
function. Since this dataset is available to all the users of the system, we will specify public
as the ownership of the datasets. In the case in which we are loading our private datasets stored on the service, we can omit the owner
parameter.
In [3]:
genes = gl.load_from_remote("HG19_ANNOTATION_REFSEQ", owner="public")
Now we have to:
HG19_ANNOTATION_REFSEQ
dataset specific of genesThis operation can be done using the select
function on the gene
GMQLDataset variable.
In [4]:
genes = genes.select(meta_predicate=genes["annotation_type"] == 'gene')
To extract the promotorial regions we will use the reg_project
(region projection) operation. In particular the following call will:
chr, start, stop, gene_symbol
region fields
In [6]:
promoters = genes.reg_project(field_list=['gene_symbol'],
new_field_dict={'start': genes.start - 2000,
'stop': genes.start + 2000})
Since we are interested in keeping the result of this part of the query for future usage, we will materialize the result locally.
In [7]:
promoters = promoters.materialize("./promoters", all_load=False)
Next we want to select the H3K27me3, H3K36me3, H3K4me1, H3K79me2, H3K9ac and H3K9me3 from the HG19_ENCODE_BROAD_NOV_2017
dataset, also in this case stored remotely.
As before, we will load the dataset using load_from_remote
. The following operation is a meta selection, which means that the selection will be done on the metadata attributes of the dataset. The resulting dataset will have all the samples for which the metadata satisfy the specified predicate.
In PyGMQL the metadata selection follows the same syntax as the Pandas Python data analysis library
selected_dataset = dataset[dataset['metadata_attribute'] == 'value']
Since we want to test for multiple histone marks we can use the isin()
operator.
In [8]:
hms = gl.load_from_remote("HG19_ENCODE_BROAD_NOV_2017", owner="public")
hms = hms[hms['experiment_target'].isin(['H3K27me3-human', 'H3K36me3-human',
"H3K4me1-human", "H3K79me2-human",
"H3K9ac-human", "H3K9me3-human"])]
The dataset resulting from the previous selection will have several duplicates for each region, due to the several biological and technical replicates offered by ENCODE. Therefore we need a way to group together the overlapping regions and to aggregate the relative signal values.
This can be done using the normal_cover
operation. In particular this function enables you to specify:
In our case we will keep all the regions (minimum overallpping = 1, maximum overlapping = "ANY") and group by the experiment_target
metadata attribute. The signals of overlapping regions will be averaged.
In [9]:
hms = hms.normal_cover(1, "ANY", groupBy=['experiment_target'],
new_reg_fields={'avg_signal': gl.AVG("signal")})
Also in this case we want to keep the resulting dataset for later processing.
In [10]:
hms = hms.materialize("./hms", all_load=False)
Now that we have our set of gene promoters and our set of histone marks, we want to map the signal of the latter to the regions of the former. We can do that using the map
operation. In particular, since on a promotorial region can be mapped multiple HMs, we will aggregate the signal by averaging it.
We decide to perform the following operations on the local machine.
In [11]:
gl.set_mode("local")
In [12]:
mapping = promoters.map(hms, refName='prom',
expName="hm",
new_reg_fields={'avg_signal': gl.AVG("avg_signal")})
We finally materialize the result and load it in a GDataframe.
In [13]:
mapping = mapping.materialize("./mapping")
In [14]:
mapping.regs.head()
Out[14]:
In [15]:
mapping.meta.head()
Out[15]:
Finally we visualize the heatmap (in the form of clusters) with the assistance of the Seaborn visualization package.
In order to visualize the data we have to transform the GDataframe data structure in the mapping
variable in a matricial form where we have $genes$ on the rows and $histone\_marks$ on the columns. This can be done with the to_matrix
function offered by the GDataframe.
The to_matrix
method is particularly versatile since it enables the user to project region and metadata at the same time. For example, in our case the information about the histone modification name is stored as a metadata attribute hm.experiment_target
while we want to use the avg_signal
region field as matrix value. The to_matrix
methods therefore enable us to aggregate the information of both regions and metadata in a single structure.
Underneeth this method leverages on the pivot_table
Pandas function and can accept all its parameter together with its specific ones.
In [16]:
!pip install seaborn
!pip install scikit-learn
In [17]:
import matplotlib.pyplot as plt
import seaborn as sns
In [18]:
plt.figure()
sns.clustermap(mapping.to_matrix(index_regs=['gene_symbol'],
columns_meta=['hm.experiment_target'],
values_regs=['avg_signal'], fill_value=0, dropna=False).sample(100))
plt.show()
In [19]:
from sklearn.decomposition import PCA
import pandas as pd
In [20]:
pca = PCA(n_components=2)
In [21]:
promoter_region_VS_hm = mapping.to_matrix(index_regs=['chr', 'start', 'stop', 'gene_symbol'],
columns_meta=['hm.experiment_target'],
values_regs=['avg_signal'], fill_value=0)
In [22]:
components = pca.fit_transform(promoter_region_VS_hm.values)
In [23]:
PCA_regions = pd.DataFrame(index=promoter_region_VS_hm.index, columns=['1st_component', '2nd_component'], data=components).reset_index()
In [24]:
plt.figure(figsize=(10, 10))
sns.regplot(data=PCA_regions, x='1st_component', y='2nd_component', fit_reg=False)
plt.show()