Mixing local and remote processing

In this tutorial you will learn:

How to connect to a remote GMQL instance and run remote queries
How to interleave remote and local GMQL processing
How to manipulate the GDataframe data structure to plot an heatmap
How to perform a Principal Component Analysis using the GDataframe data structure
How to transform a pure Pandas Dataframe back to a GMQLDataset to perform a new query

Connecting to a remote GMQL instance

PyGMQL can interact with a remote server hosting a GMQL instance and repository.

In order to connect to the remote instance the user has to:

Set the IP address of the service
Login to the server

It must be noticed that GMQL offers both authenticated and un-authenticated login: by default PyGMQL executes an un-authenticated login. This means that the remote instance will provide PyGMQL with limited resources and not permanent result storage. In order to login to the GMQL server as an authenticated user, one has to:

Register on the remote service by going to its IP address from the web browser
Use the pygmql_login tool, which is installed together with the library

In the following, we are going to use the publicly accessible GMQL service offered by the Genomic Computing group at Politecnico di Milano (accessible at http://gmql.eu/gmql-rest/). Since we did not registered to the service, when we the login call is encountered, the library will login as GUEST user with limited capabilities (in terms of computation and disk space).



In [1]:

    
import gmql as gl

gl.set_remote_address("http://gmql.eu/gmql-rest/")
gl.login()









    



[PyGMQL] Logging using stored authentication token

From this point on, all the remote calls (explained later) will be forwarded to the GMQL service hosted at the specified address as a guest user.

Remote execution

With PyGMQL the user can decide if the next query (sequence of statements on GMQLDatasets ended with a materialization) will be executed locally or on the remote service. This can be done using the set_mode function. If the user sets the mode to remote the following will happen:

The local datasets that are used are uploaded to the remote service. Nothing is done to the remote datasets used in the query (if present) since they are already on the server.
A compressed representation of the query is sent to the remote service, decoded and executed
Once the execution is complete, the results are downloaded, stored and loaded into a GDataframe .

The following image describes the steps of remote processing with PyGMQL:

The example

We will showcase these features through a simple example.

We want to understand the enrichment of histone modifications on promotorial regions of REFSEQ genes. In particular:

The gene RefSeq annotation dataset holds the information about gene symbols and coordinates and it is stored in the remote server
The Chip-Seq data about histone modifications will come from ENCODE and it is as well stored in the remote system.

The result of the analysis will be an heatmap $genes \times histone\_mark$ displaying the average signal value of each HM for each of the selected genes.

Finding the promotorial regions

We will use as definition of promoter the region situated at $\left[gene_{start} - 2000; gene_{start} + 2000\right]$.

We will perform this operation remotely.



In [2]:

    
gl.set_mode("remote")

We "load" the remote RefSeq dataset using the load_from_remote function. Since this dataset is available to all the users of the system, we will specify public as the ownership of the datasets. In the case in which we are loading our private datasets stored on the service, we can omit the owner parameter.



In [3]:

    
genes = gl.load_from_remote("HG19_ANNOTATION_REFSEQ", owner="public")

Now we have to:

Select only the samples from the HG19_ANNOTATION_REFSEQ dataset specific of genes
Select only the regions related to the genes (TP53, MYC, MAX, MAZ, PTEN)

This operation can be done using the select function on the gene GMQLDataset variable.



In [4]:

    
genes = genes.select(meta_predicate=genes["annotation_type"] == 'gene')

To extract the promotorial regions we will use the reg_project (region projection) operation. In particular the following call will:

Return a dataset with only the chr, start, stop, gene_symbol region fields
For each gene, return the genomic position at $\left[gene_{start} - 2000; gene_{start} + 2000\right]$ taking also into account the strand of the gene.



In [6]:

    
promoters = genes.reg_project(field_list=['gene_symbol'], 
                              new_field_dict={'start': genes.start - 2000, 
                                              'stop': genes.start + 2000})

Since we are interested in keeping the result of this part of the query for future usage, we will materialize the result locally.



In [7]:

    
promoters = promoters.materialize("./promoters", all_load=False)









    



RUNNING





    



100%|██████████| 1/1 [00:08<00:00,  8.94s/it]

Selecting the histone marks

Next we want to select the H3K27me3, H3K36me3, H3K4me1, H3K79me2, H3K9ac and H3K9me3 from the HG19_ENCODE_BROAD_NOV_2017 dataset, also in this case stored remotely. As before, we will load the dataset using load_from_remote. The following operation is a meta selection, which means that the selection will be done on the metadata attributes of the dataset. The resulting dataset will have all the samples for which the metadata satisfy the specified predicate.

In PyGMQL the metadata selection follows the same syntax as the Pandas Python data analysis library

selected_dataset = dataset[dataset['metadata_attribute'] == 'value']

Since we want to test for multiple histone marks we can use the isin() operator.



In [8]:

    
hms = gl.load_from_remote("HG19_ENCODE_BROAD_NOV_2017", owner="public")
hms = hms[hms['experiment_target'].isin(['H3K27me3-human', 'H3K36me3-human', 
                                         "H3K4me1-human", "H3K79me2-human", 
                                         "H3K9ac-human", "H3K9me3-human"])]

Grouping replicates

The dataset resulting from the previous selection will have several duplicates for each region, due to the several biological and technical replicates offered by ENCODE. Therefore we need a way to group together the overlapping regions and to aggregate the relative signal values.

This can be done using the normal_cover operation. In particular this function enables you to specify:

The number of minimum overlappings for a region to be kept
The maximum number of overlappings for a region to be kept
If you want to groupby the resulting samples by some metadata attribute
The aggregation function for the input region fields to be applied to the region groupings

In our case we will keep all the regions (minimum overallpping = 1, maximum overlapping = "ANY") and group by the experiment_target metadata attribute. The signals of overlapping regions will be averaged.



In [9]:

    
hms = hms.normal_cover(1, "ANY", groupBy=['experiment_target'], 
                       new_reg_fields={'avg_signal': gl.AVG("signal")})

Also in this case we want to keep the resulting dataset for later processing.



In [10]:

    
hms = hms.materialize("./hms", all_load=False)









    



RUNNING





    



100%|██████████| 6/6 [01:21<00:00, 20.57s/it]

Mapping the histone marks signal to the promotorial regions

Now that we have our set of gene promoters and our set of histone marks, we want to map the signal of the latter to the regions of the former. We can do that using the map operation. In particular, since on a promotorial region can be mapped multiple HMs, we will aggregate the signal by averaging it.

We decide to perform the following operations on the local machine.



In [11]:

    
gl.set_mode("local")



In [12]:

    
mapping = promoters.map(hms, refName='prom', 
                        expName="hm", 
                        new_reg_fields={'avg_signal': gl.AVG("avg_signal")})

We finally materialize the result and load it in a GDataframe.



In [13]:

    
mapping = mapping.materialize("./mapping")









    



Collecting regions: 282180it [00:09, 29036.10it/s]
100%|██████████| 46/46 [00:00<00:00, 405.64it/s]



In [14]:

    
mapping.regs.head()









    Out[14]:







  
    
      
      chr
      start
      stop
      strand
      gene_symbol
      count_prom_hm
      avg_signal
    
    
      id_sample
      
      
      
      
      
      
      
    
  
  
    
      -4961556509145986263
      chr15
      35690830
      35694830
      +
      HNRNPA1P45
      0.0
      NaN
    
    
      -4961556509145986263
      chrX
      79588817
      79592817
      -
      CHMP1B2P
      0.0
      NaN
    
    
      -4961556509145986263
      chr1
      152626807
      152630807
      +
      LCEP3
      0.0
      NaN
    
    
      -4961556509145986263
      chr12
      112017097
      112021097
      +
      LOC100101246
      1.0
      7.43431
    
    
      -4961556509145986263
      chr1
      154472694
      154476694
      +
      TDRD10
      0.0
      NaN



In [15]:

    
mapping.meta.head()









    Out[15]:







  
    
      
      hm.antibody_accession
      hm.assay
      hm.assembly
      hm.audit_error
      hm.audit_internal_action
      hm.audit_not_compliant
      hm.audit_warning
      hm.biological_replicate_s_
      hm.biosample_age
      hm.biosample_life_stage
      ...
      prom.data_format_description
      prom.download_link
      prom.full_name
      prom.genome_build
      prom.genome_build_accession
      prom.original_format
      prom.original_url
      prom.provider
      prom.release_date
      prom.release_version
    
    
      id_sample
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      -8047918399829123478
      [ENCAB000ANB]
      [ChIP-seq]
      [hg19]
      []
      [missing derived_from, missing unfiltered alig...
      [control insufficient read depth, insufficient...
      [inconsistent platforms, low read length, mild...
      [1, 2]
      [31 year, 53 year, unknown]
      [adult, embryonic]
      ...
      [http://biomirror.aarnet.edu.au/biomirror/ncbi...
      [ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/a...
      [RefSeq Reference Genome Annotation]
      [GRCh37.p13]
      [GCF_000001405.25]
      [gff]
      [https://www.ncbi.nlm.nih.gov/projects/genome/...
      [RefSeq]
      [2013-06-28]
      [13]
    
    
      -4961556509145986263
      [ENCAB000ANH]
      [ChIP-seq]
      [hg19]
      []
      [missing derived_from, missing unfiltered alig...
      [control insufficient read depth, insufficient...
      [inconsistent platforms, low read length]
      []
      [53 year]
      [adult]
      ...
      [http://biomirror.aarnet.edu.au/biomirror/ncbi...
      [ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/a...
      [RefSeq Reference Genome Annotation]
      [GRCh37.p13]
      [GCF_000001405.25]
      [gff]
      [https://www.ncbi.nlm.nih.gov/projects/genome/...
      [RefSeq]
      [2013-06-28]
      [13]
    
    
      -4790703078920733789
      [ENCAB000ADW]
      [ChIP-seq]
      [hg19]
      [extremely low read depth]
      [biological replicates with identical biosampl...
      [control insufficient read depth, insufficient...
      [low read length, mild to moderate bottlenecking]
      []
      [58 year, unknown]
      [adult, newborn, unknown]
      ...
      [http://biomirror.aarnet.edu.au/biomirror/ncbi...
      [ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/a...
      [RefSeq Reference Genome Annotation]
      [GRCh37.p13]
      [GCF_000001405.25]
      [gff]
      [https://www.ncbi.nlm.nih.gov/projects/genome/...
      [RefSeq]
      [2013-06-28]
      [13]
    
    
      -4315937178640546765
      [ENCAB000ADU]
      [ChIP-seq]
      [hg19]
      [control extremely low read depth, extremely l...
      [biological replicates with identical biosampl...
      [control insufficient read depth, insufficient...
      [inconsistent platforms, low read length, mild...
      []
      [31 year, 53 year, unknown]
      [adult, newborn, unknown]
      ...
      [http://biomirror.aarnet.edu.au/biomirror/ncbi...
      [ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/a...
      [RefSeq Reference Genome Annotation]
      [GRCh37.p13]
      [GCF_000001405.25]
      [gff]
      [https://www.ncbi.nlm.nih.gov/projects/genome/...
      [RefSeq]
      [2013-06-28]
      [13]
    
    
      -4215283275073074168
      [ENCAB000ANX]
      [ChIP-seq]
      [hg19]
      []
      [biological replicates with identical biosampl...
      [control insufficient read depth, insufficient...
      [inconsistent platforms, low read depth, low r...
      []
      [53 year, unknown]
      [adult, embryonic]
      ...
      [http://biomirror.aarnet.edu.au/biomirror/ncbi...
      [ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/a...
      [RefSeq Reference Genome Annotation]
      [GRCh37.p13]
      [GCF_000001405.25]
      [gff]
      [https://www.ncbi.nlm.nih.gov/projects/genome/...
      [RefSeq]
      [2013-06-28]
      [13]
    
  

5 rows × 45 columns

Visualizing the HM signal heatmap on the gene promoters

Finally we visualize the heatmap (in the form of clusters) with the assistance of the Seaborn visualization package. In order to visualize the data we have to transform the GDataframe data structure in the mapping variable in a matricial form where we have $genes$ on the rows and $histone\_marks$ on the columns. This can be done with the to_matrix function offered by the GDataframe.

The to_matrix method is particularly versatile since it enables the user to project region and metadata at the same time. For example, in our case the information about the histone modification name is stored as a metadata attribute hm.experiment_target while we want to use the avg_signal region field as matrix value. The to_matrix methods therefore enable us to aggregate the information of both regions and metadata in a single structure.

Underneeth this method leverages on the pivot_table Pandas function and can accept all its parameter together with its specific ones.



In [16]:

    
!pip install seaborn
!pip install scikit-learn









    



Collecting seaborn
  Downloading https://files.pythonhosted.org/packages/a8/76/220ba4420459d9c4c9c9587c6ce607bf56c25b3d3d2de62056efe482dadc/seaborn-0.9.0-py3-none-any.whl (208kB)
    100% |████████████████████████████████| 215kB 11.2MB/s 
Requirement already satisfied: numpy>=1.9.3 in /home/nanni/software/miniconda3/envs/pygmql/lib/python3.7/site-packages (from seaborn) (1.16.3)
Requirement already satisfied: pandas>=0.15.2 in /home/nanni/software/miniconda3/envs/pygmql/lib/python3.7/site-packages (from seaborn) (0.24.2)
Collecting scipy>=0.14.0 (from seaborn)
  Downloading https://files.pythonhosted.org/packages/5d/bd/c0feba81fb60e231cf40fc8a322ed5873c90ef7711795508692b1481a4ae/scipy-1.3.0-cp37-cp37m-manylinux1_x86_64.whl (25.2MB)
    100% |████████████████████████████████| 25.2MB 269kB/s 
Collecting matplotlib>=1.4.3 (from seaborn)
  Downloading https://files.pythonhosted.org/packages/dc/cb/a34046e75c9a4ecaf426ae0d0eada97078c8ce4bbe3250940b1a312a1385/matplotlib-3.1.0-cp37-cp37m-manylinux1_x86_64.whl (13.1MB)
    100% |████████████████████████████████| 13.1MB 551kB/s 
Requirement already satisfied: pytz>=2011k in /home/nanni/software/miniconda3/envs/pygmql/lib/python3.7/site-packages (from pandas>=0.15.2->seaborn) (2019.1)
Requirement already satisfied: python-dateutil>=2.5.0 in /home/nanni/software/miniconda3/envs/pygmql/lib/python3.7/site-packages (from pandas>=0.15.2->seaborn) (2.8.0)
Collecting pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 (from matplotlib>=1.4.3->seaborn)
  Downloading https://files.pythonhosted.org/packages/dd/d9/3ec19e966301a6e25769976999bd7bbe552016f0d32b577dc9d63d2e0c49/pyparsing-2.4.0-py2.py3-none-any.whl (62kB)
    100% |████████████████████████████████| 71kB 23.9MB/s 
Collecting cycler>=0.10 (from matplotlib>=1.4.3->seaborn)
  Downloading https://files.pythonhosted.org/packages/f7/d2/e07d3ebb2bd7af696440ce7e754c59dd546ffe1bbe732c8ab68b9c834e61/cycler-0.10.0-py2.py3-none-any.whl
Collecting kiwisolver>=1.0.1 (from matplotlib>=1.4.3->seaborn)
  Downloading https://files.pythonhosted.org/packages/93/f8/518fb0bb89860eea6ff1b96483fbd9236d5ee991485d0f3eceff1770f654/kiwisolver-1.1.0-cp37-cp37m-manylinux1_x86_64.whl (90kB)
    100% |████████████████████████████████| 92kB 29.7MB/s 
Requirement already satisfied: six>=1.5 in /home/nanni/software/miniconda3/envs/pygmql/lib/python3.7/site-packages (from python-dateutil>=2.5.0->pandas>=0.15.2->seaborn) (1.12.0)
Requirement already satisfied: setuptools in /home/nanni/software/miniconda3/envs/pygmql/lib/python3.7/site-packages (from kiwisolver>=1.0.1->matplotlib>=1.4.3->seaborn) (41.0.0)
Installing collected packages: scipy, pyparsing, cycler, kiwisolver, matplotlib, seaborn
Successfully installed cycler-0.10.0 kiwisolver-1.1.0 matplotlib-3.1.0 pyparsing-2.4.0 scipy-1.3.0 seaborn-0.9.0
Collecting scikit-learn
  Downloading https://files.pythonhosted.org/packages/ef/52/3254e511ef1fc88d31edf457d90ecfd531931d4202f1b8ee0c949e9478f6/scikit_learn-0.21.1-cp37-cp37m-manylinux1_x86_64.whl (6.7MB)
    100% |████████████████████████████████| 6.7MB 1.1MB/s 
Requirement already satisfied: scipy>=0.17.0 in /home/nanni/software/miniconda3/envs/pygmql/lib/python3.7/site-packages (from scikit-learn) (1.3.0)
Collecting joblib>=0.11 (from scikit-learn)
  Downloading https://files.pythonhosted.org/packages/cd/c1/50a758e8247561e58cb87305b1e90b171b8c767b15b12a1734001f41d356/joblib-0.13.2-py2.py3-none-any.whl (278kB)
    100% |████████████████████████████████| 286kB 23.8MB/s 
Requirement already satisfied: numpy>=1.11.0 in /home/nanni/software/miniconda3/envs/pygmql/lib/python3.7/site-packages (from scikit-learn) (1.16.3)
Installing collected packages: joblib, scikit-learn
Successfully installed joblib-0.13.2 scikit-learn-0.21.1



In [17]:

    
import matplotlib.pyplot as plt
import seaborn as sns









    



[PyGMQL] generated new fontManager



In [18]:

    
plt.figure()
sns.clustermap(mapping.to_matrix(index_regs=['gene_symbol'], 
                                 columns_meta=['hm.experiment_target'], 
                                 values_regs=['avg_signal'], fill_value=0, dropna=False).sample(100))
plt.show()









    





<Figure size 432x288 with 0 Axes>

Performing Principal Component Analysis



In [19]:

    
from sklearn.decomposition import PCA
import pandas as pd



In [20]:

    
pca = PCA(n_components=2)



In [21]:

    
promoter_region_VS_hm = mapping.to_matrix(index_regs=['chr', 'start', 'stop', 'gene_symbol'], 
                                                 columns_meta=['hm.experiment_target'], 
                                                 values_regs=['avg_signal'], fill_value=0)



In [22]:

    
components = pca.fit_transform(promoter_region_VS_hm.values)



In [23]:

    
PCA_regions = pd.DataFrame(index=promoter_region_VS_hm.index, columns=['1st_component', '2nd_component'], data=components).reset_index()



In [24]:

    
plt.figure(figsize=(10, 10))
sns.regplot(data=PCA_regions, x='1st_component', y='2nd_component', fit_reg=False)
plt.show()

	chr	start	stop	strand	gene_symbol	count_prom_hm	avg_signal
id_sample
-4961556509145986263	chr15	35690830	35694830	+	HNRNPA1P45	0.0	NaN
-4961556509145986263	chrX	79588817	79592817	-	CHMP1B2P	0.0	NaN
-4961556509145986263	chr1	152626807	152630807	+	LCEP3	0.0	NaN
-4961556509145986263	chr12	112017097	112021097	+	LOC100101246	1.0	7.43431
-4961556509145986263	chr1	154472694	154476694	+	TDRD10	0.0	NaN

	hm.antibody_accession	hm.assay	hm.assembly	hm.audit_error	hm.audit_internal_action	hm.audit_not_compliant	hm.audit_warning	hm.biological_replicate_s_	hm.biosample_age	hm.biosample_life_stage	...	prom.data_format_description	prom.download_link	prom.full_name	prom.genome_build	prom.genome_build_accession	prom.original_format	prom.original_url	prom.provider	prom.release_date	prom.release_version
id_sample
-8047918399829123478	[ENCAB000ANB]	[ChIP-seq]	[hg19]	[]	[missing derived_from, missing unfiltered alig...	[control insufficient read depth, insufficient...	[inconsistent platforms, low read length, mild...	[1, 2]	[31 year, 53 year, unknown]	[adult, embryonic]	...	[http://biomirror.aarnet.edu.au/biomirror/ncbi...	[ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/a...	[RefSeq Reference Genome Annotation]	[GRCh37.p13]	[GCF_000001405.25]	[gff]	[https://www.ncbi.nlm.nih.gov/projects/genome/...	[RefSeq]	[2013-06-28]	[13]
-4961556509145986263	[ENCAB000ANH]	[ChIP-seq]	[hg19]	[]	[missing derived_from, missing unfiltered alig...	[control insufficient read depth, insufficient...	[inconsistent platforms, low read length]	[]	[53 year]	[adult]	...	[http://biomirror.aarnet.edu.au/biomirror/ncbi...	[ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/a...	[RefSeq Reference Genome Annotation]	[GRCh37.p13]	[GCF_000001405.25]	[gff]	[https://www.ncbi.nlm.nih.gov/projects/genome/...	[RefSeq]	[2013-06-28]	[13]
-4790703078920733789	[ENCAB000ADW]	[ChIP-seq]	[hg19]	[extremely low read depth]	[biological replicates with identical biosampl...	[control insufficient read depth, insufficient...	[low read length, mild to moderate bottlenecking]	[]	[58 year, unknown]	[adult, newborn, unknown]	...	[http://biomirror.aarnet.edu.au/biomirror/ncbi...	[ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/a...	[RefSeq Reference Genome Annotation]	[GRCh37.p13]	[GCF_000001405.25]	[gff]	[https://www.ncbi.nlm.nih.gov/projects/genome/...	[RefSeq]	[2013-06-28]	[13]
-4315937178640546765	[ENCAB000ADU]	[ChIP-seq]	[hg19]	[control extremely low read depth, extremely l...	[biological replicates with identical biosampl...	[control insufficient read depth, insufficient...	[inconsistent platforms, low read length, mild...	[]	[31 year, 53 year, unknown]	[adult, newborn, unknown]	...	[http://biomirror.aarnet.edu.au/biomirror/ncbi...	[ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/a...	[RefSeq Reference Genome Annotation]	[GRCh37.p13]	[GCF_000001405.25]	[gff]	[https://www.ncbi.nlm.nih.gov/projects/genome/...	[RefSeq]	[2013-06-28]	[13]
-4215283275073074168	[ENCAB000ANX]	[ChIP-seq]	[hg19]	[]	[biological replicates with identical biosampl...	[control insufficient read depth, insufficient...	[inconsistent platforms, low read depth, low r...	[]	[53 year, unknown]	[adult, embryonic]	...	[http://biomirror.aarnet.edu.au/biomirror/ncbi...	[ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/a...	[RefSeq Reference Genome Annotation]	[GRCh37.p13]	[GCF_000001405.25]	[gff]	[https://www.ncbi.nlm.nih.gov/projects/genome/...	[RefSeq]	[2013-06-28]	[13]