In [1]:
import gmql as gl
PyGMQL can work with BED and GTF files with arbitrary fields and schemas. In order to load a dataset into Python the user can use the following functions:
load_from_path
: lazily loads a dataset into a GMQLDataset variable from the local file systemload_from_remote
: lazily loads a dataset into a GMQLDataset variable from a remote GMQL servicefrom_pandas
: lazily loads a dataset into a GMQLDataset variable from a Pandas DataFrame having at least the chromosome, start and stop columnsIn addition to these functions we also provide a function called get_example_dataset
which enables the user to load a sample dataset and play with it in order to get confidence with the library. Currently we provide two example datasets: Example_Dataset_1
and Example_Dataset_2
.
In the following we will load two example datasets and play with them.
In [2]:
dataset1 = gl.get_example_dataset("Example_Dataset_1")
dataset2 = gl.get_example_dataset("Example_Dataset_2")
GMQLDataset
The dataset
variable defined above is a GMQLDataset
, which represents a GMQL variable and on which it is possible to apply GMQL operators. It must be noticed that no data has been loaded in memory yet and the computation will only start when the query is triggered. We will see how to start the execution of a query in the following steps.
We can inspect the schema of the dataset with the following:
In [3]:
dataset1.schema
Out[3]:
In [4]:
dataset2.schema
Out[4]:
In [5]:
filtered_dataset1 = dataset1.reg_select((dataset1.chr == 'chr3') & (dataset1.start >= 30000))
From this operation we can learn several things about the GMQLDataset
data structure. Each GMQLDataset
has a set of methods and fields which can be used to build GMQL queries. For example, in the previous statement we have:
reg_select
method, which enables us to filter the datasets on the basis of a predicate on the region positions and featureschr
and start
fields, which enable the user to build predicates on the fields of the dataset.Every GMQL operator has a relative method accessible from the GMQLDataset
data structure, as well as any other field of the dataset.
In [6]:
filtered_dataset_2 = dataset2[dataset2['antibody_target'] == 'CTCF']
Notice that the notation for selecting the samples using metadata is the same as the one for filtering Pandas DataFrames.
It is not the focus of this tutorial to show all the possible operations which can be done on a GMQLDataset
, they can be seen on the documentation page of the library.
For the sake of this example, let's show the JOIN operation between the two filtered datasets defined in the previous two steps. The JOIN operation semantics relies on the concept of reference and experiment datasets. The reference dataset is the one 'calling' the join function while the experiment dataset is the one 'on which' the function is called. The semantics of the function is
resulting_dataset = <reference>.join(<experiment>, <genometric predicate>, ...)
In [7]:
dataset_join = dataset1.join(dataset2, [gl.DLE(0)])
To understand the concept of genometric predicate please visit the documentation of the library.
As we have already said, no operation has beed effectively done up to this point. What we did up to now is to define the sequence of operations to apply on the data. In order to trigger the execution we have to apply the materialize
function on the variable we want to compute.
In [8]:
query_result = dataset_join.materialize()
In [10]:
query_result.regs.head()
Out[10]:
In [11]:
query_result.meta.head()
Out[11]: