In this use case we are interested in getting missense variants from all the genes that are within two genomic regions of interest: 17:43045767-43046767
and 13:32317272-32318272
.
Before starting, we import all the required modules for this example:
In [1]:
from pycellbase.cbconfig import ConfigClient # Configuration client
from pycellbase.cbclient import CellBaseClient # CellBase client
PyCellBase configuration follows the next structure:
In [2]:
config = {
'rest': {'hosts': ['http://bioinfo.hpc.cam.ac.uk/cellbase']}, # List of RESTful host URLs
'species': 'hsapiens', # Name of the species
'version': 'v4' # API version
}
This custom configuration can be stored in a YAML file, JSON file or Python dictionary. Then, one of these files or dictionary can be passed to the ConfigClient
class, which is in charge of managing the configuration:
In [3]:
cc = ConfigClient(config)
If we need an example of the configuration structure, we can get the default one using the get_default_configuration
method:
In [4]:
cc.get_default_configuration()
Out[4]:
Once we have set up our PyCellBase configuration, we can create the CellBaseClient
, which is the central class of this package. We can pass a ConfigClient
with a customised configuration to CellBaseClient
:
In [5]:
cbc = CellBaseClient(cc)
We can check at any moment the configuration parameters used to make the calls to the database:
In [6]:
cbc.show_configuration()
Out[6]:
We can modify our CellBaseClient
configuration at any moment by modifying the ConfigClient
attributes:
In [7]:
cc.species = 'celegans'
cbc.show_configuration()
Out[7]:
In [8]:
cc.species = 'hsapiens'
cbc.show_configuration()
Out[8]:
Once we have initialised the main CellBaseClient
class, we are ready to query the database. First, we want to get all the genes that are within our regions of interest. To get information from genomic regions, we get the region-specific client:
In [9]:
rc = cbc.get_region_client()
If we do not know which method is the most adequate for our task, we can get helpful information for each data-specific client. In this case we are interested in gettting all the genes within a region, so we are going to use the get_gene
method:
In [10]:
rc.get_help()
Once we have our data-specific client, we can query the database with our query of interest and get the JSON response returned by CellBase.
In [11]:
regions_info = rc.get_gene('17:43045767-43046767,13:32317272-32318272', assembly='GRCh38')
The obtained response is a list of results for each query. In this case we have asked for information for two different regions so our response has two elements:
In [12]:
region1_result = regions_info[0]['result'] # 17:43045767-43046767
region2_result = regions_info[1]['result'] # 13:32317272-32318272
Now that we have the CellBase JSON output, it's just a question of navigating through it to retrieve the information of interest.
In [13]:
genes = []
for region in regions_info:
for gene in region['result']:
genes.append(gene['name'])
genes
Out[13]:
We have found two genes overlapping our regions of interest. Our next step is getting the variants within those genes. To get information from genes, we get the gene-specific client:
In [14]:
gc = cbc.get_gene_client()
In this example, as there are a lot of variants in these genes, we are going to limit the returned results to 5 variants per gene (limit=5), skipping the first 1100 variants (skip=1100):
In [15]:
genes_info = gc.get_snp(genes, assembly='GRCh38', limit=5, skip=1100)
As before, once we get the response, we just need to navigate through the output JSON to get the information of interest:
In [16]:
variants = []
for gene in genes_info:
for variant in gene['result']:
variants.append(':'.join([variant['chromosome'], str(variant['start']), variant['reference'], variant['alternate']]))
variants
Out[16]:
After getting the variants, the last step is selecting those ones whose consequence is missense (SO:0001583). To get information from variants, we get the variant-specific client:
In [17]:
vc = cbc.get_variant_client()
As before, we query the database and navigate through the output JSON to select our variants of interest.
In [18]:
# Querying the database
variants_info = vc.get_annotation(variants, assembly='GRCh38')
# Navigating the output
missense_variants = []
for index, variant in enumerate(variants_info):
# Getting variant consequences
for consequence in variant['result'][0]['consequenceTypes']:
# Filtering by Ensembl transcript ID
if 'ensemblTranscriptId' in consequence and consequence['ensemblTranscriptId'] in ['ENST00000357654', 'ENST00000544455']:
# Filtering by missense variant (SO:0001583)
if 'SO:0001583' in [so['accession'] for so in consequence['sequenceOntologyTerms']]:
missense_variants.append(variants[index])
missense_variants
Out[18]:
Finally, we have found two missense variants appearing in the genes of our region of interest.