In [1]:
from __future__ import print_function
import ga4gh.client
This notebook provides an overview of how to call a the GA4GH reference server from an iPython notebook. Before running this notebook:
Note: If you have trouble importing the ga4gh module, either symlink this notebook into the /server directory or add the path to the GA4GH development repo to your PYTHONPATH.
In [2]:
baseURL = "http://localhost:8000"
client = ga4gh.client.HttpClient(baseURL)
Great! Now we can run calls to query ReferenceSets, References, Datasets, VariantSets, CallSets, Variants, ReadGroups, ReadGroupSets, & Reads.
Search methods return generators. Here we will explictly create lists out of these objects to allow us to directly index into the list of instances that the query returned.
Instances have a toJsonDict method that returns a dictionary representation of the object. We'll make use of this to examine the the underlying data structures.
ReferenceSets are lists of References. Think of them like the genome version (ie. grch37, hg19).
In [3]:
referenceSets = list(client.searchReferenceSets())
print("ReferenceSets")
for referenceSet in referenceSets:
print("NCBI Taxon Id: {}".format(referenceSet.ncbiTaxonId))
In [4]:
referenceSet = client.getReferenceSet(referenceSets[0].id)
referenceSet.toJsonDict()
Out[4]:
The References endpoints let you access the DNA sequences that make up the reference genome. The example dataset is from subsets of the first 3 chromosomes:
In [5]:
references = list(client.searchReferences(referenceSet.id))
print("References")
for reference in references:
print("Name: {}, Length: {}".format(reference.name, reference.length))
In [6]:
reference = client.getReference(references[0].id)
reference.toJsonDict()
Out[6]:
In addition to fetching metadata about the reference, you can access the base sequence:
In [7]:
client.listReferenceBases(references[0].id, start=10000, end=10100)
Out[7]:
In [8]:
datasets = list(client.searchDatasets())
print("Datasets")
for dataset in datasets:
print("Name: {}".format(dataset.name))
In [9]:
dataset = client.getDataset(datasets[0].id)
dataset.toJsonDict()
Out[9]:
In [10]:
variantSets = list(client.searchVariantSets(dataset.id))
print("VariantSets")
for variantSet in variantSets:
print("Name: {}".format(variantSet.name))
In [11]:
variantSetId = variantSets[0].id
variantSet = client.getVariantSet(variantSetId)
variantSet.toJsonDict()
Out[11]:
In [12]:
callSets = list(client.searchCallSets(variantSetId))
print("CallSets")
for callSet in callSets:
print("Name: {}".format(callSet.name))
In [13]:
callSet = client.getCallSet(callSets[0].id)
callSet.toJsonDict()
Out[13]:
In [14]:
variants = list(client.searchVariants(variantSetId, start=100000, end=101000, referenceName = "1"))
print("Variants")
for variant in variants:
print("Reference Name: {}, Start: {}, Name: {}".format(variant.referenceName, variant.start, variant.names[0]))
In [15]:
variant = client.getVariant(variants[0].id)
variant.toJsonDict()
Out[15]:
In [16]:
readGroupSets = list(client.searchReadGroupSets(dataset.id))
print("ReadGroupSets")
for readGroup in readGroupSets:
print("Name: {}".format(readGroup.name))
In [17]:
readGroupSet = client.getReadGroupSet(readGroupSets[0].id)
readGroupSet.toJsonDict()
Out[17]:
In [18]:
readGroup = client.getReadGroup(readGroupSet.readGroups[0].id)
readGroup.toJsonDict()
Out[18]:
Each read alignment describes an alignment with additional information about the fragment and the read. A read alignment object is equivalent to a line in a SAM file.
Since there are often many reads, we won't iterate over all of them. Instead we can summerize the query and then output the dictionary representing the last one.
In [19]:
readGroupIds = [readGroup.id for readGroup in readGroupSet.readGroups]
readGroupDescriptions = [readGroup.description for readGroup in readGroupSet.readGroups]
reads = client.searchReads(readGroupIds, reference.id)
print("Read Alignments")
# The server paginates reads by default; here we iterate over the reads
# generate to make the appropriate requests to the server
count = 0
for read in reads:
count += 1
print("{} reads in Readgroups: {} on Reference: {}".format(
count, ', '.join(readGroupDescriptions), reference.name))
In [20]:
read.toJsonDict()
Out[20]: