GA4GH 1000 Genomes Reads Service Example

This example illustrates how to access alignment data made available using a GA4GH interface.

Initialize the client

In this step we create a client object which will be used to communicate with the server. It is initialized using the URL.


In [1]:
from ga4gh.client import client
c = client.HttpClient("http://1kgenomes.ga4gh.org")

In [2]:
#Obtain dataSet id REF: -> `1kg_metadata_service`
dataset = c.search_datasets().next() 

#Obtain reference set id REF:-> `1kg_reference_service`
reference_set = c.search_reference_sets().next()
reference = c.search_references(reference_set_id=reference_set.id).next()

Search read group sets

Read group sets are logical containers for read groups similar to BAM.

We can obtain read group sets via a search_read_group_sets request. Observe that this request takes as it's main parameter dataset_id, which was obtained using the example in 1kg_metadata_service using a search_datasets request.


In [3]:
counter = 0
for read_group_set in c.search_read_group_sets(dataset_id=dataset.id):
    counter += 1
    if counter < 4:
        print "Read Group Set: {}".format(read_group_set.name)
        print "id: {}".format(read_group_set.id)
        print "dataset_id: {}".format(read_group_set.dataset_id)
        print "Aligned Read Count: {}".format(read_group_set.stats.aligned_read_count)
        print "Unaligned Read Count: {}\n".format(read_group_set.stats.unaligned_read_count)
        if read_group_set.name == "NA19675":
            rgSet = read_group_set
        for read_group in read_group_set.read_groups:
            print "  Read group:"
            print "  id: {}".format(read_group.id)
            print "  Name: {}".format(read_group.name)
            print "  Description: {}".format(read_group.description)
            print "  Biosample Id: {}\n".format(read_group.bio_sample_id)
    else: 
        break


Read Group Set: HG03270
id: WyIxa2dlbm9tZXMiLCJyZ3MiLCJIRzAzMjcwIl0
dataset_id: WyIxa2dlbm9tZXMiXQ
Aligned Read Count: 177645990
Unaligned Read Count: 746202

  Read group:
  id: WyIxa2dlbm9tZXMiLCJyZ3MiLCJIRzAzMjcwIiwiRVJSMTgxMzI5Il0
  Name: ERR181329
  Description: SRP015238
  Biosample Id: WyIxa2dlbm9tZXMiLCJiIiwiSEcwMzI3MCJd

  Read group:
  id: WyIxa2dlbm9tZXMiLCJyZ3MiLCJIRzAzMjcwIiwiRVJSMTg0MzI4Il0
  Name: ERR184328
  Description: SRP015238
  Biosample Id: WyIxa2dlbm9tZXMiLCJiIiwiSEcwMzI3MCJd

  Read group:
  id: WyIxa2dlbm9tZXMiLCJyZ3MiLCJIRzAzMjcwIiwiRVJSMTg0MzM2Il0
  Name: ERR184336
  Description: SRP015238
  Biosample Id: WyIxa2dlbm9tZXMiLCJiIiwiSEcwMzI3MCJd

  Read group:
  id: WyIxa2dlbm9tZXMiLCJyZ3MiLCJIRzAzMjcwIiwiRVJSMTg0MzQ0Il0
  Name: ERR184344
  Description: SRP015238
  Biosample Id: WyIxa2dlbm9tZXMiLCJiIiwiSEcwMzI3MCJd

Read Group Set: HG03271
id: WyIxa2dlbm9tZXMiLCJyZ3MiLCJIRzAzMjcxIl0
dataset_id: WyIxa2dlbm9tZXMiXQ
Aligned Read Count: 201280730
Unaligned Read Count: 944735

  Read group:
  id: WyIxa2dlbm9tZXMiLCJyZ3MiLCJIRzAzMjcxIiwiRVJSMTgxMzI4Il0
  Name: ERR181328
  Description: SRP015238
  Biosample Id: WyIxa2dlbm9tZXMiLCJiIiwiSEcwMzI3MSJd

  Read group:
  id: WyIxa2dlbm9tZXMiLCJyZ3MiLCJIRzAzMjcxIiwiRVJSMTg0MzI3Il0
  Name: ERR184327
  Description: SRP015238
  Biosample Id: WyIxa2dlbm9tZXMiLCJiIiwiSEcwMzI3MSJd

  Read group:
  id: WyIxa2dlbm9tZXMiLCJyZ3MiLCJIRzAzMjcxIiwiRVJSMTg0MzM1Il0
  Name: ERR184335
  Description: SRP015238
  Biosample Id: WyIxa2dlbm9tZXMiLCJiIiwiSEcwMzI3MSJd

  Read group:
  id: WyIxa2dlbm9tZXMiLCJyZ3MiLCJIRzAzMjcxIiwiRVJSMTg0MzQzIl0
  Name: ERR184343
  Description: SRP015238
  Biosample Id: WyIxa2dlbm9tZXMiLCJiIiwiSEcwMzI3MSJd

Read Group Set: NA19675
id: WyIxa2dlbm9tZXMiLCJyZ3MiLCJOQTE5Njc1Il0
dataset_id: WyIxa2dlbm9tZXMiXQ
Aligned Read Count: 251846416
Unaligned Read Count: 3935762

  Read group:
  id: WyIxa2dlbm9tZXMiLCJyZ3MiLCJOQTE5Njc1IiwiU1JSMDU4OTM3Il0
  Name: SRR058937
  Description: SRP000803
  Biosample Id: WyIxa2dlbm9tZXMiLCJiIiwiTkExOTY3NSJd

  Read group:
  id: WyIxa2dlbm9tZXMiLCJyZ3MiLCJOQTE5Njc1IiwiU1JSMDU4OTM4Il0
  Name: SRR058938
  Description: SRP000803
  Biosample Id: WyIxa2dlbm9tZXMiLCJiIiwiTkExOTY3NSJd

  Read group:
  id: WyIxa2dlbm9tZXMiLCJyZ3MiLCJOQTE5Njc1IiwiU1JSMDU4OTM5Il0
  Name: SRR058939
  Description: SRP000803
  Biosample Id: WyIxa2dlbm9tZXMiLCJiIiwiTkExOTY3NSJd

  Read group:
  id: WyIxa2dlbm9tZXMiLCJyZ3MiLCJOQTE5Njc1IiwiU1JSMDU4OTY0Il0
  Name: SRR058964
  Description: SRP000803
  Biosample Id: WyIxa2dlbm9tZXMiLCJiIiwiTkExOTY3NSJd

Note: only a small subset of elements is being illustrated, the data returned by the servers is richer, that is, it contains other informational fields which may be of interest.

Get read group set

Similarly, we can obtain a specific Read Group Set by providing a specific identifier.


In [4]:
read_group_set = c.get_read_group_set(read_group_set_id=rgSet.id)
print "Read Group Set: {}".format(read_group_set.name)
print "id: {}".format(read_group_set.id)
print "dataset_id: {}".format(read_group_set.dataset_id)
print "Aligned Read Count: {}".format(read_group_set.stats.aligned_read_count)
print "Unaligned Read Count: {}\n".format(read_group_set.stats.unaligned_read_count)
for read_group in read_group_set.read_groups:
    print " Read Group: {}".format(read_group.name)
    print " id: {}".format(read_group.bio_sample_id)
    print " bio_sample_id: {}\n".format(read_group.bio_sample_id)


Read Group Set: NA19675
id: WyIxa2dlbm9tZXMiLCJyZ3MiLCJOQTE5Njc1Il0
dataset_id: WyIxa2dlbm9tZXMiXQ
Aligned Read Count: 251846416
Unaligned Read Count: 3935762

 Read Group: SRR058937
 id: WyIxa2dlbm9tZXMiLCJiIiwiTkExOTY3NSJd
 bio_sample_id: WyIxa2dlbm9tZXMiLCJiIiwiTkExOTY3NSJd

 Read Group: SRR058938
 id: WyIxa2dlbm9tZXMiLCJiIiwiTkExOTY3NSJd
 bio_sample_id: WyIxa2dlbm9tZXMiLCJiIiwiTkExOTY3NSJd

 Read Group: SRR058939
 id: WyIxa2dlbm9tZXMiLCJiIiwiTkExOTY3NSJd
 bio_sample_id: WyIxa2dlbm9tZXMiLCJiIiwiTkExOTY3NSJd

 Read Group: SRR058964
 id: WyIxa2dlbm9tZXMiLCJiIiwiTkExOTY3NSJd
 bio_sample_id: WyIxa2dlbm9tZXMiLCJiIiwiTkExOTY3NSJd

Note, like in the previous example. Only a selected amount of parameters are selected for illustration, the data returned by the server is far richer, this format is only to have a more aesthetic presentation.

Search reads

This request returns reads were the read group set names we obtained above. The reference ID provided corresponds to chromosome 1 as obtained from the 1kg_reference_service examples. A search_reads request searches for read alignments in a region using start and end coordinates.


In [5]:
for read_group in read_group_set.read_groups:
    print "Alignment from {}\n".format(read_group.name)
    alignment = c.search_reads(read_group_ids=[read_group.id], start=0, end=1000000, reference_id=reference.id).next()
    print " id: {}".format(alignment.id)
    print " fragment_name: {}".format(alignment.fragment_name)
    print " aligned_sequence: {}\n".format(alignment.aligned_sequence)


Alignment from SRR058937

 id: WyIxa2dlbm9tZXMiLCJyZ3MiLCJOQTE5Njc1IiwiU1JSMDU4OTM3LjIyODQ4NjU0Il0
 fragment_name: SRR058937.22848654
 aligned_sequence: CGCTCTTCCGATCTCCCTAACCCTAACCCTAATCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACC

Alignment from SRR058938

 id: WyIxa2dlbm9tZXMiLCJyZ3MiLCJOQTE5Njc1IiwiU1JSMDU4OTM4LjkzNTY4NiJd
 fragment_name: SRR058938.935686
 aligned_sequence: AACCCTAACAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTT

Alignment from SRR058939

 id: WyIxa2dlbm9tZXMiLCJyZ3MiLCJOQTE5Njc1IiwiU1JSMDU4OTM5LjI1ODYxNTEzIl0
 fragment_name: SRR058939.25861513
 aligned_sequence: CTTAACCTTAACCTTAACCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTTAACCCTA

Alignment from SRR058964

 id: WyIxa2dlbm9tZXMiLCJyZ3MiLCJOQTE5Njc1IiwiU1JSMDU4OTY0LjEwNzEwNDY5Il0
 fragment_name: SRR058964.10710469
 aligned_sequence: CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT

For documentation on the service, and more information go to.

https://ga4gh-schemas.readthedocs.io/en/latest/schemas/read_service.proto.html