GA4GH 1000 Genomes Reads Protocol Example

This example illustrates how to access alignment data made available using a GA4GH interface.

Initialize the client

In this step we create a client object which will be used to communicate with the server. It is initialized using the URL.


In [1]:
import ga4gh_client.client as client
c = client.HttpClient("http://1kgenomes.ga4gh.org")

Search read group sets

Read group sets are logical containers for read groups similar to BAM.

We can obtain read group sets via a search_read_group_sets request. Observe that this request takes as it's main parameter dataset_id, which was obtained using the example in 1kg_metadata_service using a search_datasets request.


In [2]:
counter = 0
for read_group_set in c.search_read_group_sets(dataset_id="WyIxa2dlbm9tZXMiXQ"):
    counter += 1
    if counter < 4:
        print "Read Group Set: {}".format(read_group_set.name)
        print "id: {}".format(read_group_set.id)
        print "dataset_id: {}".format(read_group_set.dataset_id)
        print "Aligned Read Count: {}".format(read_group_set.stats.aligned_read_count)
        print "Unaligned Read Count: {}\n".format(read_group_set.stats.unaligned_read_count)
        for read_group in read_group_set.read_groups:
            print "  Read group:"
            print "  id: {}".format(read_group.id)
            print "  Name: {}".format(read_group.name)
            print "  Description: {}".format(read_group.description)
            print "  Biosample Id: {}\n".format(read_group.bio_sample_id)
    else:
        break


Read Group Set: HG03270
id: WyIxa2dlbm9tZXMiLCJyZ3MiLCJIRzAzMjcwIl0
dataset_id: WyIxa2dlbm9tZXMiXQ
Aligned Read Count: 177645990
Unaligned Read Count: 746202

  Read group:
  id: WyIxa2dlbm9tZXMiLCJyZ3MiLCJIRzAzMjcwIiwiRVJSMTgxMzI5Il0
  Name: ERR181329
  Description: SRP015238
  Biosample Id: WyIxa2dlbm9tZXMiLCJiIiwiSEcwMzI3MCJd

  Read group:
  id: WyIxa2dlbm9tZXMiLCJyZ3MiLCJIRzAzMjcwIiwiRVJSMTg0MzI4Il0
  Name: ERR184328
  Description: SRP015238
  Biosample Id: WyIxa2dlbm9tZXMiLCJiIiwiSEcwMzI3MCJd

  Read group:
  id: WyIxa2dlbm9tZXMiLCJyZ3MiLCJIRzAzMjcwIiwiRVJSMTg0MzM2Il0
  Name: ERR184336
  Description: SRP015238
  Biosample Id: WyIxa2dlbm9tZXMiLCJiIiwiSEcwMzI3MCJd

  Read group:
  id: WyIxa2dlbm9tZXMiLCJyZ3MiLCJIRzAzMjcwIiwiRVJSMTg0MzQ0Il0
  Name: ERR184344
  Description: SRP015238
  Biosample Id: WyIxa2dlbm9tZXMiLCJiIiwiSEcwMzI3MCJd

Read Group Set: HG03271
id: WyIxa2dlbm9tZXMiLCJyZ3MiLCJIRzAzMjcxIl0
dataset_id: WyIxa2dlbm9tZXMiXQ
Aligned Read Count: 201280730
Unaligned Read Count: 944735

  Read group:
  id: WyIxa2dlbm9tZXMiLCJyZ3MiLCJIRzAzMjcxIiwiRVJSMTgxMzI4Il0
  Name: ERR181328
  Description: SRP015238
  Biosample Id: WyIxa2dlbm9tZXMiLCJiIiwiSEcwMzI3MSJd

  Read group:
  id: WyIxa2dlbm9tZXMiLCJyZ3MiLCJIRzAzMjcxIiwiRVJSMTg0MzI3Il0
  Name: ERR184327
  Description: SRP015238
  Biosample Id: WyIxa2dlbm9tZXMiLCJiIiwiSEcwMzI3MSJd

  Read group:
  id: WyIxa2dlbm9tZXMiLCJyZ3MiLCJIRzAzMjcxIiwiRVJSMTg0MzM1Il0
  Name: ERR184335
  Description: SRP015238
  Biosample Id: WyIxa2dlbm9tZXMiLCJiIiwiSEcwMzI3MSJd

  Read group:
  id: WyIxa2dlbm9tZXMiLCJyZ3MiLCJIRzAzMjcxIiwiRVJSMTg0MzQzIl0
  Name: ERR184343
  Description: SRP015238
  Biosample Id: WyIxa2dlbm9tZXMiLCJiIiwiSEcwMzI3MSJd

Read Group Set: NA19675
id: WyIxa2dlbm9tZXMiLCJyZ3MiLCJOQTE5Njc1Il0
dataset_id: WyIxa2dlbm9tZXMiXQ
Aligned Read Count: 251846416
Unaligned Read Count: 3935762

  Read group:
  id: WyIxa2dlbm9tZXMiLCJyZ3MiLCJOQTE5Njc1IiwiU1JSMDU4OTM3Il0
  Name: SRR058937
  Description: SRP000803
  Biosample Id: WyIxa2dlbm9tZXMiLCJiIiwiTkExOTY3NSJd

  Read group:
  id: WyIxa2dlbm9tZXMiLCJyZ3MiLCJOQTE5Njc1IiwiU1JSMDU4OTM4Il0
  Name: SRR058938
  Description: SRP000803
  Biosample Id: WyIxa2dlbm9tZXMiLCJiIiwiTkExOTY3NSJd

  Read group:
  id: WyIxa2dlbm9tZXMiLCJyZ3MiLCJOQTE5Njc1IiwiU1JSMDU4OTM5Il0
  Name: SRR058939
  Description: SRP000803
  Biosample Id: WyIxa2dlbm9tZXMiLCJiIiwiTkExOTY3NSJd

  Read group:
  id: WyIxa2dlbm9tZXMiLCJyZ3MiLCJOQTE5Njc1IiwiU1JSMDU4OTY0Il0
  Name: SRR058964
  Description: SRP000803
  Biosample Id: WyIxa2dlbm9tZXMiLCJiIiwiTkExOTY3NSJd

Note: only a small subset of elements is being illustrated, the data returned by the servers is richer, that is, it contains other informational fields which may be of interest.

Get read group set

Similarly, we can obtain a specific Read Group Set by providing a specific identifier.


In [3]:
read_group_set = c.get_read_group_set(read_group_set_id="WyIxa2dlbm9tZXMiLCJyZ3MiLCJOQTE5Njc4Il0")
print "Read Group Set: {}".format(read_group_set.name)
print "id: {}".format(read_group_set.id)
print "dataset_id: {}".format(read_group_set.dataset_id)
print "Aligned Read Count: {}".format(read_group_set.stats.aligned_read_count)
print "Unaligned Read Count: {}\n".format(read_group_set.stats.unaligned_read_count)
for read_group in read_group_set.read_groups:
    print " Read Group: {}".format(read_group.name)
    print " id: {}".format(read_group.bio_sample_id)
    print " bio_sample_id: {}\n".format(read_group.bio_sample_id)


Read Group Set: NA19678
id: WyIxa2dlbm9tZXMiLCJyZ3MiLCJOQTE5Njc4Il0
dataset_id: WyIxa2dlbm9tZXMiXQ
Aligned Read Count: 449711566
Unaligned Read Count: 5831622

 Read Group: SRR034578
 id: WyIxa2dlbm9tZXMiLCJiIiwiTkExOTY3OCJd
 bio_sample_id: WyIxa2dlbm9tZXMiLCJiIiwiTkExOTY3OCJd

 Read Group: SRR034579
 id: WyIxa2dlbm9tZXMiLCJiIiwiTkExOTY3OCJd
 bio_sample_id: WyIxa2dlbm9tZXMiLCJiIiwiTkExOTY3OCJd

 Read Group: SRR035488
 id: WyIxa2dlbm9tZXMiLCJiIiwiTkExOTY3OCJd
 bio_sample_id: WyIxa2dlbm9tZXMiLCJiIiwiTkExOTY3OCJd

 Read Group: SRR038585
 id: WyIxa2dlbm9tZXMiLCJiIiwiTkExOTY3OCJd
 bio_sample_id: WyIxa2dlbm9tZXMiLCJiIiwiTkExOTY3OCJd

 Read Group: SRR051575
 id: WyIxa2dlbm9tZXMiLCJiIiwiTkExOTY3OCJd
 bio_sample_id: WyIxa2dlbm9tZXMiLCJiIiwiTkExOTY3OCJd

 Read Group: SRR424287
 id: WyIxa2dlbm9tZXMiLCJiIiwiTkExOTY3OCJd
 bio_sample_id: WyIxa2dlbm9tZXMiLCJiIiwiTkExOTY3OCJd

 Read Group: SRR424288
 id: WyIxa2dlbm9tZXMiLCJiIiwiTkExOTY3OCJd
 bio_sample_id: WyIxa2dlbm9tZXMiLCJiIiwiTkExOTY3OCJd

 Read Group: SRR424289
 id: WyIxa2dlbm9tZXMiLCJiIiwiTkExOTY3OCJd
 bio_sample_id: WyIxa2dlbm9tZXMiLCJiIiwiTkExOTY3OCJd

 Read Group: SRR442018
 id: WyIxa2dlbm9tZXMiLCJiIiwiTkExOTY3OCJd
 bio_sample_id: WyIxa2dlbm9tZXMiLCJiIiwiTkExOTY3OCJd

 Read Group: SRR442019
 id: WyIxa2dlbm9tZXMiLCJiIiwiTkExOTY3OCJd
 bio_sample_id: WyIxa2dlbm9tZXMiLCJiIiwiTkExOTY3OCJd

 Read Group: SRR442020
 id: WyIxa2dlbm9tZXMiLCJiIiwiTkExOTY3OCJd
 bio_sample_id: WyIxa2dlbm9tZXMiLCJiIiwiTkExOTY3OCJd

Note, like in the previous example. Only a selected amount of parameters are selected for illustration, the data returned by the server is far richer, this format is only to have a more aesthetic presentation.

Search reads

This request returns reads were the read group set names we obtained above. The reference ID provided corresponds to chromosome 1 as obtained from the 1kg_reference_service examples. A search_reads request searches for read alignments in a region using start and end coordinates.


In [4]:
for read_group in read_group_set.read_groups:
    print "Alignment from {}\n".format(read_group.name)
    alignment = c.search_reads(read_group_ids=[read_group.id], start=0, end=1000000, reference_id="WyJOQ0JJMzciLCIxIl0").next()
    print " id: {}".format(alignment.id)
    print " fragment_name: {}".format(alignment.fragment_name)
    print " aligned_sequence: {}\n".format(alignment.aligned_sequence)


Alignment from SRR034578

 id: WyIxa2dlbm9tZXMiLCJyZ3MiLCJOQTE5Njc4IiwiU1JSMDM0NTc4LjE3MzYwMCJd
 fragment_name: SRR034578.173600
 aligned_sequence: ACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC

Alignment from SRR034579

 id: WyIxa2dlbm9tZXMiLCJyZ3MiLCJOQTE5Njc4IiwiU1JSMDM0NTc5LjY2MzcyODkiXQ
 fragment_name: SRR034579.6637289
 aligned_sequence: CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT

Alignment from SRR035488

 id: WyIxa2dlbm9tZXMiLCJyZ3MiLCJOQTE5Njc4IiwiU1JSMDM1NDg4Ljc1NTQ1NDciXQ
 fragment_name: SRR035488.7554547
 aligned_sequence: ACCCTGACCCCGACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTCACCCTCACCCTAACCCCTAAAC

Alignment from SRR038585

 id: WyIxa2dlbm9tZXMiLCJyZ3MiLCJOQTE5Njc4IiwiU1JSMDM4NTg1LjE1MzI5MDkiXQ
 fragment_name: SRR038585.1532909
 aligned_sequence: CCCTGACCCTGACCCTGACCCTGAACCCGAACCCGAACCCGAACCCCAACCCGAAGCGGAGCCCGAACCAGAACCC

Alignment from SRR051575

 id: WyIxa2dlbm9tZXMiLCJyZ3MiLCJOQTE5Njc4IiwiU1JSMDUxNTc1LjI0ODgxNDciXQ
 fragment_name: SRR051575.2488147
 aligned_sequence: CTCGTCATTCCTGCTGATCCGCTCTTCCGATCTGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGG

Alignment from SRR424287

 id: WyIxa2dlbm9tZXMiLCJyZ3MiLCJOQTE5Njc4IiwiU1JSNDI0Mjg3Ljk1NDQ3MjgiXQ
 fragment_name: SRR424287.9544728
 aligned_sequence: TCCGATCTCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA

Alignment from SRR424288

 id: WyIxa2dlbm9tZXMiLCJyZ3MiLCJOQTE5Njc4IiwiU1JSNDI0Mjg4LjMwIl0
 fragment_name: SRR424288.30
 aligned_sequence: TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACC

Alignment from SRR424289

 id: WyIxa2dlbm9tZXMiLCJyZ3MiLCJOQTE5Njc4IiwiU1JSNDI0Mjg5LjE5Il0
 fragment_name: SRR424289.19
 aligned_sequence: AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCAACCCTAACCCTAACCC

Alignment from SRR442018

 id: WyIxa2dlbm9tZXMiLCJyZ3MiLCJOQTE5Njc4IiwiU1JSNDQyMDE4LjMyIl0
 fragment_name: SRR442018.32
 aligned_sequence: AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC

Alignment from SRR442019

 id: WyIxa2dlbm9tZXMiLCJyZ3MiLCJOQTE5Njc4IiwiU1JSNDQyMDE5LjIwIl0
 fragment_name: SRR442019.20
 aligned_sequence: AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC

Alignment from SRR442020

 id: WyIxa2dlbm9tZXMiLCJyZ3MiLCJOQTE5Njc4IiwiU1JSNDQyMDIwLjY2Il0
 fragment_name: SRR442020.66
 aligned_sequence: CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTA

For documentation on the service, and more information go to.

https://ga4gh-schemas.readthedocs.io/en/latest/schemas/read_service.proto.html