In [1]:
from __future__ import print_function

import ga4gh.client

GA4GH IPython Example Notebook

This notebook provides an overview of how to call a the GA4GH reference server from an iPython notebook. Before running this notebook:

  1. git clone https://github.com/ga4gh/server.git -b develop
  2. Download the example data (if $scripts/download\_data.py$ doens't work, wget https://github.com/ga4gh/server/releases/download/data/ga4gh-example-data-v3.2.tar and tar -xvf in the server/ directory
  3. Launch an instance of the reference server on localhost ("python server_dev.py")

Note: If you have trouble importing the ga4gh module, either symlink this notebook into the /server directory or add the path to the GA4GH development repo to your PYTHONPATH.

Connect to GA4GH Server


In [2]:
baseURL = "http://localhost:8000"
client = ga4gh.client.HttpClient(baseURL)

Great! Now we can run calls to query ReferenceSets, References, Datasets, VariantSets, CallSets, Variants, ReadGroups, ReadGroupSets, & Reads.

Search methods return generators. Here we will explictly create lists out of these objects to allow us to directly index into the list of instances that the query returned.

Instances have a toJsonDict method that returns a dictionary representation of the object. We'll make use of this to examine the the underlying data structures.

Search/Get ReferenceSets

ReferenceSets are lists of References. Think of them like the genome version (ie. grch37, hg19).


In [3]:
referenceSets = list(client.searchReferenceSets())
print("ReferenceSets")
for referenceSet in referenceSets:
    print("NCBI Taxon Id: {}".format(referenceSet.ncbiTaxonId))


ReferenceSets
NCBI Taxon Id: 9606

In [4]:
referenceSet = client.getReferenceSet(referenceSets[0].id)
referenceSet.toJsonDict()


Out[4]:
{u'assemblyId': u'TODO',
 u'description': u'TODO',
 u'id': u'R1JDaDM4LXN1YnNldA',
 u'isDerived': False,
 u'md5checksum': u'234ea63052f999c21c2bdb6f60e61038',
 u'name': u'GRCh38-subset',
 u'ncbiTaxonId': 9606,
 u'sourceAccessions': [],
 u'sourceURI': u'TODO'}

Search/Get References

The References endpoints let you access the DNA sequences that make up the reference genome. The example dataset is from subsets of the first 3 chromosomes:


In [5]:
references = list(client.searchReferences(referenceSet.id))
print("References")
for reference in references:
    print("Name: {}, Length: {}".format(reference.name, reference.length))


References
Name: 2, Length: 32403
Name: 1, Length: 138395
Name: 3, Length: 81617

In [6]:
reference = client.getReference(references[0].id)
reference.toJsonDict()


Out[6]:
{u'id': u'R1JDaDM4LXN1YnNldDoy',
 u'isDerived': False,
 u'length': 32403,
 u'md5checksum': u'f513b7c19964e17092b55df262f62990',
 u'name': u'2',
 u'ncbiTaxonId': 9606,
 u'sourceAccessions': [u'CM000664.2.subset'],
 u'sourceDivergence': None,
 u'sourceURI': u'http://www.ebi.ac.uk/ena/data/view/CM000664.2%26range=0-32403&display=fasta'}

In addition to fetching metadata about the reference, you can access the base sequence:


In [7]:
client.listReferenceBases(references[0].id, start=10000, end=10100)


Out[7]:
u'CGTATCCCACACACCACACCCACACACCACACCCACACACACCCACACCCACACCCACACACACCACACCCACACACCACACCCACACCCACACACCACA'

Search/Get Dataset

Datasets are a collection of related data. In this example, we'll examine a collection of reads and variants.


In [8]:
datasets = list(client.searchDatasets())
print("Datasets")
for dataset in datasets:
    print("Name: {}".format(dataset.name))


Datasets
Name: 1kg-p3-subset

In [9]:
dataset = client.getDataset(datasets[0].id)
dataset.toJsonDict()


Out[9]:
{u'description': None, u'id': u'MWtnLXAzLXN1YnNldA', u'name': u'1kg-p3-subset'}

Search/Get VariantSets

VariantSets are a collection of variants and calls that are intended to be analyzed together.


In [10]:
variantSets = list(client.searchVariantSets(dataset.id))
print("VariantSets")
for variantSet in variantSets:
    print("Name: {}".format(variantSet.name))


VariantSets
Name: mvncall

In [11]:
variantSetId = variantSets[0].id
variantSet = client.getVariantSet(variantSetId)
variantSet.toJsonDict()


Out[11]:
{u'datasetId': u'MWtnLXAzLXN1YnNldA',
 u'id': u'MWtnLXAzLXN1YnNldDptdm5jYWxs',
 u'metadata': [{u'description': u'',
   u'id': u'MWtnLXAzLXN1YnNldDptdm5jYWxsOm1ldGFkYXRhOnZlcnNpb24',
   u'info': {},
   u'key': u'version',
   u'number': u'1',
   u'type': u'String',
   u'value': u'VCFv4.1'},
  {u'description': u'Confidence interval around END for imprecise variants',
   u'id': u'MWtnLXAzLXN1YnNldDptdm5jYWxsOm1ldGFkYXRhOklORk8uQ0lFTkQ',
   u'info': {},
   u'key': u'INFO.CIEND',
   u'number': u'2',
   u'type': u'Integer',
   u'value': u''},
  {u'description': u'Confidence interval around POS for imprecise variants',
   u'id': u'MWtnLXAzLXN1YnNldDptdm5jYWxsOm1ldGFkYXRhOklORk8uQ0lQT1M',
   u'info': {},
   u'key': u'INFO.CIPOS',
   u'number': u'2',
   u'type': u'Integer',
   u'value': u''},
  {u'description': u'Source call set.',
   u'id': u'MWtnLXAzLXN1YnNldDptdm5jYWxsOm1ldGFkYXRhOklORk8uQ1M',
   u'info': {},
   u'key': u'INFO.CS',
   u'number': u'1',
   u'type': u'String',
   u'value': u''},
  {u'description': u'End coordinate of this variant',
   u'id': u'MWtnLXAzLXN1YnNldDptdm5jYWxsOm1ldGFkYXRhOklORk8uRU5E',
   u'info': {},
   u'key': u'INFO.END',
   u'number': u'1',
   u'type': u'Integer',
   u'value': u''},
  {u'description': u'Imprecise structural variation',
   u'id': u'MWtnLXAzLXN1YnNldDptdm5jYWxsOm1ldGFkYXRhOklORk8uSU1QUkVDSVNF',
   u'info': {},
   u'key': u'INFO.IMPRECISE',
   u'number': u'0',
   u'type': u'Flag',
   u'value': u''},
  {u'description': u'Merged calls.',
   u'id': u'MWtnLXAzLXN1YnNldDptdm5jYWxsOm1ldGFkYXRhOklORk8uTUM',
   u'info': {},
   u'key': u'INFO.MC',
   u'number': u'.',
   u'type': u'String',
   u'value': u''},
  {u'description': u"Mobile element info of the form NAME,START,END<POLARITY; If there is only 5' OR 3' support for this call, will be NULL NULL for START and END",
   u'id': u'MWtnLXAzLXN1YnNldDptdm5jYWxsOm1ldGFkYXRhOklORk8uTUVJTkZP',
   u'info': {},
   u'key': u'INFO.MEINFO',
   u'number': u'4',
   u'type': u'String',
   u'value': u''},
  {u'description': u'Mitochondrial end coordinate of inserted sequence',
   u'id': u'MWtnLXAzLXN1YnNldDptdm5jYWxsOm1ldGFkYXRhOklORk8uTUVORA',
   u'info': {},
   u'key': u'INFO.MEND',
   u'number': u'1',
   u'type': u'Integer',
   u'value': u''},
  {u'description': u'Estimated length of mitochondrial insert',
   u'id': u'MWtnLXAzLXN1YnNldDptdm5jYWxsOm1ldGFkYXRhOklORk8uTUxFTg',
   u'info': {},
   u'key': u'INFO.MLEN',
   u'number': u'1',
   u'type': u'Integer',
   u'value': u''},
  {u'description': u'Mitochondrial start coordinate of inserted sequence',
   u'id': u'MWtnLXAzLXN1YnNldDptdm5jYWxsOm1ldGFkYXRhOklORk8uTVNUQVJU',
   u'info': {},
   u'key': u'INFO.MSTART',
   u'number': u'1',
   u'type': u'Integer',
   u'value': u''},
  {u'description': u'SV length. It is only calculated for structural variation MEIs. For other types of SVs; one may calculate the SV length by INFO:END-START+1, or by finding the difference between lengthes of REF and ALT alleles',
   u'id': u'MWtnLXAzLXN1YnNldDptdm5jYWxsOm1ldGFkYXRhOklORk8uU1ZMRU4',
   u'info': {},
   u'key': u'INFO.SVLEN',
   u'number': u'.',
   u'type': u'Integer',
   u'value': u''},
  {u'description': u'Type of structural variant',
   u'id': u'MWtnLXAzLXN1YnNldDptdm5jYWxsOm1ldGFkYXRhOklORk8uU1ZUWVBF',
   u'info': {},
   u'key': u'INFO.SVTYPE',
   u'number': u'1',
   u'type': u'String',
   u'value': u''},
  {u'description': u'Precise Target Site Duplication for bases, if unknown, value will be NULL',
   u'id': u'MWtnLXAzLXN1YnNldDptdm5jYWxsOm1ldGFkYXRhOklORk8uVFNE',
   u'info': {},
   u'key': u'INFO.TSD',
   u'number': u'1',
   u'type': u'String',
   u'value': u''},
  {u'description': u'Total number of alternate alleles in called genotypes',
   u'id': u'MWtnLXAzLXN1YnNldDptdm5jYWxsOm1ldGFkYXRhOklORk8uQUM',
   u'info': {},
   u'key': u'INFO.AC',
   u'number': u'A',
   u'type': u'Integer',
   u'value': u''},
  {u'description': u'Estimated allele frequency in the range (0,1)',
   u'id': u'MWtnLXAzLXN1YnNldDptdm5jYWxsOm1ldGFkYXRhOklORk8uQUY',
   u'info': {},
   u'key': u'INFO.AF',
   u'number': u'A',
   u'type': u'Float',
   u'value': u''},
  {u'description': u'Number of samples with data',
   u'id': u'MWtnLXAzLXN1YnNldDptdm5jYWxsOm1ldGFkYXRhOklORk8uTlM',
   u'info': {},
   u'key': u'INFO.NS',
   u'number': u'1',
   u'type': u'Integer',
   u'value': u''},
  {u'description': u'Total number of alleles in called genotypes',
   u'id': u'MWtnLXAzLXN1YnNldDptdm5jYWxsOm1ldGFkYXRhOklORk8uQU4',
   u'info': {},
   u'key': u'INFO.AN',
   u'number': u'1',
   u'type': u'Integer',
   u'value': u''},
  {u'description': u'Allele frequency in the EAS populations calculated from AC and AN, in the range (0,1)',
   u'id': u'MWtnLXAzLXN1YnNldDptdm5jYWxsOm1ldGFkYXRhOklORk8uRUFTX0FG',
   u'info': {},
   u'key': u'INFO.EAS_AF',
   u'number': u'A',
   u'type': u'Float',
   u'value': u''},
  {u'description': u'Allele frequency in the EUR populations calculated from AC and AN, in the range (0,1)',
   u'id': u'MWtnLXAzLXN1YnNldDptdm5jYWxsOm1ldGFkYXRhOklORk8uRVVSX0FG',
   u'info': {},
   u'key': u'INFO.EUR_AF',
   u'number': u'A',
   u'type': u'Float',
   u'value': u''},
  {u'description': u'Allele frequency in the AFR populations calculated from AC and AN, in the range (0,1)',
   u'id': u'MWtnLXAzLXN1YnNldDptdm5jYWxsOm1ldGFkYXRhOklORk8uQUZSX0FG',
   u'info': {},
   u'key': u'INFO.AFR_AF',
   u'number': u'A',
   u'type': u'Float',
   u'value': u''},
  {u'description': u'Allele frequency in the AMR populations calculated from AC and AN, in the range (0,1)',
   u'id': u'MWtnLXAzLXN1YnNldDptdm5jYWxsOm1ldGFkYXRhOklORk8uQU1SX0FG',
   u'info': {},
   u'key': u'INFO.AMR_AF',
   u'number': u'A',
   u'type': u'Float',
   u'value': u''},
  {u'description': u'Allele frequency in the SAS populations calculated from AC and AN, in the range (0,1)',
   u'id': u'MWtnLXAzLXN1YnNldDptdm5jYWxsOm1ldGFkYXRhOklORk8uU0FTX0FG',
   u'info': {},
   u'key': u'INFO.SAS_AF',
   u'number': u'A',
   u'type': u'Float',
   u'value': u''},
  {u'description': u'Total read depth; only low coverage data were counted towards the DP, exome data were not used',
   u'id': u'MWtnLXAzLXN1YnNldDptdm5jYWxsOm1ldGFkYXRhOklORk8uRFA',
   u'info': {},
   u'key': u'INFO.DP',
   u'number': u'1',
   u'type': u'Integer',
   u'value': u''},
  {u'description': u'Ancestral Allele. Format: AA|REF|ALT|IndelType. AA: Ancestral allele, REF:Reference Allele, ALT:Alternate Allele, IndelType:Type of Indel (REF, ALT and IndelType are only defined for indels)',
   u'id': u'MWtnLXAzLXN1YnNldDptdm5jYWxsOm1ldGFkYXRhOklORk8uQUE',
   u'info': {},
   u'key': u'INFO.AA',
   u'number': u'1',
   u'type': u'String',
   u'value': u''},
  {u'description': u'indicates what type of variant the line represents',
   u'id': u'MWtnLXAzLXN1YnNldDptdm5jYWxsOm1ldGFkYXRhOklORk8uVlQ',
   u'info': {},
   u'key': u'INFO.VT',
   u'number': u'.',
   u'type': u'String',
   u'value': u''},
  {u'description': u'indicates whether a variant is within the exon pull down target boundaries',
   u'id': u'MWtnLXAzLXN1YnNldDptdm5jYWxsOm1ldGFkYXRhOklORk8uRVhfVEFSR0VU',
   u'info': {},
   u'key': u'INFO.EX_TARGET',
   u'number': u'0',
   u'type': u'Flag',
   u'value': u''},
  {u'description': u'indicates whether a site is multi-allelic',
   u'id': u'MWtnLXAzLXN1YnNldDptdm5jYWxsOm1ldGFkYXRhOklORk8uTVVMVElfQUxMRUxJQw',
   u'info': {},
   u'key': u'INFO.MULTI_ALLELIC',
   u'number': u'0',
   u'type': u'Flag',
   u'value': u''}],
 u'name': u'mvncall',
 u'referenceSetId': u''}

Search/Get Callset

Callsets are a collection of genotype calls. Callsets apply to the samples within a dataset.


In [12]:
callSets = list(client.searchCallSets(variantSetId))
print("CallSets")
for callSet in callSets:
    print("Name: {}".format(callSet.name))


CallSets
Name: HG00096
Name: HG00533
Name: HG00534

In [13]:
callSet = client.getCallSet(callSets[0].id)
callSet.toJsonDict()


Out[13]:
{u'created': 1453157520705,
 u'id': u'MWtnLXAzLXN1YnNldDptdm5jYWxsOkhHMDAwOTY',
 u'info': {},
 u'name': u'HG00096',
 u'sampleId': u'HG00096',
 u'updated': 1453157520705,
 u'variantSetIds': [u'MWtnLXAzLXN1YnNldDptdm5jYWxs']}

Search/Get Variants

A Variant represents a change in DNA sequence relative to some reference. For example, a variant could represent a SNP or an insertion. You can think of them as a row in a VCF. Variants can be supported by many calls.


In [14]:
variants = list(client.searchVariants(variantSetId, start=100000, end=101000, referenceName = "1"))
print("Variants")
for variant in variants:
    print("Reference Name: {}, Start: {}, Name: {}".format(variant.referenceName, variant.start, variant.names[0]))


Variants
Reference Name: 1, Start: 100648, Name: rs199608293
Reference Name: 1, Start: 100675, Name: rs188226172
Reference Name: 1, Start: 100714, Name: rs561694274
Reference Name: 1, Start: 100857, Name: rs368741663

In [15]:
variant = client.getVariant(variants[0].id)
variant.toJsonDict()


Out[15]:
{u'alternateBases': [u'C'],
 u'calls': [{u'callSetId': u'MWtnLXAzLXN1YnNldDptdm5jYWxsOkhHMDAwOTY',
   u'callSetName': u'HG00096',
   u'genotype': [0, 0],
   u'genotypeLikelihood': [],
   u'info': {},
   u'phaseset': u'*'},
  {u'callSetId': u'MWtnLXAzLXN1YnNldDptdm5jYWxsOkhHMDA1MzM',
   u'callSetName': u'HG00533',
   u'genotype': [0, 0],
   u'genotypeLikelihood': [],
   u'info': {},
   u'phaseset': u'*'},
  {u'callSetId': u'MWtnLXAzLXN1YnNldDptdm5jYWxsOkhHMDA1MzQ',
   u'callSetName': u'HG00534',
   u'genotype': [0, 0],
   u'genotypeLikelihood': [],
   u'info': {},
   u'phaseset': u'*'}],
 u'created': 1453157520705,
 u'end': 100649,
 u'id': u'MWtnLXAzLXN1YnNldDptdm5jYWxsOjE6MTAwNjQ4OjNjNjFiODA3MDhlNDIzYTVhMGFiNjlmNGYwZjA2N2Yx',
 u'info': {u'AA': [u'.|||'],
  u'AC': [u'0'],
  u'AF': [u'0.000599041988607'],
  u'AFR_AF': [u'0.0'],
  u'AMR_AF': [u'0.0'],
  u'AN': [u'6'],
  u'DP': [u'16885'],
  u'EAS_AF': [u'0.00300000002608'],
  u'EUR_AF': [u'0.0'],
  u'NS': [u'2504'],
  u'SAS_AF': [u'0.0'],
  u'VT': [u'SNP']},
 u'names': [u'rs199608293'],
 u'referenceBases': u'G',
 u'referenceName': u'1',
 u'start': 100648,
 u'updated': 1453157520705,
 u'variantSetId': u'MWtnLXAzLXN1YnNldDptdm5jYWxs'}

Search/Get Readgroupsets

A ReadGroupSet is a logical collection of ReadGroups. Typically one ReadGroupSet represents all the reads from one experimental sample.


In [16]:
readGroupSets = list(client.searchReadGroupSets(dataset.id))
print("ReadGroupSets")
for readGroup in readGroupSets:
    print("Name: {}".format(readGroup.name))


ReadGroupSets
Name: HG00533
Name: HG00534

In [17]:
readGroupSet = client.getReadGroupSet(readGroupSets[0].id)
readGroupSet.toJsonDict()


Out[17]:
{u'datasetId': u'MWtnLXAzLXN1YnNldA',
 u'id': u'MWtnLXAzLXN1YnNldDpIRzAwNTMz',
 u'name': u'HG00533',
 u'readGroups': [{u'created': 1456755492901,
   u'datasetId': u'MWtnLXAzLXN1YnNldA',
   u'description': u'SRP001293',
   u'experiment': {u'description': u'SRP001293',
    u'id': u'MWtnLXAzLXN1YnNldDpIRzAwNTMzOkVSUjAyMDIzNzpleHBlcmltZW50',
    u'info': {},
    u'instrumentDataFile': None,
    u'instrumentModel': u'ILLUMINA',
    u'library': u'HUMwfmREJDIAAPEI-3',
    u'libraryLayout': None,
    u'molecule': None,
    u'name': None,
    u'platformUnit': None,
    u'recordCreateTime': u'2016-02-29T14:18:12Z',
    u'recordUpdateTime': u'2016-02-29T14:18:12Z',
    u'runTime': None,
    u'selection': None,
    u'sequencingCenter': u'BGI',
    u'strategy': None},
   u'id': u'MWtnLXAzLXN1YnNldDpIRzAwNTMzOkVSUjAyMDIzNw',
   u'info': {},
   u'name': u'ERR020237',
   u'predictedInsertSize': 473,
   u'programs': [{u'commandLine': u'bwa index -a bwtsw $reference_fasta',
     u'id': u'bwa_index',
     u'name': u'bwa',
     u'prevProgramId': None,
     u'version': u'0.5.9-r16'},
    {u'commandLine': u'bwa aln -q 15 -f $sai_file $reference_fasta $fastq_file\tPP:bwa_index',
     u'id': u'bwa_aln_fastq',
     u'name': u'bwa',
     u'prevProgramId': None,
     u'version': u'0.5.9-r16'},
    {u'commandLine': u'bwa sampe -a 1419 -r $rg_line -f $sam_file $reference_fasta $sai_file(s) $fastq_file(s)\tPP:bwa_aln_fastq',
     u'id': u'bwa_sam',
     u'name': u'bwa',
     u'prevProgramId': None,
     u'version': u'0.5.9-r16'},
    {u'commandLine': u'samtools view -bSu $sam_file | samtools sort -n -o - samtools_nsort_tmp | samtools fixmate /dev/stdin /dev/stdout | samtools sort -o - samtools_csort_tmp | samtools fillmd -u - $reference_fasta > $fixed_bam_file\tPP:bwa_sam',
     u'id': u'sam_to_fixed_bam',
     u'name': u'samtools',
     u'prevProgramId': None,
     u'version': u'0.1.17 (r973:277)'},
    {u'commandLine': u'java $jvm_args -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -R $reference_fasta -o $intervals_file -known $known_indels_file(s) \tPP:sam_to_fixed_bam',
     u'id': u'gatk_target_interval_creator',
     u'name': u'GenomeAnalysisTK',
     u'prevProgramId': None,
     u'version': u'1.2-29-g0acaf2d'},
    {u'commandLine': u'java $jvm_args -jar GenomeAnalysisTK.jar -T IndelRealigner -R $reference_fasta -I $bam_file -o $realigned_bam_file -targetIntervals $intervals_file -known $known_indels_file(s) -LOD 0.4 -model KNOWNS_ONLY -compress 0 --disable_bam_indexing\tPP:gatk_target_interval_creator',
     u'id': u'bam_realignment_around_known_indels',
     u'name': u'GenomeAnalysisTK',
     u'prevProgramId': None,
     u'version': u'1.2-29-g0acaf2d'},
    {u'commandLine': u"java $jvm_args -jar GenomeAnalysisTK.jar -T CountCovariates -R $reference_fasta -I $bam_file -recalFile $bam_file.recal_data.csv -knownSites $known_sites_file(s) -l INFO -L '1;2;3;4;5;6;7;8;9;10;11;12;13;14;15;16;17;18;19;20;21;22;X;Y;MT' -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate\tPP:bam_realignment_around_known_indels",
     u'id': u'bam_count_covariates',
     u'name': u'GenomeAnalysisTK',
     u'prevProgramId': None,
     u'version': u'1.2-29-g0acaf2d'},
    {u'commandLine': u'java $jvm_args -jar GenomeAnalysisTK.jar -T TableRecalibration -R $reference_fasta -recalFile $bam_file.recal_data.csv -I $bam_file -o $recalibrated_bam_file -l INFO -compress 0 --disable_bam_indexing\tPP:bam_count_covariates',
     u'id': u'bam_recalibrate_quality_scores',
     u'name': u'GenomeAnalysisTK',
     u'prevProgramId': None,
     u'version': u'1.2-29-g0acaf2d'},
    {u'commandLine': u'samtools calmd -Erb $bam_file $reference_fasta > $bq_bam_file\tPP:bam_recalibrate_quality_scores',
     u'id': u'bam_calculate_bq',
     u'name': u'samtools',
     u'prevProgramId': None,
     u'version': u'0.1.17 (r973:277)'},
    {u'commandLine': u'java $jvm_args -jar MergeSamFiles.jar INPUT=$bam_file(s) OUTPUT=$merged_bam VALIDATION_STRINGENCY=SILENT\tPP:bam_calculate_bq',
     u'id': u'bam_merge',
     u'name': u'picard',
     u'prevProgramId': None,
     u'version': u'1.53'},
    {u'commandLine': u'java $jvm_args -jar MarkDuplicates.jar INPUT=$bam_file OUTPUT=$markdup_bam_file ASSUME_SORTED=TRUE METRICS_FILE=/dev/null VALIDATION_STRINGENCY=SILENT\tPP:bam_merge',
     u'id': u'bam_mark_duplicates',
     u'name': u'picard',
     u'prevProgramId': None,
     u'version': u'1.53'},
    {u'commandLine': u'java $jvm_args -jar MergeSamFiles.jar INPUT=$bam_file(s) OUTPUT=$merged_bam VALIDATION_STRINGENCY=SILENT\tPP:bam_mark_duplicates',
     u'id': u'bam_merge.1',
     u'name': u'picard',
     u'prevProgramId': None,
     u'version': u'1.53'}],
   u'referenceSetId': u'R1JDaDM4LXN1YnNldA',
   u'sampleId': u'HG00533',
   u'stats': {u'alignedReadCount': -1,
    u'baseCount': None,
    u'unalignedReadCount': -1},
   u'updated': 1456755492901}],
 u'stats': {u'alignedReadCount': 2939,
  u'baseCount': None,
  u'unalignedReadCount': 30}}

Get ReadGroups

A ReadGroup is a set of reads derived from one physical sequencing process.


In [18]:
readGroup = client.getReadGroup(readGroupSet.readGroups[0].id)
readGroup.toJsonDict()


Out[18]:
{u'created': 1456755492901,
 u'datasetId': u'MWtnLXAzLXN1YnNldA',
 u'description': u'SRP001293',
 u'experiment': {u'description': u'SRP001293',
  u'id': u'MWtnLXAzLXN1YnNldDpIRzAwNTMzOkVSUjAyMDIzNzpleHBlcmltZW50',
  u'info': {},
  u'instrumentDataFile': None,
  u'instrumentModel': u'ILLUMINA',
  u'library': u'HUMwfmREJDIAAPEI-3',
  u'libraryLayout': None,
  u'molecule': None,
  u'name': None,
  u'platformUnit': None,
  u'recordCreateTime': u'2016-02-29T14:18:12Z',
  u'recordUpdateTime': u'2016-02-29T14:18:12Z',
  u'runTime': None,
  u'selection': None,
  u'sequencingCenter': u'BGI',
  u'strategy': None},
 u'id': u'MWtnLXAzLXN1YnNldDpIRzAwNTMzOkVSUjAyMDIzNw',
 u'info': {},
 u'name': u'ERR020237',
 u'predictedInsertSize': 473,
 u'programs': [{u'commandLine': u'bwa index -a bwtsw $reference_fasta',
   u'id': u'bwa_index',
   u'name': u'bwa',
   u'prevProgramId': None,
   u'version': u'0.5.9-r16'},
  {u'commandLine': u'bwa aln -q 15 -f $sai_file $reference_fasta $fastq_file\tPP:bwa_index',
   u'id': u'bwa_aln_fastq',
   u'name': u'bwa',
   u'prevProgramId': None,
   u'version': u'0.5.9-r16'},
  {u'commandLine': u'bwa sampe -a 1419 -r $rg_line -f $sam_file $reference_fasta $sai_file(s) $fastq_file(s)\tPP:bwa_aln_fastq',
   u'id': u'bwa_sam',
   u'name': u'bwa',
   u'prevProgramId': None,
   u'version': u'0.5.9-r16'},
  {u'commandLine': u'samtools view -bSu $sam_file | samtools sort -n -o - samtools_nsort_tmp | samtools fixmate /dev/stdin /dev/stdout | samtools sort -o - samtools_csort_tmp | samtools fillmd -u - $reference_fasta > $fixed_bam_file\tPP:bwa_sam',
   u'id': u'sam_to_fixed_bam',
   u'name': u'samtools',
   u'prevProgramId': None,
   u'version': u'0.1.17 (r973:277)'},
  {u'commandLine': u'java $jvm_args -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -R $reference_fasta -o $intervals_file -known $known_indels_file(s) \tPP:sam_to_fixed_bam',
   u'id': u'gatk_target_interval_creator',
   u'name': u'GenomeAnalysisTK',
   u'prevProgramId': None,
   u'version': u'1.2-29-g0acaf2d'},
  {u'commandLine': u'java $jvm_args -jar GenomeAnalysisTK.jar -T IndelRealigner -R $reference_fasta -I $bam_file -o $realigned_bam_file -targetIntervals $intervals_file -known $known_indels_file(s) -LOD 0.4 -model KNOWNS_ONLY -compress 0 --disable_bam_indexing\tPP:gatk_target_interval_creator',
   u'id': u'bam_realignment_around_known_indels',
   u'name': u'GenomeAnalysisTK',
   u'prevProgramId': None,
   u'version': u'1.2-29-g0acaf2d'},
  {u'commandLine': u"java $jvm_args -jar GenomeAnalysisTK.jar -T CountCovariates -R $reference_fasta -I $bam_file -recalFile $bam_file.recal_data.csv -knownSites $known_sites_file(s) -l INFO -L '1;2;3;4;5;6;7;8;9;10;11;12;13;14;15;16;17;18;19;20;21;22;X;Y;MT' -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate\tPP:bam_realignment_around_known_indels",
   u'id': u'bam_count_covariates',
   u'name': u'GenomeAnalysisTK',
   u'prevProgramId': None,
   u'version': u'1.2-29-g0acaf2d'},
  {u'commandLine': u'java $jvm_args -jar GenomeAnalysisTK.jar -T TableRecalibration -R $reference_fasta -recalFile $bam_file.recal_data.csv -I $bam_file -o $recalibrated_bam_file -l INFO -compress 0 --disable_bam_indexing\tPP:bam_count_covariates',
   u'id': u'bam_recalibrate_quality_scores',
   u'name': u'GenomeAnalysisTK',
   u'prevProgramId': None,
   u'version': u'1.2-29-g0acaf2d'},
  {u'commandLine': u'samtools calmd -Erb $bam_file $reference_fasta > $bq_bam_file\tPP:bam_recalibrate_quality_scores',
   u'id': u'bam_calculate_bq',
   u'name': u'samtools',
   u'prevProgramId': None,
   u'version': u'0.1.17 (r973:277)'},
  {u'commandLine': u'java $jvm_args -jar MergeSamFiles.jar INPUT=$bam_file(s) OUTPUT=$merged_bam VALIDATION_STRINGENCY=SILENT\tPP:bam_calculate_bq',
   u'id': u'bam_merge',
   u'name': u'picard',
   u'prevProgramId': None,
   u'version': u'1.53'},
  {u'commandLine': u'java $jvm_args -jar MarkDuplicates.jar INPUT=$bam_file OUTPUT=$markdup_bam_file ASSUME_SORTED=TRUE METRICS_FILE=/dev/null VALIDATION_STRINGENCY=SILENT\tPP:bam_merge',
   u'id': u'bam_mark_duplicates',
   u'name': u'picard',
   u'prevProgramId': None,
   u'version': u'1.53'},
  {u'commandLine': u'java $jvm_args -jar MergeSamFiles.jar INPUT=$bam_file(s) OUTPUT=$merged_bam VALIDATION_STRINGENCY=SILENT\tPP:bam_mark_duplicates',
   u'id': u'bam_merge.1',
   u'name': u'picard',
   u'prevProgramId': None,
   u'version': u'1.53'}],
 u'referenceSetId': u'R1JDaDM4LXN1YnNldA',
 u'sampleId': u'HG00533',
 u'stats': {u'alignedReadCount': -1,
  u'baseCount': None,
  u'unalignedReadCount': -1},
 u'updated': 1456755492901}

Get Reads

Each read alignment describes an alignment with additional information about the fragment and the read. A read alignment object is equivalent to a line in a SAM file.

Since there are often many reads, we won't iterate over all of them. Instead we can summerize the query and then output the dictionary representing the last one.


In [19]:
readGroupIds = [readGroup.id for readGroup in readGroupSet.readGroups]
readGroupDescriptions = [readGroup.description for readGroup in readGroupSet.readGroups]
reads = client.searchReads(readGroupIds, reference.id)
print("Read Alignments")
# The server paginates reads by default; here we iterate over the reads 
# generate to make the appropriate requests to the server
count = 0
for read in reads:
    count += 1
print("{} reads in Readgroups: {} on Reference: {}".format(
      count, ', '.join(readGroupDescriptions), reference.name))


Read Alignments
984 reads in Readgroups: SRP001293 on Reference: 2

In [20]:
read.toJsonDict()


Out[20]:
{u'alignedQuality': [31,
  31,
  33,
  27,
  36,
  36,
  38,
  37,
  37,
  35,
  35,
  37,
  37,
  30,
  35,
  39,
  39,
  40,
  39,
  39,
  41,
  37,
  41,
  40,
  40,
  40,
  40,
  36,
  39,
  33,
  40,
  39,
  39,
  38,
  29,
  42,
  38,
  37,
  36,
  29,
  31,
  32,
  34,
  30,
  34,
  34,
  39,
  34,
  36,
  39,
  16,
  35,
  35,
  33,
  32,
  39,
  35,
  35,
  29,
  27,
  41,
  37,
  37,
  40,
  39,
  41,
  31,
  31,
  38,
  30,
  38,
  31,
  38,
  34,
  31,
  23,
  16,
  15,
  27,
  25,
  36,
  2,
  2,
  2,
  2,
  2,
  2,
  2,
  2,
  2,
  2],
 u'alignedSequence': u'GGTTCCAAGGTGGGTGCATAACTGAGGCTCATGCCAGGTGCCACACAGCACTCCCTTGGCCCCCTAATGGACTTCGGGACAGGCAAGGGCC',
 u'alignment': {u'cigar': [{u'operation': u'ALIGNMENT_MATCH',
    u'operationLength': 82,
    u'referenceSequence': None},
   {u'operation': u'CLIP_SOFT',
    u'operationLength': 9,
    u'referenceSequence': None}],
  u'mappingQuality': 60,
  u'position': {u'position': 27497,
   u'referenceName': u'2',
   u'strand': u'POS_STRAND'}},
 u'duplicateFragment': False,
 u'failedVendorQualityChecks': False,
 u'fragmentLength': 470,
 u'fragmentName': u'ERR020237.18651656',
 u'id': u'MWtnLXAzLXN1YnNldDpIRzAwNTMzOkVSUjAyMDIzNzpFUlIwMjAyMzcuMTg2NTE2NTY',
 u'info': {u'AM': [u'37'],
  u'BQ': [u'@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@'],
  u'MD': [u'82'],
  u'MQ': [u'60'],
  u'NM': [u'0'],
  u'RG': [u'ERR020237'],
  u'SM': [u'37'],
  u'X0': [u'1'],
  u'X1': [u'0'],
  u'XC': [u'82'],
  u'XT': [u'U']},
 u'nextMatePosition': {u'position': 27877,
  u'referenceName': u'2',
  u'strand': u'NEG_STRAND'},
 u'numberReads': 2,
 u'properPlacement': True,
 u'readGroupId': u'MWtnLXAzLXN1YnNldDpIRzAwNTMzOkVSUjAyMDIzNw',
 u'readNumber': 0,
 u'secondaryAlignment': False,
 u'supplementaryAlignment': False}