Query for pile-up allignments at region "X"

We can query the API services to obtain reads from a given readgroupset such that we are able to make a pileup for any specified region

NOTE: Under the "Kernel" tab above, do "Restart & Run All" then uncomment the first cell and run it individually



In [14]:

    
Widget()









    



Position: 120394, ReadGrpSet: NA19102, Chrom: 1
NEXT PLOT VARS FUNCTION W/ PARAMS 120394, NA19102, 1

Description:

Note that the understanding of 3 terms allows for a complete/useful way to use this notebook. Now, the terminology is adapted to my understanding and therefore the expressions that I present might be incomplete or not strictly as defined by science.

First, our input takes the "position" argument which it is a unique position in the genome. It is necessary to specify which position wants to be observed because when we query the server for reads, it takes a starting and an ending position to return the set of data that spans our region of interest.
Second, the "Read Groupset Names" are specific subjects which had their genome sequenced on the 1k genomes project. The data rests on the server in the form of read sets, which will be defined later on.
Third, the “Reference Set Name” is a group of contiguous defined regions in the genome, which I refer to as chromosomes, but according to the 1k genomes website, there is more than just the 23-regular chromosomal expressions. Therefore I can only assume that other regions or references have been defined.

A set of reads is the data provided by the sequencer in the form of contiguous alleles. It is natural to observe multiple reads which overlap in a particular region, as well as reads which cover the same area. But that only adds on to the certainty of the statistics which determine the allele occurrence in a given position, that is, the purpose of Pileup. Also, variants are a set of alleles that differ from the reference bases and they are known as SNPs (Single Nucleotide Polymorphisms).

Pileup is a set of functions which do inner-mutual calls to obtain the fields previously defined. After the specific set of reads that span the region of interest have been obtained, we proceed to dissect the specific position of interest. We stack them in a counter dictionary which is then passed to a function that does the frequency calculation and finishes by returning the alleles observed as well as their individualized frequency. When the functions detect that the highest frequency allele differs from the reference bases, there is a call to the variant set to obtain the name, the position at which it starts, the alternate bases, and the genotype. Finally, it is plotted in a pie chart with the proper distribution of frequency.

Initialize the client

As seen in the "1kg.ipynb" example, we take the following steps to create the client object that will be used to obtain the information we desire and query the serever



In [2]:

    
from ga4gh.client import protocol
from ga4gh.client import client
c = client.HttpClient("http://1kgenomes.ga4gh.org")

Make reference to the data from the server

We query the server for the dataset, which is the 1k-genomes dataset. We access the bases of reference, followed by listingthe reference sets.



In [3]:

    
dataset = c.search_datasets().next()
reference_set = c.search_reference_sets().next()
references = [r for r in c.search_references(reference_set_id= reference_set.id)]

ReferenceSet Name (chromosome) & ReadGroupSet Reads

We define our contiguous sequence with a chromosome reference, and then make a reference array for our read group sets of read groups.



In [4]:

    
contig ={}
for i in references:
    contig[i.name] = str(i.id)

Functions to obtain ReadGroupSet ID by name.

We can obtain a set of reads for a given Read-Group. The set of reads is returned in the 'rgs' variable below.



In [5]:

    
def get_reads_for_name(Name):
    Name = str(Name)
    if type(get_read_groups_by_read_group_set_name(Name)) == str:
        return get_read_groups_by_read_group_set_name(Name)
    else:
        return [i for i in get_read_groups_by_read_group_set_name(Name)]
    
def read_group_set_by_name(name):
    result = None
    for rgs in c.search_read_group_sets(name=name, dataset_id= dataset.id):
        return rgs
    return result
## [name=name, dataset_id= dataset.id]
def get_read_groups_by_read_group_set_name(read_group_set_name):
    if None == read_group_set_by_name(read_group_set_name):
        return "Sorry, bad request for {}".format(read_group_set_name)
    else:
        return read_group_set_by_name(read_group_set_name).read_groups

Function to call multiple ReferenceSets.

Because some calls such as Variants, Reference Bases, and Reads require this field to return the region that wants to be analyzed. Also note, that it is a required input of this service.



In [6]:

    
def chrfunct(chromo):
    chr1 = filter(lambda x: x.name == str(chromo), references)[0]
    return chr1

Cigar-Unit interpreter function.

This function can be expanded in the sense that, INDELS are detected in this function. With more specifications this Pile-Up program with this function can be extended to also detect such variants. Also note that only 4 cigar operations are specified, because they were the only operations specified in the reads.



In [7]:

    
def cigar_interpreter(sequence, observe, ReferBase):
#     print "Sequence Val: {}".format(sequence)
#     print "Observe Val: {}".format(observe)
#     print "RefereBase Val: {}".format(ReferBase)
    Temp = 0
    BaseCounter = 0
    Variant = ""
    AligSeq = sequence.aligned_sequence
    InterpArr = list([])
    Iter = 0
    type(sequence) 
    for i in sequence.alignment.cigar:
        Length = i.operation_length
        if protocol.CigarUnit.Operation.Name(i.operation) == "ALIGNMENT_MATCH":
            InterpArr[len(InterpArr):len(InterpArr)+Length] = AligSeq[Temp:Temp+Length]
            Temp += Length 
            BaseCounter += Length
            
        elif protocol.CigarUnit.Operation.Name(i.operation) == "CLIP_SOFT":
            Temp += Length
            
     
        elif protocol.CigarUnit.Operation.Name(i.operation) == "DELETE":
            int_iter = 0
            for i in range(Length):
                InterpArr[len(InterpArr) : len(InterpArr)+1] = "N"
                BaseCounter += 1
                int_iter += 1
                if BaseCounter == observe:
                    Variant = ReferBase[BaseCounter:BaseCounter+int_iter]
                    return Variant
                
        elif protocol.CigarUnit.Operation.Name(i.operation) == "INSERT":
            for i in range(Length):
                InterpArr[len(InterpArr):len(InterpArr)+1] = AligSeq[Temp : Temp+1]
                Temp += 1
                if (Temp == observe) and (len(InterpArr) >= Temp+Length+1):
                    Variant = "".join(InterpArr[Temp:Temp+Length+1])
                    return Variant
            
        Iter += 1
    if (Temp >= observe) and (len(sequence.alignment.cigar) == Iter) :
            return InterpArr[observe]
    else: 
        return "N"

Variant Call Function

If the pile-up detects that the dominant allele frequency, defers from the reference bases, this function will be call and query the server for that variant.



In [8]:

    
list_of_callset_ids =[]
def find_variants(Start, End, RdGrpSetName, ChromoSm):
    for variant_set in c.search_variant_sets(datasetId=dataset.id):
        if variant_set.name == "phase3-release":
            release = variant_set
            print variant_set
    
    for callSet in c.search_call_sets(variant_set_id= release.id, name= str(RdGrpSetName)):
        mycallset = callSet
        list_of_callset_ids.append(callSet.id)
 
    for variant in c.search_variants(release.id, reference_name=ChromoSm, start=Start, end=End, call_set_ids=list_of_callset_ids):
        print variant
        if len(variant.alternate_bases[0]) == 1 and len(variant.reference_bases) == 1:
            print "\nA VARIANT WAS FOUND"
            print "Variant Name: {}, Start: {}, End: {} \nAlternate Bases: {} \nGenotypes: {}".format(str(variant.names[0]), str(variant.start), str(variant.end), str(variant.alternate_bases[0]), str(variant.calls[0].genotype))
            return 
    return False

Pile up function

This function calculates the pile up's for a given region, that is the position being observed. It takes as input the chromosome reference and the Read-Groups to obtain the needed aligned sequence.



In [9]:

    
def pileUp(contig, position, rgset, Chromosm):
    alleles = []
    rgset = get_reads_for_name(rgset)
    if type(rgset) != str:
        for i in rgset:
            for sequence in c.search_reads(read_group_ids=[i.id],start = position, end = position+1, reference_id=contig):
                if sequence.alignment != None:
                    start = sequence.alignment.position.position
                    observe = position - sequence.alignment.position.position
                    end = start+len(sequence.aligned_sequence)
                    
                    if observe > 100 or observe < 0:
                        continue
                    
                    if len(sequence.alignment.cigar) > 1:
                        allele = cigar_interpreter(sequence, observe,c.list_reference_bases(chrfunct(Chromosm).id, start=start, end= end))      
                    else:
                        allele = sequence.aligned_sequence[observe]
                        
                    alleles.append({"allele": str(allele), "readGroupId":i.id})
        return Calc_Freq(alleles)
    
    else:
        return rgset

Function to calculate occurrence frequency

The frequency is obtained from the occurrence of alleles in the observed position for all the reads which are mapped in that region. This function returns an array of occurrence alleles as well as their individualized frequency compared to all the reads detected.



In [10]:

    
def Calc_Freq(Test):
    tot = len(Test)
    AutCalc = {}
    Arr = []
    for i in range(tot):
        if AutCalc.has_key(Test[i]["allele"]) == False and (Test[i]['allele'] != "N"):
            AutCalc.setdefault(Test[i]["allele"], 1)
            Arr.append(Test[i]['allele'])
        else:
            if Test[i]['allele'] == "N":
                tot -= 1
            else:
                AutCalc[Test[i]["allele"]] = float(AutCalc.get(Test[i]["allele"]) + 1)
    Freq = {}
    print "\n{} Reads where used, to determine pile-up".format(tot) 
    tot = float(tot)
    for i in Arr:
        Freq.setdefault(i,float(AutCalc.get(i)/tot))
    return Freq

Precursor function

This function prepares the Read-Group set and does the inner calls, it also calls and obtains the reference bases. Note that only if the calls are correct will the function continue to make the calculations and inner calls.



In [11]:

    
def Variant_Comp(Position, ReadGroupSetName, Chromosm):
    RdGrp = get_reads_for_name(ReadGroupSetName)
    Chrm = contig.get(Chromosm, None) 
    if (Chrm != None) and type(RdGrp) != (str) :
        base = c.list_reference_bases(Chrm, start = Position, end = Position+1)
        var = pileUp(Chrm, Position, ReadGroupSetName, Chromosm)
        return (str(base), var)
    else:
        if RdGrp == None:
            print"Read Group Set '{}' is not in the API".format(ReadGroupSetName)
        else:
            print"Chromosome '{}' is not in the API".format(Chromosm)

Plotting Function

This function plots, the information obtained by the others. It obtains the reference base and denotes it. It also obtains the frequencies and plots them in a pie chart.



In [12]:

    
def plot_vars(Position, RdGrpName, Chromo):
    %matplotlib inline
    import matplotlib.pyplot as plt
    Refer, Freqs = Variant_Comp(int(Position), str(RdGrpName),str(Chromo))
    labels = Freqs.keys()
    sizes = Freqs.values()
    colors = ['yellowgreen', 'gold', 'lightskyblue', 'lightcoral']
    Expl= {}
    Legend = []
    print "Reference Bases:", Refer
    for i in labels:
        if Freqs.get(i) != max(sizes):
            find_variants(int(Position), int(Position)+1, str(RdGrpName), str(Chromo))
            Expl.setdefault(i, .15)
            Legend.append("{}: {} %".format(i, str(Freqs.get(i)*100)[:4]))
        elif i == Refer:
            Expl.setdefault(i,0.8)
            Legend.append("{}: {} %".format(i, str(Freqs.get(i)*100)[:4]))
        else:
            Expl.setdefault(i,0.0)
            Legend.append("{}: {} %".format(i, str(Freqs.get(i)*100)[:4]))
    explode = Expl.values()

    plt.pie(sizes, explode=explode, labels=labels, colors=colors,autopct='%1.1f%%', shadow=True, startangle=0)
    plt.axis('equal')
    plt.legend(['%s' % str(x) for x in (Legend)])
    plt.show()

This function calls the previous one, and sets up the interface so that it is an active application. The following one, will begin the query and plotting process.



In [13]:

    
def Widget():
    from ipywidgets import widgets
    from ipywidgets import interact
    from IPython.display import display
    
    t0 = widgets.Text(value="Position Exaple:  '120394'", disabled=True)
    text0 = widgets.Text()
    t1 = widgets.Text(value="ReadGroupName Example:  'NA19102'", disabled=True)
    text1 = widgets.Text()
    t2 = widgets.Text(value= "ReferenceSets Example:  '1'", disabled=True)
    text2 = widgets.Text()
    display(t0, text0, t1, text1, t2, text2)
    button = widgets.Button(description="Submit")
    exit = widgets.Button(description="Exit")
    display(button, exit)
    
    
    def exitFunct(c):
        import sys
        sys.exit(["Thank you, you have exited the function"]) 
    
    def Submit(sender):
        Pos, RgSetNm, Chrom = text0.value, text1.value, text2.value
        chr1 = chrfunct(Chrom)
        print "NEXT PLOT VARS FUNCTION W/ PARAMS {}, {}, {}".format(Pos, RgSetNm, Chrom)
        plot_vars(Pos, RgSetNm, Chrom)
        

    def button_clicked(b):
        print "Position: {}, ReadGrpSet: {}, Chrom: {}".format(text0.value, text1.value, text2.value)
        Submit(b)    

    button.on_click(button_clicked)
    exit.on_click(exitFunct)