Author: Laura Gutierrez Funderburk Date: June 2017
1.4 Some Hidden Messages are More Elusive than Others
We are computing the Hamming Distance Problem: The Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different. It measures the minimum number of errors that could have transformed one string into the other. This algorithm will take: Input: Two strings of equal length Output: The Hamming distance between these strings

In [37]:
import numpy as np

In [40]:
def Hamming_Distance(sequence_one,sequence_two):
    if len(sequence_one)==len(sequence_two): 
        length_of_sequence = len(sequence_one)
        count = 0
        for i in range(length_of_sequence):
            if sequence_one[i] != sequence_two[i]:
                count += 1
        return count
    else:
        print("Sequences must be strings of the same length")

In [195]:
seq_one = "CAGAAAGGAAGGTCCCCATACACCGACGCACCAGTTTA"
seq_two = "CACGCCGTATGCATAAACGAGCCGCACGAACCAGAGAG"
print(Hamming_Distance(seq_one,seq_two))


23
We say that a k-mer Pattern appears as a substring of Text with at most d mismatches if there is some k-mer substring Pattern' of Text having d or fewer mismatches with Pattern. In our case, that would mean Hamming_Distance(sequence_one,sequence_two) <= d We thus define the following problem: Approximate Pattern Matching Problem. The goal is to find all approximate occurrences of a pattern in a tring. Our algorithm will take as input two strings: a Pattern and a Text, widh an integer d. And it will output all starting positions where Pattern appears as a substring of Text with at most d mismatches.

In [ ]:


In [185]:
def ApproximatePatternCount(Text,Pattern,d):
    count = 0
    for i in range(len(Text) - len(Pattern)+1):
        Pattern_two = Text[i:i + len(Pattern)]
        if Hamming_Distance(Pattern,Pattern_two) <= d:
            count = count + 1
    return count

In [196]:
result = ApproximatePatternCount("CGTGACAGTGTATGGGCATCTTT","TGT",1)
print(result)


8

In [166]:
def ApproximatePatternMatching(Pattern,Text,d):
    positions = []
    for i in range(len(Text)-len(Pattern) + 1):
        approx_pattern = Text[i:i+len(Pattern)]
        if Hamming_Distance(Pattern,approx_pattern) < d:
            positions.append(i)
    np_positions = np.array(positions)
    return np_positions

In [167]:
test = ApproximatePatternMatching("CGTGGATTTTA","GGGTGTAGCGACCTAAACGGATTGCCCGGCTCTCACTAATGATGACTACGTATTCCCATTATCTGGCCCTTACCCGACCATTGAGAGATGCAACATCGCTCGAAGGCGGAGTTGTTCTGGTTCGAACGGTAACGTACAATCAGAACAGATAGGACGCGTAGTACCAACTAGGCAAGCCCATAGGCGAGGATCTGATTAACCCGTTATGACATCACCATTAAGCTCCCCACCCATTGCTGCGGGCAAAGGCAGCTCCGGTCTTGAGGCTCCGCAGACTTGGGAAAGTGCCACCCTCGGGCGCACGAGATCGAATGAACAGTTAACGATGCCGGAACAAAAAATAGAGTGTCCAACGAGACTGCGGTGTGTCAACGAATCTGCGCCGATGTATCTCTCTGCATGATATTATAACCCGTACTACAGCCCTGTAAGCTTTTGTGTCTATCTCCCCGGACATTACCCTGCTCGGGCTCAACGTGAAAAAATATCGTCGTGACACCCCGCGGGACGCCTCAGACGAGGTCCTAACTCATCGCCGCGTGGAGGCAAGCCCTAGTTTGGTTAGCTGTTCTAGTGTTTCATTGTACCGCCACGCGACAATATGATCTTAGCGCTTGCTTCAACGGGATCCCGGCAGTCCGCACCTCCTCACCGTAGCTTTCTCCTCGAGTTGCGGTGCTAAAGTCTGTTCTTAGCCTTACGCGGCCTTTGTTAACACCACTGTAGTTTGTACTTCGATAGGCGCCTTCTTAGTGTTACGGCGCATTCGTCAGATACGTAAAACATCTCCGACGTAAGAAGTCCTTGCGCGACAGCGTTATATGCGAATTCATTCTGAGGAGAGCTGGATTCAACTTCCCGCCGGCGAGGACGGACAACGATGCTTCGTTATTCCCCGCGTACAGTATGTCTTCATGTACGGTTTAATCGCATAATGCACAACATTCACATCGCATCTTGAGCATCTCCTTCATTCATAGGGGGAGCCACCTGGAGAATGAGTAGAACTGGGTATAGTCAGAGAGCCCGCCAATCGATTTATTCTAGGTCATCCGAGTGCCGGGACTTGTGTTTAGGGGTCTGGAGTGGAGAATTTCCCTGCGTAACGGTCAATTCCCAAATAGGGTTGTCTTAGCATTGTTCCTCGTACCCACTGAAATACCTAACCTCGGGTGGCGCGGGAAGTGATCGTAGTGCTACGTCTATCAAACCAGTAGACACATAACCCTACGGCAGATGTAAATACGCAATGCAAATACGTTAACGCAGAGCTTACGAATGGTATAACCGTTACCCCAGAGATGGAACACAAGAGATTCCGAGATCGCGAAAACAACTGCCCGGGAAGAAATTACAGTTGTAGACCTCCTTAGGGCTATGCCTGGCCGCTTACGGTCTAAATAAGACGGACGTTCACTTGAGGCGGTTCCCCAAACCTCGCGTTTGGTGAGTTGGGCACTTAACGGATCTGACAACCGATAAAGACCCAGAGTCGGAGTCGCTACATACCATGGCGCAGGACTCCTTCCGAGACTCTAATCATGAGACTTTCCATAGTTGAATTAGCAGGACACACAACTTGCGGGTTACCGTACGGATAGAAATAACTCCTACGTACCGGTTTAATTTGTACCATTGCCGTCCAAACCTGCTACGTGACTTCTGAGTTCCATAGCGCAGTTACATAGTCCGAGGATAAGGCTGCAGCTGGCATCCTGTACACTGATTGGCCGCGCGCTGGAGGGTTTGGTTCTATGCCCTCGCTCTGATTGCGCCAACCAAAGGGGTGACAACTCTTGACAGAAGATTACTTACGATCTCCAACTTAAATGCTGAGGTCAGACCACCTTGGGCATGGTCTGGTCTCAGGAACTGTGAACTTGGCAACTGAAGGCCCAACGTCCCAGATTGGTCATTGTGGTCCTTGTCTCACTTACCAGAACAACCGAGCAAGTGTGCGTATTTTGTCGCAAGCCAGCGGAGTTCACCGCACATCAAGCTCATTCATGACACGTCTAAAAGAATCATTTCACTCAACTGGGCCCTTCTGGCGGGGCTATGATATATTGGATTCGGGCAGACTAGCATGCGTTTGTAGGACCGGGGACCACGTACCCATCATAGCAGCCGTTCGACATCGATCAACGACCTACATAGCCCTCTGCTTACCCAGCCGTCCGTCCCGATTCTCTTAGCCTACATTGGCAAGGGAAGAAACAATCAGGTTAGGAGAACCGTTGTGAGAACTGTAATGTGTGCGGTTTAGAAGTCATGAAATTCAGCTCCGAGTCTTGAGCCAAAACCACGCACTAACCCGAAGGTCGATTCACTGTTATTCCTTCGGAGCTTGTCTCCAGATTGGCAGGTTGGACTTTAAGAATTACTGCTAAGTATATAGGAGGCCGAGAGCGAGGGATGCTCTTGCATACGTGACCTCTACAAGCTCTAAGGTTGAGATTAGACACAATGCGCCGTCAAACCAACGCTTGGCAGAACTGGTCTGAGGCACCAGTATCCAACACGCTTACAACTGGTTCTACTTTTAAGTTTGGTAGAAAGCTTGCGACGAACGTCAAGAGTGCTAGTTATCCGATCCTCTTGAGGGTAGCGTTTAGCGAACAGCAGCGGTTTCCCCTAAGTCCTGTGTGCCTATCCGCAGCGGAAGCAATACCGGGTGCCCCGCTACAATAACTAAACCGTTGGGTAGATAAAGGAGTAGTCGCTGTATAATGATCCCCTAGACAAAGAGATTAACGGACTATTACAAAGACGCCGCAAGACATGTTAAACCATAGAGACCTCAATCTTTGTCATCCCAAGCTTCGCCCGCGATTTACATGATCGCCAAGGTAGCTCCCTGACAATGCGCTATCGCATACGGCTCCAAGGGGTATGCTAGCTTGAACGTATTTATAGAATCATCGACGCACACGGATCTGGCGCGACGCACGTTTATAGTTACACTTGTGTGGGCCAAATCTTGTACCTCTGTAATTCCGTCACATATTGAGACGTGACCCAGCGCATAGCAGGAGCTGGTCATGGACTACAGCCGACACATGCGGGCTCTCCTACTCGGACGGGGTGGGAAGTTCGTCAACGCTTGACGTCATGCGAACATCACTTGGATACAAAGACATGAGTCGTGCTAACTCCAGACCCATGTGTGGTACTTATAGGACGACGAGAGACCTGAGCCGATCGTCTCGCGTCATCGCGTCCGCAGCGGCCTGCACAGCTATTCCCGCATAGTTGGGAAGGCATAACAGAAGTCGATCGGTGCGCCGTTTTAACCTATTGACGGCCGAATCTCTAAAACGAGGGCCTTCCCCTAGGCCGGAAGATCAATTTACCAGCGTCCTATAGTCATAGTGACTGTTGACAGCGGGACCTATTATGGGCCCCTTATGAATGAACTTCCCCAAAACAACCGCAAGACACACCATCATTGTCTGACCTATGCATGTGCAATCACTACAAATGACCTCTGTTGTCTGACTGGTTAGACTACCCCACGCGCACTGACGTGCCAACAGGAGGCCAGCTGAAGGCAGCAAACAATTTCCCGGTGTGCGAGGCCACGATAGAATTTCCTCGCTCCGGCCACTATGCTGGTGGTCGCAAATCAGAGCTGCATTTGGCCGGAGGTCACGCACACGCTATCCCGCTATCAGTTTGACGCTCGGATGGGACTACTCGTGCTACGCGTAGATTCATCTTTCGCGAATAGCGTCCACAGCGGAATAGTCAGATCAGTCAATAATGCTGAGCAAGGAACTAGATTACCGTACCGCGGACTACGCGATAGACGGCGAGGGTACTGGAAATTGTCTGATCATGCGCCAGCCCCCAATCACGCAGGAAAAGTCAACTCGTCGGAAAGGAGCGCCAAGTGAGTGTACTGGCACCCGATGTCCGATTTGGCTTGTTCTTCGCCGTCCCATCGAATGTTCGAGAAGTCGTCAGGTTCAGTAGTGGGCAAAGCGCTTGCTAGCATATGCTGATGAAGACTGGATTGACGGGCTTGTGGCATCAGGCCTCGATGTCACAGTGCGGCCGAATAATGGTCCAGTTAGATTGCACAGAAGTACGCACCGCCAGGAGAGCTCTTTCGTTTCATTCCTGGCCGCCGAATTCGAGACCCGATTGGCTGCAATATCCCAGACTACCGCTTCAGCTGGGCTGCAAGCGATAGTGTACCTCGCGAGAGGACGTCAGCGTGACCGGGCGTGCCCATGTCTCGAAGCGCCAAACTTTGTATACCTCGCGCCCACCCGTGCGCAGAGCGCATACGGGTGGAGGGTCAGCCATAATTAACAATCTCGCCACACTATGCAACCAGGCATCGGAACTCTGGATGCAACCGTCGGCGAATTGTAGCATTGAGCCCAAGTCGCATGTCACAGACCTTTTTGGCTGGGTTATACCCCTGGATCGGAAGGAGGTTTCTCCGATATGATCTACTACAGGACTTTGTACATAAGGGGTTTTTCTGAGCCGGGCTTGGATTGCCGACAGGGCATGCAACTTAAAGTAGTAATGGTTACATTCGGCTCCTGAGTTCATACTCCGTTGCACGCAGGAGATGATGAGATGTGAATATTCCGGTTGACTCAGTCCGCATTCACGTCCTCCGATGTCTCTACATGAGAAAGCATCCCGCGAAAATTCTGTACCTGGGGACTTGGACGGGATCCTAACAGCACGTCGCCCGTCCGATAATGGCGTGTCCTAATAGCCCCCACATTAATCTGTAGGATTTTGAGCCCTCGGCCTCTTAAGAGGCCACTATTCCGACGATCTGCTCAGTTAATAGAGGCCGAAGCTCAGCGAAAGAAACGCGGACGATCGCGGGGAGAGGATACCGAGGCACATGTGTGTATGTAGCGGTGAGGATCGCACACGGGAAGGGGTGTTAAAGAGGACGTATCGTCGGATCATGCCTCAACACCAAAAAGGCACACTCCCTCACCCCGGTACTGGTAGCGTCATACAGTGATGGCTGCTAGCTCGTGGACTACGGTGAAACATCGCATGGACTGCAAATACCGCGTGTGCCCTCTTCAGCCGAGGACACTGCTTGTAGGGATAAAGCGGGAGTAGTTTGAGGGTAAGTATGCCGATATCATGCATGACCAAGAAGAGACGTACGAGCACTGCTATCAGATTTAAGTCTGCGAGATCTGAAGACGAAGCCACGATTATGGTCTGGGGGGCTTCATCCGAGTGAGTCTTAGCAATAACGCACACCACGGATGTCTGGTCTCGCCCCTGTTAGGGAGCCCGCTAGGCGGTCCATTTATCAGCATCTTTCGAGCGTTCGCGTGTTGCTTATGGGGGGCGGTCTCATGTGGGGCTCGGTTCTGAAGCTCTCGGTTGGGTTTAGGACCCCCGAGGCCGTCCTGTCGGTCCCCTACGCAAGAACCCGTGCCGGCACACCTGGTAATGTCCGCGGCAGCGACTGTTCTTGGGTATCTCCACCGTCAGACCAAAGGAACCGGCGATGAAAGCGGTTCCTCGGGGATTCCCCTCAGTTCCTTTTTCGCGGCTCGCCAACACCATGCTGGTGCACTATCAAAGAAAGCCACAGTTTCCGAATGAGATCTGATGGGTCGACCAGTCACCAAGAGACCTATAACCGAGATCTTCGCCCCGCTGATCCAGGTCCAATCTCCACGTAAGGACATGCATTTTCCAAGAGCTTTCCTTACTGCAAGCTCCCGCGCGAAGTGGAATGGATCTACGCAGCAGCAACGGGCGCTGCTCCGTGAATAATCGATCAGTTGTGTGGTCTTCCGCAAGATTTCTCTCTCCACACTGTGATCAGTCGGTGTAATCCGGATTAGTCTTGGAGTACGCTCGTCAAACACTTGGGCATAGTCCGCTTCAGATGAGTTGCGTTTTCCCTTTTAGGCCTAAAACGAAAGGCTCTAGGCTAGAAACTGGTCGCGTCCATTGGCCGCCGTGATCATTTGGATGCCCTTGTGTGCACGATAGCTGGCAGGACCTAATCCTATCCACTGACGGATCTATAATGCTCCAGCATAAAAAACCTGTCGCCGAGTCGGTAGTAAGGTAAACCATAGAGCCTGCTAGTTCTGACCTACGCTTTCCATAACTCGAGTGCCTTATCTTAAAATCTCAGGGCTACACTGATGTTTCACAGCCAGTCGCAATCCTCGGCTTTCAGCATAGCCCTGTTACTATTTCGGTATCGGTCGAAAAGGTTGATTGACAGATGAGCCGTCGGGTACGAACCATGCCAACTAGCGGACCTAACACGACGTTAGCATGAGCTGTACGTTCTTCGTCGGGCGGAGGTGCCGCGCGGATGTCCGGGAATGTGGCGGAGTGACTCTAGCATCTCATACGACCTGATCAACCACACATTTGGCAATCCCGAATTCGCTCTTCGCCGGCCAATATGAGCCATACGTTAAGAAGCTGCGCTCTCATCTCACTAAAATGTGATAACCCGTGTACGTCCGGTATCTACTAGATATGTATGGGATAGTTTCCTTGTATAGGTCCCCTCTTTCTAATTCTGAGCCATCGCCGCAGTTTGAGTATGACGATGAAGTTGTTTGAATTAGGGAACTTTTCCCCACTGTAGCTATTGTTACGTTATCTACGCTCATCTCACTTGCTGCCAGAGCCCCTGGGATGAAGTCGAGTACAACGACTGGAAGGTTGCTCCCCATCGGACCACGCATTACGCTACGATGTCTTCGTCTATGCCGGTTCACTGGTTAACGACCCGCTCAGACGACGAAAAGCTCAGTCATTAAACGTCAGAGGATCCACATTCCGTCTCCAACTGAATTCGGTGGAAGTACTGCGCCCCTGCTACGGTTCCGTTGGCATGGTCGAGTATCCCAGGGCCGTTTACCACGTATGCGAGGCTTCTCTGAGACGAGTCCTACCGTTCTGTTGTCGTCGAGGTTAGACGTAGAACGACCTTCGTGGGGTCCATCTCGGTTGGCTTGCTCCGAGGTCGTACATCCGGCCGACCGGATACCATACTAATCCCCATGGGGGGAGCCTTTTCGGTTGTTATGCAATCGGCTTACAGCTCCAAGAACCGGCACCTATCATGGCTAGTTTATCCGGACTCCGCTGGCAATTGCACGCAAGCTTTCTCGAATAAGTCAGTGATGATTGCAGCACCCCCCTACAGGTGGAATAGCCGGAACTGGCCTGGTTTCGGAACGGATTGACATACCCCTGCGTTTTGCCCTCCCATGGTCCGCAATAGCCACCCAGACTCGGACCATAGTAGTATTTTATTCACTGCAACTGCGGGCCTGACGTCGAGTCTAAAGATGTTCGCACCGATCCCACGTCCGGAGTGATAAGCCTTGTATAGTGACAGGTGTTAAATGCGACAATCGGTTTTGTTGCGTGCTAGACTATTCAGTGTATCTTGAAATGTTGGGAATACGCTCCGTGGGGTTAACTAGCATGTCAGGTAGTAAGGTTTTAGTGAGGAAAGAAGATCTGAACGTTACTTCAAGTTGTTGTCCTTCCTATTACTCATCGGGTATGGGATTATAGGCCCTATTTAATACGTCCTGCATTGAGTCCATGTATCCTTTTTTAAACTTATCCGCAACCCCCTACCCCTTACTGTTCTGCAACGAGACTATGGCCCCCGGAAGTGAAAGAACACTGCGAGGGCTGTACCAATATAGAGGAAAGACTCAAGGAACACGACCTAAATCCCTCCCGATTAATCGATGCGCAAGTCAGGCCATAGCCCCTCTGCGCGTGTCGAATTAGGGGTGTATGGTTGAGTCCAGATAGTAAAGGTGGGTTCCACAACCATGCGAGCCCGGCCCTTAATAAGGCCCGCACGCGATCACACTAATACCCAGCAAAAGACCCAAGATTTGGGCAGCATGCGGAAACGGGGTGACTCTGTGGTGCCTGACTCGTGACCCCCCCTCAAAGCCGCATAGAGACTCAGAATACCACTGCACTGACGTTAACCATTTTAAGTTCTACAGGATGGGCCCTGCTCTCAGGTAAATTAGCGATTGTAATGCATACGATATCCCACGGATCGCAGGATATCAAGAGTCAATCAGCCGCCATAATGTACGCCGGATAGCTGCAGTACAGGCGCCGGGTAAAGGGGGGCACGGGACAGAGTCGGTAGAATGTCACAACATGCTTCCGAGACTGTATGGTGAGATGTGCAGGCCAGAAAAGGTCTTATATCGGAGCGTGCTAGTACTCGTTCAGTCGCATCGTGAAAGAACGGCATTCAGTGAGCCCCCTACTTACTCGGGAAGTAGGTCTTAGGGGACTCGGTTGACATATCAACCTCGCTGACGAATTTGGCCTGTAGTGCTACAAACGCAGGTCTGCCGCATAGCGTTAGCAGCTCTTAGAATAAGCAAAGGTTGCACGTGCGACCTCCCCACGTTACGTAAACATGCTTCTCTTGCATGGTTTCTAAACTGCACGGCCGCGCGGCGTGACATTGCTAACAATTGCAACAACATGCTATTTTTGTGCTTCTGTGACGCACATTTCCTGATAATGCTCAACTCTAAGAATCGTACTTGAACGTATTGGACTAACTGCTGTGTTCCTCTGAGGCGTATGGTCTTGCTCTTTCTTACACACATTATACTAGAGGAATTCACCAGCGTTACTACCAAGAGACCCATAAGTAGGATAAGCTCAATTTGACTTATAGTCGTAGCCCTTTTTGCCGACCCTTATCGGTACCATCCCCGCGGTTCGTTGTTCGGTCCAAAGACCCCTTGGCGTTGGGCCTTGGCCTCTTCTGTGATGTCCGCGAATATCCTATTAATCTAGCAGTGGTTTTTGCAATCACAAACATGCAAGTAGCCAAGTACAAGATGCTATTGGACAGCCAAATTCTACGCGAAGCACGAGAAGCTCACTGTAAATTCGGAGATTCGTTGCGGAGTACGGTGACGCACTTCGCTGTGTCAGGGGGGTTCTTGATATGTGATTAACCAGCAGTCCTCGTAGGCTCGTCGCCGCGCCTTACCTTGCCCTATAGGCGACGCGCCATACCAGTAGTTCCTTATGTACCTAGATTACGGTATTAGCATCGCCCGATTCAACAGCTACGAGTTTGTATGATAGCGGGGATAAATCTACGTTGGATTACGGTGCTCCAATAATTGAGCTGGAGCTGGCTGACCGGGGTGATAAGAGTCGGAGCTGACTTAGGCTGCTATCCCCTCTTCCTTATTATAATGGCTGTGACAAGGTGGTTCGCGGCACCCACAGCTTCAGTTTAACTAGCGGCACTACCGGTGACGATTCACTCCGCGTCATGTGGCCAGGCGCCGGATCAAGAATAACGCATGCGATAAGCTATCACGAAAGTTTAAAACGATTTGCGCTACCACTTCAAGGGAAGATCGTATGGCATCAAGCCAGAAACAACCCAAGCAGTAGAGCAAATAACCGTACAGGTCAGGCTACCTACGATGGCTAAGTAACTCATTATAGTACAACCATGAACCCACTAGGTACACACGTTTTCCCACCGTTCACAAGTAAAGGAAGAAAGAATGTGGGCTAGCTATACGTGATAATCGGGGGCGTATCCTGTGGTGATGTCAAAGAGTGCTTTGACAGGCGTAACTATGGCGTGCCTTTACATGACTCTGTAATCACGTTCATATCCCATGTCTTAACCCATTTTTAGACGCTCTGCGCCTAAACGGGGAAATAACAACTGCAGGTATTAGACAACGGAGGTTGAATCGGAATCACAGAGGTAGGACCGGGTTTGTATACAAGGCTATCGTTGCGAGGAATAAGCAATGACGGAGGATGGTAACATATCCTGTCCTCCCCGAAATCCGACCTACGAATGAGAGATGTCGCTCTTGCAATTAGCGAATCAGTCTAAAACCCGATTCGTTAACGCGCATTTACACTTTTCTTCAAGAGAGAAGACTTGAACAGGCCATGCTAGTGCCTGAGAGATTCCGAGAACAGGCAGAGATGGTCGATCCCCGTTTAGCTGTGTGCCCACGTACGTAGGGATATAATTGCGGGGACCGCCCTGATATGCTCTGCGACCGTACCATGACGGGGAATATGAACCCACGACCTATTGCCGAGAAGTAACGCCTTCGCCCCGTAGGCATTCTGTAACATAGTGCAGCGCGGCTGGTCAGAATACTCCCCGGCTTACTTTGGTTTAGGAAATAGATCAACTTTTCGAGAGTTCGGGATAACCCAACATATACAGCAAAAGTTCCATGCTGCCAATCGTTGAAAGGACTAACTTGTGAAAAGCTATATTCGTAGAAGTGTTGTGCATGCTACAGGCGCGGTAAATATTGTTCGATTCATGCCAGAGGAAGGCTCAACCGACATCTCATGGTTCAGTTCATACGTGAAACTACTCAAGGATAAGCGCTTTGAGAGTGCCCAGACATTTAATCGACTAAGCACTATTAACAGCCCCAACTCTCAGGCGCCGCGCGCGGATGCATACACCCCATCCAGCAAGGTACTGCCTGATAGTATGAAATTTGGTGACTGTTTCAATCGTGAGACCGCGCATCATCGTGACCGTTGAACAAGGAAATACAGGCGCTATTCGTTCCTGCTAGTTCCCCTTCGGGGTTACCTACGACCGTTCTCATCCTCCGATCCTATAAACGGTCTGAGGAAGCAGTAAGTGAGGTCAACCAACTGCGTTATAGAACAGGTTATTCGACGGTACATCTTCTAGTCTTTTTAGGTGACATCTATACGCCTGCCAGGGCTAGCGCGTGATCTCTATCGAAGTGTTGAATCTCGAATGGTGATACTCATAGCTCGACTAACTTACCGTCTTCTGCAATTCTCTGTGGGTCAAACTATTCCCGCTTGCATCCTTGCGAGGCCCTAGCGATGTCGGAGATCCCACCTCTTAGTTCCAACCAAGTCGCCTAATCGGCAGACTAGGCCCCTGTGGGCGGCGAGCAGTGTTAATCTTTGCCCATTGGCCTTCTGCAGCACATGGTATACGAGTCTAGGCATGCTACCCTTGTCTACAGCGAGGGAGACAGAGCTCGAGGCATTATAAGAGTCCGGTCTCAATTGATTTACGTGCATACGTCGACATCTATAAGCGTGAACCGTACCTCTAACGTGTTGCCCACAAGAATTTAAAACCATATTTCAGAGACATCTATTAGAGCCATAGATTGGCCCGCGCAACCTGGTAGCGCAGCCTTTTTGTAAATTCTACCGGATGGCCCATAGCTTCGTAGAAGATCAGATACTGAGTCCGCATAATCTGTAGCGCTATGAGGGAAGGGGGAACACATTCGGATGGTACCAGTGAGCCCGAACATGTAACATAGGGTTATATTTCACGGAATAAACGGGGACCGGAATGGACGTAGACCTGAGTACGATTTCTCGCGAATTGTGTCGAGCACCTTCGAACCGCTGGATCTGCTGGTGTATATGTGTCGCGCATTGCGGGAAGAGGTCTCATCAAAGGAATCAGGAAAGAGCAATCGCTACCTTATACTGAGAGCCCTTAGCTCCTCATCTCCTTCAGGCCAGTCGACTACCCAGTGCGAGAACAGTGGACCCTGATAACAGGGACCCATAATAAGGGCCAGATTATTGTGGATCGACCGACTCCGATTGCGCGCGGAGCGACGGCTACGTGACTCAGAGCTGAATTTGTAAGAAACCTCACGGGGGGACCAAGGTACGTTAGATCACCGGCCGCTTGCTGACGACACCTAAATGTAACGTTTTTAACCGCACAAGAAAGATGGAGTCAATAATCGGTGTCACCTTAGATGCGCGCTTTTCCTTAGATACAGAGTTCCTCTTACGTAACCCTCTTGGCCGTCAAACTGCAATATCAGTAGTAGAGGCTTCTAGAGAAGAACCACCTGCCCAACCATCAAACTATAAAACGTTACTTGTATAGAGCTTCGTCCAGCGACCAACGGGAAATACTCTACTTGGCCGATTCTCGAGCGAAATGTTGCGATTGTATCGAGTTCTACGAAAATAAGTTCTTCCAATGTGGCTCGATGACGGTAATGAAGCGCCACTAACAGACCATTGGAACCCGGATAGCGAGCGGACCTTTCCGGTCCAATGCTCACTGTGCTGAGACCACAACAATCCCTTGAGTGACGCTGCTGCGTCTTGTTCGGGCATATTGGTCACCTGATCCTGTATTGTTAAGTAAAATTACATCATGGTACCTGACCATTTTGTGGCCCACTCACCACTATCCGTAATTGTACATGTAAACCAAGCACGCAGGGGCAGTGCAGAGCATGCTATCAACTCCCTTTAGAGAGGTGGAGCGAACAAATTGATTGCGCATTGTATGTTTCTCATCCCGGGTGAAGTCGCCTCAAAAAAGGATACATGGCTACTGTCTCGGTTCGCGGACCTCTAGATCGCTTAGGATGTCAATTATTCCGCACGATCGTGCTGTCAAGCGTAACGCAGATGGTCGTGAACAAGGTGAACGAGTATGAGAGAGAATGGCCTAACCACAAGACGTAATACGGACATATGCAGCCGTCAACGTTCCTCTTGGAATGAAGGAATTTTCTCTGGGCAGCCCGGCTTTCGGTTCGAAATCGTCTCGATATATTTCGGGCACGAACACCGTGAGTCCGCCGACATATGAATGACGACAGCATGAATGTTTGGTGAACGTATAGCGACCAGTCGTTAGCGAGAAGAACAAGAGCACATCGAATTATGCACGACGGGCCTTGTCCACAGTTCGAAACTCTCTTTAGATCCGTAGTTAGATCCATTTCGGGGTCGCAGTCAAGCCCGTTATCTCTAATCCATGTCTCGGTCCGGCTGGGATATACTATATGTATTCATAGTGCAGACTAATAACAATAGTGTTCGCCAAGCAGTTCAACATCCTTAGGGTGCCCACATTCGCGCCTTTAACAACCGCAATGCATGTCCAGAGTGCATGTGGTCAGCTCGGCCACGCTTGGCGGTGGACATCCTTATTCTTAGGGTCCCTAATCAGATAATCGAACAATTACTTTGGCGAGTGTGGTACGCTTTGTTCAGCCATGTGTCCTGTTATCTGCCGATCGCAATCATCGATTTCTAACCCTATCCTATCGATTACAGCTGAAGCTAAATCCGCTTGTCAACCGAATACTGTGGTCATCACTCAAGGCAACGACCCACCTGTAGGTGTCCAATTGGTAACGGCAAGCTACTTTTAATATAAAGCCCAGACACCCGCGCTCTTGTAGCCACTTAGACAGCGAAGCTCTCCCGCGACGTATTATCATTGTCATTGGCCATCTGATATCATTAAGCGGCTCAGGTCTTGACCGCAACAACGGTCTGTAACACGAACTTCAACTAATTTCCAATGAGGACCAGCCTAGCAAGTGCCCCGAATTTTCTTTGGTGGGGCTTGTGAGCTTGAGGGATCTATCGTTCGTACTATGAAGCCAGTTGTTTCAATGGCTATGGCACTGAGCGTCATGTGACCTTCCTAAAGGGGGTTTTGCTAACAACTCGTCTCATTCGAACTTAACGTGTTTAAGAGATCGCCATGCCCCTGGTGCCGTTACAGCCCGGGACTGTATGACACGCCACCCAACCAGATAAGTTTCTACATCACTATAAACCAATCCATCCTCCATAGCCCCTCCATTACTTTTAGGATTTCTTGCACCCAGGCGCAAAGGTCTAGACGGCGCGTCTGCGATTACTCGTGTTCGAAGACGCGTGCTCAGAGTTCCACCCCGGACTTGTTCTTGCAGGTGCTATAAGGAAGATTGGTATACGTTGGACCTTCGCGTGGACGTGCATGGTACCCCCCCTAGCTGTAAGGGCTGGCGACACGTGCGACAACTGGGCTCGGATACACGGTGCGGATGTCCTTAGTACGGTCCAACCAAGATCAGAGGCCTAACCTGCTTGTGGTAATGGAACGGCGGCACTTAGAGTGGACAGGCGTCATTAGGTGTACCTGTCTGCACTTAATAGTCGAACTGCACTCATGGCGCATCCTGTCGGATAGAGCCTACAAAGCGTATTCCAACAGGACCAAGCCGTGTGCAAGAAGATCTCTGAATTGACTACTGTATTATGACTTCGCTAAAGGCAGCCTATTTGAAACGACGAGCTCGCGAGTGGTACTGAAGCTGCACCGGATGTGCCTAGCAAGTCTTAAGACAGGAAACCGACGTGCCCACTCCTACGGCATCTAGCTTCTTATCTCATAGAACTATAGTACGAAGCAGAAATATTATTGAGGATGTGCCAAAACGACGGCCGCTGACGACAATACTTGGCGCCTCACAACTCAGAGAACGGCCGCAAGAGCACGGCCACCGCTTATTGCTCACGGATACAGATGAAGAGCCGGTCCACAATCAGCACAGAGGGCGAGTGGCAGCCAGGCTATTGTCCGCAATGCACCATGTATACGAAGATGTTACAATTGCCCTGCGTCTCACACAGAGTTAAGTGAGCAAAGTGCTGACGCCAAACACAACTGCCGCATTTACTCTGATTGGTACAGAACAGTGTATAGCACTATGCCGCTCCCTGCGACGACAACTGCTGTAAAAGCAGATAGACGGTTTCTGATGACGAGGTAGAAGAAAAATGCTCCCCCTAACTGATACTTGCGACGTCTGGCCCTCAATAATGTTGCCCACCAGCGAGTAATGTGGACGGGTAAAGCATCCGACCAGGTGTCATGAAGCCGAAAGTGATATCTCGTCAAGCTATACCCCAGACTGCCCGTTCTCTTATGCACTTATGTACTGCATCTGTCACCAGGTAAGTGAATCCCCATCCGTTAGCTGATAGCCTTAGCGTGGAACAGCCCTCACACATTTGGCCTTGCCGAATCGAATACTATGTATTCGAGGCACTCAAAAGGCCAGGTTCTCACATTCAGTGGGACCGACGTTTATTGCCCTATTATGGTACGTATTCCCTCACCGGCAATCGGTTGTATATAGATTTGGTACCAGATGCACCTGCCCCTCACGAATTTGGAATTGGGTCCATCAAATTTTGCACTCACCTTTTAATTACTTCGTGAAATCCCATCGGCCAATCAAACCGTCCTTAGGCCGTCATCCCGAGCCGAATAGAGCAGAGCTAATGTGACAATTACCGGTTTGAATCCCTACGTCAGCGCTCGCGGGAGAAAGATTTAACCTTTACTGCCAACTCTGGAGTCTTAAACATACGCGATTTTCGCCAGGGCTGTGTCTGCCGACTAGAAACAGCGGCTTAGGCTGTTTGCTGCAGATGGGCCAGTAAACATATTCCTATCCGTAAGCCACTTAGCCATGATCTTTTAGCCGCTCCAGCTCGATCGTTCTAAAGCGGCCGTTCGGAAAAGCTACCAAGACGTGTAGTCAAGCCTGAATCTATGCGCTCGCTTAGAACACACGGCCGGACATCCGCTCATAAGCCAACCCTCGAGTCACTTCAAAAGAGGACCGTTGTCGACGTCATGTTCCCTACCTGCCTATTGATCAACGTAAGCATGTTAGTACGAAAAGGTTAGACACTTGGAGTTTTAGTGGTGCATGGAAGGTATGGAAGATGCTTGTTTTGCGTACCTATAGCGGCTTCCGGGCCTCCTGAATGACGTATTAAGCGTGGCACTACCATTTGTACAAAGTCCCTATAAGGCCTGTGGCTGCGTTCACAGCTCATTCGGGTCTCCTAGGGGCAGCGCCGCTTCGGGCAAATTGATCAAATGACCCAGTGCGCTATCAGTGTACCTCGCCATCCGCTGTTTCTGCATATGTAGGGAGGATGCGGAGCCATGGGAGCACTACTGCCAGCTGCCGACCCCGACGTTTAGGTGCCTAGGGGGGCGAAGTAGGTATCATTTGGGGTATAACACACCGAAGCGAGCACGGGAAATCACCTTTGAACTCAATTTCCTACTGTGGTCATTGATGTCGCCTGCTATCCTTAACGCACCTTTAAACCGGGACCCTGCAAGTTTATGGCTGCCGTGGTATCAGGCCCGCTAACTGTATCAATGTTAAGCCCTCCCAACATGGATCCTTTAGTTATTAAGGACAGGTTCAAAGCACCATTGCGCAACGACTGGTTGCCTCAGCCTTCTGCCCTTCAGTAGCGCAAGTGATCTGTAGGCACGGTGGTCAGTACTTAGACAGTGAGGTACTATCGACTCCTAAATGTAAGAAACTTTATTGGTTGCGCCAAGATTCGATGCAATTTGCATCTTATCTTTCCAGAGTGTAAAGTTACTCAATAGGTGAAGGGGTCTTCTGCGGTCGGTCCCCCAGCGGGAGGAAGCTAAGTTACGGAGATTGCGATCCTAATCCGCAAGCGGTGATAGGGCCCGCCTTTGGGTGTAACTCCTTTTATCGTATATAAACAGCACTGTGTTTTAATATTACACGCAACCTTTGGATAACAGCCGTTGTAACGGAACATCCGGACGCTTCGACGCGGGTCGATAAGTTTTGAGACTCAAGCTGCTAAGGCTTCCAATAAGCAAGAGAAGCGGTTTGTACAATCGTCAAGATGGCAAATACGGTTTGACCTGCGGACGCGCCAGCTGTGGTGTTCTCTATGCAGAGAGTCAATGGTCACATCCTACAGGTTAAGCAATCCCGCCTTATTACCCCACCGTACGTTCTTTTCGACTAACTGGAAACAAGACCGACAATACAGCACCAAGGACACCGTACACAGGAGACCTGAGTCAAAACGCGGATTCCACTCGAAAAGTTAGATAATAGGTTCGAACTCTATTAGGGGCTCGATGTACACAACGAGGTGCTAGACACTACCACGGAAATTCCAAGAATACTTATCGCTCTGCCCAACAGAAGGTAGAACCTTCTTATAGTCTTGGTGAATAGATGTCCGCCTGCTATACCTATCCCCGTATGCAGCTTCTCAGTATACTCGTGACTGATAAGAAGGCTAAAATAGTACCATGCGATTGGACACCCACCAGCGGGACCCGAGTGGTCCAAGGTCCACGGTTGGATTTCTCGTCGCGAGCATCTATCGTCACGAGTGAAAGACGCGTCTCTACTTTGTGGGCTGATTCAGTTTGGAAGCCTGCGGACTAATCCACATACAGTAGTCGAGGGACATGGAGCACCGTAGTGGACTGCGCCTTGGGGTATTTCTCAGATGATTTGCCCGGAGTACGGGGCTTTAGGGATAGCCGAGTTGAAGACTCACCAATTTTGCTCTTACCCGAGAAAATATGCTAGCCTAGAGCACGTCTCATTAAATAGTATCGACTGTGCCGGTTCCGGGCTCACAATCACGTAGTACTGCGGGTAGTATTTCTACTCTCCACACCTTTGAGATATTCTGTGATCTGGGGCTTATCAAGGACCTTTTAGTCGATGAGGTCTATTGCGGACGAGTTGGCCTCACACACGCATTCAAGTGCCTGTTAGAGCAACTGAATTCAGTAAGCGCTCATACTGCCTGCTTTCTCAAATCCGTTTACGCGGCGGCAAACTAGTTTCCAGGCTTAGCGTATACGGTGAACTAGAAATGGTCTCGAACTGCAACACTTGTCTCACATCGAAACAAGATCGTTTCAGATATTACCTTTATGGCTTTGACTCCCTGTTTTCCACCATAGAGCTGTTAATTCGTAATCCGCGCCCACATTAATCCCACACTACATTATACAACTGCTGCACATTCACGATATTATGTGTATAGTGGCAGCTTAACGCTCCATACAAAAGTCGCATTATCATTAGGAATTACATGCTTTGAAAAAAGGTAATCGTTACCACATGGCAGTAATTTCGTCCGTTTCGGGACGCCACTGAACGCGGCTAAATTCACTCAAAGCCGCACGTGTTCCCAGCGAACTATACTACGTGTAGATTGACGCTAGAACACAAAGCTAGAGAGACATCGAAGCTCTGGGTACCAGTGAACTTACAGGGCCCCCCGTGAGGGTCCGGGATCAAACTAATCGCCACCCTTGAGGCGTCTGTATTTTCTGGCCTCGCACTGACTTTGCCTGGGTCTTCTATACGACGGCCAAAACCTCGAGACAGAAATTGTTCTTCATTCCCGAGTACATGTCAGTACGTGGACTACAAATCGTCCTAAGGAGACATATTTTAAGTCATAGTGTACTGTATCTACGGCTAAACAACCTGCATGCTGTCTAGACGGCTAATAAAGTGCGCTCAAAAAATCTACAATGATAGTAGTCTACTAGGAACAAGTGCGCCATTTTCCTAATGACAACGCACGTTTACTCATTTGTCCGTTCGTTGAGTCACATAATGCGGCATAGCATTTCGATAAGATATCTACAGATCCCAGGAAATGGCGTCTGAAGCAGGCAACTCTCACCACGACGACGCGTCCGATTCTCCCGGCCAATAACTAAAGAAATACTGGCGCTTAGTGCTTGGATGATCTAAATGTACCTCTCTTTAACGGCTCTTAGGAAGCGACATGTGCGACCCTCCGCAGAACATTTGGTCCACTTGACAACCGGACGTAAGGCTATAAGGCAGTCCGCAACGACCGCGGACAATTAGGTGAAATGGTGTCACGAGCAGTAACTGAGCCACTTGTGTTGCGATTAGAATTATTCGAGGAGTCGCCCTAGTAACAGTCTGTGAACGCGACCACGCCCGTTGGTTCAATCTCGTTGTATCAACTTGGGTGAAGTATGAAAAATGATCTCCGTCCGGATCCTAGGGCATTCTGCACGGAGAGGATATTTGTGTATCCGCCTAACGGCGGAAGCCCTCGACTAGCTGCCTGACTGCGCGCCGTGCGGACCAAGGGGGGCATCACGTCTATTAGACTTAACAGTATATGCCTTAATAAGATGGAGTAGTTCTTGGCAAGGTAAATTTCATATACTGATACGGACAGTGGTATCCCCAAAGTGTGCCCGAATCGTCCATGTGTAACGAGCTCGTTGCAGTGACCCAGCGAATGTGGTCGGTTAAGGGGTAAGTGTGGAGGCTATAAGTCCCACAGGATGCATATCATTGTACTTGCTGTTTACTACCTCCTATGTACGCGAGACAATGACCCCTCCGTACAAGCTAGGTGACATGGAGGACCAGGATACGCCGATCTCCGCGGAAGATTGCTCGCATGGCACTGTCTAATCTAGTTAAGATGTGCGAGCCCAATTAATATGGACCAAGTACTAAGTTTATATGACGAATAACAAGGCAATGGTTCAGCCAGTATGCGCAGACAATAATAAGGATAGGTATCTACTTTCTTGTTAACCTAGGATAGACGCAAAGACGACGGCTTGTATGACCCTGCTACGTGACAGGTAACCGCGCTAACATGAGATGCGCGTGACCGAGACCGTCTAGAACTAAAAGCAGAGGTTGCTCTGGACCGCTCCACTGTATTTGGGCTACGGATTCAGGACTTGGTAAAACTTACACTTGTAGCTGATAGCAGGCATCTGATCACGGAGCCCTATCGGATTTGATTATACATTTCCACCCTGTACGCTTTTGTCCGAACTCAAGCTAAAATAGAAGAGTCCAACCCCCCTACGAATCTCCTAACTTCGGATACGGGAGGCTAGTTTGACACGATATCAGACGGAGAACAGCCATTACATGAGGACACGCGGCATTACCCGATTCACGCGTGGGGACGTGGATTTTA",4)
with open('/home/lgutierrezfunderburk/Documents/approx.txt',"w") as myfile:
    for number in test:
        myfile.write("%s " % number)

In [146]:
res = ApproximatePatternCount("TACGCATTACAAAGCACA","AA",1)
print(res)
res_2 = Hamming_Distance("TGACCCGTTATGCTCGAGTTCGGTCAGAGCGTCATTGCGAGTAGTCGTTTGCTTTCTCAAACTCC","GAGCGATTAAGCGTGACAGCCCCAGGGAACCCACAAAACGTGATCGCAGTCCATCCGATCATACA")
print(res_2)


12
50
A most frequent k-mer with up to d mismatches in Text is simply a string Pattern maximizing Count_d(Text, Pattern) among all k-mers. The Frequent Words with Mismatched problem finds the most frequent k-mers with mismatches in a string. It takes as input: a string Text as well as integers k and d, where we assume k<=12 and d<= 3. It outputs all most frequent k-mers with up to d mismatches in Text.

In [ ]: