A string's $k$-mers are simply all the $k$-character-long substrings that it contains:
In [9]:
def justKmers(k, string):
return [string[i:i+k] for i in range(len(string) - (k - 1))]
print justKmers(3, "tomato")
Traditional algorithms used to find matching substrings among different strings begin by generating a sorted list of each string's $k$-mers. Each $k$-mer $s$ is stored as a tuple $(s, x, p)$ with the ID of the string it came from ($x$) and its offset in that string ($p$). We will from now on handle strings as tuples $(x, string)$, containing the string's ID and its value.
In [11]:
def kmers(k, (x, string)):
return [(string[i:i+k], x, i) for i in range(len(string) - (k - 1))]
print kmers(3, (1, "tomato"))
$k$-mers are, however, very memory-expensive. Every additional character added to a string demands a new $k$-mer tuple. Minimizers are modified $k$-mer sets that are smaller in memory. Out of each window of $w$ consecutive $k$-mers, a minimizing algorithm chooses one representative $k$-mer, the window's minimizer. Ideally, the minimizer is the rarest $k$-mer in the window, and therefore the most unique identifier, increasing the probability of meaningful matches. For further explanation, see Reducing storage requirements for biological sequence comparison by Roberts et al.
Below is an implementation of a function, minimize(), which returns a list of a supplied string's minimizers. getMin() determines which of a list of $k$-mers is the minimum by alphabetical ordering, selecting the substring that appears earlier in the string in the case of ties.
In [19]:
from operator import itemgetter
def getMin(kmers):
# the named "key" argument specifies to sort based on s primarily and p secondarily (ignoring x)
return sorted(kmers, key=itemgetter(0,2))[0]
def minimize(k, w, (x, string)):
minimizers = []
for i in range(len(string) - (k + w - 2)):
# print statements to make it clearer what's happening
print "string[" + str(i) + ":" + str(i + k + w - 1) + "] = " + '"' + string[i:i + k + w - 1] + '"'
print "k-mers:", getKmers(3, string[i:i + k + w - 1])
minKmer = getMin(kmers(k, (x, string[i:i + k + w - 1])))
# adjust offset
minKmer = (minKmer[0], minKmer[1], minKmer[2] + i)
# more printing for clarification
print "minimum k-mer:", minKmer, "\n"
if minKmer not in minimizers:
minimizers.append(minKmer)
minimizers.sort()
return minimizers
print minimize(3, 3, (0, "tomato-soup"))
getMin()'s logic is disconcertingly simple. We want to choose the most unique $k$-mer, but we are currently selecting solely based on alphabetical sorting, defaulting to the substring that appears earlier in the string in case of ties. This is arbitrary at best.
Because minimizers are often used in genome sequence analysis, techniques have been developed to best select rare and representative $k$-mers. One such technique is to give each $k$-mer a numerical score based on its nucleotide contents. In our example scoring system, nucleotides indexed by even numbers contribute 1 for c, 4 for a, 4 for t, and 5 for g. For odd-indexed nucleotides, the nucleotide values are reversed (5 for c, etc.). The $k$-mer with the lowest sum is selected as the minimizer. This promotes strands that contain the rarer nucleotide types c and g, which contribute an average of 3 compared to a and t's 4. It does this while also discouraging simple repeats like "cccccc", which are difficult to align and prone to sequencing errors, with the score reversal technique.
Another basic heuristic is to prefer sequences with more nucleotide diversity. We use Roberts et al.'s method of preferring sequences with more nucleotide types in their first four nucleotides. For example, "atca" has three, which is preferable to "attt"'s two.
We reimplement the minimizer selector below. It uses the scoring system as a primary sorter and resorts to the diversity heuristic to break any resulting ties.
In [22]:
sequence = (1, "aattaatggaccata")
print "Using the simple minimizer selection:\n"
print minimize(3, 3, sequence), '\n'
def getScore(chars):
evens = {'c': 1, 'a': 4, 't': 4, 'g': 5}
odds = {'c': 5, 'a': 4, 't': 4, 'g': 1}
score = 0
for i in range(len(chars)):
score += odds[chars[i]] if i % 2 else evens[chars[i]]
return score
def getMin(kmers):
# converts kmers to a list of tuples of the form (kmer, score, diversity)
kmers = [(kmer, getScore(kmer[0]), len(set(kmer[0][:4]))) for kmer in kmers]
# select primarily on score, secondarily on diversity
# (extra [0] to return only the kmer)
return sorted(kmers, key=itemgetter(1, 2))[0][0]
print "Using new minimizer selection: \n"
print minimize(3, 3, sequence)
By definition, every nucleotide in the string must be contained in at least one minimizer if $ w \leq k$, except the first $w-1$ and the last $w-1$ nucleotides. (see Fig. 2 and Fig. 3 in Roberts et al. for an illustration). The lack of resolution at the ends reduces our ability to recognize overlaps shorter than $w + k - 1$ between the ends of two strings. Because of this, many minimizing algorithms include end minimizers. These are the minimizers of the ends of the string using $w=1$ through $w-1$. All other minimizers are known as interior minimizers.
In [36]:
import pprint
print "string:", sequence[1], '\n'
def endMinimize(k, w, (x, string)):
minimizers = []
print "\nFRONT END MINIMIZERS\n"
for w_ in range(1, w):
pprint.pprint(["kmers:", kmers(k, (x, string[0:k + w_ - 1]))])
print "minimizer:", getMin(kmers(k, (x, string[0:k + w_ - 1])))
print '\n'
minimizers.append(getMin(kmers(k, (x, string[0:k + w_ - 1]))))
print "\nINTERIOR MINIMIZERS\n"
minimizers += minimize(k, w, (x, string))
print "\nBACK END MINIMIZERS\n"
for w_ in reversed(range(1, w)):
pprint.pprint(["kmers:", kmers(3, (x, string[len(string) - (k + w_ - 1):]))])
print "minimizer:", getMin(kmers(k, (x, string[len(string) - (k + w_ - 1):])))
print '\n'
minimizers.append(getMin(kmers(k, (x, string[len(string) - (k + w_ - 1):]))))
print "\nEND\n"
minimizers.sort()
return minimizers
print endMinimize(3, 3, sequence)
In [ ]: