# Minhash

### Min hashing and Jaccard similarity

#### Introduction

There are many ways to measure how similar two strings are: Hamming distance, edit distance and global alignment value for example. Another way is to turn each string into a set, e.g. the set of its constituent $k$-mers, then consider how similar the sets are.

We define a function that, given a string, returns the set of its constituent $k$-mers.



In :

def string_to_kmer_set(Astr, k):
return set([Astr[i:i+k] for i in range(len(Astr)-k+1)])




In :

string_to_kmer_set("hello", 3)




Out:

{'ell', 'hel', 'llo'}



The Jaccard similarity coefficient $J(A, B)$ of non-empty sets $A$ and $B$ is:

$$\frac{|A \cap B|}{|A \cup B|}$$

It equals 1 when the sets are identical and 0 when they are disjoint. Otherwise it is between 0 and 1.



In :

def jaccard(Aset, Bset):
# return Jaccard similarity coefficient between two sets
isz = len(Aset.intersection(Bset))
return float(isz) / (len(Aset) + len(Bset) - isz)




In :

def jaccard_kmer(Astr, Bstr, k):
# turn two strings into sets, then return Jaccard similarity coefficient of those sets
return jaccard(string_to_kmer_set(Astr, k),
string_to_kmer_set(Bstr, k))




In :

jaccard_kmer("ABC", "ABD", 2)
# intersection: {AB}, union: {AB, BC, BD}




Out:

0.3333333333333333



#### Evaluating use of sets

Explicitly building and intersecting sets of strings seems inefficient. Let's see how long it takes to run on many randomly generated pairs of similar strings.



In :

import random

""" Add num_muts random substitution mutations to string """
for _ in range(num_muts):
rndi = random.randint(0, len(string)-1)
string = string[:rndi] + random.choice('ACGT') + string[rndi+1:]
return string

def random_jaccard_kmer(length, k):
""" Make a random string and a second string separated from the
first by a few mutations, then return the two strings and
their jaccard similarity coefficient. """
str1 = ''.join([random.choice('ACGT') for _ in range(length)])
return str1, str2, jaccard_kmer(str1, str2, k)




In :

random.seed(77)




Out:

'ACGCGCGT'




In :

random.seed(76)
random_jaccard_kmer(20, 4)




Out:

('GTTCGATCGGTTCAGGCGAA', 'GTTCGATCGGTTCAGGCGTA', 0.7777777777777778)




In :

import timeit
timeit.timeit('random_jaccard_kmer(1000, 10)',
setup='''
from __main__ import random_jaccard_kmer;
import random;
random.seed(223)''',
number=10000)




Out:

16.9908575999998



It takes >10 seconds to find Jaccard similarities between 10,000 random pairs of 100-long DNA strings, using $k$-mer length of 10.

### Min hashing

#### Introduction

How about: instead of using the set of all $k$-mers from each string, we pick one representative $k$-mer from each string. Let's pick the minimum alphabetically. For example:



In :

def string_to_min_kmer(Astr, k):
return min([Astr[i:i+k] for i in range(len(Astr)-k+1)])




In :

string_to_min_kmer("hello", 3)




Out:

'ell'



We can compare two strings by comparing their minimal $k$-mers:



In :

def jaccard_min_kmer(Astr, Bstr, k):
return 1 if string_to_min_kmer(Astr, k) == string_to_min_kmer(Bstr, k) else 0




In :

jaccard_min_kmer("ABC", "ABD", 2)




Out:

1




In :

jaccard_min_kmer("ABC", "ACB", 2)




Out:

0




In :

jaccard_min_kmer("DBC", "ABC", 2)




Out:

0



This can yield a Jaccard similarity of 0 or 1; we cannot distinguish intermedaite amounts of similarity. On the other hand, we avoided building any sets.

We'll use the mmh3 library, which contains an implementation of MurmurHash3, a fast and widely used non-cryptographic hash function. Instead of taking our representative as being the minimal $k$-mer, we'll first hash the $k$-mers, then take the $k$-mer with minimal hash value:



In :

# you might need to 'pip install mmh3' first
import mmh3




In :

def string_to_min_hash(Astr, k):
return min([mmh3.hash (Astr[i:i+k]) for i in range(len(Astr)-k+1)])




In :

string_to_min_hash("hello", 3)




Out:

-173395898



That's the minimum among the hash values of the 3-mers of "hello".



In :

def jaccard_min_kmer_hash(Astr, Bstr, k):
return 1 if string_to_min_hash(Astr, k) == string_to_min_hash(Bstr, k) else 0




In :

jaccard_min_kmer_hash("ABC", "ABD", 2)




Out:

0




In :

jaccard_min_kmer_hash("ABC", "ACB", 2)




Out:

0




In :

jaccard_min_kmer_hash("DBC", "ABC", 2)




Out:

1



jaccard_min_kmer_hash's return value won't necessarily match jaccard_min_kmer's, since the function permutes the alphabetical order of the $k$-mers.

#### Multiple hash functions

jaccard_min_kmer and jaccard_min_kmer_hash return 0 or 1 -- not very precise! We can get a better estimate by calling jaccard_min_kmer_hash multiple times, each time using a different hash function.

Let's rewrite string_to_min_hash to include a seed parameter that "salts" the hash function.



In :

def string_to_min_hash(Astr, k, seed=0):
return min([mmh3.hash(Astr[i:i+k], seed) for i in range(len(Astr)-k+1)])



Now we can call string_to_min_hash with various hash functions, or various saltings of the same function.



In :

[string_to_min_hash("hello", 3, k) for k in range(10, 20)]




Out:

[-1948827108,
-1610908706,
-1823680268,
-1885168061,
-1068521670,
-1692363780,
-1923178236,
-85412340,
-1121674942,
-2094403364]




In :

def jaccard_min_kmer_hashes(Astr, Bstr, k, seeds=):
tot = sum(string_to_min_hash(Astr, k, seed) == string_to_min_hash(Bstr, k, seed) for seed in seeds)
return float(tot) / len(seeds)




In :

jaccard_min_kmer_hashes("ABC", "ABD", 2, seeds=range(10))




Out:

0.3



Not a terrible estimate.



In :

jaccard_min_kmer_hashes("ABC", "ABD", 2, seeds=range(100))




Out:

0.38



Again, not terrible.



In :

jaccard_min_kmer_hashes("ABC", "ABD", 2, seeds=range(10000))




Out:

0.3299



A very good estimate: off by only about 1%.

Why does this function give an estimate of Jaccard similarity? Each hash function permutes the ordering of the $k$-mers differently. For each permutation, some $k$-mer from the union of all $k$-mers becomes the minimal one. By calculating the fraction of the hash functions for which this minimal $k$-mer is present in both sets, we're estimating the size of the intersection divided by the size of the union: the Jaccard similarity.



In :

def random_jaccard_kmer(length, k):
str1 = ''.join([random.choice('ACGT') for _ in range(length)])
return str1, str2, jaccard_min_kmer_hashes(str1, str2, k, seeds=range(10))

import timeit
timeit.timeit('random_jaccard_kmer(1000, 10)',
setup='''
from __main__ import random_jaccard_kmer;
import random;
random.seed(223)''',
number=10000)




Out:

91.05219180000131



It's slower than what we had before, but for large enough sets it could be faster.