Jaccard Index

It's been a week since I have started to work in an algorithm called prioritySort. A key part of this is the calculation of the Jaccard Index between two sets. As this has to be done a huge number of times, I want to study this Index from nearer to improve the performance of the main algorithm.

I will start defining the Jaccard Index, for any readers who might want to learn from or use this notebook. Given two sets A and B, the Jaccard Index is the ratio between the number of elements in common of A and B (its intersection) and the number of elements of the union of A and B.

We will first consider how to represent the inputs. We need obviously to represent two sets, A and B. As we are working with python in this document, we could use the data structure set. But I want to implement this algorithm later in PHP, or even a more efficient language such as C++. So no sets. We can use simple lists, or arrays, instead. We only need to take care of the repeated elements. For developing a useful algorithm here, we will just assume that our lists contain unique elements.

Brute-Force implementation

This algorithm is quite simple. It consists of checking if each element of set A is in the set B.



In [6]:

    
def jaccard(A, B):
    intersection = 0.0
    for i in A:
        for j in B: 
            if j == i:
                intersection += 1.0
    return intersection/(len(A)+len(B)-intersection)



In [7]:

    
A = [2,4,1,5,3,6,7,11,34]
B = [34,55,2,10,23,4,7,1]

print jaccard(A, B)









    



0.416666666667

This is obviusly quite inefficient. Its complexity is n². Not bad, but it can be much better.

Let's put some order

For instance, let's suppose A and B are lists instead of sets (which we are already taking for granted) and let's order both lists. Now we can use binary search to look for each element in both sets.



In [8]:

    
def binarySearch(needle, haystack):
    " We are assuming that the haystack is in order"
    first = 0
    last = len(haystack)-1
    found = False

    while first<=last and not found:
        midpoint = (first + last)/2
        if haystack[midpoint] == needle:
            found = True
        else:
            if needle < haystack[midpoint]:
                last = midpoint-1
            else:
                first = midpoint+1

    return found

print binarySearch(3,[0,2,3,4])









    



True



In [9]:

    
def binJaccard(A, B):
    if len(A) > len(B):
        A,B = B,A
        
    intersection = 0.0
    for i in A:
        if binarySearch(i, B):
            intersection += 1.0
    return intersection/(len(A)+len(B)-intersection)



In [10]:

    
A = [1,2,3,4,5,6,7,11,34]
B = [1,2,4,7,10,23,34,55]

print binJaccard(A, B)









    



0.416666666667

Now we have achieved a complexity of nlog(m), where n = min{len(A),len(B)} and m = max{len(A),len(B)}. Because we are using binary search, this complexity is fixed (we will always search something in log(m)), whichis quite efficient.

Approximated Jaccard Index

What if I told you, that I really don't care about the actual value of the Jaccard Index between two sets A and B? Perhaps given some (more than two) sets I want to compare the similarities between all the sets. Could that be done without knowing the exact value of this index? And more importantly, can it be done in less than nlogm?

Binary Position

Let's change the algorithm for binary search. Now it won't return any boolean. Instead, it will return the position of the element we are looking for.



In [11]:

    
def binaryPosition(needle, haystack):
    " We are assuming that the haystack is in order"
    first = 0
    last = len(haystack)-1
    found = False

    while first<=last and not found:
        midpoint = (first + last)/2
        if haystack[midpoint] == needle:
            found = True
        else:
            if needle < haystack[midpoint]:
                last = midpoint-1
            else:
                first = midpoint+1
    if found:
        return midpoint
    else:
        return -1
    
print binaryPosition(1,[2,3,4,1,8,-1,0])

Let's see now how can se use this modified binary search. Our next algorithm will do the following: search the first and the last element of the smallest list (either A or B) in the big list. This will mean that the maximum number of elements in common between A and B is min(distance of both elements in A, distance of both elements in B). This is an "optimistic" way of calculating the intersection of A and B, i.e., this function will return a value greater or equal to the actual jI.



In [12]:

    
def approximateJaccard(A, B):
    "Assume A and B ordered"
    if not A and not B:
        return 1 # definition
    elif not A or not B:
        return 0
    elif len(A)>len(B):
        A,B = B,A

    x = 0
    y = len(A)-1
    
    posX = binaryPosition(A[x], B)
    while posX == -1 and x < len(A):
        x+=1
        posX = binaryPosition(A[x], B)
    
    if posX == -1:
        return 0
    else:
        posY = binaryPosition(A[y], B)
        while posY == -1 and y > 0:
            y-=1
            posY = binaryPosition(A[y], B)

        intersection = min(posY-posX,y-x)+1

        return float(intersection)/(len(A)+len(B)-intersection)

A = [1,2,3,4,5,6,7,11,34]
B = [1,2,4,7,10,23,34,55]

print A, B
print binJaccard(A, B)
print approximateJaccard(A, B)

A = [2,3]
B = [1,2,3]
print A, B
print binJaccard(A, B)
print approximateJaccard(A, B)









    



[1, 2, 3, 4, 5, 6, 7, 11, 34] [1, 2, 4, 7, 10, 23, 34, 55]
0.416666666667
0.7
[2, 3] [1, 2, 3]
0.666666666667
0.666666666667

The best case for this algorithm is 2logn = O(logn), where n = max{len(A), len(B)}. Awesome! The worst case is nlogm, same as earlier. As we can see, it will give an incorrect value of the Jaccard Index. But it can do its task - comparing many sets - quite well, as we wanted.

We start to realize that calculating the Jaccard Index is basically calculating the # of elements in common between A and B - the intersection. The next function provides a way to do it recursively, but the complexity is again nlogn.



In [19]:

    
def recursiveIntersection(A, B):
    if not A or not B:
        return 0
    elif len(A)>len(B):
        A,B = B,A

    x = 0
    y = len(A)-1
    
    posX = binaryPosition(A[x], B)
    while posX == -1 and x < len(A):
        x+=1
        posX = binaryPosition(A[x], B)
    
    if posX == -1:
        return 0
    else:
        posY = binaryPosition(A[y], B)
        while posY == -1 and y > 0:
            y-=1
            posY = binaryPosition(A[y], B)

        if x == y:
            return 1
        elif x+1 == y:
            return 2
        else:
            return 2+recIntersection(A[x+1:y],B[posX+1:posY])
        
print recursiveIntersection([-1,1,2,3,5,7],[2,3,4,5,7])

Linear intersection

There is a way of calculating the intersection between two sorted lists in O(n). This is beautiful, because now we can calculate the jaccard index in O(n). The linear algorithm is the following:



In [20]:

    
def linearIntersection(A, B):
    if not A or not B:
        return 0
    i = 0
    j = 0
    intersection = 0
    while i<len(A) and j<len(B):
        if A[i] == B[j]:
            i+=1
            j+=1
            intersection+=1
        elif A[i]<B[j]:
            i+=1
        else:#B[j]<A[j]
            j+=1
    return intersection

print linearIntersection([-1,1,2,3,5,7],[2,3,4,5,7])

And the linear jaccard:



In [21]:

    
def linearJaccard(A, B):
    if not A or not B:
        return 0
    i = 0
    j = 0
    intersection = 0.0
    while i<len(A) and j<len(B):
        if A[i] == B[j]:
            i+=1
            j+=1
            intersection+=1.0
        elif A[i]<B[j]:
            i+=1
        else:#B[j]<A[j]
            j+=1
    return intersection/(len(A)+len(B)-intersection)

A = [1,2,3,4,5,6,7,11,34]
B = [1,2,4,7,10,23,34,55]

print A, B
print linearJaccard(A, B)

A = [2,3]
B = [1,2,3]
print A, B
print linearJaccard(A, B)









    



[1, 2, 3, 4, 5, 6, 7, 11, 34] [1, 2, 4, 7, 10, 23, 34, 55]
0.416666666667
[2, 3] [1, 2, 3]
0.666666666667

This gives us the best exact algorithm possible. At this point, we have two alternatives. We know we can find the exact Jaccard Index in linear time, or approximating its value in O(logn) in the best case.



In [ ]: