It's been a week since I have started to work in an algorithm called prioritySort. A key part of this is the calculation of the Jaccard Index between two sets. As this has to be done a huge number of times, I want to study this Index from nearer to improve the performance of the main algorithm.

I will start defining the Jaccard Index, for any readers who might want to learn from or use this notebook. Given two sets A and B, the Jaccard Index is the ratio between the number of elements in common of A and B (its intersection) and the number of elements of the union of A and B.

We will first consider how to represent the inputs. We need obviously to represent two sets, A and B. As we are working with python in this document, we could use the data structure **set**. But I want to implement this algorithm later in PHP, or even a more efficient language such as C++. So no sets. We can use simple lists, or arrays, instead. We only need to take care of the repeated elements. For developing a useful algorithm here, we will just assume that our lists contain unique elements.

This algorithm is quite simple. It consists of checking if each element of set A is in the set B.

```
In [6]:
```def jaccard(A, B):
intersection = 0.0
for i in A:
for j in B:
if j == i:
intersection += 1.0
return intersection/(len(A)+len(B)-intersection)

```
In [7]:
```A = [2,4,1,5,3,6,7,11,34]
B = [34,55,2,10,23,4,7,1]
print jaccard(A, B)

```
```

This is obviusly quite inefficient. Its complexity is n². Not bad, but it can be much better.

For instance, let's suppose A and B are lists instead of sets (which we are already taking for granted) and let's order both lists. Now we can use binary search to look for each element in both sets.

```
In [8]:
```def binarySearch(needle, haystack):
" We are assuming that the haystack is in order"
first = 0
last = len(haystack)-1
found = False
while first<=last and not found:
midpoint = (first + last)/2
if haystack[midpoint] == needle:
found = True
else:
if needle < haystack[midpoint]:
last = midpoint-1
else:
first = midpoint+1
return found
print binarySearch(3,[0,2,3,4])

```
```

```
In [9]:
```def binJaccard(A, B):
if len(A) > len(B):
A,B = B,A
intersection = 0.0
for i in A:
if binarySearch(i, B):
intersection += 1.0
return intersection/(len(A)+len(B)-intersection)

```
In [10]:
```A = [1,2,3,4,5,6,7,11,34]
B = [1,2,4,7,10,23,34,55]
print binJaccard(A, B)

```
```

**nlog(m)**, where n = min{len(A),len(B)} and m = max{len(A),len(B)}. Because we are using binary search, this complexity is fixed (we will always search something in log(m)), whichis quite efficient.

What if I told you, that I really don't care about the actual value of the Jaccard Index between two sets A and B? Perhaps given some (more than two) sets I want to compare the similarities between all the sets. Could that be done without knowing the exact value of this index? And more importantly, can it be done in less than nlogm?

Let's change the algorithm for binary search. Now it won't return any boolean. Instead, it will return the position of the element we are looking for.

```
In [11]:
```def binaryPosition(needle, haystack):
" We are assuming that the haystack is in order"
first = 0
last = len(haystack)-1
found = False
while first<=last and not found:
midpoint = (first + last)/2
if haystack[midpoint] == needle:
found = True
else:
if needle < haystack[midpoint]:
last = midpoint-1
else:
first = midpoint+1
if found:
return midpoint
else:
return -1
print binaryPosition(1,[2,3,4,1,8,-1,0])

```
```

```
In [12]:
```def approximateJaccard(A, B):
"Assume A and B ordered"
if not A and not B:
return 1 # definition
elif not A or not B:
return 0
elif len(A)>len(B):
A,B = B,A
x = 0
y = len(A)-1
posX = binaryPosition(A[x], B)
while posX == -1 and x < len(A):
x+=1
posX = binaryPosition(A[x], B)
if posX == -1:
return 0
else:
posY = binaryPosition(A[y], B)
while posY == -1 and y > 0:
y-=1
posY = binaryPosition(A[y], B)
intersection = min(posY-posX,y-x)+1
return float(intersection)/(len(A)+len(B)-intersection)
A = [1,2,3,4,5,6,7,11,34]
B = [1,2,4,7,10,23,34,55]
print A, B
print binJaccard(A, B)
print approximateJaccard(A, B)
A = [2,3]
B = [1,2,3]
print A, B
print binJaccard(A, B)
print approximateJaccard(A, B)

```
```

The best case for this algorithm is 2logn = **O(logn)**, where n = max{len(A), len(B)}. Awesome! The worst case is nlogm, same as earlier. As we can see, it will give an incorrect value of the Jaccard Index. But it can do its task - comparing many sets - quite well, as we wanted.

We start to realize that calculating the Jaccard Index is basically calculating the # of elements in common between A and B - the intersection. The next function provides a way to do it recursively, but the complexity is again nlogn.

```
In [19]:
```def recursiveIntersection(A, B):
if not A or not B:
return 0
elif len(A)>len(B):
A,B = B,A
x = 0
y = len(A)-1
posX = binaryPosition(A[x], B)
while posX == -1 and x < len(A):
x+=1
posX = binaryPosition(A[x], B)
if posX == -1:
return 0
else:
posY = binaryPosition(A[y], B)
while posY == -1 and y > 0:
y-=1
posY = binaryPosition(A[y], B)
if x == y:
return 1
elif x+1 == y:
return 2
else:
return 2+recIntersection(A[x+1:y],B[posX+1:posY])
print recursiveIntersection([-1,1,2,3,5,7],[2,3,4,5,7])

```
```

```
In [20]:
```def linearIntersection(A, B):
if not A or not B:
return 0
i = 0
j = 0
intersection = 0
while i<len(A) and j<len(B):
if A[i] == B[j]:
i+=1
j+=1
intersection+=1
elif A[i]<B[j]:
i+=1
else:#B[j]<A[j]
j+=1
return intersection
print linearIntersection([-1,1,2,3,5,7],[2,3,4,5,7])

```
```

And the linear jaccard:

```
In [21]:
```def linearJaccard(A, B):
if not A or not B:
return 0
i = 0
j = 0
intersection = 0.0
while i<len(A) and j<len(B):
if A[i] == B[j]:
i+=1
j+=1
intersection+=1.0
elif A[i]<B[j]:
i+=1
else:#B[j]<A[j]
j+=1
return intersection/(len(A)+len(B)-intersection)
A = [1,2,3,4,5,6,7,11,34]
B = [1,2,4,7,10,23,34,55]
print A, B
print linearJaccard(A, B)
A = [2,3]
B = [1,2,3]
print A, B
print linearJaccard(A, B)

```
```

**two alternatives**. We know we can find the exact Jaccard Index in linear time, or approximating its value in O(logn) in the best case.

```
In [ ]:
```