In [ ]:
from IPython.display import display
from IPython.display import HTML
import IPython.core.display as di # Example: di.display_html('<h3>%s:</h3>' % str, raw=True)

# This line will hide code by default when the notebook is exported as HTML
di.display_html('<script>jQuery(function() {if (jQuery("body.notebook_app").length == 0) { jQuery(".input_area").toggle(); jQuery(".prompt").toggle();}});</script>', raw=True)

# This line will add a button to toggle visibility of code blocks, for use with the HTML export version
di.display_html('''<button onclick="jQuery('.input_area').toggle(); jQuery('.prompt').toggle();">Toggle code</button>''', raw=True)

Question:

You are given a list of $n$ intervals $[x_i, y_i]$, where $x_i$, $y_i$ are integers with $x_i \leq y_i$. The interval $[x_i, y_i]$ represents the set of integers between $x_i$ and $y_i$. For instance, the interval $[3,6]$ represents the set $\{3,4,5,6\}$. Define the overlap of two intervals $I$, $I'$ to be $|I \cap I'|$, i.e. the number of integers that are members of both intervals.

Devise a divide-and-conquer algorithm that, when given $n$ intervals, finds and outputs the pair of intervals with highest overlap (you may resolve ties arbitrarily). A trivial $\Theta(n^2)$ algorithm can be achieved by comparing all pairs of intervals; look for something better.

Hint: Try splitting the list using the left endpoints of the intervals.

Explanation of the Solution

First, we sort the list of intervals by their left endpoints (this only happens once). Then we do the following:

  1. At each recursive step, we break the list into two halves and
  2. Recursively find the largest overlap on the left half, and on the right half.

Then we search for the largest overlap between an interval of the left half and an interval of the right half. From the left half we only need to consider the interval whose right endpoint is the greatest---no other interval from the left half can produce a greater overlap with one on the right. So we find that interval (in linear time), and then check its overlap with all of the intervals on the right half, which also takes linear time.


1: function MAX-OVERLAP(I[1,...,n])
2:     Sort the intervals I[1,...,n] by their left endpoints.
3:     return RECURSIVE-LARGEST-OVERLAP(I[1,...,n])

4: function RECURSIVE-MAX-OVERLAP(I[1,...,n])
5:     if n = 1 then return 0
6:     L ← RECURSIVE-MAX-OVERLAP(I[1,...,n/2])      #Largest overlap on left half
7:     R ← RECURSIVE-MAX-OVERLAP(I[n/2+1,...,n])    #Largest overlap on right half

8:     C ← 0               #Largest overlap between an interval in left half and an interval in right half
9:     J ← interval in I[1,...,n/2] with the largest right endpoint
10:    for interval JPrime ∈ I[n/2+1,...,n] do
11:        C ← max(C,OVERLAP(J, JPrime))
12: return max(L,C,R)

Running time:

The running time is $O(n \log n)$. Lines 8 through 11 take $O(n)$ time, and we recursively call the function on two instances of half the size. This gives the recurrence $[ T(n) = 2 T\left( \frac{n}{2} \right) + O(n) ]$ which solves to $T(n) = O(n \log n)$ by the master theorem. The sorting we do once at the very beginning also takes time $O(n \log n)$, so the total running time is again $O(n \log n)$.

Proof of correctness:

We prove, using induction on $n$, that given the list sorted by left endpoints, Recursive-Max-Overlap returns the largest overlap between any two intervals.

Base case: For $n=1$, there are no two intervals, so the largest overlap is $0$. Line 5 correctly handles this.

Inductive hypothesis: Recursive-Max-Overlap finds the largest overlap for lists of length at most $n-1$.

Inductive step: Suppose we compute the following:

  • The highest overlap $L$ between two intervals in the left half
  • The highest overlap $R$ between two intervals in the right half
  • The highest overlap $C$ between an interval in the left half and an interval in the right half Then the answer must be the max of the three, and we return it. We must now prove that Recursive-Max-Overlap correctly finds $L$, $R$, and $C$.

By hypothesis, Recursive-Max-Overlap works on all lists of length $<n$. Therefore, lines 6 and 7 correctly compute $L$ and $R$, respectively.

For the $C$ case, suppose $J_\ell$ is an interval in the left half and $J_r$ is an interval in the right half, and let $I[n/2 + 1] = [x,y]$ be the middle element of the list. Since our list is sorted, the left endpoint of $J_r$ is at least $x$. Therefore the intersection of $J_\ell$ and $J_r$ lies in $[x,\infty)$.

Now observe that the left endpoint of $J_\ell$ is at most $x$, and so it does not affect the size of the overlap. In other words, we can replace the left endpoint of $J_\ell$ with $x$ and nothing changes. Now if we hypothetically assume all left endpoints of the left intervals are $x$, it is clear that the best choice of $J_\ell$ the one that has the greatest right endpoint. Line 9 finds precisely this interval. We then we check its overlap with all intervals of the right half (lines 10-11). Our algorithm thus computes $C$, completing the proof.

Note that Max-Overlap returns only the value of the greatest overlap, but not the actual intervals; however, with a small amount of extra bookkeeping, it is easy to recover the intervals as well.

Question: Find the Missing Integer

An unsorted array A of length n contains all the integers from $0$ to $n$ except one. In this problem, we cannot access an entire integer in $A$ with a single operation. The elements of $A$ are represented in binary, and the only operation we can use to access them is “fetch the jth bit of A[i]”. Using only this operation to access $A$, give an algorithm that determines the missing integer by looking at only $O(n)$ bits.

Note that there are $O(\log n)$ bits total in $A$, so we can’t even look at all the bits. This means, for example, that we cannot add up all the numbers in $A$, which requires looking at all the bits in $A$. Your overall algorithm, however, may take up to $O(n \log n)$ time and $O(n \log n)$ space.

Explanation of the Solution

We show how we can reduce this problem to one of size $n/2$ in linear time. First, consider an array $A'$ containing all integers from 0 to n, (none missing). Suppose we count the number of times a 1 appears in the least significant digit of the entries of $A'$:

  1. If $n$ is even, we expect $n/2$ $1$ bits found.
  2. Otherwise, we expect $(n+1)/2$ $1$ bits found.

In other words, we expect there to be $\lceil \frac{n}{2} \rceil$ $1$ bits.

Now consider our array $A$, which is just like $A'$, but missing one integer. We count the number of times $1$ appears in the least significant digit; there are two cases:

  1. If the count is $\lceil \frac{n}{2} \rceil$, the least significant digit of the missing integer must be $0$.
  2. If $\lceil \frac{n}{2} \rceil - 1$, the least significant digit of the missing integer must be $1$.

Suppose we discovered that the least significant digit of the missing integer is $c$. We can then ignore all positions in $A$ where there was a $1-c$ as the least significant digit (this can be done by creating an auxiliary array $B$ containing all the interesting indices we are considering in $A$, which can be created in $O(n)$ time).

Notice if we ignore those positions, and the least significant digits, we get the problem of finding the missing integer in an array of length $n/2$. Thus, we can recursively apply our algorithm to get a series of bits representing all but the last bit in the missing integer. We then append the least significant digit $c$, and return the result.

Running time: For a problem of size $n$, we look at $O(n)$ bits. And since we divide the problem in half each time, we can write a recurrence describing the number of bits we must look at overall: [ B(n) = B(n/2) + O(n) ] which, by the master theorem, gives us $B(n) = O(n)$.

As for the time and space complexity, we must maintain an auxiliary array of size $n$ that keeps track of which indices we are still interested in. At every recursive step, we scan through this array to prevent accessing elements we no longer care about. There are $\log n$ recursive levels, so the overall time and space complexity of the algorithm are both $O(n \log n)$.

Question: Pattern matching, with tolerance for noise

We are given binary strings $s$,$t$; $s$ is $m$ bits long, and $t$ is $n$ bits long, and $m$ < $n$. We are also given an integer $k$. We want to find whether $s$ occurs as a substring of $t$, but with ≤ $k$ errors, and if so, find all such matches.

In other words, we want to determine whether there exists an index i such that $s_0 , s_1 ,..., s_{m−1}$ agrees with $t_i ,t_{i+1}, t_{i+2} , ... , t_{i+m−1}$ in all but $k$ bits; and if yes, find all such indices i.


In [1]:
def match(S, T, k):
    Mlist = []
    m = len(S)-1
    n = len(T)-1
    
    for i in xrange(0, n-m):
        e = 0
        for j in xrange(0,m-1):
            if(S[j] != T[i+j]):
                e += 1
        if(e <= k):
            Mlist.append(i)
            
    return Mlist

In [ ]: