Searching is the algorithmic process of finding a particular item in a collection of items. A search typically answers either True or False as to whether the item is present. On occasion it may be modified to return where the item is found.

In Python, there is a very easy way to ask whether an item is in a list of items. We use the in operator.


In [ ]:
10 in [1,4,9,7, 18, 31]

In [ ]:
4 in [1, 2, 3, 4, 5, 6, 7]

Even though this is easy to write, an underlying process must be carried out to answer the question. It turns out that there are many different ways to search for the item. What we are interested in here is how these algorithms work and how they compare to one another.

The Sequential Search (or Linear Search)

When data items are stored in a collection such as an array, we say that they have a linear or sequential relationship. Each data item is stored in a position relative to the others. In Python lists, these relative positions are the index values of the individual items. Since these index values are ordered, it is possible for us to visit them in sequence. This process gives rise to our first searching technique, the sequential search.

A sequential search is to look for a particular element in an array. The figure below shows how this search works. Starting at the first item in the list, we simply move from item to item, following the underlying sequential ordering until we either find what we are looking for or run out of items.

A possible algorithm is the following

The Python implementation for this algorithm is shown below. The function needs the list and the item we are looking for and returns a location in the list as to whether it is present.


In [ ]:
def sequentialSearch(alist, item):
    pos = 0
    found = False
    
    while pos < len(alist) and not found:
        if alist[pos] == item:
            found =  True
        else:
            pos = pos+1
            
    return pos

testlist = [1, 2, 32, 8, 17, 19, 42, 13, 0]
item = 2
k = sequentialSearch(testlist, item)
if k <= len(testlist)-1:
    print('The item t= %s was found at testlist[%s]') %(item, k)
else:
    print('The item t= %s was not found.' %item)

To analyze searching algorithms, we need to decide on a basic unit of computation. This is typically the common step that must be repeated in order to solve the problem. For searching, it makes sense to count the number of comparisons performed. Each comparison may or may not discover the item we are looking for. In addition, we make another assumption here. The list of items is not ordered in any way. The items have been placed randomly into the list. In other words, the probability that the item we are looking for is in any particular position is exactly the same for each position of the list.

If the item is not in the list, the only way to know it is to compare it against every item present. If there are $n$ items, then the sequential search requires $n$ comparisons to discover that the item is not there. In the case where the item is in the list, the analysis is not so straightforward. There are actually three different scenarios that can occur. In the best case we will find the item in the first place we look, at the beginning of the list. We will need only one comparison. In the worst case, we will not discover the item until the very last comparison, the $n$th comparison. On average, we will find the item about halfway into the list; that is, we will compare against $\frac{n}{2}$ items.

Recall, however, that as n gets large, the coefficients, no matter what they are, become insignificant in our approximation, so the complexity of the sequential search, is $\mathcal{O}(n)$. (In fact, it is $\mathcal{\Theta}(n)$.) Table summarizes these results.

What if the items were ordered in some way?

We assumed earlier that the items in our collection had been randomly placed so that there is no relative order between the items. What would happen to the sequential search if the items were ordered in some way? Would we be able to gain any efficiency in our search technique?

Assume that the list of items was constructed so that the items were in ascending order, from low to high. If the item we are looking for is present in the list, the chance of it being in any one of the n positions is still the same as before. We will still have the same number of comparisons to find the item. However, if the item is not present there is a slight advantage. Figure shows this process as the algorithm looks for the item 50. Notice that items are still compared in sequence until 54.

At this point, however, we know something extra. Not only is 54 not the item we are looking for, but no other elements beyond 54 can work either since the list is sorted. In this case, the algorithm does not have to continue looking through all of the items to report that the item was not found. It can stop immediately.

Exercise 1 Modify the function sequentialSearch for the code to stop the search at the moment it finds the desired item in a list.


In [ ]:
def orderedSequentialSearch(alist, item):
    pos = 0
    found = False
    stop = False
   
    #
    # Write the code here for Exercise 1 
    #
    
    return pos

In [ ]:
testlist = [0, 1, 2, 8, 13, 17, 19, 32, 42]
item = 10
k = sequentialSearch(testlist, item)
if k <= len(testlist)-1:
    print('The item t= %s was found at testlist[%s]') %(item, k)
else:
    print('The item t= %s was not found.') %item

The table below summarizes these results. Note that in the best case we might discover that the item is not in the list by looking at only one item. On average, we will know after looking through only $\frac{n}{2}$ items. However, this technique is still $\mathcal{O}(n)$. In summary, a sequential search is improved by ordering the list only in the case where we find the item.

It is possible to take greater advantage of the ordered list if we are clever with our comparisons. In the sequential search, when we compare against the first item, there are at most $n−1$ more items to look through if the first item is not what we are looking for. Instead of searching the list in sequence, a binary search will start by examining the middle item.

If that item is the one we are searching for, we are done. If it is not the correct item, we can use the ordered nature of the list to eliminate half of the remaining items. If the item we are searching for is greater than the middle item, we know that the entire lower half of the list as well as the middle item can be eliminated from further consideration. The item, if it is in the list, must be in the upper half. The above figure shows how this algorithm can quickly find the value 54. An algorithm is

A complete code is


In [ ]:
def binarySearch(alist, item):
    first = 0
    last = len(alist)-1
    found = False
    
    while first <= last and not found:
        midpoint = (first + last)//2
        if alist[midpoint] == item:
            found = True
        else:
            if item < alist[midpoint]:
                last = midpoint - 1
            else:
                first = midpoint + 1
    return (found, midpoint)

In [ ]:
testlist = [0, 1, 2, 8, 13, 17, 19, 32, 42]
item = 10
(found,k) = binarySearch(testlist, item)
if found == True:
    print('The item t= %s was found at testlist[%s]') %(item, k)
else:
    print('The item t= %s was not found.' %item)

Remark: This algorithm is a great example of a divide and conquer strategy. Divide and conquer means that we divide the problem into smaller pieces, solve the smaller pieces in some way, and then reassemble the whole problem to get the result. When we perform a binary search of a list, we first check the middle item. If the item we are searching for is less than the middle item, we can simply perform a binary search of the left half of the original list. Likewise, if the item is greater, we can perform a binary search of the right half. Either way, this is a recursive call to the binary search function passing a smaller list. The following shows this recursive version.


In [ ]:
def binarySearchRecursive(alist, item):
    if len(alist) == 0:
        return False
    else:
        midpoint = len(alist)//2
        if alist[midpoint] == item:
            return True
        else:
            if item < alist[midpoint]:
                return binarySearchRecursive(alist[:midpoint], item)
            else:
                return binarySearchRecursive(alist[midpoint+1:], item)
            
testlist = [0, 1, 2, 8, 13, 17, 19, 32, 42]
print(binarySearchRecursive(testlist, 3))
print(binarySearchRecursive(testlist, 13))

To analyze the binary search algorithm, we need to recall that each comparison eliminates about half of the remaining items from consideration. What is the maximum number of comparisons this algorithm will require to check the entire list? If we start with $n$ items, about $\frac{n}{2}$ items will be left after the first comparison. After the second comparison, there will be about $\frac{n}{4}$. Then $\frac{n}{8}$, $\frac{n}{16}$, and so on. How many times can we split the list? Table below helps us to see the answer.

When we split the list enough times, we end up with a list that has just one item. Either that is the item we are looking for or it is not. Either way, we are done. The number of comparisons necessary to get to this point is $i$ where $\frac{n}{2^i}=1$. Solving for $i$ gives us $i=\log n$. The maximum number of comparisons is logarithmic with respect to the number of items in the list. Therefore, the binary search is $\mathcal{O}(\log n)$.

Remarks

  1. One additional analysis issue needs to be addressed. In the recursive solution shown above, the recursive call, binarySearch(alist[:midpoint],item) uses the slice operator to create the left half of the list that is then passed to the next invocation (similarly for the right half as well). The analysis that we did above assumed that the slice operator takes constant time. However, we know that the slice operator in Python is actually $\mathcal{O}(n)$. This means that the binary search using slice will not perform in strict logarithmic time. Luckily this can be remedied by passing the list along with the starting and ending indices. We leave this implementation as an exercise.

  2. Even though a binary search is generally better than a sequential search, it is important to note that for small values of n, the additional cost of sorting is probably not worth it. In fact, we should always consider whether it is cost effective to take on the extra work of sorting to gain searching benefits. If we can sort once and then search many times, the cost of the sort is not so significant. However, for large lists, sorting even once can be so expensive that simply performing a sequential search from the start may be the best choice.

Exercise 2 Write a python code plotting algorithmic time complexity of the sequentialSearch, binarySearch, and binarySearchRecursive function.

Exercise 3 Set up random experiments to test the difference between a sequential search and a binary search on a list of integers. Also, demonstrate or find a case


In [ ]: