Choose a value from $S$, to be used as **pivotValue**. Then divide the list into two partitions, **leftPart** (containing the list items that are smaller than **pivotValue**) and **rightPart** (containing the list items that are greater than **pivotValue**).
If the $k$th smallest item has been found, stop. Otherwise, select the partition that must contain the $k$th smallest item, and do the whole thing again with this partition.

```
In [22]:
```def quickSelect(k, aList):
if len(aList) == 1:
return aList[0] # Base case
pivotValue = aList[0]
leftPart = []
rightPart = []
for item in aList[1:]:
if item < pivotValue:
leftPart.append(item)
else:
rightPart.append(item)
if len(leftPart) >= k:
return quickSelect(k, leftPart)
elif len(leftPart) == k - 1:
return pivotValue
else:
return quickSelect(k - len(leftPart) -1, rightPart)
print("Median:", quickSelect(6, [2, 36, 5, 21, 8, 13, 11, 20, 4, 1]))

```
```

```
In [23]:
```def quickSelect(k, aList):
if len(aList) == 1: return aList[0]
pivotValue = aList[0]
leftPart = [x for x in aList[1:] if x < pivotValue]
rightPart = [x for x in aList[1:] if not x < pivotValue]
if len(leftPart) >= k: return quickSelect(k, leftPart)
elif len(leftPart) == k - 1: return pivotValue
else: return quickSelect(k - len(leftPart) -1, rightPart)
print("Median:", quickSelect(6, [2, 36, 5, 21, 8, 13, 11, 20, 4, 1]))

```
```

The crucial step (*cf.* **Quick Sort**) that determines whether we have best case or worst case performance is the choice of the pivot – if we are really lucky we will get a value that cuts down the list the algorithm needs to search very substantially at each step.

The algorithm is divide-and-conquer and each iteration makes the sub-problem substantially smaller. In **Quick Sort**, both partitions are sorted recursively and provided that the pivot, at each stage, divides the list up into equal parts, we achieve $O(n $log$ n)$ complexity.

However, in the **Selection** algorithm we know which partition to search, so we only deal with one of them on each recursive call and as a result it is even more efficient. Hence, it can be shown that its complexity is $O(n)$.

Name: | StringMatch |
---|---|

Inputs: | A search string $S = (s_1, s_2, s_3, ..., s_n)$ A target string $T = (t_1, t_2, t_3, ..., t_m)$ |

Outputs: | An integer $x$ |

Preconditions: | $m\le n$, $m>0$ and $n>0$ |

Postcondition: | If there is an occurrence of $T$ in $S$, $x$ is the start position of the first occurrence of $T$ in $S$; otherwise $x = -1$ |

```
In [85]:
```def basicStringSearch(searchString, target):
searchIndex = 0
lenT = len(target)
lenS = len(searchString)
while searchIndex + lenT <= lenS:
targetIndex = 0
while targetIndex < lenT and target[targetIndex] == searchString[ targetIndex + searchIndex]:
targetIndex += 1
if targetIndex == lenT:
return searchIndex
searchIndex += 1
return -1

```
In [86]:
```# Test Code
for target, index in [('per', 0), ('lta', 14), ('ad', 10), ('astra', -1)]:
print(basicStringSearch('per ardua ad alta', target)==index)

```
```

It becomes immediately apparent when implement that this algorithm would consist of two nested loops leading to complexity $O(mn) > O(m^2)$.

We know that if the character in $S$ following the failed comparison with $T$ is not in $T$ then there is no need to slide along one place to do another comparison. We should slide to the next point beyond it. This gives us the basis for an improved algorithm.

For each character in $T$ calculate the number of positions to shift $T$ if a comparison fails, according to where (if at all) that character appears in $T$.

Repeatedly compare the characters of $T$ with those of $S$. If a comparison fails, examine the next character along in $S$ and shift $T$ by the calculated shift distance for that character.

Do this until an occurrence of $T$ in $S$ is found, or the end of $S$ is reached.

An important point to note first of all is that the part of the algorithm calculating the shifts depends entirely on an analysis of the target string $T$ – there is no need to examine the search string $S$ at all because for any character in $S$ that is not in $T$, the shift is a fixed distance.

The database is called a **shift table** and it stores a **shift distance** for each character in the domain of $S$ – e.g. for each character of the alphabet, or say, all upper and lower case plus punctuation.

The **shift distance** is calculated according to the following rules:

- If the character does not appear in T, the shift distance is one more than the length of T.
- If the character does appear in T, the shift distance is the first position at which it appears, counting from right to left and starting at 1. (Hence when a character appears more than once in $T$ keeps the lowest position.)

Suppose $S = $'GGGGGAGGCGGCGGT'. Then for target string $T = $'TCCACC', we have:

G | A | C | T |
---|---|---|---|

7 | 3 | 1 | 6 |

G | A | C | T |
---|---|---|---|

1 | 6 | 2 | 5 |

Once the shift table has been computed, the search part of the quick search algorithm is similar to the basic string search algorithm, except that at the end of each failed attempt we look at the next character along in $S$ that is beyond $T$ and use this to look up in the shift table how many steps to slide $T$.

We implement the **shift table** as a dictionary in Python:

```
In [87]:
```def buildShiftTable(target, alphabet):
shiftTable = {}
for character in alphabet:
shiftTable[character] = len(target) + 1
for i in range(len(target)):
char = target[i]
shift = len(target) - i
shiftTable[char] = shift
return shiftTable

```
In [88]:
```def quickSearch (searchString, target, alphabet):
shiftTable = buildShiftTable(target, alphabet)
searchIndex = 0
while searchIndex + len(target) <= len(searchString):
targetIndex = 0
# Compares the strings
while targetIndex < len(target) and target[targetIndex] == searchString[searchIndex + targetIndex]:
targetIndex = targetIndex + 1
# Return index if target found
if targetIndex == len(target): return searchIndex
# Continue search with new shivt value or exit
if searchIndex + len(target) < len(searchString):
next = searchString[searchIndex + len(target)]
shift = shiftTable[next]
searchIndex = searchIndex + shift
else:
return -1
return -1

```
In [89]:
```theAlphabet = {'G', 'A', 'C', 'T'}
stringToSearch = 'ATGAATACCCACCTTACAGAAACCTGGGAAAAGGCAATAAATATTATAAAAGGTGAACTTACAGAAGTAA'
for thetarget in ['ACAG', 'AAGTAA', 'CCCC']:
print(quickSearch(stringToSearch, thetarget, theAlphabet))

```
```

The basic brute-force algorithm we wrote first will work fine with relatively short search strings but, as with all algorithms, inputs of huge size may overwhelm it. For example, DNA strings can be billions of bases long, so algorithmic efficiency can be vital. We noted already that the complexity of the basic string search can be as bad as O(nm) in the worst case.

As for the quick search algorithm, research has shown that its average-case performance is good but, unfortunately, its worst case behaviour is still O(mn).

**Knuth–Morris–Pratt (KMP)** algorithm. A full description of the precise details of the KMP algorithm is beyond the scope of this text.

The **KMP** algorithm is in two parts:

- Build a table of the lengths of prefix matches up to every character in the target string, $T$.
- Move along the search string, $S$, using the information in the table to do the shifting and compare.

Once the prefix table has been built, the actual search in the second step proceeds like the other string-searching algorithms above, but when a mismatch is detected the algorithm uses the prefix table to decide how to shift $T$. The problem is to know if these prefix matches exist and – if they do – how long the matching substrings are.</br>

The prefix will then be aligned as shown in Figure 4.17 and comparison can continue at the next character in S.

If you want to take the trouble, you can verify that the final table will be:

```
In [100]:
```prefixTable = [0, 1, 0, 0, 0, 1, 2, 3, 4, 0, 0, 0, 1, 2]

```
In [92]:
```# Helper function for kmpSearch()
def buildPrefixTable(target):
#The first line of code just builds a list that has len(target)
#items all of which are given the default value 0
prefixTable = [0] * len(target)
q = 0
for p in range(1, len(target)):
while q > 0 and target[q] != target[p]:
q = prefixTable[q - 1]
if target[q] == target[p]:
q = q + 1
prefixTable[p] = q
return prefixTable

```
In [93]:
```def kmpSearch(searchString, target):
n = len(searchString)
m = len(target)
prefixTable = buildPrefixTable(target)
q = 0
for i in range(n):
while q > 0 and target[q] != searchString[i]:
q = prefixTable[q - 1]
if target[q] == searchString[i]:
q = q + 1
if q == m:
return i - m + 1
return -1

```
In [94]:
```stringToSearch = 'ATGAATACCCACCTTACAGAAACCTGGGAAAAGGCAATAAATATTATAAAAGGTGAACTTACAGAAGTAA'
for thetarget in ['ACAG', 'AAGTAA', 'CCCC']:
print(kmpSearch(stringToSearch, thetarget))

```
```

What about the complexity of the KMP algorithm? Computing the prefix table takes significant effort but in fact there is an efficient algorithm for doing it. Overall, the KMP algorithm has complexity $O(m + n)$. Since $n$ is usually enormously larger than $m$ (think of searching a DNA string of billions of bases), $m$ is usually dominated by $n$, so this means that KMP has effective complexity $O(n)$.

String search is an immensely important application in modern computing, and at least 30 efficient algorithms have been developed for the task. Many of these depend on the principle embodied in the quick search and KMP algorithms – shifting the target string an appropriate distance along the search string at each step, based on information in a table. The **Boyer–Moore** algorithm, for example, combines elements of both these two algorithms. This algorithm is widely used in practical applications.

There are also string-search algorithms that work in entirely different ways from the examples we have looked at. Generally, these are beyond the scope of this text, but some are based on hashing functions, which we now move on to discuss next.

We have seen how we are able to make improvements in search algorithms by taking advantage of information about where items are stored in the collection with respect to one another. For example, by knowing that a list was ordered, we could search in logarithmic time using a binary search. In this section we will attempt to go one step further by building a data structure that can be searched in $O(1)$ time. This concept is referred to as **hashing**.

In order to do this, we will need to know even more about where the items might be when we go to look for them in the collection. If every item is where it should be, then the search can use a single comparison to discover the presence of an item.

A hash table is a collection of items which are stored in such a way as to make it easy to find them later. Each position of the hash table, often called a slot, can hold an item and is named by an integer value starting at 0.

Below is a hash table of size $m=11$ implemented in Python as a list with empty slots intialized with a default **None** value:

**hash function**. The hash function will take any item in the collection and return an integer in the range of slot names, between $0$ and $m-1$.

**remainder method**, simply takes an item and divides it by the table size, returning the remainder as its hash value:

```
In [156]:
```set_of_integers = [54, 26, 93, 17, 77, 31]
hash_function = lambda x: [y % 11 for y in x]
hash_vals = hash_function(set_of_integers)
hash_vals

```
Out[156]:
```

**collision** (it may also be called a “clash”). Clearly, collisions create a problem for the hashing technique. We will discuss them in detail later.

**perfect hash function**.

If we know the items and the collection will never change, then it is possible to construct a perfect hash function (refer to the exercises for more about perfect hash functions). Unfortunately, given an arbitrary collection of items, there is no systematic way to construct a perfect hash function. Luckily, we do not need the hash function to be perfect to still gain performance efficiency.

One way to always have a perfect hash function is to increase the size of the hash table so that each possible value in the item range can be accommodated. This guarantees that each item will have a unique slot. Although this is practical for small numbers of items, it is not feasible when the number of possible items is large. For example, if the items were nine-digit Social Security numbers, this method would require almost one billion slots. If we only want to store data for a class of 25 students, we will be wasting an enormous amount of memory.

Our goal is to create a hash function that minimizes the number of collisions, is easy to compute, and evenly distributes the items in the hash table. There are a number of common ways to extend the simple remainder method. We will consider a few of them here.

The **folding method** for constructing hash functions begins by dividing the item into equal-size pieces (the last piece may not be of equal size). These pieces are then added together to give the resulting hash value.

For example, if our item was the phone number $436-555-4601$, we would take the digits and divide them into groups of $2$ and sum them; that is $43+65+55+46+01=210$. If we assume our hash table has $11$ slots, then we need to perform the extra step of dividing by $11$ and keeping the remainder. In this case $210 % 11210 % 11 = 1$, so the phone number $436-555-4601$ hashes to slot $1$. (Some folding methods go one step further and reverse every other piece before the addition. For the above example, we get $43+56+55+64+01=219$ which gives $219 % 11=10219 % 11=10$.)

```
In [176]:
```word = 4365554601
word = str(word)
step = 2
slots = 11
folds = [int(word[n: n+2]) for n in range(0, len(word), step)]
print(folds)
print(sum(folds))
print(sum(folds)%slots)

```
```

**mid-square method**. We first square the item, and then extract *some portion* of the resulting digits. For example, if the item were $44$, we would first compute $44^2=1,936$. By extracting the middle two digits, $93$, and performing the remainder step, we get remainder of $5$ on division by $11$.

```
In [144]:
```set_of_integers = [54, 26, 93, 17, 77, 31]
hash_function = lambda x: [int(str(y**2)[1:-1])%11 for y in x]
hash_vals = hash_function(set_of_integers)
hash_vals

```
Out[144]:
```

```
In [145]:
```word = 'cat'
sum([ord(l) for l in word]) % 11

```
Out[145]:
```

To avoid conflicts from anagram, we could weights:

```
In [146]:
```sum([(ord(word[x]) * (x + 1)) for x in range(len(word))]) % 11

```
Out[146]:
```

**collision resolution**.

One method for resolving collisions looks into the hash table and tries to find another open slot to hold the item that caused the collision. A simple way to do this is to start at the original hash value position and then move in a sequential manner through the slots until we encounter the first slot that is empty.

Note that we may need to go back to the first slot (circularly) to cover the entire hash table. This collision resolution process is referred to as **open addressing** in that it tries to find the next open slot or address in the hash table. By systematically visiting each slot one at a time, we are performing an open addressing technique called **linear probing**. Using the hash values from the remainder method example, when add $44$ and $55$ say:

So, a disadvantage to linear probing is the tendency for **clustering**; items become clustered in the table. This means that if many collisions occur at the same hash value, a number of surrounding slots will be filled by the linear probing resolution. This will have an impact on other items that are being inserted, as we saw when we tried to add the item 20 above. A cluster of values hashing to 0 had to be skipped to finally find an open position.

One way to deal with clustering is to extend the linear probing technique so that instead of looking sequentially for the next open slot, we skip slots, thereby more evenly distributing the items that have caused collisions. This will potentially reduce the clustering that occurs, e.g. with a “plus 3” probe. This means that once a collision occurs, we will look at every third slot until we find one that is empty.

**rehashing**. With simple linear probing, in general, $rehash(pos)=(pos+skip)$%$sizeoftable$. It is important to note that the size of the “skip” must be such that all the slots in the table will eventually be visited. Otherwise, part of the table will be unused. To ensure this, it is often suggested that the table size be a prime number. This is the reason we have been using $11$ in our examples.

**quadratic probing**. Instead of using a constant “skip” value, we use a rehash function that increments the hash value by 1, 3, 5, 7, 9, and so on. This means that if the first hash value is $h$, the successive values are $h+1$, $h+4$, $h+9$, $h+16$, and so on. In other words, quadratic probing uses a skip consisting of successive *perfect squares*:

**Chaining** allows many items to exist at the same location in the hash table. When collisions happen, the item is still placed in the proper slot of the hash table. As more and more items hash to the same location, the difficulty of searching for the item in the collection increases:

```
In [155]:
```set_of_integers = [123456, 431941, 789012, 60375]
print(set_of_integers)
set_of_integers = [((int(str(x)[0:2]) + int(str(x)[2:4]) + int(str(x)[4:])) % 80) -1 for x in set_of_integers]
print(set_of_integers)

```
```

```
In [ ]:
```