Chapter 2: Analysis

My first jupyter notebook, demonstrating some code from class and the second chapter.

We explore somewhat how to write efficient algorithms. How do we define efficient though? Typically order of magnitude, or O(n**2) is least efficient. This is because we are squaring n with each iteration or function of O. On the scale of algorithm efficieny O(n**2) is closer to being an exponential algorithm. Which is bad, becase if n is greater than a certain number modern computers would not be able to calculate for n, we want algorithms as close to linear as possible. In the case that we are working with large amounts of data or doing huge mathematical expressions the problem becomes exponentially harder to solve, which takes more time.</span>


In [22]:
# Importing some things
import matplotlib.patches as patches
from matplotlib import pyplot
from time import time
import math
from executiontime import execution_time
from issubstring import *
from random_string import *

Plotting O():


In [42]:
title = 'Plotting Big O(): '
label_y = ''
label_x = ''

#"Lower bound"
lb = 1
#"Upper bound"
ub = 30000

vals = range(lb, ub)


ub = ub * 100
# Exponential
exp = [x ** 2 for x in vals]

# Linear
fn = [10 * x + 0.1 * x * math.log(x) for x in vals]

# Logorithmic
gn = [0.01 * x ** 2 for x in vals]

expplot = pyplot.plot(exp, label='Exponential')
linplot = pyplot.plot(fn, label='Linear')
logplot = pyplot.plot(gn, label='Logorithmic')

pyplot.legend(loc='best')
pyplot.ylabel(label_y)
pyplot.xlabel(label_x)
pyplot.title(title)
pyplot.grid(False)


pyplot.show()


In this case the Linear function which is green on the plot above, is the fastest algorithm. Another way to think about the problem of writing efficient algorithms: an ideal algorithm would be written such that it both takes the least amount of steps and that the steps taken also take the least amount of steps themselves. Since python is a very high level language as n becomes larger so too do the amount of steps exponentially, making processing times grow exponentially as well. So an ideal algorithm scales in order of magnitude as linearly as possible.

Finding substrings:

This one was tricky, here's how Dylan or this guy wrote the algorithm.

But I didn't want to do something so simple with so many lines of code. I wrote my own. This is my attempt at the Boyer-Moore string search algorithm:


In [5]:
string = 'bazbarthequickbrownfoxjumpsoverthelazydogoffthequickbrownfoxjumpsoverthelazydogthequickbrownoofjumpsoverthelazyfoo'
substring = 'foo'

print(isSubStr().is_substring(substring, string))
print(isSubStr().is_substring('bleh', 'blah'))


True
False

Say we have the string "ANPANMAN" and want to find the substring "PAN", this is the Boyer-Moore algorithm in a nutshell:


In [6]:
string = 'ANPANMAN'

print(string)
fstring = string[0:3]
print(fstring + '_____')
fstring = string[1:4]
print('.' + fstring + '____')
fstring = string[2:5]
print('..' + fstring + '___')
fstring = string[3:6]
print('...' + fstring + '__')
fstring = string[4:7]
print('....' + fstring + '_')
fstring = string[5:8]
print('.....' + fstring)
fstring = string[6:9]
print('......' + fstring)


ANPANMAN
ANP_____
.NPA____
..PAN___
...ANM__
....NMA_
.....MAN
......AN

Demonstrating how I want my algorithm to behave:


In [7]:
print(isSubStr().is_substring('foo', 'FOO'))
print(isSubStr().is_substring('foo', 'oof'))
print(isSubStr().is_substring('foo', 'foofoofoo'))


False
False
True

Of course this is just to demonstrate learning how to implement an algorithm from scratch (and testing for O()), but there's the default function:


In [8]:
print(isSubStr().default_is_substring('foo', 'FOO'))
print(isSubStr().default_is_substring('foo', 'oof'))
print(isSubStr().default_is_substring('foo', 'foofoofoo'))


False
False
True

Source on Github

""" issubstring.py | Tue, Jan 31, 2017 | Roman S. Collins

    Implementation of Boyer-Moore string search algorithm in Python.

    Source:
    https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm
    https://duckduckgo.com/?q=python+substring+function&t=ffab&ia=q://duckduckgo.com/?q=python+substring+function&t=ffab&ia=qa

"""

class isSubStr:
    string = 'checkoffcheckoofcheckfool'
    substring = 'foo'

    def is_substring(self, substring, string):
        lcolon = 0; rcolon = len(substring); found = False

        for a_iteration in range(len(string) + 1):
            astring = string[lcolon:rcolon]
            lcolon += 1; rcolon += 1; check_char = 0

            for b_iteration in range(len(astring)):
                if astring[b_iteration] == substring[b_iteration]:
                    check_char += 1

                    if check_char == len(substring):
                        found = True

            if found == True:
                return found

            #if astring == substring:
            # Still comparing whole strings
            #    found = True
            #    return found
            #    pass

            if rcolon >= (len(string) + 1):
                found = False
                return found

    def default_is_substring(self, a, b):
        return a in b

def main():
    """ Doctest:
    >>> string = 'ANPANMAN'
    >>> substring = 'PAN'
    >>> print(isSubStr().is_substring(substring, string))
    True
    >>> print(isSubStr().default_is_substring(substring, string))
    True
    """
    string = 'ANPANMAN'
    substring = 'PAN'

    while isSubStr().is_substring(substring, string):
        break
        print(isSubStr().is_substring(substring, string))

if __name__=='__main__':
    main()

Generating Random Strings:

This was shown in the book as the monkey theorem problem. (And if I recall correctly in Jim showed an algorithm in class...)

This is mine. The function takes n equal to the length of the string you would like. Possible characters are simply every letter in the English alphabet plus space:


In [9]:
print(randStr().rand_string(10))


leyudfawqv

Source on Github

""" random_string.py | Wed, Feb 01, 2017 | Roman S. Collins

    Demonstrating returning random strings in Python.

"""
import random

class randStr:
    def rand_string(self, n):
        abc = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm',
                    'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v',
                    'w', 'x', 'y', 'z', ' ']
        abc_copy = list(abc[x] for x in range(len(abc)))
        return ''.join(random.sample(abc_copy, n))

def main():
    """ Again only sanity checks, because output of the
        function changes.

        Doctest:
        >>> len(randStr().rand_string(10))
        10
        >>> isinstance(randStr().rand_string(10), str)
        True
    """
    # Fill main with something other than pass
    print(randStr().rand_string(10))


if __name__=='__main__':
    main()

Plotting is_substring():

Let's find out how my algorithm scales compared to Python's builtin version, and some others:


In [14]:
from executiontime import execution_time
from issubstring import *
from matplotlib import pyplot

def main():
    """ n is equal to the sample size, each sample is 10,000 iterations of each function
        to attempt to get a more accurate average execution time from system time.
    """
    n = 30000
    substring = 'PAN'
    fullstring = 'ANPANMAN'

    is_substring_times = []

    for i in range(1, n + 1):
        is_substring_time = execution_time(isSubStr().is_substring(substring, fullstring), 10000)
        is_substring_times.append(is_substring_time[1])
        #print(is_substring_times)

    pyplot.plot(is_substring_times, label='is_substring()')

    default_is_substring_times = []

    for i in range(1, n + 1):
        default_is_substring_time = execution_time(isSubStr().default_is_substring(substring, fullstring), 10000)
        default_is_substring_times.append(default_is_substring_time[1])
        #print(default_is_substring_times)

    pyplot.plot(default_is_substring_times, label='default_is_substring()')

    dylan_is_substring_times = []

    for i in range(1, n + 1):
        dylan_is_substring_time = execution_time(isSubStr().dylan_is_substring(substring, fullstring), 10000)
        dylan_is_substring_times.append(dylan_is_substring_time[1])
        #print(is_substring_times)

    pyplot.plot(dylan_is_substring_times, label='dylan_is_substring()')

    jim_is_substring_times = []

    for i in range(1, n + 1):
        jim_is_substring_time = execution_time(isSubStr().jim_is_substring(substring, fullstring), 10000)
        jim_is_substring_times.append(jim_is_substring_time[1])
        #print(is_substring_times)

    pyplot.plot(jim_is_substring_times, label='jim_is_substring()')


    title = 'Plotting Substring Search Algorithms: '
    label_y = ''
    label_x = ''

    pyplot.legend(loc='best')
    pyplot.ylabel(label_y)
    pyplot.xlabel(label_x)
    pyplot.title(title)
    pyplot.grid(False)

    pyplot.show()

if __name__ == '__main__':
    main()


---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-14-14c2b407a8ad> in <module>()
     58 
     59 if __name__ == '__main__':
---> 60     main()

<ipython-input-14-14c2b407a8ad> in main()
     29 
     30     for i in range(1, n + 1):
---> 31         dylan_is_substring_time = execution_time(isSubStr().dylan_is_substring(substring, fullstring), 10000)
     32         dylan_is_substring_times.append(dylan_is_substring_time[1])
     33         #print(is_substring_times)

AttributeError: 'isSubStr' object has no attribute 'dylan_is_substring'

In [ ]:
def main():
    n = 100
    substring = 'fjwefiojxcvnm'
    fullstring = """asdfkljasdflkalsdkfjlasdflkjasdflkjasdflasdflkjasdflkjasdflkjasdfljkasdflkjasdflkjasdflkjas
                    dflkjfghdsfkljljweoidfjwefiojxcvnm"""

    is_substring_times = []

    for i in range(1, n + 1):
        is_substring_time = execution_time(isSubStr().is_substring(substring, fullstring), 10000)
        is_substring_times.append(is_substring_time[1])
        #print(is_substring_times)

    pyplot.plot(is_substring_times, label='is_substring()')

    default_is_substring_times = []

    for i in range(1, n + 1):
        default_is_substring_time = execution_time(isSubStr().default_is_substring(substring, fullstring), 10000)
        default_is_substring_times.append(default_is_substring_time[1])
        #print(default_is_substring_times)

    pyplot.plot(default_is_substring_times, label='default_is_substring()')

    dylan_is_substring_times = []

    for i in range(1, n + 1):
        dylan_is_substring_time = execution_time(isSubStr().dylan_is_substring(substring, fullstring), 10000)
        dylan_is_substring_times.append(dylan_is_substring_time[1])
        #print(is_substring_times)

    pyplot.plot(dylan_is_substring_times, label='dylan_is_substring()')

    jim_is_substring_times = []

    for i in range(1, n + 1):
        jim_is_substring_time = execution_time(isSubStr().jim_is_substring(substring, fullstring), 10000)
        jim_is_substring_times.append(jim_is_substring_time[1])
        #print(is_substring_times)

    pyplot.plot(jim_is_substring_times, label='jim_is_substring()')


    title = 'Plotting Substring Search Algorithms: '
    label_y = ''
    label_x = ''

    pyplot.legend(loc='best')
    pyplot.ylabel(label_y)
    pyplot.xlabel(label_x)
    pyplot.title(title)
    pyplot.grid(False)

    pyplot.show()

if __name__ == '__main__':
    main()