BONUS

In this question, we'll take a break from your regularly-scheduled probability and stats to bring you some anomaly detection!

Part A

Anomaly detection is a huge area of data science and cybersecurity. Even on a single computer, there are hundreds of little programs running simultaneously, all generating log files that record their behavior. Parsing these log files is tricky by itself, but detecting when a program may be misbehaving from its logs can be very challenging; what's the threshold at which behavior goes from normal to suspicious?

In this first part, you'll write code that flags certain sequences of numbers. Write a function that

is named flag_segments
takes 2 arguments: a list of 1s and 0s (the log file), and an optional integer indicating the count threshold of a log sequence to be flagged as suspicious (default is 4)
returns a list containing the starting indices in the log file of suspicious sequences

For example, if the input log file is [1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1], then flag_segments([1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1], 4) should return [2, 8]--a threshold of 4 is set, meaning a sequence of 4 or more 1s is considered suspicious, and the starting indices in the input log of sequences of 4 or more 1s is recorded in the return list.

A string of 1s that is contiguous is considered one sequence; as such, even if the number of contiguous 1s is a multiple of the threshold argument, still only consider it 1 sequence.

You can't use any imports or built-in functions aside from range(), len(), and enumerate().



In [ ]:



In [ ]:

    
import numpy as np

np.random.seed(583945)
l1 = np.random.randint(2, size = 1000).tolist()
a1 = set([39,87,96,132,137,169,174,185,235,257,269,292, 323, 472, 564, 583, 610, 628, 653, 695, 735, 783, 808, 865, 872, 880, 905,933,957,963,990])
assert set(flag_segments(l1)) == a1



In [ ]:

    
np.random.seed(49854)
l2 = np.random.randint(2, size = 1000).tolist()
a2 = set([61, 74, 90, 117, 124, 132, 151, 163, 179, 198, 229, 265, 297, 302, 354, 420, 479, 546, 582, 597, 632, 694, 778, 791, 923])
assert set(flag_segments(l2)) == a2



In [ ]:

    
np.random.seed(578472)
l3 = np.random.randint(2, size = 1000).tolist()
a3 = set([957, 478])
assert set(flag_segments(l3, 8)) == a3

Part B

On average, how many consecutive 0s precede a flagged sequence of suspicious 1s? Write a function which

is named preceding_zeros
takes 2 arguments: a list of 1s and 0s (the log file), and a list of index-based flags (output from Part A)
returns 1 float: the average number of 0s preceding any suspicious sequence of 1s in the log

You cannot use any imports or built-in functions aside from range(), len(), and enumerate().



In [ ]:



In [ ]:

    
import numpy as np

np.random.seed(8959384)
l1 = np.random.randint(2, size = 1000).tolist()
f1 = [25, 86, 104, 157, 180, 215, 259, 321, 346, 430, 518, 523, 537, 636, 657, 678, 687, 714, 771, 796, 820, 828, 850, 894, 902, 926, 954, 959]
a1 = 2.357143
np.testing.assert_allclose(preceding_zeros(l1, f1), a1)



In [ ]:

    
np.random.seed(94721)
l2 = np.random.randint(2, size = 1000).tolist()
f2 = [0, 13, 28, 48, 53, 72, 78, 102, 125, 132, 139, 155, 166, 206, 229, 319, 391, 418, 463, 532, 566, 574, 636, 661, 697, 732, 785, 830, 863, 912, 944, 980]
a2 = 1.78125
np.testing.assert_allclose(preceding_zeros(l2, f2), a2)