In our last lecture, we covered the basics of molecular biology and the role of sequence analysis. In this lecture, we'll dive deeper into how sequence analysis is performed and the role of algorithms in addressing sequence analysis. By the end of this lecture, you should be able to:
From computer science comes this notion: how the runtime of an algorithm changes with respect to its input size.
$\mathcal{O}(n)$ - the "$\mathcal{O}$" is short for "order of the function", and the value inside the parentheses is always with respect to $n$, interpreted to be the variable representing the size of the input data.
Big-oh notation is a representation of limits, and most often we are interested in "worst-case" runtime. Let's start with the example from the last lecture.
In [1]:
a = [1, 2, 3, 4, 5]
for element in a:
print(element)
How many steps, or iterations, does this loop require to run?
In [2]:
a = range(100)
for element in a:
print(element, end = " ")
How many iterations does this loop require?
For iterating once over any list using a single for
loop, how many iterations does this require?
Algorithms which take $n$ iterations to run, where $n$ is the number of elements in our data set, are referred to as running in $\mathcal{O}(n)$ time.
This is roughly interpreted to mean that, for $n$ data points, $n$ processing steps are required.
Important to note: we never actually specify how much time a single processing step is. It could be a femtosecond, or an hour. Ultimately, it doesn't matter. What does matter when something is $\mathcal{O}(n)$ is that, if we add one more data point ($n + 1$), then however long a single processing step is, the algorithm should take only that much longer to run.
How about this code? What is its big-oh?
In [3]:
a = range(100)
b = range(1, 101)
for i in a:
print(a[i] * b[i], end = " ")
Still $\mathcal{O}(n)$. The important part is not (directly) the number of lists, but rather how we operate on them: again, we're using only 1 for
loop, so our runtime is directly proportional to how long the lists are.
How about this code?
In [4]:
a = range(100)
x = []
for i in a:
x.append(i ** 2)
for j in a:
x.append(j ** 2)
Trick question! One loop, as we've seen, is $\mathcal{O}(n)$. Now we've written a second loop that is also $\mathcal{O}(n)$, so literally speaking the runtime is $2*\mathcal{O}(n)$, but what happens to the 2 in the limit as $n \rightarrow \infty$?
The 2 is insignificant, so the overall big-oh for this code is still $\mathcal{O}(n)$.
How about this code?
In [5]:
a = range(100)
for element_i in a:
for element_j in a:
print(element_i * element_j, end = " ")
Nested for
loops are brutal--the inner loop runs in its entirety for every single iteration of the outer loop. In the limit, for a list of length $n$, there are $\mathcal{O}(n^2)$ iterations.
One more tricky one:
In [6]:
xeno = 100
while xeno > 1:
xeno /= 2
print(xeno, end = " ")
Maybe another example from the same complexity class:
In [7]:
xeno = 100000
while xeno > 1:
xeno /= 10
print(xeno, end = " ")
What does this "look" like?
In [8]:
# I'm just plotting the iteration number against the value of "xeno".
%matplotlib inline
import matplotlib.pyplot as plt
x = []
y = []
xeno = 10000
i = 1
while xeno > 1:
x.append(i)
y.append(xeno)
xeno /= 10
i += 1
plt.plot(x, y)
Out[8]:
In the first one, on each iteration, we're dividing the remaining space by 2, halving again and again and again.
In the second one, on each iteration, we're dividing the space by 10.
$\mathcal{O}(\log n)$. We use the default (base 10) because, in the limit, constants don't matter.
Recall from Lecture 6 what SCS (shortest common superstring) was:
For example, let's say we have $X$ = ABACBDCAB
and $Y$ = BDCABA
. What would be the shortest common superstring?
Here is one alignment: BDC
ABA
(second string) and ABA
CBDCAB
(first string). The ABA
is where the two strings overlap. The full alignment, BDCABACBDCAB
, has a length of 12.
Can we do better?
ABAC
BDCAB
and BDCAB
A, which gives a full alignment of ABACBDCABA
, which has a length of only 10. So this alignment would be the SCS.
(When do we need to use SCS?)
In a related, but different, problem: longest common substring asks:
Let's go back to our sequences from before: $X$ = ABACBDCAB
and $Y$ = BDCABA
. What would be the longest common substring?
The easiest substrings are the single characters A
, B
, C
, and D
, which both $X$ and $Y$ have. But these are short: only length 1 for all. Can we do better?
ABAC
BDCAB
and BDCAB
A, so the longest common substring is BDCAB
.
(When do we need LCS?)
Given two DNA sequences $v$ and $w$:
$v$: ATATATAT
$w$: TATATATA
How would you suggest aligning these sequences to determine their similarity?
Before we try to align them, we need some objective measure of what a "good" alignment is!
Hopefully, everyone has heard of Euclidean distance: this is the usual "distance" formula you use when trying to find out how far apart two points are in 2D space.
How is it computed?
For two points in 2D space, $a$ and $b$, their Euclidean distance $d_e(a, b)$ is defined as:
$d_e(a, b) = \sqrt{(a_x - b_x)^2 + (a_y - b_y)^2)}$
So if $a = (1, 2)$ and $b = (5, 3)$, then:
$d_e(a, b) = \sqrt{(1 - 5)^2 + (2 - 3)^2} = \sqrt{(-4)^2 + (-1)^2} = \sqrt{16 + 1} = 4.1231$
How can we measure distance between two sequences?
There is a metric called Hamming Distance, which counts number of differing corresponding elements in two strings.
We'll represent the Hamming distance between two strings $v$ and $w$ as $d_H(v, w)$.
$v$: ATATATAT
$w$: TATATATA
$d_H(v, w)$ = 8
That seems reasonable. But, given how similar the two sequences are (after all, the LCS of these two is 7 characters), what if we shifted one of the sequences over by one space?
$v$: ATATATAT-
$w$: -TATATATA
Now, what's $d_H(v, w)$?
$d_H(v, w)$ = 2
The only elements of the two strings that don't overlap are the first and last; they match perfectly otherwise!
Hamming distance is useful, but it neglects the possibility of insertions and deletions in DNA (what is the only thing it counts?). So we need something more robust.
The edit distance between two strings is the minimum number of elementary operations (insertions, deletions, or substitutions / mutations) required to transform one string into the other.
Hamming distance: $i^{th}$ letter of $v$ with $i^{th}$ letter of $w$ (how hard is this to do?)
Edit distance: $i^{th}$ letter of $v$ with $j^{th}$ letter of $w$ (how hard is this to do?)
Hamming distance is easy, but gives us the wrong answers. Edit distance gives us much better answers, but it's hard to compute: how do we know which $i$ to pair with which $j$?
What's the edit distance for $v$ = TGCATAT
and $w$ = ATCCGAT
?
One solution:
ATCCGAT
== ATCCGAT
, done in 5 steps!
Can it be done in 4 steps?
(...mmmmaybe--but that's for next week!)
indel is a portmanteau of "insertion" and "deletion", so we don't need to worry about which strand we're actually referring to.
What is the edit distance here?
Things get hairier when we consider that two genes in different species may be similar over short, conserved regions and dissimilar over remaining regions.
Homeobox regions have a short region called the homeodomain that is highly conserved among species--responsible for regulation of patterns of anatomical development in animals, fungi, and plants.
Here's an example global alignment that minimizes edit distance over the entirety of these two sequences:
Here's an example local alignment that may have an overall larger edit distance, but it finds the highly conserved substring:
"BUT!", you protest.
"If the local alignment has a higher edit score, how do we find it at all?"
We've already seen that we need to consider three separate possibilities when aligning sequences:
With Hamming distance, two characters were either the same or they weren't (options 1 and 2 above were a single criterion).
With edit distance, we separated #1 and #2 above into their own categories, but they are still weighted the same (1 insertion = 1 mutation = 1 edit)
Are all insertions / deletions created equal? How about all substitutions?
Say we want to align the sequences:
$v$ = AGTCA
$w$ = CGTTGG
But instead of using a standard edit distance as before, I give you the following scoring matrix:
This matrix gives the specific edit penalties for particular substitutions / insertions / deletions.
It also allows us to codify our understanding of biology and biochemistry into how we define a "good" alignment. For instance, this penalizes matching A with G more heavily than C matched with T.
Here is a sample alignment using this scoring matrix:
Scoring matrices are created based on biological evidence.
Some mutations, especially in amino acid sequences, may have little (if any!) effect on the protein's function. Using scoring matrices, we can directly quantify that understanding.
For nucleotide sequences, there aren't really "standard" scoring matrices, since DNA is less conserved overall and less effective to compare coding regions.
There are, however, some common amino acid scoring matrices. We'll discuss two:
PAM is a more theoretical model of amino acid substitutions.
It is always associated with a number, e.g. 1 PAM, written as PAM$_1$. This means the given PAM$_1$ scoring matrix is built to reflect a 1% average change in all amino acid positions of the polypeptide.
Some important notes:
PAM$_{250}$ is a widely used scoring matrix.
Mutating A to A is clearly the most preferable (highest score in that row of 13 points), but after 250 evolutions, a mutation from A to G also seems very favorable (12 points).
Unlike PAM, scores in BLOSUM are derived from direct empirical observations of the frequencies of substitutions in blocks of local alignments in related proteins.
Like PAM, BLOSUM also has a number associated with it, this time to represent the observed substitution rate between two proteins sharing some amount of similarity.
BLOSUM$_{62}$ is a common scoring matrix, representing substitution rates in proteins sharing no more than 62% identity.