Markov Chains are an example of a stochastic process, or sequences of random variables evolving over the dimension of time or space. They were invented or discovered by Markov as way to answer the question *Does free will exist?* However, their application is very wide.

Say we have a sequence of random variables $X_0, X_1, X_2, \cdots$. In our studies up until now, we were assuming that the distributions of these random variables were *independent and individually distributed*. If the index represents time, then this is like starting with a fresh, new independent random variable at each step.

A more interesting case is when the random variables are in some way related. But in assuming such a case, the relations can get very complex.

Markov Chains are a compromise: they are one level of complexity beyond *i.i.d.*, and they come with some very useful properties. For example, think of $X_n$ as the state of a system at a particular and discrete time, like that for a wandering partical jumping from state to state.

The indexes or states can be with regards to:

- discrete time
- continuous time
- discrete space
- continuous space

For this simple introduction, we will limit ourselves to the case of *discrete time, with a finite number of states* (discrete space), where $n \in \mathbb{Z}_{\geq 0}$.

Keeping to our limitations of discrete time and space, assume that $n$ means "now" (whatever that might be). We assume $n+1$ to be the "future". Then consider that the "future" might be characterized by $P(X_{n+1} = j)$, where

\begin{align} P(X_{n+1} = j | X_{n} = i, X_{n} = i_{n-1}, \cdots , X_{0} = i_{0}) \end{align}This means that the future depends upon all of the former states in existence. Very complex.

But what if we explicitly assumed a simpler model? What if we said that the future is conditionally independent of the past, *given the present*? Then we could say

That is, if this property holds, if we know the current value $X_n$, everything else in the past is irrelevant. This is the **Markov assumption**; this is the **Markov Property**.

The transition probability in a Markov Chain is given by $P(X_{n+1} = j | X_n = i) = q_{ij}$. If the probability in transitioning from states does not depend on time, then we say the Markov Chain is *homogeneous*.

For this simple introduction, again, we will assume the Markov Chains to be homogeneous.

Here is a graphical example of a simple 4-state Markov Chain, listing all of its transition probabilities:

The *transition matrix* is then just the matrix representation of a Markov chain's transition probabilities in the form $Q = \left[ q_{ij} \right]$in the columns, with the rows representing each possible state. Since our introductory example is for a discrete, finite space, elements of a row are the complete listing of the probability mass function for the state in question.

For the example above, that'd be

\begin{align} Q &= \begin{pmatrix} \frac{1}{3} & \frac{2}{3} & 0 & 0 \\ \frac{1}{2} & 0 & \frac{1}{2} & 0 \\ 0 & 0 & 0 & 1 \\ \frac{1}{2} & 0 & \frac{1}{4} & \frac{1}{4} \\ \end{pmatrix} \end{align}We can use a Markov chain as a model to describe a system evolving over time. The case of $P(X_{n+1} = j | X_n = i) = q_{ij}$ only looks at the present $X_n$, ignoring events of the past. Using such a *first order* model might be too strong a starting assumption of conditional independence. We could, however, extend this model such that $P(X_{n+1} = j | X_n = i, X_{n-1} = k)$, to consider the last 2 states (second order), or even more past states.

There are also such things called Markov Chain Monte Carlo, where you synthesize your own Markov Chain to let you do large-scale simulations (computer-assisted, of course).

But originally, Markov came up with this idea to show there was a way to reconcile the Law of Large Numbers but relax the condition of *independent, individually distributed*, and instead of completely ignoring the past states of system, condition on the present state (first-order) to predict the future. Of course, this could be extended to condition on states prior to the present as well. He wanted to go a step beyond *i.i.d.*. His experiment was with the Russian langage and its vowels and consonants, a chain of two states.

Suppose time $n, X_n$ has distribution $\vec{s}$ ($1 \times m$ row vector). Again, in our discrete, finite state example, $\vec{s}$ is simply the PMF listed out, so its entries are non-negative and sum up to 1.0.

*What is the PMF for $X*{n+1}$?_

- So row vector $\vec{s} \, Q$ is the distribution at time $n+1$.
- So row vector $\vec{s} \, Q^{2}$ is the distribution at time $n+2$ ($\vec{s}\,Q$ plays the role of $\vec{s}$).
- So row vector $\vec{s} \, Q^{3}$ is the distribution at time $n+3$ ($\vec{s}\,Q^2$ plays the role of $\vec{s}$).
- and so on, and so forth...

Similarly, we can think about $m$ steps into the future.

\begin{align} P(X_{n+1} = j | X_n = i) &= q_{ij} &\text{ by definition; 1 step into the future } \\ \\\\ P(X_{n+2} = j | X_n = i) &= \sum_k P(X_{n+2} = j | X_{n+1} = k, X_n = i) \, P(X_{n+1} = k | X_n = i) &\text{ 2 steps into the future } \\ &= \sum_k P(X_{n+2} = j | X_{n+1} = k \require{enclose} \, \enclose{horizontalstrike}{, X_n = i}) \, P(X_{n+1} = k | X_n = i) &\text{ only the present counts } \\ &= \sum_k q_{kj} \, q_{ik} \\ &= \sum_k q_{ik} \, q_{kj} &\text{ but this is the } (i,j) \text{ entry of } Q^2 \\ \\\\ P(X_{n+m} = j | X_n = i) &= \left[ q_{i,j} \in Q^m \right] \end{align}So you see that powers of the transition matrix give us a lot of information!

$\vec{s}$ is stationary for the Markov chain we are considering if

\begin{align} \vec{s} \, Q &= \vec{s} \end{align}You should immediately recognize that $\vec{s}$ is an eigenvector of transition matrix $Q$. This means that if the chain starts off in $\vec{s}$, after one step into the future it remains at $\vec{s}$. Another step into the future, and we *still* remain at $\vec{s}$, ad nauseum. If the chain starts off with the stationary distribution, it will have that distribution forever.

But that raises some questions:

- Does a stationary distribution exist? (all elements of $\vec{s} \in \mathbb{Z}_{\ge 0}$)
- Is it unique?
- Does the chain converge to $\vec{s}$?
- How can we compute it?

It turns out, the answers to questions 1 through 3 is *yes*.

Can we compute the stationary distribution *efficiently*? In certain examples of Markov chains, we can do so very easily and without using matrices!

But that is the topic of the next lecture.

View Lecture 31: Markov Chains | Statistics 110 on YouTube.