In graduate school I came up with an original proof for Kraft's inequality. The proofs I could find in textbooks tend to be more complicated and less intutive to me, and so I'd like to share my proof here.

Prefix Codes

First it is important to understand the concept of prefix codes. Prefix codes have nice properties and are widely used in applications such as JPEG encoding for images and MP3 for music. A prefix code is such that no codeword is a prefix of any other codewords. For instance, suppose we'd like to encode the letters $\{a, b, c, d\}$ in binaries, i.e. using $\{0, 1\}$. Then one possible coding scheme is as follows:

\begin{equation} \left\{ \begin{aligned} a & \rightarrow 0 \\ b & \rightarrow 01 \\ c & \rightarrow 011 \\ d & \rightarrow 0111 \\ \end{aligned} \right. \end{equation}

This is uniquely decodable but it is not a prefix code because the codeword for $a$ is a prefix for the codeword for $b$. This means that we can not instantaneously decode $a$ without waiting for the next bit of data (to determine whether it is actually $a$ or just the first half of $b$.)

Alternatively, we can encode using a prefix code as folows:

\begin{equation} \left\{ \begin{aligned} a & \rightarrow 0 \\ b & \rightarrow 10 \\ c & \rightarrow 110 \\ d & \rightarrow 111 \\ \end{aligned} \right. \end{equation}

As you can see no codeword is a prefix of another codeword. Naturally we'd like to know how to construct prefix codes efficiently, and what are the fundamental constraints on the lengths of the codewords.

Kraft's Inequality

Kraft's inequality states that given a list of positive integers $(n_1, n_2, \dots n_r)$, there exists a prefix code with a set of codewords $(\sigma_1, \sigma_2, \dots \sigma_r)$ where the length of each codeword $|\sigma_i| = n_i, \forall i$, if and only if

\begin{equation} \sum_{i=1}^{r} s^{-n_i} \leq 1 \end{equation}

where $s$ is the size of the alphabet $S$.

In our previous example, the alphabet is simply the binary digits $S = \{0, 1\}$, and therefore the size of the alphabet $s = 2$. We can easily check that indeed the inequality holds:

\begin{equation} 2^{-1} + 2^{-2} + 2^{-3} + 2^{-3} \leq 1 \end{equation}

Simple Proof

First of all without loss of generality we can order $n_i$ so that

\begin{equation} n_1 \leq n_2 \leq \dots \leq n_r \end{equation}

Now we try to construct a prefix code in increasing order, i.e., $i = 1, 2, \dots r$. There exists a prefix code if and only if at each step $j$, there is at least one codeword to choose that does not contain any of the previous $j - 1$ codewords as a prefix. In other words, the codeword $\sigma_j$ must not contain any of the previous codewords $(\sigma_1, \sigma_2, \dots \sigma_{j-1})$ as a prefix. For $\sigma_j$, there are $s^{n_j}$ combination of codewords we could choose from if there was no prefix constraints. But due to the codeword at a prior step $k < j$, $s^{n_j - n_k}$ codewords are now forbidden because they contain $\sigma_k$ as a prefix. Furthermore, the sets of forbidden codewords due to different prior steps are exclusive. Since if any two prior steps forbid the same codeword at step $j$, that would imply the smaller of the codewords in the two prior steps is a prefix of the other one, which contradicts the prefix assumption.

Therefore we know the total number of forbidden codewords at step $j$ is

\begin{equation} \sum_{i=1}^{j-1} s^{n_j - n_i} \end{equation}

There exists a prefix code if and only if we have a codeword to choose at every step, namely,

\begin{equation} s^{n_j} > \sum_{i=1}^{j-1} s^{n_j - n_i}, \forall j = 2, 3, \dots r \end{equation}

Since every term above is an integer, this is equivalent to

\begin{equation} s^{n_j} \geq \sum_{i=1}^{j-1} s^{n_j - n_i} + 1 = \sum_{i=1}^{j} s^{n_j - n_i}, \forall j = 1, 2, \dots r \end{equation}

Now divide by $s^{n_j}$ on both sides we get

\begin{equation} 1 \geq \sum_{i=1}^{j} s^{- n_i}, \forall j = 1, 2, \dots r \end{equation}

Now since every term above is non-negative, this is equivalent to Kraft's inequality

\begin{equation} \sum_{i=1}^{r} s^{- n_i} \leq 1 \end{equation}

Notice that every argument in the proof goes both ways. Thus this proves that Kraft's inequality is both a necessary and sufficient condition for the existance of a prefix code.

McMillian Inequality

Interestingly, Kraft's inequality can be shown to hold for all uniquely decodable codes, not just prefix codes. This result is also known as the Kraft-McMillian inequality. To see that, consider a sequence of $k$ codewords from any uniquely decodable code. For instance, for a uniquely decodable code

\begin{equation} \left\{ \begin{aligned} a & \rightarrow 0 \\ b & \rightarrow 01 \\ c & \rightarrow 011 \\ \end{aligned} \right. \end{equation}

all possible sequences of $k$ codewords for $k = 2$ would be

\begin{equation} \left\{ \begin{aligned} aa & \rightarrow 00 \\ ab & \rightarrow 001 \\ ac & \rightarrow 0011 \\ ba & \rightarrow 010 \\ bb & \rightarrow 0101 \\ bc & \rightarrow 01011 \\ ca & \rightarrow 0110 \\ cb & \rightarrow 01101 \\ cc & \rightarrow 011011 \\ \end{aligned} \right. \end{equation}

Now consider the following expression

\begin{equation} \left(\sum_{i=1}^{r} s^{- n_i}\right)^k \end{equation}

This is equivalent to

\begin{equation} \sum_{j=k \cdot n_1}^{k \cdot n_r} C_j s^{-j} \end{equation}

where $C_j$ is the number of combinations of $k$ codewords that have a total length of $j$. We can see that $j$ must be at least $k \cdot n_1$ and at most $k \cdot n_r$. Since there is only $s^j$ number of unique sequences with a length of $j$, in order for this code to be uniquely decodable, we must have

\begin{equation} C_j \leq s^j \end{equation}

Therefore we now have

\begin{aligned} \left(\sum_{i=1}^{r} s^{- n_i}\right)^k & = & \sum_{j=k \cdot n_1}^{k \cdot n_r} C_j s^{-j} \\ & \leq & \sum_{j=k \cdot n_1}^{k \cdot n_r} s^j s^{-j} \\ & = & k (n_r - n_1) + 1 \end{aligned}

which means

\begin{aligned} \sum_{i=1}^{r} s^{- n_i} \leq \left(k (n_r - n_1) + 1\right)^{1/k} \end{aligned}

Since this is true for any $k$, we can let $k$ go to infinity and see that we must satifiy Kraft's inequality.

Shannon Entropy

Kraft's inequality can be used to show that any uniquely decodable code must have an expected length larger than or equal to the Shannon Entropy. If the codeword $\sigma_i$ occurs with probability $p_i$, then the Shannon Entropy is given by

\begin{equation} \sum_{i=1}^{r} p_i \log_s \frac{1}{p_i} \end{equation}

To see that the expected code length has to be bound by the Shannon entropy, consider the difference between the Shannon entropy and the expected length of the code:

\begin{aligned} \sum_{i=1}^{r} p_i \log_s \frac{1}{p_i} - \sum_{i=1}^{r} p_i n_i = & \sum_{i=1}^{r} p_i \log_s \frac{s^{-n_i}}{p_i} \\ \leq & \ \log_s \left( \sum_{i=1}^{r} p_i \frac{s^{-n_i}}{p_i} \right) \\ = & \ \log_s \left( \sum_{i=1}^{r} s^{-n_i} \right) \\ \leq & \ \log_s 1 = 0 \\ \end{aligned}

where we simply use the fact that $\log_s$ is a concave function and apply Kraft's inequality.