INFOTH Note 6: Kraft's Inequality, Shannon Code

2025-09-23

3 min read

Notes

Information Theory and Coding (UCB FA24 EE229A, rewritten in 2025)

6.1 Coding and Compression

Def: Code / Encoding

We define a coding/encoding map as follows:

$$ c:\mathcal X^n\to\{0,1\}^*-\empty $$

We say this map is nonsingular iff it’s injective. (thesaurus)

The input is a sequence of source symbols (I call it a source word), and the output (image) is a codeword.

If the < means being a subset of,

Instantaneous codes < Uniquely decodable codes < Nonsingular codes < All codes
Nonsingular: injective (again)
Uniquely decodable: Every sequence concatenated by codewords can only be decoded to only 1 source sequence.
- The “Kleene Closure” of the map is also injective.
- 整个由codeword接成的序列也只能被解码成唯一的source word序列。
If $A\to0,B\to1,C\to01$, we cannot tell the original seq. from 0101 though this coding is nonsingular.

We can process the coding like a path in a binary tree.

Def: Multi-symbol Compression

$$ c(x_1,\dots,x_k)=c(x_{1:k})=[c(x_1),c(x_2),\dots,c(x_k)] $$

as a multi-symbol compression of a concatenated source words, equal to a concatenation of codewords.

Def: Code Extension

The extension of predefined $c$ is

$$ c^*:\mathcal X^*-\empty\to\{0,1\}^*-\empty $$

variable finite length
$\mathcal X=\{A,B\}$ then $\mathcal X^*=\{\empty,A,B,AA,AB,BB,AAA,AAB,ABA,\dots\}$

Def: Uniquely Decodable

$c$ is called uniquely decodable if its extension is nonsingular.

UD -> Lossless Compression

Def: Prefix-free = Instantaneous

$c$ is called prefix-free or instantaneous if $\forall x\ne x'$, $c(x)$ is not a prefix of $c(x')$.

UD but not PF codes:
- $A\to 1,B\to 10$
- $1\to0,2\to01,3\to11$

Theorem: Kraft’s Inequality

For any prefix free code over $\mathcal X$ i.e. $c:\mathcal X\to\{0,1\}^*-\empty$,

$$ \sum_{x\in\mathcal X}2^{-l(c(x))}\le 1 $$

where $l(c(x))$ is the length of $c(x)$.

The 2-exponentials of negative lengths of codewords sum up less than 1.

binary

We can construct a binary tree for every prefix-free code. Every leaf represents a codeword, we goes left (down) to add 1 and right (up) to add 0, and the codeword is the path from the root to the leaf itself.

In the picture, we have 0,10,110,111 as codewords for a prefix-free code.

Proof. For the case with 2 leaves with length $h$ and value $2^{-h}$, they can be merged into their parents with a summed $2^{-h+1}$, or there are only 1 leaf and we merge $2^{-h}$ as it is. We find the case of max sum is when we reach a full binary tree (every leaf has 0 or 2 children), and the sum is 1. Otherwise, it’s smaller than 1.

6.2 Minimum Expected Length of the Code

For the space $\mathcal X$, every source word $x$ has a probability of appearance. So we want to minimize the expected length of the code over all prefix-free codes.

$$ \min_{c\in \text{PF}(\mathcal X)} \sum_{x\in\mathcal X}p(x)l(c(x)) $$

Converse Kraft’s Inequality

The original Kraft’s Ineq.: For a prefix-free code $c$, $\sum_x 2^{-l(c(x))}\le 1$

Converse:

Given $\mathcal X$ a finite set, given $(l(x),x\in\mathcal X)$ and $l(x)\ge 1$ (integers), $\forall x$ s.t.

$$ \sum_x 2^{-l(x)}\le 1, $$

then $\exist c:\mathcal X\to \{0,1\}^*-\empty$ s.t. $l(c(x))=l(x),\forall x$ and $c$ is a prefix-free code.

给定一组 $l(x)$ 函数的值，也就是一个正整数向量，如果满足Kraft的不等式则存在PF Code对应这个长度组。

Optimization

Converse Kraft -> We want to find such a collection of $l(x)$ to minimize the expected length (given a fixed $p$ distribution)

$$ \min_{(l(x),x\in\mathcal X,l\in\Z^+)}\sum_x p(x)l(x)\qquad\text{s.t.}\sum_x2^{-l(x)}\le 1 $$

Since $\sum_{x}2^{-l(x)}\le 1$ we have already satisfied $l(x)\ge 0$, otherwise one term where $l(x)<0$ will make the sum greater than 1.

This problem is a difficult integer programming, so we do relaxation to real values.

Relaxation of Integer Programming

We relax the lengths to real values

$$ \min_{(l(x)\in\R,x\in\mathcal X)}\sum_xp(x)l(x)\qquad\text{s.t.}\sum_x2^{-l(x)}\le 1 $$

KKT Conditions:

Lagrange Multiplier, Stationarity
$$ L(x,\lambda)=\sum_xp(x)l(x)+\lambda(\sum_{x\in\mathcal X}2^{-l(x)}-1)\\ \nabla_{(l(x),x\in\mathcal X)} L=0 $$
Primal Feasibility
$$ \sum_x 2^{-l(x)}\le 1 $$
Dual Feasibility
$$ \lambda\ge 0 $$
Complementary Slackness
$$ \lambda(\sum_x 2^{-l(x)}-1)=0 $$

(vector perspective)

$$ L=\mathbf p\cdot \mathbf l+\lambda(2^{-\mathbf l}-1)\implies \nabla_{\mathbf l}L=\mathbf p+\lambda\cdot \ln2\cdot 2^{-\mathbf l}=0\implies \mathbf l=K\log_2 \mathbf p $$

(one of $l(x)$ perspective)

For any $x$,

$$ \begin{aligned} \nabla_{l(x)}L=0&\implies \nabla_{l(x)}(p(x)l(x)+\lambda\cdot 2^{-l(x)})=0 \\&\implies p(x)+\underbrace{\lambda \ln 2}_{-\frac1c}\cdot 2^{-l(x)}=0 \\&\implies 2^{-l(x)}=cp(x) \\&\implies l(x)=-\log cp(x)=-\log c-\log p(x). \end{aligned} $$

Since $\sum_{x} 2^{-l(x)}=1$,

$$ \sum_x (2^{\log c+\log p(x)})=1\implies c\sum_x p(x)=1\implies c=1. $$

Hence the optimal solution $l^*(x)$ is

$$ l^*(x)=-\log p(x) $$

and we find that the expected length is the entropy of $p$

$$ \text{Objective}=E_p[l(X)]=\sum_x p(x)l(x)=H(p). $$

Coming back to IP

In integer programming i.e. the original problem, we can then find the solution:

$$ \hat l^*(x)=\lceil\log \frac1{p(x)}\rceil,\forall x\in\mathcal X $$

where the ceiling is applied to satisfy the primal feasibility, and we call the code following this length collection a Shannon code.

Note that if we change the log base from 2 to $e$ or any other $D$, the solution still holds with $\log\equiv \log_D$.

6.3 Shannon Code

Def: Shannon Code $c_S$

Define on the finite set $\mathcal X$,

$$ l_S(x):=\lceil\log \frac1{p(x)}\rceil,\forall x\in\mathcal X $$

satisfying $\sum_x 2^{-l_S(x)}\le 1$ (converse Kraft’s inequality), therefore, there exists a prefix-free code

$$ \exist c_S:\mathcal X\to\{0,1\}^*-\empty $$

with $l(c_S(x))=l_S(x),\forall x$.

We call such a code a Shannon code, where

$$ H(X)\le E[l(c_S(X))]= \sum_x p(x)l(c_S(x))=\sum_x p(x)\lceil \log\frac{1}{p(x)}\rceil\le H(X)+1 $$

minimizes the expected length.

If the entropy is small…

If $H(X)$ is small, say, smaller than 1, the code for one symbol may be not so great.

However, we can do this over blocks of symbols (concatenation)

Block Shannon Code

For a block (multiple symbols) with $k$ symbols, i.i.d. the distribution $p$,

$$ c_S^k:\mathcal X^k\to\{0,1\}^*-\empty,\qquad\text{s.t.}\quad l(c_S(\bold x^k))=\sum_{i=1}^k l(c_S(x_i)) $$

Then

$$ E_p[l(c_S(\bold X^k))]=\sum_{\bold x^k}p(\bold x^k)l(c_S(\bold x^k))\le H(\bold X^k)+1=kH(X)+1. $$

So the average expected length for one symbol is

$$ \overline{E_p[l(c_S(X))]}=\frac1k\sum_{\bold x^k}p(\bold x^k)l(c_S(\bold x^k))\le H(X)+\frac1k, $$

thereby we can use a long block to lower the length bound closer to the entropy.

Compression Rate

Though it’s late, we define the expected length as the compression rate that is

$$ CR(c)=E[{\text{\# of bits}\over\text{\# of symbols}}]=E_p[l(c(X))] $$

UD but not Prefix-free is not better.
Theorem: No UD codes have a better (smaller) compression rate than PF ones.

Kraft-McMillian Inequality

UD codes can also follow Kraft’s Inequality though. This conclusion expands that of PF to UDs.

$$ \sum_{x\in\mathcal X}2^{-l(c(x))}\le 1 $$

Proof.

Consider a $k$-length symbol block. We consider the length of its codeword block.

Let $l_{\max}=\max_x l(c(x))$ then for any $k$-length block, since $c(\bold x^k)=c(x_1)+c(x_2)+\dots+c(x_k)$,

$$ \sum_{\bold x^k}2^{-l(c(\bold x^k))}=\sum_{x_1}\sum_{x_2}\cdots\sum_{x_k}2^{-l(c(x_1))}2^{-l(c(x_2))}\cdots2^{-l(c(x_k))} =\left(\sum_x 2^{-l(c(x))}\right)^{k} $$

and we have

$$ \sum_{\bold x^k}2^{-l(c(\bold x^k))}=\sum_{l=1}^{k\cdot l_{\max}}2^{-l}N(l) $$

where $N(l)$ is the count of the codewords with length $l$. 某个长度的出现次数

But since the codeword of length $l$ is in $\{0,1\}^l$ with size $2^l$, $N(l)\le 2^l$, hence

$$ \sum_{\bold x^k}2^{-l(c(\bold x^k))}=\sum_{l=1}^{k\cdot l_{\max}}2^{-l}N(l)\le \sum_{l=1}^{k\cdot l_{\max}}1 =kl_{\max}. $$

Hence

$$ \sum_x 2^{-l(c(x))} \le (k l_{\max})^{1/k},\forall k\in \Z^+. $$

When $k\to \infty$, the RHS approaches to 1, so

$$ \sum_x 2^{-l(c(x))}\le \inf_k(kl_{\max})^{1/k}=1. $$