INFOTH Note 6: Kraft's Inequality, Shannon Code
6.1 Coding and Compression
Def: Code / Encoding
We define a coding/encoding map as follows:
$$ c:\mathcal X^n\to\{0,1\}^*-\empty $$We say this map is nonsingular iff it’s injective. (thesaurus)
The input is a sequence of source symbols (I call it a source word), and the output (image) is a codeword.
If the < means being a subset of,
- Instantaneous codes < Uniquely decodable codes < Nonsingular codes < All codes
- Nonsingular: injective (again)
- Uniquely decodable: Every sequence concatenated by codewords can only be decoded to only 1 source sequence.
- The “Kleene Closure” of the map is also injective.
- 整个由codeword接成的序列也只能被解码成唯一的source word序列。
- If $A\to0,B\to1,C\to01$, we cannot tell the original seq. from 0101 though this coding is nonsingular.
We can process the coding like a path in a binary tree.
Def: Multi-symbol Compression
$$ c(x_1,\dots,x_k)=c(x_{1:k})=[c(x_1),c(x_2),\dots,c(x_k)] $$as a multi-symbol compression of a concatenated source words, equal to a concatenation of codewords.
Def: Code Extension
The extension of predefined $c$ is
$$ c^*:\mathcal X^*-\empty\to\{0,1\}^*-\empty $$- variable finite length
- $\mathcal X=\{A,B\}$ then $\mathcal X^*=\{\empty,A,B,AA,AB,BB,AAA,AAB,ABA,\dots\}$
Def: Uniquely Decodable
$c$ is called uniquely decodable if its extension is nonsingular.
- UD -> Lossless Compression
Def: Prefix-free = Instantaneous
$c$ is called prefix-free or instantaneous if $\forall x\ne x'$, $c(x)$ is not a prefix of $c(x')$.
- UD but not PF codes:
- $A\to 1,B\to 10$
- $1\to0,2\to01,3\to11$
Theorem: Kraft’s Inequality
For any prefix free code over $\mathcal X$ i.e. $c:\mathcal X\to\{0,1\}^*-\empty$,
$$ \sum_{x\in\mathcal X}2^{-l(c(x))}\le 1 $$where $l(c(x))$ is the length of $c(x)$.
- The 2-exponentials of negative lengths of codewords sum up less than 1.

We can construct a binary tree for every prefix-free code. Every leaf represents a codeword, we goes left (down) to add 1 and right (up) to add 0, and the codeword is the path from the root to the leaf itself.
In the picture, we have 0,10,110,111 as codewords for a prefix-free code.
Proof. For the case with 2 leaves with length $h$ and value $2^{-h}$, they can be merged into their parents with a summed $2^{-h+1}$, or there are only 1 leaf and we merge $2^{-h}$ as it is. We find the case of max sum is when we reach a full binary tree (every leaf has 0 or 2 children), and the sum is 1. Otherwise, it’s smaller than 1.
6.2 Minimum Expected Length of the Code
For the space $\mathcal X$, every source word $x$ has a probability of appearance. So we want to minimize the expected length of the code over all prefix-free codes.
$$ \min_{c\in \text{PF}(\mathcal X)} \sum_{x\in\mathcal X}p(x)l(c(x)) $$Converse Kraft’s Inequality
The original Kraft’s Ineq.: For a prefix-free code $c$, $\sum_x 2^{-l(c(x))}\le 1$
Converse:
Given $\mathcal X$ a finite set, given $(l(x),x\in\mathcal X)$ and $l(x)\ge 1$ (integers), $\forall x$ s.t.
$$ \sum_x 2^{-l(x)}\le 1, $$then $\exist c:\mathcal X\to \{0,1\}^*-\empty$ s.t. $l(c(x))=l(x),\forall x$ and $c$ is a prefix-free code.
给定一组 $l(x)$ 函数的值,也就是一个正整数向量,如果满足Kraft的不等式则存在PF Code对应这个长度组。
Optimization
Converse Kraft -> We want to find such a collection of $l(x)$ to minimize the expected length (given a fixed $p$ distribution)
$$ \min_{(l(x),x\in\mathcal X,l\in\Z^+)}\sum_x p(x)l(x)\qquad\text{s.t.}\sum_x2^{-l(x)}\le 1 $$Since $\sum_{x}2^{-l(x)}\le 1$ we have already satisfied $l(x)\ge 0$, otherwise one term where $l(x)<0$ will make the sum greater than 1.
This problem is a difficult integer programming, so we do relaxation to real values.
Relaxation of Integer Programming
We relax the lengths to real values
$$ \min_{(l(x)\in\R,x\in\mathcal X)}\sum_xp(x)l(x)\qquad\text{s.t.}\sum_x2^{-l(x)}\le 1 $$KKT Conditions:
-
Lagrange Multiplier, Stationarity
$$ L(x,\lambda)=\sum_xp(x)l(x)+\lambda(\sum_{x\in\mathcal X}2^{-l(x)}-1)\\ \nabla_{(l(x),x\in\mathcal X)} L=0 $$ -
Primal Feasibility
$$ \sum_x 2^{-l(x)}\le 1 $$ -
Dual Feasibility
$$ \lambda\ge 0 $$ -
Complementary Slackness
$$ \lambda(\sum_x 2^{-l(x)}-1)=0 $$
(vector perspective)
$$ L=\mathbf p\cdot \mathbf l+\lambda(2^{-\mathbf l}-1)\implies \nabla_{\mathbf l}L=\mathbf p+\lambda\cdot \ln2\cdot 2^{-\mathbf l}=0\implies \mathbf l=K\log_2 \mathbf p $$(one of $l(x)$ perspective)
For any $x$,
$$ \begin{aligned} \nabla_{l(x)}L=0&\implies \nabla_{l(x)}(p(x)l(x)+\lambda\cdot 2^{-l(x)})=0 \\&\implies p(x)+\underbrace{\lambda \ln 2}_{-\frac1c}\cdot 2^{-l(x)}=0 \\&\implies 2^{-l(x)}=cp(x) \\&\implies l(x)=-\log cp(x)=-\log c-\log p(x). \end{aligned} $$Since $\sum_{x} 2^{-l(x)}=1$,
$$ \sum_x (2^{\log c+\log p(x)})=1\implies c\sum_x p(x)=1\implies c=1. $$Hence the optimal solution $l^*(x)$ is
$$ l^*(x)=-\log p(x) $$and we find that the expected length is the entropy of $p$
$$ \text{Objective}=E_p[l(X)]=\sum_x p(x)l(x)=H(p). $$Coming back to IP
In integer programming i.e. the original problem, we can then find the solution:
$$ \hat l^*(x)=\lceil\log \frac1{p(x)}\rceil,\forall x\in\mathcal X $$where the ceiling is applied to satisfy the primal feasibility, and we call the code following this length collection a Shannon code.
Note that if we change the log base from 2 to $e$ or any other $D$, the solution still holds with $\log\equiv \log_D$.
6.3 Shannon Code
Def: Shannon Code $c_S$
Define on the finite set $\mathcal X$,
$$ l_S(x):=\lceil\log \frac1{p(x)}\rceil,\forall x\in\mathcal X $$satisfying $\sum_x 2^{-l_S(x)}\le 1$ (converse Kraft’s inequality), therefore, there exists a prefix-free code
$$ \exist c_S:\mathcal X\to\{0,1\}^*-\empty $$with $l(c_S(x))=l_S(x),\forall x$.
We call such a code a Shannon code, where
$$ H(X)\le E[l(c_S(X))]= \sum_x p(x)l(c_S(x))=\sum_x p(x)\lceil \log\frac{1}{p(x)}\rceil\le H(X)+1 $$minimizes the expected length.
If the entropy is small…
If $H(X)$ is small, say, smaller than 1, the code for one symbol may be not so great.
However, we can do this over blocks of symbols (concatenation)
Block Shannon Code
For a block (multiple symbols) with $k$ symbols, i.i.d. the distribution $p$,
$$ c_S^k:\mathcal X^k\to\{0,1\}^*-\empty,\qquad\text{s.t.}\quad l(c_S(\bold x^k))=\sum_{i=1}^k l(c_S(x_i)) $$Then
$$ E_p[l(c_S(\bold X^k))]=\sum_{\bold x^k}p(\bold x^k)l(c_S(\bold x^k))\le H(\bold X^k)+1=kH(X)+1. $$So the average expected length for one symbol is
$$ \overline{E_p[l(c_S(X))]}=\frac1k\sum_{\bold x^k}p(\bold x^k)l(c_S(\bold x^k))\le H(X)+\frac1k, $$thereby we can use a long block to lower the length bound closer to the entropy.
Compression Rate
Though it’s late, we define the expected length as the compression rate that is
$$ CR(c)=E[{\text{\# of bits}\over\text{\# of symbols}}]=E_p[l(c(X))] $$- UD but not Prefix-free is not better.
- Theorem: No UD codes have a better (smaller) compression rate than PF ones.
Kraft-McMillian Inequality
UD codes can also follow Kraft’s Inequality though. This conclusion expands that of PF to UDs.
$$ \sum_{x\in\mathcal X}2^{-l(c(x))}\le 1 $$Proof.
Consider a $k$-length symbol block. We consider the length of its codeword block.
Let $l_{\max}=\max_x l(c(x))$ then for any $k$-length block, since $c(\bold x^k)=c(x_1)+c(x_2)+\dots+c(x_k)$,
$$ \sum_{\bold x^k}2^{-l(c(\bold x^k))}=\sum_{x_1}\sum_{x_2}\cdots\sum_{x_k}2^{-l(c(x_1))}2^{-l(c(x_2))}\cdots2^{-l(c(x_k))} =\left(\sum_x 2^{-l(c(x))}\right)^{k} $$and we have
$$ \sum_{\bold x^k}2^{-l(c(\bold x^k))}=\sum_{l=1}^{k\cdot l_{\max}}2^{-l}N(l) $$where $N(l)$ is the count of the codewords with length $l$. 某个长度的出现次数
But since the codeword of length $l$ is in $\{0,1\}^l$ with size $2^l$, $N(l)\le 2^l$, hence
$$ \sum_{\bold x^k}2^{-l(c(\bold x^k))}=\sum_{l=1}^{k\cdot l_{\max}}2^{-l}N(l)\le \sum_{l=1}^{k\cdot l_{\max}}1 =kl_{\max}. $$Hence
$$ \sum_x 2^{-l(c(x))} \le (k l_{\max})^{1/k},\forall k\in \Z^+. $$When $k\to \infty$, the RHS approaches to 1, so
$$ \sum_x 2^{-l(c(x))}\le \inf_k(kl_{\max})^{1/k}=1. $$