INFOTH Note 5: Fano Ineq., AEP

Fano’s Inequality, Markov Chain, Entropy Rate, Asymptotic Equipartition Property and Lossless Compression.


5.1 Fano’s Inequality

Fano is for the scenario of estimation.

Setup

Markov Chain

Given variable $X\in\cal X$ and $Y\in\cal Y$. The estimator of $X$ from $Y$ is $\hat X=\hat X(Y)$ (we determine that estimator from $Y$).

Note that $X\to Y\to \hat X$ constructs a Markov Chain, because

$$ P(\hat X=x|Y=y,X=z)=P(\hat X=x|Y=y) $$

Probability of Error

The error as an event: $\hat X\ne X$

The probability of error is defined as:

$$ p_e=P(\hat X\ne X) $$

Binary Entropy

For a probability value $p$, the binary entropy is

$$ h(p)=-p\log p-(1-p)\log (1-p) $$

which is a concave function with a maximum value of $1$.

Theorem

$$ h(p_e)+p_e\log_2(|{\cal X}|-1)\ge H(X|Y) $$

Intuition / Weaker Propositions (Corollary)

$H(X|Y)$ determines the lower bound of $p_e$.

从熵的定义来看 From the def. of entropy, if $H(X|Y)$ is large, then given $Y$, $X$ is indeterministic. So $Y$ is firstly not a good observation from $X$, which means the estimator cannot be good.

This ineq. can be weakened into

$$ \begin{aligned} 1+p_e\log|{\cal X}|&\ge 1+p_e\log(|{\cal X}|-1)\\&\ge h(p_e)+p_e\log_2(|{\cal X}|-1)\\& \ge H(X|Y), \end{aligned} $$

which is equivalent to this corollary:

$$ p_e\ge {H(X|Y)-1\over \log |{\cal X}|}. $$

Proof (Japanese)

ランダム変数 $E$ をこのように定義する。

$$ E=\bold 1[\hat X\ne X]=\begin{cases}1,&\hat X\ne X,\\0,&\hat X = X.\end{cases} $$

よって $H(E)=h(p_e)$。したがって、連鎖律によって

$$ H(E,X|\hat X)=H(E|\hat X)+H(X|E,\hat X). $$

$E$ の定義によって

$$ \begin{aligned} H(X|E,\hat X)&=\sum_{e\in\{0,1\}}P(E=e)H(X|E=e,\hat X)\\ &=p_eH(X|E=1,\hat X)+(1-p_e)H(X|E=0,\hat X) \end{aligned} $$

定義によって $E=0$ の時、$X=\hat X$ により

$$ H(X|E=0,\hat X)=0 \implies H(X|E,\hat X)=p_eH(X|E=1,\hat X) $$

したがって

$$ \begin{aligned} H(E,X|\hat X)&=H(E|\hat X)+H(X|E,\hat X) \\&\le H(E)+\underbrace{p_eH(X|E=1,\hat X)}_{H(X|E,\hat X)} \\&\le\underbrace{h(p_e)}_{H(E)}+p_eH(X|E=1,\hat X) \end{aligned} $$

あらゆる $X$ の分布のエントロピーが $\log(|{\cal X}|)$ 以下のため、また $E=1$ の時決して $X\ne \hat X$ なので、$X$ のドメイン大きさは $|{\cal X}|-1$ ということによって

$$ H(X|E=1,\hat X)\le \log(|{\cal X}|-1) $$

よって $H(X|E=1,\hat X)$ を $\log(|{\cal X}|-1)$ に置き換えると

$$ H(E,X|\hat X)\le h(p_e)+p_e\log(|{\cal X}|-1) $$

が得られる。

また、連鎖律で

$$ H(E,X|\hat X)=H(X|\hat X)+H(E|X,\hat X) \ge H(X|\hat X) $$

$X\to Y\to \hat X$ というマルコフ連鎖が存在し、データ処理不等式によって

$$ H(X|Y)\le H(X|\hat X) $$

したがって

$$ \begin{aligned} H(X|Y)&\le H(X|\hat X) \\&\le H(E,X|\hat X) \\&\le h(p_e)+p_e\log(|{\cal X}|-1)\qquad\square \end{aligned} $$

Q.E.D.

Proof (English)

Define a random variable $E=\bold 1[\hat X\ne X]=\begin{cases}1,&\hat X\ne X,\\0,&\hat X = X.\end{cases}$ Note that $H(E)=h(p_e)$.

Then from the chain rule,

$$ H(E,X|\hat X)=H(E|\hat X)+H(X|E,\hat X). $$

From the def. of $E$,

$$ \begin{aligned} H(X|E,\hat X)&=\sum_{e\in\{0,1\}}P(E=e)H(X|E=e,\hat X)\\ &=p_eH(X|E=1,\hat X)+(1-p_e)H(X|E=0,\hat X) \end{aligned} $$

From the defined cases of $E$, when $E=0$, we have $X=\hat X$, so

$$ H(X|E=0,\hat X)=0 \implies H(X|E,\hat X)=p_eH(X|E=1,\hat X) $$

Then

$$ \begin{aligned} H(E,X|\hat X)&=H(E|\hat X)+H(X|E,\hat X)\\ &\le H(E)+\underbrace{p_eH(X|E=1,\hat X)}_{H(X|E,\hat X)}\\ &\le \underbrace{h(p_e)}_{H(E)} +p_eH(X|E=1,\hat X) \end{aligned} $$

Every distribution of $X$ has an entropy not greater than $\log|\cal X|$, and also when $E=1$, $X\ne \hat X$, so the domain size of $X$ is not greater than $|{\cal X}|-1$. Hence

$$ H(X|E=1,\hat X)\le \log(|{\cal X}|-1) $$

and we replace:

$$ H(E,X|\hat X)\le h(p_e)+p_e\log(|{\cal X}|-1) $$

Also from the chain rule:

$$ H(E,X|\hat X)=H(X|\hat X)+H(E|X,\hat X) \ge H(X|\hat X) $$

From the Markov chain $X\to Y\to \hat X$ and the data processing inequality:

$$ H(X|Y)\le H(X|\hat X) $$

so

$$ \begin{aligned} H(X|Y)&\le H(X|\hat X) \\&\le H(E,X|\hat X) \\&\le h(p_e)+p_e\log(|{\cal X}|-1)\qquad\square \end{aligned} $$

Q.E.D.

Stronger Propositions (in the proof)

Also the variant of Fano’s Ineq.

$$ H(X|\hat X)\le h(p_e)+p_e\log(|{\cal X}|-1) $$

5.2 Markov Chain & Stationarity

Def: Stationary Sequences

Consider a sequence of random variables:

$$ \dots,X_{-2},X_{-1},X_0,X_1,\dots $$

We say this is stationary if the joint prob of any neighboring $N$ r.v.s are the same:

$$ P(X_0=x_0,\dots,X_{N-1}=x_{n-1})=P(X_k=x_0,\dots,X_{k+N-1}=x_{n-1}) $$

$\forall k\in\Z,\forall N\ge 1$.

(Note that time-homogeneity = time-invariance is the homogeneity of conditional probabilities)

Lemma

For any shift, the entropy (as well as cond. entropy) holds:

$$ H(X_{n_1},\dots,X_{n_m}|X_{p_1},\dots, X_{p_q}) =H(X_{n_1+k},\dots,X_{n_m+k}|X_{p_1+k},\dots, X_{p_q+k}) $$

for a stationary Markov chain.

5.3 Entropy Rate

Def: Entropy Rate of a Stochastic Process

The entropy rate of a stochastic process $\mathbb X=(X_n,n\in\Z)$ is defined as

$$ H(\mathbb X)=\lim_{n\to\infty}\frac1n H(X_{1:n}) $$

when the limit exists; and if $\mathbb X$ is stationary,

$$ H(\mathbb X)=\lim_{n\to \infty} H(X_{n+1}|X_{1:n}). $$

(since $a_n\to a$ and running average of $a_{1:n}$ shares the same limit proved by Cesáro mean)

This can be written as $H(X_0|X_{-\infty}^{-1})$ or $H(X_0|X_{-\infty:-1})$ too. Here we denote $X_{m:n}$ and $X_m^n$ as $(X_m,\dots,X_n)$.

The unit is bits per symbol, bits for log with base 2.

In a stationary sequence (stationary process), consider 3 random variables:

$$ H(X_0,X_1,X_2)=H(X_0)+H(X_1|X_0)+H(X_2|X_1,X_0) $$

and $H(X_1|X_0)=H(X_0|X_{-1})\le H(X_0)$, $H(X_2|X_1,X_0)=H(X_0|X_{-1},X_{-2})\le H(X_0|X_{-1})$.

We can easily prove

$$ \forall k_1>k_2,\qquad H(X_0|X_{-1},\dots,X_{-k_1})\le H(X_0|X_{-1},\dots,X_{-k_2}) $$

and a conclusion of monotonicity.

Monotonicity of Running Average

Conclusion. In a stationary stochastic process, ${H(X_0,\dots,X_{n-1})\over n}$ is a nonincreasing sequence of nonnegative numbers, which means the limit (entropy rate) exists since there is a lower bound (0) and the seq. is monotonous.

Cases of Entropy Rate

Typewriter

Consider the case of a typewriter that has $m$ equally likely output letters. The typewriter can produce $m^n$ sequences of length $n$, all of them equally likely.

For any $n$, $H(X_1,\dots,X_n)=\log m^n$ (we can view the sequence as one variable!) and the entropy rate is $H(\mathbb X)=\log m$ bits per symbol.

IID Case

If $\mathbb X=(X_n,n\in\Z)$ is an i.i.d. stationary sequence (i.i.d. means the variables in the sequence are independent and identically distributed), then its entropy rate is $H(\mathbb X)=H(X_0)$.

1st Order Stationary Markov Chain

For a Markov chain in stationarity, where $X_0$ only has a dependence on $X_{-1}$,

$$ H(\mathbb X)=H(X_1|X_0)=\sum_{x,x'}\pi(x)p(x'|x)\log{1\over p(x'|x)} $$

where $p(x'|x)$ is the transition, $\pi(x)$ is $P(X_0=x)$ (stationary distribution).

The stationarity is also equivalent to

$$ \pi(x')=\sum_x \pi(x) p(x'|x). $$

k-th Order Stat. MC

Similarly,

$$ H(\mathbb X)=H(X_0|X_{-k:-1}). $$

Case: Entropy Rate of a Random Walk on a Weighted Graph

On a weighted undirected graph where each edge has a weight $w(v,v')=w(v',v)$:

randomwalk

where $w(v,v')>0$ for all edges, otherwise $w(v,v')=0$ (if there is no edges).

Also we don’t allow self-loops.

From one node/vertex $v$ to another $v'$, the particle walks with a transition prob

$$ p(v'|v)\propto w(v,v'),\quad p(v'|v)={w({v,v'})\over \sum_{v''}w(v,v'')}. $$

The stationary distribution has a simple form. Let $W(v)=\sum_{v'}w(v,v')$. We define the normalizing constant

$$ W=\sum_{v}W(v)=\sum_{v,v'}w(v,v')=2\sum_{v and the stationary distribution is

$$ \pi(v)={W(v)\over W}. $$

Entropy Rate

$$ \begin{aligned} H(\mathbb X)&=H(X_1|X_0) \\&=-\sum_i \mu(i)\sum_j p(j|i)\log p(j|i) \\&=-\sum_i {W(i)\over W}\sum_j {w(i,j)\over W(i)}\log {w(i,j)\over W(i)} \\&=-\sum_{i,j}{w(i,j)\over W}(\log {w(i,j)\over W}-\log {W(i)\over W}) \\&=-\sum_{i,j}{w(i,j)\over W}\log {w(i,j)\over W}+\sum_i {W(i)\over W}\log {W(i)\over W} \\&=H(\{w(i,j)/W\})-H(\{W(i)/W\}) \end{aligned} $$

5.4 Asymptotic Equipartition Property

Intuition

The AEP states that the sample average of surprisal has a limit $H(X)$ for an i.i.d. sequence following the distribution $p$ on the simplex $\Delta_\cal X$.

  • ==Almost every sequence of a long length $n$ has an empirical entropy (mean surprisal, approximating the real entropy) close to the real entropy.==
  • 几乎所有的iid长序列都具有相似的统计特性,即它们的经验熵都约等于真实的单变量熵,或者都属于epsilon弱典型序列。

AEP Form 1:

$\mathbb X=(X_n,n\in\Z)$ is i.i.d. following $p$.

Then $\log {1\over p(X_k)}$ is a random variable for each $k\in \Z$, and $(\log p(X_n),n\in\Z)$ is also an i.i.d. real-valued sequence. Hence $\frac1n \sum_{k=1}^n\log {1\over p(X_k)}$ converges in probability (= with a probability converging to zero of being outside) to $E[-\log p(X_0)]=H(X_0)=H(\mathbb X)$, from Weak Law of Large Numbers:

$$ \lim_{n\to\infty} \Pr[|\frac1n \sum_{k=1}^n\log {1\over p(X_k)}-H(X_0)|\le \epsilon]= 1 $$

Note here because of the i.i.d. property, $H(X_0)=H(X_k),\forall k\in\Z$.

Def: $\epsilon$-weakly Typical Set (of length $n$)

We define $\epsilon$-weakly typical set of length $n$ as a subset of $\mathcal X^n$:

$$ A_\epsilon^{(n)}=\{(x_{1:n}):\left|\frac1n\sum_{k=1}^{n}\log({1\over p(x_k)})-H(X_0)\right|\le \epsilon\} $$

A sequence in this set is called an $\epsilon$-weakly typical sequence.

AEP Form 2:

For $\forall \epsilon>0$, (the convergence in probability means)

$$ \lim_{n\to\infty}\Pr[X_{1:n}\notin A_\epsilon^{(n)}]=0. $$

Corollaries

Denote $\bold x=\bold x^n=x_{1:n}$ and $\bold X=\bold X^n=X_{1:n}$ for convenience when there are too many.

Denote $\Pr(A_\epsilon^{(n)}):=\sum_{\bold x\in A_\epsilon^{(n)}}p(\bold x)$, then

$$ \Pr(A_\epsilon^{(n)}):=\sum_{\bold x\in A_\epsilon^{(n)}}p(\bold x)=\Pr[\bold X^n\in A_\epsilon^{(n)}]=\Pr[|\frac1n \sum_{k=1}^n\log {1\over p(X_k)}-H(X_0)|\le \epsilon] $$

For any sufficiently large integer $n$: $\forall n\ge N(\delta,\epsilon)$:

  1. If $x_{1:n}\in A_{\epsilon}^{(n)}$, we have

    $$ n(H(X_0)-\epsilon)\le \sum_{i=1}^n \log\frac1{p(X_i)}\equiv {1\over \log p(X_{1:n})}\le n(H(X_0)+\epsilon) $$

    (the lower/upper bounds of total surprisal, which can also be the bounds of average surprisal/empirical entropy)

    • -> Moreover, $$ 2^{-n\epsilon-nH(X_0)}\le p(x_{1:n})\le 2^{-n(X_0)+n\epsilon},\quad\forall x_{1:n}\in A_\epsilon^{(n)}. $$
  2. $\forall \delta,\epsilon>0$, we denote $\Pr(A_\epsilon^{(n)})=\sum_{\bold x\in A_\epsilon^{(n)}}p(\bold x)$.

    $$ \Pr(A_\epsilon^{(n)})>1-\delta $$

    for all sufficiently large $n$.

  3. The cardinality/size upper bound of the typical set: $\forall \epsilon>0$, for sufficiently large $n$,

    $$ |A_\epsilon^{(n)}|\le 2^{n(H(X_0)+\epsilon)} $$

    Proof.

    $$ \begin{aligned} 1 & = \sum_{\bold x\in{\cal X^n}} p(\bold x) \\&\ge \sum_{\bold x\in A_\epsilon^{(n)}} p(\bold x) \\&\ge \sum_{\bold x\in A_\epsilon^{(n)}} 2^{-n(H(X_0)+\epsilon)} \\&=|A_\epsilon^{(n)}|2^{-n(H(X_0)+\epsilon)} \end{aligned} $$

    so $|A_\epsilon^{(n)}|\le 2^{n(H(X_0)+\epsilon)}$. Q.E.D.

  4. $\forall \delta,\epsilon > 0$, for sufficiently large $n$,

    $$ |A_\epsilon^{(n)}|\ge (1-\delta) 2^{n(H(X_0)-\epsilon)} $$

    Proof.

    $$ \begin{aligned} 1-\delta &<\Pr(A_\epsilon^{(n)}) \\&=\sum_{\bold x\in{A_\epsilon^{(n)}}} p(\bold x) \\&\le \sum_{\bold x\in A_\epsilon^{(n)}}2^{-n(H(X_0)-\epsilon)} \\&=|A_\epsilon^{(n)}|2^{-n(H(X_0)-\epsilon)} \end{aligned} $$

    so $|A_\epsilon^{(n)}|>(1-\delta)2^{n(H(X_0)-\epsilon)}$.

    We can also set $\delta=\epsilon$ to exclude the $\delta$ variable.

5.5 Lossless Data Compression

Setup

假设我们有一个随机过程中的序列 $\bold X^n$ on the domain $\mathcal X^n$, 想要压缩这个序列。使用一个单射的对照表,每个序列对应一个 index, 与一个 bit string.

Assume we have a finite stochastic sequence $\bold X^n$, and we want to compress this sequence. We use an lossless encoding map, every sequence corresponding to an index and a bit string.

如上定义 typical set $A_\epsilon^{(n)}$. 有一些 $\bold X^n$ 序列属于这个集合。

We define $A_\epsilon^{(n)}$ as above, and some sequences belong to this set.

$\forall \epsilon>0,\forall \delta>0,\exist n(\epsilon,\delta)$ s.t. there exists some encoding map

$$ e_n: \mathcal X^n\to\{0,1\}^*-\empty $$ $$ e_n(\bold x^n)=\begin{cases} \text{concat}(1,\phi(\bold x^n)),&\bold x^n\in A_\epsilon^{(n)},\\ \text{concat}(0,\psi(\bold x^n)),&\bold x^n\notin A_{\epsilon}^{(n)}. \end{cases} $$
  • We use a bit 0/1 to denote if the seq. is in the typical set.
  • $\phi,\psi$ is a map mapping from a sequence to a bit string.
  • If the sequence is typical, we use $\phi(\cdot)$ to encode. $\phi(\bold x^n)$ is a unique string of length $\lceil \log|A_\epsilon^{(n)}|\rceil$ bits since there are $|A_\epsilon^{(n)}|$ input sequences. For an encoded string, we know if it’s typical or not by the first bit 0/1. 可以压缩到典型集的大小
    • $l(e_n(\bold x^n))=1+\lceil \log|A_\epsilon^{(n)}|\rceil$
      • 最小充分长度,也就是密码表里最大的长度是这个的情况下,这个最大值是最小的。。
      • Minimum value of the maximum length of the encoding map
  • If the sequence is not typical, we use $\psi(\cdot)$ to encode. $\psi(\bold x^n)$ is a unique string of length $\lceil \log |\mathcal X^n|\rceil = \lceil n\log |\mathcal X|\rceil$ bits, equal to no compression. 相当于不压缩
    • $l(e_n(\bold x^n))=1+\lceil \log |\mathcal X^n|\rceil = 1+\lceil n\log |\mathcal X|\rceil$

So we have the expected length

$$ E[l(e_n(\bold x_n))]\le 1+\Pr(A_\epsilon^{(n)})\lceil\log|A_\epsilon^{(n)}|\rceil+(1-\Pr(A_\epsilon^{(n)}))\lceil n\log |\mathcal X|\rceil $$

If every symbol has the maximum length in the encoding map then this is an equality. 当且仅当每个symbol都要用密码本里最长的那个来encode的情况下取等。

Expected Length of Lossless Compression: Upper Bound

$$ \begin{aligned} E[l(e_n(\bold X^n))]&\le 1+\Pr(A_\epsilon^{(n)})(1+n(H(X_0)+\epsilon))+\delta(1+n\log |\mathcal X|)\qquad\text{(Tight bound)} \\&\le 1+1+nH(X_0)+n\epsilon+\delta+\delta n\log|\mathcal X| \\&=2+n(H(X_0)+\epsilon) + \delta(1+\log |\mathcal X|)\qquad\qquad(\text{Loose bound}) \end{aligned} $$

Hence we have the upper part of 1st Shannon’s Theorem: Source Coding Theorem by setting $\delta=\epsilon$:

$\forall \epsilon>0$, there exists a lossless compression scheme $((e_n,d_n),n\ge 1)$ (this is, $((e_1,d_1),(e_2,d_2),\dots)$) s.t.

$$ \limsup_{n\to\infty}\frac1nE[l(e_n(\bold X^n))]\le \lim_{n\to\infty}\frac{2+n(H(X_0)+\epsilon)+\epsilon(1+\log |\mathcal X|)}{n}=H(X_0)+\epsilon. $$

Converse: Lower Bound

$\forall \epsilon>0$, for all lossless compression themes $((e_n,d_n),n\ge 1)$, we have

$$ \liminf_{n\to\infty}\frac1nE[l(e_n(\bold X^n))]\ge H(X)-\epsilon $$

Proof

懒得写了

First Shannon Theorem: Source Coding Theorem

信源编码定理/无失真(Lossless)变长(Length-variant)编码定理(Coding Theorem)/在无损情况下,数据压缩的临界值

The limit of lossless data compression

This theorem is the composition of the Upper Bound and the Lower Bound.