INFOTH Note 5: Fano Ineq., Entropy Rate, AEP

Fano’s Inequality, Markov Chain, Entropy Rate, Asymptotic Equipartition Property and Lossless Compression.


5.1 Fano’s Inequality

Fano is for the scenario of estimation.

Setup

Markov Chain

Given variable $X\in\cal X$ and $Y\in\cal Y$. The estimator of $X$ from $Y$ is $\hat X=\hat X(Y)$ (we determine that estimator from $Y$).

Note that $X\to Y\to \hat X$ constructs a Markov Chain, because

$$ P(\hat X=x|Y=y,X=z)=P(\hat X=x|Y=y) $$

Probability of Error

The error as an event: $\hat X\ne X$

The probability of error is defined as:

$$ p_e=P(\hat X\ne X) $$

Binary Entropy

For a probability value $p$, the binary entropy is

$$ h(p)=-p\log p-(1-p)\log (1-p) $$

which is a concave function with a maximum value of $1$.

Theorem

$$ h(p_e)+p_e\log_2(|{\cal X}|-1)\ge H(X|Y) $$

Intuition / Weaker Propositions (Corollary)

$H(X|Y)$ determines the lower bound of $p_e$.

从熵的定义来看 From the def. of entropy, if $H(X|Y)$ is large, then given $Y$, $X$ is indeterministic. So $Y$ is firstly not a good observation from $X$, which means the estimator cannot be good.

This ineq. can be weakened into

$$ \begin{aligned} 1+p_e\log|{\cal X}|&\ge 1+p_e\log(|{\cal X}|-1)\\&\ge h(p_e)+p_e\log_2(|{\cal X}|-1)\\& \ge H(X|Y), \end{aligned} $$

which is equivalent to this corollary:

$$ p_e\ge {H(X|Y)-1\over \log |{\cal X}|}. $$

Proof (Japanese)

ランダム変数 $E$ をこのように定義する。

$$ E=\bold 1[\hat X\ne X]=\begin{cases}1,&\hat X\ne X,\\0,&\hat X = X.\end{cases} $$

よって $H(E)=h(p_e)$。したがって、連鎖律によって

$$ H(E,X|\hat X)=H(E|\hat X)+H(X|E,\hat X). $$

$E$ の定義によって

$$ \begin{aligned} H(X|E,\hat X)&=\sum_{e\in\{0,1\}}P(E=e)H(X|E=e,\hat X)\\ &=p_eH(X|E=1,\hat X)+(1-p_e)H(X|E=0,\hat X) \end{aligned} $$

定義によって $E=0$ の時、$X=\hat X$ により

$$ H(X|E=0,\hat X)=0 \implies H(X|E,\hat X)=p_eH(X|E=1,\hat X) $$

したがって

$$ \begin{aligned} H(E,X|\hat X)&=H(E|\hat X)+H(X|E,\hat X) \\&\le H(E)+\underbrace{p_eH(X|E=1,\hat X)}_{H(X|E,\hat X)} \\&\le\underbrace{h(p_e)}_{H(E)}+p_eH(X|E=1,\hat X) \end{aligned} $$

あらゆる $X$ の分布のエントロピーが $\log(|{\cal X}|)$ 以下のため、また $E=1$ の時決して $X\ne \hat X$ なので、$X$ のドメイン大きさは $|{\cal X}|-1$ ということによって

$$ H(X|E=1,\hat X)\le \log(|{\cal X}|-1) $$

よって $H(X|E=1,\hat X)$ を $\log(|{\cal X}|-1)$ に置き換えると

$$ H(E,X|\hat X)\le h(p_e)+p_e\log(|{\cal X}|-1) $$

が得られる。

また、連鎖律で

$$ H(E,X|\hat X)=H(X|\hat X)+H(E|X,\hat X) \ge H(X|\hat X) $$

$X\to Y\to \hat X$ というマルコフ連鎖が存在し、データ処理不等式によって

$$ H(X|Y)\le H(X|\hat X) $$

したがって

$$ \begin{aligned} H(X|Y)&\le H(X|\hat X) \\&\le H(E,X|\hat X) \\&\le h(p_e)+p_e\log(|{\cal X}|-1)\qquad\square \end{aligned} $$

Q.E.D.

Proof (English)

Define a random variable $E=\bold 1[\hat X\ne X]=\begin{cases}1,&\hat X\ne X,\\0,&\hat X = X.\end{cases}$ Note that $H(E)=h(p_e)$.

Then from the chain rule,

$$ H(E,X|\hat X)=H(E|\hat X)+H(X|E,\hat X). $$

From the def. of $E$,

$$ \begin{aligned} H(X|E,\hat X)&=\sum_{e\in\{0,1\}}P(E=e)H(X|E=e,\hat X)\\ &=p_eH(X|E=1,\hat X)+(1-p_e)H(X|E=0,\hat X) \end{aligned} $$

From the defined cases of $E$, when $E=0$, we have $X=\hat X$, so

$$ H(X|E=0,\hat X)=0 \implies H(X|E,\hat X)=p_eH(X|E=1,\hat X) $$

Then

$$ \begin{aligned} H(E,X|\hat X)&=H(E|\hat X)+H(X|E,\hat X)\\ &\le H(E)+\underbrace{p_eH(X|E=1,\hat X)}_{H(X|E,\hat X)}\\ &\le \underbrace{h(p_e)}_{H(E)} +p_eH(X|E=1,\hat X) \end{aligned} $$

Every distribution of $X$ has an entropy not greater than $\log|\cal X|$, and also when $E=1$, $X\ne \hat X$, so the domain size of $X$ is not greater than $|{\cal X}|-1$. Hence

$$ H(X|E=1,\hat X)\le \log(|{\cal X}|-1) $$

and we replace:

$$ H(E,X|\hat X)\le h(p_e)+p_e\log(|{\cal X}|-1) $$

Also from the chain rule:

$$ H(E,X|\hat X)=H(X|\hat X)+H(E|X,\hat X) \ge H(X|\hat X) $$

From the Markov chain $X\to Y\to \hat X$ and the data processing inequality:

$$ H(X|Y)\le H(X|\hat X) $$

so

$$ \begin{aligned} H(X|Y)&\le H(X|\hat X) \\&\le H(E,X|\hat X) \\&\le h(p_e)+p_e\log(|{\cal X}|-1)\qquad\square \end{aligned} $$

Q.E.D.

Stronger Propositions (in the proof)

Also the variant of Fano’s Ineq.

$$ H(X|\hat X)\le h(p_e)+p_e\log(|{\cal X}|-1) $$

5.2 Markov Chain & Stationarity

Def: Stationary Sequences

Consider a sequence of random variables:

$$ \dots,X_{-2},X_{-1},X_0,X_1,\dots $$

We say this is stationary if the joint prob of any neighboring $N$ r.v.s are the same:

$$ P(X_0=x_0,\dots,X_{N-1}=x_{n-1})=P(X_k=x_0,\dots,X_{k+N-1}=x_{n-1}) $$

$\forall k\in\Z,\forall N\ge 1$.

(Note that time-homogeneity = time-invariance is the homogeneity of conditional probabilities)

Lemma

For any shift, the entropy (as well as cond. entropy) holds:

$$ H(X_{n_1},\dots,X_{n_m}|X_{p_1},\dots, X_{p_q}) =H(X_{n_1+k},\dots,X_{n_m+k}|X_{p_1+k},\dots, X_{p_q+k}) $$

for a stationary Markov chain.

5.3 Entropy Rate

Def: Entropy Rate of a Stochastic Process

The entropy rate of a stochastic process $\mathbb X=(X_n,n\in\Z)$ is defined as

$$ H(\mathbb X)=\lim_{n\to\infty}\frac1n H(X_{1:n}) $$

when the limit exists; and if $\mathbb X$ is stationary,

$$ H(\mathbb X)=\lim_{n\to \infty} H(X_{n+1}|X_{1:n}). $$

(since $a_n\to a$ and running average of $a_{1:n}$ shares the same limit proved by Cesáro mean)

This can be written as $H(X_0|X_{-\infty}^{-1})$ or $H(X_0|X_{-\infty:-1})$ too. Here we denote $X_{m:n}$ and $X_m^n$ as $(X_m,\dots,X_n)$.

The unit is bits per symbol, bits for log with base 2.

In a stationary sequence (stationary process), consider 3 random variables:

$$ H(X_0,X_1,X_2)=H(X_0)+H(X_1|X_0)+H(X_2|X_1,X_0) $$

and $H(X_1|X_0)=H(X_0|X_{-1})\le H(X_0)$, $H(X_2|X_1,X_0)=H(X_0|X_{-1},X_{-2})\le H(X_0|X_{-1})$.

We can easily prove

$$ \forall k_1>k_2,\qquad H(X_0|X_{-1},\dots,X_{-k_1})\le H(X_0|X_{-1},\dots,X_{-k_2}) $$

and a conclusion of monotonicity.

Monotonicity of Running Average

Conclusion. In a stationary stochastic process, ${H(X_0,\dots,X_{n-1})\over n}$ is a nonincreasing sequence of nonnegative numbers, which means the limit (entropy rate) exists since there is a lower bound (0) and the seq. is monotonous.

Cases of Entropy Rate

Typewriter

Consider the case of a typewriter that has $m$ equally likely output letters. The typewriter can produce $m^n$ sequences of length $n$, all of them equally likely.

For any $n$, $H(X_1,\dots,X_n)=\log m^n$ (we can view the sequence as one variable!) and the entropy rate is $H(\mathbb X)=\log m$ bits per symbol.

IID Case

If $\mathbb X=(X_n,n\in\Z)$ is an i.i.d. stationary sequence (i.i.d. means the variables in the sequence are independent and identically distributed), then its entropy rate is $H(\mathbb X)=H(X_0)$.

1st Order Stationary Markov Chain

For a Markov chain in stationarity, where $X_0$ only has a dependence on $X_{-1}$,

$$ H(\mathbb X)=H(X_1|X_0)=\sum_{x,x'}\pi(x)p(x'|x)\log{1\over p(x'|x)} $$

where $p(x'|x)$ is the transition, $\pi(x)$ is $P(X_0=x)$ (stationary distribution).

The stationarity is also equivalent to

$$ \pi(x')=\sum_x \pi(x) p(x'|x). $$

k-th Order Stat. MC

Similarly,

$$ H(\mathbb X)=H(X_0|X_{-k:-1}). $$

Case: Entropy Rate of a Random Walk on a Weighted Graph

On a weighted undirected graph where each edge has a weight $w(v,v')=w(v',v)$:

randomwalk

where $w(v,v')>0$ for all edges, otherwise $w(v,v')=0$ (if there is no edges).

Also we don’t allow self-loops.

From one node/vertex $v$ to another $v'$, the particle walks with a transition prob

$$ p(v'|v)\propto w(v,v'),\quad p(v'|v)={w({v,v'})\over \sum_{v''}w(v,v'')}. $$

The stationary distribution has a simple form. Let $W(v)=\sum_{v'}w(v,v')$. We define the normalizing constant

$$ W=\sum_{v}W(v)=\sum_{v,v'}w(v,v')=2\sum_{v and the stationary distribution is

$$ \pi(v)={W(v)\over W}. $$

Entropy Rate

$$ \begin{aligned} H(\mathbb X)&=H(X_1|X_0) \\&=-\sum_i \mu(i)\sum_j p(j|i)\log p(j|i) \\&=-\sum_i {W(i)\over W}\sum_j {w(i,j)\over W(i)}\log {w(i,j)\over W(i)} \\&=-\sum_{i,j}{w(i,j)\over W}(\log {w(i,j)\over W}-\log {W(i)\over W}) \\&=-\sum_{i,j}{w(i,j)\over W}\log {w(i,j)\over W}+\sum_i {W(i)\over W}\log {W(i)\over W} \\&=H(\{w(i,j)/W\})-H(\{W(i)/W\}) \end{aligned} $$

5.4 Asymptotic Equipartition Property

Intuition

The AEP states that the sample average of surprisal has a limit $H(X)$ for an i.i.d. sequence following the distribution $p$ on the simplex $\Delta_\cal X$.

  • ==Almost every sequence of a long length $n$ has an empirical entropy (mean surprisal, approximating the real entropy) close to the real entropy.==
  • 几乎所有的iid长序列都具有相似的统计特性,即它们的经验熵都约等于真实的单变量熵,或者都属于epsilon弱典型序列。

AEP Form 1:

$\mathbb X=(X_n,n\in\Z)$ is i.i.d. following $p$.

Then $\log {1\over p(X_k)}$ is a random variable for each $k\in \Z$, and $(\log p(X_n),n\in\Z)$ is also an i.i.d. real-valued sequence. Hence $\frac1n \sum_{k=1}^n\log {1\over p(X_k)}$ converges in probability (= with a probability converging to zero of being outside) to $E[-\log p(X_0)]=H(X_0)=H(\mathbb X)$, from Weak Law of Large Numbers:

$$ \lim_{n\to\infty} \Pr[|\frac1n \sum_{k=1}^n\log {1\over p(X_k)}-H(X_0)|\le \epsilon]= 1 $$

Note here because of the i.i.d. property, $H(X_0)=H(X_k),\forall k\in\Z$.

Def: $\epsilon$-weakly Typical Set (of length $n$)

We define $\epsilon$-weakly typical set of length $n$ as a subset of $\mathcal X^n$:

$$ A_\epsilon^{(n)}=\{(x_{1:n}):\left|\frac1n\sum_{k=1}^{n}\log({1\over p(x_k)})-H(X_0)\right|\le \epsilon\} $$

A sequence in this set is called an $\epsilon$-weakly typical sequence.

AEP Form 2:

For $\forall \epsilon>0$, (the convergence in probability means)

$$ \lim_{n\to\infty}\Pr[X_{1:n}\notin A_\epsilon^{(n)}]=0. $$

Corollaries

Denote $\bold x=\bold x^n=x_{1:n}$ and $\bold X=\bold X^n=X_{1:n}$ for convenience when there are too many.

Denote $\Pr(A_\epsilon^{(n)}):=\sum_{\bold x\in A_\epsilon^{(n)}}p(\bold x)$, then

$$ \Pr(A_\epsilon^{(n)}):=\sum_{\bold x\in A_\epsilon^{(n)}}p(\bold x)=\Pr[\bold X^n\in A_\epsilon^{(n)}]=\Pr[|\frac1n \sum_{k=1}^n\log {1\over p(X_k)}-H(X_0)|\le \epsilon] $$

For any sufficiently large integer $n$: $\forall n\ge N(\delta,\epsilon)$:

  1. If $x_{1:n}\in A_{\epsilon}^{(n)}$, we have

    $$ n(H(X_0)-\epsilon)\le \sum_{i=1}^n \log\frac1{p(X_i)}\equiv {1\over \log p(X_{1:n})}\le n(H(X_0)+\epsilon) $$

    (the lower/upper bounds of total surprisal, which can also be the bounds of average surprisal/empirical entropy)

    • -> Moreover, $$ 2^{-n\epsilon-nH(X_0)}\le p(x_{1:n})\le 2^{-n(X_0)+n\epsilon},\quad\forall x_{1:n}\in A_\epsilon^{(n)}. $$
  2. $\forall \delta,\epsilon>0$, we denote $\Pr(A_\epsilon^{(n)})=\sum_{\bold x\in A_\epsilon^{(n)}}p(\bold x)$.

    $$ \Pr(A_\epsilon^{(n)})>1-\delta $$

    for all sufficiently large $n$.

  3. The cardinality/size upper bound of the typical set: $\forall \epsilon>0$, for sufficiently large $n$,

    $$ |A_\epsilon^{(n)}|\le 2^{n(H(X_0)+\epsilon)} $$

    Proof.

    $$ \begin{aligned} 1 & = \sum_{\bold x\in{\cal X^n}} p(\bold x) \\&\ge \sum_{\bold x\in A_\epsilon^{(n)}} p(\bold x) \\&\ge \sum_{\bold x\in A_\epsilon^{(n)}} 2^{-n(H(X_0)+\epsilon)} \\&=|A_\epsilon^{(n)}|2^{-n(H(X_0)+\epsilon)} \end{aligned} $$

    so $|A_\epsilon^{(n)}|\le 2^{n(H(X_0)+\epsilon)}$. Q.E.D.

  4. $\forall \delta,\epsilon > 0$, for sufficiently large $n$,

    $$ |A_\epsilon^{(n)}|\ge (1-\delta) 2^{n(H(X_0)-\epsilon)} $$

    Proof.

    $$ \begin{aligned} 1-\delta &<\Pr(A_\epsilon^{(n)}) \\&=\sum_{\bold x\in{A_\epsilon^{(n)}}} p(\bold x) \\&\le \sum_{\bold x\in A_\epsilon^{(n)}}2^{-n(H(X_0)-\epsilon)} \\&=|A_\epsilon^{(n)}|2^{-n(H(X_0)-\epsilon)} \end{aligned} $$

    so $|A_\epsilon^{(n)}|>(1-\delta)2^{n(H(X_0)-\epsilon)}$.

    We can also set $\delta=\epsilon$ to exclude the $\delta$ variable.

5.5 Lossless Data Compression

Setup

假设我们有一个随机过程中的序列 $\bold X^n$ on the domain $\mathcal X^n$, 想要压缩这个序列。使用一个单射的对照表,每个序列对应一个 index, 与一个 bit string.

Assume we have a finite stochastic sequence $\bold X^n$, and we want to compress this sequence. We use an lossless encoding map, every sequence corresponding to an index and a bit string.

如上定义 typical set $A_\epsilon^{(n)}$. 有一些 $\bold X^n$ 序列属于这个集合。

We define $A_\epsilon^{(n)}$ as above, and some sequences belong to this set.

$\forall \epsilon>0,\forall \delta>0,\exist n(\epsilon,\delta)$ s.t. there exists some encoding map

$$ e_n: \mathcal X^n\to\{0,1\}^*-\empty $$ $$ e_n(\bold x^n)=\begin{cases} \text{concat}(1,\phi(\bold x^n)),&\bold x^n\in A_\epsilon^{(n)},\\ \text{concat}(0,\psi(\bold x^n)),&\bold x^n\notin A_{\epsilon}^{(n)}. \end{cases} $$
  • We use a bit 0/1 to denote if the seq. is in the typical set.
  • $\phi,\psi$ is a map mapping from a sequence to a bit string.
  • If the sequence is typical, we use $\phi(\cdot)$ to encode. $\phi(\bold x^n)$ is a unique string of length $\lceil \log|A_\epsilon^{(n)}|\rceil$ bits since there are $|A_\epsilon^{(n)}|$ input sequences. For an encoded string, we know if it’s typical or not by the first bit 0/1. 可以压缩到典型集的大小
    • $l(e_n(\bold x^n))=1+\lceil \log|A_\epsilon^{(n)}|\rceil$
      • 最小充分长度,也就是密码表里最大的长度是这个的情况下,这个最大值是最小的。。
      • Minimum value of the maximum length of the encoding map
  • If the sequence is not typical, we use $\psi(\cdot)$ to encode. $\psi(\bold x^n)$ is a unique string of length $\lceil \log |\mathcal X^n|\rceil = \lceil n\log |\mathcal X|\rceil$ bits, equal to no compression. 相当于不压缩
    • $l(e_n(\bold x^n))=1+\lceil \log |\mathcal X^n|\rceil = 1+\lceil n\log |\mathcal X|\rceil$

So we have the expected length

$$ E[l(e_n(\bold x_n))]\le 1+\Pr(A_\epsilon^{(n)})\lceil\log|A_\epsilon^{(n)}|\rceil+(1-\Pr(A_\epsilon^{(n)}))\lceil n\log |\mathcal X|\rceil $$

If every symbol has the maximum length in the encoding map then this is an equality. 当且仅当每个symbol都要用密码本里最长的那个来encode的情况下取等。

Expected Length of Lossless Compression: Upper Bound

$$ \begin{aligned} E[l(e_n(\bold X^n))]&\le 1+\Pr(A_\epsilon^{(n)})(1+n(H(X_0)+\epsilon))+\delta(1+n\log |\mathcal X|)\qquad\text{(Tight bound)} \\&\le 1+1+nH(X_0)+n\epsilon+\delta+\delta n\log|\mathcal X| \\&=2+n(H(X_0)+\epsilon) + \delta(1+\log |\mathcal X|)\qquad\qquad(\text{Loose bound}) \end{aligned} $$

Hence we have the upper part of 1st Shannon’s Theorem: Source Coding Theorem by setting $\delta=\epsilon$:

$\forall \epsilon>0$, there exists a lossless compression scheme $((e_n,d_n),n\ge 1)$ (this is, $((e_1,d_1),(e_2,d_2),\dots)$) s.t.

$$ \limsup_{n\to\infty}\frac1nE[l(e_n(\bold X^n))]\le \lim_{n\to\infty}\frac{2+n(H(X_0)+\epsilon)+\epsilon(1+\log |\mathcal X|)}{n}=H(X_0)+\epsilon. $$

Converse: Lower Bound

$\forall \epsilon>0$, for all lossless compression themes $((e_n,d_n),n\ge 1)$, we have

$$ \liminf_{n\to\infty}\frac1nE[l(e_n(\bold X^n))]\ge H(X)-\epsilon $$

Proof

懒得写了

First Shannon Theorem: Source Coding Theorem

信源编码定理/无失真(Lossless)变长(Length-variant)编码定理(Coding Theorem)/在无损情况下,数据压缩的临界值

The limit of lossless data compression

This theorem is the composition of the Upper Bound and the Lower Bound.