INFOTH Note 21: Polar Codes and Related Theories (Bhattacharyya, Martingale)
Reference: E. Arikan, “Channel Polarization: A Method for Constructing Capacity-Achieving Codes for Symmetric Binary-Input Memoryless Channels,” IEEE Trans. Inform. Theory, vol. 55, no. 7, pp. 3051–3073, July 2009.
21.1 Core Idea of Polar Codes
-
Notation. Throughout this note, $I(W)$ denotes the symmetric capacity $I(X;Y)$ under uniform input $X$. For BISO channels, uniform input achieves capacity, so $I(W)=C(W)$.
本文中 $I(W)$ 始终指 uniform 输入下的对称容量 $I(X;Y)$。对于 BISO 信道,uniform 输入恰好达到容量,因此 $I(W)=C(W)$。
-
Starting point. We have a noisy channel $W$ with capacity $I(W)$. Shannon’s channel coding theorem says reliable communication at rate $I(W)$ is possible, but does not give a practical construction.
有一条噪声信道 $W$,容量 $I(W)$。Shannon: 能以 $I(W)$ 的 rate 可靠通信,但没有给出实际的构造方法。
-
Arikan’s idea. Take $N=2$ copies of $W$. Instead of using them independently, mix the inputs through a linear pre-processing step: $x_1=u_1\oplus u_2$, $x_2=u_2$, a butterfly transform analogous to the FFT.
使用 $N=2$ 份 $W$,把输入 $(u_1, u_2)$ 经线性变换得到 $(x_1, x_2)$:$x_1 = u_1 \oplus u_2,~~ x_2 = u_2$,这是一个类似 FFT 的蝴蝶变换。
-
Channel splitting. After mixing, the decoder wants to recover $(u_1,u_2)$ from the outputs $(y_1,y_2)$. Rather than the intuitive order of decoding $u_2$ first (since $x_2=u_2$ directly), we deliberately decode $u_1$ first and then $u_2$, which enables polarization.
混合后从解码端看,我们想通过输出 $(y_1,y_2)$ 解码 $(u_1,u_2)$. 我们故意设计先不解码直观的 $u_2$ 而是先解码 $u_1$,再解码 $u_2$,这样才能极化两个信道。
The two identical physical channels $W$ can be viewed as two different synthetic sub-channels $W'$ and $W''$:
原来的两个相同物理信道 $W$ 可以看成两个不同的合成子信道 $W'$ 和 $W''$;
-
In $W'$, the sole input is $u_1$, while $u_2$ is unknown, an uncontrollable random variable that must be summed over (marginalized out). The presence of this unknown $u_2$ degrades the decoding of $u_1$.
在 $W'$ 我们假设输入只有 $u_1$,而 $u_2$ 是未知信息,即不可控的随机噪声,因此要对 $u_2$ 求和消去,并且直觉上存在额外噪声导致解码 $u_1$ 的性能变差;
-
In $W''$, the sole input is $u_2$, and $u_1$ has already been decoded; it is known side information available to the decoder before decoding $u_2$. Mathematically, $u_1$ is grouped with $y_1,y_2$ as part of the output. Since $x_1=u_1\oplus u_2$ and $u_1$ is known, the decoder can use $y_1$ to extract additional information about $u_2$.
而在 $W''$ 中我们假设 $u_2$ 是唯一的输入,此时 $u_1$ 已经被解码,是已知信息,也即 $u_1$ 作为 decoder 在解码前拿到的辅助信息存在,因此数学上可以和 $y_1,y_2$ 归在一起;直觉上可以清除 $x_1=u_1 \oplus u_2$ 对 $u_2$ 的干扰,从而提升性能。
也即:
$$ W': u_1 \mapsto (y_1,y_2), \text{unknown } u_2 \text{ is marginalized out}; $$ $$ W'': u_2 \mapsto (y_1,y_2,u_1), u_1 \text{ is known side info}. $$Total information is conserved: $I(W')+I(W'')=2I(W)$, but redistributed unevenly.
总信息量守恒:$I(W') + I(W'') = 2I(W)$,但一多一少。
-
-
Recursion. Apply the same transform (copy, XOR, split) recursively to $W'$ and $W''$ and we get 4 equivalent channels. After $n$ levels of recursion, we have $N=2^n$ sub-channels. Each level makes the good ones better and the bad ones worse.
对 $W'$ 和 $W''$ 各自再做同样的 copy-XOR-split,就可以产出 4 个等效信道。$n$ 层之后有 $N=2^n$ 个子信道。每一层都让好的更好、差的更差。
-
Polarization. As $N\to\infty$, every sub-channel becomes either perfectly reliable ($I\to 1$) or completely useless ($I\to 0$). The fraction of perfect channels approaches exactly $I(W)$.
$N\to\infty$ 时,所有子信道要么 $I\to 1$(完美),要么 $I\to 0$(无用)。完美的那些占比恰好趋向 $I(W)$。
-
Usage. Transmit data only on the good sub-channels ($I\approx 1$). Freeze the bad ones ($I\approx 0$) to a fixed known value (e.g., $0$). The resulting rate approaches $I(W)$ i.e. achieves capacity.
只在好的子信道上发数据,坏的 freeze 成 $0$。速率 $\approx I(W)$,达到容量。
21.2 Bhattacharyya Parameter
Definition
For a binary-input channel $W:\{0,1\}\to\mathcal Y$, the Bhattacharyya parameter is defined as
$$ Z(W)=\sum_{y\in\mathcal Y} \sqrt{W(y|0)\,W(y|1)}. $$Geometrically, $Z(W)$ is the inner product of two vectors $(\sqrt{W(y|0)})_{y\in\mathcal Y}$ and $(\sqrt{W(y|1)})_{y\in\mathcal Y}$, each of unit norm.
- $Z(W)\to 0$: the channel is reliable (close to perfect).
- $Z(W)\to 1$: the channel is noisy (close to useless).
Bound on Error Probability
Bayesian Optimal Decision Rule
Consider estimating the equiprobable input $x\in\{0,1\}$ given output $y\in\mathcal Y$, i.e. $p(x=0)=p(x=1)=\frac12$.
The optimal decision rule (MAP with uniform prior) is:
$$ \hat x=\arg\max_{x\in\{0,1\}}p(x|y)=\begin{cases} 1,&\text{if }p(y|1)\ge p(y|0),\\ 0,&\text{if }p(y|1)\lt p(y|0).\end{cases} $$In other words, the optimal decision rule is:
$$ \frac{p(y|1)}{p(y|0)}\ge 1 \implies \text{decide }1,\quad\text{otherwise decide }0. $$Bounding
The error probability satisfies
$$ \begin{aligned} P(\text{error})&=\overset{\frac{1}{2}}{\cancel{p(x=0)}}\sum_{y:\,p(y|1)/p(y|0)\ge 1}p(y|0)+\overset{\frac{1}{2}}{\cancel{p(x=1)}}\sum_{y:\,p(y|1)/p(y|0)\lt 1}p(y|1)\\\\ &=\frac{1}{2}\sum_{y}\min\big(p(y|0),\,p(y|1)\big)\\\\ &\le\frac{1}{2}\sum_{y}\sqrt{p(y|0)\,p(y|1)}=\frac{1}{2}\,Z(W). \end{aligned} $$Inequalities of Symmetric Capacity and Bh. Param. for BISO Channels
For BISO (Binary Input Symmetric Output) channels, the Bhattacharyya parameter $Z(W)$ and the symmetric capacity $I(W)$ satisfy (Arikan, Prop. 1):
$$ \log \frac{2}{1+Z(W)} \le I(W) \le \sqrt{1-Z(W)^2}. $$A stronger lower bound $I(W)\ge 1-Z(W)$ (with equality iff $W$ is a BEC) also holds, but requires the full polarization argument (Arikan, Prop. 11).
Here are some conceived proof sketches from Claude…
Proof sketch: lower bound $I(W)\ge\log\frac{2}{1+Z(W)}$
Since $I(W)=I(X;Y)=H(X)-H(X|Y)$ and $H(X)=1$ for uniform binary input, we have $I(W)=1-H(X|Y)$. Let $w_y=\frac{W(y|0)+W(y|1)}{2}$ and $p_y=\frac{W(y|0)}{W(y|0)+W(y|1)}$, so
$$H(X|Y)=\sum_y w_y\, h_2(p_y).$$Using the pointwise bound $h_2(p)\le\log(1+2\sqrt{p(1-p)})$ and noting that
$$2\sqrt{p_y(1-p_y)}=\frac{2\sqrt{W(y|0)W(y|1)}}{W(y|0)+W(y|1)},$$we get
$$H(X|Y)\le\sum_y w_y\log\!\left(1+\frac{2\sqrt{W(y|0)W(y|1)}}{W(y|0)+W(y|1)}\right).$$Since $\log$ is concave and $\sum_y w_y=1$, Jensen’s inequality gives
$$H(X|Y)\le\log\!\left(1+\sum_y\sqrt{W(y|0)W(y|1)}\right)=\log(1+Z(W)).$$Therefore $I(W)=1-H(X|Y)\ge 1-\log(1+Z(W))=\log\frac{2}{1+Z(W)}$. $\blacksquare$
Proof sketch: upper bound $I(W)\le\sqrt{1-Z(W)^2}$
Again $I(W)=\sum_y w_y(1-h_2(p_y))$. Using the pointwise bound $1-h_2(p)\le|1-2p|$, we get
$$I(W)\le\sum_y w_y\,|1-2p_y|=\frac{1}{2}\sum_y|W(y|0)-W(y|1)|.$$Factor: $|W(y|0)-W(y|1)|=|\sqrt{W(y|0)}-\sqrt{W(y|1)}|\cdot(\sqrt{W(y|0)}+\sqrt{W(y|1)})$. By Cauchy–Schwarz,
$$\frac{1}{2}\sum_y|W(y|0)-W(y|1)|\le\frac{1}{2}\sqrt{\sum_y(\sqrt{W(y|0)}-\sqrt{W(y|1)})^2}\cdot\sqrt{\sum_y(\sqrt{W(y|0)}+\sqrt{W(y|1)})^2}.$$Computing each factor:
$$\sum_y(\sqrt{W(y|0)}-\sqrt{W(y|1)})^2=2-2Z(W),$$ $$\sum_y(\sqrt{W(y|0)}+\sqrt{W(y|1)})^2=2+2Z(W).$$Hence $I(W)\le\frac{1}{2}\sqrt{(2-2Z)(2+2Z)}=\sqrt{1-Z(W)^2}$. $\blacksquare$
21.3 Channel Splitting: $W'$ and $W''$
The Polar Transform
Given a BISO channel $W$ with output alphabet $\mathcal Y$, take two independent copies and apply the linear pre-processing:
$$ x_1 = u_1\oplus u_2,\qquad x_2 = u_2. $$Then $x_1$ is sent through the first copy of $W$ producing $y_1$, and $x_2$ through the second copy producing $y_2$.
Sub-channel Definitions
This gives rise to two synthetic sub-channels $W'$ and $W''$ (Arikan, eqns. 17–18), corresponding to $W_2^{(1)}$ (worse) and $W_2^{(2)}$ (better) respectively.
Sub-channel $W'$
- output alphabet $\mathcal Y\times\mathcal Y$
- marginalize over unknown $u_2$, whose $P(u_2=0)=P(u_2=1)=\frac12$
For example, $W'(y_1,y_2\mid 0)=\frac{1}{2}W(y_1|0)W(y_2|0)+\frac{1}{2}W(y_1|1)W(y_2|1)$.
Sub-channel $W''$
- output alphabet $\mathcal Y\times\mathcal Y\times\{0,1\}$
- condition on known $u_1$ (assigned already) whose $P(u_1)=\frac12$
Identity of Mutual Information
Since the polar transform $(u_1,u_2)\mapsto(x_1,x_2)$ is a bijection, $I(Y_1,Y_2;\,U_1,U_2)=I(Y_1,Y_2;\,X_1,X_2)=I(Y_1;X_1)+I(Y_2;X_2)=2I(W)$. By the chain rule of mutual information:
$$ \underbrace{I(Y_1,Y_2;\,U_1,U_2)}_{2I(W)}=\underbrace{I(Y_1,Y_2;\,U_1)}_{I(W')}+\underbrace{I(Y_1,Y_2,U_1;\,U_2)}_{I(W'')}. $$We will prove $I(W')\le I(W)$ and $I(W'')\ge I(W)$ in the following sections.
Note that the total mutual information is conserved ($2I(W)$), but it has been redistributed: one sub-channel gets more, the other gets less. Iterating this splitting is what leads to polarization.
Bhattacharyya Parameters of the Split Channels
The Bhattacharyya parameters satisfy (Arikan, Prop. 5):
$$ \begin{cases} Z(W'')&=Z(W)^2,\\ Z(W')&\le 2Z(W)-Z(W)^2. \end{cases} $$Summing: $Z(W')+Z(W'')\le 2Z(W)$.
Proof: $Z(W'')=Z(W)^2$
By definition,
$$Z(W'')=\sum_{y_1,y_2,u}\sqrt{W''(y_1,y_2,u|0)\,W''(y_1,y_2,u|1)}.$$Substituting the expressions for $W''$ derived above:
$$ \begin{aligned} Z(W'')&=\dots=\sum_{y_1,y_2,u}\sqrt{\frac{1}{2}W(y_1|u)W(y_2|0)\cdot\frac{1}{2}W(y_1|u\oplus 1)W(y_2|1)}\\\\ &=\frac{1}{2}\sum_{y_1,y_2,u}\sqrt{W(y_1|u)\,W(y_1|u\oplus 1)}\cdot\sqrt{W(y_2|0)\,W(y_2|1)}. \end{aligned} $$The sum over $y_2$ gives $Z(W)$. For the sum over $y_1$: by BISO symmetry, $\sum_{y_1}\sqrt{W(y_1|u)\,W(y_1|u\oplus 1)}=Z(W)$ for both $u=0$ and $u=1$. Summing over $u\in\{0,1\}$:
$$Z(W'')=\frac{1}{2}\cdot 2\cdot Z(W)\cdot Z(W)=Z(W)^2.\quad\blacksquare$$Proof: $Z(W')\le 2Z(W)-Z(W)^2$
By definition,
$$Z(W')=\sum_{y_1,y_2}\sqrt{W'(y_1,y_2|0)\,W'(y_1,y_2|1)}.$$Substituting and pulling out $\frac{1}{2}$:
$$Z(W')=\frac{1}{2}\sum_{y_1,y_2}\sqrt{\big[W(y_1|0)W(y_2|0)+W(y_1|1)W(y_2|1)\big]\big[W(y_1|1)W(y_2|0)+W(y_1|0)W(y_2|1)\big]}.$$Applying $\sqrt{(a+b)(c+d)}\le\sqrt{ac}+\sqrt{bd}$ (Cauchy–Schwarz) with $a=W(y_1|0)W(y_2|0)$, $b=W(y_1|1)W(y_2|1)$, $c=W(y_1|1)W(y_2|0)$, $d=W(y_1|0)W(y_2|1)$:
$$ \begin{aligned} \sqrt{ac}&=\sqrt{W(y_1|0)W(y_1|1)}\cdot W(y_2|0),\\\\ \sqrt{bd}&=\sqrt{W(y_1|0)W(y_1|1)}\cdot W(y_2|1). \end{aligned} $$Summing over $y_1,y_2$:
$$Z(W')\le\frac{1}{2}\sum_{y_1}\sqrt{W(y_1|0)W(y_1|1)}\cdot\sum_{y_2}\big[W(y_2|0)+W(y_2|1)\big]=\frac{1}{2}\cdot Z(W)\cdot 2=Z(W).$$Since $Z(W)\in[0,1]$, we have $Z(W)^2\le Z(W)$, hence $2Z(W)-Z(W)^2\ge Z(W)$. Therefore:
$$Z(W')\le Z(W)\le 2Z(W)-Z(W)^2.\quad\blacksquare$$Recursion to $N=2^n$
The same $1\to 2$ split is applied recursively. At depth $n$, we have $N=2^n$ sub-channels $W_N^{(1)},\dots,W_N^{(N)}$, where
$$ W_N^{(i)}: U_i \mapsto (Y_1^N,\, U_1^{i-1}). $$The decoding is sequential: decode $U_1$ first (worst channel, no side information), then $U_2$ given decoded $U_1$, then $U_3$ given $U_1,U_2$, and so on. Each subsequent decoder has more side information.
The total capacity is preserved at every level:
$$ \sum_{i=1}^N I(W_N^{(i)}) = N\cdot I(W). $$21.4 Martingale Theory
Reference: Handout 11 of EE 229A Fall 2024 @ UC Berkeley, V. Anantharam.
A martingale is a discrete-time stochastic process that captures the notion of a fair sequence of bets: the conditional expectation of the increment at any time, given the information available at that time, should be zero.
鞅 (martingale) 是一种离散时间随机过程,用于刻画公平赌局的概念:在任意时刻,给定当前已知的全部信息,增量的条件期望为零。
Definition: Martingale
A real-valued discrete-time stochastic process $(M_n,\,n\ge 0)$ is called a martingale if:
实值离散时间随机过程 $(M_n,\,n\ge 0)$ 称为鞅,若满足:
-
$E[|M_n|]<\infty, \forall n\ge 0$.
Integrability; note that there need not be a single upper bound on all $E[|M_n|]$.
可积性;注意不要求 $E[|M_n|]$ 有统一上界,只需每个有限。
-
$E[M_{n+1}\mid M_0,\dots,M_n]=M_n$ for all $n\ge 0$.
Fair Game Property. Given all history $M_0,M_1,\dots,M_n$ until time $n$, the expected value (best estimate) of next step $M_{n+1}$ is still $M_n$. In other words, given the past and present values of the process, the best prediction for the future value is the current value.
公平博弈性质。给定直到时刻 $n$ 的所有历史 $M_0,M_1,\dots,M_n$,下一步 $M_{n+1}$ 的期望值(最佳估计)仍是 $M_n$。已知现在和过去,对未来的预测仍然是现在。
Example
Let $\xi_1,\xi_2,\dots$ be independent random variables (not necessarily identically distributed) with $E[|\xi_m|]\lt\infty$ and $E[\xi_m]=0$ for all $m\ge 1$. Let $S_0$ be independent of $(\xi_1,\xi_2,\dots)$ with $E[|S_0|]\lt\infty$. Define $M_0\triangleq S_0$ and $M_n\triangleq S_0+\xi_1+\cdots+\xi_n$ for $n\ge 1$. Then $(M_n)$ is a martingale.
设 $\xi_1,\xi_2,\dots$ 独立随机变量(不要求同分布),$E[|\xi_m|]\lt\infty$,$E[\xi_m]=0$。$S_0$ 与 $(\xi_1,\xi_2,\dots)$ 独立且 $E[|S_0|]\lt\infty$;令 $M_0\triangleq S_0$,$M_n\triangleq S_0+\xi_1+\cdots+\xi_n$。则 $(M_n)$ 是鞅。
Proof
Condition (1): $E[|M_n|]\le E[|S_0|]+\sum_{i=1}^n E[|\xi_i|]\lt\infty$ by triangle inequality. Condition (2): since $M_{n+1}=M_n+\xi_{n+1}$,
$$ E[M_{n+1}\mid M_0,\dots,M_n]\overset{(a)}{=}E[M_n\mid M_0,\dots,M_n]+E[\xi_{n+1}\mid M_0,\dots,M_n]\overset{(b)}{=}M_n+E[\xi_{n+1}]\overset{(c)}{=}M_n. $$$(a)$: linearity of conditional expectation. $(b)$: $M_n$ is $\sigma(M_0,\dots,M_n)$-measurable (i.e. $M_n$ is fully determined by $(M_0,\dots,M_n)$, since it is literally one of them), so conditioning on known information gives $E[M_n\mid M_0,\dots,M_n]=M_n$; and $\xi_{n+1}$ is independent of $(S_0,\xi_1,\dots,\xi_n)$, hence independent of $(M_0,\dots,M_n)$, so the conditioning drops. $(c)$: $E[\xi_{n+1}]=0$. $\blacksquare$
条件 (1):由三角不等式,$E[|M_n|]\le E[|S_0|]+\sum_{i=1}^n E[|\xi_i|]\lt\infty$。条件 (2):因为 $M_{n+1}=M_n+\xi_{n+1}$,
$$ E[M_{n+1}\mid M_0,\dots,M_n]\overset{(a)}{=}E[M_n\mid M_0,\dots,M_n]+E[\xi_{n+1}\mid M_0,\dots,M_n]\overset{(b)}{=}M_n+E[\xi_{n+1}]\overset{(c)}{=}M_n. $$$(a)$:条件期望的线性性。$(b)$:$M_n$ 关于 $\sigma(M_0,\dots,M_n)$ 可测(即 $M_n$ 的值完全由 $(M_0,\dots,M_n)$ 决定,因为它本身就是其中之一),对已知量求条件期望等于自身,故 $E[M_n\mid M_0,\dots,M_n]=M_n$;又 $\xi_{n+1}$ 与 $(S_0,\xi_1,\dots,\xi_n)$ 独立,从而与 $(M_0,\dots,M_n)$ 独立,条件可去掉。$(c)$:$E[\xi_{n+1}]=0$。$\blacksquare$
Martingale w.r.t. Another Process
More generally, the “information available at time $n$” may be larger than that in $(M_0,\dots,M_n)$.
更一般地,时刻 $n$ 可用的信息可以比 $(M_0,\dots,M_n)$ 本身更多。
Let $(X_n,\,n\ge 0)$ be an arbitrary discrete-time stochastic process. A real-valued process $(M_n,\,n\ge 0)$ is called a martingale with respect to $(X_n)$ if:
设 $(X_n,\,n\ge 0)$ 是任意离散时间随机过程。实值过程 $(M_n,\,n\ge 0)$ 称为关于 $(X_n)$ 的鞅,若:
- $E[|M_n|]\lt\infty$ for all $n\ge 0$,
- $E[M_{n+1}\mid X_n,\dots,X_0]=M_n$ for all $n\ge 0$.
This implies $M_n$ is a function of $(X_0,\dots,X_n)$. This generalization is needed for the polarization argument: the capacity process $I_n=I(W_{2^n}^{(B_1\cdots B_n)+1})$ is a martingale w.r.t. the random bit sequence $(B_1,B_2,\dots)$, because the capacity at depth $n$ depends on the entire path $(B_1,\dots,B_n)$ through the splitting tree, not just on the capacity values seen so far.
这意味着 $M_n$ 是 $(X_0,\dots,X_n)$ 的函数。极化证明需要这个推广:容量过程 $I_n=I(W_{2^n}^{(B_1\cdots B_n)+1})$ 是关于随机比特序列 $(B_1,B_2,\dots)$ 的鞅,因为第 $n$ 层的容量取决于在分裂树中走过的完整路径 $(B_1,\dots,B_n)$,而不仅仅是之前看到的容量值。
Submartingale
Definition: Submartingale
A real-valued process $(M_n,\,n\ge 0)$ is a submartingale w.r.t. $(X_n,\,n\ge 0)$ if $M_n$ is a function of $(X_n,\dots,X_0)$ and:
实值过程 $(M_n,\,n\ge 0)$ 是关于 $(X_n,\,n\ge 0)$ 的下鞅 (submartingale),若 $M_n$ 是 $(X_n,\dots,X_0)$ 的函数且:
- $E[|M_n|]\lt\infty$ for all $n\ge 0$,
- $E[M_{n+1}\mid X_n,\dots,X_0]\ge M_n$ for all $n\ge 0$.
The process tends to increase (or stay flat) in expectation.
过程的期望趋势是上升(或不变)。
Doob’s Submartingale Convergence Theorem
Theorem (Doob). Let $(M_n,\,n\ge 0)$ be a submartingale w.r.t. $(X_n,\,n\ge 0)$. If $\sup_{n\ge 0}E[|M_n|]\lt\infty$, then there is a random variable $M$ such that $M_n\xrightarrow{\text{a.s.}}M$ as $n\to\infty$. Here a.s. (almost surely) means $P(\lim_{n\to\infty}M_n=M)=1$, i.e. convergence holds with probability 1, allowing exceptions only on a set of measure zero.
定理 (Doob)。 设 $(M_n,\,n\ge 0)$ 是关于 $(X_n)$ 的下鞅。若 $\sup_{n\ge 0}E[|M_n|]\lt\infty$,则存在随机变量 $M$ 使得 $M_n\xrightarrow{\text{a.s.}}M$。此处 a.s.(almost surely,几乎必然)指 $P(\lim_{n\to\infty}M_n=M)=1$,即以概率 1 收敛,仅允许零测集上的例外。
Proof: See Chapter VII of “Probability” by A. N. Shiryayev, pg. 476.
Example: Likelihood Ratio Martingale
Let $f(x\mid 0)$ and $f(x\mid 1)$ be two densities on $\mathbb{R}$ (null and alternate hypothesis). Define the likelihood ratio $l(x)\triangleq f(x\mid 1)/f(x\mid 0)$, assuming $\int l(x)f(x\mid 0)\,dx=1$.
设 $f(x\mid 0)$、$f(x\mid 1)$ 为 $\mathbb R$ 上两个密度(零假设和备择假设)。似然比 $l(x)\triangleq f(x\mid 1)/f(x\mid 0)$,假设 $\int l(x)f(x\mid 0)\,dx=1$。
Let $\xi_1,\xi_2,\dots$ be i.i.d. with density $f(x\mid 0)$. Define $M_0\triangleq 1$, $M_n\triangleq l(\xi_1)\cdots l(\xi_n)$ for $n\ge 1$. Then $M_n$ is the cumulative likelihood ratio of the first $n$ observations, measuring how much the data collectively resemble $H_1$ versus $H_0$.
设 $\xi_1,\xi_2,\dots$ i.i.d. 密度 $f(x\mid 0)$。令 $M_0\triangleq 1$,$M_n\triangleq l(\xi_1)\cdots l(\xi_n)$。$M_n$ 是前 $n$ 个观测的累积似然比,衡量数据整体上有多像 $H_1$ 而非 $H_0$。
Then $(M_n)$ is a nonnegative martingale with $E[M_n]=1$ for all $n$: since $M_{n+1}=M_n\cdot l(\xi_{n+1})$ and $l(\xi_{n+1})$ is independent of $(M_0,\dots,M_n)$,
则 $(M_n)$ 是非负鞅且 $E[M_n]=1$:因为 $M_{n+1}=M_n\cdot l(\xi_{n+1})$,$l(\xi_{n+1})$ 独立于 $(M_0,\dots,M_n)$,
$$ E[M_{n+1}\mid M_n,\dots,M_0]=M_n\cdot E[l(\xi_{n+1})]=M_n\cdot\underbrace{\int \frac{f(x\mid 1)}{f(x\mid 0)}f(x\mid 0)\,dx}_{=\int f(x\mid 1)\,dx=1}=M_n. $$Since $M_n\ge 0$, we have $E[|M_n|]=E[M_n]=1$, so $\sup_n E[|M_n|]=1\lt\infty$. By Doob’s theorem, $M_n\xrightarrow{\text{a.s.}}M_\infty$.
由 $M_n\ge 0$ 得 $E[|M_n|]=E[M_n]=1$,故 $\sup_n E[|M_n|]=1\lt\infty$。由 Doob 定理,$M_n\xrightarrow{\text{a.s.}}M_\infty$。
In fact, unless $f(x\mid 0)\equiv f(x\mid 1)$, the a.s. limit is $M_\infty=0$ (despite $E[M_n]=1$ for all $n$). For example, take $f(x\mid 0)=N(0,1)$ and $f(x\mid 1)=N(1,1)$:
事实上,除非 $f(x\mid 0)\equiv f(x\mid 1)$,几乎必然极限为 $M_\infty=0$(尽管 $E[M_n]=1$)。例如取 $f(x\mid 0)=N(0,1)$,$f(x\mid 1)=N(1,1)$:
$$ \frac{1}{n}\log M_n=\frac{1}{n}\sum_{i=1}^n \xi_i-\frac{1}{2}\xrightarrow{\text{a.s.}}-\frac{1}{2}\lt 0, $$so $M_n\to 0$ a.s. by the strong law of large numbers. Under $H_0$, the cumulative likelihood ratio shrinks to zero (correctly rejecting $H_1$), yet $E[M_n]=1$ because a vanishingly rare set of extreme sample paths contributes enormous values. This illustrates that a.s. convergence and convergence of expectations can behave very differently.
由大数定律 $M_n\to 0$ a.s.。在 $H_0$ 下,累积似然比趋向零(正确否定 $H_1$),但 $E[M_n]=1$ 因为极少数极端样本路径贡献了巨大的值。这说明几乎必然收敛和期望收敛可以非常不同。
Supermartingale
Definition: Supermartingale
A real-valued process $(M_n,\,n\ge 0)$ is a supermartingale w.r.t. $(X_n,\,n\ge 0)$ if $(\tilde M_n)\triangleq(-M_n)$ is a submartingale w.r.t. $(X_n)$. Equivalently:
实值过程 $(M_n)$ 是上鞅 (supermartingale),若 $(\tilde M_n)\triangleq(-M_n)$ 是下鞅。等价条件:
- $E[|M_n|]\lt\infty$ for all $n\ge 0$,
- $E[M_{n+1}\mid X_n,\dots,X_0]\le M_n$ for all $n\ge 0$.
The process tends to decrease (or stay flat) in expectation.
过程的期望趋势是下降(或不变)。
Nonnegative Supermartingale Convergence
Theorem. Let $(M_n,\,n\ge 0)$ be a nonnegative supermartingale w.r.t. $(X_n)$. Then $M_n\xrightarrow{\text{a.s.}}M$ for some finite random variable $M$.
定理。 非负上鞅 $(M_n,\,n\ge 0)$ 必定几乎必然收敛到某有限随机变量 $M$。
Proof (from Handout 11)
Let $\tilde M_n:=-M_n$. Then $(\tilde M_n)$ is a nonpositive submartingale. Since
$$E[|\tilde M_{n+1}|]=-E[\tilde M_{n+1}]=-E[E[\tilde M_{n+1}\mid X_n,\dots,X_0]]\le -E[\tilde M_n]=E[|\tilde M_n|],$$we have $\sup_{n\ge 0}E[|\tilde M_n|]\lt\infty$, so by Doob’s submartingale convergence theorem $\tilde M_n$ converges a.s., and so does $M_n$. $\blacksquare$
21.5 Polarization
The Capacity Process is a Martingale
Binary Tree and Random Paths
The recursive splitting forms a binary tree. At depth $n$, there are $N=2^n$ sub-channels (leaves). Each sub-channel corresponds to a unique path from the root: at each node, going left ($W'$, the worse channel) or right ($W''$, the better channel).
递归分裂形成一棵二叉树。深度 $n$ 时有 $N=2^n$ 个子信道(叶子)。每个子信道对应从根到叶的唯一路径:每个节点处,向左走($W'$,较差信道)或向右走($W''$,较好信道)。
Let $B_1,B_2,\dots$ be i.i.d. uniform $\{0,1\}$-valued, representing a random path through this tree: $B_i=0$ means “go left” (take $W'$) at depth $i$, $B_i=1$ means “go right” (take $W''$). The bit string $(B_1 B_2\cdots B_n)$ encodes the integer $B_1\cdot 2^{n-1}+\cdots+B_n\in[0,2^n-1]$, the index of the sub-channel reached.
设 $B_1,B_2,\dots$ i.i.d. $\{0,1\}$ 上的均匀分布,表示树中的随机路径:$B_i=0$ 在第 $i$ 层向左(取 $W'$),$B_i=1$ 向右(取 $W''$)。比特串 $(B_1\cdots B_n)$ 编码整数 $B_1\cdot 2^{n-1}+\cdots+B_n\in[0,2^n-1]$,即子信道编号。
Capacity of a Sub-Channel as a R.V.
Setting $N=2^n$, the capacity along this random path is:
令 $N=2^n$,沿随机路径的容量是:
$$ I_n \triangleq I(W_N^{(B_1\cdots B_n)+1}),\quad n\ge 1 $$where $(B_1\cdots B_n)+1$ means the bit string encodes an integer in $[0, 2^n-1]$ and the $+1$ shifts to 1-indexed sub-channel numbering $W_N^{(1)},\dots,W_N^{(N)}$. Since $(B_1,\dots,B_n)$ are random, $I_n$ is a random variable.
其中 $(B_1\cdots B_n)+1$ 是因为比特串编码的整数范围是 $[0,2^n-1]$,而子信道编号 $W_N^{(1)},\dots,W_N^{(N)}$ 从 1 开始,所以 $+1$ 做 0-indexed 到 1-indexed 的转换。由于 $(B_1,\dots,B_n)$ 是随机的,$I_n$ 是一个随机变量。
$I_n$ is a martingale w.r.t. $(B_1,B_2,\dots)$. At depth $n$, the current sub-channel splits into two at depth $n+1$. Since $B_{n+1}$ is uniform and independent of $(B_1,\dots,B_n)$, using $I(W')+I(W'')=2I(W)$ from Section 21.3:
$I_n$ 是关于 $(B_1,B_2,\dots)$ 的鞅。在深度 $n$,当前子信道在深度 $n+1$ 分裂为二。因为 $B_{n+1}$ 均匀且独立于 $(B_1,\dots,B_n)$,利用 21.3 的 $I(W')+I(W'')=2I(W)$:
$$ E[I_{n+1}\mid B_1,\dots,B_n]=\frac{1}{2}\,I(W')+\frac{1}{2}\,I(W'')=\frac12 \cdot 2 I(W)=\frac{1}{2}\cdot 2\,I_n=I_n. $$The Bhattacharyya Parameter Process is a Supermartingale
From Section 21.3, $Z(W')+Z(W'')\le 2Z(W)$.
Let $Z_n$ be the Bhattacharyya Parameter for the $n$-th level. Then
$$ E[Z_{n+1}\mid B_1,\dots,B_n]=\frac{1}{2}\,Z(W')+\frac{1}{2}\,Z(W'')\le \frac12 \cdot 2 Z(W)=\frac{1}{2}\cdot 2\,Z_n=Z_n. $$By the same random-path argument, $Z_n$ is a nonnegative supermartingale w.r.t. $(B_1,B_2,\dots)$.
Convergence and Polarization
By the martingale convergence theorem (Section 21.4): $I_n\in[0,1]$ is a bounded martingale and $Z_n\in[0,1]$ is a nonneg supermartingale, so both converge a.s., denoted as $I_n\to I_\infty$ and $Z_n\to Z_\infty$.
At each step, with probability $\frac12$ the process takes the $W''$ branch where $Z_{n+1}=Z_n^2$.
If $Z_\infty\in(0,1)$, then $Z_\infty^2\neq Z_\infty$, so with probability $\frac12$ the next step jumps to $Z_\infty^2\lt Z_\infty$, contradicting convergence. The only fixed points of $z\mapsto z^2$ in $[0,1]$ are $z=0$ and $z=1$, so $Z_\infty\in\{0,1\}$.
As a capacity, $I_\infty \in [0, 1]$ since the input alphabet has 2 symbols and $\log 2 = 1$.
Combined with the $Z$-$I$ bounds for BISO channels (Section 21.2):
$$ \log \frac{2}{1+Z(W)} \le I(W) \le \sqrt{1-Z(W)^2}, $$$Z_\infty=0$ forces $I_\infty\ge\log 2=1$, hence $I_\infty=1$; $Z_\infty=1$ forces $I_\infty\le 0$, hence $I_\infty=0$. Therefore as $N\to\infty$, for the sub-channels $W_N^{(1)},\dots,W_N^{(N)}$:
$$ Z(W_N^{(i)})\to 0\text{ or }1,\qquad I(W_N^{(i)})\to 1\text{ or }0, $$almost surely. Each sub-channel becomes either perfectly reliable or completely useless. The average capacity is preserved:
$$ \frac{1}{N}\sum_{i=1}^N I(W_N^{(i)})=I(W). $$Choosing the Good Sub-channels
- Asymptotically ($N\to\infty$): select sub-channels where $I(W_N^{(i)})\to 1$ (equivalently $Z(W_N^{(i)})\to 0$). The fraction of such channels approaches $I(W)$.
- Finite $N$: a practical heuristic is to pick sub-channels with $Z\lt 0.5$, yielding roughly $N\cdot I(W)$ good sub-channels.
- The remaining (bad) sub-channels are frozen to a fixed known value (e.g., $0$); no information is transmitted through them.