INFOTH Note 21: Polar Codes and Related Theories (Bhattacharyya, Martingale)

2025-11-18

10 min read

Notes

Information Theory and Coding (UCB FA24 EE229A, rewritten in 2025)

Reference: E. Arikan, “Channel Polarization: A Method for Constructing Capacity-Achieving Codes for Symmetric Binary-Input Memoryless Channels,” IEEE Trans. Inform. Theory, vol. 55, no. 7, pp. 3051–3073, July 2009.

21.1 Core Idea of Polar Codes

Notation. Throughout this note, $I(W)$ denotes the symmetric capacity $I(X;Y)$ under uniform input $X$. For BISO channels, uniform input achieves capacity, so $I(W)=C(W)$.

本文中 $I(W)$ 始终指 uniform 输入下的对称容量 $I(X;Y)$。对于 BISO 信道，uniform 输入恰好达到容量，因此 $I(W)=C(W)$。
Starting point. We have a noisy channel $W$ with capacity $I(W)$. Shannon’s channel coding theorem says reliable communication at rate $I(W)$ is possible, but does not give a practical construction.

有一条噪声信道 $W$，容量 $I(W)$。Shannon: 能以 $I(W)$ 的 rate 可靠通信，但没有给出实际的构造方法。
Arikan’s idea. Take $N=2$ copies of $W$. Instead of using them independently, mix the inputs through a linear pre-processing step: $x_1=u_1\oplus u_2$, $x_2=u_2$, a butterfly transform analogous to the FFT.

使用 $N=2$ 份 $W$，把输入 $(u_1, u_2)$ 经线性变换得到 $(x_1, x_2)$：$x_1 = u_1 \oplus u_2,~~ x_2 = u_2$，这是一个类似 FFT 的蝴蝶变换。
Channel splitting. After mixing, the decoder wants to recover $(u_1,u_2)$ from the outputs $(y_1,y_2)$. Rather than the intuitive order of decoding $u_2$ first (since $x_2=u_2$ directly), we deliberately decode $u_1$ first and then $u_2$, which enables polarization.

混合后从解码端看，我们想通过输出 $(y_1,y_2)$ 解码 $(u_1,u_2)$. 我们故意设计先不解码直观的 $u_2$ 而是先解码 $u_1$，再解码 $u_2$，这样才能极化两个信道。

The two identical physical channels $W$ can be viewed as two different synthetic sub-channels $W'$ and $W''$:

原来的两个相同物理信道 $W$ 可以看成两个不同的合成子信道 $W'$ 和 $W''$；
- In $W'$, the sole input is $u_1$, while $u_2$ is unknown, an uncontrollable random variable that must be summed over (marginalized out). The presence of this unknown $u_2$ degrades the decoding of $u_1$.
  
  在 $W'$ 我们假设输入只有 $u_1$，而 $u_2$ 是未知信息，即不可控的随机噪声，因此要对 $u_2$ 求和消去，并且直觉上存在额外噪声导致解码 $u_1$ 的性能变差；
- In $W''$, the sole input is $u_2$, and $u_1$ has already been decoded; it is known side information available to the decoder before decoding $u_2$. Mathematically, $u_1$ is grouped with $y_1,y_2$ as part of the output. Since $x_1=u_1\oplus u_2$ and $u_1$ is known, the decoder can use $y_1$ to extract additional information about $u_2$.
  
  而在 $W''$ 中我们假设 $u_2$ 是唯一的输入，此时 $u_1$ 已经被解码，是已知信息，也即 $u_1$ 作为 decoder 在解码前拿到的辅助信息存在，因此数学上可以和 $y_1,y_2$ 归在一起；直觉上可以清除 $x_1=u_1 \oplus u_2$ 对 $u_2$ 的干扰，从而提升性能。
也即：
$$ W': u_1 \mapsto (y_1,y_2), \text{unknown } u_2 \text{ is marginalized out}; $$ $$ W'': u_2 \mapsto (y_1,y_2,u_1), u_1 \text{ is known side info}. $$
Total information is conserved: $I(W')+I(W'')=2I(W)$, but redistributed unevenly.

总信息量守恒：$I(W') + I(W'') = 2I(W)$，但一多一少。
Recursion. Apply the same transform (copy, XOR, split) recursively to $W'$ and $W''$ and we get 4 equivalent channels. After $n$ levels of recursion, we have $N=2^n$ sub-channels. Each level makes the good ones better and the bad ones worse.

对 $W'$ 和 $W''$ 各自再做同样的 copy-XOR-split，就可以产出 4 个等效信道。$n$ 层之后有 $N=2^n$ 个子信道。每一层都让好的更好、差的更差。
Polarization. As $N\to\infty$, every sub-channel becomes either perfectly reliable ($I\to 1$) or completely useless ($I\to 0$). The fraction of perfect channels approaches exactly $I(W)$.

$N\to\infty$ 时，所有子信道要么 $I\to 1$（完美），要么 $I\to 0$（无用）。完美的那些占比恰好趋向 $I(W)$。
Usage. Transmit data only on the good sub-channels ($I\approx 1$). Freeze the bad ones ($I\approx 0$) to a fixed known value (e.g., $0$). The resulting rate approaches $I(W)$ i.e. achieves capacity.

只在好的子信道上发数据，坏的 freeze 成 $0$。速率 $\approx I(W)$，达到容量。

21.2 Bhattacharyya Parameter

Definition

For a binary-input channel $W:\{0,1\}\to\mathcal Y$, the Bhattacharyya parameter is defined as

$$ Z(W)=\sum_{y\in\mathcal Y} \sqrt{W(y|0)\,W(y|1)}. $$

Geometrically, $Z(W)$ is the inner product of two vectors $(\sqrt{W(y|0)})_{y\in\mathcal Y}$ and $(\sqrt{W(y|1)})_{y\in\mathcal Y}$, each of unit norm.

$Z(W)\to 0$: the channel is reliable (close to perfect).
$Z(W)\to 1$: the channel is noisy (close to useless).

Bound on Error Probability

Bayesian Optimal Decision Rule

Consider estimating the equiprobable input $x\in\{0,1\}$ given output $y\in\mathcal Y$, i.e. $p(x=0)=p(x=1)=\frac12$.

The optimal decision rule (MAP with uniform prior) is:

$$ \hat x=\arg\max_{x\in\{0,1\}}p(x|y)=\begin{cases} 1,&\text{if }p(y|1)\ge p(y|0),\\ 0,&\text{if }p(y|1)\lt p(y|0).\end{cases} $$

In other words, the optimal decision rule is:

$$ \frac{p(y|1)}{p(y|0)}\ge 1 \implies \text{decide }1,\quad\text{otherwise decide }0. $$

Bounding

The error probability satisfies

$$ \begin{aligned} P(\text{error})&=\overset{\frac{1}{2}}{\cancel{p(x=0)}}\sum_{y:\,p(y|1)/p(y|0)\ge 1}p(y|0)+\overset{\frac{1}{2}}{\cancel{p(x=1)}}\sum_{y:\,p(y|1)/p(y|0)\lt 1}p(y|1)\\\\ &=\frac{1}{2}\sum_{y}\min\big(p(y|0),\,p(y|1)\big)\\\\ &\le\frac{1}{2}\sum_{y}\sqrt{p(y|0)\,p(y|1)}=\frac{1}{2}\,Z(W). \end{aligned} $$

Inequalities of Symmetric Capacity and Bh. Param. for BISO Channels

For BISO (Binary Input Symmetric Output) channels, the Bhattacharyya parameter $Z(W)$ and the symmetric capacity $I(W)$ satisfy (Arikan, Prop. 1):

$$ \log \frac{2}{1+Z(W)} \le I(W) \le \sqrt{1-Z(W)^2}. $$

A stronger lower bound $I(W)\ge 1-Z(W)$ (with equality iff $W$ is a BEC) also holds, but requires the full polarization argument (Arikan, Prop. 11).

Here are some conceived proof sketches from Claude…

Proof sketch: lower bound $I(W)\ge\log\frac{2}{1+Z(W)}$

Since $I(W)=I(X;Y)=H(X)-H(X|Y)$ and $H(X)=1$ for uniform binary input, we have $I(W)=1-H(X|Y)$. Let $w_y=\frac{W(y|0)+W(y|1)}{2}$ and $p_y=\frac{W(y|0)}{W(y|0)+W(y|1)}$, so

$$H(X|Y)=\sum_y w_y\, h_2(p_y).$$

Using the pointwise bound $h_2(p)\le\log(1+2\sqrt{p(1-p)})$ and noting that

$$2\sqrt{p_y(1-p_y)}=\frac{2\sqrt{W(y|0)W(y|1)}}{W(y|0)+W(y|1)},$$

we get

$$H(X|Y)\le\sum_y w_y\log\!\left(1+\frac{2\sqrt{W(y|0)W(y|1)}}{W(y|0)+W(y|1)}\right).$$

Since $\log$ is concave and $\sum_y w_y=1$, Jensen’s inequality gives

$$H(X|Y)\le\log\!\left(1+\sum_y\sqrt{W(y|0)W(y|1)}\right)=\log(1+Z(W)).$$

Therefore $I(W)=1-H(X|Y)\ge 1-\log(1+Z(W))=\log\frac{2}{1+Z(W)}$. $\blacksquare$

Proof sketch: upper bound $I(W)\le\sqrt{1-Z(W)^2}$

Again $I(W)=\sum_y w_y(1-h_2(p_y))$. Using the pointwise bound $1-h_2(p)\le|1-2p|$, we get

$$I(W)\le\sum_y w_y\,|1-2p_y|=\frac{1}{2}\sum_y|W(y|0)-W(y|1)|.$$

Factor: $|W(y|0)-W(y|1)|=|\sqrt{W(y|0)}-\sqrt{W(y|1)}|\cdot(\sqrt{W(y|0)}+\sqrt{W(y|1)})$. By Cauchy–Schwarz,

$$\frac{1}{2}\sum_y|W(y|0)-W(y|1)|\le\frac{1}{2}\sqrt{\sum_y(\sqrt{W(y|0)}-\sqrt{W(y|1)})^2}\cdot\sqrt{\sum_y(\sqrt{W(y|0)}+\sqrt{W(y|1)})^2}.$$

Computing each factor:

$$\sum_y(\sqrt{W(y|0)}-\sqrt{W(y|1)})^2=2-2Z(W),$$ $$\sum_y(\sqrt{W(y|0)}+\sqrt{W(y|1)})^2=2+2Z(W).$$

Hence $I(W)\le\frac{1}{2}\sqrt{(2-2Z)(2+2Z)}=\sqrt{1-Z(W)^2}$. $\blacksquare$

21.3 Channel Splitting: $W'$ and $W''$

The Polar Transform

Given a BISO channel $W$ with output alphabet $\mathcal Y$, take two independent copies and apply the linear pre-processing:

$$ x_1 = u_1\oplus u_2,\qquad x_2 = u_2. $$

Then $x_1$ is sent through the first copy of $W$ producing $y_1$, and $x_2$ through the second copy producing $y_2$.

Sub-channel Definitions

This gives rise to two synthetic sub-channels $W'$ and $W''$ (Arikan, eqns. 17–18), corresponding to $W_2^{(1)}$ (worse) and $W_2^{(2)}$ (better) respectively.

Sub-channel $W'$

output alphabet $\mathcal Y\times\mathcal Y$
marginalize over unknown $u_2$, whose $P(u_2=0)=P(u_2=1)=\frac12$

$$ \begin{aligned} W'(y_1,y_2\mid u_1)&=\sum_{u_2\in\{0,1\}}P(u_2)\,W(y_1\mid u_1\oplus u_2)\,W(y_2\mid u_2)\\\\ &=\frac{1}{2}W(y_1\mid u_1)\,W(y_2\mid 0)+\frac{1}{2}W(y_1\mid u_1\oplus 1)\,W(y_2\mid 1). \end{aligned} $$

For example, $W'(y_1,y_2\mid 0)=\frac{1}{2}W(y_1|0)W(y_2|0)+\frac{1}{2}W(y_1|1)W(y_2|1)$.

Sub-channel $W''$

output alphabet $\mathcal Y\times\mathcal Y\times\{0,1\}$
condition on known $u_1$ (assigned already) whose $P(u_1)=\frac12$

$$ W''(y_1,y_2,u_1\mid u_2)=P(u_1)\,W(y_1\mid u_1\oplus u_2)\,W(y_2\mid u_2)=\frac{1}{2}W(y_1\mid u_1\oplus u_2)\,W(y_2\mid u_2). $$

Identity of Mutual Information

Since the polar transform $(u_1,u_2)\mapsto(x_1,x_2)$ is a bijection, $I(Y_1,Y_2;\,U_1,U_2)=I(Y_1,Y_2;\,X_1,X_2)=I(Y_1;X_1)+I(Y_2;X_2)=2I(W)$. By the chain rule of mutual information:

$$ \underbrace{I(Y_1,Y_2;\,U_1,U_2)}_{2I(W)}=\underbrace{I(Y_1,Y_2;\,U_1)}_{I(W')}+\underbrace{I(Y_1,Y_2,U_1;\,U_2)}_{I(W'')}. $$

We will prove $I(W')\le I(W)$ and $I(W'')\ge I(W)$ in the following sections.

Note that the total mutual information is conserved ($2I(W)$), but it has been redistributed: one sub-channel gets more, the other gets less. Iterating this splitting is what leads to polarization.

Bhattacharyya Parameters of the Split Channels

The Bhattacharyya parameters satisfy (Arikan, Prop. 5):

$$ \begin{cases} Z(W'')&=Z(W)^2,\\ Z(W')&\le 2Z(W)-Z(W)^2. \end{cases} $$

Summing: $Z(W')+Z(W'')\le 2Z(W)$.

Proof: $Z(W'')=Z(W)^2$

By definition,

$$Z(W'')=\sum_{y_1,y_2,u}\sqrt{W''(y_1,y_2,u|0)\,W''(y_1,y_2,u|1)}.$$

Substituting the expressions for $W''$ derived above:

$$ \begin{aligned} Z(W'')&=\dots=\sum_{y_1,y_2,u}\sqrt{\frac{1}{2}W(y_1|u)W(y_2|0)\cdot\frac{1}{2}W(y_1|u\oplus 1)W(y_2|1)}\\\\ &=\frac{1}{2}\sum_{y_1,y_2,u}\sqrt{W(y_1|u)\,W(y_1|u\oplus 1)}\cdot\sqrt{W(y_2|0)\,W(y_2|1)}. \end{aligned} $$

The sum over $y_2$ gives $Z(W)$. For the sum over $y_1$: by BISO symmetry, $\sum_{y_1}\sqrt{W(y_1|u)\,W(y_1|u\oplus 1)}=Z(W)$ for both $u=0$ and $u=1$. Summing over $u\in\{0,1\}$:

$$Z(W'')=\frac{1}{2}\cdot 2\cdot Z(W)\cdot Z(W)=Z(W)^2.\quad\blacksquare$$

Proof: $Z(W')\le 2Z(W)-Z(W)^2$

By definition,

$$Z(W')=\sum_{y_1,y_2}\sqrt{W'(y_1,y_2|0)\,W'(y_1,y_2|1)}.$$

Substituting and pulling out $\frac{1}{2}$:

$$Z(W')=\frac{1}{2}\sum_{y_1,y_2}\sqrt{\big[W(y_1|0)W(y_2|0)+W(y_1|1)W(y_2|1)\big]\big[W(y_1|1)W(y_2|0)+W(y_1|0)W(y_2|1)\big]}.$$

Applying $\sqrt{(a+b)(c+d)}\le\sqrt{ac}+\sqrt{bd}$ (Cauchy–Schwarz) with $a=W(y_1|0)W(y_2|0)$, $b=W(y_1|1)W(y_2|1)$, $c=W(y_1|1)W(y_2|0)$, $d=W(y_1|0)W(y_2|1)$:

$$ \begin{aligned} \sqrt{ac}&=\sqrt{W(y_1|0)W(y_1|1)}\cdot W(y_2|0),\\\\ \sqrt{bd}&=\sqrt{W(y_1|0)W(y_1|1)}\cdot W(y_2|1). \end{aligned} $$

Summing over $y_1,y_2$:

$$Z(W')\le\frac{1}{2}\sum_{y_1}\sqrt{W(y_1|0)W(y_1|1)}\cdot\sum_{y_2}\big[W(y_2|0)+W(y_2|1)\big]=\frac{1}{2}\cdot Z(W)\cdot 2=Z(W).$$

Since $Z(W)\in[0,1]$, we have $Z(W)^2\le Z(W)$, hence $2Z(W)-Z(W)^2\ge Z(W)$. Therefore:

$$Z(W')\le Z(W)\le 2Z(W)-Z(W)^2.\quad\blacksquare$$

Recursion to $N=2^n$

The same $1\to 2$ split is applied recursively. At depth $n$, we have $N=2^n$ sub-channels $W_N^{(1)},\dots,W_N^{(N)}$, where

$$ W_N^{(i)}: U_i \mapsto (Y_1^N,\, U_1^{i-1}). $$

The decoding is sequential: decode $U_1$ first (worst channel, no side information), then $U_2$ given decoded $U_1$, then $U_3$ given $U_1,U_2$, and so on. Each subsequent decoder has more side information.

The total capacity is preserved at every level:

$$ \sum_{i=1}^N I(W_N^{(i)}) = N\cdot I(W). $$

21.4 Martingale Theory

Reference: Handout 11 of EE 229A Fall 2024 @ UC Berkeley, V. Anantharam.

A martingale is a discrete-time stochastic process that captures the notion of a fair sequence of bets: the conditional expectation of the increment at any time, given the information available at that time, should be zero.

鞅 (martingale) 是一种离散时间随机过程，用于刻画公平赌局的概念：在任意时刻，给定当前已知的全部信息，增量的条件期望为零。

Definition: Martingale

A real-valued discrete-time stochastic process $(M_n,\,n\ge 0)$ is called a martingale if:

实值离散时间随机过程 $(M_n,\,n\ge 0)$ 称为鞅，若满足：

$E[|M_n|]<\infty, \forall n\ge 0$.

Integrability; note that there need not be a single upper bound on all $E[|M_n|]$.

可积性；注意不要求 $E[|M_n|]$ 有统一上界，只需每个有限。
$E[M_{n+1}\mid M_0,\dots,M_n]=M_n$ for all $n\ge 0$.

Fair Game Property. Given all history $M_0,M_1,\dots,M_n$ until time $n$, the expected value (best estimate) of next step $M_{n+1}$ is still $M_n$. In other words, given the past and present values of the process, the best prediction for the future value is the current value.

公平博弈性质。给定直到时刻 $n$ 的所有历史 $M_0,M_1,\dots,M_n$，下一步 $M_{n+1}$ 的期望值（最佳估计）仍是 $M_n$。已知现在和过去，对未来的预测仍然是现在。

Example

Let $\xi_1,\xi_2,\dots$ be independent random variables (not necessarily identically distributed) with $E[|\xi_m|]\lt\infty$ and $E[\xi_m]=0$ for all $m\ge 1$. Let $S_0$ be independent of $(\xi_1,\xi_2,\dots)$ with $E[|S_0|]\lt\infty$. Define $M_0\triangleq S_0$ and $M_n\triangleq S_0+\xi_1+\cdots+\xi_n$ for $n\ge 1$. Then $(M_n)$ is a martingale.

设 $\xi_1,\xi_2,\dots$ 独立随机变量（不要求同分布），$E[|\xi_m|]\lt\infty$，$E[\xi_m]=0$。$S_0$ 与 $(\xi_1,\xi_2,\dots)$ 独立且 $E[|S_0|]\lt\infty$；令 $M_0\triangleq S_0$，$M_n\triangleq S_0+\xi_1+\cdots+\xi_n$。则 $(M_n)$ 是鞅。

Proof

Condition (1): $E[|M_n|]\le E[|S_0|]+\sum_{i=1}^n E[|\xi_i|]\lt\infty$ by triangle inequality. Condition (2): since $M_{n+1}=M_n+\xi_{n+1}$,

$$ E[M_{n+1}\mid M_0,\dots,M_n]\overset{(a)}{=}E[M_n\mid M_0,\dots,M_n]+E[\xi_{n+1}\mid M_0,\dots,M_n]\overset{(b)}{=}M_n+E[\xi_{n+1}]\overset{(c)}{=}M_n. $$

$(a)$: linearity of conditional expectation. $(b)$: $M_n$ is $\sigma(M_0,\dots,M_n)$-measurable (i.e. $M_n$ is fully determined by $(M_0,\dots,M_n)$, since it is literally one of them), so conditioning on known information gives $E[M_n\mid M_0,\dots,M_n]=M_n$; and $\xi_{n+1}$ is independent of $(S_0,\xi_1,\dots,\xi_n)$, hence independent of $(M_0,\dots,M_n)$, so the conditioning drops. $(c)$: $E[\xi_{n+1}]=0$. $\blacksquare$

条件 (1)：由三角不等式，$E[|M_n|]\le E[|S_0|]+\sum_{i=1}^n E[|\xi_i|]\lt\infty$。条件 (2)：因为 $M_{n+1}=M_n+\xi_{n+1}$，

$$ E[M_{n+1}\mid M_0,\dots,M_n]\overset{(a)}{=}E[M_n\mid M_0,\dots,M_n]+E[\xi_{n+1}\mid M_0,\dots,M_n]\overset{(b)}{=}M_n+E[\xi_{n+1}]\overset{(c)}{=}M_n. $$

$(a)$：条件期望的线性性。$(b)$：$M_n$ 关于 $\sigma(M_0,\dots,M_n)$ 可测（即 $M_n$ 的值完全由 $(M_0,\dots,M_n)$ 决定，因为它本身就是其中之一），对已知量求条件期望等于自身，故 $E[M_n\mid M_0,\dots,M_n]=M_n$；又 $\xi_{n+1}$ 与 $(S_0,\xi_1,\dots,\xi_n)$ 独立，从而与 $(M_0,\dots,M_n)$ 独立，条件可去掉。$(c)$：$E[\xi_{n+1}]=0$。$\blacksquare$

Martingale w.r.t. Another Process

More generally, the “information available at time $n$” may be larger than that in $(M_0,\dots,M_n)$.

更一般地，时刻 $n$ 可用的信息可以比 $(M_0,\dots,M_n)$ 本身更多。

Let $(X_n,\,n\ge 0)$ be an arbitrary discrete-time stochastic process. A real-valued process $(M_n,\,n\ge 0)$ is called a martingale with respect to $(X_n)$ if:

设 $(X_n,\,n\ge 0)$ 是任意离散时间随机过程。实值过程 $(M_n,\,n\ge 0)$ 称为关于 $(X_n)$ 的鞅，若：

$E[|M_n|]\lt\infty$ for all $n\ge 0$,
$E[M_{n+1}\mid X_n,\dots,X_0]=M_n$ for all $n\ge 0$.

This implies $M_n$ is a function of $(X_0,\dots,X_n)$. This generalization is needed for the polarization argument: the capacity process $I_n=I(W_{2^n}^{(B_1\cdots B_n)+1})$ is a martingale w.r.t. the random bit sequence $(B_1,B_2,\dots)$, because the capacity at depth $n$ depends on the entire path $(B_1,\dots,B_n)$ through the splitting tree, not just on the capacity values seen so far.

这意味着 $M_n$ 是 $(X_0,\dots,X_n)$ 的函数。极化证明需要这个推广：容量过程 $I_n=I(W_{2^n}^{(B_1\cdots B_n)+1})$ 是关于随机比特序列 $(B_1,B_2,\dots)$ 的鞅，因为第 $n$ 层的容量取决于在分裂树中走过的完整路径 $(B_1,\dots,B_n)$，而不仅仅是之前看到的容量值。

Submartingale

Definition: Submartingale

A real-valued process $(M_n,\,n\ge 0)$ is a submartingale w.r.t. $(X_n,\,n\ge 0)$ if $M_n$ is a function of $(X_n,\dots,X_0)$ and:

实值过程 $(M_n,\,n\ge 0)$ 是关于 $(X_n,\,n\ge 0)$ 的下鞅 (submartingale)，若 $M_n$ 是 $(X_n,\dots,X_0)$ 的函数且：

$E[|M_n|]\lt\infty$ for all $n\ge 0$,
$E[M_{n+1}\mid X_n,\dots,X_0]\ge M_n$ for all $n\ge 0$.

The process tends to increase (or stay flat) in expectation.

过程的期望趋势是上升（或不变）。

Doob’s Submartingale Convergence Theorem

Theorem (Doob). Let $(M_n,\,n\ge 0)$ be a submartingale w.r.t. $(X_n,\,n\ge 0)$. If $\sup_{n\ge 0}E[|M_n|]\lt\infty$, then there is a random variable $M$ such that $M_n\xrightarrow{\text{a.s.}}M$ as $n\to\infty$. Here a.s. (almost surely) means $P(\lim_{n\to\infty}M_n=M)=1$, i.e. convergence holds with probability 1, allowing exceptions only on a set of measure zero.

定理 (Doob)。 设 $(M_n,\,n\ge 0)$ 是关于 $(X_n)$ 的下鞅。若 $\sup_{n\ge 0}E[|M_n|]\lt\infty$，则存在随机变量 $M$ 使得 $M_n\xrightarrow{\text{a.s.}}M$。此处 a.s.（almost surely，几乎必然）指 $P(\lim_{n\to\infty}M_n=M)=1$，即以概率 1 收敛，仅允许零测集上的例外。

Proof: See Chapter VII of “Probability” by A. N. Shiryayev, pg. 476.

Example: Likelihood Ratio Martingale

Let $f(x\mid 0)$ and $f(x\mid 1)$ be two densities on $\mathbb{R}$ (null and alternate hypothesis). Define the likelihood ratio $l(x)\triangleq f(x\mid 1)/f(x\mid 0)$, assuming $\int l(x)f(x\mid 0)\,dx=1$.

设 $f(x\mid 0)$、$f(x\mid 1)$ 为 $\mathbb R$ 上两个密度（零假设和备择假设）。似然比 $l(x)\triangleq f(x\mid 1)/f(x\mid 0)$，假设 $\int l(x)f(x\mid 0)\,dx=1$。

Let $\xi_1,\xi_2,\dots$ be i.i.d. with density $f(x\mid 0)$. Define $M_0\triangleq 1$, $M_n\triangleq l(\xi_1)\cdots l(\xi_n)$ for $n\ge 1$. Then $M_n$ is the cumulative likelihood ratio of the first $n$ observations, measuring how much the data collectively resemble $H_1$ versus $H_0$.

设 $\xi_1,\xi_2,\dots$ i.i.d. 密度 $f(x\mid 0)$。令 $M_0\triangleq 1$，$M_n\triangleq l(\xi_1)\cdots l(\xi_n)$。$M_n$ 是前 $n$ 个观测的累积似然比，衡量数据整体上有多像 $H_1$ 而非 $H_0$。

Then $(M_n)$ is a nonnegative martingale with $E[M_n]=1$ for all $n$: since $M_{n+1}=M_n\cdot l(\xi_{n+1})$ and $l(\xi_{n+1})$ is independent of $(M_0,\dots,M_n)$,

则 $(M_n)$ 是非负鞅且 $E[M_n]=1$：因为 $M_{n+1}=M_n\cdot l(\xi_{n+1})$，$l(\xi_{n+1})$ 独立于 $(M_0,\dots,M_n)$，

$$ E[M_{n+1}\mid M_n,\dots,M_0]=M_n\cdot E[l(\xi_{n+1})]=M_n\cdot\underbrace{\int \frac{f(x\mid 1)}{f(x\mid 0)}f(x\mid 0)\,dx}_{=\int f(x\mid 1)\,dx=1}=M_n. $$

Since $M_n\ge 0$, we have $E[|M_n|]=E[M_n]=1$, so $\sup_n E[|M_n|]=1\lt\infty$. By Doob’s theorem, $M_n\xrightarrow{\text{a.s.}}M_\infty$.

由 $M_n\ge 0$ 得 $E[|M_n|]=E[M_n]=1$，故 $\sup_n E[|M_n|]=1\lt\infty$。由 Doob 定理，$M_n\xrightarrow{\text{a.s.}}M_\infty$。

In fact, unless $f(x\mid 0)\equiv f(x\mid 1)$, the a.s. limit is $M_\infty=0$ (despite $E[M_n]=1$ for all $n$). For example, take $f(x\mid 0)=N(0,1)$ and $f(x\mid 1)=N(1,1)$:

事实上，除非 $f(x\mid 0)\equiv f(x\mid 1)$，几乎必然极限为 $M_\infty=0$（尽管 $E[M_n]=1$）。例如取 $f(x\mid 0)=N(0,1)$，$f(x\mid 1)=N(1,1)$：

$$ \frac{1}{n}\log M_n=\frac{1}{n}\sum_{i=1}^n \xi_i-\frac{1}{2}\xrightarrow{\text{a.s.}}-\frac{1}{2}\lt 0, $$

so $M_n\to 0$ a.s. by the strong law of large numbers. Under $H_0$, the cumulative likelihood ratio shrinks to zero (correctly rejecting $H_1$), yet $E[M_n]=1$ because a vanishingly rare set of extreme sample paths contributes enormous values. This illustrates that a.s. convergence and convergence of expectations can behave very differently.

由大数定律 $M_n\to 0$ a.s.。在 $H_0$ 下，累积似然比趋向零（正确否定 $H_1$），但 $E[M_n]=1$ 因为极少数极端样本路径贡献了巨大的值。这说明几乎必然收敛和期望收敛可以非常不同。

Supermartingale

Definition: Supermartingale

A real-valued process $(M_n,\,n\ge 0)$ is a supermartingale w.r.t. $(X_n,\,n\ge 0)$ if $(\tilde M_n)\triangleq(-M_n)$ is a submartingale w.r.t. $(X_n)$. Equivalently:

实值过程 $(M_n)$ 是上鞅 (supermartingale)，若 $(\tilde M_n)\triangleq(-M_n)$ 是下鞅。等价条件：

$E[|M_n|]\lt\infty$ for all $n\ge 0$,
$E[M_{n+1}\mid X_n,\dots,X_0]\le M_n$ for all $n\ge 0$.

The process tends to decrease (or stay flat) in expectation.

过程的期望趋势是下降（或不变）。

Nonnegative Supermartingale Convergence

Theorem. Let $(M_n,\,n\ge 0)$ be a nonnegative supermartingale w.r.t. $(X_n)$. Then $M_n\xrightarrow{\text{a.s.}}M$ for some finite random variable $M$.

定理。 非负上鞅 $(M_n,\,n\ge 0)$ 必定几乎必然收敛到某有限随机变量 $M$。

Proof (from Handout 11)

Let $\tilde M_n:=-M_n$. Then $(\tilde M_n)$ is a nonpositive submartingale. Since

$$E[|\tilde M_{n+1}|]=-E[\tilde M_{n+1}]=-E[E[\tilde M_{n+1}\mid X_n,\dots,X_0]]\le -E[\tilde M_n]=E[|\tilde M_n|],$$

we have $\sup_{n\ge 0}E[|\tilde M_n|]\lt\infty$, so by Doob’s submartingale convergence theorem $\tilde M_n$ converges a.s., and so does $M_n$. $\blacksquare$

21.5 Polarization

The Capacity Process is a Martingale

Binary Tree and Random Paths

The recursive splitting forms a binary tree. At depth $n$, there are $N=2^n$ sub-channels (leaves). Each sub-channel corresponds to a unique path from the root: at each node, going left ($W'$, the worse channel) or right ($W''$, the better channel).

递归分裂形成一棵二叉树。深度 $n$ 时有 $N=2^n$ 个子信道（叶子）。每个子信道对应从根到叶的唯一路径：每个节点处，向左走（$W'$，较差信道）或向右走（$W''$，较好信道）。

Let $B_1,B_2,\dots$ be i.i.d. uniform $\{0,1\}$-valued, representing a random path through this tree: $B_i=0$ means “go left” (take $W'$) at depth $i$, $B_i=1$ means “go right” (take $W''$). The bit string $(B_1 B_2\cdots B_n)$ encodes the integer $B_1\cdot 2^{n-1}+\cdots+B_n\in[0,2^n-1]$, the index of the sub-channel reached.

设 $B_1,B_2,\dots$ i.i.d. $\{0,1\}$ 上的均匀分布，表示树中的随机路径：$B_i=0$ 在第 $i$ 层向左（取 $W'$），$B_i=1$ 向右（取 $W''$）。比特串 $(B_1\cdots B_n)$ 编码整数 $B_1\cdot 2^{n-1}+\cdots+B_n\in[0,2^n-1]$，即子信道编号。

Capacity of a Sub-Channel as a R.V.

Setting $N=2^n$, the capacity along this random path is:

令 $N=2^n$，沿随机路径的容量是：

$$ I_n \triangleq I(W_N^{(B_1\cdots B_n)+1}),\quad n\ge 1 $$

where $(B_1\cdots B_n)+1$ means the bit string encodes an integer in $[0, 2^n-1]$ and the $+1$ shifts to 1-indexed sub-channel numbering $W_N^{(1)},\dots,W_N^{(N)}$. Since $(B_1,\dots,B_n)$ are random, $I_n$ is a random variable.

其中 $(B_1\cdots B_n)+1$ 是因为比特串编码的整数范围是 $[0,2^n-1]$，而子信道编号 $W_N^{(1)},\dots,W_N^{(N)}$ 从 1 开始，所以 $+1$ 做 0-indexed 到 1-indexed 的转换。由于 $(B_1,\dots,B_n)$ 是随机的，$I_n$ 是一个随机变量。

$I_n$ is a martingale w.r.t. $(B_1,B_2,\dots)$. At depth $n$, the current sub-channel splits into two at depth $n+1$. Since $B_{n+1}$ is uniform and independent of $(B_1,\dots,B_n)$, using $I(W')+I(W'')=2I(W)$ from Section 21.3:

$I_n$ 是关于 $(B_1,B_2,\dots)$ 的鞅。在深度 $n$，当前子信道在深度 $n+1$ 分裂为二。因为 $B_{n+1}$ 均匀且独立于 $(B_1,\dots,B_n)$，利用 21.3 的 $I(W')+I(W'')=2I(W)$：

$$ E[I_{n+1}\mid B_1,\dots,B_n]=\frac{1}{2}\,I(W')+\frac{1}{2}\,I(W'')=\frac12 \cdot 2 I(W)=\frac{1}{2}\cdot 2\,I_n=I_n. $$

The Bhattacharyya Parameter Process is a Supermartingale

From Section 21.3, $Z(W')+Z(W'')\le 2Z(W)$.

Let $Z_n$ be the Bhattacharyya Parameter for the $n$-th level. Then

$$ E[Z_{n+1}\mid B_1,\dots,B_n]=\frac{1}{2}\,Z(W')+\frac{1}{2}\,Z(W'')\le \frac12 \cdot 2 Z(W)=\frac{1}{2}\cdot 2\,Z_n=Z_n. $$

By the same random-path argument, $Z_n$ is a nonnegative supermartingale w.r.t. $(B_1,B_2,\dots)$.

Convergence and Polarization

By the martingale convergence theorem (Section 21.4): $I_n\in[0,1]$ is a bounded martingale and $Z_n\in[0,1]$ is a nonneg supermartingale, so both converge a.s., denoted as $I_n\to I_\infty$ and $Z_n\to Z_\infty$.

At each step, with probability $\frac12$ the process takes the $W''$ branch where $Z_{n+1}=Z_n^2$.

If $Z_\infty\in(0,1)$, then $Z_\infty^2\neq Z_\infty$, so with probability $\frac12$ the next step jumps to $Z_\infty^2\lt Z_\infty$, contradicting convergence. The only fixed points of $z\mapsto z^2$ in $[0,1]$ are $z=0$ and $z=1$, so $Z_\infty\in\{0,1\}$.

As a capacity, $I_\infty \in [0, 1]$ since the input alphabet has 2 symbols and $\log 2 = 1$.

Combined with the $Z$-$I$ bounds for BISO channels (Section 21.2):

$$ \log \frac{2}{1+Z(W)} \le I(W) \le \sqrt{1-Z(W)^2}, $$

$Z_\infty=0$ forces $I_\infty\ge\log 2=1$, hence $I_\infty=1$; $Z_\infty=1$ forces $I_\infty\le 0$, hence $I_\infty=0$. Therefore as $N\to\infty$, for the sub-channels $W_N^{(1)},\dots,W_N^{(N)}$:

$$ Z(W_N^{(i)})\to 0\text{ or }1,\qquad I(W_N^{(i)})\to 1\text{ or }0, $$

almost surely. Each sub-channel becomes either perfectly reliable or completely useless. The average capacity is preserved:

$$ \frac{1}{N}\sum_{i=1}^N I(W_N^{(i)})=I(W). $$

Choosing the Good Sub-channels

Asymptotically ($N\to\infty$): select sub-channels where $I(W_N^{(i)})\to 1$ (equivalently $Z(W_N^{(i)})\to 0$). The fraction of such channels approaches $I(W)$.
Finite $N$: a practical heuristic is to pick sub-channels with $Z\lt 0.5$, yielding roughly $N\cdot I(W)$ good sub-channels.
The remaining (bad) sub-channels are frozen to a fixed known value (e.g., $0$); no information is transmitted through them.