INFOTH Note 4: Cond. MI & Cond. KL, Data Processing Ineq.

4.1 Conditional Mutual Information and Cond. KL Divergence

Def: Cond. MI

$$ I(X;Y|Z):=H(X|Z)-H(X|Y,Z)=H(Y|Z)-H(Y|X,Z) $$

展开式子可以发现,在互信息中定义的 给定 Z 也是 将 Z 求和的:

$$ \begin{aligned} I(X;Y|Z)&=H(X|Z)-H(X|Y,Z) \\&=\sum_z p(z)\sum_x p(x|z)\log \frac1{p(x|z)} \\&+\sum_{z}p(z)\sum_y p(y|z)\sum_x p(x|y,z)\log p(x|y,z)\quad\text{(Def)} \\&=\sum_{x,z}p(x,z)\log \frac1{p(x|z)}+ \sum_{x,y,z}p(x,y,z)\log{p(x|y,z)} \\&=\sum_{x,y,z}p(x,y,z)\log \frac1{p(x|z)}+ \sum_{x,y,z}p(x,y,z)\log{p(x|y,z)} \\&=\sum_{x,y,z}p(x,y,z)\log {p(x|y,z)\over p(x|z)} \\&=\sum_{x,y,z}p(x,y,z)\log {p(x,y|z)\over p(x|z)p(y|z)} \\&=E_{X,Y,Z}\left[\log {p(x,y|z)\over p(x|z)p(y|z)}\right] \end{aligned} $$

We can also write the expectation as

$$ D(p(x,y|z)||p(x|z)p(y|z)) $$

which is the conditional relative entropy (KL divergence).

Yes, $I(X;Y|Z)$ is the conditional relative entropy $D(\cdot||\cdot|Z)$ as defined in the unconditional MI; and it’s also in the expected form as defined.

Def: Cond. KL

$$ D(p(x|z)||q(x|z)|\sigma(z)):=\sum_z \sigma(z)\sum_xp(x|z)\log {p(x|z)\over q(x|z)} $$

Fact: 条件可能不小于等于非条件

$$ I(X;Y|Z) ~?~I(X;Y) $$

but we have a firmly guaranteed $H(X,Y|Z)\le H(X,Y)$.

4.2 Chain Rules

Theorem: KL Divergence Chain Rule

  • Analogy: $p(u,z)=p(z)p(u|z)$

Chain Rule for Relative Entropy:

$$ D(p(u,z)||q(u,z))=D(p(z)||q(z))+D(p(u|z)||q(u|z)) $$

Theorem: General MI Chain Rule

$$ I(X;Y_1,\dots,Y_n)=I(X;Y_1)+I(X;Y_2|Y_1)+\dots+ I(X;Y_n|Y_1\dots Y_{n-1}) $$
  • 右半部分在进行chain rule
  • 当然左半部分的chain也成立

Proof Sketches (2 methods)

  1. Using Successive Addition
  2. Using Expectation Linearity

4.3 Data Processing Inequality

Def: Markov Chain (1st order)

We say 3 random variables form a (1st order) Markov Chain written as $X\to Y\to Z$ if

$$ p(z|x,y)=p(z|y)p(y|x) $$

Theorem

If $X\to Y\to Z$, then

$$ I(X;Z)\le I(X;Y). $$
  • A variable processed from another variable cannot contain more information.

    • Information will not increase after being processed.
  • Equality Condition: $I(X;Y|Z)=0$. the amount of information that $Y$ provides about $X$ that is not available in $Z$.

    • $Z$ must contain all the information that $Y$ has about $X$
    • $Z$ 是 $Y$ 的充分统计量 sufficient statistics $I(X;Z)=I(X;Y)$
  • Also: because $I(X;Z)=H(X)-H(X|Z)$, $I(X;Y)=H(X)-H(X|Y)$, hence

    $H(X|Z)\ge H(X|Y)$ – the farther one in the chain has more entropy; processing adds entropy.

Proof

First, from the chain rule of MI,

$$ I(X;Y,Z)=I(X;Z)+I(X;Y|Z)=I(X;Y)+I(X;Z|Y) $$

Because $X\perp\!\!\!\perp Z|Y$, $I(X;Z|Y)=0$.

So

$$ I(X;Z)+I(X;Y|Z)=I(X;Y) $$

and $I(\cdot;\cdot)\ge 0$, which means

$$ I(X;Z)\le I(X;Y)\qquad\square $$

and the equality holds iff

$$ I(X;Y|Z)=0 $$

(or given $Z$, $X\perp\!\!\!\perp Y|Z$).