INFOTH Note 4: Cond. MI & Cond. KL, Data Processing Ineq.
4.1 Conditional Mutual Information and Cond. KL Divergence
Def: Cond. MI
$$ I(X;Y|Z):=H(X|Z)-H(X|Y,Z)=H(Y|Z)-H(Y|X,Z) $$展开式子可以发现,在互信息中定义的 给定 Z 也是 将 Z 求和的:
$$ \begin{aligned} I(X;Y|Z)&=H(X|Z)-H(X|Y,Z) \\&=\sum_z p(z)\sum_x p(x|z)\log \frac1{p(x|z)} \\&+\sum_{z}p(z)\sum_y p(y|z)\sum_x p(x|y,z)\log p(x|y,z)\quad\text{(Def)} \\&=\sum_{x,z}p(x,z)\log \frac1{p(x|z)}+ \sum_{x,y,z}p(x,y,z)\log{p(x|y,z)} \\&=\sum_{x,y,z}p(x,y,z)\log \frac1{p(x|z)}+ \sum_{x,y,z}p(x,y,z)\log{p(x|y,z)} \\&=\sum_{x,y,z}p(x,y,z)\log {p(x|y,z)\over p(x|z)} \\&=\sum_{x,y,z}p(x,y,z)\log {p(x,y|z)\over p(x|z)p(y|z)} \\&=E_{X,Y,Z}\left[\log {p(x,y|z)\over p(x|z)p(y|z)}\right] \end{aligned} $$We can also write the expectation as
$$ D(p(x,y|z)||p(x|z)p(y|z)) $$which is the conditional relative entropy (KL divergence).
Yes, $I(X;Y|Z)$ is the conditional relative entropy $D(\cdot||\cdot|Z)$ as defined in the unconditional MI; and it’s also in the expected form as defined.
Def: Cond. KL
$$ D(p(x|z)||q(x|z)|\sigma(z)):=\sum_z \sigma(z)\sum_xp(x|z)\log {p(x|z)\over q(x|z)} $$Fact: 条件可能不小于等于非条件
$$ I(X;Y|Z) ~?~I(X;Y) $$but we have a firmly guaranteed $H(X,Y|Z)\le H(X,Y)$.
4.2 Chain Rules
Theorem: KL Divergence Chain Rule
- Analogy: $p(u,z)=p(z)p(u|z)$
 
Chain Rule for Relative Entropy:
$$ D(p(u,z)||q(u,z))=D(p(z)||q(z))+D(p(u|z)||q(u|z)) $$Theorem: General MI Chain Rule
$$ I(X;Y_1,\dots,Y_n)=I(X;Y_1)+I(X;Y_2|Y_1)+\dots+ I(X;Y_n|Y_1\dots Y_{n-1}) $$- 右半部分在进行chain rule
 - 当然左半部分的chain也成立
 
Proof Sketches (2 methods)
- Using Successive Addition
 - Using Expectation Linearity
 
4.3 Data Processing Inequality
Def: Markov Chain (1st order)
We say 3 random variables form a (1st order) Markov Chain written as $X\to Y\to Z$ if
$$ p(z|x,y)=p(z|y)p(y|x) $$Theorem
If $X\to Y\to Z$, then
$$ I(X;Z)\le I(X;Y). $$- 
A variable processed from another variable cannot contain more information.
- Information will not increase after being processed.
 
 - 
Equality Condition: $I(X;Y|Z)=0$. the amount of information that $Y$ provides about $X$ that is not available in $Z$.
- $Z$ must contain all the information that $Y$ has about $X$
 - $Z$ 是 $Y$ 的充分统计量 sufficient statistics $I(X;Z)=I(X;Y)$
 
 - 
Also: because $I(X;Z)=H(X)-H(X|Z)$, $I(X;Y)=H(X)-H(X|Y)$, hence
$H(X|Z)\ge H(X|Y)$ – the farther one in the chain has more entropy; processing adds entropy.
 
Proof
First, from the chain rule of MI,
$$ I(X;Y,Z)=I(X;Z)+I(X;Y|Z)=I(X;Y)+I(X;Z|Y) $$Because $X\perp\!\!\!\perp Z|Y$, $I(X;Z|Y)=0$.
So
$$ I(X;Z)+I(X;Y|Z)=I(X;Y) $$and $I(\cdot;\cdot)\ge 0$, which means
$$ I(X;Z)\le I(X;Y)\qquad\square $$and the equality holds iff
$$ I(X;Y|Z)=0 $$(or given $Z$, $X\perp\!\!\!\perp Y|Z$).