INFOTH Note 9: Differential Entropy

9.1 Differential Entropy

For discrete r.v.s, we have the entropy. But for real-valued r.v.s with density functions, we need another form of entropy to process things. That is differential entropy.

Def: Diff. Ent.

Let $X\sim f$ be a real-valued r.v. with density $f$, meaning

$$ \Pr[X\in[a,b]]=\int_a^b f(x)\mathrm d x,\quad \int_{-\infty}^\infty f(x)\mathrm d x=1. $$

We define $(\int_\R:=\int_{-\infty}^\infty)$

$$ h(X)=-\int_{\R} f(x)\log f(x)\mathrm d x $$

In fact, the $h$ may not exist (the integral is $\pm \infty$), in case

$$ \int_\R \max(-f(x)\log f(x),0)=\infty,\text{ or } \int_\R \max(f(x)\log f(x),0)=\infty. $$

Example: Uniform distribution

$$ X\sim\text{Unif}(a,b),\\ h(X)=\int_a^b \frac1{b-a}\log(b-a)\mathrm d x=\log(b-a) $$

Example: Gaussian

$$ X\sim\mathcal N(0,\sigma^2),\quad \sigma^2>0,\\ f(x)={1\over \sigma\sqrt{2\pi}}e^{-x^2/2\sigma^2},x\in\R $$

The diff. entropy:

$$ \begin{aligned} h(X)&=-\int_\R {1\over \sigma\sqrt{2\pi}}e^{-x^2/2\sigma^2}\log \left({1\over \sigma\sqrt{2\pi}}e^{-x^2/2\sigma^2}\right)\mathrm d x \\&=-\log e\int_\R {1\over \sigma\sqrt{2\pi}}e^{-x^2/2\sigma^2}\ln\left({1\over \sigma\sqrt{2\pi}}e^{-x^2/2\sigma^2}\right)\mathrm d x \\&=-\log e\int_\R {1\over \sigma\sqrt{2\pi}}e^{-x^2/2\sigma^2}\left(-{x^2\over 2\sigma^2}-\ln(\sigma\sqrt{2\pi})\right)\mathrm d x \\&=\log e\cdot \ln(\sigma\sqrt{2\pi})+\log e\int_\R{1\over \sigma\sqrt{2\pi}}e^{-x^2/2\sigma^2}{x^2\over 2\sigma^2}\mathrm d x \\&=\log(\sigma\sqrt{2\pi})+\log e\cdot\text{Var}[f]\cdot\frac1{2\sigma^2} \\&=\log(\sigma\sqrt{2\pi})+\frac12\log e \\&=\frac12\log(2\pi e\sigma^2) \end{aligned} $$

Differential Entropy is not Entropy

  1. It can be negative. The case when $b-a<1$ in the uniform dist.
  2. Quantization

Quantization (Discretization)

Quantize by the interval $\Delta$ to approximate the integral.

$$ \sum_{k=-\infty}^\infty -\Delta\cdot (f(x_k)\log f(x_k))\approx-\int_{\R} f(x)\log f(x)\mathrm d x $$

where $x_k=k\Delta$ i.e. $\cdots,-2\Delta,-\Delta,0,\Delta,\cdots$

So we have the true entropy with $\Delta$ quantization interval as an entropy of infinitely many probability values summing to 1:

$$ H(\cdots,p_{-2}^\Delta,p_{-1}^\Delta,p_0^\Delta,p_1^\Delta,p_2^\Delta,\cdots) =-\sum_k p_k^\Delta\log p_k^\Delta $$

where the probability value $p_k^\Delta$ is

$$ p_k^\Delta:=\Pr[X\in[k\Delta,(k+1)\Delta)]=\int_{k\Delta}^{(k+1)\Delta}f(x)\mathrm d x \approx f(x_k)\Delta $$

Then

$$ -p_k^\Delta\log p_k^\Delta\approx -f(x_k)\Delta\log (f(x_k)\Delta) =-f(x_k)\Delta\cdot\log(f(x_k))-f(x_k)\Delta\cdot\log \Delta $$

Hence

$$ H(\cdots,p_{-2}^\Delta,p_{-1}^\Delta,p_0^\Delta,p_1^\Delta,p_2^\Delta,\cdots) =-\sum_k f(x_k)\Delta \log f(x_k)-\log\Delta\cdot\sum_k \Delta f(x_k) $$

When $\Delta\to 0$,

$$ \sum_k\Delta \cdot f(x_k)=\int_\R f(x)\mathrm d x=1 $$

and

$$ -\sum_k\Delta \cdot f(x_k)\log f(x_k)=-\int_\R f(x)\log f(x)\mathrm d x=h(X) $$

hence

$$ H(\cdots,p_{-2}^\Delta,p_{-1}^\Delta,p_0^\Delta,p_1^\Delta,p_2^\Delta,\cdots) \to h(X)-\log \Delta,\quad\Delta\to 0 $$

We find that $\log\Delta$ approaches $-\infty$ and the “entropy” $H$ approaches $+\infty$ while $h(X)$ still exists.

9.2 Differential Definitions of Other Quantities

Joint, Cond., MI

Joint Differential Entropy

$$ h(X,Y):=-\int_\R\int_\R f(x,y)\log f(x,y)\mathrm d x\mathrm d y\\ h(\bold X^n):=-\int_{\R^n}f(\bold x^n)\log f(\bold x^n)\mathrm d {\bold x}^n $$

Conditional Diff. Ent.

$$ h(X|Y):=-\int_\R f(y)\int_\R f(x|y)\log f(x|y)\mathrm d x\mathrm d y =-\int_\R\int_\R f(x,y)\log f(x|y)\mathrm d x\mathrm d y $$

Mutual Information (differentially)

$$ I(X;Y):=h(Y)-h(Y|X)=h(X)-h(X|Y)=h(X)+h(Y)-h(X,Y)=D(f(x,y)||f(x)f(y)) $$

Relative Entropy ($f$ vs $g$)

$$ D(f||g):=\int_\R f(x)\log{f(x)\over g(x)}\mathrm d x $$

Note that $D(f||g)\ge 0$ even if the differential entropy is not:

$$ D(f||g) =\int_\R g(x){f(x)\over g(x)}\log{f(x)\over g(x)}\mathrm d x \ge\left(\int_\R g(x){f(x)\over g(x)}\mathrm d x \right)\log \left(\int_\R g(x){f(x)\over g(x)}\mathrm d x \right)=0. $$

Hence

$$ D(f||g)=\begin{cases}\int_\R f(x)\log{f(x)\over g(x)}\mathrm d x,&g(x)=0\implies f(x)=0,\\ \infty,&\text{otherwise}.\end{cases} $$

Intuition from discrete to continuous

We have discrete distributions

$$ (p^\Delta(k),k\in\Z):p^\Delta (k)=\int_{k\Delta}^{(k+1)\Delta}f(x)\mathrm d x \\ (q^\Delta(k),k\in\Z):q^\Delta (k)=\int_{k\Delta}^{(k+1)\Delta}g(x)\mathrm d x $$

Then

$$ \lim_{\Delta\to 0}D(p^\Delta||q^\Delta)=D(f||g) $$

Intuition:

$$ \sum_{\mathrm d x}f(x)\mathrm d x\cdot {\log {f(x)\mathrm d x\over g(x)\mathrm d x}}=\sum_{\mathrm d x}f(x)\mathrm d x{\log {f(x)\over g(x)}}=\int_\R f{\log{f\over g}}\mathrm d x. $$

9.3 Chain Rules

Chain Rule for Diff. Ent.

$$ h(\bold X^n)=\sum_{k=1}^nh(X_k|\bold X^{k-1})=E\left[\log\frac1{f(\bold X^n)}\right]=\sum_k E\left[\log\frac1{f(X_k|\bold X^{k-1})}\right] $$

Chain Rule for MI for real-valued R.V.s

$$ I(X;\bold Y^n)=\sum_k I(X;Y_k|\bold Y^{k-1}) =E\left[\log{f(X,\bold Y^n)\over f(X)f(\bold Y^n)}\right] =\sum_k E\left[\log{f(X,Y_k|\bold Y^{k-1})\over f(X|\bold Y^{k-1})f(Y_k|\bold Y^{k-1})}\right] $$