INFOTH Note 1: Axioms & Def of Entropy

1.1 Definition of the Shannon Entropy

To (re)invent the entropy, we need to let this quantity follow some axioms. Let the entropy be $H(\cdot)$.

$H$ can also wrap distribution like $p_1,...,p_n$, or just be fed with a distribution tuple $p$, or a random variable $X$.

Def 1.1.1 Shannon Entropy

Consider a discrete probability distribution $p=(p_1,\dots,p_n)$ on the set $\mathcal X=\{1,\dots,n\}$.

The Shannon Entropy of $p$ is defined by

$$ H(p)=-\sum_{x\in\cal X}p_x\log p_x. $$
  • $\log:=\log_2$, as this is Shannon’s convention (and also will be ours)
  • We have defined the entropy only on a finite set.
  • We also allow $H$ to wrap a tuple like this: $H(p)=H(p_1,\dots,p_n)$.

Def 1.1.2 Entropy of a Random Variable

Given a discrete r.v. $X$ over a finitely many set $\cal X$, we write the entropy as

$$ H(X)=H(p_X)=-\sum_{x\in\cal X}p_x\log p_x=-\sum_{x=1}^dP(X=x)\log P(X=x). $$

We can also write this in an expectation form:

$$ H(X)=E[-\log p_X(X)]=E[1/\log P(X)] $$

Surprisal / Information Content

We can see the expectation of a thing is the entropy. And the thing is called the information content / surprisal for one event $X=x$:

$$ I_X(x)=-\log p_X(x);\quad H(X)=E[I_X(X)]. $$

Lemma 1.1.3

For the entropy defined in 1.1.2, $H(X)\ge 0$.

Lemma 1.1.4

$$ H_b(X)=(\log_b a)H_a(X) $$

where $H_y(X)$ is the entropy where the log base is $y$. Of course, $H_y(X)=(\log_y 2)H(X)$.

  • When we want to maximize / minimize an entropy, we can easily exchange the log base to make it convenient w/o a constant.

1.2 Motivation of the Shannon Entropy: Axioms

1.2.1 Axiom 1:

Fair Coin Toss: Random Var $X\in\{H,T\}$, $P(X=H)=P(X=T)=1/2$.

We attribute 1 as $H(X)=1$ with $H(\frac12,\frac12)=1$

1.2.2 Axiom 2:

For a finite distribution $p=(p_1\dots p_n)$,

$$ H(p_1,\cdots,p_n)=H(p_1+p_2,p_3,\cdots,p_n)+(p_1+p_2)H({p_1\over p_1+p_2},{p_2\over p_1+p_2}). $$

More generally, for any subset $A\subseteq\cal X$:

$$ H(p_1\dots p_n)=H(\sum_{x\in A}p_x,\{p_y,y\notin A\})+(\sum_{x\in A}p_x)H(\{p_x/\sum_{x\in A}p_x|x\in A\}) $$
  • Probability Merging
    • 合并的概率体现在第一项里的同时,合并项内部消化需要另外一项

Corollary from Axiom 2: Proposition of Independence

$X\perp\!\!\!\perp Y$, $P(X,Y)=P(X)P(Y)$, from Axiom 2, $H(X,Y)=H(X)+H(Y)$

Proof:

$$ \begin{align*} H(X,Y)&=H(p_1q_1,p_1q_2,\cdots,p_nq_m) \\&=H(\sum_j p_1q_j,p_2q_1,p_2q_2,\cdots,p_nq_m)+(\sum_jp_1q_j)H({p_1q_1\over \sum_jp_1q_j},\cdots,{p_1q_m\over \sum_jp_1q_j}) \\&=H(p_1,p_2q_1,p_2q_2,\cdots,p_nq_m)+p_1H({p_1q_1\over \sum_jp_1q_j},\cdots,{p_1q_m\over \sum_jp_1q_j}) \\&=p_1H({p_1q_1\over \sum_jp_1q_j},\cdots,{p_1q_m\over \sum_jp_1q_j})+H(p_1,p_2q_1,p_2q_2,\cdots,p_nq_m) \\&=p_1H(q_1,\cdots,q_m)+H(p_1,p_2q_1,p_2q_2,\cdots,p_nq_m) \\&=p_1H(q_1,\cdots,q_m)+p_2H(q_1,\cdots,q_m)+H(p_1,p_2,p_3q_1,p_3q_2,\cdots,p_nq_m) \\&=(\sum_i p_i)H(q_1,\cdots,q_m)+H(p_1,p_2,\cdots,p_n) \\&=H(q)+H(p)=H(p)+H(q)=H(X)+H(Y). \end{align*} $$

1.2.3 Axiom 3:

$p\mapsto H(p,1-p),p\in[0,1)$ should be a continuous function.

1.2.4 Axiom 4:

$H(p)$ is invariant to permutations of $(p_1,p_2,\cdots, p_n)$.

$$ \forall \sigma=\text{Perm}(p),H(\sigma)=H(p). $$

Theorem 1.2.5 Unique Entropy Function

Only one choice for $H$.

There is only one choice for $H$ given a finite-state distribution. (for infinitely many states, this will also apply but we will discuss this later).

$$ H(p_1,\cdots,p_n)=-\sum_{x=1}^n p_x\log_2 p_x $$

Verification of Axiom 3

$$ p\mapsto H(p,1-p)=-p\log_2p - (1-p)\log_2(1-p) $$
  • is continuous
  • More importantly, it’s convex.

1.3 Concavity of Shannon Entropy

Theorem 1.3.1 Convexity of $x\log x$

$$ f:x\mapsto x\ln x~~(x>0),\quad f'(x)=\ln x+1, \quad f''(x)=\frac1x>0. $$

(We complement here as $0\log 0=0$ as it’s the limit value)

Def 1.3.2 Convex Set

A subset $K\subseteq \R^d$ is called a convex set if and only if

$$ \forall v,w\in K,\forall \lambda\in[0,1]\implies \lambda v+(1-\lambda)w\in K. $$

Def 1.3.3 Probability Simplex

At $d\ge 2$, the set of probability functions on $\mathcal X=\{1,\dots,d\}$ is called the probability simplex or the unit $d$-simplex, written as

$$ \Delta_{\cal X}=\{(p(1),\dots,p(d)):p(i)\ge 0,\sum_{i=1}^dp(i)=1\}. $$

This simplex is also a convex set, which means every weighted average with weight in $[0,1]$ of the probabilities will fall on this simplex, well defining the convexity/concavity of entropy-related quantities.

Theorem 1.3.4 Concavity of Shannon Entropy

$H$ is here defined as a function of probability functions $(p(x), x\in{\cal X})$, on the probability simplex for fixed $d$:

$$ \forall p_0,p_1\in\Delta_{\cal X},\forall \lambda\in[0,1],\quad p_\lambda(x):=\lambda p_1(x)+(1-\lambda)p_0(x), $$

then

$$ H(p_\lambda)\ge \lambda H(p_1)+(1-\lambda)H(p_0). $$

Proof Sketch

This can be derived from the convexity of $u\log u$ (i.e. concavity of $-u\log u$).

Theorem 1.3.5 Upper Bound of Entropy = Uniform

The uniform distribution on finite $\{1,\cdots,d\}$ has the largest entropy among probability distributions on this set.

Proof

Let $S_d$ denote the set of all permutations of $\mathcal X=\{1,\cdots,d\}$. Then given any probability distribution $p$ (written as a probability vector)

$$ (1/d,\cdots,1/d)={1\over d!}\sum_{\sigma_{\cal X}\in S_d}(p(\sigma_1),p(\sigma_2),\cdots,p(\sigma_d)), $$

where $\sigma_{\cal X}$ denotes one permutation in the set, and $\sigma_i$ denote the $i$-th entry of that permutation.

The permutations here are equivalent to the permutations of a probability vector.

Here we apply the average of probability vectors, where every dimension of this vector is

$$ {(d-1)!\sum_{i\in\mathcal X}p(i)\over d!}=\frac1d. $$

Given $H$ is concave in 1.3.4,

$$ H(1/d,\cdots,1/d) \ge {1\over d!}\sum_{\sigma\in S_d}H(p(\sigma_1),\cdots,p(\sigma_d)), $$

plus from the axiom 4 (1.2.4), the entropy value is invariant to permutations: $RHS=H(p)$, Q.E.D.

Corollary: Upper Bound Value

$$ \sup_{p_X} H(X)=\log |\cal X|. $$

where $X$ has a distribution $p_X$ on the domain $\mathcal X$.

Proof. Just calculate the value of the uniform distribution of dimension $d$.