INFOTH Note 1: Axioms & Def of Entropy

2025-08-28

2 min read

Notes

Information Theory and Coding (UCB FA24 EE229A, rewritten in 2025)

1.1 Definition of the Shannon Entropy

To (re)invent the entropy, we need to let this quantity follow some axioms. Let the entropy be $H(\cdot)$.

$H$ can also wrap distribution like $p_1,...,p_n$, or just be fed with a distribution tuple $p$, or a random variable $X$.

Def 1.1.1 Shannon Entropy

Consider a discrete probability distribution $p=(p_1,\dots,p_n)$ on the set $\mathcal X=\{1,\dots,n\}$.

The Shannon Entropy of $p$ is defined by

$$ H(p)=-\sum_{x\in\cal X}p_x\log p_x. $$

$\log:=\log_2$, as this is Shannon’s convention (and also will be ours)
We have defined the entropy only on a finite set.
We also allow $H$ to wrap a tuple like this: $H(p)=H(p_1,\dots,p_n)$.

Def 1.1.2 Entropy of a Random Variable

Given a discrete r.v. $X$ over a finitely many set $\cal X$, we write the entropy as

$$ H(X)=H(p_X)=-\sum_{x\in\cal X}p_x\log p_x=-\sum_{x=1}^dP(X=x)\log P(X=x). $$

We can also write this in an expectation form:

$$ H(X)=E[-\log p_X(X)]=E[1/\log P(X)] $$

Surprisal / Information Content

We can see the expectation of a thing is the entropy. And the thing is called the information content / surprisal for one event $X=x$:

$$ I_X(x)=-\log p_X(x);\quad H(X)=E[I_X(X)]. $$

Lemma 1.1.3

For the entropy defined in 1.1.2, $H(X)\ge 0$.

Lemma 1.1.4

$$ H_b(X)=(\log_b a)H_a(X) $$

where $H_y(X)$ is the entropy where the log base is $y$. Of course, $H_y(X)=(\log_y 2)H(X)$.

When we want to maximize / minimize an entropy, we can easily exchange the log base to make it convenient w/o a constant.

1.2 Motivation of the Shannon Entropy: Axioms

1.2.1 Axiom 1:

Fair Coin Toss: Random Var $X\in\{H,T\}$, $P(X=H)=P(X=T)=1/2$.

We attribute 1 as $H(X)=1$ with $H(\frac12,\frac12)=1$

1.2.2 Axiom 2:

For a finite distribution $p=(p_1\dots p_n)$,

$$ H(p_1,\cdots,p_n)=H(p_1+p_2,p_3,\cdots,p_n)+(p_1+p_2)H({p_1\over p_1+p_2},{p_2\over p_1+p_2}). $$

More generally, for any subset $A\subseteq\cal X$:

$$ H(p_1\dots p_n)=H(\sum_{x\in A}p_x,\{p_y,y\notin A\})+(\sum_{x\in A}p_x)H(\{p_x/\sum_{x\in A}p_x|x\in A\}) $$

Probability Merging
- 合并的概率体现在第一项里的同时，合并项内部消化需要另外一项

Corollary from Axiom 2: Proposition of Independence

$X\perp\!\!\!\perp Y$, $P(X,Y)=P(X)P(Y)$, from Axiom 2, $H(X,Y)=H(X)+H(Y)$

Proof:

$$ \begin{align*} H(X,Y)&=H(p_1q_1,p_1q_2,\cdots,p_nq_m) \\&=H(\sum_j p_1q_j,p_2q_1,p_2q_2,\cdots,p_nq_m)+(\sum_jp_1q_j)H({p_1q_1\over \sum_jp_1q_j},\cdots,{p_1q_m\over \sum_jp_1q_j}) \\&=H(p_1,p_2q_1,p_2q_2,\cdots,p_nq_m)+p_1H({p_1q_1\over \sum_jp_1q_j},\cdots,{p_1q_m\over \sum_jp_1q_j}) \\&=p_1H({p_1q_1\over \sum_jp_1q_j},\cdots,{p_1q_m\over \sum_jp_1q_j})+H(p_1,p_2q_1,p_2q_2,\cdots,p_nq_m) \\&=p_1H(q_1,\cdots,q_m)+H(p_1,p_2q_1,p_2q_2,\cdots,p_nq_m) \\&=p_1H(q_1,\cdots,q_m)+p_2H(q_1,\cdots,q_m)+H(p_1,p_2,p_3q_1,p_3q_2,\cdots,p_nq_m) \\&=(\sum_i p_i)H(q_1,\cdots,q_m)+H(p_1,p_2,\cdots,p_n) \\&=H(q)+H(p)=H(p)+H(q)=H(X)+H(Y). \end{align*} $$

1.2.3 Axiom 3:

$p\mapsto H(p,1-p),p\in[0,1)$ should be a continuous function.

1.2.4 Axiom 4:

$H(p)$ is invariant to permutations of $(p_1,p_2,\cdots, p_n)$.

$$ \forall \sigma=\text{Perm}(p),H(\sigma)=H(p). $$

Theorem 1.2.5 Unique Entropy Function

Only one choice for $H$.

There is only one choice for $H$ given a finite-state distribution. (for infinitely many states, this will also apply but we will discuss this later).

$$ H(p_1,\cdots,p_n)=-\sum_{x=1}^n p_x\log_2 p_x $$

Verification of Axiom 3

$$ p\mapsto H(p,1-p)=-p\log_2p - (1-p)\log_2(1-p) $$

is continuous
More importantly, it’s convex.

1.3 Concavity of Shannon Entropy

Theorem 1.3.1 Convexity of $x\log x$

$$ f:x\mapsto x\ln x~~(x>0),\quad f'(x)=\ln x+1, \quad f''(x)=\frac1x>0. $$

(We complement here as $0\log 0=0$ as it’s the limit value)

Def 1.3.2 Convex Set

A subset $K\subseteq \R^d$ is called a convex set if and only if

$$ \forall v,w\in K,\forall \lambda\in[0,1]\implies \lambda v+(1-\lambda)w\in K. $$

Def 1.3.3 Probability Simplex

At $d\ge 2$, the set of probability functions on $\mathcal X=\{1,\dots,d\}$ is called the probability simplex or the unit $d$-simplex, written as

$$ \Delta_{\cal X}=\{(p(1),\dots,p(d)):p(i)\ge 0,\sum_{i=1}^dp(i)=1\}. $$

This simplex is also a convex set, which means every weighted average with weight in $[0,1]$ of the probabilities will fall on this simplex, well defining the convexity/concavity of entropy-related quantities.

Theorem 1.3.4 Concavity of Shannon Entropy

$H$ is here defined as a function of probability functions $(p(x), x\in{\cal X})$, on the probability simplex for fixed $d$:

$$ \forall p_0,p_1\in\Delta_{\cal X},\forall \lambda\in[0,1],\quad p_\lambda(x):=\lambda p_1(x)+(1-\lambda)p_0(x), $$

then

$$ H(p_\lambda)\ge \lambda H(p_1)+(1-\lambda)H(p_0). $$

Proof Sketch

This can be derived from the convexity of $u\log u$ (i.e. concavity of $-u\log u$).

Theorem 1.3.5 Upper Bound of Entropy = Uniform

The uniform distribution on finite $\{1,\cdots,d\}$ has the largest entropy among probability distributions on this set.

Proof

Let $S_d$ denote the set of all permutations of $\mathcal X=\{1,\cdots,d\}$. Then given any probability distribution $p$ (written as a probability vector)

$$ (1/d,\cdots,1/d)={1\over d!}\sum_{\sigma_{\cal X}\in S_d}(p(\sigma_1),p(\sigma_2),\cdots,p(\sigma_d)), $$

where $\sigma_{\cal X}$ denotes one permutation in the set, and $\sigma_i$ denote the $i$-th entry of that permutation.

The permutations here are equivalent to the permutations of a probability vector.

Here we apply the average of probability vectors, where every dimension of this vector is

$$ {(d-1)!\sum_{i\in\mathcal X}p(i)\over d!}=\frac1d. $$

Given $H$ is concave in 1.3.4,

$$ H(1/d,\cdots,1/d) \ge {1\over d!}\sum_{\sigma\in S_d}H(p(\sigma_1),\cdots,p(\sigma_d)), $$

plus from the axiom 4 (1.2.4), the entropy value is invariant to permutations: $RHS=H(p)$, Q.E.D.

Corollary: Upper Bound Value

$$ \sup_{p_X} H(X)=\log |\cal X|. $$

where $X$ has a distribution $p_X$ on the domain $\mathcal X$.

Proof. Just calculate the value of the uniform distribution of dimension $d$.