INFOTH Note 8: BW Transform
8.1 Burrows-Wheeler Transform
Why
BWT is a string (symbol sequence) transform algorithm for data compression. Via rearranging the symbols and making the same symbols more concentrated, it can improve the efficiency of subsequent compression algorithms.
Transformation Algorithm
Consider a sequence of symbols of length $n\ge 1$ drawn from a finite alphabet $\mathcal X$ (repetition allowed), call the sequence
$$ x_1x_2\dots x_n $$Do:
-
Make a $n\times n$ array of the cyclic permutations of the sequence like:
$$ x_1x_2\dots x_n\\ x_2x_3\dots x_{n}x_1\\ x_3x_4\dots x_{n}x_1x_2\\ \vdots\\ x_nx_1\dots x_{n-2}x_{n-1} $$Example:
SHANNONoriginal->
SHANNON HANNONS ANNONSH NNONSHA NONSHAN ONSHANN NSHANNO -
Make every row lexicographically sorted (字典序)
ANNONSH HANNONS NNONSHA NONSHAN NSHANNO ONSHANN SHANNON -
As a sender (transformer) we want to send the receiver (de-transformer):
- The last column $\hat {\bold{x}}$:
HSANONN - The row index of the original sequence $i$:
7here starting from 1
- The last column $\hat {\bold{x}}$:
Inverse Algorithm
Suppose we receive a HSANONN and an index 7, we repeat this:
Initialize the partial array with the column $\hat{\bold x}$ HSANONN
- Sort the partial array with lexicographic order
- Insert $\hat{\bold x}$ before the partial array
Until we inserted $n$ times.
Example:
H
S
A
N
O
N
N
Sort:
A
H
N
N
N
O
S
Insert:
HA
SH
AN
NN
ON
NO
NS
Sort:
AN
HA
NN
NO
NS
ON
SH
…
Finally after sorting:
ANNONSH
HANNONS
NNONSHA
NONSHAN
NSHANNO
ONSHANN
SHANNON
We recovered the original array, and with the index 7, we index the location and get SHANNON.
Facts
For the tuple $(\hat{\bold x},i)$,
-
$\hat{\bold x}$ and $i$ uniquely identify the original string
-
$\hat{\bold x}$ is sent by arithmetic coding because of rather good performance
-
$i$ is sent by a prefix-free code for sending an integer: Comma-free (PF) Encoding of Positive Integers, p.444 of the Thomas book, Lemma 13.5.1.
The code length is $\lceil \log i\rceil + 2\lceil \log\lceil\log i\rceil\rceil$.
Otherwise we will send a binary if not compressed.