INFOTH Note 8: BW Transform

8.1 Burrows-Wheeler Transform

Why

BWT is a string (symbol sequence) transform algorithm for data compression. Via rearranging the symbols and making the same symbols more concentrated, it can improve the efficiency of subsequent compression algorithms.

Transformation Algorithm

Consider a sequence of symbols of length $n\ge 1$ drawn from a finite alphabet $\mathcal X$ (repetition allowed), call the sequence

$$ x_1x_2\dots x_n $$

Do:

  1. Make a $n\times n$ array of the cyclic permutations of the sequence like:

    $$ x_1x_2\dots x_n\\ x_2x_3\dots x_{n}x_1\\ x_3x_4\dots x_{n}x_1x_2\\ \vdots\\ x_nx_1\dots x_{n-2}x_{n-1} $$

    Example: SHANNON original

    ->

    SHANNON
    HANNONS
    ANNONSH
    NNONSHA
    NONSHAN
    ONSHANN
    NSHANNO
    
  2. Make every row lexicographically sorted (字典序)

    ANNONSH
    HANNONS
    NNONSHA
    NONSHAN
    NSHANNO
    ONSHANN
    SHANNON
    
  3. As a sender (transformer) we want to send the receiver (de-transformer):

    • The last column $\hat {\bold{x}}$: HSANONN
    • The row index of the original sequence $i$: 7 here starting from 1

Inverse Algorithm

Suppose we receive a HSANONN and an index 7, we repeat this:

Initialize the partial array with the column $\hat{\bold x}$ HSANONN

  1. Sort the partial array with lexicographic order
  2. Insert $\hat{\bold x}$ before the partial array

Until we inserted $n$ times.

Example:

H
S
A
N
O
N
N

Sort:

A
H
N
N
N
O
S

Insert:

HA
SH
AN
NN
ON
NO
NS

Sort:

AN
HA
NN
NO
NS
ON
SH

Finally after sorting:

ANNONSH
HANNONS
NNONSHA
NONSHAN
NSHANNO
ONSHANN
SHANNON

We recovered the original array, and with the index 7, we index the location and get SHANNON.

Facts

For the tuple $(\hat{\bold x},i)$,

  1. $\hat{\bold x}$ and $i$ uniquely identify the original string

  2. $\hat{\bold x}$ is sent by arithmetic coding because of rather good performance

  3. $i$ is sent by a prefix-free code for sending an integer: Comma-free (PF) Encoding of Positive Integers, p.444 of the Thomas book, Lemma 13.5.1.

    The code length is $\lceil \log i\rceil + 2\lceil \log\lceil\log i\rceil\rceil$.

    Otherwise we will send a binary if not compressed.