Data Compression

Feel free to leave a reply if there’s something wrong or there are potential improvements in this post.

This guide explains various data compression methods. Use these compression methods in the following places:

  • custom data sets
  • program ROMs (CPU)
  • memory

… at the cost of degraded performance.

1. Bitstream

Sometimes, let’s say you want to read a number that has a restriction. The value can’t be greater than, say, 32. That way, instead of reading the entire byte, you can simply read the next 6 bits. (Not 5 bits, because then it’s up to 31, not 32)

Or if you want to read a boolean. Instead of consuming an entire byte, you can just read a single bit.

Define an internal variable P and B.
P defines the pointer to the bit position in B
B defines the current byte

Define the property C as:
(B & (1 << P)) != 0

Define the function read_bit() as:

if P == 8:
    P = 0
    B = read_next_byte()
retval = C
P += 1
return retval

Define the initial value of B as 0 and the initial value of P as 8.

The function read_next_byte() should read the next byte from the byte stream (e.g. FILE or System.IO.Stream).

Define the function read_bits(n) as:

val = 0
for i = 0; i < n; i++:
    val = (val << 1) | read_bit()
return val

Now you can read individual bits rather than consuming whole bytes. If for instance you want to read 20 bits, reading 3 bytes ends up wasting 4 bits, whereas this way you can read 20 bits directly without wasting any leftover bits.

This is one of the simplest compression algorithms, and if that’s not enough, let’s keep going.

2. Exponential Golomb Coding

Imagine a number that can have values like 1, 2, or 580000. If a restriction is large, like 12 bits, but you also want to represent smaller values too, it would be wise to have dynamic length coding rather than the fixed number of bits. In essence, this algorithm consumes lesser bits for smaller numbers and more bits for larger numbers automatically.

They were proposed in the '60s, but have gained lots of popularity after they were featured in the most popular video codecs - and of course, I’m talking about H.264, H.265, H.266 and AV1.

There are multiple flavors of Exponential Golomb codings - UE (Unsigned), SE (Signed), ME (Mapped) and TE (Truncated).

To read an Unsigned ExpGolomb (for unsigned integers of variable size):

# read_ue_golomb
ZC = 0 # Leading Zero Count
while !read_bit() && ZC <= 31:
    ZC += 1
return (1 << ZC) - 1 + read_bits(ZC)

To read a Signed ExpGolomb (for signed integers of variable size):

# read_se_golomb
CN = read_ue_golomb() # CodeNum
val = (int)((CN + 1) >> 1);
return (val & 1 == 0) ? -val : val;

To read a Truncated ExpGolomb (must have a parameter named x):

# read_te_golomb(x)

if x == 1:
    return read_bit()
else:
    V = read_ue_golomb()
    if V >= x:
        ERROR()
    return V

3. Quantization

Sometimes you can trade data for compression rate.

Quantization involves lowering data values to minimize the amount of bits being written. For instance, you could turn “10” into “2” and at runtime, transform it into 2 * 5 to get “10” again. 2 equates to lesser bits being written than 10.

Remember, for Exp Golombs, that’s a big deal - they’re variable length.

There are countless ways to implement quantization. Here’s an example to transform audio samples into sample / 5 then turn the quantized values back into sample * 5. Of course, that means you can’t represent 1, 2, 3, 4, 6, 7, and so on anymore, but for audio samples that’s barely a difference.

int quantize_sample(int sample) {
    return sample / 5;
}

int dequantize_sample(int quantized) {
    return sample * 5;
}

Most popular method of implementing quantization is by using a quant (a.k.a. Scale Factor). Essentially your numbers are divided and converted into floating point values, and you have a single value that you use as a multiplier.

void quantize_multiple_values(float* array, int length, float quant) {
    for (int i = 0; i < length; i++) {
        array[i] /= quant;
    }
}

void dequantize_multiple_values(float* array, int length, float quant) {
    for (int i = 0; i < length; i++) {
        array[i] *= quant;
    }
}

This is the type of lossy compression and can result in minor data loss due to precision, but the difference is so small that it’s probably not even a difference in contexts like video, audio or image compression. But for places where data compression is important, like machine code, quantization should be taken carefully.

4. Arithmetic Coding

It’s popular in video codecs - H.264, H.265, H.266 and AV1. It’s a no brainer your MP4 video file has arithmetic coding turned on in the video codec data (called CABAC for H.264, H.265 and H.266).

This involves predicting values in the bitstream using probability models. For instance, “yes, given that there were 3 bits of 1 before, I’m 70% confident that there’s a 1 afterwards. If there isn’t, provide more bits to prove me wrong.”

I myself am not sure how this actually works, but it’s a great thing to include here given how popular and complex arithmetic coding is. I recommend checking out Boolean Entropy, it’s the simplest type of arithmetic coding.

5. RLE

Run Length Encoding? It involves describing the data to repeat than including it.

For instance, rather than saying “helloooooooooo”, you could say

String "hell". No RLE
String "o". RLE enabled. Repeat this 10 times.

It’s highly popular especially in image compression.

6. Data inheritance

This is simple. Split your data into windows, each window being a pool of, let’s say 16 bytes. And each window should start with the bit that specifies, “hey, should this pool inherit from the previous one?”. If it’s one, the current pool’s data must be empty and at runtime, it should inherit the data from the previous pool.

Like this.

# compute_pool(px, py)
if py.inherit:
    for i = 0; i < 16; i++:
        py.private_data[i] = px.private_data[i]
else:
    for i = 0; i < 16; i++:
        py.private_data[i] = read_bits(8)
1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.