Sunday, April 22, 2018

Proper pbit computation in the BC7 texture format

The BC7 GPU texture format supports the clever concept of endpoint "pbits", where the LSB's of RGB(A) endpoints are forced to the same value so only 1 bit (vs. 3 or 4) needs to be coded. BC7's use of pbits saves precious bits which can be used for other things which decrease error more. Some modes support a unique pbit per endpoint, and some only have 1 pbit for each endpoint pair.

I'm covering this because the majority of available BC7 encoders mess this important detail up. (I now kinda doubt any available BC7 encoder handles this correctly in all modes.) The overall difference across a 31 texture corpus (including the kodim images) is .26 dB RGB PSNR, which is quite a lot considering the CPU/GPU cost of doing this correctly vs. incorrectly is the same. (The improvement is even greater if you try all pbit combinations with proper rounding: .4 dB.)

ispc_texcomp handles this correctly for sure in most if not all modes, while the DirectXTex CPU, Volition GPU, and NVidia Texture Tool encoders don't as far as I can tell (they use pbit voting without proper rounding - the worst). The difference to doing this correctly in some modes is pretty significant - by at least ~.6 dB in mode 0!

Not doing this properly means your encoder will run slower because it will have to work harder (scanning more of the search space) to keep PSNR up vs. the competition. The amount of compute involved in lifting a BC7 encoder "up" by .26 dB across an entire test corpus is actually quite large, because there's a very steep quality vs. compute "wall" in BC7.

Here are some of the ways p-bits can be computed. The RGB PSNR's were for a single encoded image (kodim18), purposely limited to mode 0 (with 4 bit components+unique per-endpoint pbits) to emphasize the importance of doing this correctly:
  • 40.217 dB: pbit search+compensated rounding: Compute properly rounded component endpoints compensating for the chosen pbit, try all pbit combinations. This is an encoder option in my new BC7 encoder. Encoding error evaluation cost: 2X or 4X (expensive!)
  • 39.663 dB: Round to middle of component bin+pbit search: Compute rounded endpoints (with a scale factor not including the pbit), then evaluate the full error of all 2 or 4 pbit possibilities. This is what I initially started doing, because it's trivial to code. In mode 0, you scale by 2^4, round, then iterate through all the pbits and test the error of each combination. Error evaluation cost: 2X or 4X
  • 39.431 dB: Compensated rounding (ispc_texcomp method): Proper quantization interval rounding, factoring in the shift introduced when the pbit is 1 vs. 0. The key idea: If an endpoint scaled and rounded to full-precision (with a scale factoring in the pbit) has an LSB which differs from the pbit actually encoded, you must properly round the output component value to compensate for the different LSB or you will introduce more error than necessary. So basically, if the LSB you want is different from what's encoded, you need to correctly round the encoded component index to compensate for this difference. You must also factor in the way the hardware scales up the encoded component+pbit to 8-bits. Error evaluation cost: 1X
  • 39.151 dB: Voting/parity (DirectXTex/Volition): Count how many endpoint components in the scaled colors (with a scale factor including the pbit) sharing each pbit have set LSB's. If half or more do then set the encoded pbit to 1, otherwise 0. pbit voting doesn't round the encoded components so it introduces a lot of error. Error evaluation cost: 1X
  • 38.712 dB: Always 0 or 0,1
  • 37.878: Always 0 or 0,0
I tested a few different ways of breaking ties when computing pbits by voting and just reported the best one. At least on this image biasing the high endpoint towards 1 helps a little:

Shared  Unique
> >    39.053
> >=   39.151
>= >   38.996
>= >=  38.432
>=  > >=   39.151

This stuff is surprisingly tricky to get right, so here's a mode 0 example to illustrate what's going on. Each component can be coded to 16 possible values with one pbit selecting between two different ramps. So factoring in the pbit we have 32 possible representable 8-bit values. Here are the resulting 8-bit values (scaled using the shift+or method BC7 uses - not pure arithmetic scaling by 255/31 which is slightly different):

pbit 0:
0
16
33
49
66
82
99
115
132
148
165
181
198
214
231
247

pbit 1:
8
24
41
57
74
90
107
123
140
156
173
189
206
222
239
255

Let's say the encoder wants to encode an endpoint value of 9/255 (using 8-bit precision) in mode 0 (4-bit components+pbit). The pbit voting encoders will compute a quantized/scaled component value of 1/31 (from a range of [0,31] - not [0,15] because we're factoring in the pbit). The LSB is 1 and the encoded component index (the top 4 MSB's) is 0, and if more than half of the other component LSB's are also 1 we're ok. In the good case we're coding a value of 8/255, which is closer to 9/255 than 24/255.

If instead a pbit of 0 is chosen, we're now encoding a value of 0/255 (because the encoded component index of 0 wasn't compensated), when we should have chosen the closer value of 16/255 (i.e. a component index of 1). Choosing the wrong LSB and not compensating the index has resulted in increased error.

There's an additional bit of complexity to all this: The hardware scales the mode 0 index+pbit up to 8-bits by shifting the index+pbit left by 3 bits for mode 0, then it takes the upper 3 MSB's of this and logically or's them into the lower 3 bits to fill in. This isn't quite the same as scaling by 255/31. So proper pbit determination code needs to factor this in, too. Here are the ramps computed using arithmetic scaling+rounding (notice they slightly differ from the previous ramps computed using shifting+or'ing):

0
16
33
49
66
82
99
115
132
148
165
181
197
214
230
247

8
25
41
58
74
90
107
123
140
156
173
189
206
222
239
255

I worked out the formulas involved on a piece of paper: 

How to compute [0,1] values from mode 0 bins+pbits (using arithmetic scaling, NOT hardware scaling):
pbit 0: value=bin*2/31
pbit 1: value=(bin*2+1)/31

How to compute mode 0 bins from [0,1] values with proper compensation/rounding (rearranging the equations+rounding) for each pbit index:
pbit 0: bin=floor(value*31/2+.5)
pbit 1: bin=floor((value*31-1)/2+.5)

Here's the clever code in ispc_texcomp that handles this correctly for modes with unique p-bits (modes 0,3,6,7). I bolded the bin calculations, which are slightly optimized forms of the previous set of equations. 

I believe there's actually a bug in here for mode 7 - I don't see it scaling the component values up to 8-bit bytes for this mode. It has a special case in there to handle mode 0, and modes 3/6 don't need scaling because they have 7-bit components, but mode 7 has 5-bit components. I didn't check the rest of the code to see if it actually handles mode 7 elsewhere, but it's possible ispc_texcomp's handling of mode 7 is actually broken due to this bug. Mode 7 isn't valuable when encoding opaque textures, but is pretty valuable for alpha textures because it's the only alpha mode that supports partitions.

///////////////////////////
// endpoint quantization

inline int unpack_to_byte(int v, uniform const int bits)
{
    assert(bits >= 4);
    int vv = v << (8 - bits);
    return vv + shift_right(vv, bits);
}

void ep_quant0367(int qep[], float ep[], uniform int mode, uniform int channels)
{
    uniform int bits = 7;
    if (mode == 0) bits = 4;
    if (mode == 7) bits = 5;

    uniform int levels = 1 << bits;
    uniform int levels2 = levels * 2 - 1;

    for (uniform int i = 0; i < 2; i++)
    {
        int qep_b[8];

        for (uniform int b = 0; b < 2; b++)
            for (uniform int p = 0; p < 4; p++)
            {
                int v = (int)((ep[i * 4 + p] / 255f*levels2 - b) / 2 + 0.5) * 2 + b;
                qep_b[b * 4 + p] = clamp(v, b, levels2 - 1 + b);
            }

        float ep_b[8];
        for (uniform int j = 0; j < 8; j++)
            ep_b[j] = qep_b[j];

        if (mode == 0)
            for (uniform int j = 0; j < 8; j++)
                ep_b[j] = unpack_to_byte(qep_b[j], 5);

        float err0 = 0f;
        float err1 = 0f;
        for (uniform int p = 0; p < channels; p++)
        {
            err0 += sq(ep[i * 4 + p] - ep_b[0 + p]);
            err1 += sq(ep[i * 4 + p] - ep_b[4 + p]);
        }

        for (uniform int p = 0; p < 4; p++)
            qep[i * 4 + p] = (err0 < err1) ? qep_b[0 + p] : qep_b[4 + p];
    }
}

No comments:

Post a Comment