A quick note that’s mostly regurgitating the paper “Accelerating Correctly Rounded Floating-Point Division when the Divisor Is Known in Advance” with some comments, code and extra numbers.
The problem is computing
NOTE: The working assumption throughout this is that the inputs are limited to floating point values on the normal range and that no overflow or underflow (absolute value is not smaller than smallest normal number) occurs. Otherwise error bounds are not as quoted.
BEWARE: Also division latency and throughput are not as bad as in days of yore.
The “naive” method multiply by reciprocal
Let’s run though this well known method. First we precompute the reciprocal:
where
Performing the division (the correctly rounded result) has an error bound of
Of course the authors dismiss this as being too awful to contemplate but for those of us willing to increase the error bound for a performance boost it’s a very useful tool. I’m including this method because:
- It’s seemly common to run across code which is performing some
divides to avoid the increased error bound and at the same time and/or the values of are computed with a very wide bound. Improving the latter and using this “naive” method will be faster and lower total error in some cases. - The other methods require hardware FMA.
- As noted in intro the latency of division isn’t the murder it used to be and this might be the only viable option.
- And might as well mention that computing the reciprocal as a double, promoting the
values to doubles and demoting is another potential option.
The product of unevaluated pair method 1 FMA + 1 product
The main proposed method of the paper precomputes an unevaluated pair
One possible way to translate that into C code would be:
typedef struct { float h,l; } f32_pair_t;
// compute the unevaluated pair (h,l) approximation of 1/a
void f32_up_recip(f32_pair_t* p, float a)
{
p->h = 1.f/a;
p->l = -fmaf(p->h, a, -1.f)/a;
}
To approximate the division we simply multiply by the unevaluated pair:
// multiply 'x' by the value represented by the unevaluated pair p=(h,l)
static inline float f32_up_mul(f32_pair_t* p, float x)
{
return fmaf(x, p->h, x*p->l);
}
As an aside there’s a related paper “Correctly rounded multiplication by arbitrary precision constants” which describes how to determine if some constant (stored as an unevaluated pair) is correctly rounded for all x. Some examples include:
So how good is this approximation? The short version is that it’s excellent:
- For
of choices of the result is exact (meaning correctly rounded, aka bit identical to performing the division) for all input. - The remain
of values only return an inexact result for exactly one mantissa bit pattern per power-of-two interval which in single precision means in or approximately - The maximum error of these inexact results is
, so they are all faithfully rounded. The average error is a much tighter and the RMS is .
A minor note is that the sign of a zero output is flushed to positive if the signs of the two elements of the ordered pair are different.
Aside: unevaluated pairs are a special case of (non-overlapping) floating-point expansions. We have:
Corrected “naive” method 2 FMA + 1 product
Up front: this method isn’t very promising since it requires two FMAs and a product. Minimum sketch: we compute the approximate
and a code version:
// it up to the reader to come up with a good name
inline float algorithm_1(float x, float y, float zh)
{
float q = x*zh;
float r = -fmaf(q,y,-x);
return fmaf(r,zh,q);
}
Oh! For form sake!
For the single FMA method the paper provides a recipe to determine if a given divisor
My code is slightly different than the method presented in the paper. First I computed the smallest 0x9f0237
. That’s folded with the paper’s initial test of all even
// computes the unevaluated pair C which is approximates 1/y and returns the mantissa
// bit pattern X for which a result will be inexact (zero if always correctly rounded)
uint32_t fp_up_recip_x(f32_pair_t* C, float y)
{
f32_up_recip(C, y); // compute 1/y approximation C=(h,l)
uint32_t S = f32_to_bits(y); // IEEE bit pattern of 'y'
uint32_t Y = S & 0x7FFFFF; // mantissa as an integer (without implied bit)
// folded test: even or smaller than smallest that fails
if (u32_rotr(Y,1) < 0x800f811b) return 0; // p taken = ~.62113
// 37.89% reach here
Y |= 0x800000; // add in the implied bit
// compute: P- and X- (see supplement)
uint32_t p = u32_mod_inverse(Y) & 0x1FFFFFF;
uint32_t q = p >> 1;
uint32_t X = 0;
// this block could be made branch-free
if (q >= (1<<23)) { // p taken = ~.5
X = (uint32_t)(((uint64_t)p*Y-1)>>25);
}
else {
// compute: P+ and X+
p = 0x2000000-p;
X = (uint32_t)(((uint64_t)p*Y+1)>>25);
}
// ------------------------------------
if (X >= (1<<23)) { // p taken = ~0.742677
// we have a potential X so test if it
// is correctly rounded or not.
float x = (float)X;
float r0 = x/(float)Y;
float r1 = f32_up_mul(C,x);
if (r0 != r1) // p taken = ~0.0452307
return X;
}
return 0;
}
The above needs to the following helper functions:
// multiplicative modulo inverse of odd 'a'
static inline uint32_t u32_mod_inverse(uint32_t a)
{
uint32_t x;
x = (a*a)+a-1;
x *= 2-a*x;
x *= 2-a*x;
x *= 2-a*x;
return x;
}
// warning: 'i' on [1,31] otherwise undefined
static inline uint32_t u32_rotr(uint32_t x, uint32_t i)
{
return (x>>i)|(x<<((0-i) & 31));
}
static inline uint32_t f32_to_bits(float x)
{
uint32_t u; memcpy(&u, &x, 4); return u;
}
static inline float f32_from_bits(uint32_t x)
{
float f; memcpy(&f, &x, 4); return f;
}