Is Fast Inverse Square Root still Fast?
Introduction
Fast Inverse Square Root (Fast InvSqrt) is an algorithm that quickly estimates the inverse of the square root of a float variable. The algorithm appeared first in Quake III Arena first-person shooter game source code [1]. The square root inverse function is used in vector normalization during the game lighting and reflection calculations.
By that time, Intel has already introduced MMX SIMD instruction set which did not have any dedicated instruction to compute the reciprocal of the square root of a floating-point value [2]. So, Fast InvSqrt algorithm was one of the fastest way to compute 1 / sqrt(x) with an acceptable precision. The question is: " Are CPU instruction extensions still eating Fast InvSqrt dust? "
Experiment
I am going to run the original Fast InvSqrt code and compare its execution time with some other functions that basically compute 1 / sqrt(x). I consider the following function as a benchmark, and assess the other functions performance relatively:
float Standard_InvSqrt(float number)
{
float squareRoot = pow(number, 0.5f);
return pow(squareRoot, -1.0f);
}
I only use the standard math.h library in that function.
Following is the Fast InvSqrt function retrieved from [3]. If you are eager to know the logic behind this function, you can watch video at reference [4]:
float Fast_InvSqrt(float number)
{
long i;
float x2, y;
const float threehalfs = 1.5f;
x2 = number * 0.5f;
y = number;
i = *(long*)&y; // Floating point bit hack
i = 0x5f3759df - ( i >> 1 ); // Magic number
y = *(float*)&i;
y = y * ( threehalfs - ( x2 * y * y ) ); // Newton 1st iteration
// y = y * ( threehalfs - ( x2 * y * y ) ); // 2nd iteration (disabled)
return y;
}
Fast_InvSqrt function runs ~132% faster than Standard_InvSqrt. It seems Fast InvSqrt is still the winner. Now, let's optimize Standard_InvSqrt a bit. We can combine the two pow functions together which leads to the code below:
float Standard_InvSqrtV2(float number)
{
return pow(number, -0.5f);
}
Standard_InvSqrtV2 runs ~157% faster than the original versions which is 11% faster than Fast_InvSqrt. Maybe it is a bit early to judge about the Fast InvSqrt performance, because the machine specifications and the compiler optimizations can affect the performance assessment.
To have a better judgement, let's replace the pow function with the sqrt function from the math.h library. The code would look like as follows:
float Standard_InvSqrtV3(float number)
{
float squareRoot = sqrt(number);
return 1.0f / squareRoot;
}
Standard_InvSqrtV3 runs ~578% faster than the original version! That is a significant difference in compare with the previous version. I guess it is a time to take a deeper look into Standard_InvSqrtV3. Following is the x64 assembly code of the function:
Standard_InvSqrtV3(float): pxor xmm2, xmm2 movaps xmm1, xmm0 ucomiss xmm2, xmm0 sqrtss xmm1, xmm1 ; <-- sqrt(number); ja .L12 ; <-- A jump to call 'sqrt()' to set 'errno' movss xmm0, DWORD PTR .LC1[rip] divss xmm0, xmm1 ; <-- 1.0f / sqrt(number) ret
As declared above, the standard sqrt function is translated to sqrtss [5] and the inversion is translated to divss assembly instructions which are introduced in SSE instruction set.
Apparently, the new instruction sets can outperform Fast InvSqrt. Fortunately, my machine supports AVX instruction set too, so I can implement the 1 / sqrt(x) based on AVX intrinsic functions [6] to compare it with SSE set. The only thing I need is to include <immintrin.h> header file and use -mavx argument while compiling my code using GCC compiler [7].
Following is the AVX based square root inverse computation code:
float AVX_InvSqrt(float number)
{
float array[8] = {number}; // 256-bit ymm register (8x32-bit float)
__m256 _srcReisger = _mm256_loadu_ps(array);
__m256 _dstRegister = _mm256_rsqrt_ps(_srcReisger);
_mm256_storeu_ps(array, _dstRegister);
return array[0]; // The result is at the lowest element
}
AVX_InvSqrt runs ~123% faster than our benchmark. Even though, _mm256_rsqrt_ps intrinsic function can approximately compute reciprocal square root for 8 float values at the same time, for a single float value, it is 4% slower than Fast_InvSqrt.
The AVX_InvSqrt function is a bit unfair. _mm256_loadu_ps function needs to load an array (which is not necessary aligned) from the memory to a ymm register that is relatively an expensive procedure. So, we can replace the loading instruction with a value broadcasting instruction that would look like this:
float AVX_InvSqrtV2(float number)
{
__m256 _srcReisger = _mm256_set1_ps(number);
__m256 _dstRegister = _mm256_rsqrt_ps(_srcReisger);
float array[8]; // 256-bit ymm register (8x32-bit float)
_mm256_storeu_ps(array, _dstRegister);
return array[0]; // The result at all the elements are the same
}
AVX_InvSqrtV2 runs ~392% faster than our benchmark which is ~121% faster than its original version and ~112% faster than Fast_InvSqrt. However, it is still ~38% slower than Standard_InvSqrtV3. So, let's back to Standard_InvSqrtV3 again.
I increased the compiler optimization from level 0 to level 1 to see whether the compiler replaces the combination of sqrtss and divss with rsqrtss instruction [8] or not. The answer was No. So, I used SSE intrinsic functions to write the instruction on my own as follows:
float SSE_InvSqrt(float number)
{
__m128 _srcReisger = _mm_set1_ps(number);
__m128 _dstRegister = _mm_rsqrt_ps(_srcReisger);
float array[4]; // 128-bit xmm register (4x32-bit float)
_mm_storeu_ps(array, _dstRegister);
return array[0]; // The result at all the elements are the same
}
The last function runs ~592% faster than the benchmark which is only 2% faster than Standard_InvSqrtV3. Conversely, SSE_InvSqrt has 2% larger computation error than Standard_InvSqrtV3.
Conclusion
As shown below, SSE_InvSqrt function is the fastest algorithm to compute 1 / sqrt(x) with a reasonable precision. However, the standard sqrt function can provide more or less the same performance in a SISD architecture, but definitely with a better portability and maintainability.
Update
Standard_InvSqrtV3 function can be compiled using -ffast-math flag with some risks explained in [9] in order to ignore calling sqrt function within libm.a to the set error number (errno) in case of negative inputs. So, by enabling that flag, no comparison nor branching instruction gets involved to the function and Standard_InvSqrtV3 can run 33% faster than its original version (Thanks to Marc Reynolds for informing me about this).
References
[1] Wikipedia: Fast inverse square root
[2] Wikipedia: MMX (instruction set)
[4] YouTube: Fast Inverse Square Root — A Quake III Algorithm
Independent Computer Games Professional
1yFWIW: An issue with the scalar code (Standard_InvSqrtV3) is that math errno (a legacy feature of no interest) has not been disabled (-fno-math-errno in GCC & clang). That change produces: vsqrtss xmm0, xmm0, xmm0 vmovss xmm1, dword ptr [rip + .LCPI1_0] # xmm1 = mem[0],zero,zero,zero vdivss xmm0, xmm1, xmm0 ret https://godbolt.org/z/fKWrv6759
Technology Executive with 25+ years of experience across high tech, investment banking, and strategy consulting
2yhttps://levelup.gitconnected.com/death-of-quake-3s-inverse-square-root-hack-32fd2eadd7b7