For x87, you can use FLDZ to load `+0.0` onto the register stack. My guess is that neither Intel nor AMD has tried particularly hard to optimize that, given x87's legacy status and weird register semantics (see: them being a pseudo-stack).
For SIMD-based FP, you can use VZEROUPPER and VZEROALL if you want to clear all {X,Y,Z}MM state. For a single register, I believe PXOR is still the common idiom, mirroring `XOR EAX, EAX`.
Should read: for x87, unless you are doing this for fun, shoot yourself in the face and instead use SSE because no one in their right mind should be writing x87 code in 2021!
Please help this architectural misfeature die, which should have happened decades ago.
> Should read: for x87, unless you are doing this for fun, shoot yourself in the face and instead use SSE because no one in their right mind should be writing x87 code in 2021!
My information is pretty old, but IIRC x87 still has specialized instructions for sin/cos/tan that are sometimes more performant than their equivalent implementations in SSE. x87 instructions are also very small, so trig-heavy workloads where I$ is a measurable performance component might unfortunately still be a good fit for x87.
On recent Intel/AMD CPUs the x87 transcedental functions are usually much slower than their equivalents using SSE/AVX instructions.
For future CPUs, it is expected that the performance gap will increase.
The x87 trigonometric functions require typically between 100 and 200 clock cycles. During that time a recent CPU can execute 200 to 500 instructions, enough to compute many values of a trigonometric function (using a polynomial approximation). When SIMD instructions can be used, several tens of values of a function could be computed during a single x87 instruction.
x87 trig instructions are pretty inaccurate even in the supported range mainly because of faulty range reduction [1] and can't be vectorized at all. If you have trig-heavy workloads nowadays you would want SIMD libm, not x87.
Why can't the trig instructions be vectorized? I've never quiet understood why SSE/AVX didn't add trig functions to finally kill off any argument for using x87.
Unlike simple operations like a floating-point multiplication, trigonometric functions and the other transcedental functions are too complex, so they must be split into many steps.
If a trigonometric function is encoded as a single instruction, then it must launch a microprogram, to execute the many required steps.
A microprogram cannot execute faster than when the same execution steps would have been encoded as separate instructions, the only advantage of encoding a trigonometric function in a single instruction would be to reduce the program size. Most programs contain few trigonometric functions, so the reduction in program size is not worthwhile.
While a microprogrammed trigonometric function could be as fast as the equivalent sequence of instructions, in reality it is usually much slower.
The reason is that the modern CPUs are optimized for the most frequent instructions and they dedicate a minimum of resources for the seldom used microprogrammed instructions, so these hit various limitations that do not exist for the simple instructions. The microprogrammed instructions usually have some phases whose execution cannot be overlapped in time with other instructions, which leads to lower performance.
Gotcha, hence the reason they've added instructions like FMA. Those don't require microprograms but do make things like calculating a taylor series faster. Right?
> My information is pretty old, but IIRC x87 still has specialized instructions for sin/cos/tan that are sometimes more performant than their equivalent implementations in SSE.
They absolutely are not. If you accept the same, crappy precision, you can do an estimate in just a few cycles instead of the 60+ that the instructions take.
For SIMD-based FP, you can use VZEROUPPER and VZEROALL if you want to clear all {X,Y,Z}MM state. For a single register, I believe PXOR is still the common idiom, mirroring `XOR EAX, EAX`.