Except even intrinsics aren't even that much of a guarantee - clang converts the...

Except even intrinsics aren't even that much of a guarantee - clang converts them to its internal operations and applies the same optimization passes over them as it does on its own autovectorized code; and there are no intrinsics for x86's inline memory operands, so issues can arise around those (or the inverse - I've seen clang do a memory load of the same constant twice (the second one being behind some branches), despite there being only one such load).

And there also are no intrinsics for most scalar operations, e.g. if you wanted to force "x>>48 == 0x1234" to be actually done via the shift and not "x & 0xffff000000000000 == 0x1234000000000000" (or vice versa).

And of course assembly means writing platform-specific code (potentially undesirable even if you want to only do the optimization for a single architecture, as it means having to learn to write assembly of said architecture).

There is some potential middle-ground of doing black-boxing, but as-is in C/C++ the way to do this is with a no-op asm block, but that can make register allocation worse, and still requires some platform-specific logic for deriving the register kind from value type.