I've only glanced over that paper, but they're using a processor with a 5-stage pipeline, which is really short by modern standards (Zen 1 uses a 19-stage pipeline, and I couldn't quickly find the number for subsequent versions of Zen). Using a very short pipeline significantly reduces the advantage of control flow speculation (CFS)... but they still showed CFS offering up to a ~50% advantage over their best alternative, if I'm reading that right.
I wish they had included the geometric mean of their benchmarks, but I didn't see it anywhere, and I'm not going to run the numbers right now. Even if the speedup CFS offered on a 5-stage pipeline is "only" 25%, that is still huge... and on a larger pipeline, that delta would grow. "Not much" is drastically different from my interpretation of those results.
I do think security is extremely important, but I'm not convinced that things are currently so terrible that this is the only way forward, as the authors seemed to imply.
OTOH, I would enjoy seeing a return of an Itanium-style ISA that moves a lot of speculation from the hardware to the compiler. I think compilers are in a much better place now than they were when Itanium hit the scene, which did not help Itanium's problems.
In many ways we've seen that return to dumber hardware + smarter compilers in the GPGPU realm, although even there the hardware continues to get more capable over time.
Those applications tend to work, though, on the basis that either the compiler is generating fat binaries to support multiple architecture versions (e.g. Cuda), or some sort of IR, or compilation happens at runtime (e.g. OpenCL). It doesn't really work if you want to generate single binaries that will work performantly on a wide range of hardware versions - particularly important for users answering "how will application X work on future hardware Y", which really gets in the way of general-purpose use.
That's really the great advantage of putting more smarts in the hardware - you can evolve the processor design (often to improve performance) while executing the same binaries.
https://arxiv.org/abs/2007.15919
This paper adds (more or less) a really big branch delay slot to fill the cpu with work while it's waiting for the branch to resolve.
Impossibility of Spectre-like attacks is a neat side-effect.