I could have sworn that PPro used 2-bit saturating counters, but I checked and: ...

throwawaylinux · on Jan 1, 2023

Misprediction recovery is an important area of optimization. Mispredictions tend to flush and begin refetching ASAP, so rather than wait until retire, they will do that when they execute, including out of order (a newer branch could flush before an older one is resolved, and later itself could be flushed by that older one).

There are also several levels of predictor and mistakes in the low latency but less accurate ones can cause recovery events if they are overridden a few cycles later by slower predictors.

Flush / recovery is made as light weight as possible, only flushing instructions after the branch, and only the parts of the pipeline that the oldest one has reached.

There is some literature on controlling speculation if you are not confident. Say if you have a really poorly predicted indirect branch target where a mispredict is not 50/50 but 99/1, and it depends on a very slow source, then you might not want to speculate 500 instructions ahead of that but say okay let's just wait for it to resolve. Or, if you have a 50/50 hard to predict branch, you might kick off some prefetching down the other path when you see it so recovery is faster (but this will cost more power). I don't know if anybody does anything with these "hard to predict branches" today. Apparently it's actually quite difficult to predict whether a branch is hard to predict, and the resources to make that prediction can be put to better use just trying to predict them instead of predicting that you can't predict them :) Although that might change. Point is al kinds of misprediction optimization are actively being pursued.

brigade · on Dec 31, 2022

Yeah, these days it's likely for only the frontend to stall during misprediction recovery. Already decoded µops from before the branch still make their way through the backend as normal. If you have enough of those (long dependency chains or the branch was reordered early enough), it's possible that the misprediction has effectively no penalty.

As a corollary, this means that when possible you want the dependency chain for branches to be as independent and short as possible, so the CPU can reorder and resolve branches as early as possible.

gpderetta · on Dec 31, 2022

The original pentium famously used a 2 bit counter except, IIRC, it was buggy and it didn't actually saturate. It was fixed on the Pentium MMX.

A simple 2 bit predictor wouldn't have been enough for the PPro as the deeper pipeline and the new OoO window required a better predictor.