I could have sworn that PPro used 2-bit saturating counters, but I checked and: "The P6 rejects the commonly used Smith algorithm, which maintains four states using two bits, in favor of the more recent Yeh method[1]. This adaptive algorithm uses four bits of branch history and can recognize and predict repeatable sequences of branches, for example, taken–taken–not taken."
One other thing I wondered about is if there have been improvements to the recovery of mispredicted branches that have not yet been retired. It seems like the huge OOO-windows of the latest Apple and Intel CPUs would see diminishing returns otherwise. Anyone knows?
Misprediction recovery is an important area of optimization. Mispredictions tend to flush and begin refetching ASAP, so rather than wait until retire, they will do that when they execute, including out of order (a newer branch could flush before an older one is resolved, and later itself could be flushed by that older one).
There are also several levels of predictor and mistakes in the low latency but less accurate ones can cause recovery events if they are overridden a few cycles later by slower predictors.
Flush / recovery is made as light weight as possible, only flushing instructions after the branch, and only the parts of the pipeline that the oldest one has reached.
There is some literature on controlling speculation if you are not confident. Say if you have a really poorly predicted indirect branch target where a mispredict is not 50/50 but 99/1, and it depends on a very slow source, then you might not want to speculate 500 instructions ahead of that but say okay let's just wait for it to resolve. Or, if you have a 50/50 hard to predict branch, you might kick off some prefetching down the other path when you see it so recovery is faster (but this will cost more power). I don't know if anybody does anything with these "hard to predict branches" today. Apparently it's actually quite difficult to predict whether a branch is hard to predict, and the resources to make that prediction can be put to better use just trying to predict them instead of predicting that you can't predict them :) Although that might change. Point is al kinds of misprediction optimization are actively being pursued.
Yeah, these days it's likely for only the frontend to stall during misprediction recovery. Already decoded µops from before the branch still make their way through the backend as normal. If you have enough of those (long dependency chains or the branch was reordered early enough), it's possible that the misprediction has effectively no penalty.
As a corollary, this means that when possible you want the dependency chain for branches to be as independent and short as possible, so the CPU can reorder and resolve branches as early as possible.
One other thing I wondered about is if there have been improvements to the recovery of mispredicted branches that have not yet been retired. It seems like the huge OOO-windows of the latest Apple and Intel CPUs would see diminishing returns otherwise. Anyone knows?