Store-to-Load Forwarding and Memory Disambiguation in x86 Processors

rayiner · on June 15, 2014

Dave Kanter on RWT has published a few articles going into more depth about the memory disambiguation on Merom (Core 2+) and Haswell processors: http://www.realworldtech.com/merom/7; http://www.realworldtech.com/haswell-tm-alt (in the context of how the traditional memory order buffer had to be updated to support transactional memory).

This is a really interesting area of modern OOO processor design. Every entry of the store buffer has to be probed on every load to see if there is an earlier store to that address that hasn't hit the cache yet. If you make it bigger, you can perform more stores without waiting for the cache, but you also need a bigger, more power-hungry CAM to implement the store buffer. That structure tends to be a major point of contention in trading off between increased memory parallelism and the cycle-time/power usage of the design. Structures for predicting the addresses of stores use even more power.

See this discussion of why Silvermont (the OOO Atom core in Bay Trail), avoids memory disambiguation by simply stalling on stores with unknown addresses, in order to save power: http://www.realworldtech.com/silvermont/7.

In order to avoid expensive memory disambiguation, Itanium punts on the problem entirely and uses a software-visible structure called an ALAT: http://www.realworldtech.com/poulson/6.