Wait, I'm pretty sure operand collection logic & banking is there to keep the nu...

avianes · on July 11, 2022

> There are a few cross-lane shuffles / reduce instruction but it seems to me that those would be handled in a dedicated execution unit. (they are not really the fast-path/common case)

Yes, you essentially need a (kind of) crossbar for shuffle and value broadcast. But as far as I know there is no unit dedicated to this on Nvidia GPU. However, depending on the GPU microarchitecture, shuffle and broadcast may be implemented differently (e.g. through the load/store units).

Note that I said "crossbar" for simplicity and because there is little information available, I doubt that all the paths really exist