Wait, I'm pretty sure operand collection logic & banking is there to keep the number of ports on the SRAM low, so basically you're arbitrating and buffering requests coming from high register count instructions (say 3 input fma) and potentially multiple pipelined SMT threads (not thread in the nvidia sense, thread in the "a whole wavefront/warp" sense).
However to me it seems that's completely orthogonal to the vector lanes : I don't see why two parallel lanes in a single thread (eg a 64-element GCN wavefront) would need cross-connected logic at the register file, since almost all instructions _do not_ read/write data from another lane.
There are a few cross-lane shuffles / reduce instruction but it seems to me that those would be handled in a dedicated execution unit. (they are not really the fast-path/common case)
> There are a few cross-lane shuffles / reduce instruction but it seems to me that those would be handled in a dedicated execution unit. (they are not really the fast-path/common case)
Yes, you essentially need a (kind of) crossbar for shuffle and value broadcast.
But as far as I know there is no unit dedicated to this on Nvidia GPU.
However, depending on the GPU microarchitecture, shuffle and broadcast may be implemented differently (e.g. through the load/store units).
Note that I said "crossbar" for simplicity and because there is little information available, I doubt that all the paths really exist
However to me it seems that's completely orthogonal to the vector lanes : I don't see why two parallel lanes in a single thread (eg a 64-element GCN wavefront) would need cross-connected logic at the register file, since almost all instructions _do not_ read/write data from another lane.
There are a few cross-lane shuffles / reduce instruction but it seems to me that those would be handled in a dedicated execution unit. (they are not really the fast-path/common case)