Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Wait, I'm pretty sure operand collection logic & banking is there to keep the number of ports on the SRAM low, so basically you're arbitrating and buffering requests coming from high register count instructions (say 3 input fma) and potentially multiple pipelined SMT threads (not thread in the nvidia sense, thread in the "a whole wavefront/warp" sense).

However to me it seems that's completely orthogonal to the vector lanes : I don't see why two parallel lanes in a single thread (eg a 64-element GCN wavefront) would need cross-connected logic at the register file, since almost all instructions _do not_ read/write data from another lane.

There are a few cross-lane shuffles / reduce instruction but it seems to me that those would be handled in a dedicated execution unit. (they are not really the fast-path/common case)



> There are a few cross-lane shuffles / reduce instruction but it seems to me that those would be handled in a dedicated execution unit. (they are not really the fast-path/common case)

Yes, you essentially need a (kind of) crossbar for shuffle and value broadcast. But as far as I know there is no unit dedicated to this on Nvidia GPU. However, depending on the GPU microarchitecture, shuffle and broadcast may be implemented differently (e.g. through the load/store units).

Note that I said "crossbar" for simplicity and because there is little information available, I doubt that all the paths really exist




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: