That's totally nuts. How do they deal with the silicon warping around disabled cores or dark silicon? How long of hard running does it take before the chip gets fatally damaged and needs to be replaced in their system? Word on the street is that h100s fail surprisingly often, this can't be better
This video from Cerebras perfectly explain how they solve the interconnect problem, and why their approach greatly reduces the risk of Blackwell-type hardware design challenges.
edit: I should have just checked their website instead of guessing. Apparently WSE has significant fabrication challenges, which makes what Cerebras has accomplished all the more impressive. But it is still surprising that no one else has attempted this in the HPC field.
I had guessed that Cerebras had made some trade-offs in process in order to make it work at scale, but then they aren't actually building these devices at scale (yet).
It shouldn't be surprising. It's hard as fuck, and we don't know if it's worth it yet (there's something to be said for "if your compute dies you send your remote hands to swap out a high four figure component" and not "decommission a high-six-figure node"
I hope I didn't make it sound like it was easy, at least I don't think I said that anywhere. It doesn't really matter how hard something is to do (short of it being trivially proven impossible), it matters whether there's a good enough chance that the payoff exceeds the cost.
And actually there have been attempts to do it, I mentioned in an earlier version of my comment that Gene Amdahl had attemped to make WSE work something like 20 years ago, without success - but also without the clear profitability story of AI to attract the same mountains of cash being thrown around today.
What's surprising is not that it is hard, or that it's hard as fuck, but that given the potentially stratospheric rewards for success there have not been more attempts in this direction.
https://cerebras.ai/product-chip/