Cerebras' Wafer Scale Engine is the opposite of small and cheap. https://cerebra...

throwawaymaths · on Aug 27, 2024

That's totally nuts. How do they deal with the silicon warping around disabled cores or dark silicon? How long of hard running does it take before the chip gets fatally damaged and needs to be replaced in their system? Word on the street is that h100s fail surprisingly often, this can't be better

bigcat12345678 · on Aug 27, 2024

https://www.youtube.com/watch?v=7GV_OdqzmIU&t=1104s

This video from Cerebras perfectly explain how they solve the interconnect problem, and why their approach greatly reduces the risk of Blackwell-type hardware design challenges.

prng2021 · on Aug 27, 2024

Super informative video!

throwup238 · on Aug 27, 2024

The way it's mounted, there's unlikely to be warping: https://web.archive.org/web/20230812020202/https://www.youtu...

The cooling is significantly better than what you'd see on a server platform with water cooling channels going to each row of the wafer.

TradingPlaces · on Aug 27, 2024

They have a way of bypassing bad cores, and over-provision both logic and memory by 1.5% to account for that. They get 100% yields this way.

https://www.anandtech.com/show/16626/cerebras-unveils-wafer-...

throwawaymaths · on Aug 27, 2024

Yes,that's exactly what would cause the problems I'm suggesting.

ChainOfFools · on Aug 27, 2024

edit: I should have just checked their website instead of guessing. Apparently WSE has significant fabrication challenges, which makes what Cerebras has accomplished all the more impressive. But it is still surprising that no one else has attempted this in the HPC field.

I had guessed that Cerebras had made some trade-offs in process in order to make it work at scale, but then they aren't actually building these devices at scale (yet).

johntb86 · on Aug 27, 2024

They're using TSMC 5-nm for WSE-3: https://spectrum.ieee.org/cerebras-chip-cs3

sanxiyn · on Aug 27, 2024

Cerebras is known to use TSMC, so your speculation about boutique fab is incorrect.

https://cerebras.ai/press-release/cerebras-systems-smashes-t...

throwawaymaths · on Aug 27, 2024

It shouldn't be surprising. It's hard as fuck, and we don't know if it's worth it yet (there's something to be said for "if your compute dies you send your remote hands to swap out a high four figure component" and not "decommission a high-six-figure node"

ChainOfFools · on Aug 28, 2024

I hope I didn't make it sound like it was easy, at least I don't think I said that anywhere. It doesn't really matter how hard something is to do (short of it being trivially proven impossible), it matters whether there's a good enough chance that the payoff exceeds the cost.

And actually there have been attempts to do it, I mentioned in an earlier version of my comment that Gene Amdahl had attemped to make WSE work something like 20 years ago, without success - but also without the clear profitability story of AI to attract the same mountains of cash being thrown around today.

What's surprising is not that it is hard, or that it's hard as fuck, but that given the potentially stratospheric rewards for success there have not been more attempts in this direction.

throwawaymaths · on Aug 28, 2024

What makes you think you couldn't do as well on something less radical?

IIRC cerebras' design was originally for HPC workloads, so even it may not necessarily be optimized for LLMs

nomel · on Aug 27, 2024

Is this all mostly a heat spreader efficiency requirement?

nomel · on Aug 27, 2024

I suppose the proper term would be heat spreader thermal resistance limit, dictated by the thermal stress sums or whatever the term would be.

jedberg · on Aug 27, 2024

I updated my comment to be clearer. I meant smaller and/or cheaper per token. In this case it's cheaper per token.

doctorpangloss · on Aug 28, 2024

Maybe the real trend is that huge parameter counts are curiosities.