Wow, I wish we could post pictures to HN. That chip is HUGE!!!!
The WSE-3 is the largest AI chip ever built, measuring 46,255 mm² and containing 4 trillion transistors. It delivers 125 petaflops of AI compute through 900,000 AI-optimized cores — 19× more transistors and 28× more compute than the NVIDIA B200.
The problem is our primitive text representation online. The formatting should be localized but there’s not a number type I can easily insert inline in a text box.
It's a whole wafer. Basically all chips are made on wafers that big, but normally it's a lot of different chips, you cut the wafer into small chips and throw the bad ones away.
Cerebras has other ways of marking the defects so they don't affect things.
I was under the impression that often times chip manufacture at the top of the lines failed to be manufactured perfectly to spec and those with say, a core that was a bit under spec or which were missing a core would be down clocked or whatever and sold as the next in line chip.
Is that not a thing anymore? Or would a chip like this maybe be so specialized that you'd use say a generation earners transistor width and thus have more certainty of a successful cast?
Or does a chip this size just naturally ebb around 900,000 cores and that's not always the exact count?
20kwh! Wow! 900,000 cores. 125 teraflops of compute. Very neat
More or less, yes. Of course, defects are not evenly distributed, so you get a lot of chips with different grades of brokenness. Normally the more broken chips gets sold off as lower tier products. A six core CPU is probably an eight core with two broken cores.
Though in this case, it seems [1] that Cerebras just has so many small cores they can expect a fairly consistent level of broken cores and route around them
There have been discussions about this chip here in the past. Maybe not that particular one but previous versions of it. The whole server if I remember correctly eats some 20KWs of power.
A first-gen Oxide Computer rack puts out max 15 kW of power, and they manage to do that with air cooling. The liquid-cooled AI racks being used today for training and inference workloads almost certainly have far higher power output than that.
(Bringing liquid cooling to the racks likely has to be one of the biggest challenges with this whole new HPC/AI datacenter infrastructure, so the fact that an aircooled rack can just sit in mostly any ordinary facility is a non-trivial advantage.)
> Bringing liquid cooling to the racks likely has to be one of the biggest challenges with this whole new HPC/AI
Are you sure about that? HPC has had full rack liquid cooling for a long time now.
The primary challenge with the current generation is the unusual increase of power density in racks. This necessitates upgrades in capacity, notably getting 10-20 kWh of heat away from few Us is generally though but if done can increase density.
Watt is a measure of power, that is a rate: Joule/second, [energy/time]
> The watt (symbol: W) is the unit of power or radiant flux in the International System of Units (SI), equal to 1 joule per second or 1 kg⋅m2⋅s−3.[1][2][3] It is used to quantify the rate of energy transfer.
You would hope that an EV reporting x kWh/hour considers the charge curve when charging for an hour. Then it makes sense to report that instead of the peak kW rating. But reality is that they just report the peak kW rating as the "kWh/hour" :-(
I asked because that's the average power consumption of an average household in the US per day. So, if that figure is per hour, that's equivalent to one household worth of power consumption per hour...which is a lot.
Others clarified the kW versus kWh, but to re-visit the comparison to a household:
One household uses about 30 kWh per day.
20 kW * 24 = 480 kWh per day for the server.
So you're looking at one server (if parent's 20kW number is accurate - I see other sources saying even 25kW) consuming 16 households worth of energy.
For comparison, a hair dryer uses around 1.5 kW of energy, which is just below the rating for most US home electrical circuits. This is something like 13 hair dryers going on full blast.
At least with GPT-5.3-Codex-Spark, I gather most of the AI inference isn't rendering cat videos but mostly useful work.. so I don't feel tooo bad about 16 households worth of energy.
To be fair, this is 16 households of electrical energy. The average household uses about as much electrical energy as it uses energy in form of natural gas (or butane or fuel oil, depending on what they use). And then roughly as much gasoline as they use electricity. So really more like 5 households of energy. And that's just your direct energy use, not accounting for all the products including food consumed in the average household.
Consumption of a house per day is measured in kiloWatt-hours (an amount of power like litres of water), not kiloWatts (a flow of power like 1 litre per second of water).
I think you are confusing KW (kilowatt) with KWH (kilowatt hour).
A KW is a unit of power while a KWH is a unit of energy. Power is a measure of energy transferred in an amount of time, which is why you rate an electronic device’s energy usage using power; it consumes energy over time.
In terms of paying for electricity, you care about the total energy consumed, which is why your electric bill is denominated in KWH, which is the amount of energy used if you use one kilowatt of power for one hour.
It’s the chip they’re apparently running the model on.
> Codex-Spark runs on Cerebras’ Wafer Scale Engine 3 (opens in a new window)—a purpose-built AI accelerator for high-speed inference giving Codex a latency-first serving tier. We partnered with Cerebras to add this low-latency path to the same production serving stack as the rest of our fleet, so it works seamlessly across Codex and sets us up to support future models.
That's what it's running on. It's optimized for very high throughput using Cerebras' hardware which is uniquely capable of running LLMs at very, very high speeds.
It's a single wafer, not a single compute core. A familiar equivalent might be putting 192 cores in a single Epyc CPU (or, more to be more technically accurate, the group of cores in a single CCD) rather than trying to interconnect 192 separate single core CPUs externally with each other.
Those are scribe lines where you usually would cut out chips which is why it resembles multiple chips. However, they work with TSMC to etch across them.
>Wow, I wish we could post pictures to HN. That chip is HUGE!!!!
Using a waffer sized chip doesn't sound great from a cost perspective when compared to using many smaller chips for inference. Yield will be much lower and prices higher.
Nevertheless, the actual price might not be very high if Cerebras doesn't apply an Nvidia level tax.
That's an intentional trade-off in the name of latency. We're going to see a further bifurcation in inference use-cases in the next 12 months. I'm expecting this distinction to become prominent:
(A) Massively parallel (optimize for token/$)
(B) Serial low latency (optimize for token/s).
Users will switch between A and B depending on need.
Examples of (A):
- "Search this 1M line codebase for DRY violations subject to $spec."
An example of (B):
- "Diagnose this one specific bug."
- "Apply this diff".
(B) is used in funnels to unblock (A). (A) is optimized for cost and bandwidth, (B) is optimized for latency.
As I understand it the chip consists of a huge number of processing units, with a mesh network between them so to speak, and they can tolerate disabling a number of units by routing around them.
Speed will suffer, but it's not like a stuck pixel on an 8k display rendering the whole panel useless (to consumers).
For now. And also largely because it's easier to get that up and running than the alternative.
Eventually, as we ramp up on domestic solar production, (and even if we get rid of solar tariffs for a short period of time maybe?), the numbers will make them switch to renewable energy.
One thing is really bothering me: they show these tiny cores, but in the wafer comparison image, they chose to show their full chip as a square bounded by the circle of the wafer. Even the prior GPU arch tiled into the arcs of the circle. What gives?
How should advertising work in an AI product? Asad Awan, one of the ad leads at OpenAI, walks through how the company is approaching this decision and why it’s testing ads in ChatGPT at all. He explains how ads are built to stay separate from the model response, keep conversations with ChatGPT private from advertisers, and give people control over their experience.
There is a certain subset of Tesla owners who have this belief that features in certain Tesla vehicles are completely novel to Teslas and other auto manufacturers haven't even considered them. They can often be identified by how they refer to them as "dinosaurs".
Adjustable ride height? Miraculous. Meanwhile my car is mapping the road surface, actively leaning into corners and following road camber, actively avoiding potholes, and adjusting the suspension, including ride height, constantly.
Traffic Sign Recognition, including recognizing school zones, and recognizing active school zones.
Adaptive blind spot - so nice. Speed differential low, or you're going faster? Will not activate, or only activate last moment. But if someone is blowing by you in the HOV lane, it will warn of them when they're still several hundred feet back.
Laser headlights. Matrix headlights. Night vision with thermal imaging.
Predictive active suspension - The car actively scans the road ahead with sensors and it will adjust suspension for poorer road conditions.
The car can not just stop, but will actively swerve, if safe, around obstructions to avoid a collision, or even a parked car opening a door into traffic.
In my opinion it isn't useful at all because if the only thing you can get into a spot is a vehicle with 4-wheel steering, you have already fucked up your site planning. You aren't going to be delivering materials with that thing, bulk materials are too heavy and light materials are too large. Maybe tools, but it isn't that large to be a tool truck and too expensive for small handyman type work.
Thanks for posting that, I watched a couple minutes of it and it suggests that the cybertruck has a really good turning radius, and it was able to drive on a go-kart track.
The WSE-3 is the largest AI chip ever built, measuring 46,255 mm² and containing 4 trillion transistors. It delivers 125 petaflops of AI compute through 900,000 AI-optimized cores — 19× more transistors and 28× more compute than the NVIDIA B200.
From https://www.cerebras.ai/chip:
https://cdn.sanity.io/images/e4qjo92p/production/78c94c67be9...
https://cdn.sanity.io/images/e4qjo92p/production/f552d23b565...
reply