>Cerebras is the only platform to enable instant responses at a blistering 450 tokens/sec. All this is achieved using native 16-bit weights for the model, ensuring the highest accuracy responses.
As near as I can tell, from the model card[1], the majority of the math for this model is 4096x4096 multiply-accumulates. So, there should be 70b/16m about 4000 of these in the Llama3-70B model.
A 16x16 multiplier is about 9000 transistors, according to a quick google. 4096^2 should thus be about 150 billion transistors, if you include the bias values. There are plenty of transistors on this chip to have many of them operating in parallel.
According to [2], a switching transition in the 7nM process node, is about 0.025 femtoJoule (10^-15 watt seconds) per transistor. At a clock rate of 1 Ghz, that's about 25 nanowatt/transistor. Scaling that at 50% transitions(a 50/50 chance any given gate in the MAC will flip), gets you about 2kW for each 4096^2 MAC running at 1 Ghz.
There are enough transistors, and enough RAM on the wafer to fit the entire model. Even if they have a single 4096^2 MAC array, a clock rate of 1 ghz should result in a total time of 4 uSec/token, or 250,000 tokens/second.
>There are enough transistors, and enough RAM on the wafer to fit the entire model.
Not the entire 70b fp16 model. It'd take 148GB of RAM to hold the entire model. Each Cerebras wafer chip has 44GB of SRAM. You need 4 of them chained together to hold the entire model.
As near as I can tell, from the model card[1], the majority of the math for this model is 4096x4096 multiply-accumulates. So, there should be 70b/16m about 4000 of these in the Llama3-70B model.
A 16x16 multiplier is about 9000 transistors, according to a quick google. 4096^2 should thus be about 150 billion transistors, if you include the bias values. There are plenty of transistors on this chip to have many of them operating in parallel.
According to [2], a switching transition in the 7nM process node, is about 0.025 femtoJoule (10^-15 watt seconds) per transistor. At a clock rate of 1 Ghz, that's about 25 nanowatt/transistor. Scaling that at 50% transitions(a 50/50 chance any given gate in the MAC will flip), gets you about 2kW for each 4096^2 MAC running at 1 Ghz.
There are enough transistors, and enough RAM on the wafer to fit the entire model. Even if they have a single 4096^2 MAC array, a clock rate of 1 ghz should result in a total time of 4 uSec/token, or 250,000 tokens/second.
[1] https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md
[2] https://mpedram.com/Papers/7nm-finfet-libraries-tcasII.pdf