Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

relative to what? according to this, a single card is capable of >300t/s on Mistral-7B and the workstations with 4 cards are doing nearly 500t/s on Mixtral-7Bx8.

yeah the H100 and MI300 nearly 10x those numbers [1] at the same batch sizes, but those cards are unobtainium server-class hardware priced way outside the prosumer range. while these cards cost less than a RTX 4090 and only use ~300W.

what other options exist for individuals or small companies looking to run/train locally at that kind of speed?

1: https://blog.runpod.io/amd-mi300x-vs-nvidia-h100-sxm-perform...



> yeah the H100 and MI300 nearly 10x those numbers

Don't they cost 10x as much, or more?


No, from their benchmark setup, it’s more like 3x as much. So yeah it’s just really not competitive.


I've only ever seen rumors and reports of them being >$10k USD and only available to enterprise customers.

Where are you seeing pricing for the H100 and MI300?


Yes but the benchmarks listed above are mostly done on an 8x setup at $1400 each so ca. $12k and the performance achieved is a fraction of what a $30k H100 will do.


The benchmarks on GitHub use the N300 which has 2 chips per board with 4 boards in the system -- that's the "2x4" they refer to -- with each board being 1400 USD. So that's only $5.6k to match the system they sell, versus the H100 which is north of $30k. Well, OK, at an equivalent bs=32 it's only 4x or 5x worse than the H100 according to the benchmarks in this thread (500t/s vs ~2000t/s), but as you note that's not what people practically use for batch sizes, and the power usage is a factor of 4x worse overall too. So, at an impractical batch size it only uses 4x more power for about 1/4th the total tokens/second. Given the pricing you could in theory buy 4x as many cards while still being cheaper than an H100, but that totally ignores operational and other scaling costs. Apparently Tenstorrent is still on 12nm for Wormhole, too.

Anyway, I overall agree with you it's not as amazing as people here might make it sound. I think people here are viewing it with rose tints because practically speaking nobody else actually sells B2C accelerator hardware at a reasonable cost, and has actual availability which is what they want. They look at Tenstorrent and see a "Buy now" form as a way to spend $5k USD and get 96GB of GDDR6 and a toolchain that's both open-source and not-nvidia, or whatever. This forum is going to be particularly sensitive to things like that.

The actual hardware I think still has a ways to go, but hopefully they can scale it up in a bunch of ways and people can at least buy functionally usable cards with a software stack that works on them all. So, they're doing better than a lot of competitors in those ways, I guess...


> the N300 which has 2 chips per board with 4 boards in the system

Ah my mistake, thanks for that.

But also an H100 will definitely do more than 2000 t/s at bs=32 in fp16


That is bs=32 which is wholly unimpressive. Last gen consumer cards can do better than this at similar power envelope. Current gen even better.


do you have any benchmarks to share? the best I've seen for a 4090 are all in the mid-20s, never more than 30t/s.


Where have you seen this? Look at what vLLM or TensorRT will do on a 4090 at those batch sizes.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: