Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Find these findings questionable unless Whisper is very poorly optimized the way it was run on a 4090.

I have a 3090 and an M1 Max 32GB and and although I haven't tried Whisper the inference difference on Llama and Stable Diffusion between the two is staggering, especially with Stable Diffusion where SDXL is about 0:09 seconds 3090 and 1:10 minute on M1 Max.



You're taking benchmark numbers from a latent diffusion model's (SDXL) inference and extrapolating them to encoder-decoder transformer model's (Whisper) inference. These two model architectures have little in common (except perhaps the fact that Stable Diffusion models use a pre-trained text encoder from clip, which again is very different from an encoder-decoder transformer).


The point still stands though. Popular models tend to to have massively hand optimized Nvidia implementations.

Whisper is no exception: https://github.com/Vaibhavs10/insanely-fast-whisper

SDXL is actually an interesting exception for Nvidia because most users still tend to run it in PyTorch eager mode. There are super optimized Nvidia implementations, like stable-fast but their use is less common. Apple, on the other hand, took the odd step of hand writing a Metal implementation themselves, at least for SD 1.5.


This will determine who has a shot at actually being Nvidia competitive.

What I like to say is (generally speaking) other implementations like AMD (ROCm), Intel, Apple, etc are more-or-less at the “get it to work” stage. Due to their early lead and absolute market dominance Nvidia has been at the “wring every last penny of performance out of this” stage for years.

Efforts like this are a good step but they still have a very long way to go to compete with multiple layers (throughout the stack) of insanely optimized Nvidia/CUDA implementations. Bonus points nearly anything with Nvidia is a docker command that just works on any chip they’ve made in the last half decade from laptop to datacenter.

This can be seen (dramatically) with ROCm. I recently took the significant effort (again) to get an LLM to run on an AMD GPU. The AMD GPU is “cheaper” in initial cost but when the dollar equivalent (to within 10-30%) Nvidia GPU is 5-10x faster (or whatever) you’re not saving anything.

You’re already at a loss unless your time is free just to get it to work (random patches, version hacks, etc) and then the performance just isn’t even close so the “value prop” of AMD currently doesn’t make any sense whatsoever. The advantage for Apple is you likely spent whatever for the machine anyway, and when you have it just sitting in front of you for a variety of tasks the value prop increases significantly.


Although both LDM inference and encoder-decoder and decoder-only LLM inference are both fundamentally autoregressive in nature, LLM inference is memory bound and LDM inference is compute bound. In that light, it makes sense that the difference between a 4090 and M1 Pro isn't as pronounced as one would expect at first approximation.

Also, as you hint whisper.cpp certainly isn't one of the fastest implementations of whisper inference out there. Perhaps a comparison between a pure PyTorch version running on the 4090 with an MLX version of Whisper running on the M1 Pro would be fairer. Or better yet, run the whisper encoder on ANE with CoreML and have the decoder running with Metal and Accelerate (which uses Apple's undocumented AMX ISA) using MLX, since MLX currently does not use the ANE. IIRC, whisper.cpp has a similar optimization on Apple hardware, where it optionally runs the encoder using CoreML and the decoder using Metal.


Modest 30x speedup


There has been a ton of optimization around whisper with regards to apple silicon, whisper.cpp is a good example that takes advantage of this - also this article is specifically referencing the new apple MLX framework which I’m guessing your tests with llama and stable diffusion weren’t utilizing.


I assume people are working on bringing a MLX backend to llama.cpp... Any idea what the state of that project is?


https://github.com/ml-explore/mlx-examples

Several people working on mlx-enabled backends to popular ML workloads but it seems inference workloads are the most accelerated vs generative/training.


Reading through some (admittedly very early) MLX docs and it seems that convolutions (as used heavily in GANs and particularly stable diffusion) are not really seeing meaningful uplifts on MLX at all, and in some cases are slower than on the cpu.

Not sure if this is a hardware limitation or just unoptimized MLX libraries but I find it hard to believe they would have just ignored this very prominent use case. It's more likely that convolutions use high precision and much larger tile sets that require some expensive context switching when the entire transform can't fit in the gpu.


I have 4090 and M1 Max 64GB. 4090 is far superior on Llama 2.


But are you using the newly released Apple MLX optimizations?


It's been approximately 2 months since I have tested it, so probably not.


But those optimizations are the subject of the article you are commenting on.


On models < 24GB presumably. "Faster" depends on the model size.


In this case, the 4090 is far more memory efficient thanks to ExLlamav2.

70B in particular is indeed a significant compromise on the 4090, but not as much as you'd think. 34B and down though, I think Nvidia is unquestionably king.


Doesn't running 70B in 24GB need 2 bit quantisation?

I'm no expert, but to me that sounds like a recipe for bad performance. Does a 70B model in 2-bit really outperform a smaller-but-less-quantised model?


2.65bpw, on a totally empty 3090 (and I mean totally empty).

I woukd say 34B is the performance sweetspot, yeah. There was a long period where allow we had in the 33B range was llamav1, but now we have Yi and Codellamav2 (among others).


Having used whisper a ton, there are versions of it that have one or two magnitudes of better performance at the same quality while using less memory for reasons I don't fully understand.

So I'd be very careful about your intuition on whisper performance unless it's literally the same software and same model (and then the comparison isn't very meaningful still, seeing how we want to optimize it for different platforms).


It all is really messy, I would assume that almost any model is poorly optimized to run on Apple Silicon as well.


Both of your SDXL and M1 Max number should be faster (of course, it depends on how many steps). But the point stands, for SDXL, 3090 should be 5x to 6x faster than M1 Max and should be 2x to 2.5x faster than M2 Ultra.


Thank you for sharing this data. I've just been debating between M2 Mac Studio Max and a 64GB i9 10900x with RTX 3090 for personal ML use. Glad I chose the 3090! Would love to learn more about your setup.


"I haven't tried Whisper"

I haven't tried the hardware/software/framework/... of the article, but I have an opinion on this exact topic.


The topic is benchmarking some hardware and specific implementation of some tool.

The provided context is n earlier version of hardware where known implementations perform drastically differently, an order of magnitude differently.

That leaves the question why that specific tool exhibits the behavior described in the article.


M1 Max has 400GB/s of memory bandwidth and a 4090 has 1TB/s of memory bandwidth, M1 Max has 32 GPU cores and a 4090 has 16,000. The difference is more about how well the software is optimized for the hardware platform than any performance difference between the two, which are frankly not comparable in any way.


> M1 Max has 32 GPU cores and a 4090 has 16,000.

Apple M1 Max has 32 GPU cores, each core contains 16 Execution Units, each EU has 8 ALUs (also called shaders), so overall there are 4096 shaders. Nvidia RTX 4090 contains 12 Graphics Processing Clusters, each GPC has 12 Streaming Multi-Processors, and each SM has 128 ALUs, overall there are 18432 shaders.

A single shader is somewhat similar to a single lane of a vector ALU in a CPU. One can say that a single-core CPU with AVX-512 has 8 shaders, because it can process 8 FP64s at the same time. Calling them "cores" (as in "CUDA core") is extremely misleading, so "shader" became the common name for a GPU's ALU due to that. If Nvidia is in charge of marketing a 4-core x86-64 CPU, they would call it a CPU with 32 "AVX cores" because each core has 8-way SIMD.


Actually each of those x86 CPUs probably has at least two AVX FMA units, and can issue 16xFP32 FMAs per cycle – it’s at least “64 AVX cores”! :)


Doesn't zen4 have 2x 256-bit FADD and 2x 256-bit FMA, and with avx512 ops it double-pumps the ALU (a good overview here [0]). If you count FADD as a single flop and FMA as 2, that's 48 "1 flop cores" per core.

I think it's got the same total FP ALU resources as zen3, and shows how register width and ALU resources can be completely decoupled.

[0] https://www.mersenneforum.org/showthread.php?p=614191


I think that 4090 has 16000 ALUs, not "cores" (let's call a component capable to execute instructions independently from others, a "core"). And M1 Max probably has more than 1 ALU in every core, otherwise it resembles an ancient GPU.


Yeah; 'core' is a pretty meaningless term when it comes to GPUs, or at least it's meaningless outside the context of a particular architecture.

We may just be thankful that this particular bit of marketing never caught on for CPUs.


Nvidia switched to marketing speak a long time ago when it came to the word "core". If we go with Nvidia's definition then M1 Max has 4096 cores, still behind the 4090, but the gap isn't as big as 32 to 16k.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: