Find these findings questionable unless Whisper is very poorly optimized the way it was run on a 4090.
I have a 3090 and an M1 Max 32GB and and although I haven't tried Whisper the inference difference on Llama and Stable Diffusion between the two is staggering, especially with Stable Diffusion where SDXL is about 0:09 seconds 3090 and 1:10 minute on M1 Max.
You're taking benchmark numbers from a latent diffusion model's (SDXL) inference and extrapolating them to encoder-decoder transformer model's (Whisper) inference. These two model architectures have little in common (except perhaps the fact that Stable Diffusion models use a pre-trained text encoder from clip, which again is very different from an encoder-decoder transformer).
SDXL is actually an interesting exception for Nvidia because most users still tend to run it in PyTorch eager mode. There are super optimized Nvidia implementations, like stable-fast but their use is less common. Apple, on the other hand, took the odd step of hand writing a Metal implementation themselves, at least for SD 1.5.
This will determine who has a shot at actually being Nvidia competitive.
What I like to say is (generally speaking) other implementations like AMD (ROCm), Intel, Apple, etc are more-or-less at the “get it to work” stage. Due to their early lead and absolute market dominance Nvidia has been at the “wring every last penny of performance out of this” stage for years.
Efforts like this are a good step but they still have a very long way to go to compete with multiple layers (throughout the stack) of insanely optimized Nvidia/CUDA implementations. Bonus points nearly anything with Nvidia is a docker command that just works on any chip they’ve made in the last half decade from laptop to datacenter.
This can be seen (dramatically) with ROCm. I recently took the significant effort (again) to get an LLM to run on an AMD GPU. The AMD GPU is “cheaper” in initial cost but when the dollar equivalent (to within 10-30%) Nvidia GPU is 5-10x faster (or whatever) you’re not saving anything.
You’re already at a loss unless your time is free just to get it to work (random patches, version hacks, etc) and then the performance just isn’t even close so the “value prop” of AMD currently doesn’t make any sense whatsoever. The advantage for Apple is you likely spent whatever for the machine anyway, and when you have it just sitting in front of you for a variety of tasks the value prop increases significantly.
Although both LDM inference and encoder-decoder and decoder-only LLM inference are both fundamentally autoregressive in nature, LLM inference is memory bound and LDM inference is compute bound. In that light, it makes sense that the difference between a 4090 and M1 Pro isn't as pronounced as one would expect at first approximation.
Also, as you hint whisper.cpp certainly isn't one of the fastest implementations of whisper inference out there. Perhaps a comparison between a pure PyTorch version running on the 4090 with an MLX version of Whisper running on the M1 Pro would be fairer. Or better yet, run the whisper encoder on ANE with CoreML and have the decoder running with Metal and Accelerate (which uses Apple's undocumented AMX ISA) using MLX, since MLX currently does not use the ANE. IIRC, whisper.cpp has a similar optimization on Apple hardware, where it optionally runs the encoder using CoreML and the decoder using Metal.
There has been a ton of optimization around whisper with regards to apple silicon, whisper.cpp is a good example that takes advantage of this - also this article is specifically referencing the new apple MLX framework which I’m guessing your tests with llama and stable diffusion weren’t utilizing.
Several people working on mlx-enabled backends to popular ML workloads but it seems inference workloads are the most accelerated vs generative/training.
Reading through some (admittedly very early) MLX docs and it seems that convolutions (as used heavily in GANs and particularly stable diffusion) are not really seeing meaningful uplifts on MLX at all, and in some cases are slower than on the cpu.
Not sure if this is a hardware limitation or just unoptimized MLX libraries but I find it hard to believe they would have just ignored this very prominent use case. It's more likely that convolutions use high precision and much larger tile sets that require some expensive context switching when the entire transform can't fit in the gpu.
In this case, the 4090 is far more memory efficient thanks to ExLlamav2.
70B in particular is indeed a significant compromise on the 4090, but not as much as you'd think. 34B and down though, I think Nvidia is unquestionably king.
Doesn't running 70B in 24GB need 2 bit quantisation?
I'm no expert, but to me that sounds like a recipe for bad performance. Does a 70B model in 2-bit really outperform a smaller-but-less-quantised model?
2.65bpw, on a totally empty 3090 (and I mean totally empty).
I woukd say 34B is the performance sweetspot, yeah. There was a long period where allow we had in the 33B range was llamav1, but now we have Yi and Codellamav2 (among others).
Having used whisper a ton, there are versions of it that have one or two magnitudes of better performance at the same quality while using less memory for reasons I don't fully understand.
So I'd be very careful about your intuition on whisper performance unless it's literally the same software and same model (and then the comparison isn't very meaningful still, seeing how we want to optimize it for different platforms).
Both of your SDXL and M1 Max number should be faster (of course, it depends on how many steps). But the point stands, for SDXL, 3090 should be 5x to 6x faster than M1 Max and should be 2x to 2.5x faster than M2 Ultra.
Thank you for sharing this data. I've just been debating between M2 Mac Studio Max and a 64GB i9 10900x with RTX 3090 for personal ML use. Glad I chose the 3090! Would love to learn more about your setup.
M1 Max has 400GB/s of memory bandwidth and a 4090 has 1TB/s of memory bandwidth, M1 Max has 32 GPU cores and a 4090 has 16,000. The difference is more about how well the software is optimized for the hardware platform than any performance difference between the two, which are frankly not comparable in any way.
Apple M1 Max has 32 GPU cores, each core contains 16 Execution Units, each EU has 8 ALUs (also called shaders), so overall there are 4096 shaders. Nvidia RTX 4090 contains 12 Graphics Processing Clusters, each GPC has 12 Streaming Multi-Processors, and each SM has 128 ALUs, overall there are 18432 shaders.
A single shader is somewhat similar to a single lane of a vector ALU in a CPU. One can say that a single-core CPU with AVX-512 has 8 shaders, because it can process 8 FP64s at the same time. Calling them "cores" (as in "CUDA core") is extremely misleading, so "shader" became the common name for a GPU's ALU due to that. If Nvidia is in charge of marketing a 4-core x86-64 CPU, they would call it a CPU with 32 "AVX cores" because each core has 8-way SIMD.
Doesn't zen4 have 2x 256-bit FADD and 2x 256-bit FMA, and with avx512 ops it double-pumps the ALU (a good overview here [0]). If you count FADD as a single flop and FMA as 2, that's 48 "1 flop cores" per core.
I think it's got the same total FP ALU resources as zen3, and shows how register width and ALU resources can be completely decoupled.
I think that 4090 has 16000 ALUs, not "cores" (let's call a component capable to execute instructions independently from others, a "core"). And M1 Max probably has more than 1 ALU in every core, otherwise it resembles an ancient GPU.
Nvidia switched to marketing speak a long time ago when it came to the word "core". If we go with Nvidia's definition then M1 Max has 4096 cores, still behind the 4090, but the gap isn't as big as 32 to 16k.
I have a 3090 and an M1 Max 32GB and and although I haven't tried Whisper the inference difference on Llama and Stable Diffusion between the two is staggering, especially with Stable Diffusion where SDXL is about 0:09 seconds 3090 and 1:10 minute on M1 Max.