Find these findings questionable unless Whisper is very poorly optimized the way...

woadwarrior01 · on Dec 13, 2023

You're taking benchmark numbers from a latent diffusion model's (SDXL) inference and extrapolating them to encoder-decoder transformer model's (Whisper) inference. These two model architectures have little in common (except perhaps the fact that Stable Diffusion models use a pre-trained text encoder from clip, which again is very different from an encoder-decoder transformer).

brucethemoose2 · on Dec 13, 2023

The point still stands though. Popular models tend to to have massively hand optimized Nvidia implementations.

Whisper is no exception: https://github.com/Vaibhavs10/insanely-fast-whisper

SDXL is actually an interesting exception for Nvidia because most users still tend to run it in PyTorch eager mode. There are super optimized Nvidia implementations, like stable-fast but their use is less common. Apple, on the other hand, took the odd step of hand writing a Metal implementation themselves, at least for SD 1.5.

kkielhofner · on Dec 13, 2023

This will determine who has a shot at actually being Nvidia competitive.

What I like to say is (generally speaking) other implementations like AMD (ROCm), Intel, Apple, etc are more-or-less at the “get it to work” stage. Due to their early lead and absolute market dominance Nvidia has been at the “wring every last penny of performance out of this” stage for years.

Efforts like this are a good step but they still have a very long way to go to compete with multiple layers (throughout the stack) of insanely optimized Nvidia/CUDA implementations. Bonus points nearly anything with Nvidia is a docker command that just works on any chip they’ve made in the last half decade from laptop to datacenter.

This can be seen (dramatically) with ROCm. I recently took the significant effort (again) to get an LLM to run on an AMD GPU. The AMD GPU is “cheaper” in initial cost but when the dollar equivalent (to within 10-30%) Nvidia GPU is 5-10x faster (or whatever) you’re not saving anything.

You’re already at a loss unless your time is free just to get it to work (random patches, version hacks, etc) and then the performance just isn’t even close so the “value prop” of AMD currently doesn’t make any sense whatsoever. The advantage for Apple is you likely spent whatever for the machine anyway, and when you have it just sitting in front of you for a variety of tasks the value prop increases significantly.

woadwarrior01 · on Dec 13, 2023

Although both LDM inference and encoder-decoder and decoder-only LLM inference are both fundamentally autoregressive in nature, LLM inference is memory bound and LDM inference is compute bound. In that light, it makes sense that the difference between a 4090 and M1 Pro isn't as pronounced as one would expect at first approximation.

Also, as you hint whisper.cpp certainly isn't one of the fastest implementations of whisper inference out there. Perhaps a comparison between a pure PyTorch version running on the 4090 with an MLX version of Whisper running on the M1 Pro would be fairer. Or better yet, run the whisper encoder on ANE with CoreML and have the decoder running with Metal and Accelerate (which uses Apple's undocumented AMX ISA) using MLX, since MLX currently does not use the ANE. IIRC, whisper.cpp has a similar optimization on Apple hardware, where it optionally runs the encoder using CoreML and the decoder using Metal.

tgtweak · on Dec 13, 2023

Modest 30x speedup

kamranjon · on Dec 13, 2023

There has been a ton of optimization around whisper with regards to apple silicon, whisper.cpp is a good example that takes advantage of this - also this article is specifically referencing the new apple MLX framework which I’m guessing your tests with llama and stable diffusion weren’t utilizing.

sbrother · on Dec 13, 2023

I assume people are working on bringing a MLX backend to llama.cpp... Any idea what the state of that project is?

tgtweak · on Dec 13, 2023

https://github.com/ml-explore/mlx-examples

Several people working on mlx-enabled backends to popular ML workloads but it seems inference workloads are the most accelerated vs generative/training.

tgtweak · on Dec 13, 2023

Reading through some (admittedly very early) MLX docs and it seems that convolutions (as used heavily in GANs and particularly stable diffusion) are not really seeing meaningful uplifts on MLX at all, and in some cases are slower than on the cpu.

Not sure if this is a hardware limitation or just unoptimized MLX libraries but I find it hard to believe they would have just ignored this very prominent use case. It's more likely that convolutions use high precision and much larger tile sets that require some expensive context switching when the entire transform can't fit in the gpu.

ps · on Dec 13, 2023

I have 4090 and M1 Max 64GB. 4090 is far superior on Llama 2.

jb1991 · on Dec 13, 2023

But are you using the newly released Apple MLX optimizations?

ps · on Dec 13, 2023

It's been approximately 2 months since I have tested it, so probably not.

jb1991 · on Dec 13, 2023

But those optimizations are the subject of the article you are commenting on.

astrodust · on Dec 13, 2023

On models < 24GB presumably. "Faster" depends on the model size.

brucethemoose2 · on Dec 13, 2023

In this case, the 4090 is far more memory efficient thanks to ExLlamav2.

70B in particular is indeed a significant compromise on the 4090, but not as much as you'd think. 34B and down though, I think Nvidia is unquestionably king.

michaelt · on Dec 13, 2023

Doesn't running 70B in 24GB need 2 bit quantisation?

I'm no expert, but to me that sounds like a recipe for bad performance. Does a 70B model in 2-bit really outperform a smaller-but-less-quantised model?

brucethemoose2 · on Dec 14, 2023

2.65bpw, on a totally empty 3090 (and I mean totally empty).

I woukd say 34B is the performance sweetspot, yeah. There was a long period where allow we had in the 33B range was llamav1, but now we have Yi and Codellamav2 (among others).

stefan_ · on Dec 13, 2023

Having used whisper a ton, there are versions of it that have one or two magnitudes of better performance at the same quality while using less memory for reasons I don't fully understand.

So I'd be very careful about your intuition on whisper performance unless it's literally the same software and same model (and then the comparison isn't very meaningful still, seeing how we want to optimize it for different platforms).

agloe_dreams · on Dec 13, 2023

It all is really messy, I would assume that almost any model is poorly optimized to run on Apple Silicon as well.

liuliu · on Dec 13, 2023

Both of your SDXL and M1 Max number should be faster (of course, it depends on how many steps). But the point stands, for SDXL, 3090 should be 5x to 6x faster than M1 Max and should be 2x to 2.5x faster than M2 Ultra.

mv4 · on Dec 13, 2023

Thank you for sharing this data. I've just been debating between M2 Mac Studio Max and a 64GB i9 10900x with RTX 3090 for personal ML use. Glad I chose the 3090! Would love to learn more about your setup.

KingOfCoders · on Dec 13, 2023

"I haven't tried Whisper"

I haven't tried the hardware/software/framework/... of the article, but I have an opinion on this exact topic.

xxs · on Dec 13, 2023

The topic is benchmarking some hardware and specific implementation of some tool.

The provided context is n earlier version of hardware where known implementations perform drastically differently, an order of magnitude differently.

That leaves the question why that specific tool exhibits the behavior described in the article.

oceanplexian · on Dec 13, 2023

M1 Max has 400GB/s of memory bandwidth and a 4090 has 1TB/s of memory bandwidth, M1 Max has 32 GPU cores and a 4090 has 16,000. The difference is more about how well the software is optimized for the hardware platform than any performance difference between the two, which are frankly not comparable in any way.

segfaultbuserr · on Dec 13, 2023

> M1 Max has 32 GPU cores and a 4090 has 16,000.

Apple M1 Max has 32 GPU cores, each core contains 16 Execution Units, each EU has 8 ALUs (also called shaders), so overall there are 4096 shaders. Nvidia RTX 4090 contains 12 Graphics Processing Clusters, each GPC has 12 Streaming Multi-Processors, and each SM has 128 ALUs, overall there are 18432 shaders.

A single shader is somewhat similar to a single lane of a vector ALU in a CPU. One can say that a single-core CPU with AVX-512 has 8 shaders, because it can process 8 FP64s at the same time. Calling them "cores" (as in "CUDA core") is extremely misleading, so "shader" became the common name for a GPU's ALU due to that. If Nvidia is in charge of marketing a 4-core x86-64 CPU, they would call it a CPU with 32 "AVX cores" because each core has 8-way SIMD.

jrk · on Dec 13, 2023

Actually each of those x86 CPUs probably has at least two AVX FMA units, and can issue 16xFP32 FMAs per cycle – it’s at least “64 AVX cores”! :)

kimixa · on Dec 13, 2023

Doesn't zen4 have 2x 256-bit FADD and 2x 256-bit FMA, and with avx512 ops it double-pumps the ALU (a good overview here [0]). If you count FADD as a single flop and FMA as 2, that's 48 "1 flop cores" per core.

I think it's got the same total FP ALU resources as zen3, and shows how register width and ALU resources can be completely decoupled.

[0] https://www.mersenneforum.org/showthread.php?p=614191

codedokode · on Dec 13, 2023

I think that 4090 has 16000 ALUs, not "cores" (let's call a component capable to execute instructions independently from others, a "core"). And M1 Max probably has more than 1 ALU in every core, otherwise it resembles an ancient GPU.

rsynnott · on Dec 13, 2023

Yeah; 'core' is a pretty meaningless term when it comes to GPUs, or at least it's meaningless outside the context of a particular architecture.

We may just be thankful that this particular bit of marketing never caught on for CPUs.

stonemetal12 · on Dec 13, 2023

Nvidia switched to marketing speak a long time ago when it came to the word "core". If we go with Nvidia's definition then M1 Max has 4096 cores, still behind the 4090, but the gap isn't as big as 32 to 16k.