Ok, but in the end you're just evaluating a graph, and I suppose that compilers ...

MacsHeadroom · on March 12, 2023

>I suppose that compilers can figure out how to do this in the most efficient way on any type of hardware for which a backend was written.

No, that's exactly the problem. Compilers can't because the GPU hardware and the algorithms involved are such rapidly moving targets. Bespoke hardware specific quantization, inference, attention, and kernel compilation is the only way to squeeze out the performance users are looking for.

Creating one fast implementation for all models on all hardware would be like writing one GPU driver for all GPUs and OSs. It just isn't going to work and if it does it isn't going to be fast on all hardware.