Ok, but in the end you're just evaluating a graph, and I suppose that compilers can figure out how to do this in the most efficient way on any type of hardware for which a backend was written. So it makes more sense to work on a backend that you can use for any type of model than to hand-optimize everything.
>I suppose that compilers can figure out how to do this in the most efficient way on any type of hardware for which a backend was written.
No, that's exactly the problem. Compilers can't because the GPU hardware and the algorithms involved are such rapidly moving targets. Bespoke hardware specific quantization, inference, attention, and kernel compilation is the only way to squeeze out the performance users are looking for.
Creating one fast implementation for all models on all hardware would be like writing one GPU driver for all GPUs and OSs. It just isn't going to work and if it does it isn't going to be fast on all hardware.