There are two problems with this: a) natural language is inherently poor at giving artistic directions compared to higher-order ways like sketching and references, even if you got a human on the other end of the wire, and b) to create something conceptually appealing/novel, the model has to have much better conceptualizing ability than is currently possible with the best LLMs, and those already need some mighty hardware to run. Besides, tweaking the prompt will probably never be stable, partly due to the reasons outlined in the OP; although you could optimize for that, I guess.
That said, better understanding is always welcome. DeepFloyd IF tried to pair a full-fledged transformer with a diffusion part (albeit with only 11B parameters). It improved the understanding of complex prompts like "koi fish doing a handstand on a skateboard", but also pushed the hardware requirements way up, and haven't solved the fundamental issues above.
I think you're right about the current limitations, but imagine a trillion or ten trillion parameter model trained and RLHF'd for this specific use case. It may take a year or two, but I see no reason to think it isn't coming.
Yes, hardware requirements will be steep, but it will still be cheap compared to equivalent human illustrators. And compute costs will go down in the long run.
That said, better understanding is always welcome. DeepFloyd IF tried to pair a full-fledged transformer with a diffusion part (albeit with only 11B parameters). It improved the understanding of complex prompts like "koi fish doing a handstand on a skateboard", but also pushed the hardware requirements way up, and haven't solved the fundamental issues above.