I'm more interested in how this "cross attention" part works.
Being able to combine two different kinds of AI sounds too good to be true. It sounds like AGI. Why does it work for SD? Why aren't we trying to combine more AIs to create a super AI? Or we're already doing this?
Cross attention is not really a way to "combine multiple AI models" but there are many ways to do that, and actually diffusion models are really good at being combined with stuff. Especially thanks to tricks like score distillation (see dreamfusion3d.github.io). But it isn't anything like AGI because the AI is not inventing the combinations itself, and even if you could, there is no clear way to make it self-directed. These are still processes that require lots of programmers being very clever.
Being able to combine two different kinds of AI sounds too good to be true. It sounds like AGI. Why does it work for SD? Why aren't we trying to combine more AIs to create a super AI? Or we're already doing this?