Interesting, thanks for sharing! I share some concerns others have about this piece, but I’m most shocked about their finding that image generation is cheaper than text. As someone who’s gone down this rabbit hole multiple times, this runs against every single paper I’ve ever cited on the topic. Anyone know why? Maybe this is a recent change? It also doesn’t help that multimodal transformers are now blurring the lines between image and text, of course… this article doesn’t even handle that though, treating all image models as diffusion models.