Reminds me of [Language as Intermediate Representation](https://chrisvoncsefalvay.com/posts/lair/) - LLMs are optimized for language, so translate an image into language and they'll do better at modeling it.
Cool connection, hadn't seen this before but feels intuitively correct! I also formulate similar (but a bit more out-there) philosophical thoughts on word-meaning as being described by the topological structure of its corresponding images in embedding space, in Section 5.3 of my undergrad thesis [1].