Thanks! Yeah, it seems like a lot can be done with just text while we wait for multimodal models to catch up. The recent Platonic Representation Hypothesis [1] also suggests that different models, regardless of modality, build the same internal representations of the world.
[1] https://arxiv.org/abs/2405.07987