A random nxn matrix is full rank... So it's kinda the default: any amount of noi...

A random nxn matrix is full rank... So it's kinda the default: any amount of noise in the embedding is going to result in full-rank transformations.

So it's really less-than-full rank which would require an explanation - ie, why does this image representation project into this perfectly isolated subspace of the language representation (or vice versa)?

If that happened I would start looking for things like a vocabulary of smell which is completely distinct and non-overlapping with any visual context. But we use cross-modal analogies in language /constantly/ (many smells are associated with things we can see - 'smells like a rose') so you wouldn't expect any clean separations for different modalities... Maybe there's some branch of analytic philosophy which has managed to completely divorce itself from the physical world...