I think its very surprising, although I would like the paper to show more experiments (they already have a lot, i know).
The ViT models are never really trained from scratch - they are always finetuned as they require large amounts of data to converge nicely. The pretraining just provides a nice initialization. Why would one expect two ViT's finetuned on two different things - image and text classification end up in the same subspace as they show? I think this is groundbreaking.
I don't really agree with the drift far from the parent model idea. I think they drift pretty far in terms of their norms. Even the small LoRA adapters drift pretty far from the base model.
The ViT models are never really trained from scratch - they are always finetuned as they require large amounts of data to converge nicely. The pretraining just provides a nice initialization. Why would one expect two ViT's finetuned on two different things - image and text classification end up in the same subspace as they show? I think this is groundbreaking.
I don't really agree with the drift far from the parent model idea. I think they drift pretty far in terms of their norms. Even the small LoRA adapters drift pretty far from the base model.