This is basically just a rehash of "trained" DNN are a function which is strongly dependent on the initialization parameters. (Easily provable)
It would be awesome to have a way of finding them in advance but this is also just a case of avoid pure DNNs due to their strong reliance on initialization parameters.
Looking at transformers by comparison you see a much much weaker dependence of the model on the input initial parameters. Does this mean the model is better or worse at learning or just more stable?
This is an interesting insight I hadn’t thought much about before. Reminds me a bit of some of the mechanistic interpretability work that looked at branch specialization in CNNs and found that architectures which had built in branches tended to have those branches specialize in a way that was consistent across multiple training runs [1]. Maybe the multi-headed and branching nature of transformers adds and inductive bias that is useful for stable training over larger scales.
It would be awesome to have a way of finding them in advance but this is also just a case of avoid pure DNNs due to their strong reliance on initialization parameters.
Looking at transformers by comparison you see a much much weaker dependence of the model on the input initial parameters. Does this mean the model is better or worse at learning or just more stable?