I think I’ve misinterpreted it in the same way. I guess you’re asking something like: if we can exorcise parts of a model without affecting quality of inferences (in some particular domain), can we do the same with the training step? That is, is it necessary to train a model on a wide variety of topics in order to get high-quality ‘understanding’ for a particular application?
If we don’t need those weights at inference time, why do the computation to train them in the first place?
The real answer is we don't know yet but it's interesting.
To go back to your ax+b example, imagine instead you are fitting a much higher dimensional model, but you don't know how high. ax^n+bx^(n-1) ... where n might be in the millions, or hundreds of millions, or?? So we know if we make the model high enough order (e.g n-1 training points will give "perfect") it will overfit, so we throw some regularization and a bit of handwavy tuning and we end up with a model of say n=7213472123 and a set of a,b .. which behaves pretty well, but from it's behavior we suspect most of them dont' matter. And maybe should be <= 2million, or whatever.
So, a few obvious questions - one is can we find a way to throw out most of the a,b,c ... to get just the core, i.e. if we throw away all |k| <= 0.00001 does it change anything (for inference). A very different question is could we decide that ahead of time (during training). A different class of question looks more like "could we have figured this out from the data".
It's a lot harder to reason about the latter questions, because the former one is empirical: After training, this one doesn't seem to do anything. Ahead of time, how do you know? This has interesting offshoots, like how stable is the distribution of the parts that matter, etc.