Yes. Think of it in terms of fitting a line to some points y=mx+b. Training is f...

zoogeny · on Dec 15, 2023

I'm not sure if that gives me an intuition on the title of the article: "Do large language models need all those layers"

Am I interpreting you correctly if I say: "Finding the slope (training) may require those extra layers but finding a particular y value given an known x coordinate (inference) may not require those extra layers".

What I mean is, does the answer to the article's question change if one is considering training vs. inference?

quadrature · on Dec 15, 2023

Apologies, I thought you were asking a general question about ML. Will let someone else comment on the specifics here.

xanderlewis · on Dec 15, 2023

I think I’ve misinterpreted it in the same way. I guess you’re asking something like: if we can exorcise parts of a model without affecting quality of inferences (in some particular domain), can we do the same with the training step? That is, is it necessary to train a model on a wide variety of topics in order to get high-quality ‘understanding’ for a particular application?

If we don’t need those weights at inference time, why do the computation to train them in the first place?

ska · on Dec 15, 2023

The real answer is we don't know yet but it's interesting.

To go back to your ax+b example, imagine instead you are fitting a much higher dimensional model, but you don't know how high. ax^n+bx^(n-1) ... where n might be in the millions, or hundreds of millions, or?? So we know if we make the model high enough order (e.g n-1 training points will give "perfect") it will overfit, so we throw some regularization and a bit of handwavy tuning and we end up with a model of say n=7213472123 and a set of a,b .. which behaves pretty well, but from it's behavior we suspect most of them dont' matter. And maybe should be <= 2million, or whatever.

So, a few obvious questions - one is can we find a way to throw out most of the a,b,c ... to get just the core, i.e. if we throw away all |k| <= 0.00001 does it change anything (for inference). A very different question is could we decide that ahead of time (during training). A different class of question looks more like "could we have figured this out from the data".

It's a lot harder to reason about the latter questions, because the former one is empirical: After training, this one doesn't seem to do anything. Ahead of time, how do you know? This has interesting offshoots, like how stable is the distribution of the parts that matter, etc.

cyanydeez · on Dec 15, 2023

wouldn't the most accurate 2D analogy in geometry be in the Discrete Fourier Transform?