Forgive me if this is not a good question, but is there a difference here betwee...

ImprobableTruth · on Dec 15, 2023

(I assume this question is about whether models need all those layers during training, even if they don't need them during inference)

Yes. There's the so called "lottery ticket hypothesis". Essentially the idea is that large models start with many randomly initialized subnetworks ("lottery tickets"0 and that training finds which ones work best. Then it's only natural that during inference we can prune all the "losing tickets" away, even though we need them during training.

It's kind of an open question how large this effect is though. As the article mentions, if you can prune a lot away, this could also just mean that the network isn't optimally trained.

zoogeny · on Dec 15, 2023

I figured this must be a well-known property of neural networks. I'll do some reading on the lottery ticket hypothesis. That is almost exactly what I was thinking when reading the article: sure after you have trained it you can prune the layers that aren't used. But I wasn't sure you could know/guess which layers will be unused before you train.

It strikes me as an interesting open question since if it is the case that you need big networks for training but can use significantly smaller "pruned" networks for inference there are many, many reasons why that might be true. Determining which of the possible reasons is the actual reason may be a key in understanding how LLMs work.

jncfhnb · on Dec 15, 2023

Training is making the model (or rather going from something random and useless to something well calibrated and useful). Inference is using it to make a prediction.

This is saying that you don’t need the entire model to make good predictions for specific subsets of tasks. You can literally remove a large part of the model and it will do fine. Which is not very controversial. The model, after being trained, is a large collection of interacting nodes. When this is talking about dropping chunks of the model it means dropping nodes after training to make predictions. The advantage primarily being that smaller models are cheaper and faster to run or modify with further training.

You know that meme about how you only use 10% of your brain at a time? Well, yeah, but the idiot movies that suggest using 100% of your brain would make you impossibly smarter are not correct. 90% of your brain just isn’t relevant. More brain / model is not better than the relevant subset alone.

The important question to be asking is whether you can remove large chunks of the model without hurting its ability to generally to do well on whatever you ask it.

As a very crude example, imagine you trained a simple model to predict rainfall using a weather monitor and the number of farts you did last week. The model will probably learn that the monitor is a useful and the farts are irrelevant. If this were as simple as a linear regression, you could just remove the farts coefficient from the equation and the model would come out to the same outcomes. Neural nets are not so easily observed but it’s still just dropping the irrelevant bits to whatever you’re trying to do.

quadrature · on Dec 15, 2023

Yes. Think of it in terms of fitting a line to some points y=mx+b. Training is finding the right slope m and intercept b of the line to get a good fit to the points. Inference is when you take an x coordinate and find the y value using the "trained" m and b in the line equation

zoogeny · on Dec 15, 2023

I'm not sure if that gives me an intuition on the title of the article: "Do large language models need all those layers"

Am I interpreting you correctly if I say: "Finding the slope (training) may require those extra layers but finding a particular y value given an known x coordinate (inference) may not require those extra layers".

What I mean is, does the answer to the article's question change if one is considering training vs. inference?

quadrature · on Dec 15, 2023

Apologies, I thought you were asking a general question about ML. Will let someone else comment on the specifics here.

xanderlewis · on Dec 15, 2023

I think I’ve misinterpreted it in the same way. I guess you’re asking something like: if we can exorcise parts of a model without affecting quality of inferences (in some particular domain), can we do the same with the training step? That is, is it necessary to train a model on a wide variety of topics in order to get high-quality ‘understanding’ for a particular application?

If we don’t need those weights at inference time, why do the computation to train them in the first place?

ska · on Dec 15, 2023

The real answer is we don't know yet but it's interesting.

To go back to your ax+b example, imagine instead you are fitting a much higher dimensional model, but you don't know how high. ax^n+bx^(n-1) ... where n might be in the millions, or hundreds of millions, or?? So we know if we make the model high enough order (e.g n-1 training points will give "perfect") it will overfit, so we throw some regularization and a bit of handwavy tuning and we end up with a model of say n=7213472123 and a set of a,b .. which behaves pretty well, but from it's behavior we suspect most of them dont' matter. And maybe should be <= 2million, or whatever.

So, a few obvious questions - one is can we find a way to throw out most of the a,b,c ... to get just the core, i.e. if we throw away all |k| <= 0.00001 does it change anything (for inference). A very different question is could we decide that ahead of time (during training). A different class of question looks more like "could we have figured this out from the data".

It's a lot harder to reason about the latter questions, because the former one is empirical: After training, this one doesn't seem to do anything. Ahead of time, how do you know? This has interesting offshoots, like how stable is the distribution of the parts that matter, etc.

cyanydeez · on Dec 15, 2023

wouldn't the most accurate 2D analogy in geometry be in the Discrete Fourier Transform?

xanderlewis · on Dec 15, 2023

The use of the word ‘inference’ in this context can seem a bit weird, but I think it’s borrowed from statistics and it’s quite standard.

Training = optimising model parameters to ‘learn’ from data.

Inference = asking the model to make a prediction, usually assuming the model is already trained.

Instead of inference, you could say running/querying the model.