At least with my hardware this runs at "[size of model]/[speed of SSD reads]" tokens per second, which (up to some possible further memory reduction so you can run larger batches at once on the same GPU) is a good as it gets when you need to read the whole model from disk each token.
At a 125GB and a 2MB/s read (largest model, what I get from my ssd) that's 60 seconds per token (1 day per 1440 words), which isn't exactly practical. Which is really the issue here, if you need to stream the model from an SSD because you don't have enough RAM, it is just a fundamentally slow process.
You could probably optimize quite a bit for batch throughput if you're ok with the latency though.
Yeah, it does seem like there's a fundamental limit how fast you can go even if you engineer the data juggling to perfection. My guess is that every loop through the transformer is going to have to visit every weight and if those weights cannot fit in your fastest memory, then it's going to have to spend time transferring data from SSD or whatever is lower in your memory hierarchy.
The quantization used in the post luckily seems to work somewhat well; I'm also wondering if some new clever ways will be invented that reduce the amount of data you need to juggle. Maybe e.g. not just using 4-bit weights but also compressing them in some way, sorting the weights or something.
Huffman encoding the weights (treating each 16bit float a symbol) could reduce the weights size to ~85% the original (I calculated this exactly before, but am going from memory). You could maybe get a bit more than that with arithmetic encoding (if you managed to decode fast enough), but it shouldn't be that much more.
Once you start including lossy steps like quantization though it's much less clear. At some point you just reach "knowledge distillation is an open problem".
It requires some very minimal system RAM to load the model into VRAM and to compile the 4bit quantized weights.
But if you use pre-quantized weights (get them from HuggingFace or a friend) then all you really need is ~32GB of VRAM and maybe around 2GB of system RAM for 65B. (It's 30B which needs 20GB of VRAM.)
The full use case includes quantisation, which the repo points out uses a large amount of system RAM. Of course that’s not required if you skip that step.
Judging from downloads of the 4bit file and how many people I've seen post about quantizing it themselves, around 99% of people are just downloading the pre-quantized files.
I would not personally call compilation of software part of its "use case." It's use case is text generation.
Quantisation is a once off process. I suspect most people who don't have access to a machine with enough RAM and don't want to use the pre-quantized version can afford the $20 to hire a big cloud server for an day.
Or it is probably possible to make it work slowly using a swapfile on Linux.
I have a separate branch that streams weights from ram - at which point I think I was only seeing negligible performance loss compared to storing the weights in vram. The bottleneck was compute, not GPU bandwidth.
The 65B model only needs just over 32GB of VRAM to run. It does not need system RAM to run/use if you use pre-quantized weights which you can find many places already.
No need to quantize yourself (besides it takes almost a day to do 4bit GPTQ quantization on 3xA6000).
Quantizing is a lossy process, you can't really claim to be running the 65B model llama at that point (though the 65b qgpt-llama does look like it might be very useful)
Usage instructions in the commit message: https://github.com/facebookresearch/llama/commit/5be06e56056...
At least with my hardware this runs at "[size of model]/[speed of SSD reads]" tokens per second, which (up to some possible further memory reduction so you can run larger batches at once on the same GPU) is a good as it gets when you need to read the whole model from disk each token.
At a 125GB and a 2MB/s read (largest model, what I get from my ssd) that's 60 seconds per token (1 day per 1440 words), which isn't exactly practical. Which is really the issue here, if you need to stream the model from an SSD because you don't have enough RAM, it is just a fundamentally slow process.
You could probably optimize quite a bit for batch throughput if you're ok with the latency though.