My code for this is very much not high quality, but I have a CPU + GPU + SSD com...

adeon · on March 11, 2023

Yeah, it does seem like there's a fundamental limit how fast you can go even if you engineer the data juggling to perfection. My guess is that every loop through the transformer is going to have to visit every weight and if those weights cannot fit in your fastest memory, then it's going to have to spend time transferring data from SSD or whatever is lower in your memory hierarchy.

The quantization used in the post luckily seems to work somewhat well; I'm also wondering if some new clever ways will be invented that reduce the amount of data you need to juggle. Maybe e.g. not just using 4-bit weights but also compressing them in some way, sorting the weights or something.

gpm · on March 11, 2023

Huffman encoding the weights (treating each 16bit float a symbol) could reduce the weights size to ~85% the original (I calculated this exactly before, but am going from memory). You could maybe get a bit more than that with arithmetic encoding (if you managed to decode fast enough), but it shouldn't be that much more.

Once you start including lossy steps like quantization though it's much less clear. At some point you just reach "knowledge distillation is an open problem".

AmblingAvocado · on March 11, 2023

Perhaps there is an instance of Amdahl's law lurking the the midst?

Tepix · on March 11, 2023

Won't the 65b model (almost) fit into 128GB RAM? Or into 128GB RAM and 24GB VRAM?

MacsHeadroom · on March 11, 2023

LLaMA-65B fits in 32GB of VRAM using state of the art GPTQ quantization with no output performance loss.

https://github.com/qwopqwop200/GPTQ-for-LLaMa

summarity · on March 11, 2023

So if I'm reading this right, 65B at 4bit would consume around 20GB of VRAM and ~130GB of system RAM?

MacsHeadroom · on March 11, 2023

LLaMA it doesn't require any system RAM to run.

It requires some very minimal system RAM to load the model into VRAM and to compile the 4bit quantized weights.

But if you use pre-quantized weights (get them from HuggingFace or a friend) then all you really need is ~32GB of VRAM and maybe around 2GB of system RAM for 65B. (It's 30B which needs 20GB of VRAM.)

summarity · on March 11, 2023

The full use case includes quantisation, which the repo points out uses a large amount of system RAM. Of course that’s not required if you skip that step.

MacsHeadroom · on March 12, 2023

Judging from downloads of the 4bit file and how many people I've seen post about quantizing it themselves, around 99% of people are just downloading the pre-quantized files.

I would not personally call compilation of software part of its "use case." It's use case is text generation.

nl · on March 12, 2023

Quantisation is a once off process. I suspect most people who don't have access to a machine with enough RAM and don't want to use the pre-quantized version can afford the $20 to hire a big cloud server for an day.

Or it is probably possible to make it work slowly using a swapfile on Linux.

Tepix · on March 13, 2023

Closer to 38-40GB VRAM (and hardly any RAM).

gpm · on March 11, 2023

Yes (I just don't have that much ram)

I have a separate branch that streams weights from ram - at which point I think I was only seeing negligible performance loss compared to storing the weights in vram. The bottleneck was compute, not GPU bandwidth.

MacsHeadroom · on March 12, 2023

The 65B model only needs just over 32GB of VRAM to run. It does not need system RAM to run/use if you use pre-quantized weights which you can find many places already.

No need to quantize yourself (besides it takes almost a day to do 4bit GPTQ quantization on 3xA6000).

gpm · on March 12, 2023

Quantizing is a lossy process, you can't really claim to be running the 65B model llama at that point (though the 65b qgpt-llama does look like it might be very useful)

Tepix · on March 13, 2023

The GPTQ paper https://arxiv.org/abs/2210.17323 claims "negligible accuracy degradation relative to the uncompressed baseline".

Tepix · on March 13, 2023

Are you sure? I think it took a mere 2 hours to do 4bit GPTQ quantization of LLaMA-65B on 1x RTX 3090. But i may be mistaken.