So if I'm reading this right, 65B at 4bit would consume around 20GB of VRAM and ...

MacsHeadroom · on March 11, 2023

LLaMA it doesn't require any system RAM to run.

It requires some very minimal system RAM to load the model into VRAM and to compile the 4bit quantized weights.

But if you use pre-quantized weights (get them from HuggingFace or a friend) then all you really need is ~32GB of VRAM and maybe around 2GB of system RAM for 65B. (It's 30B which needs 20GB of VRAM.)

summarity · on March 11, 2023

The full use case includes quantisation, which the repo points out uses a large amount of system RAM. Of course that’s not required if you skip that step.

MacsHeadroom · on March 12, 2023

Judging from downloads of the 4bit file and how many people I've seen post about quantizing it themselves, around 99% of people are just downloading the pre-quantized files.

I would not personally call compilation of software part of its "use case." It's use case is text generation.

nl · on March 12, 2023

Quantisation is a once off process. I suspect most people who don't have access to a machine with enough RAM and don't want to use the pre-quantized version can afford the $20 to hire a big cloud server for an day.

Or it is probably possible to make it work slowly using a swapfile on Linux.

Tepix · on March 13, 2023

Closer to 38-40GB VRAM (and hardly any RAM).