Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

So if I'm reading this right, 65B at 4bit would consume around 20GB of VRAM and ~130GB of system RAM?


LLaMA it doesn't require any system RAM to run.

It requires some very minimal system RAM to load the model into VRAM and to compile the 4bit quantized weights.

But if you use pre-quantized weights (get them from HuggingFace or a friend) then all you really need is ~32GB of VRAM and maybe around 2GB of system RAM for 65B. (It's 30B which needs 20GB of VRAM.)


The full use case includes quantisation, which the repo points out uses a large amount of system RAM. Of course that’s not required if you skip that step.


Judging from downloads of the 4bit file and how many people I've seen post about quantizing it themselves, around 99% of people are just downloading the pre-quantized files.

I would not personally call compilation of software part of its "use case." It's use case is text generation.


Quantisation is a once off process. I suspect most people who don't have access to a machine with enough RAM and don't want to use the pre-quantized version can afford the $20 to hire a big cloud server for an day.

Or it is probably possible to make it work slowly using a swapfile on Linux.


Closer to 38-40GB VRAM (and hardly any RAM).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: