Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

13B in GPTQ 4bit has practically no quality loss and runs in 8GB. I get 8 tokens/second on my laptop CPU and 4 tokens/second on my phone CPU.

Even 33B only needs 20GB of VRAM in GPTQ 4bit.

8bit has zero perplexity loss, so there's really no reason to run in 16bit.

Even a $200 P40 24GB is enough to run 33B at extremely high speeds in GPTQ 4bit.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: