Doesn't running 70B in 24GB need 2 bit quantisation? I'm no expert, but to me th... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		michaelt on Dec 13, 2023 \| parent \| context \| favorite \| on: Whisper: Nvidia RTX 4090 vs. M1 Pro with MLX Doesn't running 70B in 24GB need 2 bit quantisation? I'm no expert, but to me that sounds like a recipe for bad performance. Does a 70B model in 2-bit really outperform a smaller-but-less-quantised model?

brucethemoose2 on Dec 14, 2023 [–]

2.65bpw, on a totally empty 3090 (and I mean totally empty).

I woukd say 34B is the performance sweetspot, yeah. There was a long period where allow we had in the 33B range was llamav1, but now we have Yi and Codellamav2 (among others).

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact