Is "Parameter Creep" going to becomes a thing? They hold up Llama-8b as a compet...

voiper1 · on July 18, 2024

Oddly, they are only charging slightly more for their hosted version:

open-mistral-7b is 25c/m tokens open-mistral-nemo-2407 is 30c/m tokens

Palmik · on July 18, 2024

They specifically call out fp8 aware training and TensoRT LLM is really good (efficient) with fp8 inference on H100 and other hopper cards. It's possible that they run the 7b natively in fp16 as smaller models suffer more from even "modest" quantization like this.

dannyw · on July 18, 2024

Possibly a NVIDIA subsidy. You run NEMO models, you get cheaper GPUs.

marci · on July 18, 2024

For the benchmarks, it depends on how you interpret it. The other models are quite popular so many can have a starting point. Now, if you regularly use them you can assess: "just 3% better on some benchmark, 80% to 83, and at the cost of almost twice the inference speed and base base RAM requirement, but 16x context window, and for commercial usage..." and at the end "for my use case, is it worth it?"

eyeswideopen · on July 18, 2024

As written here: https://huggingface.co/nvidia/Mistral-NeMo-12B-Instruct

"It significantly outperforms existing models smaller or similar in size." is a statement that goes in that direction and would allow the comparison of a 1.7T param model with a 7b one

causal · on July 18, 2024

Yeah it will be interesting to see if we ever settle on standard sizes here. My preference would be:

- 3B for CPU inference or running on edge devices.

- 20-30B for maximizing single consumer GPU potential.

- 70B+ for those who can afford it.

7-9B never felt like an ideal size.