They specifically call out fp8 aware training and TensoRT LLM is really good (efficient) with fp8 inference on H100 and other hopper cards. It's possible that they run the 7b natively in fp16 as smaller models suffer more from even "modest" quantization like this.
For the benchmarks, it depends on how you interpret it. The other models are quite popular so many can have a starting point. Now, if you regularly use them you can assess: "just 3% better on some benchmark, 80% to 83, and at the cost of almost twice the inference speed and base base RAM requirement, but 16x context window, and for commercial usage..." and at the end "for my use case, is it worth it?"
"It significantly outperforms existing models smaller or similar in size."
is a statement that goes in that direction and would allow the comparison of a 1.7T param model with a 7b one
The same thing happened with gemma-27b, where they compared it to all the 7-9b models.
It seems like an easy way to boost benchmarks while coming off as "small" at first glance.