It means that the benchmark isn't useful anymore and we need to build a harder one.
edit: as far as what the numbers mean, they are arbitrary. They are only useful insofar as you can run two models (or two versions of the same model) on the same benchmark, and compare the numbers. But on an absolute scale the numbers don't mean anything.
It was actually very helpful as it answered my question about what the benchmark numbers are. It wasn't a request for advice, but I'm merely looking to understand the article, which doesn't really elaborate on what they are presenting; either assuming an audience that is very familiar with these benchmarks prior, or so dazzled by number going up they forget to ask what number is.