I highly doubt this in practice on a large scale. Outside of the common phenomena of "most large NNs are under trained" and "less better data is sometimes better than more worse data", there are no other obvious mechanisms to explain why a smaller model with same or similar architecture would be better than a larger one.
I claim instead that we are still hardly scratching the surface with how we evaluate NLP systems. Also, some fields have straight up trash evaluation schemes. Summarization and ROGUE scores are totally BS and I find the claim that they even correlate with high quality summaries suspect. I say this with publications in the that subfield, so I have personal experience with just how crummy many summarizes are.
I claim instead that we are still hardly scratching the surface with how we evaluate NLP systems. Also, some fields have straight up trash evaluation schemes. Summarization and ROGUE scores are totally BS and I find the claim that they even correlate with high quality summaries suspect. I say this with publications in the that subfield, so I have personal experience with just how crummy many summarizes are.