If you're talking about fig. 4, then it's some units scaled so that random performance is 0 and perfect performance is 100 (depending on task it may be accuracy or something else). Since the models are so large, good benchmarks are diverse, and different tasks require different metrics.
Poetic that the top post right now is (partially) about how science communication over-simplifying figures results in a popular misunderstanding of science, leading readers to believe that conducting research is easier than it actually is.