This seems like a good way to measure LLM improvement. It matches the my persona...

		grim_io 38 days ago \| parent \| context \| favorite \| on: Measuring AI Ability to Complete Long Tasks This seems like a good way to measure LLM improvement. It matches the my personal feeling when using progressively better models over time.