I'm sure you're right that it's more than just it being in the training data, but that it's in the training data means that you can't draw any conclusions about general mathematical ability using just this as a benchmark, even if you substitute numbers.
There are lots of possible mechanisms by which this particular problem would become more prominent in the weights in a given round of training even if the model itself hasn't actually gotten any better at general reasoning. Here are a few:
* Random chance (these are still statistical machines after all)
* The problem resurfaced recently and shows up more often than it used to.
* The particular set of RLHF data chosen for this model draws out the weights associated with this problem in a way that wasn't true previously.
There are lots of possible mechanisms by which this particular problem would become more prominent in the weights in a given round of training even if the model itself hasn't actually gotten any better at general reasoning. Here are a few:
* Random chance (these are still statistical machines after all)
* The problem resurfaced recently and shows up more often than it used to.
* The particular set of RLHF data chosen for this model draws out the weights associated with this problem in a way that wasn't true previously.