The optimizer is functioning correctly, and the pattern really exists in the training data. But consider:
- This behavior damages the model's performance on out of sample data; every word you predict during silence increases the transcript's Word Error Rate.
- These translation credits are an artifact of our training data, and not a reflection of the process we are modeling (spoken language).
So, while you are correct about the mechanism at work here, it is still correct to call learning a spurious pattern which damages our performance "overfitting".
- This behavior damages the model's performance on out of sample data; every word you predict during silence increases the transcript's Word Error Rate.
- These translation credits are an artifact of our training data, and not a reflection of the process we are modeling (spoken language).
So, while you are correct about the mechanism at work here, it is still correct to call learning a spurious pattern which damages our performance "overfitting".