The optimizer is functioning correctly, and the pattern really exists in the tra...

The optimizer is functioning correctly, and the pattern really exists in the training data. But consider:

- This behavior damages the model's performance on out of sample data; every word you predict during silence increases the transcript's Word Error Rate.

- These translation credits are an artifact of our training data, and not a reflection of the process we are modeling (spoken language).

So, while you are correct about the mechanism at work here, it is still correct to call learning a spurious pattern which damages our performance "overfitting".