Really? It's in line with my experience. I'd think that making the model predict...

unixpickle · on Jan 1, 2023

It's definitely not just a regularizer in my case, because the gap appears even before a single epoch. The gap does also appear for two very different model architectures.

One explanation is that price labels are super noisy. If there is enough noise in the primary labels, you could imagine that adding in the more predictable target variables could help reduce gradient noise and speed up training. That's my current hypothesis, but I'm very open to others. If I had more time I'd try to do more experiments on this.

version_five · on Jan 1, 2023

That's very interesting. Do the train and val set losses both show that behavior? I did a very similar experiment earlier this year - in my case it was a classifier where images could be categorized different ways, and my takeaway was making it predict more classes improved performance. I'll have to go back and look at the loss curves during training and see if the improvement is immediate as in your case

unixpickle · on Jan 1, 2023

Before one epoch, both the train and eval curves look pretty much identical. Quite curious

zone411 · on Jan 1, 2023

I've seen that happen before, but it was always a temporary artifact of the chosen architecture, optimizer, loss function, or training details that would disappear later on once they were improved and then offer no help. I have little experience with CV compared to other tasks though, so this might be the reason. Have you seen any papers about this phenomenon?

version_five · on Jan 1, 2023

I don't know that I've seen a paper that says this explicitly, I'm just saying it based on experience. You may have seen papers where they obscure parts of the image and find that the classifier is basing its decision on the background, I think that's well known. Likewise I can't cite a paper of the top of my head, but the usual augmentations like random cropping, flipping, color jitter, etc are pretty well known to practitioners as way of preventing oveffitting. I see additional prediction targets as an extension of that, because they likewise incentivise the model to learn the "right" features because they make it harder to latch on to some spurious pattern in the data. And I've had success with it practically, which is why I made my original comment. Ymmv of course

zone411 · on Jan 1, 2023

I haven't seen a paper on this either, but I think that if it does work, somebody should definitely write one about it because it seems quite consequential. I am not sure if the underlying reason would be regularization though, seems more like a case of indirect feature engineering.

version_five · on Jan 1, 2023

I'd argue that both are the same thing, but yes I agree - one way or another it's a way of bringing additional domain knowledge to the problem - and I suppose it entails the downside that if you do it wrong, you can reduce predictive power

unixpickle · on Jan 1, 2023

Before writing this post, I asked chatgpt for examples of positive transfer from auxiliary losses in the literature. It pointed me to this paper:

https://arxiv.org/abs/1705.07115