We're still working on training the DWA weights on top of a pretained model. We're hopeful that this is feasible. The experiments you're mentioning in the appendix are not changing the learning rate scheduler. E.g., when starting to train the DWA weights after 20k iterations, the learning rate is already quite small. To some extent, this might explain the diminishing returns. Maybe this could work with a completely different learning rate scheduler.
Yeah, you can't change the model much with low LRs. That's the point! Same reason you don't get continual-learning if you just keep using low LRs: https://arxiv.org/abs/2403.08763 You need to really shake up the model if you want to learn some genuinely better (ie. different) internal representations that exploits the DenseNet (https://arxiv.org/abs/1608.06993)/LTG-BERT (https://arxiv.org/abs/2311.02265) arch you're using here.