There's a couple other papers in the same vein, a genre of "you never needed residual layers or complicated normalization tricks if you just get the initialization & layers right", like Balduzzi https://arxiv.org/abs/1702.08591 or NFNets https://arxiv.org/abs/2102.06171#deepmind . I find them pretty interesting because they are a datapoint that neural nets were, all along, much more powerful and simpler than we realized, and we just don't know how to NN right, and we mistook our incompetence at setting up easily-optimized instances for fundamental limits of the models.
Most of that work focuses on CNNs, but it looks like it is true of simple MLPs too: https://www.gwern.net/notes/FC With MLP-Mixer, we find that simple MLP archs are already highly competitive with the CNNs & Transformers that have had literally a decade of intense research put into them. But MLPs long predate both families, so why did it take this long? As far as I can tell, there were just some cursory experiments relatively early on where MLPs didn't work well past a few layers (just like CNNs didn't early on), and everyone shrugged and abandoned MLPs, and it just became conventional wisdom that 'MLPs are too flexible to work'. If it doesn't work immediately, who's going to persist and try to scale up? But then it turns out that a little bit of normalization or gating, and MLPs Just Work. All seems obvious in retrospect (working fine at small layer count and then asymptoting at a few layers obviously looks like optimization problems of the sort you'd fix by residual layers or normalization or better initialization - in retrospect!). And yet.
A humbling instance of our cognitive limits; we're just too dumb and ignorant to do this stuff the easy way, and have to do it the hard way.
Great insights and commentary - also thank you for the references, looking forward to digging into those (especially your site).
I was going to ask about your thoughts regarding the ReZero paper, but it looks like you already have it cited in your FCN/MLP bibliography! I will study that list a bit harder, especially the MLP-Mixer section.
One paper that I didn't see cited is [1] - I only mention it because I'm intrigued by your thoughts on normalization and gating, and I can see how the early literature on dynamical isometry (especially with respect to weight initialization and choice of activation function) would add further support to your general line of thinking.
Most of that work focuses on CNNs, but it looks like it is true of simple MLPs too: https://www.gwern.net/notes/FC With MLP-Mixer, we find that simple MLP archs are already highly competitive with the CNNs & Transformers that have had literally a decade of intense research put into them. But MLPs long predate both families, so why did it take this long? As far as I can tell, there were just some cursory experiments relatively early on where MLPs didn't work well past a few layers (just like CNNs didn't early on), and everyone shrugged and abandoned MLPs, and it just became conventional wisdom that 'MLPs are too flexible to work'. If it doesn't work immediately, who's going to persist and try to scale up? But then it turns out that a little bit of normalization or gating, and MLPs Just Work. All seems obvious in retrospect (working fine at small layer count and then asymptoting at a few layers obviously looks like optimization problems of the sort you'd fix by residual layers or normalization or better initialization - in retrospect!). And yet.
A humbling instance of our cognitive limits; we're just too dumb and ignorant to do this stuff the easy way, and have to do it the hard way.