There's a couple other papers in the same vein, a genre of "you never needed res...

criticaltinker · on Aug 16, 2021

Great insights and commentary - also thank you for the references, looking forward to digging into those (especially your site).

I was going to ask about your thoughts regarding the ReZero paper, but it looks like you already have it cited in your FCN/MLP bibliography! I will study that list a bit harder, especially the MLP-Mixer section.

One paper that I didn't see cited is [1] - I only mention it because I'm intrigued by your thoughts on normalization and gating, and I can see how the early literature on dynamical isometry (especially with respect to weight initialization and choice of activation function) would add further support to your general line of thinking.

[1] Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice https://papers.nips.cc/paper/2017/file/d9fc0cdb67638d50f4114...