How to Train 10k-Layer Vanilla Convolutional Neural Networks [pdf]

criticaltinker · on Aug 16, 2021

This paper is one of my all time favorites.

It shows that extremely deep vanilla CNNs - without the use of batch normalization or residual connections - can be trained simply by using a Delta-Orthogonal weight initialization scheme and appropriate activation function.

The Delta-Orthogonal initialization scheme is derived theoretically by developing a mean field theory for signal propagation which characterizes the conditions for dynamical isometry. Ultra-deep CNNs can train faster and perform better if their input-output Jacobians exhibit dynamical isometry, namely the property that the entire distribution of singular values is close to 1.

Put another way, dynamical isometry is a necessary condition for signals to flow both forward and backward through the network without attenuation. A variety of pathologies such as vanishing/exploding gradients make training such deep networks challenging - mean field theory is a powerful tool that offers solutions to these challenges.

The authors demonstrate experimentally that Delta-Orthogonal kernels outperform existing initialization schemes for very deep vanilla convolutional networks. They also find strikingly good agreement between theoretical and experimental results. IMO one of the most astonishing findings is that for networks initialized using this scheme the learning time measured in number of training epochs is independent of depth.

> Our results indicate that we have removed all the major fundamental obstacles to training arbitrarily deep vanilla convolutional networks. In doing so, we have layed the groundwork to begin addressing some outstanding questions in the deep learning community, such as whether depth alone can deliver enhanced generalization performance. Our initial results suggest that past a certain depth, on the order of tens or hundreds of layers, the test performance for vanilla convolutional architecture saturates. These observations suggest that architectural features such as residual connections and batch normalization are likely to play an important role in defining a good model class, rather than simply enabling efficient training.

Here is a link to the ConvolutionDeltaOrthogonal initializer in Tensorflow [1].

[1] https://github.com/tensorflow/tensorflow/blob/d287ff3d95c06b...

gwern · on Aug 16, 2021

There's a couple other papers in the same vein, a genre of "you never needed residual layers or complicated normalization tricks if you just get the initialization & layers right", like Balduzzi https://arxiv.org/abs/1702.08591 or NFNets https://arxiv.org/abs/2102.06171#deepmind . I find them pretty interesting because they are a datapoint that neural nets were, all along, much more powerful and simpler than we realized, and we just don't know how to NN right, and we mistook our incompetence at setting up easily-optimized instances for fundamental limits of the models.

Most of that work focuses on CNNs, but it looks like it is true of simple MLPs too: https://www.gwern.net/notes/FC With MLP-Mixer, we find that simple MLP archs are already highly competitive with the CNNs & Transformers that have had literally a decade of intense research put into them. But MLPs long predate both families, so why did it take this long? As far as I can tell, there were just some cursory experiments relatively early on where MLPs didn't work well past a few layers (just like CNNs didn't early on), and everyone shrugged and abandoned MLPs, and it just became conventional wisdom that 'MLPs are too flexible to work'. If it doesn't work immediately, who's going to persist and try to scale up? But then it turns out that a little bit of normalization or gating, and MLPs Just Work. All seems obvious in retrospect (working fine at small layer count and then asymptoting at a few layers obviously looks like optimization problems of the sort you'd fix by residual layers or normalization or better initialization - in retrospect!). And yet.

A humbling instance of our cognitive limits; we're just too dumb and ignorant to do this stuff the easy way, and have to do it the hard way.

criticaltinker · on Aug 16, 2021

Great insights and commentary - also thank you for the references, looking forward to digging into those (especially your site).

I was going to ask about your thoughts regarding the ReZero paper, but it looks like you already have it cited in your FCN/MLP bibliography! I will study that list a bit harder, especially the MLP-Mixer section.

One paper that I didn't see cited is [1] - I only mention it because I'm intrigued by your thoughts on normalization and gating, and I can see how the early literature on dynamical isometry (especially with respect to weight initialization and choice of activation function) would add further support to your general line of thinking.

[1] Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice https://papers.nips.cc/paper/2017/file/d9fc0cdb67638d50f4114...