NAFNet: Nonlinear Activation Free Network for Image Restoration

liuliu · on Aug 4, 2022

To people who are as confused as me. The "nonlinear activation free" doesn't mean "linear" in the paper (otherwise this is going to be ground-breaking discovery). They use polynomial functions in place of traditionally GELU or ReLU gating, or sigmoid (for attention module). By replaced with simply "x^2" with some bells and whistles, it seems get some good results.

If I am being pedantic, "x^2" are still nonlinear though ...

mkaic · on Aug 4, 2022

This is fascinating because one of the first things they teach you in ML class is why nonlinear activations are necessary -- because otherwise, your entire network is mathematically equivalent to a single linear transformation since linear transformations are composable! I'm going to have to read through this paper because I'm really curious if they posit any theories as to why removing nonlinear transforms in their model had such a positive effect in this particular scenario.

idontpost · on Aug 4, 2022

But ... they didn't remove nonlinear transforms. They just used a different one and gave it a stupid name.

civilized · on Aug 4, 2022

So, when they say "free of nonlinear activations", what they mean is "uses nonlinear activations other than the ones most commonly used"?

Reminds me of when Salesforce replaced on-prem software with cloud software and declared it "the end of software".

igorkraw · on Aug 4, 2022

>If I am being pedantic, "x^2" are still nonlinear though

yes, and this is where this paper should have been stopped, with no ill intention towards the authors.

Polynomial networks have gotten good results (my colleague https://scholar.google.com/citations?user=1bU041kAAAAJ has done extensive work on them) and there have been multiple papers studying multiplicative interactions and the effects of feature engineering, dozens of small tweaks on activation functions, not to start with the NAS papers automating the whole process. But higher numbers get a paper in I guess

a1369209993 · on Aug 4, 2022

I think the idea is that the derivative of x^2 is linear (namely 2x), which makes the training/backpropagation very fast since it can evaluate 2x much faster than dsigma(x)/dx, dGELU(x)/dx, etc, and also speeds up evaluation since x*x is faster than sigma(x) or cetera. But if so they didn't explain it well or possibly at all. And "nonlinear activation" is still nonsense.

idontpost · on Aug 5, 2022

It's no faster than RELU in either direction, which is the most common choice.

eutectic · on Aug 5, 2022

Usually activation functions are not a bottleneck.