To people who are as confused as me. The "nonlinear activation free" doesn't mean "linear" in the paper (otherwise this is going to be ground-breaking discovery). They use polynomial functions in place of traditionally GELU or ReLU gating, or sigmoid (for attention module). By replaced with simply "x^2" with some bells and whistles, it seems get some good results.
If I am being pedantic, "x^2" are still nonlinear though ...
This is fascinating because one of the first things they teach you in ML class is why nonlinear activations are necessary -- because otherwise, your entire network is mathematically equivalent to a single linear transformation since linear transformations are composable! I'm going to have to read through this paper because I'm really curious if they posit any theories as to why removing nonlinear transforms in their model had such a positive effect in this particular scenario.
>If I am being pedantic, "x^2" are still nonlinear though
yes, and this is where this paper should have been stopped, with no ill intention towards the authors.
Polynomial networks have gotten good results (my colleague https://scholar.google.com/citations?user=1bU041kAAAAJ has done extensive work on them) and there have been multiple papers studying multiplicative interactions and the effects of feature engineering, dozens of small tweaks on activation functions, not to start with the NAS papers automating the whole process. But higher numbers get a paper in I guess
I think the idea is that the derivative of x^2 is linear (namely 2x), which makes the training/backpropagation very fast since it can evaluate 2x much faster than dsigma(x)/dx, dGELU(x)/dx, etc, and also speeds up evaluation since x*x is faster than sigma(x) or cetera. But if so they didn't explain it well or possibly at all. And "nonlinear activation" is still nonsense.
If I am being pedantic, "x^2" are still nonlinear though ...