Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I think the idea is that the derivative of x^2 is linear (namely 2x), which makes the training/backpropagation very fast since it can evaluate 2x much faster than dsigma(x)/dx, dGELU(x)/dx, etc, and also speeds up evaluation since x*x is faster than sigma(x) or cetera. But if so they didn't explain it well or possibly at all. And "nonlinear activation" is still nonsense.


It's no faster than RELU in either direction, which is the most common choice.


Usually activation functions are not a bottleneck.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: