I think the idea is that the derivative of x^2 is linear (namely 2x), which makes the training/backpropagation very fast since it can evaluate 2x much faster than dsigma(x)/dx, dGELU(x)/dx, etc, and also speeds up evaluation since x*x is faster than sigma(x) or cetera. But if so they didn't explain it well or possibly at all. And "nonlinear activation" is still nonsense.