Do not be fooled by the simplicity; The magic itself is in the many Q, K and V matrices (each of which is huge) which are learned and depend on the language(s). This is just the form of the application of those matrices/transformations: Making the embedding for the last token of a context "attend to" (hence attention) all information (at all layers of meaning and not just syntactic or semantic meaning but logical, scientific, poetic, discoursal, etc. => multi-head attention) contained in the context so far.
Any complex function can be made to look simple in some representation (e.g its Fourier series or Taylor series, etc.).
attention(Q,K,V) = softmax (Q K^T √ dK ) @ V
is just half a row; the multi-head, masking and positional stuff just toppings
we have many basic algorithms in CS that are more involved, it's amazing we get language understanding from such simple math