At the same time it feels extremely simple attention(Q,K,V) = softmax (Q K^T √ d...

diedyesterday · on April 17, 2024

Do not be fooled by the simplicity; The magic itself is in the many Q, K and V matrices (each of which is huge) which are learned and depend on the language(s). This is just the form of the application of those matrices/transformations: Making the embedding for the last token of a context "attend to" (hence attention) all information (at all layers of meaning and not just syntactic or semantic meaning but logical, scientific, poetic, discoursal, etc. => multi-head attention) contained in the context so far.

Any complex function can be made to look simple in some representation (e.g its Fourier series or Taylor series, etc.).

bilsbie · on April 15, 2024

For me I never had too much trouble understanding the algorithm. But this is the first time I can see why it works.