Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

At the same time it feels extremely simple

attention(Q,K,V) = softmax (Q K^T √ dK ) @ V

is just half a row; the multi-head, masking and positional stuff just toppings

we have many basic algorithms in CS that are more involved, it's amazing we get language understanding from such simple math



Do not be fooled by the simplicity; The magic itself is in the many Q, K and V matrices (each of which is huge) which are learned and depend on the language(s). This is just the form of the application of those matrices/transformations: Making the embedding for the last token of a context "attend to" (hence attention) all information (at all layers of meaning and not just syntactic or semantic meaning but logical, scientific, poetic, discoursal, etc. => multi-head attention) contained in the context so far.

Any complex function can be made to look simple in some representation (e.g its Fourier series or Taylor series, etc.).


For me I never had too much trouble understanding the algorithm. But this is the first time I can see why it works.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: