Generating Pixels One by One

akoboldfrying · 2025-06-09T01:48:48 1749433728

Surely the ideal choice for the

> learned token embeddings

for the pixel intensities in M3 would be... a single scalar value between 0 and 1? (That is, a "vector" of length 1, consisting of just the original pixel intensity.)

Am I misunderstanding something? Or is this maybe a case of "Yes, you could just do that much simpler thing in this simplified example, but in general you need the complicated approach that we're trying to demonstrate"?

archermarks · 2025-06-09T14:18:56 1749478736

Really nice article! This clarified a lot to me about how positional embeddings work and how other useful context might be embedded in other models.