> This is the beautiful part - a mere multiplication is enough to convert the im...

> This is the beautiful part - a mere multiplication is enough to convert the image tensor to text tensor. One freaking line of code, and a simple one.

I thought they were creating image tokens based on the queries during finetuning and appending them to the language model. They are not text tokens.