In naive ML scenarios you are right. You can think of JPEG as an input embedding. One of many. The JPEG/spectral embedding is useful because it already provides miniature variational encoding that "makes sense" in terms of translation, sharpness, color, scale and texture.
But with clever ML you can design better variational characteristics such as rotation or nonlinear thing like faces, fingers, projections and abstract objects.
Further JPEG encoding/decoding will be an obstacle for many architectures that require gradients going back and forth between pixel space and JPEG in order to do evaluation steps and loss functions based on the pixel space (which would be superior). Not to mention if you need human feedback in generative scenarios to retouch the output and run training steps on the changed pixels.
And finally, there are already picture and video embeddings that are gradient-friendly and reusable.
>And finally, there are already picture and video embeddings that are gradient-friendly and reusable.
I have been thinking about such things for a while and considered things like giving each of R rows and each of C columns a vector, and using the inner product of row_i and col_i as that pixel's intensity (in the simplest demonstrative case monochromatic, but reordering the floats in each vector before taking the inner product allows many more channels).
But this is just my quick shallow concoction. If I look at the konicq10k dataset, there are 10373 images 1024 x 768 totaling to 5.3GB. Thats ~511KB per image. 511KB / ( 1024 + 768 ) = 285 bytes for each row or column. Dividing by 4 for standard floats that gives each column and each row a vector of 71 (32-bit) floats. This would use absolutely no prior knowledge about human visual perception, so fitting these float vectors inner products (and their permutations for different channels) to the image by the most naive metric (average per pixel residual error) will probably not result in great images. But I'm curious how bad it performs. Perhaps I will try it out in a few hours.
Do you have any references for such or similar simplistic embeddings? I don't want to force you to dig for me, but if you happen to know of a few such papers or perhaps even a review paper that would be welcome!
Not aware of this type of simplistic embeddings. I think getting the first few layers of a large pretrained vision model will get you better results. Blindly learning from a generic image would probably steer it toward the dominant textures and shapes rather than linear operation of camera moves.
The simplest embeddings for vision should focus on camera primitives and invariants. Translation, rotation, scale, skew, projections, lighting. It doesn't matter that much what you use in the layers, but you should steer the training with augmented data. Like rotate and skew the objects in the batches to make sure the layers are invariant to these things.
Next are some depth-mapping embeddings which go beyond flat camera awareness.
The best papers I've seen are face embeddings. You can get useful results with smaller models. There are of course deeper embeddings that focus on the whole scene and depth maps but those are huge.
But with clever ML you can design better variational characteristics such as rotation or nonlinear thing like faces, fingers, projections and abstract objects.
Further JPEG encoding/decoding will be an obstacle for many architectures that require gradients going back and forth between pixel space and JPEG in order to do evaluation steps and loss functions based on the pixel space (which would be superior). Not to mention if you need human feedback in generative scenarios to retouch the output and run training steps on the changed pixels.
And finally, there are already picture and video embeddings that are gradient-friendly and reusable.