Short text representations (via good tokenization) significantly reduces the computational cost of a transformer (need to generate fewer tokens for the same output length, and need fewer tokens to represent the same window size). I think these combine to n^3 scaling (n^2 from window size and n from output size).
For images it's not clear to me if there are any preprocessing methods that do a lot better than resizing the image to a smaller resolution (which is commonly done already).
For images it's not clear to me if there are any preprocessing methods that do a lot better than resizing the image to a smaller resolution (which is commonly done already).