Don't they what, port their models to being on-device?
I'm not Google, but just look at why they did it for voice: The hardware caught up, the UX is better when it's done locally, and I think they mentioned better power consumption?
Presumably at some point that might happen for image recognition. They already started in Google Lens with porting OCR + translation onto the device