@feditips That’s from scratch text-to-image generation, not captioning.
Eg, VoiceOver can run on iOS and basically describes what you see in your camera app in real time. Were it using as much energy as it takes to charge your phone, I imagine you’d notice.
I don’t necessarily know what tools what software is using to caption images, but most people can run some form of captioning locally without the CPU breaking a sweat.