Contrary to what most commenters assume, the high bandwidth usage is not coming from scraping text, but images. They are pretty clear about it:
> Since January 2024, we have seen the bandwidth used for downloading multimedia content grow by 50%. This increase is not coming from human readers, but largely from automated programs that scrape the Wikimedia Commons image catalog of openly licensed images to feed images to AI models.
There's two distinct problems caused by AI scrapers:
1. Bandwidth consumption - that's on scrapers downloading multimedia files
2. CPU resource exhaustion - AI scrapers don't take contextual clues into account. They just blindly follow each and every link they can find, which means that they hit a lot of pages that aren't cached but re-generated for each call. That's stuff like the article history but especially the version delta pages. These are very expensive to generate and are so rarely called that it doesn't make sense to cache them.
You want images only be available to users with a Wikipedia login? This would mean by far most people would no longer see images in Wikipedia articles.
No, I am saying what a lot of other people are. Force bots into API access, which can then be authenticated and restricted by bandwidth or calls per day. Then block bot access to html pages. Nobody looses their images, and bots are limited in stealing bandwidth.
Have you actually tried blocking these scraper bots? The whole problem is that if you do, they start impersonating normal browsers from residential IPs instead. They actively evade countermeasures.
Isnt everything measures and countermeasures though?
As far as I am aware there is no such thing as a silver bullet anywhere when it comes to security.
Its like moving your SSH port from port 22 to some other random one. Will it stop advanced scripts from scanning your server and finding it? No, but it sure as hell will cut down the noise of unsophisticated connections which means you can focus on the more tough ones.
> Since January 2024, we have seen the bandwidth used for downloading multimedia content grow by 50%. This increase is not coming from human readers, but largely from automated programs that scrape the Wikimedia Commons image catalog of openly licensed images to feed images to AI models.