The internet happens to be quite a large source of data and information. As of last year, the web contained 149 zettabytes of data. That's 149 million petabytes, or 1.49 trillion terabytes, or 149 trillion gigabytes, otherwise known as a lot. Such a collective of textual, image, visual, and audio-based data is irresistible to AI companies that need more data than ever to keep growing and improving their models.
Those lawsuits probably aren't slowing down the AI vacuum machines. In fact, the machines are in desperate need of more data: Last year, researchers found that AI models were running out of data necessary to continue with the current rate of growth. Some projections saw the runway giving out sometime in 2028, which, if true, gives only a few years left for AI companies to scrape the web for data. While they'll look to other data sources, like official deals or synthetic data (data produced by AI), they need the internet more than ever.
The web isn't giving up without a fight
But just because the situation is a bit dire for the internet at large, that doesn't mean its giving up entirely. On the contrary, there is real opposition to this type of practice, especially when it goes after the little guy.
Anubis is the creation of Xe Iaso, a developer based out of Ottawa, CA. As reported by 404 Media, Iaso started Anubis after she discovered an Amazon bot clicking on every link on her Git server. After deciding against taking down the Git server entirely, she experimented with a few different tactics before discovering a way to block these bots entirely: an "uncaptcha," as Iaso calls it.
This isn't something the general web surfer needs to think about. Instead, Anubis is made for the people who run websites and servers of their own. To that point, the tool is totally free and open source, and is in continued development. Iaso tells 404 Media that while she doesn't have the resources to work on Anubis full time, she is planning to update the tool with new features. That includes a new test that doesn't push the end-user's CPU as much, as well as one that doesn't rely on JavaScript, as some users disable JavaScript as a privacy measure.
Iaso isn't the only one on the web fighting back against AI crawlers. Cloudflare, for example, is blocking AI crawlers by default as of this month, and will also let customers charge AI companies that want to harvest the data on their sites. Perhaps as it becomes easier to stop AI companies from openly scraping the web, these companies will scale back their efforts—or, at the very least, offer site owners more in return for their data.
My hope is that I run into more websites that initially load with the Anubis splash screen. If I click a link, and am presented with the "Making sure you're not a bot" message, I'll know that site has successfully blocked these AI crawlers. For a while there, the AI machine felt unstoppable. Now, it feels like there's something we can do to at least put it in check.
Read More Details
Finally We wish PressBee provided you with enough information of ( AI Is Scraping the Web, but the Web Is Fighting Back )
Also on site :