Very proud to share our cross-institutional work about how shifting patterns in consent on the internet impact AI ✨ .
We find that consent around web data is rapidly evolving 🌐 . As the internet has changed, so has user preferences. In the last year alone, >5% of all tokens have become restricted for AI training. ~30% of top sites.
**Yet, while restrictions are increasing, widely used protocols like Robots.txt fail to be effective at expressing intent.**
Paper: https://lnkd.in/eEjqKU9K
NYT article by Kevin Roose: https://lnkd.in/eiDRFRBu
Much of consent on the internet relies on a decades old protocol called Robots.txt. Originally designed to specify whether search bots were allowed to crawl a users page, it has increasingly been leaned on to express preference about data being used for AI training. However, it places huge burden on the user to specify each agent individually, which leads to a patchwork of restrictions and could disproportionately impact access for researchers.
I see this as part of wider research agenda to understand how data informs breakthroughs. AI datasets are no longer static but reflect an evolving internet. Understanding how we shape data, as well as the protocols needed to reflect consent effectively is critical work.
If you made it this far -- take a look at some of the wider research agenda of the data provenance initiative. Earlier this year we released the largest scale audit of dataset licensing & attribution in AI: https://lnkd.in/eh5mZkF2
A shoutout to Shayne Longpre who led this initiative, with many other cross-instutional collaborators including Robert Mahari, Ariel Lee, Campbell Lund, Caiming Xiong, Luis Villa, Stella Biderman, Hanlin li, Daphne Ippolito, Jad Kabbara, and Alex 'Sandy' Pentland and many more. 🔥