Discussion is growing about LLM bots and other tools scraping public posts.
-
Discussion is growing about LLM bots and other tools scraping public posts. This raises concerns about privacy, consent, and community safety. IFTAS has compiled community conversations, expert advice, and tools for identifying and blocking scrapers: https://connect.iftas.org/library/tools-resources/web-crawlers-and-scrapers/
-
Discussion is growing about LLM bots and other tools scraping public posts. This raises concerns about privacy, consent, and community safety. IFTAS has compiled community conversations, expert advice, and tools for identifying and blocking scrapers: https://connect.iftas.org/library/tools-resources/web-crawlers-and-scrapers/
Specifically with Meta, here's a post from @cuchaz with some tools to set up firewall-level blocking. https://gladtech.social/@cuchaz/115004304985099620
Also, Meta's user agents include Meta-ExternalFetcher and Meta-ExternalAgent (which is the one they say they use for AI training). https://www.businessinsider.com/meta-web-crawler-bots-robots-txt-ai-2024-8
And for instances that use a separate domain for media, make sure that you've got your robots.txt and firewall blocks in place for that instance as well -- I've heard from a couple of admins who recently realized they didn't.