Discussion is growing about LLM bots and other tools scraping public posts.

sw_isac@mastodon.iftas.org

Discussion is growing about LLM bots and other tools scraping public posts. This raises concerns about privacy, consent, and community safety. IFTAS has compiled community conversations, expert advice, and tools for identifying and blocking scrapers: https://connect.iftas.org/library/tools-resources/web-crawlers-and-scrapers/

thenexusofprivacy@infosec.exchange

Specifically with Meta, here's a post from @cuchaz with some tools to set up firewall-level blocking. https://gladtech.social/@cuchaz/115004304985099620

Also, Meta's user agents include Meta-ExternalFetcher and Meta-ExternalAgent (which is the one they say they use for AI training). https://www.businessinsider.com/meta-web-crawler-bots-robots-txt-ai-2024-8

And for instances that use a separate domain for media, make sure that you've got your robots.txt and firewall blocks in place for that instance as well -- I've heard from a couple of admins who recently realized they didn't.

@sw_isac

The Nexus of Discussions

Discussion is growing about LLM bots and other tools scraping public posts.