As you've probably seen or heard Dropsitenews has published a list (from a Meta whistleblower) of "the roughly 100,000 top websites and content delivery network addresses scraped to train Meta's proprietary AI models" -- including quite a few fedi sites.

thenexusofprivacy@infosec.exchange

As you've probably seen or heard Dropsitenews has published a list (from a Meta whistleblower) of "the roughly 100,000 top websites and content delivery network addresses scraped to train Meta's proprietary AI models" -- including quite a few fedi sites. Meta denies everything of course, but they routinely lie through their teeth so who knows. In any case, whether the specific details in the report are accurate, it's certainly a threat worth thinking about.

So I'm wondering what defenses fedi admins are using today to try to defeat scrapers: robots.txt, user-agent blocking, firewall-level blocking of ip ranges, Cloudflare or Fastly AI scraper blocking, Anubis, stuff you don't want to disclose ... @deadsuperhero has some good discussion on We Distribute, and it would b e very interesting to hear what various instances are doing.

And a couple of more open-ended questions:

Do you feel like your defenses against scraping are generally holding up pretty well?
Are there other approaches that you think might be promising that you just haven't had the time or resources to try?
Do you have any language in your terms of servive that attempts to prohibit training for AI?

Here's @FediPact's post with a link to the Dropsitenews report and (in the replies) a list of fedi instances and CDNs that show up on the list.

:pona_plush: #FediPact :pona_plush: (@FediPact@cyberpunk.lol)

# **LEAKED: A New List Reveals Top Websites Meta Is Scraping of Copyrighted Content to Train Its AI (Including Many Fediverse Instances!!!)** > *"The tech giant is sidestepping guardrails that websites use to prevent being scraped, data show, in a move whistleblowers say is unethical and potentially illegal."* **ARTICLE:** https://www.dropsitenews.com/p/meta-facebook-tech-copyright-privacy-whistleblower **FULL PDF:** https://www.dropsitenews.com/api/v1/file/b3555944-e204-4f5e-9a64-e44281b19a82.pdf #FediPact #meta #threads #AI

cyberpunk dot lol (cyberpunk.lol)

@fediverse @fediversenews

#MastoAdmin #Meta #FediPact

rancidrabbit@anarchism.space

@thenexusofprivacy I'd like to know if they're using the API to scrape or the WebUI. Considering Anubis.

thenexusofprivacy@infosec.exchange

@rancidrabbit good point, thanks. In practice both are potential vectors (as is RSS), so useful to lock them all down.

The Nexus of Discussions

As you've probably seen or heard Dropsitenews has published a list (from a Meta whistleblower) of "the roughly 100,000 top websites and content delivery network addresses scraped to train Meta's proprietary AI models" -- including quite a few fedi sites.

:pona_plush: #FediPact :pona_plush: (@FediPact@cyberpunk.lol)

:pona_plush: #FediPact :pona_plush: (@FediPact@cyberpunk.lol)