Skip to content
  • Categories
  • World
  • Recent
  • Tags
  • Popular
  • Users
  • Groups
Skins
  • Light
  • Brite
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (Zephyr)
  • No Skin
Collapse
Brand Logo

The Nexus of Discussions

  1. Home
  2. Categories
  3. Uncategorized
  4. As you've probably seen or heard Dropsitenews has published a list (from a Meta whistleblower) of "the roughly 100,000 top websites and content delivery network addresses scraped to train Meta's proprietary AI models" -- including quite a few fedi sites.

As you've probably seen or heard Dropsitenews has published a list (from a Meta whistleblower) of "the roughly 100,000 top websites and content delivery network addresses scraped to train Meta's proprietary AI models" -- including quite a few fedi sites.

Scheduled Pinned Locked Moved Uncategorized
mastoadminmetafedipact
3 Posts 2 Posters 2 Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • thenexusofprivacy@infosec.exchangeT This user is from outside of this forum
    thenexusofprivacy@infosec.exchangeT This user is from outside of this forum
    thenexusofprivacy@infosec.exchange
    wrote last edited by thenexusofprivacy@infosec.exchange
    #1

    As you've probably seen or heard Dropsitenews has published a list (from a Meta whistleblower) of "the roughly 100,000 top websites and content delivery network addresses scraped to train Meta's proprietary AI models" -- including quite a few fedi sites. Meta denies everything of course, but they routinely lie through their teeth so who knows. In any case, whether the specific details in the report are accurate, it's certainly a threat worth thinking about.

    So I'm wondering what defenses fedi admins are using today to try to defeat scrapers: robots.txt, user-agent blocking, firewall-level blocking of ip ranges, Cloudflare or Fastly AI scraper blocking, Anubis, stuff you don't want to disclose ... @deadsuperhero has some good discussion on We Distribute, and it would b e very interesting to hear what various instances are doing.

    And a couple of more open-ended questions:

    • Do you feel like your defenses against scraping are generally holding up pretty well?

    • Are there other approaches that you think might be promising that you just haven't had the time or resources to try?

    • Do you have any language in your terms of servive that attempts to prohibit training for AI?

    Here's @FediPact's post with a link to the Dropsitenews report and (in the replies) a list of fedi instances and CDNs that show up on the list.

    https://cyberpunk.lol/@FediPact/114999480874284493

    @fediverse @fediversenews

    #MastoAdmin #Meta #FediPact

    rancidrabbit@anarchism.spaceR 1 Reply Last reply
    1
    • thenexusofprivacy@infosec.exchangeT thenexusofprivacy@infosec.exchange

      As you've probably seen or heard Dropsitenews has published a list (from a Meta whistleblower) of "the roughly 100,000 top websites and content delivery network addresses scraped to train Meta's proprietary AI models" -- including quite a few fedi sites. Meta denies everything of course, but they routinely lie through their teeth so who knows. In any case, whether the specific details in the report are accurate, it's certainly a threat worth thinking about.

      So I'm wondering what defenses fedi admins are using today to try to defeat scrapers: robots.txt, user-agent blocking, firewall-level blocking of ip ranges, Cloudflare or Fastly AI scraper blocking, Anubis, stuff you don't want to disclose ... @deadsuperhero has some good discussion on We Distribute, and it would b e very interesting to hear what various instances are doing.

      And a couple of more open-ended questions:

      • Do you feel like your defenses against scraping are generally holding up pretty well?

      • Are there other approaches that you think might be promising that you just haven't had the time or resources to try?

      • Do you have any language in your terms of servive that attempts to prohibit training for AI?

      Here's @FediPact's post with a link to the Dropsitenews report and (in the replies) a list of fedi instances and CDNs that show up on the list.

      https://cyberpunk.lol/@FediPact/114999480874284493

      @fediverse @fediversenews

      #MastoAdmin #Meta #FediPact

      rancidrabbit@anarchism.spaceR This user is from outside of this forum
      rancidrabbit@anarchism.spaceR This user is from outside of this forum
      rancidrabbit@anarchism.space
      wrote last edited by
      #2

      @thenexusofprivacy I'd like to know if they're using the API to scrape or the WebUI. Considering Anubis.

      thenexusofprivacy@infosec.exchangeT 1 Reply Last reply
      • rancidrabbit@anarchism.spaceR rancidrabbit@anarchism.space

        @thenexusofprivacy I'd like to know if they're using the API to scrape or the WebUI. Considering Anubis.

        thenexusofprivacy@infosec.exchangeT This user is from outside of this forum
        thenexusofprivacy@infosec.exchangeT This user is from outside of this forum
        thenexusofprivacy@infosec.exchange
        wrote last edited by
        #3

        @rancidrabbit good point, thanks. In practice both are potential vectors (as is RSS), so useful to lock them all down.

        1 Reply Last reply
        • thenexusofprivacy@infosec.exchangeT thenexusofprivacy@infosec.exchange shared this topic
        Reply
        • Reply as topic
        Log in to reply
        • Oldest to Newest
        • Newest to Oldest
        • Most Votes


        Please keep the community guidelines in mind!
        • Login

        • Don't have an account? Register

        • Login or register to search.
        • First post
          Last post
        0
        • Categories
        • World
        • Recent
        • Tags
        • Popular
        • Users
        • Groups