A user on the online forum 4chan has leaked a massive 270GB of data purportedly belonging to The New York Times. This leak includes what is claimed to be the source code for the newspaper’s digital operations.

  • lurch@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    43
    arrow-down
    1
    ·
    20 days ago

    reminds me of the time someone said “Who is this 4chan?” on tv and it became a meme. good times

  • Autonomous User@lemmy.world
    link
    fedilink
    English
    arrow-up
    19
    arrow-down
    4
    ·
    edit-2
    20 days ago

    We still have no legal right to use, change and share its source code, control it both ourselves and in groups. It’s still anti-libre software.

    • seathru@lemmy.sdf.org
      link
      fedilink
      English
      arrow-up
      41
      arrow-down
      1
      ·
      edit-2
      20 days ago

      Anything that may help develop better adblockers/paywall bypasses or exposes how/what of our personal information is collected is a win in my book. And this may very well be none of those things.

      • Autonomous User@lemmy.world
        link
        fedilink
        English
        arrow-up
        4
        arrow-down
        1
        ·
        edit-2
        20 days ago

        They only exist when we keep them relevant and we already know we can’t prove it’s private but if it helps some people, that’s good.

      • 0xD@infosec.pub
        link
        fedilink
        English
        arrow-up
        0
        ·
        20 days ago

        Right, because fuck paying for proper journalism. Everything must be free!

        Remind me again, how does that work?

        • noisefree@lemmy.world
          link
          fedilink
          English
          arrow-up
          1
          ·
          19 days ago

          The inverse of this is where subscription services that previously had no ads for paying subscribers then add in ads on paid plans while also increasing the fees associated. It’s a pretty standard practice, NYT included. Adblocking is necessary.

  • Dark Arc@social.packetloss.gg
    link
    fedilink
    English
    arrow-up
    12
    ·
    edit-2
    20 days ago

    I doubt this will affect much … that’s a lot more source code than I’d expect though, dang.

    Presumably a lot of it is for internal operations (custom editing software or something of that ilk).

    • Dark Arc@social.packetloss.gg
      link
      fedilink
      English
      arrow-up
      15
      arrow-down
      1
      ·
      20 days ago

      That’s a really silly take … a Paywall is just an authorization mechanism.

      That’s like saying the source code of lemmy leaks and you expect your account to be compromised any second.

  • skymtf@pricefield.org
    link
    fedilink
    English
    arrow-up
    8
    arrow-down
    1
    ·
    20 days ago

    I have not read the news in a really long time just cause paywalls are annoying as frick.

  • merthyr1831@lemmy.world
    link
    fedilink
    English
    arrow-up
    2
    ·
    edit-2
    19 days ago

    270GB feels insane for the source code of a single organisation. Is there media assets or backups in there too?

    EDIT: yep, multiple subsidiaries and slack Comms which could inflate it by a lot. we post a whole lot of uncompressed shit on our slack

      • DudeImMacGyver@sh.itjust.works
        link
        fedilink
        English
        arrow-up
        0
        ·
        19 days ago

        Yeah, I guess I didn’t consider all the other operational shit that goes into providing content and funding for the website.

        • aStonedSanta@lemm.ee
          link
          fedilink
          English
          arrow-up
          1
          ·
          19 days ago

          It’s why our PCs have gotten insanely fast but websites still load like fucking trash. All the back end spying shit takes up a ton of cpu cycles. If you don’t already have em run ublock origin and no script and the internet is so fucking speedy 😆

  • muntedcrocodile@lemm.ee
    link
    fedilink
    English
    arrow-up
    0
    ·
    edit-2
    19 days ago

    Thats a lot of data but surly its not all their articles cos I’d very much like to train mixtral7x8b on it along with 4chan data and shir from the dark web. Surly there is a project where such a model is public and being trained on literally everything regardless of legality.

    EDIT: why am i getting downvoted?

    • reddithalation@sopuli.xyz
      link
      fedilink
      English
      arrow-up
      0
      arrow-down
      1
      ·
      edit-2
      19 days ago

      you’re getting downvoted because LLMs are simply not very good, they consume lots of energy (bad for climate), and seemingly most people involved in ai hype want to replace human creativity or something.

      how about instead of training a not very trustworthy or useful LLM on lots of nyt, 4chan, and “dark web”, you go read lots of nyt, 4chan, and dark web to train your own (much better) model (your brain).

  • Dogyote@slrpnk.net
    link
    fedilink
    English
    arrow-up
    1
    arrow-down
    1
    ·
    19 days ago

    Did this leak happen before or after NYT published an investigation detailing how Israeli forces were raping and torturing defenseless Palestinian detainees brought in from the Gaza Strip?

    • General_Effort@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      ·
      edit-2
      18 days ago

      In case anyone missed the hubbub: [ETA: This is from March 2024; unconnected to this hack/leak]

      https://apnews.com/article/new-york-times-wordle-clones-takedown-dmca-35d32b7548f7312ea74a2065b2cd31a6

      The Times has filed several Digital Millennium Copyright Act, or DMCA, takedown notices to developers of Wordle-inspired games, which cited infringement on the Times’ ownership of the Wordle name, as well as its look and feel — such as the layout and color scheme of green, gray and yellow tiles.

      Numerous impacted developers have also taken to social media to share their frustrations. Many said that their games, which range from Wordle-like offerings in other languages to more guessing games, would be taken down as a result.

      Still, Brauneis said he believes the Times’ arguments for Wordle copyright infringement are on “a little bit shaky ground” for several reasons. Rules of a game, for example, are not covered by copyright — and that can include the layout of the game itself, he said.