Reddit signs $60M contract allowing AI company to train its models on the social media platform's content

minnix@lemux.minnix.dev · 9 个月前

Reddit signs $60M contract allowing AI company to train its models on the social media platform's content

SorteKanin@feddit.dk · 9 个月前

Remember the whole “if you aren’t paying for the product, you are the product”?

It wasn’t enough to turn you into a product. Now they also want to turn you into a resource. Farming your comments and posts to feed to an AI model.

What an economy we’ve built.

rhabarba@feddit.de · 9 个月前

I wonder why I don’t pay for Lemmy.

SorteKanin@feddit.dk · 9 个月前

The kind of frightening thing is that anyone could start an instance on the Fediverse, collect all the posts and comments coming in as all instances usually do and then use it to do the same thing, and I’m not sure there’s currently anything (legally or otherwise) stopping them.

But at least we have the option to defederate such an instance. If we can find out which ones do it…

GenderNeutralBro@lemmy.sdf.org · 9 个月前

I totally understand your perspective, but I approach this from the opposite direction.

From my perspective, there’s no “at least” here. My Lemmy posts are public. I have no control over what is done with them after I post them. I am comfortable with that.

The difference between Reddit and Lemmy is not that one protects privacy and they other doesn’t. NEITHER is a platform for private discussion.

The difference is that with Lemmy, public means PUBLIC. Reddit, Twitter, and Facebook are also “public” in the sense that there can be no expectation of privacy. But they’re “private” in the corporate sense — a single corporate entity retains control of the data. They can, at will, restrict access to that data, without the consent of the users who created it.

And that’s not just theoretical; all of those companies have literally restricted access to content that users meant to be public. People can’t read the Twitter posts that I made with the intention of them being public, because Twitter now requires an account to read posts and comments. Reddit has restricted access to posts I made with the intention of them being public and readily accessible, because they killed apps and integrations, and implemented onerous access control in an attempt to hoard my data.

They altered the terms, and I, for one, got sick of praying that they would not alter them further.

Lemmy is public. You cannot control who can read it, and you cannot control what they do with it. The difference is that with a truly public platform like Lemmy, my data can benefit the whole world, instead of just some corporation.

If you are looking for a platform for private discussion, Matrix is probably it. But even then, the concept of data privacy only makes sense if you trust all the people that ever have access to the data. If I’m in a Matrix room with hundreds of strangers, I wouldn’t consider that “private” either, regardless of the protocol’s encryption.

Bad actors will always have access to the posts I make public. On Lemmy, good actors do, too, and nobody can take that away from us. THAT’S the difference.

Scrubbles@poptalk.scrubbles.tech · 9 个月前

This is the right way to think of it. Reddit feels dirty because they were a private company and we trusted them in the walled garden. That trust was naiive at least on my part, but it was 14ish years ago I had joined and they never did wrong, until recently.

Lemmy, however, is a public protocol. From the ground up everything is public. There is no illusion of privacy here, and anyone who thinks there is should forget about it. The protocol is by definition public, and will launch any comment/post across the globe to anyone listening. It’s nailing the paper to the door for everyone to see. To me this is okay though, because I know that going in. The tradeoff is less privacy, but it’s an open platform that no one can take away.

SorteKanin@feddit.dk · 9 个月前

I really like that perspective, thank you for easing my fear.

AeroLemming@lemm.ee · edit-2 3 个月前

deleted by creator

Kichae@lemmy.ca · 9 个月前

An instance isn’t required. It’s not like the current generation of generative AI wasn’t trained from web scrapings

BlameThePeacock@lemmy.ca · 9 个月前

The instance would likely just act as a regular instance and allow normal users on, you couldn’t even tell they were using it to scrape data at that point.

skillissuer@discuss.tchncs.de · 9 个月前

There are already a few instances that ignore delete requests

BitOneZero@beehaw.org · 9 个月前

Free and open information, like Wikipedia, used to be an ideal. I have used Reddit since 2008 or earlier because it got on search engines and shared information consistently on precise topics. Twitter used to also be this way, but now mostly only puts paid subscribers on search engines.

If you are to organize information around topics, such as a Commodore 64 community, and the protocol openly allows copies to be made via federation, I encourage people to have the attitude that information be treated like Wikipedia content. It sucks now that so much information from 10 years ago has been just entirely lost now that so many deliberately purged their Reddit comments, etc. Tragedy of the commons. And it drags down the entire planet that people squirrel away discussions on topics that are generally public. It’s like now everyone wants to monetize even their discussions on Commodore 64 or automotive repair / have behind absolute control or paywalls /etc.

towerful@programming.dev · 9 个月前

Just join the Commodore 64 discord!
/s

tryptaminev 🇵🇸 🇺🇦 🇪🇺@feddit.de · 9 个月前

I wouldn’t consider this a tragedy of the commons situation. People entrusted reddit to remain a somewhat acceptable company, and reddit betrayed that trust.

People didn’t purge their comments to remove this information from the public, but they purged it from reddit making money off limiting the access to this information.

BitOneZero@beehaw.org · 9 个月前

People didn’t purge their comments to remove this information from the public, but they purged it from reddit making money off limiting the access to this information.

Reddit was always making money off their content. The tragedy is that the common knowledge is destroyed. They didn’t bother to copy it to a public place, they just nuked information and context. The loss is for newcomers on any topics. The result is the same old questions being asked over and over, which all social media sites (including Lemmy thrive on FRESH content).

Sibbo@sopuli.xyz · 9 个月前

Legally, in EU, you probably cannot scrape an instance of someone else because of the database copyright law. But I have no idea if that applies to being part of the network. Since the other instances send you their content willingly.

Maybe someone should make a license extension to ActivityPub, where instances can communicate what can and what can’t be done with the information they publish. Then at least there would be legal clarity. If it can be enforced is another question.

Kichae@lemmy.ca · 9 个月前

The thing is, the license probably doesn’t mean a whole lot in that case because of the way content is shared on the Fediverse.

As you say, you actively send your content to other websites, and licenses need at least some degree of active acceptance. Including a license field in the metadata almost certainly does not meet any kind of legal threshold. It’s significantly weaker than the EULAs they everyone knows that nobody reads.

Sibbo@sopuli.xyz · 9 个月前

I would think that subscribing to a community could be coupled to a license. Servers do not randomly send data, they only send it to other servers that are subscribed. And a server could technically decline a subscription.

But anyways, by default, copyright is with the creator. No idea what that looks like in legislations around the world, but if I remember correctly, in EU, just because you give a copy of a e.g. song you wrote to someone, does not actually mean they can do with it what they want. By default, you have all the rights, and the someone else needs to grant them to you. So if you give that someone also a contract where it states that he can play it in front of an audience, then they can, otherwise they cannot.

However, I am not sure how much implied consent can play a role here. By posting something on a fediverse instance, since the purpose of the fediverse is to share these posts with other servers, then by posting you may implicitly agree to this data being shared, and the next server can share it with another server again, and so on. This is the basic “boost” functionality of mastodon.

I believe though that because the purpose of the fediverse is not explicitly to train AI models or to sell the posts to someone else, it may be illegal to scrape all posts off to feed e.g. an AI model. But may also not be. We will never know until someone starts doing it and someone else sues them.

Kichae@lemmy.ca · 9 个月前

The thing is, servers don’t subscribe to anything, users do. If the end user is provided with a license, the server is not obligated to honour it, because the server didn’t agree to shit.

rhabarba@feddit.de · 9 个月前

The content posted here has no obvious license. I wonder if an administrator could just put any license of his choice on your posts.

BitOneZero@beehaw.org · 9 个月前

people joined basically with no terms of service on a lot of Lemmy instances.

RobotToaster · 9 个月前

People can already do that without an instance, the same way google indexes the site.

lemmyingly@lemm.ee · 9 个月前

If an instance is defederated, the owners can just spin up a new instance.

I’ve always thought about what you’ve said about Lemmy when people start talking about how Lemmy is more privacy focused than Reddit.

As one of your replies have said many people in the hundreds/thousandths have a copy of your data on Lemmy - the instance owners. If you decide you’ve shared too much information then you end up asking every owner to delete that nugget of information. And realistically there is nothing to enforce it. This is one benefit of the walled garden of places like Reddit because they are legally obligated to delete the information especially in places like the EU.

SorteKanin@feddit.dk · 9 个月前

This is one benefit of the walled garden of places like Reddit because they are legally obligated to delete the information especially in places like the EU.

In theory yes, but anyone can also scrape reddit for all its posts and comments (and someone likely is). And nobody is making them delete the data. And then there’s stuff like the Internet archive complicating stuff further.

lemmyingly@lemm.ee · 9 个月前

Whilst true about anyone can scrape data off Reddit, I think it’s more of a pain since before the API updates the rate limit was 2 API calls per second. You also have to find or create a scraper. With Lemmy, you follow the instructions (copy and paste) on join-lemmy.org to create your instance and you’re done. Both methods you have to configure it to subscribe to communities, so they’re about the same.

In the EU at least there is a right to be forgotten, so yeah, Reddit and other platforms are forced to delete the data on request. I’m not sure how the same can be applied to a distributed network like Lemmy.

There were publicly available archives of Reddit. The last time I checked, you couldn’t find the latest submissions and comments. Maybe things have changed, maybe newer alternatives have appeared.

tryptaminev 🇵🇸 🇺🇦 🇪🇺@feddit.de · 9 个月前

For the right to be forgotten, this only applies to personal information. E.g. information that can be associated with information, that could be used to identify you.

Since you usually have an email for signup, that would make the data fall under personal information. But reddit could just delete the email adress and your user name and show something like:

[deleted]
When does the Narwhal bacon?

And well, it is pretty difficult to find out if, when and where there is backups that still contain your information and could be given to the AI model trainers too. To find these things out, we’d need a precedence case that makes a data protection agency investigate reddit throughouly.

lemmyingly@lemm.ee · 9 个月前

It’s all of the data or just the data that associates content with you, the latter if the company has a genuine reason to keep the content, which a forum generally does.

If the content cannot be associated with you then does it matter if the content is present on the website?

Kichae@lemmy.ca · 9 个月前

Creating a new instance only gets you access to content that users of your instance have subscribed to, and then mostly only content that comes in after subscription (I believe Lemmy primes the pump a bit on community subs, pulling in a handful of posts at the time of discovery, but discovery is done by users). So, there’s a limit on what you can scrape with your own private instance, and you’re taking a bit of a bet on which communities will yield what you’re looking for in the future.

It’d be easier and more reliable to just crawl the network and scrape it the old fashion way.

lemmyingly@lemm.ee · 9 个月前

"If you search for a community first time, 20 posts are fetched initially. Only if a least one user on your instance subscribes to the remote community, will the community send updates to your instance. Updates include:

New posts, comments
Votes
Post, comment edits and deletions
Mod actions"

So you create a single user and subscribe to all communities of interest.

I probably downplayed the difficulty of setting up a Lemmy instance that will come if you do something out of order or don’t quite have the host set up correctly or something. Although I do think it’s easier than pigging about with web crawlers.

lemmyingly@lemm.ee · edit-2 9 个月前

deleted by creator

Creesch@beehaw.org · edit-2 9 个月前

At least for the instance this was posted on: the February 2024 Beehaw Financial Update

Scrubbles@poptalk.scrubbles.tech · 9 个月前

You don’t have to, but the owners of your instance are probably paying out of pocket to keep it online. I’m sure they’re taking donations

👍Maximum Derek👍@discuss.tchncs.de · 9 个月前

That’s why I’m on Lemmy. At least when they train AI on my posts here it’s not legitimized by some contract.

FIash Mob #5678@beehaw.org · 9 个月前

That AI is going to get really racist, really fast, judging by the muck we all saw daily on Reddit.

Echo Dot@feddit.uk · 9 个月前

Although it’s going to be really good at anime porn too. So there’s that.

FIash Mob #5678@beehaw.org · 9 个月前

If that’s your thing, then hell yeah brother!

Lojcs@lemm.ee · 9 个月前

Damn just 60 mil??

dmrzl@programming.dev · 9 个月前

Like seriously, this must be fake. Add a zero and I’d still find it suspiciously cheap.

Evil_Shrubbery@lemm.ee · 9 个月前

Yeah, the diarrhea of my shitposts over there alone is worth more, it’s what will make the future AI kinda smart & very depressed.

DeltaTangoLima@reddrefuge.com · 9 个月前

And that’s why I deleted all my posts and comments before deleting my account. Sure, they could probably go back and restore it if they wanted but, so far, they haven’t.

Glad I landed here on Lemmy.

Phen@lemmy.eco.br · 9 个月前

I deleted all my comments last year. Recently I got a notification for a response in one of such comments. When I clicked the notification link, my comment and the response were visible. The comment doesn’t show up in my profile.

DeltaTangoLima@reddrefuge.com · edit-2 9 个月前

Interesting. I’ve specifically searched for some fairly unique content (Python scripts, etc) I posted in my time over there, and it hasn’t shown up at all.

So you left your Reddit account intact?

Edit: Fucking. Cunts. I just searched (had been a few months) and at least some of my data is back. I reckon they’ve done it ahead of the planned AI move and IPO.

Edit 2: joke’s on them - my posts were linked to an alt account I setup on Pastebin years ago. Still had the creds, so have deleted the pastes. Fuck Reddit. 🤘

thatsnothowyoudoit@lemmy.ca · edit-2 9 个月前

Reddit was aggressively rate limiting tools used to delete and edit content in a funny way when the API pricing was announced. The API wouldn’t return an error, the rate limiting was silent, and the tools would report successful deletion or edits even when the edit or deletion wasn’t made.

I had to modify an existing script to handle the 5-second rate limit and, lieu of deleting, I just rewrote each comment with a farewell.

Even then I did 3 passes (minor additional edits) in cases Reddit was saving previous edits.

My content has stayed edited.

dubyakay@lemmy.ca · edit-2 9 个月前

Do you still have the Python script available?

I was fine with keeping my comments up before for the future searchers, but I’m not fine with that shithole making profit off of it.

Hubi@feddit.de · 9 个月前

I recently used shreddit with the --gdpr-export-dir flag and it worked perfectly.

thatsnothowyoudoit@lemmy.ca · 9 个月前

DM’ed you the link.

Reason: personal GitHub account.

Hubi@feddit.de · 9 个月前

I’ve had the same experience. Most scripts just erase the comments available directly through your reddit profile, which is limited to the most recent ~2000 posts that you’ve made. To fully erase anything and everything, you need to request all your data from reddit, download the .zip and feed it into an application like shreddit.

Skull giver@popplesburger.hilciferous.nl · edit-2 9 个月前

deleted by creator

Echo Dot@feddit.uk · 9 个月前

Presumably most of the current AI models have already had access to reddit data in the past, so I am a bit confused about why they would pay 60 million for it now.

bevan@lemmy.nz · 9 个月前

Yep used ‘power delete suite’ to delete everything before I left.

DeltaTangoLima@reddrefuge.com · 9 个月前

Well, I just discovered a bunch of my stuff had been restored. Says deleted account, but it’s there.

blackstrat@lemmy.fwgx.uk · 9 个月前

Deleting your account doesnt delete your content AFAIK.

DeltaTangoLima@reddrefuge.com · 9 个月前

I was saying elsewhere I deleted all my content before deleting my account, but now some of my content is back.

Echo Dot@feddit.uk · 9 个月前

I don’t think I ever actually bothered deleting my content because I suspected that they would just do something like that anyway.

FarFarAway@startrek.website · 9 个月前

Supposedly, if you deleted it during the blackouts… any sub that was down at the time of deletion, didn’t delete comments.

sunbeam60@lemmy.one · 9 个月前

I suspect Reddit holds a perfect copy of every edit, including the first, you’ve ever done. For legal reasons if nothing else. Now also to prevent against perfectly good AI training content to be deleted.

soggy_kitty@sopuli.xyz · edit-2 9 个月前

deleted by creator

DaleGribble88@programming.dev · 9 个月前

Yeah! Here, no one gets paid when someone else wants to profit off of all the free user generated content. Wait, what was our goal again?

Evil_Shrubbery@lemm.ee · edit-2 9 个月前

Just in time to make new AI generated shitposts with AI generated replies & pump up those numbers for the IPO.

Can’t wait to read a post about how a novice AI finds it hard to animate human hands and some other AI suggest studying hentai porn to get the finger/tentacles movements just right. And ofc lots of ads. From AIs, to AIs, by AIs, for AIs.

Lemmy_2019@lemmy.one · 9 个月前

r/TotallyNotRobots is spreading everywhere.

Evil_Shrubbery@lemm.ee · 9 个月前

Reddit is run by pigeons and other birds/drones confined. Actually we always knew that.

bilboswaggings@sopuli.xyz · 9 个月前

Trained on 99% reposts

Hubi@feddit.de · edit-2 9 个月前

And the outputs of bots. There has been a shocking increase in auto-generated comments on reddit in the past years and it’s turning the training data into a minefield.

nul@programming.dev · 9 个月前

Haven’t touched reddit socially in 8 months, but every now and then I’ll use it to search for opinions or instructions on things. Searched “reddit best domain registrar” recently and landed on a thread where top to bottom, every comment recommending a registrar was from a bot and/or banned account. No real person testimonials, all ads. And as AI implementations improve, that’s going to get harder to spot. In the meantime, I’m formatting searches like “best domain registrar lemmy” because reddit is legit that bad rn.

sabreW4K3@lemmy.tf · 9 个月前

We all knew it was coming, but it’s still disappointing

DragonTypeWyvern@literature.cafe · 9 个月前

Funny, I don’t see anyone saying the AI companies have free right to Reddit’s content.

Natanael@slrpnk.net · 9 个月前

Can users opt out? Because the content belong to the users

tryptaminev 🇵🇸 🇺🇦 🇪🇺@feddit.de · 9 个月前

my layman understanding would be, that they include it in the TOS and your only option would be to leave the platform and demand them to delete all your content, which they may or may not do. E.g. they could just train the AI on an older backup. Good luck getting your rights recognized and abided by.

And009@lemmynsfw.com · 9 个月前

It doesn’t, as soon as you post on reddit it becomes ‘content’ on their social media.

Kichae@lemmy.ca · 9 个月前

No, the user owns it, but by creating an account you provide Reddit a license to use that content in certain ways.

So, it’s yours, but you’ve agreed to let them do whatever they want with it as if it’s theirs, too.

And009@lemmynsfw.com · 9 个月前

Yes, as we left reddit, the option to delete everything and leave a memorable ‘fuck u/spez’ was always ours.

jarfil@beehaw.org · 9 个月前

The content belongs to users… they just license it to Reddit, for Reddit to do as it pleases:

https://www.redditinc.com/policies/user-agreement

soggy_kitty@sopuli.xyz · edit-2 9 个月前

Good point. People are only loud about something if it directly effects them

fine_sandy_bottom@discuss.tchncs.de · 9 个月前

$60m doesn’t seem like that much in an era where twitter could (have been) sold for $40b.

mob@sopuli.xyz · 9 个月前

60 million a year for access to the relatively public data… That seems pretty good to me tbh.

fine_sandy_bottom@discuss.tchncs.de · 9 个月前

Maybe, but with people are saying reddit’s main value proposition is access to AI training data, and that reddit is worth n billion dollars, $60m seems like a pittance.

mob@sopuli.xyz · 9 个月前

Its just an API, right?

fine_sandy_bottom@discuss.tchncs.de · 9 个月前

No, it’s really not.

Firstly, while the data may be public, it’s not “free”. Scraping reddit and using it to train an AI would likely contravene their terms of use, you’d end up facing similar copyright issues that the current generation of bots has.

Secondly, scraped data would be incomplete, you wouldn’t get anything edited or “deleted”, which would surely be available if you paid them. The edits and deletes would be very valuable for AI training.

Thirdly, you would get the meta that reddit has. Geolocation, user agent, alt accounts, browsing habits, et cetera.

Fourthly, you wouldn’t get exclusivity. Locking out a competitor is worth something.

mob@sopuli.xyz · 9 个月前

Idk why you are talking about scraping when I said API?

And is all that information in the training contract?

fine_sandy_bottom@discuss.tchncs.de · 9 个月前

I assumed that when you said “it’s just an API” you were saying you’re paying $60m for an API as opposed to scraping for free.

Is all what information in the training contract?

wargreymon2023@sopuli.xyz · edit-2 9 个月前

$44B was a bad deal, good luck looking for another Elon Musk 😜

kib48@lemm.ee · 9 个月前

so the API thing was over nothing? brilliant

Natanael@slrpnk.net · 9 个月前

No, it was just preemptive to enforce control over who can programmatically read the site

RobotToaster · 9 个月前

We did it reddit, we trained an AI to be the pure embodiment of cringe.

TheRadiatorIsWarm@sopuli.xyz · 9 个月前

Add the bot problem to it and you’ll get garbage in, garbage out

Echo Dot@feddit.uk · 9 个月前

Hell even the users didn’t exactly contribute good quality content.

comicallycluttered@beehaw.org · edit-2 9 个月前

Lol, so they’re going to be training their AI on… AI generated content? The uptick in that shit on reddit has made it more annoying than usual.

That and all the confidently incorrect shit on the site… Not to mention the constant in-jokes. I’m just imagining a chatbot responding to something about how to deal with grief with “I also choose this man’s dead wife!”

Can’t see how this could possibly go wrong.

unknowing8343@discuss.tchncs.de · 9 个月前

They are gonna love it when their chatbot also chooses that man’s dead wife.

🇰 🌀 🇱 🇦 🇳 🇦 🇰 ℹ️@yiffit.net · edit-2 9 个月前

There’s gonna be so many bots commenting “Actually…” Followed by the most incorrect information about the topic at hand possible.