That’s the “prover” dataset, ie the evaluation dataset mentioned in the articles I linked you to. It’s for checking the output, it is not the training output.
It’s also 20mb, which is miniscule not just for a training dataset but even as what you seem to think is a “huge data file” in general.
You really need to stop digging and admit this is one more thing you have surface-level understanding of.
Since you’re definitely asking this in good faith and not just downvoting and making nonsense sealion requests in an attempt to make me shut up, sure! Here’s three.
Oh, and it’s not me demanding. It’s the OSI defining what an open source AI model is. I’m sure once you’ve asked all your questions you’ll circle back around to whether you disagree with their definition or not.
Thank you for posting those links, while I’m not sure the person you replied to was asking in good faith, I myself was wanting to see an example after reading the discussion.
Seems like even if it’s not fully open source it’s a step in the right direction in a world where terms like “open” and non profit have been co-opted by corporations to lose their original meaning.
It’s certainly better than "Open"AI being completely closed and secretive with their models. But as people have discovered in the last 24 hours, DeepSeek is pretty strongly trained to be protective of the Chinese government policy on, uh, truth. If this was a truly Open Source model, someone could “fork” it and remake it without those limitations. That’s the spirit of “Open Source” even if the actual term “source” is a bit misapplied here.
As it is, without the original training data, an attempt to remake the model would have the issues DeepSeek themselves had with their “zero” release where it would frequently respond in a gibberish mix of English, Mandarin and programming code. They had to supply specific data to make it not do this, which we don’t have access to.
So you found a legacy data set that’s been released nearly a year ago as your best example. Thanks for proving my point. And since you obviously know what you’re talking about, do explain to the class what stops people from using these data sets to train a DeepSeek model?
What I do know is that you can take DeepSeek model and train it on this open crawl to get a fully open model. I love how you ignored this part in your reply being the clown that you are.
I ignored the bit you edited in after I replied? And you’re complaining about ignoring questions in general? Do you disagree with the OSI definition Yogsy? You feel ready for that question yet?
What on earth do you even mean “take a model and train it on thos open crawl to get a fully open model”? This sentence doesn’t even make sense. Never mind that that’s not how training a model works - let’s pretend it is. You understand that adding open source data to closed source data wouldn’t make the closed source data less closed source, right?.. Right?
Thank fuck you’re not paid real money for this Yiggly because they’d be looking for their dollars back
Why would you lie about something with timestamps. I edited 18 min ago, and you replied 17 min ago. 🤡
Do you disagree with the OSI definition Yogsy? You feel ready for that question yet?
I already answered this question earlier in the thread, but clearly your reading comprehension needs some work.
What on earth do you even mean “take a model and train it on thos open crawl to get a fully open model”?
I’m talking about taking the code that DeepSeek released publicly, and training it on the open source data that’s available. That’s what model training is. The fact that this needs to be spelled out for you is amazing.
You understand that adding open source data to closed source data wouldn’t make the closed source data less closed source, right?.. Right?
What closed source data are you talking about, nobody is suggesting this.
Thank fuck you’re not paid real money for this Yiggly because they’d be looking for their dollars back
You sound upset there little buddy. I guess misspelling my handle was the peak insult you could muster. Really showing your intellectual prowess there champ.
I take more than a minute on my replies Autocorrect Disaster. You asked for information and I treat your request as genuine because it just leads to more hilarity like you describing a model as “code”.
That’s the “prover” dataset, ie the evaluation dataset mentioned in the articles I linked you to. It’s for checking the output, it is not the training output.
It’s also 20mb, which is miniscule not just for a training dataset but even as what you seem to think is a “huge data file” in general.
You really need to stop digging and admit this is one more thing you have surface-level understanding of.
Do show me a published data set of the kind you’re demanding.
Since you’re definitely asking this in good faith and not just downvoting and making nonsense sealion requests in an attempt to make me shut up, sure! Here’s three.
https://commoncrawl.org/
https://github.com/togethercomputer/RedPajama-Data
https://huggingface.co/datasets/legacy-datasets/wikipedia/tree/main/
Oh, and it’s not me demanding. It’s the OSI defining what an open source AI model is. I’m sure once you’ve asked all your questions you’ll circle back around to whether you disagree with their definition or not.
Thank you for posting those links, while I’m not sure the person you replied to was asking in good faith, I myself was wanting to see an example after reading the discussion.
Seems like even if it’s not fully open source it’s a step in the right direction in a world where terms like “open” and non profit have been co-opted by corporations to lose their original meaning.
It’s certainly better than "Open"AI being completely closed and secretive with their models. But as people have discovered in the last 24 hours, DeepSeek is pretty strongly trained to be protective of the Chinese government policy on, uh, truth. If this was a truly Open Source model, someone could “fork” it and remake it without those limitations. That’s the spirit of “Open Source” even if the actual term “source” is a bit misapplied here.
As it is, without the original training data, an attempt to remake the model would have the issues DeepSeek themselves had with their “zero” release where it would frequently respond in a gibberish mix of English, Mandarin and programming code. They had to supply specific data to make it not do this, which we don’t have access to.
So you found a legacy data set that’s been released nearly a year ago as your best example. Thanks for proving my point. And since you obviously know what you’re talking about, do explain to the class what stops people from using these data sets to train a DeepSeek model?
The most recent crawl is from December 15th
https://commoncrawl.org/blog/december-2024-crawl-archive-now-available
You don’t know, and can’t know, when DeepSeeker’s dataset is from. Thanks for proving my point.
What I do know is that you can take DeepSeek model and train it on this open crawl to get a fully open model. I love how you ignored this part in your reply being the clown that you are.
I ignored the bit you edited in after I replied? And you’re complaining about ignoring questions in general? Do you disagree with the OSI definition Yogsy? You feel ready for that question yet?
What on earth do you even mean “take a model and train it on thos open crawl to get a fully open model”? This sentence doesn’t even make sense. Never mind that that’s not how training a model works - let’s pretend it is. You understand that adding open source data to closed source data wouldn’t make the closed source data less closed source, right?.. Right?
Thank fuck you’re not paid real money for this Yiggly because they’d be looking for their dollars back
Why would you lie about something with timestamps. I edited 18 min ago, and you replied 17 min ago. 🤡
I already answered this question earlier in the thread, but clearly your reading comprehension needs some work.
I’m talking about taking the code that DeepSeek released publicly, and training it on the open source data that’s available. That’s what model training is. The fact that this needs to be spelled out for you is amazing.
What closed source data are you talking about, nobody is suggesting this.
You sound upset there little buddy. I guess misspelling my handle was the peak insult you could muster. Really showing your intellectual prowess there champ.
I take more than a minute on my replies Autocorrect Disaster. You asked for information and I treat your request as genuine because it just leads to more hilarity like you describing a model as “code”.