Using supervised fine-tuning (SFT) to introduce even a small amount of relevant data to the training set can often lead to strong improvements in this kind of “out of domain” model performance. But the researchers say that this kind of “patch” for various logical tasks “should not be mistaken for achieving true generalization. … Relying on SFT to fix every [out of domain] failure is an unsustainable and reactive strategy that fails to address the core issue: the model’s lack of abstract reasoning capability.”
Rather than showing the capability for generalized logical inference, these chain-of-thought models are “a sophisticated form of structured pattern matching” that “degrades significantly” when pushed even slightly outside of its training distribution, the researchers write. Further, the ability of these models to generate “fluent nonsense” creates “a false aura of dependability” that does not stand up to a careful audit.
As such, the researchers warn heavily against “equating [chain-of-thought]-style output with human thinking” especially in “high-stakes domains like medicine, finance, or legal analysis.” Current tests and benchmarks should prioritize tasks that fall outside of any training set to probe for these kinds of errors, while future models will need to move beyond “surface-level pattern recognition to exhibit deeper inferential competence,” they write.
I do get it. And that’s why I’m disdainful towards all this “simulated reasoning” babble.
Emphasis mine: that “near” is a sleight of hand.
It doesn’t really matter if it’s hitting “near” or “far”; in both cases someone will need to stop the brick-throwing machine, get into the construction site (as if building a house manually), place the brick in the correct location (as if building a house manually), and then redo operations as usual.
In other words, “hitting near the target” = “failure to hit the target”.
And it’s obvious why it’s wrong; the idea that an auto-builder should throw bricks is silly. It should detect where the brick should be placed, and lay it down gently.
The same thing applies to those large token* models; they won’t reach anywhere close to reasoning, just like a brick-throwing machine won’t reach anywhere close to an automatic house builder.
*I’m calling it “large token model” instead of “large language model” to highlight another thing: those models don’t even model language fully, except in the brain of functionally illiterate tech bros who think language is just a bunch of words. Semantics and pragmatics are core parts of a language; you don’t have language if utterances don’t have meaning or purpose. The nearest of that LLMs do is to plop some mislabelled “semantic supplement” - because it’s a great red herring (if you mislabel something, you’re bound to get suckers confusing it with the real thing, and saying “I dun unrurrstand, they have semantics! Y u say they don’t? I is so confusion… lol lmao”).
If the machine relies on you to be an assumer (i.e. to make shit up, like a muppet), there’s already something wrong with it.
To be blunt that stinks “wishful thinking” from a distance.
As I implied in the other comment (“Can house construction be partially automated? Certainly. Perhaps even fully. But not through a brick-throwing machine.”), I don’t think reasoning algorithms are impossible; but it’s clear LLMs are not the way to go.
I think the brick that is the point of this parody sailed right over your head.😁
If it is not a parody, the user got a serious answer. And if it is, I’m just playing along ;-)
(If it is a parody, it’s so good that it allows me to actually answer it as if it wasn’t.)
It is most definitely satire but that doesnt mean your comments aren’t worth reading.
Amd you should see the therapeutic effects of brick throwing and the very promising health applications.
You would be amazed of what you can achieve with a well thrown brick.
Sorry, I just got carried away in your analogy, like the proverbial brick thrown in to the air by a large machine that is always very precisely almost often sometimes hitting its target.
I should apologise - I didn’t catch right off the bat that you were playing along the analogy.