OpenAI Insider Estimates 70 Percent Chance That AI Will Destroy or Catastrophically Harm Humanity

floofloof@lemmy.ca · 9 months ago

OpenAI Insider Estimates 70 Percent Chance That AI Will Destroy or Catastrophically Harm Humanity

Lvxferre [he/him] · 9 months ago

I don’t think that a different training scheme or integrating it with already existing algos would be enough. You’d need a structural change.

I’ll use a silly illustration for that; it’s somewhat long so I’ll put it inside spoilers. (Feel free to ignore it though - it’s just an illustration, the main claim is outside the spoilers tag.)

The Mad Librarian and the Good Boi

Let’s say that you’re a librarian. And you have lots of books to sort out. So you want to teach a dog to sort books for you. Starting by sci-fi and geography books.

So you set up the training environment: a table with a sci-fi and a geography books. And you give your dog a treat every time that he puts the ball over the sci-fi book.

At the start, the dog doesn’t do it. But then as you train him, he’s able to do it perfectly. Great! Does the dog now recognise sci-fi and geography books? You test this out, by switching the placement of the books, and asking the dog to perform the same task; now he’s putting the ball over the history book. Nope - he doesn’t know how to tell sci-fi and geography books apart, you were “leaking” the answer by the placement of the books.

Now you repeat the training with a random position for the books. Eventually after a lot of training the dog is able to put the ball over the sci-fi book, regardless of position. Now the dog recognises sci-fi books, right? Nope - he’s identifying books by the smell.

To fix that you try again, with new versions of the books. Now he’s identifying the colour; the geography book has the same grey/purple hue as grass (from a dog PoV), the sci book is black like the neighbour’s cat. The dog would happily put the ball over the neighbour’s cat and ask “where’s my treat, human???” if the cat allowed it.

Needs more books. You assemble a plethora of geo and sci-fi books. Since typically tend to be dark, and the geo books tend to have nature on their covers, the dog is able to place the ball over the sci-fi books 70% of the time. Eventually you give up and say that the 30% error is the dog “hallucinating”.

We might argue that, by now, the dog should be “just a step away” from recognising books by topic. But we’re just fooling ourselves, the dog is finding a bunch of orthogonal (like the smell) and diagonal (like the colour) patterns. What the dog is doing is still somewhat useful, but it won’t go much past that.

And, even if you and the dog lived forever (denying St. Peter the chance to tell him “you weren’t a good boy. You were the best boy.”), and spend most of your time with that training routine, his little brain won’t be able to create the associations necessary to actually identify a book by the topic, such as the content.

I think that what happens with LLMs is a lot like that. With a key difference - dogs are considerably smarter than even state-of-art LLMs, even if they’re unable to speak.

At the end of the day LLMs are complex algorithms associating pieces of words, based on statistical inference. This is useful, and you might even see some emergent behaviour - but they don’t “know” stuff, and this is trivial to show, as they fail to perform simple logic even with pieces of info that they’re able to reliably output. Different training and/or algo might change the info that it’s outputting, but they won’t “magically” go past that.

CanadaPlus@lemmy.sdf.org · edit-2 9 months ago

Chinese room, called it. Just with a dog instead.

I have this debate so often, I’m going to try something a bit different. Why don’t we start by laying down how LLMs do work. If you had to explain as fully as you could the algorithm we’re talking about, how would you do it?

Lvxferre [he/him] · 9 months ago

Chinese room, called it. Just with a dog instead.

The Chinese room experiment is about the internal process; if it thinks or not, if it simulates or knows, with a machine that passes the Turing test. My example clearly does not bother with all that, what matters here is the ability to perform the goal task.

As such, no, my example is not the Chinese room. I’m highlighting something else - that the dog will keep doing spurious associations, that will affect the outcome. Is this clear now?

Why this matters: in the topic of existential threat, it’s pretty much irrelevant if the AI in question “thinks” or not. What matters is its usage in situations where it would “decide” something.

I have this debate so often, I’m going to try something a bit different. Why don’t we start by laying down how LLMs do work. If you had to explain as full as you could the algorithm we’re talking about, how would you do it?

Why don’t we do the following instead: I play along your inversion of the burden of the proof once you show how it would be relevant to your implicit claim that AI [will|might] become an existential threat (from “[AI is] Not yet [an existential threat], anyway”)?

Also worth noting that you outright ignored the main claim outside spoilers tag.

CanadaPlus@lemmy.sdf.org · edit-2 9 months ago

Yeah, sorry, I don’t want to invert burden of proof - or at least, I don’t want to ask anything unreasonable of you.

Okay, let’s talk just about the performance we measure - it wasn’t clear to me that’s what you mean from what you wrote. Natural language is inherently imprecise, so no bitterness intended, but in particular that’s how I read the section outside of the spoiler tag.

By some measures, it can do quite a bit of novel logic. I recall it drawing a unicorn using text commends in one published test, for example, which correctly had a horn, body and four legs. That requires combining concepts in a way that almost certainly isn’t directly in the training data, so it’s fair to say it’s not a mere search engine. Then again, sometimes it just doesn’t do what it’s asked, for example when adding two numbers - it will give a plausible looking result, but that’s all.

So, we have a blackbox, and we’re trying to decide if it could become an existential threat. Do we agree a computer just as smart as us probably would be? If so, that reduces to whether the blackbox could be just as smart as us eventually. Up until now, there’s been great reasons to say no, even about blackbox software. I know clippy could never have done it, because there’s forms of reasoning classical algorithms just couldn’t do, despite great effort - it doesn’t matter if clippy is closed source, because it was a classical algorithm.

On the other hand, what neural nets can’t do is a total unknown. GPT-n won’t add numbers directly, but it is able to correctly preform the steps, which you can show by putting it in a chain-of-thought framework. It just “chooses” not to, because that’s not how it was trained. GPT-n can’t organise a faction that threatens human autonomy, but we don’t know if that’s because it doesn’t know the steps, or because of the lack of memory and cost function to make it do that.

It’s a blackbox, there’s no known limits on what it could do, and it’s certain to be improved on quickly at least in some way. For this reason, I think it might become an existential threat, in some future iteration.

Lvxferre [he/him] · 9 months ago

I also apologise for the tone. That was a knee-jerk reaction from my part; my bad.

(In my own defence, I’ve been discussing this topic with tech bros, and they rather consistently invert the burden of the proof. Often to evoke Brandolini’s Law. You probably know which “types” I’m talking about.)

On-topic. Given that “smart” is still an internal attribute of the blackbox, perhaps we could gauge better if those models are likely to become an existential threat by 1) what they output now, 2) what they might output in the future, and 3) what we [people] might do with it.

It’s also easier to work with your example productively this way. Here’s a counterpoint:

The prompt asks for eight legs, and only one pic was able to output it correctly; two ignored it, and one of the pics shows ten legs. That’s 25% accuracy.

I believe that the key difference between “your” unicorn and “my” eight-legged dragon is in the training data. Unicorns are fictitious but common in popular culture, so there are lots of unicorn pictures to feed the model with; while eight-legged dragons are something that I made up, so there’s no direct reference, even if you could logically combine other references (as a spider + a dragon).

So their output is strongly limited by the training data, and it doesn’t seem to follow some strong logic. What they might output in the future depends on what we add in; the potential for decision taking is rather weak, as they wouldn’t be able to deal with unpredictable situations. And thus their ability to go rogue.

[Note: I repeated the test with a horse instead of a dragon, within the same chat. The output was slightly less bad, confirming my hypothesis - because pics of eight-legged horses exist due to the Sleipnir.]

Neural nets

Neural networks are a different can of worms for me, as I think that they’ll outlive LLMs by a huge margin, even if the current LLMs use them. However, how they’ll be used is likely considerably different.

For example, current state-of-art LLMs are coded with some “semantic” supplementation near the embedding, added almost like an afterthought. However, semantics should play a central role in the design of the transformer - because what matters is not the word itself, but what it conveys.

That would be considerably closer to a general intelligence than to modern LLMs - because you’re effectively demoting language processing to input/output, that might as well be subbed with something else, like pictures. In this situation I believe that the output would be far more accurate, and it could theoretically handle novel situations better. Then we could have some concerns about AI being an existential threat - because people would use this AI for decision taking, and it might output decisions that go terribly right, as in that “paperclip factory” thought experiment.

The fact that we don’t see developments in this direction yet shows, for me, that it’s easier said than done, and we’re really far from that.

CanadaPlus@lemmy.sdf.org · 9 months ago

To be clear, I wasn’t talking about an actual picture generating model. It was raw GPT trained on just text, asked to write instructions for a paint program to output a unicorn. That’s more convincing because it’s multiple steps away from the basic task it was trained on. Here, I found the paper, it starts with unicorns and then starts exploring other images, and eventually they delve into way more detail than I actually read. There’s a video talk that goes with it.

The trick with trying to “make” an AI do semantics, is that we don’t know what semantics is, exactly. I mean, that’s kind of what we started out with (remember the old pattern-matching chatbots?) but simpler approaches often worked better. Even the Transformer block itself is barely more complicated than a plain feed-forward network. I don’t think that’s so much because neural nets are more efficient (they really aren’t) but because we were looking for an answer to a question we didn’t have.

I think the challenge going forwards is freeing all that know-how from the black box we’ve put it in, somehow. Assuming we do want to mess with something so dangerous if handled carelessly.