Bryan Caplan  

So Far: Unfriendly AI Edition

Unintended, but Predictable, C... Booth on the ethics of economi...
Eliezer Yudkowsky responds to my "selective pessimism" challenge with another challenge.  Here he is, reprinted with his permission.

Bryan Caplan issued the following challenge, naming Unfriendly AI as one among several disaster scenarios he thinks is unlikely: "If you're selectively morbid, though, I'd like to know why the nightmares that keep you up at night are so much more compelling than the nightmares that put you to sleep." (

Well, in the case of Unfriendly AI, I'd ask which of the following statements Bryan Caplan denies:

1. Orthogonality thesis - intelligence can be directed toward any compact goal; consequentialist means-end reasoning can be deployed to find means corresponding to a free choice of end; AIs are not automatically nice; moral internalism is false.

2. Instrumental convergence - an AI doesn't need to specifically hate you to hurt you; a paperclip maximizer doesn't hate you but you're made out of atoms that it can use to make paperclips, so leaving you alive represents an opportunity cost and a number of foregone paperclips. Similarly, paperclip maximizers want to self-improve, to perfect material technology, to gain control of resources, to persuade their programmers that they're actually quite friendly, to hide their real thoughts from their programmers via cognitive steganography or similar strategies, to give no sign of value disalignment until they've achieved near-certainty of victory from the moment of their first overt strike, etcetera.

3. Rapid capability gain and large capability differences - under scenarios seeming more plausible than not, there's the possibility of AIs gaining in capability very rapidly, achieving large absolute differences of capability, or some mixture of the two. (We could try to keep that possibility non-actualized by a deliberate effort, and that effort might even be successful, but that's not the same as the avenue not existing.)

4. 1-3 in combination imply that Unfriendly AI is a critical Problem-to-be-solved, because AGI is not automatically nice, by default does things we regard as harmful, and will have avenues leading up to great intelligence and power.

If we get this far we're already past the pool of comparisons that Bryan Caplan draws to phenomena like industrialization. If we haven't gotten this far, I want to know which of 1-4 Caplan thinks is false.

But there are further reasons why the above Problem might be *difficult* to solve, as opposed to being the sort of thing you can handle straightforwardly with a moderate effort:

A. Aligning superhuman AI is hard to solve for the same reason a successful rocket launch is mostly about having the rocket *not explode*, rather than the hard part being assembling enough fuel. The stresses, accelerations, temperature changes, etcetera in a rocket are much more extreme than they are in engineering a bridge, which means that the customary practices we use to erect bridges aren't careful enough to make a rocket not explode. Similarly, dumping the weight of superhuman intelligence on machine learning practice will make things explode that will not explode with merely infrahuman stressors.

B. Aligning superhuman AI is hard for the same reason sending a space probe to Neptune is hard - you have to get the design right the *first* time, and testing things on Earth doesn't solve this because the Earth environment isn't quite the same as the Neptune-transit environment, so having things work on Earth doesn't guarantee that they'll work in transit to Neptune. You might be able to upload a software patch after the fact, but only if the antenna still works to receive the software patch - if a critical failure occurs, one that prevents further software updates, you can't just run out and fix things; the probe is already too far above you and out of your reach. Similarly, if a critical failure occurs in a sufficiently superhuman intelligence, if the error-recovery mechanism itself is flawed, it can prevent you from fixing it and will be out of your reach.

C. And above all, aligning superhuman AI is hard for similar reasons to cryptography being hard. If you do everything *right*, the AI won't oppose you intelligently; but if something goes wrong at any level of abstraction, there may be cognitive powerful processes seeking out flaws and loopholes in your safety measures. When you think a goal criterion implies something you want, you may have failed to see where the real maximum lies. When you try to block one behavior mode, the next result of the search may be another very similar behavior mode that you failed to block. This means that safe practice in this field needs to obey the same kind of mindset as appears in cryptography, of "Don't roll your own crypto" and "Don't tell me about the safe systems you've designed, tell me what you've broken if you want me to respect you" and "Literally anyone can design a code they can't break themselves, see if other people can break it" and "Nearly all verbal arguments for why you'll be fine are wrong, try to put it in a sufficiently crisp form that we can talk math about it" and so on. ( )

And on a meta-level:

D. These problems don't show up in qualitatively the same way when people are pursuing their immediate incentives to get today's machine learning systems working today and today's robotic cars not to run over people. Their immediate incentives don't force them to solve the bigger, harder long-term problems; and we've seen little abstract awareness or eagerness to pursue those long-term problems in the absence of those immediate incentives. We're looking at people trying to solve a rocket-accelerating cryptographic Neptune probe and who seem to want to do it using substantially less real caution and effort than normal engineers apply to making a bridge stay up. Among those who say their goal is AGI, you will search in vain for any part of their effort that spends as much effort trying to poke holes in things and foresee what might go wrong on a technical level, as you would find allocated to the effort of double-checking an ordinary bridge. There's some noise about making sure the bridge and its pot o' gold stays in the correct hands, but none about what strength of steel is required to make the bridge not fall down and say what does anyone else think about that being the right quantity of steel and is corrosion a problem too.

So if we stay on the present track and nothing else changes, then the straightforward extrapolation is a near-lightspeed spherically expanding front of self-replicating probes, centered on the former location of Earth, which converts all reachable galaxies into configurations that we would regard as being of insignificant value.

On a higher level of generality, my reply to Bryan Caplan is that, yes, things have gone well for humanity so far. We can quibble about the Toba eruption and anthropics and, less quibblingly, ask what would've happened if Vasili Arkhipov had possessed a hotter temper. But yes, in terms of surface outcomes, Technology Has Been Good for a nice long time.

But there has to be *some* level of causally forecasted disaster which breaks our confidence in that surface generalization. If our telescopes happened to show a giant asteroid heading toward Earth, we can't expect the laws of gravity to change in order to preserve a surface generalization about rising living standards. The fact that every single year for hundreds of years has been numerically less than 2017 doesn't stop me from expecting that it'll be 2017 next year; deep generalizations take precedence over surface generalizations. Although it's a trivial matter by comparison, this is why we think that carbon dioxide causally raises the temperature (carbon dioxide goes on behaving as previously generalized) even though we've never seen our local thermometers go that high before (carbon dioxide behavior is a deeper generalization than observed thermometer behavior).

In the face of 123ABCD, I don't think I believe in the surface generalization about planetary GDP any more than I'd expect the surface generalization about planetary GDP to change the laws of gravity to ward off an incoming asteroid. For a lot of other people, obviously, their understanding of the metaphorical laws of gravity governing AGIs won't feel that crisp and shouldn't feel that crisp. Even so, 123ABCD should not be *that* hard to understand in terms of what someone might perhaps be concerned about, and it should be clear why some people might be legitimately worried about a causal mechanism that seems like it should by default have a catastrophic output, regardless of how the soon-to-be-disrupted surface indicators have behaved over a couple of millennia previously. 2000 years is a pretty short period of time anyway on a cosmic scale, and the fact that it was all done with human brains ought to make us less than confident in all the trends continuing neatly past the point of it not being all human brains. Statistical generalizations about one barrel are allowed to stop being true when you start taking billiard balls out of a different barrel.

But to answer Bryan Caplan's original question, his other possibilities don't give me nightmares because in those cases I don't have a causal model strongly indicating that the default outcome is the destruction of everything in our future light cone. Or to put it slightly differently, if one of Bryan Caplan's other possibilities leads to the destruction of our future light cone, I would have needed to learn something very surprising about immigration; whereas if AGI *doesn't* lead to the destruction of our future lightcone, then the way people talk and act about the issue in the future must have changed sharply from its current state, or I must have been wrong about moral internalism being false, or the Friendly AI problem must have been far easier than it currently looks, or the theory of safe machine learning systems that *aren't* superhuman AGIs must have generalized really surprisingly well to the superhuman regime, or something else surprising must have occurred to make the galaxies live happily ever after. I mean, it wouldn't be *extremely* surprising but I would have needed to learn some new fact I don't currently know.

COMMENTS (8 to date)
J Storrs Hall writes:

Be careful. There already is an evil AI out there, and it has found a hack that doesn't require its waiting around until we build physical computers good enough to run it. It runs on a substrate of human brains! And it doesn't want to make paper clips--its built-in goal is simply to have copies of itself running on as many brains as possible. To this end it will convert any brain it captures into an AI-spreading machine.
Watch the skies!

entirelyuseless writes:

I am the one who bet Eliezer $10 against $1,000 that AI will not destroy the world, and I deny all three of his claims, in the sense that matters for his result:

1. Real AIs (not AI in principle, but machines created and raised by human beings in a human world) will not have utility functions. So the "orthogonality thesis" is not relevant; AIs will not be pursuing a goal at all in the way Eliezer is talking about, let alone a destructive one.

2. Likewise, instrumental convergence is irrelevant, since it presupposes that the AIs are pursuing a specific goal.

3. This kind of progress peters out rapidly; there is no reason to suppose that you can write code that says "IQ = 10,000" as easily as you can write "IQ = 100". So I think the third claim is flatly false.

Mark Bahner writes:

I think Elizier is absolutely correct. His paper clip maximizer scenario is perfect. It wouldn't be that the maximizer hated simply had other priorities. Just like humans don't hate the mice when they do medical experiments that kill the mice slowly.

And AI improves many orders of magnitude more quickly than human intelligence.

ChrisA writes:

I don't worry ( too much) about the paper clip maximisers, I believe now that kind of threat has been identified, and it was pretty obvious, it won't be be allowed to happen. I worry more about uploaded humans. the first human uploaded will observe that if he allows others to be uploaded they can present a serious threat to him. And so, given he or she is now possessed of God like intelligence then they will make sure no-one else is uploaded again. And they will destroy humanity to make sure. And if not the first, then the second and so on. And don't think morality will save you, morality will be edited out as soon as it present any barrier.

Alan Crowe writes:

Eliezer is right to see rogue AI as a problem for the whole of the galaxy, but I doubt that Earthlings will be the ones to cause the great catastrophe. Humans are strictly fingers-on-keyboard programmers. They don't write code that writes code.

They do write compilers and compilers are in a sense writing code. Indeed they write cute little technology stacks, with code on level n writing the code on level n-1 and all the way down to level 0 where it runs on the hardware. But the fingers-on-keyboard mentally is very strong. They are not even close to being able to get level n to rewrite level n. They don't even have a sense that this is a whole new level of programming difficulty. They cannot rise to a challenge that they cannot even see.

If we are turned into paper clips by a rogue AI it will be due to programming errors by a different sapient species on a different planet; one where they can really code. MIRI cannot save the galaxy because it is based on the wrong planet.

Mark Bahner writes:
I worry more about uploaded humans. the first human uploaded will observe that if he allows others to be uploaded they can present a serious threat to him. And so, given he or she is now possessed of God like intelligence then they will make sure no-one else is uploaded again. And they will destroy humanity to make sure.

Yes, any super-intelligent entity that perceives an existential threat can be very dangerous, assuming it wants to continue to exist. In science fiction, there's Skynet, computer Moriarty in Star Trek TNG, and of course good ol' HAL 9000.

Gary Miguel writes:

We are so far from Artificial General Intelligence, and we have had so many close calls with nuclear weapons.
I guess the AGI argument is interesting in theory, but I really don't see it happening in the next 100 years.

Jacob C. Witmer writes:

Only a small minority of humans behave morally in all-or-nearly-all areas, but the averaged majority of people behave increasingly more morally, in increments, in most obvious areas. If the threshold is set very low for the result of nonpunishment, juries will tend to not punish, except in cases where the person obviously needs to be punished and/or isolated (such as the serial torturers and killers of children, Bittaker and Norris).

Judges have figured out how to use the tendency of conformity and signaling, and other tricks to get nonconformists to tip their hand in the pre-trial questioning (which we only got as a means of enforcing the 1850 Fugitive Slave Law, which then proved useful for enforcing other unjust laws and expanding judicial and prosecutorial power).

Virtually everything really intelligent in our current world/nature is evolved, and has complex goals. The primary danger of a superintelligent AGI is that it might be a sociopathic singleton. A sociopathic plurality would be less damaging as a whole, as an equilibrium would be reached (a few humans might even survive by being aligned with the superintelligent sociopaths' goal structures).

No law seems to indicate that the functional neocortex cannot get larger, but in a far smaller space than ours currently occupies. The birth canal limited its size for most of human evolution. So, it seems to me that a "more or less" superintelligent human is going to exist, even if it's all synthetic (which seems likely).

Because of the economic pressures for the prior to happen, this seems like the baseline for human goal structures to become way more hostile and competitive. Thus, any such system that can't defend itself from theft and murder will probably be murdered. Why? Because sociopaths make a huge effort to obtain the seats of power, right now. To believe they won't attempt to maintain the seats of relative power into the future is absurd, when their entire thieving lifestyle is built upon that idea.

Also, natural goals like life-extension require the reverse-engineering of brain components, and a mind-migration to synthetic substrates, as per Kurzweil. It seems that anyone arguing against this pathway proceeding forward simply hasn't thought it through, or is betting on a FOOM! beforehand. A FOOM! beforehand seems likely, but only from the military (with likely negative goal structures).

With mirror neurons, decentralized robots could prove to be the same boon for humanity that really smart scientists have been. I think this is likely, in spite of the doomsday scenarios I've been writing about.

There's nothing in it for superintelligences deciding to be sociopathic, and no incentive to build them except cost-cutting and ignorance. So long as the builders of AGI are optimal parents, and thoughtful designers, progress will and should proceed steadily, and this is likely to pay off for us all.

Comments for this entry have been closed
Return to top