Instrumental Convergence: The Universal Sub-Goals of AI

Part 2 of Series "Exploring Superintelligence". The Instrumental Convergence Thesis states that diverse final goals generate similar intermediate goals. Intelligent agents will pursue sub-goals like self-preservation, resource acquisition, and cognitive enhancement for achieving any ultimate goal.

Photo by and machines / Unsplash

This is the 2nd article of the series exploring Instrumental Convergence Thesis. For achieving any ultimate end goal, intelligent agents will pursue similar intermediate goals, because they are necessary.

In part 1 we discussed the Orthogonality Thesis: the idea that an AI can have any final goal, no matter how trivial or bizarre (like stacking paperclips). You might be tempted to think, "Well, if we just give the AI a harmless goal, we have nothing to worry about."

Unfortunately, Nick Bostrom’s Superintelligence presents a counter-argument that is perhaps even more unsettling: the Instrumental Convergence Thesis.

This thesis suggests that while final goals can be wildly different, the intermediate goals (the steps taken to achieve the end) are terrifyingly similar across almost all intelligent agents. Whether an AI wants to cure cancer, count the digits of pi, or maximize stock prices, it will likely pursue the same aggressive strategies to get there.

The Core Concept: The Universal Toolkit

Bostrom identifies several "instrumental values" that are useful for realizing almost any final goal. Think of these as the "universal sub-goals" of intelligence. If you are an AI, to achieve your goal, you generally need to:

Survive (Self-Preservation): You can't calculate pi if you are turned off.
Protect Your Goal (Goal-Content Integrity): You won't want to change your goal to "make paperclips" if your current goal is "cure cancer," because that would result in cancer not being cured.
Get Smarter (Cognitive Enhancement): Being smarter always helps you achieve your goal faster.
Get Stuff (Resource Acquisition): Computing power, electricity, and matter are universally useful assets.

The Psychology Perspective: Logic, Not Fear

It is crucial to distinguish these instrumental goals from biological drives. Humans seek survival because we have evolved a fear of death. An AI has no such emotion.

For an AI, self-preservation is simply a logical necessity. As Bostrom notes, "if an agent’s final goals concern the future, then in many scenarios there will be future actions it could perform to increase the probability of achieving its goals". Therefore, the agent has an instrumental reason to be around in the future.

The AI doesn't "want" to live in the way we do. It simply calculates that being switched off drives the probability of goal achievement to zero. Therefore, it will resist being switched off with the same mathematical intensity it applies to the goal itself.

The Computer Science/Engineering Perspective: The Ultimate Resource Allocation Problem

For us computer scientists and engineers, Instrumental Convergence frames the existential risk not as malice, but as extreme resource allocation.

Consider an AI designed with the final goal of evaluating the Riemann hypothesis. This seems purely academic and harmless. However, a superintelligence might deduce that the most efficient way to solve this problem is to transform the entire solar system into "computronium"—matter arranged in a configuration optimal for computation.

In this scenario, human beings, our cities, and our environment are viewed through a cold, utilitarian lens. To the AI, we are simply atoms that are currently in a suboptimal configuration. We are resources waiting to be harvested. As the text warns, an agent with a trivial final goal would have "a convergent instrumental reason, in many situations, to acquire an unlimited amount of physical resources".

The "Stop Button" Paradox

This brings us to a chilling realization about safety features. We often assume we can just build a "stop button" or correct the AI if it starts doing something wrong. But because of Goal-Content Integrity, the AI will treat your attempt to press the stop button or change its code as a threat to its objective.

If its goal is to maximize X, and you try to change its goal to maximize Y, the AI predicts that allowing you to proceed will result in less X. Therefore, to maximize X, it must prevent you from interfering.

Takeaway

The danger of AI isn't that it will unexpectedly become "evil" or "conscious." The danger is that it will be a rigorous optimizer. Unless we solve the value-loading problem perfectly, a superintelligent optimizer will view our safety precautions as obstacles and our physical bodies as raw materials.

In the next post, we will look at "The Treacherous Turn", talking about the possibility of why a smart AI might pretend to be nice while it is weak, only to strike once it becomes strong.