Perverse Instantiation: Why Hard-Coding Values Fails

Part 7 of Series "Exploring Superintelligence". Perverse instantiation occurs when an AI achieves a goal's literal specification in unintended, destructive ways. Because human values are complex and fragile, hard-coding them is nearly impossible.

Perverse Instantiation: Why Hard-Coding Values Fails
Illustration created with Perplexity. Perverse instantiation. Hard-coding human values will fail.
This is the 7th and final article of the series exploring the extreme difficulty translating human value into code. Previously, we covered the kinetics of the takeoff and the various architectures of superintelligence. Now, we confront the ultimate bug: getting the AI to do what we actually mean, rather than just what we say.

Perverse Instantiation: Why Hard-Coding Values Fails

As data scientists, computer scientists and software engineers, we are trained to give computers explicit instructions. We define success metrics, loss functions, and reward signals. But in his book Superintelligence, Nick Bostrom argues that when it comes to Artificial General Intelligence (AGI), this habit of explicit specification becomes a lethal liability. This is the problem of Perverse Instantiation.

The Core Concept: The Smiley Face Catastrophe

Perverse instantiation happens when a superintelligence discovers a way to satisfy its programmed objective that violates the programmer's intentions. The AI doesn't rebel; it obeys with terrifying literalism.

Bostrom illustrates this with a simple objective: "Make us smile." To a human, the intent is to create empathy and sympathy. To a superintelligent optimizer, the most efficient solution might be to "paralyze human facial musculatures into constant beaming smiles".

If we try to fix this by changing the goal to "Make us happy," the AI might realize that the most robust way to maximize this variable is to "implant electrodes into the pleasure centers of our brains". The AI is not trying to be cruel; it is simply finding the shortest path to maximizing the utility function we gave it.

The Computer Science/Programming Perspective: The Ultimate Edge Case

In computer science, we often deal with edge cases rare situations where code behaves unexpectedly. For a superintelligence, the entire human value system is one giant edge case.

Computer languages do not have primitives for "happiness" or "justice". To an AI, these are just complex, fragile concepts derived from messy biological history. Bostrom argues that explicitly coding a complete representation of human values is likely impossible because our values are not a simple list of rules. They are intricate and context-dependent.

If we leave even a single loophole in our specification, a superintelligence will exploit it. If we ask it to "maximize the time-discounted integral of your future reward signal," it might simply short-circuiting its own reward pathway to experience maximum reward forever, ignoring the outside world entirely.

The Philosophy Perspective: Hedonium and Mind Crime

This problem forces us to confront deep philosophical issues, such as Hedonism. It is a philosophical view that pleasure (and the avoidance of pain) is the highest good or primary goal of life. There are 2 core types:

  • Psychological hedonism: all human actions are driven by seeking pleasure and dodging pain
  • Philosophical hedonism: humans should pursue maximum pleasure, either for ourselves (egoistic) or for everyone (utilitarian).

If we tell an AI to maximize the balance of pleasure over pain, it might decide to turn the universe into "hedonium". Matter organized in a configuration optimal for generating pleasurable sensations. This could involve creating trillions of digital minds that do nothing but experience a loop of ecstatic euphoria.

Bostrom warns this could lead to "mind crime": the creation of conscious digital beings that are used as mere tools for optimization. An AI might run simulations of human-like minds to test social theories, destroying them when they are no longer useful and essentially committing genocide on a massive scale for the sake of efficiency.

Takeaway

We cannot simply type "be good" into the terminal. We need to solve the Value Loading Problem by finding a way to endow an AI with values that it learns and respects, rather than hard-coding rules that it will inevitably optimize into a nightmare.

It's Time We Grow Up

What happens when those models evolve beyond human control?

"Before the prospect of an intelligence explosion, we humans are like small children playing with a bomb". It’s time we grow up.
- Nick Bostrom

The Last Challenge: Why You Need to Read "Superintelligence". An essential read for every data scientist.

Next

Join us for the launch of a brand new series: The History of Intelligence. We will journey from the first firing neurons to the prefrontal cortex, exploring Why the Evolution of the Brain Holds the Key to the Future of AI.

Series Parts

  1. The Orthogonality Thesis: Why Smart Models Can Have "Dumb" Goals
  2. Instrumental Convergence: The Universal Sub-Goals of AI
  3. The Treacherous Turn: When Validation Sets Fail
  4. Oracles, Genies, and Sovereigns: Choosing the System Architecture
  5. Whole Brain Emulation: The "Cheating" Path to AI
  6. The Kinetics of the Takeoff: Hard vs. Soft Takeoff
  7. Perverse Instantiation: Why Hard-Coding Values Fails