AI Optimists, essay argues AI is easy to control
Notes From the Desk: No. 19 - 2023.12.28
Notes From the Desk are periodic posts that summarize recent topics of interest or other brief notable commentary that might otherwise be a tweet or note.
AI Optimists: AI is easy to control
A recent essay by AI Optimists, AI is easy to control, argues for two main points of view:
Even superhuman AI will remain much more controllable than humans for the foreseeable future.
It will be easy to instill our values into an AI, a process called “alignment.” Aligned AIs, by design, would prioritize human safety and welfare, contributing to a positive future for humanity, even in scenarios where they, say, acquire the level of autonomy current-day humans possess.
Below I will probe the strengths of these propositions from the supporting arguments in the essay. I will not comment on every topic of the original, but what I perceive as the most relevant parts. I encourage you to read the original for your own perspective.
Jailbreaks are examples of properly controlled AI
"Most jailbreaks are examples of AIs being successfully controlled, just by different humans and by different methods."
This highlights one of the fundamental problems of alignment. So far it can not be precisely defined.
Most consider that alignment must include the concept of preventing not only the AI from causing harm on its own initiative but also not doing so when being instructed by a human. Jailbreaks are a violation of this rule.
This also breaks the premise of the argument of AI being under the control of humans. Jailbreaks represent uncertain unpredictable resolution of human conflicting goals. One human intends the AI to not do a particular behavior and another wants that behavior. It is mutually exclusive with one human’s objectives winning and another’s denied.
Granted, it is alignment itself that presents the paradox where both conditions of obeying humans and disobeying humans must be resolved. Jailbreaking just exposes this problem, but it is not evidence that a problem is solved.
AI is white box
AIs implemented using artificial neural networks (ANN) are white boxes in the sense that we have full read and write access to their internals. They’re just a special type of computer program, and we can analyze and manipulate computer programs however we want at essentially no cost.
The White Box argument implies we have some advantage in understanding, but this seems significantly overstated. What benefit is it that we can observe data, but have no methods to understand it? Yes, we can peek at values stored in the AI system, but we have no way to interpret the relationships and meaning. As such, most consider the AI systems instead to be black box systems.
If the argument is that theoretically, we might be able to use this at some distant future point, then maybe there is merit. However, we don’t as of yet have that capability and no known path to get there. It would be a useful tool, essentially an AI mind debugging tool.
Furthermore, even if this were possible, we wouldn’t have the luxury of lab conditions for AGI. Assuming at some point it is interacting with the world environment the same as we do and learning on its own, but at orders of magnitude greater speeds. We won’t be able to keep up with what is happening and many things outside of the lab exhibit properties not seen before.
Human alignment suggests a path to success
There are two facets to the argument made related to human alignment. They are described as white box and black box, but this seems to be an odd use of these terms which I think some will find confusing.
Instead, I would say what the author is describing is innate vs. directed alignment.
The following would be the innate alignment.
“Our reward circuitry reliably imprints a set of motivational invariants into the psychology of every human: we have empathy for friends and acquaintances, we have parental instincts, we want revenge when others harm us, etc.
…
This suggests that at least human-level general AI could be aligned using similarly simple reward functions.”
And the following directed alignment.
“We provide role models for children to imitate, along with rewards and punishments that are tailored to their innate, evolved drives.”
The premise of the human alignment argument is that we can use the fact that humans are aligned as support for the feasibility of AI alignment. However, the essay’s text itself makes the counterargument I often cite which is to conclude humans are not sufficiently aligned with ourselves.
But human alignment is also highly imperfect. Lots of people are selfish and anti-social when they can get away with it, and cultural norms do change over time, for better or worse.”
The reality is that humans are also in constant conflict. Individually, collectively and geographically humans are often misaligned leading to catastrophic results. Furthermore, none of this behavior is predictable. It is an area of uncertainty where we at times go off the rails.
What happens when this conflict continues but is enhanced by massively increased capability? Human “alignment” doesn’t point towards the outcome of safe and consistently reliable AI systems.
To further that point, when humans are given great capability( power ), we often see even worse behaviors exhibited. You could argue that nothing could be more aligned to humanity than another human. So which human would you be comfortable giving unlimited capability to subjugate the world?
The paradox of alignment theory is that we continue to talk about aligning to “human values” as a solution while at the same time ignoring that it is precisely those values already in conflict within human civilization.
“Demonstrably unfriendly natural intelligence seeks to create provably friendly artificial intelligence.”
— Pawel Pachniewski
Gradient descent will by default induce pruning of undesirable capabilities.
If the AI is secretly planning to kill you, gradient descent will notice this and make it less likely to do that in the future, because the neural circuitry needed to make the secret murder plot can be dismantled and reconfigured into circuits that directly improve performance.
It is not that AI will necessarily have “intent” to kill you. It is that it could become an inadvertent side effect of optimizing for some other outcome. However, the complexity of relationships of how capabilities emerge may not be so cleanly and distinctly represented by the internals of the AI.
Low-level skills are used to build higher-level skills and those low-level skills are easily reassembled to perform other tasks for which they were not originally conceived. This is how we humans operate as well. Skill transfers across domains. Will AI not be used for policing, warfare, destructive simulations, violent video games, general physics and understanding of human anatomy? All base skills can be easily recombined and repurposed.
And what is “a secret desire to kill you”? Humans explore all kinds of destructive thoughts in their head as a way to understand the world but do not necessarily act on them. Sometimes it is precisely to prevent such circumstances from occurring that we analyze deeply how it could be done. How would we know the intent of an AI that has the neural circuitry for such plans?
Values are easy to learn
… as an AI is being trained, it will achieve a fairly strong understanding of human values well before it acquires dangerous capabilities like self-awareness, the ability to autonomously replicate itself, or the ability to develop new technologies.
…
If an AI learns morality first, it will want to help us ensure it stays moral as it gets more powerful.
Much argument has been made for AI safety from the premise of human behavioral heuristics. However, I continue to find this unsettling as a confident basis for establishing AI behavior. I recently covered Yann LeCun’s popular arguments which originate from this basis as well.
We are using human behavior, which is by definition unpredictable, to justify making confident predictions about AI behavior. We don’t know to what degree we can make an AI reason like a human and if we are successful it would imply that we likely get the same nondeterministic behavior. Which calls into question the entire objective. How human-like do we want an AI to be? A recent demonstration was able to get AI to both break laws and then lie to cover it up. A very human action.
The problem with the assertion above from the essay, about morality once learned will persist, is that we know humans can become “unaligned” during their lifetimes. Probably every human has at least experienced moments of temporary unalignment.
Reasoning about how the AI will observe the world from the perspective of how we observe the world when we are already significantly flawed seems to be a weak argument from a safety perspective, but a stronger argument for elevated concern.
Scaling AI is predictable
The following was not part of the essay, but a statement made by one of the authors in the discussion comments.
All scaling results suggest real-world cognitive capabilities increase smoothly with compute / data / training time / data quality / etc. When we’re dealing with the first strongly superhuman AI, we’ll have previously dealt with the first moderately superhuman AI, and with the first slightly superhuman AI before that, and so on.
I’m not sure what metric is used here to quantify “smoothly”. From the current experience with LLMs, we have seen many unexpected emergent behaviors when scaling. I am not aware that we as of yet have a method to predict the capabilities based on scale.
Furthermore, we have seen orders of magnitude differences in capability based on prompting. Currently, the AIs don’t have an internal updating state, but prompting is like a temporary state. With such immense divergent results based on prompting we could assume that an AI that is allowed to evaluate and modify its own internal state might make additional large and unexpected leaps in capability without changes to compute, data, etc.
1% risk and we are good to go?
Accordingly, we think a catastrophic AI takeover is roughly 1% likely— a tail risk worth considering.
It should be noted that wherever we see some percent of risk related to AI, that number is not derived from any type of calculation or formal methodologies. It is nothing more than a guess weighted by someone’s intuition. There is nothing provided in the essay that would allow us to derive a numerical value.
Additionally, we don’t know the context or parameters for such statements of risk. Is that a 1% chance for the creation of AGI? Is that a 1% chance over the period from AGI until the end of time? Is there a 1% chance for each instance of an AGI created? Are we doomed if we create 100 instances?
A risk of 1%, in the context of the destruction of Earth, seems rather large. Are we seriously eager to roll the dice on that one?
Without any rigorous scientific method for deriving risk, in most cases what we may be observing is simply a weighting of someone’s bias towards a particular outcome. At present the future is simply unpredictable.
Conclusion
AI Safety debates remain mostly in the realm of philosophical debate. It is all thought experiments whether arguing for outcomes of Utopia or doom. We still have no rigorous scientific method for which we could accurately predict an outcome. In order to get to some outcome each argument relies on many assumptions. The future is still unknown. We are all probably wrong. Expect plot twists.
No compass through the dark exists without hope of reaching the other side and the belief that it matters …
Read the in-depth AI and societal introspection I wrote at the beginning of 2023 AI and the end to all things ...