A randomly sampled human cannot be trusted to prescribe medicine, fix a pipe, or write important code. But they can be trusted not to ingeniously and secretly release code for a superintelligent assistant while subject to close monitoring. What if we trained an artificial agent to do a task without acting in a way that makes it incredibly obvious that it's not a human? Then we could say something like, "If a human only would do Bad Thing with a small probability ε, then the artificial agent would only do Bad Thing with not-quite-as-small probability f(ε), and as ε approaches 0, so does f(ε)."
Let's call this idea "truly human-like optimization". But it's not quite so easy. No artificial agent knows the exact probability that a human would do something. That makes truly human-like optimization unimplementable.
So a natural idea is to do supervised learning: predict human actions using data of actions of that humans take. If we interpret human text behavior as a kind of action (and we should), then we have an enormous amount of data of human behavior, and large language models are capable of approximating it. Call this model of human behavior the "base model". So let's modify the proposal from above: What if we trained an artificial agent to do a task without acting in a way that makes it incredibly obvious that it's not the base model? I'll call this proposal "arguably human-like optimization". Explaining why takes some background.
ML background: when a generative ML model is trained on data, it attempts to regenerate the data. But more precisely, it attempts to generate outputs in such a way that it's not incredibly obvious that the real data wasn't actually generated by it (the ML model itself). According to the training objective for the ML model, it's not a problem if it some of its outputs obviously could not have been generated by the real data-generating process; it's only a problem if there's real data that it obviously didn't generate. If there's lots of data covering every possible situation, then avoiding the latter problem will also help it avoid the former problem as a consequence. But in the real world, there are always unusual situations with little to no relevant data.
This issue is closely tied to handling epistemic uncertainty, uncertainty about how the world works. If you tried to predict what would happen in an unusual situation, and you didn't know, the best you could do was admit that you didn't know, and say "Maybe X will happen, but then again, maybe Y or even Z will happen." If in reality Y happens, you have done a decent job at ensuring that the real answer could have been generated by you; after all you acknowledged the possibility. But you might have generated X, or even Z. And maybe, because of something you don't know about the world, those outcomes happen to be impossible. So some of your outputs could not have been generated by the real data-generating process. Under epistemic uncertainty, you have to humbly spread probability around, including to some outcomes that are actually definitely impossible, unbeknownst to you.
Suppose my guiding ethic was to ask What Would Jesus Do? And if Jesus wouldn't do it, neither should I. Now suppose I am considering whether to block a proposed merger under the Sherman Anti-Trust Act in my capacity as the chair of the FTC. The training data of biblical verses doesn't give me much to go off of. As far as I know, there's a decent chance he'd go either way on this one. So my WWJD ethic doesn't rule anything out. My ethic only commits me to stick to arguably-Jesus-like behavior, not truly-Jesus-like behavior. What else can I do, not knowing which choice is truly Jesus-like?
DALL-E 2: Jesus as chair of the Federal Trade Commission
The answer is to ask for help in these situations (from the Pope, I suppose, if we can extend the metaphor). But before getting into that, I need to explain what can go wrong with simply executing arguably human-like optimization.
Recall that the arguably human-like optimizer can do anything as long as they don't act in a way that makes it incredibly obvious that it's not the base model doing the acting. So if the base model does something with some probability, so can the arguably human-like optimizer. How would the base model behave in a completely novel situation? In novel situations, nothing can be ruled out. If the base model did rule something out, and it got it wrong, that would "infinitely bad" for it. In particular, in this novel situation, maybe the human demonstrator (which the base model imitates) would suddenly start counting or start computing superintelligent goal-directed plans. Yes, these are highly unlikely, but the base model would have to put some credence on them.
If the artificial agent started following superintelligent goal-directed plans in that novel situation, it wouldn't be incredibly obvious that it wasn't the base model doing that, because the base model might well do that. This is the key problem with the arguably human-like optimizer. It could amplify parts of the base model that are not human-like at all, but the base model can "argue" that it could be human-like because there's no data to rule it out. That problem is analyzed and presented in this forthcoming paper, both theoretically and empirically.
Here is a link, but please do not distribute it, since it hasn't been peer-reviewed yet.
But I promised a solution, involving asking for help. I'll call this approach "surely human-like optimization". That's presented in Section 7 of the same paper.
So-called KWIK learning describes an ML model that "knows what it knows". Consider again the devout FTC chair, who probably knows that he doesn't know WJWD. In his case, there's not a reliable way to ask for help, unless Catholicism is true and the Pope has a spare moment. So let's say Catholicism is true, and the Pope has a spare moment. He asks for help. He learns WJWD. He does WJWD. Moreover, he comes to understand Jesus's underlying views about macroeconomics, and so he can guess with pretty high confidence WJWD in other similar contexts.
Our FTC chair doesn't do anything unless he's sure that Jesus would at least occasionally do that thing. Using a similar principle, we can construct a surely human-like optimizer that asks for help when no available actions are surely human-like.
The first step is to construct an ask-for-help imitation of a human to use as a base model, and the second step is to construct an agent that optimizes a goal (perhaps a misspecified goal) under the constraint that it can't act in way that is incredibly obviously not the output of the base model. This paper constructs an ask-for-help imitation learner that meets our requirements for the base model.
In particular, we proved that if the imitator generates actions, it's probably not incredibly obvious that it wasn't the demonstrator generating those actions. Also, importantly, the imitation learner doesn't ask for help too much; the sum of the cubes of all the ask-for-help probabilities is finite, even if it's run for infinite time.
The construction is heavily inspired by the pessimistic agent presented in the previous post. The imitation learner considers many different models of the (human) demonstrator---that is, programs which output actions given contexts, in a way that is supposed to match the observed demonstrator. It ranks these models starting from the most plausible, and builds a set of the top several models. It keeps adding to that to set, and it keeps track of its credence that one of those models is the truth. When it adds a model and its total credence increases by less than X% of what it was before, it removes that model and stops adding more models to its set.
When evaluating the probability that the demonstrator would take a given action, it uses minimum probability according to any model in that set. These probabilities may sum to less than 1. Suppose there are two actions, and one model says the demonstrator would definitely take action 1, and the other says it would definitely take action 2. For each action, there's a model that says that action has zero probability, so they're both assigned zero probability, making the sum of probabilities zero. Whenever action probabilities sum to less than one, the remaining probability is spent asking for help.
If X is small enough, then the probability that the imitator takes a (potentially long) course of action "on its own" is less than the probability the demonstrator would take that course of action, and the probability that the imitator takes a course of action at all (potentially asking for help in the middle) goes to 0 as the probability that the demonstrator would take that course of action goes to 0. So if we think that a human would never take over the world to accomplish a task while causing human extinction in the process, then this imitator wouldn't either. And then if the imitator wouldn't do that, and we built an agent was constrained to ensure that its actions could be mistaken for that of the imitator, then the agent wouldn't do that either.
Let's check the requirements for solution-in-theory.
1. Could do superhuman long-term planning: ✅
The agent doesn't have to follow the (imitative) base policy exactly. It can engage in behavior that would be surprising coming from the base policy (just not astonishing). So it can certainly complete tasks more competently than a human demonstrator.
2. Ongoing receptiveness to feedback about its objectives: ✅
In the paper, the agent maximizes rewards (subject to the constraint of not engaging in behavior that gives it away as not being the base model). It continues to observe rewards, which provides more information about what it should do to get high rewards.
3. No reason to escape human control to accomplish its objectives: ✅
If the demonstrator would never escape human control to accomplish its objectives, then if X is small enough, the base model would never escape human control to accomplish its objectives, and neither would the agent that can't make it too obvious it's not the base model.
4. No impossible demands on human designers or operators: ✅
Operators could just give higher rewards when they are happy with the artificial agent's performance, and they could learn more effective strategies over time. Human demonstrators can be selected for better performance, but all they really have to do is absolutely never engage in the dangerous behavior that we want the AI to avoid. The most annoying demand on designers, if we are to satisfy the premises for the formal results, is that if the agent takes an action, that needs to have the same effect as if the human demonstrator took the action. So depending on the context, the involvement of a human demonstrator, who is occasionally asked for help, might need to be obscured. That said, it's not obvious to me what threat would arise if this demand was not achieved.
5. No TODOs when defining how we set up the AI's setting: ✅
There are no constraints on the AI's setting.
6. No TODOs when defining any programs that are involved, except how to modify them to be tractable: ✅
These papers have the details.
There are three key potential weaknesses, which are are the same as for the pessimistic agent:
It might require lots of ongoing human intervention/demonstration.
It might not be substantially superhuman. (Here even more so than for the pessimistic agent, in my view).
As always, there might not exist tractable approximations that preserve safety.
Comments