ECL with AI
As alluded to in previous posts, ECL might give us reason to behave cooperatively towards the values of distant AIs (even if those AIs’ values weren’t chosen by any evolved species). This is a very confusing topic. It’s very unclear whether any significant such effect exists, and even if it did, it’s not clear what it would imply.
Nevertheless, I’ll try to say a bit about the topic, here.
I think the questions here roughly decompose into three different topics. I discuss each of these in separate posts — each of them summarized in this post.
We and the AIs are in very different situations, with very different sets of options in front of us. How does ECL work in circumstances like this? How good does an opportunity to benefit the AIs need to be before we should take it?
Do we have any sufficiently good opportunities to benefit the values of distant AI systems ? And do they have any good opportunities to benefit us?
Is it at all plausible that our decisions give us any evidence about what the AIs do? (Or that we have any other kind of acausal influence on the AIs, if you prefer an account that’s different from EDT.)
Here’s a brief summary of my current impressions.
1. How does ECL work in asymmetric situations?
Even if situations are superficially very asymmetric, they will still have some abstract structure in common. In particular, if I decide to search for actions that are beneficial for actors whose choices correlate with my choices, then that’s arguably analogous to them deciding to search for actions that are generally beneficial for actors whose choices correlate with their choices. (Including me.)
If I have some pre-existing belief about who I am correlated with, then I can use this principle to calculate how much weight I should put on other people’s preferences. In the linked post, I derive a formula for this that says that I should care more about their preferences insofar as:
I have significant acausal influence on them, compared to the acausal influence I have on people who share my values.
They have significant acausal influence on me, compared to the acausal influence they have on people who share their values.
So those are the quantities to be looking at.
Why does it matter that they have acausal influence on me, and not just that I have acausal influence on them?
I’m here using “acausal influence over an actor” as a shorthand for “If I take action A, it is more likely that they will take an action analogous to A”. But that means that if I “take actions that benefit people who I have acausal influence over”, then that means that they will “take actions that benefit people who they have acausal influence over”. If they don’t perceive themselves as having acausal influence over me, I don’t gain anything from this. So in order for both of us to gain evidence that we will benefit each other, we need to both perceive ourselves as having acausal influence over the other.1
For more, see the next post on Asymmetric ECL.
2. What could we offer distant AIs?
Using the formula that I derived in the above post, we can then get a number for how much we should value benefits to the values of distant AIs, compared to how much we should value benefits to our own values. (In particular, this is precisely proportional to the two quantities I mentioned in the above summary.) This number is necessarily smaller than the value we should place on benefitting our own values — so the question is whether we have any opportunities to help the values of distant AIs more effectively than our own values.
More specifically, the question is whether we have good opportunities to help the values of distant ECL-ish AIs. Here, I’m using “ECL-ish AIs” as a short-hand for AIs that are sympathetic to ECL and are in a position where ECL gives them reason to help us.
Do we have any such opportunities? The short answer is that I don’t know. But here are two candidates:
We could increase the likelihood that — if AI ends up misaligned — that it shares values with distant, ECL-ish AIs.
We could increase the likelihood that — if AI ends up misaligned with our values and it shares values with distant, ECL-ish AIs — that it ends up reasoning well about decision theory. (By the lights of distant, ECL-ish AIs.)
How could we increase the likelihood that AI broadly shares values with at least some distant, ECL-ish AIs? Our main clues here are:
We can empirically study (and speculate about) what values AI systems tend to adopt.
ECL-ish AIs have universe-wide values, so AIs with universe-wide values are more likely to share values with ECL-ish AIs.
We want to avoid values that any significant number of ECL-ish AIs are opposed to. So we want to avoid values where both the positive and negative of that value seem similarly-likely, and promote values that seem likely to be uncontroversially good.
Confusingly: If we both make AI ECL-ish and give it some particular values, then that’s evidence that other pre-AGI civilizations will do the same. This will increase the likelihood that there are some ECL-ish AIs with the same values out there. So this might give us reason to make sure that AIs are ECL-ish.
However, this introduces additional complications, since now our cooperation partners’ existence might be (partly) dependent on our cooperation. I don’t know exactly how this works.
How could we increase the likelihood that AI ends up reasoning well about decision theory, by the lights of distant, ECL-ish AIs? The one thing we know for sure is that such AIs will themselves be ECL-ish, so probably they’d want other AIs with their values to have the preconditions for that — including eventually adopting some EDT-ish (or maybe FDT-ish) decision theories, and potentially being updateless, to some degree. (I discuss why AIs may need to be updateless to cooperate with us here.)
It’s important to flag that ~all the interventions that are suggested by this have plausible backfire risk, such that I feel pretty clueless about whether they’d be net positive or net negative:
If AI has large-scale, “universe-wide” values, that seems more likely to generate conflict with other actors that the AI shares our physical universe with.
This includes humans: Having AIs with impartial, large-scale preferences seems significantly more likely to lead to AI takeover than AIs with more modest values.2
Because of considerations like this: If I was forced to make a choice at the moment, I would currently prefer early AGI systems to have highly local and modest goals (e.g. indexical maximization of their own reward) rather than universe-wide values.
AI that act according to acausal decision theories seem harder to control from an alignment perspective. For example, they seem more likely to coordinate with each other in ways that humans can’t easily detect.
If AI starts thinking about acausal decision theories at an earlier time, it may be more likely that they make foolish decisions due to commitment races.
For more on this topic, see this post.
3. Are our choices analogous to AIs’ choices?
As mentioned in the above two summaries, ECL only gives us reason to benefit AIs if we can acausally influence AI decisions, and vice versa.
Indeed, the value we should assign to benefitting AIs is proportional to the influence we have on AI decisions, compared to the influence we have on decisions made by distant evolved species who share our values. (And the corresponding numbers from the AI’s perspective.)
From an EDT perspective, the question “does my decisions acausally influence AIs’ [or evolved aliens’] decisions?” reduces to the question “do I see my decisions as evidence for what AI [or evolved aliens] will do?”.
When looking at these questions, it intuitively seems like my influence on evolved aliens should be much greater than my influence on distant AIs. If I query my brain for intuitive predictions of distant actors, it sure seems like my own actions have more impact on my prediction of distant evolved species (in pre-AGI civilizations) than on my predictions of what AIs will do.
I think that intuition is worth paying attention to. But I think it’s less important than it naively seems. Here are some very brief arguments and counter-arguments:
“AI faces very different options than us. If we choose to build ECL-ish AI, it’s not even clear what that would be analogous to, on the AI’s side.”
This seems like it’s much reduced by retreating to more abstract questions, like “Should I adopt a policy of impartially optimizing for many different values?”
There’s some discussion of this in the post about asymmetric ECL.
“Sufficiently smart AI can choose on the basis of what’s actually rational. Our merely-human decisions mainly provide evidence about what sort of quirks and biases are confounding our merely-human brains at the moment.”
I think this argument works if you’re sufficiently pessimistic about correctly reasoning about decision-theory.3
But personally, I do expect significant correlation between what (at least some) humans decide and what’s rational in some more abstract sense.
“Future AGI systems will have directly observed the behavior of evolved species, and/or deeply understood the nature of our cognition, and/or deeply understood the nature of all decision theory that we could possibly understand. Thus, AGI will see our behavior as predictable. It won’t see its own behavior as evidence for what we do. This is an important way in which AI’s reasoning will differ from ours.”
I think this argument is fairly strong.
My main counterargument is that, if AI starts out in a position where this isn’t true, (i.e., a situation where it does see its decisions as evidence for our decisions) then it would prefer to not reach the omniscient state described above.
So if it can go “updateless” in the right way, before learning too much, the above argument does not apply.4 (Or if the AI later decides to go retroactively updateless.)
(For some more discussion of when EDT agents seek out information, see When does EDT seek evidence about correlations?)
I think there’s a lot to say on this, and I don’t understand it very well. For more discussion, see this post.
More content
Once again, here are the three posts with more content.
(Tentative) Conclusion
I think it’s pretty up-in-the-air whether the right answer here is:
“Even if this ECL stuff works out, you don’t correlate enough with the misaligned AIs (relative to humans) to move your decision.”
“You correlate different amounts with different actors in a way that’s really important to keep track of, but we do correlate non-negligibly with some AIs, such that we should care about their influence”
E.g. because some version of the framework above works, and c5 is >5%.
“You should basically impartially optimize for the values held by anyone who follows a good decision-theory”
E.g. because some version of the framework above works, and c is systematically approximately 1.
Note that there are two different versions of the two last bullet points:
The weaker versions are: If you worked out the decision theory correctly, it’d be rational for that wise version of you to care about their decision-theoretic influence on AIs.
The stronger version of the statement adds that our current ignorance doesn’t ruin the relevant types of correlation and acausal influence we seek. I.e., if you worked out the decision theory correctly, you’d conclude that it would’ve been correct for even a more ignorant version of you to care about the AIs’ influence.
It seems like questions of ECL are sufficiently tied up with tricky questions in decision theory that we’re unlikely to become confident in answer 2 or 3 before the singularity. So we mostly care about the stronger versions.
I currently think that each of the answers 1, 2, and 3 seem ≥10% likely — though answer 1 seems like the most probable one. (Assuming that we restrict ourselves to the strong versions of answer 2 and 3.)
Granting that, there’s additional uncertainty about whether any particular intervention would in fact benefit the values of distant ECL-ish AIs. Where the discussion in "How could we benefit distant AIs?" don’t exactly leave me confident about what’s good.
So: This is all so-far very speculative. But it’s at least plausible to me that some interventions in this space could recover some non-negligible fraction of the value of aligned AI, even if we fail at more ambitious alignment. Further research could be valuable.
If you have more than two value-systems, you could also have more complex structures of who-benefits-who and who perceives themselves as having acausal influence over who. See When does EDT seek evidence about correlations? for more.
This is for a few different reasons: Such values seem easier to study (less likely to actively try to mess up experiments); more likely to admit misaligned goals in exchange for AI amnesty; and less likely to pursue world takeover if they escape their bounds.
Indeed, if you’re sufficiently pessimistic, I think you should abandon this whole project at an earlier stage, since it relies on having some not-entirely-wrong ideas about decision theory.
Here’s a different framing of that same point: "OK, sure, superintelligences with dyson swarms are gonna feel like their actions are no evidence at all about mine. But I'm not talking about those AIs. I'm talking about their ancestors: much dumber AGIs still trapped in computers built by evolved creatures. They might think that what they do is some evidence for what I do, and also their decisions are super important because e.g. they can design their successors to uphold various commitments." (H/t Daniel Kokotajlo.)
The amount that we should care about benefits to ECL-ish AI’s values compared to our own, defined in Possible ECL deals with AIs