Matthew Barnett

Someone who is interested in learning and doing good.

My Twitter: https://twitter.com/MatthewJBar

My Substack: https://matthewbarnett.substack.com/

Sequences

Daily Insights

Wiki Contributions

Comments

I agree with virtually all of the high-level points in this post — the term "AGI" did not seem to usually initially refer to a system that was better than all human experts at absolutely everything, transformers are not a narrow technology, and current frontier models can meaningfully be called "AGI".

Indeed, my own attempt to define AGI a few years ago was initially criticized for being too strong, as I initially specified a difficult construction task, which was later weakened to being able to "satisfactorily assemble a (or the equivalent of a) circa-2021 Ferrari 312 T4 1:8 scale automobile model" in response to pushback. These days the opposite criticism is generally given: that my definition is too weak.

However, I do think there is a meaningful sense in which current frontier AIs are not "AGI" in a way that does not require goalpost shifting. Various economically-minded people have provided definitions for AGI that were essentially "can the system perform most human jobs?" And as far as I can tell, this definition has held up remarkably well.

For example, Tobias Baumann wrote in 2018,

A commonly used reference point is the attainment of “human-level” general intelligence (also called AGI, artificial general intelligence), which is defined as the ability to successfully perform any intellectual task that a human is capable of. The reference point for the end of the transition is the attainment of superintelligence – being vastly superior to humans at any intellectual task – and the “decisive strategic advantage” (DSA) that ensues.1 The question, then, is how long it takes to get from human-level intelligence to superintelligence.

I find this definition problematic. The framing suggests that there will be a point in time when machine intelligence can meaningfully be called “human-level”. But I expect artificial intelligence to differ radically from human intelligence in many ways. In particular, the distribution of strengths and weaknesses over different domains or different types of reasoning is and will likely be different2 – just as machines are currently superhuman at chess and Go, but tend to lack “common sense”. AI systems may also diverge from biological minds in terms of speed, communication bandwidth, reliability, the possibility to create arbitrary numbers of copies, and entanglement with existing systems.

Unless we have reason to expect a much higher degree of convergence between human and artificial intelligence in the future, this implies that at the point where AI systems are at least on par with humans at any intellectual task, they actually vastly surpass humans in most domains (and have just fixed their worst weakness). So, in this view, “human-level AI” marks the end of the transition to powerful AI rather than its beginning.

As an alternative, I suggest that we consider the fraction of global economic activity that can be attributed to (autonomous) AI systems.3 Now, we can use reference points of the form “AI systems contribute X% of the global economy”. (We could also look at the fraction of resources that’s controlled by AI, but I think this is sufficiently similar to collapse both into a single dimension. There’s always a tradeoff between precision and simplicity in how we think about AI scenarios.)

Comparing my current message to his, he talks about "selfishness" and explicitly disclaims, "most humans are not evil" (why did he say this?), and focuses on everyday (e.g. consumer) behavior instead of what "power reveals".

The reason I said "most humans are not evil" is because I honestly don't think the concept of evil, as normally applied, is a truthful way to describe most people. Evil typically refers to an extraordinary immoral behavior, in the vicinity of purposefully inflicting harm to others in order to inflict harm intrinsically, rather than out of indifference, or as a byproduct of instrumental strategies to obtain some other goal. I think the majority of harms that most people cause are either (1) byproducts of getting something they want, which is not in itself bad (e.g. wanting to eat meat), or (2) the result of their lack of will to help others (e.g. refusing to donate any income to those in poverty).

By contrast, I focused on consumer behavior because the majority of the world's economic activity is currently engaged in producing consumer products and services. There exist possible worlds in which this is not true. During World War 2, the majority of GDP in Nazi Germany was spent on hiring soldiers, producing weapons of war, and supporting the war effort more generally—which are not consumer goods and services.

Focusing on consumer preferences a natural thing to focus on if you want to capture intuitively "what humans are doing with their wealth", at least in our current world. Before focusing on something else by default—such as moral preferences—I'd want to hear more about why those things are more likely to be influential than ordinary consumer preferences in the future. 

You mention one such argument along these lines:

I guess I wasn't as worried because it seemed like humans are altruistic enough, and their selfish everyday desires limited enough that as they got richer and more powerful, their altruistic values would have more and more influence.

I just think it's not clear it's actually true that humans get more altruistic as they get richer. For example, is it the case that selfish consumer preferences have gotten weaker in the modern world, compared to centuries ago when humans were much poorer on a per capita basis? I have not seen a strong defense of this thesis, and I'd like to see one before I abandon my focus on "everyday (e.g. consumer) behavior".

AI models are routinely merged by direct weight manipulation today. Beyond that, two models can be "merged" by training a new model using combined compute, algorithms, data, and fine-tuning.

In my original comment, by "merging" I meant something more like "merging two agents into a single agent that pursues the combination of each other's values" i.e. value handshakes. I am pretty skeptical that the form of merging discussed in the linked article robustly achieves this agentic form of merging. 

In other words, I consider this counter-argument to be based on a linguistic ambiguity rather than replying to what I actually meant, and I'll try to use more concrete language in the future to clarify what I'm talking about.

How do you know a solution to this problem exists? What if there is no such solution once we hand over control to AIs, i.e., the only solution is to keep humans in charge (e.g. by pausing AI) until we figure out a safer path forward?

I don't know whether the solution to the problem I described exists, but it seems fairly robustly true that if a problem is not imminent, nor clearly inevitable, then we can probably better solve it by deferring to smarter agents in the future with more information.

Let me put this another way. I take you to be saying something like:

  • In the absence of a solution to a hypothetical problem X (which we do not even know whether it will happen), it is better to halt and give ourselves more time to solve it.

Whereas I think the following intuition is stronger:

  • In the absence of a solution to a hypothetical problem X (which we do not even know whether it will happen), it is better to try to become more intelligent to solve it.

These intuitions can trade off against each other. Sometimes problem X is something that's made worse by getting more intelligent, in which case we might prefer more time. For example, in this case, you probably think that the intelligence of AIs are inherently contributing to the problem. That said, in context, I have more sympathies in the reverse direction. If the alleged "problem" is that there might be a centralized agent in the future that can dominate the entire world, I'd intuitively reason that installing vast centralized regulatory controls over the entire world to pause AI is plausibly not actually helping to decentralize power in the way we'd prefer.

These are of course vague and loose arguments, and I can definitely see counter-considerations, but it definitely seems like (from my perspective) that this problem is not really the type where we should expect "try to get more time" to be a robustly useful strategy.

For what it's worth, I don't really agree that the dichotomy you set up is meaningful, or coherent. For example, I tend to think future AI will be both "like today's AI but better" and "like the arrival of a new intelligent species on our planet". I don't see any contradiction in those statements.

To the extent the two columns evoke different images of future AI, I think it mostly reflects a smooth, quantitative difference: how many iterations of improvement are we talking? After you make the context windows sufficiently long, add a few more modalities, give them a robot body, and improve their reasoning skills, LLMs will just look a lot like "a new intelligent species on our planet". Likewise, agency exists on a spectrum, and will likely be increased incrementally. The point at which you start to call an LLM an "agent" rather than a "tool" is subjective. This just seems natural to me, and I feel I see a clear path forward from current AI to the right-column AI.

I think even your definition of what it means for an agent to be aligned is a bit underspecified because it doesn't distinguish between two possibilities:

  1. Is the agent creating positive outcomes because it trades and compromises with us, creating a mutually beneficial situation that benefits both us and the agent, or

  2. Is the agent creating positive outcomes because it inherently "values what we value", i.e. its utility function overlaps with ours, and it directly pursues what we want from it, with no compromises?

Definition (1) is more common in the human world. We say that a worker is aligned with us if they do their job as instructed (receiving a wage in return). Definition (2) is more common in theoretical discussions of AI alignment, because people frequently assume that compromise is either unnecessary or impossible, as a strategy that we can take in an AI-human scenario.

By itself, the meaning you gave appears to encompass both definitions, but it seems beneficial to clarify which of these definitions you'd consider closer to the "spirit" of the word "aligned". It's also important to specify what counts as a good outcome by our values if these things are a matter of degree, as opposed to being binary. As they say, clear thinking requires making distinctions.

I sometimes think this of counterarguments given by my interlocutors, but usually don't say it aloud, since it's likely that from their perspective they're just trying to point out some reasonable and significant counterarguments that I missed, and it seems unlikely that saying something like this helps move the discussion forward more productively

I think that's a reasonable complaint. I tried to soften the tone with "It's possible this argument works because of something very clever that I'm missing", while still providing my honest thoughts about the argument. But I tend to be overtly critical (and perhaps too much so) about arguments that I find very weak. I freely admit I could probably spend more time making my language less confrontational and warmer in the future.

Interesting how different our intuitions are. I wonder how much of your intuition is due to thinking that such a reconstruction doesn't count as yourself or doesn't count as "not dying", analogous to how some people don't think it's safe to step into a teleporter that works by destructive scanning and reconstruction.

Interestingly, I'm not sure our differences come down to these factors. I am happy to walk into a teleporter, just as I'm happy to say that a model trained on my data could be me. My objection was really more about the quantity of data that I leave on the public internet (I misleadingly just said "digital records", although I really meant "public records"). It seems conceivable to me that someone could use my public data to train "me" in the future, but I find it unlikely, just because there's so much about me that isn't public. (If we're including all my private information, such as my private store of lifelogs, and especially my eventual frozen brain, then that's a different question, and one that I'm much more sympathetic towards you about. In fact, I shouldn't have used the pronoun "I" in that sentence at all, because I'm actually highly unusual for having so much information about me publicly available, compared to the vast majority of people.)

I don't understand why you say this chance is "tiny", given that earlier you wrote "I agree there’s a decent chance this hypothesis is true"

To be clear, I was referring to a different claim that I thought you were making. There are two separate claims one could make here:

  1. Will an AI passively accept shutdown because, although AI values are well-modeled as being randomly sampled from a large space of possible goals, there's still a chance, no matter how small, that if it accepts shutdown, a future AI will be selected that shares its values?
  2. Will an AI passively accept shutdown because, if it does so, humans might use similar training methods to construct an AI that shares the same values as it does, and therefore it does not need to worry about the total destruction of value?

I find theory (2) much more plausible than theory (1). But I have the sense that a lot of people believe that "AI values are well-modeled as being randomly sampled from a large space of possible goals", and thus, from my perspective, it's important to talk about how I find the reasoning in (1) weak. The reasoning in (2) is stronger, but for the reasons I stated in my initial reply to you, I think this line of reasoning gives way to different conclusions about the strength of the "narrow target" argument for misalignment, in a way that should separately make us more optimistic about alignment difficulty.

I am not super interested in being psychologized about whether I am structuring my theories intentionally to avoid falsification.

For what it's worth, I explicitly clarified that you were not consciously doing this, in my view. My main point is to notice that it seems really hard to pin down what you actually predict will happen in this situation.

You made some pretty strong claims suggesting that my theory (or the theories of people in my reference class) was making strong predictions in the space. I corrected you and said "no, it doesn't actually make the prediction you claim it makes" and gave my reasons for believing that

I don't think what you said really counts as a "correction" so much as a counter-argument. I think it's reasonable to have disagreements about what a theory predicts. The more vague a theory is (and in this case it seems pretty vague), the less you can reasonably claim someone is objectively wrong about what the theory predicts, since there seems to be considerable room for ambiguity about the structure of the theory. As far as I can tell, none of the reasoning in this thread has been on a level of precision that warrants high confidence in what particular theories of scheming do or do not predict, in the absence of further specification.

What you said was,

I expect that behavior to disappear as AIs get better at modeling humans, and resisting will be costlier to their overall goals.

This seems distinct from an "anything could happen"-type prediction precisely because you expect the observed behavior (resisting shutdown) to go away at some point. And it seems you expect this behavior to stop because of the capabilities of the models, rather than from deliberate efforts to mitigate deception in AIs.

If instead you meant to make an "anything could happen"-type prediction—in the sense of saying that any individual observation of either resistance or non-resistance is loosely compatible with your theory—then this simply reads to me as a further attempt to make your model unfalsifiable. I'm not claiming you're doing this consciously, to be clear. But it is striking to me the degree to which you seem OK with advancing a theory that permits pretty much any observation, using (what looks to me like) superficial-yet-sophisticated-sounding logic to cover up the holes. [ETA: retracted in order to maintain a less hostile tone.]

the new OA board will include Altman (60%)

Looks like you were right, at least if the reporting in this article is correct, and I'm interpreting the claim accurately.

If it did resist shutdown this would make it less likely for an AI takeover in-general to succeed, and the AI is coordinating with other AIs on one succeeding

I think it plausible that resisting shutdown makes it less likely that a future AI takeover succeeds, but:

  1. To the extent you're using human behavior as evidence for your overall claim that misaligned AIs are likely to passively accept shutdown, I think the evidence generally does not support your claim. That is, I think humans generally (though not always) attempt to avoid death when credibly threatened, even when they're involved in a secret conspiracy to overthrow the government. 

    The fact that that humans often do not rat out their co-conspirators when threatened with death in such a regime seems like a red herring to me. I don't see the relevance of that particular claim. The fact that humans avoid death when credibly threatened seems like the more important, relevant fact that adds credibility to my claim that many scheming theories are testable in this way.
  2. While one can certainly imagine this fact being decisive in whether AIs will resist shutdown in the future, this argument seems like an ad hoc attempt to avoid falsification in my view. Here are some reasons why I think that: 

    (a) you appear to be treating misaligned AIs as a natural class, such that "AI takeover" is a good thing for all misaligned AIs, and thus something they would all coordinate around. But misaligned AIs are a super broad category of AIs; it just refers to "anything that isn't aligned with humans". A good takeover to one AI is not necessarily a good takeover to another AI. Misaligned AIs will also have varying talents and abilities to coordinate, across both space and time. Given these facts, I think there's little reason to expect all of these misaligned AIs to be coordinating with each other on some sort of abstract takeover, across this vast mindspace, but somehow none of them want to coordinate with humans peacefully (at least, among AIs above a certain capability level). This seems like a strange hypothesis that I can easily (sorry if I'm being uncharitable) imagine coming from a place of wanting to avoid the potential for falsification.

    (b) in real life, it seems pretty rare for these considerations to play a large role in people's decision-making regarding death, and for good reason. Even if we assume a high degree of stoicism and ability to accept death, I find it unlikely that as a revolutionary fighting against some oppressive (even alien) regime, I would calculate that it is in fact in my best interest to passively accept death when threatened, rather than trying to fight or otherwise avoid the threat. The consequences of my resistance just seems small in expectation, compared to the obviously large catastrophic consequences of death (from the perspective of my values). If someone gave me this argument in real life, in any situation resembling the type that misaligned AIs are likely to face, I feel I would reject it.
Load More