Provable AI Alignment (ProvAIA)

A new R&D paradigm for developing "AI systems with quantitative safety guarantees".

Preliminary arguments to introduce the perspective adopted in this sequence:

AI Safety: I dismiss the capabilities vs safety divide. In the future, all AI developments should be in some respect about AI safety. The high-level objective for the entire field should be "building towards a better future where all humanity can flourish".^[1] To accomplish this, we would require coordination across almost all aspects of the AI ecosystem and society more broadly.^[2]

AI Alignment: AI Safety from a purely technical perspective, aimed at solving what is commonly known as the hard problem of alignment. In my current view, it must include two aspects: steering and control. To steer AI ≈ align it with human values, intent and goals. To control AI ≈ avoid harmful effects in the real-world.^[3] Our ability to directly align superhuman AI systems is limited to weak supervision. But maybe we could adopt an indirect approach? What if we could safely transfer some aspects of alignment to an external process/system? The broad ideas currently pursued in this regard are AI assistance/automation and Formal verification.^[4] They led to what I call the Automated AI alignment and Provable AI Alignment agendas, the only ones I consider to have a real shot at solving alignment.

Provable AI Alignment (ProvAIA)^[5]: At the control and formal verification^[6] end of the spectrum, several similar proposals were introduced - the most notable being OAA and provably safe systems^[7] - that seek to develop "AI systems with quantitative safety guarantees". With this sequence, I aim to distill, integrate and extrapolate from existing research agendas, domains and methodologies related to ProvAIA. My contribution will most likely be on a meta/strategic/epistemic-level. However, depending on how my AI safety career will progress in the following months, I hope to also make concrete technical contributions.

Incoming posts
ProvAIA Introduction: Establish the high-level objectives, results and strategy (defined in terms of technical areas TAs).

ProvAIA Meta: Distill and critique high-level, conceptual approaches. Clarify the theory of impact/change. Define meta-level strategy to orchestrate work across TAs.

^{^}
A disproportionate emphasis on either capability or safety seems to have led many people astray from what truly matters. Preventing X-Risk is such a low bar to obsess about! A scenario where a meaningful number of people are severely harmed seems to me equally (or even more) serious than human extinction.
^{^}
I like view the AI ecosystem in terms of five pillars: AI Capabilities, AI Alignment, AI Audit, AI Governance, AI Proliferation. Currently, I believe these are necessary and sufficient to cover the entire space.
^{^}
I expect the steering and control aspects to be both necessary and sufficient to achieve technical AI safety. In line with the Deontic Sufficiency Hypothesis, I believe that in the short-term these two aspects could be pursued somewhat independently. Unlike the argument presented in provably safe systems, I consider steering (what I believe the authors mean with alignment) to be a necessary component for developing useful AI systems (e.g. the proof assistants required to implement provably safe systems). However, to contain/control we must first "de-pessimize" and specify the absence of a catastrophe. As AI capabilities increase, I expect we would need to unify steering and controlling approaches.
^{^}
My current view is that, if a once-and-for-all superalignment solution even exists, it will likely combine these two ideas.
Without a validation protocol that is external to the automated AI alignment researcher (AAR), we cannot guarantee safety. However, humans alone will not be able to reliably evaluate complex AI behaviours (a capable AAR is expected to surpass most humans in meaningful ways).
Formal methods are the only alternative I can think of that could, in theory, enable such validation protocol. Unfortunately, in practice, the application of formal verification to ML is bottlenecked by the scale, complexity and lack of transparency associated with our currently deployed models. Here is exactly where AI automation could yield most leverage.
^{^}
While I think "Provable AI Safety" is a better name (since existing works cover aspects beyond technical alignment, from all the other AI ecosystem pillars I mentioned in ^[2]), for the sake of maintaining consistent (literary) structure I chose "Provable AI Alignment".
^{^}
Note, however, that all Provable AI alignment proposals I am aware of have a significant AI automation component.
^{^}
OAA has the advantage of being a deeply technical and relatively detailed/explored research agenda while provably safe systems is a more high-level proposal that emphasises the integration of formal verification within the broader digital, physical and social infrastructure. Beyond these two proposals, I would also keep in mind that OpenAI's AAR-based plan might be grounded into similar ideas.