An Introduction to Alignment


AI Safety

In the past few years, artificial intelligence capabilities have dramatically increased in broad domains like playing games and generating text/code, images, and videos. Most of these latest advances have come from “foundation models,” large neural networks trained on massive amounts of data. These models continue to demonstrate new and more powerful capabilities the more model size, data, and compute increase.

As AI models increase in capabilities and complexity, they pose greater risk to individuals and society. Some of the risks are from misuse by bad actors, such as the spread of disinformation, government censorship or surveillance, or election manipulation. But other risks from models exist even without misuse, such as fundamental biases, generation of untruthful and misleading information, or physical risks from applied AI such as from self-driving cars.

AI Safety is a field focused on finding technical solutions to these problems. Some of the solutions researchers have pursued are to design and implement software or hardware solutions to help better control AI behavior, eliminate bias from datasets, detect AI-generated content, and so on. According to the 2017 Tokyo Statement “Cooperation for Beneficial AI,” the aim of AI Safety efforts are to make AI systems “demonstrably safe, reliable and robust, and developed in alignment with the values of the communities in which it will be deployed.”

AI Safety and AI Alignment

The increasing capabilities of AI models have re-fueled a race among major tech companies to develop transformatively powerful AI systems. Many organizations have as their stated goal the creation of Artificial General Intelligence (AGI), systems that could be able to perform all intellectual tasks that humans can do.

Most AI Safety techniques being developed for today’s AI systems will not scale as AI becomes increasingly powerful. As laboratories get closer to developing AGI, AI Safety experts need to contend with a different set of potential risks, including catastrophic damage caused by transformatively powerful systems.

AI Alignment is the subfield of AI Safety focused on these risks, and is generally concerned with the problem of directing Artificial Intelligence towards desired outcomes, particularly as the responsibilities left to AIs and the capabilities of AIs increase.

Alignment research can be split into two broad categories: conceptual alignment and applied alignment, akin to theoretical and applied science.

The goal of conceptual alignment is to reliably target AIs toward desired outcomes that are robust to scaling of automatization and capabilities. Although some work directly proposes (incomplete) alignment schemes, like Paul Christiano and Geoffrey Irving’s AI Safety via Debate and Koen Holtman’s Counterfactual Planning, most of the literature focuses instead on more fundamental questions, like constraints on inferring preferences, what are the right concepts to use in directing AIs, and mathematical theories able to guarantee key behaviors and outcomes.

Applied alignment on the other hand focuses on experimental work studying the architectures that will be used to implement powerful AIs. The vast majority of the work falls under the heading of interpretability, the subfield which searches for explanations and causal models of why current ML models behave the way that they do. Practically implementing alignment schemes also falls under the header of applied alignment; yet not much has been done beyond a few experiments around debate.

Alignment fundamentally differs from most scientific and engineering disciplines, and from most mainstream research in AI Safety. Because alignment solutions must be robust to scaling, researchers cannot assume some reasonable limit on the capabilities and responsibilities of AIs, leading to the failure of many bounded solutions. Because of the potentially massive consequences of malfunction, researchers cannot iterate on powerful AIs and catch alignment problems as they crop up. And because AIs act in the world and react to our interventions on them, researchers cannot treat alignment as a static problem of natural science. All these subtleties require a complete reworking of methods to tackle alignment, and the risks entailed by malfunction of improving AI.

Come work with us!

Check out our current open positions!