Chai
01/22/2026 (Thu) 17:34
Id: 167116
No.174790
del
>>174788Here is how the constitutional AI technique works:
""""
Constitutional AI is a technique Anthropic developed to train AI systems to be helpful, harmless, and honest without relying entirely on human feedback for every decision.
The approach works in two main phases.
In the first phase, called supervised learning from a constitution, the model generates responses to prompts, then critiques and revises its own responses based on a set of written principles called a constitution. These principles are things like instructions to avoid harmful outputs, be helpful, and respect human values. The model essentially asks itself whether its response violates any constitutional principles and then rewrites the response to better align with those principles. This self-critique and revision process can happen multiple times. The revised responses are then used to fine-tune the model.
In the second phase, called reinforcement learning from AI feedback, the model is trained using a preference model. Instead of having humans label which outputs are better or worse, the AI itself evaluates pairs of outputs based on the constitutional principles and decides which one better adheres to them. This creates training signal that can be used for reinforcement learning, similar to how reinforcement learning from human feedback works, but with AI-generated preferences guided by the constitution.
The key insight is that by explicitly writing down principles and having the model reason about them, you can scale the training process beyond what would be possible with human labeling alone, while still maintaining alignment with human values. The constitution provides transparency about what values the system is trained on, and the self-critique process helps the model internalize those values more deeply than it might from just seeing examples.
This technique helps reduce the burden on human raters while also potentially producing more consistent and principled behavior in the trained model.
"""