Responsible LLM Training
As AI models evolve from passive tools to autonomous agents, ensuring that their outputs reflect human values—not just technical accuracy—becomes vital. This project reimagines how large language models (LLMs) are trained, shifting the focus from pure performance to ethical alignment. We envision AI that doesn’t just work—but works in ways that are safe, fair, and socially aware.
To achieve this, we focus on designing an LLM training architecture based on Reinforcement Learning from AI Feedback (RLAIF), a novel methodology that allows models to learn not only from correctness but also from value-sensitive feedback.
We begin by evaluating existing approaches like RLHF and comparing them to the more scalable and self-supervising RLAIF paradigm. We assess their strengths and limitations—particularly their ability (or failure) to reflect ethical considerations in generation.
Our proposed system features two interacting agents: a Learner and an Evaluator. The evaluator model scores outputs based on predefined ethical alignment criteria—such as harmfulness, trustworthiness, or inclusiveness—and the learner adapts accordingly. This structure creates a feedback loop that promotes socially desirable behavior without requiring constant human intervention.
Through controlled experiments, we test model responses against alignment benchmarks. Human raters and crowdworkers assess outputs across dimensions like user-friendliness, safety, and value congruence. The algorithm is then optimized to achieve improvement targets such as ≥90% harmlessness and ≥75% user preference (win rate).