Research

Responsible LLM Training

" Designing Alignment-Centered Learning Algorithms "

Vision

As AI models evolve from passive tools to autonomous agents, ensuring that their outputs reflect human values—not just technical accuracy—becomes vital. This project reimagines how large language models (LLMs) are trained, shifting the focus from pure performance to ethical alignment. We envision AI that doesn’t just work—but works in ways that are safe, fair, and socially aware.

Approach

To achieve this, we focus on designing an LLM training architecture based on Reinforcement Learning from AI Feedback (RLAIF), a novel methodology that allows models to learn not only from correctness but also from value-sensitive feedback.

1. Research and Baseline Development

We begin by evaluating existing approaches like RLHF and comparing them to the more scalable and self-supervising RLAIF paradigm. We assess their strengths and limitations—particularly their ability (or failure) to reflect ethical considerations in generation.

2. Design of Multi-Agent Learning Structure

Our proposed system features two interacting agents: a Learner and an Evaluator. The evaluator model scores outputs based on predefined ethical alignment criteria—such as harmfulness, trustworthiness, or inclusiveness—and the learner adapts accordingly. This structure creates a feedback loop that promotes socially desirable behavior without requiring constant human intervention.

3. Benchmarking and Iterative Optimization

Through controlled experiments, we test model responses against alignment benchmarks. Human raters and crowdworkers assess outputs across dimensions like user-friendliness, safety, and value congruence. The algorithm is then optimized to achieve improvement targets such as ≥90% harmlessness and ≥75% user preference (win rate).

Execution Plan

Page updated

Google Sites

Report abuse