Reinforcement Learning from Human Feedback (RLHF)

Technical

A training technique where a reward model trained on human preference data is used to fine-tune an LLM via reinforcement learning, aligning it with human val...

Explained at 5 levels

👶5 Year Old

Teaching the AI to be nicer and more helpful by having people tell it "good answer!" or "bad answer!" over and over.

📚Middle Schooler

A way to make AI better by having humans rate its answers — thumbs up or thumbs down — so it learns what people actually want.

🎓College Student

A training technique where a reward model trained on human preference data is used to fine-tune an LLM via reinforcement learning, aligning it with human values.

🧑Adult

An alignment method that trains a reward model from pairwise human preferences, then optimizes the language model policy via PPO or DPO to maximize the learned reward while maintaining output diversity.

🧠Genius

A preference-based alignment technique: first training a Bradley-Terry reward model on human comparison data, then optimizing the LLM policy via proximal policy optimization with a KL-divergence penalty against the SFT reference — increasingly supplanted by direct preference optimization.

Want to explore Reinforcement Learning from Human Feedback (RLHF) in depth?

Ask SeekBox and get answers from 7 AI engines at once.

Try it in SeekBox →