SeekBox

Reinforcement Learning from Human Feedback (RLHF)

Technical

A training technique where a reward model trained on human preference data is used to fine-tune an LLM via reinforcement learning, aligning it with human val...

Explained at 5 levels

๐Ÿ‘ถ5 Year Old

Teaching the AI to be nicer and more helpful by having people tell it "good answer!" or "bad answer!" over and over.

๐Ÿ“šMiddle Schooler

A way to make AI better by having humans rate its answers โ€” thumbs up or thumbs down โ€” so it learns what people actually want.

๐ŸŽ“College Student

A training technique where a reward model trained on human preference data is used to fine-tune an LLM via reinforcement learning, aligning it with human values.

๐Ÿง‘Adult

An alignment method that trains a reward model from pairwise human preferences, then optimizes the language model policy via PPO or DPO to maximize the learned reward while maintaining output diversity.

๐Ÿง Genius

A preference-based alignment technique: first training a Bradley-Terry reward model on human comparison data, then optimizing the LLM policy via proximal policy optimization with a KL-divergence penalty against the SFT reference โ€” increasingly supplanted by direct preference optimization.

Want to explore Reinforcement Learning from Human Feedback (RLHF) in depth?

Ask SeekBox and get answers from 7 AI engines at once.

Try it in SeekBox โ†’