A training technique where a reward model trained on human preference data is used to fine-tune an LLM via reinforcement learning, aligning it with human val...
Teaching the AI to be nicer and more helpful by having people tell it "good answer!" or "bad answer!" over and over.
A way to make AI better by having humans rate its answers โ thumbs up or thumbs down โ so it learns what people actually want.
A training technique where a reward model trained on human preference data is used to fine-tune an LLM via reinforcement learning, aligning it with human values.
An alignment method that trains a reward model from pairwise human preferences, then optimizes the language model policy via PPO or DPO to maximize the learned reward while maintaining output diversity.
A preference-based alignment technique: first training a Bradley-Terry reward model on human comparison data, then optimizing the LLM policy via proximal policy optimization with a KL-divergence penalty against the SFT reference โ increasingly supplanted by direct preference optimization.
Want to explore Reinforcement Learning from Human Feedback (RLHF) in depth?
Ask SeekBox and get answers from 7 AI engines at once.
Try it in SeekBox โ