Скачать или смотреть Direct Preference Optimization

Direct Preference Optimization

deep learningdeep learning for NLPnatural language processingllmslarge language modelsdpopreference datareinforcement learningRLHFpreference optimization

Скачать Direct Preference Optimization бесплатно в качестве 4к (2к / 1080p)

У нас вы можете скачать бесплатно Direct Preference Optimization или посмотреть видео с ютуба в максимальном доступном качестве.

Для скачивания выберите вариант из формы ниже:

Информация по загрузке:

Cкачать музыку Direct Preference Optimization бесплатно в формате MP3:

Если иконки загрузки не отобразились, ПОЖАЛУЙСТА, НАЖМИТЕ ЗДЕСЬ или обновите страницу
Если у вас возникли трудности с загрузкой, пожалуйста, свяжитесь с нами по контактам, указанным в нижней части страницы.
Спасибо за использование сервиса video2dn.com

Описание к видео Direct Preference Optimization

While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gaining such steerability collect human labels of the relative quality of model generations and fine-tune the unsupervised LM to align with these preferences, often with reinforcement learning from human feedback (RLHF). However, RLHF is a complex and often unstable procedure, first fitting a reward model that reflects the human preferences, and then fine-tuning the large unsupervised LM using reinforcement learning to maximize this estimated reward without drifting too far from the original model. In this paper, a new parameterization of the reward model in RLHF is introduced that enables extraction of the corresponding optimal policy in closed form, allowing the standard RLHF problem to be solved with only a simple classification loss. The resulting algorithm, which is called Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight, eliminating the need for sampling from the LM during fine-tuning or performing significant hyperparameter tuning. Experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods. Notably, fine-tuning with DPO exceeds PPO-based RLHF in ability to control sentiment of generations, and matches or improves response quality in summarization and single-turn dialogue while being substantially simpler to implement and train.

In this video, I will talk about the following: How does RLHF work? What is DPO? Your LM Is Secretly a Reward Model. How does DPO compare with RLHF?

For more details, please look at https://arxiv.org/pdf/2305.18290.pdf

Rafailov, Rafael, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. "Direct preference optimization: Your language model is secretly a reward model." arXiv preprint arXiv:2305.18290 (2023).

Комментарии

Информация по комментариям в разработке