[논문리뷰] Direct Preference Optimization: Your Language Model is Secretly a Reward Model (DPO)
Direct Preference Optimization: Your Language Model is Secretly a Reward Model, Advances in Neural Information Processing Systems (Neurips,'24). Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., & Finn, C.https://arxiv.org/abs/2305.18290 Direct Preference Optimization: Your Language Model is Secretly a Reward ModelWhile large-scale unsupervised language models (LMs) learn broad ..
2024. 9. 26.