Untold Stories of Intellectual Property: background

Showing posts with label background. Show all posts

Friday, January 31, 2025

A summary of the background technologies applied in DeepSeek-R1

Detailed Explanation of the Foundational Research Behind DeepSeek-R1

The success of DeepSeek-R1 lies in its ability to harness multiple foundational breakthroughs in AI research and strategically merge them to produce a high-performance, cost-efficient system. Below is a closer look at the key research areas and innovations that serve as its backbone:

DeepSeek-R1에 적용된 선행 연구들의 상세한 설명

DeepSeek-R1의 성공은 여러 핵심 AI 연구의 혁신을 전략적으로 결합하여 높은 성능과 비용 효율성을 달성한 데에 있다. 여기에서는 DeepSeek-R1의 근간이 된 주요 연구와 혁신들을 구체적으로 살펴본다.

1. Reinforcement Learning (RL) and Policy Optimization

Core Idea: Reinforcement Learning teaches AI through trial and error by assigning rewards to desired outcomes. The Group Relative Policy Optimization (GRPO) used in DeepSeek-R1 builds upon previous research in Proximal Policy Optimization (PPO), introduced by OpenAI.

Foundational Work:

Schulman et al. (2017), “Proximal Policy Optimization Algorithms”
PPO is a method for optimizing policies in reinforcement learning by balancing exploration and exploitation while maintaining stability in updates. GRPO extends this by considering group-level optimization rather than individual sample-based feedback, allowing scalable and global policy adjustments.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.

Impact on DeepSeek:
GRPO enables DeepSeek to efficiently handle diverse inputs and dynamically optimize its policy across various scenarios. This approach ensures that the AI can generalize and improve decision-making without requiring excessive computation.

1. 강화학습(RL)과 정책 최적화

핵심 아이디어: 강화학습은 시행착오를 통해 원하는 결과에 보상을 부여함으로써 AI를 학습시키는 방식이다. DeepSeek-R1는 OpenAI에서 도입한 Proximal Policy Optimization (PPO)의 개념을 확장한 Group Relative Policy Optimization (GRPO) 알고리즘을 사용한다.

기초 연구:
- Schulman et al. (2017), “Proximal Policy Optimization Algorithms”
  PPO는 탐색과 착취의 균형을 유지하면서 정책 업데이트의 안정성을 확보하기 위한 최적화 방법이다. GRPO는 이를 그룹 수준의 최적화로 확장하여, 개별 샘플이 아닌 글로벌 피드백을 기반으로 정책을 최적화한다.
- Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
DeepSeek에 미친 영향:
GRPO는 다양한 입력 데이터를 효율적으로 처리하고 정책을 지엽적인 아닌 전 글로벌하게 동적으로 조정할 수 있도록 하여 AI의 의사결정 능력을 향상시킨다.

2. Reward Modeling and Rejection Sampling

DeepSeek-R1 incorporates rejection sampling, a technique that ensures only high-quality responses are used for further training through Supervised Fine-Tuning (SFT). This concept is derived from Paul F. Christiano et al.’s RLHF framework.

Foundational Work:
- Christiano et al. (2017), “Deep Reinforcement Learning from Human Preferences”
  This paper introduced Reinforcement Learning from Human Feedback (RLHF), a method where humans provide feedback on AI-generated responses, and the AI is fine-tuned based on this feedback.
- Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., & Amodei, D. (2017). Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.
- Further developments: Rejection sampling extends this by adding a filtering step where suboptimal responses are rejected before being incorporated into the supervised learning dataset.
Impact on DeepSeek:
DeepSeek-R1 adapts this process to include rule-based feedback (AI-based feedback from RLAIF) in addition to human oversight, allowing for faster feedback loops while avoiding reward hacking and resource-intensive retraining.

2. 보상 모델링과 거부 샘플링

DeepSeek-R1는 거부 샘플링(rejection sampling)을 활용하여 고품질 응답만 학습에 사용되도록 한다. 이는 OpenAI의 RLHF(인간 피드백 기반 강화학습) 프레임워크에서 발전한 개념이다.

기초 연구:
- Christiano et al. (2017), “Deep Reinforcement Learning from Human Preferences”
  이 논문은 AI가 생성한 응답에 대해 사람이 피드백을 제공하고 이를 바탕으로 AI를 미세 조정하는 RLHF 방법론을 소개했다.
- Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., & Amodei, D. (2017). Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.
- 거부 샘플링의 확장: 거부 샘플링은 이 과정에 필터링 단계를 추가하여, 질 낮은 응답을 걸러내고 고품질 응답만 학습에 반영한다.
DeepSeek에 미친 영향:
DeepSeek-R1는 인간 피드백 대신 AI 피드백을 포함하여 보다 빠르고 효율적인 피드백 루프를 제공하며, 보상 해킹과 자원 소모를 줄인다. 대형 언어 모델(LLM)이 생성한 선호도를 기반으로 보상 모델(RM)을 훈련시키는 AI 피드백에서의 강화 학습(RLAIF)은 Bai 등(2022b)에서 소개되어 있다.
Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., ... & Kaplan, J. (2022). Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.

Untold Stories of Intellectual Property