Untold Stories of Intellectual Property: DeepSeek R1의 두려움에서 벗어나라 (Break Free from the Fear of DeepSeek R1) !

Thursday, January 30, 2025

DeepSeek R1의 두려움에서 벗어나라 (Break Free from the Fear of DeepSeek R1) !

서론(Background)

최근 DeepSeek(딥시크) 관련 뉴스를 보면, 혁신 기술에 대한 논의가 정치적 문제로 비화하거나 기존 시장 경쟁이 무너질 것처럼 과도하게 부정적으로 다뤄지고 있다는 점에서 의구심이 든다. 몇몇 보도가 시장과 소비자의 불안 심리를 부추겨, 실제 상황을 더 악화시키는 듯한 인상도 준다.

Recent news coverage about DeepSeek seems to excessively politicize discussions on innovative technology or frame it as a threat to existing market competition. This raises concerns that some reports may be amplifying market and consumer anxiety, potentially worsening the actual situation.

여러 기사에서는 엔비디아 주가 급락의 원인을 중국 스타트업 딥시크로 단정 짓고, 심지어 딥시크가 키보드 입력 패턴을 수집해 중국으로 대규모 정보를 유출한다고까지 우려를 표한다. 하지만 그러한 주장이 객관적 증거에 기반한 것인지, 아니면 시장과 정책 입안자들에게 막연한 불안감을 심어주는 보도인지는 냉정하게 살펴볼 필요가 있다.

Several articles have attributed the recent drop in NVIDIA’s stock price entirely to the Chinese startup DeepSeek. Some even allege that DeepSeek collects keyboard input patterns and transmits vast amounts of data to China. However, we must question whether these claims are backed by solid evidence or merely fuel fears among the market and policymakers improperly.

특히 딥시크 R1 모델 등장 후 “엔비디아 GPU가 필요 없어질 것”이라는 주장도 나왔는데, 이는 “딥시크 R1이 GPU 없이도 고성능을 낼 수 있으니, 앞으로 AI 연구에서 GPU가 불필요해진다”는 전제에서 출발한다. 그러나 실제로는 고성능 하드웨어가 있을수록 더 큰 모델도 더 빠르게 학습할 수 있기 때문에 중국의 AI 모델이 제한된 GPU 자원으로도 주목할 만한 성과를 냈다면, 더 강력한 GPU를 활용하면 훨씬 뛰어난 결과를 낼 수 있다. 즉 AI 기술이 진화할수록 고성능 칩에 대한 수요가 증가할 공산이 크다는 것이다.

After the debut of the DeepSeek R1 model, some have claimed that "NVIDIA GPUs will no longer be needed." This argument is based on the premise that since DeepSeek R1 can deliver high performance without GPUs, AI research will eventually no longer require them. However, in reality, the more powerful the hardware, the faster and more efficiently larger models can be trained. If a Chinese AI model has achieved notable results despite limited GPU resources, then utilizing more powerful GPUs could yield even greater outcomes. In other words, as AI technology advances, the demand for high-performance chips is more likely to increase.

또한 OpenAI 같은 미국 기업들은 내부 구조나 모델 파라미터를 공개하지 않지만, 딥시크는 오픈 소스 형태로 공개해 많은 개발자·스타트업이 쉽게 접근할 수 있다. 딥시크가 활용한 비지도 학습의 강화 학습 기법은 대규모의 학습데이터가 필요없는 모델로 다른 기업도 사실상 그대로 적용 가능하므로, 이제는 다른 AI 모델들이 딥시크 방식을 이어받아 학습 효율을 높이기가 쉬워진다. 같은 방식이라면 더 많은 고성능 하드웨어를 투입할수록 더 빠르고 정확한 학습과 추론이 가능해져, 고성능 칩을 동원한 AI들이 딥시크보다 더 뛰어난 성능을 낼 잠재력이 크다. 이로 인해 고성능 반도체 수요 역시 상당히 늘어날 가능성이 있다.

Moreover, while U.S. companies like OpenAI do not disclose their internal structures or model parameters, DeepSeek has made its technology open-source, allowing easy access for many developers and startups. The reinforcement learning techniques used in DeepSeek's unsupervised learning are models that do not require massive datasets and can be practically adopted by other companies. Therefore, it is now easier for other AI models to adopt DeepSeek’s methods and enhance their learning efficiency. With the same approach, the more high-performance hardware is employed, the faster and more accurate the inference becomes, which means AI models using powerful chips have the potential to surpass DeepSeek in performance. Consequently, this could significantly increase the demand for high-performance semiconductors.

DeepSeek GitHub page : https://github.com/deepseek-ai/DeepSeek-R1/tree/main

한편, 딥시크 사례는 기존 미국 대형 기업들이 막대한 비용을 들여 개발한 모델에 비해 훨씬 적은 투자로도 유의미한 AI를 구현할 수 있음을 보여준다. 즉, “누구나 할 수 있다”는 메시지를 던지는 셈이다. 이는 특히 AI 산업에서 뒤처져 있던 한국 같은 IT 기업들에게도 기회가 될 수 있다. 그동안은 막대한 투자비에 가로막혀 AI 시장에 진입하기 어려웠지만, 딥시크가 촉발한 저비용 구조의 AI 모델이 확산된다면, 새롭게 AI 생태계에 적극 참여할 수 있는 길이 열릴 수 있기 때문이다.

Also noteworthy is the fact that DeepSeek demonstrates how one can achieve a meaningful AI system on a far more modest budget than what major American firms have historically poured into proprietary models. This effectively underscores the message that “anyone can do it.” For nations like South Korea, where the AI industry has lagged behind that of the United States or China, the historically steep AI development costs could now become less of a barrier. If DeepSeek’s low-cost model spawns broader adoption, these emerging markets might find themselves in a more dynamic AI ecosystem.

다만, 딥시크 측이 학습 데이터는 공개하지 않고 모델과 가중치(Weights)만 공개하고 있기에, 오픈 소스 이니셔티브(OSI) 정의상의 ‘완전한 오픈 소스’와는 다소 차이가 있다. 보통 이 같은 유형을 ‘오픈 웨이트 모델(Open Weight Model)’이라 부르지만, 모델 구조와 가중치·파라미터만으로도 재현이 가능한 만큼, ‘오픈 소스 모델’이라 부르는 데도 큰 문제는 없어 보인다.

However, DeepSeek only released the model code and weights, not the actual training data. So, strictly speaking, it doesn’t meet the Open Source Initiative (OSI) definition of open source. Typically, such releases are referred to as “Open Weight Models,” but since sharing the model structure and parameters is enough for replication, it’s arguably reasonable to label DeepSeek as an “open source model.”

흥미로운 점은 딥시크가 정부나 관 주도가 아닌 항저우 소재의 민간 스타트업이라는 것이다. 중국 총리가 뒤늦게 이를 알고 항저우로 직접 날아가 “무슨 일이 벌어지고 있는지” 확인했다는 이야기는, 정부가 미리 이 성과를 예측하지 못했음을 방증한다. 중국 스타트업이 만들었다고 해서 자동으로 정부 후원 프로젝트라고 볼 수는 없으며, 오히려 정부가 나중에야 성공 소식을 접했다는 일화가 이를 뒷받침한다.

An interesting aspect of DeepSeek is that it is a private startup based in Hangzhou, rather than a government-led initiative. Reports suggest that the Chinese Premier was unaware of its achievements until later and rushed to Hangzhou to personally assess “what was happening,” indicating that the government had not foreseen this breakthrough.

The fact that DeepSeek was developed by a Chinese startup does not automatically imply state sponsorship; on the contrary, the government’s delayed recognition of its success reinforces the notion that it was not a premeditated state-backed project.

한편, 미국의 일부 대형 벤처캐피털(VC)들이 막대한 LP(출자자) 자금을 독점적(proprietary) AI 모델에 투자해왔고, 미국 정부가 중국 등 경쟁 시장을 배제하도록 로비하고 있다는 지적도 있다. 그러나 트럼프 등 여러 인사를 상대로 제재를 요구하는 것만으로 세계적 혁신 흐름을 막기는 어렵다는 평가가 많다. 이미 AI가 더 이상 미국 대형 업체의 전유물이 아닐 수 있음을 중국 등 다양한 국가들이 보여주고 있으며, 그들은 오픈 소스와 저비용 전략을 통해 빠르게 추격 중이다.

Meanwhile, some argue that a handful of major U.S. venture capital firms, which have funneled substantial LP (limited partner) funds into expensive proprietary AI models, are lobbying the U.S. government to exclude China and other competing markets. However, many believe that simply pressuring Trump for sanctions is unlikely to halt the global tide of innovation. Countries like China have already demonstrated that AI is no longer the exclusive domain of large American corporations, and they are rapidly closing the gap by leveraging open-source models and cost-effective strategies.

또한 ‘중국산이라 믿을 수 없다’는 서구권 태도가 얼마나 실효성이 있을지는 의문이다. 막연한 반중 정서를 부추기는 식의 접근은 시장과 혁신 전반에 해를 끼칠 수도 있다는 우려가 제기된다.

Likewise, the effectiveness of the Western stance dismissing technology solely because it originates from China remains uncertain. Some concerns that foster vague anti-China sentiment could harm the market and hinder overall innovation.

마지막으로, 개인정보 유출 우려와 관련해서는, 일반적으로 사용자가 텍스트를 입력한 뒤 ‘전송’ 버튼을 누르기 전까지는 서버에 전송되지 않는다. 하지만 자바스크립트 등의 스크립트로, 실시간 키 입력(간격·순서)을 추적해 ‘키보드 입력 패턴’을 수집할 기술적 가능성이 전혀 없는 것은 아니다. 그럼에도 지금까지 딥시크 AI가 실제로 그런 입력 패턴을 몰래 수집하고 있다는 구체적 보고나 증거는 확인되지 않았다. 이는 중국 안보 리스크를 과장해 막연히 불안감을 조성하는 프레임일 가능성이 높고, 충분한 객관적 근거 없이 성급한 결론을 내리는 것은 경계해야 한다.

Finally, regarding data privacy concerns—ordinarily, text typed into a prompt doesn’t get uploaded to a server until the user actually hits “Send.” In other words, under typical conditions, keystroke patterns aren’t transmitted before submission.

It’s technically possible, however, to use JavaScript or similar scripts to capture input in real-time (including timing intervals and key order) even before the “Send” button is pressed. So far, though, there’s no solid evidence or reports that DeepSeek AI actually collects such data.

Such speculation may reflect a frame intended to magnify security risks tied to China, thereby sparking vague fears without concrete proof.

This is likely a framing tactic that exaggerates China’s security risks to fuel unfounded fears. It is crucial to avoid hasty conclusions without sufficient objective evidence.

논문에서 밝힌 DeepSeek-R1의 특징

Key Features of DeepSeek-R1 Highlighted in the Paper

2025년 1월 20일에 발표된 DeepSeek-R1 관련 논문을 보면, DeepSeek-R1이 다단계 강화학습을 기반으로 하는 하이브리드 모델임을 확인할 수 있다.

According to a paper published on January 20, 2025, about DeepSeek-R1, one can see that DeepSeek-R1 is a hybrid model grounded in multi-stage reinforcement learning.

Reference: Guo, D., et al. (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv preprint arXiv:2501.12948.

논문에 따르면, DeepSeek R1은 다음과 같은 특징으로 요약할 수 있다.

According to the paper, DeepSeek R1 can be summarized by the following key features:

1. DeepSeek-R1-Zero:

대규모 강화학습(large-scale Reinforcement Learning, RL)을 통해 사전 감독 학습(Supervised Fine-Tuning, SFT) 없이 학습된 모델.

이 모델은 추론 능력이 뛰어나고 흥미로운 행동들을 자연스럽게 나타냄으로써 놀라운 성능을 보임.

하지만 문제점으로는 문장 가독성(readability)이 낮고, 언어 혼합(language mixing) 문제가 있었음.

A model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT).

This model demonstrates outstanding performance by naturally exhibiting powerful reasoning capabilities and interesting behaviors.

However, it faced issues such as poor sentence readability and language mixing problems.

2. DeepSeek-R1:

이러한 문제를 해결하기 위해 도입된 후속 모델로, 다단계 학습(multi-stage training)과 콜드 스타트 데이터(cold-start data)를 강화 학습(RL) 이전에 도입.

이를 통해 추론 성능이 개선되었으며, 결과적으로 OpenAI-o1-1217과 유사한 수준의 성능을 달성.

A follow-up model was introduced to address these issues by incorporating multi-stage training and cold-start data before reinforcement learning (RL).

This approach improves reasoning performance and, as a result, achieves performance comparable to OpenAI-o1-1217.

3. 공개와 모델 확장:

연구 커뮤니티를 지원하기 위해 DeepSeek-R1-Zero, DeepSeek-R1, 그리고 DeepSeek-R1에서 증류(distillation)된 6개의 밀집(dense) 모델을 오픈소스로 제공.

증류된 모델들은 각각 1.5B, 7B, 8B, 14B, 32B, 70B 파라미터 크기로 구성되며, Qwen 및 Llama 아키텍처 기반으로 설계됨.

To support the research community, DeepSeek-R1-Zero, DeepSeek-R1, and six dense models distilled from DeepSeek-R1 are provided as open-source.

The distilled models come in configurations of 1.5B, 7B, 8B, 14B, 32B, and 70B parameters and are designed based on the Qwen and Llama architectures.

DeepSeek-R1에 적용된 주요 기법

Major Techniques Applied in DeepSeek-R1

1. 기본 모델 설계

강화 학습 기반:

기존의 빅데이터에 의존하는 지도 학습 대신, 보상 신호를 활용하여 최적의 행동을 스스로 찾아가는 강화 학습(RL) 방식 적용.
Group Relative Policy Optimization (GRPO) 알고리즘으로 글로벌 정책 최적화를 달성하며, 인간의 선호도와 문제 해결 능력을 조화롭게 반영.

1. Base Model Design

Reinforcement Learning (RL)-Based Approach:
- Instead of relying on big-data-driven supervised learning, DeepSeek-R1 applies reinforcement learning to autonomously discover optimal actions through reward signals.
- The Group Relative Policy Optimization (GRPO) algorithm is used to achieve global policy optimization, harmoniously integrating human preferences and problem-solving capabilities.

2. 콜드 스타트 문제 해결

문제: 초기 데이터가 부족하거나 불안정한 학습이 발생할 수 있음.
해결:

초기 CoT(Chain-of-Thought) 데이터를 소량 수집하여 지도 미세 조정(SFT)을 선행.
지도 학습 후 강화 학습을 적용하여 초기 학습의 안정성과 효율성을 높임.
초기 데이터 수집에는 Few-shot prompting, 직접 프롬프트, DeepSeek-R1-Zero 출력 활용, 인간 검토 및 후처리를 포함.

2. Solving the Cold Start Problem

Problem:
- Insufficient initial data or unstable training can result in inefficient and unreliable learning during the early stages.
Solution:
- Collect a small amount of Chain-of-Thought (CoT) data and perform supervised fine-tuning (SFT) before initiating RL.
- This improves the stability and efficiency of the initial training process.
- Data collection methods include: Few-shot prompting, direct prompting, leveraging DeepSeek-R1-Zero outputs, and human review and post-processing.

3. 두 단계의 강화 학습 (RL Stages)

첫 번째 단계:
- 추론 패턴 개선에 집중하여 정교한 CoT 기반 추론과 중간 사고 단계 확장을 목표로 학습.
두 번째 단계:
- 모델의 응답이 인간의 기대치와 정합(Alignment)하도록 조정.
- 인간 피드백을 기반으로 윤리적 기준과 사용자 선호도에 부합하는 응답을 생성.

결과: 문제 해결 과정과 결과가 정확성뿐 아니라 사용자와 사회적 요구에 부합할 수 있도록 설계.

3. Two Reinforcement Learning (RL) Stages

Stage 1: Improving Reasoning Patterns
- Focuses on refining Chain-of-Thought (CoT)-based reasoning and expanding intermediate reasoning steps to handle complex problems.
Stage 2: Alignment with Human Expectations
- The model is adjusted to ensure that its responses align with human feedback, ethical standards, and user preferences.
- As a result, both the problem-solving process and the final outputs meet not only accuracy requirements but also user and societal expectations.

4. 두 단계의 지도 미세 조정 학습 (SFT Stages)

첫 번째 SFT 단계:
- 모델의 기본적인 추론 및 비추론 능력을 형성.
- 언어적 구조, 일반적인 문제 해결 방법과 같은 기초적인 능력 학습.
두 번째 SFT 단계:

강화 학습 이후 모델의 추론 패턴을 정교화하고 구체적 문제 해결 능력을 최적화.
수학적 문제, 논리적 추론 등 고난도 작업의 성능 강화.

4. Two Stages of Supervised Fine-Tuning (SFT)

Stage 1:
- Establishes the model’s fundamental reasoning and non-reasoning capabilities, including linguistic structure understanding and basic problem-solving skills.
Stage 2:
- Conducted after reinforcement learning, this stage refines and optimizes the model’s reasoning patterns and enhances its performance on high-level tasks such as mathematical reasoning and logical analysis.

5. 거부 샘플링과 데이터 결합을 통한 지도 미세 조정

Rejection Sampling:
- 이전 RL 체크포인트에서 부정확하거나 혼란스러운 데이터를 필터링하고, 정확한 응답만을 선택하여 추론 데이터를 구성.
- 총 60만 개의 추론 데이터와 20만 개의 비추론 데이터를 결합하여 80만 개의 통합 데이터셋을 구축.
최종 미세 조정:

통합 데이터셋으로 모델을 두 번의 에폭(epoch) 동안 학습하여 논리적 사고와 일반 언어 처리 능력을 동시에 강화.

5. Rejection Sampling and Combined Data for SFT

Rejection Sampling:
- Filters out inaccurate or confusing data from previous RL checkpoints and selects only correct responses to construct the reasoning dataset.
- Combines 600k reasoning data samples and 200k non-reasoning data samples to build a comprehensive dataset of 800k samples.
Final Fine-Tuning:
- The integrated dataset is used to train the model over two epochs, enhancing both its logical reasoning and general language processing capabilities.

6. 모든 시나리오를 위한 강화 학습

목적:
- 추론 능력 유지와 함께 유용성(Helpfulness) 및 무해성(Harmlessness)을 보장하는 응답 생성.
추론 데이터:
- 수학, 코딩, 논리적 추론과 같은 명확한 도메인에서는 규칙 기반 보상을 사용.
일반 데이터:
- 정답이 명확하지 않은 일반적 질문에서는 보상 모델을 도입해 사용자 선호를 반영.
보상 설계:

유용성: 최종 요약의 유용성 평가로 응답의 적합성 보장.
무해성: 전체 응답(추론 과정 + 요약)을 검토해 잠재적 위험 요소와 편향을 제거.

6. Reinforcement Learning for All Scenarios

Objective:
- Generate responses that maintain strong reasoning capabilities while ensuring helpfulness and harmlessness.
Reasoning Data:
- For domains like math, coding, and logical reasoning, rule-based rewards are applied.
General Data:
- For general questions without clear-cut answers, reward models are used to reflect user preferences.
Reward Design:
- Helpfulness: Evaluate the utility and relevance of the final summary to ensure user satisfaction.
- Harmlessness: Reviews the entire response (reasoning process + summary) to detect and eliminate potential biases, risks, or harmful content.

7. 증류(Distillation)를 통한 소형 모델의 추론 능력 강화

과정:
- 대형 모델의 고급 추론 패턴을 소형 모델에 전이하여 리소스 효율적인 소형 모델 구축.
- Qwen 및 Llama 모델을 대상으로 미세 조정(SFT)을 적용하고 800k의 데이터셋 활용.
특징:

강화 학습 없이 지도 미세 조정만으로도 고급 성능을 계승할 수 있음을 입증.
RL 추가 시 더 높은 성능 가능성을 열어두어 후속 연구의 기반 마련.

7. Distillation for Enhancing Reasoning Capabilities of Smaller Models

Process:
- Transfers advanced reasoning patterns from large models to smaller models, resulting in resource-efficient models with high performance.
- Fine-tuning (SFT) is applied to Qwen and Llama models using an 800k dataset.
Key Features:
- Demonstrates that supervised fine-tuning alone is sufficient to transfer advanced reasoning capabilities without the need for reinforcement learning.
- This leaves the option open for incorporating RL in future research to achieve even higher performance.

DeepSeek-R1은 콜드 스타트 해결, 강화 학습과 지도 학습의 결합, 증류 기법을 통해 정확성, 유연성, 그리고 실시간 응용 가능성을 갖춘 AI 모델로 발전했다. 향후 RL과 증류의 통합 연구는 더욱 강력하고 효율적인 소형 모델을 제공할 수 있는 가능성을 열어준다.

DeepSeek-R1 integrates solutions for the cold start problem, combines reinforcement learning with supervised fine-tuning, and employs distillation to create an AI model with high accuracy, flexibility, and real-time applicability. The ongoing integration of reinforcement learning and distillation is expected to further develop more powerful and efficient smaller models in the future.

OpenAI의 모델과 경쟁할 만한 성능과 문제점

Performance Comparable to OpenAI’s Models and Issues with DeepSeek

이러한 최적화 전략 덕분에 DeepSeek-R1은 비용 효율성 면에서 OpenAI의 주력 모델과도 견줄 수 있는 수준에 도달했다. 이는 단순히 대형 모델을 모방하는 것이 아니라, 기술적 혁신과 효율성의 결합을 통해 더 적은 비용으로 더 나은 결과를 낼 수 있다는 가능성을 보여주었다.

Performance Rivaling OpenAI’s Models

Thanks to its optimized strategies, DeepSeek-R1 has reached a level of performance that can compete with OpenAI’s flagship models, while maintaining superior cost efficiency.

This achievement is not about simply imitating large-scale models; rather, it highlights the synergistic combination of technological innovation and efficiency. It demonstrates the potential to achieve superior results with fewer resources, paving the way for more accessible and sustainable AI development.

<source> Lilmod. "Can Deepseek R1 become the next best reasoning model for AI Agents ?". medium https://medium.com/@lilmod/can-deepseek-r1-become-the-next-best-reasoning-model-for-ai-agents-3647fb6b6274

할루시네이션(hallucination) issues under wireheading

Hallucination issues under wireheading

필자도 실제로 DeepSeek R1을 테스트해보았고, ChatGPTo1에 필적하는 성능을 여러 차례 확인했다. 참고로 DeepSeek에서 R1 모델을 사용하려면 프롬프트 대화창 하단에 있는 ‘DeepThink (R1)’ 옵션을 활성화해야 한다.

다만, 지도학습 모델에서 흔히 보고되는 할루시네이션(hallucination) 문제가 강화학습 모델에서도 여전히 나타나는 듯하다. 한 번은 브뤼셀 1bis 규정(Brussels I-bis Regulation) (EU Regulation no. 1215/2012) 제26조(응소관할)에 대한 신속한 리서치를 요구했는데, ChatGPTo1은 정확한 답을 제시한 반면, DeepSeek R1은 제26조가 아닌 제29조(동일 소송의 중복 제기) 및 제30조 관련 내용을 찾아내면서 틀린 답변을 고집했다. 이처럼 생성된 결과를 한 번 더 검증하는 주의가 필요하다.

Through personal testing, I found that DeepSeek R1 delivers performance comparable to ChatGPTo1 too. To enable R1 mode, users must activate “DeepThink (R1)” at the bottom of the prompt window.

However, hallucination issues persist in both supervised learning and reinforcement learning models.

For example, when I requested a quick legal research summary on Article 26 (Jurisdiction by Appearance) of the Brussels I-bis Regulation (EU Regulation no. 1215/2012), ChatGPTo1 correctly identified the relevant provision, whereas DeepSeek R1 incorrectly referenced Article 29 (Lis Pendens) and Article 30.

It repeatedly insisted on the incorrect answer, highlighting the importance of cross-checking AI-generated responses.

DeepSeek-R1에 적용된 거인들의 선행 연구들

Preceding Research by Giants Applied in DeepSeek-R1

기술적 기반과 미래를 향한 도약

DeepSeek-R1에 적용된 모든 기법은 여러 선행 연구자들의 지식을 바탕으로 하고 있다. DeepSeek 팀은 이러한 거인들의 여러 선행 연구 결과를 영리하고 독창적인 방식으로 결합하여 놀라운 성과를 이루었다. 이 모델은 단순히 현재에 안주하지 않고, 미래의 AI 발전을 향한 중요한 도약을 시사한다.

Technological Foundation and a Leap Toward the Future

Every technique applied to DeepSeek-R1 stands on the shoulders of prior research from many pioneering scholars. The DeepSeek team, however, has taken this foundation and creatively combined it in a way that delivers exceptional results. Their work not only reflects the current state of AI but also signals a significant leap toward the future of AI development.

Deepseek가 활용한 거인들의 선행 연구와 기법이 무엇이 있으며, 이러한 기법의 DeepSeek로의 접목은 맨 아래 참고글과 아래 링크에서 제공하기로 한다.

The foundational research and techniques developed by giants that DeepSeek utilized, along with detailed technical explanations and their integration into DeepSeek, will be provided in the article linked below and the bottom line.

Link : The integration of conventional background technologies (pioneering studies) with DeepSeek R1.

결론 (Final Thoughts)

DeepSeek R1은 딥시크-V3 베이스 모델을 기반으로 한 새로운 오픈 웨이트 LLM으로, 강화학습(RL), AI 피드백을 활용한 강화학습(RLAIF), 전문가 혼합 모델(MoE), 그리고 지식 증류(Knowledge Distillation) 기법을 결합하여 높은 효율성과 정확도를 유지하면서도 연산 비용을 절감하는 인상적인 AI 모델로 평가받고 있다.

순수 RL(pure RL) 방식으로 성능을 향상할 수 있음을 입증한 중간 모델인 딥시크-R1-제로(DeepSeek-R1-Zero) 역시 주목할 만하다.

그러나 딥시크-R1은 새로운 AI 패러다임을 제시하는 모델은 아니다. 기존의 LLM 학습 아키텍처를 기반으로 기술적·구조적 최적화를 통해 학습과 추론의 효율성을 높인 사례로 보는 것이 적절하다.

성능 면에서도 최신 모델들과 비교할 때 눈에 띄는 혁신보다는 기존 수준을 유지하는 데 초점이 맞춰져 있으며, 새로운 성능 기준을 정립하기보다는 효율적인 학습 방식을 도입해 기존 성능을 최적화한 모델이라고 할 수 있다.

또한, 딥시크의 접근 방식이 하드웨어와 데이터 투입을 통한 모델 확장의 중요성을 부정하는 것은 아니다. 오히려 보다 효율적인 모델 확장이 AI 성능을 극대화하는 데 있어 핵심 요소임을 다시금 확인시켜주는 사례라고 볼 수 있다.

딥시크-R1이 패러다임 전환할 정도의 혁신적인 모델은 아니라고 하지만, 오픈소스로 공개된 점과 빠르게 발전하는 기술적 성과를 고려할 때 AI 업계에서 주목할 만한 모델임은 분명하다.

우리가 한가지 더 주목해야 할 것은 DeepSeek R1이 현대 AI 혁신이 기존 기술(거인의 어깨) 위에서 어떻게 발전하는지를 잘 보여주는 모범 사례라는 점이다.

기존 기술을 새롭게 결합하여 독창적인 혁신을 이끌어낸다는 것을 입증하고 있으며, 발명은 종종 알려진 기술을 얼마나 창의적이고 효과적으로 연결하는가에 따라 탄생한다는 점을 보여준다. 그러나 거인의 어깨 위에서만 멀리 바라 볼 수 있다는 진리 역시 변하지 않는다.

따라서 만약 딥시크가 활용한 주요 기술들이 기존 거인들의 특허로 잘 보호되었다면, 이는 AI 시대의 새로운 특허 전쟁으로 이어질 수 있다.

다행인지 불행인지 모르겠으나 미국은 그동안 빅테크 기업 보호 정책에 따라 소프트웨어 알고리즘에 대한 특허 보호에 소극적이었다. 이 부분은 반드시 정비할 필요가 있다.

이러한 점에서 특허 보호 정책을 유형적 변화(재료 및 기구의 물리적 변화)에만 국한하지 않고, 전자기적 데이터 신호처리나 무형적 변화에 기반한 AI 및 소프트웨어 알고리즘에도 특허 인정과 보호를 확대해야 한다고 제안한다.

DeepSeek R1 is a new open-weight LLM based on the DeepSeek-V3 base model. It combines Reinforcement Learning (RL), Reinforcement Learning with AI Feedback (RLAIF), Mixture of Experts (MoE), and Knowledge Distillation techniques to enhance efficiency and accuracy while reducing computational costs. As a result, it is regarded as an impressive AI model.

Another noteworthy development is DeepSeek-R1-Zero, an intermediate model that has demonstrated the feasibility of improving performance using a pure RL approach.

However, DeepSeek R1 does not introduce a new AI paradigm. Instead, it builds on existing LLM training architectures, optimizing technical and structural aspects to improve training and inference efficiency.

In terms of performance, rather than delivering groundbreaking innovations, DeepSeek R1 focuses on maintaining the current standard. Rather than redefining performance benchmarks, it optimizes existing performance through more efficient training methodologies.

That being said, DeepSeek’s approach does not dismiss the importance of hardware and data investment in model scaling. On the contrary, it underscores how efficient model scaling remains a critical factor in maximizing AI performance.

While DeepSeek R1 enhances efficiency, it does not represent a paradigm shift. However, given its open-source nature and the rapid pace of its technical advancements, it is undeniably a model worth paying attention to in the AI industry.

One key takeaway is that DeepSeek R1 serves as a prime example of how modern AI innovation builds upon existing technologies—standing on the shoulders of giants.

It proves that true innovation often comes from creatively and effectively integrating well-established technologies.

Many great inventions have emerged not from entirely new discoveries but from recombining known technologies in novel and impactful ways.

However, still, no one can see far without the shoulders of giants.

Therefore if the core technologies utilized in DeepSeek R1 had been well protected by patents of industry giants, we might have witnessed a new patent war in the AI era.

For better or worse, the U.S. has historically been reluctant to grant strong patent protection for software algorithms, largely due to policies favoring big tech companies. This issue must be addressed.

In this regard, patent protection policies should not be limited to tangible innovations (such as material and mechanical changes) but should also be expanded to cover AI and software algorithms based on electromagnetic data processing and intangible transformations.

※ 종래 배경기술(선행연구)와 DeepSeek R1의 접목

※ The integration of conventional background technologies (pioneering studies) with DeepSeek R1.

ChatGPT는 OpenAI에서 개발한 비공개 모델이다. 초창기에는 레이블이 달린 기존 데이터를 학습하는 지도학습 기반 모델로 출발했으나, 현재는 지도학습(Supervised Learning)과 사람 피드백을 적용한 강화학습(RLHF, Reinforcement Learning with Human Feedback)을 결합한 하이브리드 구조로 진화했다. 이를 통해 더욱 자연스럽고 인간 친화적인 답변을 생성하고, 추론 성능을 높이고 있다.

한편, DeepSeek R1은 중국 항저우에 소재한 량원펑(梁文鹏, Liang Wenpeng)이 이끄는 스타트업 딥시크(DeepSeek)가 개발한 오픈소스 공개 모델이다. 초기에는 스스로 학습하는 강화학습(RL) 중심의 비지도 학습 모델로 알려졌으나, 최근에는 다른 AI 모델이 제공하는 피드백(보상)을 활용한 강화학습(RLAIF, Reinforcement Learning from AI Feedback)을 결합해 성능을 한층 끌어올린 하이브리드 모델을 사용하고 있다.

ChatGPT, developed by OpenAI, is a proprietary (closed-source) model.

Initially, it was based on supervised learning techniques using labeled datasets. However, in its present model, ChatGPT employs a hybrid approach that combines supervised learning with reinforcement learning from human feedback (RLHF). This helps refine its reasoning and overall response quality to generate more natural and human-friendly responses.

DeepSeek R1, on the other hand, is an open-source model developed by a Hangzhou-based startup, led by Liang Wenpeng (Liang Yuanpeng in some transcriptions).

Early on, DeepSeek R1 functioned primarily as an unsupervised, reinforcement learning-driven model, teaching itself without explicit supervision. Currently, to further boost its performance, it takes advantage of reinforcement learning from AI feedback (RLAIF), in which the model receives “rewards” from other AI systems—thus making DeepSeek R1 a hybrid approach as well.

OpenAI의 ChatGPT는 지도학습 기반의 트랜스포머(Transformer) 모델이다.

이 모델은 기존에 레이블이 달린 데이터를 통해 패턴을 학습한 뒤, 사람의 피드백(보상)을 반영하는 강화학습(RLHF, Reinforcement Learning with Human Feedback)을 추가로 적용한 하이브리드 방식을 사용한다.

먼저, 지도학습(Supervised Learning)은 AI에게 정답이 있는 데이터를 제공하고, 그 정답을 맞히도록 훈련하는 과정이다. 이는 마치 학생이 교과서를 보고 공부하는 것과 비슷하다. AI가 미리 준비된 문제-정답 데이터셋을 반복적으로 학습하면, 질문에 맞는 적절한 답변을 확률적으로 생성할 수 있는 능력을 얻게 된다.

여기에 AI가 생성한 여러 답변 중 사람이 더 나은 답을 골라 보상을 주는 RLHF(사람 피드백을 활용한 강화학습) 과정을 추가하면, 모델이 더 자연스러운 답변을 만들어낼 수 있다. 비유하자면, 학교에서 수업을 마친 뒤, 방과 후 과외를 한 번 더 받는 셈이다. 이처럼 지도학습으로 기본기를 쌓은 뒤, 사람 피드백을 통한 강화학습으로 세밀한 조정이 이뤄지므로, 모델이 사용자 의도와 맥락에 더욱 부합하는 답변을 생성하게 된다.

OpenAI’s ChatGPT: A Hybrid Transformer Model with RLHF

OpenAI’s ChatGPT is a Transformer-based language model initially trained using supervised learning and later enhanced through reinforcement learning with human feedback (RLHF) to generate more natural and human-like responses.

Supervised learning involves training AI on labeled data, similar to how a student studies from textbooks. The model learns from pre-existing question-answer datasets, allowing it to recognize patterns and generate probabilistically relevant responses.

Source: Radford, A. (2018). Improving language understanding by generative pre-training.

On top of this, RLHF refines the model by having human evaluators rank multiple AI-generated responses, rewarding those that are more accurate and natural. This process is akin to receiving additional tutoring after regular school classes, fine-tuning the model’s ability to align with user intent and context.

DeepSeek R1: A Hybrid Model with RL and AI Feedback (RLAIF)

딥시크(DeepSeek)의 DeepSeek R1은 비지도 학습 기반의 강화학습(RL) 모델이다. 단순히 사람이 정답을 제공하는 것이 아니라, AI가 스스로 탐색하여 올바른 답을 찾도록 설계된 방식이다. 여기에 다른 AI로부터 제공받는 피드백(보상)을 반영하는 강화학습(RLAIF, Reinforcement Learning from AI Feedback) 기법을 추가해, 하이브리드 모델로 성능을 높였다.

강화학습(RL)은, 정답을 직접 가르쳐주지 않는 대신 AI가 환경과 상호작용하며 보상을 극대화하는 방향으로 행동(액션)을 선택하도록 학습하는 방식이다. 예를 들어, 아이가 자전거 타기를 배울 때 누군가가 정답(균형 잡는 법)을 자세히 알려주지 않아도, 넘어지고 다시 시도하는 과정을 통해 스스로 균형을 찾고, 성공하면 칭찬(보상)을 받는 것과 비슷하다.

To save training costs, the paper employs Group Relative Policy Optimization (GRPO) instead of traditional RL algorithms like Proximal Policy Optimization (PPO). Group Relative Policy Optimization (GRPO) is a novel reinforcement learning method introduced in the DeepSeekMath paper earlier this year. GRPO builds upon the Proximal Policy Optimization (PPO) framework, designed to improve mathematical reasoning capabilities while reducing memory consumption. This method offers several advantages, particularly suitable for tasks requiring advanced mathematical reasoning. (source: DhanushKumar (2025), “DeepSeek-R1: Incentivizing Reasoning Capability in Large Language Models via Reinforcement Learning…”, Medium, https://medium.com/@danushidk507/deepseek-r1-incentivizing-reasoning-capability-in-large-language-models-via-reinforcement-learning-9515a28a23ad.)

DeepSeek R1은 이 RL 과정에서 거절 샘플링(Rejection Sampling)에 의한 지도 학습 파인 튜닝(SFT, Supervised Fine-Tuning)을 적용한다. 즉, 모델이 스스로 생성한 여러 응답을 규칙에 따라 평가해 “좋은 응답”만 선별하고, 그 선별된 응답을 다시 지도학습 데이터로 활용하여 모델을 추가로 미세 조정한다. 결과적으로, 다른 AI가 내린 보상이나 평가(AI Feedback)를 활용하는 RLAIF 기법이 가미되어, DeepSeek R1은 비지도 학습 중심의 RL에 더해 지도학습 요소까지 혼합한 하이브리드 모델이 된 셈이다.

또한, Harrison Lee et al. (2024)의 서베이 연구에 따르면, RLHF(Reinforcement Learning from Human Feedback)는 2017년 OpenAI의 폴 F. 크리스티아노(Paul F. Christiano)와 동료들이 처음 소개했으며, RLAIF는 2022년 Bai et al.이 제안한 것으로 알려져 있다.

Reference: Lee, H., et al. RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback. In Forty-first International Conference on Machine Learning.

---

Unlike ChatGPT, DeepSeek R1 is based on reinforcement learning (RL) rather than traditional supervised learning. Instead of providing explicit answers, the model learns by exploration, refining its responses based on the rewards it receives.

DeepSeek R1 further integrates reinforcement learning from AI feedback (RLAIF)—a technique where feedback and rewards are provided not by humans but by another AI model. This hybrid approach enhances performance by leveraging both self-learning and AI-driven feedback loops.

Reinforcement learning (RL) is comparable to how a child learns to ride a bike without direct instructions. The child falls, adjusts, and eventually finds balance through trial and error. Similarly, an RL-based model interacts with its environment, optimizing actions based on received rewards.

One of DeepSeek R1’s unique reinforcement learning methods is Rejection Sampling with Supervised fine tuning (SFT). In this approach, the model generates multiple responses, filters out poor-quality ones using predefined criteria, and uses the best responses to further fine-tune itself through additional supervised learning. By incorporating AI feedback (RLAIF), DeepSeek R1 combines the benefits of both reinforcement learning and supervised learning to achieve higher efficiency and accuracy.

Historical Context: RLHF and RLAIF

According to a 2024 survey by Harrison Lee et al., RLHF was first introduced in 2017 by Paul F. Christiano and his colleagues at OpenAI. In contrast, RLAIF was proposed in 2022 by Bai et al., highlighting the shift from human-driven reinforcement learning to AI-assisted feedback mechanisms.

Reference:

Lee, H., et al. RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback. Proceedings of the Forty-first International Conference on Machine Learning.

Cold Start Training: Stabilizing RL with Pretraining

전혀 학습되지 않은 모델에 강화학습(RL)을 곧바로 적용하면, 학습 과정이 불안정해지고 시행착오가 지나치게 많아질 수 있다. 간단히 말해, 어디서부터 시작해야 할지 전혀 모르는 상태에 빠지는 셈이다.

이 문제를 해결하기 위해, DeepSeek R1은 먼저 콜드 스타트(Cold Start) 데이터를 사용해 지도학습(SFT, Supervised Fine-Tuning)을 수행한다. 이를 통해 모델이 기본적인 언어 능력과 문제 해결 패턴을 어느 정도 갖춘 상태에서 RL을 시작하게 한다. 이는 마치 자전거를 처음 배울 때, 누군가 뒤에서 자전거를 잡아주어 넘어지지 않도록 도와주며 안정적으로 균형 잡는 요령을 익히는 것과 유사하다.

결국, 강화학습 이전에 콜드 스타트 단계를 거쳐 소량의 지도학습을 적용함으로써, 모델이 강화학습(RL)에 적합한 초기 파라미터(초깃값)를 갖추도록 하는 전략이다.

---

Applying reinforcement learning directly to an untrained model often leads to instability and excessive trial-and-error. In simple terms, it’s like trying to navigate without any reference points.

To address this, DeepSeek R1 first undergoes a Cold Start phase using Supervised Fine-Tuning (SFT) before reinforcement learning begins. This ensures that the model has a foundational understanding of language and problem-solving patterns before diving into RL training.

This process is similar to how a child learning to ride a bike might initially receive assistance—such as someone holding the bike steady—before attempting to balance independently. By integrating a preliminary Cold Start training phase, DeepSeek R1 establishes a solid baseline before reinforcement learning optimizes its decision-making further.

Mixture of Experts (MoE): Efficient Computation

Source : Aoki, R., et al. (2022). Heterogeneous multi-task learning with expert diversity. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 19(6), 3093-3102.

무엇보다 DeepSeek R1은 Mixture of Experts(MoE) 기법을 채택해, 필요한 서브 모델(전문가 네트워크)만 선택적으로 활성화하여 연산 효율성을 높인다. 이와 달리, Dense Transformer 구조는 입력을 받을 때마다 모델 전체 뉴런(모든 파라미터)을 매번 활성화한다.

MoE 모델에서는, 예를 들어 언어 모델링에 특화된 신경망과 수학 문제 풀이에 특화된 신경망 등 여러 “전문가(Experts)”가 준비되어 있다. 입력을 처리할 때, 전체 신경망 중에서 해당 입력에 적합한 전문가 신경망만 골라 활성화하므로, 불필요한 연산을 줄일 수 있다. 여기서 말하는 “전문가”는 사람이 아니라, 개별적으로 학습된 작은 신경망(Neural Networks)이다.

이러한 MoE(Mixture of Experts) 모델은 입력마다 일부 전문가만 활성화하여 계산량을 절감하고, 동시에 더 큰 모델을 구성할 수 있게 한다. 최근 GPT-4o, DeepSeek, Gemini 같은 최신 AI 모델들이 성능 극대화와 연산 비용 최적화를 동시에 달성하기 위해 MoE를 활용하고 있다고 알려져 있다.

한편, Mixture of Experts 기법은 1991년, Ronald A. Jacobs, Michael I. Jordan, 그리고 Andrew G. Barto가 “Adaptive Mixtures of Local Experts”라는 논문을 통해 처음 제안했다.

Reference: Jacobs, R. A., et al. (1991). Adaptive mixtures of local experts. Neural Computation, 3(1), 79-87.

---

One of DeepSeek R1’s defining features is its use of Mixture of Experts (MoE), a technique that activates only specific subnetworks (expert models) instead of engaging the entire neural network for every input.

This approach improves computational efficiency, contrasting with the Dense Transformer architecture, where all neurons (parameters) are activated at once.

MoE models contain specialized expert networks trained for different tasks, such as language modeling, mathematics, or logical reasoning. When processing an input, MoE selectively activates only the relevant expert networks, reducing unnecessary computations while maintaining high performance. These experts are not human specialists but independently trained neural subnetworks that specialize in particular functions. By selectively activating a subset of experts for each input, MoE models significantly reduce computational overhead, allowing for larger, more powerful models without proportional increases in computational cost.

Leading AI models like GPT-4o, DeepSeek R1, and Gemini have adopted MoE to maximize performance while optimizing efficiency.

MoE was first introduced in 1991 by Ronald A. Jacobs, Michael I. Jordan, and Andrew G. Barto in their paper "Adaptive Mixtures of Local Experts".

Reference:

Jacobs, R. A., et al. (1991). Adaptive Mixtures of Local Experts. Neural Computation, 3(1), 79-87.

Knowledge Distillation: Compressing Large Models

Source: Gou, J., et al. (2021). Knowledge distillation: A survey. International Journal of Computer Vision, 129(6), 1789-1819.

DeepSeek R1에는 디스틸레이션(distillation)이라는 독특한 기법도 적용된다. 이 디스틸레이션 기법과 앞서 언급한 Mixture of Experts(MoE) 방식을 강화학습(RL) 과정에서 함께 사용해, 최소한의 비용으로 최고 성능을 내도록 설계되었다.

디스틸레이션은 고성능 모델(‘교사 모델’)이 학습해낸 지식이나 패턴의 결과를 더 작은 모델(‘학생 모델’)에 전달하거나 압축해 학습시키는 기법이다. 이렇게 하면 AI의 성능을 유지하면서도 경량화와 효율성을 높일 수 있어, 실제로 모델 추론 속도가 빨라지고 메모리 점유량도 줄어드는 효과가 있다.

구체적으로는 이미 학습된 교사(Teacher) 모델의 출력 확률 분포(혹은 중간 레이어 표현)를, 학생(Student) 모델이 모방하도록 훈련한다. 예를 들어 교사 모델이 어떤 분류 문제에서 0.01,0.9,0.090.01, 0.9, 0.090.01,0.9,0.09 같은 확률 분포를 예측하면, 학생 모델도 유사한 확률을 맞추도록 학습하게 된다. 이를 통해 학생 모델은 교사 모델의 예측 패턴을 학습함으로써, 파라미터 수가 적어도 높은 성능에 근접할 수 있다. 결과적으로 추론 속도 향상과 메모리 절감이라는 장점을 얻으며, 규모가 작은 모델임에도 대형 모델에 필적하는 수준의 성능을 낼 수 있다.

이 디스틸레이션(distillation) 기법은 2015년 Geoffrey Hinton, Oriol Vinyals, 그리고 Jeff Dean이 발표한 “Distilling the Knowledge in a Neural Network” 논문에서 처음 소개되었다.

Reference: Hinton, G., et al. (2015). Distilling the Knowledge in a Neural Network. arXiv preprint arXiv:1503.02531.

---

Another key optimization in DeepSeek R1 is Knowledge Distillation, a technique that compresses the knowledge of a large high-performance model (Teacher Model) into a smaller, more efficient model (Student Model) without significant loss of accuracy.

Rather than directly using a pre-trained large-scale model, Knowledge Distillation trains the smaller Student Model to mimic the probability distributions (or intermediate representations) of the Teacher Model. For example, if the Teacher Model assigns probabilities 0.01, 0.9, and 0.09 to three possible outputs, the Student Model learns to approximate these same distributions.

This allows Student Models to retain much of the predictive accuracy of their larger counterparts while being lighter, faster, and more memory-efficient.

The Knowledge Distillation (KD) technique was first introduced in 2015 by Geoffrey Hinton, Oriol Vinyals, and Jeff Dean in their paper "Distilling the Knowledge in a Neural Network."

Reference:

Hinton, G., et al. (2015). Distilling the Knowledge in a Neural Network. arXiv preprint arXiv:1503.02531.

Estimating Parameter Reduction through Distillation

딥시크 R1 논문이나 개발사 측 자료를 보면 더 정확한 수치를 확인할 수 있겠지만, 일반적으로 지식 증류(Knowledge Distillation, KD)를 통해 파라미터를 5~10배, 혹은 그 이상 줄이면서도 성능을 상당 부분 유지한 사례가 많다. 대형 모델의 규모를 생각해보면, 수십 억~수백 억 개 파라미터를 줄일 수 있다는 의미이다. 예를 들어, GPT-3(파라미터 1750억 개)를 130억~300억 파라미터 규모의 학생(Student) 모델로 디스틸(distill)해도, 기존 성능의 90% 이상을 유지할 수 있다는 보고가 있다.

GPT-4가 파라미터가 수천 억 개부터 최대 1조 개 이상으로 추정되는 만큼, “GPT-o1 Mini 수준의 성능”이라고 한다면 대략 수십 억~100억 단위 모델이거나, 경우에 따라서는 그보다 더 작을 수도 있다는 의미가 된다. 학계나 업계 보고서에 따르면, 10배 이상 축소가 흔히 관찰되는 편이며, 이렇게 줄여도 성능이 80~90% 정도는 보전되는 사례가 많다고 알려져 있다.

---

While exact parameter reductions for DeepSeek R1 depend on its official research papers and developer disclosures, Knowledge Distillation (KD) has historically demonstrated 5× to 10× reductions in model size while maintaining high performance.

For example:

GPT-3 (175 billion parameters) can be distilled into a 13B–30B model while retaining 90% of its performance.

GPT-4, estimated to have hundreds of billions to over a trillion parameters, could have its GPT-o1 Mini equivalent distilled down to a few billion to tens of billions of parameters.

Studies and industry reports suggest that AI models can often be reduced by a factor of 10 while retaining 80–90% of their original accuracy.

Thursday, January 30, 2025

DeepSeek R1의 두려움에서 벗어나라 (Break Free from the Fear of DeepSeek R1) !

서론(Background)

Key Features of DeepSeek-R1 Highlighted in the Paper

1. DeepSeek-R1-Zero:

DeepSeek-R1에 적용된 주요 기법

Major Techniques Applied in DeepSeek-R1

1. 기본 모델 설계

1. Base Model Design

2. 콜드 스타트 문제 해결

2. Solving the Cold Start Problem

3. 두 단계의 강화 학습 (RL Stages)

3. Two Reinforcement Learning (RL) Stages

4. 두 단계의 지도 미세 조정 학습 (SFT Stages)

4. Two Stages of Supervised Fine-Tuning (SFT)

5. 거부 샘플링과 데이터 결합을 통한 지도 미세 조정

5. Rejection Sampling and Combined Data for SFT

6. 모든 시나리오를 위한 강화 학습

6. Reinforcement Learning for All Scenarios

7. 증류(Distillation)를 통한 소형 모델의 추론 능력 강화

7. Distillation for Enhancing Reasoning Capabilities of Smaller Models

OpenAI의 모델과 경쟁할 만한 성능과 문제점

Performance Comparable to OpenAI’s Models and Issues with DeepSeek

Performance Rivaling OpenAI’s Models

DeepSeek-R1에 적용된 거인들의 선행 연구들

Preceding Research by Giants Applied in DeepSeek-R1

Technological Foundation and a Leap Toward the Future

결론 (Final Thoughts)

※ 종래 배경기술(선행연구)와 DeepSeek R1의 접목

※ The integration of conventional background technologies (pioneering studies) with DeepSeek R1.

OpenAI’s ChatGPT: A Hybrid Transformer Model with RLHF

DeepSeek R1: A Hybrid Model with RL and AI Feedback (RLAIF)

Historical Context: RLHF and RLAIF

Cold Start Training: Stabilizing RL with Pretraining

Mixture of Experts (MoE): Efficient Computation

---

Knowledge Distillation: Compressing Large Models

Estimating Parameter Reduction through Distillation

No comments:

Post a Comment

K-Robot, 지금 결단해야 산다: 美 휴머노이드 투자 광풍과 한국 정부·기업을 위한 3대 긴급 제언