누구나 쉽게 이해할 수 있는 강화학습 원리(The Principles of Reinforcement Learning Made Easy)

I. 들어가는 말(Background)

얼마 전 한 변호사님이 강화학습이 무엇인지 질문하셨습니다. 딥러닝 및 인공지능 기술을 수학적 배경 지식이 부족한 분께 설명하는 것은 쉽지 않은 일입니다. 이를 해결하기 위해, "역와쿠이 요시유키. (2020). 엑셀로 배우는 순환 신경망·강화학습 초(超)입문"이라는 책을 참고하여, 수학적 용어를 최소화하고 논리적으로 설명을 시도하였습니다. 이 책을 바탕으로, 수학적 기초 지식이 부족한 분들도 인공지능 컴퓨팅 기술의 기본을 이해할 수 있도록 이 글을 작성하게 되었습니다. 이 과정은 저에게도 관련 내용을 체계적으로 정리할 수 있는 좋은 기회가 되었습니다.

Not long ago, a lawyer asked me what reinforcement learning is. Explaining deep learning and artificial intelligence technologies to someone without a strong mathematical background is no easy task. To address this challenge, I referred to the book Excel de Manabu RNN & Reinforcement Learning: A Super Introductory Guide by Yoshiyuki Yakuwa (2020). Drawing from this resource, I attempted to provide a logical explanation while minimizing the use of mathematical terminology. Based on this book, I wrote this piece to help those with limited mathematical knowledge grasp the fundamentals of AI computing technologies. This process also provided me with an excellent opportunity to systematically organize my understanding of the subject.

II. 지도학습 인공신경망과 시계열 데이터 신경망

Supervised Learning Neural Networks and Time Series Neural Networks

1. 지도학습 인공신경망(Supervised Learning Neural Networks)

지도학습은 데이터의 특성과 결과가 연결된 패턴을 포함한 학습 데이터를 사용하여 인공 신경망에 주입함으로써 데이터 패턴을 학습하는 모델입니다. 따라서 지도학습에서는 학습 데이터가 입력과 정답(출력) 쌍으로 주어집니다. 이때 인공 신경망은 가중치(w)를 곱하여 합산된 형태로 데이터를 표현합니다. 학습 과정에서 신경망은 주어진 데이터의 패턴을 학습하며, 효과적으로 훈련된 모델은 최적의 가중치를 도출하여 모든 데이터 패턴을 정확히 표현할 수 있습니다.

Supervised learning trains an artificial neural network using labeled training data that includes patterns linking input features to corresponding outcomes. In this process, the neural network receives training data as input-output pairs, allowing it to learn the underlying patterns. The network represents data as a weighted sum of input features, where weights (w) are assigned to each feature. During the training phase, the network adjusts these weights based on the given patterns, and a well-trained model can derive optimal weights to accurately represent the data relationships.

예를 들어, 토마토와 당근 이미지에 각각 '토마토', '당근'이라는 레이블을 붙여 모델을 학습시킬 경우, 신경망은 두 이미지를 구분하는 데 필요한 가중치를 학습합니다. 새로운 이미지가 입력되면, 이 가중치를 바탕으로 해당 이미지가 토마토인지 당근인지를 판별할 수 있습니다. 이 과정에서 사용되는 가중치를 파라미터라고 하며, 학습 효율을 높이기 위해 조정하는 값을 하이퍼파라미터라고 합니다. 모델의 구조와 이러한 파라미터들을 공개하면, 다른 연구자나 개발자들이 모델을 재현하거나 개선하는 데 도움이 됩니다. 학습 데이터까지 공개되면, 이는 모델의 발전과 개선을 더욱 촉진할 수 있습니다.

For example, if you label images of tomatoes and carrots as 'tomato' and 'carrot' respectively and train the model with these labels, the neural network learns the necessary weights to distinguish between the two types of images. When a new image is introduced, these learned weights allow the model to determine whether the image is of a tomato or a carrot. The weights used in this process are known as parameters, while values adjusted to enhance learning efficiency are called hyperparameters. Sharing the model structure and these parameters can help other researchers and developers reproduce or improve the model. If the training data is also made public, it can further accelerate the development and refinement of the model.

<Fig. 1>

2. 시계열 데이터 신경망(Time Series Neural Networks)

앞에서 설명한 신경망은 정적 이미지를 구별할 수 있으나, 동적 이미지의 다음 동작을 예측할 수는 없습니다. 예를 들어 사진 속에서 고양이를 식별해낼 수 있어도 그 고양이가 어떻게 움직일지는 예측할 수 없습니다.

The neural networks previously described can distinguish static images but cannot predict the subsequent movements of dynamic images. For example, while they can identify a cat in a photograph, they cannot predict how the cat will move.

<Fig. 2>

자연어도 마찬가지입니다. 영어 "Please set the vase on the table."를 프랑스어로 번역한다고 할 때, 영어단어 "set"은 특정 장소에 무언가를 두는 것, 물체를 고정시키는 것, 회사를 설립하는 것, 무언가를 사용 준비하는 것, 조건을 조정하는 것, 시간이나 날짜를 정하는 것, 일치하는 아이템의 집합을 나타내는 것, 무대의 배경을 설명하는 것, 현재의 방향을 가리키는 것, 힘을 적용하기 시작하는 것, 스포츠에서 자신이나 객체를 위치시키는 것, 사회적 그룹을 언급하는 것, 수학에서 서로 다른 요소의 집합을 정의하는 것 등 여러 가지 의미를 가질 수 있습니다. 따라서 이 단어를 정확히 프랑스어로 번역하려면, 앞 뒤 단어에 의해 연결된 문맥을 파악하여야 합니다. 이와 같이 순서가 문제가 되는 데이터를 시계열 데이터라고 합니다.

The same principle applies to natural language. When translating the English sentence "Please set the vase on the table." into French, the English word "set" can have multiple meanings, such as placing something in a specific location, fixing an object in place, establishing a company, preparing something for use, adjusting conditions, setting a time or date, denoting a collection of matching items, describing a stage backdrop, indicating the direction of a current, starting to apply force, positioning oneself or an object in sports, referring to a social group, or defining a set of distinct elements in mathematics. Accurate translation into French requires understanding the context provided by the surrounding words. Such sequentially problematic data is referred to as time-series data.

<Fig. 3>

이런 시계열 데이터를 처리하려면 인공신경망에 기억능력을 갖게 해주어야 합니다. 이렇게 기억 능력을 갖게 한 모델은 순환신경망과 트랜스포머가 대표적입니다.

Handling time-series data requires endowing neural networks with memory capabilities. Recurrent Neural Networks (RNNs) and Transformers are prominent examples of models equipped with memory.

순환신경망은 시간에 따라 데이터를 순차적으로 처리하면서 이전의 정보를 '기억'하여 다음 데이터 처리에 활용합니다. 이와 달리 트랜스포머는 전체 입력 데이터를 동시에 처리하면서 자기 주의 메커니즘(Self-Attention Mechanism)을 통해 입력 데이터의 모든 부분 사이의 관계를 '기억'합니다. Self-Attention Mechanism이란 입력 시퀀스 내의 "모든" 단어 쌍 간의 상호 작용을 계산하여 각 단어의 유의성(attention score)을 학습합니다. 이렇게 트랜스포머는 고정된 정보를 시간적 순서대로 저장하고 추출하는 RNN의 '기억' 방식과 달리, 전체 입력에 대한 글로벌한 문맥을 한 번에 고려하는 방식입니다. OpenAI에서 개발한 GPT (Generative Pre-trained Transformer)는 트랜스포머 모델의 일종입니다.

Recurrent Neural Networks (RNNs) process data sequentially over time, ‘remembering’ previous information and using it to process subsequent data. In contrast, Transformers process the entire input data simultaneously and ‘remember’ the relationships between all parts of the input through the Self-Attention Mechanism. This mechanism computes the interactions between all word pairs within the input sequence, learning attention scores that determine the significance of each word. Unlike RNNs, which store and retrieve fixed information in a temporal sequence, Transformers capture the global context of the entire input at once. A Generative Pre-trained Transformer (GPT), developed by OpenAI, is a type of Transformer model.

<Fig. 4>

[Self-Attention in the Transformer Model]

트랜스포머의 셀프 어텐션은 인코더에 한번, 디코더에 두번 적용되는데, 입력 문장의 단어 벡터를 기반으로 수행됩니다. 그러나 셀프 어텐션은 인코더의 초기 입력인 단어 벡터를 직접 사용하지 않고, 먼저 각 단어 벡터를 Query(Q), Key(K), Value(V) 벡터로 변환하는 과정을 거칩니다. 예를 들어, 입력 문장에 "student" 라는 단어가 있다면, 해당 단어 벡터는 Q, K, V 벡터로 변환됩니다. 이 과정은 문장의 모든 단어에 적용되므로 "I," "am," "a," "student" 각각의 단어도 Q, K, V 벡터를 갖게 됩니다.

The self-attention mechanism in a Transformer is applied once in the encoder and twice in the decoder, operating based on the word vectors of the input sentence. However, instead of directly applying self-attention to the encoder's initial word vectors, the model first transforms each word vector into three separate representations: Query (Q), Key (K), and Value (V) vectors. For example, if the input sentence contains the word "student," its word vector is converted into corresponding Q, K, and V vectors. **This process is applied to all words in the sentence**, meaning that "I," "am," "a," and "student" each obtain their own Q, K, and V vectors.

<Fig. 5>

이제 Q, K, V 벡터가 준비되었으면, 각 Q 벡터는 모든 K 벡터와의 어텐션 스코어(점수)를 계산하고, 이를 바탕으로 어텐션 분포(Attention Distribution)를 구한 뒤, 모든 V 벡터를 가중합하여 최종 어텐션 값(Attention Value)을 계산합니다. 이 과정은 문장의 모든 Q 벡터에 대해 반복됩니다.

Once the Q, K, and V vectors are obtained, each Q vector computes attention scores against all K vectors, forming the attention distribution. Using this distribution, a weighted sum of **all V vectors** is calculated to determine the final attention value. This process is repeated for each Q vector in the sentence.

예를 들어, 어텐션 스코어는 단어 "I"가 문장 내의 다른 단어들인 "I," "am," "a," "student" 와 얼마나 연관되어 있는지를 수치적으로 나타낸 것입니다. 더 구체적으로, 어텐션 스코어에 소프트맥스(softmax) 함수를 적용하면 어느 단어와 더 연관되어 있는지를 나타내는 어텐션 분포가 구해지며, 이를 사용하여 각 V 벡터와 가중합하여 최종 어텐션 값(연관도 값)을 얻습니다. 이때, 특정 단어에 대한 어텐션 값은 그 단어의 문맥 벡터(context vector) 라고도 합니다.

For example, an attention score represents how strongly the word "I" is related to each of the words "I," "am," "a," and "student." More precisely, by applying the softmax function to the attention scores, the attention distribution is obtained. Then, this distribution is used to compute the weighted sum of the corresponding V vectors, resulting in the final attention value for each word. This attention value is also referred to as the context vector for the respective word.

<Fig. 6>

이 모든 과정은 벡터 연산이 아니라 행렬 연산(matrix operations)을 사용하여 한 번에 계산할 수 있습니다. 행렬 연산을 통해 어텐션 값 행렬(Attention Value Matrix)이 생성되면, 이 행렬은 입력 문장의 각 단어가 다른 단어와 가지는 연관도를 나타냅니다. 이러한 과정 덕분에, 예를 들어 문장에 대명사 "it"이 포함되어 있을 경우, 어텐션 메커니즘을 통해 "it"이 지칭하는 대상이 무엇인지 판별할 수 있습니다.

Rather than performing these operations as separate vector calculations, they can be efficiently computed using matrix operations. For example, in the attention score matrix, the value at the intersection of the "I" row and "student" column represents the attention score between the Q vector of "I" and the K vector of "student."

Once the attention score matrix is constructed, the next step is to compute the attention distribution and use it to obtain the final attention value matrix. This is achieved by applying the softmax function to the attention score matrix and multiplying it with the V matrix.

The final output is the attention value matrix, which represents the relationship between each word and all other words in the input sentence. This mechanism enables the model to understand contextual dependencies between words, such as resolving pronoun references (e.g., determining what "it" refers to in a given sentence).

참고로 벡터 개념을 모르시면 그냥 숫자의 집합이라고 생각하셔도 좋습니다. 여러 개의 숫자가 모여 하나의 정보를 나타내는 구조라고 보면 됩니다.

가중합(weighted sum)은 직선 방정식의 기울기 계수(가중치)를 곱한 후 모두 더한 값이라고 이해하면 쉽습니다. 즉, 어떤 값이 더 중요한지를 나타내는 가중치를 곱한 후 최종적으로 합산하는 과정입니다. 좀 더 직관적으로 설명하자면, 가중합은 특정 신호(값)가 얼마나 영향을 미치는지를 계산한 후, 그 영향을 모두 합한 것이라고 볼 수 있습니다. 중요한 값일수록 가중치가 커지고, 덜 중요한 값일수록 가중치가 작아지며, 이를 합산하여 최종 결과를 도출하는 방식입니다.

행렬(Matrix)은 벡터처럼 숫자의 집합이지만, 여러 개의 벡터를 층층이 쌓아 놓은 형태라고 생각하면 됩니다. 마치 표(테이블)처럼 가로(행)와 세로(열)로 구성된 숫자의 집합이며, 한 번에 많은 벡터를 처리할 수 있도록 만들어진 구조입니다.

If you are unfamiliar with the concept of vectors, you can think of them simply as a collection of numbers that collectively represent some information.

The weighted sum is similar to a linear equation, where coefficients (weights) are multiplied by values and the results are summed. In simpler terms, the weighted sum determines how much influence each value has by assigning different levels of importance to each component. The more important a value is, the larger its weight, and vice versa.

A matrix is also a collection of numbers, just like a vector, but instead of being a single sequence, a matrix stacks multiple vectors together in rows and columns. You can think of it as a table that organizes numbers in a structured format, allowing efficient processing of multiple vectors at once.

III. 비지도학습 인공신경망과 강화학습

Unsupervised Learning Neural Networks and Reinforcement Learning

1. 비지도학습 인공신경망(Supervised Learning Neural Networks)

비지도학습(unsupervised learning)은 정답(label) 없이 데이터를 학습하는 방식입니다. 주어진 데이터에서 숨겨진 패턴이나 구조를 발견하는 것이 목표이며, 대표적인 비지도학습 인공신경망으로는 데이터의 중요한 특징을 압축하고 복원하는 오토인코더(Autoencoder), 두 개의 신경망(생성자 Generator, 판별자 Discriminator)이 서로 경쟁하면서 현실적인 데이터를 생성하는 생성적 적대 신경망(GAN, Generative Adversarial Networks), K-Means와 같은 전통적인 클러스터링 기법을 사용하여 데이터의 그룹을 자동으로 학습하는 클러스터링 기반 신경망 (Clustering Neural Networks) 등이 있습니다. 생성적 적대 신경망(GAN, Generative Adversarial Networks)은 주로 가짜 이미지를 생성하고 이를 실제 이미지처럼 보이도록 학습하는 데 사용됩니다.

Unsupervised learning is a method of training a model without labeled data. Its primary goal is to uncover hidden patterns or structures within the given data. Representative types of neural networks used in unsupervised learning include:

- Autoencoders: These compress the key features of data and reconstruct it, making them useful for dimensionality reduction and anomaly detection.

- Generative Adversarial Networks (GANs): GANs consist of two neural networks — a generator and a discriminator — that compete with each other. The generator creates realistic data while the discriminator evaluates its authenticity.

- Clustering-Based Neural Networks: These use traditional clustering techniques, such as K-Means, to automatically group similar data without requiring labels.

Among these, GANs are widely used as a deep-fake for generating synthetic images and training them to appear as realistic as possible.

<Fig. 7>

2. 준지도학습 인공신경망, 강화학습 (Semi-Supervised Learning Neural Networks, RL)

강화학습 (Reinforcement Learning, RL)은 기계가 환경(environment)과 상호작용하면서 보상(reward) 을 최대화하는 방향으로 행동(action)을 학습하는 방식입니다. 지도학습(supervised learning)이나 비지도학습과 달리, 강화학습은 명확한 정답을 제공받는 것이 아니라 스스로 탐색하며 최적의 전략(policy)을 학습해야 합니다.

Reinforcement Learning (RL) is a learning method in which a machine interacts with an environment and learns to take actions that maximize rewards. Unlike supervised learning or unsupervised learning, reinforcement learning does not receive explicit correct answers but instead explores autonomously to learn an optimal policy.

비지도학습은 라벨이 없는 데이터에서 패턴을 발견하는 것을 목표로 합니다. 강화학습(Reinforcement Learning, RL) 또한 지도학습처럼 정답(라벨)이 직접 주어지지 않는다는 점에서 비지도학습과 유사합니다. 그러나 강화학습은 에이전트가 환경과 상호작용하며 데이터를 직접 수집하여 학습한다는 점에서 지도학습(Supervised Learning)의 특성을 갖는다고 볼 수도 있습니다.

Unsupervised learning aims to discover patterns in unlabeled data. Reinforcement learning (RL) is similar to unsupervised learning in that, like supervised learning, it does not receive explicit labels or correct answers. However, reinforcement learning differs in that the agent interacts with the environment, collects data through these interactions, and learns from them, incorporating aspects of supervised learning.

그러나 다음과 같은 이유로 강화학습을 비지도학습의 범주로 분류할 수도 있습니다.

However, reinforcement learning can also be classified as a type of unsupervised learning for the following reasons.

(1) 명확한 정답(라벨)이 없음

지도학습에서는 학습 데이터가 입력과 정답(출력) 쌍으로 제공되지만, 강화학습에서는 에이전트가 환경에서 행동을 선택하고, 그 결과로 보상을 받으며 학습합니다. 이때, 보상(reward)은 지도학습에서 제공하는 정답(라벨)과 달리 명확한 정답이 아니라 환경을 탐색하는 과정에서 얻는 피드백입니다.

In supervised learning, training data is provided as input-output pairs, whereas reinforcement learning involves an agent selecting actions within an environment and receiving rewards based on the outcomes. Here, the reward is not a predefined correct answer (label) but rather feedback obtained through environmental exploration.

다만, Q-러닝(Q-Learning)과 같은 기법에서는 최적의 행동을 찾기 위해 가치(value)를 예측하는 과정이 포함되며, 이 과정은 지도학습과 유사한 면이 있습니다.

However, techniques such as Q-learning (Q-Learning) involve predicting values to identify optimal actions, making this process somewhat similar to supervised learning.

(2) 에이전트가 데이터를 스스로 탐색하고 학습

비지도학습은 사람이 직접 라벨링하지 않은 데이터를 분석하는 과정인데, 강화학습에서는 에이전트가 환경과 상호작용하며 데이터를 직접 수집하고 학습합니다. 즉, 주어진 데이터셋을 단순히 학습하는 것이 아니라, 데이터를 능동적으로 탐색하고 의미를 찾아야 한다는 점에서 비지도학습과 유사합니다.

Unsupervised learning analyzes unlabeled data without human intervention, and similarly, reinforcement learning agents interact with the environment, collecting and learning from data independently. In this sense, reinforcement learning is similar to unsupervised learning because it requires the agent to actively explore and derive meaning from its experiences rather than simply learning from a fixed dataset.

다만, 강화학습에서 데이터에 담긴 경험을 바탕으로 최적의 선택 전략을 찾아가는 방식입니다. 따라서 보상을 통해 더 나은 행동을 선택하도록 유도하는 것은 일종의 간접적인 정답 제공으로 볼 수 있어, 일부 지도학습의 개념이 적용된다고 볼 수 있습니다.

However, in reinforcement learning, the agent derives optimal decision-making strategies based on accumulated experiences. Since rewards guide the agent toward better actions, this can be seen as an implicit form of supervision, incorporating some aspects of supervised learning.

(3) 데이터 패턴을 발견하는 과정

비지도학습의 주요 목표는 주어진 데이터에서 숨겨진 패턴을 찾는 것입니다. 강화학습에서도 최적의 정책(optimal policy)을 학습하는 과정에서 데이터의 패턴을 발견하는 것이 중요한 요소입니다.

The primary goal of unsupervised learning is to uncover hidden patterns within the given data. Similarly, reinforcement learning also involves discovering patterns in data during the process of learning an optimal policy.

예를 들어, 게임 AI가 각 행동(action)이 장기적으로 어떤 결과를 가져오는지 패턴을 분석하는 과정은 비지도학습의 성격을 띠고 있습니다. 다만, 강화학습에서는 특정 행동에서 보상을 받을 수 있게 지도하기 때문에 지도학습과 간접적으로 유사한 특성도 포함됩니다.

For example, a game AI analyzing the long-term consequences of each action and identifying patterns shares characteristics with unsupervised learning. However, reinforcement learning also involves structured reward mechanisms that help guide the agent’s actions, making it somewhat similar to supervised learning in an indirect way.

IV. 강화학습 알고리즘의 이해

Understanding of Reinforcement Learning

강화학습은 에이전트(agent)가 환경(environment)과 상호작용하며 누적 보상을 극대화하는 방법을 학습하는 머신 러닝 모델입니다. 지도학습(supervised learning)이 정답이 주어진 데이터에서 학습하고, 비지도학습(unsupervised learning)이 레이블이 없는 데이터에서 패턴을 찾는 것과 달리, 강화학습은 시행착오(trial and error)를 통해 최적의 전략(policy)을 학습합니다.

Reinforcement learning (RL) is a machine learning model in which an agent interacts with an environment to learn how to maximize cumulative rewards. Unlike supervised learning, where models learn from labeled data, or unsupervised learning, which identifies patterns in unlabeled data, reinforcement learning relies on trial and error to discover an optimal policy.

강화학습은 에이전트(agent), 환경(environment), 행동(action)이라는 세 가지 주요 구성 요소로 이루어집니다.

에이전트: 환경과 상호작용하며 학습하는 지능형 개체
환경: 에이전트가 행동을 수행하는 외부 시스템
행동: 에이전트가 환경의 상태를 변화시키기 위해 수행하는 선택

Reinforcement learning consists of three key components: the agent, the environment, and actions.

Agent: An intelligent entity that interacts with and learns from the environment.
Environment: The external system in which the agent operates.
Action: The decision made by the agent to transition between states in the environment.

강화학습에서 에이전트는 자신의 행동에 따라 보상(reward) 또는 벌점(penalty)을 받습니다.

보상: 에이전트가 어떤 선택을 극대화해야 하는 긍정적인 강화 신호
벌점: 에이전트가 어떤 선택을 최소화해야 하는 부정적인 결과

에이전트는 이러한 보상과 벌점을 바탕으로 행동을 최적화하며, 목표 달성을 위한 최적의 전략을 학습합니다.

In reinforcement learning, the agent receives rewards or penalties based on its actions.

Reward: A positive reinforcement signal that the agent aims to maximize.
Penalty: A negative consequence that the agent seeks to minimize.

Through this system of rewards and penalties, the agent learns to optimize its actions and develop an optimal strategy to achieve its goal.

<Fig. 8>

1. 강화학습 대표주자, Q-학습

Q-러닝(Q-learning)은 강화학습에서 널리 사용되는 알고리즘이었습니다. 강화학습 원리를 이해하는데 좋은 모델입니다. 이 알고리즘은 모델이 필요 없는(model-free) 방식이므로 환경의 동적 특성(dynamics)에 대한 명시적인 정보 없이도 학습이 가능합니다.

Q-learning is a widely used algorithm in reinforcement learning. It falls under the model-free category, meaning it does not require explicit knowledge of the environment’s dynamics.

Q-러닝은 각 상태-행동(state-action) 쌍의 가치(value)를 추정하고, 관찰된 보상(reward)을 기반으로 Q 값을 반복적으로 업데이트합니다. 이를 통해 에이전트(agent)는 경험을 통해 최적의 정책(optimal policy)을 학습하며, 보다 지능적인 결정을 내릴 수 있습니다.

Q-learning estimates the value of state-action pairs and iteratively updates Q-values based on observed rewards. Through this process, the agent learns the optimal policy from experience, enabling it to make more intelligent decisions.

Q-러닝을 딥 신경망(deep neural networks, DNN)과 결합한 모델이 딥 Q-네트워크(Deep Q-Network, DQN)입니다. DQN은 딥러닝의 강력한 표현 학습 능력을 활용하여, 대규모 상태-행동 공간에서 Q 값을 근사(approximate)합니다.

A model that combines Q-learning with deep neural networks (DNNs) is called the Deep Q-Network (DQN). DQN leverages the power of deep learning architectures to approximate Q-values in large state-action spaces.

신경망을 함수 근사자(function approximator)로 사용함으로써 DQN은 복잡한 환경을 처리하고, 고차원(high-dimensional) 데이터를 학습할 수 있습니다.

By using neural networks as function approximators, DQN can handle complex environments and learn high-dimensional representations.

이때 사용되는 가치 함수(Value function)는 특정 상태(state)에서의 기대 수익(reward) 또는 특정 행동(action)을 취할 때의 예상 유용성(utility)을 추정하는 함수입니다.

가치 함수를 최적화함으로써 에이전트(agent)는 장기적인 보상을 극대화하는 방향으로 정보에 입각한(informed) 결정을 내릴 수 있습니다.

The value function is a function that estimates the expected reward of being in a particular state or the anticipated utility of taking a specific action.

By optimizing the value function, the agent can make informed decisions that maximize long-term rewards.

Q-Learning에서는 각 상태(state)와 행동(action)의 가치를 Q-값(Q-value)으로 저장하며, 이를 Q-테이블(Q-Table) 형태로 관리합니다.

학습 과정에서 Q-값은 지속적으로 업데이트되며, 이 과정을 반복함으로써 에이전트는 더 높은 보상을 기대할 수 있는 행동을 선택하는 경향을 가지게 되고, 결국 최적의 정책(optimal policy)을 형성하게 됩니다.

In Q-learning, the value of each state-action pair is stored as a Q-value and maintained in a structure called a Q-table.

During training, the Q-values are continuously updated. By iteratively refining these values, the agent learns to favor actions that lead to higher expected rewards, ultimately forming an optimal policy.

(Reference)

[Q-Learning: 강화학습의 핵심 개념과 이해]

[강화학습알고리즘: DQN]

<Fig. 9>

1) Q학습의 개념이해 (Understanding the Concept of Q-Learning)

이제부터 본격적으로, 서두에서 언급한 "역와쿠이 요시유키 (2020)" 책을 참고하여 강화학습 개념을 쉽게 설명해보겠습니다.

Now, we will explain the concept of reinforcement learning intuitively, using examples from "Yoshiyuki Yakuwa (2020)", as mentioned earlier.

이 책에서는 강화학습 알고리즘을 개미가 미로에서 케이크가 있는 방까지 최단 경로를 찾는 방식으로 학습하는 예제를 통해 설명하고 있습니다.

개미는 에이전트(Agent)
개미가 활동하는 다수의 방으로 구성된 세계는 환경(Environment)
개미가 한 방에서 인접한 옆 방으로 움직이는 동작은 행동(Action)
목적지에 있는 케이크가 보상(Reward)

In this book, reinforcement learning algorithms are illustrated through an example where an ant learns to find the shortest path to a room with a cake inside a maze.

The ant represents the Agent
The maze consisting of multiple rooms is the Environment
The movement of the ant from one room to an adjacent room is an Action
The cake in the destination room is the Reward

이 환경(Environment)에서 개미가 이동할 수 있는 방의 상태(state)는 총 9개입니다. 상태(State, S) 란 환경에서 에이전트가 현재 위치한 상태를 의미합니다. 개미가 방 1 상태에서 출발하여 방 9 상태에 도착하면 케이크라는 보상을 받게 됩니다.

In this environment, there are nine possible states where the ant can be positioned. A state (S) represents the current location of the agent within the environment. For example, when the ant starts at room 1 and reaches room 9, it receives the reward (cake).

행동(Action, A)는 특정 상태에서 다른 상태로 이동하기 위해 수행할 수 있는 선택지를 의미합니다. 개미 입장에서, 방의 벽이 막혀있지 않다면 오른쪽, 위쪽, 왼쪽, 아래쪽으로 이동할 수 있습니다. 편의상, 이러한 움직임을 각각 1, 2, 3, 4로 표현합니다.

An action (A) refers to the choices available in a given state. From the ant’s perspective, if no walls are blocking the way, it can move right, up, left, or down. For simplicity, these movements are represented as 1, 2, 3, and 4, respectively.

<Fig.10>

이때, 특정 상태에서 행동을 선택하는 전략을 정책(Policy, π)라고 합니다. 개미가 이 전략에 따라 케이크가 있는 방까지 도달하기 위해 이동한 경로는 상태(state)들의 연속적인 순서로 표현되는데, 이러한 하나의 경로를 에피소드(Episode)라고 부릅니다.

At this point, the strategy the agent follows to choose actions in different states is called a Policy (π). The sequence of states the ant follows to reach the cake can be represented as an ordered sequence of transitions, which is known as an Episode.

즉, 에피소드(Episode) 란 시작 상태에서 목표 상태까지 도달하는 일련의 단계(step)들을 포함하는 상태의 집합을 의미합니다. 개미가 선택한 에피소드는 다양한 경로 선택에 따라 여러 개가 존재할 수 있습니다. 이 중 누적 보상이 가장 큰 경로를 학습합니다.

Thus, an Episode is a series of steps taken from the initial state to the goal state. The ant can follow multiple different episodes depending on the paths it chooses.

Q-러닝에서는 행동 가치 함수(State-Action Value Function, Q(S, A))는 Q-테이블(Q-Table)로 표현되며, 이는 각 상태(state)에서 가능한 행동(action)에 대한 예상 보상 값(Q-Value)을 저장하는 테이블입니다.

The State-Action Value Function (Q(S, A)) is represented as a Q-table, which stores the expected reward values (Q-values) for each possible action in a given state.

Q-러닝 알고리즘에서는 이 Q-테이블을 점진적으로 업데이트하며 최적의 행동을 학습합니다. 즉 Q-테이블이라는 구조를 사용하여 각 상태에서의 행동 가치를 학습하며, Q-값을 반복적으로 업데이트함으로써 에이전트는 더 높은 보상을 기대할 수 있는 행동을 선택하는 경향을 가지게 되고, 결국 최적의 정책(optimal policy)을 형성하게 됩니다.

In the Q-learning algorithm, the Q-table is gradually updated to learn the optimal actions. Q-learning utilizes the Q-table structure to learn the action values for each state. By iteratively updating Q-values, the agent tends to choose actions that yield higher rewards, ultimately forming an optimal policy.

예를 들어, 개미는 페로몬(pheromone) 냄새를 이용하여 옆 방에서 받을 수 있는 보상을 예측합니다. 따라서 각 상태에서 취한 행동에 대한 예상 보상값인 Q값은 해당 출구에 남아 있는 페로몬 강도 값으로 표현됩니다. 한번 학습을 진행하고 나면 이 값을 경험한 냄새 값으로 갱신합니다.

Ants use pheromone scents to predict the rewards they might receive from adjacent rooms. Therefore, the reward for an action taken in a particular state is represented by the pheromone intensity at the exit.

개미는 강한 냄새가 나는 방향으로 이동하는 경향이 있으며, 이는 Q값(Q-Value)의 본질을 설명하는 개념입니다. 즉, Q-값은 선택한 행동이 얼마나 유용한지를 나타내는 지표이며, 높은 Q-값을 가진 방향으로 개미가 유도됩니다. 이렇게 학습을 반복하면, 개미는 가장 냄새가 강한 경로를 따라 이동할 수 있습니다.

Ants naturally tend to move in the direction where the scent is stronger, which fundamentally represents the concept of Q-values. In other words, Q-values indicate the usefulness of a chosen action, guiding the agent toward directions with higher Q-values.

아래 그림에서는 앞서 예로 든 미로 환경에서 벽이 막혀 이동할 수 없는 경우 해당 Q값을 "closed"로 표기하고, 이동 가능한 경로의 Q-값을 페로몬 강도로 수치화하여 표현하였습니다.

The figure below illustrates the maze environment example, where closed paths (blocked by walls) are labeled as "closed" Q-values, while Q-values for available paths are numerically represented based on pheromone intensity.

<Fig.11>

2) Q-Learning의 학습 과정 (Q-Learning Training Process)

Q-Learning의 학습은 Q-테이블(Q-Table)을 반복적으로 업데이트하는 과정을 통해 진행됩니다. 이 과정은 크게 Q-테이블 초기화, 행동 선택, 보상 관찰 및 Q-값 업데이트의 단계로 구성됩니다.

Q-Learning training progresses through an iterative process of updating the Q-table. This process consists of three main steps: Q-table initialization, action selection, and reward observation with Q-value updates.

A. Q-테이블 초기화 (Q-Table Initialization)

학습의 첫 단계는 Q-테이블을 초기화하는 과정입니다. 초기에는 모든 상태(state)와 행동(action)의 Q-값(Q(S, A))을 0으로 설정합니다. 개미가 처음 출발하는 상태는 아무런 냄새가 나지 않는 것과 같습니다.

이를 수식으로 나타내면 다음과 같습니다.

The first step in training is initializing the Q-table. Initially, all Q-values (Q(S, A)) for each state (S) and action (A) are set to 0. This can be represented mathematically as follows:

𝑄(𝑆, 𝐴)=0

이후, 학습이 진행됨에 따라 에이전트(agent)는 환경(environment)과 상호작용하며 Q-값을 지속적으로 업데이트합니다.

As training progresses, the agent interacts with the environment and continuously updates the Q-values.

B. 행동 선택 (탐색 vs. 활용)(Action Selection (Exploration vs. Exploitation))

Q-Learning에서 에이전트는 탐색(Exploration)과 활용(Exploitation)을 조합하여 행동을 선택합니다. 이 과정에서 ϵ-탐욕(ϵ-greedy) 정책이 사용됩니다.

In Q-Learning, the agent selects actions by balancing exploration and exploitation. This process is managed using the ϵ-greedy policy.

만약 개미가 냄새의 강도(페로몬)에만 의존하여 행동을 결정한다면, 경우에 따라서는 무한 루프에 빠져 목표 지점에 도달하지 못할 가능성이 존재합니다. 이를 방지하기 위해서는 냄새의 강도와 관계없이 새로운 경로를 탐색할 필요가 있습니다.

If the ant relies solely on the pheromone intensity to determine its actions, it may sometimes enter an infinite loop and fail to reach the target destination. To prevent this, the ant must explore new paths regardless of pheromone strength.

Exploitation (활용/이용): 기존의 정보를 활용하여 현재 가장 높은 Q-값을 가진 행동을 선택하는 방식
- Exploitation: Utilizing existing information to choose the action with the highest Q-value.

Exploration (탐색): 새로운 가능성을 찾기 위해 랜덤하게 다른 행동을 선택하는 방식
- Exploration: Randomly selecting a different action to discover new possibilities.

ϵ-greedy 기법은 특정 "확률 ϵ"로 탐색(Explore)을 수행하고, 나머지 "확률 (1−ϵ)"에서는 현재 최적의 행동(Exploit)을 선택하는 방식으로 동작합니다.

일반적으로 학습 초반에는 탐색(Explore) 비율을 높게 설정하고, 학습이 진행될수록 이용(Exploit) 비율을 점진적으로 증가시키는 방식을 사용합니다.

The ϵ-greedy method operates by performing exploration with a probability of 𝜖 while selecting the current optimal action (exploitation) with a probability of (1−𝜖).

확률 (1 - ϵ) : 최적 행동 선택 (Exploitation)
- Probability (1−𝜖): Selects the optimal action (Exploitation).

확률 ϵ : 랜덤 행동 선택 (Exploration)
- Probability 𝜖: Chooses a random action (Exploration).

C. 보상 관찰 및 Q-값 업데이트 (Reward Observation and Q-Value Update)

Q-값은 벨만 방정식(Bellman Equation)을 이용해 지속적으로 업데이트됩니다. 벨만 방정식은 최적 정책(Optimal Policy)을 찾기 위해 상태(State)와 행동(Action) 간의 관계를 수식화한 핵심 원리로, 강화학습에서 Q-값을 갱신하는 데 사용됩니다.

The Q-value is continuously updated using the Bellman Equation.

The Bellman Equation is a fundamental principle in reinforcement learning that formulates the relationship between states and actions to find the optimal policy. It is used to iteratively update the Q-values.

Q-값 업데이트는 다음과 같은 재귀 방정식을 따릅니다.

Q-value updates follow the recursive equation shown below:

여기서,

Q(S,A): 현재 상태 S 에서 행동 A 를 했을 때의 Q-값
α (학습률, Learning Rate, 0~1): Q-값 업데이트 비율 조절
γ (할인율, Discount Factor, 0~1): 미래 보상의 가중치를 조절
R: 현재 행동을 수행한 후 받은 보상 (Reward)
max Q'(S', a'): 다음 상태 S'에서 최적 행동 a'의 Q'값

Where:
- 𝑄(𝑆, 𝐴): The Q-value of taking action
- 𝐴 in the current state 𝑆
- 𝛼 (Learning Rate, 0–1): Controls the rate at which Q-values are updated.
- 𝛾 (Discount Factor, 0–1): Adjusts the weight of future rewards.
- 𝑅 : The reward received after performing the action.
- max 𝑄(𝑆′ , 𝑎′ ) : The Q'-value of the optimal action a' in the next state 𝑆′

이 과정을 충분히 반복하면 Q-값이 수렴하며, 최적의 정책(optimal policy)을 찾을 수 있습니다.

By repeating this process sufficiently, the Q-values converge, leading to the discovery of the optimal policy.

D. 할인율(Discount Factor,𝛾)의 역할 (Role of the Discount Factor, 𝛾)

Q-Learning에서는 미래 보상의 가치를 현재 보상과 비교할 때 할인율(Discount Factor, 𝛾)을 적용합니다. 이는 페로몬 냄새가 시간이 지남에 따라 휘발하는 현상과 비슷하게 비유할 수 있습니다.

In Q-Learning, the discount factor 𝛾 is applied to compare the value of future rewards with present rewards. This concept can be analogized to pheromone trails gradually evaporating over time.

정확히 말하면, Q-값이 확률적으로 변화할 수 있기 때문에 이러한 불확실성을 반영하기 위해 할인율이 사용되고, 무한급수의 수렴 성질을 이용해 최적의 값으로 수렴하도록하기 위해 사용됩니다.

More precisely, since Q-values can change probabilistically, the discount factor is used to account for this uncertainty.

할인율 𝛾가 높을수록 → 먼 미래의 보상을 고려 (예: 마라톤 경기 전략 설계)
- Higher 𝛾 values → Consider long-term rewards (e.g., strategizing for a marathon race).

할인율 𝛾가 낮을수록 → 즉각적인 보상을 우선시 (예: 즉각적인 이익을 추구하는 도박 전략)
- Lower 𝛾 values → Prioritize immediate rewards (e.g., short-term gambling strategies).

E. DQN과의 차이(Differences from DQN)

Deep Q-Network (DQN)은 Q-table 대신 인공신경망을 이용하여 학습결과를 보존한다는 점에서 차이가 있습니다.

DQN 신경망에는 '상태'가 입력되고 '액션'이 출력됩니다. DQN은 딥러닝의 강력한 표현 학습 능력을 활용하여, 대규모 상태-행동 공간에서 Q 값을 근사(approximate)합니다.

The Deep Q-Network (DQN) differs from Q-Learning in that it uses a neural network instead of a Q-table to store learning results.

In a DQN, the state is fed as input to the neural network, and the corresponding action is produced as output.

<Fig. 12>

V. 결론(Conclusion)

강화학습과 지도학습은 학습 방식과 목표에서 큰 차이를 보입니다. 지도학습은 라벨이 지정된 데이터셋을 기반으로 입력과 출력 간의 패턴을 학습하는 방식이며, 학습 과정에서 명확한 정답(라벨)이 주어지는 즉각적인 피드백을 받습니다.

반면, 강화학습은 에이전트가 환경과 상호작용하며 시행착오를 통해 최적의 행동을 학습하는 방식입니다. 이 과정에서 즉각적인 정답이 주어지지 않고, 누적된 보상을 극대화하는 방향으로 학습이 진행됩니다.

Reinforcement learning and supervised learning exhibit significant differences in their learning methods and objectives. Supervised learning is a method that learns patterns between inputs and outputs based on labeled datasets, receiving immediate feedback in the form of explicit correct answers (labels) during the training process.

In contrast, reinforcement learning involves an agent interacting with an environment and learning optimal behaviors through trial and error. In this process, immediate correct answers are not provided, and learning progresses in a way that maximizes cumulative rewards.

강화학습의 가장 중요한 특징 중 하나는 탐색(Exploration)과 활용(Exploitation)의 균형을 유지하는 학습 전략입니다. 에이전트는 ϵ-탐욕(ϵ-greedy) 정책을 통해 새로운 행동을 시도하면서도 기존 경험을 활용하여 최적의 행동을 찾아갑니다. 또한, 벨만 방정식(Bellman Equation)을 기반으로 Q-값을 지속적으로 업데이트하면서 학습을 최적화합니다.

One of the most crucial characteristics of reinforcement learning is its strategy for balancing exploration and exploitation. An agent follows an ϵ-greedy policy to explore new actions while also leveraging past experiences to determine the optimal behavior. Additionally, reinforcement learning optimizes learning by continuously updating Q-values based on the Bellman Equation.

미래 보상의 가치를 현재 보상과 비교할 때, 할인율(Discount Factor)을 적용하는 것도 강화학습의 핵심 요소입니다. 할인율 값이 높을수록 장기적인 보상을 고려하는 방향으로 학습이 이루어지며, 낮을수록 즉각적인 보상을 우선시하는 학습이 진행됩니다. 나이가 할인율은 학습 계산과정을 결국 수렴하게 합니다.

Another core concept in reinforcement learning is the application of the discount factor to compare the value of future rewards with present rewards. A higher discount factor encourages learning that considers long-term rewards, while a lower discount factor prioritizes immediate rewards.

또한, Q-Learning을 확장한 딥 Q-네트워크(DQN)는 인공신경망을 활용하여 Q-값을 근사하는 방식으로, 복잡한 환경에서도 효과적으로 학습할 수 있도록 합니다. 기존 Q-테이블 방식은 상태와 행동 공간이 커질수록 한계를 보이지만, DQN은 신경망을 활용하여 고차원 데이터를 다룰 수 있는 확장성을 제공합니다.

Furthermore, Deep Q-Network (DQN), an extension of Q-Learning, utilizes artificial neural networks to approximate Q-values, enabling effective learning even in complex environments. Unlike traditional Q-tables, which become impractical as the state-action space expands, DQN leverages deep learning to handle high-dimensional data, providing greater scalability.

강화학습은 게임 AI, 로봇 제어, 자율주행, 금융 최적화 등 다양한 분야에서 활용되며, 복잡한 환경에서 최적의 의사 결정을 수행하는 데 강점을 가지고 있습니다. 이러한 특징으로 인해 강화학습은 미래의 인공지능 연구와 실제 응용 분야에서 중요한 역할을 할 것으로 기대됩니다.

Reinforcement learning is widely applied in various fields, including game AI, robotics, autonomous driving, and financial optimization, excelling in making optimal decisions in complex environments. Due to these strengths, reinforcement learning is expected to play a vital role in future AI research and real-world applications.

[EoD]

Blog Search

Untold Stories of Intellectual Property