Table Of Content

Model-Free Reinforcement Learning

Last Lecture
- Model-Free Prediction
- Estimate the value function of an unknown MDP
This Lecture
- Model-Free Control
- Optimize the value function of an unknwon MDP

Use of Model-Free Control

Some example problems that can be modelled as MDPs

Elevator, Parallel Parking, Ship Steering, Bioreactor, Helicopter, Aeroplane Logistics, Robocup Soccer, Quake , Portfolio management, Protein Folding, Robot walking, Game of Go

Model-Free Control는 아래 문제들도 풀 수 있다 → Sampling을 통해 문제를 풀어야 할 때 사용

MDP model is unknown, but experience can be sampled
MDP model is known, but is too big to use, except by samples

On and Off-Policy Learning

On-policy learning

(최적화하고자 하는 policy) == (environment에서 경험을 쌓는 behavior policy)

Learn on the job
Learn about policy π from experience sampled from π

Off-policy learning

(최적화하고자 하는 policy) != (environment에서 경험을 쌓는 behavior policy)

Look over someone’s shoulder → 다른 Agent의 경험으로부터 배우는 방법
Learn about policy π from experience sampled from µ

On-Policy MC Control > Generalised Policy Iteration (Refresher)

On-Policy MC Control > Generalised Policy Iteration With Monte-Carlo Evaluation

Policy improvement Greedy policy improvement?
- 불가능. MDP를 알아야 다음 State를 알아야 V_*을 찾을 수 있는데, Model Free는 MDP를 알 수 없기 때문에 Policy Improvement는 불가능
매우 긴 episode일 경우 오랜 시간 수행되어야 하고, 경험하지 못한 경로는 학습하지 못함

On-Policy MC Control > Model-Free Policy Iteration Using Action-Value Function

V(s)는 MDP 모델을 사용해서 State에 대한 Value Function을 추정
Q(s, a)는 State s에서 Action a를 수행했을 때의 가장 큰 Q값을 갖게 만드는 Action을 사용하게 됨

On-Policy MC Control > Generalized Policy Iteration with Action Value Function

Policy improvement Greedy policy improvement?
- Q로 가능. 할 수 있는 Action 개수를 알 수 있기 때문에 그 중 Q값이 가장 높은 곳으로 이동하는걸 Policy로 정하면 되기 때문
- Greedy하게만 움직이면 충분히 많은 곳을 갈 수 없음. 곧 이어 나오는 예제에서 확인
V 대신 Q를 사용해서 State-Action Value Function(=V)을 업데이트 하도록 함
Policy evaluation과 improvement를 반복 수행하면서 최적의 q를 찾을 수 있고 최적의 policy도 찾을 수 있음

On-Policy MC Control > Example of Greedy Action Selection

계속 right door만 선택하게 됨
left door를 열어본 경험이 거의 없기 때문에 진정화 최적화라고 볼 수 없음
따라서 ε-Greedy Exploration이 등장

On-Policy MC Control > ε-Greedy Exploration

Simplest idea for ensuring continual exploration
All m actions are tried with non-zero probability
With probability 1 - ε choose the greedy action
- 1 - ε 확률로 좋은 Action 선택
With probability ε choose an action at random

ε-Greedy Exploration 장점

모든 Action을 Exploration할 수 있음이 ε을 통해 보장됨
1 - ε을 통해 Policy가 계속 Improvement함을 보장할 수 있음

On-Policy MC Control > ε-Greedy Policy Improvement

On-Policy MC Control > Monte-Carlo Policy Iteration

On-Policy MC Control > Monte-Carlo Control

1개 epsiode가 생성되고 나면 Monte-Carlo evaluation을 수행
그 직후 ε-greedy Policy Improvement 수행

On-Policy MC Control > GLIE

All state-action pairrs are explored infinitely many times : exploration에 관한 조건
The policy converages on a greedy policy: exploitation에 관한 조건. 결국 greedy policy에 수렴해야함. 예) 1/k 을 통해서

On-Policy MC Control > GLIE Monte-Carlo Control

4장의 "Monte-Carlo Policy Evaluation" 부분과 유사

On-Policy MC Control > Monte-Carlo Control in Blackjack

On-Policy TD Learning > MC vs. TD Control

Temporal-difference (TD) learning has several advantages over Monte-Carlo (MC)

Lower variance
Online: Episode 안 끝나도 값을 알 수 있다
Incomplete sequences

Natural idea: use TD instead of MC in our control loop

Apply TD to Q(S, A)
Use ε-greedy policy improvement
Update every time-step

On-Policy TD Learning > Updating Action-Value Functions with Sarsa

State S에서 Action A을 수행했을 때 Reward R을 받고 State S'에 도착. State S'에서 Action A'에 도착.
이를 Sarsa 알고리즘이라고 함
Q는 S(가로)xA(세로) 테이블(Lookup 테이블). TD이기 때문에 Episode가 끝나지 않아도 업데이트 가능.

On-Policy TD Learning > On-Policy Control With Sarsa

그림의 화살표는 하나의 Episode가 아닌 하나의 Step을 의미
하나의 Step을 통해 Q를 업데이트하고 ε-greedy로 다음 Step 움직임. 이를 반복

On-Policy TD Learning > Sarsa Algorithm for On-Policy Control

Q(s, a)를 random값으로 초기화
처음 State에서 ε-greedy policy로 Action 하나를 선택
하나의 Action A를 선택하고, Reward R과 다음 State S'을 관측
State S'에서 ε-greedy policy로 Action A'을 수행. 이를 통해서 Q 업데이트

3~4 계속 반복

On-Policy TD Learning > Convergence of Sarsa

2가지 조건을 만족해야 Optimal Action Value Function을 만족한다
Robbins-Monro: Step α에 관한 내용. α는 얼만큼 업데이트할지 나타냄
- https://en.wikipedia.org/wiki/Stochastic_approximation#Robbins%E2%80%93Monro_algorithm

On-Policy TD Learning > Windy Gridworld Example

0은 바람이 불지 않고, 1부터는 바람이 불어서 1만큼 올려주는 것 의미
예제에서는 4방향인 kings moves을 사용한 것으로 보임
1번째 Episode는 2,000번 움직여야 종료 → random하게 움직이니까 오래 걸림
한번 도달하면 Reward가 발생. bootstrapping으로 인해 정보가 전파되어서 다음 Episode부터 수백번 이내에 도착
Q 테이블은 SxA = about 70 * 8 테이블

On-Policy TD Learning > n-Step Sarsa

Forward View Sarsa(λ)

Backward View Sarsa(λ)

On-Policy TD Learning > Sarsa(λ) Algorithm

기존: 하나의 State에서 하나의 Action을 하면 해당하는 칸 하나만 업데이트
Sarsa(λ): 하나의 State에서 하나의 Action을 하면 모든 칸을 업데이트
→ 정보 전파가 빠름

Sarsa(λ) Gridworld Example

One-Step Sarsa는 도착하기 직전 칸만 업데이트 → Action 하나에 대해서만 좋다고 학습
λ Sarsa는 지나왔던 모든 경로(칸)들 arbitrary trace 값에 비례해서 업데이트

Off-Policy Learning

behaviour policy µ는 실제 Action을 Sampling하는 Policy.

사람 또는 Agent를 따라하는게 아닌 이를 통해 최적을 배움.

Evaluate target policy π(a|s) to compute v_π(s) or q_π(s, a)
While following behaviour policy µ(a|s)
- {S₁, A₁, R₂, ..., S_T} ∼ µ
Why is this important
- Learn from observing humans or other agents
- Re-use experience generated from old policies π₁, π₂, ..., π_t−1
- Learn about optimal policy while following exploratory policy
- Learn about multiple policies while following one policy

Importance Sampling

X가 확률분포 P상에서 sampling되는 것들
실제로 동작하지 않음. 개념만 존재.

Importance Sampling for Off-Policy MC

Importance Sampling for Off-Policy TD

Off-Policy Learning > Q-Learning

Off-Policy Control with Q-Learning

Q-Learning Control Algorithm

Sarsa Max라고도 함

Relationship Between DP and TD

'Study > Machine Learning' 카테고리의 다른 글

[RL]Lecture #6 - Value Function Approximation (0)	2022.03.13
[RL]Lecture #4 - Model-Free Prediction (0)	2022.03.07
[RL]Lecture #3 - Planning by Dynamic Programming (0)	2022.03.06
[RL]Lecture #2 - Markov Decision Processes (0)	2022.03.06
[RL]Lecture #1 - Introduction to Reinforcement Learning (0)	2022.03.04

[RL]Lecture #5 - Model-Free Control