Table Of Content

Planning

Environment를 앎(Reward와 State Transition을 안다)
Agent는 Environment를 사용하지 않고 내부적인 계산을 통해 다른 상황을 미리 엿보고 Policy를 개선시켜감

Dynamic Programming

복잡한 문제를 푸는 방법론
큰 문제를 작은 문제로 나누고 → 작은 문제의 솔루션을 모으는 것
Requirements for Dynamic Programming
- 적용하기 위해서 2가지 조건이 필요
  - Optimal substructure
    - Principle of optimality applies
    - Optimal solution can be decomposed into subproblems
  - Overlapping subproblems
    - Subproblems recur many times
    - Solution can be cached and reused
Markov decision processes satisfy both properties (MDP가 2가지 조건을 만족)
- Bellman equation gives recursive decomposition
- Value function stores and reuses solutions

Planning by Dynamic Programming

Dynamic programming assumes full knowledge of the MDP
- MDP는 Full Observable Environment(Agent State == Environment State == Information State)이기 때문인듯
문제 분류 방법
- For Prediction
  - Prediction: 미래 평가, 최적의 Value Function을 찾는 것
  - Input
    - MDP <S, A, P, R, γ> and Policy π
    - 또는 MRP <S, P^π, R^π, γ>
  - Output
    - Value Function v_π
- or For Control
  - Control: 미래 최적화, 최적의 Policy를 찾는 것
  - Input
    - MDP <S, A, P, R, γ>
  - Output
    - Optimal Value Function v_*
    - 과 Optimal Policy π_*

Other Application for Dynamic Programming

Dynamic programming is used to solve many other problems, e.g.
- Scheduling algorithms
- String algorithms (e.g. sequence alignment)
- Graph algorithms (e.g. shortest path algorithms)
- Graphical models (e.g. Viterbi algorithm)
- Bioinformatics (e.g. lattice models)

Policy Evaluation

Prediction 문제

Iterative Policy Evaluation

Problem: evaluate a given policy π
- → 이 Policy를 따라갔을 때 Return 얼마 받냐를 알게 되면 Policy를 평가한 것 = Value Function
- → 즉 이 Policy를 따라갔을 때 Value Function을 찾는 것
Solution: iterative application of Bellman expectation backup
- Bellman Equation을 사용해서 Iterative하게 풀 예정
v₁ → v₂ → ... → v_π
- 랜덤하게 v₁에서 출발. (예: 모든 State의 Value Function이 0)
- 이런 식으로 반복하다 보면 v_π에 수렴
backup: 작업을 잠깐 저장해두는 곳, Like cache
Using synchronous backups,
- At each iteration k + 1
  - 매 iteration마다 k가 10이라고 가정하면 즉 for (int k = 0; k > 10; ++k)
- For all states s ∈ S
  - State가 10개 있다고 가정하자. 모든 State 10개에 대해서(이상한 값으로 초기화되어 있을 거임)
- Update v_k+1(s) from v_k(s')
  - Bellman Expectation을 사용해서 모든 State의 각 값을 업데이트
- where s' is a successor state of s
  - 모든 State를 업데이트하는 데 이를 Full Sweep?이라고 함. 모든 State를 한 번씩 업데이트하니까 Synchronous backup
  - s가 갈 수 있는 State들이 s'
We will discuss asynchronous backups later
Convergence to v_π will be proven at the end of the lecture
- 정확한 값에 도달하면 더 이상 값이 업데이트되지 않음

Evaluating a Random Policy in the Small Gridworld

동서남북 중 하나로 움직일 확률: 25% → π(a|s) = 1/4
R_s^a = -1, γ = 1
example
k = 1일 때
- img
- = (1/4 * (-1 + 1 * 1 * 0)) * 동서남북방향수 = -1/4 * 4 = -1

Policy Iteration

Control 문제

How to Improve a Policy

Given a policy π (아래 2가지를 반복)
- Evaluate the policy π(Value Function을 찾고)
  - v_π(s) = E [R_t+1 + γR_t+2 + ...|S_t = s]
- Improve the policy by acting greedily with respect to v_π (찾은 Value Function에 대해서 Greedy하게 움직이는 새로운 Policy를 만듦)
  - π' = greedy(v_π)
In Small Gridworld improved policy was optimal, π' = π^∗
In general, need more iterations of improvement / evaluation
But this process of policy iteration always converges to π^∗

Policy Iteration

예시) Jack's Car Rental

Policy Improvement

Consider a deterministic policy, a = π(s)
We can improve the policy by acting greedily
- q에 대해서 greedy하게 움직이는 게 π보다 π'이 더 낫다
This improves the value from any state s over one step,
It therefore improves the value function, v_π'(s) ≥ v_π(s)
If improvements stop,
Then the Bellman optimality equation has been satisfied
Therefore v_π(s) = v_∗(s) for all s ∈ S
so π is an optimal policy

Modified Policy Iteration

Does policy evaluation need to converge to v_π?
Or should we introduce a stopping condition
- e.g. e-convergence of value function
Or simply stop after k iterations of iterative policy evaluation?
For example, in the small gridworld k = 3 was sufficient to achieve optimal policy
Why not update policy every iteration? i.e. stop after k = 1
- This is equivalent to value iteration (next section)

Generalized Policy Iteration

Value Iteration

Control 문제

Principle of Optimality

Any optimal policy can be subdivided into two components
- An optimal first action A_*
- Followed by an optimal policy from successor state S'

Deterministic Value Iteration

If we know the solution to subproblems v_*(s')
The solution v_*(s') can be found by one-step lookahead
- subproblem v_*(s')에 대한 솔루션을 알고 있다면, v_*(s')에 의해 v_*(s)를 찾을 수 있다
The idea of value iteration is to apply these updates iteratively
Intuition: start with final rewards and work backwards
- 최종 Reward에서 시작해서 거꾸로 올라간다
Still works with loopy, stochastic MDPs

Example: Shortest Path

Value Iteration

Problem: Find optimal policy π
Solution: iterative application of Bellman optimality backup
v₁ → v₂ → ... → v_*
Using synchronous backups
- At each iteration k + 1
- For all states s ∈ S
- Update v_k+1(s) from v_k(s')
Convergence to v_∗ will be proven later
Unlike policy iteration, there is no explicit policy
Intermediate value functions may not correspond to any policy

Synchronous Dynamic Programming Algorithms

Asynchronous Dynamic Programming

DP methods described so far used synchronous backups
- DP 방법론은 다 synchronous backup을 씀. 모든 State들이 Parallel하게 backup
i.e. all states are backed up in parallel
Asynchronous DP backs up states individually, in any order
- 어떤 State만 backup한다던지, 순서를 다르게 한다던지를 의미
For each selected state, apply the appropriate backup
Can significantly reduce computation
Guaranteed to converge if all states continue to be selected
Three simple ideas for asynchronous dynamic programming
- In-place dynamic programming
- Prioritised sweeping
- Real-time dynamic programming

In-place Dynamic Programming

코딩 테크닉에 더 가까움

Synchronous value iteration stores two copies of value function
- 이전 State와 관련된 Value Function 테이블과 업데이트될 State Value Function 테이블 2개가 필요
In-place value iteration only stores one copy of value function
- State 관련 Value Function 테이블을 1개만 갖고 있음. 방금 업데이트된 State를 바로 씀

Prioritised Sweeping

중요한 순서로 State 업데이트, Bellman Error가 큰 State가 중요한 녀석이라고 봄

Use magnitude of Bellman error to guide state selection,
Backup the state with the largest remaining Bellman error
Update Bellman error of affected states after each backup
Requires knowledge of reverse dynamics (predecessor states)
Can be implemented efficiently by maintaining a priority queue

Real-Time Dynamic Programming

Idea: only states that are relevant to agent
- Agent가 방문했던 State만 업데이트. 앞선 방법은 모든 State에 일괄 적용하기 때문에 Agent 불필요
Use agent’s experience to guide the selection of states
After each time-step S_t , A_t , R_t+1
Backup the state S_t

Full-Width Backups

DP uses full-width backups
- 지금까지 배운 방법들은 여기에 속함
For each backup (sync or async)
- Every successor state and action is considered
  - S에서 갈 수 있는 모든 S'를 고려한다
- Using knowledge of the MDP transitions and reward function
DP is effective for medium-sized problems (millions of states)
For large problems DP suffers Bellman’s curse of dimensionality
- 큰 문제에서는 Full-Width Backups를 적용할 수 없음
- Number of states n = |S| grows exponentially with number of state variables
Even one backup can be too expensive

Sample Backups

In subsequent lectures we will consider sample backups
Using sample rewards and sample transitions <S, A, R', S>
Instead of reward function R and transition dynamics P
Advantages
- Model-free: no advance knowledge of MDP required
- Breaks the curse of dimensionality through sampling
- Cost of backup is constant, independent of n = |S|

'Study > Machine Learning' 카테고리의 다른 글

[RL]Lecture #5 - Model-Free Control (0)	2022.03.10
[RL]Lecture #4 - Model-Free Prediction (0)	2022.03.07
[RL]Lecture #2 - Markov Decision Processes (0)	2022.03.06
[RL]Lecture #1 - Introduction to Reinforcement Learning (0)	2022.03.04
유용한 강의 영상 (0)	2022.02.03

커피콩

[RL]Lecture #3 - Planning by Dynamic Programming

Table Of Content

Planning

Dynamic Programming

Planning by Dynamic Programming

Other Application for Dynamic Programming

Policy Evaluation

Iterative Policy Evaluation

Evaluating a Random Policy in the Small Gridworld

Policy Iteration

How to Improve a Policy

Policy Iteration

예시) Jack's Car Rental

Policy Improvement

Modified Policy Iteration

Generalized Policy Iteration

Value Iteration

Principle of Optimality

Deterministic Value Iteration

Example: Shortest Path

Value Iteration

Synchronous Dynamic Programming Algorithms

Asynchronous Dynamic Programming

In-place Dynamic Programming

Prioritised Sweeping

Real-Time Dynamic Programming

Full-Width Backups

Sample Backups

'Study > Machine Learning' 카테고리의 다른 글

댓글

티스토리툴바

[RL]Lecture #3 - Planning by Dynamic Programming

Table Of Content

Planning

Dynamic Programming

Planning by Dynamic Programming

Other Application for Dynamic Programming

Policy Evaluation

Iterative Policy Evaluation

Evaluating a Random Policy in the Small Gridworld

Policy Iteration

How to Improve a Policy

Policy Iteration

예시) Jack's Car Rental

Policy Improvement

Modified Policy Iteration

Generalized Policy Iteration

Value Iteration

Principle of Optimality

Deterministic Value Iteration

Example: Shortest Path

Value Iteration

Synchronous Dynamic Programming Algorithms

Asynchronous Dynamic Programming

In-place Dynamic Programming

Prioritised Sweeping

Real-Time Dynamic Programming

Full-Width Backups

Sample Backups

'Study > Machine Learning' 카테고리의 다른 글

관련글

댓글

티스토리툴바