Table Of Content
Planning
- Environment를 앎(Reward와 State Transition을 안다)
- Agent는 Environment를 사용하지 않고 내부적인 계산을 통해 다른 상황을 미리 엿보고 Policy를 개선시켜감
Dynamic Programming
- 복잡한 문제를 푸는 방법론
- 큰 문제를 작은 문제로 나누고 → 작은 문제의 솔루션을 모으는 것
- Requirements for Dynamic Programming
- 적용하기 위해서 2가지 조건이 필요
- Optimal substructure
- Principle of optimality applies
- Optimal solution can be decomposed into subproblems
- Overlapping subproblems
- Subproblems recur many times
- Solution can be cached and reused
- Optimal substructure
- 적용하기 위해서 2가지 조건이 필요
- Markov decision processes satisfy both properties (MDP가 2가지 조건을 만족)
- Bellman equation gives recursive decomposition
- Value function stores and reuses solutions
Planning by Dynamic Programming
- Dynamic programming assumes full knowledge of the MDP
- MDP는 Full Observable Environment(Agent State == Environment State == Information State)이기 때문인듯
- 문제 분류 방법
- For Prediction
- Prediction: 미래 평가, 최적의 Value Function을 찾는 것
- Input
- MDP <S, A, P, R, γ> and Policy π
- 또는 MRP <S, Pπ, Rπ, γ>
- Output
- Value Function vπ
- or For Control
- Control: 미래 최적화, 최적의 Policy를 찾는 것
- Input
- MDP <S, A, P, R, γ>
- Output
- Optimal Value Function v*
- 과 Optimal Policy π*
- For Prediction
Other Application for Dynamic Programming
- Dynamic programming is used to solve many other problems, e.g.
- Scheduling algorithms
- String algorithms (e.g. sequence alignment)
- Graph algorithms (e.g. shortest path algorithms)
- Graphical models (e.g. Viterbi algorithm)
- Bioinformatics (e.g. lattice models)
Policy Evaluation
Prediction 문제
Iterative Policy Evaluation
- Problem: evaluate a given policy π
- → 이 Policy를 따라갔을 때 Return 얼마 받냐를 알게 되면 Policy를 평가한 것 = Value Function
- → 즉 이 Policy를 따라갔을 때 Value Function을 찾는 것
- Solution: iterative application of Bellman expectation backup
- Bellman Equation을 사용해서 Iterative하게 풀 예정
- v1 → v2 → ... → vπ
- 랜덤하게 v1에서 출발. (예: 모든 State의 Value Function이 0)
- 이런 식으로 반복하다 보면 vπ에 수렴
- backup: 작업을 잠깐 저장해두는 곳, Like cache
- Using synchronous backups,
- At each iteration k + 1
- 매 iteration마다 k가 10이라고 가정하면 즉 for (int k = 0; k > 10; ++k)
- For all states s ∈ S
- State가 10개 있다고 가정하자. 모든 State 10개에 대해서(이상한 값으로 초기화되어 있을 거임)
- Update vk+1(s) from vk(s')
- Bellman Expectation을 사용해서 모든 State의 각 값을 업데이트
- where s' is a successor state of s
- 모든 State를 업데이트하는 데 이를 Full Sweep?이라고 함. 모든 State를 한 번씩 업데이트하니까 Synchronous backup
- s가 갈 수 있는 State들이 s'
- At each iteration k + 1
- We will discuss asynchronous backups later
- Convergence to vπ will be proven at the end of the lecture
- 정확한 값에 도달하면 더 이상 값이 업데이트되지 않음
Evaluating a Random Policy in the Small Gridworld
![]() |
![]() |
![]() |
- 동서남북 중 하나로 움직일 확률: 25% → π(a|s) = 1/4
- Rsa = -1, γ = 1
- example
- k = 1일 때
- img
- = (1/4 * (-1 + 1 * 1 * 0)) * 동서남북방향수 = -1/4 * 4 = -1
Policy Iteration
Control 문제
How to Improve a Policy
- Given a policy π (아래 2가지를 반복)
- Evaluate the policy π(Value Function을 찾고)
- vπ(s) = E [Rt+1 + γRt+2 + ...|St = s]
- Improve the policy by acting greedily with respect to vπ (찾은 Value Function에 대해서 Greedy하게 움직이는 새로운 Policy를 만듦)
- π' = greedy(vπ)
- Evaluate the policy π(Value Function을 찾고)
- In Small Gridworld improved policy was optimal, π' = π∗
- In general, need more iterations of improvement / evaluation
- But this process of policy iteration always converges to π∗
Policy Iteration
예시) Jack's Car Rental
![]() |
![]() |
Policy Improvement
- Consider a deterministic policy, a = π(s)
- We can improve the policy by acting greedily
- q에 대해서 greedy하게 움직이는 게 π보다 π'이 더 낫다
- This improves the value from any state s over one step,
- It therefore improves the value function, vπ'(s) ≥ vπ(s)
- If improvements stop,
- Then the Bellman optimality equation has been satisfied
- Therefore vπ(s) = v∗(s) for all s ∈ S
- so π is an optimal policy
Modified Policy Iteration
- Does policy evaluation need to converge to vπ?
- Or should we introduce a stopping condition
- e.g. e-convergence of value function
- Or simply stop after k iterations of iterative policy evaluation?
- For example, in the small gridworld k = 3 was sufficient to achieve optimal policy
- Why not update policy every iteration? i.e. stop after k = 1
- This is equivalent to value iteration (next section)
Generalized Policy Iteration
Value Iteration
Control 문제
Principle of Optimality
- Any optimal policy can be subdivided into two components
- An optimal first action A*
- Followed by an optimal policy from successor state S'
Deterministic Value Iteration
- If we know the solution to subproblems v*(s')
- The solution v*(s') can be found by one-step lookahead
- subproblem v*(s')에 대한 솔루션을 알고 있다면, v*(s')에 의해 v*(s)를 찾을 수 있다
- The idea of value iteration is to apply these updates iteratively
- Intuition: start with final rewards and work backwards
- 최종 Reward에서 시작해서 거꾸로 올라간다
- Still works with loopy, stochastic MDPs
Example: Shortest Path
Value Iteration
- Problem: Find optimal policy π
- Solution: iterative application of Bellman optimality backup
- v1 → v2 → ... → v*
- Using synchronous backups
- At each iteration k + 1
- For all states s ∈ S
- Update vk+1(s) from vk(s')
- Convergence to v∗ will be proven later
- Unlike policy iteration, there is no explicit policy
- Intermediate value functions may not correspond to any policy
Synchronous Dynamic Programming Algorithms
Asynchronous Dynamic Programming
- DP methods described so far used synchronous backups
- DP 방법론은 다 synchronous backup을 씀. 모든 State들이 Parallel하게 backup
- i.e. all states are backed up in parallel
- Asynchronous DP backs up states individually, in any order
- 어떤 State만 backup한다던지, 순서를 다르게 한다던지를 의미
- For each selected state, apply the appropriate backup
- Can significantly reduce computation
- Guaranteed to converge if all states continue to be selected
- Three simple ideas for asynchronous dynamic programming
- In-place dynamic programming
- Prioritised sweeping
- Real-time dynamic programming
In-place Dynamic Programming
코딩 테크닉에 더 가까움
- Synchronous value iteration stores two copies of value function
- 이전 State와 관련된 Value Function 테이블과 업데이트될 State Value Function 테이블 2개가 필요
- In-place value iteration only stores one copy of value function
- State 관련 Value Function 테이블을 1개만 갖고 있음. 방금 업데이트된 State를 바로 씀
Prioritised Sweeping
중요한 순서로 State 업데이트, Bellman Error가 큰 State가 중요한 녀석이라고 봄
- Use magnitude of Bellman error to guide state selection,
- Backup the state with the largest remaining Bellman error
- Update Bellman error of affected states after each backup
- Requires knowledge of reverse dynamics (predecessor states)
- Can be implemented efficiently by maintaining a priority queue
Real-Time Dynamic Programming
- Idea: only states that are relevant to agent
- Agent가 방문했던 State만 업데이트. 앞선 방법은 모든 State에 일괄 적용하기 때문에 Agent 불필요
- Use agent’s experience to guide the selection of states
- After each time-step St , At , Rt+1
- Backup the state St
Full-Width Backups
- DP uses full-width backups
- 지금까지 배운 방법들은 여기에 속함
- For each backup (sync or async)
- Every successor state and action is considered
- S에서 갈 수 있는 모든 S'를 고려한다
- Using knowledge of the MDP transitions and reward function
- Every successor state and action is considered
- DP is effective for medium-sized problems (millions of states)
- For large problems DP suffers Bellman’s curse of dimensionality
- 큰 문제에서는 Full-Width Backups를 적용할 수 없음
- Number of states n = |S| grows exponentially with number of state variables
- Even one backup can be too expensive
Sample Backups
- In subsequent lectures we will consider sample backups
- Using sample rewards and sample transitions <S, A, R', S>
- Instead of reward function R and transition dynamics P
- Advantages
- Model-free: no advance knowledge of MDP required
- Breaks the curse of dimensionality through sampling
- Cost of backup is constant, independent of n = |S|
'Study > Machine Learning' 카테고리의 다른 글
[RL]Lecture #5 - Model-Free Control (0) | 2022.03.10 |
---|---|
[RL]Lecture #4 - Model-Free Prediction (0) | 2022.03.07 |
[RL]Lecture #2 - Markov Decision Processes (0) | 2022.03.06 |
[RL]Lecture #1 - Introduction to Reinforcement Learning (0) | 2022.03.04 |
유용한 강의 영상 (0) | 2022.02.03 |
댓글