본문 바로가기
Study/Machine Learning

[RL]Lecture #3 - Planning by Dynamic Programming

by 커피콩 2022. 3. 6.

 

Table Of Content

     

    Planning

    • Environment를 앎(Reward와 State Transition을 안다)
    • Agent는 Environment를 사용하지 않고 내부적인 계산을 통해 다른 상황을 미리 엿보고 Policy를 개선시켜감

     

    Dynamic Programming

    • 복잡한 문제를 푸는 방법론
    • 큰 문제를 작은 문제로 나누고 → 작은 문제의 솔루션을 모으는 것
    • Requirements for Dynamic Programming
      • 적용하기 위해서 2가지 조건이 필요
        • Optimal substructure
          • Principle of optimality applies
          • Optimal solution can be decomposed into subproblems
        • Overlapping subproblems
          • Subproblems recur many times
          • Solution can be cached and reused
    • Markov decision processes satisfy both properties (MDP가 2가지 조건을 만족)
      • Bellman equation gives recursive decomposition
      • Value function stores and reuses solutions

     

    Planning by Dynamic Programming

    • Dynamic programming assumes full knowledge of the MDP
      • MDP는 Full Observable Environment(Agent State == Environment State == Information State)이기 때문인듯
    • 문제 분류 방법
      • For Prediction
        • Prediction: 미래 평가, 최적의 Value Function을 찾는 것
        • Input
          • MDP <S, A, P, R, γ> and Policy π
          • 또는 MRP <S, Pπ, Rπ, γ>
        • Output
          • Value Function vπ
      • or For Control
        • Control: 미래 최적화, 최적의 Policy를 찾는 것
        • Input
          • MDP <S, A, P, R, γ>
        • Output
          • Optimal Value Function v*
          • 과 Optimal Policy π*

     

    Other Application for Dynamic Programming

    • Dynamic programming is used to solve many other problems, e.g.
      • Scheduling algorithms
      • String algorithms (e.g. sequence alignment)
      • Graph algorithms (e.g. shortest path algorithms)
      • Graphical models (e.g. Viterbi algorithm)
      • Bioinformatics (e.g. lattice models)

     

    Policy Evaluation

    Prediction 문제

    Iterative Policy Evaluation

    • Problem: evaluate a given policy π
      • → 이 Policy를 따라갔을 때 Return 얼마 받냐를 알게 되면 Policy를 평가한 것 = Value Function
      • → 즉 이 Policy를 따라갔을 때 Value Function을 찾는 것
    • Solution: iterative application of Bellman expectation backup
      • Bellman Equation을 사용해서 Iterative하게 풀 예정
    • v1 → v2 → ... → vπ
      • 랜덤하게 v1에서 출발. (예: 모든 State의 Value Function이 0)
      • 이런 식으로 반복하다 보면 vπ에 수렴
    • backup: 작업을 잠깐 저장해두는 곳, Like cache
    • Using synchronous backups,
      • At each iteration k + 1
        • 매 iteration마다 k가 10이라고 가정하면 즉 for (int k = 0; k > 10; ++k)
      • For all states s ∈ S
        • State가 10개 있다고 가정하자. 모든 State 10개에 대해서(이상한 값으로 초기화되어 있을 거임)
      • Update vk+1(s) from vk(s')
        • Bellman Expectation을 사용해서 모든 State의 각 값을 업데이트
      • where s' is a successor state of s
        • 모든 State를 업데이트하는 데 이를 Full Sweep?이라고 함. 모든 State를 한 번씩 업데이트하니까 Synchronous backup
        • s가 갈 수 있는 State들이 s'
    • We will discuss asynchronous backups later
    • Convergence to vπ will be proven at the end of the lecture
      • 정확한 값에 도달하면 더 이상 값이 업데이트되지 않음

     

    Evaluating a Random Policy in the Small Gridworld

    • 동서남북 중 하나로 움직일 확률: 25% → π(a|s) = 1/4
    • Rsa = -1, γ = 1
    • example
    • k = 1일 때
      • img
      • = (1/4 * (-1 + 1 * 1 * 0)) * 동서남북방향수 = -1/4 * 4 = -1

     

    Policy Iteration

    Control 문제

    How to Improve a Policy

    • Given a policy π (아래 2가지를 반복)
      • Evaluate the policy π(Value Function을 찾고)
        • vπ(s) = E [Rt+1 + γRt+2 + ...|St = s]
      • Improve the policy by acting greedily with respect to vπ (찾은 Value Function에 대해서 Greedy하게 움직이는 새로운 Policy를 만듦)
        • π' = greedy(vπ)
    • In Small Gridworld improved policy was optimal, π' = π
    • In general, need more iterations of improvement / evaluation
    • But this process of policy iteration always converges to π

     

    Policy Iteration

     

    예시) Jack's Car Rental

     

    Policy Improvement

    • Consider a deterministic policy, a = π(s)
    • We can improve the policy by acting greedily
      • q에 대해서 greedy하게 움직이는 게 π보다 π'이 더 낫다
    • This improves the value from any state s over one step,
    • It therefore improves the value function, vπ'(s) ≥ vπ(s)
    • If improvements stop,
    • Then the Bellman optimality equation has been satisfied
    • Therefore vπ(s) = v(s) for all s ∈ S
    • so π is an optimal policy

     

    Modified Policy Iteration

    • Does policy evaluation need to converge to vπ?
    • Or should we introduce a stopping condition
      • e.g. e-convergence of value function
    • Or simply stop after k iterations of iterative policy evaluation?
    • For example, in the small gridworld k = 3 was sufficient to achieve optimal policy
    • Why not update policy every iteration? i.e. stop after k = 1
      • This is equivalent to value iteration (next section)

     

    Generalized Policy Iteration

     

    Value Iteration

    Control 문제

    Principle of Optimality

    • Any optimal policy can be  subdivided into two components
      • An optimal first action A*
      • Followed by an optimal policy from successor state S'

     

    Deterministic  Value Iteration

    • If we know the solution to subproblems v*(s')
    • The solution v*(s') can be found by one-step lookahead
      • subproblem v*(s')에 대한 솔루션을 알고 있다면, v*(s')에 의해 v*(s)를 찾을 수 있다
    • The idea of value iteration is to apply these updates iteratively
    • Intuition: start with final rewards and work backwards
      • 최종 Reward에서 시작해서  거꾸로 올라간다
    • Still works with loopy, stochastic MDPs

     

    Example: Shortest Path

     

    Value Iteration

    • Problem: Find optimal policy π
    • Solution: iterative application of Bellman optimality backup
    • v1 → v2 → ... → v*
    • Using synchronous backups
      • At each iteration k + 1
      • For all states s ∈ S
      • Update vk+1(s) from vk(s')
    • Convergence to v will be proven later
    • Unlike policy iteration, there is no explicit policy
    • Intermediate value functions may not correspond to any policy

     

    Synchronous Dynamic Programming Algorithms

     

    Asynchronous Dynamic Programming

    • DP methods described so far used synchronous backups
      • DP 방법론은 다 synchronous backup을 씀. 모든 State들이 Parallel하게 backup
    • i.e. all states are backed up in parallel
    • Asynchronous DP backs up states individually, in any order
      • 어떤 State만 backup한다던지, 순서를 다르게 한다던지를 의미
    • For each selected state, apply the appropriate backup
    • Can significantly reduce computation
    • Guaranteed to converge if all states continue to be selected
    • Three simple ideas for asynchronous dynamic programming
      • In-place dynamic programming
      • Prioritised sweeping
      • Real-time dynamic programming

     

    In-place Dynamic Programming

    코딩 테크닉에 더 가까움

    • Synchronous value iteration stores two copies of value function
      • 이전 State와 관련된 Value Function 테이블과 업데이트될 State Value Function 테이블 2개가 필요
    • In-place value iteration only stores one copy of value function
      • State 관련 Value Function 테이블을 1개만 갖고 있음. 방금 업데이트된 State를 바로 씀

     

    Prioritised Sweeping

    중요한 순서로 State 업데이트, Bellman Error가 큰 State가 중요한 녀석이라고 봄

    • Use magnitude of Bellman error to guide state selection,
    • Backup the state with the largest remaining Bellman error
    • Update Bellman error of affected states after each backup
    • Requires knowledge of reverse dynamics (predecessor states)
    • Can be implemented efficiently by maintaining a priority queue

     

    Real-Time Dynamic Programming

    • Idea: only states that are relevant to agent
      • Agent가 방문했던 State만 업데이트. 앞선 방법은 모든 State에 일괄 적용하기 때문에 Agent 불필요
    • Use agent’s experience to guide the selection of states
    • After each time-step St , At , Rt+1
    • Backup the state St

     

    Full-Width Backups

    • DP uses full-width backups
      • 지금까지 배운 방법들은 여기에 속함
    • For each backup (sync or async)
      • Every successor state and action is considered
        • S에서 갈 수 있는 모든 S'를 고려한다
      • Using knowledge of the MDP transitions and reward function
    • DP is effective for medium-sized problems (millions of states)
    • For large problems DP suffers Bellman’s curse of dimensionality
      • 큰 문제에서는 Full-Width Backups를 적용할 수 없음 
      • Number of states n = |S| grows exponentially with number of state variables
    • Even one backup can be too expensive

     

    Sample Backups

    • In subsequent lectures we will consider sample backups
    • Using sample rewards and sample transitions <S, A, R', S>
    • Instead of reward function R and transition dynamics P
    • Advantages
      • Model-free: no advance knowledge of MDP required
      • Breaks the curse of dimensionality through sampling
      • Cost of backup is constant, independent of n = |S|

     

     

     

    댓글