# reinforcement learning (machine learning, sir) reinforcement learning (machine learning, sir)...

Post on 28-Sep-2020

4 views

Embed Size (px)

TRANSCRIPT

Formalism Dynamic Programming

Approximate Dynamic Programming Online learning

Policy search and actor-critic methods

Reinforcement Learning (Machine Learning, SIR)

Matthieu Geist (CentraleSupélec) matthieu.geist@centralesupelec.fr

1 / 66

Formalism Dynamic Programming

Approximate Dynamic Programming Online learning

Policy search and actor-critic methods

Figure : The perception-action cycle in reinforcement learning.

2 / 66

Formalism Dynamic Programming

Approximate Dynamic Programming Online learning

Policy search and actor-critic methods

Applications

playing games (Backgammon, Go, Tetris, Atari...) robotics autonomous acrobatic helicopter control1

operation research (pricing, vehicle routing...) human computer interactions (dialogue management, e-learning...) virtually any control problem2

1http://heli.stanford.edu/ 2An old list: http://umichrl.pbworks.com/w/page/7597597/

Successes_of_Reinforcement_Learning 3 / 66

http://heli.stanford.edu/ http://umichrl.pbworks.com/w/page/7597597/Successes_of_Reinforcement_Learning http://umichrl.pbworks.com/w/page/7597597/Successes_of_Reinforcement_Learning

Formalism Dynamic Programming

Approximate Dynamic Programming Online learning

Policy search and actor-critic methods

Markov Decision Processes Policy and value function Bellman operators

1 Formalism Markov Decision Processes Policy and value function Bellman operators

2 Dynamic Programming Linear programming Value iteration Policy iteration

3 Approximate Dynamic Programming State-action value function Approximate value iteration Approximate policy iteration

4 Online learning SARSA and Q-learning The exploration-exploitation dilemma

5 Policy search and actor-critic methods The policy gradient theorem Actor-critic methods

4 / 66

Formalism Dynamic Programming

Approximate Dynamic Programming Online learning

Policy search and actor-critic methods

Markov Decision Processes Policy and value function Bellman operators

1 Formalism Markov Decision Processes Policy and value function Bellman operators

2 Dynamic Programming Linear programming Value iteration Policy iteration

3 Approximate Dynamic Programming State-action value function Approximate value iteration Approximate policy iteration

4 Online learning SARSA and Q-learning The exploration-exploitation dilemma

5 Policy search and actor-critic methods The policy gradient theorem Actor-critic methods

5 / 66

Formalism Dynamic Programming

Approximate Dynamic Programming Online learning

Policy search and actor-critic methods

Markov Decision Processes Policy and value function Bellman operators

A Markov Decision Process (MDP) is a tuple {S,A,P, r , γ} where:

S is the (finite) state space; A is the (finite) action space; P ∈ ∆S×AS is the Markovian transition kernel. The term P(s ′|s, a) denotes the probability of transiting in state s ′ given that action a was chosen in state s;

r ∈ RS×A is the reward function, it associates the reward r(s, a) for taking action a in state s. The reward function is assumed to be uniformly bounded;

γ ∈ (0, 1) is a discount factor that favors shorter term rewards (usually set to a value close to 1).

6 / 66

Formalism Dynamic Programming

Approximate Dynamic Programming Online learning

Policy search and actor-critic methods

Markov Decision Processes Policy and value function Bellman operators

1 Formalism Markov Decision Processes Policy and value function Bellman operators

2 Dynamic Programming Linear programming Value iteration Policy iteration

3 Approximate Dynamic Programming State-action value function Approximate value iteration Approximate policy iteration

4 Online learning SARSA and Q-learning The exploration-exploitation dilemma

5 Policy search and actor-critic methods The policy gradient theorem Actor-critic methods

7 / 66

Formalism Dynamic Programming

Approximate Dynamic Programming Online learning

Policy search and actor-critic methods

Markov Decision Processes Policy and value function Bellman operators

Policy:

π ∈ AS ; in state s, an agent applying policy π chooses the action π(s)

Value function (quantify the quality of a policy):

vπ(s) = E[ ∞∑ t=0

γtr(St , π(St))|S0 = s,St+1 ∼ P(.|St , π(St))].

Comparing policies (partial ordering):

π1 ≥ π2 ⇔ ∀s ∈ S, vπ1(s) ≥ vπ2(s).

Optimal policy: π∗ ∈ argmax

π∈AS vπ.

8 / 66

Formalism Dynamic Programming

Approximate Dynamic Programming Online learning

Policy search and actor-critic methods

Markov Decision Processes Policy and value function Bellman operators

1 Formalism Markov Decision Processes Policy and value function Bellman operators

2 Dynamic Programming Linear programming Value iteration Policy iteration

4 Online learning SARSA and Q-learning The exploration-exploitation dilemma

5 Policy search and actor-critic methods The policy gradient theorem Actor-critic methods

9 / 66

Formalism Dynamic Programming

Approximate Dynamic Programming Online learning

Policy search and actor-critic methods

Markov Decision Processes Policy and value function Bellman operators

Rewriting the Bellman equation

vπ(s) = E[ ∞∑ t=0

γtr(St , π(St))|S0 = s,St+1 ∼ P(.|St , π(St))]

= r(s, π(s)) + E[ ∞∑ t=1

γtr(St , π(St))|S0 = s,St+1 ∼ P(.|St , π(St))]

= r(s, π(s)) + γE[ ∞∑ t=0

γtr(St+1, π(St+1))|S0 = s,St+1 ∼ P(.|St , π(St))]

⇔ vπ(s) = r(s, π(s)) + γ ∑ s′∈S

P(s ′|s, π(s))vπ(s ′).

10 / 66

Formalism Dynamic Programming

Approximate Dynamic Programming Online learning

Policy search and actor-critic methods

Markov Decision Processes Policy and value function Bellman operators

Rewriting the Bellman equation (cont.)

Define the stochastic matrix Pπ ∈ RS×S and the vector rπ ∈ RS as

Pπ = ( P(s ′|s, π(s))

) s,s′∈S and rπ = (r(s, π(s)))s∈S .

Using these notations, we have:

vπ = rπ + γPπvπ ⇔ vπ = (I − γPπ)−1rπ.

11 / 66

Formalism Dynamic Programming

Approximate Dynamic Programming Online learning

Policy search and actor-critic methods

Markov Decision Processes Policy and value function Bellman operators

Bellman evaluation operator

Define the Bellman evaluation operator Tπ : RS → RS as

∀v ∈ RS , Tπv = rπ + γPπv ,

or equivalently componentwise

∀s ∈ S, [Tπv ](s) = r(s, π(s)) + γ ∑ s′∈S

P(s ′|s, π(s))v(s ′).

Tπ is a contraction (supremum norm) and vπ is its unique fixed point:

vπ = Tπvπ.

12 / 66

Formalism Dynamic Programming

Approximate Dynamic Programming Online learning

Policy search and actor-critic methods

Markov Decision Processes Policy and value function Bellman operators

Optimal value function and policies

Assume that v∗ = vπ∗ is known, an optimal policy should be greedy resp. to v∗:

π∗(s) ∈ argmax a∈A

( r(s, a) + γ

∑ s′∈S

P(s ′|s, a)v∗(s ′)

) .

Characterizing v∗:

∀s ∈ S, v∗(s) = max a∈A

( r(s, a) + γ

∑ s′∈S

P(s ′|s, a)v∗(s ′)

) .

13 / 66

Formalism Dynamic Programming

Approximate Dynamic Programming Online learning

Policy search and actor-critic methods

Markov Decision Processes Policy and value function Bellman operators

Bellman optimality operator

Define the Bellman optimality operator T∗ : RS → RS as

∀v ∈ RS , T∗v = max π∈AS

(rπ + γPπv) ,

or equivalently componentwise

∀s ∈ S, [T∗v ](s) = max a∈A

( r(s, a) + γ

∑ s′∈S

P(s ′|s, a)v(s ′)

) .

T∗ is a contraction (supremum norm) and v∗ is its unique fixed point:

v∗ = T∗v∗.

14 / 66

Formalism Dynamic Programming

Approximate Dynamic Programming Online learning

Policy search and actor-critic methods

Linear programming Value iteration Policy iteration

1 Formalism Markov Decision Processes Policy and value function Bellman operators

2 Dynamic Programming Linear programming Value iteration Policy iteration

4 Online learning SARSA and Q-learning The exploration-exploitation dilemma

5 Policy search and actor-critic methods The policy gradient theorem Actor-critic methods

15 / 66

Formalism Dynamic Programming

Approximate Dynamic Programming Online learning

Policy search and actor-critic methods

Linear programming Value iteration Policy iteration

DP: solve an MDP when the model is known.

In p