Laplacian Agent Learning: Representation Policy Iteration

Size: px

Start display at page:

Download "Laplacian Agent Learning: Representation Policy Iteration"

Dale Griffith
6 years ago
Views:

1 Laplacian Agent Learning: Representation Policy Iteration Sridhar Mahadevan

2 Example of a Markov Decision Process a1: $0 Heaven $1 Earth What should the agent do? a2: $100 Hell $-1 V a1 ( Earth ) = f(0,1,1,1,1,...) V a2 ( Earth ) = f(100,-1,-1,-1,-1,...)

3 Infinite Horizon Markov Decision Processes! Graded myopia: discounted sum of rewards " Maximize γ t " Hell if γ < 0.98, otherwise Heaven " Maximize average-adjusted sum of rewards n lim rt n Always go to heaven! t r t = 1 t n

4 Sample MDP , , +1 V b (1) = 0.5 x x 2 + γ (0.5) V b (1) + γ (0.5) V b (2) V b (2) = 1 + γ V b (2)

5 Sample MDP , , +1 V b (2) = 1/(1 - γ) = 2 (if γ = 0.5) V b (1) = (1.5 + γ (0.5) V b (2))/(1 - γ(0.5)) = 8/3

6 Sample MDP , , +1 V r (2) = 1/(1 - γ) = 2 (if γ = 0.5) V r (1) = (1 + γ (0.5)V r (2))/(1 - γ(0.5)) = 2 Hence, the blue policy is optimal in state 1

7 Bellman Equation Choose greedy action given V* EXIT WEST Goal! +50 WAIT EXIT EAST

8 Value Function Approximation R S x A LSPI [Lagoudakis and Parr, 2003] Inverted Pendulum with Radial Basis Functions (10) R k

9 VFA by Least-Squares Projection: Minimize Bellman Residual Error T π ( Vˆ( s)) Bellman residual = i V ˆ( s) φ ( s) i w i Subspace Φ (poly, RBF, neural nets)

10 VFA by Least-Squares Projection: Fixed Point Projection T π ( Vˆ( s)) = i V ˆ( s) φ ( s) i w i Minimize Projected Residual Subspace Φ (poly, RBF, neural nets)

11 Parametric Value Function Approximation Can Fail Approximator blind to symmetries! LSPI converges to an incorrect policy Approximator blind to bottlenecks! Goal is to get to center of square grid G [Drummond, JAIR 2002]

12 Divergence of Neural Nets in Mountain Car Task (Boyan and Moore, NIPS 1995)

Automating Value Function Approximation " Many approaches to value function approximation " Neural nets, radial basis functions, support vector

13 Automating Value Function Approximation " Many approaches to value function approximation " Neural nets, radial basis functions, support vector machines, kernel density estimation, nearest neighbor " How to automate the design of a function approximator? " We want to go beyond model selection!

Representation Learning using Harmonic Analysis on Manifolds (Mahadevan and Maggioni, NIPS 2005) 8 6 Fourier 2 ( F) = 0 Div(Grad(F))=0 4 Inverted pendulum 2 0-2

14 Representation Learning using Harmonic Analysis on Manifolds (Mahadevan and Maggioni, NIPS 2005) 8 6 Fourier 2 ( F) = 0 Div(Grad(F))=0 4 Inverted pendulum Wavelets Dilations of diffusion operator on manifold Samples from random walk

15 Proto-Value Functions (Mahadevan, ICML 2005) Proto-value functions are the representational building blocks of all value functions on a given environment. Proto-value functions are based on modeling the state space manifold Fourier Wavelet

16 How are Proto-Value Functions Learned? Proto-value functions Graph Laplacian

17 Representation Policy Iteration (Mahadevan, UAI 2005) Policy improvement Greedy Policy Actor Policy evaluation Representation Learner Trajectories Critic Proto-value functions

18 Policy Iteration (Howard, 1960) Policy Evaluation: ( Critic ) γ Policy Improvement: ( Actor ) γ

19 Representation Policy Iteration " Learn a set of proto-value functions from a sample of transitions generated from a random walk (or from watching an expert) " These basis functions can then be used in an approximate policy iteration algorithm, such as Least Squares Policy Iteration [Lagoudakis and Parr, 2003]

20 Least-Squares Policy Iteration (Boyan, ICML 1999; Lagoudakis and Parr, JMLR 2003) Do a random walk generating a set of transitions D = (s t, a t, r, s t ) Solve the equation:

21 Learned vs. Handcoded Representations: Chain MDP Time Error Laplacian RBF Poly

22 Results of RPI on a Grid World LSPI RBF LSPI Polynomial RPI Goal is to get to center of square grid

23 Results of RPI on Two Room World G LSPI Polynomial Nonlinearity due to bottleneck is nicely captured by RPI!

24 RPI in Continuous State Spaces (Mahadevan, Maggioni, Ferguson, Osentoski, 2006) " To extend RPI to continuous state spaces, we need to use the Nystrom extension to extend eigenfunctions from sample points to new points " Many issues are involved here, including how many sampled points to choose from, how to construct the graph to reflect the underlying manifold, etc.

25 RPI : Inverted Pendulum Task Normalized Graph Laplacian matrix Sample transitions from random walk (~ ) Proto value functions

26 RPI: Inverted Pendulum Q-value for action Angle Angular Velocity -4 Q-value for action Angle Angular Velocity -4 Learned Proto-Value Functions Handcoded Radial Basis Functions

27 RPI Results on Inverted Pendulum (Mahadevan, Maggioni, Ferguson, Osentoski, 2006) 3200 Pendulum: Proto Value Functions vs Radial Basis Functions Steps RBF 50 PVF 36 RBF 50 RBF Number of training episodes

28 RPI Results for Mountain Car (Mahadevan, Maggioni, Ferguson, Osentoski, 2006) 550 Mountain Car: Proto Value Functions vs. Radial Basis Functions Steps PVF (Z=500, P=100, k=25) PVF (Z=1500, P=100, k=25) 13 RBF (Z=1500) 49 RBF (Z=1500) Number of training episodes

29 Summary: Proto-Value Functions " A major challenge facing MDP research is to how to automate value function approximation " Proto-value functions " Fourier and wavelet representations on manifolds that capture largescale geometry of the state (action) space " Representation Policy Iteration is a general framework for simultaneously learning representations and policies " Extensions of proto-value functions " On-policy proto-value functions [Maggioni and Mahadevan, 2005] " Factored Markov decision processes [Mahadevan, 2006] " Group-theoretic extensions [Mahadevan, in preparation]

Representation Policy Iteration: A Unified Framework for Learning Behavior and Representation

Representation Policy Iteration: A Unified Framework for Learning Behavior and Representation Sridhar Mahadevan Unified Theories of Cognition (Newell, William James Lectures, Harvard) How are we able to