AIBO experiences change of surface incline.

Size: px

Start display at page:

Download "AIBO experiences change of surface incline."

Grant Norton
5 years ago
Views:

1 NW Computational Intelligence Laboratory AIBO experiences change of surface incline. Decisions based only on kinesthetic experience vector ICNC 07, China 8/27/07 # 51

2 NW Computational Intelligence Laboratory AIBO experiences change of surface incline. Decisions based only on kinesthetic experience vector ICNC 07, China 8/27/07 # 52

3 CONCLUSION Conjecture that the proposed experience-based approach will usher in a whole new phase of development of the decision and controls fields making a significant stride toward the achievement of more human-like decision and control. Also conjecture that the context discernment concepts plus the manifolds representation will provide a basis for constructing learning agents capable of long term rapidly accessible memory. If so, this could pave the way for scaling neural systems to brain-like capabilities ICNC 07, China 8/27/07 # 53

4 Geometric Topology construct of manifolds provides a useful formalism: 1) a set of elements, S, and 2) a coordinate system (a one-to-one mapping from S to Rn that specifies each element in S via a vector of n real numbers, a.k.a. the coordinates of the element. We let the experience repository be the set portion of a manifold. The manifold s coordinate space serves as a searchable indexing vehicle for the repository. and since the coordinate space is R n, the Euclidean distance provides a natural metric for nearness.

5 Demonstration Example Let the manifold set be a collection of neural networks (NNs) generated via a NN whose structure is fixed, and its adjustable parameters (weights and biases) are made to take on all possible value combinations. Each such combination yields a distinct member of the set, and the parameter values may serve as the coordinates; a point in the coordinate space may be called the set element s address.

6 When the manifold s set are NNs, we use the label neural manifold Care is needed relative to two aspects: 1) while each coordinate point corresponds to a distinct NN instantiation, nevertheless, many such points may all perform the same mapping, and 2) the set of (distinct) mappings that can be performed by this set of NNs is typically just a subset of all possible mappings on the NN s input domain to its output range (called the NN s Performance Subset)

7 Mapping from Context Space to the policy manifold may in general be many-to-one (in the controls vocabulary, changes in the plant dynamics or in its environment do not necessarily imply a needed change in control policy).

8 The indexing schemas for both a plant and policy neural manifold may employ the weights of their respective set of NNs. So far so good. But, how does one go about crafting a mapping between, say, the plant manifold s coordinate space to that of the policy manifold? Clearly, such a mapping will be required for the Agent to select a policy based on information about the plant model. More generally, how does one craft an appropriate mapping from the full Context Space (whatever form of representation is employed) to the coordinate system of the policy manifold? The task of answering these questions is assigned to a Higher Level Learning Algorithm (HLLA) i.e., the answers are to be learned.

9 For another aspect of mappings, consider a linear plant example, and assume the plant transfer functions in the plant manifold are factored polynomials but the CF requirements are given in terms of an expanded polynomial representation (e.g., if requirements are given in terms of damping coefficient for a second order system). While the two representations are equivalent, the Agent would need a mapping between the two to accomplish the controller selection (e.g., via factoring the polynomial, or equivalently, multiplying out the factored polynomial). In the second order case, the notion of nearness in the CF sub-space would be in terms of the damping coefficient, whereas in the corresponding plant manifold coordinate space, nearness would be in terms of S-plane pole locations.

10 As an intuitive example of notions such as efficiency, nearness, and mappings, consider the example of a store that rents movie DVDs shelved alphabetically or by content type Which is more efficient for the customer depends on the customer s needs and knowledge.

11 Key exploration steps ahead: Refine and further develop ideas related to Context Discernment / System Identification thus far developed. Move into the Controller Selection aspect of the suggested EB Control method. Expand the exploration to multiple-level considerations.

12 Key exploration steps ahead (cont.): Provide one or more feasibility demonstrations in support of developing theory and techniques for populating the experience repositories progressing from the synthetic methods already demonstrated to more and more general ones.

13 Key exploration steps ahead (cont.): Formalize ideas about and develop demonstration experiments for incorporating application domain knowledge as repository constraints of a nature that facilitates the larger objectives of improved speed of access and good generalization.

14 Key exploration steps ahead (cont.): Formalize ideas about constructing and achieving needed mappings between components of Context and from Context to Repository, and develop demonstration examples useful for theory development. CONTEXT A. PLANT B. ENVIRONMENT C. CF CONTROL LAW REPOSITORY (EXPERIENCE)

15 Key exploration steps ahead (cont.): Further develop ideas related to multi-level aspects of EB Identification & Control, leading to a Context Space Hierarchy notion, and use the associated ideas as a guide in refining and further defining the HLLA concepts and training methods.

16 Key exploration steps ahead (cont.): Formalize and further develop the role of the human designer in providing higher level knowledge for crafting the RL process entailed in the HLLAs, particularly, designation of state variables and CF s specialized to the multi-level conceptualization. Develop demonstration examples for EB controls, similar to the successful demonstrations of the Context Discernment (systems identification) part of the EB process thus far accomplished.

17 QUESTIONS? OGI Talk 7/26/2006 # 67

19 Agent: computational intelligence device (that, in this paper, is to perform the acts of context discernment and selection, along with possible design refinement). Context Variables (Agent centric): those attributes of i) the environment and ii) the plant/process whose variations could engender changes to the decision rule / control policy employed by the Agent while accomplishing the Agent s current objective or goal; and in addition, iii) the criteria (representing the objective or goal) to be used for designing and subsequent selection of the decision rule or control law. [The term Criterion Function (CF) is used here to represent these criteria.] Context Space (Agent centric): a vector space in which each context variable is assigned to a dimension. The Context Space concept comprises three sub-spaces; one each associated with the i) Plant, ii) Environment, and iii) Criterion Function. Context (Agent centric): a point in Context Space; the set of values taken on by the context variables in a given situation. Context Awareness: the act of monitoring the application to take notice (become aware) that a change may be occurring in the Context. Context Discernment: the act or process of determining the current values of the context variables (current point in Context Space) appropriate to the task being performed. [Webster on-line for discern : to recognize or identify as separate and distinct.] Experience-Based approach: A two-component concept: Component A: Repository of previously developed context-specific models (controller or plant models), and Component B: Algorithms used by the Agent to effectively and efficiently select a model from the repository as changes in context occur. [Note: A key task of the Higher Level Learning Algorithm (defined below) is to train the Agent to learn Component B.]

20 Selection: the act of choosing/retrieving an appropriate element of the repository corresponding to the discerned context. Higher-Level Learning Algorithm (HLLA): The reference level for the term higher is the case where the learning algorithms are applied directly to the design of optimal controllers (as in Learning Control), ones that would be accumulated in the repository (c.f. Fig. 1). Higher-Level here means applying the learning method to create a strategy for selecting an appropriate controller from the repository, where the process of selection is optimized; thus, the focus of the learning process is at the next level up. Definition of the Utility function (a specific type of CF) is key for application of this process. Note: When the Contextual Hierarchy ideas mentioned in Section I are developed, more levels will be involved. World Space (Agent Centric): A vector space whose dimensions are associated to designated attributes of the Agent s relevant environment, its physical body, and the external CF. [Note: This definition is included for completeness. It is not explicitly used in this paper, but is used in related publications in terms of mappings from World Space to Context Space, e.g. [39].] Guidelines: Parametric models/equations are used to represent the Plant, Criterion Function (CF), and Environment (for the latter, measurements may serve as parameters w/o an explicit model). Construct (conceptually) a Parameter Space that comprises three sub-spaces: (Plant, Environment, CF). The associated parameters serve as Context variables for the discernment activity; Agent s Context Space may be a sub-space of Parameter Space. Controllers are also represented via parametric models.

23 To develop feel for the weight update rule in the Adaptive Critic, consider a partial block diagram and a little math (discrete time): R(t) Controller ( ) w ij u(t) R(t+1) J(t+1) PLANT Critic Desire a training Delta Rule for w ij to minimize cost-to-go J(t). Obtain this via J () t and w () t the chain rule of differentiation. ij ICNC 07, China 8/27/07 # 73

24 Family of Adaptive Critic Methods: The critic approximates either 1) J(t), Heuristic Dynamic Programming (HDP) (cf. Q Learning ) or 2) the gradient of J(t) wrt state vector R(t) [ J(R)], Dual Heuristic Programming (DHP) [ J(R(t)) λ(t)] [Today, focus on DHP] ICNC 07, China 8/27/07 # 74

25 Overview of Adaptive Critic method Control engineer provides the Design objectives / Criteria for success through a Utility Function, U(t) (local cost). Then, a new utility function is defined (Bellman Eqn.), J () t = γ U( t+ k) k = 0 k [ cost to go ] which is to be minimized [~ Dynamic Programming]. [We note: Jt () = Ut () + γ Jt ( + 1) Bellman Recursion] ICNC 07, China 8/27/07 # 75

26 The weights in controller NN are updated with objective of minimizing J(t): J () t Δ wij() t = lcoef w () t ij where and a J() t J() t uk () t = w () t u () t w ij k = 1 k ij J() t U() t J( t + 1) = + u () t u () t u () t k k k n J( t + 1) J( t + 1) Rs ( t + 1) and = uk() t s= 1 Rs( t + 1) uk() t Call this term λ ( 1) (to be output of critic) s t + ICNC 07, China 8/27/07 # 76

27 It follows that Controller training is based on: n Jt ( + 1) Ut ( ) Jt ( + 1) Rs ( t+ 1) = + u () t u () t R ( t+ 1) u () t k k s= 1 s k Via CRITIC Via Plant Model Similarly, Critic training is based on: n Jt () dut () Jt ( + 1) Rk( t+ 1) Rk( t+ 1) um( t) = + + Rs() t drs() t k= 1 Rk( t+ 1) Rs() t m um() t Rs() t Via Plant Model [Bellman Recursion & Chain Rule used in above.] Plant model is needed to calculate partial derivatives for DHP ICNC 07, China 8/27/07 # 77

28 Utility Functions for three Design Scenarios: [different combinations of above criteria] 1. U(1,2,3) 2. U(1,2,3,5) 3. U(1,2,3,4,5) All applied to task of designing controller for autonomous 2-axle terrestrial vehicle. ICNC 07, China 8/27/07 # 78

29 ICNC 07, China 8/27/07 # 79

30 ICNC 07, China 8/27/07 # 80

31 ICNC 07, China 8/27/07 # 81

32 Design Scenario 2. Add Criterion 5 ( friction sense ) in U2. This is intended to 1.) allow aggressive lane changes on dry pavement, and 2.) make lane changes on icy road conditions as aggressively as the icy road will allow. [This was our first foray into use of CONTEXT variable: this one via Utility function.] ICNC 07, China 8/27/07 # 82

33 ICNC 07, China 8/27/07 # 83

34 ICNC 07, China 8/27/07 # 84

35 Conclusions from Utility Function Expts. Controller Designs resulting via DHP satisfy intuitive sense of being good each looks and feels like one a human designer might have designed. Control Engineer knows that controller design requires careful specification of objective, and that as change design criteria, the controller changes. For DHP, control objectives are contained in the Utility Function. The DHP process embodied the different requirements for the three design scenarios in qualitatively distinct controllers -- all yielding intuitively good results, according to the design constraints. ICNC 07, China 8/27/07 # 85

Planning in Markov Decision Processes

Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Planning in Markov Decision Processes Lecture 3, CMU 10703 Katerina Fragkiadaki Markov Decision Process (MDP) A Markov