REINFORCEMENT learning (RL) is a machine learning

Size: px
Start display at page:

Download "REINFORCEMENT learning (RL) is a machine learning"

Transcription

1 762 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 5, MAY 213 Online Learning Control Using Adaptive Critic Designs With Sparse Kernel Machines Xin Xu, Senior Member, IEEE, Zhongsheng Hou, Chuanqiang Lian, and Haibo He, Senior Member, IEEE Abstract In the past decade, adaptive critic designs (ACDs), including heuristic dynamic programming (HDP), dual heuristic programming (DHP), and their action-dependent ones, have been widely studied to realize online learning control of dynamical systems. However, because neural networks with manually designed features are commonly used to deal with continuous state and action spaces, the generalization capability and learning efficiency of previous ACDs still need to be improved. In this paper, a novel framework of ACDs with sparse kernel machines is presented by integrating kernel methods into the critic of ACDs. To improve the generalization capability as well as the computational efficiency of kernel machines, a sparsification method based on the approximately linear dependence analysis is used. Using the sparse kernel machines, two kernel-based ACD algorithms, that is, kernel HDP (KHDP) and kernel DHP (KDHP), are proposed and their performance is analyzed both theoretically and empirically. Because of the representation learning and generalization capability of sparse kernel machines, KHDP and KDHP can obtain much better performance than previous HDP and DHP with manually designed neural networks. Simulation and experimental results of two nonlinear control problems, that is, a continuous-action inverted pendulum problem and a ball and plate control problem, demonstrate the effectiveness of the proposed kernel ACD methods. Index Terms Adaptive critic designs, approximate dynamic programming, kernel machines, learning control, Markov decision processes, reinforcement learning. I. INTRODUCTION REINFORCEMENT learning (RL) is a machine learning framework for solving sequential decision making problems that can be modeled using the Markov decision process (MDP) formalism. In RL, the learning agent interacts with an initially unknown environment and modifies its action policies to maximize its cumulative payoffs [1], [2]. Although earlier RL research focused on tabular algorithms in discrete Manuscript received September 2, 211; revised October 12, 212; accepted December 16, 212. Date of publication February 13, 213; date of current version March 8, 213. This work was supported in part by the National Natural Science Foundation of China under Grant , Grant 98232, Grant , and Grant , the New Century Excellent Talent Program under Grant NCET-1-91, and the U.S. National Science Foundation under Grant CAREER ECCS X. Xu and C. Lian are with the College of Mechatronics and Automation, National University of Defense Technology, Changsha 4173, China ( xuxin_mail@263.net; xinxu@nudt.edu.cn). Z. Hou is with the Advanced Control Systems Laboratory of School of Electronic and Information Engineering, Beijing Jiaotong University, Beijing 144, China ( zhshhou@bjtu.edu.cn). H. He is with the Department of Electrical, Computer, and Biomedical Engineering, University of Rhode Island, Kingston, RI 2881 USA ( he@ele.uri.edu). Color versions of one or more of the figures in this paper are available online at Digital Object Identifier 1.119/TNNLS X/$ IEEE state/action spaces, approximation and generalization methods for RL have received more and more research interest in recent years. In the literature, there are several synonyms used for RL, including approximate/adaptive dynamic programming (ADP) and neuro-dynamic programming [3] [7]. One common goal of ADP and RL is to solve the optimal control problem of MDP with large or continuous state and action spaces. Until now, RL has been shown to be a very promising framework to solve learning control problems which are difficult or even impossible for mathematical programming and supervised learning methods. However, despite some successful empirical results in real-world applications [8] [11], realizing efficient online learning control for MDPs with large or continuous space is still a difficult problem. In such cases, many RL or ADP algorithms are slow to converge and require a large amount of training samples [12]. As indicated in [1], this problem is closely related to the generalization capability of learning machines, which is the ability of a learning algorithm to perform accurately on new, unseen examples after having trained on a finite data set. In order to improve the generalization capability and learning efficiency of RL, function approximation has been a central topic in RL. Currently, there are three main categories of research work on function approximation for RL, that is, value function approximation (VFA) [13], [14], policy search [15], and actor-critic methods [16]. The actor-critic algorithms, viewed as a hybrid of VFA and policy search, have been shown to be more effective than standard VFA or policy search in online learning tasks with continuous state/action spaces [17]. In an actor-critic learning controller, there is an actor for policy learning and a critic for VFA or policy evaluation. One pioneering work on RL algorithms using the actor-critic architecture can be found in [18]. In recent years, adaptive critic designs (ACDs) [19] [23], [28] [3] were widely studied as an important class of actor-critic learning control methods for dynamical systems. Generally, ACDs can be categorized as the following major groups: heuristic dynamic programming (HDP), dual heuristic programming (DHP), globalized dual heuristic programming (GDHP), and their action-dependent versions [17]. Among ACD architectures, DHP is the most popular one, which has been proven to be more efficient than HDP [19]. Although ACDs have been applied in various learning control problems [24] [26], such as aircraft control, automotive engine control, and power system control, there are still some difficult issues in the design and implementation of ACDs. The first issue is that the learning efficiency and convergence

2 XU et al.: ONLINE LEARNING CONTROL USING ADAPTIVE CRITIC DESIGNS 763 of ACDs greatly rely on the empirical design of the critic, including the approximation structure and the learning rate. In ACDs, multilayer perceptron neural networks (MLPNNs) [48] were commonly used for VFA, but the structure and learning rates (step sizes) of MLPNNs have to be manually selected for good performance [27]. The second difficulty is that the robustness to uncertainties in learning control systems based on DHP or HDP still needs to be improved. In DHP and HDP, due to the local minima in neural network training, how to improve the quality of the final policies is still an open problem [22]. As suggested in [22], the most important potential extension of their results would be to characterize the quality of the converged solution in ACDs. Recent studies have attempted to approximate the optimal control solution using various ADP techniques with or without apriorisystem model [28] [3], [47]. Vamvoudakis and Lewis [29] proposed an online actor critic algorithm to solve the continuous-time infinite-horizon optimal control problem, with the assumption of known dynamics. Zhang et al. proposed a data-driven robust approximate optimal tracking control scheme for unknown general nonlinear systems [3]. Nevertheless, the above works still relied on manual settings of critic networks, and the learning control performance depended on the empirical design of basis functions. Therefore, it is desirable to develop automatic feature representation and selection methods for the critic learning of ADP approaches. As is well known, feature representation and selection is a critical factor for improving the generalization performance of machine learning algorithms. However, compared with supervised learning, there are relatively fewer works on feature representation and selection in reinforcement learning, especially in online learning control methods. For ACDs, it was pointed out recently [22] that a study on the choice of the basis functions for the critic to obtain a good estimate of the policy gradient should be done to improve the performance of ACDs. The motivation of this paper is to present a novel kernelbased feature representation method for ACDs and develop new online learning control algorithms with sparse kernel machines. Based on the theoretical and empirical results from statistical learning [31], [32], sparse kernel machines will have better generalization capability than conventional MLPNNs with manually designed structures. Therefore, the goal of this paper is to provide a new kernel-based feature representation method for ACDs, which is important to realize efficient online learning control methods for uncertain dynamical systems. Recently, kernel machines have been popularly studied to realize nonlinear and nonparametric versions of supervised or unsupervised learning algorithms [31] [32]. The main idea of kernel machines is as follows: an inner product in a highdimensional feature space can be represented as a Mercer kernel function, thus, existing learning algorithms in linear spaces can be transformed to kernel-based algorithms without explicitly computing the inner products in high-dimensional feature spaces. This idea, which is usually called the kernel trick, has been widely used in supervised and unsupervised learning problems [32]. In supervised learning, the most popular kernel machines include support vector machines (SVMs) and Gaussian processes (GPs), which have been applied in many classification and regression problems. In most cases, kernel machines obtained very good results or even the stateof-the-art performance [32] [34]. In unsupervised learning, kernel principal component analysis and kernel independent component analysis were also studied by many researchers [34]. Comprehensive reviews on kernel machines can be found in [35]. The combination of kernel methods with RL and ADP has also received increased research interest in recent years. However, the function approximation problem is more difficult in RL than in supervised learning. One of the earlier works in this direction was published in [36], where kernel-based locally weighted averaging was used to approximate the state value functions of MDPs. The applications of GPs or SVMs in reinforcement learning problems were also studied in the literature, such as GPs in temporal difference [TD()] learning [37], SVMs for RL [38], and Gaussian processes in modelbased approximate policy iteration [39]. In [38], support vector regression was applied to batch learning of state value functions of MDPs with discrete state spaces, and there were no theoretical results on the policies obtained. The GP-based policy iteration method in [39] uses support points, which are usually selected by manual discretization of the state spaces, and policy evaluation is performed using the state transition model approximated by a GP model. In [4], a model-free approximate policy iteration algorithm, called least-squares policy iteration (LSPI), was presented, which offers an RL method with good properties in convergence, stability, and sample complexity. Nevertheless, the approximation structures in LSPI may lead to degraded performance when the features are improperly selected. In [41], a kernel-based least-squares policy iteration (KLSPI) algorithm was presented for MDPs with large or continuous state spaces. However, both LSPI and KLSPI are mainly restricted to solving MDPs with discrete actions. In this paper, a novel framework of ACDs with sparse kernel machines is presented by integrating kernel methods into the critic learning of ACD algorithms. A sparsification method based on the approximately linear dependence (ALD) analysis [42] is used to sparsify the kernel machines when approximating the action value functions or their derivatives. Using the sparsified kernel machines, two Kernel ACD algorithms, that is, kernel HDP (KHDP) and kernel DHP (KDHP), are proposed to realize efficient online learning control for dynamical systems. To the best of our best knowledge, there are very few works on integrating kernel methods into online learning control based on ACDs in the community. Simulation and experimental results on two nonlinear control problems, a continuous-action inverted pendulum problem and a ball and plate control problem, demonstrate that kernel ACDs can obtain much better performance than that of previous ACDs. The main contributions of this paper include the following two aspects. One is automatic feature representation using kernels for VFA in ACDs. Because of the structure learning and nonlinear approximation ability of sparse kernel machines, KHDP and KDHP can obtain much better performance than previous HDP and DHP methods with manually designed neural networks. The second is to combine sparsified kernel

3 764 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 5, MAY 213 features with the recursive least-squares TD (RLS-TD) algorithm [42] so that faster learning speed can be realized in the critic of kernel ACDs. As studied in [16] and [22], the convergence of actor-critic algorithms can be ensured based on the principle of two-timescale stochastic approximations, which are characterized by coupled stochastic recursions that are driven by two different step size schedules. According to the results in [22], when linear function approximators are used, actor-critic algorithms can be proved to converge if the learning process in the critic is a faster recursion than the actor. Thus, when faster learning speed is realized by using RLS-TD in the critic, kernel ACDs can be expected to have improved performance in convergence. The idea of kernelbased VFA can also be applied to other ADP methods for learning control of dynamical systems [28] [3]. In recent studies on ADP methods, VFA is still a central problem and it can be expected that new kernel-based ADP algorithms can be developed. In the following, we will focus on kernel methods in popularly used ACDs including HDP and DHP, and the extension of kernel methods in other ADP algorithms is a promising direction for future work. The rest of this paper is organized as follows. In Section II, some research backgrounds on MDPs and the ALD-based kernel sparsification process are introduced. In Section III, the framework of ACDs with sparse kernel machines is presented and the KHDP and KDHP algorithms are proposed. The performance of kernel ACDs is analyzed from two perspectives. One is the performance error of critic learning and the other is the convergence of the actor-critic learning control process. In Section IV, simulation and experimental results on two nonlinear learning control systems are provided to illustrate the effectiveness of the proposed method. Finally, conclusions and future work are summarized in Section V. II. BACKGROUND A. Markov Decision Processes An MDP M is denoted as a quadruple {X, A, R, P}, where X is the state space, A is the action space,p is the state transition probability, and R is the reward function. A stochastic stationary policy π(or just stationary policy) maps states to distributions over the action space. When referring to such a policy π, weuseπ(a x) to denote the probability of selecting action a in state x by π. A deterministic stationary policy directly maps states to actions, denoted as a t = π(x t ), t. (1) When the actions a t (t ) satisfy (1), policy π is followed in the MDP M. A stochastic stationary policy π is said to be followed in the MDP M if a t π(a x t ), t. The objective of a learning controller is to estimate the optimal policy π satisfying J π = max π J π = max π E π[ ] γ t r t where < γ < 1 is the discount factor and r t is the reward at time step t, E π [ ] stands for the expectation with respect to the policy π and the state transition probabilities, and J π is the t= (2) expected total reward along the state trajectories by following policy π. In this paper, J π is also called the performance value of policy π. The state value function V π (x) of a policy π is the expected, discounted total rewards when starting from x and following policy π thereafter [ ] V π (x) = E π γ t r t x = x. (3) t= Similarly, the state action value function Q π (x,a) is defined as the expected, discounted total rewards when taking action a in state x and following policy π thereafter Q π (x, a) = E π[ t= ] γ t r t x = x, a = a. (4) For an MDP, a deterministic optimal policy π (x) maximizes the expected, discounted total reward of state x π (x) = arg max Q π (x, a). (5) a B. ALD-Based Kernel Sparsification Let X denote the original state space. A kernel function is a mapping from X X to R, which is usually assumed to be continuous. A Mercer kernel is a kernel function that is positive definite, that is, for any finite set of points {x 1, x 2,..., x n }, the kernel matrix K = [k(x i, x j )](1 i, j n) is positive definite. According to the Mercer theorem [32], there exists a Hilbert space H and a mapping φ from X to H such that k(x i, x j ) =< φ(x i ), φ(x j )> (6) where <, > is the inner product in H. Although the dimension of H may be infinite and the nonlinear mapping φ is usually unknown, all the computation in the feature space can still be performed if it is in the form of inner products. As introduced in [42], in the ALD analysis, after the sample collection process, the kernel-based features are constructed in a data-driven way. Let S n = {s 1, s 2,...,s n }denote a set of data samples and φ be a feature mapping on the data, which can be determined by the Mercer kernel function defined in (6). A feature vector set can be obtained as n = {φ(s 1 ), φ(s 2 ),...,φ(s n )}, φ(s i ) R m 1, i = 1, 2,...,n. To perform ALD analysis on the feature vector set, a data dictionary is defined as a subset of the feature vector set. The data dictionary D is initially empty and the ALD analysis is implemented by testing every feature vector in n, one at a time. If a feature vector φ(s) cannot be approximated within a predefined precision by the linear combination of the feature vectors in the dictionary, it will be added to the dictionary. Otherwise, it will not be added to the dictionary. Thus, after the ALD analysis process, all the feature vectors of the data samples in S n can be approximately represented by linear combinations of the feature vectors in the dictionary within a given precision.

4 XU et al.: ONLINE LEARNING CONTROL USING ADAPTIVE CRITIC DESIGNS 765 The ALD-based sparsification procedure mainly includes two steps. The first step is to compute the following optimization solution: 2 δ t = min c c j φ(s j ) φ(s t ). (7) s j D t Due to the kernel trick, after substituting (6) into (7), we obtain δ t = min{c T K t 1 c 2c T k t 1 (s t ) + k tt } (8) c where [K t 1 ] i, j = k(s i, s j ), s i (i = 1, 2,...,d(t 1)) are the elements in the dictionary, d(t 1) is the length of the data dictionary, k t 1 (s t ) = [k(s 1,s t ), k(s 2, s t ),..., k(s d(t 1),st )] T, c = [c 1, c 2,..., c d ] T,andk tt = k(s t, s t ). The optimal solution for (8) is c t = Kt 1 1 k t 1(s t ) (9) δ t = k tt kt 1 T (s t )c t. (1) The second step of the ALD-based sparsification is to update the data dictionary by comparing δ t with a predefined threshold μ. Ifδ t < μ, the dictionary is unchanged, otherwise, s t is added to the dictionary, that is, D t = D t 1 s t. After the sparsification procedure, a data dictionary D n with reduced number of data vectors is obtained and the approximated state action value function or its derivative is represented as follows: d(n) Q(x, a) = α j k(s, s j ) (11) j=1 d(n) λ(x) = α j k(x, x j ) (12) j=1 where d(n), usually much smaller than the original sample size n, is the length of the dictionary D n, s j = s(x j, a j ),and x j ( j =1,2,..., d(n)) are the elements of the data dictionary. III. ACDS WITH SPARSE KERNEL MACHINES A. Framework of Kernel ACDs A general framework of ACDs with sparse kernel machines is shown in Fig. 1. The main components of kernel ACDs include a critic, a kernel-based feature learning module, a reward function, an actor/controller, and a model of the plant. The kernel-based feature learning module is to implement data-driven feature representation and learning so that better learning efficiency and generalization performance can be obtained for ACDs. The critic is used to approximate the value functions or their derivatives. In the proposed framework, the kernel function and its induced feature space play important roles in the critic learning process. Since kernel-based features are in linear forms, the RLS-TD learning algorithms can be employed in the critic. The actor or controller receives measurement data about the plant s current state x t and outputs the control u t. The output of the critic is used in the training process of the actor so that policy gradients can be computed. The plant model receives the control u t, and estimates the next Algorithm 1 Kernel ACDs Input: k(.,.): a Mercer kernel function g(x,θ): the approximation structure in the actor S = {s i s i = (x i, a i )} N : asampleset 1) Initialize: A kernel dictionaryd = NULL, actor weights θ = θ, critic weights α = α, step size in the actor β = β. 2) For i = 1, 2,..., Size(D) Compute δ t using (8); If δ t μ Add s i to D; End if End for 3) Let t = ; 4) Loop: t = t + 1; Draw action a t = g(x t, θ t ); Get reward r t ; Observe next state x t+1 ; Compute feature vector k(s t ) and k(s t+1 ); Update θ and α according to (35) and (27) or (56) and (53); Until the termination criterion is satisfied 5) Return the final policy in the actor. Fig. 1. Critic V ( x t ) λ( x t ) Actor at ˆ +1 x t Kernel-based feature learning Model Plant xt Learning control structure of kernel ACDs. r t Reward function state x t+1. The state data are provided to the critic and to the reward function. In some ACDs, such as DHP, by making use of the plant model, x t+1 is provided for a second pass through the critic so that V (x t+1 ) can be obtained for critic training. Algorithm 1 shows the proposed kernel ACDs, which include two main procedures, that is, a kernel-based feature construction process and an online learning control process. The sample collection process for kernel feature construction can be realized either by collecting data when a conventional controller is used or by observing the MDP running with an initially randomized policy in the actor. The data samples are in the form of state transitions {(x 1, a 1 ),(x 2, a 2 ),...,(x n, a n )}. Based on the data samples, the ALD-based kernel sparsification procedure, which was introduced in Section II-B, can be performed offline before the online learning process of ACDs. Since HDP and DHP are the most widely studied ACDs, we will focus on integrating sparse kernel machines into these two online learning control methods. In HDP, the aim of the

5 766 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 5, MAY 213 critic is to approximate the value functions or action value functions, whereas in DHP, the derivatives of value functions are approximated in the critic. So, in the proposed kernel ACDs, a recursive algorithm of KLSTD [44] will be used and the action value function or the value function derivative is approximated as t Q(s) = α i k(s, s i ) (13) λ(x) = t α i k(x, x i ) (14) where s and s i are the combined features of the state action pairs (x, a) and (x i, a i ), respectively, α i (i = 1, 2,..., t) are the weights, and (x i, a i ) (i = 1, 2,..., t) are selected state action pairs in the sample data, that is, trajectories generated from a Markov decision process. B. KHDP Algorithm In the critic of KHDP, the action value function Q(x, a) is approximated in a linear weighted form, where a Mercer kernel function k(x, y) = <φ(x), φ(y)> is employed to realize the feature mapping in a reproducing kernel Hilbert space (RKHS). Let s t = (x t, a t ) denote the state action pair at time step t. Then, the action value function Q(x t, a t ) can also be expressed as Q(s t ). As studied in [4], the regression equation for the linear LS-TD() (λ = ) algorithm is ] ] E [φ(s t )( Q(s t ) γ Q(s t+1 )) = E [φ(s t )r t (15) where E is the expectation with respect to the state transition probability when following a stationary policy and Q(s) = φ T (s)w, φ,w R q 1. (16) Equation (15) can be rewritten as [ ] E [φ(s t )(φ T (s t ) γφ T (s t+1 )) ]W = E φ(s t )r(s t ). (17) The observation equation of (17) is as follows: φ(s t )(φ T (s t ) γφ T (s t+1 ))W = φ(s t )r t + ε t (18) where ε t is the one-step observation noise. Due to the property of RKHS, the weight vector W in (18) can be represented by the weighted sum of the state feature vectors T W = φ(s i )α i (19) where s i (i = 1, 2,...,T )are the selected state action pairs after the ALD analysis, T is the number of selected samples, and α i are the coefficients. Let T = (φ T (s 1 ), φ T (s 2 ),...,φ T (s T )) T (2) k(s t ) = (k(s 1, s t ), k(s 2, s t ),...,k(s T, s t )) T. (21) By multiplying T to both sides of the observation equation (18), due to the kernel trick, we get k(s t )[ k T (s t ) α γ k T (s t+1 ) α] = k(s t )r t + ν t (22) where v t R T 1 is a transformed noise vector and Let A T = b T = α =[α 1,α 2,...,α T ] T. (23) N k(s t )[ k T (s t ) γ k T (s t+1 )] (24) t=1 N k(s t )r t (25) t=1 where N is the total number of samples. Then, the kernel-based least-squares fixed-point solution to the TD learning problem is as follows: α = A 1 T b T. (26) To realize online learning in the critic, the following update rules based on the kernel RLS-TD() algorithm are used in the critic of KHDP. Critic Update in KHDP: β t+1 = P t k(s t )/(μ + ( k T (s t ) γ k T (s t+1 ))P t k(s t )) α t+1 = α t + β t+1 (r t ( k T (s t ) γ k T (s t+1 )) α t ) P t+1 = 1 [ P t P tk(s t )( k T (s t ) γ k T ] (s t+1 ))P t [ ] μ μ + ( k T (s t ) γ k T (s t+1 ))P t k(s t ) (27) where β t is the step size in the critic, μ( <μ 1) is the forgetting factor, P = δi, δ is a positive number, and I is the identity matrix. The actor network in KHDP uses MLPNNs to approximate the policy function a t = g(x t, θ t ). (28) In this paper, the learning control objective is to minimize or maximize the following total discounted reward: [ ] J(x) = V (x) = E γ t r t x = x. (29) where < γ < 1 is the discount factor. In this paper, we will mainly focus on deterministic MDPs, and the reward function is defined as nonpositive or nonnegative values. For nonpositive reward functions, the learning objective is to maximize the expected total discounted reward. For nonnegative reward functions, the learning objective is to minimize the expected total discounted reward. Therefore, the following cost function is used in the actor to realize the learning control objective: t= E a = 1 2 Q2 (x, a) (3) Since the minimization of cost function (3) is equivalent to minimize J(x) when Q(x, a) is nonnegative or maximize J(x) when Q(x, a) is nonpositive, the policy gradient learning rule in the actor can be designed as θ t = E a = Q(x t, a t ) Q(x t, a t ) a t. θ t a t θ t (31)

6 XU et al.: ONLINE LEARNING CONTROL USING ADAPTIVE CRITIC DESIGNS 767 When Gaussion kernels are used, the approximated action value function is T T Q(x, a) = α i k(s, s i ) = α i e s s i 2 /σ 2 (32) where x = (x (1), x (2),...,x (m) ) T, s = (x (1), x (2),...,x (m), a) is the combined vector of the state action pair (x, a), and is defined as n s s i = (x ( j) x i( j) ) 2 + (a a i ) 2. (33) j=1 On the basis of the definition in (32), we have Q(x t, a t ) a t = T 2α i (a t a i ) σ 2 e s t s i 2 /σ 2. (34) Then, the actor learning rule in KHDP is as follows. Actor Update in KHDP: θ t = θ t η t θ t = θ t η t Q(x t, a t ) a t θ t T where η t is the step size in the actor. α i (a t a i ) σ 2 e ( s t s i ) 2 /σ 2 (35) C. KDHP Algorithm The critic learning in KDHP is to approximate the derivatives of state value functions, which satisfy the following Bellman equation: V (x t ) k=1 = R t + γ E[V (x t+1)] (36) where R t is the expected reward and E[.] is with respect to the state transition probability when following a stationary policy. Let λ(x t ) = V (x t) (37) λ(x t+1 ) = V (x t+1). +1 (38) If x t and a t are 1-D values, the following relation holds: V (x t+1 ) = V (x t+1) +1 + V (x t+1) +1 a t a t = λ(x t+1 ) +1 + λ(x t+1 ) +1 a t a t (39) If x t = [x i (t)] n 1 and a t = [u i (t)] m 1 are multidimensional vectors, equation (39) becomes V (x t+1 ) n V (x t+1 ) x i (t + 1) = x j (t) x i (t + 1) x j (t) n m V (x t+1 ) x i (t + 1) a k (t) + x i (t + 1) a k (t) x j (t) = n λ(x i (t + 1)) x i(t + 1) x j (t) n m + λ(x i (t + 1)) x i(t + 1) a k (t) (4) a k (t) x j (t) k=1 where m and n are the dimensions of a t and x t, respectively. To simplify notations, we only show the results when x t and a t are 1-D variables, therefore (39) is employed. The extensions to multidimensional state and control vectors can be done by considering (4) instead of (39). Then, (36) can be rewritten as λ(x t ) = R [ ( t xt+1 a )] t + γ E λ(x t+1 ) (41) + +1 a t where +1 / and +1 / a t can be computed based on the model network in Fig. 1, and a t / can be computed on the basis of the actor network. Suppose the following nonlinear mappings are implemented by the model network and the actor network, respectively: x t+1 = f (x t, a t ) (42) a t = g(x t, θ t ) (43) where θ t is the weight vector of the actor network. Then, the derivatives in the right-hand side of (41) can be obtained as +1 = f (x t, a t ) (44) a t = g(x t, θ t ). (45) The temporal differences can be defined as δ(t) = r ( t xt+1 + γ + +1 a ) t λ(x t+1 ) λ(x t ). (46) a t In the critic learning of KDHP, a kernel-based approximation structure is considered to approximate λ(x t ).Atfirst, consider the following approximation structure in linear forms: λ(x t ) = V (x t) = φ T (x t )W = l φ j (x t )w j (47) where φ(x t ) = [φ 1 (x t ), φ 2 (x t ),...,φ l (x t )] T is a vector of basis functions, x t is the input state of the critic, and W = [w 1, w 2,..., w l ] T is the weight vector. By multiplying φ(x t ) to both sides of (41), the fixed-point equation for linear LS-TD() algorithms is derived [ [ ( xt+1 E φ(x t ) λ(x t ) γ + +1 a ) ]] t λ(x t+1 ) a t Let j=1 = φ(x t ) R t (48) D(x t ) = a t. (49) a t Equation (48) can be rewritten as [ ] E φ(x t )(φ T (x t ) γ D(x t )φ T (x t+1 )) W = E [ φ(x t ) r t ]. (5) Assume that x i (i = 1, 2,...,T )are the selected states after the ALD analysis, k(x, y) =<φ(x), φ(y)> is a Mercer kernel, and T is the number of selected samples. Similar to the

7 768 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 5, MAY 213 derivation of (22), by using the kernel trick, we can also obtain the least-squares fixed point equation for approximating λ(x) in the form of kernel-based features: k(x t )[ k T (x t ) α γ D(x t ) k T (x t+1 ) α] = k(x t )r t + ν t (51) where v t R T 1 is a noise vector, α =[α 1,α 2,...,α T ] T is the coefficient vector for approximating λ(x) and ( T k(x t ) = k(x 1, x t ), k(x 2, x t ),...,k(x T, x t )). The kernel-based RLS-TD update rules for critic learning in KDHP are as follows. Critic Update in KDHP: β t+1 =P t k(x t )/(μ+ ( k T (x t ) γ D(x t ) k T (x t+1 ))P t k(x t )) (52) ( rt ) α t+1 = α t + β t+1 ( k T (x t ) γ D(x t ) k T (x t+1 )) α t (53) ( P t+1 = 1 [ k T (x t ) γ D(x t ) k T (x t+1 )) P t ] P t P t k(x t ) ) μ μ+( k T (x t ) γ D(x t ) k T (x t+1 ) P t k(x t ) (54) where β t is the step size in the critic, μ( < μ 1) is the forgetting factor, P = δi, δ is a positive number, and I is the identity matrix. The actor network is used to generate the control actions based on the observed states of the plant. The output of the actor is given by (43). The learning objective of the actor is to minimize the performance value of the closed-loop system, which can be computed by the value functions of the MDP [ ] J(x) = V (x) = E π γ t r t x = x. (55) In KDHP, based on the outputs of the critic, the following policy gradient methods can be used to train the actor: Actor Update in KDHP: V (x t+1 ) a t θ t+1 = θ t η t θ t = θ t η t t= = θ t η t λ(x t+1 ) +1 a t a t θ t a t. (56) θ t Since λ(x t+1 ) and +1 / a t can be computed by the critic and the model network, respectively, and a t / θ t is given by (45), the above policy gradient learning can be implemented along with the critic learning. D. Performance Analysis and Discussions Compared with recent attempts in ADP methods for modelfree learning control, one advantage of kernel ACDs is that the manual selection of approximation structures in the critic is avoided and automatic feature construction and selection can be realized to improve the approximation and generalization capability of ACDs. Furthermore, by making use of the generalization capability of sparse kernel machines, which has been verified in the literature [35], [41], better learning control performance can be obtained. In the critic training of ACDs, the TD(λ) algorithm was popularly used to approximate the value functions or their derivatives, where function approximators were employed to realize generalization in large or continuous spaces. However, for TD(λ) with nonlinear approximators, for example, MLPNNs, there are no convergence proofs, and some divergence counterexamples were found in previous studies [45]. According to the recent theoretical results in [16] and [22], the convergence of ACDs can be ensured based on twotimescale stochastic approximations, where the critic needs to implement a faster recursion than the actor. In kernel ACDs, by making use of the kernel-based features, which are in a form of linear basis functions, the RLS-TD algorithm [43] is used to approximate the value functions or their derivatives with improved data efficiency and stability. As shown in [44], the kernel-based LS-TD algorithm is superior to conventional linear or nonlinear TD algorithms in terms of fast convergence rates. Therefore, with faster learning in the critic, kernel ACDs can have better performance than previous ACDs in terms of convergence rates. 1) Performance Error of Critic Learning in Kernel ACDs: In KHDP and KDHP, sparsification of kernel machines is implemented based on the ALD analysis so that the kernelbased features have approximately linear independence. The following Lemma 1 shows that the kernel dictionary obtained by the ALD-based sparsification procedure is finite even if infinite samples are used. Lemma 1 [42]: For the ALD-based kernel sparsification procedure, assume that 1) k(.,.) is a continuous Mercer kernel and 2) S is a compact subset of a Banach space. Then, for any training sequence {x i } X (i = 1, 2,..., ) and for any μ >, the number of dictionary vectors is finite. In Lemma 1, it is shown that if the original state space X is compact, the ultimate dictionary set will be finite regardless of the dimension of the Hilbert space H. In the following, to simplify the notation, a countable state action space is considered, but the results on TD learning can also be extended to general spaces [45]. Let the cardinality of the states be N. The kernel matrix can be denoted as [ ] T K = k(x 1 ), k(x 2 ),..., k(x N ) R N d (57) where d is the number of dictionary vectors. Let α be the critic s weight vector, Ṽ (α) be the approximated value function using kernel machines, and θ t be the actor s weight vector. Since θ t is updated in a slower timescale than the critic, the policy π(θ t ) determined by θ t is also slowly varying. In the following, we will analyze the approximation error of kernel-based RLS-TD learning when the actor s policy is stationary or changes very slowly. An MDP with a stationary action policy π can be viewed as an equivalent Markov reward process with state transition probability P. Suppose μ is the unique distribution that satisfies μ T P = μ T with μ(i) > for all i X and μ is a finite or infinite vector, depending on the cardinality of X. The theoretical results in [4] and [46] show that when LS-TD or RLS-TD converges, a fixed-point solution can be obtained to minimize the projected Bellman residual errors min α J α = min α Ṽ (α) T Ṽ (α) (58)

8 XU et al.: ONLINE LEARNING CONTROL USING ADAPTIVE CRITIC DESIGNS 769 where = ( T D ) 1 T D, = [φ 1, φ 2,...,φ n ] R N n, T is the Bellman operator, and D = diag{μ(i)}. In kernel ACDs, the projection operator is determined by the sparsified kernel features and the ALD-based sparsification can be viewed as a regularization procedure for the optimization problem in (58), where the objective function becomes min α J α = min α ˆV (α) T ˆV (α) + h(α) (59) where h(α) is the structural risk of the kernel machines. Although the combined objective function in (59) may be minimized in a synchronized way, it is sequentially optimized in kernel ACDs. At first, by using the ALD-based sparsification criterion, the structural risk h(α) is reduced. Then, the combined objective function in (59) is optimized by kernel LS-TD using the kernel dictionary obtained by the ALD analysis. In [22], it is proved that if the policy parameters change slowly, the critic weight vector can converge to a solution determined by the actor s policy π(θ). The next problem is the approximation error between the true value function V π(θ) (x) and the solution based on the least-squares fixedpoint equation (58). Since the kernel-based RLS-TD learning essentially implements linear TD learning using kernel-based feature vectors, (58) is equivalent to the fixed-point equation of linear LS-TD learning algorithms. Therefore, based on the analysis of TD learning using linear basis functions in [45], the following relation holds: Ṽ (α) V π(θ) D 1 λγ 1 γ V π(θ) V π(θ) D (6) where D = diag{μ(i)}, = K (K T DK) 1 K T D, λ ( λ 1) is the parameter for eligibility traces, and X D = X T DX. Since is determined by the sparsified kernel features, inequality (6) also shows that by appropriately selecting and sparsifying the kernel-based features, the approximation error bounds of value functions can also be reduced. E. Convergence Analysis of Kernel ACDs Similar to the analysis in [22], the update rules (53) and (56) in ACDs can be modeled as a general setting of two-timescale stochastic approximations X t+1 = X t + β t ( f (X t, Y t ) + Nt+1 1 ) (61) Y t+1 = Y t + γ t (g(x t, Y t ) + Nt+1 2 ) (62) where f and g are Lipschitz continuous functions and {Nt+1 1 }and {N2 t+1 } are martingale difference sequences with respect to the field [ N i 2 ] E t+1 F t D 1 (1+ X t 2 + Y t 2 ), i = 1, 2, t ] (63) for some constant D 1 <. In KHDP and KDHP, the learning rules in the critic use recursive least-squares methods and the step sizes are adaptively determined by online computation rules (27) and (52), respectively. When the update in the critic is a faster recursion than the update in the actor, the weights in the critic have uniformly higher increments compared to the weights in the actor. To analyze the convergence of kernel ACDs based on twotimescale stochastic approximations, the following ordinary differential equations can be considered: Ẋ = f (X (t), Y ) (64) where Assumptions (A1) (A3) hold. (A1) sup X t, sup Y t < ; t t (A2) Ẋ = f (X (t), Y ) has a globally asymptotically stable equilibrium μ(y ), where μ(.) is a Lipschitz continuous function; (A3) Ẏ = g(μ(y (t)), Y (t)) has a globally asymptotically stable equilibrium Y ; In [22], the main convergence result was obtained for twotimescale stochastic approximations: Theorem 1: Under Assumptions (A1) (A3), the updates in (61) and (62) converge asymptotically to the equilibrium, that is, (X t, Y t ) (μ(y ), Y ) as t, with probability one. In KHDP and KDHP, by appropriately selecting the actor s step sizes, it can be expected that the update in the critic is a faster recursion than the update in the actor, and the weights in the critic have uniformly higher increments as compared with the weights in the actor. In [22], when the update in the critic is a faster recursion than the actor, it was proved that a class of actor-critic algorithms with linear function approximators will converge almost surely to a small neighborhood of a local minimum of the averaged reward J. InkernelACDs, by making use of kernel-based features and the RLS-TD algorithm in the critic, the updates in the critic can be a faster recursion than the actor. Thus, it will be more beneficial to ensure the convergence of the online learning process. In Section IV, extensive performance tests and comparisons were conducted and it was shown that kernel ACDs have much better performance than conventional ACDs both in terms of convergence speed and in terms of the quality of the final policies. IV. SIMULATION AND EXPERIMENTAL RESULTS A. Inverted Pendulum Problem The inverted pendulum problem has been widely studied as a benchmark control problem with nonlinearity and instability. In the following, simulation and experimental studies will be conducted on the inverted pendulum problem to compare the performance of different RL algorithms. In simulation, the performance of kernel ACDs is compared with that of ACDs under different conditions and parameter settings. The nearoptimal policies of different algorithms are also implemented in a real inverted pendulum system to test the performance of different controllers. The aim of the learning controller is to balance the pole as long as possible and make the angle variations of the pendulum be as small as possible. The dynamics equations are assumed to be unknown or only partially known for the learning controller. For HDP and KHDP, the reward r is always before the pole angle or the position of the cart exceeds

9 77 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 5, MAY 213 average trials average trials average trials successful rate successful rate successful rate actor module learning rate cart mass/kg pole length/m KDHP KHDP DHP HDP (a) actor module learning rate cart mass/kg pole length/m KDHP KHDP DHP HDP (b) Fig. 2. Performance comparisons between K-ACDs and ACDs under different parameter settings such as (a) success rates and (b) average trials. the boundary conditions, that is, if θ 12, x 1.2m, r(t) = ; else r(t) = 1. For DHP and KDHP, a differentiable reward function is defined as r(t) =.5(x 2 + θ 2 ). The simulation time step is.2 s. A learning controller is regarded to be successful when its final policy can balance the pole for at least 1 time steps. A trail starts from an initial state near the equilibrium and ends when the controller balances the pole for 1 time steps or the pole angle or the position of the cart exceeds the boundary conditions. In Fig. 2, the performance of kernel ACDs and conventional ACDs is compared under different parameter settings including the variations of actor learning rates, the cart mass, and the pole length. We use two performance measures to evaluate the learning efficiency of different learning control methods. One is the success rate of a learning controller, which is defined as the percentage of successful learning trials that can learn a policy to balance the pole for at least 1 time steps. The other is the averaged number of trials which is needed to learn a successful policy. The averaged number of trials was computed by running the learning control process for 1 independent runs. For each independent run, the maximum number of learning trials is 1. For KHDP and KDHP, 4 trials of samples were collected by a random policy to construct the dictionary of kernel features. The threshold parameter for the ALDanalysisissetasμ =.1. It is shown in Fig. 2 that the performance of KDHP and KHDP is much better than that of DHP and HDP, respectively. In Fig. 2(a), we see that the success rates of KDHP are all 1% under different settings of actor learning rates, whereas the performance of DHP and HDP is greatly influenced by the actor learning rates. It is observed that KHDP has higher success rates than HDP and it is also less sensitive to the variations of actor learning rates. In Fig. 2(a), it is illustrated that KDHP has the best performance (1% success rate) under different dynamics changes of the plant including the variations of the cart mass and the pole length. The performance of KHDP is also much more robust than that of HDP and DHP. In Fig. 2(b), it is shown that KDHP needs the minimum averaged number of

10 XU et al.: ONLINE LEARNING CONTROL USING ADAPTIVE CRITIC DESIGNS 771 average trials successful rate average trials successful rate [-.1,.1] [-.5,.5] [-.1,.1] (,.1) (,.2) noise level [-.1,.1] [-.5,.5] [-.1,.1] (,.1) (,.2) noise level KDHP KHDP DHP HDP (a) number of hidden layer nodes in the actor number of hidden layer nodes in the actor KDHP KHDP DHP HDP (b) Fig. 3. Performance comparisons between K-ACDs and ACDs under (a) different conditions of noise levels and (b) different number of hidden layer nodes in actor networks. KDHP KHDP theta(rad) theta(rad) t(s) DHP t(s) HDP theta(rad) theta(rad) t(s) t(s) Fig. 4. Angle variations of the real cart-pole system controlled by different learning controllers after convergence. trials for learning a successful control policy, which means that KDHP converges faster than other learning control algorithms. Compared with HDP, KHDP converges to a good control policy much faster. However, compared with KHDP and HDP, DHP needs smaller number of trials to balance the pole successfully. This is mainly due to the fact that DHP makes

11 772 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 5, MAY 213 the ball position/m DHP KDHP KHDP HDP DHP KDHP HDP KHDP Fig. 5. the cumulative squared errors/m*m Fig. 6. Ball and plate system. KDHP DHP KHDP HDP DHP KDHP HDP KHDP trials Total squared errors of four algorithms in 18 trials. use of some model information to estimate the policy gradient, which will greatly reduce the variance of policy gradients and increase the convergence speed of ACDs. The performance comparisons between HDP and DHP were also studied in the simulation, where the performance of HDP and DHP was evaluated under different learning rates and hidden node numbers of the critic and the actor. It is observed that DHP can consistently obtain better performance than HDP. In Fig. 3, the performance of different learning control algorithms is compared under different noise levels and different number of hidden nodes in the actor network. It is illustrated that KDHP has the best performance among all the algorithms and it is very robust to sensor noises and structure variations in the actor network. It can be seen that KHDP has much better performance than HDP and its performance is more robust than DHP. Fig. 4 shows the angle variations of a real cart-pole system controlled by different learning controllers after convergence. From Fig. 4, it is observed that the final policy obtained by KDHP can stabilize the system in a shorter time than other learning controllers. This means that the quality of the final near-optimal policy of KDHP is better than other algorithms. Moreover, the performance of KHDP is also better than HDP. From the above simulation and experimental results, it is illustrated that by making use of the sparse kernel machines in the critic of ACDs, the robustness and the efficiency of learning controllers can be greatly improved time-steps Fig. 7. Performance comparisons of the final policies obtained by the four algorithms. B. Learning Control of the Ball and Plate System The ball and plate system is a typical multivariable nonlinear plant, which has been used to test various learning control methods as an experimental device. The controller design for the ball and plate system becomes very difficult when there are model uncertainties and unknown disturbances in the plant. In the following, both simulation and experimental studies will be conducted on the ball and plate problem to compare the performance of different ACDs. As shown in Fig. 5, a typical ball and plate system comprises a ball, a round plate, a charge-coupled device (CCD) vidicon, two electromotors, and some other control devices. The CCD vidicon is used to detect the position of the ball, and the two electromotors can drive the round plate inclining so that the ball can roll arbitrarily on the plate. The control problems of the ball and plate system comprise the rolling from point to point, route tracking and obstacle avoiding, and so on. In this paper, the learning control problem of rolling from point to point was studied to compare the performance of different ACDs. The movement of the ball on the plate can be decomposed into two parts: the move along the x axes and the move along the y axes. Because of the independence of the control actions and the coherence of the dynamics models on the x axes and y axes, only the learning control problem on the x axes is considered. Let x stand for the position of the ball on the plate, and θ denote the inclining angle of the plate. R is the diameter of the ball and m is the mass of the ball, τ denotes the moment by which the inclining angle of the plate can be changed. In the learning control process, the moment τis defined as the action. If the state exceeds the boundary conditions, the current trial ends and the controller is regarded as unsuccessful. For the state defined as (x 1, x 2, x 3, x 4 ) = (θ, θ,x, ẋ), the dynamics equations of the ball and plate system on the x axes can be described as follows: ẋ 1 x 2 g Ẋ = ẋ 2 ẋ 3 = 1 x 4R 2 +x3 2 3 cos x 1 + τ 4mR 2 +mx3 2 x. (65) 4 ẋ 4 7sin(x 1 )

12 XU et al.: ONLINE LEARNING CONTROL USING ADAPTIVE CRITIC DESIGNS DHP KDHP the ball position/m time/s Fig. 8. Performance comparisons of KDHP and DHP algorithms for real-time control of the ball and plate system. The reward function is defined as r(t) =.25(θ(t) θ d (t)) (x(t) x d (t)) 2 where θ d (t) is the expected inclining angle of the plate and x d (t) is the expected position of the ball, they are both zeros in the simulation. The discount factor γ is set to be.95. In the simulation, the time step is set to be.2 s, and a trail starts from an initial position and ends after 1 time steps or the controller is unsuccessful. The initial state is randomly set around zero vectors and within 1% of the state boundary. If the ball can be stabilized at the expected position in 1 time steps, the controller is regarded as successful. One run of the learning control process consists of at most 2 trials. The initial conditions are independently set among different runs. If a successful controller is obtained in one run, this run ends and a new run starts. For the learning control algorithms, the performance is evaluated based on 1 independent runs. In the four algorithms, action modules are all constructed with neural networks whose structures and parameter settings are the same. The network structure (number of nodes in each layer) of action modules is set as The transfer functions from the input layer to the hidden layer is f (x) = (1 + e x ) 1 and from the hidden layer to the output layer, the transfer function is L(x) = kx. The learning rate in the actor is.3 and the actor weights are randomly initialized from.5 to.5. In the HDP and DHP algorithms, the critic modules are constructed with neural networks, whose parameter settings are the same as that in the action module except the structure, which is here. In the KHDP and KDHP algorithms, kernel-based methods are employed to approximate the value functions and the derivatives of value functions, respectively. The performance of the four algorithms is compared by the tracking errors. In Fig. 6, the following performance index in each trial (T = 1 ) is used to compare the four algorithms: J = T (x (t) x d (t)) 2. (66) t= Fig. 6 shows that KDHP has the fastest convergence rate and smallest tracking errors. KHDP also has faster convergence rates and smaller tracking errors than HDP. TABLE I AVERAGED LEARNING CONTROL PERFORMANCE IN 1 RUNS Minimum No. of Trials Maximum No. of Trials Averaged Trials Success Rates HDP % KHDP % DHP % KDHP % The final policies obtained by the four algorithms can stabilize the ball within a very small region around the plate center, as demonstrated in Fig. 7. In Fig. 7, the position variations of the ball controlled by the final policies obtained by the four algorithms are depicted. It is shown that using the final policies obtained by KDHP, it takes the shortest time to control the ball to reach the plate center and be stabilized there. Compared with HDP, KHDP also costs less time to stabilize the ball. Because of the characteristics of online learning control, the convergence rates and success rates of the four algorithms were evaluated for performance comparisons. In Table I, it is shown that in 1 independent runs, KDHP needs the smallest number of trials to converge and KHDP needs smaller number of trials than HDP. In KDHP, the success rate of learning control is 1%, whereas in DHP, it is 84%. In KHDP, the success rate of learning control is 92%, whereas in HDP, it is only 57%. As shown in Fig. 5, the ball and plate control system developed by Googol Technology is used for experimental studies. In the experiments, the control policies obtained from simulation data are used for performance tests, and the time step is still.2 s. Fig. 8 shows that in the real-time control experiments, by using the final policies obtained by KDHP, it takes smaller number of time steps to control the ball to reach the plate center and the tracking error of KDHP is smaller than DHP. Therefore, on the basis of the simulation and experimental results, it is clearly shown that the proposed kernel ACDs can obtain better performance than standard ACDs.

13 774 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 5, MAY 213 V. CONCLUSION ACDs were among the first to address reinforcement learning problems in a general setting. Recently, ACDs have gained renewed interests due to their abilities in online learning control of dynamical systems, In ACDs, MLPNNs with manually designed structures are commonly used to realize function approximation in continuous state and action spaces. However, when the structures of neural networks are improperly designed, previous ACDs will have difficulties in improving the generalization capability and learning efficiency. In this paper, a novel class of ACDs with sparse kernel machines, called Kernel ACDs, was presented for online learning control problems. Based on the framework of Kernel ACDs, two Kernel ACD algorithms, that is, KHDP and KDHP, were proposed and their performance was analyzed. Due to the representation learning and generalization capability of sparse kernel machines, as well as the fast recursion in the critic using RLS-TD, kernel ACDs can obtain much better performance than previous ACDs with MLPNNs. The research in this paper showed that it is very promising to integrate sparse kernel machines into online learning control problems. There are also some interesting topics to be studied in future work. One of these topics is the application of kernel ACDs in real-world online learning control problems so that better performance can be obtained for real-time learning control systems. Another topic is to develop more rigorous theoretical results for the convergence of kernel ACDs. Existing results on convergence analysis of ACDs still require various assumptions, and further work needs to be done to eliminate the gap between theoretical assumptions and practical implementations. REFERENCES [1] R. Sutton and A. Barto, Reinforcement Learning. Introduction. Cambridge, MA: MIT Press, [2] F. Y. Wang, H. Zhang, and D. Liu, Adaptive dynamic programming: An introduction, IEEE Comput. Intell. Mag., vol. 4, no. 2, pp , May 29. [3] C. Szepesvári, Algorithms for Reinforcement Learning. SanMateo, CA: Morgan, 21. [4] D. A. White and D. A. Sofge, Handbook of Intelligent Control. New York: Van Nostrand, [5] P. J. Werbos, Intelligence in the brain: A theory of how it works and how to build it, Neural Netw., vol. 22, no. 3, pp , Apr. 29. [6] P. J. Werbos, Using ADP to understand and replicate brain intelligence: The next level design, in Proc. IEEE Int. Symp. Approx. Dynamic Program. Reinforcement Learn., Apr. 27, pp [7] D. P. Bertsekas and J. Tsitsiklis, Neuro-Dynamic Programming. Belmont, MA: Athena Scientific, [8] D. Liu, Y. Zhang, and H. Zhang, A self-learning call admission control scheme for CDMA cellular networks, IEEE Trans. Neural Netw., vol. 16, no. 5, pp , Sep. 25. [9] R. H. Crites and A. G. Barto, Elevator group control using multiple reinforcement learning agents, Mach. Learn., vol. 33, nos. 2 3, pp , Nov [1] G. Tesauro, TD-Gammon, a self-teaching backgammon program, achieves master-level play, Neural Comput., vol. 6, no. 2, pp , Mar [11] P. Shih, B. C. Kaul, S. Jagannathan, and J. A. Drallmeier, Reinforcement-learning-based dual-control methodology for complex nonlinear discrete-time systems with application to spark engine EGR operation, IEEE Trans. Neural Netw., vol. 19, no. 8, pp , Aug. 28. [12] W. B. Powell, Approximate Dynamic Programming: Solving the Curses of Dimensionality. New York: Wiley, 27. [13] L. Busoniu, R. Babuska, B. De Schutter, and D. Ernst, Reinforcement Learning and Dynamic Programming Using Function Approximators. Boca Raton, FL: CRC Press, 21 [14] R. S. Sutton, H. R. Maei, D. Precup, S. Bhatnagar, D. Silver, C. Szepesvari, and E. Wiewiora, Fast gradient-descent methods for temporaldifference learning with linear function approximation, in Proc. Int. Conf. Mach. Learn., 29, pp [15] J. Baxter and P. L. Bartlett, Infinite-horizon policy-gradient estimation, J. Artif. Intell. Res., vol. 15, no. 1, pp , Jul. 21. [16] V. R. Konda and J. N. Tsitsiklis, Actor-critic algorithms, in Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 2. [17] D. V. Prokhorov and D. C. Wunsch, Adaptive critic designs, IEEE Trans. Neural Netw., vol. 8, no. 5, pp , Jul [18] A. G. Barto, R. S. Sutton, and C. W. Anderson, Neuron-like adaptive elements that can solve difficult learning control problems, IEEE Trans. Syst., Man, Cybern., vol. 13, no. 5, pp , Sep. Oct [19] G. K. Venayagamoorthy, R. G. Harley, and D. C. Wunsch, Comparison of heuristic dynamic programming and dual heuristic programming adaptive critics for neurocontrol of a turbogenerator, IEEE Trans. Neural Netw., vol. 13, no. 3, pp , May 22. [2] F. L. Lewis and D. Vrabie, Reinforcement learning and adaptive dynamic programming for feedback control, IEEE Circuits Syst. Mag., vol. 9, no. 3, pp. 32 5, Aug. 29. [21] J. Peters and S. Schaal, Natural actor-critic, Neurocomputing, vol. 71, nos. 7 9, pp , Mar. 28. [22] S. Bhatnagar, R. S. Sutton, M. Ghavamzadeh, and M. Lee, Natural Actor-Critic Algorithms. Alberta, Canada: Dept. Comput. Sci., 29. [23] S. N. Balakrishnan and V. Biega, Adaptive-critic-based neural networks for aircraft optimal control, J. Guid., Control, Dynamics, vol. 19, no. 4, pp , [24] R. Enns and J. Si, Helicopter trimming and tracking control using direct neural dynamic programming, IEEE Trans. Neural Netw., vol. 14, no. 4, pp , Jul. 23. [25] C. Lu, J. Si, and X. Xie, Direct heuristic dynamic programming for damping oscillations in a large power system, IEEE Trans. Syst., Man, Cybern., Part B, Cybern., vol. 38, no. 4, pp , Aug. 28. [26] P. Shih, B. C. Kaul, S. Jagannathan, and J. A. Drallmeier, Reinforcement-learning-based dual-control methodology for complex nonlinear discrete-time systems with application to spark engine EGR operation, IEEE Trans. Neural Netw., vol. 19, no. 8, pp , Aug. 28. [27] G. D. Magoulasa, M. N. Vrahatisb, and G. S. Androulakisb, Effective Backpropagation training with variable stepsize, Neural Netw., vol. 1, no. 1, pp , Jan [28] S. Bhasin, N. Sharma, P. Patre, and W. E. Dixon, Asymptotic tracking by a reinforcement learning-based adaptive critic controller, J. Control Theory Appl., vol. 9, No. 3, pp. 4 49, 211. [29] K. G. Vamvoudakis and F. L. Lewis, Online actor-critic algorithm to solve the continuous-time infinite horizon optimal control problem, Automatica, vol. 46, no. 5, pp , May 21. [3] H. Zhang, L. Cui, X. Zhang, and Y. Luo, Data-driven robust approximate optimal tracking control for unknown general nonlinear systems using adaptive dynamic programming method, IEEE Trans. Neural Netw., vol. 22, no. 12, pp , Dec [31] V. Vapnik, Statistical Learning Theory. New York: Wiley, [32] B. Schölkopf and A. Smola, Learning With Kernels. Cambridge, MA: MIT Press, 22. [33] N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines. Cambridge, U.K.: Cambridge Univ. Press, 2. [34] F. R. Bach and M. I. Jordan, Kernel independent component analysis, J. Mach. Learn. Res., vol. 3, pp. 1 48, Jul. 22. [35] T. Hofmann, B. Schölkopf, and A. J. Smola, Kernel methods in machine learning, Ann. Statist., vol. 36, no. 3 pp , 28. [36] D. Ormoneit and S. Sen, Kernel-based reinforcement learning, Mach. Learn., vol. 49, nos. 2 3, pp , 22. [37] Y. Engel, S. Mannor, and R. Meir, Bayes meets bellman: The Gaussian Process approach to temporal difference learning, in Proc. Int. Conf. Mach. Learn., 23, pp

14 XU et al.: ONLINE LEARNING CONTROL USING ADAPTIVE CRITIC DESIGNS 775 [38] T. G. Dietterich and X. Wang, Batch value function approximation via support vectors, in Advances in Neural Information Processing Systems 14, Cambridge, MA: MIT Press, 22, pp [39] C. E. Rasmussen and M. Kuss, Gaussian processes in reinforcement learning, in Advances in Neural Information Processing Systems 16, S. Thrun, L. K. Saul, and B. Schölkopf, Eds., Cambridge, MA: MIT Press, 24, pp [4] M. G. Lagoudakis and R. Parr, Least-squares policy iteration, J. Mach. Learn. Res., vol. 4, pp , Dec. 23. [41] X. Xu, D. Hu, and X. Lu, Kernel-based least-squares policy iteration for reinforcement learning, IEEE Trans. Neural Netw., vol. 18, no. 4, pp , Jul. 27. [42] Y. Engel, S. Mannor, and R. Meir, The kernel recursive least-squares algorithm, IEEE Trans. Signal Process., vol. 52, no. 8, pp , Aug. 24. [43] X. Xu, H. G. He, and D. W. Hu, Efficient reinforcement learning using recursive least-squares methods, J. Artif. Intell. Res., vol. 16, pp , Jun. 22. [44] X. Xu, T. Xie, D. Hu, and X. Lu, Kernel least-squares temporal difference learning, Int. J. Inf. Technol., vol. 11, no. 9, pp , 25. [45] J. N. Tsitsiklis and B. V. Roy, An analysis of temporal difference learning with function approximation, IEEE Trans. Autom. Control, vol. 42, no. 5, pp , May [46] A. Nedic and D. P. Bertsekas, Least squares policy evaluation algorithms with linear function approximation, Discrete Event Dyn. Syst., vol. 13, nos. 1 2, pp , Jan. Apr. 23. [47] T. Dierks and S. Jagannathan, Online optimal control of affine nonlinear discrete-time systems with unknown internal dynamics by using timebased policy update, IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 7, pp , Jul [48] S. Zhong, X. Zeng, S. Wu, and L. Han, Sensitivity-based adaptive learning rules for binary feedforward neural networks, IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 3, pp , Mar Xin Xu (M 7 SM 12) received the B.S. degree in electrical engineering from the Department of Automatic Control, National University of Defense Technology (NUDT), Changsha, China, in 1996, where he received the Ph.D. degree in control science and engineering from the College of Mechatronics and Automation, in 22. He is currently a Full Professor with the Institute of Unmanned Systems, College of Mechatronics and Automation, National University of Defense Technology, China. He has been a Visiting Scientist for cooperation research with Hong Kong Polytechnic University, Hong Kong, University of Alberta, Edmonton, AB, Canada, the University of Guelph, Guelph, ON, Canada, and the University of Strathclyde, Glasgow, U.K. He has authored or co-authored more than 9 papers in international journals and conferences, and co-authored four books. His current research interests include reinforcement learning, approximate dynamic programming, machine learning, robotics, and autonomous vehicles. Dr. Xu was a recipient of the 1st class Natural Science Award from Hunan Province, China, in 29 and the Fork Ying Tong Youth Teacher Fund of China in 28. He serves as currently an associate editor of the Information Sciences Journal, a guest editor of the International Journal of Adaptive Control and Signal Processing. He is a Committee Member of the IEEE Technical Committee on Approximate Dynamic Programming and Reinforcement Learning (ADPRL) and the IEEE Technical Committee on Robot Learning. He was a PC member or the Session Chair of many international conferences. Zhongsheng Hou received the Bachelors and Masters degrees from the Jilin University of Technology, Changchun, China, in 1983 and 1988, respectively, and the Ph.D. degree from Northeastern University, Shenyang, China, in He was a Post-Doctoral Fellow with the Harbin Institute of Technology, Harbin, China, from 1995 to 1997 and a Visiting Scholar with Yale University, New Haven, CT, from 22 to 23. In 1997, he joined the Beijing Jiaotong University, Beijing, China, where he is currently a Full Professor and the Founding Director of the Advanced Control Systems Laboratory, and the Head of the Department of Automatic Control, School of Electronic and Information Engineering. He has authored or co-authored over 1 papers in peer-reviewed journals and over 1 papers in prestigious conference proceedings and authored two monographs, Nonparametric Model and its Adaptive Control Theory and the coming Model Free Adaptive Control: Theory and Applications (CRC Press, 213). His current research interests include data-driven control, model-free adaptive control, learning control, and intelligent transportation systems. Dr. Hou was a committee member of over 4 international conferences and Chinese conferences, and was an Associate Editor and Guest Editor for a few international journals and Chinese journals. Chuanqiang Lian received the Bachelors degree from the Department of Automation, Qinghua University, Beijing, China, in 28, and the Masters degree from the College of Mechatronics and Automation, National University of Defense Technology, Changsha, China, in 21, where he is currently pursuing the Ph.D. degree with the Institute of Unmanned Systems. He has co-authored more than 1 papers in international journals and conferences. His current research interests include reinforcement learning, approximate dynamic programming, and autonomous vehicles. Haibo He (SM 11) received the B.S. and M.S. degrees in electrical engineering from the Huazhong University of Science and Technology, Wuhan, China, in 1999 and 22, respectively, and the Ph.D. degree in electrical engineering from Ohio University, Athens, in 26. He was an Assistant Professor with the Department of Electrical and Computer Engineering, Stevens Institute of Technology, Hoboken, NJ, from 26 to 29. He is currently an Associate Professor with the Department of Electrical, Computer, and Biomedical Engineering, University of Rhode Island, Kingston, RI. He has authored and co-authored over 1 peer-reviewed journal and conference papers, authored one research book (Wiley), and edited six conference proceedings (Springer). His research has been highlighted in numerous media, such as IEEE Smart Grid Newsletter, The Wall Street Journal, and Providence Business News. His current research interests include adaptive dynamic programming, machine learning, computational intelligence, hardware design for machine intelligence, and various applications such as smart grid. Dr. He was a recipient of the National Science Foundation CAREER Award in 211 and the Providence Business News Rising Star Innovator Award in 211. He is currently an Associate Editor of the IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS and the IEEE TRANSAC- TIONS ON SMART GRID.

REINFORCEMENT learning (RL) is a machine learning

REINFORCEMENT learning (RL) is a machine learning 146 IEEE TRANSACTIONS ON CONTROL SYSTEMS TECHNOLOGY, VOL. 22, NO. 1, JANUARY 214 Kernel-Based Approximate Dynamic Programming for Real-Time Online Learning Control: An Experimental Study Xin Xu, Senior

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

1944 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 12, DECEMBER 2013

1944 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 12, DECEMBER 2013 1944 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 12, DECEMBER 2013 Online Selective Kernel-Based Temporal Difference Learning Xingguo Chen, Yang Gao, Member, IEEE, andruiliwang

More information

Reinforcement Learning In Continuous Time and Space

Reinforcement Learning In Continuous Time and Space Reinforcement Learning In Continuous Time and Space presentation of paper by Kenji Doya Leszek Rybicki lrybicki@mat.umk.pl 18.07.2008 Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina Reinforcement Learning Introduction Introduction Unsupervised learning has no outcome (no feedback). Supervised learning has outcome so we know what to predict. Reinforcement learning is in between it

More information

A Residual Gradient Fuzzy Reinforcement Learning Algorithm for Differential Games

A Residual Gradient Fuzzy Reinforcement Learning Algorithm for Differential Games International Journal of Fuzzy Systems manuscript (will be inserted by the editor) A Residual Gradient Fuzzy Reinforcement Learning Algorithm for Differential Games Mostafa D Awheda Howard M Schwartz Received:

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Master MVA: Reinforcement Learning Lecture: 5 Approximate Dynamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of the lecture 1.

More information

An online kernel-based clustering approach for value function approximation

An online kernel-based clustering approach for value function approximation An online kernel-based clustering approach for value function approximation N. Tziortziotis and K. Blekas Department of Computer Science, University of Ioannina P.O.Box 1186, Ioannina 45110 - Greece {ntziorzi,kblekas}@cs.uoi.gr

More information

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel Reinforcement Learning with Function Approximation Joseph Christian G. Noel November 2011 Abstract Reinforcement learning (RL) is a key problem in the field of Artificial Intelligence. The main goal is

More information

Prioritized Sweeping Converges to the Optimal Value Function

Prioritized Sweeping Converges to the Optimal Value Function Technical Report DCS-TR-631 Prioritized Sweeping Converges to the Optimal Value Function Lihong Li and Michael L. Littman {lihong,mlittman}@cs.rutgers.edu RL 3 Laboratory Department of Computer Science

More information

Lecture 1: March 7, 2018

Lecture 1: March 7, 2018 Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights

More information

Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods

Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods Yaakov Engel Joint work with Peter Szabo and Dmitry Volkinshtein (ex. Technion) Why use GPs in RL? A Bayesian approach

More information

Robotics Part II: From Learning Model-based Control to Model-free Reinforcement Learning

Robotics Part II: From Learning Model-based Control to Model-free Reinforcement Learning Robotics Part II: From Learning Model-based Control to Model-free Reinforcement Learning Stefan Schaal Max-Planck-Institute for Intelligent Systems Tübingen, Germany & Computer Science, Neuroscience, &

More information

Reinforcement Learning as Classification Leveraging Modern Classifiers

Reinforcement Learning as Classification Leveraging Modern Classifiers Reinforcement Learning as Classification Leveraging Modern Classifiers Michail G. Lagoudakis and Ronald Parr Department of Computer Science Duke University Durham, NC 27708 Machine Learning Reductions

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

Reinforcement Learning and NLP

Reinforcement Learning and NLP 1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value

More information

Reinforcement Learning. Introduction

Reinforcement Learning. Introduction Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control

More information

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning Ronen Tamari The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (#67679) February 28, 2016 Ronen Tamari

More information

Reinforcement learning

Reinforcement learning Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error

More information

Machine Learning I Reinforcement Learning

Machine Learning I Reinforcement Learning Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning RL in continuous MDPs March April, 2015 Large/Continuous MDPs Large/Continuous state space Tabular representation cannot be used Large/Continuous action space Maximization over action

More information

Reinforcement Learning and Control

Reinforcement Learning and Control CS9 Lecture notes Andrew Ng Part XIII Reinforcement Learning and Control We now begin our study of reinforcement learning and adaptive control. In supervised learning, we saw algorithms that tried to make

More information

Reinforcement learning an introduction

Reinforcement learning an introduction Reinforcement learning an introduction Prof. Dr. Ann Nowé Computational Modeling Group AIlab ai.vub.ac.be November 2013 Reinforcement Learning What is it? Learning from interaction Learning about, from,

More information

ilstd: Eligibility Traces and Convergence Analysis

ilstd: Eligibility Traces and Convergence Analysis ilstd: Eligibility Traces and Convergence Analysis Alborz Geramifard Michael Bowling Martin Zinkevich Richard S. Sutton Department of Computing Science University of Alberta Edmonton, Alberta {alborz,bowling,maz,sutton}@cs.ualberta.ca

More information

Learning Control Under Uncertainty: A Probabilistic Value-Iteration Approach

Learning Control Under Uncertainty: A Probabilistic Value-Iteration Approach Learning Control Under Uncertainty: A Probabilistic Value-Iteration Approach B. Bischoff 1, D. Nguyen-Tuong 1,H.Markert 1 anda.knoll 2 1- Robert Bosch GmbH - Corporate Research Robert-Bosch-Str. 2, 71701

More information

Least squares policy iteration (LSPI)

Least squares policy iteration (LSPI) Least squares policy iteration (LSPI) Charles Elkan elkan@cs.ucsd.edu December 6, 2012 1 Policy evaluation and policy improvement Let π be a non-deterministic but stationary policy, so p(a s; π) is the

More information

Value Function Approximation through Sparse Bayesian Modeling

Value Function Approximation through Sparse Bayesian Modeling Value Function Approximation through Sparse Bayesian Modeling Nikolaos Tziortziotis and Konstantinos Blekas Department of Computer Science, University of Ioannina, P.O. Box 1186, 45110 Ioannina, Greece,

More information

The convergence limit of the temporal difference learning

The convergence limit of the temporal difference learning The convergence limit of the temporal difference learning Ryosuke Nomura the University of Tokyo September 3, 2013 1 Outline Reinforcement Learning Convergence limit Construction of the feature vector

More information

Open Theoretical Questions in Reinforcement Learning

Open Theoretical Questions in Reinforcement Learning Open Theoretical Questions in Reinforcement Learning Richard S. Sutton AT&T Labs, Florham Park, NJ 07932, USA, sutton@research.att.com, www.cs.umass.edu/~rich Reinforcement learning (RL) concerns the problem

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function approximation Mario Martin CS-UPC May 18, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 18, 2018 / 65 Recap Algorithms: MonteCarlo methods for Policy Evaluation

More information

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING AN INTRODUCTION Prof. Dr. Ann Nowé Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING WHAT IS IT? What is it? Learning from interaction Learning about, from, and while

More information

Optimal Control. McGill COMP 765 Oct 3 rd, 2017

Optimal Control. McGill COMP 765 Oct 3 rd, 2017 Optimal Control McGill COMP 765 Oct 3 rd, 2017 Classical Control Quiz Question 1: Can a PID controller be used to balance an inverted pendulum: A) That starts upright? B) That must be swung-up (perhaps

More information

Lecture 23: Reinforcement Learning

Lecture 23: Reinforcement Learning Lecture 23: Reinforcement Learning MDPs revisited Model-based learning Monte Carlo value function estimation Temporal-difference (TD) learning Exploration November 23, 2006 1 COMP-424 Lecture 23 Recall:

More information

6 Reinforcement Learning

6 Reinforcement Learning 6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,

More information

Lecture 7: Value Function Approximation

Lecture 7: Value Function Approximation Lecture 7: Value Function Approximation Joseph Modayil Outline 1 Introduction 2 3 Batch Methods Introduction Large-Scale Reinforcement Learning Reinforcement learning can be used to solve large problems,

More information

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction MS&E338 Reinforcement Learning Lecture 1 - April 2, 2018 Introduction Lecturer: Ben Van Roy Scribe: Gabriel Maher 1 Reinforcement Learning Introduction In reinforcement learning (RL) we consider an agent

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function Approximation Continuous state/action space, mean-square error, gradient temporal difference learning, least-square temporal difference, least squares policy iteration Vien

More information

Q-Learning for Markov Decision Processes*

Q-Learning for Markov Decision Processes* McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan khoa.phan@mail.mcgill.ca Sandeep Manjanna sandeep.manjanna@mail.mcgill.ca (*Based on: Convergence of

More information

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms * Proceedings of the 8 th International Conference on Applied Informatics Eger, Hungary, January 27 30, 2010. Vol. 1. pp. 87 94. Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms

More information

Artificial Intelligence

Artificial Intelligence Artificial Intelligence Dynamic Programming Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: So far we focussed on tree search-like solvers for decision problems. There is a second important

More information

Linear Least-squares Dyna-style Planning

Linear Least-squares Dyna-style Planning Linear Least-squares Dyna-style Planning Hengshuai Yao Department of Computing Science University of Alberta Edmonton, AB, Canada T6G2E8 hengshua@cs.ualberta.ca Abstract World model is very important for

More information

Reinforcement Learning. George Konidaris

Reinforcement Learning. George Konidaris Reinforcement Learning George Konidaris gdk@cs.brown.edu Fall 2017 Machine Learning Subfield of AI concerned with learning from data. Broadly, using: Experience To Improve Performance On Some Task (Tom

More information

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018 Machine learning for image classification Lecture 14: Reinforcement learning May 9, 2018 Page 3 Outline Motivation Introduction to reinforcement learning (RL) Value function based methods (Q-learning)

More information

A Tour of Reinforcement Learning The View from Continuous Control. Benjamin Recht University of California, Berkeley

A Tour of Reinforcement Learning The View from Continuous Control. Benjamin Recht University of California, Berkeley A Tour of Reinforcement Learning The View from Continuous Control Benjamin Recht University of California, Berkeley trustable, scalable, predictable Control Theory! Reinforcement Learning is the study

More information

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Department of Systems and Computer Engineering Carleton University Ottawa, Canada KS 5B6 Email: mawheda@sce.carleton.ca

More information

Sparse Kernel-SARSA(λ) with an Eligibility Trace

Sparse Kernel-SARSA(λ) with an Eligibility Trace Sparse Kernel-SARSA(λ) with an Eligibility Trace Matthew Robards 1,2, Peter Sunehag 2, Scott Sanner 1,2, and Bhaskara Marthi 3 1 National ICT Australia Locked Bag 8001 Canberra ACT 2601, Australia 2 Research

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Model-Based Reinforcement Learning Model-based, PAC-MDP, sample complexity, exploration/exploitation, RMAX, E3, Bayes-optimal, Bayesian RL, model learning Vien Ngo MLR, University

More information

Jun Ma and Warren B. Powell Department of Operations Research and Financial Engineering Princeton University, Princeton, NJ 08544

Jun Ma and Warren B. Powell Department of Operations Research and Financial Engineering Princeton University, Princeton, NJ 08544 Convergence Analysis of Kernel-based On-policy Approximate Policy Iteration Algorithms for Markov Decision Processes with Continuous, Multidimensional States and Actions Jun Ma and Warren B. Powell Department

More information

Behavior Policy Gradient Supplemental Material

Behavior Policy Gradient Supplemental Material Behavior Policy Gradient Supplemental Material Josiah P. Hanna 1 Philip S. Thomas 2 3 Peter Stone 1 Scott Niekum 1 A. Proof of Theorem 1 In Appendix A, we give the full derivation of our primary theoretical

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course Value Iteration: the Idea 1. Let V 0 be any vector in R N A. LAZARIC Reinforcement

More information

Variance Reduction for Policy Gradient Methods. March 13, 2017

Variance Reduction for Policy Gradient Methods. March 13, 2017 Variance Reduction for Policy Gradient Methods March 13, 2017 Reward Shaping Reward Shaping Reward Shaping Reward shaping: r(s, a, s ) = r(s, a, s ) + γφ(s ) Φ(s) for arbitrary potential Φ Theorem: r admits

More information

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory Deep Reinforcement Learning Jeremy Morton (jmorton2) November 7, 2016 SISL Stanford Intelligent Systems Laboratory Overview 2 1 Motivation 2 Neural Networks 3 Deep Reinforcement Learning 4 Deep Learning

More information

REINFORCEMENT LEARNING

REINFORCEMENT LEARNING REINFORCEMENT LEARNING Larry Page: Where s Google going next? DeepMind's DQN playing Breakout Contents Introduction to Reinforcement Learning Deep Q-Learning INTRODUCTION TO REINFORCEMENT LEARNING Contents

More information

Path Integral Stochastic Optimal Control for Reinforcement Learning

Path Integral Stochastic Optimal Control for Reinforcement Learning Preprint August 3, 204 The st Multidisciplinary Conference on Reinforcement Learning and Decision Making RLDM203 Path Integral Stochastic Optimal Control for Reinforcement Learning Farbod Farshidian Institute

More information

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Reinforcement Learning. Spring 2018 Defining MDPs, Planning Reinforcement Learning Spring 2018 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state

More information

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning. Yishay Mansour Tel-Aviv University Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak

More information

Proceedings of the International Conference on Neural Networks, Orlando Florida, June Leemon C. Baird III

Proceedings of the International Conference on Neural Networks, Orlando Florida, June Leemon C. Baird III Proceedings of the International Conference on Neural Networks, Orlando Florida, June 1994. REINFORCEMENT LEARNING IN CONTINUOUS TIME: ADVANTAGE UPDATING Leemon C. Baird III bairdlc@wl.wpafb.af.mil Wright

More information

Reinforcement Learning in Non-Stationary Continuous Time and Space Scenarios

Reinforcement Learning in Non-Stationary Continuous Time and Space Scenarios Reinforcement Learning in Non-Stationary Continuous Time and Space Scenarios Eduardo W. Basso 1, Paulo M. Engel 1 1 Instituto de Informática Universidade Federal do Rio Grande do Sul (UFRGS) Caixa Postal

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function approximation Daniel Hennes 19.06.2017 University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Eligibility traces n-step TD returns Forward and backward view Function

More information

Active Policy Iteration: Efficient Exploration through Active Learning for Value Function Approximation in Reinforcement Learning

Active Policy Iteration: Efficient Exploration through Active Learning for Value Function Approximation in Reinforcement Learning Active Policy Iteration: fficient xploration through Active Learning for Value Function Approximation in Reinforcement Learning Takayuki Akiyama, Hirotaka Hachiya, and Masashi Sugiyama Department of Computer

More information

Q-Learning in Continuous State Action Spaces

Q-Learning in Continuous State Action Spaces Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental

More information

Laplacian Agent Learning: Representation Policy Iteration

Laplacian Agent Learning: Representation Policy Iteration Laplacian Agent Learning: Representation Policy Iteration Sridhar Mahadevan Example of a Markov Decision Process a1: $0 Heaven $1 Earth What should the agent do? a2: $100 Hell $-1 V a1 ( Earth ) = f(0,1,1,1,1,...)

More information

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers Computational Methods for Data Analysis Massimo Poesio SUPPORT VECTOR MACHINES Support Vector Machines Linear classifiers 1 Linear Classifiers denotes +1 denotes -1 w x + b>0 f(x,w,b) = sign(w x + b) How

More information

Procedia Computer Science 00 (2011) 000 6

Procedia Computer Science 00 (2011) 000 6 Procedia Computer Science (211) 6 Procedia Computer Science Complex Adaptive Systems, Volume 1 Cihan H. Dagli, Editor in Chief Conference Organized by Missouri University of Science and Technology 211-

More information

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5 Table of contents 1 Introduction 2 2 Markov Decision Processes 2 3 Future Cumulative Reward 3 4 Q-Learning 4 4.1 The Q-value.............................................. 4 4.2 The Temporal Difference.......................................

More information

Dual Temporal Difference Learning

Dual Temporal Difference Learning Dual Temporal Difference Learning Min Yang Dept. of Computing Science University of Alberta Edmonton, Alberta Canada T6G 2E8 Yuxi Li Dept. of Computing Science University of Alberta Edmonton, Alberta Canada

More information

The Art of Sequential Optimization via Simulations

The Art of Sequential Optimization via Simulations The Art of Sequential Optimization via Simulations Stochastic Systems and Learning Laboratory EE, CS* & ISE* Departments Viterbi School of Engineering University of Southern California (Based on joint

More information

Bits of Machine Learning Part 1: Supervised Learning

Bits of Machine Learning Part 1: Supervised Learning Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification

More information

Final Exam, Machine Learning, Spring 2009

Final Exam, Machine Learning, Spring 2009 Name: Andrew ID: Final Exam, 10701 Machine Learning, Spring 2009 - The exam is open-book, open-notes, no electronics other than calculators. - The maximum possible score on this exam is 100. You have 3

More information

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN Reinforcement Learning for Continuous Action using Stochastic Gradient Ascent Hajime KIMURA, Shigenobu KOBAYASHI Tokyo Institute of Technology, 4259 Nagatsuda, Midori-ku Yokohama 226-852 JAPAN Abstract:

More information

arxiv: v1 [cs.lg] 23 Oct 2017

arxiv: v1 [cs.lg] 23 Oct 2017 Accelerated Reinforcement Learning K. Lakshmanan Department of Computer Science and Engineering Indian Institute of Technology (BHU), Varanasi, India Email: lakshmanank.cse@itbhu.ac.in arxiv:1710.08070v1

More information

Temporal-Difference Q-learning in Active Fault Diagnosis

Temporal-Difference Q-learning in Active Fault Diagnosis Temporal-Difference Q-learning in Active Fault Diagnosis Jan Škach 1 Ivo Punčochář 1 Frank L. Lewis 2 1 Identification and Decision Making Research Group (IDM) European Centre of Excellence - NTIS University

More information

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016 Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45 Overview We consider probabilistic temporal models where the

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Lecture 4: Approximate dynamic programming

Lecture 4: Approximate dynamic programming IEOR 800: Reinforcement learning By Shipra Agrawal Lecture 4: Approximate dynamic programming Deep Q Networks discussed in the last lecture are an instance of approximate dynamic programming. These are

More information

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013 Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description

More information

Q-Learning and SARSA: Machine learningbased stochastic control approaches for financial trading

Q-Learning and SARSA: Machine learningbased stochastic control approaches for financial trading Q-Learning and SARSA: Machine learningbased stochastic control approaches for financial trading Marco CORAZZA (corazza@unive.it) Department of Economics Ca' Foscari University of Venice CONFERENCE ON COMPUTATIONAL

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

STATE GENERALIZATION WITH SUPPORT VECTOR MACHINES IN REINFORCEMENT LEARNING. Ryo Goto, Toshihiro Matsui and Hiroshi Matsuo

STATE GENERALIZATION WITH SUPPORT VECTOR MACHINES IN REINFORCEMENT LEARNING. Ryo Goto, Toshihiro Matsui and Hiroshi Matsuo STATE GENERALIZATION WITH SUPPORT VECTOR MACHINES IN REINFORCEMENT LEARNING Ryo Goto, Toshihiro Matsui and Hiroshi Matsuo Department of Electrical and Computer Engineering, Nagoya Institute of Technology

More information

On and Off-Policy Relational Reinforcement Learning

On and Off-Policy Relational Reinforcement Learning On and Off-Policy Relational Reinforcement Learning Christophe Rodrigues, Pierre Gérard, and Céline Rouveirol LIPN, UMR CNRS 73, Institut Galilée - Université Paris-Nord first.last@lipn.univ-paris13.fr

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Ron Parr CompSci 7 Department of Computer Science Duke University With thanks to Kris Hauser for some content RL Highlights Everybody likes to learn from experience Use ML techniques

More information

Reinforcement Learning II

Reinforcement Learning II Reinforcement Learning II Andrea Bonarini Artificial Intelligence and Robotics Lab Department of Electronics and Information Politecnico di Milano E-mail: bonarini@elet.polimi.it URL:http://www.dei.polimi.it/people/bonarini

More information

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning Christos Dimitrakakis Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands

More information

CSE250A Fall 12: Discussion Week 9

CSE250A Fall 12: Discussion Week 9 CSE250A Fall 12: Discussion Week 9 Aditya Menon (akmenon@ucsd.edu) December 4, 2012 1 Schedule for today Recap of Markov Decision Processes. Examples: slot machines and maze traversal. Planning and learning.

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course Approximate Dynamic Programming (a.k.a. Batch Reinforcement Learning) A.

More information

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels? Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity

More information

Learning Tetris. 1 Tetris. February 3, 2009

Learning Tetris. 1 Tetris. February 3, 2009 Learning Tetris Matt Zucker Andrew Maas February 3, 2009 1 Tetris The Tetris game has been used as a benchmark for Machine Learning tasks because its large state space (over 2 200 cell configurations are

More information

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation CS234: RL Emma Brunskill Winter 2018 Material builds on structure from David SIlver s Lecture 4: Model-Free

More information

ELEC-E8119 Robotics: Manipulation, Decision Making and Learning Policy gradient approaches. Ville Kyrki

ELEC-E8119 Robotics: Manipulation, Decision Making and Learning Policy gradient approaches. Ville Kyrki ELEC-E8119 Robotics: Manipulation, Decision Making and Learning Policy gradient approaches Ville Kyrki 9.10.2017 Today Direct policy learning via policy gradient. Learning goals Understand basis and limitations

More information

Ensembles of Extreme Learning Machine Networks for Value Prediction

Ensembles of Extreme Learning Machine Networks for Value Prediction Ensembles of Extreme Learning Machine Networks for Value Prediction Pablo Escandell-Montero, José M. Martínez-Martínez, Emilio Soria-Olivas, Joan Vila-Francés, José D. Martín-Guerrero IDAL, Intelligent

More information

A Gentle Introduction to Reinforcement Learning

A Gentle Introduction to Reinforcement Learning A Gentle Introduction to Reinforcement Learning Alexander Jung 2018 1 Introduction and Motivation Consider the cleaning robot Rumba which has to clean the office room B329. In order to keep things simple,

More information

16.4 Multiattribute Utility Functions

16.4 Multiattribute Utility Functions 285 Normalized utilities The scale of utilities reaches from the best possible prize u to the worst possible catastrophe u Normalized utilities use a scale with u = 0 and u = 1 Utilities of intermediate

More information