REINFORCEMENT learning (RL) is a machine learning

Size: px

Start display at page:

Download "REINFORCEMENT learning (RL) is a machine learning"

Phillip Potter
5 years ago
Views:

1 762 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 5, MAY 213 Online Learning Control Using Adaptive Critic Designs With Sparse Kernel Machines Xin Xu, Senior Member, IEEE, Zhongsheng Hou, Chuanqiang Lian, and Haibo He, Senior Member, IEEE Abstract In the past decade, adaptive critic designs (ACDs), including heuristic dynamic programming (HDP), dual heuristic programming (DHP), and their action-dependent ones, have been widely studied to realize online learning control of dynamical systems. However, because neural networks with manually designed features are commonly used to deal with continuous state and action spaces, the generalization capability and learning efficiency of previous ACDs still need to be improved. In this paper, a novel framework of ACDs with sparse kernel machines is presented by integrating kernel methods into the critic of ACDs. To improve the generalization capability as well as the computational efficiency of kernel machines, a sparsification method based on the approximately linear dependence analysis is used. Using the sparse kernel machines, two kernel-based ACD algorithms, that is, kernel HDP (KHDP) and kernel DHP (KDHP), are proposed and their performance is analyzed both theoretically and empirically. Because of the representation learning and generalization capability of sparse kernel machines, KHDP and KDHP can obtain much better performance than previous HDP and DHP with manually designed neural networks. Simulation and experimental results of two nonlinear control problems, that is, a continuous-action inverted pendulum problem and a ball and plate control problem, demonstrate the effectiveness of the proposed kernel ACD methods. Index Terms Adaptive critic designs, approximate dynamic programming, kernel machines, learning control, Markov decision processes, reinforcement learning. I. INTRODUCTION REINFORCEMENT learning (RL) is a machine learning framework for solving sequential decision making problems that can be modeled using the Markov decision process (MDP) formalism. In RL, the learning agent interacts with an initially unknown environment and modifies its action policies to maximize its cumulative payoffs [1], [2]. Although earlier RL research focused on tabular algorithms in discrete Manuscript received September 2, 211; revised October 12, 212; accepted December 16, 212. Date of publication February 13, 213; date of current version March 8, 213. This work was supported in part by the National Natural Science Foundation of China under Grant , Grant 98232, Grant , and Grant , the New Century Excellent Talent Program under Grant NCET-1-91, and the U.S. National Science Foundation under Grant CAREER ECCS X. Xu and C. Lian are with the College of Mechatronics and Automation, National University of Defense Technology, Changsha 4173, China ( xuxin_mail@263.net; xinxu@nudt.edu.cn). Z. Hou is with the Advanced Control Systems Laboratory of School of Electronic and Information Engineering, Beijing Jiaotong University, Beijing 144, China ( zhshhou@bjtu.edu.cn). H. He is with the Department of Electrical, Computer, and Biomedical Engineering, University of Rhode Island, Kingston, RI 2881 USA ( he@ele.uri.edu). Color versions of one or more of the figures in this paper are available online at Digital Object Identifier 1.119/TNNLS X/$ IEEE state/action spaces, approximation and generalization methods for RL have received more and more research interest in recent years. In the literature, there are several synonyms used for RL, including approximate/adaptive dynamic programming (ADP) and neuro-dynamic programming [3] [7]. One common goal of ADP and RL is to solve the optimal control problem of MDP with large or continuous state and action spaces. Until now, RL has been shown to be a very promising framework to solve learning control problems which are difficult or even impossible for mathematical programming and supervised learning methods. However, despite some successful empirical results in real-world applications [8] [11], realizing efficient online learning control for MDPs with large or continuous space is still a difficult problem. In such cases, many RL or ADP algorithms are slow to converge and require a large amount of training samples [12]. As indicated in [1], this problem is closely related to the generalization capability of learning machines, which is the ability of a learning algorithm to perform accurately on new, unseen examples after having trained on a finite data set. In order to improve the generalization capability and learning efficiency of RL, function approximation has been a central topic in RL. Currently, there are three main categories of research work on function approximation for RL, that is, value function approximation (VFA) [13], [14], policy search [15], and actor-critic methods [16]. The actor-critic algorithms, viewed as a hybrid of VFA and policy search, have been shown to be more effective than standard VFA or policy search in online learning tasks with continuous state/action spaces [17]. In an actor-critic learning controller, there is an actor for policy learning and a critic for VFA or policy evaluation. One pioneering work on RL algorithms using the actor-critic architecture can be found in [18]. In recent years, adaptive critic designs (ACDs) [19] [23], [28] [3] were widely studied as an important class of actor-critic learning control methods for dynamical systems. Generally, ACDs can be categorized as the following major groups: heuristic dynamic programming (HDP), dual heuristic programming (DHP), globalized dual heuristic programming (GDHP), and their action-dependent versions [17]. Among ACD architectures, DHP is the most popular one, which has been proven to be more efficient than HDP [19]. Although ACDs have been applied in various learning control problems [24] [26], such as aircraft control, automotive engine control, and power system control, there are still some difficult issues in the design and implementation of ACDs. The first issue is that the learning efficiency and convergence

2 XU et al.: ONLINE LEARNING CONTROL USING ADAPTIVE CRITIC DESIGNS 763 of ACDs greatly rely on the empirical design of the critic, including the approximation structure and the learning rate. In ACDs, multilayer perceptron neural networks (MLPNNs) [48] were commonly used for VFA, but the structure and learning rates (step sizes) of MLPNNs have to be manually selected for good performance [27]. The second difficulty is that the robustness to uncertainties in learning control systems based on DHP or HDP still needs to be improved. In DHP and HDP, due to the local minima in neural network training, how to improve the quality of the final policies is still an open problem [22]. As suggested in [22], the most important potential extension of their results would be to characterize the quality of the converged solution in ACDs. Recent studies have attempted to approximate the optimal control solution using various ADP techniques with or without apriorisystem model [28] [3], [47]. Vamvoudakis and Lewis [29] proposed an online actor critic algorithm to solve the continuous-time infinite-horizon optimal control problem, with the assumption of known dynamics. Zhang et al. proposed a data-driven robust approximate optimal tracking control scheme for unknown general nonlinear systems [3]. Nevertheless, the above works still relied on manual settings of critic networks, and the learning control performance depended on the empirical design of basis functions. Therefore, it is desirable to develop automatic feature representation and selection methods for the critic learning of ADP approaches. As is well known, feature representation and selection is a critical factor for improving the generalization performance of machine learning algorithms. However, compared with supervised learning, there are relatively fewer works on feature representation and selection in reinforcement learning, especially in online learning control methods. For ACDs, it was pointed out recently [22] that a study on the choice of the basis functions for the critic to obtain a good estimate of the policy gradient should be done to improve the performance of ACDs. The motivation of this paper is to present a novel kernelbased feature representation method for ACDs and develop new online learning control algorithms with sparse kernel machines. Based on the theoretical and empirical results from statistical learning [31], [32], sparse kernel machines will have better generalization capability than conventional MLPNNs with manually designed structures. Therefore, the goal of this paper is to provide a new kernel-based feature representation method for ACDs, which is important to realize efficient online learning control methods for uncertain dynamical systems. Recently, kernel machines have been popularly studied to realize nonlinear and nonparametric versions of supervised or unsupervised learning algorithms [31] [32]. The main idea of kernel machines is as follows: an inner product in a highdimensional feature space can be represented as a Mercer kernel function, thus, existing learning algorithms in linear spaces can be transformed to kernel-based algorithms without explicitly computing the inner products in high-dimensional feature spaces. This idea, which is usually called the kernel trick, has been widely used in supervised and unsupervised learning problems [32]. In supervised learning, the most popular kernel machines include support vector machines (SVMs) and Gaussian processes (GPs), which have been applied in many classification and regression problems. In most cases, kernel machines obtained very good results or even the stateof-the-art performance [32] [34]. In unsupervised learning, kernel principal component analysis and kernel independent component analysis were also studied by many researchers [34]. Comprehensive reviews on kernel machines can be found in [35]. The combination of kernel methods with RL and ADP has also received increased research interest in recent years. However, the function approximation problem is more difficult in RL than in supervised learning. One of the earlier works in this direction was published in [36], where kernel-based locally weighted averaging was used to approximate the state value functions of MDPs. The applications of GPs or SVMs in reinforcement learning problems were also studied in the literature, such as GPs in temporal difference [TD()] learning [37], SVMs for RL [38], and Gaussian processes in modelbased approximate policy iteration [39]. In [38], support vector regression was applied to batch learning of state value functions of MDPs with discrete state spaces, and there were no theoretical results on the policies obtained. The GP-based policy iteration method in [39] uses support points, which are usually selected by manual discretization of the state spaces, and policy evaluation is performed using the state transition model approximated by a GP model. In [4], a model-free approximate policy iteration algorithm, called least-squares policy iteration (LSPI), was presented, which offers an RL method with good properties in convergence, stability, and sample complexity. Nevertheless, the approximation structures in LSPI may lead to degraded performance when the features are improperly selected. In [41], a kernel-based least-squares policy iteration (KLSPI) algorithm was presented for MDPs with large or continuous state spaces. However, both LSPI and KLSPI are mainly restricted to solving MDPs with discrete actions. In this paper, a novel framework of ACDs with sparse kernel machines is presented by integrating kernel methods into the critic learning of ACD algorithms. A sparsification method based on the approximately linear dependence (ALD) analysis [42] is used to sparsify the kernel machines when approximating the action value functions or their derivatives. Using the sparsified kernel machines, two Kernel ACD algorithms, that is, kernel HDP (KHDP) and kernel DHP (KDHP), are proposed to realize efficient online learning control for dynamical systems. To the best of our best knowledge, there are very few works on integrating kernel methods into online learning control based on ACDs in the community. Simulation and experimental results on two nonlinear control problems, a continuous-action inverted pendulum problem and a ball and plate control problem, demonstrate that kernel ACDs can obtain much better performance than that of previous ACDs. The main contributions of this paper include the following two aspects. One is automatic feature representation using kernels for VFA in ACDs. Because of the structure learning and nonlinear approximation ability of sparse kernel machines, KHDP and KDHP can obtain much better performance than previous HDP and DHP methods with manually designed neural networks. The second is to combine sparsified kernel

3 764 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 5, MAY 213 features with the recursive least-squares TD (RLS-TD) algorithm [42] so that faster learning speed can be realized in the critic of kernel ACDs. As studied in [16] and [22], the convergence of actor-critic algorithms can be ensured based on the principle of two-timescale stochastic approximations, which are characterized by coupled stochastic recursions that are driven by two different step size schedules. According to the results in [22], when linear function approximators are used, actor-critic algorithms can be proved to converge if the learning process in the critic is a faster recursion than the actor. Thus, when faster learning speed is realized by using RLS-TD in the critic, kernel ACDs can be expected to have improved performance in convergence. The idea of kernelbased VFA can also be applied to other ADP methods for learning control of dynamical systems [28] [3]. In recent studies on ADP methods, VFA is still a central problem and it can be expected that new kernel-based ADP algorithms can be developed. In the following, we will focus on kernel methods in popularly used ACDs including HDP and DHP, and the extension of kernel methods in other ADP algorithms is a promising direction for future work. The rest of this paper is organized as follows. In Section II, some research backgrounds on MDPs and the ALD-based kernel sparsification process are introduced. In Section III, the framework of ACDs with sparse kernel machines is presented and the KHDP and KDHP algorithms are proposed. The performance of kernel ACDs is analyzed from two perspectives. One is the performance error of critic learning and the other is the convergence of the actor-critic learning control process. In Section IV, simulation and experimental results on two nonlinear learning control systems are provided to illustrate the effectiveness of the proposed method. Finally, conclusions and future work are summarized in Section V. II. BACKGROUND A. Markov Decision Processes An MDP M is denoted as a quadruple {X, A, R, P}, where X is the state space, A is the action space,p is the state transition probability, and R is the reward function. A stochastic stationary policy π(or just stationary policy) maps states to distributions over the action space. When referring to such a policy π, weuseπ(a x) to denote the probability of selecting action a in state x by π. A deterministic stationary policy directly maps states to actions, denoted as a t = π(x t ), t. (1) When the actions a t (t ) satisfy (1), policy π is followed in the MDP M. A stochastic stationary policy π is said to be followed in the MDP M if a t π(a x t ), t. The objective of a learning controller is to estimate the optimal policy π satisfying J π = max π J π = max π E π[ ] γ t r t where < γ < 1 is the discount factor and r t is the reward at time step t, E π [ ] stands for the expectation with respect to the policy π and the state transition probabilities, and J π is the t= (2) expected total reward along the state trajectories by following policy π. In this paper, J π is also called the performance value of policy π. The state value function V π (x) of a policy π is the expected, discounted total rewards when starting from x and following policy π thereafter [ ] V π (x) = E π γ t r t x = x. (3) t= Similarly, the state action value function Q π (x,a) is defined as the expected, discounted total rewards when taking action a in state x and following policy π thereafter Q π (x, a) = E π[ t= ] γ t r t x = x, a = a. (4) For an MDP, a deterministic optimal policy π (x) maximizes the expected, discounted total reward of state x π (x) = arg max Q π (x, a). (5) a B. ALD-Based Kernel Sparsification Let X denote the original state space. A kernel function is a mapping from X X to R, which is usually assumed to be continuous. A Mercer kernel is a kernel function that is positive definite, that is, for any finite set of points {x 1, x 2,..., x n }, the kernel matrix K = [k(x i, x j )](1 i, j n) is positive definite. According to the Mercer theorem [32], there exists a Hilbert space H and a mapping φ from X to H such that k(x i, x j ) =< φ(x i ), φ(x j )> (6) where <, > is the inner product in H. Although the dimension of H may be infinite and the nonlinear mapping φ is usually unknown, all the computation in the feature space can still be performed if it is in the form of inner products. As introduced in [42], in the ALD analysis, after the sample collection process, the kernel-based features are constructed in a data-driven way. Let S n = {s 1, s 2,...,s n }denote a set of data samples and φ be a feature mapping on the data, which can be determined by the Mercer kernel function defined in (6). A feature vector set can be obtained as n = {φ(s 1 ), φ(s 2 ),...,φ(s n )}, φ(s i ) R m 1, i = 1, 2,...,n. To perform ALD analysis on the feature vector set, a data dictionary is defined as a subset of the feature vector set. The data dictionary D is initially empty and the ALD analysis is implemented by testing every feature vector in n, one at a time. If a feature vector φ(s) cannot be approximated within a predefined precision by the linear combination of the feature vectors in the dictionary, it will be added to the dictionary. Otherwise, it will not be added to the dictionary. Thus, after the ALD analysis process, all the feature vectors of the data samples in S n can be approximately represented by linear combinations of the feature vectors in the dictionary within a given precision.

4 XU et al.: ONLINE LEARNING CONTROL USING ADAPTIVE CRITIC DESIGNS 765 The ALD-based sparsification procedure mainly includes two steps. The first step is to compute the following optimization solution: 2 δ t = min c c j φ(s j ) φ(s t ). (7) s j D t Due to the kernel trick, after substituting (6) into (7), we obtain δ t = min{c T K t 1 c 2c T k t 1 (s t ) + k tt } (8) c where [K t 1 ] i, j = k(s i, s j ), s i (i = 1, 2,...,d(t 1)) are the elements in the dictionary, d(t 1) is the length of the data dictionary, k t 1 (s t ) = [k(s 1,s t ), k(s 2, s t ),..., k(s d(t 1),st )] T, c = [c 1, c 2,..., c d ] T,andk tt = k(s t, s t ). The optimal solution for (8) is c t = Kt 1 1 k t 1(s t ) (9) δ t = k tt kt 1 T (s t )c t. (1) The second step of the ALD-based sparsification is to update the data dictionary by comparing δ t with a predefined threshold μ. Ifδ t < μ, the dictionary is unchanged, otherwise, s t is added to the dictionary, that is, D t = D t 1 s t. After the sparsification procedure, a data dictionary D n with reduced number of data vectors is obtained and the approximated state action value function or its derivative is represented as follows: d(n) Q(x, a) = α j k(s, s j ) (11) j=1 d(n) λ(x) = α j k(x, x j ) (12) j=1 where d(n), usually much smaller than the original sample size n, is the length of the dictionary D n, s j = s(x j, a j ),and x j ( j =1,2,..., d(n)) are the elements of the data dictionary. III. ACDS WITH SPARSE KERNEL MACHINES A. Framework of Kernel ACDs A general framework of ACDs with sparse kernel machines is shown in Fig. 1. The main components of kernel ACDs include a critic, a kernel-based feature learning module, a reward function, an actor/controller, and a model of the plant. The kernel-based feature learning module is to implement data-driven feature representation and learning so that better learning efficiency and generalization performance can be obtained for ACDs. The critic is used to approximate the value functions or their derivatives. In the proposed framework, the kernel function and its induced feature space play important roles in the critic learning process. Since kernel-based features are in linear forms, the RLS-TD learning algorithms can be employed in the critic. The actor or controller receives measurement data about the plant s current state x t and outputs the control u t. The output of the critic is used in the training process of the actor so that policy gradients can be computed. The plant model receives the control u t, and estimates the next Algorithm 1 Kernel ACDs Input: k(.,.): a Mercer kernel function g(x,θ): the approximation structure in the actor S = {s i s i = (x i, a i )} N : asampleset 1) Initialize: A kernel dictionaryd = NULL, actor weights θ = θ, critic weights α = α, step size in the actor β = β. 2) For i = 1, 2,..., Size(D) Compute δ t using (8); If δ t μ Add s i to D; End if End for 3) Let t = ; 4) Loop: t = t + 1; Draw action a t = g(x t, θ t ); Get reward r t ; Observe next state x t+1 ; Compute feature vector k(s t ) and k(s t+1 ); Update θ and α according to (35) and (27) or (56) and (53); Until the termination criterion is satisfied 5) Return the final policy in the actor. Fig. 1. Critic V ( x t ) λ( x t ) Actor at ˆ +1 x t Kernel-based feature learning Model Plant xt Learning control structure of kernel ACDs. r t Reward function state x t+1. The state data are provided to the critic and to the reward function. In some ACDs, such as DHP, by making use of the plant model, x t+1 is provided for a second pass through the critic so that V (x t+1 ) can be obtained for critic training. Algorithm 1 shows the proposed kernel ACDs, which include two main procedures, that is, a kernel-based feature construction process and an online learning control process. The sample collection process for kernel feature construction can be realized either by collecting data when a conventional controller is used or by observing the MDP running with an initially randomized policy in the actor. The data samples are in the form of state transitions {(x 1, a 1 ),(x 2, a 2 ),...,(x n, a n )}. Based on the data samples, the ALD-based kernel sparsification procedure, which was introduced in Section II-B, can be performed offline before the online learning process of ACDs. Since HDP and DHP are the most widely studied ACDs, we will focus on integrating sparse kernel machines into these two online learning control methods. In HDP, the aim of the

5 766 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 5, MAY 213 critic is to approximate the value functions or action value functions, whereas in DHP, the derivatives of value functions are approximated in the critic. So, in the proposed kernel ACDs, a recursive algorithm of KLSTD [44] will be used and the action value function or the value function derivative is approximated as t Q(s) = α i k(s, s i ) (13) λ(x) = t α i k(x, x i ) (14) where s and s i are the combined features of the state action pairs (x, a) and (x i, a i ), respectively, α i (i = 1, 2,..., t) are the weights, and (x i, a i ) (i = 1, 2,..., t) are selected state action pairs in the sample data, that is, trajectories generated from a Markov decision process. B. KHDP Algorithm In the critic of KHDP, the action value function Q(x, a) is approximated in a linear weighted form, where a Mercer kernel function k(x, y) = <φ(x), φ(y)> is employed to realize the feature mapping in a reproducing kernel Hilbert space (RKHS). Let s t = (x t, a t ) denote the state action pair at time step t. Then, the action value function Q(x t, a t ) can also be expressed as Q(s t ). As studied in [4], the regression equation for the linear LS-TD() (λ = ) algorithm is ] ] E [φ(s t )( Q(s t ) γ Q(s t+1 )) = E [φ(s t )r t (15) where E is the expectation with respect to the state transition probability when following a stationary policy and Q(s) = φ T (s)w, φ,w R q 1. (16) Equation (15) can be rewritten as [ ] E [φ(s t )(φ T (s t ) γφ T (s t+1 )) ]W = E φ(s t )r(s t ). (17) The observation equation of (17) is as follows: φ(s t )(φ T (s t ) γφ T (s t+1 ))W = φ(s t )r t + ε t (18) where ε t is the one-step observation noise. Due to the property of RKHS, the weight vector W in (18) can be represented by the weighted sum of the state feature vectors T W = φ(s i )α i (19) where s i (i = 1, 2,...,T )are the selected state action pairs after the ALD analysis, T is the number of selected samples, and α i are the coefficients. Let T = (φ T (s 1 ), φ T (s 2 ),...,φ T (s T )) T (2) k(s t ) = (k(s 1, s t ), k(s 2, s t ),...,k(s T, s t )) T. (21) By multiplying T to both sides of the observation equation (18), due to the kernel trick, we get k(s t )[ k T (s t ) α γ k T (s t+1 ) α] = k(s t )r t + ν t (22) where v t R T 1 is a transformed noise vector and Let A T = b T = α =[α 1,α 2,...,α T ] T. (23) N k(s t )[ k T (s t ) γ k T (s t+1 )] (24) t=1 N k(s t )r t (25) t=1 where N is the total number of samples. Then, the kernel-based least-squares fixed-point solution to the TD learning problem is as follows: α = A 1 T b T. (26) To realize online learning in the critic, the following update rules based on the kernel RLS-TD() algorithm are used in the critic of KHDP. Critic Update in KHDP: β t+1 = P t k(s t )/(μ + ( k T (s t ) γ k T (s t+1 ))P t k(s t )) α t+1 = α t + β t+1 (r t ( k T (s t ) γ k T (s t+1 )) α t ) P t+1 = 1 [ P t P tk(s t )( k T (s t ) γ k T ] (s t+1 ))P t [ ] μ μ + ( k T (s t ) γ k T (s t+1 ))P t k(s t ) (27) where β t is the step size in the critic, μ( <μ 1) is the forgetting factor, P = δi, δ is a positive number, and I is the identity matrix. The actor network in KHDP uses MLPNNs to approximate the policy function a t = g(x t, θ t ). (28) In this paper, the learning control objective is to minimize or maximize the following total discounted reward: [ ] J(x) = V (x) = E γ t r t x = x. (29) where < γ < 1 is the discount factor. In this paper, we will mainly focus on deterministic MDPs, and the reward function is defined as nonpositive or nonnegative values. For nonpositive reward functions, the learning objective is to maximize the expected total discounted reward. For nonnegative reward functions, the learning objective is to minimize the expected total discounted reward. Therefore, the following cost function is used in the actor to realize the learning control objective: t= E a = 1 2 Q2 (x, a) (3) Since the minimization of cost function (3) is equivalent to minimize J(x) when Q(x, a) is nonnegative or maximize J(x) when Q(x, a) is nonpositive, the policy gradient learning rule in the actor can be designed as θ t = E a = Q(x t, a t ) Q(x t, a t ) a t. θ t a t θ t (31)

6 XU et al.: ONLINE LEARNING CONTROL USING ADAPTIVE CRITIC DESIGNS 767 When Gaussion kernels are used, the approximated action value function is T T Q(x, a) = α i k(s, s i ) = α i e s s i 2 /σ 2 (32) where x = (x (1), x (2),...,x (m) ) T, s = (x (1), x (2),...,x (m), a) is the combined vector of the state action pair (x, a), and is defined as n s s i = (x ( j) x i( j) ) 2 + (a a i ) 2. (33) j=1 On the basis of the definition in (32), we have Q(x t, a t ) a t = T 2α i (a t a i ) σ 2 e s t s i 2 /σ 2. (34) Then, the actor learning rule in KHDP is as follows. Actor Update in KHDP: θ t = θ t η t θ t = θ t η t Q(x t, a t ) a t θ t T where η t is the step size in the actor. α i (a t a i ) σ 2 e ( s t s i ) 2 /σ 2 (35) C. KDHP Algorithm The critic learning in KDHP is to approximate the derivatives of state value functions, which satisfy the following Bellman equation: V (x t ) k=1 = R t + γ E[V (x t+1)] (36) where R t is the expected reward and E[.] is with respect to the state transition probability when following a stationary policy. Let λ(x t ) = V (x t) (37) λ(x t+1 ) = V (x t+1). +1 (38) If x t and a t are 1-D values, the following relation holds: V (x t+1 ) = V (x t+1) +1 + V (x t+1) +1 a t a t = λ(x t+1 ) +1 + λ(x t+1 ) +1 a t a t (39) If x t = [x i (t)] n 1 and a t = [u i (t)] m 1 are multidimensional vectors, equation (39) becomes V (x t+1 ) n V (x t+1 ) x i (t + 1) = x j (t) x i (t + 1) x j (t) n m V (x t+1 ) x i (t + 1) a k (t) + x i (t + 1) a k (t) x j (t) = n λ(x i (t + 1)) x i(t + 1) x j (t) n m + λ(x i (t + 1)) x i(t + 1) a k (t) (4) a k (t) x j (t) k=1 where m and n are the dimensions of a t and x t, respectively. To simplify notations, we only show the results when x t and a t are 1-D variables, therefore (39) is employed. The extensions to multidimensional state and control vectors can be done by considering (4) instead of (39). Then, (36) can be rewritten as λ(x t ) = R [ ( t xt+1 a )] t + γ E λ(x t+1 ) (41) + +1 a t where +1 / and +1 / a t can be computed based on the model network in Fig. 1, and a t / can be computed on the basis of the actor network. Suppose the following nonlinear mappings are implemented by the model network and the actor network, respectively: x t+1 = f (x t, a t ) (42) a t = g(x t, θ t ) (43) where θ t is the weight vector of the actor network. Then, the derivatives in the right-hand side of (41) can be obtained as +1 = f (x t, a t ) (44) a t = g(x t, θ t ). (45) The temporal differences can be defined as δ(t) = r ( t xt+1 + γ + +1 a ) t λ(x t+1 ) λ(x t ). (46) a t In the critic learning of KDHP, a kernel-based approximation structure is considered to approximate λ(x t ).Atfirst, consider the following approximation structure in linear forms: λ(x t ) = V (x t) = φ T (x t )W = l φ j (x t )w j (47) where φ(x t ) = [φ 1 (x t ), φ 2 (x t ),...,φ l (x t )] T is a vector of basis functions, x t is the input state of the critic, and W = [w 1, w 2,..., w l ] T is the weight vector. By multiplying φ(x t ) to both sides of (41), the fixed-point equation for linear LS-TD() algorithms is derived [ [ ( xt+1 E φ(x t ) λ(x t ) γ + +1 a ) ]] t λ(x t+1 ) a t Let j=1 = φ(x t ) R t (48) D(x t ) = a t. (49) a t Equation (48) can be rewritten as [ ] E φ(x t )(φ T (x t ) γ D(x t )φ T (x t+1 )) W = E [ φ(x t ) r t ]. (5) Assume that x i (i = 1, 2,...,T )are the selected states after the ALD analysis, k(x, y) =<φ(x), φ(y)> is a Mercer kernel, and T is the number of selected samples. Similar to the

7 768 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 5, MAY 213 derivation of (22), by using the kernel trick, we can also obtain the least-squares fixed point equation for approximating λ(x) in the form of kernel-based features: k(x t )[ k T (x t ) α γ D(x t ) k T (x t+1 ) α] = k(x t )r t + ν t (51) where v t R T 1 is a noise vector, α =[α 1,α 2,...,α T ] T is the coefficient vector for approximating λ(x) and ( T k(x t ) = k(x 1, x t ), k(x 2, x t ),...,k(x T, x t )). The kernel-based RLS-TD update rules for critic learning in KDHP are as follows. Critic Update in KDHP: β t+1 =P t k(x t )/(μ+ ( k T (x t ) γ D(x t ) k T (x t+1 ))P t k(x t )) (52) ( rt ) α t+1 = α t + β t+1 ( k T (x t ) γ D(x t ) k T (x t+1 )) α t (53) ( P t+1 = 1 [ k T (x t ) γ D(x t ) k T (x t+1 )) P t ] P t P t k(x t ) ) μ μ+( k T (x t ) γ D(x t ) k T (x t+1 ) P t k(x t ) (54) where β t is the step size in the critic, μ( < μ 1) is the forgetting factor, P = δi, δ is a positive number, and I is the identity matrix. The actor network is used to generate the control actions based on the observed states of the plant. The output of the actor is given by (43). The learning objective of the actor is to minimize the performance value of the closed-loop system, which can be computed by the value functions of the MDP [ ] J(x) = V (x) = E π γ t r t x = x. (55) In KDHP, based on the outputs of the critic, the following policy gradient methods can be used to train the actor: Actor Update in KDHP: V (x t+1 ) a t θ t+1 = θ t η t θ t = θ t η t t= = θ t η t λ(x t+1 ) +1 a t a t θ t a t. (56) θ t Since λ(x t+1 ) and +1 / a t can be computed by the critic and the model network, respectively, and a t / θ t is given by (45), the above policy gradient learning can be implemented along with the critic learning. D. Performance Analysis and Discussions Compared with recent attempts in ADP methods for modelfree learning control, one advantage of kernel ACDs is that the manual selection of approximation structures in the critic is avoided and automatic feature construction and selection can be realized to improve the approximation and generalization capability of ACDs. Furthermore, by making use of the generalization capability of sparse kernel machines, which has been verified in the literature [35], [41], better learning control performance can be obtained. In the critic training of ACDs, the TD(λ) algorithm was popularly used to approximate the value functions or their derivatives, where function approximators were employed to realize generalization in large or continuous spaces. However, for TD(λ) with nonlinear approximators, for example, MLPNNs, there are no convergence proofs, and some divergence counterexamples were found in previous studies [45]. According to the recent theoretical results in [16] and [22], the convergence of ACDs can be ensured based on twotimescale stochastic approximations, where the critic needs to implement a faster recursion than the actor. In kernel ACDs, by making use of the kernel-based features, which are in a form of linear basis functions, the RLS-TD algorithm [43] is used to approximate the value functions or their derivatives with improved data efficiency and stability. As shown in [44], the kernel-based LS-TD algorithm is superior to conventional linear or nonlinear TD algorithms in terms of fast convergence rates. Therefore, with faster learning in the critic, kernel ACDs can have better performance than previous ACDs in terms of convergence rates. 1) Performance Error of Critic Learning in Kernel ACDs: In KHDP and KDHP, sparsification of kernel machines is implemented based on the ALD analysis so that the kernelbased features have approximately linear independence. The following Lemma 1 shows that the kernel dictionary obtained by the ALD-based sparsification procedure is finite even if infinite samples are used. Lemma 1 [42]: For the ALD-based kernel sparsification procedure, assume that 1) k(.,.) is a continuous Mercer kernel and 2) S is a compact subset of a Banach space. Then, for any training sequence {x i } X (i = 1, 2,..., ) and for any μ >, the number of dictionary vectors is finite. In Lemma 1, it is shown that if the original state space X is compact, the ultimate dictionary set will be finite regardless of the dimension of the Hilbert space H. In the following, to simplify the notation, a countable state action space is considered, but the results on TD learning can also be extended to general spaces [45]. Let the cardinality of the states be N. The kernel matrix can be denoted as [ ] T K = k(x 1 ), k(x 2 ),..., k(x N ) R N d (57) where d is the number of dictionary vectors. Let α be the critic s weight vector, Ṽ (α) be the approximated value function using kernel machines, and θ t be the actor s weight vector. Since θ t is updated in a slower timescale than the critic, the policy π(θ t ) determined by θ t is also slowly varying. In the following, we will analyze the approximation error of kernel-based RLS-TD learning when the actor s policy is stationary or changes very slowly. An MDP with a stationary action policy π can be viewed as an equivalent Markov reward process with state transition probability P. Suppose μ is the unique distribution that satisfies μ T P = μ T with μ(i) > for all i X and μ is a finite or infinite vector, depending on the cardinality of X. The theoretical results in [4] and [46] show that when LS-TD or RLS-TD converges, a fixed-point solution can be obtained to minimize the projected Bellman residual errors min α J α = min α Ṽ (α) T Ṽ (α) (58)

8 XU et al.: ONLINE LEARNING CONTROL USING ADAPTIVE CRITIC DESIGNS 769 where = ( T D ) 1 T D, = [φ 1, φ 2,...,φ n ] R N n, T is the Bellman operator, and D = diag{μ(i)}. In kernel ACDs, the projection operator is determined by the sparsified kernel features and the ALD-based sparsification can be viewed as a regularization procedure for the optimization problem in (58), where the objective function becomes min α J α = min α ˆV (α) T ˆV (α) + h(α) (59) where h(α) is the structural risk of the kernel machines. Although the combined objective function in (59) may be minimized in a synchronized way, it is sequentially optimized in kernel ACDs. At first, by using the ALD-based sparsification criterion, the structural risk h(α) is reduced. Then, the combined objective function in (59) is optimized by kernel LS-TD using the kernel dictionary obtained by the ALD analysis. In [22], it is proved that if the policy parameters change slowly, the critic weight vector can converge to a solution determined by the actor s policy π(θ). The next problem is the approximation error between the true value function V π(θ) (x) and the solution based on the least-squares fixedpoint equation (58). Since the kernel-based RLS-TD learning essentially implements linear TD learning using kernel-based feature vectors, (58) is equivalent to the fixed-point equation of linear LS-TD learning algorithms. Therefore, based on the analysis of TD learning using linear basis functions in [45], the following relation holds: Ṽ (α) V π(θ) D 1 λγ 1 γ V π(θ) V π(θ) D (6) where D = diag{μ(i)}, = K (K T DK) 1 K T D, λ ( λ 1) is the parameter for eligibility traces, and X D = X T DX. Since is determined by the sparsified kernel features, inequality (6) also shows that by appropriately selecting and sparsifying the kernel-based features, the approximation error bounds of value functions can also be reduced. E. Convergence Analysis of Kernel ACDs Similar to the analysis in [22], the update rules (53) and (56) in ACDs can be modeled as a general setting of two-timescale stochastic approximations X t+1 = X t + β t ( f (X t, Y t ) + Nt+1 1 ) (61) Y t+1 = Y t + γ t (g(x t, Y t ) + Nt+1 2 ) (62) where f and g are Lipschitz continuous functions and {Nt+1 1 }and {N2 t+1 } are martingale difference sequences with respect to the field [ N i 2 ] E t+1 F t D 1 (1+ X t 2 + Y t 2 ), i = 1, 2, t ] (63) for some constant D 1 <. In KHDP and KDHP, the learning rules in the critic use recursive least-squares methods and the step sizes are adaptively determined by online computation rules (27) and (52), respectively. When the update in the critic is a faster recursion than the update in the actor, the weights in the critic have uniformly higher increments compared to the weights in the actor. To analyze the convergence of kernel ACDs based on twotimescale stochastic approximations, the following ordinary differential equations can be considered: Ẋ = f (X (t), Y ) (64) where Assumptions (A1) (A3) hold. (A1) sup X t, sup Y t < ; t t (A2) Ẋ = f (X (t), Y ) has a globally asymptotically stable equilibrium μ(y ), where μ(.) is a Lipschitz continuous function; (A3) Ẏ = g(μ(y (t)), Y (t)) has a globally asymptotically stable equilibrium Y ; In [22], the main convergence result was obtained for twotimescale stochastic approximations: Theorem 1: Under Assumptions (A1) (A3), the updates in (61) and (62) converge asymptotically to the equilibrium, that is, (X t, Y t ) (μ(y ), Y ) as t, with probability one. In KHDP and KDHP, by appropriately selecting the actor s step sizes, it can be expected that the update in the critic is a faster recursion than the update in the actor, and the weights in the critic have uniformly higher increments as compared with the weights in the actor. In [22], when the update in the critic is a faster recursion than the actor, it was proved that a class of actor-critic algorithms with linear function approximators will converge almost surely to a small neighborhood of a local minimum of the averaged reward J. InkernelACDs, by making use of kernel-based features and the RLS-TD algorithm in the critic, the updates in the critic can be a faster recursion than the actor. Thus, it will be more beneficial to ensure the convergence of the online learning process. In Section IV, extensive performance tests and comparisons were conducted and it was shown that kernel ACDs have much better performance than conventional ACDs both in terms of convergence speed and in terms of the quality of the final policies. IV. SIMULATION AND EXPERIMENTAL RESULTS A. Inverted Pendulum Problem The inverted pendulum problem has been widely studied as a benchmark control problem with nonlinearity and instability. In the following, simulation and experimental studies will be conducted on the inverted pendulum problem to compare the performance of different RL algorithms. In simulation, the performance of kernel ACDs is compared with that of ACDs under different conditions and parameter settings. The nearoptimal policies of different algorithms are also implemented in a real inverted pendulum system to test the performance of different controllers. The aim of the learning controller is to balance the pole as long as possible and make the angle variations of the pendulum be as small as possible. The dynamics equations are assumed to be unknown or only partially known for the learning controller. For HDP and KHDP, the reward r is always before the pole angle or the position of the cart exceeds

9 77 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 5, MAY 213 average trials average trials average trials successful rate successful rate successful rate actor module learning rate cart mass/kg pole length/m KDHP KHDP DHP HDP (a) actor module learning rate cart mass/kg pole length/m KDHP KHDP DHP HDP (b) Fig. 2. Performance comparisons between K-ACDs and ACDs under different parameter settings such as (a) success rates and (b) average trials. the boundary conditions, that is, if θ 12, x 1.2m, r(t) = ; else r(t) = 1. For DHP and KDHP, a differentiable reward function is defined as r(t) =.5(x 2 + θ 2 ). The simulation time step is.2 s. A learning controller is regarded to be successful when its final policy can balance the pole for at least 1 time steps. A trail starts from an initial state near the equilibrium and ends when the controller balances the pole for 1 time steps or the pole angle or the position of the cart exceeds the boundary conditions. In Fig. 2, the performance of kernel ACDs and conventional ACDs is compared under different parameter settings including the variations of actor learning rates, the cart mass, and the pole length. We use two performance measures to evaluate the learning efficiency of different learning control methods. One is the success rate of a learning controller, which is defined as the percentage of successful learning trials that can learn a policy to balance the pole for at least 1 time steps. The other is the averaged number of trials which is needed to learn a successful policy. The averaged number of trials was computed by running the learning control process for 1 independent runs. For each independent run, the maximum number of learning trials is 1. For KHDP and KDHP, 4 trials of samples were collected by a random policy to construct the dictionary of kernel features. The threshold parameter for the ALDanalysisissetasμ =.1. It is shown in Fig. 2 that the performance of KDHP and KHDP is much better than that of DHP and HDP, respectively. In Fig. 2(a), we see that the success rates of KDHP are all 1% under different settings of actor learning rates, whereas the performance of DHP and HDP is greatly influenced by the actor learning rates. It is observed that KHDP has higher success rates than HDP and it is also less sensitive to the variations of actor learning rates. In Fig. 2(a), it is illustrated that KDHP has the best performance (1% success rate) under different dynamics changes of the plant including the variations of the cart mass and the pole length. The performance of KHDP is also much more robust than that of HDP and DHP. In Fig. 2(b), it is shown that KDHP needs the minimum averaged number of

10 XU et al.: ONLINE LEARNING CONTROL USING ADAPTIVE CRITIC DESIGNS 771 average trials successful rate average trials successful rate [-.1,.1] [-.5,.5] [-.1,.1] (,.1) (,.2) noise level [-.1,.1] [-.5,.5] [-.1,.1] (,.1) (,.2) noise level KDHP KHDP DHP HDP (a) number of hidden layer nodes in the actor number of hidden layer nodes in the actor KDHP KHDP DHP HDP (b) Fig. 3. Performance comparisons between K-ACDs and ACDs under (a) different conditions of noise levels and (b) different number of hidden layer nodes in actor networks. KDHP KHDP theta(rad) theta(rad) t(s) DHP t(s) HDP theta(rad) theta(rad) t(s) t(s) Fig. 4. Angle variations of the real cart-pole system controlled by different learning controllers after convergence. trials for learning a successful control policy, which means that KDHP converges faster than other learning control algorithms. Compared with HDP, KHDP converges to a good control policy much faster. However, compared with KHDP and HDP, DHP needs smaller number of trials to balance the pole successfully. This is mainly due to the fact that DHP makes

772 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 5, MAY 213 the ball position/m.1.8.6.4.2 -.2 -.4 -.6 DHP KDHP KHDP HDP DHP KDHP HDP KHDP Fig. 5. the cumulative squared errors/m*m 6 5 4 3 2 1 Fig.

11 772 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 5, MAY 213 the ball position/m DHP KDHP KHDP HDP DHP KDHP HDP KHDP Fig. 5. the cumulative squared errors/m*m Fig. 6. Ball and plate system. KDHP DHP KHDP HDP DHP KDHP HDP KHDP trials Total squared errors of four algorithms in 18 trials. use of some model information to estimate the policy gradient, which will greatly reduce the variance of policy gradients and increase the convergence speed of ACDs. The performance comparisons between HDP and DHP were also studied in the simulation, where the performance of HDP and DHP was evaluated under different learning rates and hidden node numbers of the critic and the actor. It is observed that DHP can consistently obtain better performance than HDP. In Fig. 3, the performance of different learning control algorithms is compared under different noise levels and different number of hidden nodes in the actor network. It is illustrated that KDHP has the best performance among all the algorithms and it is very robust to sensor noises and structure variations in the actor network. It can be seen that KHDP has much better performance than HDP and its performance is more robust than DHP. Fig. 4 shows the angle variations of a real cart-pole system controlled by different learning controllers after convergence. From Fig. 4, it is observed that the final policy obtained by KDHP can stabilize the system in a shorter time than other learning controllers. This means that the quality of the final near-optimal policy of KDHP is better than other algorithms. Moreover, the performance of KHDP is also better than HDP. From the above simulation and experimental results, it is illustrated that by making use of the sparse kernel machines in the critic of ACDs, the robustness and the efficiency of learning controllers can be greatly improved time-steps Fig. 7. Performance comparisons of the final policies obtained by the four algorithms. B. Learning Control of the Ball and Plate System The ball and plate system is a typical multivariable nonlinear plant, which has been used to test various learning control methods as an experimental device. The controller design for the ball and plate system becomes very difficult when there are model uncertainties and unknown disturbances in the plant. In the following, both simulation and experimental studies will be conducted on the ball and plate problem to compare the performance of different ACDs. As shown in Fig. 5, a typical ball and plate system comprises a ball, a round plate, a charge-coupled device (CCD) vidicon, two electromotors, and some other control devices. The CCD vidicon is used to detect the position of the ball, and the two electromotors can drive the round plate inclining so that the ball can roll arbitrarily on the plate. The control problems of the ball and plate system comprise the rolling from point to point, route tracking and obstacle avoiding, and so on. In this paper, the learning control problem of rolling from point to point was studied to compare the performance of different ACDs. The movement of the ball on the plate can be decomposed into two parts: the move along the x axes and the move along the y axes. Because of the independence of the control actions and the coherence of the dynamics models on the x axes and y axes, only the learning control problem on the x axes is considered. Let x stand for the position of the ball on the plate, and θ denote the inclining angle of the plate. R is the diameter of the ball and m is the mass of the ball, τ denotes the moment by which the inclining angle of the plate can be changed. In the learning control process, the moment τis defined as the action. If the state exceeds the boundary conditions, the current trial ends and the controller is regarded as unsuccessful. For the state defined as (x 1, x 2, x 3, x 4 ) = (θ, θ,x, ẋ), the dynamics equations of the ball and plate system on the x axes can be described as follows: ẋ 1 x 2 g Ẋ = ẋ 2 ẋ 3 = 1 x 4R 2 +x3 2 3 cos x 1 + τ 4mR 2 +mx3 2 x. (65) 4 ẋ 4 7sin(x 1 )

12 XU et al.: ONLINE LEARNING CONTROL USING ADAPTIVE CRITIC DESIGNS DHP KDHP the ball position/m time/s Fig. 8. Performance comparisons of KDHP and DHP algorithms for real-time control of the ball and plate system. The reward function is defined as r(t) =.25(θ(t) θ d (t)) (x(t) x d (t)) 2 where θ d (t) is the expected inclining angle of the plate and x d (t) is the expected position of the ball, they are both zeros in the simulation. The discount factor γ is set to be.95. In the simulation, the time step is set to be.2 s, and a trail starts from an initial position and ends after 1 time steps or the controller is unsuccessful. The initial state is randomly set around zero vectors and within 1% of the state boundary. If the ball can be stabilized at the expected position in 1 time steps, the controller is regarded as successful. One run of the learning control process consists of at most 2 trials. The initial conditions are independently set among different runs. If a successful controller is obtained in one run, this run ends and a new run starts. For the learning control algorithms, the performance is evaluated based on 1 independent runs. In the four algorithms, action modules are all constructed with neural networks whose structures and parameter settings are the same. The network structure (number of nodes in each layer) of action modules is set as The transfer functions from the input layer to the hidden layer is f (x) = (1 + e x ) 1 and from the hidden layer to the output layer, the transfer function is L(x) = kx. The learning rate in the actor is.3 and the actor weights are randomly initialized from.5 to.5. In the HDP and DHP algorithms, the critic modules are constructed with neural networks, whose parameter settings are the same as that in the action module except the structure, which is here. In the KHDP and KDHP algorithms, kernel-based methods are employed to approximate the value functions and the derivatives of value functions, respectively. The performance of the four algorithms is compared by the tracking errors. In Fig. 6, the following performance index in each trial (T = 1 ) is used to compare the four algorithms: J = T (x (t) x d (t)) 2. (66) t= Fig. 6 shows that KDHP has the fastest convergence rate and smallest tracking errors. KHDP also has faster convergence rates and smaller tracking errors than HDP. TABLE I AVERAGED LEARNING CONTROL PERFORMANCE IN 1 RUNS Minimum No. of Trials Maximum No. of Trials Averaged Trials Success Rates HDP % KHDP % DHP % KDHP % The final policies obtained by the four algorithms can stabilize the ball within a very small region around the plate center, as demonstrated in Fig. 7. In Fig. 7, the position variations of the ball controlled by the final policies obtained by the four algorithms are depicted. It is shown that using the final policies obtained by KDHP, it takes the shortest time to control the ball to reach the plate center and be stabilized there. Compared with HDP, KHDP also costs less time to stabilize the ball. Because of the characteristics of online learning control, the convergence rates and success rates of the four algorithms were evaluated for performance comparisons. In Table I, it is shown that in 1 independent runs, KDHP needs the smallest number of trials to converge and KHDP needs smaller number of trials than HDP. In KDHP, the success rate of learning control is 1%, whereas in DHP, it is 84%. In KHDP, the success rate of learning control is 92%, whereas in HDP, it is only 57%. As shown in Fig. 5, the ball and plate control system developed by Googol Technology is used for experimental studies. In the experiments, the control policies obtained from simulation data are used for performance tests, and the time step is still.2 s. Fig. 8 shows that in the real-time control experiments, by using the final policies obtained by KDHP, it takes smaller number of time steps to control the ball to reach the plate center and the tracking error of KDHP is smaller than DHP. Therefore, on the basis of the simulation and experimental results, it is clearly shown that the proposed kernel ACDs can obtain better performance than standard ACDs.

13 774 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 5, MAY 213 V. CONCLUSION ACDs were among the first to address reinforcement learning problems in a general setting. Recently, ACDs have gained renewed interests due to their abilities in online learning control of dynamical systems, In ACDs, MLPNNs with manually designed structures are commonly used to realize function approximation in continuous state and action spaces. However, when the structures of neural networks are improperly designed, previous ACDs will have difficulties in improving the generalization capability and learning efficiency. In this paper, a novel class of ACDs with sparse kernel machines, called Kernel ACDs, was presented for online learning control problems. Based on the framework of Kernel ACDs, two Kernel ACD algorithms, that is, KHDP and KDHP, were proposed and their performance was analyzed. Due to the representation learning and generalization capability of sparse kernel machines, as well as the fast recursion in the critic using RLS-TD, kernel ACDs can obtain much better performance than previous ACDs with MLPNNs. The research in this paper showed that it is very promising to integrate sparse kernel machines into online learning control problems. There are also some interesting topics to be studied in future work. One of these topics is the application of kernel ACDs in real-world online learning control problems so that better performance can be obtained for real-time learning control systems. Another topic is to develop more rigorous theoretical results for the convergence of kernel ACDs. Existing results on convergence analysis of ACDs still require various assumptions, and further work needs to be done to eliminate the gap between theoretical assumptions and practical implementations. REFERENCES [1] R. Sutton and A. Barto, Reinforcement Learning. Introduction. Cambridge, MA: MIT Press, [2] F. Y. Wang, H. Zhang, and D. Liu, Adaptive dynamic programming: An introduction, IEEE Comput. Intell. Mag., vol. 4, no. 2, pp , May 29. [3] C. Szepesvári, Algorithms for Reinforcement Learning. SanMateo, CA: Morgan, 21. [4] D. A. White and D. A. Sofge, Handbook of Intelligent Control. New York: Van Nostrand, [5] P. J. Werbos, Intelligence in the brain: A theory of how it works and how to build it, Neural Netw., vol. 22, no. 3, pp , Apr. 29. [6] P. J. Werbos, Using ADP to understand and replicate brain intelligence: The next level design, in Proc. IEEE Int. Symp. Approx. Dynamic Program. Reinforcement Learn., Apr. 27, pp [7] D. P. Bertsekas and J. Tsitsiklis, Neuro-Dynamic Programming. Belmont, MA: Athena Scientific, [8] D. Liu, Y. Zhang, and H. Zhang, A self-learning call admission control scheme for CDMA cellular networks, IEEE Trans. Neural Netw., vol. 16, no. 5, pp , Sep. 25. [9] R. H. Crites and A. G. Barto, Elevator group control using multiple reinforcement learning agents, Mach. Learn., vol. 33, nos. 2 3, pp , Nov [1] G. Tesauro, TD-Gammon, a self-teaching backgammon program, achieves master-level play, Neural Comput., vol. 6, no. 2, pp , Mar [11] P. Shih, B. C. Kaul, S. Jagannathan, and J. A. Drallmeier, Reinforcement-learning-based dual-control methodology for complex nonlinear discrete-time systems with application to spark engine EGR operation, IEEE Trans. Neural Netw., vol. 19, no. 8, pp , Aug. 28. [12] W. B. Powell, Approximate Dynamic Programming: Solving the Curses of Dimensionality. New York: Wiley, 27. [13] L. Busoniu, R. Babuska, B. De Schutter, and D. Ernst, Reinforcement Learning and Dynamic Programming Using Function Approximators. Boca Raton, FL: CRC Press, 21 [14] R. S. Sutton, H. R. Maei, D. Precup, S. Bhatnagar, D. Silver, C. Szepesvari, and E. Wiewiora, Fast gradient-descent methods for temporaldifference learning with linear function approximation, in Proc. Int. Conf. Mach. Learn., 29, pp [15] J. Baxter and P. L. Bartlett, Infinite-horizon policy-gradient estimation, J. Artif. Intell. Res., vol. 15, no. 1, pp , Jul. 21. [16] V. R. Konda and J. N. Tsitsiklis, Actor-critic algorithms, in Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 2. [17] D. V. Prokhorov and D. C. Wunsch, Adaptive critic designs, IEEE Trans. Neural Netw., vol. 8, no. 5, pp , Jul [18] A. G. Barto, R. S. Sutton, and C. W. Anderson, Neuron-like adaptive elements that can solve difficult learning control problems, IEEE Trans. Syst., Man, Cybern., vol. 13, no. 5, pp , Sep. Oct [19] G. K. Venayagamoorthy, R. G. Harley, and D. C. Wunsch, Comparison of heuristic dynamic programming and dual heuristic programming adaptive critics for neurocontrol of a turbogenerator, IEEE Trans. Neural Netw., vol. 13, no. 3, pp , May 22. [2] F. L. Lewis and D. Vrabie, Reinforcement learning and adaptive dynamic programming for feedback control, IEEE Circuits Syst. Mag., vol. 9, no. 3, pp. 32 5, Aug. 29. [21] J. Peters and S. Schaal, Natural actor-critic, Neurocomputing, vol. 71, nos. 7 9, pp , Mar. 28. [22] S. Bhatnagar, R. S. Sutton, M. Ghavamzadeh, and M. Lee, Natural Actor-Critic Algorithms. Alberta, Canada: Dept. Comput. Sci., 29. [23] S. N. Balakrishnan and V. Biega, Adaptive-critic-based neural networks for aircraft optimal control, J. Guid., Control, Dynamics, vol. 19, no. 4, pp , [24] R. Enns and J. Si, Helicopter trimming and tracking control using direct neural dynamic programming, IEEE Trans. Neural Netw., vol. 14, no. 4, pp , Jul. 23. [25] C. Lu, J. Si, and X. Xie, Direct heuristic dynamic programming for damping oscillations in a large power system, IEEE Trans. Syst., Man, Cybern., Part B, Cybern., vol. 38, no. 4, pp , Aug. 28. [26] P. Shih, B. C. Kaul, S. Jagannathan, and J. A. Drallmeier, Reinforcement-learning-based dual-control methodology for complex nonlinear discrete-time systems with application to spark engine EGR operation, IEEE Trans. Neural Netw., vol. 19, no. 8, pp , Aug. 28. [27] G. D. Magoulasa, M. N. Vrahatisb, and G. S. Androulakisb, Effective Backpropagation training with variable stepsize, Neural Netw., vol. 1, no. 1, pp , Jan [28] S. Bhasin, N. Sharma, P. Patre, and W. E. Dixon, Asymptotic tracking by a reinforcement learning-based adaptive critic controller, J. Control Theory Appl., vol. 9, No. 3, pp. 4 49, 211. [29] K. G. Vamvoudakis and F. L. Lewis, Online actor-critic algorithm to solve the continuous-time infinite horizon optimal control problem, Automatica, vol. 46, no. 5, pp , May 21. [3] H. Zhang, L. Cui, X. Zhang, and Y. Luo, Data-driven robust approximate optimal tracking control for unknown general nonlinear systems using adaptive dynamic programming method, IEEE Trans. Neural Netw., vol. 22, no. 12, pp , Dec [31] V. Vapnik, Statistical Learning Theory. New York: Wiley, [32] B. Schölkopf and A. Smola, Learning With Kernels. Cambridge, MA: MIT Press, 22. [33] N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines. Cambridge, U.K.: Cambridge Univ. Press, 2. [34] F. R. Bach and M. I. Jordan, Kernel independent component analysis, J. Mach. Learn. Res., vol. 3, pp. 1 48, Jul. 22. [35] T. Hofmann, B. Schölkopf, and A. J. Smola, Kernel methods in machine learning, Ann. Statist., vol. 36, no. 3 pp , 28. [36] D. Ormoneit and S. Sen, Kernel-based reinforcement learning, Mach. Learn., vol. 49, nos. 2 3, pp , 22. [37] Y. Engel, S. Mannor, and R. Meir, Bayes meets bellman: The Gaussian Process approach to temporal difference learning, in Proc. Int. Conf. Mach. Learn., 23, pp

XU et al.: ONLINE LEARNING CONTROL USING ADAPTIVE CRITIC DESIGNS 775 [38] T. G. Dietterich and X.

Kuss, Gaussian processes in reinforcement learning, in Advances in Neural Information Processing Systems 16, S. Thrun, L. K. Saul, and B. Schölkopf, Eds., Cambridge, MA: MIT Press, 24, pp. 751 759.

Lu, Kernel-based least-squares policy iteration for reinforcement learning, IEEE Trans. Neural Netw., vol. 18, no. 4, pp. 973 992, Jul. 27. [42] Y. Engel, S. Mannor, and R.

Hu, Efficient reinforcement learning using recursive least-squares methods, J. Artif. Intell. Res., vol. 16, pp. 259 292, Jun. 22. [44] X. Xu, T. Xie, D. Hu, and X.

14 XU et al.: ONLINE LEARNING CONTROL USING ADAPTIVE CRITIC DESIGNS 775 [38] T. G. Dietterich and X. Wang, Batch value function approximation via support vectors, in Advances in Neural Information Processing Systems 14, Cambridge, MA: MIT Press, 22, pp [39] C. E. Rasmussen and M. Kuss, Gaussian processes in reinforcement learning, in Advances in Neural Information Processing Systems 16, S. Thrun, L. K. Saul, and B. Schölkopf, Eds., Cambridge, MA: MIT Press, 24, pp [4] M. G. Lagoudakis and R. Parr, Least-squares policy iteration, J. Mach. Learn. Res., vol. 4, pp , Dec. 23. [41] X. Xu, D. Hu, and X. Lu, Kernel-based least-squares policy iteration for reinforcement learning, IEEE Trans. Neural Netw., vol. 18, no. 4, pp , Jul. 27. [42] Y. Engel, S. Mannor, and R. Meir, The kernel recursive least-squares algorithm, IEEE Trans. Signal Process., vol. 52, no. 8, pp , Aug. 24. [43] X. Xu, H. G. He, and D. W. Hu, Efficient reinforcement learning using recursive least-squares methods, J. Artif. Intell. Res., vol. 16, pp , Jun. 22. [44] X. Xu, T. Xie, D. Hu, and X. Lu, Kernel least-squares temporal difference learning, Int. J. Inf. Technol., vol. 11, no. 9, pp , 25. [45] J. N. Tsitsiklis and B. V. Roy, An analysis of temporal difference learning with function approximation, IEEE Trans. Autom. Control, vol. 42, no. 5, pp , May [46] A. Nedic and D. P. Bertsekas, Least squares policy evaluation algorithms with linear function approximation, Discrete Event Dyn. Syst., vol. 13, nos. 1 2, pp , Jan. Apr. 23. [47] T. Dierks and S. Jagannathan, Online optimal control of affine nonlinear discrete-time systems with unknown internal dynamics by using timebased policy update, IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 7, pp , Jul [48] S. Zhong, X. Zeng, S. Wu, and L. Han, Sensitivity-based adaptive learning rules for binary feedforward neural networks, IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 3, pp , Mar Xin Xu (M 7 SM 12) received the B.S. degree in electrical engineering from the Department of Automatic Control, National University of Defense Technology (NUDT), Changsha, China, in 1996, where he received the Ph.D. degree in control science and engineering from the College of Mechatronics and Automation, in 22. He is currently a Full Professor with the Institute of Unmanned Systems, College of Mechatronics and Automation, National University of Defense Technology, China. He has been a Visiting Scientist for cooperation research with Hong Kong Polytechnic University, Hong Kong, University of Alberta, Edmonton, AB, Canada, the University of Guelph, Guelph, ON, Canada, and the University of Strathclyde, Glasgow, U.K. He has authored or co-authored more than 9 papers in international journals and conferences, and co-authored four books. His current research interests include reinforcement learning, approximate dynamic programming, machine learning, robotics, and autonomous vehicles. Dr. Xu was a recipient of the 1st class Natural Science Award from Hunan Province, China, in 29 and the Fork Ying Tong Youth Teacher Fund of China in 28. He serves as currently an associate editor of the Information Sciences Journal, a guest editor of the International Journal of Adaptive Control and Signal Processing. He is a Committee Member of the IEEE Technical Committee on Approximate Dynamic Programming and Reinforcement Learning (ADPRL) and the IEEE Technical Committee on Robot Learning. He was a PC member or the Session Chair of many international conferences. Zhongsheng Hou received the Bachelors and Masters degrees from the Jilin University of Technology, Changchun, China, in 1983 and 1988, respectively, and the Ph.D. degree from Northeastern University, Shenyang, China, in He was a Post-Doctoral Fellow with the Harbin Institute of Technology, Harbin, China, from 1995 to 1997 and a Visiting Scholar with Yale University, New Haven, CT, from 22 to 23. In 1997, he joined the Beijing Jiaotong University, Beijing, China, where he is currently a Full Professor and the Founding Director of the Advanced Control Systems Laboratory, and the Head of the Department of Automatic Control, School of Electronic and Information Engineering. He has authored or co-authored over 1 papers in peer-reviewed journals and over 1 papers in prestigious conference proceedings and authored two monographs, Nonparametric Model and its Adaptive Control Theory and the coming Model Free Adaptive Control: Theory and Applications (CRC Press, 213). His current research interests include data-driven control, model-free adaptive control, learning control, and intelligent transportation systems. Dr. Hou was a committee member of over 4 international conferences and Chinese conferences, and was an Associate Editor and Guest Editor for a few international journals and Chinese journals. Chuanqiang Lian received the Bachelors degree from the Department of Automation, Qinghua University, Beijing, China, in 28, and the Masters degree from the College of Mechatronics and Automation, National University of Defense Technology, Changsha, China, in 21, where he is currently pursuing the Ph.D. degree with the Institute of Unmanned Systems. He has co-authored more than 1 papers in international journals and conferences. His current research interests include reinforcement learning, approximate dynamic programming, and autonomous vehicles. Haibo He (SM 11) received the B.S. and M.S. degrees in electrical engineering from the Huazhong University of Science and Technology, Wuhan, China, in 1999 and 22, respectively, and the Ph.D. degree in electrical engineering from Ohio University, Athens, in 26. He was an Assistant Professor with the Department of Electrical and Computer Engineering, Stevens Institute of Technology, Hoboken, NJ, from 26 to 29. He is currently an Associate Professor with the Department of Electrical, Computer, and Biomedical Engineering, University of Rhode Island, Kingston, RI. He has authored and co-authored over 1 peer-reviewed journal and conference papers, authored one research book (Wiley), and edited six conference proceedings (Springer). His research has been highlighted in numerous media, such as IEEE Smart Grid Newsletter, The Wall Street Journal, and Providence Business News. His current research interests include adaptive dynamic programming, machine learning, computational intelligence, hardware design for machine intelligence, and various applications such as smart grid. Dr. He was a recipient of the National Science Foundation CAREER Award in 211 and the Providence Business News Rising Star Innovator Award in 211. He is currently an Associate Editor of the IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS and the IEEE TRANSAC- TIONS ON SMART GRID.

REINFORCEMENT learning (RL) is a machine learning

REINFORCEMENT learning (RL) is a machine learning 146 IEEE TRANSACTIONS ON CONTROL SYSTEMS TECHNOLOGY, VOL. 22, NO. 1, JANUARY 214 Kernel-Based Approximate Dynamic Programming for Real-Time Online Learning Control: An Experimental Study Xin Xu, Senior