arxiv: v1 [cs.sy] 20 Sep 2017

Size: px

Start display at page:

Download "arxiv: v1 [cs.sy] 20 Sep 2017"

Corey Horton
5 years ago
Views:

1 On the Design o LQR Kernels or Eicient Controller Learning Alonso Marco 1, Philipp Hennig 1, Stean Schaal 1, and Sebastian Trimpe 1 arxiv:179.79v1 [cs.sy] Sep 17 Abstract Finding optimal eedback controllers or nonlinear dynamic systems rom data is hard. Recently, Bayesian optimization (BO) has been proposed as a powerul ramework or direct controller tuning rom experimental trials. For selecting the next query point and inding the global optimum, BO relies on a probabilistic description o the latent objective unction, typically a Gaussian process (GP). As is shown herein, GPs with a common kernel choice can, however, lead to poor learning outcomes on standard quadratic control problems. For a irstorder system, we construct two kernels that speciically leverage the structure o the well-known Linear Quadratic Regulator (LQR), yet retain the lexibility o Bayesian nonparametric learning. Simulations o uncertain linear and nonlinear systems demonstrate that the LQR kernels yield superior learning perormance. I. INTRODUCTION A core problem o learning control is to determine optimal eedback controllers or (partially) unknown nonlinear systems rom experimental data. Reinorcement learning (RL) [1], [] is a promising ramework or this, yet oten requires perorming many experiments on the physical system to even ind suitable controllers, which limits the applicability o such techniques. Thereore, a lot o research eort has been invested into data eiciency o RL aiming at learning controllers rom as ew experiments as possible. Recently, Bayesian optimization (BO) has been proposed or RL as a promising approach in this direction. BO employs a probabilistic description o the latent objective unction (typically a Gaussian process (GP)), which allows or selecting next control experiments in a principled manner, e.g., to maximize inormation gain [3] or perorm sae exploration []. While BO provides a promising ramework or learning controllers in airly general settings, the ull power o Bayesian learning is oten not exploited. A key advantage o Bayesian methods is that they allow or combining prior problem knowledge with learning rom data in a principled manner. In case o GP models, this concerns speciically the choice o the kernel, which captures the covariance between unction values at dierent inputs and is thus the core component to model prior knowledge about the unction shape. By choosing standard kernels, however, naive BO approaches do oten not exploit this opportunity to improve learning perormance. 1 Max Planck Institute or Intelligent Systems, 77 Tübingen, Germany. {amarco, phennig, sschaal, strimpe}@tuebingen.mpg.de Computational Learning and Motor Control Lab, University o Southern Caliornia, Los Angeles, USA. This work was supported in part by the Max Planck Society, the Max Planck ETH Center or Learning Systems, National Science Foundation grants IIS-159, IIS-11713, EECS-95, the Oice o Naval Research, and the Okawa Foundation. In this paper, we show how structural knowledge about the optimal control problem at hand can be leveraged or designing speciic kernels in order to improve data eiciency in learning control. For a irst-order nonlinear quadratic optimal control problem, we propose two LQR kernels that leverage the structure o the amous Linear Quadratic Regulator (LQR) problem given in orm o an approximate linear model o the true nonlinear dynamics. The proposed kernels leverage the structure o the LQR problem, while retaining the lexibility o nonparametric GP regression. Contributions: In detail, this paper makes the ollowing contributions: 1) We discuss how the structure o the well-known LQR problem can be leveraged or eicient learning o controllers or nonlinear systems. ) This discussion leads to the proposal o two new kernels or GP regression in the context o learning control: the parametric and nonparametric LQR kernels. 3) The improved learning perormance achieved with these kernels over a standard kernel is demonstrated through numerical simulations. Related work: BO or learning controllers has recently been considered, or example in [3] [7], in dierent lavors. While [3] and [5] ocus on data eiciency by maximizing the inormation gain in each iteration, [] and [] propose methods or sae exploration. Dierent BO algorithms are compared in [7]. These works all consider tuning o stateeedback controllers with a quadratic cost (possibly saturated) similar to the setting herein, but using standard kernels. The design o customized kernels or GP regression has been considered beore in the context o control and robotics or related problems. A kernel or bipedal locomotion, which captures typical gait characteristics, is proposed in []. In [9], an impedance-based model is incorporated as prior knowledge or improved predictions in human-robot interaction. For the problem o maximizing power generation in photovoltaic plants, the authors in [1] incorporate explicit basis unctions about known power curves in the kernel. In [5], a kernel is designed to model inormation rom simulation and physical experiments, in order to leverage both sources o inormation or RL. None o the above reerences considers the problem o incorporating the structure o the LQR problem or improving data-eiciency in learning control with BO. Outline: We introduce the considered learning control problem in Sec. II, and summarize BO or RL in Sec. III, together with the necessary background on GPs. While the BO ramework or learning control is introduced or general multivariate systems, we ocus on the special case o a scalar Accepted inal version. To appear in 5th IEEE Conerence on Decision and Control. c 17 IEEE. Personal use o this material is permitted. Permission rom IEEE must be obtained or all other uses, in any current or uture media, including reprinting/republishing this material or advertising or promotional purposes, creating new collective works, or resale or redistribution to servers or lists, or reuse o any copyrighted component o this work in other works.

2 problem thereater and develop the LQR kernels in Sec. IV. Numerical results in Sec. V illustrate the improved learning perormance o the proposed kernels over a standard kernel. The paper concludes with remarks in Sec. VI. Notation: A Gaussian random variable x with mean m and variance V is denoted by x N (m, V ). The expected value o x is denoted by E[x], while V[x 1, x ] denotes the covariance o x 1 and x. We also use V[x 1 ] := V[x 1, x 1 ]. II. LEARNING CONTROL PROBLEM We consider regulation o the nonlinear stochastic system x t+1 = g(x t, u t, v t ) (1) with state x t R nx, control input u t R nu, and random noise v t N (, V ). We assume that the state x t can be measured or otherwise estimated. The system dynamics g( ) are unknown. The control objective is to ind a state-eedback controller u t = F x t () with controller gain F R nu nx such that the quadratic cost 1 [ T 1 J = lim T T E ] x T t Qx t + u T t Ru t t= with symmetric weights Q and R > is minimized. A quadratic cost is a very common choice in optimal control [11] expressing the designer s preerence in the undamental trade-o between control perormance and control eort. While the solution o the above problem is standard when the dynamics (1) are known and linear [11], here, we ace the problem o optimal control under unknown and nonlinear dynamics. To address this problem, we ollow a direct RL approach [1], where the optimal controller is learned by directly evaluating the perormance o candidate controllers F i on the real system (1) without learning an (intermediate) dynamics model. We employ Bayesian optimization (BO) as a popular approach or sequential global optimization, which leverages the inormation rom previous trials in order to propose the next candidate controller F i+1 so as to eventually ind the optimum o (3). Because evaluations on the system are expensive, we seek to ind the optimal controller with as ew evaluations as possible. To this end, we address herein how to leverage inormation rom a linear model that approximates the true system (1). III. BAYESIAN OPTIMIZATION FOR LEARNING CONTROL We briely introduce necessary background material on Gaussian processes (GPs) and Bayesian optimization (BO), beore describing reinorcement learning with BO. A. Gaussian process regression Let θ denote the ree parameters o a eedback controller; that is, θ = vec(f ) in (), or any other parameterization. The dependence o the cost unction J (3) on θ is unknown a priori because o the lack o knowledge o the system (1). We (3) use GP regression to approximate the unction J : Θ R, θ Θ, rom data (i.e., noisy unction evaluations) J i = J(θ i ) + ε i, ε i N (, σ ). () Obtaining one data point J i corresponds to perorming a closed-loop experiment on the true system (1) with controller θ i, recording the state-input trajectory, and computing the cost rom these data (in case o (3) a inite approximation is used with a suiciently long horizon T ). A GP can be deined as a probability distribution over the space o unctions J : Θ R whose restriction to any inite number o unction values is jointly Gaussian [13, p. 13]. A GP is speciied through its prior mean unction µ : Θ R and covariance unction k : Θ Θ R. We write J(θ) GP(µ(θ), k(θ, θ ) µ(θ) = E[J(θ)] k(θ, θ ) = V[J(θ), J(θ )]. with Hence, µ is the expected unction value, and k captures the covariance between any two unction values and is used to model uncertainty. The latter is also called kernel and must satisy the ollowing property to give rise to a valid covariance unction: Deinition 1 (according to [13]): Let k : Θ Θ R be a unction, and K N R N N be the symmetric matrix whose entries are computed as K ij = k(θ i, θ j ) rom the collection {θ 1, θ,..., θ N }, θ i Θ. The unction k is called a positive semideinite kernel i K is positive semideinite or any inite collection {θ 1, θ,..., θ N }. The matrix K N is called the Gram matrix. A kernel typically has its own parameters, called hyperparameters, such as length scales or output variance. By choosing the prior mean and the kernel, the user can speciy prior knowledge about the unction such as expected shape, length scales, and smoothness properties. In GP regression, learning a unction amounts to predicting the (normal) distribution o unction values J(θ ) at arbitrary inputs θ Θ based on previous evaluations. Given N data points D N = {θ i, J i } N i=1, the posterior mean and variance o J(θ ) can be stated in closed orm as E[J(θ ) D N ] = µ(θ ) k T N(θ )(K N +σ I) 1 Ĵ N (5) V[J(θ ) D N ] = k(θ, θ ) k T N(θ )(K N +σ I) 1 k N (θ ) () where k N (θ ) := [k(θ, θ 1 ),..., k(θ, θ N )] T and ĴN := [J 1 µ(θ 1 ),..., J N µ(θ N )] T. In addition to computing the posterior distribution, the data D N can also be used to optimize hyperparameters in order to improve the model it. A popular approach is maximizing the marginal likelihood o the data, p(ĵn θ 1,..., θ N ), which is given in logarithmic orm as [13, p. 19] Ĵ 1 N T (K N +σ I) 1 Ĵ N 1 K N +σ I N log π. (7)

3 Algorithm 1 Learning control with Bayesian optimization 1: Speciy objective unction J (e.g., (3) with inite horizon T ) : Speciy GP prior (mean µ, kernel k) 3: Initialize: controller θ ; data set D = : while (not terminated) do e.g., ixed number o experiments 5: perorm closed-loop experiment with controller θ i : compute J i rom experimental data {x t, u t} T 1 t= 7: add evaluation {θ i, J i } to data set D i : update GP posterior (5), () 9: [optional:] optimize hyperparameters (e.g., maximize (7)) 1: compute next controller θ i+1 via () 11: end while 1: determine best guess θ bg or the controller parameters, e.g., minimum o posterior mean (5) 13: return θ bg B. Bayesian optimization Bayesian optimization [1] denotes a class o global optimization methods, where a probabilistic description p(j(θ)) o the latent objective unction J is used or data-eicient optimization. Here, the probabilistic description is given by the GP J GP(µ, k). Given p(j(θ)), a utility U[p(J(θ))] is deined, which is maximized in each iteration to ind the next evaluation point θ i+1 = arg max U[p(J(θ))]. () θ Dierent BO algorithms primarily distinguish in how the selection o the next evaluation () is ormulated. Popular BO algorithms include expected improvement (EI) [15], probability o improvement (PI) [1], GP upper conidence bound (GP-UCB) [17], and entropy search (ES) [1]. The method developed herein is agnostic to the type o BO algorithm used. For the examples in Sec. V, we employ EI, which uses U[p(J(θ))] = E[min{, J low J(θ)}] (9) or the optimization in (), where J low represents the lowest unction rom current data set D N. EI thus selects the next query point where the expected improvement upon J low (under the GP o J) is maximal. C. Reinorcement learning with Bayesian optimization Algorithm 1 summarizes direct RL using Bayesian optimization. This ramework has been successully applied in experimental settings to learn eedback controllers or the control problem o Sec. II and related scenarios. Berkenkamp et al. [] optimize a state-eedback controller () or quadrotor trajectory tracking using a sae BO algorithm [19], which builds on GP-UCB. Marco et al. [3] propose to parametrize the eedback gain F as LQR policy [], whose optimal weights are learned using ES, which maximizes the inormation gain rom each experiment. This algorithm has been extended in [5] to include inormation rom dierent sources (such as simulation and physical experiments). Both methods [3], [5] were successully applied to learn pole balancing controllers. Calandra et al. [7] compare PI, EI, and GP-UCB or learning the parameters o a discrete event controller or a walking robot. IV. CONSTRUCTING LQR KERNELS In this section, we present two kernels that are speciically designed or the control problem o Sec. II and incorporate approximate model knowledge in orm o a linear approximation to (1). Because or a linear system, the solution to the optimal control problem o Sec. II is the well-known LQR, we term these kernels LQR kernels. The ollowing derivations are presented or a irst-order system (1), where all variables are scalars (n x = n u = 1). Small letters are used in place o the capital ones to emphasize scalar variables (e.g., v instead o V in (1) and instead o F in ()). We consider minimization o the cost (3), which is rewritten as 1 [ T 1 J = lim T T E t= qx t + ru t ]. (1) We start by considering a learning example with a standard kernel choice or the GP, which motivates why a speciically designed kernel can be desirable. A. Problems with standard kernel The most common kernel choice in GP regression is arguably the squared exponential (SE) kernel [13] ( k SE (θ, θ ) = σse exp (θ θ ) ) l (11) with signal variance σse and length-scale l as its hyperparameters. Let us consider the problem o learning the cost unction J (1) via GP regression with SE kernel or the ollowing linear example: Example 1: Let (1) be given by x t+1 =.9x t + u t + v t with v t N (, 1). Figure 1 shows the prior GP and the posterior GP ater obtaining our data points (i.e., ater our evaluations o controllers i ). A ew issues are apparent rom the posterior: (i) the kernel has problems with the dierent length scales o the unction (J is steep around =.1, but rather lat in the center region); (ii) the GP does not generalize well to regions where no data has been seen (e.g., around = 1., where the posterior mean resorts to the prior); and (iii) the overall it is not very good. Clearly, the it will improve with more data, but, or eicient and ast learning o controllers, we are particularly interested in good its rom just ew data points. Hence, we seek to improve the itting properties o the GP by exploiting knowledge about the system (1) in terms o an approximate linear model. B. Incorporating prior knowledge In no practical situation, one has a perect model o the system to be controlled. At the same time, it is oten possible to obtain a rough model, e.g., rom irst principles modeling or some system identiication procedure. Here, we assume that an uncertain linear model x t+1 = ax t + bu t + v t (1) a [a min, a max ], b [b min, b max ] (13)

4 J() J() Posterior GP (Ex. 1) 1 Prior GP Fig. 1. Prior and posterior GPs o Example 1 using the squared exponential kernel k SE (hyperparameters: σse = 5, l =.). The thick line represents the GP mean, and the light blue area the GP variance (+/- two standard deviations). The true unction is shown in the bottom plot in dashed gray, and data points in orange. is available as an approximation to (1), e.g., rom linearization o a irst principles model with (possibly) some uncertainty about the physical parameters. In the ollowing, we will consider controller gains such that the system (1) is guaranteed stable or all parameters (13). That is, we consider F with F := { R a + b < 1 a [a min, a max ], b [b min, b max ]}. (1) This restriction makes sense, or example, in saety critical applications, where one wants to avoid the risk o trying an unstable controller based on the system knowledge available (i.e., (1), (13)). Moreover, the restriction to F will ensure that subsequent calculations are well-deined. I a and b were known, the unctional dependence o the cost J on the controller gain in () or the linear system (1) could be derived using standard control theory. Fact 1: Consider the system (1) with known parameters a and b, and let a + b < 1. Then, the cost (1) is given by 1 q + r J = v 1 (a + b) =: φ (a,b)(). (15) Proo: The controlled process x t+1 = (a + b)x t + v t is stable by assumption and thus converges to a stationary process with zero mean and variance E[x t ] = P, where P is the unique positive solution to P P (a + b) = v, [1, Theorem 3.1]. For stationary x t, (1) resolves into J = E[qx t + ru t ] = (q + r ) E[x t ] = (q + r )P. Equation (15) then ollows. 1 In the notation φ (a,b) (), we omit the parametric dependence on v, q and r since v is a multiplicative constant, which does not play a role in the later optimization, and we assume that q and r are ixed. I we are uncertain about a and b, a collection o possible costs (15) emerge rom all the possible combinations o a and b within their ranges (13). We assume this collection o costs to be explained by a Gaussian process J LQR. While any arbitrary choice o a and b in (1) is only an approximation to (1), it yields a cost (15) that contains useul structural inormation, which can be leveraged or aster learning controllers rom data. In the next sections, we show how this prior knowledge can be exploited to construct LQR kernels. C. Parametric LQR kernel A reasonable choice or J LQR () is J LQR () = w φ (ā, b) (), w N ( w, σ w) (1) where ā := (a min + a max )/ and b := (b min + b max )/ are the midpoints o the uncertainty intervals (13). Equation (1) is a standard parametric model with a single eature φ (ā, b) () and Gaussian prior. It is well-known (see e.g. [13]) that J LQR () is a GP, J LQR () GP (, k plqr (, ) ) (17) where we have assumed w =, with kernel k plqr (, ) := σ wφ (ā, b) ()φ (ā, b) ( ) = σ p v (q + r )(q + r ) (1 (ā + b) )(1 (ā + b ) ) (1) and hyperparameters σ p := σ w, ā, and b. We reer to (1) as the parametric LQR kernel because it captures the cost unction or the linear system (ā, b) with quadratic cost, and thus the structure o the LQR problem. To illustrate the perormance o the parametric LQR kernel k plqr, we revisit Example 1 using this kernel instead o the SE kernel. The top two graphs o Fig. show the prior and posterior GP or the same data points as in Fig. 1 when using k plqr with hyperparameters ā =.9 and b = 1. We see that the posterior it rom only our data points is almost perect and much better than the one in Fig. 1. This is, o course, not a big surprise because the hyperparameters match the true underlying system o Example 1 perectly. The itting perormance deteriorates, however, i the hyperparameters are o. This can be seen in the bottom graph o Fig., which shows the posterior GP with the same hyperparameters, but or the system Example : x t+1 =.x t +.9u t +v t with v t N (, 1). To improve the it in this situation, one can employ hyperparameter optimization (see Sec. III-A) to ind improved parameters ā and b that better explain the data. The simulation results in Sec. V will show that this can be a viable approach. An alternative is to design more lexible and expressive kernels, which allow or itting more general models. This we discuss next. D. Nonparametric LQR kernel The kernel (1) captures the structure o the cost unction or the LQR problem with one speciic linear model (ā, b). A straightorward way to increase the lexibility o the kernel

5 J() Posterior GP (Ex. 1) J() J() Posterior GP (Ex. ) Prior GP Fig.. GP it using the parametric LQR kernel k plqr (hyperparameters: σp = 1, ā =.9, b = 1). The color code is the same as in Fig. 1. The hyperparameters are exact or Example 1, while they are o by about 1% or Example. in order to it more general problems is to use m basis unctions (or eatures) φ (ai,b i) corresponding to m models (a 1, b 1 ), (a, b ),..., (a m, b m ), J LQR () = [ φ (a1,b 1)() φ (a,b )() φ ] (am,b }{{ m)() w } =:Φ T () = Φ T ()w (19) with w R m, w N ( w, Σ w ). The derivation o the corresponding kernel is analogous to (1) (see [13, Sec..7]) and yields k plqr,m (, ) = Φ T () Σ w Φ( ). () Same as (1), the model (19) represents a parametric model or the LQR cost J LQR. That is, its lexibility is essentially limited to the number o explicit eatures φ (ai,b i). Employing powerul kernel techniques [], the parametric model can be turned into a nonparametric one, which includes an ininite number o eatures while retaining inite computational complexity. The key idea is to consider the kernel () in the limit o ininitely many eatures corresponding to models a [a min, a max ] and b [b min, b max ]. The derivation ollows ideas similar to how the standard SE kernel can be derived rom basic eatures, [13, p. ]. Consider the partitions o [a min, a max ] and [b min, b max ] into m equidistant intervals, and let {a i } 1:m and {b i } 1:m be the lower (or upper) interval limits. Consider the model (19) with eature vector Φ T () = [ φ (a1,b 1)() φ (ai,b j)() φ (am,b m)() ] which includes all combinations φ (ai,b j) or i, j {1,..., m}, and the parametric prior w = and Σ w = σ n (amax amin)(bmax bmin) m I or some σ n R. The kernel () then becomes k plqr,m (, ) = σ n (a max a min )(b max b min ) m j=1 i=1 m m φ (ai,b j)() φ (ai,b j)( ). (1) Since φ (ai,b j) is continuous on F, the inite sum (1) converges to the Riemann integral in the limits as m. We can thus deine the nonparametric LQR kernel k nlqr (, ) = lim k plqr,m(, ) m = σ n bmax amax b min a min φ (a,b) ()φ (a,b) ( ) da db () or, F with the signal variance σn and the integration boundaries a min, a max, b min, and b max as hyperparameters. While the kernel in () represents the structure o the cost unction (15) or an ininite number o models (all a [a min, a max ] and b [b min, b max ]), its computation is inite consisting o solving the integral in (). By contrast, the computational complexity o the parametric kernels () and (1) grows with the number o eatures m. We prove that the kernel () is indeed a valid covariance unction. Proposition 1: k nlqr (, ) is positive semideinite or all, F. Proo: Take any collection {θ 1, θ,..., θ N } and any z R N. Let K N be the Gram matrix or the kernel k nlqr. Then N ( N ) z T K N z = z j z i k nlqr (θ i, θ j ) j=1 i=1 N N b max a max z j z i j=1 i=1 b min b max a max = σ n = σ n = σ n b min a j=1 min b max a max b min a min a min φ (a,b) (θ i )φ (a,b) (θ j ) da db N N z j φ (a,b) (θ j ) z i φ (a,b) (θ i ) da db i=1 ( N z i φ (a,b) (θ i )) da db i=1 which completes the proo. The above derivation corresponds to the kernel trick [13], [], which is a core idea o kernel methods and behind many powerul learning algorithms. In essence, the kernel trick means to write a learning algorithm solely in terms o inner products o eatures and replacing those by a kernel. In particular, this allows or considering an ininite number o eatures, while retaining inite computation. Figure 3 shows the prior and posterior GP or the nonparametric LQR kernel (). Because the kernel is more lexible All computations or the GP regression, i.e., equations (5), (), and (7), are expressed solely in terms o the kernel k (and the mean µ).

6 J() Posterior GP (Ex. 1) J() J() Posterior GP (Ex. ) Prior GP Fig. 3. GP it using the nonparametric LQR kernel k nlqr (hyperparameters: σ n =, a min =., a max = 1., b min =.9, and b max = 1.1). Colors are the same as in Fig. 1. Both examples are itted well. than (1), it can it the cost unctions or the two dierent models o Example 1 and Example well. E. A combined kernel System (1) is an approximation to the true system (1). It is thus not desirable to ully commit to the model or solving the optimal control problem. This would mean to directly minimize (15) (which would result in the wellknown LQR solution). On the other hand, the linear problem contains inormation about the structure o the optimization problem, which shall be useul also or optimization o the true nonlinear system (1), as long as we believe (1) to be a reasonable approximation thereo. In other words, we can expect the true cost unction J or the nonlinear problem to bear some similarity to (15). We model the cost unction (3) or the nonlinear system (1) as being composed o a part that stems rom the approximation as LQR problem and an error term, J() = J LQR () + J (). (3) The term J () captures the error in the model that stems rom the act that the true problem is nonlinear. We model it as a standard GP, e.g., employing the SE kernel (11): J () GP(, k SE (, )). We can model J LQR () as a GP (17) using either the parametric LQR kernel (1) or the nonparametric () LQR kernel. Then, since the sum o two independent Gaussians is also Gaussian, it ollows rom (3) that also J is a GP (see [13, Sec..7]), with J() GP (, k LQR (, ) + k SE (, ) ). where k LQR can be replaced by (1) or (). By choosing the hyperparameters o the kernels, the designer can express how much he or she trusts the LQR versus the SE model. For example, σ SE = means to ully rely on the LQR kernel. V. SIMULATIONS In this section, we show statistical comparisons o the LQR kernels proposed in Sec. IV against the commonly used SE kernel, in two dierent settings. In the irst setting, we evaluate the perormance o each kernel in the context o GP regression. Speciically, we quantiy the mismatch between the GP posterior mean, computed rom a set o random evaluations, and the underlying cost unction. In the second setting, we evaluate each kernel in the context o BO by comparing the learned minimum to the true global minimum. The GP regression and BO experiments are presented in Sec. V-B and Sec. V-C, respectively, considering a linear system (1). In addition, we also evaluate the BO setting or a nonlinear system in Sec. V-D. A. Experimental choices For the simulations in Sec. V-B and Sec. V-C, we consider the true system (1) to be linear (i.e., (1)) with uncertain parameters a [., 1.] and b [.9, 1.1]. We consider the optimal control problem as in Sec. II with q = r = v = 1. Feedback controllers () are considered to be in the range (1), F = [ 1.,.1]. For each controller, the corresponding ininite-horizon LQR cost J() is given by (15). In practice, only initehorizon simulations can be realized. Thereore, the outcome o an experiment is noisy, as modeled in (), with σ =.5. In each simulation, a dierent linear model (a, b) is obtained by uniormly sampling the ranges above, which yields a dierent underlying cost unction (15). Each simulation is repeated or our dierent kernels: SE kernel: As described in (11). LQR kernel I: Parametric LQR kernel (1) with ixed parameters (ā, b) (midpoints o uncertainty intervals). LQR kernel II: Parametric LQR kernel (1), with (ā, b) optimized rom evaluations by maximizing (7). LQR kernel III: Nonparametric LQR kernel (). The parametric LQR kernel (1) is constructed taking the middle points o the uncertainty intervals, i.e., ā =.9 and b = 1. For the nonparametric LQR kernel (), the intervals serve as integration domains. The length scale o the SE kernel is computed as one ith o the input domain, i.e., l =.37. The signal variance o the parametric and nonparametric LQR kernel, i.e., σp and σn, are normalized such that k plqr (, ) = k nlqr (, ) = 1, where =. is the midpoint o F. Since the variance o these two kernels grow ast toward the corners o the domain, we set or the SE kernel σse = 5, or a air comparison. B. GP regression setting For this statistical comparison, we run 1 simulations. In each simulation, we compute the GP posterior conditioned

7 SE kernel In LQR kernel I In LQR kernel II In LQR kernel III In RMSE 3 1 SE kernel In 3 LQR kernel I In 3 LQR kernel II In 3 LQR kernel III In Regret Fig.. Histograms o the root mean squared error (RMSE) between the true cost unction and the GP posterior mean conditioned on two evaluations. The histograms are computed out o 1 simulations. Fig. 5. Histogram o the regret incurred by stopping BO ater three evaluations. The system considered is the linear system (1). The histogram is computed over 1 BO runs. on two evaluations randomly sampled rom the underlying cost unction. We assess the quality o the regression by computing the root mean squared error (RMSE) between the true cost unction and the GP posterior mean, both computed on a grid o 1 points over F. Fig. shows the histograms o the RMSE obtained with the dierent kernels. The LQR kernels clearly outperorm the SE kernel in these experiments because they contain structural knowledge about the true cost, which contributes to a better GP it, even with only two data points. The nonparametric kernel makes good predictions because it inherently contains inormation about all possible unctions within their uncertainty ranges o (a, b). However, poorly speciied integration bounds will decrease its perormance. The RMSE statistical analysis conirms this when the integration intervals on (a, b) are a 5% larger than their uncertainties. The parametric kernel with ixed hyperparameters (ā, b) has a signiicant number o outliers since, in many cases, the data is queried rom a cost unction whose sampled (a, b) are ar away rom the ones o the kernel. The parametric kernel with hyperparameter optimization also leads to a better it than SE, but has most outliers (about 75% o the simulations) since hyperparameter optimization is not reliable with just two data points. We have repeated these experiments with N = 1, 5, and 1 evaluations. Table I shows the averaged RMSE over 1 simulations or each kernel, and the corresponding standard TABLE I RMSE AVERAGED OVER 1 SIMULATIONS. N SE ker. SE ker. (*) LQR ker. I LQR ker. II LQR ker. III 1.7 (.7).39 (.1) 1.1 (.95) 1.9 (1.9) 1.31 (.71).9 (.5). (.93) 1. (.9) 1.9 (1.7) 1. (.7) (1.) 1.31 (.95) 1.1 (1.5).5 (.7) 1. (.7) (.99).9 (.).9 (.9). (.) 1.11 (.7) deviation (in parentheses). The the outliers (i.e., any RMSE above 5) were excluded rom these computations. In general, we see that the LQR kernel optimized rom data perorms better than the others or more than evaluations. For a air comparison, we also include the SE kernel with hyperparameter optimization in the table (marked with an asterisk). Because it perorms similar to the SE kernel, and it does not improve upon the LQR kernels, we leave it out o the discussion or the rest o the paper. Remark: The hyperparameters (a, b) o the parametric LQR kernel are optimized rom data by maximizing the marginal likelihood (7). In a sense, optimizing the hyperparameters (a, b) o the LQR kernel rom data can be considered similar to doing system identiication on the linear system (a, b) using (7) as perormance metric. C. BO setting In this section, we evaluate the perormance o each kernel in the context o BO. For each BO run, the irst evaluation is decided randomly within the range o controllers F. Subsequent evaluations are acquired using the expected improvement (EI) method (), (9). We stop the exploration ater three evaluations and compute the instantaneous regret (i.e., the absolute error between the true minimum and the minimum o the GP posterior mean) as the outcome o each experiment. Fig. 5 shows the histogram o the regret or each kernel over 1 BO runs. The LQR kernels consistently outperorm the SE kernel. The nonparametric kernel shows some outliers because in some cases the GP posterior mean grows large toward negative values, and so does its minimum. This is a wrong prediction o the underlying cost, deined positive. However, this minor issue can easily be detected, and the optimization procedure continued with a randomly sampled controller.

8 TABLE II REGRET AVERAGED OVER 1 BO RUNS FOR A NONLINEAR SYSTEM N SE kernel LQR kernel I LQR kernel II LQR kernel III 1.3 (.33).3 (.1).3 (.).35 (.1) 3.9 (.39).33 (.3).31 (.19).3 (.1).35 (.7).3 (.).3 (.1).3 (.17) 5.3 (.1).31 (.3).3 (.).3 (.1) D. BO setting or a nonlinear system In this section, we use the same BO setting as in Sec. V-C, but consider now a nonlinear system (1), namely x t+1 = ã sin (x t ) + bu t + v t () with v t N (, 1) and uncertain parameters ã [.9, 1.1], b [.9, 1.1]. We control this system rom x = 1 to zero, using the same controller structure as the one described in Sec. V-A. In this case, the considered range o controllers is F = [ 1.57,.7], which corresponds to (1), reduced by a %. The LQR kernels are built up using the linearized version o () around the zero equilibrium point, i.e., x t+1 = ãx t + bu t +v t. Table II shows the regret average and standard deviation. As can be seen, the LQR kernels perorm better than the SE kernel. VI. CONCLUDING REMARKS In this paper, we discussed how prior knowledge about the structure o an optimal control problem can be leveraged or data-eicient learning control. Speciically, or a nonlinear quadratic optimal control problem, we showed how an uncertain linear model approximating the true nonlinear dynamics can be exploited in a Bayesian setting. This led to the proposal o two novel kernels, a parametric and a nonparametric version o an LQR kernel, which incorporate the structure o the well-known LQR problem as prior knowledge. Numerical simulations herein demonstrate improved data eiciency over standard kernels, i.e., good controllers are learned rom ewer experiments. We hope that the discussion and analysis herein also motivate urther development o kernels tailored or other learning control problems. Approaching the nonlinear quadratic optimal control problem presented herein with pure model-based methods can lead to superior perormance or very accurate models compared to the proposed data-based approach. However, this paper shares the motivation o [3], which proposes a databased approach when only poor models are available, and extends it by incorporating the LQR structure into the kernel. The results herein are preliminary in the sense that the LQR kernels are developed or a irst-order system. While this is the natural irst step, uture work will concern the extension o the ideas and derivations to multivariate systems in order to develop this into a powerul ramework or learning control in practice. Moreover, we plan to validate the beneit o the new kernels in experiments on more realistic nonlinear examples or physical hardware such as those in [3] and [5], settings that pose challenging issues, like imperect states measurement, among others. REFERENCES [1] R. S. Sutton and A. G. Barto, Reinorcement learning: An introduction. MIT press, 199. [] J. Kober, J. A. Bagnell, and J. Peters, Reinorcement learning in robotics: A survey, The International Journal o Robotics Research, vol. 3, no. 11, pp , 13. [3] A. Marco, P. Hennig, J. Bohg, S. Schaal, and S. Trimpe, Automatic LQR tuning based on Gaussian process global optimization, in IEEE International Conerence on Robotics and Automation, 1, pp [] F. Berkenkamp, A. P. Schoellig, and A. Krause, Sae controller optimization or quadrotors with Gaussian processes, in IEEE International Conerence on Robotics and Automation, 1, pp [5] A. Marco, F. Berkenkamp, P. Hennig, A. P. Schoellig, A. Krause, S. Schaal, and S. Trimpe, Virtual vs. real: Trading o simulations and physical experiments in reinorcement learning with Bayesian optimization, in IEEE International Conerence on Robotics and Automation, 17, pp [] J. Schreiter, D. Nguyen-Tuong, M. Eberts, B. Bischo, H. Markert, and M. Toussaint, Sae exploration or active learning with gaussian processes, in Machine Learning and Knowledge Discovery in Databases: European Conerence, 15, pp [7] R. Calandra, A. Seyarth, J. Peters, and M. P. Deisenroth, Bayesian optimization or learning gaits under uncertainty, Annals o Mathematics and Artiicial Intelligence, pp. 1 19, 15. [] R. Antonova, A. Rai, and C. G. Atkeson, Sample eicient optimization or learning controllers or bipedal locomotion, in IEEE International Conerence on Humanoid Robots, 1, pp.. [9] J. R. Medina, S. Endo, and S. Hirche, Impedance-based Gaussian processes or predicting human behavior during physical interaction, in IEEE International Conerence on Robotics and Automation, 1, pp [1] H. Abdelrahman, F. Berkenkamp, J. Poland, and A. Krause, Bayesian optimization or maximum power point tracking in photovoltaic power plants, in European Control Conerence, 1, pp [11] B. D. O. Anderson and J. B. Moore, Optimal Control: Linear Quadratic Methods. Mineola, New York: Dover Publications, 7. [1] S. Schaal and C. Atkeson, Learning control in robotics, IEEE Robotics Automation Magazine, vol. 17, no., pp. 9, Jun. 1. [13] C. Rasmussen and C. Williams, Gaussian Processes or Machine Learning. MIT Press,. [1] J. Mockus, Bayesian approach to global optimization: theory and applications, ser. Mathematics and its applications. Kluwer Academic Publishers, 199, vol. 37. [15] D. R. Jones, M. Schonlau, and W. J. Welch, Eicient global optimization o expensive black-box unctions, Journal o Global Optimization, vol. 13, no., pp. 55 9, 199. [1] H. J. Kushner, A new method o locating the maximum point o an arbitrary multipeak curve in the presence o noise, Journal o Basic Engineering, vol., no. 1, pp. 97 1, 19. [17] N. Srinivas, A. Krause, S. M. Kakade, and M. Seeger, Gaussian process optimization in the bandit setting: No regret and experimental design, in International Conerence on Machine Learning, 1, pp [1] P. Hennig and C. J. Schuler, Entropy search or inormation-eicient global optimization, The Journal o Machine Learning Research, vol. 13, no. 1, pp , 1. [19] Y. Sui, A. Gotovos, J. W. Burdick, and A. Krause, Sae exploration or optimization with gaussian processes, in International Conerence on Machine Learning, 15, pp [] S. Trimpe, A. Millane, S. Doessegger, and R. D Andrea, A sel-tuning LQR approach demonstrated on an inverted pendulum, in IFAC World Congress, Cape Town, South Arica, Aug. 1, pp [1] B. D. O. Anderson and J. B. Moore, Optimal Filtering. Mineola, New York: Dover Publications, 5. [] B. Schölkop and A. J. Smola, Learning with kernels. MIT Press,.

Safe Controller Optimization for Quadrotors with Gaussian Processes

Safe Controller Optimization for Quadrotors with Gaussian Processes Felix Berkenkamp, Angela P. Schoellig, and Andreas Krause Abstract One of the most fundamental problems when designing controllers for