Covariance Matrix Estimation for Reinforcement Learning

Size: px

Start display at page:

Download "Covariance Matrix Estimation for Reinforcement Learning"

Theresa Briggs
6 years ago
Views:

1 Covariance Matrix Estimation for Reinforcement Learning Tomer Lancewicki Deartment of Electrical Engineering and Comuter Science University of Tennessee Knoxville, TN Itamar Arel Deartment of Electrical Engineering and Comuter Science University of Tennessee Knoxville, TN Abstract One of the goals in scaling reinforcement learning RL) ertains to dealing with high-dimensional and continuous stateaction saces. In order to tackle this roblem, recent efforts have focused on harnessing well-develoed methodologies from statistical learning, estimation theory and emirical inference. A key related challenge is tuning the many arameters and efficiently addressing numerical roblems, such that ultimately efficient RL algorithms could be scaled to real-world roblem settings. Methods such as Covariance Matrix Adatation - Evolutionary Strategy CMAES), Policy Imrovement with Path Integral PI ) and their variations heavily deends on the covariance matrix of the noisy data observed by the agent. It is well known that covariance matrix estimation is roblematic when the number of samles is relatively small comared to the number of variables. One way to tackle this roblem is through the use of shrinkage estimators that offer a comromise between the samle covariance matrix and a well-conditioned matrix also known as the target) with the aim of minimizing the mean-squared error MSE). Recently, it has been shown that a Multi-Target Shrinkage Estimator MTSE) can greatly imrove the single-target variation by utilizing several targets simultaneously. Unlike the comutationally comlex cross-validation CV) rocedure, the shrinkage estimators rovide an analytical framework which is an attractive alternative to the CV comuting rocedure. We consider the alication of shrinkage estimators in dealing with a function aroximation roblem, using the quadratic discriminant analysis QDA) technique and show that a two-target shrinkage estimator generates imroved erformance. The aroach aves the way for imroved value function estimation in large-scale RL settings, offering higher efficiency and fewer hyer-arameters. Keywords: covariance matrix estimation, ath integral, classification uncertainty The authors are with the Machine Intelligence Lab at the University of Tennessee - htt://mil.engr.utk.edu

2 1 Introduction Reinforcement learning RL) alied to real-world roblems inherently involves combining otimal control theory and dynamic rogramming methods with learning techniques from statistical estimation theory [1,, 3, 4]. The motivation is achieving efficient value function aroximation for the non-stationary iterative learning rocess involved, articularly when the number of state variables exceeds 10 [5]. Recent efforts in scaling RL address continuous state and/or action saces by otimizing arametrized olicies. For examle, the Policy Imrovement with Path Integral PI ) [5] combines a derivation from first rinciles of stochastic otimal control with tools from statistical estimation theory. It has been shown in [6] that PI is a member of a wider family of methods which share robabilistic modeling concets such as Covariance Matrix Adatation - Evolutionary Strategy CMAES) [7] and the Cross-Entroy Methods CEM) [8]. The Path Integral Policy Imrovement with Covariance Matrix Adatation PI -CMA) [6] takes advantage on the PI method by determining the magnitude of the exloration noise automatically [6]. The PI -SEQ [9] scheme alies PI to sequences of motion rimitives. One alication of the PI -SEQ is concerned with object grasing under uncertainty [9, Sec. 5] while alying the exerimental aradigm of [10]. The latter aroach has illustrated that over time, humans adat their reaching motion and gras to the shae of the object osition distribution, determined by the orientation of the main axis of its covariance matrix. Moreover, it has been shown that the PI otimal control olicy can be aroximated through linear regression [11]. This connection allows the use of well-develoed linear regression algorithms for learning the otimal olicy. The aforementioned methods rely on accurate covariance matrix estimation of the multivariate data involved. Unfortunately, when the number of observations n is comarable to the number of state variables the covariance estimation roblem become more challenging. In such scenarios, the samle covariance matrix is not well-conditioned and is not necessarily invertible desite the fact that those two roerties are required for most alications). When n, the inversion cannot be comuted at all [5, Sec..]. The same covariance roblem arises in other related alications of RL. For examle, in RL with Gaussian rocesses, the covariance matrix is regularized [1, Sec. ]. However, although the regularization arameter lays a ivotal role, it is not clear how it should be set [1, Sec. 3]. Other related work [13] study the ability to mitigate otentially overconfident classifications by assessing how qualified the system is to make a judgment on the current test datum. It is well known that for a small ratio of training observations n to observation dimensionality, conventional Quadratic Discriminant Analysis QDA) classifier erform oorly, due to a highly variable class conditional samle covariance matrices. In order to imrove the classifiers erformance, regularization is recommended, with the aim of roviding an aroriate comromise between the bias and variance of the solution. While other regularization methods [14] define regularization coefficients by the comutationally comlicated cross-validation CV) rocedure, the shrinkage estimators studied in this aer rovide an analytical solution, which is an attractive alternative to the CV rocedure. This aer elaborates on the Multi-Target Shrinkage Estimator MTSE) [15] that addresses the roblem of covariance matrix estimation when the number of samles is relatively small comared to the number of variables. MTSE offers a comromise between the samle covariance matrix and well-conditioned matrices also known as targets) with the aim of minimizing the mean-squared error MSE). Section resents the MTSE and examine the squared biases of two diagonal targets. In Section 3, we conduct a careful exerimental study and examine the two-target and one-target shrinkage estimator, as well as the Lediot-Wolf LW) [16] method for different covariance matrices. We demonstrate an alication for the quadratic discriminant analysis QDA) classifier, showing that the test classification accuracy rate TCAR) is higher when using the two-target, rather than one-target, shrinkage regularization. The QDA classifier is a fundamental comonent in DeSTIN [17] which is a dee learning system for satiotemoral feature extraction. The DeSTIN architecture currently assumes diagonal covariance matrices, which is one of the targets examined in this aer. In our future research we intend to utilize the results shown in this aer in order to imrove the DeSTIN architecture. Multi-Target Shrinkage Estimation Let {x i } n be a samle of indeendent identical distributed i.i.d.) -dimensional vectors drawn from a density having zero mean and covariance Σ = {σ ij }. The most common estimator of Σ is the samle covariance matrix S = {s ij }, defined as S = 1 n x i x T i 1) n and is unbiased, i.e., E {S} = Σ. The MTSE model [15] defined as ) t t ˆΣ γ) = 1 γ i S + γ i T i, ) 1

3 where t is the number of the targets T i, i = 1,..., t and γ = [γ 1,..., γ t ] T is the vector of shrinkage coefficients. Our objective is therefore to find ˆΣ γ) ), which minimizes the MSE loss function { } L γ) = E ˆΣ γ) Σ. 3) The otimal shrinkage coefficient vector γ that minimize L γ) 3) can be found by using a strictly convex quadratic rogram [15]. In this aer, we use the two diagonal targets T 1 = F Tr S) I, T = diags). 4) Following the develoments in [16, Sec..], the covariance matrix Σ can be written as Σ = VΛV T, where V and Λ are the eigenvector and eigenvalue matrices of Σ, resectively. The eigenvalues of Σ are denoted as ζ i, i = 1,..., in increasing order, i.e., ζ 1 ζ... ζ, and it is well known that ζ i = Tr Σ). As a result, the squared bias of T 1 with resect to Σ can be written as E {T 1 } Σ F = 1 Tr Σ) I VΛVT F = ζi ζ ), ζ = Tr Σ) = 1 ζ i 5) where ζ is the mean of the eigenvalues ζ i, i = 1,...,. The above result shows that E {T 1 } Σ F is equal to the disersion of the eigenvalues around their mean. Therefore, T 1 becomes less suitable in describing Σ when the disersion of the eigenvalues 5) increases. On the other hand, the exression of the squared bias of T with resect to Σ can be written as E {T } Σ F = diag Σ) Σ F = σ ij, 6) i j which shows that it is equal to the off-diagonal entries in Σ. Therefore, T becomes less suitable for describing Σ when the variables of Σ are more highly correlated. 3 Exeriments In this section, we resent an extensive exerimental study of one-target and two-target shrinkage estimators. The estimators are affected by the squared bias and the variance of a target, when the latter deends on the number of data observations n. Therefore, we examine cases of different true covariance matrices Σ that result in different biases of T 1 and T. We then examine the estimator s erformance as a function of n. In order to study the effect of the squared biases, we create a covariance matrix Σ with determinant of one, i.e., Σ = 1, according to two arameters. The first arameter is the condition number η, which is the ratio of the largest eigenvalue ζ max to the smallest eigenvalue ζ min of Σ, i.e., η = ζmax ζ min. In the exeriments, the eigenvalues of Σ denoted as ζ i, i = 1,,..., are generated according to ) i 1) ζ i = ζ min η 1) 1) + 1, i = 1,...,. 7) Then, the eigenvalue matrix Σ is defined as having elements ζ i, i = 1,,..., in the matrix form Λ η) = diag ζ 1, ζ,..., ζ ). 8) The second arameter K, controls the rotation of Λ η). Our aroach is to select a set of orthonormal transformations, as in [18, Sec..B] E K) = K k=1 k E k = E 1 E... E K, where each matrix E k is defined as E k = E kl =E k1 E k... E K k). 9) The matrix E kl is an orthonormal rotation of 45 0 in a two-coordinate lane for the coordinates k and + 1 l), i.e., where Φ i k, j k ) is defined as [Φ] ij = l=1 E kl = I + Φ k, + 1 l), 10) if i = j = i k or i = j = j k if i = i k and j = j k if i = j k and j = i k otherwise. 11)

4 The arameter K is an integer value with the range 0 K 1, where K = 0 indicates there is no rotation, and K = 1 indicates full rotation, such that all the coordinates rotate with resect to each other at an angle of Then, by using Λ η) 8) and E 9), the covariance matrix is created by Σ η, K) = E K) Λ η) E T K). 1) By emloying the covariance matrix 1), the biases of T 1 and T can be controlled indeendently for η > 1. The squared bias E {T 1 } Σ F is affected only by η, and increases as η does, when E {T 1} Σ F = 0 for η = 1. The E {T } Σ F is affected only by K, and increases as K does, when E {T } Σ F = 0 for K = 0. It should be noted that if η = 1 then K has no imact while if η is near 1, then K could has minor imact. The shrinkage estimators used in the study are of the one-target variety with T 1 and T. In the figures that aear in this section, these estimators are denoted as T1 and T, resectively. The LW estimator [16] is of the one-target shrinkage variety with T 1, which uses a biased shrinkage coefficient estimator and is denoted as LW. Finally, the two-target shrinkage estimator aears in the figures as TT. We show that the two-target estimator can imrove classification results comared with one-target estimators, when using the quadratic discriminant analysis QDA) method. The urose of the QDA is to assign observations to one of several g = 1,..., G grous with -variate normal distributions 1 ) f g x) = π) Σ g ex 0.5 x m g ) T Σ 1 g x m g ), 13) where m g and Σ g are the oulation mean vector and covariance matrix of the grou g. An observation x is assigned to a class ĝ according to dĝ x) = min d g x), 14) 1 g G with d g x) = x m g ) T Σ 1 g x m g ) + ln Σ g ln π g, 15) where π g is the unconditional rior robability of observing a member from the grou g. In our exeriments, we classify two grous G = ), with observations generated from a normal distribution with zero mean and π 1 = π. The covariance matrix of the first grou is the identity matrix Σ 1 = I, while that of the second grou is the covariance matrix Σ η, K) = Σ η, K) 1), which is generated on the basis of the revious exeriments. The goal is to study the effectiveness of the shrinkage estimators when using QDA, by assigning observations to one of these two grous, based on the classification rule 14). We run our exeriments for n =, 3,..., 30. For each n, twenty sets of data of size n are roduced. a) b) Figure 1: QDA for a) Σ η, 0) = Λ η) with η = 10 and b) an unrestricted Σ 10, K) with K = 5 We summarize for each exeriment the average test classification accuracy rate TCAR) with standard deviations the bars in the figure) over the twenty relications for each n. For each grou, 10 5 test observations were generated in order to exam the efficiency of the classifier. We rovide the best TCAR, calculated by using 14), when the covariance matrices are known, denoted in the figures as Bayes. We also comare the results for a regularization [19, sec. 6], where the zero eigenvalues were relaced with a small number just large enough to ermit numerically stable inversion. This has the effect of roducing a classification rule based on Euclidean distance in the zero-variance subsace. We denote this rocedure as the zero-variance regularization ZVR). In all exeriments, the TCAR of the two-target estimator is higher than the one-target variety. The LW estimator is inferior to its unbiased version when dealing with a small number of observations, and converges to its unbiased version as the number of observations increases. Fig. 1a) resents the result 3

5 when the covariance matrix is a diagonal matrix, i.e., Σ η, 0) = Λ η), with η = 10, and therefore T is unbiased while T 1 is biased. The target T 1 rovides a higher TCAR than T for small numbers of observations, and then T rovides a better TCAR. In Fig. 1b), the covariance matrix is unrestricted, i.e., Σ 10, K), with K = 5. The targets T 1 and T are biased. The squared bias of T 1 is not affected by K; whereas the higher the value of K, the higher the squared bias of T, and therefore T loses its advantage over T 1. In conclusion, it has been shown that the Multi-Target Shrinkage Estimator MTSE) [15] can greatly imrove the singletarget variation in the sense of mean-squared error MSE) by utilizing several targets simultaneously. We consider the alication of shrinkage estimator in the context of a function aroximation roblem, using the quadratic discriminant analysis QDA) technique and show that a two-target shrinkage estimator generates imroved erformance. This is done by a careful exerimental study which examines the squared biases of the two diagonal targets. Unlike the comutationally comlex cross-validation CV) rocedure; the shrinkage estimators rovide an analytical solution which is an attractive alternative to the CV comuting rocedure, commonly used in the QDA. The aroach aves the way for imroved value function estimation in large-scale RL settings, offering higher efficiency and fewer hyer-arameters. References [1] P. Dayan and G. E. Hinton, Using exectation-maximization for reinforcement learning, Neural Comutation, vol. 9, no., , [] M. Ghavamzadeh and Y. Engel, Bayesian actor-critic algorithms, in Proceedings of the 4th international conference on Machine learning. ACM, 007, [3] M. Toussaint and A. Storkey, Probabilistic inference for solving discrete and continuous state markov decision rocesses, in Proceedings of the 3rd international conference on Machine learning. ACM, 006, [4] N. Vlassis, M. Toussaint, G. Kontes, and S. Pieridis, Learning model-free robot control by a monte carlo em algorithm, Autonomous Robots, vol. 7, no., , 009. [5] E. Theodorou, J. Buchli, and S. Schaal, A generalized ath integral control aroach to reinforcement learning, J. Mach. Learn. Res., vol. 11, , Dec [6] F. Stul and O. Sigaud, Path integral olicy imrovement with covariance matrix adatation, in Proceedings of the 9th International Conference on Machine Learning ICML), 01. [7] N. Hansen and A. Ostermeier, Comletely derandomized self-adatation in evolution strategies, Evolutionary Comutation, vol. 9, no., , June 001. [8] S. Mannor, R. Y. Rubinstein, and Y. Gat, The cross entroy method for fast olicy search, in ICML, 003, [9] F. Stul, E. Theodorou, and S. Schaal, Reinforcement learning with sequences of motion rimitives for robust maniulation, IEEE Transactions on Robotics, vol. 8, no. 6, , Dec 01. [10] V. N. Christooulos and P. R. Schrater, Grasing objects with environmentally induced osition uncertainty, PLoS comutational biology, vol. 5, no. 10, 009. [11] F. Farshidian and J. Buchli, Path integral stochastic otimal control for reinforcement learning, in The 1st Multidiscilinary Conference on Reinforcement Learning and Decision Making RLDM013), 013. [1] G. Chowdhary, M. Liu, R. Grande, T. Walsh, J. How, and L. Carin, Off-olicy reinforcement learning with gaussian rocesses, IEEE/CAA Journal of Automatica Sinica, vol. 1, no. 3,. 7 38, 014. [13] H. Grimmett, R. Paul, R. Triebel, and I. Posner, Knowing when we don t know: Introsective classification for mission-critical decision making, in 013 IEEE International Conference on Robotics and Automation ICRA), May 013, [14] P. J. Bickel and E. Levina, Regularized estimation of large covariance matrices, The Annals of Statistics, vol. 36, no. 1, , 008. [15] T. Lancewicki and M. Aladjem, Multi-target shrinkage estimation for covariance matrices, IEEE Transactions on Signal Processing, vol. 6, no. 4, , Dec 014. [16] O. Ledoit and M. Wolf, A well-conditioned estimator for large-dimensional covariance matrices, Journal of Multivariate Analysis, vol. 88, no., , 004. [17] S. Young, J. Lu, J. Holleman, and I. Arel, On the imact of aroximate comutation in an analog destin architecture, IEEE Transactions on Neural Networks and Learning Systems, vol. 5, no. 5, , May 014. [18] G. Cao, L. Bachega, and C. Bouman, The sarse matrix transform for covariance estimation and analysis of high dimensional signals, IEEE Transactions on Image Processing, vol. 0, no. 3, , 011. [19] J. H. Friedman, Regularized discriminant analysis, Journal of the American Statistical Association, vol. 84, no. 405, ,

Combining Logistic Regression with Kriging for Mapping the Risk of Occurrence of Unexploded Ordnance (UXO)

Combining Logistic Regression with Kriging for Maing the Risk of Occurrence of Unexloded Ordnance (UXO) H. Saito (), P. Goovaerts (), S. A. McKenna (2) Environmental and Water Resources Engineering, Deartment