An Analysis of Laplacian Methods for Value Function Approximation in MDPs

Size: px
Start display at page:

Download "An Analysis of Laplacian Methods for Value Function Approximation in MDPs"

Transcription

1 An Analysis of Laplacian Methods for Value Function Approximation in MDPs Marek Petrik Department of Computer Science University of Massachusetts Amherst, MA 3 petrik@cs.umass.edu Abstract Recently, a method based on Laplacian eigenfunctions was proposed to automatically construct a basis for value function approximation in MDPs. We show that its success may be explained by drawing a connection between the spectrum of the Laplacian and the value function of the MDP. This explanation helps us to identify more precisely the conditions that this method requires to achieve good performance. Based on this, we propose a modification of the Laplacian method for which we derive an analytical bound on the approximation error. Further, we show that the method is related the augmented Krylov methods, commonly used to solve sparse linear systems. Finally, we empirically demonstrate that in basis construction the augmented Krylov methods may significantly outperform the Laplacian methods in terms of both speed and quality. 1 Introduction Markov Decision Processes (MDP) [Puterman, 2] are a widely-used framework for planning under uncertainty. In this paper, we focus on the discounted infinite horizon problem with discount γ such that < γ < 1. We also assume finite state and action spaces. While solving this problem requires only polynomial time, many practical problems are too large to be solved precisely. This motivated the development of approximate methods for solving very large MDPs that are sparse and structured. Value Function Approximation (VFA) is a method for finding approximate solutions for MDPs, which has received a lot of attention [Bertsekas and Tsitsiklis, 1996]. Linear approximation is a popular VFA method, because it is simple to analyze and use. The representation of the value function in linear schemes is a linear combination of basis vectors. The optimal policy is usually calculated using approximate value iteration or approximate linear programming [Bertsekas and Tsitsiklis, 1996]. The choice of the basis plays an important role in solving the problem. Usually, the basis used to represent the space is hand-crafted using human insight about some topological properties of the problem [Sutton and Barto, 1998]. Recently, a new framework for automatically constructing the basis was proposed in [Mahadevan, 2]. This approach is based on analysis of the neighborhood relation among the states. The analysis is inspired by spectral methods, often used in machine-learning tasks. We describe this framework in more detail in Section 2. In the following, we use v to denote the value function, and r to denote the reward vector. The matrix I denotes an identity matrix of an appropriate size. We use n to denote the number of states of the MDP. An important property of many VFA methods is that they are guaranteed to converge to an approximately optimal solution. In methods based on approximate policy iteration the maximal distance of the optimal approximate value function ˆv from the optimal value function v is bounded by [Munos, 23]: lim sup v 2γ ˆv µ k (1 γ) 2 sup v k ṽ k µk, k where v k and ṽ k are true and approximated value at step k given the current policy. The norm µ denotes a weighted quadratic norm with distribution µ. The distribution µ is arbitrary, and µ k depends on µ and the current transition matrix [Munos, 23]. Similar bounds hold for algorithms based on Bellman residual minimization, when v k ṽ k is replaced by r (I γp πk )ṽ k, where P πk is a transition matrix of the current policy. In addition, these bounds hold also for the max norm [Bertsekas and Tsitsiklis, 1996]. In the following, we refer to the value function v k and its approximation ṽ k as v and ṽ respectively, without using the index k. The main focus of the paper is the construction a good basis for algorithms that minimize v ṽ 2 in each iteration. We focus only on the quadratic approximation bound, because the other bounds are related. Notice that 2. The paper is organized as follows. Section 2 describes the spectral methods for VFA in greater detail. The main contribution of the paper is an explanation of the good performance of spectral methods for VFA and its connection to methods for solving sparse linear systems. This is described in Section 3, where we also propose two new alternative algorithms. In Section 4, we show theoretical error bounds of one of the methods, and then in Section 6 we demonstrate that our method may significantly outperform the previously proposed spectral methods.

2 2 Proto-Value Functions In this section, we describe in greater detail the proto-value function methods proposed in [Mahadevan, 2]. These methods use the spectral graph framework to construct a basis for linear VFA that respects the topology of the problem. The states of the MDP for a fixed policy represent nodes of an undirected weighted graph (N, E). This graph may be represented by a symmetric adjacency matrix W, where W xy represents the weight between the nodes. A diagonal matrix of the node degrees is denoted by D and defined as D xx = y N W xy. There are several definitions of the graph Laplacian, with the most common being the normalized Laplacian, defined for W as L = I D 1 2 W D 1 2. The advantage of the normalized Laplacian is that it is symmetric, and thus its eigenvectors are orthogonal. Its use is closely related to the random walk Laplacian L r = I D 1 W and the combinatorial Laplacian L c = D W, which are commonly used to motivate the use of the Laplacian by making a connection to random walks on graphs [Chung, 1997]. A function f on a graph (N, E) is a mapping from each vertex to R. One well-defined measure of smoothness of a function on a graph is its Sobolev norm. The bottom eigenvectors of L c may be seen as a good approximation to smooth functions, that is, functions with low Sobolev norm [Chung, 1997]. The application of the spectral framework to VFA is straightforward. The value function may be seen as a function on the graph of nodes that correspond to states. The edge weight between two nodes is determined by the probability of transiting either way between the corresponding states, given a fixed policy. Usually, the weight is 1 if the transition is possible either way and otherwise. Notice that these weights must be symmetric, unlike the transition probabilities. If we assume the value function is smooth on the induced graph, the spectral graph framework should lead to good results. While the adjacency matrix with edge weights of 1 was usually used before, [Mahadevan, 2] also discusses other schemes. While the approach is interesting, it suffers from some problems. In our opinion, the good performance of this method has not been sufficiently well explained, because the construction of the adjacency matrix is not well motivated. In addition, the construction of the adjacency matrix makes the method very hard to analyze. As we show later, using the actual transition matrix instead of the adjacency matrix leads to a better motivated algorithm, which is easy to derive and analyze. The requirement of the value function being smooth was partially resolved by using diffusion wavelets in [Mahadevan and Maggioni, 2; Maggioni and Mahadevan, 26]. Briefly, diffusion wavelets construct a low-order approximation of the inverse of a matrix. An disadvantage of using the wavelets is the high computational overhead needed to construct the inverse approximation. The advantage is, however, that once the inverse is constructed it may be reused for different rewards as long as the transition matrix is fixed. Thus, we do not compare our approach to diffusion wavelets due to the different goals and computational complexities of the two methods. 3 Analysis In this section we show that if we use the actual transition matrix instead of the random walk Laplacian of the adjacency matrix, the method may be well justified and analyzed. We assume in the following that the transition matrix P and the reward function r are available. It is reasonable to explain the performance using P instead of L r, because in the past applications P was usually very similar to I L r. This is because the transition in these problems were symmetric, and the adjacency matrix was based on a random policy. Notice that the eigenvectors of L r and I L r are the same. Moreover, if λ is an eigenvalue of L r, then 1 λ is an eigenvalue of I L r. 3.1 Spectral Approximation Assuming that P is the transition matrix for a fixed Markov policy, we can express the value function as [Puterman, 2]: v = (I γp ) 1 r = (γp ) i r. (3.1) i= The second equality follows from the Neumann series expansion. In fact, synchronous backups in the modified policy iteration calculate this series adding one term in each iteration. We assume that the transition matrix is diagonalizable. The analysis for non-diagonalizable matrices would be similar, but would have to use Jordan decomposition. Let x 1... x n be the eigenvectors of P with corresponding eigenvalues λ 1... λ n. Moreover, without loss of generality, x j 2 = 1 for all j. Since the matrix is diagonalizable, x 1... x n are linearly independent, and we can decompose the reward as: r = c j x j, j=1 for some c 1... c n. Using P i x j = λ i j x j, we have v = j=1 i= (γλ j ) i c j x j = j=1 1 1 γλ j c j x j = d j x j. j=1 Considering a subset U of eigenvectors x j as a basis will lead to the following bound on approximation error: v ṽ 2 j / U d j. (3.2) Therefore, the value function may be well approximated by considering those x j with greatest d j. Assuming that all c j are equal, then the best choice to minimize the bound (3.2) is to consider the eigenvectors with high λ j. This is identical to taking the low order eigenvectors of the random walk Laplacian, as proposed in the spectral proto-vfa framework, only that the transition matrix is used instead. Using the analysis above, we propose a new algorithm in Subsection Krylov Methods There are also other well-motivated base choices besides the eigenvectors. Another choice is to use some of the vectors in the Neumann series (3.1). We denote these vectors as

3 y i = P i r for all i, ). Calculating the value function by progressively adding all vectors in the series, as done in the modified policy iteration, potentially requires an infinite number of iterations. However, since the linear VFA methods consider any linear combination of the basis vectors, we need at most n linearly independent vectors from the sequence {y i } i=. Then it is preferable to choose those y i with small i, since these are simple to calculate. Interestingly, it can be shown that we just need y... y m 1 to represent any value function, where m is the degree of the minimal polynomial [Ipsen and Meyer, 1998] of (I γp ). Moreover, even taking fewer vectors than that may be a good choice in many cases. To show that y... y m 1 vectors are sufficient to precisely represent any value function, let the minimal polynomial be: p(a) = m i= α ia i. Then let B = 1 m 1 α i= α i+1 A i. By algebraic multiplication, we have BA = I. Having this, the value function may be represented as: v = Br = 1 m 1 α i= α i+1 (I γp ) i r = m 1 i= β i y i, for some β i. A more rigorous derivation may be found for example in [Ipsen and Meyer, 1998; Golub and Loan, 1996]. The space spanned by y... y m 1 is known as Krylov space (or Krylov subspace), denoted as K(P, r). It has been previously used in a variety of numerical methods, such as GM- RES, or Lancoz, and Arnoldi [Golub and Loan, 1996]. It is also common to combine the use of eigenvectors and Krylov space, what is known as an augmented Krylov methods [Saad, 1997]. These methods actually subsume the methods based on simply considering the largest eigenvectors of P. We discuss this method in Subsection Algorithms In this section we propose two algorithms based on the previous analysis for constructing a good basis for VFA. These algorithms deal only with the selection of the basis and they can be arbitrarily incorporated into any algorithm based on approximate policy iteration. The first algorithm, which we refer to as the weighted spectral method, is to form the basis from the eigenvectors of the transition matrix. This is in contrast with the method presented in [Mahadevan, 2], which uses any of the Laplacians. Moreover, we propose to choose the vectors with greatest d j value, in contrast to choosing the ones with largest λ j. The reason is that this leads to minimization of the bound in (3.2). A practical implementation of this algorithm faces some major obstacles. One issue is that there is no standard eigenvector solver that can efficiently calculate the top eigenvectors with regard to d j of a sparse matrix. However, such a method may be developed in the future. In addition, when the transition matrix is not diagonalizable, we need to calculate the Jordan decomposition, which is very time-consuming and unstable. Finally, some of the eigenvectors and eigenvalues Require: P, r, k - number of eigenvectors in the basis, l - total number of vectors Let z 1... z k be the top real eigenvectors of P z k+1 r for i 1... l + k do if i > k + 1 then z i P z i 1 end if for j 1... (i 1) do z i z i z j, z i z j end for if z i then break end if z i 1 z z i i end for Figure 1: Augmented Krylov method for basis construction. may contain complex numbers, what would require a revision of the approximate policy iteration algorithms. If these issues were resolved, this may be a viable alternative to other methods. The second algorithm we propose is to use the vectors from the augmented Krylov method, that is, to combine the vectors in the Krylov space with a few top eigenvectors. A pseudocode of the algorithm is in Figure 1. The algorithm calculates an orthonormal basis of the augmented Krylov space using a modified Gram-Schmidt method. Using the Krylov space eliminates the problems with nondiagonalizable transition matrices and complex eigenvectors. Another reason to combine these two methods is to take advantage of approximation properties of both methods. Intuitively, Krylov vectors capture the short-term behavior, while the eigenvectors capture the long-term behavior. Though, there is no reliable decision rule to determine the right number of augmenting eigenvectors, it is preferable to keep their number relatively low. This is because they are usually more expensive to compute than vectors in the Krylov space. 4 Approximation Bounds In this section, we briefly present a theoretical error bound guaranteed by the augmented Krylov methods. The bound characterizes the worst-case approximation error of the value function, given that the basis is constructed using the current transition matrix and reward vector. We focus on bounding the quadratic norm, what also implies the max norm. We show the bound for r ṽ + γp ṽ 2. This bound applies directly to algorithms that minimize the Bellman residual [Bertsekas and Tsitsiklis, 1996], and it also implies: v ṽ = (I γp ) 1 r (I γp ) 1 (I γp )ṽ 1 1 γ r (I γp )ṽ 2. This follows from the Neumann series expansion of the inverse and from P = 1.

4 In the following, we denote the set of m Krylov vectors as K m and the chosen set of top eigenvectors of P as U. We also use E(c, d, a) to denote an ellipse in the set of complex numbers with center c, focal distance d, and major semi-axis a. The approximation error for a basis constructed for the current policy may be bounded as the following theorem states. Theorem 4.1. Let (I γp ) = XSX 1 be a diagonalizable matrix. Further, let φ = max XT x/ U, x X 1 x 2, 2=1 where T is a diagonal matrix with 1 in place of eigenvalues of eigenvectors in U. The approximation error using the basis K m U is bounded by: { r (I γp )ṽ 2 max κ(x) C m( a1 d 1 ) C m ( c1 d 1, φ C a m( d ) C m ( c d ), where C m is the Chebyshev polynomial of the first kind of degree m, and κ(x) = X 2 X 1 2. The parameters a, c, d and a 1, c 1, d 1 are chosen such that E(c 1, d 1, a 1 ) includes the lower n U eigenvalues of (I γp ), and E(c, d, a) includes all its eigenvalues. The value of φ depends on the angle between the invariant subspace U and the other eigenvectors of P. If they are perpendicular, such as when P is symmetric, then φ =. Proof. To simplify the notation, we denote A = (I γp ). First, we show the standard bound on approximation by a Krylov space [Saad, 23], ignoring the additional eigenvectors. In this case, the objective is to find such w K m that minimizes the following: r Aw 2 = r w(i)a i r 2 = w(i)a i r 2, i=1 i= where w() = 1. Notice that this defines a polynomial in A multiplied by r with the constant factors determined by w. Let P m denote the set of polynomials of degree at most m such that every p P m satisfies p() = 1. The minimization problem may be then expressed as finding a polynomial p P m that minimizes p(a)r 2. This is related to the value of the polynomial on complex numbers as [Golub and Loan, 1996]: p(a) 2 κ(x) max p(λ i), i=1,...,n where λ i are the eigenvalues of A. A more practical bound may be obtained using Chebyshev polynomials, as for example in [Saad, 23], as follows: min max p(λ i) min max p(λ) C m( a d ) p P m i=1,...,n p P m λ E(c,d,a) C m ( c d ), where the ellipse E(c, d, a) covers all eigenvalues of A. In the approximation with eigenvectors, the minimization may be expressed as follows: min v r Aṽ 2 = min w K m min q U r Aw AUq 2 = = min min p(a)r + AUq 2 = p P m q U = min min p(a)(i P U )r + p(a)p U r AUq 2 p P m q U = min p(a)(i P U )r 2. p P m } where P U is the least squares projection matrix to U. Note that p(a)p U r AUy =, since U is an invariant space of A. Moreover, (I P U ) r is perpendicular to U. The theorem then follows from the approximation by Chebyshev polynomials as described above, and the definition of matrixvalued function approximation [Golub and Loan, 1996]. This bound shows that it is important to choose an invariant subspace that corresponds to the top eigenvalues of P. It also shows that U should be chosen to be to the highest degree perpendicular to the remaining eigenvectors. Finally, the bound also implies that the approximation precision increases with lower γ, as this decreases the size of the ellipse to cover the eigenvalues. Experiments In this section we demonstrate the proposed methods on the two-room problem similar to the one used in [Mahadevan, 2; Mahadevan and Maggioni, 2]. This problem is a typical representative of some stochastic planning problems encountered in AI. The MDP we use is a two-room grid with a single-cell doorway in the middle of the wall. The actions are to move in any of the four main direction. We use the policy with an equal probability of taking any action. The size of each room is by cells. The problem size is intentionally small to make explicit calculation of the value function possible. Notice that because of the structure of the problem and the policy, the random walk Laplacian has identical eigenvectors as the transition matrix in the same order. This allows us to evaluate the impact of choosing the eigenvectors in the order proposed by the weighted spectral method. We used the problem with various reward vectors and discount factors, but with the same transition structure. The reward vectors are synthetically constructed to show possible advantages and disadvantages of the methods. They are shown projected onto the grid in Figure 2. Vector Reward 1 represents an arbitrary smooth reward function. Reward 2 and Reward 3 are made perpendicular to the top 4 and the top 19 eigenvectors of the random walk Laplacian respectively. A lower discount rate is generally more favorable to Krylov methods, because the value is closer to the value of the reward received in the state. This can also be seen from Theorem 4.1, because γ shrinks the eigenvalues. Therefore, we use problems that are generally less favorable to Krylov methods, we evaluate the approach using two high discount rates, γ =.9 and γ =.99. In our experiments, we evaluate the Mean Squared Error () of the value approximation with regard to the number of vectors in the basis. The basis and the value function were calculated based on the same policy with equiprobable action choice. We obtained the true value function by explicitly evaluating (I γp ) 1 r. Notice, however, that the true value does not need to be used in the approximation, it is only required to determine the approximation error. We compared the following 4 methods: Laplacian The technique of using the eigenvectors of the random walk Laplacian of the adjacency matrix, as pro-

5 Reward 1 Reward 2 Reward 3 4 Reward Reward Reward 2 2 Y X 2 Y X 2 4 Y X 2 Figure 2: Reward vectors projected onto the grid Reward 1 Laplacian WL Krylov KA Reward 2 Reward Figure 3: Mean squared error of each method, using discount of γ =.9. The axis is in log scale. posed in [Mahadevan, 2]. The results of the normalized Laplacian are not shown because they were practically identical to the random walk Laplacian. WL This is the weighted spectral method, described in Subsection 3.3, which determines the optimal order of the eigenvectors to be used based on the value of d j. Krylov This is the augmented Krylov method with no eigenvectors, as described in Subsection 3.3. KA This is the augmented Krylov method with 3 eigenvectors, as proposed in Figure 1. The number of eigenvectors was chosen before running any experiments on this domain. The results for γ =.9 are in Figure 3 and for γ =.99 are in Figure 4. Reward 3 intentionally violates the smoothness assumption, but it is the easiest one to approximate for all methods except the Laplacian. In the case of Reward 1 and γ =.99, the weighted spectral method and the random walk Laplacian outperform the ordinary Krylov method for the first vectors. However, with additional vectors, both Krylov and augmented Krylov methods significantly outperform them. Our results suggest that Krylov methods may offer superior performance compared to using the eigenvectors of the Laplacian. Since these methods have been found very effective in many sparse linear systems [Saad, 23], they may perform well for basis construction also in other MDP problems. In addition, constructing a Krylov space is typically faster than constructing the invariant space of eigenvectors. Using Matlab on a standard PC, it took 2.99 seconds to calculate the top eigenvectors, while it took only.24 second to calculate the first vectors of the Krylov space for Reward 1. 6 Discussion We presented an alternative explanation of the success of Laplacian methods used for value approximation. This explanation allows us to more precisely determine when the method may work. Moreover, it shows that these methods are closely related to augmented Krylov methods. We demonstrated on a limited problem set that basis constructed from augmented Krylov methods may be superior to one constructed from the eigenvectors. In addition, both the weighted spectral method and the augmented Krylov method do not assume the value function to be smooth. Moreover, calculating vectors in the Krylov space is typically cheaper than calculating the eigenvectors. An approach for state-space compression of Partially Observable MDPs (POMDP) that also uses Krylov space was proposed in [Poupart and Boutilier, 22]. While POMDPs are a generalization of MDPs, the approach is not practical for large MDPs since it implicitly assumes that the number of observations is small. This is not true for MDPs, because the number of observations is equivalent to the number of states. In addition, the objectives of this approach are somewhat different, though the method is similar.

6 Reward 1 Laplacian WL Krylov KA Reward 2 Reward Figure 4: Mean squared error of each method, using discount of γ =.99. The axis is in log scale. One important issue that we did not address is an application to state spaces that are too large to be enumerated, such as factored MDPs. Because the vectors in Krylov space are as large as the whole state space, they cannot be enumerated either. However, the approach may be extended along similar lines as discussed in [Poupart and Boutilier, 22]. In reinforcement learning, the MDP needs to be solved using only an estimation of the transition matrix from sampling. An additional problem is that the basis is not defined for states that were not sampled. This was addressed for eigenvectors for example in [Mahadevan et al., 26]. Thus augmenting the Krylov space by the eigenvectors also has the advantage that these methods may be directly applied. We focused mainly on the VFA for a fixed policy without explicitly considering the control part of the problem. While these methods may be combined arbitrarily, a future challenge is to determine an efficient combination. Our results suggest that exploring the connection between the basis construction in VFA and sparse linear solvers may bring about interesting advances in the future. Acknowledgements The author was supported by the National Science Foundation under Grant No. IIS I also thank Sridhar Mahadevan, Hala Mostafa, Shlomo Zilberstein, and the anonymous reviewers for valuable comments. References [Bertsekas and Tsitsiklis, 1996] Dimitri P. Bertsekas and John N. Tsitsiklis. Neuro-dynamic programming. Athena Scientific, [Chung, 1997] Fang Chung. Spectral graph theory. American Mathematical Society, [Golub and Loan, 1996] Gene H. Golub and Charles F. Van Loan. Matrix computations. John Hopkins University Press, 3rd edition, [Ipsen and Meyer, 1998] Ilse C. F. Ipsen and Carl D. Meyer. The idea behind Krylov methods. American Mathematical Monthly, (): , [Maggioni and Mahadevan, 26] Mauro Maggioni and Sridhar Mahadevan. Fast direct policy evaluation using multiscale analysis of markov diffusion processes. In Proceedings of the International Conference on Machine Learning, pages ACM Press, 26. [Mahadevan and Maggioni, 2] Sridhar Mahadevan and Mauro Maggioni. Value function approximation with diffusion wavelets and Laplacian eigenfuctions. In Proceedings of Advances in Neural Information Processing Systems, 2. [Mahadevan et al., 26] Sridhar Mahadevan, Mauro Maggioni, Kimberly Ferguson, and Sarah Osentoski. Learning representation and control in continuous Markov decision processes. In Proceedings of the National Conference on Artificial Intelligence, 26. [Mahadevan, 2] Sridhar Mahadevan. Samuel meets Amarel: Automating value function approximation using global state space analysis. In Proceedings of the National Conference on Artificial Intelligence, 2. [Munos, 23] Remi Munos. Error bounds for approximate policy iteration. In Proceedings of the International Conference on Machine Learning, pages 6 67, 23. [Poupart and Boutilier, 22] Pascal Poupart and Craig Boutilier. Value-directed compression of POMDPs. In Proceedings of Advances in Neural Information Processing Systems, pages , 22. [Puterman, 2] Martin L. Puterman. Markov decision processes: Discrete stochastic dynamic programming. John Wiley & Sons, Inc., 2. [Saad, 199] Yosef Saad. Preconditioned Krylov subspace methods for CFD applications. In W. G. Hasbashi, editor, Solution Techniques for Large Scale CFD Problems, page 141. Wiley, 199. [Saad, 1997] Yousef Saad. Analysis of augmented Krylov subspace methods. SIAM Journal on Matrix Analysis and Applications, 18(2):43 449, April [Saad, 23] Yousef Saad. Iterative methods for sparse linear systems. SIAM, 2nd edition, 23. [Sutton and Barto, 1998] Richard S. Sutton and Andrew Barto. Reinforcement learning. MIT Press, 1998.

Basis Construction from Power Series Expansions of Value Functions

Basis Construction from Power Series Expansions of Value Functions Basis Construction from Power Series Expansions of Value Functions Sridhar Mahadevan Department of Computer Science University of Massachusetts Amherst, MA 3 mahadeva@cs.umass.edu Bo Liu Department of

More information

Value Function Approximation with Diffusion Wavelets and Laplacian Eigenfunctions

Value Function Approximation with Diffusion Wavelets and Laplacian Eigenfunctions Value Function Approximation with Diffusion Wavelets and Laplacian Eigenfunctions Sridhar Mahadevan Department of Computer Science University of Massachusetts Amherst, MA 13 mahadeva@cs.umass.edu Mauro

More information

Laplacian Agent Learning: Representation Policy Iteration

Laplacian Agent Learning: Representation Policy Iteration Laplacian Agent Learning: Representation Policy Iteration Sridhar Mahadevan Example of a Markov Decision Process a1: $0 Heaven $1 Earth What should the agent do? a2: $100 Hell $-1 V a1 ( Earth ) = f(0,1,1,1,1,...)

More information

Fast Direct Policy Evaluation using Multiscale Analysis of Markov Diffusion Processes

Fast Direct Policy Evaluation using Multiscale Analysis of Markov Diffusion Processes Fast Direct Policy Evaluation using Multiscale Analysis of Markov Diffusion Processes Mauro Maggioni mauro.maggioni@yale.edu Department of Mathematics, Yale University, P.O. Box 88, New Haven, CT,, U.S.A.

More information

Feature Selection by Singular Value Decomposition for Reinforcement Learning

Feature Selection by Singular Value Decomposition for Reinforcement Learning Feature Selection by Singular Value Decomposition for Reinforcement Learning Bahram Behzadian 1 Marek Petrik 1 Abstract Linear value function approximation is a standard approach to solving reinforcement

More information

Constructing Basis Functions from Directed Graphs for Value Function Approximation

Constructing Basis Functions from Directed Graphs for Value Function Approximation Constructing Basis Functions from Directed Graphs for Value Function Approximation Jeff Johns johns@cs.umass.edu Sridhar Mahadevan mahadeva@cs.umass.edu Dept. of Computer Science, Univ. of Massachusetts

More information

Representation Policy Iteration

Representation Policy Iteration Representation Policy Iteration Sridhar Mahadevan Department of Computer Science University of Massachusetts 14 Governor s Drive Amherst, MA 13 mahadeva@cs.umass.edu Abstract This paper addresses a fundamental

More information

Low-rank Feature Selection for Reinforcement Learning

Low-rank Feature Selection for Reinforcement Learning Low-rank Feature Selection for Reinforcement Learning Bahram Behzadian and Marek Petrik Department of Computer Science University of New Hampshire {bahram, mpetrik}@cs.unh.edu Abstract Linear value function

More information

Prioritized Sweeping Converges to the Optimal Value Function

Prioritized Sweeping Converges to the Optimal Value Function Technical Report DCS-TR-631 Prioritized Sweeping Converges to the Optimal Value Function Lihong Li and Michael L. Littman {lihong,mlittman}@cs.rutgers.edu RL 3 Laboratory Department of Computer Science

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes

Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes Journal of Machine Learning Research 8 (27) 2169-2231 Submitted 6/6; Revised 3/7; Published 1/7 Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes

More information

Value Function Approximation in Reinforcement Learning using the Fourier Basis

Value Function Approximation in Reinforcement Learning using the Fourier Basis Value Function Approximation in Reinforcement Learning using the Fourier Basis George Konidaris Sarah Osentoski Technical Report UM-CS-28-19 Autonomous Learning Laboratory Computer Science Department University

More information

Representation Policy Iteration: A Unified Framework for Learning Behavior and Representation

Representation Policy Iteration: A Unified Framework for Learning Behavior and Representation Representation Policy Iteration: A Unified Framework for Learning Behavior and Representation Sridhar Mahadevan Unified Theories of Cognition (Newell, William James Lectures, Harvard) How are we able to

More information

The Conjugate Gradient Method

The Conjugate Gradient Method The Conjugate Gradient Method Classical Iterations We have a problem, We assume that the matrix comes from a discretization of a PDE. The best and most popular model problem is, The matrix will be as large

More information

Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes

Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes Sridhar Mahadevan Department of Computer Science University of Massachusetts Amherst, MA

More information

An Adaptive Clustering Method for Model-free Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at

More information

A Multiscale Framework for Markov Decision Processes using Diffusion Wavelets

A Multiscale Framework for Markov Decision Processes using Diffusion Wavelets A Multiscale Framework for Markov Decision Processes using Diffusion Wavelets Mauro Maggioni Program in Applied Mathematics Department of Mathematics Yale University New Haven, CT 6 mauro.maggioni@yale.edu

More information

An Analysis of Linear Models, Linear Value-Function Approximation, and Feature Selection For Reinforcement Learning

An Analysis of Linear Models, Linear Value-Function Approximation, and Feature Selection For Reinforcement Learning An Analysis of Linear Models, Linear Value-Function Approximation, and Feature Selection For Reinforcement Learning Ronald Parr, Lihong Li, Gavin Taylor, Christopher Painter-Wakefield, Michael L. Littman

More information

Biasing Approximate Dynamic Programming with a Lower Discount Factor

Biasing Approximate Dynamic Programming with a Lower Discount Factor Biasing Approximate Dynamic Programming with a Lower Discount Factor Marek Petrik, Bruno Scherrer To cite this version: Marek Petrik, Bruno Scherrer. Biasing Approximate Dynamic Programming with a Lower

More information

Learning Representation and Control In Continuous Markov Decision Processes

Learning Representation and Control In Continuous Markov Decision Processes Learning Representation and Control In Continuous Markov Decision Processes Sridhar Mahadevan Department of Computer Science University of Massachusetts 140 Governor s Drive Amherst, MA 03 mahadeva@cs.umass.edu

More information

Compact Spectral Bases for Value Function Approximation Using Kronecker Factorization

Compact Spectral Bases for Value Function Approximation Using Kronecker Factorization Compact Spectral Bases for Value Function Approximation Using Kronecker Factorization Jeff Johns and Sridhar Mahadevan and Chang Wang Computer Science Department University of Massachusetts Amherst Amherst,

More information

1 Problem Formulation

1 Problem Formulation Book Review Self-Learning Control of Finite Markov Chains by A. S. Poznyak, K. Najim, and E. Gómez-Ramírez Review by Benjamin Van Roy This book presents a collection of work on algorithms for learning

More information

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course In This Lecture A. LAZARIC Markov Decision Processes

More information

MATH 1120 (LINEAR ALGEBRA 1), FINAL EXAM FALL 2011 SOLUTIONS TO PRACTICE VERSION

MATH 1120 (LINEAR ALGEBRA 1), FINAL EXAM FALL 2011 SOLUTIONS TO PRACTICE VERSION MATH (LINEAR ALGEBRA ) FINAL EXAM FALL SOLUTIONS TO PRACTICE VERSION Problem (a) For each matrix below (i) find a basis for its column space (ii) find a basis for its row space (iii) determine whether

More information

Iterative Methods for Solving A x = b

Iterative Methods for Solving A x = b Iterative Methods for Solving A x = b A good (free) online source for iterative methods for solving A x = b is given in the description of a set of iterative solvers called templates found at netlib: http

More information

Linearly-solvable Markov decision problems

Linearly-solvable Markov decision problems Advances in Neural Information Processing Systems 2 Linearly-solvable Markov decision problems Emanuel Todorov Department of Cognitive Science University of California San Diego todorov@cogsci.ucsd.edu

More information

Convergence of Synchronous Reinforcement Learning. with linear function approximation

Convergence of Synchronous Reinforcement Learning. with linear function approximation Convergence of Synchronous Reinforcement Learning with Linear Function Approximation Artur Merke artur.merke@udo.edu Lehrstuhl Informatik, University of Dortmund, 44227 Dortmund, Germany Ralf Schoknecht

More information

Online solution of the average cost Kullback-Leibler optimization problem

Online solution of the average cost Kullback-Leibler optimization problem Online solution of the average cost Kullback-Leibler optimization problem Joris Bierkens Radboud University Nijmegen j.bierkens@science.ru.nl Bert Kappen Radboud University Nijmegen b.kappen@science.ru.nl

More information

A Generalized Reduced Linear Program for Markov Decision Processes

A Generalized Reduced Linear Program for Markov Decision Processes Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence A Generalized Reduced Linear Program for Markov Decision Processes Chandrashekar Lakshminarayanan and Shalabh Bhatnagar Department

More information

Learning on Graphs and Manifolds. CMPSCI 689 Sridhar Mahadevan U.Mass Amherst

Learning on Graphs and Manifolds. CMPSCI 689 Sridhar Mahadevan U.Mass Amherst Learning on Graphs and Manifolds CMPSCI 689 Sridhar Mahadevan U.Mass Amherst Outline Manifold learning is a relatively new area of machine learning (2000-now). Main idea Model the underlying geometry of

More information

Basis Construction and Utilization for Markov Decision Processes Using Graphs

Basis Construction and Utilization for Markov Decision Processes Using Graphs University of Massachusetts Amherst ScholarWorks@UMass Amherst Open Access Dissertations 2-2010 Basis Construction and Utilization for Markov Decision Processes Using Graphs Jeffrey Thomas Johns University

More information

Compact Spectral Bases for Value Function Approximation Using Kronecker Factorization

Compact Spectral Bases for Value Function Approximation Using Kronecker Factorization Compact Spectral Bases for Value Function Approximation Using Kronecker Factorization Jeff Johns and Sridhar Mahadevan and Chang Wang Computer Science Department University of Massachusetts Amherst Amherst,

More information

, and rewards and transition matrices as shown below:

, and rewards and transition matrices as shown below: CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Master MVA: Reinforcement Learning Lecture: 5 Approximate Dynamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of the lecture 1.

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

Open Theoretical Questions in Reinforcement Learning

Open Theoretical Questions in Reinforcement Learning Open Theoretical Questions in Reinforcement Learning Richard S. Sutton AT&T Labs, Florham Park, NJ 07932, USA, sutton@research.att.com, www.cs.umass.edu/~rich Reinforcement learning (RL) concerns the problem

More information

ilstd: Eligibility Traces and Convergence Analysis

ilstd: Eligibility Traces and Convergence Analysis ilstd: Eligibility Traces and Convergence Analysis Alborz Geramifard Michael Bowling Martin Zinkevich Richard S. Sutton Department of Computing Science University of Alberta Edmonton, Alberta {alborz,bowling,maz,sutton}@cs.ualberta.ca

More information

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course How to model an RL problem The Markov Decision Process

More information

Arnoldi Methods in SLEPc

Arnoldi Methods in SLEPc Scalable Library for Eigenvalue Problem Computations SLEPc Technical Report STR-4 Available at http://slepc.upv.es Arnoldi Methods in SLEPc V. Hernández J. E. Román A. Tomás V. Vidal Last update: October,

More information

Value Function Approximation in Reinforcement Learning Using the Fourier Basis

Value Function Approximation in Reinforcement Learning Using the Fourier Basis Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence Value Function Approximation in Reinforcement Learning Using the Fourier Basis George Konidaris,3 MIT CSAIL gdk@csail.mit.edu

More information

Learning Representation & Behavior:

Learning Representation & Behavior: Learning Representation & Behavior: Manifold and Spectral Methods for Markov Decision Processes and Reinforcement Learning Sridhar Mahadevan, U. Mass, Amherst Mauro Maggioni, Yale Univ. June 25, 26 ICML

More information

CMPSCI 791BB: Advanced ML: Laplacian Learning

CMPSCI 791BB: Advanced ML: Laplacian Learning CMPSCI 791BB: Advanced ML: Laplacian Learning Sridhar Mahadevan Outline! Spectral graph operators! Combinatorial graph Laplacian! Normalized graph Laplacian! Random walks! Machine learning on graphs! Clustering!

More information

Internet Monetization

Internet Monetization Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition

More information

Procedia Computer Science 00 (2011) 000 6

Procedia Computer Science 00 (2011) 000 6 Procedia Computer Science (211) 6 Procedia Computer Science Complex Adaptive Systems, Volume 1 Cihan H. Dagli, Editor in Chief Conference Organized by Missouri University of Science and Technology 211-

More information

Bias-Variance Error Bounds for Temporal Difference Updates

Bias-Variance Error Bounds for Temporal Difference Updates Bias-Variance Bounds for Temporal Difference Updates Michael Kearns AT&T Labs mkearns@research.att.com Satinder Singh AT&T Labs baveja@research.att.com Abstract We give the first rigorous upper bounds

More information

On the Convergence of Optimistic Policy Iteration

On the Convergence of Optimistic Policy Iteration Journal of Machine Learning Research 3 (2002) 59 72 Submitted 10/01; Published 7/02 On the Convergence of Optimistic Policy Iteration John N. Tsitsiklis LIDS, Room 35-209 Massachusetts Institute of Technology

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course Value Iteration: the Idea 1. Let V 0 be any vector in R N A. LAZARIC Reinforcement

More information

Markov Decision Processes Infinite Horizon Problems

Markov Decision Processes Infinite Horizon Problems Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld 1 What is a solution to an MDP? MDP Planning Problem: Input: an MDP (S,A,R,T)

More information

WE consider finite-state Markov decision processes

WE consider finite-state Markov decision processes IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 54, NO. 7, JULY 2009 1515 Convergence Results for Some Temporal Difference Methods Based on Least Squares Huizhen Yu and Dimitri P. Bertsekas Abstract We consider

More information

Lecture 4: Approximate dynamic programming

Lecture 4: Approximate dynamic programming IEOR 800: Reinforcement learning By Shipra Agrawal Lecture 4: Approximate dynamic programming Deep Q Networks discussed in the last lecture are an instance of approximate dynamic programming. These are

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

1 Conjugate gradients

1 Conjugate gradients Notes for 2016-11-18 1 Conjugate gradients We now turn to the method of conjugate gradients (CG), perhaps the best known of the Krylov subspace solvers. The CG iteration can be characterized as the iteration

More information

Markov Chains and Spectral Clustering

Markov Chains and Spectral Clustering Markov Chains and Spectral Clustering Ning Liu 1,2 and William J. Stewart 1,3 1 Department of Computer Science North Carolina State University, Raleigh, NC 27695-8206, USA. 2 nliu@ncsu.edu, 3 billy@ncsu.edu

More information

Course Notes: Week 1

Course Notes: Week 1 Course Notes: Week 1 Math 270C: Applied Numerical Linear Algebra 1 Lecture 1: Introduction (3/28/11) We will focus on iterative methods for solving linear systems of equations (and some discussion of eigenvalues

More information

Cover Page. The handle holds various files of this Leiden University dissertation

Cover Page. The handle  holds various files of this Leiden University dissertation Cover Page The handle http://hdl.handle.net/1887/39637 holds various files of this Leiden University dissertation Author: Smit, Laurens Title: Steady-state analysis of large scale systems : the successive

More information

4.8 Arnoldi Iteration, Krylov Subspaces and GMRES

4.8 Arnoldi Iteration, Krylov Subspaces and GMRES 48 Arnoldi Iteration, Krylov Subspaces and GMRES We start with the problem of using a similarity transformation to convert an n n matrix A to upper Hessenberg form H, ie, A = QHQ, (30) with an appropriate

More information

Planning in Markov Decision Processes

Planning in Markov Decision Processes Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Planning in Markov Decision Processes Lecture 3, CMU 10703 Katerina Fragkiadaki Markov Decision Process (MDP) A Markov

More information

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning. Yishay Mansour Tel-Aviv University Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak

More information

Least-Squares Temporal Difference Learning based on Extreme Learning Machine

Least-Squares Temporal Difference Learning based on Extreme Learning Machine Least-Squares Temporal Difference Learning based on Extreme Learning Machine Pablo Escandell-Montero, José M. Martínez-Martínez, José D. Martín-Guerrero, Emilio Soria-Olivas, Juan Gómez-Sanchis IDAL, Intelligent

More information

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra. DS-GA 1002 Lecture notes 0 Fall 2016 Linear Algebra These notes provide a review of basic concepts in linear algebra. 1 Vector spaces You are no doubt familiar with vectors in R 2 or R 3, i.e. [ ] 1.1

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course Approximate Dynamic Programming (a.k.a. Batch Reinforcement Learning) A.

More information

ITERATIVE METHODS BASED ON KRYLOV SUBSPACES

ITERATIVE METHODS BASED ON KRYLOV SUBSPACES ITERATIVE METHODS BASED ON KRYLOV SUBSPACES LONG CHEN We shall present iterative methods for solving linear algebraic equation Au = b based on Krylov subspaces We derive conjugate gradient (CG) method

More information

Math 102, Winter Final Exam Review. Chapter 1. Matrices and Gaussian Elimination

Math 102, Winter Final Exam Review. Chapter 1. Matrices and Gaussian Elimination Math 0, Winter 07 Final Exam Review Chapter. Matrices and Gaussian Elimination { x + x =,. Different forms of a system of linear equations. Example: The x + 4x = 4. [ ] [ ] [ ] vector form (or the column

More information

Linear Algebra review Powers of a diagonalizable matrix Spectral decomposition

Linear Algebra review Powers of a diagonalizable matrix Spectral decomposition Linear Algebra review Powers of a diagonalizable matrix Spectral decomposition Prof. Tesler Math 283 Fall 2018 Also see the separate version of this with Matlab and R commands. Prof. Tesler Diagonalizing

More information

Approximate Dynamic Programming By Minimizing Distributionally Robust Bounds

Approximate Dynamic Programming By Minimizing Distributionally Robust Bounds Approximate Dynamic Programming By Minimizing Distributionally Robust Bounds Marek Petrik IBM T.J. Watson Research Center, Yorktown, NY, USA Abstract Approximate dynamic programming is a popular method

More information

LINEAR ALGEBRA: NUMERICAL METHODS. Version: August 12,

LINEAR ALGEBRA: NUMERICAL METHODS. Version: August 12, LINEAR ALGEBRA: NUMERICAL METHODS. Version: August 12, 2000 74 6 Summary Here we summarize the most important information about theoretical and numerical linear algebra. MORALS OF THE STORY: I. Theoretically

More information

MATH 20F: LINEAR ALGEBRA LECTURE B00 (T. KEMP)

MATH 20F: LINEAR ALGEBRA LECTURE B00 (T. KEMP) MATH 20F: LINEAR ALGEBRA LECTURE B00 (T KEMP) Definition 01 If T (x) = Ax is a linear transformation from R n to R m then Nul (T ) = {x R n : T (x) = 0} = Nul (A) Ran (T ) = {Ax R m : x R n } = {b R m

More information

Exercise Set 7.2. Skills

Exercise Set 7.2. Skills Orthogonally diagonalizable matrix Spectral decomposition (or eigenvalue decomposition) Schur decomposition Subdiagonal Upper Hessenburg form Upper Hessenburg decomposition Skills Be able to recognize

More information

Math 307 Learning Goals

Math 307 Learning Goals Math 307 Learning Goals May 14, 2018 Chapter 1 Linear Equations 1.1 Solving Linear Equations Write a system of linear equations using matrix notation. Use Gaussian elimination to bring a system of linear

More information

Optimizing Memory-Bounded Controllers for Decentralized POMDPs

Optimizing Memory-Bounded Controllers for Decentralized POMDPs Optimizing Memory-Bounded Controllers for Decentralized POMDPs Christopher Amato, Daniel S. Bernstein and Shlomo Zilberstein Department of Computer Science University of Massachusetts Amherst, MA 01003

More information

Lecture 9: Krylov Subspace Methods. 2 Derivation of the Conjugate Gradient Algorithm

Lecture 9: Krylov Subspace Methods. 2 Derivation of the Conjugate Gradient Algorithm CS 622 Data-Sparse Matrix Computations September 19, 217 Lecture 9: Krylov Subspace Methods Lecturer: Anil Damle Scribes: David Eriksson, Marc Aurele Gilles, Ariah Klages-Mundt, Sophia Novitzky 1 Introduction

More information

Review of similarity transformation and Singular Value Decomposition

Review of similarity transformation and Singular Value Decomposition Review of similarity transformation and Singular Value Decomposition Nasser M Abbasi Applied Mathematics Department, California State University, Fullerton July 8 7 page compiled on June 9, 5 at 9:5pm

More information

Least-Squares λ Policy Iteration: Bias-Variance Trade-off in Control Problems

Least-Squares λ Policy Iteration: Bias-Variance Trade-off in Control Problems : Bias-Variance Trade-off in Control Problems Christophe Thiery Bruno Scherrer LORIA - INRIA Lorraine - Campus Scientifique - BP 239 5456 Vandœuvre-lès-Nancy CEDEX - FRANCE thierych@loria.fr scherrer@loria.fr

More information

Dual Temporal Difference Learning

Dual Temporal Difference Learning Dual Temporal Difference Learning Min Yang Dept. of Computing Science University of Alberta Edmonton, Alberta Canada T6G 2E8 Yuxi Li Dept. of Computing Science University of Alberta Edmonton, Alberta Canada

More information

Optimism in the Face of Uncertainty Should be Refutable

Optimism in the Face of Uncertainty Should be Refutable Optimism in the Face of Uncertainty Should be Refutable Ronald ORTNER Montanuniversität Leoben Department Mathematik und Informationstechnolgie Franz-Josef-Strasse 18, 8700 Leoben, Austria, Phone number:

More information

7. Symmetric Matrices and Quadratic Forms

7. Symmetric Matrices and Quadratic Forms Linear Algebra 7. Symmetric Matrices and Quadratic Forms CSIE NCU 1 7. Symmetric Matrices and Quadratic Forms 7.1 Diagonalization of symmetric matrices 2 7.2 Quadratic forms.. 9 7.4 The singular value

More information

LINEAR ALGEBRA 1, 2012-I PARTIAL EXAM 3 SOLUTIONS TO PRACTICE PROBLEMS

LINEAR ALGEBRA 1, 2012-I PARTIAL EXAM 3 SOLUTIONS TO PRACTICE PROBLEMS LINEAR ALGEBRA, -I PARTIAL EXAM SOLUTIONS TO PRACTICE PROBLEMS Problem (a) For each of the two matrices below, (i) determine whether it is diagonalizable, (ii) determine whether it is orthogonally diagonalizable,

More information

Analyzing Feature Generation for Value-Function Approximation

Analyzing Feature Generation for Value-Function Approximation Ronald Parr Christopher Painter-Wakefield Department of Computer Science, Duke University, Durham, NC 778 USA Lihong Li Michael Littman Department of Computer Science, Rutgers University, Piscataway, NJ

More information

Practice Final Exam Solutions for Calculus II, Math 1502, December 5, 2013

Practice Final Exam Solutions for Calculus II, Math 1502, December 5, 2013 Practice Final Exam Solutions for Calculus II, Math 5, December 5, 3 Name: Section: Name of TA: This test is to be taken without calculators and notes of any sorts. The allowed time is hours and 5 minutes.

More information

An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs

An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs 2015 IEEE 54th Annual Conference on Decision and Control CDC December 15-18, 2015. Osaka, Japan An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs Abhishek Gupta Rahul Jain Peter

More information

The convergence limit of the temporal difference learning

The convergence limit of the temporal difference learning The convergence limit of the temporal difference learning Ryosuke Nomura the University of Tokyo September 3, 2013 1 Outline Reinforcement Learning Convergence limit Construction of the feature vector

More information

M.A. Botchev. September 5, 2014

M.A. Botchev. September 5, 2014 Rome-Moscow school of Matrix Methods and Applied Linear Algebra 2014 A short introduction to Krylov subspaces for linear systems, matrix functions and inexact Newton methods. Plan and exercises. M.A. Botchev

More information

Markov Chains, Random Walks on Graphs, and the Laplacian

Markov Chains, Random Walks on Graphs, and the Laplacian Markov Chains, Random Walks on Graphs, and the Laplacian CMPSCI 791BB: Advanced ML Sridhar Mahadevan Random Walks! There is significant interest in the problem of random walks! Markov chain analysis! Computer

More information

In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and. Convergence of Indirect Adaptive. Andrew G.

In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and. Convergence of Indirect Adaptive. Andrew G. In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and J. Alspector, (Eds.). Morgan Kaufmann Publishers, San Fancisco, CA. 1994. Convergence of Indirect Adaptive Asynchronous

More information

Planning in POMDPs Using Multiplicity Automata

Planning in POMDPs Using Multiplicity Automata Planning in POMDPs Using Multiplicity Automata Eyal Even-Dar School of Computer Science Tel-Aviv University Tel-Aviv, Israel 69978 evend@post.tau.ac.il Sham M. Kakade Computer and Information Science University

More information

Q-Learning in Continuous State Action Spaces

Q-Learning in Continuous State Action Spaces Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental

More information

Control Theory : Course Summary

Control Theory : Course Summary Control Theory : Course Summary Author: Joshua Volkmann Abstract There are a wide range of problems which involve making decisions over time in the face of uncertainty. Control theory draws from the fields

More information

5.) For each of the given sets of vectors, determine whether or not the set spans R 3. Give reasons for your answers.

5.) For each of the given sets of vectors, determine whether or not the set spans R 3. Give reasons for your answers. Linear Algebra - Test File - Spring Test # For problems - consider the following system of equations. x + y - z = x + y + 4z = x + y + 6z =.) Solve the system without using your calculator..) Find the

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences) AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences) Lecture 19: Computing the SVD; Sparse Linear Systems Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical

More information

Spectral Processing. Misha Kazhdan

Spectral Processing. Misha Kazhdan Spectral Processing Misha Kazhdan [Taubin, 1995] A Signal Processing Approach to Fair Surface Design [Desbrun, et al., 1999] Implicit Fairing of Arbitrary Meshes [Vallet and Levy, 2008] Spectral Geometry

More information

Math 307 Learning Goals. March 23, 2010

Math 307 Learning Goals. March 23, 2010 Math 307 Learning Goals March 23, 2010 Course Description The course presents core concepts of linear algebra by focusing on applications in Science and Engineering. Examples of applications from recent

More information

The Lanczos and conjugate gradient algorithms

The Lanczos and conjugate gradient algorithms The Lanczos and conjugate gradient algorithms Gérard MEURANT October, 2008 1 The Lanczos algorithm 2 The Lanczos algorithm in finite precision 3 The nonsymmetric Lanczos algorithm 4 The Golub Kahan bidiagonalization

More information

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works CS68: The Modern Algorithmic Toolbox Lecture #8: How PCA Works Tim Roughgarden & Gregory Valiant April 20, 206 Introduction Last lecture introduced the idea of principal components analysis (PCA). The

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

Approximate Optimal-Value Functions. Satinder P. Singh Richard C. Yee. University of Massachusetts.

Approximate Optimal-Value Functions. Satinder P. Singh Richard C. Yee. University of Massachusetts. An Upper Bound on the oss from Approximate Optimal-Value Functions Satinder P. Singh Richard C. Yee Department of Computer Science University of Massachusetts Amherst, MA 01003 singh@cs.umass.edu, yee@cs.umass.edu

More information

MATH 31 - ADDITIONAL PRACTICE PROBLEMS FOR FINAL

MATH 31 - ADDITIONAL PRACTICE PROBLEMS FOR FINAL MATH 3 - ADDITIONAL PRACTICE PROBLEMS FOR FINAL MAIN TOPICS FOR THE FINAL EXAM:. Vectors. Dot product. Cross product. Geometric applications. 2. Row reduction. Null space, column space, row space, left

More information

Topics. The CG Algorithm Algorithmic Options CG s Two Main Convergence Theorems

Topics. The CG Algorithm Algorithmic Options CG s Two Main Convergence Theorems Topics The CG Algorithm Algorithmic Options CG s Two Main Convergence Theorems What about non-spd systems? Methods requiring small history Methods requiring large history Summary of solvers 1 / 52 Conjugate

More information

Key words. conjugate gradients, normwise backward error, incremental norm estimation.

Key words. conjugate gradients, normwise backward error, incremental norm estimation. Proceedings of ALGORITMY 2016 pp. 323 332 ON ERROR ESTIMATION IN THE CONJUGATE GRADIENT METHOD: NORMWISE BACKWARD ERROR PETR TICHÝ Abstract. Using an idea of Duff and Vömel [BIT, 42 (2002), pp. 300 322

More information

av 1 x 2 + 4y 2 + xy + 4z 2 = 16.

av 1 x 2 + 4y 2 + xy + 4z 2 = 16. 74 85 Eigenanalysis The subject of eigenanalysis seeks to find a coordinate system, in which the solution to an applied problem has a simple expression Therefore, eigenanalysis might be called the method

More information

Conjugate gradient method. Descent method. Conjugate search direction. Conjugate Gradient Algorithm (294)

Conjugate gradient method. Descent method. Conjugate search direction. Conjugate Gradient Algorithm (294) Conjugate gradient method Descent method Hestenes, Stiefel 1952 For A N N SPD In exact arithmetic, solves in N steps In real arithmetic No guaranteed stopping Often converges in many fewer than N steps

More information