On Sparse Associative Networks: A Least Squares Formulation

Size: px
Start display at page:

Download "On Sparse Associative Networks: A Least Squares Formulation"

Transcription

1 On Sparse Associative Networks: A Least Squares Formulation Björn Johansson August 7, 200 Technical report LiTH-ISY-R-2368 ISSN Computer Vision Laboratory Department of Electrical Engineering Linköping University, SE Linköping, Sweden bjorn@isy.liu.se Abstract This report is a complement to the working document [4], where a sparse associative network is described. This report shows that the net learning rule in [4] can be viewed as the solution to a weighted least squares problem. This means that we can apply the theory framework of least squares problems, and compare the net rule with some other iterative algorithms that solve the same problem. The learning rule is compared with the gradient search algorithm and the RPROP algorithm in a simple synthetic experiment. The gradient rule has the slowest convergence while the associative and the RPROP rules have similar convergence. The associative learning rule has a smaller initial error than the RPROP rule though. It is also shown in the same experiment that we get a faster convergence if we have a monopolar constraint on the solution, i.e. if the solution is constrained to be non-negative. The least squares error is a bit higher but the norm of the solution is smaller, which gives a smaller interpolation error. The report also discusses a generalization of the least squares model, which include other known function approximation models.

2 Contents Introduction 3 2 Least squares model 3 2. Iterative solutions Gradient search Associative net rule RPROP - Resilient propagation The sparse associative network 6 3. A least squares formulation Choice of normalization mode Generalization Experiments 4. Experimentdata Experimentsetup Results Conclusions Summary 4 6 Acknowledgment 4 A Appendices 20 A. Proof of theorem, section A.2 Proof of gradient solution, section

3 Introduction This report is a complement to the working document [4], where a sparse associative network is described. The network parameters are computed using an iterative update rule. This report shows that the update rule can be viewed as an iterative solution to a weighted least squares problem. This means that we can compare the net rule with some other iterative algorithms that solves the same least squares problem. The least squares formulation also makes it easier to compare the associative network with other known high-dimensional function approximation theory, such as the least squares models used in neural networks, radial basis functions, and probabilistic mixture models, see [5]. Section 2 introduces the least squares problem, and some iterative methods to compute the solution. Section 3 derives the least squares model corresponding to the net learning rule and analyzes the different choices of models (normalization modes) mentioned in [4] using the least squares approach. Section 4 evaluates different iterative algorithms and models on a simple synthetic example. 2 Least squares model Let A be a M N matrix, b be a M vector and x be a N vector. The problem considered in this report is formulated as where Some comments: x 0 = arg min ɛ(x), l, u () l x u ɛ(x) = 2 Ax b 2 W = 2 (Ax b)t W(Ax b) (2) W is a positive semi-definite diagonal weight matrix, which depend on the application. l x u means element-wise bounds, i.e. l i x i u i. In the case of the sparse associative network we have l i =0,u i = for all i. The factor 2 is only included to avoid an extra factor 2 in the gradient. In the case of infinite boundaries (l =, u = ) we can compute a solution as x 0 =(A T WA) A T Wb (3) Where (.) means pseudo-inverse. The pseudo-inverse is equivalent to regular inverse (.) in case of a unique solution. In case of a non-unique solution 3

4 (rank(a) <N) the pseudo-inverse gives the minimal norm solution. The minimal norm solution can also approximately be achieved by adding a regularization term ɛ(x) = 2 Ax b 2 W + 2 x 2 W r = 2 (Ax b)t W(Ax b)+ 2 xt W r x where W r is a diagonal weight matrix. (4) 2. Iterative solutions The sparse associative networks are intended to be used for very high dimensional data applications, where M, N are very large. The analytical solution is therefore not practically possible to compute. There exist several iterative solutions to the least squares problem. The iterative algorithms discussed in this report are based on the gradient ɛ x = ɛ x = AT WAx A T Wb (5) There are many other, more elaborate, iterative update rules for the least squares solution, see e.g. [, 7, 2]. Some use more refined rules based on the gradient, some are based on higher order derivatives. They should converge faster, but there will be more computations in each iteration. This report focuses on simple, robust rules based on the gradient, which is suited for very large, sparse systems. But other algorithms may be a topic for future research. as Without the bounds l, u the gradient-based update rules can be formulated x p+ = x p f(ɛ p x) (6) where ɛ p x = ɛ x x p means the partial derivative at the point x = x p and f(.) is some suitably chosen update function with the property f(ɛ x ) 0 with equality iff ɛ x = 0 (7) If we have bounds we simply truncate: ( ( )) x p+ = min u, max l, x p f(ɛ p x) (8) where min and max denote element-wise operations. The iterative algorithms will converge to the same solution if it is unique, otherwise the solution will depend on the algorithm (choice of f(.)) and on the initial value. Then there is no guarantee that we will get the minimal norm solution, unless we for example include the regularization term. The next subsections discuss three different choices of f(.). Section 4 contain experiments which compare the different gradient-based update rules on a simple synthetic example. 4

5 2.. Gradient search The simple gradient update function is proportional to the gradient and the update rule becomes f(ɛ x )=ηɛ x (9) x p+ = x p ηɛ p x = x p η(a T WAx A T Wb) (0) It can be shown that the gradient search algorithm converges for 0 <η< 2/λ max, where λ max is the largest eigenvalue to the matrix A T WA, see e.g. [5]. In the case of a large sparse matrix A the largest eigenvalue can be estimated using for example the power method, see e.g. [3]. There is an interesting special choice of W which gives λ max = if all elements in A have the same sign. We state the following theorem: Theorem Assume A 0, i.e. all values in A are non-negative. Let W = diag(aa T ),where =(... ) T. Then the largest eigenvalue to A T WA is equal to. The theorem is proven in appendix A.. This theorem will be used in section 3. It can also be shown that if we use the gradient update rule with initial vector x 0 = 0 we get the minimal norm solution, see appendix A Associative net rule As will be shown later in section 3 the associative learning rule in [4] has the update function f(ɛ x )=D η ɛ x, where D η = η η () η N and the update rule becomes x p+ = x p D η ɛ p x = x p+ = x p D η A T W(Ax b) (2) We now have an individual learning rate for each dimension. The gradient search algorithm in section 2.. is a special case where D η = ηi. Using the associative net update rule and x 0 = 0 do not assure that we get the minimal norm solution, as we did with the gradient update rule. The only thing we know for certain is that the dimensions of x that do not affect the solution are remaining zero. 5

6 2..3 RPROP - Resilient propagation The RPROP algorithm has attracted attention in recent years, see e.g. [7, 8]. It only uses the signs of the gradient ɛ x. The update function is written as f(ɛ x )=D η sign(ɛ x ) (3) D η is a diagonal matrix similar to equation and sign(ɛ x ) means element-wise sign (define sign(0) = 0). The learning rate D η is in this case adaptive. The basic idea is that we increase the learning rate if we are updating in a consistent direction, otherwise we decrease it. The update rule for the learning rates are η p k = η + η p k { η η p k Set ɛ p x k := 0 η p k if ɛ p x k ɛ p x k > 0 if ɛ p x k ɛ p x k < 0 if ɛ p x k ɛ p x k =0, where 0 <η < <η + ɛ p x k =(ɛ p x) k = ɛ x k x p (4) η and η + is called the retardation and acceleration factor respectively. Good empirical values are η =0.5 andη + =.2. A suitable initial value is for example ηk 0 =0.0. Note that in the case of the gradient changing sign we also set the gradient to zero. This avoids unnecessary oscillations in the following iterations. The update rule becomes x p+ = x p D η psign(ɛ p x) (5) The main difference between RPROP and most other heuristic algorithms is that the learning rate adjustments and weight changes depend only on the signs of the gradient terms, not their magnitudes. It is argued that the gradient magnitude depends on scaling of the error function and can change greatly from one step to the next. Also, the gradient vanishes at a minimum so the step size becomes smaller and smaller as it nears the minimum. This can give a slow convergence near the minimum (c.f. the experiments in section 4). Another advantage with RPROP compared to the previous gradient-based methods is that we do not have to choose suitable learning rates, they adapt in time. As for the associative net rule we cannot say which solution we will get if the solution is non-unique. 3 The sparse associative network To avoid confusion, note that the matrix A in this section is not the same as in section 2 and that some of the vectors here are row-vectors instead of columnvectors. The associative net tries to associate a N -dimensional input feature vector to an output response value u. The output and input are associated with 6

7 a N link vector c. The link vector contains the net parameters that are computed using a set of training samples. For purposes outside the scope of this report, all the involved quantities a, u, and c are restricted to having nonnegative values. Another preferable property is that A (and sometimes also u) is sparse, giving a sparse link vector. There can be several output responses and a link vector to each one of them, but they are optimized independently. We will therefore focus on a scalar output response. Assume we have M training samples {a k,u k } M. We want to associate the feature vectors a k with the responses u k using some suitable criteria. Let A = (a a 2... a M )bethen M matrix with the input training data and u = (u u 2... u M ) be the M vector of output training data. The optimization rule used in working document [4] is { ĉ(i + ) = max(ĉ(i) νf. (û(i) u)a T, 0) (6) û(i +) = ν s. ĉ(i +)A (Note that A is not the same matrix as in section 2, rather the transpose.) ν f is an N vector and are called the feature domain normalization. ν s is an M vector and are called the sample domain normalization. ν f and ν s are a function of A. To use the net, we take a feature vector a and compute the output response as û = ν s. ĉa (7) where ν s is a function of a and possibly also on other features vectors. 3. A least squares formulation We will now show that the update rule 6 is the solution to a weighted least squares problem. First, denote D f = diag(ν f ), D s = diag(ν s ) (8) If we ignore the boundary limit for a moment we can rewrite the update rule as { ĉ(i +) = ĉ(i) (û(i) u)a T D f û(i +) = ĉ(i +)AD s (9) By combining the two equations into one we get ĉ(i +) = ĉ(i) (û(i) u)a T D f = ĉ(i) (ĉ(i)ad s u)a T D f = ĉ(i) (ĉ(i)ad s u)d s D s A T D f (20) 7

8 If we compare this equation with equation 2 (note that the two equations differ by a transpose) and includes the boundary limit again we can see that update rule 6 is the iterative gradient-based solution to the problem arg min u cad s 0 c D s (2) with D f serving as the learning rate in equation. This means that D s controls the choice of model (normalization of A) and weight, while D f controls the convergence rate. As mentioned in section 2..2, in the case of a non-unique solution we cannot know which solution we will get, only that it minimizes the least squares function in equation 2. The solution depend on the initial value c 0, and also on the learning rate D f. 3.2 Choice of normalization mode Three different combinations of D f and D s are suggested in [4]. D s controls the net model and depend on the application. D f affects the optimization algorithm. D f in choice is optimal in the special case when all features are uncorrelated. D f in choices 2 and 3 are more difficult to analyze. Choice : Normalization entirely in the feature domain D s = I D f = diag(aa T ) = m a2 m m a2 2m..., =. (22) From the least squares formulation, equation 2, we see that this choice corresponds to the net model u = ca (23) c is optimized by non-weighted least square using a gradient-based update rule with a learning rate D f. It is difficult to analyze the convergence properties of D f except in the very simple case when all features are uncorrelated, i.e. if the rows in A are orthogonal (AA T = I). Then we can optimize each link element c k independently, and it is easy to show that D f above is the optimal learning rate (gives convergence after one iteration). In general though, we have correlated features. 8

9 Choice 2: Mixed domain normalization D s = diag(a T ) = D f = diag(a) = a T a T 2... m am a2m... (24) Each diagonal element in D s is the inverse sum of all feature values in one training sample. AD s means that each training sample will be normalized with its sum, and we get the net model u = c a T a (25) In addition, we use D s as weight in the least squares problem. This means that the feature vectors a k with the largest sum will have most impact on the solution. Choice 3: Normalization entirely in the sample domain D s = diag(a T A) = a T (a+a2+... ) a T 2 (a+a2+... )... (26) D f = I We can view a + a a M as an average operation (ignoring the factor ). This choice therefore corresponds to the net model M u = c a T a, where ā =E[a] (27) ā Again we use D s as weight, this time meaning that the feature vectors with largest norm and with a direction close to the direction of the average vector ā will have most impact on the solution. Since D f = I the update rule reduces to ordinary gradient search with η = (section 2..). It was mentioned that the gradient search algorithm converges for 0 <η<2/λ max. In this case λ max is the largest eigenvalue to the matrix (AD s )D s (AD s ) T = AD s A T. With the choice of D s as above we can use theorem in appendix A., which says that we get λ max =. Note that the theorem only holds for certain if A 0, which is the case in the associative networks theory. D f = I is therefore optimal! 9

10 3.3 Generalization This section is somewhat outside the scope of this report, but it is included to show that the associative net model in equation 2 can be generalized to include other algorithms. For example, we could have independent normalization and weight: arg min u cad s W (28) c l c c u One example of this can be found in [6], which is one of the contributions to the radial basis theory (see [5]). Two choices of normalization were suggested: a T D s = I and D s = a T 2 (29)... The last choice corresponds to the model u = c a T a (30) which is the same model as the second choice of normalization mode, mixed domain normalization, in section 3.2. Each element in a is in this case a radial basis function. The model parameters c are computed using unweighted least squares (W = I). To make the problem well-posed, some form of regularization is often used. The model in equation 30 is also used in kernel regression theory, or mixture model theory, building on the notion of density estimation. In this case each element in a is a kernel function playing the role of a local density function, e.g. a Gaussian function, see [5]. The normalization factor is an estimate of the underlying probability density function of the input. The model parameters c are found by minimizing a least squares function or a maximum likelihood function. The solution is computed using gradient search or expectation-maximization. Again, regularization is often used to make the problem well-posed. The examples above use c l = and c u =, which is one of the major differences from the associative network in [4] which use the monopolar constraint c l = 0. Another difference is that the above examples use unweighted least squares, whereas a weighted least squares goal function is used for the associative network. 0

11 4 Experiments Section 2 described three different iterative algorithms that solves the least squares problem in section 3; the gradient rule, the RPROP rule, and the associative net rule. These algorithms are compared on a simple synthetic set of data. 4. Experiment data The goal of the experiment is to train an associative net to estimate 2D-position from a set of local distance functions, also called feature channels. Given a 2D position x =(x,x 2 ), the local distance functions are computed as d k (x) = { cos 2 ( x x k w k ) if x x k w k π 2 0 otherwise (3) where x k is called the channel center and w k is called the channel width. Two choices of {x k,w k } will be explored: Regularly placed feature channels: The centers, x k, are placed in a regular Cartesian grid and all widths, w k, are equal. Randomly placed feature channels: All x k are randomly placed and the widths w k varies randomly within a limited range. The two choices is shown in figure. The data used to train the system are computed along the spiral function in figure. The net is also evaluated on another set of data that is randomly located within the training region, see figure 2. The following list contain some facts for the data: N = 500 local distance functions (feature channels) M = 200 training samples M e = 000 evaluation samples randomly located within the training region The input to the net is a 500 dimensional vector containing the values of the local distance functions, and the output is a 2D vector with the position x, i.e. d (x) d 2 (x) ( ) a(x) =. u = x = x (32) x 2 d N (x)

12 4.2 Experiment setup As discussed in previous sections, different net models can be used. They can be summarized using the sample domain normalization D s : ( ) { u U = u = c AD s u U = CAD u 2 = c 2 AD s where ( 2 ) (33) s c C = c 2 The M vector u i contains all the output samples for coordinate x i and the N M matrix A =(a a 2...) contains all the feature channel vectors. The association is computed using least squares, weighted with W = D s.we can solve the total system U = CAD s directly, or equivalently, each system u i = c i AD s separately. Several combinations of boundaries and models is considered, table lists the cases. Experiment and 2 compare bipolar and monopolar solutions on regularly placed feature channels. Experiment 3 and 4 contain the same comparison on randomly placed feature channels. Experiment 4, 5, and 6 compare three different choices of normalization. 4.3 Results Table 2 contains the result after training using different iterative algorithms. We do not have any boundaries in experiment and 3 and we can compute the analytical solution in equation 3 as well. The training error e is the relative error defined as e = U CAD s F U F (34) where the norm means the Frobenius norm. The table also contains the norm and the sparsity (nnz = number of non-zero values) of the solution C. Figure 3 shows the error during training for each of the experiments and algorithms. The net is also investigated for its interpolation performance on a set of evaluation data (described in section 4.). The error between the net output û and the true position u is computed for each of the evaluation samples: u = u û = u C a (35) h(a) where h(a) depend on the net model, see table. The mean value, standard deviation, minimal value, and maximal value of u is shown in table Conclusions If we compare the three iterative rules we can see that RPROP rule and the associative net rule have fairly the same convergence rate in all experiments, 2

13 while the gradient rule is slower. The RPROP has a higher error at the beginning because the learning rate has not yet adapted. The computational complexity for RPROP is somewhat higher because we have to update the learning rate as well. This is compensated by a slightly faster convergence rate. The solutions using RPROP and the associative rule also have similar norm and sparsity. They are therefore comparable in performance. The main advantage with the RPROP rule or any other general iterative rule to solve the least squares problem is that we do not have to derive an explicit learning rate for each choice of normalization. In addition, we can make some observations regarding choice of model and boundaries: In experiments and 2 we had a regularly grid of feature channels and compared bipolar solution (no boundary constraint) in experiment with monopolar solution (only positive coefficients) in experiment 2. The bipolar case had still not converged after 0000 iterations (the optimal analytical solution is close to zero), but the error is at least fairly low. In experiments 3 and 4 we did the same cases but now on randomly placed feature channels. In this case the difference between bipolar and monopolar coefficients is much more evident, see figure 3. This is because the feature channels are more correlated in this case. Again, the bipolar case had still not converged after 0000 iterations. The experiments indicate that we get a faster convergence using only monopolar coefficients compared to bipolar coefficients. The cost is a larger error for the solution. The analytical solution in experiment 3 has a large norm C compared to the iterative solutions in the same experiment (this can also be seen in experiment ). This is because the solution contains a few very large positive and a few very large negative values. The iterative solutions did not converge within a 0000 iterations, but they would eventually give a large norm as well. By comparing the results from experiments and 3, where we have a bipolar solution, with experiments 2 and 4, where the solution is monopolar, we can see that the norm C is lower in the monopolar cases. A lower norm may give better robustness to noise and a lower interpolation error which is partly confirmed by the evaluation results in table 3. This is a topic for future research. A lower norm could also have been accomplished if we had used both lower and upper constraints, or if we include a regularization term. The alternatives might differ in convergence rate though. In experiments 5 and 6 we get a lower interpolation error than in experiment 4. This may be because we use a different model (normalization of the feature vectors), but it may also be because we use a weight. This is a 3

14 topic for future research. The generalization of the least squares problem, equation 28, where we have independent normalization and weight allows for this to be investigated. In experiment 6 we get identical convergence for the gradient rule and the associative update rule, which confirms the statement in section 3.2 that normalization mode 3 is equivalent to gradient search. One argument in [4] for having the monopolar constraint is that the link vector becomes much more sparse than without the constraint. A sparse link vector gives a lower computational complexity in the net. Table 2 partly confirms the argument, but only when we have randomly located feature channels and no normalization of the input a (experiment 4). Otherwise we only get a slightly higher sparsity. 5 Summary This report shows that the learning algorithm for the associative net in [4] can be described as a weighted least squares problem. This allows for comparison with other iterative solution methods. This report compared the associative net update rules with the gradient rule and the RPROP rule on a simple experiment. The gradient rule performs worse than the associative rules. The RPROP rule has a higher initial error, but the convergence rate is comparable to the associative net rules. The experiments also show that the convergence is considerably faster when using only positive (monopolar) values in the solution compared to using both negative and positive (bipolar) values. This holds for all three of the update rules. The weighted least squares problem was generalized to include other algorithms as well. This might be of help when the associative net is compared to other function approximation methods. This also allows for investigation of the importance of normalization and weight. 6 Acknowledgment This work was supported by the Swedish Foundation for Strategic Research, project VISIT - VIsual Information Technology. The author would like to thank the people at CVL for helpful discussions, especially my supervisor Gösta Granlund. 4

15 Regularly placed feature channels Randomly placed feature channels Figure : Experiment data. Local distance functions (feature channels) and training samples along a spiral function. 5

16 Figure 2: Evaluation data. The evaluation data is randomly located within the training region (spiral). Experiment Feature location Lower bound, c l Model (D s ) Weight, W = D s Regular u = ca I 2 Regular 0 u = ca I 3 Random u = ca I 4 Random 0 u = ca I 5 Random 0 u = c a T a diag(at ) 6 Random 0 u = c a T ā a diag(at A) Common for all experiments: # iterations 0000 Upper bound c u Initial value c 0 0 Regularization term - Table : Experiment setup. The three iterative update methods using gradient rule, RPROP rule, and associative net rule are evaluated on each of experiment 6. 6

17 0.9 Experiment Experiment 2 Gradient rule RPROP rule Associative rule 0.9 Gradient rule RPROP rule Associative rule Experiment 3 Experiment 4 Gradient rule RPROP rule Associative rule 0.9 Gradient rule RPROP rule Associative rule Experiment 5 Experiment 6 Gradient rule RPROP rule Associative rule 0.9 Gradient rule RPROP rule Associative rule Figure 3: Error during training for each of the experiments and iterative update rules. Note the logarithmic scale on the x-axis. The gradient rule and the associative rule coincided in the last experiment. 7

18 error, e C F nnz(c) Experiment Analytical solution Gradient rule RPROP rule Associative net rule Experiment 2 Gradient rule RPROP rule Associative net rule Experiment 3 Analytical solution Gradient rule RPROP rule Associative net rule Experiment 4 Gradient rule RPROP rule Associative net rule Experiment 5 Gradient rule RPROP rule Associative net rule Experiment 6 Gradient rule RPROP rule Associative net rule Table 2: Results after training. 8

19 mean( u) std( u) min( u) max( u) Experiment Analytical solution Gradient rule RPROP rule Associative net rule Experiment 2 Gradient rule RPROP rule Associative net rule Experiment 3 Analytical solution Gradient rule RPROP rule Associative net rule Experiment 4 Gradient rule RPROP rule Associative net rule Experiment 5 Gradient rule RPROP rule Associative net rule Experiment 6 Gradient rule RPROP rule Associative net rule Table 3: Interpolation performance on evaluation data. 9

20 A A. Proof of theorem, section 2.. The theorem is repeated below: Theorem Let A be a matrix. Assume A 0, i.e. all values in A are nonnegative. Assume W = diag(aa T ),where =(... ) T. Then the largest eigenvalue to A T WA is equal to. Proof The proof consists of two parts. First, we show that the eigenvalues cannot be larger than. Second, we show that there exist an eigenvector that has the eigenvalue. We assume that all values in the vector AA T are nonzero so that W exists (since A 0 this basically means that no row in A contains only zeros).. Let v be an eigenvector to A T WA. We can compute the eigenvalue as λ = vat WAv v T v (36) It is a well known fact that AA T and A T A have the same non-zero eigenvalues. In this case we also have a weight involved, but we can make a modification and state that A T WA = A T W WA and WAA T W have the same non-zero eigenvalues. The eigenvalue λ above can thus be computed as λ = ut WAA T Wu u T u (37) for some (eigen-)vector u. Letz = Wu u = W z and insert this into the equation: λ = ut WAA T Wu u T u = z T AA T z z T diag(aa T )z = z T AA T z z T W = zt AAT z W z z T W z (38) It remains to show that this quotient cannot be larger than. To simplify the index notation we let B = AA T and compute the quotient: z T Bz λ = z T diag(b)z = i,j b ijz i z j i,j i ( j b ij)zi 2 = b ijz i z j i,j b ijzi 2 i = b iizi 2 + i<j b ij2z i z j i b iizi 2 + i<j b ij(zi 2 + z2 j ) (39) 20

21 In the last equality we have used the symmetry property b ij = b ji. For each numerator term b ij 2z i z j we have a corresponding denominator term b ij (zi 2 + z2 j ). The inequality between algebraic average and geometric average states that zi 2 + z2 j 2z iz j, and since all b ij are non-negative we can therefore conclude that λ. 2. Does there exist an eigenvector with eigenvalue λ =? Yes, it is easy to show that v = A T has eigenvalue : A T WAv = A T (WAA T ) = A T (diag(aa T ) AA T ) = A T = v (40) (In the third equality we used the fact that diag(x) x = ) A.2 Proof of gradient solution, section 2.. Theorem 2 Let S = {x : ɛ(x) = Ax b 2 is mimimum} (4) Then the gradient search algorithm x p+ = x p ηɛ x (x p ) with x 0 = 0 gives the solution x 0 S with minimal Euclidean norm. (The weight W is ignored here, it does not affect the theorem). Proof Let A T A = VΣV T, ( ) Σr 0 Σ = 0 0 V T V = VV T = I (42) be the SVD decomposition of A T A (r =rank(a T A)). Furthermore, let y = V T x (y will have the same norm as x). The gradient update rule in equation 0 can then be written y p+ = y p η(σy p s), where s = V T A T b (43) y p+ can be expressed as a function of the initial value y 0 = Vx 0 as y p+ = (I ησ)y p + ηs = (I ησ) 2 y p +(I ησ)ηs + ηs =... ( p = (I ησ) p+ y 0 + k=0 )ηs (I ησ)k (44) 2

22 Assume x S S. It has the property A T Ax S = A T b (gradient of ɛ(x) equals zero). We can then write s = V T A T Ax S = Σy S, where y S = V T x S (45) the update can then be written ( p y p+ =(I ησ) p+ y 0 + (I ησ) k) ησy S (46) ( N ) By using the property k=0 Bk (I B) =I B N+ with B = I ησ we can write the update rule as ( y p+ =(I ησ) p+ y 0 + I (I ησ) p+) y S (47) or, equal y p+ = ( (I ησr ) p+ 0 0 I k=0 ) ( y 0 I (I ησr ) + p After convergence (if η is suitably chosen) we get ( ) ( 0 0 y = y 0 I I 0 0 ) y S (48) ) y S (49) And we can see that if we choose y 0 = 0 we get the solution ( ) I 0 y = y 0 0 S (50) which can be shown to be the minimal norm solution, see e.g. [3]. 22

23 References [] M. Adlers. Topics in Sparse Least Squares Problems. PhD thesis, Linköping University, Linköping, Sweden, Dept. of Mathematics, Dissertation No [2] Å. Björck. Numerical Methods for Least Squares Problems. SIAM, Society for Industrial and Applied Mathematics, 996. [3] G. H. Golub and C. F. Van Loan. Matrix Computations. The Johns Hopkins University Press, second edition, 989. [4] G Granlund. Parallel Learning in Artificial Vision Systems: Working Document. Technical report, Dept. EE, Linköping University, [5] S. Haykin. Neural Networks A comprehensive foundation. Prentice Hall, 2nd edition, 999. ISBN [6] J. Moody and C. J. Darken. Fast learning in networks of locally-tuned processing units. Neural Computation, :28 293, 989. [7] Russel D. Reed and Robert J. Marks II. Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks. MIT Press, 999. [8] M. Riedmiller and H Braum. A Direct Adaptive Method for Faster Backpropagation Learning: The RPROP Algorithm. In Proceedings of the IEEE International Conference on Neural Networks, volume, San Francisco, CA,

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra. DS-GA 1002 Lecture notes 0 Fall 2016 Linear Algebra These notes provide a review of basic concepts in linear algebra. 1 Vector spaces You are no doubt familiar with vectors in R 2 or R 3, i.e. [ ] 1.1

More information

Principal Component Analysis

Principal Component Analysis Machine Learning Michaelmas 2017 James Worrell Principal Component Analysis 1 Introduction 1.1 Goals of PCA Principal components analysis (PCA) is a dimensionality reduction technique that can be used

More information

Lecture 2: Linear Algebra Review

Lecture 2: Linear Algebra Review EE 227A: Convex Optimization and Applications January 19 Lecture 2: Linear Algebra Review Lecturer: Mert Pilanci Reading assignment: Appendix C of BV. Sections 2-6 of the web textbook 1 2.1 Vectors 2.1.1

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Numerical Linear Algebra Background Cho-Jui Hsieh UC Davis May 15, 2018 Linear Algebra Background Vectors A vector has a direction and a magnitude

More information

Regularization via Spectral Filtering

Regularization via Spectral Filtering Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,

More information

Linear Regression and Its Applications

Linear Regression and Its Applications Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start

More information

SVD and its Application to Generalized Eigenvalue Problems. Thomas Melzer

SVD and its Application to Generalized Eigenvalue Problems. Thomas Melzer SVD and its Application to Generalized Eigenvalue Problems Thomas Melzer June 8, 2004 Contents 0.1 Singular Value Decomposition.................. 2 0.1.1 Range and Nullspace................... 3 0.1.2

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 5: Numerical Linear Algebra Cho-Jui Hsieh UC Davis April 20, 2017 Linear Algebra Background Vectors A vector has a direction and a magnitude

More information

Least Squares Optimization

Least Squares Optimization Least Squares Optimization The following is a brief review of least squares optimization and constrained optimization techniques. Broadly, these techniques can be used in data analysis and visualization

More information

8.1 Concentration inequality for Gaussian random matrix (cont d)

8.1 Concentration inequality for Gaussian random matrix (cont d) MGMT 69: Topics in High-dimensional Data Analysis Falll 26 Lecture 8: Spectral clustering and Laplacian matrices Lecturer: Jiaming Xu Scribe: Hyun-Ju Oh and Taotao He, October 4, 26 Outline Concentration

More information

7 Principal Component Analysis

7 Principal Component Analysis 7 Principal Component Analysis This topic will build a series of techniques to deal with high-dimensional data. Unlike regression problems, our goal is not to predict a value (the y-coordinate), it is

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra)

AMS526: Numerical Analysis I (Numerical Linear Algebra) AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 1: Course Overview & Matrix-Vector Multiplication Xiangmin Jiao SUNY Stony Brook Xiangmin Jiao Numerical Analysis I 1 / 20 Outline 1 Course

More information

Least Squares Optimization

Least Squares Optimization Least Squares Optimization The following is a brief review of least squares optimization and constrained optimization techniques, which are widely used to analyze and visualize data. Least squares (LS)

More information

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis Probabilistic Latent Semantic Analysis Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr

More information

Spectral Regularization

Spectral Regularization Spectral Regularization Lorenzo Rosasco 9.520 Class 07 February 27, 2008 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,

More information

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A = 30 MATHEMATICS REVIEW G A.1.1 Matrices and Vectors Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A = a 11 a 12... a 1N a 21 a 22... a 2N...... a M1 a M2... a MN A matrix can

More information

Mobile Robotics 1. A Compact Course on Linear Algebra. Giorgio Grisetti

Mobile Robotics 1. A Compact Course on Linear Algebra. Giorgio Grisetti Mobile Robotics 1 A Compact Course on Linear Algebra Giorgio Grisetti SA-1 Vectors Arrays of numbers They represent a point in a n dimensional space 2 Vectors: Scalar Product Scalar-Vector Product Changes

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

14 Singular Value Decomposition

14 Singular Value Decomposition 14 Singular Value Decomposition For any high-dimensional data analysis, one s first thought should often be: can I use an SVD? The singular value decomposition is an invaluable analysis tool for dealing

More information

Lecture 1: Review of linear algebra

Lecture 1: Review of linear algebra Lecture 1: Review of linear algebra Linear functions and linearization Inverse matrix, least-squares and least-norm solutions Subspaces, basis, and dimension Change of basis and similarity transformations

More information

Linear Algebra and Eigenproblems

Linear Algebra and Eigenproblems Appendix A A Linear Algebra and Eigenproblems A working knowledge of linear algebra is key to understanding many of the issues raised in this work. In particular, many of the discussions of the details

More information

Backprojection of Some Image Symmetries Based on a Local Orientation Description

Backprojection of Some Image Symmetries Based on a Local Orientation Description Backprojection of Some Image Symmetries Based on a Local Orientation Description Björn Johansson Computer Vision Laboratory Department of Electrical Engineering Linköping University, SE-581 83 Linköping,

More information

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017 The Kernel Trick, Gram Matrices, and Feature Extraction CS6787 Lecture 4 Fall 2017 Momentum for Principle Component Analysis CS6787 Lecture 3.1 Fall 2017 Principle Component Analysis Setting: find the

More information

Background Mathematics (2/2) 1. David Barber

Background Mathematics (2/2) 1. David Barber Background Mathematics (2/2) 1 David Barber University College London Modified by Samson Cheung (sccheung@ieee.org) 1 These slides accompany the book Bayesian Reasoning and Machine Learning. The book and

More information

Linear Subspace Models

Linear Subspace Models Linear Subspace Models Goal: Explore linear models of a data set. Motivation: A central question in vision concerns how we represent a collection of data vectors. The data vectors may be rasterized images,

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences) AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences) Lecture 1: Course Overview; Matrix Multiplication Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical

More information

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling Machine Learning B. Unsupervised Learning B.2 Dimensionality Reduction Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University

More information

Least Squares Optimization

Least Squares Optimization Least Squares Optimization The following is a brief review of least squares optimization and constrained optimization techniques. I assume the reader is familiar with basic linear algebra, including the

More information

Review of some mathematical tools

Review of some mathematical tools MATHEMATICAL FOUNDATIONS OF SIGNAL PROCESSING Fall 2016 Benjamín Béjar Haro, Mihailo Kolundžija, Reza Parhizkar, Adam Scholefield Teaching assistants: Golnoosh Elhami, Hanjie Pan Review of some mathematical

More information

c Springer, Reprinted with permission.

c Springer, Reprinted with permission. Zhijian Yuan and Erkki Oja. A FastICA Algorithm for Non-negative Independent Component Analysis. In Puntonet, Carlos G.; Prieto, Alberto (Eds.), Proceedings of the Fifth International Symposium on Independent

More information

Principal Components Theory Notes

Principal Components Theory Notes Principal Components Theory Notes Charles J. Geyer August 29, 2007 1 Introduction These are class notes for Stat 5601 (nonparametrics) taught at the University of Minnesota, Spring 2006. This not a theory

More information

Chapter Two Elements of Linear Algebra

Chapter Two Elements of Linear Algebra Chapter Two Elements of Linear Algebra Previously, in chapter one, we have considered single first order differential equations involving a single unknown function. In the next chapter we will begin to

More information

Learning Gaussian Process Models from Uncertain Data

Learning Gaussian Process Models from Uncertain Data Learning Gaussian Process Models from Uncertain Data Patrick Dallaire, Camille Besse, and Brahim Chaib-draa DAMAS Laboratory, Computer Science & Software Engineering Department, Laval University, Canada

More information

THE PERTURBATION BOUND FOR THE SPECTRAL RADIUS OF A NON-NEGATIVE TENSOR

THE PERTURBATION BOUND FOR THE SPECTRAL RADIUS OF A NON-NEGATIVE TENSOR THE PERTURBATION BOUND FOR THE SPECTRAL RADIUS OF A NON-NEGATIVE TENSOR WEN LI AND MICHAEL K. NG Abstract. In this paper, we study the perturbation bound for the spectral radius of an m th - order n-dimensional

More information

Learning with Singular Vectors

Learning with Singular Vectors Learning with Singular Vectors CIS 520 Lecture 30 October 2015 Barry Slaff Based on: CIS 520 Wiki Materials Slides by Jia Li (PSU) Works cited throughout Overview Linear regression: Given X, Y find w:

More information

CHAPTER 11. A Revision. 1. The Computers and Numbers therein

CHAPTER 11. A Revision. 1. The Computers and Numbers therein CHAPTER A Revision. The Computers and Numbers therein Traditional computer science begins with a finite alphabet. By stringing elements of the alphabet one after another, one obtains strings. A set of

More information

Logistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017

Logistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017 1 Will Monroe CS 109 Logistic Regression Lecture Notes #22 August 14, 2017 Based on a chapter by Chris Piech Logistic regression is a classification algorithm1 that works by trying to learn a function

More information

Backward Error Estimation

Backward Error Estimation Backward Error Estimation S. Chandrasekaran E. Gomez Y. Karant K. E. Schubert Abstract Estimation of unknowns in the presence of noise and uncertainty is an active area of study, because no method handles

More information

Singular Value Decomposition

Singular Value Decomposition Singular Value Decomposition (Com S 477/577 Notes Yan-Bin Jia Sep, 7 Introduction Now comes a highlight of linear algebra. Any real m n matrix can be factored as A = UΣV T where U is an m m orthogonal

More information

Unsupervised Learning with Permuted Data

Unsupervised Learning with Permuted Data Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University

More information

Linear Models for Regression. Sargur Srihari

Linear Models for Regression. Sargur Srihari Linear Models for Regression Sargur srihari@cedar.buffalo.edu 1 Topics in Linear Regression What is regression? Polynomial Curve Fitting with Scalar input Linear Basis Function Models Maximum Likelihood

More information

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2.

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2. APPENDIX A Background Mathematics A. Linear Algebra A.. Vector algebra Let x denote the n-dimensional column vector with components 0 x x 2 B C @. A x n Definition 6 (scalar product). The scalar product

More information

Data Mining Lecture 4: Covariance, EVD, PCA & SVD

Data Mining Lecture 4: Covariance, EVD, PCA & SVD Data Mining Lecture 4: Covariance, EVD, PCA & SVD Jo Houghton ECS Southampton February 25, 2019 1 / 28 Variance and Covariance - Expectation A random variable takes on different values due to chance The

More information

1 Kernel methods & optimization

1 Kernel methods & optimization Machine Learning Class Notes 9-26-13 Prof. David Sontag 1 Kernel methods & optimization One eample of a kernel that is frequently used in practice and which allows for highly non-linear discriminant functions

More information

We wish to solve a system of N simultaneous linear algebraic equations for the N unknowns x 1, x 2,...,x N, that are expressed in the general form

We wish to solve a system of N simultaneous linear algebraic equations for the N unknowns x 1, x 2,...,x N, that are expressed in the general form Linear algebra This chapter discusses the solution of sets of linear algebraic equations and defines basic vector/matrix operations The focus is upon elimination methods such as Gaussian elimination, and

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Lecture 8: Linear Algebra Background

Lecture 8: Linear Algebra Background CSE 521: Design and Analysis of Algorithms I Winter 2017 Lecture 8: Linear Algebra Background Lecturer: Shayan Oveis Gharan 2/1/2017 Scribe: Swati Padmanabhan Disclaimer: These notes have not been subjected

More information

Fast Approximate Matrix Multiplication by Solving Linear Systems

Fast Approximate Matrix Multiplication by Solving Linear Systems Electronic Colloquium on Computational Complexity, Report No. 117 (2014) Fast Approximate Matrix Multiplication by Solving Linear Systems Shiva Manne 1 and Manjish Pal 2 1 Birla Institute of Technology,

More information

Large Scale Data Analysis Using Deep Learning

Large Scale Data Analysis Using Deep Learning Large Scale Data Analysis Using Deep Learning Linear Algebra U Kang Seoul National University U Kang 1 In This Lecture Overview of linear algebra (but, not a comprehensive survey) Focused on the subset

More information

(v, w) = arccos( < v, w >

(v, w) = arccos( < v, w > MA322 Sathaye Notes on Inner Products Notes on Chapter 6 Inner product. Given a real vector space V, an inner product is defined to be a bilinear map F : V V R such that the following holds: For all v

More information

Comparative Performance Analysis of Three Algorithms for Principal Component Analysis

Comparative Performance Analysis of Three Algorithms for Principal Component Analysis 84 R. LANDQVIST, A. MOHAMMED, COMPARATIVE PERFORMANCE ANALYSIS OF THR ALGORITHMS Comparative Performance Analysis of Three Algorithms for Principal Component Analysis Ronnie LANDQVIST, Abbas MOHAMMED Dept.

More information

1 Inner Product and Orthogonality

1 Inner Product and Orthogonality CSCI 4/Fall 6/Vora/GWU/Orthogonality and Norms Inner Product and Orthogonality Definition : The inner product of two vectors x and y, x x x =.., y =. x n y y... y n is denoted x, y : Note that n x, y =

More information

Radial Basis Function Networks. Ravi Kaushik Project 1 CSC Neural Networks and Pattern Recognition

Radial Basis Function Networks. Ravi Kaushik Project 1 CSC Neural Networks and Pattern Recognition Radial Basis Function Networks Ravi Kaushik Project 1 CSC 84010 Neural Networks and Pattern Recognition History Radial Basis Function (RBF) emerged in late 1980 s as a variant of artificial neural network.

More information

Properties of Matrices and Operations on Matrices

Properties of Matrices and Operations on Matrices Properties of Matrices and Operations on Matrices A common data structure for statistical analysis is a rectangular array or matris. Rows represent individual observational units, or just observations,

More information

On Optimal Frame Conditioners

On Optimal Frame Conditioners On Optimal Frame Conditioners Chae A. Clark Department of Mathematics University of Maryland, College Park Email: cclark18@math.umd.edu Kasso A. Okoudjou Department of Mathematics University of Maryland,

More information

Efficient and Accurate Rectangular Window Subspace Tracking

Efficient and Accurate Rectangular Window Subspace Tracking Efficient and Accurate Rectangular Window Subspace Tracking Timothy M. Toolan and Donald W. Tufts Dept. of Electrical Engineering, University of Rhode Island, Kingston, RI 88 USA toolan@ele.uri.edu, tufts@ele.uri.edu

More information

Notes on Linear Algebra and Matrix Theory

Notes on Linear Algebra and Matrix Theory Massimo Franceschet featuring Enrico Bozzo Scalar product The scalar product (a.k.a. dot product or inner product) of two real vectors x = (x 1,..., x n ) and y = (y 1,..., y n ) is not a vector but a

More information

Subset selection for matrices

Subset selection for matrices Linear Algebra its Applications 422 (2007) 349 359 www.elsevier.com/locate/laa Subset selection for matrices F.R. de Hoog a, R.M.M. Mattheij b, a CSIRO Mathematical Information Sciences, P.O. ox 664, Canberra,

More information

MAT 610: Numerical Linear Algebra. James V. Lambers

MAT 610: Numerical Linear Algebra. James V. Lambers MAT 610: Numerical Linear Algebra James V Lambers January 16, 2017 2 Contents 1 Matrix Multiplication Problems 7 11 Introduction 7 111 Systems of Linear Equations 7 112 The Eigenvalue Problem 8 12 Basic

More information

SVD, PCA & Preprocessing

SVD, PCA & Preprocessing Chapter 1 SVD, PCA & Preprocessing Part 2: Pre-processing and selecting the rank Pre-processing Skillicorn chapter 3.1 2 Why pre-process? Consider matrix of weather data Monthly temperatures in degrees

More information

MLCC 2015 Dimensionality Reduction and PCA

MLCC 2015 Dimensionality Reduction and PCA MLCC 2015 Dimensionality Reduction and PCA Lorenzo Rosasco UNIGE-MIT-IIT June 25, 2015 Outline PCA & Reconstruction PCA and Maximum Variance PCA and Associated Eigenproblem Beyond the First Principal Component

More information

Ordinary Least Squares Linear Regression

Ordinary Least Squares Linear Regression Ordinary Least Squares Linear Regression Ryan P. Adams COS 324 Elements of Machine Learning Princeton University Linear regression is one of the simplest and most fundamental modeling ideas in statistics

More information

Mathematical foundations - linear algebra

Mathematical foundations - linear algebra Mathematical foundations - linear algebra Andrea Passerini passerini@disi.unitn.it Machine Learning Vector space Definition (over reals) A set X is called a vector space over IR if addition and scalar

More information

Channel Representation of Colour Images

Channel Representation of Colour Images Channel Representation of Colour Images Report LiTH-ISY-R-2418 Per-Erik Forssén, Gösta Granlund and Johan Wiklund Computer Vision Laboratory, Department of Electrical Engineering Linköping University,

More information

Linear Algebra in Computer Vision. Lecture2: Basic Linear Algebra & Probability. Vector. Vector Operations

Linear Algebra in Computer Vision. Lecture2: Basic Linear Algebra & Probability. Vector. Vector Operations Linear Algebra in Computer Vision CSED441:Introduction to Computer Vision (2017F Lecture2: Basic Linear Algebra & Probability Bohyung Han CSE, POSTECH bhhan@postech.ac.kr Mathematics in vector space Linear

More information

Unsupervised Learning

Unsupervised Learning 2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html ML Problem Setting First build and

More information

Machine Learning Applied to 3-D Reservoir Simulation

Machine Learning Applied to 3-D Reservoir Simulation Machine Learning Applied to 3-D Reservoir Simulation Marco A. Cardoso 1 Introduction The optimization of subsurface flow processes is important for many applications including oil field operations and

More information

WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY,

WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY, WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY, WITH IMPLICATIONS FOR TRAINING Sanjeev Arora, Yingyu Liang & Tengyu Ma Department of Computer Science Princeton University Princeton, NJ 08540, USA {arora,yingyul,tengyu}@cs.princeton.edu

More information

Applied Mathematics 205. Unit II: Numerical Linear Algebra. Lecturer: Dr. David Knezevic

Applied Mathematics 205. Unit II: Numerical Linear Algebra. Lecturer: Dr. David Knezevic Applied Mathematics 205 Unit II: Numerical Linear Algebra Lecturer: Dr. David Knezevic Unit II: Numerical Linear Algebra Chapter II.3: QR Factorization, SVD 2 / 66 QR Factorization 3 / 66 QR Factorization

More information

CS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares

CS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares CS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares Robert Bridson October 29, 2008 1 Hessian Problems in Newton Last time we fixed one of plain Newton s problems by introducing line search

More information

Lecture 9: Numerical Linear Algebra Primer (February 11st)

Lecture 9: Numerical Linear Algebra Primer (February 11st) 10-725/36-725: Convex Optimization Spring 2015 Lecture 9: Numerical Linear Algebra Primer (February 11st) Lecturer: Ryan Tibshirani Scribes: Avinash Siravuru, Guofan Wu, Maosheng Liu Note: LaTeX template

More information

SUPPLEMENTAL NOTES FOR ROBUST REGULARIZED SINGULAR VALUE DECOMPOSITION WITH APPLICATION TO MORTALITY DATA

SUPPLEMENTAL NOTES FOR ROBUST REGULARIZED SINGULAR VALUE DECOMPOSITION WITH APPLICATION TO MORTALITY DATA SUPPLEMENTAL NOTES FOR ROBUST REGULARIZED SINGULAR VALUE DECOMPOSITION WITH APPLICATION TO MORTALITY DATA By Lingsong Zhang, Haipeng Shen and Jianhua Z. Huang Purdue University, University of North Carolina,

More information

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 1 MACHINE LEARNING Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 2 Practicals Next Week Next Week, Practical Session on Computer Takes Place in Room GR

More information

Linear Algebra for Machine Learning. Sargur N. Srihari

Linear Algebra for Machine Learning. Sargur N. Srihari Linear Algebra for Machine Learning Sargur N. srihari@cedar.buffalo.edu 1 Overview Linear Algebra is based on continuous math rather than discrete math Computer scientists have little experience with it

More information

On the Equivariance of the Orientation and the Tensor Field Representation Klas Nordberg Hans Knutsson Gosta Granlund Computer Vision Laboratory, Depa

On the Equivariance of the Orientation and the Tensor Field Representation Klas Nordberg Hans Knutsson Gosta Granlund Computer Vision Laboratory, Depa On the Invariance of the Orientation and the Tensor Field Representation Klas Nordberg Hans Knutsson Gosta Granlund LiTH-ISY-R-530 993-09-08 On the Equivariance of the Orientation and the Tensor Field

More information

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes (bilmes@cs.berkeley.edu) International Computer Science Institute

More information

Linear Algebra Review. Vectors

Linear Algebra Review. Vectors Linear Algebra Review 9/4/7 Linear Algebra Review By Tim K. Marks UCSD Borrows heavily from: Jana Kosecka http://cs.gmu.edu/~kosecka/cs682.html Virginia de Sa (UCSD) Cogsci 8F Linear Algebra review Vectors

More information

Kasetsart University Workshop. Multigrid methods: An introduction

Kasetsart University Workshop. Multigrid methods: An introduction Kasetsart University Workshop Multigrid methods: An introduction Dr. Anand Pardhanani Mathematics Department Earlham College Richmond, Indiana USA pardhan@earlham.edu A copy of these slides is available

More information

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1). Regression and PCA Classification The goal: map from input X to a label Y. Y has a discrete set of possible values We focused on binary Y (values 0 or 1). But we also discussed larger number of classes

More information

TWO METHODS FOR ESTIMATING OVERCOMPLETE INDEPENDENT COMPONENT BASES. Mika Inki and Aapo Hyvärinen

TWO METHODS FOR ESTIMATING OVERCOMPLETE INDEPENDENT COMPONENT BASES. Mika Inki and Aapo Hyvärinen TWO METHODS FOR ESTIMATING OVERCOMPLETE INDEPENDENT COMPONENT BASES Mika Inki and Aapo Hyvärinen Neural Networks Research Centre Helsinki University of Technology P.O. Box 54, FIN-215 HUT, Finland ABSTRACT

More information

Tutorials in Optimization. Richard Socher

Tutorials in Optimization. Richard Socher Tutorials in Optimization Richard Socher July 20, 2008 CONTENTS 1 Contents 1 Linear Algebra: Bilinear Form - A Simple Optimization Problem 2 1.1 Definitions........................................ 2 1.2

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Normed & Inner Product Vector Spaces

Normed & Inner Product Vector Spaces Normed & Inner Product Vector Spaces ECE 174 Introduction to Linear & Nonlinear Optimization Ken Kreutz-Delgado ECE Department, UC San Diego Ken Kreutz-Delgado (UC San Diego) ECE 174 Fall 2016 1 / 27 Normed

More information

MODELS OF LEARNING AND THE POLAR DECOMPOSITION OF BOUNDED LINEAR OPERATORS

MODELS OF LEARNING AND THE POLAR DECOMPOSITION OF BOUNDED LINEAR OPERATORS Eighth Mississippi State - UAB Conference on Differential Equations and Computational Simulations. Electronic Journal of Differential Equations, Conf. 19 (2010), pp. 31 36. ISSN: 1072-6691. URL: http://ejde.math.txstate.edu

More information

CS 340 Lec. 6: Linear Dimensionality Reduction

CS 340 Lec. 6: Linear Dimensionality Reduction CS 340 Lec. 6: Linear Dimensionality Reduction AD January 2011 AD () January 2011 1 / 46 Linear Dimensionality Reduction Introduction & Motivation Brief Review of Linear Algebra Principal Component Analysis

More information

CHAPTER 7. Regression

CHAPTER 7. Regression CHAPTER 7 Regression This chapter presents an extended example, illustrating and extending many of the concepts introduced over the past three chapters. Perhaps the best known multi-variate optimisation

More information

A Tutorial on Data Reduction. Principal Component Analysis Theoretical Discussion. By Shireen Elhabian and Aly Farag

A Tutorial on Data Reduction. Principal Component Analysis Theoretical Discussion. By Shireen Elhabian and Aly Farag A Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag University of Louisville, CVIP Lab November 2008 PCA PCA is A backbone of modern data

More information

Principal Component Analysis and Linear Discriminant Analysis

Principal Component Analysis and Linear Discriminant Analysis Principal Component Analysis and Linear Discriminant Analysis Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 http://www.eecs.northwestern.edu/~yingwu 1/29

More information

Matrix Factorizations

Matrix Factorizations 1 Stat 540, Matrix Factorizations Matrix Factorizations LU Factorization Definition... Given a square k k matrix S, the LU factorization (or decomposition) represents S as the product of two triangular

More information

This appendix provides a very basic introduction to linear algebra concepts.

This appendix provides a very basic introduction to linear algebra concepts. APPENDIX Basic Linear Algebra Concepts This appendix provides a very basic introduction to linear algebra concepts. Some of these concepts are intentionally presented here in a somewhat simplified (not

More information

Maths for Signals and Systems Linear Algebra in Engineering

Maths for Signals and Systems Linear Algebra in Engineering Maths for Signals and Systems Linear Algebra in Engineering Lectures 13 15, Tuesday 8 th and Friday 11 th November 016 DR TANIA STATHAKI READER (ASSOCIATE PROFFESOR) IN SIGNAL PROCESSING IMPERIAL COLLEGE

More information

Fixed Weight Competitive Nets: Hamming Net

Fixed Weight Competitive Nets: Hamming Net POLYTECHNIC UNIVERSITY Department of Computer and Information Science Fixed Weight Competitive Nets: Hamming Net K. Ming Leung Abstract: A fixed weight competitive net known as the Hamming net is discussed.

More information

Optimization of Gaussian Process Hyperparameters using Rprop

Optimization of Gaussian Process Hyperparameters using Rprop Optimization of Gaussian Process Hyperparameters using Rprop Manuel Blum and Martin Riedmiller University of Freiburg - Department of Computer Science Freiburg, Germany Abstract. Gaussian processes are

More information

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725 Numerical Linear Algebra Primer Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: proximal gradient descent Consider the problem min g(x) + h(x) with g, h convex, g differentiable, and h simple

More information

A Program for Data Transformations and Kernel Density Estimation

A Program for Data Transformations and Kernel Density Estimation A Program for Data Transformations and Kernel Density Estimation John G. Manchuk and Clayton V. Deutsch Modeling applications in geostatistics often involve multiple variables that are not multivariate

More information

LINEAR ALGEBRA: NUMERICAL METHODS. Version: August 12,

LINEAR ALGEBRA: NUMERICAL METHODS. Version: August 12, LINEAR ALGEBRA: NUMERICAL METHODS. Version: August 12, 2000 74 6 Summary Here we summarize the most important information about theoretical and numerical linear algebra. MORALS OF THE STORY: I. Theoretically

More information

Lecture: Local Spectral Methods (1 of 4)

Lecture: Local Spectral Methods (1 of 4) Stat260/CS294: Spectral Graph Methods Lecture 18-03/31/2015 Lecture: Local Spectral Methods (1 of 4) Lecturer: Michael Mahoney Scribe: Michael Mahoney Warning: these notes are still very rough. They provide

More information

Extreme Values and Positive/ Negative Definite Matrix Conditions

Extreme Values and Positive/ Negative Definite Matrix Conditions Extreme Values and Positive/ Negative Definite Matrix Conditions James K. Peterson Department of Biological Sciences and Department of Mathematical Sciences Clemson University November 8, 016 Outline 1

More information

Grassmann Averages for Scalable Robust PCA Supplementary Material

Grassmann Averages for Scalable Robust PCA Supplementary Material Grassmann Averages for Scalable Robust PCA Supplementary Material Søren Hauberg DTU Compute Lyngby, Denmark sohau@dtu.dk Aasa Feragen DIKU and MPIs Tübingen Denmark and Germany aasa@diku.dk Michael J.

More information

arxiv: v1 [math.na] 1 Sep 2018

arxiv: v1 [math.na] 1 Sep 2018 On the perturbation of an L -orthogonal projection Xuefeng Xu arxiv:18090000v1 [mathna] 1 Sep 018 September 5 018 Abstract The L -orthogonal projection is an important mathematical tool in scientific computing

More information