On Sparse Associative Networks: A Least Squares Formulation
|
|
- Edgar Gardner
- 5 years ago
- Views:
Transcription
1 On Sparse Associative Networks: A Least Squares Formulation Björn Johansson August 7, 200 Technical report LiTH-ISY-R-2368 ISSN Computer Vision Laboratory Department of Electrical Engineering Linköping University, SE Linköping, Sweden bjorn@isy.liu.se Abstract This report is a complement to the working document [4], where a sparse associative network is described. This report shows that the net learning rule in [4] can be viewed as the solution to a weighted least squares problem. This means that we can apply the theory framework of least squares problems, and compare the net rule with some other iterative algorithms that solve the same problem. The learning rule is compared with the gradient search algorithm and the RPROP algorithm in a simple synthetic experiment. The gradient rule has the slowest convergence while the associative and the RPROP rules have similar convergence. The associative learning rule has a smaller initial error than the RPROP rule though. It is also shown in the same experiment that we get a faster convergence if we have a monopolar constraint on the solution, i.e. if the solution is constrained to be non-negative. The least squares error is a bit higher but the norm of the solution is smaller, which gives a smaller interpolation error. The report also discusses a generalization of the least squares model, which include other known function approximation models.
2 Contents Introduction 3 2 Least squares model 3 2. Iterative solutions Gradient search Associative net rule RPROP - Resilient propagation The sparse associative network 6 3. A least squares formulation Choice of normalization mode Generalization Experiments 4. Experimentdata Experimentsetup Results Conclusions Summary 4 6 Acknowledgment 4 A Appendices 20 A. Proof of theorem, section A.2 Proof of gradient solution, section
3 Introduction This report is a complement to the working document [4], where a sparse associative network is described. The network parameters are computed using an iterative update rule. This report shows that the update rule can be viewed as an iterative solution to a weighted least squares problem. This means that we can compare the net rule with some other iterative algorithms that solves the same least squares problem. The least squares formulation also makes it easier to compare the associative network with other known high-dimensional function approximation theory, such as the least squares models used in neural networks, radial basis functions, and probabilistic mixture models, see [5]. Section 2 introduces the least squares problem, and some iterative methods to compute the solution. Section 3 derives the least squares model corresponding to the net learning rule and analyzes the different choices of models (normalization modes) mentioned in [4] using the least squares approach. Section 4 evaluates different iterative algorithms and models on a simple synthetic example. 2 Least squares model Let A be a M N matrix, b be a M vector and x be a N vector. The problem considered in this report is formulated as where Some comments: x 0 = arg min ɛ(x), l, u () l x u ɛ(x) = 2 Ax b 2 W = 2 (Ax b)t W(Ax b) (2) W is a positive semi-definite diagonal weight matrix, which depend on the application. l x u means element-wise bounds, i.e. l i x i u i. In the case of the sparse associative network we have l i =0,u i = for all i. The factor 2 is only included to avoid an extra factor 2 in the gradient. In the case of infinite boundaries (l =, u = ) we can compute a solution as x 0 =(A T WA) A T Wb (3) Where (.) means pseudo-inverse. The pseudo-inverse is equivalent to regular inverse (.) in case of a unique solution. In case of a non-unique solution 3
4 (rank(a) <N) the pseudo-inverse gives the minimal norm solution. The minimal norm solution can also approximately be achieved by adding a regularization term ɛ(x) = 2 Ax b 2 W + 2 x 2 W r = 2 (Ax b)t W(Ax b)+ 2 xt W r x where W r is a diagonal weight matrix. (4) 2. Iterative solutions The sparse associative networks are intended to be used for very high dimensional data applications, where M, N are very large. The analytical solution is therefore not practically possible to compute. There exist several iterative solutions to the least squares problem. The iterative algorithms discussed in this report are based on the gradient ɛ x = ɛ x = AT WAx A T Wb (5) There are many other, more elaborate, iterative update rules for the least squares solution, see e.g. [, 7, 2]. Some use more refined rules based on the gradient, some are based on higher order derivatives. They should converge faster, but there will be more computations in each iteration. This report focuses on simple, robust rules based on the gradient, which is suited for very large, sparse systems. But other algorithms may be a topic for future research. as Without the bounds l, u the gradient-based update rules can be formulated x p+ = x p f(ɛ p x) (6) where ɛ p x = ɛ x x p means the partial derivative at the point x = x p and f(.) is some suitably chosen update function with the property f(ɛ x ) 0 with equality iff ɛ x = 0 (7) If we have bounds we simply truncate: ( ( )) x p+ = min u, max l, x p f(ɛ p x) (8) where min and max denote element-wise operations. The iterative algorithms will converge to the same solution if it is unique, otherwise the solution will depend on the algorithm (choice of f(.)) and on the initial value. Then there is no guarantee that we will get the minimal norm solution, unless we for example include the regularization term. The next subsections discuss three different choices of f(.). Section 4 contain experiments which compare the different gradient-based update rules on a simple synthetic example. 4
5 2.. Gradient search The simple gradient update function is proportional to the gradient and the update rule becomes f(ɛ x )=ηɛ x (9) x p+ = x p ηɛ p x = x p η(a T WAx A T Wb) (0) It can be shown that the gradient search algorithm converges for 0 <η< 2/λ max, where λ max is the largest eigenvalue to the matrix A T WA, see e.g. [5]. In the case of a large sparse matrix A the largest eigenvalue can be estimated using for example the power method, see e.g. [3]. There is an interesting special choice of W which gives λ max = if all elements in A have the same sign. We state the following theorem: Theorem Assume A 0, i.e. all values in A are non-negative. Let W = diag(aa T ),where =(... ) T. Then the largest eigenvalue to A T WA is equal to. The theorem is proven in appendix A.. This theorem will be used in section 3. It can also be shown that if we use the gradient update rule with initial vector x 0 = 0 we get the minimal norm solution, see appendix A Associative net rule As will be shown later in section 3 the associative learning rule in [4] has the update function f(ɛ x )=D η ɛ x, where D η = η η () η N and the update rule becomes x p+ = x p D η ɛ p x = x p+ = x p D η A T W(Ax b) (2) We now have an individual learning rate for each dimension. The gradient search algorithm in section 2.. is a special case where D η = ηi. Using the associative net update rule and x 0 = 0 do not assure that we get the minimal norm solution, as we did with the gradient update rule. The only thing we know for certain is that the dimensions of x that do not affect the solution are remaining zero. 5
6 2..3 RPROP - Resilient propagation The RPROP algorithm has attracted attention in recent years, see e.g. [7, 8]. It only uses the signs of the gradient ɛ x. The update function is written as f(ɛ x )=D η sign(ɛ x ) (3) D η is a diagonal matrix similar to equation and sign(ɛ x ) means element-wise sign (define sign(0) = 0). The learning rate D η is in this case adaptive. The basic idea is that we increase the learning rate if we are updating in a consistent direction, otherwise we decrease it. The update rule for the learning rates are η p k = η + η p k { η η p k Set ɛ p x k := 0 η p k if ɛ p x k ɛ p x k > 0 if ɛ p x k ɛ p x k < 0 if ɛ p x k ɛ p x k =0, where 0 <η < <η + ɛ p x k =(ɛ p x) k = ɛ x k x p (4) η and η + is called the retardation and acceleration factor respectively. Good empirical values are η =0.5 andη + =.2. A suitable initial value is for example ηk 0 =0.0. Note that in the case of the gradient changing sign we also set the gradient to zero. This avoids unnecessary oscillations in the following iterations. The update rule becomes x p+ = x p D η psign(ɛ p x) (5) The main difference between RPROP and most other heuristic algorithms is that the learning rate adjustments and weight changes depend only on the signs of the gradient terms, not their magnitudes. It is argued that the gradient magnitude depends on scaling of the error function and can change greatly from one step to the next. Also, the gradient vanishes at a minimum so the step size becomes smaller and smaller as it nears the minimum. This can give a slow convergence near the minimum (c.f. the experiments in section 4). Another advantage with RPROP compared to the previous gradient-based methods is that we do not have to choose suitable learning rates, they adapt in time. As for the associative net rule we cannot say which solution we will get if the solution is non-unique. 3 The sparse associative network To avoid confusion, note that the matrix A in this section is not the same as in section 2 and that some of the vectors here are row-vectors instead of columnvectors. The associative net tries to associate a N -dimensional input feature vector to an output response value u. The output and input are associated with 6
7 a N link vector c. The link vector contains the net parameters that are computed using a set of training samples. For purposes outside the scope of this report, all the involved quantities a, u, and c are restricted to having nonnegative values. Another preferable property is that A (and sometimes also u) is sparse, giving a sparse link vector. There can be several output responses and a link vector to each one of them, but they are optimized independently. We will therefore focus on a scalar output response. Assume we have M training samples {a k,u k } M. We want to associate the feature vectors a k with the responses u k using some suitable criteria. Let A = (a a 2... a M )bethen M matrix with the input training data and u = (u u 2... u M ) be the M vector of output training data. The optimization rule used in working document [4] is { ĉ(i + ) = max(ĉ(i) νf. (û(i) u)a T, 0) (6) û(i +) = ν s. ĉ(i +)A (Note that A is not the same matrix as in section 2, rather the transpose.) ν f is an N vector and are called the feature domain normalization. ν s is an M vector and are called the sample domain normalization. ν f and ν s are a function of A. To use the net, we take a feature vector a and compute the output response as û = ν s. ĉa (7) where ν s is a function of a and possibly also on other features vectors. 3. A least squares formulation We will now show that the update rule 6 is the solution to a weighted least squares problem. First, denote D f = diag(ν f ), D s = diag(ν s ) (8) If we ignore the boundary limit for a moment we can rewrite the update rule as { ĉ(i +) = ĉ(i) (û(i) u)a T D f û(i +) = ĉ(i +)AD s (9) By combining the two equations into one we get ĉ(i +) = ĉ(i) (û(i) u)a T D f = ĉ(i) (ĉ(i)ad s u)a T D f = ĉ(i) (ĉ(i)ad s u)d s D s A T D f (20) 7
8 If we compare this equation with equation 2 (note that the two equations differ by a transpose) and includes the boundary limit again we can see that update rule 6 is the iterative gradient-based solution to the problem arg min u cad s 0 c D s (2) with D f serving as the learning rate in equation. This means that D s controls the choice of model (normalization of A) and weight, while D f controls the convergence rate. As mentioned in section 2..2, in the case of a non-unique solution we cannot know which solution we will get, only that it minimizes the least squares function in equation 2. The solution depend on the initial value c 0, and also on the learning rate D f. 3.2 Choice of normalization mode Three different combinations of D f and D s are suggested in [4]. D s controls the net model and depend on the application. D f affects the optimization algorithm. D f in choice is optimal in the special case when all features are uncorrelated. D f in choices 2 and 3 are more difficult to analyze. Choice : Normalization entirely in the feature domain D s = I D f = diag(aa T ) = m a2 m m a2 2m..., =. (22) From the least squares formulation, equation 2, we see that this choice corresponds to the net model u = ca (23) c is optimized by non-weighted least square using a gradient-based update rule with a learning rate D f. It is difficult to analyze the convergence properties of D f except in the very simple case when all features are uncorrelated, i.e. if the rows in A are orthogonal (AA T = I). Then we can optimize each link element c k independently, and it is easy to show that D f above is the optimal learning rate (gives convergence after one iteration). In general though, we have correlated features. 8
9 Choice 2: Mixed domain normalization D s = diag(a T ) = D f = diag(a) = a T a T 2... m am a2m... (24) Each diagonal element in D s is the inverse sum of all feature values in one training sample. AD s means that each training sample will be normalized with its sum, and we get the net model u = c a T a (25) In addition, we use D s as weight in the least squares problem. This means that the feature vectors a k with the largest sum will have most impact on the solution. Choice 3: Normalization entirely in the sample domain D s = diag(a T A) = a T (a+a2+... ) a T 2 (a+a2+... )... (26) D f = I We can view a + a a M as an average operation (ignoring the factor ). This choice therefore corresponds to the net model M u = c a T a, where ā =E[a] (27) ā Again we use D s as weight, this time meaning that the feature vectors with largest norm and with a direction close to the direction of the average vector ā will have most impact on the solution. Since D f = I the update rule reduces to ordinary gradient search with η = (section 2..). It was mentioned that the gradient search algorithm converges for 0 <η<2/λ max. In this case λ max is the largest eigenvalue to the matrix (AD s )D s (AD s ) T = AD s A T. With the choice of D s as above we can use theorem in appendix A., which says that we get λ max =. Note that the theorem only holds for certain if A 0, which is the case in the associative networks theory. D f = I is therefore optimal! 9
10 3.3 Generalization This section is somewhat outside the scope of this report, but it is included to show that the associative net model in equation 2 can be generalized to include other algorithms. For example, we could have independent normalization and weight: arg min u cad s W (28) c l c c u One example of this can be found in [6], which is one of the contributions to the radial basis theory (see [5]). Two choices of normalization were suggested: a T D s = I and D s = a T 2 (29)... The last choice corresponds to the model u = c a T a (30) which is the same model as the second choice of normalization mode, mixed domain normalization, in section 3.2. Each element in a is in this case a radial basis function. The model parameters c are computed using unweighted least squares (W = I). To make the problem well-posed, some form of regularization is often used. The model in equation 30 is also used in kernel regression theory, or mixture model theory, building on the notion of density estimation. In this case each element in a is a kernel function playing the role of a local density function, e.g. a Gaussian function, see [5]. The normalization factor is an estimate of the underlying probability density function of the input. The model parameters c are found by minimizing a least squares function or a maximum likelihood function. The solution is computed using gradient search or expectation-maximization. Again, regularization is often used to make the problem well-posed. The examples above use c l = and c u =, which is one of the major differences from the associative network in [4] which use the monopolar constraint c l = 0. Another difference is that the above examples use unweighted least squares, whereas a weighted least squares goal function is used for the associative network. 0
11 4 Experiments Section 2 described three different iterative algorithms that solves the least squares problem in section 3; the gradient rule, the RPROP rule, and the associative net rule. These algorithms are compared on a simple synthetic set of data. 4. Experiment data The goal of the experiment is to train an associative net to estimate 2D-position from a set of local distance functions, also called feature channels. Given a 2D position x =(x,x 2 ), the local distance functions are computed as d k (x) = { cos 2 ( x x k w k ) if x x k w k π 2 0 otherwise (3) where x k is called the channel center and w k is called the channel width. Two choices of {x k,w k } will be explored: Regularly placed feature channels: The centers, x k, are placed in a regular Cartesian grid and all widths, w k, are equal. Randomly placed feature channels: All x k are randomly placed and the widths w k varies randomly within a limited range. The two choices is shown in figure. The data used to train the system are computed along the spiral function in figure. The net is also evaluated on another set of data that is randomly located within the training region, see figure 2. The following list contain some facts for the data: N = 500 local distance functions (feature channels) M = 200 training samples M e = 000 evaluation samples randomly located within the training region The input to the net is a 500 dimensional vector containing the values of the local distance functions, and the output is a 2D vector with the position x, i.e. d (x) d 2 (x) ( ) a(x) =. u = x = x (32) x 2 d N (x)
12 4.2 Experiment setup As discussed in previous sections, different net models can be used. They can be summarized using the sample domain normalization D s : ( ) { u U = u = c AD s u U = CAD u 2 = c 2 AD s where ( 2 ) (33) s c C = c 2 The M vector u i contains all the output samples for coordinate x i and the N M matrix A =(a a 2...) contains all the feature channel vectors. The association is computed using least squares, weighted with W = D s.we can solve the total system U = CAD s directly, or equivalently, each system u i = c i AD s separately. Several combinations of boundaries and models is considered, table lists the cases. Experiment and 2 compare bipolar and monopolar solutions on regularly placed feature channels. Experiment 3 and 4 contain the same comparison on randomly placed feature channels. Experiment 4, 5, and 6 compare three different choices of normalization. 4.3 Results Table 2 contains the result after training using different iterative algorithms. We do not have any boundaries in experiment and 3 and we can compute the analytical solution in equation 3 as well. The training error e is the relative error defined as e = U CAD s F U F (34) where the norm means the Frobenius norm. The table also contains the norm and the sparsity (nnz = number of non-zero values) of the solution C. Figure 3 shows the error during training for each of the experiments and algorithms. The net is also investigated for its interpolation performance on a set of evaluation data (described in section 4.). The error between the net output û and the true position u is computed for each of the evaluation samples: u = u û = u C a (35) h(a) where h(a) depend on the net model, see table. The mean value, standard deviation, minimal value, and maximal value of u is shown in table Conclusions If we compare the three iterative rules we can see that RPROP rule and the associative net rule have fairly the same convergence rate in all experiments, 2
13 while the gradient rule is slower. The RPROP has a higher error at the beginning because the learning rate has not yet adapted. The computational complexity for RPROP is somewhat higher because we have to update the learning rate as well. This is compensated by a slightly faster convergence rate. The solutions using RPROP and the associative rule also have similar norm and sparsity. They are therefore comparable in performance. The main advantage with the RPROP rule or any other general iterative rule to solve the least squares problem is that we do not have to derive an explicit learning rate for each choice of normalization. In addition, we can make some observations regarding choice of model and boundaries: In experiments and 2 we had a regularly grid of feature channels and compared bipolar solution (no boundary constraint) in experiment with monopolar solution (only positive coefficients) in experiment 2. The bipolar case had still not converged after 0000 iterations (the optimal analytical solution is close to zero), but the error is at least fairly low. In experiments 3 and 4 we did the same cases but now on randomly placed feature channels. In this case the difference between bipolar and monopolar coefficients is much more evident, see figure 3. This is because the feature channels are more correlated in this case. Again, the bipolar case had still not converged after 0000 iterations. The experiments indicate that we get a faster convergence using only monopolar coefficients compared to bipolar coefficients. The cost is a larger error for the solution. The analytical solution in experiment 3 has a large norm C compared to the iterative solutions in the same experiment (this can also be seen in experiment ). This is because the solution contains a few very large positive and a few very large negative values. The iterative solutions did not converge within a 0000 iterations, but they would eventually give a large norm as well. By comparing the results from experiments and 3, where we have a bipolar solution, with experiments 2 and 4, where the solution is monopolar, we can see that the norm C is lower in the monopolar cases. A lower norm may give better robustness to noise and a lower interpolation error which is partly confirmed by the evaluation results in table 3. This is a topic for future research. A lower norm could also have been accomplished if we had used both lower and upper constraints, or if we include a regularization term. The alternatives might differ in convergence rate though. In experiments 5 and 6 we get a lower interpolation error than in experiment 4. This may be because we use a different model (normalization of the feature vectors), but it may also be because we use a weight. This is a 3
14 topic for future research. The generalization of the least squares problem, equation 28, where we have independent normalization and weight allows for this to be investigated. In experiment 6 we get identical convergence for the gradient rule and the associative update rule, which confirms the statement in section 3.2 that normalization mode 3 is equivalent to gradient search. One argument in [4] for having the monopolar constraint is that the link vector becomes much more sparse than without the constraint. A sparse link vector gives a lower computational complexity in the net. Table 2 partly confirms the argument, but only when we have randomly located feature channels and no normalization of the input a (experiment 4). Otherwise we only get a slightly higher sparsity. 5 Summary This report shows that the learning algorithm for the associative net in [4] can be described as a weighted least squares problem. This allows for comparison with other iterative solution methods. This report compared the associative net update rules with the gradient rule and the RPROP rule on a simple experiment. The gradient rule performs worse than the associative rules. The RPROP rule has a higher initial error, but the convergence rate is comparable to the associative net rules. The experiments also show that the convergence is considerably faster when using only positive (monopolar) values in the solution compared to using both negative and positive (bipolar) values. This holds for all three of the update rules. The weighted least squares problem was generalized to include other algorithms as well. This might be of help when the associative net is compared to other function approximation methods. This also allows for investigation of the importance of normalization and weight. 6 Acknowledgment This work was supported by the Swedish Foundation for Strategic Research, project VISIT - VIsual Information Technology. The author would like to thank the people at CVL for helpful discussions, especially my supervisor Gösta Granlund. 4
15 Regularly placed feature channels Randomly placed feature channels Figure : Experiment data. Local distance functions (feature channels) and training samples along a spiral function. 5
16 Figure 2: Evaluation data. The evaluation data is randomly located within the training region (spiral). Experiment Feature location Lower bound, c l Model (D s ) Weight, W = D s Regular u = ca I 2 Regular 0 u = ca I 3 Random u = ca I 4 Random 0 u = ca I 5 Random 0 u = c a T a diag(at ) 6 Random 0 u = c a T ā a diag(at A) Common for all experiments: # iterations 0000 Upper bound c u Initial value c 0 0 Regularization term - Table : Experiment setup. The three iterative update methods using gradient rule, RPROP rule, and associative net rule are evaluated on each of experiment 6. 6
17 0.9 Experiment Experiment 2 Gradient rule RPROP rule Associative rule 0.9 Gradient rule RPROP rule Associative rule Experiment 3 Experiment 4 Gradient rule RPROP rule Associative rule 0.9 Gradient rule RPROP rule Associative rule Experiment 5 Experiment 6 Gradient rule RPROP rule Associative rule 0.9 Gradient rule RPROP rule Associative rule Figure 3: Error during training for each of the experiments and iterative update rules. Note the logarithmic scale on the x-axis. The gradient rule and the associative rule coincided in the last experiment. 7
18 error, e C F nnz(c) Experiment Analytical solution Gradient rule RPROP rule Associative net rule Experiment 2 Gradient rule RPROP rule Associative net rule Experiment 3 Analytical solution Gradient rule RPROP rule Associative net rule Experiment 4 Gradient rule RPROP rule Associative net rule Experiment 5 Gradient rule RPROP rule Associative net rule Experiment 6 Gradient rule RPROP rule Associative net rule Table 2: Results after training. 8
19 mean( u) std( u) min( u) max( u) Experiment Analytical solution Gradient rule RPROP rule Associative net rule Experiment 2 Gradient rule RPROP rule Associative net rule Experiment 3 Analytical solution Gradient rule RPROP rule Associative net rule Experiment 4 Gradient rule RPROP rule Associative net rule Experiment 5 Gradient rule RPROP rule Associative net rule Experiment 6 Gradient rule RPROP rule Associative net rule Table 3: Interpolation performance on evaluation data. 9
20 A A. Proof of theorem, section 2.. The theorem is repeated below: Theorem Let A be a matrix. Assume A 0, i.e. all values in A are nonnegative. Assume W = diag(aa T ),where =(... ) T. Then the largest eigenvalue to A T WA is equal to. Proof The proof consists of two parts. First, we show that the eigenvalues cannot be larger than. Second, we show that there exist an eigenvector that has the eigenvalue. We assume that all values in the vector AA T are nonzero so that W exists (since A 0 this basically means that no row in A contains only zeros).. Let v be an eigenvector to A T WA. We can compute the eigenvalue as λ = vat WAv v T v (36) It is a well known fact that AA T and A T A have the same non-zero eigenvalues. In this case we also have a weight involved, but we can make a modification and state that A T WA = A T W WA and WAA T W have the same non-zero eigenvalues. The eigenvalue λ above can thus be computed as λ = ut WAA T Wu u T u (37) for some (eigen-)vector u. Letz = Wu u = W z and insert this into the equation: λ = ut WAA T Wu u T u = z T AA T z z T diag(aa T )z = z T AA T z z T W = zt AAT z W z z T W z (38) It remains to show that this quotient cannot be larger than. To simplify the index notation we let B = AA T and compute the quotient: z T Bz λ = z T diag(b)z = i,j b ijz i z j i,j i ( j b ij)zi 2 = b ijz i z j i,j b ijzi 2 i = b iizi 2 + i<j b ij2z i z j i b iizi 2 + i<j b ij(zi 2 + z2 j ) (39) 20
21 In the last equality we have used the symmetry property b ij = b ji. For each numerator term b ij 2z i z j we have a corresponding denominator term b ij (zi 2 + z2 j ). The inequality between algebraic average and geometric average states that zi 2 + z2 j 2z iz j, and since all b ij are non-negative we can therefore conclude that λ. 2. Does there exist an eigenvector with eigenvalue λ =? Yes, it is easy to show that v = A T has eigenvalue : A T WAv = A T (WAA T ) = A T (diag(aa T ) AA T ) = A T = v (40) (In the third equality we used the fact that diag(x) x = ) A.2 Proof of gradient solution, section 2.. Theorem 2 Let S = {x : ɛ(x) = Ax b 2 is mimimum} (4) Then the gradient search algorithm x p+ = x p ηɛ x (x p ) with x 0 = 0 gives the solution x 0 S with minimal Euclidean norm. (The weight W is ignored here, it does not affect the theorem). Proof Let A T A = VΣV T, ( ) Σr 0 Σ = 0 0 V T V = VV T = I (42) be the SVD decomposition of A T A (r =rank(a T A)). Furthermore, let y = V T x (y will have the same norm as x). The gradient update rule in equation 0 can then be written y p+ = y p η(σy p s), where s = V T A T b (43) y p+ can be expressed as a function of the initial value y 0 = Vx 0 as y p+ = (I ησ)y p + ηs = (I ησ) 2 y p +(I ησ)ηs + ηs =... ( p = (I ησ) p+ y 0 + k=0 )ηs (I ησ)k (44) 2
22 Assume x S S. It has the property A T Ax S = A T b (gradient of ɛ(x) equals zero). We can then write s = V T A T Ax S = Σy S, where y S = V T x S (45) the update can then be written ( p y p+ =(I ησ) p+ y 0 + (I ησ) k) ησy S (46) ( N ) By using the property k=0 Bk (I B) =I B N+ with B = I ησ we can write the update rule as ( y p+ =(I ησ) p+ y 0 + I (I ησ) p+) y S (47) or, equal y p+ = ( (I ησr ) p+ 0 0 I k=0 ) ( y 0 I (I ησr ) + p After convergence (if η is suitably chosen) we get ( ) ( 0 0 y = y 0 I I 0 0 ) y S (48) ) y S (49) And we can see that if we choose y 0 = 0 we get the solution ( ) I 0 y = y 0 0 S (50) which can be shown to be the minimal norm solution, see e.g. [3]. 22
23 References [] M. Adlers. Topics in Sparse Least Squares Problems. PhD thesis, Linköping University, Linköping, Sweden, Dept. of Mathematics, Dissertation No [2] Å. Björck. Numerical Methods for Least Squares Problems. SIAM, Society for Industrial and Applied Mathematics, 996. [3] G. H. Golub and C. F. Van Loan. Matrix Computations. The Johns Hopkins University Press, second edition, 989. [4] G Granlund. Parallel Learning in Artificial Vision Systems: Working Document. Technical report, Dept. EE, Linköping University, [5] S. Haykin. Neural Networks A comprehensive foundation. Prentice Hall, 2nd edition, 999. ISBN [6] J. Moody and C. J. Darken. Fast learning in networks of locally-tuned processing units. Neural Computation, :28 293, 989. [7] Russel D. Reed and Robert J. Marks II. Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks. MIT Press, 999. [8] M. Riedmiller and H Braum. A Direct Adaptive Method for Faster Backpropagation Learning: The RPROP Algorithm. In Proceedings of the IEEE International Conference on Neural Networks, volume, San Francisco, CA,
DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.
DS-GA 1002 Lecture notes 0 Fall 2016 Linear Algebra These notes provide a review of basic concepts in linear algebra. 1 Vector spaces You are no doubt familiar with vectors in R 2 or R 3, i.e. [ ] 1.1
More informationPrincipal Component Analysis
Machine Learning Michaelmas 2017 James Worrell Principal Component Analysis 1 Introduction 1.1 Goals of PCA Principal components analysis (PCA) is a dimensionality reduction technique that can be used
More informationLecture 2: Linear Algebra Review
EE 227A: Convex Optimization and Applications January 19 Lecture 2: Linear Algebra Review Lecturer: Mert Pilanci Reading assignment: Appendix C of BV. Sections 2-6 of the web textbook 1 2.1 Vectors 2.1.1
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Numerical Linear Algebra Background Cho-Jui Hsieh UC Davis May 15, 2018 Linear Algebra Background Vectors A vector has a direction and a magnitude
More informationRegularization via Spectral Filtering
Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,
More informationLinear Regression and Its Applications
Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start
More informationSVD and its Application to Generalized Eigenvalue Problems. Thomas Melzer
SVD and its Application to Generalized Eigenvalue Problems Thomas Melzer June 8, 2004 Contents 0.1 Singular Value Decomposition.................. 2 0.1.1 Range and Nullspace................... 3 0.1.2
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 5: Numerical Linear Algebra Cho-Jui Hsieh UC Davis April 20, 2017 Linear Algebra Background Vectors A vector has a direction and a magnitude
More informationLeast Squares Optimization
Least Squares Optimization The following is a brief review of least squares optimization and constrained optimization techniques. Broadly, these techniques can be used in data analysis and visualization
More information8.1 Concentration inequality for Gaussian random matrix (cont d)
MGMT 69: Topics in High-dimensional Data Analysis Falll 26 Lecture 8: Spectral clustering and Laplacian matrices Lecturer: Jiaming Xu Scribe: Hyun-Ju Oh and Taotao He, October 4, 26 Outline Concentration
More information7 Principal Component Analysis
7 Principal Component Analysis This topic will build a series of techniques to deal with high-dimensional data. Unlike regression problems, our goal is not to predict a value (the y-coordinate), it is
More informationAMS526: Numerical Analysis I (Numerical Linear Algebra)
AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 1: Course Overview & Matrix-Vector Multiplication Xiangmin Jiao SUNY Stony Brook Xiangmin Jiao Numerical Analysis I 1 / 20 Outline 1 Course
More informationLeast Squares Optimization
Least Squares Optimization The following is a brief review of least squares optimization and constrained optimization techniques, which are widely used to analyze and visualize data. Least squares (LS)
More informationProbabilistic Latent Semantic Analysis
Probabilistic Latent Semantic Analysis Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr
More informationSpectral Regularization
Spectral Regularization Lorenzo Rosasco 9.520 Class 07 February 27, 2008 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,
More informationMatrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =
30 MATHEMATICS REVIEW G A.1.1 Matrices and Vectors Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A = a 11 a 12... a 1N a 21 a 22... a 2N...... a M1 a M2... a MN A matrix can
More informationMobile Robotics 1. A Compact Course on Linear Algebra. Giorgio Grisetti
Mobile Robotics 1 A Compact Course on Linear Algebra Giorgio Grisetti SA-1 Vectors Arrays of numbers They represent a point in a n dimensional space 2 Vectors: Scalar Product Scalar-Vector Product Changes
More informationNeural Network Training
Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification
More information14 Singular Value Decomposition
14 Singular Value Decomposition For any high-dimensional data analysis, one s first thought should often be: can I use an SVD? The singular value decomposition is an invaluable analysis tool for dealing
More informationLecture 1: Review of linear algebra
Lecture 1: Review of linear algebra Linear functions and linearization Inverse matrix, least-squares and least-norm solutions Subspaces, basis, and dimension Change of basis and similarity transformations
More informationLinear Algebra and Eigenproblems
Appendix A A Linear Algebra and Eigenproblems A working knowledge of linear algebra is key to understanding many of the issues raised in this work. In particular, many of the discussions of the details
More informationBackprojection of Some Image Symmetries Based on a Local Orientation Description
Backprojection of Some Image Symmetries Based on a Local Orientation Description Björn Johansson Computer Vision Laboratory Department of Electrical Engineering Linköping University, SE-581 83 Linköping,
More informationThe Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017
The Kernel Trick, Gram Matrices, and Feature Extraction CS6787 Lecture 4 Fall 2017 Momentum for Principle Component Analysis CS6787 Lecture 3.1 Fall 2017 Principle Component Analysis Setting: find the
More informationBackground Mathematics (2/2) 1. David Barber
Background Mathematics (2/2) 1 David Barber University College London Modified by Samson Cheung (sccheung@ieee.org) 1 These slides accompany the book Bayesian Reasoning and Machine Learning. The book and
More informationLinear Subspace Models
Linear Subspace Models Goal: Explore linear models of a data set. Motivation: A central question in vision concerns how we represent a collection of data vectors. The data vectors may be rasterized images,
More informationAMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)
AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences) Lecture 1: Course Overview; Matrix Multiplication Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical
More informationMachine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling
Machine Learning B. Unsupervised Learning B.2 Dimensionality Reduction Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University
More informationLeast Squares Optimization
Least Squares Optimization The following is a brief review of least squares optimization and constrained optimization techniques. I assume the reader is familiar with basic linear algebra, including the
More informationReview of some mathematical tools
MATHEMATICAL FOUNDATIONS OF SIGNAL PROCESSING Fall 2016 Benjamín Béjar Haro, Mihailo Kolundžija, Reza Parhizkar, Adam Scholefield Teaching assistants: Golnoosh Elhami, Hanjie Pan Review of some mathematical
More informationc Springer, Reprinted with permission.
Zhijian Yuan and Erkki Oja. A FastICA Algorithm for Non-negative Independent Component Analysis. In Puntonet, Carlos G.; Prieto, Alberto (Eds.), Proceedings of the Fifth International Symposium on Independent
More informationPrincipal Components Theory Notes
Principal Components Theory Notes Charles J. Geyer August 29, 2007 1 Introduction These are class notes for Stat 5601 (nonparametrics) taught at the University of Minnesota, Spring 2006. This not a theory
More informationChapter Two Elements of Linear Algebra
Chapter Two Elements of Linear Algebra Previously, in chapter one, we have considered single first order differential equations involving a single unknown function. In the next chapter we will begin to
More informationLearning Gaussian Process Models from Uncertain Data
Learning Gaussian Process Models from Uncertain Data Patrick Dallaire, Camille Besse, and Brahim Chaib-draa DAMAS Laboratory, Computer Science & Software Engineering Department, Laval University, Canada
More informationTHE PERTURBATION BOUND FOR THE SPECTRAL RADIUS OF A NON-NEGATIVE TENSOR
THE PERTURBATION BOUND FOR THE SPECTRAL RADIUS OF A NON-NEGATIVE TENSOR WEN LI AND MICHAEL K. NG Abstract. In this paper, we study the perturbation bound for the spectral radius of an m th - order n-dimensional
More informationLearning with Singular Vectors
Learning with Singular Vectors CIS 520 Lecture 30 October 2015 Barry Slaff Based on: CIS 520 Wiki Materials Slides by Jia Li (PSU) Works cited throughout Overview Linear regression: Given X, Y find w:
More informationCHAPTER 11. A Revision. 1. The Computers and Numbers therein
CHAPTER A Revision. The Computers and Numbers therein Traditional computer science begins with a finite alphabet. By stringing elements of the alphabet one after another, one obtains strings. A set of
More informationLogistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017
1 Will Monroe CS 109 Logistic Regression Lecture Notes #22 August 14, 2017 Based on a chapter by Chris Piech Logistic regression is a classification algorithm1 that works by trying to learn a function
More informationBackward Error Estimation
Backward Error Estimation S. Chandrasekaran E. Gomez Y. Karant K. E. Schubert Abstract Estimation of unknowns in the presence of noise and uncertainty is an active area of study, because no method handles
More informationSingular Value Decomposition
Singular Value Decomposition (Com S 477/577 Notes Yan-Bin Jia Sep, 7 Introduction Now comes a highlight of linear algebra. Any real m n matrix can be factored as A = UΣV T where U is an m m orthogonal
More informationUnsupervised Learning with Permuted Data
Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University
More informationLinear Models for Regression. Sargur Srihari
Linear Models for Regression Sargur srihari@cedar.buffalo.edu 1 Topics in Linear Regression What is regression? Polynomial Curve Fitting with Scalar input Linear Basis Function Models Maximum Likelihood
More informationAPPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2.
APPENDIX A Background Mathematics A. Linear Algebra A.. Vector algebra Let x denote the n-dimensional column vector with components 0 x x 2 B C @. A x n Definition 6 (scalar product). The scalar product
More informationData Mining Lecture 4: Covariance, EVD, PCA & SVD
Data Mining Lecture 4: Covariance, EVD, PCA & SVD Jo Houghton ECS Southampton February 25, 2019 1 / 28 Variance and Covariance - Expectation A random variable takes on different values due to chance The
More information1 Kernel methods & optimization
Machine Learning Class Notes 9-26-13 Prof. David Sontag 1 Kernel methods & optimization One eample of a kernel that is frequently used in practice and which allows for highly non-linear discriminant functions
More informationWe wish to solve a system of N simultaneous linear algebraic equations for the N unknowns x 1, x 2,...,x N, that are expressed in the general form
Linear algebra This chapter discusses the solution of sets of linear algebraic equations and defines basic vector/matrix operations The focus is upon elimination methods such as Gaussian elimination, and
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and
More informationLecture 8: Linear Algebra Background
CSE 521: Design and Analysis of Algorithms I Winter 2017 Lecture 8: Linear Algebra Background Lecturer: Shayan Oveis Gharan 2/1/2017 Scribe: Swati Padmanabhan Disclaimer: These notes have not been subjected
More informationFast Approximate Matrix Multiplication by Solving Linear Systems
Electronic Colloquium on Computational Complexity, Report No. 117 (2014) Fast Approximate Matrix Multiplication by Solving Linear Systems Shiva Manne 1 and Manjish Pal 2 1 Birla Institute of Technology,
More informationLarge Scale Data Analysis Using Deep Learning
Large Scale Data Analysis Using Deep Learning Linear Algebra U Kang Seoul National University U Kang 1 In This Lecture Overview of linear algebra (but, not a comprehensive survey) Focused on the subset
More information(v, w) = arccos( < v, w >
MA322 Sathaye Notes on Inner Products Notes on Chapter 6 Inner product. Given a real vector space V, an inner product is defined to be a bilinear map F : V V R such that the following holds: For all v
More informationComparative Performance Analysis of Three Algorithms for Principal Component Analysis
84 R. LANDQVIST, A. MOHAMMED, COMPARATIVE PERFORMANCE ANALYSIS OF THR ALGORITHMS Comparative Performance Analysis of Three Algorithms for Principal Component Analysis Ronnie LANDQVIST, Abbas MOHAMMED Dept.
More information1 Inner Product and Orthogonality
CSCI 4/Fall 6/Vora/GWU/Orthogonality and Norms Inner Product and Orthogonality Definition : The inner product of two vectors x and y, x x x =.., y =. x n y y... y n is denoted x, y : Note that n x, y =
More informationRadial Basis Function Networks. Ravi Kaushik Project 1 CSC Neural Networks and Pattern Recognition
Radial Basis Function Networks Ravi Kaushik Project 1 CSC 84010 Neural Networks and Pattern Recognition History Radial Basis Function (RBF) emerged in late 1980 s as a variant of artificial neural network.
More informationProperties of Matrices and Operations on Matrices
Properties of Matrices and Operations on Matrices A common data structure for statistical analysis is a rectangular array or matris. Rows represent individual observational units, or just observations,
More informationOn Optimal Frame Conditioners
On Optimal Frame Conditioners Chae A. Clark Department of Mathematics University of Maryland, College Park Email: cclark18@math.umd.edu Kasso A. Okoudjou Department of Mathematics University of Maryland,
More informationEfficient and Accurate Rectangular Window Subspace Tracking
Efficient and Accurate Rectangular Window Subspace Tracking Timothy M. Toolan and Donald W. Tufts Dept. of Electrical Engineering, University of Rhode Island, Kingston, RI 88 USA toolan@ele.uri.edu, tufts@ele.uri.edu
More informationNotes on Linear Algebra and Matrix Theory
Massimo Franceschet featuring Enrico Bozzo Scalar product The scalar product (a.k.a. dot product or inner product) of two real vectors x = (x 1,..., x n ) and y = (y 1,..., y n ) is not a vector but a
More informationSubset selection for matrices
Linear Algebra its Applications 422 (2007) 349 359 www.elsevier.com/locate/laa Subset selection for matrices F.R. de Hoog a, R.M.M. Mattheij b, a CSIRO Mathematical Information Sciences, P.O. ox 664, Canberra,
More informationMAT 610: Numerical Linear Algebra. James V. Lambers
MAT 610: Numerical Linear Algebra James V Lambers January 16, 2017 2 Contents 1 Matrix Multiplication Problems 7 11 Introduction 7 111 Systems of Linear Equations 7 112 The Eigenvalue Problem 8 12 Basic
More informationSVD, PCA & Preprocessing
Chapter 1 SVD, PCA & Preprocessing Part 2: Pre-processing and selecting the rank Pre-processing Skillicorn chapter 3.1 2 Why pre-process? Consider matrix of weather data Monthly temperatures in degrees
More informationMLCC 2015 Dimensionality Reduction and PCA
MLCC 2015 Dimensionality Reduction and PCA Lorenzo Rosasco UNIGE-MIT-IIT June 25, 2015 Outline PCA & Reconstruction PCA and Maximum Variance PCA and Associated Eigenproblem Beyond the First Principal Component
More informationOrdinary Least Squares Linear Regression
Ordinary Least Squares Linear Regression Ryan P. Adams COS 324 Elements of Machine Learning Princeton University Linear regression is one of the simplest and most fundamental modeling ideas in statistics
More informationMathematical foundations - linear algebra
Mathematical foundations - linear algebra Andrea Passerini passerini@disi.unitn.it Machine Learning Vector space Definition (over reals) A set X is called a vector space over IR if addition and scalar
More informationChannel Representation of Colour Images
Channel Representation of Colour Images Report LiTH-ISY-R-2418 Per-Erik Forssén, Gösta Granlund and Johan Wiklund Computer Vision Laboratory, Department of Electrical Engineering Linköping University,
More informationLinear Algebra in Computer Vision. Lecture2: Basic Linear Algebra & Probability. Vector. Vector Operations
Linear Algebra in Computer Vision CSED441:Introduction to Computer Vision (2017F Lecture2: Basic Linear Algebra & Probability Bohyung Han CSE, POSTECH bhhan@postech.ac.kr Mathematics in vector space Linear
More informationUnsupervised Learning
2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html ML Problem Setting First build and
More informationMachine Learning Applied to 3-D Reservoir Simulation
Machine Learning Applied to 3-D Reservoir Simulation Marco A. Cardoso 1 Introduction The optimization of subsurface flow processes is important for many applications including oil field operations and
More informationWHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY,
WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY, WITH IMPLICATIONS FOR TRAINING Sanjeev Arora, Yingyu Liang & Tengyu Ma Department of Computer Science Princeton University Princeton, NJ 08540, USA {arora,yingyul,tengyu}@cs.princeton.edu
More informationApplied Mathematics 205. Unit II: Numerical Linear Algebra. Lecturer: Dr. David Knezevic
Applied Mathematics 205 Unit II: Numerical Linear Algebra Lecturer: Dr. David Knezevic Unit II: Numerical Linear Algebra Chapter II.3: QR Factorization, SVD 2 / 66 QR Factorization 3 / 66 QR Factorization
More informationCS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares
CS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares Robert Bridson October 29, 2008 1 Hessian Problems in Newton Last time we fixed one of plain Newton s problems by introducing line search
More informationLecture 9: Numerical Linear Algebra Primer (February 11st)
10-725/36-725: Convex Optimization Spring 2015 Lecture 9: Numerical Linear Algebra Primer (February 11st) Lecturer: Ryan Tibshirani Scribes: Avinash Siravuru, Guofan Wu, Maosheng Liu Note: LaTeX template
More informationSUPPLEMENTAL NOTES FOR ROBUST REGULARIZED SINGULAR VALUE DECOMPOSITION WITH APPLICATION TO MORTALITY DATA
SUPPLEMENTAL NOTES FOR ROBUST REGULARIZED SINGULAR VALUE DECOMPOSITION WITH APPLICATION TO MORTALITY DATA By Lingsong Zhang, Haipeng Shen and Jianhua Z. Huang Purdue University, University of North Carolina,
More informationMACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA
1 MACHINE LEARNING Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 2 Practicals Next Week Next Week, Practical Session on Computer Takes Place in Room GR
More informationLinear Algebra for Machine Learning. Sargur N. Srihari
Linear Algebra for Machine Learning Sargur N. srihari@cedar.buffalo.edu 1 Overview Linear Algebra is based on continuous math rather than discrete math Computer scientists have little experience with it
More informationOn the Equivariance of the Orientation and the Tensor Field Representation Klas Nordberg Hans Knutsson Gosta Granlund Computer Vision Laboratory, Depa
On the Invariance of the Orientation and the Tensor Field Representation Klas Nordberg Hans Knutsson Gosta Granlund LiTH-ISY-R-530 993-09-08 On the Equivariance of the Orientation and the Tensor Field
More informationA Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes (bilmes@cs.berkeley.edu) International Computer Science Institute
More informationLinear Algebra Review. Vectors
Linear Algebra Review 9/4/7 Linear Algebra Review By Tim K. Marks UCSD Borrows heavily from: Jana Kosecka http://cs.gmu.edu/~kosecka/cs682.html Virginia de Sa (UCSD) Cogsci 8F Linear Algebra review Vectors
More informationKasetsart University Workshop. Multigrid methods: An introduction
Kasetsart University Workshop Multigrid methods: An introduction Dr. Anand Pardhanani Mathematics Department Earlham College Richmond, Indiana USA pardhan@earlham.edu A copy of these slides is available
More informationClassification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).
Regression and PCA Classification The goal: map from input X to a label Y. Y has a discrete set of possible values We focused on binary Y (values 0 or 1). But we also discussed larger number of classes
More informationTWO METHODS FOR ESTIMATING OVERCOMPLETE INDEPENDENT COMPONENT BASES. Mika Inki and Aapo Hyvärinen
TWO METHODS FOR ESTIMATING OVERCOMPLETE INDEPENDENT COMPONENT BASES Mika Inki and Aapo Hyvärinen Neural Networks Research Centre Helsinki University of Technology P.O. Box 54, FIN-215 HUT, Finland ABSTRACT
More informationTutorials in Optimization. Richard Socher
Tutorials in Optimization Richard Socher July 20, 2008 CONTENTS 1 Contents 1 Linear Algebra: Bilinear Form - A Simple Optimization Problem 2 1.1 Definitions........................................ 2 1.2
More informationOptimization methods
Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,
More informationNormed & Inner Product Vector Spaces
Normed & Inner Product Vector Spaces ECE 174 Introduction to Linear & Nonlinear Optimization Ken Kreutz-Delgado ECE Department, UC San Diego Ken Kreutz-Delgado (UC San Diego) ECE 174 Fall 2016 1 / 27 Normed
More informationMODELS OF LEARNING AND THE POLAR DECOMPOSITION OF BOUNDED LINEAR OPERATORS
Eighth Mississippi State - UAB Conference on Differential Equations and Computational Simulations. Electronic Journal of Differential Equations, Conf. 19 (2010), pp. 31 36. ISSN: 1072-6691. URL: http://ejde.math.txstate.edu
More informationCS 340 Lec. 6: Linear Dimensionality Reduction
CS 340 Lec. 6: Linear Dimensionality Reduction AD January 2011 AD () January 2011 1 / 46 Linear Dimensionality Reduction Introduction & Motivation Brief Review of Linear Algebra Principal Component Analysis
More informationCHAPTER 7. Regression
CHAPTER 7 Regression This chapter presents an extended example, illustrating and extending many of the concepts introduced over the past three chapters. Perhaps the best known multi-variate optimisation
More informationA Tutorial on Data Reduction. Principal Component Analysis Theoretical Discussion. By Shireen Elhabian and Aly Farag
A Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag University of Louisville, CVIP Lab November 2008 PCA PCA is A backbone of modern data
More informationPrincipal Component Analysis and Linear Discriminant Analysis
Principal Component Analysis and Linear Discriminant Analysis Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 http://www.eecs.northwestern.edu/~yingwu 1/29
More informationMatrix Factorizations
1 Stat 540, Matrix Factorizations Matrix Factorizations LU Factorization Definition... Given a square k k matrix S, the LU factorization (or decomposition) represents S as the product of two triangular
More informationThis appendix provides a very basic introduction to linear algebra concepts.
APPENDIX Basic Linear Algebra Concepts This appendix provides a very basic introduction to linear algebra concepts. Some of these concepts are intentionally presented here in a somewhat simplified (not
More informationMaths for Signals and Systems Linear Algebra in Engineering
Maths for Signals and Systems Linear Algebra in Engineering Lectures 13 15, Tuesday 8 th and Friday 11 th November 016 DR TANIA STATHAKI READER (ASSOCIATE PROFFESOR) IN SIGNAL PROCESSING IMPERIAL COLLEGE
More informationFixed Weight Competitive Nets: Hamming Net
POLYTECHNIC UNIVERSITY Department of Computer and Information Science Fixed Weight Competitive Nets: Hamming Net K. Ming Leung Abstract: A fixed weight competitive net known as the Hamming net is discussed.
More informationOptimization of Gaussian Process Hyperparameters using Rprop
Optimization of Gaussian Process Hyperparameters using Rprop Manuel Blum and Martin Riedmiller University of Freiburg - Department of Computer Science Freiburg, Germany Abstract. Gaussian processes are
More informationNumerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725
Numerical Linear Algebra Primer Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: proximal gradient descent Consider the problem min g(x) + h(x) with g, h convex, g differentiable, and h simple
More informationA Program for Data Transformations and Kernel Density Estimation
A Program for Data Transformations and Kernel Density Estimation John G. Manchuk and Clayton V. Deutsch Modeling applications in geostatistics often involve multiple variables that are not multivariate
More informationLINEAR ALGEBRA: NUMERICAL METHODS. Version: August 12,
LINEAR ALGEBRA: NUMERICAL METHODS. Version: August 12, 2000 74 6 Summary Here we summarize the most important information about theoretical and numerical linear algebra. MORALS OF THE STORY: I. Theoretically
More informationLecture: Local Spectral Methods (1 of 4)
Stat260/CS294: Spectral Graph Methods Lecture 18-03/31/2015 Lecture: Local Spectral Methods (1 of 4) Lecturer: Michael Mahoney Scribe: Michael Mahoney Warning: these notes are still very rough. They provide
More informationExtreme Values and Positive/ Negative Definite Matrix Conditions
Extreme Values and Positive/ Negative Definite Matrix Conditions James K. Peterson Department of Biological Sciences and Department of Mathematical Sciences Clemson University November 8, 016 Outline 1
More informationGrassmann Averages for Scalable Robust PCA Supplementary Material
Grassmann Averages for Scalable Robust PCA Supplementary Material Søren Hauberg DTU Compute Lyngby, Denmark sohau@dtu.dk Aasa Feragen DIKU and MPIs Tübingen Denmark and Germany aasa@diku.dk Michael J.
More informationarxiv: v1 [math.na] 1 Sep 2018
On the perturbation of an L -orthogonal projection Xuefeng Xu arxiv:18090000v1 [mathna] 1 Sep 018 September 5 018 Abstract The L -orthogonal projection is an important mathematical tool in scientific computing
More information