On Sparse Associative Networks: A Least Squares Formulation

Size: px

Start display at page:

Download "On Sparse Associative Networks: A Least Squares Formulation"

Edgar Gardner
5 years ago
Views:

1 On Sparse Associative Networks: A Least Squares Formulation Björn Johansson August 7, 200 Technical report LiTH-ISY-R-2368 ISSN Computer Vision Laboratory Department of Electrical Engineering Linköping University, SE Linköping, Sweden bjorn@isy.liu.se Abstract This report is a complement to the working document [4], where a sparse associative network is described. This report shows that the net learning rule in [4] can be viewed as the solution to a weighted least squares problem. This means that we can apply the theory framework of least squares problems, and compare the net rule with some other iterative algorithms that solve the same problem. The learning rule is compared with the gradient search algorithm and the RPROP algorithm in a simple synthetic experiment. The gradient rule has the slowest convergence while the associative and the RPROP rules have similar convergence. The associative learning rule has a smaller initial error than the RPROP rule though. It is also shown in the same experiment that we get a faster convergence if we have a monopolar constraint on the solution, i.e. if the solution is constrained to be non-negative. The least squares error is a bit higher but the norm of the solution is smaller, which gives a smaller interpolation error. The report also discusses a generalization of the least squares model, which include other known function approximation models.

2 Contents Introduction 3 2 Least squares model 3 2. Iterative solutions Gradient search Associative net rule RPROP - Resilient propagation The sparse associative network 6 3. A least squares formulation Choice of normalization mode Generalization Experiments 4. Experimentdata Experimentsetup Results Conclusions Summary 4 6 Acknowledgment 4 A Appendices 20 A. Proof of theorem, section A.2 Proof of gradient solution, section

3 Introduction This report is a complement to the working document [4], where a sparse associative network is described. The network parameters are computed using an iterative update rule. This report shows that the update rule can be viewed as an iterative solution to a weighted least squares problem. This means that we can compare the net rule with some other iterative algorithms that solves the same least squares problem. The least squares formulation also makes it easier to compare the associative network with other known high-dimensional function approximation theory, such as the least squares models used in neural networks, radial basis functions, and probabilistic mixture models, see [5]. Section 2 introduces the least squares problem, and some iterative methods to compute the solution. Section 3 derives the least squares model corresponding to the net learning rule and analyzes the different choices of models (normalization modes) mentioned in [4] using the least squares approach. Section 4 evaluates different iterative algorithms and models on a simple synthetic example. 2 Least squares model Let A be a M N matrix, b be a M vector and x be a N vector. The problem considered in this report is formulated as where Some comments: x 0 = arg min ɛ(x), l, u () l x u ɛ(x) = 2 Ax b 2 W = 2 (Ax b)t W(Ax b) (2) W is a positive semi-definite diagonal weight matrix, which depend on the application. l x u means element-wise bounds, i.e. l i x i u i. In the case of the sparse associative network we have l i =0,u i = for all i. The factor 2 is only included to avoid an extra factor 2 in the gradient. In the case of infinite boundaries (l =, u = ) we can compute a solution as x 0 =(A T WA) A T Wb (3) Where (.) means pseudo-inverse. The pseudo-inverse is equivalent to regular inverse (.) in case of a unique solution. In case of a non-unique solution 3

4 (rank(a) <N) the pseudo-inverse gives the minimal norm solution. The minimal norm solution can also approximately be achieved by adding a regularization term ɛ(x) = 2 Ax b 2 W + 2 x 2 W r = 2 (Ax b)t W(Ax b)+ 2 xt W r x where W r is a diagonal weight matrix. (4) 2. Iterative solutions The sparse associative networks are intended to be used for very high dimensional data applications, where M, N are very large. The analytical solution is therefore not practically possible to compute. There exist several iterative solutions to the least squares problem. The iterative algorithms discussed in this report are based on the gradient ɛ x = ɛ x = AT WAx A T Wb (5) There are many other, more elaborate, iterative update rules for the least squares solution, see e.g. [, 7, 2]. Some use more refined rules based on the gradient, some are based on higher order derivatives. They should converge faster, but there will be more computations in each iteration. This report focuses on simple, robust rules based on the gradient, which is suited for very large, sparse systems. But other algorithms may be a topic for future research. as Without the bounds l, u the gradient-based update rules can be formulated x p+ = x p f(ɛ p x) (6) where ɛ p x = ɛ x x p means the partial derivative at the point x = x p and f(.) is some suitably chosen update function with the property f(ɛ x ) 0 with equality iff ɛ x = 0 (7) If we have bounds we simply truncate: ( ( )) x p+ = min u, max l, x p f(ɛ p x) (8) where min and max denote element-wise operations. The iterative algorithms will converge to the same solution if it is unique, otherwise the solution will depend on the algorithm (choice of f(.)) and on the initial value. Then there is no guarantee that we will get the minimal norm solution, unless we for example include the regularization term. The next subsections discuss three different choices of f(.). Section 4 contain experiments which compare the different gradient-based update rules on a simple synthetic example. 4

5 2.. Gradient search The simple gradient update function is proportional to the gradient and the update rule becomes f(ɛ x )=ηɛ x (9) x p+ = x p ηɛ p x = x p η(a T WAx A T Wb) (0) It can be shown that the gradient search algorithm converges for 0 <η< 2/λ max, where λ max is the largest eigenvalue to the matrix A T WA, see e.g. [5]. In the case of a large sparse matrix A the largest eigenvalue can be estimated using for example the power method, see e.g. [3]. There is an interesting special choice of W which gives λ max = if all elements in A have the same sign. We state the following theorem: Theorem Assume A 0, i.e. all values in A are non-negative. Let W = diag(aa T ),where =(... ) T. Then the largest eigenvalue to A T WA is equal to. The theorem is proven in appendix A.. This theorem will be used in section 3. It can also be shown that if we use the gradient update rule with initial vector x 0 = 0 we get the minimal norm solution, see appendix A Associative net rule As will be shown later in section 3 the associative learning rule in [4] has the update function f(ɛ x )=D η ɛ x, where D η = η η () η N and the update rule becomes x p+ = x p D η ɛ p x = x p+ = x p D η A T W(Ax b) (2) We now have an individual learning rate for each dimension. The gradient search algorithm in section 2.. is a special case where D η = ηi. Using the associative net update rule and x 0 = 0 do not assure that we get the minimal norm solution, as we did with the gradient update rule. The only thing we know for certain is that the dimensions of x that do not affect the solution are remaining zero. 5

6 2..3 RPROP - Resilient propagation The RPROP algorithm has attracted attention in recent years, see e.g. [7, 8]. It only uses the signs of the gradient ɛ x. The update function is written as f(ɛ x )=D η sign(ɛ x ) (3) D η is a diagonal matrix similar to equation and sign(ɛ x ) means element-wise sign (define sign(0) = 0). The learning rate D η is in this case adaptive. The basic idea is that we increase the learning rate if we are updating in a consistent direction, otherwise we decrease it. The update rule for the learning rates are η p k = η + η p k { η η p k Set ɛ p x k := 0 η p k if ɛ p x k ɛ p x k > 0 if ɛ p x k ɛ p x k < 0 if ɛ p x k ɛ p x k =0, where 0 <η < <η + ɛ p x k =(ɛ p x) k = ɛ x k x p (4) η and η + is called the retardation and acceleration factor respectively. Good empirical values are η =0.5 andη + =.2. A suitable initial value is for example ηk 0 =0.0. Note that in the case of the gradient changing sign we also set the gradient to zero. This avoids unnecessary oscillations in the following iterations. The update rule becomes x p+ = x p D η psign(ɛ p x) (5) The main difference between RPROP and most other heuristic algorithms is that the learning rate adjustments and weight changes depend only on the signs of the gradient terms, not their magnitudes. It is argued that the gradient magnitude depends on scaling of the error function and can change greatly from one step to the next. Also, the gradient vanishes at a minimum so the step size becomes smaller and smaller as it nears the minimum. This can give a slow convergence near the minimum (c.f. the experiments in section 4). Another advantage with RPROP compared to the previous gradient-based methods is that we do not have to choose suitable learning rates, they adapt in time. As for the associative net rule we cannot say which solution we will get if the solution is non-unique. 3 The sparse associative network To avoid confusion, note that the matrix A in this section is not the same as in section 2 and that some of the vectors here are row-vectors instead of columnvectors. The associative net tries to associate a N -dimensional input feature vector to an output response value u. The output and input are associated with 6

7 a N link vector c. The link vector contains the net parameters that are computed using a set of training samples. For purposes outside the scope of this report, all the involved quantities a, u, and c are restricted to having nonnegative values. Another preferable property is that A (and sometimes also u) is sparse, giving a sparse link vector. There can be several output responses and a link vector to each one of them, but they are optimized independently. We will therefore focus on a scalar output response. Assume we have M training samples {a k,u k } M. We want to associate the feature vectors a k with the responses u k using some suitable criteria. Let A = (a a 2... a M )bethen M matrix with the input training data and u = (u u 2... u M ) be the M vector of output training data. The optimization rule used in working document [4] is { ĉ(i + ) = max(ĉ(i) νf. (û(i) u)a T, 0) (6) û(i +) = ν s. ĉ(i +)A (Note that A is not the same matrix as in section 2, rather the transpose.) ν f is an N vector and are called the feature domain normalization. ν s is an M vector and are called the sample domain normalization. ν f and ν s are a function of A. To use the net, we take a feature vector a and compute the output response as û = ν s. ĉa (7) where ν s is a function of a and possibly also on other features vectors. 3. A least squares formulation We will now show that the update rule 6 is the solution to a weighted least squares problem. First, denote D f = diag(ν f ), D s = diag(ν s ) (8) If we ignore the boundary limit for a moment we can rewrite the update rule as { ĉ(i +) = ĉ(i) (û(i) u)a T D f û(i +) = ĉ(i +)AD s (9) By combining the two equations into one we get ĉ(i +) = ĉ(i) (û(i) u)a T D f = ĉ(i) (ĉ(i)ad s u)a T D f = ĉ(i) (ĉ(i)ad s u)d s D s A T D f (20) 7

8 If we compare this equation with equation 2 (note that the two equations differ by a transpose) and includes the boundary limit again we can see that update rule 6 is the iterative gradient-based solution to the problem arg min u cad s 0 c D s (2) with D f serving as the learning rate in equation. This means that D s controls the choice of model (normalization of A) and weight, while D f controls the convergence rate. As mentioned in section 2..2, in the case of a non-unique solution we cannot know which solution we will get, only that it minimizes the least squares function in equation 2. The solution depend on the initial value c 0, and also on the learning rate D f. 3.2 Choice of normalization mode Three different combinations of D f and D s are suggested in [4]. D s controls the net model and depend on the application. D f affects the optimization algorithm. D f in choice is optimal in the special case when all features are uncorrelated. D f in choices 2 and 3 are more difficult to analyze. Choice : Normalization entirely in the feature domain D s = I D f = diag(aa T ) = m a2 m m a2 2m..., =. (22) From the least squares formulation, equation 2, we see that this choice corresponds to the net model u = ca (23) c is optimized by non-weighted least square using a gradient-based update rule with a learning rate D f. It is difficult to analyze the convergence properties of D f except in the very simple case when all features are uncorrelated, i.e. if the rows in A are orthogonal (AA T = I). Then we can optimize each link element c k independently, and it is easy to show that D f above is the optimal learning rate (gives convergence after one iteration). In general though, we have correlated features. 8

9 Choice 2: Mixed domain normalization D s = diag(a T ) = D f = diag(a) = a T a T 2... m am a2m... (24) Each diagonal element in D s is the inverse sum of all feature values in one training sample. AD s means that each training sample will be normalized with its sum, and we get the net model u = c a T a (25) In addition, we use D s as weight in the least squares problem. This means that the feature vectors a k with the largest sum will have most impact on the solution. Choice 3: Normalization entirely in the sample domain D s = diag(a T A) = a T (a+a2+... ) a T 2 (a+a2+... )... (26) D f = I We can view a + a a M as an average operation (ignoring the factor ). This choice therefore corresponds to the net model M u = c a T a, where ā =E[a] (27) ā Again we use D s as weight, this time meaning that the feature vectors with largest norm and with a direction close to the direction of the average vector ā will have most impact on the solution. Since D f = I the update rule reduces to ordinary gradient search with η = (section 2..). It was mentioned that the gradient search algorithm converges for 0 <η<2/λ max. In this case λ max is the largest eigenvalue to the matrix (AD s )D s (AD s ) T = AD s A T. With the choice of D s as above we can use theorem in appendix A., which says that we get λ max =. Note that the theorem only holds for certain if A 0, which is the case in the associative networks theory. D f = I is therefore optimal! 9

10 3.3 Generalization This section is somewhat outside the scope of this report, but it is included to show that the associative net model in equation 2 can be generalized to include other algorithms. For example, we could have independent normalization and weight: arg min u cad s W (28) c l c c u One example of this can be found in [6], which is one of the contributions to the radial basis theory (see [5]). Two choices of normalization were suggested: a T D s = I and D s = a T 2 (29)... The last choice corresponds to the model u = c a T a (30) which is the same model as the second choice of normalization mode, mixed domain normalization, in section 3.2. Each element in a is in this case a radial basis function. The model parameters c are computed using unweighted least squares (W = I). To make the problem well-posed, some form of regularization is often used. The model in equation 30 is also used in kernel regression theory, or mixture model theory, building on the notion of density estimation. In this case each element in a is a kernel function playing the role of a local density function, e.g. a Gaussian function, see [5]. The normalization factor is an estimate of the underlying probability density function of the input. The model parameters c are found by minimizing a least squares function or a maximum likelihood function. The solution is computed using gradient search or expectation-maximization. Again, regularization is often used to make the problem well-posed. The examples above use c l = and c u =, which is one of the major differences from the associative network in [4] which use the monopolar constraint c l = 0. Another difference is that the above examples use unweighted least squares, whereas a weighted least squares goal function is used for the associative network. 0

11 4 Experiments Section 2 described three different iterative algorithms that solves the least squares problem in section 3; the gradient rule, the RPROP rule, and the associative net rule. These algorithms are compared on a simple synthetic set of data. 4. Experiment data The goal of the experiment is to train an associative net to estimate 2D-position from a set of local distance functions, also called feature channels. Given a 2D position x =(x,x 2 ), the local distance functions are computed as d k (x) = { cos 2 ( x x k w k ) if x x k w k π 2 0 otherwise (3) where x k is called the channel center and w k is called the channel width. Two choices of {x k,w k } will be explored: Regularly placed feature channels: The centers, x k, are placed in a regular Cartesian grid and all widths, w k, are equal. Randomly placed feature channels: All x k are randomly placed and the widths w k varies randomly within a limited range. The two choices is shown in figure. The data used to train the system are computed along the spiral function in figure. The net is also evaluated on another set of data that is randomly located within the training region, see figure 2. The following list contain some facts for the data: N = 500 local distance functions (feature channels) M = 200 training samples M e = 000 evaluation samples randomly located within the training region The input to the net is a 500 dimensional vector containing the values of the local distance functions, and the output is a 2D vector with the position x, i.e. d (x) d 2 (x) ( ) a(x) =. u = x = x (32) x 2 d N (x)

12 4.2 Experiment setup As discussed in previous sections, different net models can be used. They can be summarized using the sample domain normalization D s : ( ) { u U = u = c AD s u U = CAD u 2 = c 2 AD s where ( 2 ) (33) s c C = c 2 The M vector u i contains all the output samples for coordinate x i and the N M matrix A =(a a 2...) contains all the feature channel vectors. The association is computed using least squares, weighted with W = D s.we can solve the total system U = CAD s directly, or equivalently, each system u i = c i AD s separately. Several combinations of boundaries and models is considered, table lists the cases. Experiment and 2 compare bipolar and monopolar solutions on regularly placed feature channels. Experiment 3 and 4 contain the same comparison on randomly placed feature channels. Experiment 4, 5, and 6 compare three different choices of normalization. 4.3 Results Table 2 contains the result after training using different iterative algorithms. We do not have any boundaries in experiment and 3 and we can compute the analytical solution in equation 3 as well. The training error e is the relative error defined as e = U CAD s F U F (34) where the norm means the Frobenius norm. The table also contains the norm and the sparsity (nnz = number of non-zero values) of the solution C. Figure 3 shows the error during training for each of the experiments and algorithms. The net is also investigated for its interpolation performance on a set of evaluation data (described in section 4.). The error between the net output û and the true position u is computed for each of the evaluation samples: u = u û = u C a (35) h(a) where h(a) depend on the net model, see table. The mean value, standard deviation, minimal value, and maximal value of u is shown in table Conclusions If we compare the three iterative rules we can see that RPROP rule and the associative net rule have fairly the same convergence rate in all experiments, 2

13 while the gradient rule is slower. The RPROP has a higher error at the beginning because the learning rate has not yet adapted. The computational complexity for RPROP is somewhat higher because we have to update the learning rate as well. This is compensated by a slightly faster convergence rate. The solutions using RPROP and the associative rule also have similar norm and sparsity. They are therefore comparable in performance. The main advantage with the RPROP rule or any other general iterative rule to solve the least squares problem is that we do not have to derive an explicit learning rate for each choice of normalization. In addition, we can make some observations regarding choice of model and boundaries: In experiments and 2 we had a regularly grid of feature channels and compared bipolar solution (no boundary constraint) in experiment with monopolar solution (only positive coefficients) in experiment 2. The bipolar case had still not converged after 0000 iterations (the optimal analytical solution is close to zero), but the error is at least fairly low. In experiments 3 and 4 we did the same cases but now on randomly placed feature channels. In this case the difference between bipolar and monopolar coefficients is much more evident, see figure 3. This is because the feature channels are more correlated in this case. Again, the bipolar case had still not converged after 0000 iterations. The experiments indicate that we get a faster convergence using only monopolar coefficients compared to bipolar coefficients. The cost is a larger error for the solution. The analytical solution in experiment 3 has a large norm C compared to the iterative solutions in the same experiment (this can also be seen in experiment ). This is because the solution contains a few very large positive and a few very large negative values. The iterative solutions did not converge within a 0000 iterations, but they would eventually give a large norm as well. By comparing the results from experiments and 3, where we have a bipolar solution, with experiments 2 and 4, where the solution is monopolar, we can see that the norm C is lower in the monopolar cases. A lower norm may give better robustness to noise and a lower interpolation error which is partly confirmed by the evaluation results in table 3. This is a topic for future research. A lower norm could also have been accomplished if we had used both lower and upper constraints, or if we include a regularization term. The alternatives might differ in convergence rate though. In experiments 5 and 6 we get a lower interpolation error than in experiment 4. This may be because we use a different model (normalization of the feature vectors), but it may also be because we use a weight. This is a 3

14 topic for future research. The generalization of the least squares problem, equation 28, where we have independent normalization and weight allows for this to be investigated. In experiment 6 we get identical convergence for the gradient rule and the associative update rule, which confirms the statement in section 3.2 that normalization mode 3 is equivalent to gradient search. One argument in [4] for having the monopolar constraint is that the link vector becomes much more sparse than without the constraint. A sparse link vector gives a lower computational complexity in the net. Table 2 partly confirms the argument, but only when we have randomly located feature channels and no normalization of the input a (experiment 4). Otherwise we only get a slightly higher sparsity. 5 Summary This report shows that the learning algorithm for the associative net in [4] can be described as a weighted least squares problem. This allows for comparison with other iterative solution methods. This report compared the associative net update rules with the gradient rule and the RPROP rule on a simple experiment. The gradient rule performs worse than the associative rules. The RPROP rule has a higher initial error, but the convergence rate is comparable to the associative net rules. The experiments also show that the convergence is considerably faster when using only positive (monopolar) values in the solution compared to using both negative and positive (bipolar) values. This holds for all three of the update rules. The weighted least squares problem was generalized to include other algorithms as well. This might be of help when the associative net is compared to other function approximation methods. This also allows for investigation of the importance of normalization and weight. 6 Acknowledgment This work was supported by the Swedish Foundation for Strategic Research, project VISIT - VIsual Information Technology. The author would like to thank the people at CVL for helpful discussions, especially my supervisor Gösta Granlund. 4

15 Regularly placed feature channels Randomly placed feature channels Figure : Experiment data. Local distance functions (feature channels) and training samples along a spiral function. 5

16 Figure 2: Evaluation data. The evaluation data is randomly located within the training region (spiral). Experiment Feature location Lower bound, c l Model (D s ) Weight, W = D s Regular u = ca I 2 Regular 0 u = ca I 3 Random u = ca I 4 Random 0 u = ca I 5 Random 0 u = c a T a diag(at ) 6 Random 0 u = c a T ā a diag(at A) Common for all experiments: # iterations 0000 Upper bound c u Initial value c 0 0 Regularization term - Table : Experiment setup. The three iterative update methods using gradient rule, RPROP rule, and associative net rule are evaluated on each of experiment 6. 6

17 0.9 Experiment Experiment 2 Gradient rule RPROP rule Associative rule 0.9 Gradient rule RPROP rule Associative rule Experiment 3 Experiment 4 Gradient rule RPROP rule Associative rule 0.9 Gradient rule RPROP rule Associative rule Experiment 5 Experiment 6 Gradient rule RPROP rule Associative rule 0.9 Gradient rule RPROP rule Associative rule Figure 3: Error during training for each of the experiments and iterative update rules. Note the logarithmic scale on the x-axis. The gradient rule and the associative rule coincided in the last experiment. 7

18 error, e C F nnz(c) Experiment Analytical solution Gradient rule RPROP rule Associative net rule Experiment 2 Gradient rule RPROP rule Associative net rule Experiment 3 Analytical solution Gradient rule RPROP rule Associative net rule Experiment 4 Gradient rule RPROP rule Associative net rule Experiment 5 Gradient rule RPROP rule Associative net rule Experiment 6 Gradient rule RPROP rule Associative net rule Table 2: Results after training. 8

19 mean( u) std( u) min( u) max( u) Experiment Analytical solution Gradient rule RPROP rule Associative net rule Experiment 2 Gradient rule RPROP rule Associative net rule Experiment 3 Analytical solution Gradient rule RPROP rule Associative net rule Experiment 4 Gradient rule RPROP rule Associative net rule Experiment 5 Gradient rule RPROP rule Associative net rule Experiment 6 Gradient rule RPROP rule Associative net rule Table 3: Interpolation performance on evaluation data. 9

20 A A. Proof of theorem, section 2.. The theorem is repeated below: Theorem Let A be a matrix. Assume A 0, i.e. all values in A are nonnegative. Assume W = diag(aa T ),where =(... ) T. Then the largest eigenvalue to A T WA is equal to. Proof The proof consists of two parts. First, we show that the eigenvalues cannot be larger than. Second, we show that there exist an eigenvector that has the eigenvalue. We assume that all values in the vector AA T are nonzero so that W exists (since A 0 this basically means that no row in A contains only zeros).. Let v be an eigenvector to A T WA. We can compute the eigenvalue as λ = vat WAv v T v (36) It is a well known fact that AA T and A T A have the same non-zero eigenvalues. In this case we also have a weight involved, but we can make a modification and state that A T WA = A T W WA and WAA T W have the same non-zero eigenvalues. The eigenvalue λ above can thus be computed as λ = ut WAA T Wu u T u (37) for some (eigen-)vector u. Letz = Wu u = W z and insert this into the equation: λ = ut WAA T Wu u T u = z T AA T z z T diag(aa T )z = z T AA T z z T W = zt AAT z W z z T W z (38) It remains to show that this quotient cannot be larger than. To simplify the index notation we let B = AA T and compute the quotient: z T Bz λ = z T diag(b)z = i,j b ijz i z j i,j i ( j b ij)zi 2 = b ijz i z j i,j b ijzi 2 i = b iizi 2 + i<j b ij2z i z j i b iizi 2 + i<j b ij(zi 2 + z2 j ) (39) 20

21 In the last equality we have used the symmetry property b ij = b ji. For each numerator term b ij 2z i z j we have a corresponding denominator term b ij (zi 2 + z2 j ). The inequality between algebraic average and geometric average states that zi 2 + z2 j 2z iz j, and since all b ij are non-negative we can therefore conclude that λ. 2. Does there exist an eigenvector with eigenvalue λ =? Yes, it is easy to show that v = A T has eigenvalue : A T WAv = A T (WAA T ) = A T (diag(aa T ) AA T ) = A T = v (40) (In the third equality we used the fact that diag(x) x = ) A.2 Proof of gradient solution, section 2.. Theorem 2 Let S = {x : ɛ(x) = Ax b 2 is mimimum} (4) Then the gradient search algorithm x p+ = x p ηɛ x (x p ) with x 0 = 0 gives the solution x 0 S with minimal Euclidean norm. (The weight W is ignored here, it does not affect the theorem). Proof Let A T A = VΣV T, ( ) Σr 0 Σ = 0 0 V T V = VV T = I (42) be the SVD decomposition of A T A (r =rank(a T A)). Furthermore, let y = V T x (y will have the same norm as x). The gradient update rule in equation 0 can then be written y p+ = y p η(σy p s), where s = V T A T b (43) y p+ can be expressed as a function of the initial value y 0 = Vx 0 as y p+ = (I ησ)y p + ηs = (I ησ) 2 y p +(I ησ)ηs + ηs =... ( p = (I ησ) p+ y 0 + k=0 )ηs (I ησ)k (44) 2

22 Assume x S S. It has the property A T Ax S = A T b (gradient of ɛ(x) equals zero). We can then write s = V T A T Ax S = Σy S, where y S = V T x S (45) the update can then be written ( p y p+ =(I ησ) p+ y 0 + (I ησ) k) ησy S (46) ( N ) By using the property k=0 Bk (I B) =I B N+ with B = I ησ we can write the update rule as ( y p+ =(I ησ) p+ y 0 + I (I ησ) p+) y S (47) or, equal y p+ = ( (I ησr ) p+ 0 0 I k=0 ) ( y 0 I (I ησr ) + p After convergence (if η is suitably chosen) we get ( ) ( 0 0 y = y 0 I I 0 0 ) y S (48) ) y S (49) And we can see that if we choose y 0 = 0 we get the solution ( ) I 0 y = y 0 0 S (50) which can be shown to be the minimal norm solution, see e.g. [3]. 22

23 References [] M. Adlers. Topics in Sparse Least Squares Problems. PhD thesis, Linköping University, Linköping, Sweden, Dept. of Mathematics, Dissertation No [2] Å. Björck. Numerical Methods for Least Squares Problems. SIAM, Society for Industrial and Applied Mathematics, 996. [3] G. H. Golub and C. F. Van Loan. Matrix Computations. The Johns Hopkins University Press, second edition, 989. [4] G Granlund. Parallel Learning in Artificial Vision Systems: Working Document. Technical report, Dept. EE, Linköping University, [5] S. Haykin. Neural Networks A comprehensive foundation. Prentice Hall, 2nd edition, 999. ISBN [6] J. Moody and C. J. Darken. Fast learning in networks of locally-tuned processing units. Neural Computation, :28 293, 989. [7] Russel D. Reed and Robert J. Marks II. Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks. MIT Press, 999. [8] M. Riedmiller and H Braum. A Direct Adaptive Method for Faster Backpropagation Learning: The RPROP Algorithm. In Proceedings of the IEEE International Conference on Neural Networks, volume, San Francisco, CA,

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

DS-GA 1002 Lecture notes 0 Fall 2016 Linear Algebra These notes provide a review of basic concepts in linear algebra. 1 Vector spaces You are no doubt familiar with vectors in R 2 or R 3, i.e. [ ] 1.1