Convergence of the Majorization Method for Multidimensional
|
|
- Violet Ferguson
- 5 years ago
- Views:
Transcription
1 Joumal of Classification 5: (1988) Convergence of the Majorization Method for Multidimensional Scaling Jan de Leeuw University of California Los Angeles Abstract: In this paper we study the convergence properties of an important class of multidimensional scaling algorithms. We unify and extend earlier qualitative results on convergence, which tell us when the algorithms are convergent. In order to prove global convergence results we use the majorization method. We also derive, for the first time, some quantitative convergence theorems, which give information about the speed of convergence. It turns out that in almost all cases convergence is linear, with a convergence rate close to unity. This has the practical consequence that convergence will usually be very slow, and this makes techniques to speed up convergence very important. It is pointed out that step-size techniques will generally not succeed in producing marked improvements in this respect. Keywords: Multidimensional scaling; Convergence; Step size; Local minima. 1. Introduction Recent research in multidimensional scaling has moved in the direction of proposing more and more complicated models, often with a very large number of parameters, and sometimes even with severe discontinuities in the model. The emphasis has been on producing computer programs that work, and comparatively little attention has been paid to theoretical problems associated with the loss functions and the algorithms used to minimize them. We Author's Address: Jan de Leeuw, Departments of Psychology and Mathematics, University of California Los Angeles, 405 Hilgard Avenue, Los Angeles, CA 90024, USA.
2 164 J. de Leeuw think that such a more theoretical study is long overdue. In fact we think that at the moment an in-depth study of some of the more simple models and techniques is more urgent than the development of even more complicated ones. This paper is a contribution to the study of a simple algorithm for fitting the simplest of multidimensional scaling models. Thus, our paper does not present a new algorithm or a new model. It also makes no claim about superiority of the algorithm studied here over other possible or actual algorithms. The model is chosen from a large spectrum of possible models, and the loss function is one among many. The only reason for studying this particular model, loss function, and algorithm is that they are simple and direct. This choice makes the mathematical study of their properties relatively easy. It is also the reason why they have been around for quite some time now, and why they have been used in a majority of the applications of multidimensional scaling. The paper will mainly be about metric multidimensional scaling; towards the end we shall briefly discuss the nonmetric case. Results for nonmetric scaling are often simple extensions of the metric results, and in this sense metric scaling is more basic. 2. Notation and Terminology We introduce the notation and terminology more or less standard for metric multidimensional scaling (Kruskal and Wish 1978). The data in a classical multidimensional scaling problem are collected in a symmetric non-negative matrix A = { 8ij }. The elements of A are called dissimilarities; 8ij is the dissimilarity between objects i and j. There are n objects, and thus A is of order n. We suppose that self-dissimilarities are zero; as a consequence A has a zero diagonal. A common purpose of multidimensional scaling is to represent the objects as points in a low-dimensional Euclidean space, in such a way that the distance between points i and j is approximately equal to the given dissimilarity of objects i and j. The xi denote n points inp-space, with coordinates in the n p matrix X, called the configuration. The matrix D(X), with dements d0(x), contains the Euclidean distances between the points xl. It follows that d/~(x) = (xi - xj)" (Xl _ xj) (la) = (el -ej) X X (ei - ej) (lb) = tr X Aij X. (lc) In (lb) the ei are unit vectors (columns of the identity matrix of order n), and in (lc) we have Aij = (el - ej)(ei - ej). In order to find out how successful a representation is we compute the value of a loss function, defined to be
3 Convergence of the Majorization Method 165 o(x) = 1/2 Zi Z i wq (~)ij - dq(x)) 2 (2) where W = {wij } is a symmetric, non-negative matrix of weights, with a zero diagonal. Matrix W is known and fixed throughout the computations. Its nondiagonal elements can be used to code various forms of supplementary information. If there are missing data, for instance, we can set wij corresponding to missing dissimilarities equal to zero. If there are replications of the dissimilarities in a cell we can estimate their variability, and use this information to choose weights. It sometimes also makes sense to set wq equal to a fixed function of the dissimilarity values, such as the square or the inverse square, in order to differentially weight errors. This strategy can be used to simulate the behavior of other multidimensional scaling loss functions. The purpose of the particular form of multidimensional scaling we have presented can now be stated more precisely: given weights, dissimilarities, and a dimensionality p, we want to find X that minimizes c(x). 3. Basic Algorithm The algorithm we discuss in this paper was first given by Guttman (1968). He derived it by setting the stationary equations for the minimization of o(x) equal to zero, and he observed that the algorithm could be interpreted as a gradient algorithm with constant step-size. Compare also Lingoes and Roskam (1973, p. 8-10), Hartman (1979, p ), Borg (1981, p ). In de Leeuw (1977) the very same algorithm was derived in a somewhat more general context from convex analysis as a subgradient method. It was observed that in the simple Euclidean case, which is the one we are interested in here, the algorithm could be derived from the Cauchy-Schwartz inequality, without using either differentiation or subdifferentiation. This is the derivation we present here. It is a much simplified version of the one given in de Leeuw and Heiser (1980). In order to describe the algorithm efficiently we need some additional notation. First let and ~X(x) = 1/2 E i Zj wij d~j(x), p(x) = 1/2 zj d j(x). (3) (4) If we assume, without loss of generality that
4 166 J. de Leeuw then 1/2 ~'i ~j wij ~2 = 1, o(x) = 1-2p(X) + ~2(X). (5) (6) It is clear that rl2(x) is a convex quadratic function of X. From (lc) and (3) we have with rl2(x) = tr X" V X, V = 1/2 Zi Zy wq Aq. (7) (8) Throughout the paper we assume that V has rank n - 1, an assumption which can be made without any loss of generality (de Leeuw 1977). It also follows from (8) that V is positive semidefinite, and that its (one-dimensional) null space consists of all vectors with constant elements. The function p(x) is somewhat more complicated than rl2(x). We know that dq(x) is a convex and positively homogeneous function of X. Thus, by (4), the same claim holds for 9(X). It is convenient to write p(x) as with where p(x) = tr X'B(X)X, B(X) = 1/2 Zi Ej wij 86 sij(x)a 0, sij(x) = 1/dij(X) if dij(x) so so(x) = 0 otherwise (9) (lo) (lla) (1 lb) It is obvious that the difference between (7) and (9) is that in (7) V is a constant matrix,while in (9), B(X) varies with X. Thus, rl2(x) is quadratic in X, while p(x) is not. With the notation developed so far it is easy to explain the algorithm. Clearly the partial derivatives of rl2(x) are given by V rl2(x) = 2VX. Using the definitions it is also not difficult to see that V p(x) = B(X)X, provided that p(x) is differentiable at X. Thus, V ~(X)= 2(VX-B(X)X). In de Leeuw and Heiser (1980), the Guttman transform F(Y) of a configuration Y is defined as
5 Convergence of the Majorization Method 167 F(Y) = V+B(Y)Y, (12) with V + the Moore-Penrose inverse of V. Observe that the Guttman transform depends on weights and dissimilarities, and is consequently defined relative to a particular metric multidimensional scaling problem. Using the Guttman transform makes it possible to rewrite the gradient as V o(x) = 2V(X - F(X)), and thus V (~(X) = 0 if and only if X = F(X). As a consequence, a configuration X is called stationary for a particular metric multidimensional scaling problem if it is equal to its Guttman transform. The result also immediately suggests the algorithm Xk+l = F(Xk), or more explicitly, Equivalently we can also write Xk+l = V+B(Xk)Xt. (13) Xk+l = Xk - 1/2 V + V c(xk), (14) which shows the gradient interpretation of the algorithm. In this derivation of the algorithm we have made the provision that o(x) had to be differentiable at X. This claim is true if and only ifdij(x) > 0 for all i, j for which wij 5ij > 0. Let us agree to call a configuration usable if this condition is true. Thus if X is not usable, then ~(X) is not differentiable at X, and interpretation (14) cannot be used. Using definition (10), however, the iteration (12) can still be canied out. In de Leeuw (1977) and de Leeuw and Heiser (1980) it is shown that in this case the algorithm still converges to a stationary point. There are three reasons why we simply assume differentiability in the sequel. In the first place it was proved by de Leeuw (1984), that if ~ has a local minimum at X, then X is usable. Since we are interested in the behavior of our algorithm in the neighborhood of a local minimum in most of this paper, we might as well assume differentiability. In the second place, although we do not need differentiability for the qualitative study of convergence (i.e., for proving that the algorithm converges), we do need it for the quantitative study (i.e., for establishing the rate of convergence). And, finally, we shall base our convergence proof below directly on the Cauchy- Schwartz inequality, for which differentiability is not needed in the first place. The convergence proof starts with a lemma, which is the foundation of our approach to the minimization of this particular multidimensional scaling loss function.
6 168 J. de Leeuw Lemma 1: For all X and Y Moreover for all X o(x) < 1 - rlz(f(y)) + n2fx- r(y)). o(x) = 1 - n2(r(x)) + n2(x- r(x)). (15a) (15b) Proof." By Cauchy-Schwartz tr X'AqY < { tr X'A/jX} 1/2 { tr Y'AqY } u2 = dq(x)dij(y). (16) Multiply both sides by wij ~ij sq(y), and add over all i, j. This manipulation gives, using (4) and (10), tr X'B(Y)Y < p(x), (17) with equality ify = X. We can also write (17) as and substitution in (6) gives We can write (19) as p(x) > tr X'VF(Y), (18) o(x) < 1-2 tr X'VF(Y) + tr X'VX. (19) o(x) < 1 - tr F(Y)'VF(Y) + tr (X - F(Y))'V(X - F(Y)), (20) which is (15a). We have equality ifx = Y, which is (15b). Q.E.D. Figure 1 provides a useful interpretation of Lemma 1, and also shows the algorithmic implications. If we use the abbreviation o~(x) = 1 - ~2(F(Y)) + rl2(x - F(Y)), (21) then for each Y, the function roy is quadratic in X. Moreover, by Lemma I, c(x) < oar(x) and c(y) = roy(y). Thus, for each Y, the function ohr majorizes the function a, and the two functions touch only for X = Y. Moreover, roy is minimized over X by setting X = F(Y) and thus we see that the Guttman transform decreases the loss function. We write this result as o(f(y)) < roy(f(y)) < roy(y) = o(y), (22)
7 Convergence of the Majorization Method 169 first majorization first value second majori: second value third! td P configurations minimum It first iteration second iteration third iteration Figm'e 1. Three iterations of the majorization algorithm. We have drawn a section of the stresg loss function, and two quadratic majorization functions. These touch the function at the current configuration, they are always above it, and their minimum provides the next configuration. provided that Y # F(Y). These results are illustrated in Figure 1. Thus either Y = F(Y), in which case V o(y) = 0 and we have found a solution of the stationary equations, or Y ~ F(Y) and we can decrease the loss by replacing Y by its Guttman transform. This procedure constitutes the basic algorithm. It also explains why we call it a majorization method: instead of local linear approximation we use global quadratic majorization in each iteration step.
8 170 J. de Leeuw 4. Sequences Generated by the Algorithm The algorithm (13)generates a sequence Xk, and also sequences of real numbers ok = O(Xk), Pk = p(xd, rl 2 = rl2(xt~). Also define the sequences 3. k = p(xj:) / ri(xk), and e~ = l~2(xk- F(Xk) ). The symbols 1" and,l. are used for convergence of monotone sequences which are, respectively, increasing and decreasing. The theorems in this section unify earlier results in de Leeuw (1977) and de Leeuw and Heiser (1980). Theorem l: (a) Pk 1" po*, (c) 1" = (d) Ok,l, ~3~ : 1 - poo, (e) o. : Proof." From (3) and (4) we lind, by using Cauchy-Schwartz, that p(x) < ri(x) for all X. Thus Zk <-1 for all k. We can write (9) as p(x) = tr X VF(X), and again by Cauchy-Schwartz, p(x) < rl(x)q(f(x)) for all X. Thus Pt, < rlkrlk 1, and also ~.k < rlk+l. Now write (17) as p(f(x)) > tr (F(X))'B(X)X = rl2(f(x)), which implies that Pk > rlk 2 and )~k > rlk. Summarizing the results so far gives rl/c<~lk<rlk+l <)~k+1 < 1, 1] 2 <-- Pk <- l~ktlk+l < <Pk+l < 1. (23a) (23b) These chains are sufficient to prove parts (a) (b) (c). We have already proved the decrease of ok in (22), the limit value follows trivially from ~k = 1-2pk + rl 2, together with po~ = rl 2. This proves (d). For (e) we write e rl2+1-2pk, and again use poo = rl 2. Q.E.D. Observe that the theorem does not state that ek decreases monotonically to zero. In fact usually convergence of ek to zero is nonmonotonic. The theorem also does not say anything about the convergence of Xk. The sequence Xk is studied in a separate theorem. Remember that Xoo is an accumulation point of a sequence Xk if each neighborhood of X~ contains infinitely many points of the sequence. Theorem 2: Suppose So. is the set of all accumulation points of the sequence Xk. Then: (a) So. is nonempty,
9 Convergence of the Majorization Method 171 (b) ifx~ E S.~ then a(x~) = c**, (c) if X~ ~ S~ and X** is usable, then X~ is stationary (i.e., equal to its Guttman transform), (d) if S~. is not a singleton, then it is a continuum. Proof." All Xk are column centered. In the p(n - 1) dimensional space of all column-centered n p matrices, the function 11 defines a norm. Because rik -< 1 for all k, it follows that all Xk are in the unit ball of this normed space, and consequently they have at least one accumulation point. This implication proves (a). Suppose X~ is a subsequence converging to Xo.. Then, by continuity of t~, the sequence t~(x~.) converges to c(xo~). But the only accumulation point of C(Xk) is Co~. This argument proves (b). In the neighborhood of a usable configuration the Guttman transform is continuous. Thus, by the same argument that proved (b), we find that e(x~.) = rl(x~. - F(X~,)) = rl(x~. - X~.+l) converges to rl(x** - F(X~)), which must be zero by Theorem 1, part (e). This reasoning proves (c). For result (d), we remember that a continuum is a closed set, which cannot be written as the union of two or more disjoint closed sets. A proof of (d) is given by Ostrowski (1966, Theorem 28.1). Q.E.D. Again Theorem 2 does not say that Xk converges. This fact is quite irrelevant from a practical point of view, however. If we define Ix-optimal configurations as those configurations for which rl(x - F(X)) < Ix, then for all Ix > 0, the algorithm finds a Ix-optimal configuration in a finite number of steps. This result is all the convergence we ever need in practice. In Theorem 2 there is a restriction in part (c), because we require that X~, is usable. This restriction can be removed by using subdifferentials (de Leeuw and Heiser 1980). But again, from a practical point of view, the restriction is not very important, because all local minima are usable (de Leeuw 1984). The conclusion for Theorems 1 and 2 is that the sequences of loss and fit values generated by the algorithm converge monotonically. The difference between successive solutions converges to zero, which implies that the sequence of solutions converges. In a precise mathematical sense we have either convergence to a single point, or convergence to a continuum of stationary points, with all these stationary points having the same value of the loss function. We have not been able to exclude this last possibility, and indeed a natural candidate for such a continuum is available in any multidimensional scaling problem. If X~. is a stationary point, then for any p x p rotation matrix K, the matrix XooK is also a stationary point, with the same loss function value. Thus there is the possibility that Xk converges to a
10 172 J. de Leeuw continuum of the form S**= {XIX=X.oK}, in the sense that min {ri(xk - X) I X e S** } converges to zero, but XI, does not converge to a point in So.. Again it is clear that this fact is irrelevant from a practical point of view. 5. Derivatives of the Guttman Transform A more detailed study into the convergence behavior of the algorithm is possible if we determine the derivative of the Guttman transform. This information makes it possible to investigate, for the first time, the rate of convergence of our basic algorithm. For a general discussion of the role of the derivative in one-step iterative processes, such as our (12), we refer to Ostrowski (1966, Chapter 22) or Ortega and Rheinboldt (1970, Chapter 10). For ease of reference we briefly summarize their key result here. If Xk+1 = ~(Xk) is any convergent iterative algorithm generating a sequence converging to Xoo, and if the largest eigenvalue "coo of the derivative of at X~ satisfies 0 <'co. < 1, then ]]Xk+1-Xo.ll/]]Xk-Xoo]]-->"coo, i.e., we have linear convergence with rate "c.~. The derivative is considered as a linear operator mapping the (n - 1)p dimensional space of column-centered configurations into itself. We can represent it computationally by an np x np matrix, but here we prefer to give it as a linear map which associates with each centered configuration Y another centered configuration Fx(Y). The map Fx is the derivative of the Guttman transform at the usable configuration X. Thus we have, for example, r(x + Y) = r(x) + Ix(Y) + o(nor)), (24) showing the local linear approximation provided by the derivative. Observe that in (24), we have chosen I"1 as the norm we are using in defining the derivative. This choice is not essential, of course, but it certainly is convenient. We now give the formula for the derivative. It is most usefully written as with Fx(Y) = V + {B(X)Y- U(X,Y)X}, (25) U(X,Y) = 1/2 Xi Zj {wij~ijcij(x,y) / d3(x)} Aij, (26) and clj(x,y) = tr X'A Y. Thus U(X,X) = B(X). Next we are interested in the eigenvalues and eigenvectors of Ix. Observe that we have interpreted it as an operator on the (n - 1)p dimensional
11 Convergence of the Majorization Method 173 space of centered configurations. It consequently has (n - 1)p eigenvalues, not necessarily all distinct. If we consider Fx as an operator on the space of all np matrices, then it has p additional eigenvalues equal to zero. The corresponding eigensubspace contains the solutions ofrl(y) = 0. Before proceeding, we have to single out one special case. If p = 1, then UCK,Y)X = B(X)Y, and thus Fx(Y) = 0. It is shown in de Leeuw and Heiser (1977) that the Guttman transform iterations converge in a finite number of steps if p = 1. This special case was already singled out by Guttman (1968). Defays (1978), Heiser (1981), and Hubert and Arabie (1986) show that one-dimensional scaling is essentially a combinatorial problem. Because either the result Fx(Y) = 0 or the fact of finite convergence stops all considerations having to do with rate of convergence, we assume from now onthatp > 1. Result 1. X is an eigenvector of Fx with eigenvalue zero. This result follows directly from U(X,X) = B(X) and (24). Result 2. Fx has simple structure, i.e., (n - 1)p linearly independent eigenvectors. This follows because Fx(Y) = ~Y can be written in the form (V2p(X))Y = ~,VY, where V2p is the operator corresponding with the second partial derivatives of p, which is consequently symmetric. Thus the eigenvalues are real, and the eigenvectors Ys can be chosen such that they are orthogonal, i.e., such that tr Y~VYt = Sst, with 5 st the Kronecker delta. Result 3. The eigenvalues of Fx are non-negative. This fact follows from the representation in the previous result. Because p is convex, the operator V2p is positive semidefmite. Result 4. If X is a local minimum of ~, then all eigenvalues of Fx are less than or equal to one. This result follows because if X is a local minimum, then we must have that V2c(X) is positive semidefinite. And V2a(X) is positive semidefmite if and only if I - F x is positive semidefinite, i.e., if and only if all eigenvalues of Fx are less than or equal to one. Result 5. If X is stationary, then Fx has 1/2p(p - 1) eigenvalues equal to one. To prove this result, choose Y = XS, with S anti-symmetric, i.e., S = - S'. Then c0(x,y ) = tr X'AijXS = 0, and thus from (24) and (25) we have Fx(Y)= V+B(X)XS = XS = Y. The anti-symmetric matrix S can be chosen in 1/2pfp - 1) linearly independent ways. 6. A Small Example The example we use has n = 4 objects, and has all dissimilarities ~ij, with i ~ j, equal to 1 / "~fff. The weights wij are all equal to one, and consequently normalization (5) is true. Of course this example does not constitute a realistic multidimensional scaling problem, because it is much too small. We use the example to illustrate what kind of stationary points we can expect
12 174 J. de Leeuw in a metric scaling problem, and how our simple algorithm will behave in a neighborhood of these stationary points. We do not intend to prove, in any sense, that our algorithm is better than other existing algorithms. In fact we think that the other algorithms will behave quite similarly. It seems to us that the example can be used to illustrate the great difficulties multidimensional scaling methods can encounter if the iterations are started at unfommate places, and to illustrate the slow rate of convergence that seems to be typical for multidimensional scaling problems. Both the local minimum problem and the slow convergence problem will tend to become more serious if we increase the size of the problem. We first make a small list of stationary points. This list may not be exhaustive, but it contains some of the more interesting types of stationary points. Stationary point 1. Take four points equally spaced on a line. This actually defines a whole family of stationary points, because any permutation of four equally spaced points will do. Moreover we can think of this solution as embedded in one-dimensional space, in two-dimensional space, and so on. If p = 1 we know that Fx = 0, we also know that the equally spaced points define the global minimum (de Leeuw and Stoop 1983). But, as we said before, we are really only interested in p > 1. Ifp = 2, i.e., if we have four points equally spaced on a line in two-space, then the eigenvalues of Fx are four zeros together with the eigenvalues of 1/4B(X), which are 0, 1, 1.5, and Thus for p = 2 the four points on a line are not a local minimum, in fact they define a saddle-point. The function value in this saddle-point is Stationary point 2. Take three points in the comers of an equilateral triangle, and the fourth one in the centroid of the triangle. This is, again, a whole family of stationary points, all with loss function value = Forp = 2 the operator Fx has 8 eigenvalues. Three are equal to zero, three are equal to one, and two are equal to Thus the triangle plus centroid defines a local minimum, but not an isolated one. There are three eigenvalues equal to one and only one of them corresponds with the trivial eigenvalue equal to one of result (5) in the previous section. Ifp = 3 we have the same 8 eigenvalues, plus the 4 eigenvalues of 1/2B(X), which are 0, 1, 1, and 1A641 in this case. Thus forp = 3 this stationary point is not a local minimum any more. Stationary point 3. Take four points in the corners of a square. Loss is Forp = 2 the eigenvalues of Fx are three zeros,.4142, three times.5858, and one trivial unit eigenvalue. Thus we have local minimum here that is isolated in the sense that the manifold of all rotations of the square is isolated from other stationary values. Again the square is not a local minimum forp = 3, although it remains stationary.
13 Convergence of the Majorization Method 175 Stationary point 4. Take four points in the comers of a regular tetrahedron. The loss function is now zero, which shows directly that this three-dimensional solution is certainly the global minimum. The eigenvalues of Fx are zero four times, 1/2 three times, 3/4 two times, and one three times. The unit eigenvalues correspond with the 1/2p(p- 1) trivial eigenvalues defining the manifold of rotations. 7. Rate of Convergence The discussion in the previous sections shows that at least part of the difficulty with proving actual convergence of our iterations comes from the rotational indeterminancy of multidimensional scaling. Because of this rotational indeterminancy, Fx has at least 1/2pfp - 1) unit eigenvalues at a stationary point X. If we eliminate rotational indetenninancy, then we eliminate these difficulties. First, we call a regular stationary point isolated if F x has exactly 1/2p(p - 1) unit eigenvalues. We call it an isolated local minimum if all other eigenvalues are strictly less than one. In the small example in the previous section, the four points in the comer of a square are an isolated local minimum for p = 2. For p = 3 the solution is neither isolated nor a local minimum. The equilateral triangle with centroid is a local minimum for p = 2, but not an isolated local minimum. The regular tetrahedron is an isolated local minimum for p = 3. At an isolated local minimum, we use the symbol ~: for the largest eigenvalue less than one. We call it the level of the isolated local minimum. Theorem 3: If Xk has an accumulation point, which is an isolated local minimum, and has level ~c, then ek+l / ek --~ ~. Proof." We want to apply the general theorems of Ostrowski and of Ortega and Rheinboldt referred to above. First, we eliminate rotational indeterminancy by defining a new sequence X~. For each k the configuration X~ is a rotation of Xk; moreover it is a specific rotation which identifies the configuration uniquely in the manifold of rotations. We can rotate to principal components, for example, with some special provision for equal eigenvalues. Because X~ is a rotation of Xk, the sequences ak, Pk, rh, ~k generated by this modified algorithm are exactly the same. So is 2 = (~k+1)2 + ~2-2Ok, although now ek ~:rl((xk+l) -X~). The transformation which maps X~ to (Xk+l) has a derivative F~ at a stationary point with exactly the same eigenvalues as Fx, except for the 1/2p(p- 1) unit eigenvalues, which are replaced by zeroes. Thus ~ < I is actually the largest eigenvalue of F~, so that X~ converges linearly, with rate K:. Q.E.D.
14 176 J. de Leeuw If the stationary point is a non-isolated local minimum, or not even a local minimum, then Theorem 3 does not say anything about the rate of convergence. This fact does not seem to be a very important restriction of generality in practice, because it seems difficult to get our algorithm to converge to a non-isolated local optimum. We illustrate this finding with the small example from the previous section. The equilateral triangle with centroid is a non-isolated local minimum for p = 2. Start the iterations from a small perturbation of this stationary point. With a very close start (o = ), we have convergence to the stationary value with 10-decimal precision within 5 iterations. The ratio (ek+l) 2 / e 2 continues to increase, however, although extremely slowly. It is.99 at iteration 8, and.999 at iteration 10. We have stopped the process at iteration 30, at which point we are still equally close to the stationary value, and the ratio is still increasing. We have restarted the iterations somewhat further away from the stationary point (o= ). After 10 iterations t~ is down to and e 2 is The ratio (ek+l) 2 /e 2 is At iteration 50 we have t~ = , e 2 = 36 x 10-1, and (ek l) 2 / e 2 = Around iteration 60 the value of o drops below (equilateral triangle with centroid) and e 2 begins to rise, causing a ratio ek+l / ek larger than one. This situation continues for a very long time. At iteration 200, for instance, we have o = and e 2 = 3413 x The ratio of successive epsilons is still larger than one. This continues until iteration 250. In the meantime e 2 has increased to , and o, which started dropping rapidly at iteration 225, is down to Convergence now becomes rapid, and within 20 iterations the configuration converges to the four comers of the square, which is an isolated local minimum (in fact the global minimum) forp = 2. At iteration 270 we have o = , e 2 < 10-1, and (ek+l) 2 / e 2 = , which is for all practical purposes equal to ~: 2. Thus we have started close to a non-isolated local minimum. The algorithm has great difficulty in getting away from it, but ultimately succeeds. With a restart even further away (o = ) the algorithm has difficulty escaping only until iteration 30. The e 2 decreases rapidly again, and we have convergence to the square in 55 iterations. It is now not difficult to conjecture that in the first start, in which we seemed to converge on the equilateral triangle, we merely did not continue long enough. After hundreds or perhaps thousands of iterations we would converge on the square again, if we continued.
15 Convergence of the Majorization Method Nonmetric Scaling We now introduce some additional terminology in order to define nonmetric scaling. The loss function for a nonmetric scaling problem can be written as t~(x,a) = 1/2 Ei Ej w O (~ij - dij(x)) 2, (27) where all symbols have the same meaning as before, except for A = {~ilj }, which now contains the disparities and no longer the dissimilarities. In nonmetric scaling the loss function (27) must be minimized over both the configuration X and the disparities A, where the disparities are restricted by the ordinal information in the data. Thus the disparities are additional parameters in the nonmetric scaling problem over which we minimize, in contrast to the metric scaling problem in which the dissimilarities are fixed numbers. Thus we can choose A freely, provided it is feasible, which means in this case monotone with the given dissimilarities. In this sense nonmetric scaling generalizes metric scaling, in which A must not only be monotone but in fact identical to the dissimilarities. In order to prevent trivial solutions, we must also impose a normalization condition on the disparities, for instance that their sum of squares is unity, or that their variance is unity. As indicated in de Leeuw (1977) we can think of nonmetric scaling algorithms in two different ways. First, we interpret them as alternating least squares methods, which alternate one gradient step (or Guttman-transform) with a monotone regression step. In the gradient step, the loss function (27) is minimized (or rather decreased) by choosing a new X; in the monotone regression step (27) is decreased by choosing a new A, consistent with the ordinal constraints. Of course it is possible, and perhaps sometimes advisable, to introduce some obvious modifications of this algorithm. Instead of alternating one Gutlman transform with one monotone regression we can perform more Guttman transforms between monotone regressions. These Guttman steps have a rate of convergence which is described by our results above. In the alternating least squares interpretation, the loss function (27) is dearly interpreted as a function of two sets of parameters, the coordinates and the disparities. Second, it is also possible to view the loss function in nonmetfic scaling as a function of X alone. This is the original definition of stress as proposed by Kruskal (1964a, 1964b). If c(x,a) is the "stress" used in the first approach, then Kruskal's stress is the minimum of c(x,a) over all feasible disparities A. Thus A is "projected out," and the remaining function depends only on X. This elementary fact has caused a great deal of confusion in the early days of multidimensional scaling. The confusion was made even bigger
16 178 J. de Leeuw by the fact that the derivative of t~(x) is the same as the partial derivatives of c~(x,a) with respect to X, evaluated at the optimal A(X). Thus a(x) = c(x,a(x)), but also V c(x) = Vxt~(X,A(X)). This last result is due to Kruskal (1971), who used it to show that t~(x) is differentiable (whenever o(x,a) is differentiable). It is also used by de Leeuw (1977) to show that the iteration Xk+l = V B(Xk)Xk is still a convergent algorithm if we define B(X) as in (10) but with A(X) substituted for A. Thus, our qualitative convergence results remain true without modification, both in the alternating least squares and in the (sub)differential interpretation. Unfortunately the transformation X-->A(X) is generally not differentiable. In fact, to find A(X) we have to project D(X) orthogonally on a polyhedral convex cone, which implies that the transformation is piecewise linear in the distances. The linear pieces are joined in a continuous but nonsmooth way (compare Kruskal 1971). If we have convergence of the nonmetric scaling algorithm to a point where the cone-projection is locally a constant linear map, then our convergence results apply. In monotone regression terms this means that for all points in an open neighborhood of the solution we are converging to, monotone regression finds the same partitioning into blocks in its computation of the disparities, i.e., for all these points it projects on the same face of the cone. In general, however, we cannot exclude the possibility that the convergence is to a point in the boundary of two regions with different projection maps. These linear maps again correspond to the partitioning into blocks found by the monotone regression algorithm. In this case our results do not apply, and they must be adapted. 9. Summary and Conclusions We have shown in this paper that our basic majorization algorithm for multidimensional scaling converges to a stationary point, if convergence is defined using the asymptotic regularity of the generated sequence, i.e., in terms of the fact that the distance between two consecutive members of the sequence converges to zero. We have also shown that if one of the accumulation points of the sequence is an isolated local minimum, then convergence is linear. This condition seems to be all that is needed for practical applications. It follows from our small numerical example that it is possible for actual computer programs to stop at saddle points which are not local minima, and that in the neighborhood of such saddle points convergence may look sublinear. Our experience with many practical examples indicates that the level of isolated local minima (i.e., the convergence rate of the algorithm we have described) in multidimensional scaling is very often close to unity. Thus although convergence is theoretically linear, it can be extremely slow.
17 Convergence of the Majorization Method 179 Consequently it becomes very important, at least in some cases, to look for ways to speed up linear convergence, or even for ways to attain supralinear convergence. These acceleration devices will be investigated in subsequent publications. Simple ways to speed up linear convergence were already investigated by de Leeuw and Heiser (1980); more complicated ones were studied by Stoop and de Leeuw (1983). In both papers the basic convergence of the Gutmaan-transform iterations is preserved, but a specially developed step-size procedure is added to the algorithm. Both on theoretical and on empirical grounds one can argue that stepsize procedures in MDS, including the ones devised by Kruskal (1964), will generally double the speed of convergence, i.e., half the number of iterations required for a given precision. In the case of very slow linear convergence, this does not really help very much. Of course the formulae derived in Section 5 can be used quite easily to derive the exact form of Newton's method applicable to multidimensional scaling, but our numerical experience so far suggests that Newton's method must also be used with much care in this context. Our main conclusion is that the majorization method is reliable and very simple, but that it is generally slow, and sometimes intolerably slow. It seems to us that an additional conclusion is that one should always study the second derivatives of the loss function at the stopping point of the algorithm. This information indicates if we have stopped at a local minimum, and how much improvement we can expect in various directions. Such information on improvement can be used either to try to make one or more Newton-steps, or to derive information on the stability of the solution. References BORG, I. (1981), Anwendungsorientierte Multidimensionale Skalierung, Berlin, West Germany: Springer. DEFAYS, D. (1978), "A Short Note on a Method of Seriation," British Journal of Mathematical and Statistical Psychology, 31, DE LEEUW, J. (1977), "Applications of Convex Analysis to Multidimensional Scaling," in Recent Developments in Statistics, eds. J.R. Barra, F. Brodeau, G. Romier and B. van Cutsem, Amsterdam: North Holland, DE LEEUW, J. (1984), "Differentiability of Kruskal's Stress at a Local Minimum." Psychometrika, 49, DE LEEUW, J., and HEISER, W. J. (1977), "Convergence of Correction Matrix Algorithm for Multidimensional Scaling," in Geometric Representations of Relational Data, ed. J. C. Lingoes, Ann Arbor: Mathesis Press, DE LEEUW, J., and HEISER, W. J, (1980), "Multidimensional Scaling with Restrictions on the Configuration," in Multivariate Analysis, Vol. V, ed. P. R. Krishnaiah, Amsterdam: North Holland.
18 180 J. de Leeuw DE LEEUW, J., and STOOP, I. (1984), "Upper Bounds for Kruskal's Stress," Psychometrika, 49, GUTrMAN, L. (1968), "A General Nonmetric Technique for Finding the Smallest Coordinate Space for a Configuration of Points," Psychometrika, 33, HARTMANN, W. (1979), Geometrische Modelle zur Analyse empirischer Daten, Berlin: Akademie Vedag. HEISER, W. J. (1981), Unfolding Analysis of Proximity Data, Unpublished Doctoral Dissertation, University of Leiden. HUBERT, L., and ARABIE, P. (1986), "Unidimensional Scaling and Combinatorial Optimization," in Multidimensional Data Analysis, eds. L de Leeuw, et al, Leiden: DSWO- Press, , KRUSKAL J. B. (1964a), "Multidimensional Scaling by Optimizing Goodness of Fit to a Nonmetric Hypotheses," Psychometrika, 29, KRUSKAL, L B. (1964b), "Nonmetric Multidimensional Scaling: A Nttmedcal Method," Psychometrika, 29, KRUSKAL, J. B. (1971), "Monotone Regression: Continuity and Differenfiability Properties," Psychometrika, 36, KRUSKAL, J. B., and WISH, M. (1978), Multidimensional Scaling, Newbury Park, CA: Sage. LINGOES, J. C., and ROSKAM, E. E. (1973), "A Mathematical and Empirical Comparison of Two Multidimensional Scaling Algorithms," Psychometrika, 38, Monograph Supplement. ORTEGA, J. M., and RHEINBOLDT, W. C. (1970), Iterative Solution of Nonlinear Equations in Several Variables, New York: Academic Press. OSTROWSKI, A. M. (1966), Solution of Equations and Systems of Equations, New York: Academic Press. STOOP, I., and DE LEEUW, J. (1983), The Stepsize in Multidimensional Scaling Algorithms, Paper presented at the Third European Meeting of the Psychometric Society, Jouy-en- Josas, France, July 5-8,1983.
Least squares multidimensional scaling with transformed distances
Least squares multidimensional scaling with transformed distances Patrick J.F. Groenen 1, Jan de Leeuw 2 and Rudolf Mathar 3 1 Department of Data Theory, University of Leiden P.O. Box 9555, 2300 RB Leiden,
More informationSeptember Math Course: First Order Derivative
September Math Course: First Order Derivative Arina Nikandrova Functions Function y = f (x), where x is either be a scalar or a vector of several variables (x,..., x n ), can be thought of as a rule which
More informationQUADRATIC MAJORIZATION 1. INTRODUCTION
QUADRATIC MAJORIZATION JAN DE LEEUW 1. INTRODUCTION Majorization methods are used extensively to solve complicated multivariate optimizaton problems. We refer to de Leeuw [1994]; Heiser [1995]; Lange et
More informationDS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.
DS-GA 1002 Lecture notes 0 Fall 2016 Linear Algebra These notes provide a review of basic concepts in linear algebra. 1 Vector spaces You are no doubt familiar with vectors in R 2 or R 3, i.e. [ ] 1.1
More informationAN ALTERNATING MINIMIZATION ALGORITHM FOR NON-NEGATIVE MATRIX APPROXIMATION
AN ALTERNATING MINIMIZATION ALGORITHM FOR NON-NEGATIVE MATRIX APPROXIMATION JOEL A. TROPP Abstract. Matrix approximation problems with non-negativity constraints arise during the analysis of high-dimensional
More informationTMA 4180 Optimeringsteori KARUSH-KUHN-TUCKER THEOREM
TMA 4180 Optimeringsteori KARUSH-KUHN-TUCKER THEOREM H. E. Krogstad, IMF, Spring 2012 Karush-Kuhn-Tucker (KKT) Theorem is the most central theorem in constrained optimization, and since the proof is scattered
More informationCSC Linear Programming and Combinatorial Optimization Lecture 10: Semidefinite Programming
CSC2411 - Linear Programming and Combinatorial Optimization Lecture 10: Semidefinite Programming Notes taken by Mike Jamieson March 28, 2005 Summary: In this lecture, we introduce semidefinite programming
More informationBasic Notations and Definitions
PSYCHOMETRIKA--VOL. 49, NO. 3, 391-402 SEPTEMBER 1984 UPPER BOUNDS FOR KRUSKAL'S STRESS JAN DE LEEUW AND INEKE STOOP LEIDEN UNIVERSITY In this paper the relationships between the two formulas for stress
More informationGlobal Optimization in Least Squares. Multidimensional Scaling by Distance. October 28, Abstract
Global Optimization in Least Squares Multidimensional Scaling by Distance Smoothing P.J.F. Groenen 3 W.J. Heiser y J.J. Meulman 3 October, 1997 Abstract Least squares multidimensional scaling is known
More informationIntroduction to Real Analysis Alternative Chapter 1
Christopher Heil Introduction to Real Analysis Alternative Chapter 1 A Primer on Norms and Banach Spaces Last Updated: March 10, 2018 c 2018 by Christopher Heil Chapter 1 A Primer on Norms and Banach Spaces
More informationHOMOGENEITY ANALYSIS USING EUCLIDEAN MINIMUM SPANNING TREES 1. INTRODUCTION
HOMOGENEITY ANALYSIS USING EUCLIDEAN MINIMUM SPANNING TREES JAN DE LEEUW 1. INTRODUCTION In homogeneity analysis the data are a system of m subsets of a finite set of n objects. The purpose of the technique
More informationOptimality Conditions for Constrained Optimization
72 CHAPTER 7 Optimality Conditions for Constrained Optimization 1. First Order Conditions In this section we consider first order optimality conditions for the constrained problem P : minimize f 0 (x)
More informationTopological properties
CHAPTER 4 Topological properties 1. Connectedness Definitions and examples Basic properties Connected components Connected versus path connected, again 2. Compactness Definition and first examples Topological
More informationexceptional value turns out to be, "RAC" CURVATURE calling attention especially to the fundamental Theorem III, and to emphasize
VOL. 18, 1932 V,MA THEMA TICS: E. KASNER 267 COMPLEX GEOMETRY AND RELATIVITY: "RAC" CURVATURE BY EDWARD KASNER DEPARTMENT OF MATHEMATICS, COLUMBIA UNIVERSITY Communicated February 3, 1932 THEORY OF THE
More informationMAT2342 : Introduction to Applied Linear Algebra Mike Newman, fall Projections. introduction
MAT4 : Introduction to Applied Linear Algebra Mike Newman fall 7 9. Projections introduction One reason to consider projections is to understand approximate solutions to linear systems. A common example
More informationALGORITHM CONSTRUCTION BY DECOMPOSITION 1. INTRODUCTION. The following theorem is so simple it s almost embarassing. Nevertheless
ALGORITHM CONSTRUCTION BY DECOMPOSITION JAN DE LEEUW ABSTRACT. We discuss and illustrate a general technique, related to augmentation, in which a complicated optimization problem is replaced by a simpler
More informationMATH 5720: Unconstrained Optimization Hung Phan, UMass Lowell September 13, 2018
MATH 57: Unconstrained Optimization Hung Phan, UMass Lowell September 13, 18 1 Global and Local Optima Let a function f : S R be defined on a set S R n Definition 1 (minimizers and maximizers) (i) x S
More informationUnconstrained optimization
Chapter 4 Unconstrained optimization An unconstrained optimization problem takes the form min x Rnf(x) (4.1) for a target functional (also called objective function) f : R n R. In this chapter and throughout
More informationLecture 2: Linear Algebra Review
EE 227A: Convex Optimization and Applications January 19 Lecture 2: Linear Algebra Review Lecturer: Mert Pilanci Reading assignment: Appendix C of BV. Sections 2-6 of the web textbook 1 2.1 Vectors 2.1.1
More informationChapter 7. Extremal Problems. 7.1 Extrema and Local Extrema
Chapter 7 Extremal Problems No matter in theoretical context or in applications many problems can be formulated as problems of finding the maximum or minimum of a function. Whenever this is the case, advanced
More informationA Criterion for the Stochasticity of Matrices with Specified Order Relations
Rend. Istit. Mat. Univ. Trieste Vol. XL, 55 64 (2009) A Criterion for the Stochasticity of Matrices with Specified Order Relations Luca Bortolussi and Andrea Sgarro Abstract. We tackle the following problem:
More informationLecture Notes 1: Vector spaces
Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector
More informationFITTING DISTANCES BY LEAST SQUARES
FITTING DISTANCES BY LEAST SQUARES JAN DE LEEUW CONTENTS 1. Introduction 1 2. The Loss Function 2 2.1. Reparametrization 4 2.2. Use of Homogeneity 5 2.3. Pictures of Stress 6 3. Local and Global Minima
More informationON A CLASS OF NONSMOOTH COMPOSITE FUNCTIONS
MATHEMATICS OF OPERATIONS RESEARCH Vol. 28, No. 4, November 2003, pp. 677 692 Printed in U.S.A. ON A CLASS OF NONSMOOTH COMPOSITE FUNCTIONS ALEXANDER SHAPIRO We discuss in this paper a class of nonsmooth
More information1 Directional Derivatives and Differentiability
Wednesday, January 18, 2012 1 Directional Derivatives and Differentiability Let E R N, let f : E R and let x 0 E. Given a direction v R N, let L be the line through x 0 in the direction v, that is, L :=
More informationTangent spaces, normals and extrema
Chapter 3 Tangent spaces, normals and extrema If S is a surface in 3-space, with a point a S where S looks smooth, i.e., without any fold or cusp or self-crossing, we can intuitively define the tangent
More informationSome Properties of the Augmented Lagrangian in Cone Constrained Optimization
MATHEMATICS OF OPERATIONS RESEARCH Vol. 29, No. 3, August 2004, pp. 479 491 issn 0364-765X eissn 1526-5471 04 2903 0479 informs doi 10.1287/moor.1040.0103 2004 INFORMS Some Properties of the Augmented
More information7.5 Partial Fractions and Integration
650 CHPTER 7. DVNCED INTEGRTION TECHNIQUES 7.5 Partial Fractions and Integration In this section we are interested in techniques for computing integrals of the form P(x) dx, (7.49) Q(x) where P(x) and
More informationSimultaneous Diagonalization of Positive Semi-definite Matrices
Simultaneous Diagonalization of Positive Semi-definite Matrices Jan de Leeuw Version 21, May 21, 2017 Abstract We give necessary and sufficient conditions for solvability of A j = XW j X, with the A j
More informationCopositive Plus Matrices
Copositive Plus Matrices Willemieke van Vliet Master Thesis in Applied Mathematics October 2011 Copositive Plus Matrices Summary In this report we discuss the set of copositive plus matrices and their
More informationChapter 1. Preliminaries. The purpose of this chapter is to provide some basic background information. Linear Space. Hilbert Space.
Chapter 1 Preliminaries The purpose of this chapter is to provide some basic background information. Linear Space Hilbert Space Basic Principles 1 2 Preliminaries Linear Space The notion of linear space
More informationNonlinear Support Vector Machines through Iterative Majorization and I-Splines
Nonlinear Support Vector Machines through Iterative Majorization and I-Splines P.J.F. Groenen G. Nalbantov J.C. Bioch July 9, 26 Econometric Institute Report EI 26-25 Abstract To minimize the primal support
More informationConvex Optimization Theory. Chapter 5 Exercises and Solutions: Extended Version
Convex Optimization Theory Chapter 5 Exercises and Solutions: Extended Version Dimitri P. Bertsekas Massachusetts Institute of Technology Athena Scientific, Belmont, Massachusetts http://www.athenasc.com
More information1. Algebraic and geometric treatments Consider an LP problem in the standard form. x 0. Solutions to the system of linear equations
The Simplex Method Most textbooks in mathematical optimization, especially linear programming, deal with the simplex method. In this note we study the simplex method. It requires basically elementary linear
More informationSome Notes on Linear Algebra
Some Notes on Linear Algebra prepared for a first course in differential equations Thomas L Scofield Department of Mathematics and Statistics Calvin College 1998 1 The purpose of these notes is to present
More informationB553 Lecture 3: Multivariate Calculus and Linear Algebra Review
B553 Lecture 3: Multivariate Calculus and Linear Algebra Review Kris Hauser December 30, 2011 We now move from the univariate setting to the multivariate setting, where we will spend the rest of the class.
More informationNumber of cases (objects) Number of variables Number of dimensions. n-vector with categorical observations
PRINCALS Notation The PRINCALS algorithm was first described in Van Rickevorsel and De Leeuw (1979) and De Leeuw and Van Rickevorsel (1980); also see Gifi (1981, 1985). Characteristic features of PRINCALS
More informationDuke University, Department of Electrical and Computer Engineering Optimization for Scientists and Engineers c Alex Bronstein, 2014
Duke University, Department of Electrical and Computer Engineering Optimization for Scientists and Engineers c Alex Bronstein, 2014 Linear Algebra A Brief Reminder Purpose. The purpose of this document
More informationReal Analysis Prof. S.H. Kulkarni Department of Mathematics Indian Institute of Technology, Madras. Lecture - 13 Conditional Convergence
Real Analysis Prof. S.H. Kulkarni Department of Mathematics Indian Institute of Technology, Madras Lecture - 13 Conditional Convergence Now, there are a few things that are remaining in the discussion
More informationEAD 115. Numerical Solution of Engineering and Scientific Problems. David M. Rocke Department of Applied Science
EAD 115 Numerical Solution of Engineering and Scientific Problems David M. Rocke Department of Applied Science Taylor s Theorem Can often approximate a function by a polynomial The error in the approximation
More informationPrinciple Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA
Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA Principle Components Analysis: Uses one group of variables (we will call this X) In
More informationA CHARACTERIZATION OF STRICT LOCAL MINIMIZERS OF ORDER ONE FOR STATIC MINMAX PROBLEMS IN THE PARAMETRIC CONSTRAINT CASE
Journal of Applied Analysis Vol. 6, No. 1 (2000), pp. 139 148 A CHARACTERIZATION OF STRICT LOCAL MINIMIZERS OF ORDER ONE FOR STATIC MINMAX PROBLEMS IN THE PARAMETRIC CONSTRAINT CASE A. W. A. TAHA Received
More information6 Lecture 6: More constructions with Huber rings
6 Lecture 6: More constructions with Huber rings 6.1 Introduction Recall from Definition 5.2.4 that a Huber ring is a commutative topological ring A equipped with an open subring A 0, such that the subspace
More informationJim Lambers MAT 610 Summer Session Lecture 2 Notes
Jim Lambers MAT 610 Summer Session 2009-10 Lecture 2 Notes These notes correspond to Sections 2.2-2.4 in the text. Vector Norms Given vectors x and y of length one, which are simply scalars x and y, the
More information4TE3/6TE3. Algorithms for. Continuous Optimization
4TE3/6TE3 Algorithms for Continuous Optimization (Algorithms for Constrained Nonlinear Optimization Problems) Tamás TERLAKY Computing and Software McMaster University Hamilton, November 2005 terlaky@mcmaster.ca
More informationMath Camp Lecture 4: Linear Algebra. Xiao Yu Wang. Aug 2010 MIT. Xiao Yu Wang (MIT) Math Camp /10 1 / 88
Math Camp 2010 Lecture 4: Linear Algebra Xiao Yu Wang MIT Aug 2010 Xiao Yu Wang (MIT) Math Camp 2010 08/10 1 / 88 Linear Algebra Game Plan Vector Spaces Linear Transformations and Matrices Determinant
More informationSpanning and Independence Properties of Finite Frames
Chapter 1 Spanning and Independence Properties of Finite Frames Peter G. Casazza and Darrin Speegle Abstract The fundamental notion of frame theory is redundancy. It is this property which makes frames
More informationGradient Descent. Dr. Xiaowei Huang
Gradient Descent Dr. Xiaowei Huang https://cgi.csc.liv.ac.uk/~xiaowei/ Up to now, Three machine learning algorithms: decision tree learning k-nn linear regression only optimization objectives are discussed,
More informationMAT 570 REAL ANALYSIS LECTURE NOTES. Contents. 1. Sets Functions Countability Axiom of choice Equivalence relations 9
MAT 570 REAL ANALYSIS LECTURE NOTES PROFESSOR: JOHN QUIGG SEMESTER: FALL 204 Contents. Sets 2 2. Functions 5 3. Countability 7 4. Axiom of choice 8 5. Equivalence relations 9 6. Real numbers 9 7. Extended
More informationAbsolute value equations
Linear Algebra and its Applications 419 (2006) 359 367 www.elsevier.com/locate/laa Absolute value equations O.L. Mangasarian, R.R. Meyer Computer Sciences Department, University of Wisconsin, 1210 West
More informationRoberto s Notes on Linear Algebra Chapter 10: Eigenvalues and diagonalization Section 3. Diagonal matrices
Roberto s Notes on Linear Algebra Chapter 10: Eigenvalues and diagonalization Section 3 Diagonal matrices What you need to know already: Basic definition, properties and operations of matrix. What you
More information6 Markov Chain Monte Carlo (MCMC)
6 Markov Chain Monte Carlo (MCMC) The underlying idea in MCMC is to replace the iid samples of basic MC methods, with dependent samples from an ergodic Markov chain, whose limiting (stationary) distribution
More informationExercise Solutions to Functional Analysis
Exercise Solutions to Functional Analysis Note: References refer to M. Schechter, Principles of Functional Analysis Exersize that. Let φ,..., φ n be an orthonormal set in a Hilbert space H. Show n f n
More informationTEST CODE: MMA (Objective type) 2015 SYLLABUS
TEST CODE: MMA (Objective type) 2015 SYLLABUS Analytical Reasoning Algebra Arithmetic, geometric and harmonic progression. Continued fractions. Elementary combinatorics: Permutations and combinations,
More informationLecture 6 Positive Definite Matrices
Linear Algebra Lecture 6 Positive Definite Matrices Prof. Chun-Hung Liu Dept. of Electrical and Computer Engineering National Chiao Tung University Spring 2017 2017/6/8 Lecture 6: Positive Definite Matrices
More informationDifferentiation. f(x + h) f(x) Lh = L.
Analysis in R n Math 204, Section 30 Winter Quarter 2008 Paul Sally, e-mail: sally@math.uchicago.edu John Boller, e-mail: boller@math.uchicago.edu website: http://www.math.uchicago.edu/ boller/m203 Differentiation
More informationConvex Optimization Notes
Convex Optimization Notes Jonathan Siegel January 2017 1 Convex Analysis This section is devoted to the study of convex functions f : B R {+ } and convex sets U B, for B a Banach space. The case of B =
More information8. Diagonalization.
8. Diagonalization 8.1. Matrix Representations of Linear Transformations Matrix of A Linear Operator with Respect to A Basis We know that every linear transformation T: R n R m has an associated standard
More informationLinear Regression and Its Applications
Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start
More informationLinear Algebra I. Ronald van Luijk, 2015
Linear Algebra I Ronald van Luijk, 2015 With many parts from Linear Algebra I by Michael Stoll, 2007 Contents Dependencies among sections 3 Chapter 1. Euclidean space: lines and hyperplanes 5 1.1. Definition
More informationIntroduction and Math Preliminaries
Introduction and Math Preliminaries Yinyu Ye Department of Management Science and Engineering Stanford University Stanford, CA 94305, U.S.A. http://www.stanford.edu/ yyye Appendices A, B, and C, Chapter
More informationSemidefinite and Second Order Cone Programming Seminar Fall 2001 Lecture 4
Semidefinite and Second Order Cone Programming Seminar Fall 2001 Lecture 4 Instructor: Farid Alizadeh Scribe: Haengju Lee 10/1/2001 1 Overview We examine the dual of the Fermat-Weber Problem. Next we will
More informationmin f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;
Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many
More information3 The language of proof
3 The language of proof After working through this section, you should be able to: (a) understand what is asserted by various types of mathematical statements, in particular implications and equivalences;
More informationLinear Algebra March 16, 2019
Linear Algebra March 16, 2019 2 Contents 0.1 Notation................................ 4 1 Systems of linear equations, and matrices 5 1.1 Systems of linear equations..................... 5 1.2 Augmented
More informationThe Skorokhod reflection problem for functions with discontinuities (contractive case)
The Skorokhod reflection problem for functions with discontinuities (contractive case) TAKIS KONSTANTOPOULOS Univ. of Texas at Austin Revised March 1999 Abstract Basic properties of the Skorokhod reflection
More information8 Numerical methods for unconstrained problems
8 Numerical methods for unconstrained problems Optimization is one of the important fields in numerical computation, beside solving differential equations and linear systems. We can see that these fields
More informationCoordinate Update Algorithm Short Course Subgradients and Subgradient Methods
Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 30 Notation f : H R { } is a closed proper convex function domf := {x R n
More information1. Addition: To every pair of vectors x, y X corresponds an element x + y X such that the commutative and associative properties hold
Appendix B Y Mathematical Refresher This appendix presents mathematical concepts we use in developing our main arguments in the text of this book. This appendix can be read in the order in which it appears,
More informationOptimality, Duality, Complementarity for Constrained Optimization
Optimality, Duality, Complementarity for Constrained Optimization Stephen Wright University of Wisconsin-Madison May 2014 Wright (UW-Madison) Optimality, Duality, Complementarity May 2014 1 / 41 Linear
More informationOn the iterate convergence of descent methods for convex optimization
On the iterate convergence of descent methods for convex optimization Clovis C. Gonzaga March 1, 2014 Abstract We study the iterate convergence of strong descent algorithms applied to convex functions.
More informationOctober 25, 2013 INNER PRODUCT SPACES
October 25, 2013 INNER PRODUCT SPACES RODICA D. COSTIN Contents 1. Inner product 2 1.1. Inner product 2 1.2. Inner product spaces 4 2. Orthogonal bases 5 2.1. Existence of an orthogonal basis 7 2.2. Orthogonal
More informationMATH 205C: STATIONARY PHASE LEMMA
MATH 205C: STATIONARY PHASE LEMMA For ω, consider an integral of the form I(ω) = e iωf(x) u(x) dx, where u Cc (R n ) complex valued, with support in a compact set K, and f C (R n ) real valued. Thus, I(ω)
More informationMTH Linear Algebra. Study Guide. Dr. Tony Yee Department of Mathematics and Information Technology The Hong Kong Institute of Education
MTH 3 Linear Algebra Study Guide Dr. Tony Yee Department of Mathematics and Information Technology The Hong Kong Institute of Education June 3, ii Contents Table of Contents iii Matrix Algebra. Real Life
More informationInfinite series, improper integrals, and Taylor series
Chapter 2 Infinite series, improper integrals, and Taylor series 2. Introduction to series In studying calculus, we have explored a variety of functions. Among the most basic are polynomials, i.e. functions
More informationCHAPTER 2: CONVEX SETS AND CONCAVE FUNCTIONS. W. Erwin Diewert January 31, 2008.
1 ECONOMICS 594: LECTURE NOTES CHAPTER 2: CONVEX SETS AND CONCAVE FUNCTIONS W. Erwin Diewert January 31, 2008. 1. Introduction Many economic problems have the following structure: (i) a linear function
More informationAn Introduction to Correlation Stress Testing
An Introduction to Correlation Stress Testing Defeng Sun Department of Mathematics and Risk Management Institute National University of Singapore This is based on a joint work with GAO Yan at NUS March
More informationMath 302 Outcome Statements Winter 2013
Math 302 Outcome Statements Winter 2013 1 Rectangular Space Coordinates; Vectors in the Three-Dimensional Space (a) Cartesian coordinates of a point (b) sphere (c) symmetry about a point, a line, and a
More informationarxiv: v1 [math.na] 5 May 2011
ITERATIVE METHODS FOR COMPUTING EIGENVALUES AND EIGENVECTORS MAYSUM PANJU arxiv:1105.1185v1 [math.na] 5 May 2011 Abstract. We examine some numerical iterative methods for computing the eigenvalues and
More informationApplications of Differentiation
MathsTrack (NOTE Feb 2013: This is the old version of MathsTrack. New books will be created during 2013 and 2014) Module9 7 Introduction Applications of to Matrices Differentiation y = x(x 1)(x 2) d 2
More informationStructural and Multidisciplinary Optimization. P. Duysinx and P. Tossings
Structural and Multidisciplinary Optimization P. Duysinx and P. Tossings 2018-2019 CONTACTS Pierre Duysinx Institut de Mécanique et du Génie Civil (B52/3) Phone number: 04/366.91.94 Email: P.Duysinx@uliege.be
More informationEvaluating Goodness of Fit in
Evaluating Goodness of Fit in Nonmetric Multidimensional Scaling by ALSCAL Robert MacCallum The Ohio State University Two types of information are provided to aid users of ALSCAL in evaluating goodness
More information1 Topology Definition of a topology Basis (Base) of a topology The subspace topology & the product topology on X Y 3
Index Page 1 Topology 2 1.1 Definition of a topology 2 1.2 Basis (Base) of a topology 2 1.3 The subspace topology & the product topology on X Y 3 1.4 Basic topology concepts: limit points, closed sets,
More informationCS 6820 Fall 2014 Lectures, October 3-20, 2014
Analysis of Algorithms Linear Programming Notes CS 6820 Fall 2014 Lectures, October 3-20, 2014 1 Linear programming The linear programming (LP) problem is the following optimization problem. We are given
More informationVectors. January 13, 2013
Vectors January 13, 2013 The simplest tensors are scalars, which are the measurable quantities of a theory, left invariant by symmetry transformations. By far the most common non-scalars are the vectors,
More informationNotes on Linear Algebra and Matrix Theory
Massimo Franceschet featuring Enrico Bozzo Scalar product The scalar product (a.k.a. dot product or inner product) of two real vectors x = (x 1,..., x n ) and y = (y 1,..., y n ) is not a vector but a
More informationCHAPTER 11. A Revision. 1. The Computers and Numbers therein
CHAPTER A Revision. The Computers and Numbers therein Traditional computer science begins with a finite alphabet. By stringing elements of the alphabet one after another, one obtains strings. A set of
More informationCourse Summary Math 211
Course Summary Math 211 table of contents I. Functions of several variables. II. R n. III. Derivatives. IV. Taylor s Theorem. V. Differential Geometry. VI. Applications. 1. Best affine approximations.
More informationAn Algorithm for Solving the Convex Feasibility Problem With Linear Matrix Inequality Constraints and an Implementation for Second-Order Cones
An Algorithm for Solving the Convex Feasibility Problem With Linear Matrix Inequality Constraints and an Implementation for Second-Order Cones Bryan Karlovitz July 19, 2012 West Chester University of Pennsylvania
More informationMAXIMUM LIKELIHOOD IN GENERALIZED FIXED SCORE FACTOR ANALYSIS 1. INTRODUCTION
MAXIMUM LIKELIHOOD IN GENERALIZED FIXED SCORE FACTOR ANALYSIS JAN DE LEEUW ABSTRACT. We study the weighted least squares fixed rank approximation problem in which the weight matrices depend on unknown
More informationCONSTRAINT QUALIFICATIONS, LAGRANGIAN DUALITY & SADDLE POINT OPTIMALITY CONDITIONS
CONSTRAINT QUALIFICATIONS, LAGRANGIAN DUALITY & SADDLE POINT OPTIMALITY CONDITIONS A Dissertation Submitted For The Award of the Degree of Master of Philosophy in Mathematics Neelam Patel School of Mathematics
More informationMathematics 530. Practice Problems. n + 1 }
Department of Mathematical Sciences University of Delaware Prof. T. Angell October 19, 2015 Mathematics 530 Practice Problems 1. Recall that an indifference relation on a partially ordered set is defined
More information,, rectilinear,, spherical,, cylindrical. (6.1)
Lecture 6 Review of Vectors Physics in more than one dimension (See Chapter 3 in Boas, but we try to take a more general approach and in a slightly different order) Recall that in the previous two lectures
More informationStability of Feedback Solutions for Infinite Horizon Noncooperative Differential Games
Stability of Feedback Solutions for Infinite Horizon Noncooperative Differential Games Alberto Bressan ) and Khai T. Nguyen ) *) Department of Mathematics, Penn State University **) Department of Mathematics,
More informationChapter 2: Unconstrained Extrema
Chapter 2: Unconstrained Extrema Math 368 c Copyright 2012, 2013 R Clark Robinson May 22, 2013 Chapter 2: Unconstrained Extrema 1 Types of Sets Definition For p R n and r > 0, the open ball about p of
More informationLecture 2 - Unconstrained Optimization Definition[Global Minimum and Maximum]Let f : S R be defined on a set S R n. Then
Lecture 2 - Unconstrained Optimization Definition[Global Minimum and Maximum]Let f : S R be defined on a set S R n. Then 1. x S is a global minimum point of f over S if f (x) f (x ) for any x S. 2. x S
More informationA misère-play -operator
A misère-play -operator Matthieu Dufour Silvia Heubach Urban Larsson arxiv:1608.06996v1 [math.co] 25 Aug 2016 July 31, 2018 Abstract We study the -operator (Larsson et al, 2011) of impartial vector subtraction
More informationMAXIMA AND MINIMA CHAPTER 7.1 INTRODUCTION 7.2 CONCEPT OF LOCAL MAXIMA AND LOCAL MINIMA
CHAPTER 7 MAXIMA AND MINIMA 7.1 INTRODUCTION The notion of optimizing functions is one of the most important application of calculus used in almost every sphere of life including geometry, business, trade,
More information08a. Operators on Hilbert spaces. 1. Boundedness, continuity, operator norms
(February 24, 2017) 08a. Operators on Hilbert spaces Paul Garrett garrett@math.umn.edu http://www.math.umn.edu/ garrett/ [This document is http://www.math.umn.edu/ garrett/m/real/notes 2016-17/08a-ops
More informationSemidefinite and Second Order Cone Programming Seminar Fall 2001 Lecture 2
Semidefinite and Second Order Cone Programming Seminar Fall 2001 Lecture 2 Instructor: Farid Alizadeh Scribe: Xuan Li 9/17/2001 1 Overview We survey the basic notions of cones and cone-lp and give several
More information