5.0 Vector norms Properties of norms Norm notation Unitarily invariant norms Inner products

Size: px

Start display at page:

Download "5.0 Vector norms Properties of norms Norm notation Unitarily invariant norms Inner products"

Vivien Daniel
5 years ago
Views:

1 Chapter 5 Norms Contents (final version) 5.0 Vector norms Properties of norms Norm notation Unitarily invariant norms Inner products Matrix norms Induced matrix norms Properties of matrix norms Spectral radius Procrustes analysis Complex case - with translation Convergence of sequences of vectors and matrices Summary

2 J. Fessler, December 17, 2017, 11:13 (final version) 5.2 Introduction This chapter discusses vector norms and matrix norms in more generality and applies them to the Procrustes problem. The reading for this chapter is [1, ]. The Tikhonov regularized least-squares problem in Ch. 4 illustrates the two primary uses of norms: ˆx = arg min x Ax y 2 2 }{{} distance: how far + β x 2 2 }{{}. size: how big 5.0 Vector norms L 7.3 So far the only vector norm discussed in these notes has been the common Euclidean norm. Many other vector norms are also important in signal processing. Define. [1, p. 57] A norm on a vector space V defined over a field F is a function from V to [0, ) that satisfies the following properties x, y V : x 0 (nonnegative) x = 0 iff x = 0 (positive) αx = α x for all scalars α F in the field (homogeneous) x + y x + y (triangle inequality)

3 J. Fessler, December 17, 2017, 11:13 (final version) 5.3 Examples of vector norms For 1 p <, the l p norm is x p ( ) 1/p x i p. (5.1) i The vector 2-norm or Euclidian norm is the case p = 2: x 2 i x i 2. The 1-norm or Manhattan norm is the case p = 1: x 1 i x i. The max norm or infinity norm or l norm is x sup { x 1, x 2,...}, (5.2) where sup denotes the supremum (least upper bound) of a set. One can show [2, Prob. 2.12] that For the vector space F N, the supremum is simply a maximum: x = lim p x p. (5.3) x max { x 1,..., x N }, (5.4)

4 J. Fessler, December 17, 2017, 11:13 (final version) 5.4 For quantifying sparsity, it is useful to note that lim p 0 x p p = i I {xi 0} x 0, (5.5) where I { } denotes the indicator function that is unity if the argument is true and zero if false. However, the 0-norm x 0 is not a vector norm because it does not satisfy the at least one of the conditions of the norm definition above. The proper name for x 0 is counting measure. Which of the four properties of a vector norm does 0 satisfy? A: 1,2 B: 1,3 C: 1,2,3 D: 1,2,4 E: 1,3,4?? Practical implementation For the preceding examples, in JULIA use: vecnorm(v, p) vecnorm(v, 2) vecnorm(v, 1) vecnorm(v, Inf) vecnorm(v, 0) Caution. Using norm(v, Inf) and norm(v, Inf) yield different results in general! (See matrix norms below.) It is more clear to use vecnorm() Caution. For p < 1, p is not a norm, though it is sometimes used in practical problems and vecnorm will evaluate (5.1) for any p.

5 J. Fessler, December 17, 2017, 11:13 (final version) 5.5 Properties of norms Let α and β be any two vector norms on a finite-dimensional space. Then there exist finite positive constants C m and C M (that depend on α and β) such that: In a sense then all norms are equivalent to within constant factors. For any vector norm, the reverse triangle inequality is: All vector norms are convex functions: C m α β C M α. (5.6) x y x y, x, y V. αx + (1 α)z α x + (1 α) z, α [0, 1], x, z V. This fact is easy to prove using the triangle inequality and the homogeneity property (HW). For p > 1, f(x) x p p is strictly convex. Norm notation Some math literature uses x instead of x to denote a vector norm. That notation should be avoided for matrices where A often denotes the determinant of A. Sometimes one must determine from context what means in such literature.

6 J. Fessler, December 17, 2017, 11:13 (final version) 5.6 Unitarily invariant norms Some vector norms have the following useful property. Define. A vector norm on F N is unitarily invariant iff for every unitary matrix U F N N : Ux = x, x F N. Example. The Euclidean norm 2 on F N is unitarily invariant, because for any unitary U (see p. 1.42): Ux 2 = (Ux) (Ux) = x U Ux = x x = x 2, x F N. As noted previously, this property is related to Parseval s theorem. Example. 1 is not unitarily invariant. If U = 1 [ ] [ ] and x = e = then x 0 1 = 1 but Ux 1 = 2.

7 J. Fessler, December 17, 2017, 11:13 (final version) 5.7 Inner products Most of the vector spaces used in this course are inner product spaces, meaning a vector space with an associated inner product operation. Define. For a vector space V over the field F, an inner product operation is a function, : V V F that must satisfy the following axioms x, y V, α F. x, y = y, x (Hermitian symmetry) x + y, z = x, z + y, z (additivity) αx, y = α x, y (scaling) x, x 0 and x, x = 0 iff x = 0. (positive definite) Examples of inner products Example. For vectors in F N, the usual inner product is L 7.2 x, y = N x n yn. Example. For the (infinite dimensional) vector space of square integrable functions on the interval [a, b], the following integral is a valid inner product: n=1 f, g = b a f(t)g (t) dt.

8 J. Fessler, December 17, 2017, 11:13 (final version) 5.8 Example. For two matrices A, B F M N (a vector space! ), the Frobenius inner product is: A, B = trace{ab }. Exercise. Verify the four properties above for these inner product examples. Properties of inner products Bilinearity: i α ix i, j β jy j = i j α iβj x i, y j, {x i }, {y j } V. Any valid vector inner product induces a valid vector norm: x = x, x. (5.7) A vector norm satisfies the parallelogram law: 1 ( x + y 2 + x y 2) = x 2 + y 2, x, y V, 2 iff it is induced by an inner product via (5.7). The required inner product is x, y 1 4 ( x + y 2 x y 2 + ı x + ıy 2 ı x iy 2) = x + y 2 x 2 y ı x + ıy 2 x 2 y 2. 2

9 J. Fessler, December 17, 2017, 11:13 (final version) 5.9 The Cauchy-Schwarz inequality (or Schwarz or Cauchy-Bunyakovsky-Schwarz inequality) states: x, y x y = x, x y, y, x, y V, (5.8) for a norm induced by an inner product, via (5.7), with equality iff x and y are linearly dependent. In an inner product space on R, is x, y x y? A: Yes, always B: Not always C: Never?? Proof of Cauchy-Schwarz inequality for F N [ ] x For any x, y F N let A = [x y], so A A = x x y y x y. y A A is Hermitian symmetric = its eigenvalues are all real and nonnegative. = det{a A} 0 = (x x)(y y) (y x)(x y) 0 = x y 2 (x x)(y y) = x 2 2 y 2 2. Taking the square root of both sides yields the inequality. We used the fact that (y x)(x y) = x, y y, x = x, y ( x, y ) = x, y 2.

10 J. Fessler, December 17, 2017, 11:13 (final version) 5.10 Angle between vectors Define. The angle θ between two nonzero vectors x, y V is defined by cos θ = x, y x y = θ [0, π/2]. The Cauchy-Schwarz inequality is equivalent to the statement cos θ 1. What is the angle between two orthogonal vectors? A: 0 B: 1 C: π/2 D: π?? An inner product for random variables For two real, zero-mean random variables X, Y defined on a joint probability space, a natural inner product is E[XY ]. (Keep in mind that random variables are functions.) With this definition, the corresponding norm is X = X, X = E[X 2 ] = σ X, the standard deviation of X. Here, the Cauchy-Schwarz inequality is equivalent to usual bound on the correlation coefficient: ρ X,Y E[XY ] σ X σ Y = ρ X,Y 1. With this definition of inner product, what types of random variables are orthogonal? Pairs of random variables that are uncorrelated, i.e., where E[XY ] = 0.

11 J. Fessler, December 17, 2017, 11:13 (final version) Matrix norms L 7.4 Also important are matrix norms; these quantify how large are the elements of a matrix. Define. [1, p. 59] A norm on the vector space of matrices F M N is a function from F M N to [0, ) that satisfies the following properties A, B F M N : A 0 (nonnegative) A = 0 iff A = 0 M N (positive) αa = α A for all scalars α F in the field (homogeneous) A + B A + B (triangle inequality) Because the set of all M N matrices F M N is itself a vector space, matrix norms are simply vector norms for that space. So at first having a new definition might seem to have modest utility. However, many, but not all, marix norms are sub-multiplicative, also called consistent [1, p. 61], meaning that they satisfy the following inequality: AB A B, A, B. (5.9) These notes use the notation to distinguish such matrix norms from the ordinary matrix norms on the vector space F M N that need not satisfy this extra condition.

12 J. Fessler, December 17, 2017, 11:13 (final version) 5.12 Examples of matrix norms The max norm on F M N is the element-wise maximum: A max max i,j a ij. This norm is somewhat like the infinity norm for vectors. One can compute it in JULIA using vecnorm(a, Inf) However, this differs completely from norm(a, Inf) that computes A described below. The max norm is a matrix norm on the vector space F M N but it does not satisfy the sub-multiplicative condition (5.9) so it is of limited use. Most of the norms of interest in signal processing are submultiplicative, so such matrix norms are our primary focus hereafter. The Frobenius norm (aka Hilbert-Schmidt norm and trace norm) is defined on F M N by A F M m=1 n=1 N a mn 2 = trace{a A} = trace{aa }, (5.10) and is also called the Schur norm and Schatten 2-norm. It is a very easy norm to compute. The equalities related to trace are a HW problem.

13 J. Fessler, December 17, 2017, 11:13 (final version) 5.13 To relate the Frobenius norm of a matrix to its singular values: A F = trace{aa } = trace{u r Σ r V r V r Σ r U r} = trace{σ r Σ r U ru r } = trace{σ 2 r} = r k=1 σ2 k. This norm is invariant to unitary transformations [3, p. 442], because of the trace property (1.13). This norm is induced by the Frobenius inner product. It is not induced by any vector norm on F N (see next page) [4], but nevertheless it is compatible with the Euclidean vector norm because Ax 2 A F x 2. (5.11) However, this upper bound is not tight in general. (It is tight for rank-1 matrices only.) By combining (5.11) with the definition of matrix multiplication, one can show easily that the Frobenius norm is sub-multiplicative [5, p. 291]. What is uv F? A: u v B: u v 2 C: u 2 v 2 D: u 2 v 2 E: None of these.??

14 J. Fessler, December 17, 2017, 11:13 (final version) 5.14 Induced matrix norms If is any vector norm that is suitable for both F N and F M, then a matrix norm for F M N is: Ax A max Ax = max x : x =1 x 0 x. (5.12) By construction: Ax A x, x F N. (5.13) We say such a matrix norm is induced by the vector norm. Importantly, the sub-multiplicative property (5.9) holds whenever the number of columns of A matches the number of rows of B. Example. The most important matrix norms are induced by the vector norm p, i.e., A p = max x 0 Ax p x p. (5.14) The spectral norm 2, often denoted simply, is defined on F M N by (5.14) with p = 2. This is the matrix norm induced by the Euclidean vector norm. As shown on p. 2.20: Ax { A 2 = max 2 = max λ : λ eig{a A}} = σ 1 (A). x 0 x 2

15 J. Fessler, December 17, 2017, 11:13 (final version) 5.15 The maximum row sum matrix norm is defined on F M N by A max 1 i M It is induced by the l vector norm. It differs from the max norm defined above! The maximum column sum matrix norm is defined on F M N by N a ij. (5.15) j=1 A 1 max x 0 Ax 1 x 1 = max 1 j N M a ij. (5.16) i=1 It is induced by the l 1 vector norm.

16 J. Fessler, December 17, 2017, 11:13 (final version) 5.16 For 1 p, the Schatten p-norm of a M N matrix having rank r is defined in terms of its nonzero singular values: ( r 1/p A S,p = σk) p. All of these Schatten norms are sub-multiplicative. Important special cases: A S,1 = A is also called the nuclear norm and sometimes the trace norm [1, p. 60]. A S,2 = A F A S, = σ 1 (A) = A 2 k=1 The Ky-Fan K-norm is the sum of the first K singular values of a matrix: A Ky Fan,K = K σ k (A). k=1 Challenge. Determine if the Ky-Fan K-norm is sub-multiplicative, or find a counter-example. For a PCA generalization that uses this norm see [6].

17 J. Fessler, December 17, 2017, 11:13 (final version) 5.17 Practical implementation JULIA commands for some of these norms are as follows: A 1 norm(a, 1) A 2 norm(a, 2) A norm(a, Inf) ( A max is vecnorm(a, Inf) ) A sum(svd(a)[2]) or sum(svdfact(a)[:s]) Again, caution that norm(a, Inf) differs from vecnorm(a, Inf) Examples For A = [1 2 3] [1 1 1], what is vecnorm(a, Inf)? A: 1 B: 2 C: 3 D: 6 E: None of these?? For A = [1 2 3] [1 1 1], what is norm(a, Inf)? A: 1 B: 2 C: 3 D: 6 E: None of these?? For A = [1 2 3] [1 1 1], what is norm(a, 2)? Hint: perhaps recall the eigenvalue commutative property (1.12). A: 1 B: 2 C: 3 D: 6 E: None of these????

18 J. Fessler, December 17, 2017, 11:13 (final version) 5.18 Properties of matrix norms All matrix norms are also equivalent (to within constants that depend on the matrix dimensions). See [1, p. 61] for inequalities relating various matrix norms. Example. (See HW): A F M N = A 1 N A 2. Example. To relate the spectral norm and nuclear norm for a matrix A having rank r: Combining: A = r σ k k=1 A 2 = σ 1 r σ 1 = rσ 1 = r A 2 k=1 r σ k = A k=1 1 r A A 2 A r A 2. To express it in way that depends on the norm only (not r, which is a property of a specific matrix): 1 min(m, N) A A 2 A min(m, N) A 2.

19 J. Fessler, December 17, 2017, 11:13 (final version) 5.19 Unitarily invariant matrix norms Define. A matrix norm on F M N is called unitarily invariant iff for all unitary matrices U F M M and V F N N : UAV = A, A F M N. Theorem. The spectral norm A 2 is unitarily invariant All Schatten p-norms A S,p are unitarily invariant Proof sketch: unitary matrix rotations do not change singular values. The Frobenius norm A F is unitarily invariant Proof for Frobenius case: UAV F = trace{v A U UAV } = trace{v A AV } = trace{v V A A} = trace{aa } = A F. (We could also prove it using singular values.)

20 J. Fessler, December 17, 2017, 11:13 (final version) 5.20 Spectral radius Define. For any square matrix, the spectral radius is the maximum absolute eigenvalue: A F N N = ρ(a) max i λ i (A). By construction, λ i (A) ρ(a) so all eigenvalues lie within a disk in the complex plane of radius ρ(a), hence the name. In general, ρ(a) is not a matrix norm and Ax ρ(a) x. However, if A is normal, then its eigenvalues are real, so assuming we order the eigenvalues in decreasing order of their absolute values, we can relate its orthogonal eigendecomposition to an SVD as follows: A = V ΛV = N λ n v n v n = n=1 N λ n sign(λ n ) v n v }{{}}{{} n. σ n u n Thus A = A = ρ(a) = σ 1 (A) = A 2, so Ax 2 ρ(a) x 2. Furthermore, A = A = x Ax ρ(a) x 2 2 because n=1 max x 0 x Ax x 2 2 = max z 0 (V z) A(V z) V z 2 2 = max z 0 z Λz z 2 2 = max λ n (A) = ρ(a) = σ 1 (A) = A n 2.

21 J. Fessler, December 17, 2017, 11:13 (final version) 5.21 If is any induced matrix norm on F N N and if A F N N, then ρ(a) A. (5.17) Proof. If Av = λv, then λ v = λv = Av A v. Dividing by v, which is fine because v 0, yields λ A. This inequality holds for all eigenvalues, including the one with maximum magnitude. For any A F N N, the spectral radius is an infimum of all induced matrix norms: ρ(a) = inf { A : is an induced matrix norm}. If A F N N, then lim k A k = 0 if and only if ρ(a) < 1. This property is particularly important for analyzing the convergence of iterative algorithms, including training recurrent neural networks [7]. (cf. HW) Gelfand s formula for any induced matrix norm is: ρ(a) = lim k A k 1/k. (5.18)

22 J. Fessler, December 17, 2017, 11:13 (final version) 5.22 Which equality (if any) correctly relates a singular value and a spectral radius for any general matrix A F M N? A: σ 1 (A)? = ρ(a) B: σ 1 (A)? = ρ 2 (A) C: σ 1 (A)? = ρ(a A) D: σ 1 (A)? = ρ(a A) E: None of these.??

23 J. Fessler, December 17, 2017, 11:13 (final version) Procrustes analysis (An application of the SVD and the Frobenius matrix norm) One use of matrix norms is quantifying the dissimilarity of two matrices by using the norm of their difference. We illustrate that use by solving the orthogonal Procrustes problem [8]. The goal of the Procrustes problem is to find an orthogonal matrix Q in R M M that makes two other matrices B and A in R M N as similar as possible by rotating the columns of A: ˆQ = arg min Q : Q Q=I M f(q), f(q) B QA 2 F }{{}. cost function (One could use some other norm but the Frobenius is simple and natural here.)

24 J. Fessler, December 17, 2017, 11:13 (final version) 5.24 One of many motivating applications is performing image registration of two pictures of the same scene acquired with different camera orientations, using a technique called landmark registration. Example. Here the goal is to match (by rotation) two sets of landmark coordinates: [ ] [ ] [ ] cos θ sin θ A =, B = QA = A sin θ cos θ Here M = 2 and N = 3 and typically M < N in such problems.

25 J. Fessler, December 17, 2017, 11:13 (final version) 5.25 Analyze the cost function: f(q) = B QA 2 F = trace{(b QA) (B QA)} = trace{b B A Q B B QA + A Q QA} expanding = trace{b B A Q B B QA + A A} Q Q = I = trace{b B} + trace{a A} trace{a Q B} trace{b QA} linearity = trace{b B} + trace{a A} trace{a Q B} trace{(a Q B) } transpose = trace{b B} + trace{a A} 2 trace{a Q B} transpose inv. = trace{b B} + trace{a A} 2 trace{q BA } comm. prop. of trace. So minimizing f(q) is equivalent to maximizing g(q) trace{q BA }. Use an SVD (of course!) of the M M matrix C BA = UΣV so g(q) = trace{q BA } = trace{q UΣV } = trace{v Q UΣ} = trace{w Σ}, W = W (Q) V Q U. Using the orthogonality of U, V and Q, it is clear that the M M matrix W is orthogonal (cf. HW): W W = U QV V Q U = U QI M Q U = U U = I M.

26 J. Fessler, December 17, 2017, 11:13 (final version) 5.26 We must maximize trace{w Σ} over Q orthogonal, where W depends on Q but Σ does not. Observe: [W Σ] mm = w mm σ m = trace{w Σ} = M w mm σ m. To proceed, we look for an upper bound for this sum. Because W is an orthogonal matrix, each of its columns have unit norm, i.e., m = 1 N w mn 2 = 1 for all m, so w mn 1 for all m, n. This inequality yields the following upper bound: M trace{w Σ} σ m = trace{iσ}. m=1 This upper bound is achieved when W = I. Now solve for Q: W = V Q U = I = V V Q UU = V U = Q = V U = ˆQ = UV. m=1 In summary: ˆQ = arg min Q:Q Q=I B QA 2 F = UV, where C = BA = UΣV. (5.19) A homework problem will express C = QP where P is positive semi-definite, using a polar decomposition or polar factorization [1, p. 41].

27 J. Fessler, December 17, 2017, 11:13 (final version) 5.27 Sanity check (self consistency and scale invariance) Suppose B is exactly a rotated version of the columns of A, along with an additional scale factor i.e., B = α QA for some orthogonal matrix Q; equivalently A = 1 α Q B. We now verify that the Procrustes method finds the correct rotation, i.e., ˆQ = Q. Let B = Ũ ΣṼ denote the SVD of B. Then the SVD of C is evident by inspection: C = BA = 1 1 α BB Q = α } Ũ {{ ΣṼ } Ṽ Σ Ũ 1 }{{} Q = }{{} Ũ Σ Σ Ũ Q. α B B U }{{}}{{} Σ V The Procrustes solution is indeed correct (self consistent), and invariant to the scale parameter α: ˆQ = UV = (Ũ)(Ũ Q) = Q. After finding ˆQ, if we also want to estimate the scale, we can solve a least-squares problem: { } arg min B α ˆQA trace BA ˆQ = = trace{uσv V U } r k=1 = σ k α F trace{a A} trace{a A} A 2, F where {σ k } are the singular values of C = BA A HW problem will explore a real-world image registration example.

28 J. Fessler, December 17, 2017, 11:13 (final version) 5.28 Example. For determining 2D image rotation, even a single nonzero point in each image will suffice! For example, suppose the first point is at (1, 0) and the second point is at (x, y) where x = 5 cos[ φ and ] y = 5[ sin ] φ. 1 x (This example includes scaling by a factor of 5 just to illustrate the generality.) Then A =, B = so 0 y [ ] [ ] [ ] [ ] 5 cos φ [1 ] cos φ C = BA q1 sin φ = 0 =, q 5 sin φ sin φ q 1 cos φ q 1, q 2 {±1}. 2 }{{}}{{}}{{} U Σ V Here C is a simple outer product so finding a (full!) SVD by hand was easy. In fact we found four SVDs, corresponding to different signs for u 2 and v 2. For each of these SVDs, the optimal rotation matrix per (5.19) is [ ] [ ] [ ] cos φ ˆQ = UV q1 sin φ 1 0 cos φ q sin φ = =, sin φ q 1 cos φ 0 q 2 sin φ q cos φ where q q 1 q 2 {±1}. The two Procrustes solutions here (for q = ±1) both have the correct cos φ in the upper left and both exactly satisfy B = 5QA. So there are two Procrustes solutions that fit the data exactly, one of which (for q = 1) corresponds to a rotation matrix, and the other of which (for q = 1) has sign flip for the second coordinate. In 2D, any rotation matrix is a unitary matrix, but the converse is not true! Exercise. Explore what happens with two points: colinear, symmetric around zero, non-colinear.

29 J. Fessler, December 17, 2017, 11:13 (final version) 5.29 Complex case - with translation (reading only) In many practical applications of the Procrustes problem, there can be both rotation and an unknown translation between the two sets of coordinates. Instead of the model B :,n QA :,n a more realistic model is B :,n QA :,n + d where d F M is an unknown displacement vector. In matrix form: B QA + d1 N. Now we must determine both an orthogonal matrix Q F M M and the vector d C M by a double minimization using a Frobenius norm: ( ˆQ, ˆd) arg min Q : Q Q=I M arg min d F M g(d, Q), g(d, Q) B (QA + d1 N) 2 F. We first focus on the inner minimization over the displacement d for any given Q: g(d, Q) = B (QA + d1 N) 2 F = trace{(z d1 N) (ZA d1 N)}, = trace{z Z} trace{z d1 N} trace{1 N d Z} + trace{1 N d d1 N} = trace{z Z} trace{1 NZ d} trace{d Z1 N } + trace{d d1 N1 N } = trace{z Z} 1 NZ d d Z1 N + Nd d = trace{z Z} 1 N Z1 N N d 1 2 N Z1 N. 2 Z B QA

30 J. Fessler, December 17, 2017, 11:13 (final version) 5.30 It is clear from this expression that the optimal estimate of the displacement d for any Q is: ˆd(Q) = 1 N Z1 N = 1 N (B QA)1 N. Now to find the optimal rotation matrix Q we must solve the outer minimization: ( ) ˆQ arg min ˆd(Q), Q Q : Q Q=I M g ( ) g ˆd(Q), Q = trace{(b QA) (B AQ)} 1 N (B QA)1 N 2 2 c = 2 real{trace{q BA }} + 2 N real{1 NA Q B1 N } { = 2 real{trace{q BA }} +2 real trace {Q B 1N 1 N1 NA }} Following the previous logic, the optimal Q is = 2 real{trace{q BMA }}, M I 1 N 1 N1 N. Q = UV, where C BMA = UΣV. The matrix M is called a de-meaning or centering operator because y = Mx subtracts the mean of x from each element of x. In code: y = x.- mean(x)

31 J. Fessler, December 17, 2017, 11:13 (final version) 5.31 The de-meaning matrix M is a symmetric idempotent matrix so M = MM and we can rewrite C above as C = (BM)(AM) = BÃ where Ã AM, B BM are versions of A and B where each column has its mean subtracted out. In words, to find the optimal rotation matrix when there is possible translation, we first de-mean each column of A and B, and then compute the usual SVD of BÃ and use the left and right bases via Q = UV. Exercise. Do a sanity check in the case where B = αqa + d1 N. What is M(α1 N )? A: 0 B: 1 C: I D: None of these.?? Practical implementation The solution to the Procrustes problem requires just a couple JULIA statements. The key ingredient is simply the svd command.

32 J. Fessler, December 17, 2017, 11:13 (final version) Convergence of sequences of vectors and matrices In later chapters we will be discussing iterative optimization algorithms and analyzing when such algorithms converge. Convergence of a sequence of numbers Define. We say a sequence of (possibly complex) numbers {x k } converges to a limit x iff x k x 0 as k, where denotes absolute value (or complex magnitude more generally). Specifically, ɛ > 0, N ɛ N s.t. x k x < ɛ k N ɛ We now define convergence of a sequence of vectors or matrices by using a norm to quantify distance, relating to convergence of a sequence of scalars. Define. We say a sequence of vectors {x k } in a vector space V converges to a limit x V iff x k x 0 for some norm as k. Specifically, ɛ > 0, N ɛ N s.t. x k x < ɛ k N ɛ A matrix is simply a point in a vector space of matrices so we use essentially the same definition of convergence of a sequence of matrices:

33 J. Fessler, December 17, 2017, 11:13 (final version) 5.33 Define. We say a sequence of matrices {X k } (in a vector space V of matrices) converges to a limit X V iff X k X 0 for some (matrix) norm as k. Specifically, ɛ > 0, N ɛ N s.t. X k X < ɛ k N ɛ Example. Consider (for simplicity) the sequence of diagonal matrices {D k } defined by D k = This sequence converges to the limit D = [ ] k 0 0 ( 1) k /k 2. [ ] 3 0 because 0 0 [ D k D F = 2 k 0 2] 0 ( 1) k /k F = 4 k + 1/k 4 0.

34 J. Fessler, December 17, 2017, 11:13 (final version) 5.34 The Procrustes problem has an SVD-based solution: 5.4 Summary ˆQ = arg min Q : Q Q=I M B QA 2 F = UV, C = BA = UΣV. The solution is invariant to scaling factors αqa Unknown displacement (translation) simply ( requires de-meaning A and B before doing SVD Displacement estimate (if needed) is 1 B ˆQA ) 1 N N. Deriving the solution to this problem used many of the tools discussed so far: Frobenius norm, matrix trace and its properties, SVD, matrix/vector algebra. Exercise. Suppose A and B are both real 1 N vectors (each with mean 0 for simplicity). What does Procrustes solution mean in this case geometrically? Hint: What is SVD of BA here? If A = x and B = y where x, y R N, then BA = y x = sgn(y x) y x }{{}}{{}}{{} 1 U σ 1 V so Q = UV = sgn(y x) = ±1. Here the rotation is just possibly negating the sign to match in 1D. Exercise. What if B = e ıφ A??? (in class if possible)

35 J. Fessler, December 17, 2017, 11:13 (final version) 5.35 Challenge (for much later in the course). The Frobenius norm is not robust to outliers. Using something like an l 1 norm instead would provide better robustness [9]. Bibliography [1] A. J. Laub. Matrix analysis for scientists and engineers. Soc. Indust. Appl. Math., [2] D. G. Luenberger. Optimization by vector space methods. New York: Wiley, [3] R. H. Chan and M. K. Ng. Conjugate gradient methods for Toeplitz systems. In: SIAM Review 38.3 (Sept. 1996), [4] V-S. Chellaboina and W. M. Haddad. Is the Frobenius matrix norm induced? In: IEEE Trans. Auto. Control (Dec. 1995), [5] R. A. Horn and C. R. Johnson. Matrix analysis. Cambridge: Cambridge Univ. Press, [6] S. Feizi and D. Tse. Maximally correlated principal component analysis. arxiv [7] R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. In: pmlr 28.3 (2013), [8] P. H. Schonemann. A generalized solution of the orthogonal Procrustes problem. In: Psychometrika 31.1 (Mar. 1966), [9] P. J. F. Groenen, P. Giaquinto, and H. A. L. Kiers. An improved majorization algorithm for robust Procrustes analysis. In: New Developments in Classification and Data Analysis: Proceedings of the Meeting of the Classification and Data Analysis Group (CLADAG) of the Italian Statistical Society, University of Bologna. Berlin: Springer, 2005, pp

Lecture 5. Ch. 5, Norms for vectors and matrices. Norms for vectors and matrices Why?

Lecture 5. Ch. 5, Norms for vectors and matrices. Norms for vectors and matrices Why? KTH ROYAL INSTITUTE OF TECHNOLOGY Norms for vectors and matrices Why? Lecture 5 Ch. 5, Norms for vectors and matrices Emil Björnson/Magnus Jansson/Mats Bengtsson April 27, 2016 Problem: Measure size of