CSC 576: Linear System

Size: px

Start display at page:

Download "CSC 576: Linear System"

Brittany Anthony
5 years ago
Views:

1 CSC 576: Linear System Ji Liu Department of Computer Science, University of Rochester September 3, 206 Linear Equations Consider solving linear equations where A R m n and b R n m and n could be extremely large 2 Preliminary Ax = b () σ i (A) denotes the ith largest singular value of A, σ min (A) denotes the minimal nonzero singular value of A and σ max (A) denotes the maximal singular value of A; If U has orthogonal columns, that is, U T U = I, then Ux = x and U y y ; Ax σ max (A) x, where σ max (A) denotes the largest singular value of A; Note that we do NOT have in general Ax σ min (A) x, where σ min (A)(> 0) denotes the minimal nonzero singular value of A However, we have AA T x σ min (A) A T x (due to AA T x = UΣ 2 U T x = Σ 2 U T x σ min (Σ) ΣU T x = σ min (Σ) V ΣU T x = σ min (A) A T x ); The compact SVD of A is A = UΣV T We have span(a) = span(u) and span(a T ) = span(v ); Let a and b two random variables we have E a,b [f(a, b)] = E b [E a b [f(a, b)]] = E a [E b a [f(a, b)]] The derivative (gradient, differential) of a function f(x) in terms of a matrix X R m n is defined below f(x) f(x) f(x) X X j X n f(x) X = f(x) f(x) f(x) X i X ij X in f(x) f(x) X m X mj f(x) X mn

2 The commonly used derivatives include f(x) = trace(a T f(x) X) = A, X X = A f(x) = trace(x T f(x) AX) X = A + AT f(x) = 2 AX f(x) 2 X = AT (AX ) f(x) = 2 trace(t X T f(x) X) X = XT f(x) = 2 trace(t X T f(x) AX) X = 2 (A + AT )X T A general rule to compute the derivative of f(x, X) is f(x, X) = f(x, X 2 ) + f(x, X 2 ) X X X 2 X=X2=X 3 Closed orm of Eq () 3 Case : A is invertible irst we consider that A is invertible (which implies that A a square matrix and has a full rank) The closed form solution is x = A b (2) Computational complexity: The complexity to compute the inverse of a matrix is O(n 3 ) The Gaussian elimination algorithm can achieve this complexity The memory requirement is O(n 2 ) Although Eq (2) gives the exact solution to problem (), it would be a disaster when n is huge This method is only used in solving small scale problem 32 Case 2: A may be not invertible If A might be not invertible, then there might not exist a solution or exist more than one solution to Eq () or example, A = [ 0; 0 ; 0 0], b = [; ; ], and A = [ 0 0; 0 0], b = [; ] In the more general situation, people use pseudo inverse to compute x: x = A + b where A + is defined in the following Denote the compact SVD of A as A = UΣV T The pseudo inverse of A is defined as A + = (UΣ V T ) T = V Σ U T Note that if A is invertible, then A = A + Now you may ask why is x = A + b? Actually, A + b solves the following objective: min x 2 Ax b 2 2

3 To verify this, we only need to check the optimality condition Substituting x = A + b into both sides, we have A T Ax = A T b A T A(A + b) = (UΣV T ) T (UΣV T )(V Σ U T )b = V ΣU T b = A T b It is worth to note that A(A + b) is the projection of b onto the range space of A Question: If there are multiple solutions to (), then A + b is just one of solutions So why this solution is so speical? Since A is just a special case of A +, the complexity of computing A + is still in the order of O(n 3 ) in general The next question is how to solve () when n is extremely large? Read next section! 4 (Randomized) Kaczmarz Algorithm irst we assume that Eq () has at least one solution (Even if this condition fails, we can reformulate it into a problem satisfying this condition, which will be clear soon) Note that we do not need assume that A is a square matrix The randomized Kaczmarz (RK) algorithm is a storage efficient algorithm Algorithm Randomized Kaczmarz Algorithm : Given A R m n and b R m ; 2: Initialize k 0 and x 0 = 0; 3: while k K do 4: Choose i from {, 2,, m} with probability { A i 2 / A 2 }; 5: Update 6: k k + ; 7: end while x k+ x k A i x k b i A i 2 A T i ; Intuition: The RK algorithm iteratively randomly selects a hyperplane A i x = b i and projects the current x onto this selected hyperplane Given a hyperplane c x = d, the projection of a given point z onto this hyperplane is proj {x c x=d}(z) = z c x d c 2 c which is obtained by considering two conditions: ) the difference z proj {x c x=d}(z) must be along the normal direction of the hyperplane, so it should be proportional to c; 2) the factor in front of c can be calculated from the observation that proj {x c x=d}(z) must be on the hyperplane Computational complexity: The RK algorithm only needs a row of the data matrix A for computation oth the memory cost and computation complexity are just O(n) Convergence rate: The RK algorithm is nothing but the stochastic gradient algorithm, which will be clear in our later class However, RK guarantees a much faster convergence rate then the general convergence rate of the stochastic gradient algorithm 3

4 Theorem Assume that Eq () has at least a solution Denote the minimal nonzero singular value of A as σ min (A) ( E( x k x 2 ) σ2 min (A) ) k A 2 x 0 x 0 2 where x = A + b and E means taking the expectation in terms of all random variables Proof irst we verify that x = A + b is a solution to Ax = b As we showed before, x essentially minimizes Ax b 2 Since there exists at least one solution to Ax = b, that is, min x Ax b 2 = 0, which implies that Ax = b We notice that x k (for any k) is in the span of all rows in A, that is, x k span{a T, AT 2,, AT m } or x k can be written as Ay k for some vector y k We also notice that x = A + b span{a T, AT 2,, AT m } Therefore x k x span{a T, AT 2,, AT m } or x k x = A T z k for some z k Using the result in our preliminary section, we have A(x k x ) = AA T z k σ min (A) Az k = σ min (A) (x k x ) Next we have x k+ x 2 = x k A i 2 AT i (A i x k b i ) x = x k x 2 + A i 4 AT i (A i x k b i ) 2 2 x k x, A T i (A i x k b i ) = x k x 2 + A i 2 A i x k b i 2 2 A i (x k x ), A i x k b i = x k x 2 A i 2 [A i (x k x )] 2 (from Ax = b) Taking expectation on both sides in term of i(k) given i(0), i(),, i(k ), we have E i(k) i(0), i(k ) ( x k+ x 2 ) = x k x k 2 E i(k) {i(0), i(k )} A i 2 [A i (x k x )] 2 which implies that = x k x k 2 i 2 A i 2 A 2 = x k x k 2 A 2 A(x k x ) 2 x k x k 2 σ2 min (A) x k x 2 A 2 ( = σ2 min (A) ) A 2 x k x 2, A i 2 [A i (x k x )] 2 E i(k),i(k ),,i(0) ( x k+ x 2 ( ) =E i(0), i(k ) Ei(k) i(0), i(k ) ( x k+ x 2 ) ) (( E i(0), i(k ) σ2 min (A) ) ) A 2 x k x 2 ( σ2 min (A) ) k+ A 2 x 0 x 2 4

5 It completes the proof If problem () does not have a solution, one usually solves or equivalently min x Ax b 2 2 A T Ax = A T b (3) One can directly apply RK algorithm to solve Eq (3) However, in many cases, it is hard to compute A T A in advance due to memory issue One way to deal with it is to solve any equivalent problem by introducing a dual variable y R m : A T y = A T b Ax = y or [ 0 A T A I ] [ ] x = y [ A T ] b 0 Question: How to solve the following linear system using the RK algorithm where A R m n, R k l, and C R n k AX = C 5 Conjugate Gradient Algorithm The conjugate gradient (CG) algorithm is one of the most important algorithms for solving linear systems However, the original paper which proposes this algorithm was rejected for many times, until people realized its merits CG aims at solving the following problem x = c where is a PSD matrix To solve a least squares problem 2 Ax b 2, one can apply CG to solve A T Ax = A T b (4) The CG algorithm is provided in Algorithm 2 To see the motivation of CG, let us consider the projection in the inner product space with inner product definition x, y = x y where is a positive definite matrix Assume that we have a group of orthogonal basis {p, p 2,, p n } which satisfy p i, p j = 0 for any i j Let x be the solution to x = c Apparently, we can find a unique decompose for x by x = n α i p i i= It implies that b = n α i p i i= 5

6 To decide the value of α i, we can use α i = b, p i p i 2 where we use the orthogonality among all basis, and p i 2 is defined as p i, p i Therefore, when we have the basis, the solution can be easily obtained So the key question is how to obtain the group of basis Next we introduce a general approach to construct a group of basis Given a group of independent vectors {r, r 2,, r n }, we can construct a group of orthogonal basis following the procedure below: p =r p 2 =r 2 r 2, p p 2 p p n =r n k<n r n, p k p k 2 p k The key motivation above is to remove the components p, p 2,, p k from r k to obtain p k One can verify p i, p j = 0 for any i j Now we know how to construct orthogonal basis in the space associating with However, the computational complexity is too high Can we do better by reducing the computational complexity? If we can find a group linear independent vectors {r k } n k= satisfy the following property then we can construct {p k } n k= by r k, p j = 0 j =, 2,, k 2, (5) p =r p 2 =r 2 r 2, p p 2 p p n =r n r n, p n p n 2 p n CG basically designs a smart way to construct {r k } n k= r =b, p = r r 2 =r r, p p 2 p = r b, p p 2 p p 2 = r 2 r 2, p p 2 p r n =r n r n, p n p n 2 p n = r n b, p n p n 2 p n, p n = r n r n, p n p n 2 p n 6

7 To see why (5) holds for the construction above, defining K j = span(b, b,, j b) we have r K p K r 2 K 2 r 2 K p 2 K 2 r 3 K 3 r 3 K 2 p 3 K 3 r n K n r n K n p n K n rom p j K j+, we obtain that r k, p j = 0 for all j k 2 Algorithm 2 Conjugate Gradient : r 0 = A T b A T Ax 0 ; 2: p 0 = r 0 ; 3: k = 0; 4: while k K do 5: Compute the steplength α k = r k 2 Ap k 2 6: Update x k+ = x k + α k p k 7: r k+ = r k α k A T Ap k 8: if r k+ is sufficiently small then exit loop 9: β k = r k+ 2 r k 2 0: Compute the conjugate gradient p k+ = r k+ + β k p k : k k + 2: end while Theorem 2 Apply CG to (4) The kth iterate of CG algorithm satisfies ( ( x k x 2 O σ ) ) k min(a) x 0 x 2 σ max (A) The computational complexity of CG per iteration is O(mn) To compare RK and CG, we assume that rank(r) = m Then we only need to compare ( ) m σ2 m(a) A 2 σ2 m(a) m A 2 which suggests that CG outperforms RK if = m σm(a) 2 r i= σ2 i (A) and σ m(a) σ (A), σ m (A)σ (A) m m σi 2 (A) i= 6 Successive over-relaxation Consider the linear system () with a full rank A 7

8 Ax = b where a a 2 a n x b a 2 a 22 a 2n A =, x = x 2, b = b 2 a n a n2 a nn x n b n Then A can be decomposed into a diagonal matrix diagonal component D, and triangular matrix Strictly triangular matrix strictly lower and upper triangular components L and U : A = D + L + U, where a a 2 a n 0 a 22 0 D =, L = a 2 0 0, U = 0 0 a 2n 0 0 a nn a n a n The system of linear equations may be rewritten as: (D + ωl)x = ωb [ωu + (ω )D]x for a constant ω >, called the relaxation factor The method of successive over-relaxation is an Iterative method iterative technique that solves the left hand side of this expression for x, using previous value for x on the right hand side Analytically, this may be written as: x (k+) = (D + ωl) ( ωb [ωu + (ω )D]x (k)) = L w x (k) + c However, by taking advantage of the triangular form of (D + ωl), the elements of x k+) can be computed sequentially using [[forward substitution]]: x (i) k+ = ( ω)x(i) k+ + ω a ii b i j<i a ij x (j) k+ j>i a ij x (j) k, i =, 2,, n The choice of relaxation factor ω is not necessarily easy, and depends upon the properties of the coefficient matrix In 947, Ostrowski proved that if A is Symmetric matrix symmetric and Positive-definite matrix positive-definite then ρ(l ω ) < for 0 < ω < 2 Thus convergence of the iteration process follows, but we are generally interested in faster convergence rather than just convergence 8

Chapter 7 Iterative Techniques in Matrix Algebra

Chapter 7 Iterative Techniques in Matrix Algebra Per-Olof Persson persson@berkeley.edu Department of Mathematics University of California, Berkeley Math 128B Numerical Analysis Vector Norms Definition