STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing Numerical Linear Algebra Background Cho-Jui Hsieh UC Davis May 15, 2018

Linear Algebra Background

Vectors A vector has a direction and a magnitude (norm) Example (2-norm): x = [1, 2] T, x 2 = 1 2 + 2 2 = 5 Properties satisfied by a vector norm ( ) x 0 and x = 0 if and only if x = 0 αx = α x (homogeneity) x + y x + y (triangle inequality)

Examples of vector norms x = [x 1, x 2,, x n ] T x 2 = x 1 2 + x 2 2 + + x n 2 (2-norm) x 1 = x 1 + x 2 + + x n (1-norm) x p = p x 1 p + x 2 p + + x n p (p-norm) x = max 1 i n x i ( -norm)

Distances x = [1, 2], y = [2, 1] x y = [1 2, 2 1] = [ 1, 1] Distance: x y 2 = [ 1, 1] 2 = 2 x y 1 = 2 x y = 1

Metrics Metrics: d(x, y) will satisfy the following properties: d(x, y) 0, and d(x, y) = 0 if and only if x = y d(x, y) = d(y, x) d(x, z) d(x, y) + d(y, z) (triangle inequality) Is d(x, y) = x y a valid metric if is a norm?

Inner Product between Vectors x = [x 1,, x n ] T, y = [y 1,,, y n ] T Inner product: x T y = x 1 y 1 + x 2 y 2 + + x n y n = n x i y i i=1 x T x = x 2 2, x y 2 2 = (x y)t (x y) Orthogonal: x y x T y = 0 (x and y are orthogonal to each other)

Projection onto a vector x T y = x 2 y 2 cos(θ) cos θ = x T y x 2 cos θ = x T y ( ) = x T ŷ x 2 y 2 y 2 Projection of x onto y: x 2 cos θ ŷ = ŷŷ T x }{{} projection matrix

Linearly independent Suppose we have 3 vectors x 1, x 2, x 3 x 1 = α 2 x 2 + α 3 x 3 x 1 is linearly dependent on x 2 and x 3 When are x 1,, x n linearly independent? α 1 x 1 + α 2 x 2 + + α n x n = 0 if and only if α 1 = α 2 = = α n = 0 A vector space is a set of vectors that is closed under vector addition & scalar multiplications. If x 1, x 2 V, then α 1 x 1 + α 2 x 2 V A basis of the vector space is the maximal set of vectors in the subspace that are linearly independent of each other.

Matrices An m by n matrix: A R m n : a 11 a 12 a 1n...... a m1 a m2 a mn A matrix is also a linear transform: A : R m R n x Ax αx + βy A(αx + βy) = αax + βay (Linear Transform)

Matrix Norms Popular matrix norm: Frobenius norm m A F = i=1 j=1 n a ij 2 Matrix norms satisfy following properties: A 0 and A = 0 if and only if A = 0 (positivity) αa = α A (homogeneity) A + B A + B

Induced Norm Given a vector norm, we can define the corresponding induced norm or operator norm by Induced p-norm: Examples: A = sup{ Ax : x = 1} x = sup{ Ax x x : x 0} Ax p A p = sup x 0 x p Ax A 2 = sup 2 x 0 x 2 = σ max (A) (induced 2-norm is the maximum singular value) m A 1 = max j i=1 a ij n A = max i j=1 a ij

Rank of a matrix Column rank of A: the dimension of column space (vector space formed by column vectors) Row rank of A: the dimension of row space Column rank = row rank := rank (always true) Examples: Rank 2 matrix: 1 0 1 1 0 [ ] 2 3 1 = 2 3 1 0 1 0 1 1 3 3 0 3 3 Rank 1 matrix: [ ] [ ] 1 1 0 2 1 [1 ] = 1 0 2 1 1 0 2 1

Singular Value Decomposition

Low-rank Approximation Big Matrix A: Want to approximate by simple matrix Â Goodness of approximation of Â can be measured by (e.g., using F ) Low-rank approximation: A BC A Â

Important Question Given A and k, what is the best rank-k approximation? min B,C are rank k A BC F Optimal B and C are obtained by SVD (Singular Value Decomposition) of A

Singular Value Decomposition Any real matrix A R m n has the following Singular Value Decomposition (SVD): A = U Σ V T U is a m m unitary matrix (U T U = I ), columns of U are orthogonal V is a n n unitary matrix (V T V = I ), columns of V are orthogonal Σ is a diagonal m n matrix with non-negative real numbers on the diagonal. Usually, we assume the diagonal numbers are organized in descending order: Σ = diag(σ 1, σ 2,, σ n ), σ 1 σ 2 σ n

Singular Value Decomposition u 1, u 2,, u m : left singular vectors, basis of column space ui T u j = 0 i j, ui T u i = 1 i = 1,, m v 1, v 2,, v n : right singular vectors, basis of row space v T i v j = 0 i j, vi T v i = 1 i = 1,, n SVD is unique (up to permutations, rotations)

Linear Transformations Matrix A is a linear transformation For right singular vectors v 1,, v n, what are the vectors after transformation?

Linear Transformations Matrix A is a linear transformation For right singular vectors v 1,, v n, what are the vectors after transformation? A = UΣV T AV = UΣV T V = UΣ Therefore, Av i = σ i u i, i = 1,, n

Geometry of SVD Given an input vector x, how do we define the linear transform Ax? Assume A = UΣV T is the SVD, then Ax is: Step 1: Represent x as x = a 1 v 1 + + a n v n, (a i = v T i a) Step 2: Transform v i σ i u i for all i: Ax = a 1 σ 1 u 1 + + a n σ n v n

Reduced SVD (Thin SVD) If m > n, then only first n left singular vectors u 1,, u n are useful: Therefore, when m > n, we often just use the following thin SVD : A = UΣV T where U R m n, Σ, V R n n : U, V are unitary, and Σ is diagonal

Best Rank-k approximation Given a matrix A, how to form the best rank k approximation? arg min X :rank(x )=k Solution: truncated SVD: X A or arg min W R n k,h R m k X WHT A U r Σ r V T r, where U r, V r, Σ r are the first r columns of SVD matrices Why? Keep the k-most important components in SVD

Application: Data Approximation/Compression Given a data matrix A R m n Approximate the data matrix by the best rank-k approximation: A WH T Compress the data: O(mn) memory O(mk + nk) memory De-noising (sometimes) Example: if each column of A is a data, then w i (i-th column of W ) is dictionary, and H is the coefficient: a i k H ij w j j=1

Eigenvalue Decomposition

Eigen Decomposition For a n by n matrix H, if Hy = λy, then we say λ is an eigenvalue of H y is the corresponding eigenvector

Eigen Decomposition Consider A R n n to be a square, symmetric matrix. The eigenvalue decomposition of A is: A = V ΛV T, V T V = I (V is unitary), Λ is diagonal A = V ΛV T AV = V Λ Av i = λ i v i, i = 1,, n Each v i is an eigenvector, and each λ i is an eigenvalue Usually, we assume the diagonal numbers are organized in descending order: Λ = diag(λ 1, λ 2,, λ n ), λ 1 λ 2 λ n Eigenvalue decomposition is unique when there are n unique eigenvalues.

Eigen Decomposition Each eigenvector v i will be mapped to Av i = λv i after the linear transform: Scaling without changing the direction of eigenvectors Ax = m i=1 λ iv i (vi T x) Project x to eigenvectors, and then scaling each vector

Eigen Decomposition

How to compute SVD/eigen decomposition?

Eigen Decomposition & SVD SVD can be transformed to eigenvalue decomposition!

Eigen Decomposition & SVD SVD can be transformed to eigenvalue decomposition! Given A R m n, and A = UΣV T (SVD) We have AA T = UΣV T V ΣU T = UΣ 2 U T (eigen decomposition of AA T ) After getting U, Σ, we can compute V by V T = Σ 1 U T A

Eigen Decomposition & SVD Which one should we use? eigenvalue decomposition of AA T or A T A? Depends on dimensionality m vs n

Eigen Decomposition & SVD Which one should we use? eigenvalue decomposition of AA T or A T A? Depends on dimensionality m vs n If A R m n and m > n: Step 1: Compute the eigenvalue decomposition of A T A = V Σ 2 V T Step 2: Compute U = AV Σ 1 If A R m n and m < n: Step 1: Compute the eigenvalue decomposition of AA T = UΣ 2 U T Step 2: Compute V = A T UΣ 1

Computing Eigenvalue Decomposition If A is a dense matrix: Call scipy.linalg.eig If A is a (large) sparse matrix: usually we only compute compute top-k eigenvectors with a small k Power method Lanczos algorithm (implemented in build-in packages) Call scipy.sparse.linalg.eigs

Dense Eigenvalue Decomposition Only for dense matrix Will compute all the eigenvalues and eigenvectors >>> import numpy as np >>> from scipy import linalg >>> A = np.array([[1, 2], [3, 4]]) >>> S, V = linalg.eig(a) >>> S array([-0.37228132+0.j, 5.37228132+0.j]) >>> V array([[-0.82456484, -0.41597356], [ 0.56576746, -0.90937671]])

Dense SVD Only for dense matrix Specify full matrices=false for the thin SVD >>> A = np.array([[1,2],[3,4], [5,6]]) >>> U, S, V = linalg.svd(a) >>> U array([[-0.2298477, 0.88346102, 0.40824829], [-0.52474482, 0.24078249, -0.81649658], [-0.81964194, -0.40189603, 0.40824829]]) >>> s array([ 9.52551809, 0.51430058]) >>> U, S, V = linalg.svd(a, full_matrices=false) >>> U array([[-0.2298477, 0.88346102], [-0.52474482, 0.24078249], [-0.81964194, -0.40189603]])

Sparse Eigenvalue Decomposition Usually only compute the top-k eigenvalues and eigenvectors (since the U, V will be dense) Use iterative methods to compute U k (top-k eigenvectors): Initialize U k by a n-by-k random matrix U (0) k Iteratively update U k : Converge to the SVD solution: U (0) k U (1) k U (2) k lim t U(t) k = U k

Power Iteration for top eigenvector Main idea: compute (A T v)/ A T v for large T. A T v/ A T v top eigen vector when T

Power Iteration for top eigenvector Main idea: compute (A T v)/ A T v for large T. A T v/ A T v top eigen vector when T Power Iteration for top eigenvector Input: A R d d, number of iterations T Initialize a random vector v R d For t = 1, 2,, T v Av v v/ v Output v (top eigenvector)

Power Iteration for top eigenvector Power Iteration for top eigenvector Input: A R d d, number of iterations T Initialize a random vector v R d For t = 1, 2,, T v Av v v/ v Output v (top eigenvector) Top eigenvalue is v T Av (= v T λv = λ v 2 = λ) Only need one matrix-vector product at each iteartion Time complexity: O(nnz(A)) per iteration If A = XX T, then Av = X (X T v) No need for explicitly forming A

Power Iteration If A = UΛU T is the eigenvalue decomposition of A What is the eigenvalue decomposition of A t? λ t 1 0 0 A t = (UΛU T )(UΛU T ) (UΛU T 0 λ t 2 0 ) = U...... 0 0 λ t d UT A t v = d i=1 λ t i (u T i v)u i u 1 + ( λ 2 λ 1 ) t ( ut 2 v u T 1 v )u 2 + + ( λ d λ 1 ) t ( ut d v u T 1 v )u d If v is a random vector, then u1 T v 0 with probability 1, so A t v A t v u 1 as t

Power iteration for top k eigenvectors Power Iteration for top eigenvector Input: A R d d, number of iterations T, rank k Initialize a random matrix V R d k For t = 1, 2,, T V AV [Q, R] qr(v ) V Q S V T AV [Ū, S] eig( S) Output eigenvectors U = VU, and eigenvalues S Sometimes faster than default functions (linalg.eigs and linalg.svds), since they aim to compute very accurate solutions.

Coming up Numerical Linear Algebra for Machine Learning Questions?