Lecture 8: Linear Algebra Background

Similar documents
Lecture 9: Low Rank Approximation

Lecture 9: SVD, Low Rank Approximation

The Singular Value Decomposition

Functional Analysis Review

Lecture 1 and 2: Random Spanning Trees

Lecture 13: Spectral Graph Theory

NORMS ON SPACE OF MATRICES

Math 408 Advanced Linear Algebra

Maths for Signals and Systems Linear Algebra in Engineering

Chapter 7: Symmetric Matrices and Quadratic Forms

Lecture 14: Random Walks, Local Graph Clustering, Linear Programming

Lecture 12: Introduction to Spectral Graph Theory, Cheeger s inequality

5 More on Linear Algebra

Lecture 8 : Eigenvalues and Eigenvectors

Knowledge Discovery and Data Mining 1 (VO) ( )

Econ Slides from Lecture 7

Lecture 7 Spectral methods

Notes on singular value decomposition for Math 54. Recall that if A is a symmetric n n matrix, then A has real eigenvalues A = P DP 1 A = P DP T.

Review of Some Concepts from Linear Algebra: Part 2

Review problems for MA 54, Fall 2004.

Applied Linear Algebra in Geoscience Using MATLAB

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

Lecture 2: Linear Algebra Review

Singular Value Decomposition

UNIT 6: The singular value decomposition.

Singular Value Decomposition

Basic Calculus Review

COMP 558 lecture 18 Nov. 15, 2010

Notes on Eigenvalues, Singular Values and QR

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

Stat 159/259: Linear Algebra Notes

Linear Algebra. Session 12

Tutorials in Optimization. Richard Socher

Math 102, Winter Final Exam Review. Chapter 1. Matrices and Gaussian Elimination

Math 108b: Notes on the Spectral Theorem

Chapter 0 Miscellaneous Preliminaries

Lecture 9: September 28

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

Chapter 7. Canonical Forms. 7.1 Eigenvalues and Eigenvectors

Throughout these notes we assume V, W are finite dimensional inner product spaces over C.

Final Exam, Linear Algebra, Fall, 2003, W. Stephen Wilson

Computational Methods. Eigenvalues and Singular Values

The Singular Value Decomposition (SVD) and Principal Component Analysis (PCA)

Stat 206: Linear algebra

Ma/CS 6b Class 23: Eigenvalues in Regular Graphs

Section 3.9. Matrix Norm

MATH 612 Computational methods for equation solving and function minimization Week # 2

Principal Component Analysis (PCA) Our starting point consists of T observations from N variables, which will be arranged in an T N matrix R,

STAT 309: MATHEMATICAL COMPUTATIONS I FALL 2017 LECTURE 5

Matrix Algebra, part 2

Chapter 3 Transformations

Linear algebra and applications to graphs Part 1

Singular Value Decomposition

Lecture 7: Positive Semidefinite Matrices

Mathematical Methods for Quantum Information Theory. Part I: Matrix Analysis. Koenraad Audenaert (RHUL, UK)

Ir O D = D = ( ) Section 2.6 Example 1. (Bottom of page 119) dim(v ) = dim(l(v, W )) = dim(v ) dim(f ) = dim(v )

STA141C: Big Data & High Performance Statistical Computing

2. Review of Linear Algebra

Lecture 1: Review of linear algebra

1. Linear systems of equations. Chapters 7-8: Linear Algebra. Solution(s) of a linear system of equations (continued)

MATH 5720: Unconstrained Optimization Hung Phan, UMass Lowell September 13, 2018

Eigenvalues and diagonalization

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Introduction to Matrix Algebra

Linear Least Squares. Using SVD Decomposition.

Review of similarity transformation and Singular Value Decomposition

Common-Knowledge / Cheat Sheet

5 Compact linear operators

be a Householder matrix. Then prove the followings H = I 2 uut Hu = (I 2 uu u T u )u = u 2 uut u

STA141C: Big Data & High Performance Statistical Computing

1 Linear Algebra Problems

(a) If A is a 3 by 4 matrix, what does this tell us about its nullspace? Solution: dim N(A) 1, since rank(a) 3. Ax =

1 Last time: least-squares problems

Positive Definite Matrix

Duke University, Department of Electrical and Computer Engineering Optimization for Scientists and Engineers c Alex Bronstein, 2014

Lecture 5. Ch. 5, Norms for vectors and matrices. Norms for vectors and matrices Why?

Rank minimization via the γ 2 norm

Remark 1 By definition, an eigenvector must be a nonzero vector, but eigenvalue could be zero.

33AH, WINTER 2018: STUDY GUIDE FOR FINAL EXAM

Chapter 3. Matrices. 3.1 Matrices

Summary of Week 9 B = then A A =

Image Registration Lecture 2: Vectors and Matrices

Math Camp Lecture 4: Linear Algebra. Xiao Yu Wang. Aug 2010 MIT. Xiao Yu Wang (MIT) Math Camp /10 1 / 88

Deep Learning Book Notes Chapter 2: Linear Algebra

Remark By definition, an eigenvector must be a nonzero vector, but eigenvalue could be zero.

Recall the convention that, for us, all vectors are column vectors.

Principal Component Analysis

Lecture 15, 16: Diagonalization

j=1 u 1jv 1j. 1/ 2 Lemma 1. An orthogonal set of vectors must be linearly independent.

Lecture 5 : Projections

18.06 Problem Set 8 - Solutions Due Wednesday, 14 November 2007 at 4 pm in

a 11 a 12 a 11 a 12 a 13 a 21 a 22 a 23 . a 31 a 32 a 33 a 12 a 21 a 23 a 31 a = = = = 12

Linear Algebra Primer

Exercise Set 7.2. Skills

Numerical Methods for Solving Large Scale Eigenvalue Problems

Linear Algebra. Shan-Hung Wu. Department of Computer Science, National Tsing Hua University, Taiwan. Large-Scale ML, Fall 2016

CS 143 Linear Algebra Review

MIT Final Exam Solutions, Spring 2017

Linear Algebra, part 3. Going back to least squares. Mathematical Models, Analysis and Simulation = 0. a T 1 e. a T n e. Anna-Karin Tornberg

Chapter 6. Eigenvalues. Josef Leydold Mathematical Methods WS 2018/19 6 Eigenvalues 1 / 45

Transcription:

CSE 521: Design and Analysis of Algorithms I Winter 2017 Lecture 8: Linear Algebra Background Lecturer: Shayan Oveis Gharan 2/1/2017 Scribe: Swati Padmanabhan Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications. Now that we have finished our lecture series on randomized algorithms, we start with a bit of linear algebra review so that we can use these tools in the algorithms we learn next. The book Matrix Analysis by Horn and Johnson is an excellent reference for all the concepts reviewed here. 8.1 Eigenvalues For a matrix A R n n, the eigenvalue-eigenvector pair is defined as (λ, x), where Ax λx. Many of our algorithms will deal with the family of symmetric matrices (which we denote by S n ), with special properties of eigenvalues. We start with the fact that a symmetric matrix has real eigenvalues. This means we can order them and talk about the largest/smallest eigenvalues (which we ll do in Section 8.2). 8.1.1 Spectral Theorem Theorem 8.1 (Spectral Theorem). For any symmetric matrix, there are eigenvalues λ 1, λ 2,..., λ n, with corresponding eigenvectors v 1, v 2,..., v n which are orthonormal (that is, they have unit length measured in the l 2 norm and v i, v j 0 for all i and j). We can then write M λ i v i vi T V ΛV T. (8.1) where V is the matrix with v i s arranged as column vectors and Λ is the diagonal matrix of eigenvalues. The v i s in the above theorem form a basis for all vectors in R n. This means that for any vector x we can uniquely write it as x v i, x v i. An application of this is being able to write complicated functions of a symmetric matrix in terms of functions of the eigenvalues that is, f(m) n f(λ i)v i v T i for M S n. For example: M 2 n λ2 i v iv T i. exp(m) Ak k! n exp(λ i)v i v T i For an invertible matrix, M 1 n ( 1 λ i )v i v T i. Two special functions of eigenvalues are the trace and determinant, described in the next subsection. 8-1

8-2 Lecture 8: Linear Algebra Background 8.1.2 Trace, Determinant and Rank Definition 8.2. The trace of a square matrix is the sum of its diagonal entries. Alternatively, we can say the following: Lemma 8.3. The trace of a square matrix is the sum of its eigenvalues. We will give three different proofs for this lemma. Proof 1 : By definition of trace, Tr(A) 1 T i A1 i, where 1 i is the indicator vector of i, i.e., it is a vector which is equal to 1 in the i-th coordinate and it is 0 everwhere else. Using (8.1) we can write, Tr(A) 1 T i j1 λ j v j vj T 1 i j1 λ j 1 T i v j vj T 1 i λ j 1 i, v j 2 j1 n λ j j1 1 i, v j 2 λ j. The last identity uses the fact that for any vector v j, n 1 i, v j 2 v j 2 1, as 1 1,..., 1 n form another orthonormal basis of R n. Next, we give two other proofs of the same statement for the sake of intuition. j1 Proof 2: Write out the characteristic polynomial of A. It turns out that the characteristic polynomial of A is the unique polynomial of degree n with eigenvalues of A as the roots, χ A (t) det(ti A) (t λ 1 )(t λ 2 )... (t λ n ) Observe that in the RHS of the above, the coefficient of t (up to a negative sign) is equal to the sum of all eigenvalues. On the other hand, if we expand the determinant of ti A matrix we see that the sum of the diagonal entries exactly equals the coefficient of t. Proof 3: Recall the cyclic permutation property of trace is that Tr(ABC) Tr(BCA) Tr(CAB)

Lecture 8: Linear Algebra Background 8-3 This is derived simply from definition. Let λ 1,..., λ n be the eigenvalues of A with corresponding eigenvalues v 1,..., v n. We have ( n ) Tr(A) Tr λ i v i vi T In the last identity we used that v i 1 for all i. Tr(λ i v i vi T ) λ i Tr( v i, vi T ) λ i. Lemma 8.4. The determinant of a matrix is the product of its eigenvalues. The above lemma can be proved using the characteristic polynomial. It follows from the lemma that determinant is zero if and only if at least one eigenvalue is zero, that is, if the matrix is not full rank. For a symmetric matrix, we can also state that the rank is the number of non-zero eigenvalues. 8.2 Rayleigh Quotient Let A be a symmetric matrix. The Rayleigh coefficient gives a characterization of all eigenvalues (and eigenvectors of A) in terms of the solution to optimization problems. Let λ 1 λ 2 λ n be the eigenvalues of A. Then, x T Ax λ 1 (A) max x xt Ax max 21 x x T x Let x 1 be the optimum vector in the above. It follows that x 1 is the eigenvector of A corresponding to λ 1. Then, λ 2 (A) max x: x,x xt Ax 1 0, x 1 And so on, the third eigenvector is the vector maximizing the quadratic form x T Ax over all vectors that orthogonal to the first two eigenvectors. Similarly, we can write λ n (A) min x xt Ax 21 Let us derive, Equation (8.2). Note that f(x) x T Ax is a continuous function and {x x 2 1} is a compact set. So by Weierstrass Theorem, the maximum is attained. Now we diagonalize A using Equation (8.1) as A n λ iv i vi T and multiply on either side by x to get the following chain of equalities: ( n ) x T Ax x T λ i v i vi T x λ i x T v i vi T x (8.2) λ i x, v i 2. (8.3)

8-4 Lecture 8: Linear Algebra Background Since x 1 and v 1,..., v n form an orthonormal basis of R n, n v i, x 2 x 2 1. Therefore, (8.3) is maximized when x, v 1 1 and the rest are 0. This means the vector x for which this optimum value is attained is v 1 as desired. In the same way, we can also get the characterization for the minimum eigenvalue. Positive (Semi) Definite Matrices A real symmetric matrix A is said to be positive semidefinite(psd), A 0 if x T Ax 0 for all x R n. A real symmetric matrix A is said to be positive definite (PD), A 0, if x T Ax > 0 for all x 0, x R n. By Rayleigh-Ritz characterization, we can see that A is PSD if and only if all eigenvalues of A are nonnegative. Also A is positive definite if and only if all eigenvalues of A are positive. 8.3 Singular Value Decomposition Of course not every matrix is unitarily diagonalizable. In fact non-symmetric matrices may not have real eigenvalues the space of eigenvectors is not necessarily orthonormal. Instead, when dealing with a non-symmetric matrix, first we turn it into a symmetric matrix and then we apply the spectral theorem to that matrix. This idea is called the Singular Value Decomposition (SVD). For any matrix A R m n (with m n) can be written as A UΣV T m σ i u i vi T (8.4) where σ 1 σ m 0 are the singular values of A, u 1,..., u m are orthonormal and are called the left singular vectors of A and v 1,..., vm R n are orthonormal and are call the right singular vectors of A. To construct this decomposition we need to apply the spectral theorem to the matrix A T A. Observe that if the above identity holds then m A T A σ i v i u T i σ j u j vj T σi T v i vi T j1 where we used that u i, u j is 1 if i j and it is zero otherwise. Therefore, v 1,..., v m are in fact the eigenvectors of A T A and σ1, 2..., σm 2 are the eigenvalues of A T A. By a similar argument it follows that u 1,..., u m are eigenvectors of AA T and σ1, 2..., σm 2 are its eigenvalues. Note that both matrices AA T and A T A are symmetric PSD matrices. In the matrix form the above identities can be written as [ ] A T A V ΣU T UΣV T V Σ 2 V T Σ [V Ṽ ] 2 0 [V 0 0 Ṽ ]T (8.5) AA T UΣV T V ΣU T UΣ 2 U T [U Ũ] [ Σ 2 0 0 0 ] [U Ũ]T (8.6) where Ṽ, Ũ are any matrices for which [V Ṽ ] and [U Ũ] are orthonormal. The righthand expressions are eigenvalue decompositions of A T A and AA T. To summarize, The singular values σ i are the squareroots of eigenvalues of A T A and AA T, that is, σ i (A) λ i (A T A) λi (AA T ) (λ i (A T A) λ i (AA T ) 0 for i > r). The left singular vectors u 1,..., u r are the eigenvectors of AA T the right singular vectors V [v 1,..., v m ] are the eigenvectors of A T A.

Lecture 8: Linear Algebra Background 8-5 In general, computing the singular value decomposition can take O(n 3 ) time. 8.4 Matrix Norms Any matrix A R n n can be thought of as a vector of n 2 dimensions. Therefore, we can measure the size of a matrix using matrix norms. For a function. : R n n R to be a matrix norm, it must satisfy the properties of non-negativity (and zero only when the argument is zero), homogeneity, triangle inequality and submultiplicativity. We list below a few important matrix norms that we ll repeatedly encounter: Frobenius norm: A F Tr(AA T ) 1/2 ( a 2 ij) 1/2. (8.7) The Frobenius norm is just the Euclidean norm of matrix A thought of as a vector. As we just saw in Section 8.3, Tr(AA T ) λ i (AA T ) σ i (A) 2, i,j1 therefore this gives us an important alternative characterization of Frobenius norm: A F ( σ i (A) 2 ) 1/2. (8.8) Operator norm: The operator norm. 2 is defined as Ax A 2 max Ax max x 1 x 0 x (8.9) It follows by the Rayleigh-Ritz characterization that max x Ax x Ax max 2 x x x 2 max T A T Ax x x T λ max (A x T A) σ max (A). 8.5 Low Rank Approximation The ideas described in the previous sections are used in low-rank approximation theory which finds many applications in computer science. A famous recent example was the Netflix problem. We have a large dataset of users and many of them have provided ratings to many movies. But this ratings matrix obviously has several missing entries. The problem is to figure out, using this limited data, what movies to recommend to users. Under the (justifiable) assumption that this is a low-rank matrix,this is a matrix completion problem that falls in the category of low-rank approximation. So, we may, for example, leave the unknown entries to be 0. Then, we can approximate the matrix with low rank matrix. Then, we can fill out the unknown entries with the entries of the estimated low rank matrix. This gives a heuristic for the matrix completion problem. Formally, in the low rank approximation problem we are given a matrix M, we want to find another M of rank k such that M M is as small as possible.

8-6 Lecture 8: Linear Algebra Background The famous Johnson-Lindenstrauss dimension reduction tells us that for any set of n points P R m, with high probability, we can map them using Γ R d m, to a d O(log n)/ɛ 2 dimensional space such that for any x, y P, (1 ɛ) x y 2 Γ(x) Γ(y) 2 (1 + ɛ) x y 2 We do not prove this lemma in this course as it has been covered in the randomized algorithms course. The important fact here is that the mapping is a linear map and Γ is just a Guassian matrix; i.e., Γ R d m and each entry Γ i,j N (0,1) d. As as clear from this context, the dimension reduction ideas are oblivious to the structure of the data. That is the Gaussian mapping that we defined above does not look at the data point to construct the lower dimensional map. Because of that it may now help us to observe certain hidden structures in the data. As we will see in the next lecture, low rank approximation algorithms, chooses the low rank matrix by looking at the SVD of M. Because of that it typically can reveal many unknown hidden structures between the data points that M represent.