Dimensionality Reduction Le Song Machine Learning I CSE 674, Fall 23
Unsupervised learning Learning from raw (unlabeled, unannotated, etc) data, as opposed to supervised data where a classification of examples is given Explore and understand your data before you create expensive labeled data or build predictive models Extract meaningful patterns or compact representations from raw data according to certain criterion More subjective compared to supervised learning, and harder to evaluate 2
Documents collections What are the relations between data points? 3
Image databases Image Databases What are the relations between data points? 4
Handwritten digits What are the relations between data points? 5
Cartoon characters What are the relations between data points? 6
So what is dimensionality reduction? The process of reducing the number of random variables under consideration One can combine, transform or select variables The dimension-reduced data can be used for Visualizing, exploring and understanding the data Cleaning the data Building simpler model later Issues for dimensionality reduction How to represent objects? (Vector space? Normalization?) What is the criterion for carrying out the reduction process? What are the algorithm steps? 7
Bag of words representation document document 2 Machine learning concerns the construction and study of systems that can learn from data. Representation of data instances and functions evaluated on these instances are part of all machine learning systems 2 learn represent system data instance function 2 vector in R n 8
Pixel representation 9 vector in R n
Images of different sizes color texture composition vector in R n
Objects in real life Family: //2 Sex: / Work place: //2/3 vector in R n
Use what criterion for reduction? There are many criteria (geometric based, information theory based, etc.) Want to capture variation in data variations are signals or information in the data need to normalize each variables first Want to discover variables or dimensions highly correlated or dependent represent highly related phenomena combine them to form a simpler presentation 2
An example Data vary more in this direction Data vary less in this direction Two features are correlated 3
Reduce to dimension 4
Principal component analysis Given m data points, x, x 2, x m μ = m m i= x i R n, with their mean Find a direction w R n where w Such that the variance (or variation) of the data along direction w is maximized max w: w m m i= w x i w μ 2 variance 5
Is it an easy optimization problem? Manipulate the objective with linear algebra m m i= m w x i w μ 2 = m (w (x i μ)) 2 i= m = m w x i μ x i μ w i= m = w x i μ x i μ w m i= covariance matrix C 6
Landscape of the optimization problem Suppose the data has two dimension (n = 2) C is a diagonal matrix C = 2 The optimization problem becomes max w: w w Cw = max w: w (w, w 2 ) 2 w w 2 = max w: w w 2 + 2w 2 2 7
Landscape of the optimization problem f w, w 2 = w 2 + 2w 2 2 3 2.5 2.5.5.5 -.5 - - -.5.5 8
Eigen-value problem Eigen-value problem Given a symmetric matrix C R n n Find a vector w R n and w = Such that Cw = λw There will be multiple solution of w, w 2, with different λ, λ 2, They are ortho-normal: w i w i =, w i w j = 9
Equivalent to eigen-value problem Claim: max w: w w Cw Cw = λw Form lagrangian function of the optimization problem L(w, λ) = w Cw + λ( w 2 ) Necessary condition If w is a maximum of the original optimization problem, then there exists a λ, where (w, λ) is a stationary point of L(w, λ) This implies that L w = = 2Cw 2λw 2
Principal direction of the data w 2
Variance of in the principal direction Principal direction w satisfies Cw = λw Variance in principal direction is w Cw = λw w = λ eigen-value 22
Multiple principal directions Directions w, w 2, which has the largest variances but are orthogonal to each other Take the eigenvectors w, w 2, of C corresponding to the largest eigenvalue λ, the second largest eigenvalue λ 2 23
Solve eigen-value problem Not an easy task in general But eigen-decomposition is implemented in many modern linear algebra libraries Large-scale, parallel, distributed, and iterative implementations also exist For instance, in matlab [W, S] = eig(c) returns all eigen-vector [W, S] = eigs(c, k) returns k eigen-vectors 24
Experiments with handwritten digits 25
Experiments with 2 news groups Bag-of-words, or term-document matrix 26
Singular value decomposition For a matrix X, decompose it as X = USV Singular vector pair (u, v) is related by Xu = sv and X v = su Singular value decomposition is related to eigendecomposition Let C = XX X v = su Cv = λv and λ = s 2 27