Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling
|
|
- Lindsey Meredith Conley
- 5 years ago
- Views:
Transcription
1 Machine Learning B. Unsupervised Learning B.2 Dimensionality Reduction Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University of Hildesheim, Germany 1 / 31
2 Outline 2. Non-linear Dimensionality Reduction 3. Supervised Dimensionality Reduction 2 / 31
3 Outline 2. Non-linear Dimensionality Reduction 3. Supervised Dimensionality Reduction 1 / 31
4 The Dimensionality Reduction Problem Given a set X called data space, e.g., X := R m, a set X X called data, a function D : X X,K N (R K ) X R + 0 called distortion where D(P) measures how bad a low dimensional representation P : X R K for a data set X X is, and a number K N of latent dimensions, find a low dimensional representation P : X R K with K dimensions with minimal distortion D(P). 1 / 31
5 Distortions for Dimensionality Reduction (1/2) Let d X be a distance on X and d Z be a distance on the latent space R K, usually just the Euclidean distance K d Z (v, w) := v w 2 = ( (v i w i ) 2 ) 1 2 i=1 Multidimensional scaling aims to find latent representations P that reproduce the distance measure d X as good as possible: D(P) := = 2 X ( X 1) 2 n(n 1) x,x X x x (d X (x, x ) d Z (P(x), P(x ))) 2 n i 1 (d X (x i, x j ) z i z j ) 2, z i := P(x i ) i=1 j=1 2 / 31
6 Distortions for Dimensionality Reduction (2/2) Feature reconstruction methods aim to find latent representations P and reconstruction maps r : R K X from a given class of maps that reconstruct features as good as possible: D(P, r) := 1 X = 1 n d X (x, r(p(x))) x X n d X (x i, r(z i )), z i := P(x i ) i=1 3 / 31
7 PCA: Intuition We want to find an orthogonal basis which represent directions of maximum variance 4 / 31
8 PCA: Intuition we want to find an orthogonal basis which represent directions of maximum variance red line indicates direction of maximal variance within the data 5 / 31
9 PCA: Intuition we want to find an orthogonal basis which represent directions of maximum variance red line indicates direction of maximal variance within the data blue line represents direction of maximal variance given that it has to be orthogonal to the red line 6 / 31
10 PCA: Task We assume that the data has mean Zero: n x i = 0 i=1 Then PCA can be accomplished by finding the top K eigenvalues of the covariance matrix X X of the data! Mean has to be zero to account for different units in the data (Degrees C vs Degrees F) Can also be accomplished through a singular value decomposition of the data matrix X as we will see 7 / 31
11 Singular Value Decomposition (SVD) Theorem (Existence of SVD) For every A R n m there exist matrices U R n k,v R m k, Σ := diag(σ 1,..., σ k ) R k k, k := min{n, m} σ 1 σ 2 σ r > σ r+1 = = σ k = 0, r := rank(a) U, V orthonormal, i.e., U T U = I, V T V = I with A = UΣV T σ i are called singular values of A. Note: I := diag(1,..., 1) R k k denotes the unit matrix. 8 / 31
12 Singular Value Decomposition (SVD; 2/2) It holds: a) σi 2 are eigenvalues and V i eigenvectors of A T A: (A T A)V i = σi 2 V i, i = 1,..., k, V = (V 1,..., V k ) b) σi 2 are eigenvalues and U i eigenvectors of AA T : (AA T )U i = σi 2 U i, i = 1,..., k, U = (U 1,..., U k ) 9 / 31
13 Singular Value Decomposition (SVD; 2/2) It holds: a) σi 2 are eigenvalues and V i eigenvectors of A T A: (A T A)V i = σi 2 V i, i = 1,..., k, V = (V 1,..., V k ) b) σi 2 are eigenvalues and U i eigenvectors of AA T : (AA T )U i = σi 2 U i, i = 1,..., k, U = (U 1,..., U k ) proof: a) (A T A)V i = V Σ T U T UΣV T V i = V Σ 2 e i = σi 2 V i b) (AA T )U i = UΣ T V T V Σ T U T U i = UΣ 2 e i = σi 2 U i 9 / 31
14 Truncated SVD Let A R n m and UΣV T = A its SVD. Then for k min{n, m} the decomposition with A = U Σ V T U := (U,1,..., U,k ), V := (V,1,..., V,k ), Σ := diag(σ 1,..., σ k ) is called truncated SVD with rank k. 10 / 31
15 Low Rank Approximation Let A R n m. For k min{n, m}, any pair of matrices U R n k, V R m k is called a low rank approximation of A with rank k. The matrix UV T is called the reconstruction of A by U, V and the quantity A UV T F the L2 reconstruction error. 11 / 31
16 Low Rank Approximation Let A R n m. For k min{n, m}, any pair of matrices U R n k, V R m k is called a low rank approximation of A with rank k. The matrix UV T is called the reconstruction of A by U, V and the quantity A UV T F = n m (A i,j Ui T V j ) 2 i=1 j=1 the L2 reconstruction error. Note: A F is called Frobenius norm. 11 / 31
17 Optimal Low Rank Approximation is Truncated SVD Theorem (Low Rank Approximation; Eckart-Young theorem) Let A R n m. For k min{n, m}, the optimal low rank approximation of rank k (i.e., with smallest reconstruction error) is the truncated SVD. (U, V ) := arg min U R n k,v R m k A UV T 2 Note: As U, V do not have to be orthonormal, one can take U := U Σ, V := V for the SVD A = U Σ V T. 12 / 31
18 Principal Components Analysis (PCA) Let X := {x 1,..., x n } R m be a data set and K N the number of latent dimensions (K m). PCA finds K principal components v 1,..., v K R m and latent weights z i R K for each data point i {1,..., n}, such that the linear combination of the principal components K x i z i,k v k k=1 reconstructs the original features x i as good as possible: 13 / 31
19 Principal Components Analysis (PCA) Let X := {x 1,..., x n } R m be a data set and K N the number of latent dimensions (K m). PCA finds K principal components v 1,..., v K R m and latent weights z i R K for each data point i {1,..., n}, such that the linear combination of the principal components reconstructs the original features x i as good as possible: arg min v 1,...,v K z 1,...,zn = n x i i=1 K z i,k v k 2 k=1 n x i Vz i 2, i=1 V := (v 1,..., v K ) T 13 / 31
20 Principal Components Analysis (PCA) Let X := {x 1,..., x n } R m be a data set and K N the number of latent dimensions (K m). PCA finds K principal components v 1,..., v K R m and latent weights z i R K for each data point i {1,..., n}, such that the linear combination of the principal components reconstructs the original features x i as good as possible: arg min v 1,...,v K z 1,...,zn = n x i i=1 K z i,k v k 2 k=1 n x i Vz i 2, i=1 = X ZV T 2 F, V := (v 1,..., v K ) T thus PCA is just the SVD of the data matrix X. X := (x 1,..., x n ) T, Z := (z 1,..., z n ) T 13 / 31
21 PCA Algorithm 1: procedure dimred-pca(d := {x 1,..., x N } R M, K N) 2: X := (x 1, x 2,..., x N ) T 3: (U, Σ, V ) := svd(x ) 4: Z := U.,1:K Σ 1:K,1:K 5: return D dimred := {Z 1,.,..., Z N,. } 14 / 31
22 FIGURE Simulated data in three classes, near 15 / 31 Principal Components Analysis (Example 1) [HTFF05, p. 530]
23 Principal Components Analysis (Example 1) Second principal component First principal component FIGURE The best rank-two linear approximation to the half-sphere data. The right panel shows the [HTFF05, p. 536] projected points with coordinates given by U 2 D 2,the first two principal components of the data. 15 / 31
24 Principal Components Analysis (Example 2) FIGURE A sample of 130 handwritten [HTFF05, 3 s p. 537] shows a variety of writing styles. 16 / 31
25 Principal Components Analysis (Example 2) First Principal Component Second Principal Component O O O O O O O O O O O O O O O O O O O O O O O O O FIGURE (Left panel:) the first two principal components of the handwritten threes. The circled points are the closest projected images to the vertices of 16 / 31 [HTFF05, p. 538]
26 2. Non-linear Dimensionality Reduction Outline 2. Non-linear Dimensionality Reduction 3. Supervised Dimensionality Reduction 17 / 31
27 2. Non-linear Dimensionality Reduction Linear Dimensionality Reduction Dimensionality reduction accomplishes two tasks: 1. compute lower dimensional representations for given data points x i for PCA: u i = Σ 1 V T x i, U := (u 1,..., u n ) T 17 / 31
28 2. Non-linear Dimensionality Reduction Linear Dimensionality Reduction Dimensionality reduction accomplishes two tasks: 1. compute lower dimensional representations for given data points x i for PCA: u i = Σ 1 V T x i, U := (u 1,..., u n ) T 2. compute lower dimensional representations for new data points x (often called fold in ) for PCA: u := arg min x V Σu 2 = Σ 1 V T x u 17 / 31
29 2. Non-linear Dimensionality Reduction Linear Dimensionality Reduction Dimensionality reduction accomplishes two tasks: 1. compute lower dimensional representations for given data points x i for PCA: u i = Σ 1 V T x i, U := (u 1,..., u n ) T 2. compute lower dimensional representations for new data points x (often called fold in ) for PCA: u := arg min x V Σu 2 = Σ 1 V T x u PCA is called a linear dimensionality reduction technique because the latent representations u depend linearly on the observed representations x. 17 / 31
30 2. Non-linear Dimensionality Reduction Kernel Trick Represent (conceptionally) non-linearity by linearity in a higher dimensional embedding φ : R m R m but compute in lower dimensionality for methods that depend on x only through a scalar product x T θ = φ(x) T φ(θ) = k(x, θ), x, θ R m if k can be computed without explicitly computing φ. 18 / 31
31 2. Non-linear Dimensionality Reduction Kernel Trick / Example Example: φ : R R 1001, ( ( 1000 x i ) 1 ) 2 x i i=0,..., ( x T θ = φ(x) T 1000 φ(θ) = i Naive computation: i=0 = x x x 999 x 1000 ) x i θ i = (1 + xθ) 1000 =: k(x, θ) 2002 binomial coefficients, 3003 multiplications, 1000 additions. Kernel computation: 1 multiplication, 1 addition, 1 exponentiation. 19 / 31
32 2. Non-linear Dimensionality Reduction Kernel PCA φ :R m R m, φ(x 1 ) φ(x 2 ) X :=. φ(x n ) X UΣṼ T m m We can compute the columns of U as eigenvectors of X X T R n n without having to compute Ṽ R m k (which is large!): X X T U i = σ 2 i U i 20 / 31
33 2. Non-linear Dimensionality Reduction Kernel PCA / Removing the Mean Issue 1: The x i := φ(x i ) may not have zero mean and thus distort PCA. x i := x i 1 n n x i i=1 21 / 31
34 2. Non-linear Dimensionality Reduction Kernel PCA / Removing the Mean Issue 1: The x i := φ(x i ) may not have zero mean and thus distort PCA. x i := x i 1 n n x i i=1 = X T (I 1 n 1I) X :=( x 1,..., x n) T = (I 1 n 1I) X T Note: 1I := (1) i=1,...,n,j=1,...,n vector of ones, I := (δ(i = j)) Lars Schmidt-Thieme, i=1,...,n,j=1,...,n unity matrix. Nicolas Schilling, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 21 / 31
35 2. Non-linear Dimensionality Reduction Kernel PCA / Removing the Mean Issue 1: The x i := φ(x i ) may not have zero mean and thus distort PCA. x i := x i 1 n n x i i=1 = X T (I 1 n 1I) X :=( x 1,..., x n) T = (I 1 n 1I) X T K := X X T = (I 1 n 1I) X T X (I 1 n 1I) =HKH, H := (I 1 1I) centering matrix n Thus, the kernel matrix K with means removed can be computed from the kernel matrix K without having to access coordinates. 21 / 31
36 2. Non-linear Dimensionality Reduction Kernel PCA / Fold In Issue 2: How to compute projections u of new points x (as Ṽ is not computed)? With Ṽ = X T UΣ 1 u := arg min x Ṽ Σu 2 = Σ 1 Ṽ T x u u = Σ 1 Ṽ T x = Σ 1 Σ 1 U T X x = Σ 2 U T (k(x i, x)) i=1,...,n u can be computed with access to kernel values only (and to U, Σ). 22 / 31
37 2. Non-linear Dimensionality Reduction Kernel PCA / Summary Given: data set X := {x 1,..., x n } R m, kernel function k : R m R m R. task 1: Learn latent representations U of data set X : K :=(k(x i, x j )) i=1,...,n,j=1,...,n (0) K :=HKH, H := (I 1 1I) (1) n (U, Σ) :=eigen decomposition (K ) (2) task 2: Learn latent representation u of new point x: u := Σ 2 U T (k(x i, x)) i=1,...,n (3) 23 / 31
38 2. Non-linear Dimensionality Reduction Kernel PCA: Example 1 Eigenvalue= Eigenvalue= Eigenvalue= Eigenvalue= Eigenvalue= Eigenvalue= Eigenvalue= Eigenvalue= [Mur12, p. 493] 24 / 31
39 2. Non-linear Dimensionality Reduction Kernel PCA: Example pca [Mur12, p. 495] 25 / 31
40 2. Non-linear Dimensionality Reduction Kernel PCA: Example kpca [Mur12, p. 495] 25 / 31
41 3. Supervised Dimensionality Reduction Outline 2. Non-linear Dimensionality Reduction 3. Supervised Dimensionality Reduction 26 / 31
42 3. Supervised Dimensionality Reduction Dimensionality Reduction as Pre-Processing Given a prediction task and a data set D train := {(x 1, y 1 ),..., (x n, y n )} R m Y. 1. compute latent features z i R K for the objects of a data set by means of dimensionality reduction of the predictors x i. e.g., using PCA on {x1,..., x n } R m 2. learn a prediction model on the latent features based on ŷ : R K Y D train := {(z 1, y 1 ),..., (z n, y n )} 3. treat the number K of latent dimensions as hyperparameter. e.g., find using grid search. 26 / 31
43 3. Supervised Dimensionality Reduction Dimensionality Reduction as Pre-Processing Advantages: simple procedure generic procedure works with any dimensionality reduction method and prediction method as component methods. usually fast 27 / 31
44 3. Supervised Dimensionality Reduction Dimensionality Reduction as Pre-Processing Advantages: simple procedure generic procedure works with any dimensionality reduction method and prediction method as component methods. usually fast Disadvantages: dimensionality reduction is unsupervised, i.e., not informed about the target that should be predicted later on. leads to the very same latent features regardless of the prediction task. likely not the best task-specific features are extracted. 27 / 31
45 3. Supervised Dimensionality Reduction Supervised PCA p(z) := N (z; 0, 1) p(x z; µ x, σ 2 x, W x ) := N (x; µ x + W x z, σ 2 xi ) p(y z; µ y, σ 2 y, W y ) := N (y; µ y + W y z, σ 2 y I ) like two PCAs, coupled by shared latent features z: one for the predictors x. one for the targets y. latent features act as information bottleneck. also known as Latent Factor Regression or Bayesian Factor Regression. 28 / 31
46 3. Supervised Dimensionality Reduction Supervised PCA: Discriminative Likelihood A simple likelihood would put the same weight on reconstructing the predictors and reconstructing the targets. A weight α R + 0 for the reconstruction error of the predictors should be introduced (discriminative likelihood): L α (Θ; x, y, z) := n p(y i z i ; Θ)p(x i z i ; Θ) α p(z i ; Θ) i=1 α can be treated as hyperparameter and found by grid search. 29 / 31
47 3. Supervised Dimensionality Reduction Supervised PCA: EM The M-steps for µ x, σ 2 x, W x and µ y, σ 2 y, W y are exactly as before. the coupled E-step is: ( 1 z i = σy 2 Wy T W y + α 1 ) 1 σx 2 Wx T W x ( 1 σy 2 Wy T (y i µ y ) + α 1 ) σx 2 Wx T (x i µ x ) 30 / 31
48 3. Supervised Dimensionality Reduction Conclusion (1/3) Dimensionality reduction aims to find a lower dimensional representation of data that preserves the information as much as possible. Preserving information means to preserve pairwise distances between objects (multidimensional scaling). to be able to reconstruct the original object features (feature reconstruction). The truncated Singular Value Decomposition (SVD) provides the best low rank factorization of a matrix in two factor matrices. SVD is usually computed by an algebraic factorization method (such as QR decomposition). 31 / 31
49 3. Supervised Dimensionality Reduction Conclusion (2/3) Principal components analysis (PCA) finds latent object and variable features that provide the best linear reconstruction (in L2 error). PCA is a truncated SVD of the data matrix. Probabilistic PCA (PPCA) provides a probabilistic interpretation of PCA. PPCA adds a L2 regularization of the object features. PPCA is learned by the EM algorithm. Adding L2 regularization for the linear reconstruction/variable features on top leads to Bayesian PCA. Generalizing to variable-specific variances leads to Factor Analysis. For both, Bayesian PCA and Factor Analysis, EM can be adapted easily. 31 / 31
50 3. Supervised Dimensionality Reduction Conclusion (3/3) To capture a nonlinear relationship between latent features and observed features, PCA can be kernelized (Kernel PCA). Learning a Kernel PCA is done by an eigen decomposition of the kernel matrix. Kernel PCA often is found to lead to unnatural visualizations. But Kernel PCA sometimes provides better classification performance for simple classifiers on latent features (such as 1-Nearest Neighbor). 31 / 31
51 Readings Principal Components Analysis (PCA) [HTFF05], ch , [Bis06], ch. 12.1, [Mur12], ch Probabilistic PCA [Bis06], ch. 12.2, [Mur12], ch Factor Analysis [HTFF05], ch , [Bis06], ch Kernel PCA [HTFF05], ch , [Bis06], ch. 12.3, [Mur12], ch / 31
52 Further Readings (Non-negative) Matrix Factorization [HTFF05], ch Independent Component Analysis, Exploratory Projection Pursuit [HTFF05], ch [Bis06], ch [Mur12], ch Nonlinear Dimensionality Reduction [HTFF05], ch. 14.9, [Bis06], ch / 31
53 References Christopher M. Bishop. Pattern recognition and machine learning, volume 1. springer New York, Trevor Hastie, Robert Tibshirani, Jerome Friedman, and James Franklin. The elements of statistical learning: data mining, inference and prediction, volume Kevin P. Murphy. Machine learning: a probabilistic perspective. The MIT Press, / 31
Machine Learning. A. Supervised Learning A.1. Linear Regression. Lars Schmidt-Thieme
Machine Learning A. Supervised Learning A.1. Linear Regression Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University of Hildesheim, Germany
More informationProbabilistic Latent Semantic Analysis
Probabilistic Latent Semantic Analysis Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr
More informationMachine Learning (BSMC-GA 4439) Wenke Liu
Machine Learning (BSMC-GA 4439) Wenke Liu 02-01-2018 Biomedical data are usually high-dimensional Number of samples (n) is relatively small whereas number of features (p) can be large Sometimes p>>n Problems
More informationIntroduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin
1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)
More informationLars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Syllabus Fri. 21.10. (1) 0. Introduction A. Supervised Learning: Linear Models & Fundamentals Fri. 27.10. (2) A.1 Linear Regression Fri. 3.11. (3) A.2 Linear Classification Fri. 10.11. (4) A.3 Regularization
More informationMachine Learning - MT & 14. PCA and MDS
Machine Learning - MT 2016 13 & 14. PCA and MDS Varun Kanade University of Oxford November 21 & 23, 2016 Announcements Sheet 4 due this Friday by noon Practical 3 this week (continue next week if necessary)
More informationUnsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent
Unsupervised Machine Learning and Data Mining DS 5230 / DS 4420 - Fall 2018 Lecture 7 Jan-Willem van de Meent DIMENSIONALITY REDUCTION Borrowing from: Percy Liang (Stanford) Dimensionality Reduction Goal:
More informationDimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas
Dimensionality Reduction: PCA Nicholas Ruozzi University of Texas at Dallas Eigenvalues λ is an eigenvalue of a matrix A R n n if the linear system Ax = λx has at least one non-zero solution If Ax = λx
More informationLecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016
Lecture 8 Principal Component Analysis Luigi Freda ALCOR Lab DIAG University of Rome La Sapienza December 13, 2016 Luigi Freda ( La Sapienza University) Lecture 8 December 13, 2016 1 / 31 Outline 1 Eigen
More informationExercise Sheet 1. 1 Probability revision 1: Student-t as an infinite mixture of Gaussians
Exercise Sheet 1 1 Probability revision 1: Student-t as an infinite mixture of Gaussians Show that an infinite mixture of Gaussian distributions, with Gamma distributions as mixing weights in the following
More information14 Singular Value Decomposition
14 Singular Value Decomposition For any high-dimensional data analysis, one s first thought should often be: can I use an SVD? The singular value decomposition is an invaluable analysis tool for dealing
More informationUnsupervised dimensionality reduction
Unsupervised dimensionality reduction Guillaume Obozinski Ecole des Ponts - ParisTech SOCN course 2014 Guillaume Obozinski Unsupervised dimensionality reduction 1/30 Outline 1 PCA 2 Kernel PCA 3 Multidimensional
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Feature Extraction Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi, Payam Siyari Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Dimensionality Reduction
More informationPrincipal Component Analysis
Machine Learning Michaelmas 2017 James Worrell Principal Component Analysis 1 Introduction 1.1 Goals of PCA Principal components analysis (PCA) is a dimensionality reduction technique that can be used
More informationPCA and admixture models
PCA and admixture models CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar, Alkes Price PCA and admixture models 1 / 57 Announcements HW1
More informationAn Introduction to Statistical and Probabilistic Linear Models
An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakultät für Informatik Technische Universität München June 07, 2017 Introduction In statistical learning
More informationStatistical Machine Learning
Statistical Machine Learning Christoph Lampert Spring Semester 2015/2016 // Lecture 12 1 / 36 Unsupervised Learning Dimensionality Reduction 2 / 36 Dimensionality Reduction Given: data X = {x 1,..., x
More informationMotivating the Covariance Matrix
Motivating the Covariance Matrix Raúl Rojas Computer Science Department Freie Universität Berlin January 2009 Abstract This note reviews some interesting properties of the covariance matrix and its role
More informationClustering VS Classification
MCQ Clustering VS Classification 1. What is the relation between the distance between clusters and the corresponding class discriminability? a. proportional b. inversely-proportional c. no-relation Ans:
More informationIV. Matrix Approximation using Least-Squares
IV. Matrix Approximation using Least-Squares The SVD and Matrix Approximation We begin with the following fundamental question. Let A be an M N matrix with rank R. What is the closest matrix to A that
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Numerical Linear Algebra Background Cho-Jui Hsieh UC Davis May 15, 2018 Linear Algebra Background Vectors A vector has a direction and a magnitude
More informationLEC 2: Principal Component Analysis (PCA) A First Dimensionality Reduction Approach
LEC 2: Principal Component Analysis (PCA) A First Dimensionality Reduction Approach Dr. Guangliang Chen February 9, 2016 Outline Introduction Review of linear algebra Matrix SVD PCA Motivation The digits
More informationCS 340 Lec. 6: Linear Dimensionality Reduction
CS 340 Lec. 6: Linear Dimensionality Reduction AD January 2011 AD () January 2011 1 / 46 Linear Dimensionality Reduction Introduction & Motivation Brief Review of Linear Algebra Principal Component Analysis
More informationData Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395
Data Mining Dimensionality reduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 42 Outline 1 Introduction 2 Feature selection
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 5: Numerical Linear Algebra Cho-Jui Hsieh UC Davis April 20, 2017 Linear Algebra Background Vectors A vector has a direction and a magnitude
More informationData Mining Techniques
Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 12 Jan-Willem van de Meent (credit: Yijun Zhao, Percy Liang) DIMENSIONALITY REDUCTION Borrowing from: Percy Liang (Stanford) Linear Dimensionality
More informationMACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA
1 MACHINE LEARNING Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 2 Practicals Next Week Next Week, Practical Session on Computer Takes Place in Room GR
More informationPrincipal components analysis COMS 4771
Principal components analysis COMS 4771 1. Representation learning Useful representations of data Representation learning: Given: raw feature vectors x 1, x 2,..., x n R d. Goal: learn a useful feature
More informationMachine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012
Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Principal Components Analysis Le Song Lecture 22, Nov 13, 2012 Based on slides from Eric Xing, CMU Reading: Chap 12.1, CB book 1 2 Factor or Component
More informationPrincipal Component Analysis
CSci 5525: Machine Learning Dec 3, 2008 The Main Idea Given a dataset X = {x 1,..., x N } The Main Idea Given a dataset X = {x 1,..., x N } Find a low-dimensional linear projection The Main Idea Given
More informationPrincipal Component Analysis (PCA)
Principal Component Analysis (PCA) Additional reading can be found from non-assessed exercises (week 8) in this course unit teaching page. Textbooks: Sect. 6.3 in [1] and Ch. 12 in [2] Outline Introduction
More informationDATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD
DATA MINING LECTURE 8 Dimensionality Reduction PCA -- SVD The curse of dimensionality Real data usually have thousands, or millions of dimensions E.g., web documents, where the dimensionality is the vocabulary
More informationMachine learning for pervasive systems Classification in high-dimensional spaces
Machine learning for pervasive systems Classification in high-dimensional spaces Department of Communications and Networking Aalto University, School of Electrical Engineering stephan.sigg@aalto.fi Version
More information7 Principal Component Analysis
7 Principal Component Analysis This topic will build a series of techniques to deal with high-dimensional data. Unlike regression problems, our goal is not to predict a value (the y-coordinate), it is
More information15 Singular Value Decomposition
15 Singular Value Decomposition For any high-dimensional data analysis, one s first thought should often be: can I use an SVD? The singular value decomposition is an invaluable analysis tool for dealing
More informationSingular Value Decomposition
Chapter 6 Singular Value Decomposition In Chapter 5, we derived a number of algorithms for computing the eigenvalues and eigenvectors of matrices A R n n. Having developed this machinery, we complete our
More informationNumerical Methods I Singular Value Decomposition
Numerical Methods I Singular Value Decomposition Aleksandar Donev Courant Institute, NYU 1 donev@courant.nyu.edu 1 MATH-GA 2011.003 / CSCI-GA 2945.003, Fall 2014 October 9th, 2014 A. Donev (Courant Institute)
More informationLinear Dimensionality Reduction
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Principal Component Analysis 3 Factor Analysis
More informationCOMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017
COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University PRINCIPAL COMPONENT ANALYSIS DIMENSIONALITY
More informationEECS 275 Matrix Computation
EECS 275 Matrix Computation Ming-Hsuan Yang Electrical Engineering and Computer Science University of California at Merced Merced, CA 95344 http://faculty.ucmerced.edu/mhyang Lecture 6 1 / 22 Overview
More informationDimensionality Reduction with Principal Component Analysis
10 Dimensionality Reduction with Principal Component Analysis Working directly with high-dimensional data, such as images, comes with some difficulties: it is hard to analyze, interpretation is difficult,
More informationMachine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395
Machine Learning Dimensionality reduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 1 / 47 Table of contents 1 Introduction
More informationCourse 495: Advanced Statistical Machine Learning/Pattern Recognition
Course 495: Advanced Statistical Machine Learning/Pattern Recognition Deterministic Component Analysis Goal (Lecture): To present standard and modern Component Analysis (CA) techniques such as Principal
More informationMLCC 2015 Dimensionality Reduction and PCA
MLCC 2015 Dimensionality Reduction and PCA Lorenzo Rosasco UNIGE-MIT-IIT June 25, 2015 Outline PCA & Reconstruction PCA and Maximum Variance PCA and Associated Eigenproblem Beyond the First Principal Component
More informationData Mining Techniques
Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 4 Jan-Willem van de Meent (credit: Yijun Zhao, Arthur Gretton Rasmussen & Williams, Percy Liang) Kernel Regression Basis function regression
More informationData Mining Lecture 4: Covariance, EVD, PCA & SVD
Data Mining Lecture 4: Covariance, EVD, PCA & SVD Jo Houghton ECS Southampton February 25, 2019 1 / 28 Variance and Covariance - Expectation A random variable takes on different values due to chance The
More informationData Mining II. Prof. Dr. Karsten Borgwardt, Department Biosystems, ETH Zürich. Basel, Spring Semester 2016 D-BSSE
D-BSSE Data Mining II Prof. Dr. Karsten Borgwardt, Department Biosystems, ETH Zürich Basel, Spring Semester 2016 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 2 / 117 Our course
More information4 Bias-Variance for Ridge Regression (24 points)
Implement Ridge Regression with λ = 0.00001. Plot the Squared Euclidean test error for the following values of k (the dimensions you reduce to): k = {0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500,
More informationThe Singular Value Decomposition (SVD) and Principal Component Analysis (PCA)
Chapter 5 The Singular Value Decomposition (SVD) and Principal Component Analysis (PCA) 5.1 Basics of SVD 5.1.1 Review of Key Concepts We review some key definitions and results about matrices that will
More informationLecture Notes 2: Matrices
Optimization-based data analysis Fall 2017 Lecture Notes 2: Matrices Matrices are rectangular arrays of numbers, which are extremely useful for data analysis. They can be interpreted as vectors in a vector
More informationExample: Face Detection
Announcements HW1 returned New attendance policy Face Recognition: Dimensionality Reduction On time: 1 point Five minutes or more late: 0.5 points Absent: 0 points Biometrics CSE 190 Lecture 14 CSE190,
More informationFocus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.
Previously Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations y = Ax Or A simply represents data Notion of eigenvectors,
More informationGWAS V: Gaussian processes
GWAS V: Gaussian processes Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS V: Gaussian processes Summer 2011
More informationIntroduction to Machine Learning
10-701 Introduction to Machine Learning PCA Slides based on 18-661 Fall 2018 PCA Raw data can be Complex, High-dimensional To understand a phenomenon we measure various related quantities If we knew what
More informationCS281 Section 4: Factor Analysis and PCA
CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we
More informationLecture 6: Methods for high-dimensional problems
Lecture 6: Methods for high-dimensional problems Hector Corrada Bravo and Rafael A. Irizarry March, 2010 In this Section we will discuss methods where data lies on high-dimensional spaces. In particular,
More informationPrincipal Component Analysis
Principal Component Analysis Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison [based on slides from Nina Balcan] slide 1 Goals for the lecture you should understand
More informationECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction
ECE 521 Lecture 11 (not on midterm material) 13 February 2017 K-means clustering, Dimensionality reduction With thanks to Ruslan Salakhutdinov for an earlier version of the slides Overview K-means clustering
More informationClassification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).
Regression and PCA Classification The goal: map from input X to a label Y. Y has a discrete set of possible values We focused on binary Y (values 0 or 1). But we also discussed larger number of classes
More informationA Least Squares Formulation for Canonical Correlation Analysis
A Least Squares Formulation for Canonical Correlation Analysis Liang Sun, Shuiwang Ji, and Jieping Ye Department of Computer Science and Engineering Arizona State University Motivation Canonical Correlation
More informationComputer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)
Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori
More informationLinear Algebra Methods for Data Mining
Linear Algebra Methods for Data Mining Saara Hyvönen, Saara.Hyvonen@cs.helsinki.fi Spring 2007 The Singular Value Decomposition (SVD) continued Linear Algebra Methods for Data Mining, Spring 2007, University
More informationEE731 Lecture Notes: Matrix Computations for Signal Processing
EE731 Lecture Notes: Matrix Computations for Signal Processing James P. Reilly c Department of Electrical and Computer Engineering McMaster University October 17, 005 Lecture 3 3 he Singular Value Decomposition
More informationMachine Learning (Spring 2012) Principal Component Analysis
1-71 Machine Learning (Spring 1) Principal Component Analysis Yang Xu This note is partly based on Chapter 1.1 in Chris Bishop s book on PRML and the lecture slides on PCA written by Carlos Guestrin in
More informationLinear and Non-Linear Dimensionality Reduction
Linear and Non-Linear Dimensionality Reduction Alexander Schulz aschulz(at)techfak.uni-bielefeld.de University of Pisa, Pisa 4.5.215 and 7.5.215 Overview Dimensionality Reduction Motivation Linear Projections
More informationCollaborative Filtering: A Machine Learning Perspective
Collaborative Filtering: A Machine Learning Perspective Chapter 6: Dimensionality Reduction Benjamin Marlin Presenter: Chaitanya Desai Collaborative Filtering: A Machine Learning Perspective p.1/18 Topics
More informationNonparameteric Regression:
Nonparameteric Regression: Nadaraya-Watson Kernel Regression & Gaussian Process Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro,
More informationUnsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto
Unsupervised Learning Techniques 9.520 Class 07, 1 March 2006 Andrea Caponnetto About this class Goal To introduce some methods for unsupervised learning: Gaussian Mixtures, K-Means, ISOMAP, HLLE, Laplacian
More informationSTA 414/2104: Lecture 8
STA 414/2104: Lecture 8 6-7 March 2017: Continuous Latent Variable Models, Neural networks With thanks to Russ Salakhutdinov, Jimmy Ba and others Outline Continuous latent variable models Background PCA
More informationCSC411: Final Review. James Lucas & David Madras. December 3, 2018
CSC411: Final Review James Lucas & David Madras December 3, 2018 Agenda 1. A brief overview 2. Some sample questions Basic ML Terminology The final exam will be on the entire course; however, it will be
More informationDimensionality Reduction
Lecture 5 1 Outline 1. Overview a) What is? b) Why? 2. Principal Component Analysis (PCA) a) Objectives b) Explaining variability c) SVD 3. Related approaches a) ICA b) Autoencoders 2 Example 1: Sportsball
More informationSTA 414/2104: Machine Learning
STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 8 Continuous Latent Variable
More informationSingular Value Decomposition
Singular Value Decomposition CS 205A: Mathematical Methods for Robotics, Vision, and Graphics Doug James (and Justin Solomon) CS 205A: Mathematical Methods Singular Value Decomposition 1 / 35 Understanding
More informationThe Singular Value Decomposition
The Singular Value Decomposition Philippe B. Laval KSU Fall 2015 Philippe B. Laval (KSU) SVD Fall 2015 1 / 13 Review of Key Concepts We review some key definitions and results about matrices that will
More informationLinear Regression and Its Applications
Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start
More informationTutorial on Gaussian Processes and the Gaussian Process Latent Variable Model
Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model (& discussion on the GPLVM tech. report by Prof. N. Lawrence, 06) Andreas Damianou Department of Neuro- and Computer Science,
More informationDecember 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis
.. December 20, 2013 Todays lecture. (PCA) (PLS-R) (LDA) . (PCA) is a method often used to reduce the dimension of a large dataset to one of a more manageble size. The new dataset can then be used to make
More informationLinear Regression and Discrimination
Linear Regression and Discrimination Kernel-based Learning Methods Christian Igel Institut für Neuroinformatik Ruhr-Universität Bochum, Germany http://www.neuroinformatik.rub.de July 16, 2009 Christian
More informationL11: Pattern recognition principles
L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction
More informationCS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)
CS4495/6495 Introduction to Computer Vision 8B-L2 Principle Component Analysis (and its use in Computer Vision) Wavelength 2 Wavelength 2 Principal Components Principal components are all about the directions
More informationbe a Householder matrix. Then prove the followings H = I 2 uut Hu = (I 2 uu u T u )u = u 2 uut u
MATH 434/534 Theoretical Assignment 7 Solution Chapter 7 (71) Let H = I 2uuT Hu = u (ii) Hv = v if = 0 be a Householder matrix Then prove the followings H = I 2 uut Hu = (I 2 uu )u = u 2 uut u = u 2u =
More informationPCA and LDA. Man-Wai MAK
PCA and LDA Man-Wai MAK Dept. of Electronic and Information Engineering, The Hong Kong Polytechnic University enmwmak@polyu.edu.hk http://www.eie.polyu.edu.hk/ mwmak References: S.J.D. Prince,Computer
More informationLecture 7: Con3nuous Latent Variable Models
CSC2515 Fall 2015 Introduc3on to Machine Learning Lecture 7: Con3nuous Latent Variable Models All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/
More informationPrincipal Component Analysis and Linear Discriminant Analysis
Principal Component Analysis and Linear Discriminant Analysis Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 http://www.eecs.northwestern.edu/~yingwu 1/29
More informationData Mining and Analysis: Fundamental Concepts and Algorithms
Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA
More information1 Singular Value Decomposition and Principal Component
Singular Value Decomposition and Principal Component Analysis In these lectures we discuss the SVD and the PCA, two of the most widely used tools in machine learning. Principal Component Analysis (PCA)
More informationGI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil
GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis Massimiliano Pontil 1 Today s plan SVD and principal component analysis (PCA) Connection
More informationNonlinear Dimensionality Reduction
Nonlinear Dimensionality Reduction Piyush Rai CS5350/6350: Machine Learning October 25, 2011 Recap: Linear Dimensionality Reduction Linear Dimensionality Reduction: Based on a linear projection of the
More informationPrincipal Component Analysis
B: Chapter 1 HTF: Chapter 1.5 Principal Component Analysis Barnabás Póczos University of Alberta Nov, 009 Contents Motivation PCA algorithms Applications Face recognition Facial expression recognition
More informationLecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University
Lecture 4: Principal Component Analysis Aykut Erdem May 016 Hacettepe University This week Motivation PCA algorithms Applications PCA shortcomings Autoencoders Kernel PCA PCA Applications Data Visualization
More informationPCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani
PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2015 Soleymani Dimensionality Reduction: Feature Selection vs. Feature Extraction Feature selection Select a subset of a given
More informationPreprocessing & dimensionality reduction
Introduction to Data Mining Preprocessing & dimensionality reduction CPSC/AMTH 445a/545a Guy Wolf guy.wolf@yale.edu Yale University Fall 2016 CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall 2016
More informationPrincipal Component Analysis and Singular Value Decomposition. Volker Tresp, Clemens Otte Summer 2014
Principal Component Analysis and Singular Value Decomposition Volker Tresp, Clemens Otte Summer 2014 1 Motivation So far we always argued for a high-dimensional feature space Still, in some cases it makes
More informationSTA414/2104. Lecture 11: Gaussian Processes. Department of Statistics
STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations
More informationThe Mathematics of Facial Recognition
William Dean Gowin Graduate Student Appalachian State University July 26, 2007 Outline EigenFaces Deconstruct a known face into an N-dimensional facespace where N is the number of faces in our data set.
More informationMachine Learning 2nd Edition
INTRODUCTION TO Lecture Slides for Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some parts from http://www.cs.tau.ac.il/~apartzin/machinelearning/ The MIT Press, 2010
More informationLECTURE NOTE #11 PROF. ALAN YUILLE
LECTURE NOTE #11 PROF. ALAN YUILLE 1. NonLinear Dimension Reduction Spectral Methods. The basic idea is to assume that the data lies on a manifold/surface in D-dimensional space, see figure (1) Perform
More informationChemometrics: Classification of spectra
Chemometrics: Classification of spectra Vladimir Bochko Jarmo Alander University of Vaasa November 1, 2010 Vladimir Bochko Chemometrics: Classification 1/36 Contents Terminology Introduction Big picture
More informationKernel methods, kernel SVM and ridge regression
Kernel methods, kernel SVM and ridge regression Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Collaborative Filtering 2 Collaborative Filtering R: rating matrix; U: user factor;
More informationLeverage Sparse Information in Predictive Modeling
Leverage Sparse Information in Predictive Modeling Liang Xie Countrywide Home Loans, Countrywide Bank, FSB August 29, 2008 Abstract This paper examines an innovative method to leverage information from
More information