Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Similar documents
Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Data Mining Techniques

Principal Component Analysis (PCA)

What is Principal Component Analysis?

Expectation Maximization

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

PCA and admixture models

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Introduction to Machine Learning

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Data Mining Techniques

Machine Learning - MT & 14. PCA and MDS

STATS 306B: Unsupervised Learning Spring Lecture 13 May 12

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

Principal Component Analysis and Singular Value Decomposition. Volker Tresp, Clemens Otte Summer 2014

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2016

Dimensionality Reduction

Principal Component Analysis (PCA)

20 Unsupervised Learning and Principal Components Analysis (PCA)

Vector Space Models. wine_spectral.r

Probabilistic Latent Semantic Analysis

Principal Component Analysis

CS281 Section 4: Factor Analysis and PCA

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

Advanced Introduction to Machine Learning CMU-10715

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Principal Component Analysis

CS 340 Lec. 6: Linear Dimensionality Reduction

Machine Learning (Spring 2012) Principal Component Analysis

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

CSC 411 Lecture 12: Principal Component Analysis

Principal Component Analysis

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Exercise Sheet 1. 1 Probability revision 1: Student-t as an infinite mixture of Gaussians

Designing Information Devices and Systems II Fall 2018 Elad Alon and Miki Lustig Homework 9

Principal Component Analysis. Applied Multivariate Statistics Spring 2012


ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA

PCA, Kernel PCA, ICA

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

Unsupervised Learning

Mathematical Formulation of Our Example

CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA)

Machine Learning (BSMC-GA 4439) Wenke Liu

Lecture: Face Recognition and Feature Reduction

Factor Analysis (10/2/13)

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

Dimensionality reduction

L11: Pattern recognition principles

Dimensionality Reduction and Principle Components Analysis

Structured matrix factorizations. Example: Eigenfaces

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

Dimension reduction, PCA & eigenanalysis Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Lecture: Face Recognition and Feature Reduction

Robot Image Credit: Viktoriya Sukhanova 123RF.com. Dimensionality Reduction

Unsupervised Learning: K- Means & PCA

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Applications of visual analytics, data types 3 Data sources and preparation Project 1 out 4

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Principal Component Analysis (PCA) CSC411/2515 Tutorial

CPSC 340 Assignment 4 (due November 17 ATE)

Lecture 6: Methods for high-dimensional problems

Principal components analysis COMS 4771

Dimensionality Reduction

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Singular Value Decomposition and Principal Component Analysis (PCA) I

ECE521 week 3: 23/26 January 2017

Learning with Singular Vectors

ISyE 691 Data mining and analytics

Machine Learning 2nd Edition

Efficiently Implementing Sparsity in Learning

Lecture 7: Con3nuous Latent Variable Models

IV. Matrix Approximation using Least-Squares

4 Bias-Variance for Ridge Regression (24 points)

1 Principal Components Analysis

Chemometrics: Classification of spectra

Dimensionality Reduc1on

Classification 2: Linear discriminant analysis (continued); logistic regression

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Collaborative Filtering: A Machine Learning Perspective

7 Principal Component Analysis

Computational Methods. Eigenvalues and Singular Values

Deriving Principal Component Analysis (PCA)

15 Singular Value Decomposition

Probabilistic & Unsupervised Learning

DS-GA 1002 Lecture notes 10 November 23, Linear models

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016

Transcription:

1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin

Singular Value Decomposition (SVD) The singular value decomposition (SVD) of any matrix X is:

Existence of SVD Spectral decomposition (assume for simplicity n = m):

Existence of SVD Spectral decomposition (assume for simplicity n = m): Define: Define:

Existence of SVD We will assume all eigenvalues are > 0.

Existence of SVD We will assume all eigenvalues are > 0.

Existence of SVD We will assume all eigenvalues are > 0. Note that the transposed matrix has the same eigenvalues:

SVD and linear regression Recall, in linear regression: If is singular there are many solution. How can we find the solution with the smallest norm?

SVD and linear regression

SVD and linear regression Assume: The solution is unique since if is always better to set then it

SVD and linear regression Pseudoinverse of

SVD and linear regression Pseudoinverse of Exercise: Show that the Ridge solution converges to the pseudoinverse solution:

Unsupervised Learning Supervised Learning: (x 1,y 1 ),...,(x n,y n ), (x n+1,?),...,(x m,?) Fill in the missing labels Unsupervised Learning: (x 1,?),...,(x m,?) Fill in the missing labels

High Dimensional Data Documents Face images Genetic data

Dimensionality Reduction Avoid the risk of over-fitting. Data compression. Visualization. Outlier detection.

Linear Dimensionality Reduction Input: (x 1,...,x n ) x i 2 R m X = 0 @ x 1... x n 1 A Output: (z 1,...,z n ) z i = U t x i 2 R d u i u j =0 U = 0 @ u 1... u d 1 A How should we choose U?

Reconstruction Error We can think of dimensionality reduction as compression. Encode step: z i = U t x i 2 R d Decode step: x 0 i = Uz i = UU t x i 2 R m

Reconstruction Error We can think of dimensionality reduction as compression. Encode step: z i = U t x i 2 R d Decode step: x 0 i = Uz i = UU t x i 2 R m Minimize the loss of information:

Capturing Variance Assume the data points are centered at zero. That is: The estimate of the variance of the distance from the origin: The estimate of the variance of the projected distance:

Capturing Variance Recall: Note: Minimize reconstruction error = maximize projected variance

Reconstruction Error

Principal Component Analysis Input: Output: X = U = 0 @ 0 @ x 1... x n u 1... u d 1 A 1 A (z 1,...,z n ) z i = U t x i 2 R d u i u j =0 Objective:

PCA one dimension is the largest eigenvector of

PCA d dimensions Let with eigenvalues be the eigenvectors of

PCA d dimensions Let with eigenvalues be the eigenvectors of When we get that the variance is.

PCA d dimensions Denote:

PCA d dimensions Maximized when Maximal value of the solution is is the maximal solution.

Principal Component Analysis (PCA) The entries of are called the loadings of the i-th principal component The contribution of the i-th principal component to the explained variance is: The number of PCs can be determined by the eigenvalue distribution. The PC scores are obtained by

Computing PCA Computing the eigenvectors requires the computation of Time: O(nm 2 ) + time for eigenvectors More efficient compute the singular value decomposition (SVD) with d singular values.

Example: Eigenfaces Goal: Recognition and/or detection of a face.

Example: Eigenfaces Each picture is a matrix of pixels. We can think of it as a vector of real numbers. Compute PCA, represent each picture by a small number of directions. Given a new picture: Project the picture on the space defined by the PCs. Compute the distance to the face points. If the distance is small enough this is a face. Nearest neighbor classification (face detection).

Using more eigenfaces allows for a better reconstruction. Number of eigenfaces: 10,30,50,70,,310

Example: Latent Semantic Analysis Goal: Cluster documents based on word count similarity. Generate a vector for each document: the histogram of words. Compute PCA. Run clustering in the projected space.

Example: Population Structure Goal: Use genetic data to cluster individuals based on their ancestry. Input: the allele count (e.g., number of copies of A ) in each position in the genome. Visualize and cluster populations based on the first PCs, detect outliers.

PCA on a European population

Practical Issues Features have to be standardized. No standardization leads to a non-uniform contribution of the features. Outliers can affect the results. The variance explained in the dimension spanned by the outlier is typically large. In population structure, the first PCs could be due to outliers (e.g., individuals with wrong genotype data, or a different platform). Eigenfaces could be affected by the camera used, time of the day, etc.

Regression and Normalization PCA is often used for normalization of data by regressing over the first PCs. The first PCs often correspond to general phenomena that are not related to the data. Comparing two sets of faces the main difference may be due to the camera used in each group or the amount of light used. Comparing genomes of patients and controls population structure, genotype lab, are confounders.

Spectral Clustering - Intuition Scenario: two clusters, A, B each of size n/2. Note: is maximized when

Spectral Clustering - Intuition

Spectral Clustering - Intuition

Spectral Clustering Given a weighted graph with edge weights Define The Laplacian of G is defined as Find the smallest eigenvalues orthogonal to (1,1,,1)

Spectral Clustering Find the smallest eigenvalues orthogonal to (1,1,,1): Cluster the values of v. We can add more eigenvectors and run k-means in the resulting space.

Spectral Clustering A general framework that can work with every similarity measure (not only scalar product). It can be derived as a relaxation to the following min sparse cut problem (normalized cut problem):

Spectral Clustering 1.5 0.03 1 0.02 0.5 0.01 0 0 0.5 0.01 1 0.02 1.5 1.5 1 0.5 0 0.5 1 1.5 Original data 0.03 0 200 400 600 800 1000 1200 1400 2 nd smallest eigenvector

PCA Probabilistic View Model assumptions:

PCA Probabilistic View Model assumptions: Likelihood: Maximizing the likelihood is equivalent to minimizing

PCA Probabilistic View Minimize Fixing W, we get a standard linear regression problem: Plugging back in, we need to minimize: Minimize

PCA Probabilistic View Minimize Using SVD: We get:

PCA Probabilistic View We can deal with missing data using Expectation Maximization. We can model a mixture of PCAs. Often the variables z i are treated as random variables and the standard deviation as unknown. The derivation is then different, as well as the result.

Regression: Minimize the Residuals r i = y i a x i b (x i,y i ) y = a x + b

PCA: Minimize the reconstruction error (x i,y i ) y = a x + b

Interpretation of the Loadings Loading interpretation: Which pixels provide the most important contribution to the eigenfaces? Which genes contribute the most to the difference in ancestry?

Sparse PCA Goal: PCs with a small number of non-zero loadings. Interpretation is easier. Avoid capturing noise when m = O(n) and the true structure is defined by a small dimension.

Sparse PCA SCoTLASS: Simplified Component Technique-Lasso u 1 = arg max u t XX t u s.t. kuk 2 =1 kuk 1 apple t

Sparse PCA SCoTLASS: Simplified Component Technique-Lasso u 1 = arg max u t XX t u s.t. kuk 2 =1 kuk 1 apple t u k = arg max u t XX t u s.t. kuk 2 =1 8i <k,u t u i =0 kuk 1 apple t Disadvantage: Slow (non-convex optimization). In practice not sparse when a high fraction of the variance is explained.

Sparse PCA A regression view - Run PCA and approximate the PCs using sparse PCs. ˆi = arg min ku t ix t Xk 2 + 1 k k 1 Problem: If X is not full rank, the above formulation might not return the PC when 1 =0

Sparse PCA A regression view - Run PCA and approximate the PCs using sparse PCs. ˆi = arg min ku t ix t Xk 2 + 1 k k 1 Problem: If X is not full rank, the above formulation might not return the PC when 1 =0 Solution: Add L2 regularization: ˆi = arg min ku t ix t Xk 2 + 2 k k 2 + 1 k k 1

Sparse PCA Theorem: If 1 =0and 2 > 0 then. k ˆik = u i 2 ˆi Proof: ˆi = (XX t + 2 I) 1 XX t u i = U( 2 + 2 I) 1 U t U 2 U t u i = u i 2 ii 2 ii + 2 Note this formulation is a convex quadratic program.