PCA SVD LDA MDS, LLE, CCA. Data mining. Dimensionality reduction. University of Szeged. Data mining

Similar documents
Machine Learning for Data Science (CS 4786)

BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics

For a 3 3 diagonal matrix we find. Thus e 1 is a eigenvector corresponding to eigenvalue λ = a 11. Thus matrix A has eigenvalues 2 and 3.

Machine Learning for Data Science (CS 4786)

Cov(aX, cy ) Var(X) Var(Y ) It is completely invariant to affine transformations: for any a, b, c, d R, ρ(ax + b, cy + d) = a.s. X i. as n.

Topics in Eigen-analysis

6. Kalman filter implementation for linear algebraic equations. Karhunen-Loeve decomposition

Apply change-of-basis formula to rewrite x as a linear combination of eigenvectors v j.

5.1 Review of Singular Value Decomposition (SVD)

Distributional Similarity Models (cont.)

Distributional Similarity Models (cont.)

Linear Classifiers III

Introduction to Optimization Techniques. How to Solve Equations

Lecture 8: October 20, Applications of SVD: least squares approximation

State Space Representation

Outline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression

Machine Learning for Data Science (CS4786) Lecture 4

Machine Learning for Data Science (CS4786) Lecture 9

Abstract Vector Spaces. Abstract Vector Spaces

Mon Apr Second derivative test, and maybe another conic diagonalization example. Announcements: Warm-up Exercise:

Matrix Algebra from a Statistician s Perspective BIOS 524/ Scalar multiple: ka

The Method of Least Squares. To understand least squares fitting of data.

Algebra of Least Squares

Inverse Matrix. A meaning that matrix B is an inverse of matrix A.

Introduction to Optimization Techniques

Factor Analysis. Lecture 10: Factor Analysis and Principal Component Analysis. Sam Roweis

Symmetric Matrices and Quadratic Forms

TMA4205 Numerical Linear Algebra. The Poisson problem in R 2 : diagonalization methods

R is a scalar defined as follows:

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector

A widely used display of protein shapes is based on the coordinates of the alpha carbons - - C α

Definitions and Theorems. where x are the decision variables. c, b, and a are constant coefficients.

Lecture 22: Review for Exam 2. 1 Basic Model Assumptions (without Gaussian Noise)

FFTs in Graphics and Vision. The Fast Fourier Transform

LECTURE 8: ORTHOGONALITY (CHAPTER 5 IN THE BOOK)

Math 5311 Problem Set #5 Solutions

10-701/ Machine Learning Mid-term Exam Solution

36-755, Fall 2017 Homework 5 Solution Due Wed Nov 15 by 5:00pm in Jisu s mailbox

Math E-21b Spring 2018 Homework #2

Chapter 12 EM algorithms The Expectation-Maximization (EM) algorithm is a maximum likelihood method for models that have hidden variables eg. Gaussian

Section 14. Simple linear regression.

Supplemental Material: Proofs

Math 778S Spectral Graph Theory Handout #3: Eigenvalues of Adjacency Matrix

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

Bivariate Sample Statistics Geog 210C Introduction to Spatial Data Analysis. Chris Funk. Lecture 7

Session 5. (1) Principal component analysis and Karhunen-Loève transformation

Notes The Incremental Motion Model:

5.1. The Rayleigh s quotient. Definition 49. Let A = A be a self-adjoint matrix. quotient is the function. R(x) = x,ax, for x = 0.

EECS 442 Computer vision. Multiple view geometry Affine structure from Motion

1 1 2 = show that: over variables x and y. [2 marks] Write down necessary conditions involving first and second-order partial derivatives for ( x0, y

Soo King Lim Figure 1: Figure 2: Figure 3: Figure 4: Figure 5: Figure 6: Figure 7:

NBHM QUESTION 2007 Section 1 : Algebra Q1. Let G be a group of order n. Which of the following conditions imply that G is abelian?

Geometry of LS. LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT

1 Last time: similar and diagonalizable matrices

Chimica Inorganica 3

Last time: Moments of the Poisson distribution from its generating function. Example: Using telescope to measure intensity of an object

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

Why learn matrix algebra? Vectors & Matrices with statistical applications. Brief history of linear algebra

ECON 3150/4150, Spring term Lecture 3

AH Checklist (Unit 3) AH Checklist (Unit 3) Matrices

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

18.S096: Homework Problem Set 1 (revised)

Physics 324, Fall Dirac Notation. These notes were produced by David Kaplan for Phys. 324 in Autumn 2001.

Slide Set 13 Linear Model with Endogenous Regressors and the GMM estimator

4. Hypothesis testing (Hotelling s T 2 -statistic)

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

PAPER : IIT-JAM 2010

Problem Set 2 Solutions

GRAPHING LINEAR EQUATIONS. Linear Equations ( 3,1 ) _x-axis. Origin ( 0, 0 ) Slope = change in y change in x. Equation for l 1.

Orthogonal transformations

V. Nollau Institute of Mathematical Stochastics, Technical University of Dresden, Germany

A) is empty. B) is a finite set. C) can be a countably infinite set. D) can be an uncountable set.

CLRM estimation Pietro Coretto Econometrics

The Jordan Normal Form: A General Approach to Solving Homogeneous Linear Systems. Mike Raugh. March 20, 2005

11 Correlation and Regression

Mathematics 3 Outcome 1. Vectors (9/10 pers) Lesson, Outline, Approach etc. This is page number 13. produced for TeeJay Publishers by Tom Strang

The Basic Space Model

CHAPTER 5. Theory and Solution Using Matrix Techniques

Questions and answers, kernel part

Where do eigenvalues/eigenvectors/eigenfunctions come from, and why are they important anyway?

Affine Structure from Motion

Variable selection in principal components analysis of qualitative data using the accelerated ALS algorithm

Matrix Representation of Data in Experiment

Assumptions. Motivation. Linear Transforms. Standard measures. Correlation. Cofactor. γ k

ALGEBRAIC GEOMETRY COURSE NOTES, LECTURE 5: SINGULARITIES.

Tridiagonal reduction redux

Correlation Regression

Stochastic Matrices in a Finite Field

ACO Comprehensive Exam 9 October 2007 Student code A. 1. Graph Theory

MATH10212 Linear Algebra B Proof Problems

Nonlinear regression

Filter banks. Separately, the lowpass and highpass filters are not invertible. removes the highest frequency 1/ 2and

Classification of DT signals

10/2/ , 5.9, Jacob Hays Amit Pillay James DeFelice

15.083J/6.859J Integer Optimization. Lecture 3: Methods to enhance formulations

Machine Learning Regression I Hamid R. Rabiee [Slides are based on Bishop Book] Spring

Chapter 3 Inner Product Spaces. Hilbert Spaces

CMSE 820: Math. Foundations of Data Sci.

Brief Review of Functions of Several Variables

Review Problems 1. ICME and MS&E Refresher Course September 19, 2011 B = C = AB = A = A 2 = A 3... C 2 = C 3 = =

Transcription:

Dimesioality reductio Uiversity of Szeged

The role of dimesioality reductio We ca spare computatioal costs (or simply fit etire datasets ito mai memory) if we represet data i fewer dimesios Visualizatio of datasets (i 2 or 3 dimesios) Elimiatio of oise from data, feature selectio Key idea: try to represet data poits i lower dimesios Depedig our ojective fuctio with respect the lower dimesioal represetatio,,,...

ricipal Compoet Aalysis Trasform multidimesioal data ito lower dimesios i such a way that we lose as little proportio of the origial variatio of the data as possile Assumptio: data poits of the origial m-dimesioal space lie at (or at least very close to) a m -dimesioal suspace we shall express data poits with respect this suspace What that m -dimesioal suspace might e? We would like to miimize the recostructio error X k (xi xi ) k2 i=1, where xi is a approximatio for poit xi

Covariace Remider Quatifies how much radom variales Y ad Z chage together cov (Y, Z ) = E[(Y µy )(Z µz )] µy = 1 i=1 yi ad µz = 1 zi i=1 Colums i, j of data matrix X (i.e. X:,i, X:,j ) ca e regarded as oservatios from two radom variales

Scatter ad covariace matrix Scatter matrix: S = (xi µ)(xi µ) k=1 (U)iased covariace matrix: Σ = 1 S (Σ = 1 1 S) cov (X:,1, X:,1 ) cov (X:,1, X:,2 )... cov (X:,1, X:,m ) cov (X:,2, X:,1 ) cov (X:,2, X:,2 )... cov (X:,2, X:,m ) Σ=........ cov (X, X ). :,i :,j cov (X:,m, X:,1 )...... cov (X:,m, X:,m ) Σi,j is the covariace of variales i ad j (cov (X:,i, X:,j )) What values are icluded i the mai diagoal?

Characteristics of scatter ad covariace matrices Claim: matrices S ad Σ are symmetric ad positive defiite Bizoyítás. S= (xi µ)(xi µ) =! (xi µ)(xi µ) = S k=1 k=1 a Sa = k=1 (a (xi µ))((xi µ) a) = (a (xi µ))2 k=1 Cosequece: the eigevalues of S ad C are λ1 λ2... λm The m -dimesioal projectio which preserves most of the variatio of the data ca e otaied y projectig data poits usig the eigevectors elogig to the m highest eigevalues of either matrix S (or C ) (proof: see tale)

Lagrage multipliers rovides a schema for solvig (o-)liear optimizatio prolems f (x) mi/max such that gi (x) = i {1,..., } Lagrage fuctio: L(x, λ) = f(x) X λi gi (x) i=1 Karush-Kuh-Tucker (KKT) coditios: ecessity coditios for a optimum L(x, λ) = (1) λi gi (x) = i {1,..., } (2) λi (3)

ractical issues Its worth hadlig all the features o similar scales mi-max ormalizatio: xi,j = stadardizatio: xi,j = xi,j mi(x,j ) max(x,j ) mi(x,j ) xi,j µj σj How to choose the reduced dimesioality (m )? m m si2 λi = Hit : i=1 i=1 k i=1 λi m = arg mi m t threshold 1 k m i=1 λi m 1 X λi > λj 1 i m m m = arg max j=1 m = arg max (λi λi+1 ) 1 i m 1

Summarizig Sutract the mea vector from data X ad also ormalize it somehow Calculate the scatter/covariace matrix of the ormalized data Calculate its eigevalues Form projectio matrix from the eigevectors correspodig to the m largest eigevalues X = X gives the trasformed data X 1 gives a approximatio o the origial positios of the data poits A useful tutorial o

Sigular Value Decompositio X = U x Sigma x V t rak(x ) σi ui vi i=1 s s rak(x m ) 2 xij2 = kx kf = σi X = UΣV = i=1 j=1 i=1 Low(er) rak approximatio of X is X = U Σ V We rely o the top m < m largest sigular value of Σ upo recostructig X This is the est possile m -dimesioal approximatio of X if we look for the approximatio which miimizes kx X kfroeius

Sigular Value Decompositio ad Eigedecompositio Remider Ay symmetric matrix A is decomposale as A = X ΛX 1, where X = [x1 x2... xm ] comprises of the orthogoal eigevectors of A ad Λ = diag ([λ1 λ2... λm ]) cotaiig the correspodig eigevalues i its mai diagoal. Why? Ay m matrix X ca e uiquely decomposed ito the product of three matrices of the form UΣV where U = [u1 u2... u ] is the orthoormal matrix cosistig of the eigevectors of XX Σ m = diag ( λ1, λ2,..., λm ) Vm m = [v1 v2... vm ] is the orthoormal matrix cosistig of the eigevectors of X X Why? Orthogoal matrices: M M = I (a trasformatio which preserves distace i the trasformed space as well) Why?

Relatio etwee ad Froeius-orm Suppose M = Q R, i.e. mij = k 58 M = 32 28 87 6 48 = 2 42 3 4 1 2 1 2 2 3 4 2 pik qkl rlj l 3 m32 = 3 4 3 + 2 1 3 + 2 pik qkl rlj The kmk2f = (mij )2 = i j i j k l 2 Also pik qkl rlj = pik qkl rlj pi qm rmj k l k l m From where kmk2f = pik qkl rlj pi qm rmj i j k l m Give tha matrices, Q, R origiate from a decompositio, kmk2f = pik qkk rkj pi q rj = qkk rkj qkk rkj = (qkk )2. i,j,k, j,k Why? The error of the approximatig Xy X = U Σ V is kx X k2 = ku(σ Σ )V k2 = Data (σ miig σ )2 k

CUR The drawack of is that a typically sparse matrix is decomposed ito a products of dese matrices (i.e. U ad V ) Oe alterative is to use CUR decompositio This time oly matrix U happes to e dese Matrices C ad R are composed of the rows ad colums of the matrix X, thus they preserve the sparsity of X is uique ulike CUR

CUR decompositio producig C ad R Choose k colums (rows) from the data matrix with replacemet A colum (row) ca e selected more tha oce ito C (R) The proaility of selectig a colum (row) should e proportioal to the sum of squared elemets i it Scale the elemets of the selected colum (row) y 1/ kpi

CUR decompositio producig U 1 2 3 Create W Rkxk from the itersectio of the k selected rows ad colums erform o W such that W = X ΣY Let U = Y (Σ+ )2 X where Σ+ is the pseudoiverse of Σ Σ is diagoal, so calculatig its pseudoiverse is easy Just take the reciprocal of the o-zero elemets i the diagoal Zeros ca e left itact

Alie Matrix Casalaca Titaic CUR decompositio example aul Sam Martha Bo Sarah 4 5 3 3 4 3 5 4 5 4 25 166 41 166 18 166 5 166 32 166 5 166 34 166 41 166 41 166 5.2 6.4 C = 3.9.4 U= 7.1 R= 7.1 5.7.4 5.7 6.4 6.4

Liear Discrimiat Aalysis Trasform data poits ito lower dimesios i such a way that poits of the same class have as little dispersio as possile whereas poits of differet classes mix as little as possile How should we choose w, i.e. the directio of the projectio? µ ~i = w µi µ~1 µ~2 = w (µ1 µ2 ) X s i = (w x ~ µi )2 x with lael i w = arg max J(w) = arg max w w ~ µ1 ~ µ2 2 2 s 1 + s 22 (1)

Withi ad outer scatter matrices The withi-scatter matrix of poits with class lael i: X Si = (x µi )(x µi ) x with lael i Aggregated withi scatter matrix: SW = S1 + S2 Scatter of the poits with class lael i: X X s i 2 = (w x w µi )2 = w (x µi )(x µi ) w = w Si w x with lael i Scatter matrix of the poits etwee differet classes: SB = (µ1 µ2 )(µ1 µ2 ) Scatter of the poits etwee differet classes: (µ 1 µ 2 )2 = (w µ1 w µ2 )2 = w SB w

A equivalet ojective with Eq. (1) is w = arg maxw J(w) = arg maxw ww SSWB ww w SB w w SW w is the so-called geeralized Rayleigh-coefficiet J(w) is maximal ww SSWB ww = SB w = λsw w 1 1 SW SB w = λw w = SW (µ1 µ2 ) Remider f (x) (x)g (x) = f (x)g (x) f g (x) g 2 (x) x x Ax = 2Ax, give that A = A xx y = xi yi x (i.e. a vector poitig i=1 of x) i the directio

vs. 4 3 4 3 2 2 1 1 1 2 3 4 1 2 3 4

Multi-Dimesioal Scalig (MDS) Goal: give cotaiig pair-wise cost/distaces of poits fid the positios of the poits for which k xi xj k δij Trasforms multidimesioal poits ito lower dimesios such that the pairwise distaces get preserved as much as possile

Locally Liear Emeddig (LLE), ad all assume liear relatioship etwee variales No-liear dimesioality reductio techique Idea: defie the earest eighors for all poits ad defie them as their liear comiatio Wij = 1, és Wij xj k2, such that kxi J(W ) = i=1 j=1 j=1 wij > xj eighors(xi )

Caoical Correlatio Aalysis (CCA) Our data poits have two distict represetatios (coordiate systems) Goal: fid a commo coordiate system (with reduced dimesioality) such that the correlatio etwee the trasformed poits get maximized E [xy ] = E [x 2 ] E [y 2 ] wx Cxy w y wx Cxx wx wy Cyy wy ρ= E[wx xy wy ] E[wx xx wx ] E [wy yy wy ] = arg max ρ is idepedet from the legth of wx ad wy arg max ρ = arg max wx Cxy wy " # " # Σxx Σxy x x Σ= =E y y Σyx Σyy