y new = M x old Feature Selection: Linear Transformations Constraint Optimization (insertion)

Similar documents
LECTURE :FACTOR ANALYSIS

Least Squares Fitting of Data

Least Squares Fitting of Data

Finite Vector Space Representations Ross Bannister Data Assimilation Research Centre, Reading, UK Last updated: 2nd August 2003

What is LP? LP is an optimization technique that allocates limited resources among competing activities in the best possible manner.

Xiangwen Li. March 8th and March 13th, 2001

The Prncpal Component Transform The Prncpal Component Transform s also called Karhunen-Loeve Transform (KLT, Hotellng Transform, oregenvector Transfor

XII.3 The EM (Expectation-Maximization) Algorithm

On the Eigenspectrum of the Gram Matrix and the Generalisation Error of Kernel PCA (Shawe-Taylor, et al. 2005) Ameet Talwalkar 02/13/07

Slobodan Lakić. Communicated by R. Van Keer

Excess Error, Approximation Error, and Estimation Error

1 Definition of Rademacher Complexity

Applied Mathematics Letters

Lecture 12: Discrete Laplacian

Machine Learning. Support Vector Machines. Eric Xing. Lecture 4, August 12, Reading: Eric CMU,

System in Weibull Distribution

Composite Hypotheses testing

Computational and Statistical Learning theory Assignment 4

Machine Learning. What is a good Decision Boundary? Support Vector Machines

Fall 2012 Analysis of Experimental Measurements B. Eisenstein/rev. S. Errede. ) with a symmetric Pcovariance matrix of the y( x ) measurements V

Solving Fuzzy Linear Programming Problem With Fuzzy Relational Equation Constraint

CHAPTER 6 CONSTRAINED OPTIMIZATION 1: K-T CONDITIONS

Inner Product. Euclidean Space. Orthonormal Basis. Orthogonal

CHAPTER 7 CONSTRAINED OPTIMIZATION 1: THE KARUSH-KUHN-TUCKER CONDITIONS

Non-linear Canonical Correlation Analysis Using a RBF Network

Description of the Force Method Procedure. Indeterminate Analysis Force Method 1. Force Method con t. Force Method con t

Recap: the SVM problem

CHALMERS, GÖTEBORGS UNIVERSITET. SOLUTIONS to RE-EXAM for ARTIFICIAL NEURAL NETWORKS. COURSE CODES: FFR 135, FIM 720 GU, PhD

BAYESIAN CURVE FITTING USING PIECEWISE POLYNOMIALS. Dariusz Biskup

Geometric Camera Calibration

Lecture 3. Camera Models 2 & Camera Calibration. Professor Silvio Savarese Computational Vision and Geometry Lab. 13- Jan- 15.

Perceptual Organization (IV)

Statistical pattern recognition

PROBABILITY AND STATISTICS Vol. III - Analysis of Variance and Analysis of Covariance - V. Nollau ANALYSIS OF VARIANCE AND ANALYSIS OF COVARIANCE

Gradient Descent Learning and Backpropagation

Singular Value Decomposition: Theory and Applications

Differentiating Gaussian Processes

FINDING RELATIONS BETWEEN VARIABLES

A KERNEL FUZZY DISCRIMINANT ANALYSIS MINIMUM DISTANCE-BASED APPROACH FOR THE CLASSIFICATION OF FACE IMAGES

Rectilinear motion. Lecture 2: Kinematics of Particles. External motion is known, find force. External forces are known, find motion

Preference and Demand Examples

CSE 252C: Computer Vision III

13 Principal Components Analysis

Elastic Collisions. Definition: two point masses on which no external forces act collide without losing any energy.

Machine Learning. Support Vector Machines. Eric Xing , Fall Lecture 9, October 8, 2015

Quantum Mechanics for Scientists and Engineers

C/CS/Phy191 Problem Set 3 Solutions Out: Oct 1, 2008., where ( 00. ), so the overall state of the system is ) ( ( ( ( 00 ± 11 ), Φ ± = 1

Machine Learning. Support Vector Machines. Eric Xing , Fall Lecture 9, October 6, 2015

= = = (a) Use the MATLAB command rref to solve the system. (b) Let A be the coefficient matrix and B be the right-hand side of the system.

General Averaged Divergence Analysis

Lecture 10 Support Vector Machines II

However, since P is a symmetric idempotent matrix, of P are either 0 or 1 [Eigen-values

APPENDIX A Some Linear Algebra

{ In general, we are presented with a quadratic function of a random vector X

Our focus will be on linear systems. A system is linear if it obeys the principle of superposition and homogenity, i.e.

Errors for Linear Systems

COS 511: Theoretical Machine Learning

PHYS 1443 Section 002 Lecture #20

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

The Geometry of Logit and Probit

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

1 Matrix representations of canonical matrices

Linear Momentum. Center of Mass.

14 Lagrange Multipliers

On the Construction of Polar Codes

1 Review From Last Time

Minimization of l 2 -Norm of the KSOR Operator

APPROXIMATE PRICES OF BASKET AND ASIAN OPTIONS DUPONT OLIVIER. Premia 14

Matrix Approximation via Sampling, Subspace Embedding. 1 Solving Linear Systems Using SVD

AUTO-CALIBRATION. FACTORIZATION. STRUCTURE FROM MOTION.

On the Construction of Polar Codes

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

n α j x j = 0 j=1 has a nontrivial solution. Here A is the n k matrix whose jth column is the vector for all t j=0

Chapter 12 Lyes KADEM [Thermodynamics II] 2007

Denote the function derivatives f(x) in given points. x a b. Using relationships (1.2), polynomials (1.1) are written in the form

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

Converted Measurement Kalman Filter with Nonlinear Equality Constrains

COMP th April, 2007 Clement Pang

ρ some λ THE INVERSE POWER METHOD (or INVERSE ITERATION) , for , or (more usually) to

Fixed-Point Iterations, Krylov Spaces, and Krylov Methods

Quantum Particle Motion in Physical Space

Solutions to exam in SF1811 Optimization, Jan 14, 2015

The Expectation-Maximization Algorithm

Numerical Solution of Ordinary Differential Equations

Army Ants Tunneling for Classical Simulations

Optimal Marketing Strategies for a Customer Data Intermediary. Technical Appendix

Unified Subspace Analysis for Face Recognition

Introducing Entropy Distributions

arxiv:cond-mat/ v3 22 May 2003

15 Lagrange Multipliers

The Parity of the Number of Irreducible Factors for Some Pentanomials

Salmon: Lectures on partial differential equations. Consider the general linear, second-order PDE in the form. ,x 2

COMPLEX NUMBERS AND QUADRATIC EQUATIONS

1. Statement of the problem

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

A Tutorial on Data Reduction. Linear Discriminant Analysis (LDA) Shireen Elhabian and Aly A. Farag. University of Louisville, CVIP Lab September 2009

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

On Pfaff s solution of the Pfaff problem

Spectral Recomposition in Stratigraphic Interpretation*

Solutions for Homework #9

Transcription:

Feature Selecton: Lnear ransforatons new = M x old Constrant Optzaton (nserton) 3 Proble: Gven an objectve functon f(x) to be optzed and let constrants be gven b h k (x)=c k, ovng constants to the left, ==> h k (x) - c k =g k (x). f(x) and g k (x) ust have contnuous frst partal dervatves A Soluton: Lagrangan Multplers or startng wth the Lagrangan : 0 = x f(x) + Σ x λ k g k (x) L (x,λ) = f(x) + Σ λ k g k (x). wth x L (x,λ) = 0.

he Covarance Matrx (nserton) 4 Defnton Let x = {x,..., x N } N be a real valued rando varable (data vectors), wth the expectaton value of the ean E[x] = μ. We defne the covarance atrx Σ x of a rando varable x as Σ x := E[ (x- μ) (x- μ) ] wth atrx eleents Σ j = E[ (x - μ ) (x j - μ j ) ]. Applcaton: Estatng E[x] and E[ (x - E[x] ) (x - E[x] ) ] fro data. We assue saples of the rando varable x = {x,..., x N } N that s we have a set of vectors { x,..., x } N or when put nto a data atrx X N x Maxu Lkelhood estators for μ and Σ x are: x k k ML ( x )( x ) ML k k k M L M L XX KL/PCA Motvaton 5 Fnd eanngful drectons n correlated data Lnear densonalt reducton Vsualzaton of hgher densonal data Copresson / Nose reducton PDF-Estate

7 Karhunen-Loève ransfor: st Dervaton Proble Let x = {x,..., x N } N be a feature vector of zero ean, real valued rando varables. We seek the drecton a of axu varance: == > = a x for whch a s such as E[ ] s axu wth the constrant that a a = hs s a constraned optzaton use of the Lagrangan: L(a, λ ) = E[a x x a ] λ ( a a ) = a Σ x a λ ( a a ) Lagrange ultpler Karhunen-Loève ransfor 8 L(a, λ ) = a Σ x a λ ( a a ) L( a, ) for E[ ] to be axu : a 0 => Σ x a λ a = 0 => a ust be egenvector of Σ x wth egenvalue λ. E[ ] = a Σ x a = λ => for E[ ] to be axu, λ ust be the largest egenvalue. 3

Karhunen-Loève ransfor 9 Now let s search for a second drecton, a, such that: = a x such as E[ ] s axu, and a a = 0 and a a = Slar dervaton: L(a, λ ) = a Σ x a λ ( a a ) wth a a = 0 => a ust be the egenvector of Σ x assocated wth the second largest egenvalue λ. We can derve N orthonoral drectons that axze the varance: A = [a, a,, a N ] and = A x he resultng atrx A s known as Prncpal Coponent Analss (PCA) or Kharunen-Loève transfor (KL) = A x x a N 0 Karhunen-Loève ransfor: nd Dervaton Proble Let x = {x,..., x N } N be a feature vector of zero ean, real valued rando varables. We seek a transforaton A of x that results n a new set of varables = A x (feature vectors) whch are uncorrelated (.e. E[, j ]= 0 for j ). Let = A x, then b defnton of the correlaton atrx: R E [ ] E [ A xx A] A R A x R x s setrc ts egenvectors are utuall orthogonal 4

Karhunen-Loève ransfor.e. f we choose A such that ts coluns a are orthonoral egenvectors of R x, we get: R A R A x 0 0 0 0 0 0 N If we further assue R x to be postve defnte, ---- > the egenvalues wll be postve. he resultng atrx A s known as Karhunen-Loève transfor (KL) = A x x N a Karhunen-Loève ransfor he Karhunen-Loève transfor (KL) A x x N a For ean-free vectors ( e.g. replace x b x E[ x ] ) ths process dagonalzes the covarance atrx Σ 5

KL Propertes: MSE-Approxaton 3 We defne a new vector n -densonal subspace ( < N ), ˆx usng onl bass vectors: xˆ a Projecton of x nto the subspace spanned b the used (orthonoral) egenvectors. Now, what s the expected ean square error between x and ts projecton ˆx : E x xˆ N E a E ( a )( a ) j j j KL Propertes: MSE-Approxaton 4 E x xˆ... E ( )( ) j a a j j N E N he error s nzed f we choose as bass those egenvectors correspondng to the largest egenvalues of the correlaton atrx. Aongst all other possble orthogonal transfors KL s the one leadng to nu MSE hs for of KL ( appled to ean free data) s also referred to as Prncpal Coponent Analss (PCA). he prncpal coponents are the egenvectors ordered (desc.) b ther respectve egenvalue agntudes 6

KL Propertes 5 otal varance Let w.l.o.g. E[x]=0 and = A x the KL (PCA) of x. Fro the prevous defntons we get: E.e. the egenvalues of the nput covarance atrx are equal to the varances of the transfored coordnates. Selectng those features correspondng to largest egenvalues retans the axal possble total varance (su of coponent varances) assocated wth the orgnal rando varables x. KL Propertes: Entrop 6 For a rando vector the entrop H E[ln p ( )] s a easure for the randoness of the underlng process. Exaple: for a zero-ean (=0) -d. Gaussan E [ ln( ( ) exp( ) ) ] H H E ln( ) ln [ ] ln( ) ln E trace E[ ] [ ] E[ trace ] E[ trace I ] Selectng those features correspondng to largest egenvalues axzes the entrop n the reanng features. No wonder: varance and randoness are drectl related! 7

Coputng a PCA: 7 Proble: Gven ean free data X, a set on n feature vectors x R. Copute the orthonoral egenvectors a of the correlaton atrx R x. here are an algorths that can copute ver effcentl egenvectors of a atrx. However, ost of these ethods can be ver unstable n certan specal cases. Here we present SVD, a ethod that s n general not the ost effcent one. However, the ethod can be ade nuercall stable ver easl! Coputng a PCA: 8 Sngular Value Decoposton: an Excursus to Lnear Algebra ( wthout Proofs ) 8

Sngular Value Decoposton : 9 SVD (reduced Verson): For atrces A R n wth n, there exst atrces U R n wth orthonoral coluns ( U U = I ), V R nn orthogonal ( V V = I ), R nn dagonal, wth n A=U V = A U V he dagonal values of (,,., n ) are called the sngular values. It s accustoed to sort the:. n SVD Applcatons: 0 SVD s an all-rounder! Once ou have U,, V, ou can use t to: - Solve Lnear Sstes: A x = b -. a) If A - exsts Copute atrx nverse b) for fewer equatons than unknowns c) for ore equatons than unknowns d) f there s no soluton: copute x that A x - b = n e) copute rank (nuercal rank) of a atrx - Copute PCA / KL 9

SVD : Matrx nverse A - A x = b : A=U V U,, V, exst for all A If A s square nxn and not sngular, then A - exsts. A U V V U V U n Coputng A - for a sngular A!? Snce U,, V all exst, the onl proble can orgnate f one σ = 0 or nuercall close to zero. --> sngular values ndcate f A s sngular or not!! SVD : Rank of a Matrx - he rank of A s the nuber of non-zero sngular values. - If there are ver sall sngular values, then A s close of beng sngular. We can set a threshold t, and set = 0 f t then the nuerc_rank ( A ) = # { > t } n = n A U V 0

SVD : Rank of a Matrx () 3 - nuerc_rank( A ) = # { > t }, the rank of A s equal the d( Ig( A ) ) n = s 0 0 A U V n = d( Ig(A) ) + d( Ker(A) ) - the coluns of U correspondng to the 0, span the range of A - the coluns of V correspondng to the = 0, span the nullspace of A reeber lnear appngs A x = b 4 ) case A - exsts R n A R A x = b ) A s sngular: d( Ker(A) ) 0 b

SVD : solvng A x = b 5 ) A s sngular: d( Ker(A) ) 0 x b here are an nfnte nuber of dfferent x that solve Ax=b!!?? Whch one should we choose?? e.g. we can choose the x wth x = n then we have to search n the space orthogonal to the nullspace SVD : Solvng A x - c = n 3) c s not n the range of A 6 c x c* ) Projectng c nto the range of A results n c* ) Fro all the solutons of A x = c* we choose the x wth x = n

SVD : Solvng A x - c = n 7 A x = c U V x = c x U V c V U c for an A exst U,, V, wth A= U V wth. n Coputng A - for a sngular A!? --> What to do n - wth /0 =???? V U c n Soe = 0 f t Reeber what we need ---- > SVD : Solvng: A x - c = n 8 We need to: ) Project c nto the range of A to obtan a c* ) Fro all the solutons of A x = c* we choose the x = n that s the x n the space orthogonal to the nullspace x = 0 0 c V U - the coluns of U correspondng to the 0, span the range of A - the coluns of V correspondng to the = 0, span the nullspace of A Bascall all rows or coluns ultpled b /0 are rrelevant!! --> so even settng /0 = 0, wll lead to the correct result. 3

SVD at Work: 9 For Lnear Sstes A x = b : Case fewer equatons than unknowns: fll rows of A wth zeros so that n = Perfor SVD on A wth (n ): Copute U,, V, wth A=U V Copute threshold t and n set = 0 for all t n - set / = 0 for all t For Lnear Sstes: copute Pseudonverse A + = V - U and copute x = A + b Applcaton: Copute PCA va SVD 30 Proble: Gven ean free data X, a set on n feature vectors x R copute the orthonoral egenvectors a of the correlaton atrx R x. Now we use SVD. Move center of ass to orgn: x =x -. Buld data atrx, fro ean free data X=U V 3. he prncpal axes are egenvector of the covarance atrx C = /n XX XX U U d 4

3 Applcaton: Copute PCA va SVD () wth SVD XX = U V (U V ) = U V (V U ) = U U = U U Snce C = /n XX the egenvalues copute to λ = /n σ wth λ = σ σ fro SVD σ varance of E[ ] Exaple: PCA on Iages 3 Assue we have a set of k ages (of sze NN) Each age can be seen as N -densonal pont p (lexcographcall ordered); the whole set can be stored as atrx: X p p p k Coputng PCA the naïve wa Buld correlaton atrx XX (N 4 eleents) Copute egenvectors fro ths atrx: O((N ) 3 ) Alread for sall ages (e.g. N=00) ths s far too expensve 5

PCA on Iages 33 Now we use SVD. Move center of ass to orgn: p =p -. Buld data atrx, fro ean free data X p p p n 3. he prncpal axes are egenvector of XX U U d PCA on Iages 34 ean face Faces Egenfaces Prncpal Coponents can be vsualzed b addng to the ean vector an egenvector ultpled b a factor (e.g. λ ) 6

PCA appled to face ages 35 Here the faces where noralzed n ee dstance and ee poston. ean face Choosng subspace denson r: Look at deca of the egenvalues as a functon of r Larger r eans lower expected error n the subspace data approxaton Egenfaces r Egenvalue spectru k Egenfaces for Face Recognton 36 In the 90 s the best perforng Face Recognton Sste! urk, M. and Pentland, A. (99). Face recognton usng egenfaces. In Proceedngs of Coputer Vson and Pattern Recognton, pages 586--59. IEEE. 7

PCA for Face Recognton 37 PCA & Dscrnaton 38 PCA/KL do not use an class labels n the constructon of the transfor. he resultng features a obscure the exstence of separate groups. 8

PCA Suar 39 Unsupervsed: no assupton about the exstence or nature of groupngs wthn the data. PCA s slar to learnng a Gaussan dstrbuton for the data. Optal bass for copresson (f easured va MSE). As far as densonalt reducton s concerned ths process s dstrbuton-free,.e. t s a atheatcal ethod wthout underlng statstcal odel. Extracted features (PCs) often lack ntuton. PCA an Neural Networks 40 A three-laer NN wth lnear hdden unts, traned as auto-encoder, develops an nternal representaton that corresponds to the prncpal coponents of the full data set. he transforaton F s a lnear projecton onto a k-densonal (Duda, Hart and Stork: chapter 0.3.). 9