Lecture VIII Dim. Reduction (I) Contents: Subset Selection & Shrinkage Ridge regression, Lasso PCA, PCR, PLS Lecture VIII: MLSC - Dr. Sethu Viayakumar
Data From Human Movement Measure arm movement and full-body movement of humans and anthropomorphic robots Perform local dimensionality analysis ith a groing variational mixture of factor analyzers Lecture VIII: MLSC - Dr. Sethu Viayakumar
Dimensionality of Full Body Motion Global PCA 3 / / 0.5 Local FA / / 0.9 0.8 0. Cumulative Variance 0.7 0.6 0.5 0.4 0.3 0. 0. Probabilit y 0.5 0. 0.05 0 / / 0 0 0 30 40 05 50 No. of retained dimensions 0 3 4 Dimensionality About 8 dimensions in the space formed by oint positions, velocities, and accelerations are needed to model an inverse dynamics model / / 05 Lecture VIII: MLSC - Dr. Sethu Viayakumar 3
Dimensionality Reduction Goals: feer dimensions for subsequent processing better numerical stability due to removal of correlations simplify post processing due to advanced statistical properties of preprocessed data don t lose important information, only redundant information or irrelevant information perform dimensionality reduction spatially localized for nonlinear problems (e.g., each RBF has its on local dimensionality reduction) Lecture VIII: MLSC - Dr. Sethu Viayakumar 4
Subset Selection & Shrinkage Methods Subset Selection Refers to methods hich selects a set of variables to be included and discards other dimensions based on some optimality criterion. Does regression on this reduced dimensional input set. Discrete method variables are either selected or discarded leaps and bounds procedure (Furnival & Wilson, 974) Forard/Backard stepise selection Shrinkage Methods Refers to methods hich reduces/shrinks the redundant or irrelevant variables in a more continuous fashion. Ridge Regression Lasso Derived Input Direction Methods PCR, PLS Lecture VIII: MLSC - Dr. Sethu Viayakumar 5
Ridge Regression Ridge regression shrinks the coefficients by imposing a penalty on their size. hey minimize a penalized residual sum of squares : ˆ ridge N M = + M arg min ( ti 0 xi ) λ i= = = Complexity parameter controlling amount of shrinkage Equivalent representation: ˆ ridge subect to = arg min M = N M ( ti 0 xi ) i= = s Lecture VIII: MLSC - Dr. Sethu Viayakumar 6
Ridge regression (cont d) Some notes: hen there are many correlated variables in a regression problem, their coefficients can be poorly determined and exhibit high variance e.g. a ildly large positive variable can be canceled by a similarly largely negative coefficient on its correlated cousin. he bias term is not included in the penalty. When the inputs are orthogonal, the ridge estimates are ust a scaled version of the least squares estimate Matrix representation of criterion and it s solution: J (, λ ) ˆ ridge = = ( X ( t X ) ( t X ) X + λ I ) X t + λ Lecture VIII: MLSC - Dr. Sethu Viayakumar 7
Ridge regression (cont d) Ridge regression shrinks the dimension ith least variance the most. Shrinkage Factor : d Each direction is shrunk by, ( d + λ) here d refers to the correspond ing eigen [Page 6: Elem. Stat. Learning] value. Effective degree of freedom : df ( λ) M d = tr[ X(X X + λi) X ] = ( d + λ) = [Page 63: Elem. Stat. Learning] Note :df ( λ) = M if λ = 0( no regularization) Lecture VIII: MLSC - Dr. Sethu Viayakumar 8
Lasso Lasso is also a shrinkage method like ridge regression. It minimizes a penalized residual sum of squares : ˆ lasso N M = + M arg min ( ti 0 xi ) λ i= = = Equivalent representation: ˆ lasso subect to = arg min M = N M ( ti 0 xi ) i= = s he L ridge penalty is replaced by the L lasso penalty. his makes the solution non-linear in t. Lecture VIII: MLSC - Dr. Sethu Viayakumar 9
Derived Input Direction Methods hese methods essentially involves transforming the input directions to some lo dimensional representation and using these directions to perform the regression. Example Methods Principal Components Regression (PCR) Based on input variance only Partial Least Squares (PLS) Based on input-output correlation and output variance Lecture VIII: MLSC - Dr. Sethu Viayakumar 0
Principal Component Analysis From earlier discussions: PCA finds the eigenvectors of the correlation (covariance) matrix of the the data Here, e sho ho PCA can be learned by bottle-neck neural netorks (auto-associator) and Hebbian learning frameorks Lecture VIII: MLSC - Dr. Sethu Viayakumar
PCA Implementation Hebbian Learning obtain a measure of familiarity of a ne data point the more familiar a data point, the larger the output y x x x 3 x d n + = n + αyx Lecture VIII: MLSC - Dr. Sethu Viayakumar
Hebbian Learning Properties of Hebbian Learning unstable learning rule (at most neutrally stable) finds direction of maximal variance of the data J = y = xx E { J} = E xx = E { xx } = C x y x Here, C is the correlation matrix 0 = α x y x Lecture VIII: MLSC - Dr. Sethu Viayakumar 3
Fixing the Instability (Oa s rule) Renormalization n + = n + α x y n + α x y Oa s Rule = n+ n = α y( x y ) = α( yx y ) Verify Oa s Rule (does it do the right thing??) { } { } = α ( x ) = αe{ xx xx } = α ( C C ) E E y y After convergence: E { } = 0 ( C ) λ ( C ) ( C ) C = =, thus is eigenvector C thus, = = = Lecture VIII: MLSC - Dr. Sethu Viayakumar 4
Application Examples of Oa s Rule Finds the direction (line) hich minimizes the sum of the total squared distance from each point to its orthogonal proection onto the line. X=UDV Singular Value Decomposition (SVD) Lecture VIII: MLSC - Dr. Sethu Viayakumar 5
PCA : Batch vs Stochastic PCA in Batch Update: ust subtract the mean from the data calculate eigenvectors to obtain principle components (Matlab function eig ) PCA in Stochastic Update: stochastically estimate the mean and subtract it from input data use Oa s rule on mean subtracted data x n + = x n n + x n + = x n + ( ) x n + = x n + α x x n n + ( x x n ) Lecture VIII: MLSC - Dr. Sethu Viayakumar 6
Autoencoder as motivation for Oa s Rule Note that Oa s rule looks like a supervised learning rule the update looks like a reverse delta-rule: it depends on the difference beteen the actual input and the back-propagated output = n+ n = α y( x y ) x y x Convert unsupervised learning into a supervised learning problem by trying to reconstruct the inputs from fe features! Lecture VIII: MLSC - Dr. Sethu Viayakumar 7
Learning in the Autoencoder Minimize the cost J = ( x x ˆ ( ) x x ˆ )= ( x y) x y ( ) J y = + y ( ) = x + y = ( x + y)x ( y) = ( y + y)x ( y) = y( x y) = α J = α y x y ( x y) ( x y) ( ) his is Oa s rule! Lecture VIII: MLSC - Dr. Sethu Viayakumar 8
PCA ith More han One Feature Oa s Rule in Multiple Dimensions y = W x xˆ = W y ( ) W = α y x W y Sanger s Rule: A clever trick added to Oa s rule! y = Wx x ˆ = W y r [ W ] r th ro = α y r x [ W ] i th ro i= ( ) ( ( ) [ y ] :r th ro ) = α y r x [ W ] :r th ro Matlab Notation: W (r,:) = α * y (r )*(x W (: r, : )' *y (: r)) his rule makes the ros of W become the eigenvectors of the data, ordered in descending sequence according to the magnitude of the eigenvalues. y i Lecture VIII: MLSC - Dr. Sethu Viayakumar 9
Discussion about PCA If data is noisy, e may represent the noise instead of the data he ay out: Factor Analysis (handles noise in input dimension also) PCA has no data generating model PCA has no probabilistic interpretation (not quite true!!) PCA ignores possible influence of subsequent (e.g., supervised) learning steps PCA is a linear method Way out: Nonlinear PCA PCA can converge very sloly Way out: EM versions of PCA But PCA is a very reliable method for dimensionality reduction if it is appropriate! Lecture VIII: MLSC - Dr. Sethu Viayakumar 0
PCA preprocessing for Supervised Learning In Batch Learning Notation for a Linear Netork: N x x x x n= U = eigenvectors N X% X% = eigenvectors N ( ) n n ( )( ) max(: k ) max(: k ) here X% contains mean-zero data Subsequent Linear Regression for Netork Weights % % % W= U X XU U X Y NOE: Inversion of the above matrix is very cheap since it is diagonal! No numerical problems! Problems of this pre-processing: Important regression data might have been clipped! Lecture VIII: MLSC - Dr. Sethu Viayakumar
Principal Component Regression (PCR) Build the matrix X and vector y X = ( ~ x ~ ~, x,..., x t = ( t, t,..., t ) n n Eigen Decomposition ) X X = VD V Compute the linear model U ˆ = max [ v v Lv eigen values pcr = (U X XU) k U ] X t Data proection Lecture VIII: MLSC - Dr. Sethu Viayakumar
PCA in Joint Data Space A straightforard extension to take supervised learning step into account: perform PCA in oint data space extract linear eight parameters from PCA results y x Lecture VIII: MLSC - Dr. Sethu Viayakumar 3
PCA in Joint Data Space: Formalization Z U x x = y y = eigenvectors n = Z Z = eigenvectors N N n n ( z z)( z z) N max(: k ) max(: k ) he Netork Weights become: ( ( ) ) U x ( = d k) W = Ux Uy Uy UyUy I UyUy, here U = y ( = c k) U Note: this ne kind of linear netork can actually tolerate noise in the input data! But only the same noise level in all (oint) dimensions! Lecture VIII: MLSC - Dr. Sethu Viayakumar 4