Lecture VIII Dim. Reduction (I)

Similar documents
Linear Regression Linear Regression with Shrinkage

Linear Regression Linear Regression with Shrinkage

Online Learning in High Dimensions. LWPR and it s application

Linear Methods for Regression. Lijun Zhang

CS540 Machine learning Lecture 5

Learning with Singular Vectors

Theorems. Least squares regression

What is Principal Component Analysis?

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Dimensionality Reduction

Machine Learning (Spring 2012) Principal Component Analysis

ECE521 lecture 4: 19 January Optimization, MLE, regularization

7. Variable extraction and dimensionality reduction

Introduction to Machine Learning

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

Principal Component Analysis

Covariance and Correlation Matrix

Lecture 3a: The Origin of Variational Bayes

Linear Models for Regression CS534

Basics of Multivariate Modelling and Data Analysis

Machine Learning (BSMC-GA 4439) Wenke Liu

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Principal Component Analysis (PCA)

1 Principal Components Analysis

Eigenvalues and diagonalization

CS281 Section 4: Factor Analysis and PCA

y(x) = x w + ε(x), (1)

Lecture 6: Methods for high-dimensional problems

Linear Dimensionality Reduction

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Linear regression methods

Principal Component Analysis

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

The prediction of house price

CS545 Contents XVI. Adaptive Control. Reading Assignment for Next Class. u Model Reference Adaptive Control. u Self-Tuning Regulators

EECS 275 Matrix Computation

Linear Models for Regression CS534

Linear Regression Models. Based on Chapter 3 of Hastie, Tibshirani and Friedman

Linear Models for Regression CS534

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Machine learning for pervasive systems Classification in high-dimensional spaces

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2

PCA and LDA. Man-Wai MAK

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

PCA, Kernel PCA, ICA

A Least Squares Formulation for Canonical Correlation Analysis

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Principal Component Analysis CS498

Principal Component Analysis

Dimension Reduction Methods

A Modern Look at Classical Multivariate Techniques

Advanced Introduction to Machine Learning CMU-10715

MSA200/TMS041 Multivariate Analysis

STA 414/2104: Lecture 8

1 Effects of Regularization For this problem, you are required to implement everything by yourself and submit code.

PCA and admixture models

Dimensionality Reduction

Regularization via Spectral Filtering

Principal Component Analysis (PCA)

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

STA 414/2104: Machine Learning

Gopalkrishna Veni. Project 4 (Active Shape Models)

Multi-Label Informed Latent Semantic Indexing

ISyE 691 Data mining and analytics

CS545 Contents XVI. l Adaptive Control. l Reading Assignment for Next Class

Lecture 14: Shrinkage

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Notes on Latent Semantic Analysis

STA 414/2104: Lecture 8

Lecture XI Learning with Distal Teachers (Forward and Inverse Models)

A SURVEY OF SOME SPARSE METHODS FOR HIGH DIMENSIONAL DATA

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Expectation Maximization

Data Mining Stat 588

Principal Component Analysis and Linear Discriminant Analysis

4 Bias-Variance for Ridge Regression (24 points)

Cheng Soon Ong & Christian Walder. Canberra February June 2018

1 Singular Value Decomposition and Principal Component

Preprocessing & dimensionality reduction

PCA and LDA. Man-Wai MAK

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

Computer Vision Group Prof. Daniel Cremers. 3. Regression

MSA220 Statistical Learning for Big Data

November 28 th, Carlos Guestrin 1. Lower dimensional projections

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 147 le-tex

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis

We ve seen today that a square, symmetric matrix A can be written as. A = ODO t

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

Neuroscience Introduction

PRINCIPAL COMPONENTS ANALYSIS

Principal Component Analysis. Applied Multivariate Statistics Spring 2012

COMPLEX PRINCIPAL COMPONENT SPECTRA EXTRACTION

Transcription:

Lecture VIII Dim. Reduction (I) Contents: Subset Selection & Shrinkage Ridge regression, Lasso PCA, PCR, PLS Lecture VIII: MLSC - Dr. Sethu Viayakumar

Data From Human Movement Measure arm movement and full-body movement of humans and anthropomorphic robots Perform local dimensionality analysis ith a groing variational mixture of factor analyzers Lecture VIII: MLSC - Dr. Sethu Viayakumar

Dimensionality of Full Body Motion Global PCA 3 / / 0.5 Local FA / / 0.9 0.8 0. Cumulative Variance 0.7 0.6 0.5 0.4 0.3 0. 0. Probabilit y 0.5 0. 0.05 0 / / 0 0 0 30 40 05 50 No. of retained dimensions 0 3 4 Dimensionality About 8 dimensions in the space formed by oint positions, velocities, and accelerations are needed to model an inverse dynamics model / / 05 Lecture VIII: MLSC - Dr. Sethu Viayakumar 3

Dimensionality Reduction Goals: feer dimensions for subsequent processing better numerical stability due to removal of correlations simplify post processing due to advanced statistical properties of preprocessed data don t lose important information, only redundant information or irrelevant information perform dimensionality reduction spatially localized for nonlinear problems (e.g., each RBF has its on local dimensionality reduction) Lecture VIII: MLSC - Dr. Sethu Viayakumar 4

Subset Selection & Shrinkage Methods Subset Selection Refers to methods hich selects a set of variables to be included and discards other dimensions based on some optimality criterion. Does regression on this reduced dimensional input set. Discrete method variables are either selected or discarded leaps and bounds procedure (Furnival & Wilson, 974) Forard/Backard stepise selection Shrinkage Methods Refers to methods hich reduces/shrinks the redundant or irrelevant variables in a more continuous fashion. Ridge Regression Lasso Derived Input Direction Methods PCR, PLS Lecture VIII: MLSC - Dr. Sethu Viayakumar 5

Ridge Regression Ridge regression shrinks the coefficients by imposing a penalty on their size. hey minimize a penalized residual sum of squares : ˆ ridge N M = + M arg min ( ti 0 xi ) λ i= = = Complexity parameter controlling amount of shrinkage Equivalent representation: ˆ ridge subect to = arg min M = N M ( ti 0 xi ) i= = s Lecture VIII: MLSC - Dr. Sethu Viayakumar 6

Ridge regression (cont d) Some notes: hen there are many correlated variables in a regression problem, their coefficients can be poorly determined and exhibit high variance e.g. a ildly large positive variable can be canceled by a similarly largely negative coefficient on its correlated cousin. he bias term is not included in the penalty. When the inputs are orthogonal, the ridge estimates are ust a scaled version of the least squares estimate Matrix representation of criterion and it s solution: J (, λ ) ˆ ridge = = ( X ( t X ) ( t X ) X + λ I ) X t + λ Lecture VIII: MLSC - Dr. Sethu Viayakumar 7

Ridge regression (cont d) Ridge regression shrinks the dimension ith least variance the most. Shrinkage Factor : d Each direction is shrunk by, ( d + λ) here d refers to the correspond ing eigen [Page 6: Elem. Stat. Learning] value. Effective degree of freedom : df ( λ) M d = tr[ X(X X + λi) X ] = ( d + λ) = [Page 63: Elem. Stat. Learning] Note :df ( λ) = M if λ = 0( no regularization) Lecture VIII: MLSC - Dr. Sethu Viayakumar 8

Lasso Lasso is also a shrinkage method like ridge regression. It minimizes a penalized residual sum of squares : ˆ lasso N M = + M arg min ( ti 0 xi ) λ i= = = Equivalent representation: ˆ lasso subect to = arg min M = N M ( ti 0 xi ) i= = s he L ridge penalty is replaced by the L lasso penalty. his makes the solution non-linear in t. Lecture VIII: MLSC - Dr. Sethu Viayakumar 9

Derived Input Direction Methods hese methods essentially involves transforming the input directions to some lo dimensional representation and using these directions to perform the regression. Example Methods Principal Components Regression (PCR) Based on input variance only Partial Least Squares (PLS) Based on input-output correlation and output variance Lecture VIII: MLSC - Dr. Sethu Viayakumar 0

Principal Component Analysis From earlier discussions: PCA finds the eigenvectors of the correlation (covariance) matrix of the the data Here, e sho ho PCA can be learned by bottle-neck neural netorks (auto-associator) and Hebbian learning frameorks Lecture VIII: MLSC - Dr. Sethu Viayakumar

PCA Implementation Hebbian Learning obtain a measure of familiarity of a ne data point the more familiar a data point, the larger the output y x x x 3 x d n + = n + αyx Lecture VIII: MLSC - Dr. Sethu Viayakumar

Hebbian Learning Properties of Hebbian Learning unstable learning rule (at most neutrally stable) finds direction of maximal variance of the data J = y = xx E { J} = E xx = E { xx } = C x y x Here, C is the correlation matrix 0 = α x y x Lecture VIII: MLSC - Dr. Sethu Viayakumar 3

Fixing the Instability (Oa s rule) Renormalization n + = n + α x y n + α x y Oa s Rule = n+ n = α y( x y ) = α( yx y ) Verify Oa s Rule (does it do the right thing??) { } { } = α ( x ) = αe{ xx xx } = α ( C C ) E E y y After convergence: E { } = 0 ( C ) λ ( C ) ( C ) C = =, thus is eigenvector C thus, = = = Lecture VIII: MLSC - Dr. Sethu Viayakumar 4

Application Examples of Oa s Rule Finds the direction (line) hich minimizes the sum of the total squared distance from each point to its orthogonal proection onto the line. X=UDV Singular Value Decomposition (SVD) Lecture VIII: MLSC - Dr. Sethu Viayakumar 5

PCA : Batch vs Stochastic PCA in Batch Update: ust subtract the mean from the data calculate eigenvectors to obtain principle components (Matlab function eig ) PCA in Stochastic Update: stochastically estimate the mean and subtract it from input data use Oa s rule on mean subtracted data x n + = x n n + x n + = x n + ( ) x n + = x n + α x x n n + ( x x n ) Lecture VIII: MLSC - Dr. Sethu Viayakumar 6

Autoencoder as motivation for Oa s Rule Note that Oa s rule looks like a supervised learning rule the update looks like a reverse delta-rule: it depends on the difference beteen the actual input and the back-propagated output = n+ n = α y( x y ) x y x Convert unsupervised learning into a supervised learning problem by trying to reconstruct the inputs from fe features! Lecture VIII: MLSC - Dr. Sethu Viayakumar 7

Learning in the Autoencoder Minimize the cost J = ( x x ˆ ( ) x x ˆ )= ( x y) x y ( ) J y = + y ( ) = x + y = ( x + y)x ( y) = ( y + y)x ( y) = y( x y) = α J = α y x y ( x y) ( x y) ( ) his is Oa s rule! Lecture VIII: MLSC - Dr. Sethu Viayakumar 8

PCA ith More han One Feature Oa s Rule in Multiple Dimensions y = W x xˆ = W y ( ) W = α y x W y Sanger s Rule: A clever trick added to Oa s rule! y = Wx x ˆ = W y r [ W ] r th ro = α y r x [ W ] i th ro i= ( ) ( ( ) [ y ] :r th ro ) = α y r x [ W ] :r th ro Matlab Notation: W (r,:) = α * y (r )*(x W (: r, : )' *y (: r)) his rule makes the ros of W become the eigenvectors of the data, ordered in descending sequence according to the magnitude of the eigenvalues. y i Lecture VIII: MLSC - Dr. Sethu Viayakumar 9

Discussion about PCA If data is noisy, e may represent the noise instead of the data he ay out: Factor Analysis (handles noise in input dimension also) PCA has no data generating model PCA has no probabilistic interpretation (not quite true!!) PCA ignores possible influence of subsequent (e.g., supervised) learning steps PCA is a linear method Way out: Nonlinear PCA PCA can converge very sloly Way out: EM versions of PCA But PCA is a very reliable method for dimensionality reduction if it is appropriate! Lecture VIII: MLSC - Dr. Sethu Viayakumar 0

PCA preprocessing for Supervised Learning In Batch Learning Notation for a Linear Netork: N x x x x n= U = eigenvectors N X% X% = eigenvectors N ( ) n n ( )( ) max(: k ) max(: k ) here X% contains mean-zero data Subsequent Linear Regression for Netork Weights % % % W= U X XU U X Y NOE: Inversion of the above matrix is very cheap since it is diagonal! No numerical problems! Problems of this pre-processing: Important regression data might have been clipped! Lecture VIII: MLSC - Dr. Sethu Viayakumar

Principal Component Regression (PCR) Build the matrix X and vector y X = ( ~ x ~ ~, x,..., x t = ( t, t,..., t ) n n Eigen Decomposition ) X X = VD V Compute the linear model U ˆ = max [ v v Lv eigen values pcr = (U X XU) k U ] X t Data proection Lecture VIII: MLSC - Dr. Sethu Viayakumar

PCA in Joint Data Space A straightforard extension to take supervised learning step into account: perform PCA in oint data space extract linear eight parameters from PCA results y x Lecture VIII: MLSC - Dr. Sethu Viayakumar 3

PCA in Joint Data Space: Formalization Z U x x = y y = eigenvectors n = Z Z = eigenvectors N N n n ( z z)( z z) N max(: k ) max(: k ) he Netork Weights become: ( ( ) ) U x ( = d k) W = Ux Uy Uy UyUy I UyUy, here U = y ( = c k) U Note: this ne kind of linear netork can actually tolerate noise in the input data! But only the same noise level in all (oint) dimensions! Lecture VIII: MLSC - Dr. Sethu Viayakumar 4