Slide10. Haykin Chapter 8: Principal Components Analysis. Motivation. Principal Component Analysis: Variance Probe

Similar documents
3.3 Variational Characterization of Singular Values

Feature Extraction Techniques

Introduction to Machine Learning. Recitation 11

Mechanics Physics 151

Pattern Recognition and Machine Learning. Artificial Neural networks

Principal Components Analysis

Block designs and statistics

Machine Learning Basics: Estimators, Bias and Variance

Synaptic Plasticity. Introduction. Biophysics of Synaptic Plasticity. Functional Modes of Synaptic Plasticity. Activity-dependent synaptic plasticity:

Estimating Parameters for a Gaussian pdf

Multi-Scale/Multi-Resolution: Wavelet Transform

1 Bounding the Margin

The Distribution of the Covariance Matrix for a Subset of Elliptical Distributions with Extension to Two Kurtosis Parameters

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

Covariance and Correlation Matrix

Unsupervised Learning: Dimension Reduction

1 Principal Components Analysis

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

The Hebb rule Neurons that fire together wire together.

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

Order Recursion Introduction Order versus Time Updates Matrix Inversion by Partitioning Lemma Levinson Algorithm Interpretations Examples

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t.

Lean Walsh Transform

ma x = -bv x + F rod.

Random Process Review

Supplementary to Learning Discriminative Bayesian Networks from High-dimensional Continuous Neuroimaging Data

Ch 12: Variations on Backpropagation

Support recovery in compressed sensing: An estimation theoretic approach

Introduction to Optimization Techniques. Nonlinear Programming

Supporting Information for Supression of Auger Processes in Confined Structures

CS Lecture 13. More Maximum Likelihood

Problem Set 2 Due Sept, 21

COS 424: Interacting with Data. Written Exercises

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks

Probability Distributions

CSE525: Randomized Algorithms and Probabilistic Analysis May 16, Lecture 13

A remark on a success rate model for DPA and CPA

A method to determine relative stroke detection efficiencies from multiplicity distributions

Reed-Muller Codes. m r inductive definition. Later, we shall explain how to construct Reed-Muller codes using the Kronecker product.

A Note on the Applied Use of MDL Approximations

Physics 215 Winter The Density Matrix

Statistics and Probability Letters

Lecture 9 November 23, 2015

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

Pattern Recognition and Machine Learning. Artificial Neural networks

Slide05 Haykin Chapter 5: Radial-Basis Function Networks

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab

Physics 139B Solutions to Homework Set 3 Fall 2009

Approximation in Stochastic Scheduling: The Power of LP-Based Priority Policies

Topic 5a Introduction to Curve Fitting & Linear Regression

Feedforward Networks

The proofs of Theorem 1-3 are along the lines of Wied and Galeano (2013).

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Chapter 6 1-D Continuous Groups

Chaotic Coupled Map Lattices

Solutions of some selected problems of Homework 4

EE5900 Spring Lecture 4 IC interconnect modeling methods Zhuo Feng

Interactive Markov Models of Evolutionary Algorithms

Poornima University, For any query, contact us at: , 18

Extension of CSRSM for the Parametric Study of the Face Stability of Pressurized Tunnels

List Scheduling and LPT Oliver Braun (09/05/2017)

Introduction to Discrete Optimization

Feedforward Networks. Gradient Descent Learning and Backpropagation. Christian Jacob. CPSC 533 Winter 2004

Donald Fussell. October 28, Computer Science Department The University of Texas at Austin. Point Masses and Force Fields.

lecture 36: Linear Multistep Mehods: Zero Stability

Optimal Jamming Over Additive Noise: Vector Source-Channel Case

Hebb rule book: 'The Organization of Behavior' Theory about the neural bases of learning

RECOVERY OF A DENSITY FROM THE EIGENVALUES OF A NONHOMOGENEOUS MEMBRANE

ASSUME a source over an alphabet size m, from which a sequence of n independent samples are drawn. The classical

arxiv: v1 [cs.ds] 3 Feb 2014

Iterative face image feature extraction with Generalized Hebbian Algorithm and a Sanger-like BCM rule

Lecture 21 Principle of Inclusion and Exclusion

Lec 05 Arithmetic Coding

Detection and Estimation Theory

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search

CS281 Section 4: Factor Analysis and PCA

Kernel Methods and Support Vector Machines

A Simple Regression Problem

Feedforward Networks

On Lotka-Volterra Evolution Law

Random Process Examples 1/23

Pattern Recognition and Machine Learning. Artificial Neural networks

Modeling Chemical Reactions with Single Reactant Specie

Boosting with log-loss

Recovering Data from Underdetermined Quadratic Measurements (CS 229a Project: Final Writeup)

Least Squares Fitting of Data

Principal Component Analysis

Support Vector Machines MIT Course Notes Cynthia Rudin

TEST OF HOMOGENEITY OF PARALLEL SAMPLES FROM LOGNORMAL POPULATIONS WITH UNEQUAL VARIANCES

Vulnerability of MRD-Code-Based Universal Secure Error-Correcting Network Codes under Time-Varying Jamming Links

Tutorial on Principal Component Analysis

STOPPING SIMULATED PATHS EARLY

arxiv: v1 [stat.ml] 31 Jan 2018

Multivariate Methods. Matlab Example. Principal Components Analysis -- PCA

arxiv: v1 [stat.ot] 7 Jul 2010

INNER CONSTRAINTS FOR A 3-D SURVEY NETWORK

NB1140: Physics 1A - Classical mechanics and Thermodynamics Problem set 2 - Forces and energy Week 2: November 2016

Linear Transformations

A New Class of APEX-Like PCA Algorithms

Physics 221B: Solution to HW # 6. 1) Born-Oppenheimer for Coupled Harmonic Oscillators

Transcription:

Slide10 Motivation Haykin Chapter 8: Principal Coponents Analysis 1.6 1.4 1.2 1 0.8 cloud.dat 0.6 CPSC 636-600 Instructor: Yoonsuck Choe Spring 2015 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 How can we project the given data so that the variance in the projected points is axiized? 1 2 Principal Coponent Analysis: Variance Probe X: -diensional rando vector (vector rando variable following a certain probability distribution). Assue EX 0. Projection of a unit vector q ((qq T ) 1/2 1) onto X: A X T q q T X. We know EA Eq T X q T EX 0. The variance can also be calculated: σ 2 EA 2 E(q T X)(X T q) q T EXX T covariance atrix q q T Rq. 3 Principal Coponent Analysis: Variance Probe (cont d) This is sort of a variance probe: ψ(q) q T Rq. Using different unit vectors q for the projection of the input data points will result in saller or larger variance in the projected points. With this, we can ask which vector direction does the variance probe ψ(q) has exteral value? The solution to the question is obtained by finding unit vectors satisfying the following condition: Rq λq, where λ is a scaling factor. This is basically an eigenvalue proble. 4

PCA With an covariance atrix R, we can get eigenvectors and eigenvalues: Rq j λ j q j, j 1, 2,..., We can sort the eigenvectors/eigenvalues according to the eigenvalues, so that λ 1 > λ 2 >... > λ. PCA: Suary The eigenvectors of the covariance atrix R of zero-ean rando input vector X define the principal directions q j along with the variance of the projected inputs have extreal values. The associated eigenvaluess define the extreal values of the variance probe. and arrange the eigenvectors in a colun-wise atrix Q q 1, q 2,..., q. Then we can write RQ Qλ where λ diag(λ 1, λ 2,..., λ ). Q is orthogonal, so that QQ T I. That is, Q 1 Q T. 5 6 PCA: Usage Project input x to the principal directions: a Q T x. We can also recover the input fro the projected point a: x (Q T ) 1 a Qa. Note that we don t need all principal directions, depending on how uch variance is captured in the first few eigenvalues: We can do diensionality reduction. PCA: Diensionality Reduction Encoding: We can use the first l eigenvectors to encode x. a 1, a 2,..., a l T q 1, q 2,..., q l T x. Note that we only need to calculate l projections a 1, a 2,..., a l, where l. Decoding: Once a 1, a 2,..., a l T is obtained, we want to reconstruct the full x 1, x 2,..., x l,..., x T. x Qa q 1, q 2,..., q l a 1, a 2,..., a l T ˆx. Or, alternatively ˆx Qa 1, a 2,..., a l, 0, 0,..., 0 T. l zeros 7 8

PCA: Total Variance The total variance of th e coponents of the data vector is σ 2 j j1 j1 λ j. The truncated version with the first l coponents have variance l σ 2 j j1 j1 The larger the variance in the truncated version, i.e., the saller the variance in the reaining coponents, the ore accurate the diensionality reduction. 9 l λ j. 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 PCA Exaple line 1 line 2 line 3 line 4-0.2-0.5 0 0.5 1 1.5 2 inprandn(800,2)/9+0.5;randn(1000,2)/6+ones(1000,2); Q λ 0.70285 0.71134 0.71134 0.70285 0.14425 0.00000 0.00000 0.02161 10 PCA s Relation to Neural Networks: Hebbian-Based Maxiu Eigenfilter How does all the above relate to neural networks? A rearkable result by Oja (1982) shows that a single linear neuron with Hebbian synapse can evolve into a filter for the first principal coponent of the input distribution! Activation: y i1 Learning rule: w i (n + 1) ( w i x i w i (n) + ηy(n)x i (n) i1 w i(n) + ηy(n)x i (n) 2 ) 1/2 Hebbian-Based Maxiu Eigenfilter Expanding the denoinator as a power series, dropping the higher order ters, etc., we get w i (n+1) w i (n)+ηy(n)x i (n) y(n)w i (n)+o(η 2 ), with O(η 2 ) including the second- and higher-order effects of η, which we can ignore for sall η. Based on that, we get w i (n + 1) w i (n) + ηy(n)x i (n) y(n)w i (n) w i (n) + η y(n)xi (n) y(n) 2 w i (n) Hebbian ter Stabilization ter. 11 12

Matrix Forulation of the Algorith Asyptotic Stability Theore Activation y(n) x T (n)w(n) w T (n)x(n) To ease the analysis, we rewrite the learning rule as w(n + 1) w(n) + η(n)h(w(n), x(n)). Learning w(n + 1) w(n) + ηy(n)x(n) y(n)w(n) Cobining the above, w(n + 1) w(n) + ηx(n)x T (n)w(n) w T (n)x(n)x T (n)w(n)w(n) represents a nonlinear stochastic difference equation, which is hard to analyze., The goal is to associate a deterinistic ordinary differential equation (ODE) with the stochastic equation. Under certain reasonable conditions on η, h(, ), and w, we get the asyptotic stability theore stating that infinitely often with probability 1. li w(n) q 1 n 13 14 Conditions for Stability Stability Analysis of Maxiu Eigenfilter 1. η(n) is a decreasing sequence of positive real nubers such that η(n) n1, n1 ηp (n) < for p > 1, η(n) 0 as n. 2. Sequence of paraeter vectors w( ) is bounded with probability 1. 3. The update function h(w, x) is continuously differentiable w.r.t. w and x, and it derivatives are bounded in tie. 4. The liit h(w) li n Eh(w, X) exists for each w, where X is a rando vector. 5. There is a locally asyptotically stable solution to the ODE d w(t) ĥ(w(t)). 6. Let q 1 denote the solution to the ODE above with a basin of attraction B(q). The paraeter vector w(n) enters the copact subset A of B(q) infinitely often with prob. 1. 15 Set it up to satisfy the conditions of the asyptotic stability theore: Set the learning rate to be η(n) 1/n. Set h(, ) to h(w, x) Taking expectaion over all x, x(n)y(n) y 2 w(n) x(n)x T (n)w(n) w T (n)x(n)x T (n)w(n)w(n) h lin EX(n)X T (n)w(n) (w T (n)x(n)x T (n)w(n))w(n) Rw( ) w T ( )Rw( ) w( ) Substituting h into the ODE, d w(t) h(w(t)) Rw(t) w T (t)rw(t)w(t). 16

Stability Analysis of Maxiu Eigenfilter Expanding w(t) with the eigenvectors of R, w(t) and using basic definitions we get (see next slide for derivation) q k θ k (t)q k, Rq k λ k q, q T k Rq k λ k λ k θ k (t)q k λ l θ 2 l (t) l1 θ k (t)q k. Equating the RHS s of the following dw(t) d ( θ k (t)q k ) d w(t) h(w(t)) Rw(t) w T (t)rw(t)w(t). we get q k λ k θ k (t)q k λ l θ 2 l (t) l1, θ k (t)q k. 17 18 First, we show Rw(t) λ kθ k (t)q k, using Rq k λ k q. Rw(t) R θ k (t)q k θ k (t)rq k λ k θ k (t)q k 19 Next, we show w T (t)rw(t)w(t) l1 λ l θ2 l (t) θ k (t)q k. w T (t)rw(t)w(t) w (t)rw(t) T θ k (t)q k ( ) l1 θ l R( (t)qt l θ k (t)q k ( l1 l1 l1 ) ( θ l (t)q T l R θ k ( k)) (t)q ) θ l (t)qt l Rθ k (t)q k ( ) θ l (t)θ k (t)qt l Rq k ( ( θ k (t)q k θ k (t)q k θ k (t)q k θ k (t)q k ) l1 θ l (t)θ k (t)qt l (λ k q k ) θ k (t)q k ) l1 θ l (t)θ k (t)λ k qt l q k ) θ k (t)q k { Inner su disappears since q T l q k 0 for l k and 1 for l k} l1 θ l (t)θ l (t)λ l θ k (t)q k l1 θ2 l (t)λ l θ k (t)q k 20

Factoring out q k, we get λ k θ k (t) l1 λ l θ 2 l (t) θ k (t). We can analyze the above in two cases (details in following slides): Case I: k 1 In this case, α k (t) θ k(t) Case II: k 1 θ 1 (t) above to derive dα k(t) 0 as t, by using In this case, θ 1 (t) ±1 as t, fro dθ 1 (t) λ 1 θ 1 (t) 1 θ 2 1 (t). 21 (λ 1 λ k )α k (t). positive! Case I (in detail): k 1 Given λ k θ k (t) Define α k (t) θ k (t) θ 1 (t). Derive dα k (t) l1 1 θ 1 (t) λ l θ 2 l (t) θ k (t).(1) θ k(t) dθ 1 (t) θ1 2(t) (2) Plug in (1) above into (2). (Both / and dθ 1 (k)/.) Finally, we get: dα k (t) t. (λ 1 λ k )α k (t), so α k (t) 0 as 22 Case II: k 1 dθ 1 (t) λ 1 θ 1 (t) l1 λ l θ 2 l (t) θ k (t) λ 1 θ 1 (t) λ 1 θ 3 1 (t) θ 1(t) l2 λ l θ 2 l (t) λ 1 θ 1 (t) λ 1 θ 3 1 (t) θ3 1 (t) λ l α 2 l (t) l2 Using results fro Case I (α l 0 for l 1 and t ), θ 1 (t) ±1 as t, fro dθ 1 (t) λ 1 θ 1 (t) 1 θ 2 1 (t). 23 Recalling the original expansion w(t) we can conclude that θ k (t)q k, w(t) q 1, as t. where q 1 is the noralized eigenvector associated with the largest eigenvalue λ 1 of the covariance atrix R. Other conditions of stability can also be shown to hold (see the textbook). 24

Suary of Hebbian-Based Maxiu Eigenfilter Hebbian-based linear neuron converges with probability 1 to a fixed point, which is characterized as follows: Generalized Hebbian Algorith for full PCA Sanger (1989) showed how to construct a feedfoward network to learn all the eigenvectors of R. Variance of output approaches the largest eigenvalue of the covariance atrix R (y(n) is the output): li n σ2 (n) li EY 2 (n) λ 1 n Activation y j (n) i1 w ji (n)x i (n), j 1, 2,..., l Synaptic weight vector approaches the associated eigenvector with li n w(n) q 1 li w(n) 1. n Learning w ji (n) η y j (n)x i (n) y j (n) j i 1, 2,...,, j 1, 2,..., l. w ki (n)y k (n), 25 26