Slide10 Motivation Haykin Chapter 8: Principal Coponents Analysis 1.6 1.4 1.2 1 0.8 cloud.dat 0.6 CPSC 636-600 Instructor: Yoonsuck Choe Spring 2015 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 How can we project the given data so that the variance in the projected points is axiized? 1 2 Principal Coponent Analysis: Variance Probe X: -diensional rando vector (vector rando variable following a certain probability distribution). Assue EX 0. Projection of a unit vector q ((qq T ) 1/2 1) onto X: A X T q q T X. We know EA Eq T X q T EX 0. The variance can also be calculated: σ 2 EA 2 E(q T X)(X T q) q T EXX T covariance atrix q q T Rq. 3 Principal Coponent Analysis: Variance Probe (cont d) This is sort of a variance probe: ψ(q) q T Rq. Using different unit vectors q for the projection of the input data points will result in saller or larger variance in the projected points. With this, we can ask which vector direction does the variance probe ψ(q) has exteral value? The solution to the question is obtained by finding unit vectors satisfying the following condition: Rq λq, where λ is a scaling factor. This is basically an eigenvalue proble. 4
PCA With an covariance atrix R, we can get eigenvectors and eigenvalues: Rq j λ j q j, j 1, 2,..., We can sort the eigenvectors/eigenvalues according to the eigenvalues, so that λ 1 > λ 2 >... > λ. PCA: Suary The eigenvectors of the covariance atrix R of zero-ean rando input vector X define the principal directions q j along with the variance of the projected inputs have extreal values. The associated eigenvaluess define the extreal values of the variance probe. and arrange the eigenvectors in a colun-wise atrix Q q 1, q 2,..., q. Then we can write RQ Qλ where λ diag(λ 1, λ 2,..., λ ). Q is orthogonal, so that QQ T I. That is, Q 1 Q T. 5 6 PCA: Usage Project input x to the principal directions: a Q T x. We can also recover the input fro the projected point a: x (Q T ) 1 a Qa. Note that we don t need all principal directions, depending on how uch variance is captured in the first few eigenvalues: We can do diensionality reduction. PCA: Diensionality Reduction Encoding: We can use the first l eigenvectors to encode x. a 1, a 2,..., a l T q 1, q 2,..., q l T x. Note that we only need to calculate l projections a 1, a 2,..., a l, where l. Decoding: Once a 1, a 2,..., a l T is obtained, we want to reconstruct the full x 1, x 2,..., x l,..., x T. x Qa q 1, q 2,..., q l a 1, a 2,..., a l T ˆx. Or, alternatively ˆx Qa 1, a 2,..., a l, 0, 0,..., 0 T. l zeros 7 8
PCA: Total Variance The total variance of th e coponents of the data vector is σ 2 j j1 j1 λ j. The truncated version with the first l coponents have variance l σ 2 j j1 j1 The larger the variance in the truncated version, i.e., the saller the variance in the reaining coponents, the ore accurate the diensionality reduction. 9 l λ j. 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 PCA Exaple line 1 line 2 line 3 line 4-0.2-0.5 0 0.5 1 1.5 2 inprandn(800,2)/9+0.5;randn(1000,2)/6+ones(1000,2); Q λ 0.70285 0.71134 0.71134 0.70285 0.14425 0.00000 0.00000 0.02161 10 PCA s Relation to Neural Networks: Hebbian-Based Maxiu Eigenfilter How does all the above relate to neural networks? A rearkable result by Oja (1982) shows that a single linear neuron with Hebbian synapse can evolve into a filter for the first principal coponent of the input distribution! Activation: y i1 Learning rule: w i (n + 1) ( w i x i w i (n) + ηy(n)x i (n) i1 w i(n) + ηy(n)x i (n) 2 ) 1/2 Hebbian-Based Maxiu Eigenfilter Expanding the denoinator as a power series, dropping the higher order ters, etc., we get w i (n+1) w i (n)+ηy(n)x i (n) y(n)w i (n)+o(η 2 ), with O(η 2 ) including the second- and higher-order effects of η, which we can ignore for sall η. Based on that, we get w i (n + 1) w i (n) + ηy(n)x i (n) y(n)w i (n) w i (n) + η y(n)xi (n) y(n) 2 w i (n) Hebbian ter Stabilization ter. 11 12
Matrix Forulation of the Algorith Asyptotic Stability Theore Activation y(n) x T (n)w(n) w T (n)x(n) To ease the analysis, we rewrite the learning rule as w(n + 1) w(n) + η(n)h(w(n), x(n)). Learning w(n + 1) w(n) + ηy(n)x(n) y(n)w(n) Cobining the above, w(n + 1) w(n) + ηx(n)x T (n)w(n) w T (n)x(n)x T (n)w(n)w(n) represents a nonlinear stochastic difference equation, which is hard to analyze., The goal is to associate a deterinistic ordinary differential equation (ODE) with the stochastic equation. Under certain reasonable conditions on η, h(, ), and w, we get the asyptotic stability theore stating that infinitely often with probability 1. li w(n) q 1 n 13 14 Conditions for Stability Stability Analysis of Maxiu Eigenfilter 1. η(n) is a decreasing sequence of positive real nubers such that η(n) n1, n1 ηp (n) < for p > 1, η(n) 0 as n. 2. Sequence of paraeter vectors w( ) is bounded with probability 1. 3. The update function h(w, x) is continuously differentiable w.r.t. w and x, and it derivatives are bounded in tie. 4. The liit h(w) li n Eh(w, X) exists for each w, where X is a rando vector. 5. There is a locally asyptotically stable solution to the ODE d w(t) ĥ(w(t)). 6. Let q 1 denote the solution to the ODE above with a basin of attraction B(q). The paraeter vector w(n) enters the copact subset A of B(q) infinitely often with prob. 1. 15 Set it up to satisfy the conditions of the asyptotic stability theore: Set the learning rate to be η(n) 1/n. Set h(, ) to h(w, x) Taking expectaion over all x, x(n)y(n) y 2 w(n) x(n)x T (n)w(n) w T (n)x(n)x T (n)w(n)w(n) h lin EX(n)X T (n)w(n) (w T (n)x(n)x T (n)w(n))w(n) Rw( ) w T ( )Rw( ) w( ) Substituting h into the ODE, d w(t) h(w(t)) Rw(t) w T (t)rw(t)w(t). 16
Stability Analysis of Maxiu Eigenfilter Expanding w(t) with the eigenvectors of R, w(t) and using basic definitions we get (see next slide for derivation) q k θ k (t)q k, Rq k λ k q, q T k Rq k λ k λ k θ k (t)q k λ l θ 2 l (t) l1 θ k (t)q k. Equating the RHS s of the following dw(t) d ( θ k (t)q k ) d w(t) h(w(t)) Rw(t) w T (t)rw(t)w(t). we get q k λ k θ k (t)q k λ l θ 2 l (t) l1, θ k (t)q k. 17 18 First, we show Rw(t) λ kθ k (t)q k, using Rq k λ k q. Rw(t) R θ k (t)q k θ k (t)rq k λ k θ k (t)q k 19 Next, we show w T (t)rw(t)w(t) l1 λ l θ2 l (t) θ k (t)q k. w T (t)rw(t)w(t) w (t)rw(t) T θ k (t)q k ( ) l1 θ l R( (t)qt l θ k (t)q k ( l1 l1 l1 ) ( θ l (t)q T l R θ k ( k)) (t)q ) θ l (t)qt l Rθ k (t)q k ( ) θ l (t)θ k (t)qt l Rq k ( ( θ k (t)q k θ k (t)q k θ k (t)q k θ k (t)q k ) l1 θ l (t)θ k (t)qt l (λ k q k ) θ k (t)q k ) l1 θ l (t)θ k (t)λ k qt l q k ) θ k (t)q k { Inner su disappears since q T l q k 0 for l k and 1 for l k} l1 θ l (t)θ l (t)λ l θ k (t)q k l1 θ2 l (t)λ l θ k (t)q k 20
Factoring out q k, we get λ k θ k (t) l1 λ l θ 2 l (t) θ k (t). We can analyze the above in two cases (details in following slides): Case I: k 1 In this case, α k (t) θ k(t) Case II: k 1 θ 1 (t) above to derive dα k(t) 0 as t, by using In this case, θ 1 (t) ±1 as t, fro dθ 1 (t) λ 1 θ 1 (t) 1 θ 2 1 (t). 21 (λ 1 λ k )α k (t). positive! Case I (in detail): k 1 Given λ k θ k (t) Define α k (t) θ k (t) θ 1 (t). Derive dα k (t) l1 1 θ 1 (t) λ l θ 2 l (t) θ k (t).(1) θ k(t) dθ 1 (t) θ1 2(t) (2) Plug in (1) above into (2). (Both / and dθ 1 (k)/.) Finally, we get: dα k (t) t. (λ 1 λ k )α k (t), so α k (t) 0 as 22 Case II: k 1 dθ 1 (t) λ 1 θ 1 (t) l1 λ l θ 2 l (t) θ k (t) λ 1 θ 1 (t) λ 1 θ 3 1 (t) θ 1(t) l2 λ l θ 2 l (t) λ 1 θ 1 (t) λ 1 θ 3 1 (t) θ3 1 (t) λ l α 2 l (t) l2 Using results fro Case I (α l 0 for l 1 and t ), θ 1 (t) ±1 as t, fro dθ 1 (t) λ 1 θ 1 (t) 1 θ 2 1 (t). 23 Recalling the original expansion w(t) we can conclude that θ k (t)q k, w(t) q 1, as t. where q 1 is the noralized eigenvector associated with the largest eigenvalue λ 1 of the covariance atrix R. Other conditions of stability can also be shown to hold (see the textbook). 24
Suary of Hebbian-Based Maxiu Eigenfilter Hebbian-based linear neuron converges with probability 1 to a fixed point, which is characterized as follows: Generalized Hebbian Algorith for full PCA Sanger (1989) showed how to construct a feedfoward network to learn all the eigenvectors of R. Variance of output approaches the largest eigenvalue of the covariance atrix R (y(n) is the output): li n σ2 (n) li EY 2 (n) λ 1 n Activation y j (n) i1 w ji (n)x i (n), j 1, 2,..., l Synaptic weight vector approaches the associated eigenvector with li n w(n) q 1 li w(n) 1. n Learning w ji (n) η y j (n)x i (n) y j (n) j i 1, 2,...,, j 1, 2,..., l. w ki (n)y k (n), 25 26