The Dynamics of Learning Vector Quantization

The Dynamic of Learning Vector Quantization Barbara Hammer TU Clauthal-Zellerfeld Intitute of Computing Science Michael Biehl, Anarta Ghoh Rijkuniveriteit Groningen Mathematic and Computing Science

Introduction prototype-baed learning from example data: repreentation, claification Vector Quantization (VQ) Learning Vector Quantization (LVQ) The dynamic of learning a model ituation: randomized data learning algorithm for VQ und LVQ analyi and comparion: dynamic, ucce of learning Summary Outlook

Vector Quantization (VQ) aim: repreentation of large amount of data by (few) prototype vector example: identification and grouping in cluter of imilar data aignment of feature vector ξ to the cloet prototype w (imilarity or ditance meaure, e.g. Euclidean ditance )

unupervied competitive learning initialize K prototype vector preent a ingle example identify the cloet prototype, i.e the o-called winner move the winner even cloer toward the example intuitively clear, plauible procedure - place prototype in area with high denity of data - identifie the mot relevant combination of feature - (tochatic) on-line gradient decent with repect to the cot function...

quantization error H VQ K P K = j j= 1 = 1 k j ( ) 2 ( ) ξ w Θ d d k j here: Euclidean ditance prototype data d w j j i the winner! aim: faithful repreentation (in general: clutering ) Reult depend on - the number of prototype vector - the ditance meaure / metric ued

Learning Vector Quantization (LVQ) aim: claification of data learning from example example itutation: 3 clae, 3 prototype claification: aignment of a vector ξ to the cla of the cloet prototype w Learning: choice of prototype according to example data aim : generalization ability, i.e. correct claification of novel data after training

motly: heuritically motivated variation of competitive learning prominent example [Kohonen]: LVQ 2.1. initialize prototype vector (for different clae) preent a ingle example identify the cloet correct and the cloet wrong prototype move the correponding winner toward / away from the example known convergence / tability problem, e.g. for infrequent clae

LVQ algorithm... - appear plauible, intuitive, flexible - are fat, eay to implement - are frequently applied in a variety of problem involving the claification of tructured data, a few example: - real time peech recognition - medical diagnoi, e.g. from hitological data - gene expreion data analyi - texture recognition and claification -...

illutration: microcopic image of (pig) emen cell after freezing and torage, c/o Lidia Sanchez-Gonzalez, Leon/Spain

illutration: microcopic image of (pig) emen cell after freezing and torage, c/o Lidia Sanchez-Gonzalez, Leon/Spain healthy cell damaged cell prototype obtained by LVQ (1)

LVQ algorithm... - are often baed on purely heuritic argument, or derived from a cot function with unclear relation to the generalization ability - almot excluively ue the Euclidean ditance meaure, inappropriate for heterogeneou data - lack, in general, a thorough theoretical undertanding of dynamic, convergence propertie, performance w.r.t. generalization, etc.

In the following: analyi of LVQ algorithm w.r.t. - dynamic of the learning proce - performance, i.e. generalization ability - aymptotic behavior in the limit of many example typical behavior in a model ituation - randomized, high-dimenional data - eential feature of LVQ learning aim: - contribute to the theoretical undertanding - develop efficient LVQ cheme - tet in application

model ituation: two cluter of N-dimenional data random vector ξ R N according to P( ξ) = p P( ξ σ) mixture of two Gauian: σ=± 1 ( 2π) σ ( ) ξ -l 1 1 P( ξ σ) exp Β N/2 2 2 = σ orthonormal center vector: B +, B - R N, ( B σ ) 2 =1, B + B - =0 (p - ) prior weight of clae p +, p - p + + p - = 1 eparation l B - l B + independent component: ξ = l j B σ σ, j (p + ) ξ 2 j σ 2 j σ = 1 ξ ξ 2 = N j= 1 ξ 2 j = N + l 2

high-dimenional data (formally: N ) 400 example ξ R N, N=200, l=1, p + =0.6 projection into the plane of center vector B +, B - projection in two independent random direction w 1,2 (240) (160) (240) (160) y = B ξ x =w ξ 2 2 y + = B + ξ x1 =w ξ 1 Note: model for tudying typical behavior of LVQ algorithm, not: denity-etimation baed claification

dynamic of on-line training equence of independent random data ( = 1,2,3,... ) ( ) ξ acc. to P ξ update of prototype vector: w -1 η = w + f -... N [ ] ( -1 ) d, d, S,σ, ξ w S,σ= ± 1 d = 1 ( ξ w ) 2 learning rate, tep ize competition, direction of update etc. change of prototype toward or away from the current data above example: Vector Quantization [ ] ( ) f... Θ d d unupervied Vector Quantization = The Winner Take It All (clae irrelevant/unknown) + 1 ( correct cla) f 1 ( wrong cla ) Learning Vector Quantization 2.1. [ ] {... = S σ = here: two prototype, no explicit competition

mathematical analyi of the learning dynamic 1. decription in term of a few characteritic quantititie R σ = w B projection in the (B +, B - )-plane σ Q t = w w length and relative poition of prototype t (,t,σ { 1+, 1} ) ( here: R 2N R 7 ) w [ ] ( -1) d d, S,σ, ξ w -1 η = w + f... N, - recurion R Q σ t R 1/N Q 1/N -1 σ -1 t = = η f η f [ ] ( -1)... y R σ [ ] ( -1) [ ] ( -1) 2... x Q + η f... x Q + η f [...] f [...] + Ο( 1 ) t σ t t t t Ν random vector ξ enter only in the form of projection x -1 = w ξ yτ = Bτ ξ ditance ( ) 2 ( ) ξ w 1 = ξ 2 2x Q -1 d = +

2. average over the current example random vector acc. to P( ξ σ) in the thermodynamic limit N correlated Gauian random quantitie x y τ = = w B -1 τ ξ ξ completely pecified in term of firt and econd moment (w/o indice ) x σ N w,j ξj j= 1 = = l w,jbσ,j = σ N j= 1 l R σ y τ σ = l δ σ = 0 l if S= σ ele y ρ y τ σ - y ρ σ y τ σ = δ ρτ x xt - x σ x σ t = Q σ t x yτ - x σ y σ τ = R σ τ averaged recurion L = pσ L σ cloed in { R σ, Q t } σ=± 1

3. elf-averaging propertie characteritic quantitie Qt, R σ - depend on the random equence of example data - their variance vanihe with N (here: N -1 ) learning dynamic i completely decribed in term of average 4. continuou learning time α= N # of example # of learning tep per degree of freedom recurion coupled, ordinary differential equation evolution of projection Q t ( α ), R σ ( α )

5. learning curve probability for miclaification of a novel example ε g ( d+ d ) + p Θ( ) + d = p+ Θ d+ L = p + Φ 2 ( R R ) Q Q 2 ( R R ) 1 Q Q 2l l ++ ++ + ++ + Q 2Q + Q 2 Q 2Q + Q ++ + ++ + 1 + p Φ generalization error ε g (α) after training with α N example invetigation and comparion of given algorithm - repulive/attractive fixed point of the dynamic - aymptotic behavior for α - dependence on learning rate, eparation, initialization -... optimization and development of new precription - time-dependent learning rate η(α) maximize - variational optimization w.r.t. f [...] -... d ε g d α

optimal claification with minimal generalization error in the model ituation (equal variance of cluter): eparation of clae by the plane with p P( ξ σ= 1) = p+ P( ξ σ= + 1) (p + ) l B + B - (p - >p + ) exce error 0.50 ε g l=0 minimal ε g a a function of prior weight l=2 0.25 l=1 0 0 0.5 p + 1.0

LVQ 2.1. update the correct and wrong winner [Seo, Obermeyer]: LVQ2.1. ս cot function (likelihood ratio) w = w -1 + η N σ S ( -1) ξ w p σ = (1+m σ ) / 2 (m>0) (analytical) integration for w (0) = 0 R R ++ = = l m l m 1+ m 2 1 m 2 η m α l 1 m ηm α ( 1 e ) R = ( 1 e ) + m η m α l 1+ m + η m α ( 1 e + ) R = ( 1 e ) K K + m 2 2 Q ++ = K theory and imulation (N=100) p + =0.8, l=1, η=0.5 average over 100 independent run R ++, R +, Q ++ remain finite 6 0 Q R Q + + R + + R + R, R +, Q, Q + with α - 6 Q + R + 2 4 α 6 8 10

problem: intability of the algorithm due to repulion of wrong prototype (p + > p - ) trivial claification für α : ε g = max { p +,p - } (p - ) trategie: - election of data in a window cloe to the current deciion boundary low down the repulion, ytem remain intable - Soft Robut Learning Vector Quantization [Seo & Obermayer] denity-etimation baed cot function limiting cae Learning from mitake: LVQ2.1-tep only, if the example i currently miclaified low learning, poor generalization

The winner take it all [ ] ( ) LVQ 1 [Kohonen] -1 η -1 w w + Θd d σ S ξ w I) LVQ 1 = S only the winner i updated according to the cla memberhip N S winner w ±1 numerical integration for w (0)=0 R Q ++ ++ R -- R S+ w + l B + w - Q +- R -+ Q -- R -- w - l B - α theory and imulation (N=200) p + =0.2, l=1.2, η=1.2 averaged over 100 indep. run R S- trajectorie in the (B +,B - )-plane ( ) α=20,40,...140... optimal deciion boundary aymptotic poition

learning curve (p+=0.2, l=1.2) ε0.26 g ε g 0.22 η=1.2 η - role of the learning rate - tationary tate: ε g (α ) grow lin. with η 0.18 0.4 0.2 η 0 - variable rate η(α)!? 0.14 0 2.0 α 100 200 300 α 0.26 - well-defined aymptotic: ε g 0.22 η 0 η 0, α ( η α ) 0.18 (ODE linear in η) 0.14 0 10 uboptimal min. ε 20 30 g 40 50 (η α)

The winner take it all II ) LVQ+ ( only poitive tep without repulion) w η N -1 = w + Θ S [ ] ( ) ( -1) d d δ σ, S ξ w S winner correct (w updated only from cla S) w + l B + α aymptotic configuration ymmetric about l (B + +B - )/2 l B - w - p+=0.2, l=1.2, η=1.2 claification cheme and the achieved generalization error are independent of the prior weight p ± (and optimal for p ± = 1/2 ) LVQ+ VQ within the clae

ε g p+=0.2, l=1.0, η=1.0 learning curve LVQ+ LVQ1 α ε g min {p +,p - } aymptotic: η 0, (ηα) - LVQ 2.1. trivial aignment to the more frequent cla optimal claification p + - LVQ 1 here: cloe to optimal claification - LVQ+ min-max olution p ± -independent claification

Vector Quantization [ ] ( ) competitive learning -1 η -1 w = w + Θ d d ξ w N S S w winner numerical integration for w (0) 0 ( p + =0.2, l=1.0, η=1.2 ) cla memberhip i unknown or identical for all data ε g 1.0 LVQ+ VQ R -- R ++ R +- ytem i invariant under exchange of the prototype weakly repulive fixed point 0 0 α α LVQ1 R -+ α 100 200 300

interpretation: - VQ, unupervied learning unlabelled data - LVQ, two prototype of the ame cla, identical label - LVQ, different clae, but label are not ued in training ε g aymptotic (α,η 0, ηα ) p + 0 p + p - 1 - low quantization error - high gen. error ε g

Summary prototype prototype-baed learning Vector Quantization and Learning Vector Quantization a model cenario: two cluter, two prototype dynamic of online training comparion of algorithm: LVQ 2.1.: intability, trivial (tationary) claification LVQ 1 : cloe to optimal aymptotic generalization LVQ + : min-max olution w.r.t. aymptotic generalization VQ : ymmetry breaking, repreentation work in progre, outlook regularization of LVQ 2.1, Robut Soft LVQ [Seo, Obermayer] model: different cluter variance, more cluter/prototype optimized procedure: learning rate chedule, variational approach / denity etimation / Baye optimal on-line everal clae and prototype

Perpective Generalized Relevance LVQ [Hammer & Villmann] N λ ( i= 1 ( ) adaptive metric, e.g. ditance meaure i d w,ξ) = λi ξ i w Self Self-Organizing Map (SOM) (many) N-dim. prototype form a (low) d-dimenional grid repreentation of data in a topology preerving map training 2 neighborhood preerving SOM Neural Ga (ditance baed) application