Relevance Vector Machines Explained

Size: px

Start display at page:

Download "Relevance Vector Machines Explained"

Cody Curtis
5 years ago
Views:

1 October 19, 2010 Relevance Vector Machnes Explaned Trstan Fletcher

2 Introducton Ths document has been wrtten n an attempt to make Tppng s [1] Relevance Vector Machnes (RVM) as smple to understand as possble for those wth mnmal experence of Machne Learnng. It assumes knowledge of probablty n the areas of Bayes theorem and Gaussan dstrbutons ncludng margnal and condtonal Gaussan dstrbutons. It also assumes famlarty wth matrx dfferentaton, the vector representaton of regresson and kernel (bass) functons. These latter two areas are brefly covered n the author s smlar paper on Support Vector Machnes whch can be found through the URL on the coverpage. The document has been splt nto two man sectons. The frst ntroduces the problem that needs to be solved, namely maxmsng the posteror probablty of regresson target values over some hyperparameters. It then proceeds to derve the equatons requred to do ths. Asde from the areas mentoned above where knowledge s assumed, every mathematcal step s gone through. It s therefore hoped that there s no ambguty over any of the detals n ths explanaton, though ths does make the descrpton a lttle cumbersome. The second secton therefore explans from an algorthmc vewpont the teratons that would be requred to actually apply the technque. Asde from Tppng [1], the majorty of ths document s based on work by MacKay [2], [3], [4], [5] and Bshop [6]. Notaton Wth the vew of makng ths descrpton as explct (n the mathematcal sense) as possble, t s worth ntroducng some of the notaton that wll be used: P (A B, C) s the probablty of A gven B and C. Note that usng ths notaton, dfferent representatons of the parameters wll not alter ths probablty, e.g. P (A B) P (A B 1 ). Furthermore, for the sake of smplcty, occasonally some of the elements n the probablstc notaton wll be omtted where they are not relevant, e.g. P (A B) P (A B, C, D etc). X N(µ, σ 2 ) s used to sgnfy that X s normally dstrbuted wth mean µ and varance σ 2. Bold font s used to represent vectors and matrces. 1

3 1 Theory 1.1 Evdence Approxmaton Theory Lnear regresson problems are generally based on fndng the parameter vector w and the offset c so that we can predct y for an unknown nput x (x R M ): y = w T x + c In practce we usually ncorporate the offset c nto w. If there s a non-lnear relatonshp between x and y then a bass functon can be used: y = w T φ(x) where x φ(x) s a non-lnear mappng (.e. bass functon). When attemptng to calculate w from our our tranng examples, we assume that each target t s representatve of the true model y, but wth the addton of nose ɛ : t = y + ɛ = w T φ(x ) + ɛ where ɛ are assumed to be ndependent samples from a Gaussan nose process wth zero mean and varance σ 2,.e. ɛ N(0, σ 2 ). Ths means that: P (t x, w, σ 2 ) N(y, σ 2 ) = (2πσ 2 ) 1 2 exp 1 } 2σ 2 (t y ) 2 = (2πσ 2 ) 1 2 exp 1 } 2σ 2 (t w T φ(x )) 2 Lookng at N tranng ponts smultaneously, so that the vector t represents all the ndvdual tranng ponts t and the N M desgn matrx Φ s constructed such that the th row represents the vector φ(x ), we have: P (t x, w, σ 2 ) = N N(w T φ(x ), σ 2 ) N = (2πσ 2 ) 1 2 exp 1 } 2σ 2 (t w T φ(x )) 2 = (2πσ 2 ) N 2 exp 1 } t Φw 2 2σ2 2

4 When attemptng to learn the relatonshp between x and y, we wsh to constran complexty and hence the growth of the weghts w and do ths by defnng an explct pror probablty dstrbuton on w. Our preference for smoother and therefore less complex functons s encoded by usng a zero-mean Gaussan pror over w: P (w α ) N(0, α 1 ) where we have used α to descrbe the nverse varance (.e. precson) of each w. If once agan we look at all N ponts smultaneously, so that the th element n the vector α represents α, we have: P (w α) = N =0 N(0, α 1 ) Ths means that there s an ndvdual hyperparameter α assocated wth each weght, modfyng the strength of the pror thereon. The posteror probablty over all the unknown parameters, gven the data, s expressed as P (w, α, σ 2 t). We are tryng to fnd the w, α and σ 2 whch maxmse ths posteror probablty. We can decompose the posteror: P (w, α, σ 2 t) = P (w t, α, σ 2 )P (α, σ 2 t) (1.1) Substtutng β 1 for σ 2 to make the maths appear less cluttered, the frst part of (1.1) can be expressed: P (w t, α, β) N(m, Σ) (1.2) where the mean m and the covarance Σ are gven by: m = βσφ T t (1.3) Σ = (A + βφ T Φ) 1 (1.4) and A = dag(α). The method for arrvng at (1.2), (1.3) and (1.4), relatng to condtonal Gaussan dstrbutons, les outsde the scope of ths document. In order to evaluate m and Σ we need to fnd the hyperparameters (α and β) whch maxmse the second part of (1.1), whch we decompose: P (α, σ 2 t) P (t α, σ 2 )P (α)p (σ 2 ) We wll assume unform hyperprors and hence gnore P (α) and P (σ 2 ). Our problem s now to maxmse the evdence: P (t α, σ 2 ) P (t α, β) = P (t w, β)p (w α)dw (1.5) 3

5 Lookng at the frst component of the equaton: P (t w, β) = = N N(y, β 1 ) ( 2π β ) N 2 exp β 2 t Φw 2 } (1.6) And then at the second (where M s the dmensonalty of x): P (w α) = = N(0, α 1 ) (2πα 1 ) 1 2 exp 1 } 2 α w 2 = (2π) M 2 α 1 2 exp 1 } 2 wt Aw (1.7) Substtutng (1.6) and (1.7) nto (1.5): P (t α, β) = = ( ) 2π N 2 exp β } β 2 t Φw 2 (2π) M 2 ( ) N ( ) M β π 2π α 1 2 α 1 2 exp 1 } 2 wt Aw dw β exp 2 t Φw } 2 wt Aw dw (1.8) In order to smplfy (1.8), we create a defnton to represent the ntegrand: E(w) = β 2 t Φw wt Aw (1.9) Ths means that: P (t α, β) = ( ) N ( ) M β π 2π α 1 2 exp E(w)} dw Expandng out (1.9) we get: E(w) = β 2 (tt t 2t T Φw + w T Φ T Φw) wt Aw = 1 2 (βtt t 2βt T Φw + βw T Φ T Φw + w T Aw) 4

6 Substtutng n (1.4) and then usng I = Σ 1 Σ: E(w) = 1 2 (βtt t 2βt T Φw + w T Σ 1 w) Substtutng n (1.3): where = 1 2 (βtt t 2βt T ΦΣ 1 Σw + w T Σ 1 w) E(w) = 1 2 (βtt t 2m T Σ 1 w + w T Σ 1 w + m T Σ 1 m m T Σ 1 m) = E(t) (w m)t Σ 1 (w m) E(t) = 1 2 (βtt t m T Σ 1 m) Our ntegrand from (1.8) now becomes: exp E(w)} dw = exp E(t)} (2π) M 2 Σ 1 2 Substtutng ths back n, gves us: P (t α, β) = ( ) N ( ) M β π 2π α 1 2 exp E(t)} (2π) M 2 Σ 1 2 Ths s known as our margnal lkelhood and takng logs, gves us our log margnal lkelhood: ln P (t α, β) = N 2 ln β E(t) 1 2 ln Σ N 2 ln(2π) M ln α (1.10) It s ths equaton we need to maxmse wth respect to α and β, a process known as the evdence approxmaton procedure. 1.2 Evdence Approxmaton Procedure In order to maxmse our log margnal lkelhood, we start by takng dervatves of (1.10) wth respect to α and settng these to zero: d dα ln P (t α, β) = 1 2α 1 2 Σ 1 2 m2 = 0 α = 1 α Σ m 2 5

7 Substtutng n γ = 1 α Σ our recursve defnton for the α whch maxmse (1.10) can be expressed more elegantly as: α = γ m 2 We now need to dfferentate (1.10) wth respect to β and set these dervatves to zero: d dβ ln P (t α, β) = 1 ( N 2 β t Φm 2 T r [ ΣΦ T Φ ] ) = 0 (1.11) In order to solve ths, we frst smplfy the argument of the trace operator T r[ ]: ΣΦ T Φ = ΣΦ T Φ + β 1 ΣA β 1 ΣA = Σ ( Φ T Φβ + A ) β 1 β 1 ΣA = ( A + βφ T Φ ) 1 ( Φ T Φβ + A ) β 1 β 1 ΣA = (I AΣ) β 1 Substtutng ths back nto (1.11): ( [ ]) 1 N I AΣ 2 β t Φm 2 T r = 0 β N [ ] I AΣ β T r = t Φm 2 β 1 (N T r [I AΣ]) = t Φm 2 β 1 β = t Φm 2 N T r [I AΣ] β = N γ t Φm 2 The α and β whch maxmse our margnal lkelhood are then found teratvely by settng α and β to ntal values, fndng values for m and Σ from (1.3) and (1.4), usng these to calculate new estmates for α and β and repeatng ths process untl a convergence crtera s met. We wll then be left wth values for α and β whch maxmse our margnal lkelhood and whch we can use to evaluate our predctve dstrbuton over t for a new nput x : P (t x, α, β) = P (t w, β)p (w, α, β)dw = N(m T φ(x ), σ 2 (x )) 6

8 Ths means that our estmate for t s the mean of the above dstrbuton m T φ(x ). Our confdence n our predcton s determned by the varance of ths dstrbuton σ 2 (x ) whch s gven by: σ 2 (x ) = β 1 + φ(x ) T Σφ(x ) 1.3 Automatc Relevance Determnaton Whlst carryng out the evdence approxmaton procedure descrbed above, many of the α wll tend to nfnty. Ths has mplcatons for the varance Σ and mean m of the posteror dstrbuton over the correspondng weghts n (1.2): lm Σ = lm (A + α α βφt Φ) 1 = 0 lm m = lm α α βσφt t = 0 Ths means that each w that such α relate to wll be dstrbuted α N(0, 0),.e. wll be equal to zero. The correspondng bass functons, φ(x ) should therefore be pruned from the overall desgn matrx Φ each teraton. The x correspondng to the remanng non-zero weghts after prunng are called relevance vectors and are analogous to the support vectors of an SVM. 7

9 2 Applcaton The RVM process s an teratve one and nvolves repeatedly re-estmatng α and β untl a stoppng condton s met. The steps are as follows: 1. Select a sutable kernel functon for the data set and relevant parameters. Use ths kernel functon to create the desgn matrx Φ. 2. Establsh a sutable convergence crtera for α and β, e.g. a threshold value for change δ T hresh between one teraton s estmaton of α and the next δ = αn+1 α n so that re-estmaton wll stop when δ < δ T hresh. 3. Establsh a threshold value α T hresh whch t s assumed an α s tendng to nfnty upon reachng t. 4. Choose startng values for α and β. 5. Calculate m = βσφ T t and Σ = (A + βφ T Φ) Update α = γ m 2 and β = N γ t Φm Prune the α and correspondng bass functons where α > α T hresh. 8. Repeat (5) to (7) untl the convergence crtera s met. Our hyperparameter values α and β whch result from the above procedure are those that maxmse our margnal lkelhood and hence are those used when makng a new estmate of a target value t for a new nput x : t = m T φ(x ) (2.1) The varance relatng to our confdence n ths estmate s gven by: σ 2 (x ) = β 1 + φ(x ) T Σφ(x ) (2.2) 8

10 References [1] M. E. Tppng, J. Mach. Learn. Res. 1, 211 (2001). [2] D. J. C. Mackay, Neural Computaton 4, 415 (1992). [3] D. J. C. Mackay, Neural Computaton 4, 448 (1992). [4] D. J. C. Mackay, Bayesan methods for backprop networks, chap. 6, pp Sprnger (1994). [5] D. J. C. Mackay, C. Laboratory, Neural Computaton 11, 1035 (1999). [6] C. M. Bshop, Pattern Recognton and Machne Learnng (Informaton Scence and Statstcs). Sprnger (2006). 9

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012 MLE and Bayesan Estmaton Je Tang Department of Computer Scence & Technology Tsnghua Unversty 01 1 Lnear Regresson? As the frst step, we need to decde how we re gong to represent the functon f. One example: