Regression Using Support Vector Machines: Basic Foundations

Size: px

Start display at page:

Download "Regression Using Support Vector Machines: Basic Foundations"

Constance Lang
6 years ago
Views:

1 Regresson Usng Support Vector Machnes: Basc Foundatons Techncal Report December 004 Aly Farag and Refaat M Mohamed Computer Vson and Image Processng Laboratory Electrcal and Computer Engneerng Department Unversty of Lousvlle Lousvlle, KY 409

2 1 Regresson Usng Support Vector Machnes: Basc Foundatons Support Vector Machnes (SVM) were developed by Vapnk [1] to solve the classfcaton problem, but recently, SVM have been successfully extended to regresson and densty estmaton problems []. SVM are ganng popularty due to many attractve features and promsng emprcal performance. For nstance, the formulaton of SVM densty estmaton employs the Structural Rsk Mnmzaton (SRM) prncple, whch has been shown to be superor to the tradtonal Emprcal Rsk Mnmzaton (ERM) prncple employed n conventonal learnng algorthms (e.g. neural networks) [3]. SRM mnmzes an upper bound on the generalzaton error as opposed to ERM, whch mnmzes the error on the tranng data. Ths dfference makes SVM more attractve n statstcal learnng applcatons. The tradtonal formulaton of the SVM densty estmaton problem rases a quadratc optmzaton problem of the same sze as the tranng data set. Ths computatonally demandng optmzaton problem prevents the SVM from beng the default choce of the pattern recognton communty [4]. Several approaches have been ntroduced for crcumventng the above shortcomngs of the SVM learnng. These nclude smpler optmzaton crteron for SVM desgn (e.g. the kernel ADA- TRON [5]), specalzed QP algorthms lke the conjugate gradent method, decomposton technques (whch break down the large QP problem nto a seres of smaller QP sub-problems), the sequental mnmal optmzaton (SMO) algorthm and ts varous extensons [6], Nystrom approxmatons [7], and greedy Bayesan methods [8] and the Chunkng algorthm [9]. Recently, actve learnng has become a popular paradgm for reducng the sample complexty of large-scale learnng tasks (e.g. [10 1]). In actve learnng, nstead of learnng from random samples, the learner has the ablty to select ts own tranng data. Ths s done teratvely and the output of one step s used to select the examples for the next step. Ths tutoral presents the mathematcal foundatons of the SVM regresson algorthm. Then, t presents a new learnng algorthm whch uses the Mean Feld (MF) theory. The MF methods provde effcent approxmatons whch are able to cope wth the complexty of probablstc data models [13]. MF methods replace the ntractable task of computng hgh dmensonal sums and ntegrals by the much easer problem of solvng a system of lnear equatons. The regresson problem s formu-

3 1 Problem Statement and Some Basc Prncples lated so that the MF method can be used to approxmate the learnng procedure n a way that avods the quadratc programmng optmzaton. Ths proposed approach s sutable for hgh dmensonal regresson problems and several expermental examples are presented. 1 Problem Statement and Some Basc Prncples The regresson problem can be stated as: gven a tranng data set D = (y, t ) = 1,,..., n}, of nput vectors y and assocated targets t, the goal s to ft a functon g(y) whch approxmates the relaton nherted between the data set ponts and t can be used later on to nfer the output t for a new nput data pont y. Any practcal regresson algorthm has a loss functon L (t, g(y)), whch descrbes how the estmated functon devated from the true one. Many forms for the loss functon can be found n the lterature: e.g. lnear, quadratc loss functon, exponental, etc. In ths tutoral, Vapnk s loss functon s used, whch s known as ε nsenstve loss functon and defned as: 0 f t g(y) ε L (t, g(y)) = (1) t g(y) ε otherwse Fgure 1: The soft margn loss functon. where ε> 0 s a predefned constant whch controls the nose tolerance. Wth the ε nsenstve loss functon, the goal s to fnd g(y) that has at most ε devaton from the actually obtaned targets t for all tranng data, and at the same tme as flat as possble. In other words, the regresson algorthm does not care about errors as long as they are less than ε, but wll not accept any devaton larger than ths.

4 Classcal Formulaton of the Regresson Problem 3 For pedagogcal reasons, the followng dscusson begns by descrbng the case of lnear functons g, takng the form: f(y) = w.y + b () where w Y, Y s the nput space, b R, and w.y s the dot product of the vectors w and y. Classcal Formulaton of the Regresson Problem As stated before, the goal of a regresson algorthm s to ft a flat functon to the data ponts. Flatness n the case of Eq. () means that one seeks a small w. One way to ensure ths flatness s to mnmze the norm,.e. w. Thus, the regresson problem can be wrtten as a convex optmzaton problem: mnmze subject to 1 w (3) t (w.y + b) ε (4) (w.y + b) t ε The mpled assumpton n Eq.(4) s that such a functon g actually exsts that approxmates all pars (y, t ) wth ε precson, or n other words, that the convex optmzaton problem s feasble. Sometmes, however, ths may not be the case, or we also may want to allow for some errors. Analogously to the soft margn loss functon [14] whch was adapted to SVM machnes Vapnk [15], slack varables ζ, ζ can be ntroduced to cope wth otherwse nfeasble constrants of the optmzaton problem n Eq.(4). Hence the formulaton stated n [15] s attaned: mnmze subject to 1 w + C (ζ + ζ ) (5) t (w.y + b) ε + ζ (w.y + b) t ε + ζ (6) ζ, ζ 0 The constant C > 0 determnes the trade-off between the flatness of g and the amount up to whch devatons larger than ε are tolerated. Ths corresponds to dealng wth the so called ε-nsenstve loss functon whch descrbed before.

5 .1 Dual problem and quadratc programmng 4 As shown n Fg.1, only the ponts outsde the shaded regon contrbute to the cost nsofar, as the devatons are penalzed n a lnear fashon. It turns out that n most cases the optmzaton problem Eq. (6) can be solved more easly n ts dual formulaton. Moreover, the dual formulaton provdes the key for extendng SVM machne to nonlnear functons. Hence, a standard dualzaton method utlzng Lagrange multplers wll be descrbed next..1 Dual problem and quadratc programmng The mnmzaton problem n Eq. (6) s called the prmal objectve functon. The key dea of the dual problem s to construct a Lagrange functon from the prmal objectve functon and the correspondng constrants, by ntroducng a dual set of varables. It can be shown that the Lagrange functon has a saddle pont wth respect to the prmal and dual varables at the soluton (for detals see e.g. [16], [17]). The prmal objectve functon wth ts constrants are transformed to the Lagrange functon as follows: L = 1 w + C (ζ + ζ ) (λ ζ + λ ζ ) α (ε + ζ t + (w.y + b)) α (ε + ζ + t (w.y + b)) (7) Here L s the Lagrangan and α, α, λ, and λ are Lagrange multplers. Hence the dual varables n Eq. (7) have to satsfy postvty constrants: α, α, λ, λ 0. (8) It follows from the saddle pont condton that the partal dervatves of L wth respect to the prmal varables (w, b, ζ, ζ ) have to vansh for optmalty: (Note α ( ), refers to α, and α. b L = w L = (α α ) = 0 (9) (α α )y = 0 (10) ( ) ζ L =C α ( ) λ ( ) = 0 (11)

6 . Support Vectors 5 Substtutng from Eqs. (9),(10), and (11) nto Eq. (7) yelds the dual optmzaton problem: maxmze 1 (α α )(α j α j)(y.y j ) ε (α + α ) + y (α α ),j=1 subject to (α α ) = 0 and α, α [0, C] (1) In dervng Eq. (1), the dual varables λ, λ are elmnated through the condton n Eq. (11) whch can be reformulated as λ ( ) = C α ( ). Eq. (9) can be rewrtten as follows: w = g(y) = (α α )y, thus: (α α )(y.y) + b (13) Ths s the so-called Support Vector Machnes regresson expanson,.e. w can be completely descrbed as a lnear combnaton of the tranng patterns y. In a sense, the complexty of a functon s representaton by SVs s ndependent of the dmensonalty of the nput space Y, and depends only on the number of SVs. Moreover, the complete algorthm can be descrbed n terms of dot products between the data. Even when evaluatng g(y), the value of w does not need to be computed explctly. These observatons wll come n handy for the formulaton of a nonlnear extenson.. Support Vectors The Karush-Kuhn-Tucker (KKT) condtons [18, 19] are the bascs for the Lagrangan soluton. These condtons state that at the soluton pont, the product between dual varables and constrants has to vansh.e.: α (ε + ζ t + w.y + b) = 0 α (ε + ζ + t w.y b) = 0 (14) (C α )ζ = 0 (C α )ζ = 0 (15)

7 .3 Computng b 6 Several useful conclusons can be drawn from these condtons. Frstly only samples (y, t ) wth correspondng α ( ) a set of dual varables α, α that: = C le outsde the ε-nsenstve tube. Secondly α α = 0,.e. there can never be = 0 whch are both smultaneously nonzero. Ths allows to conclude ε t + w.y + b 0 and ζ = 0 f α C (16) ε t + w.y + b 0 f α > 0 (17) (18) A fnal note has to be made regardng the sparsty of the SVM expanson. From Eq. (14) t follows that only for g(y) ε the Lagrange multplers may be nonzero, or n other words, for all samples nsde the ε-tube (.e. the shaded regon n Fg. (1)) the α, α vansh: for g(y) < ε the second factor n Eq. (14) s nonzero, hence α, α has to be zero such that the KKT condtons are satsfed. Therefore there s a sparse expanson of w n terms of y (.e. not all y needed to descrbe w). The tranng samples that come wth nonvanshng coeffcents are called Support Vectors..3 Computng b There are many ways to compute the value of b n Eq. (13). One of such ways can be found n [0]: b = 1 (w.(y r + y s )) (19) where y r and y s are the support vectors (.e. any nput vector whch has nonzero value of ether α or α respectvely). 3 Nonlnear Regresson: The Kernel Trck The next step s to make the SVM algorthm nonlnear. Ths, for nstance, could be acheved by smply preprocessng the tranng patterns y by a map Ψ : Y I nto some feature space I, as descrbed n [1], and then applyng the standard SVM regresson algorthm. Here s a bref look at an example gven n [1]. Example 1 (Quadratc features n R)

8 3.1 Mappng va the Kernel 7 Consder the map Ψ : R R 3 wth Ψ(y 1, y ) = (y1, y 1 y, y). (The subscrpts refer to the components of y R ). Tranng a lnear SVM on the preprocessed features would yeld a quadratc functon. Whle ths approach seems reasonable n the partcular example above, t can easly become computatonally nfeasble for both polynomal features of hgher order and hgher dmensonalty. 3.1 Mappng va the Kernel To overcome the nfeasblty of the above approach, the key observaton s that the feature map of example 1 can be rewrtten as: (y 1, y 1 y, y ).(y 1, y 1 y, y ) = y.y (0) As noted n the prevous secton, the SVM algorthm only depends on dot products between patterns y. Hence t suffces to know K(y, y ) = Ψ(y, y ) rather than Ψ explctly whch allows us to restate the SVM optmzaton problem as: maxmze 1 (α α )(α j α j)k(y.y j ) ε subject to,j=1 (α + α ) + y (α α ) (α α ) = 0 and α, α [0, C] (1) Lkewse the expanson of g n Eq. (13) may be wrtten as: w = (α α )K(y ) and: g(y) = (α α )K(y, y) + b () An mportant note here s that n the nonlnear settng, the optmzaton problem corresponds to fndng the flattest functon n feature space, not n nput space. The detals of the condtons for admssble SVM kernel functons can be found n [4]. roughly speakng, any postve sem-defne Reproducng Kernel Hlbert Space (RKHS) functon s an admssble SVM kernel. Probably, the most used kerenl n the lterature s the Radal Base Gaussan Functon (RBGF) whch s defned as: K(y, y ) = exp ( 1 ) (y y )Λ 1 (y y ) T where Λ s a parameter. But (3)

9 4 Statstcal Formulaton of the Regresson Problem 8 4 Statstcal Formulaton of the Regresson Problem As stated before, the man problem wth the classcal formulaton of the SVM regresson s the optmzaton problem n Eq.(1). The soluton of such optmzaton problem of o(n 3 ) whch s hghly nfeasble wth large tranng sample sze n. Ths secton presents another formulaton for the SVM regresson whch overcomes ths problem. To construct a Bayesan framework under the assumed loss functon n Eq.(1), an exponental model s employed. In ths model, the lkelhood for the probablty of the true output t at a gven pont y, provdng that the machne output s g(y), (p (t g (y))), s assumed by the followng relatonshp: p (t g (y)) = C exp CL(t, g (y))} (4) (εc + 1) Snce the elements of the tranng sample are assumed to be statstcally ndependent random vectors, the probablstc nterpretaton of the SVM regresson can be consdered to have the followng lkelhood: where: ( ) n C p(t g(d))= exp C (εc + 1) T= [t 1, t,..., t n ] and g (D) = [g (y 1 ), g (y ),..., g (y n )]. } L(t, g(y )) Snce the SVM s consdered as a maxmum posteror probablty estmator wth a Gaussan pror, the pror probablty dstrbuton of the predcton g (y) s assumed as a Gaussan Process, GP. Generally, a GP s a stochastc process whch s completely specfed by ts mean vector and covarance matrx. Thus, the pror probablty for a sample D can be expressed as a GP wth zero mean (for smplcty) and a covarance functon K (y, y ) as: 1 p(g(d))= π det (Kn ) exp 1 } g (D) K 1 n g (D) T where K n = [K (y, y j )] s the covarance matrx at the ponts of D. From Bayes theorem: (5) (6) = p (g (D) D) = M exp C n p (D g (D)) p (g (D)) p (D) L(t, g(y )) 1 π det (Kn ) p (D) } g (D) K 1 n g (D) T (7)

10 5 Mean Feld Theory for Learnng of SVM Regresson 9 where Let: M = ( ) n C. (εc + 1) I = = exp C } n L(t, g(y )) 1 g(d)k 1 n g(d) T dg(d) π det (Kn ) ( ) N (g (D) 0, K n ) exp C L(t, g(y )) dg(d) (8) where N (g (D) 0, K n ) s a normal dstrbuton wth a zero mean and a covarance matrx of K n. Then the normalzng constant p (D) can be expressed as: p (D) = M I (9) From the above dscusson, t can be noted that the estmate of the posteror predcton dstrbuton p (g (D) D) s the one whch maxmzes the numerator of Eq.(7). Equvalently, the MAP estmate s the one whch mnmzes: mn C L(t, g(y )) + 1 g(d) g(d)k 1 n g(d) T (30) The tradtonal SVM formulaton, [3], stops at ths pont and uses quadratc programmng optmzaton by ntroducng Lagrange multplers to solve Eq.(30). The sze of the optmzaton problem s the same as the sze of the tranng sample. Thus, f the sze of the tranng sample ncreases, the optmzaton problem becomes nfeasble (n tme and accuracy consderatons) f t s at all possble. Thus, learnng algorthms are necessary to avod ths unfeasble such quadratc optmzaton. In the followng, a learnng algorthm whch accommodates such a requrement s presented. 5 Mean Feld Theory for Learnng of SVM Regresson An approxmate formulaton of the SVM regresson algorthm s desrable to avod rasng the quadratc programmng problem of the classcal formulaton. Recently, the authors of the work n [13] suggested an advanced approach whch utlzes some prncples of the mean feld theory to cope wth the Gaussan classfcaton problem. The basc dea of the mean feld theory s to approxmate the

11 5 Mean Feld Theory for Learnng of SVM Regresson 10 statstcs of a random varable whch s correlated to other random varables by assumng that the nfluence of the other varables can be compressed nto a sngle effectve mean feld wth a rather smple dstrbuton. In ths paper, ths approach s used to approxmate a dstrbuton for the SVM output, g (y ), correspondng to an nstant, y, from the tranng data set gven the rest of the tranng data set, D, ( p ( g (y ) D )). The dervaton of ths approxmaton s dscussed next. Usng the posteror predcton dstrbuton p (g (D) D) whch s defned n Eq.(7), the predcton (expectaton) on a new test pont y s gven by: g (y) = g (y) p (g (y) D) dg(y) = g (y) p (g (y), g (D) D) dg(y) d g(d) (31) Substtutng from Eq.(7) nto Eq.(31) and wth some mathematcal reducton: M g (y) = g (y) A dg(y) d g(d) (3) π det (Kn ) where: A exp C n L(t, g(y )) 1 g(d, y)k 1 n+1g(d, y) T = p (D) K n+1 = K n K n (y) T, and K n (y) K (y, y) K n (y) = [K (y 1, y), K (y, y),..., K (y n, y)]. But: } 1 g (y) exp g(d, y)k 1 n+1g(d, y) T = n+1 K (y, y ) g(y ) exp } 1 g(d, y)k 1 n+1g(d, y) T Substtutng from Eq.(33) nto Eq.(3), then: M g (y) = K (y, y ) N (g (D) 0, K n )g (y). P (D) } g (y ) exp C L(t j g(y j )) dg(d) = j=1 (33) w K (y, y ) (34)

12 5 Mean Feld Theory for Learnng of SVM Regresson 11 where w s a constant defned as: M w = P (D) g (y ) exp N (g (D) 0, K n )g (y). } C L(t j g(y j )) dg(d) (35) j=1 The weghts w s are estmated usng the tranng sample. One way to facltate ths estmaton s to assume a dstrbuton for the expected output correspondng to an nstant whch s left out of the tranng data set. Ths dea s known as the Leave-One-Out prncple, n whch one nstant y s tentatvely taken away (left out) from the tranng sample and ts correspondng weght w s estmated usng the remanng data nstants and the assumed dstrbuton whch s defned as: p ( g (y ) D ) ( ) B dg D = (36) B dg(d) where: B = N (g (D) 0, K n )exp C } j L(t j g(y j )), D s obtaned by removng the tranng data pattern (y, t ) from D, and D s obtaned by removng the nstant y from the sample D. It can be noted that p ( g (y ) D ) s the predctve dstrbuton at the test pont y gven the data set D. Wth the predctve dstrbuton p ( g (y ) D ), an average (expected) value s defned by: V = V p ( g (y ) D ) dg(y ) (37) where V denotes the expected value for V gven only the data sample D. Substtutng from Eq.(9) nto Eq.(35) for the normalzng constant P (D), and then usng Eq.(36) and Eq.(37), the weght coeffcent w n Eq.(35) can be rewrtten as: w = M exp CL(t g(x ) j g(y j ))} (38) M exp CL(t j g(y j ))} Thus, the weght coeffcents n Eq.(34) can be obtaned by the lkelhood varant rates wth respect to the local predctve dstrbuton p ( g (y ) D ). Dependng on the form of the local predctve dstrbuton p ( g (y ) D ), a formula for calculatng ths weght can be obtaned. In ths paper, a Gaussan approxmaton s used for p ( g (y ) D ) whch has the form: p ( g (y ) D ) } 1 exp (g (y ) g (y ) ) πσ σ where the varance s defned as: σ = g (y ) g (y ). (39)

13 5 Mean Feld Theory for Learnng of SVM Regresson 1 as: where: Insertng Eq. (39) nto Eq.(37) and evaluatng Eq. (38), the weght coeffcents can be obtaned F = w F ( g (y ), σ ) G ( g (y ), σ ) = F G (40) C C exp ( ) } g (y ) t + ε + Cσ ( }) g (y ) 1 erf t + ε + Cσ σ C C exp ( ) } g (y ) + t + ε + Cσ ( }) g (y ) 1 erf + t + ε + Cσ σ and G = } 1 erf t g (y ) + ε σ } 1 erf t g (y ) ε σ + C C exp ( ) } g (y ) t + ε + Cσ ( }) g (y ) 1 erf t + ε + Cσ σ C C exp ( ) } g (y ) + t + ε + Cσ ( }) g (y ) 1 erf + t + ε + Cσ σ (41) Equatons (40) and (41) are called the Mean Feld equatons correspondng to the weght coeffcent w. To evaluate the weght coeffcents n Eq. (40), t s requred to get both the mean (average) g (y ) and the varance σ of the assumed Gaussan model for the local predctve dstrbuton p ( g (y ) D ). The detaled dervaton for both g (y ) and σ dependng on the mean feld theory can be found n [13], but only the fnal results are summarzed here. The posteror average at y s gven by: g (y ) = w j K (y, y j ) (4) j=1

14 6 Summary of the Proposed MF-Based SVM Regresson Algorthm 13 From [13], the followng results are obtaned: g (y ) g (y ) σ w (43) and, σ 1 [ (Σ + K) 1 ] Σ (44) where: The expresson for w g(y ) Σ = dag (Σ 1, Σ,..., Σ n ), and ( ) 1 Σ = σ w g (y ) can be obtaned from Equatons (40) and (41) as: w g (y ) C w w g (y ) + σ C t +ε t ε p ( g (y ) D ) dg(y ) σ G ( g (y ), σ ) C w w g (y ) + σ C + IG σ G ( g (y ), σ ) (45) where: } IG = 1 erf t g (y ) + ε σ } 1 erf t g (y ) ε σ 6 Summary of the Proposed MF-Based SVM Regresson Algorthm The mplementaton steps of the proposed approach for densty estmaton usng SVM wth the mean feld theory beng appled to the learnng process are presented below: 1. Consder the tranng data set D.. Set a learnng rate η and randomly ntalze w s.

15 6.1 Remarks on the MF-Based SVM Densty Estmaton Algorthm Choose a kernel K (y, y ) and accordngly, calculate the covarance matrx K n and let σ = [K n ]. 4. Iterate steps 5 and 6 untl convergence n w s. 5. nner loop : For = 1,,..., n do 5.1 calculate g (y ) from Eq. (4) 5. calculate g (y ) from Eq.(43). 5.3 calculate F and G from Eq. (41) 5.4 update w by: ( ) F w = w + η G w 6. outer loop : For every M teratons for w, update σ from Eq. (44). 6.1 Remarks on the MF-Based SVM Densty Estmaton Algorthm 1. The most computatonally expensve step n the above algorthm s the nverson of the matrx K n + Σ n step 6. So, t s recommended that step 6 at the outer loop terate less frequently than step 5 of the nner loop. For example, after M = 10 teratons of updatng w, there wll be one update of σ.. The optmzaton needed to obtan the weghts s carred out n the feature space,.e. after applyng the kernel functon on the nput samples. 3. Snce the optmzaton s done n the feature space, the optmzaton does not depend on the nput space dmensonalty and so the densty estmaton procedure too. 7 Sample Results In ths secton, a sample of expermental results on the SVM regresson s ntroduced. In ths excrement, a data set from the 41 ponts from a mxture of Gaussan functons: f(y) = N ( 10, 9) + N (0,.5) + N (10, 9) (46)

16 8 Concluson True Approxmated True Approxmated (a) (b) Fgure : Functon approxmaton for a mxture of Gaussans functon usng: (a) classcal formulaton, and (b) statstcal formulaton for the SVM regresson algorthm. Fgure () shows the results of applyng the SVM regresson algorthm to approxmate f(y) and llustrates the good performance of the SVM regresson algorthm wth both types of mplementaton. The superorty of the statstcal based formulaton would appear wth data sets that have large szes. The SVM regresson s mplemented usng MATLAB software. The followng lnk contans both the classcal and statstcal based mplementatons. 8 Concluson An overvew of the mathematcal foundatons of the SVM regresson was ntroduced. The bascs of the regresson process and the dea of soft margn loss functon s dscussed. Classcal formulaton of the SVM regresson algorthm s ntroduced and ts shortcomngs. Formulaton of the SVM regresson n a statstcal setup s dscussed wth the advantage of avodng the the shortcomngs of the classcal formulaton. References [1] V. Vapnk, The Nature of Statstcal Learnng Theory. Second Edton, Sprnger, New York, 001. [] Refaat M Mohamed and Aly A Farag, Classfcaton of Multspectral Data Usng Support Vector Machnes Approach for Densty Estmaton, IEEE Seventh Internatonal Conference on

17 REFERENCES 16 Intellgent Engneerng Systems, INES03, Assut, Egypt, March 003. [3] V. Vapnk, S. Golowch and A. Smola, Support Vector Method for Multvarate Densty Estmaton, Advances n Neural Informaton Processng Systems, Vol. 1, pp , MIT Press, Aprl [4] B. Scholkopf, C. Burges, and A. Smola, Advances n Kernel Methods-Support Vector Learnng. MIT Press, Cambrdge, MA, [5] T. Fress, N. Crstann, and C. Campbell, The Kernel ADATRON algorthm: a fast and smple learnng procedure for Support Vector Machnes, The 15th Internatonal Conference on Machne Learnng, July 4-7, 1998, Madson, Wsconsn USA, pp [6] J. Platt, Fast Tranng of Support Vector Machnes Usng Sequental Mnmal Optmzaton. Advances n Kernel Methods Book, pp , MIT Press, Cambrdge: MA, [7] C. Wllams and M. Seeger, Usng the Nystrom Method to Speed Up Kernel Machnes, Advances n Neural Informaton Processng System, vol. 14, 001. [8] M. Tppng and A. Faul, Fast Margnal Lkelhood Maxmzaton for Sparse Bayesan Models, Internatonal Workshop on AI and Statstcs, 003. [9] C. Burges, A Tutoral on Support Vector Machnes for Pattern Recognton, Data Mnng and Knowledge Dscovery, vol., no., pp. 1-47, [10] P. Mtra, C. Murthy and S. Pal, A Probablstc Actve Support Vector Learnng Algorthm, IEEE Transactons on Pattern Analyss and Machne Intellgence, vol. 6, no.3, pp , March 004. [11] D. Cohn, Z. Ghahraman and M. Jordan, Actve Learnng wth Statstcal Models, Journal of AI Research, vol. 4, pp , [1] D. MacKay, Informaton Based Objectve Functon for Actve Data Selecton, Neural Computaton, vol. 4, no. 4, pp , 199. [13] M. Opper and O. Wnther, Gaussan Processes for Classfcaton: Mean Feld Algorthms, Neural Computaton, Vol. 1, pp , 000.

18 REFERENCES 17 [14] K. P. Bennett and O. L. Mangasaran, Robust lnear programmng dscrmnaton of two lnearly nseparable sets, Optmzaton Methods and Software, Vol. 1, pp. 334, 199. [15] C. Cortes and V. Vapnk, Support vector networks, Machne Learnng, Vol. 0, pp. 7397, [16] O. L. Mangasaran. Nonlnear Programmng. McGraw-Hll, New York, [17] R. J. Vanderbe, LOQO users manualverson 3.10, Techncal Report SOR-97-08, Prnceton Unversty, Statstcs and Operatons Research, [18] W. Karush, Mnma of functons of several varables wth nequaltes as sde constrants, Masters thess, Dept. of Mathematcs, Unversty of Chcago, [19] H. W. Kuhn and A. W. Tucker, Nonlnear programmng, nd Berkeley Symposum on Mathematcal Statstcs and Probablstcs, pp , Berkeley, [0] S. R. Gunn, Support Vetor Machnes for Classfcaton and Regresson, Techncal Report, Unversty of Suthoampton, School of Electroncs and Computer Scence, 1998.

Lecture Notes on Linear Regression

Lecture Notes on Linear Regression Lecture Notes on Lnear Regresson Feng L fl@sdueducn Shandong Unversty, Chna Lnear Regresson Problem In regresson problem, we am at predct a contnuous target value gven an nput feature vector We assume