Semi-supervised Learning using Kernel Self-consistent Labeling

Size: px

Start display at page:

Download "Semi-supervised Learning using Kernel Self-consistent Labeling"

Hugo Stokes
6 years ago
Views:

1 Sem-supervsed earnng usng Kernel Self-consstent abelng Keywords: sem-supervsed learnng, kernel methods, support vector machne, leave-one-out cross valdaton, Gaussan process. Abstract We present a new method for sem-supervsed learnng based on any gven vald kernel. Our strategy s to vew the kernel as the covarance matrx of a Gaussan process and predct the label of each nstance condtoned on all other nstances. We then fnd a self-consstent labelng of the nstances by usng the hnge loss on the predctons on labeled data and the ε nsenstve loss on predctons on the unlabeled data. Computatonally, ths leads to a quadratc programmng problem wth only bound constrants. We examne the smlartes and dfferences between the algorthm and other sem-supervsed learnng algorthms. Evaluaton of the method on both synthetc and real-world datasets gves encouragng results.. Introducton In many applcatons, labeled data are expensve but unlabeled data can be easly obtaned. In these applcatons, t s desrable to explot the unlabeled data to mprove performance of learnng algorthms. Methods that learn from both labeled and unlabeled data are called transductve or sem-supervsed learnng algorthms. Kernel methods have been used successfully n varous machne learnng problems ncludng classfcaton, regresson, rankng and dmensonalty reducton. he use of kernels allows non-lnear functons to be used effcently n these problems by mappng the nstances nto a hgher dmensonal feature space. By desgnng the kernels approprately, a rch famly of hgh dmensonal mappngs s avalable to sut varous learnng problems. In vew of the advantages and flexblty of kernel methods, effectve sem-supervsed kernel learnng algorthms could be qute useful. We use the self-consstency prncple to desgn a semsupervsed kernel classfcaton method. he dea behnd ths prncple s to label the unlabeled data n such a way that the classfer traned usng a subset of the (labeled and artfcally labeled) data would have low valdaton Prelmnary work. Under revew by the Internatonal Conference on Machne earnng (ICM). Do not dstrbute. error on the (labeled and artfcally labeled) data that s not used n tranng. hs prncple was used n desgnng the mncut based sem-supervsed learner n (Blum & Chawla, ) where the algorthm was shown to able to generate a labelng that mnmzes the leave-one-out cross valdaton (OOCV) error of the -nearest neghbor algorthm. A normalzed verson of the algorthm based on mnmzng the ratocut (Spectral Graph ransducer) was proposed n (Joachms, 3). We propose a support vector style algorthm wth ε- nsenstve regresson cost constrants for self-consstency on unlabeled ponts and hnge loss penaltes on labeled ponts. he problem can be formulated as a quadratc programmng problem wth only bound constrants, for whch global optmum can be found effcently. As the method s based on leave-one-out Gaussan process predcton usng kernels, we call the method kernel selfconsstent labelng. Kernel self-consstent labelng has nterestng connectons wth many exstng algorthms. Wthout the ε-nsenstve self-consstency constrants, the method s very smlar to a leave-one-out verson of the support vector machne. he Representer heorem states that the support vector machnes can be represented by lnearly combnng only kernel functons that are assocated wth the labeled data, regardless of whether unlabeled data s present. Interestngly, kernel self-consstent labelng utlzes kernel functons of both the labeled and unlabeled nstances even when only the hnge loss penaltes on labeled ponts are used. ke many sem-supervsed learnng methods, t s also possble to vew the kernel self-consstent labelng as a graph-based sem-supervsed learnng method. Both the mncut and the Spectral Graph ransducer algorthms are based on mnmzng edge costs on graphs. Smlarly, (Zhu, Ghahraman & afferty, 3) mnmzes the sum of squared dfferences of the functon on all pars of ponts (edges), weghted by the par-wse smlarty n the graph. Instead of mnmzng weghted cost on edges, a second class of graph algorthms mnmzes the node cost. hey express the predcted label of one nstance based on the average of other nstances labels, weghted by mutual smlarty. In (Rowes & Saul, ), the cost functon s defned on each node as the squared dfference between the node s own functon value and the weghted average values of ts neghbors n a certan localty. Smlarly, we can vew kernel self-consstent labelng as a method for

2 mnmzng node costs on a graph by usng what s known as the equvalent kernel (Slverman, 984) of the orgnal kernel as the weght matrx n a graph. Conversely, n some graph-based sem-supervsed learnng methods such as n the harmonc energy mnmzaton (Zhu et al., 3), the aplacan of the graph (Chung, 999) can be vewed as the pseudo-nverse of a kernel (Ham et al., 4). Kernels can be nterpreted as the covarance matrx of a Gaussan process except wth ndependent Gaussan nose added to the output (Seeger, 4). hus, relatonshp between gven labels, the probablty dstrbuton of unknown labels, and par-wse smlarty among examples can be represented under the framework of multvarate Gaussan dstrbutons. Interestngly, n Gaussan dstrbutons, the expected label of an unlabeled pont, condtoned on known labels, turns out not to utlze the other unlabeled data once the kernel s fxed. In the harmonc energy mnmzaton (Zhu et al., 3), the unlabeled data s utlzed to construct the aplacan and changng an unlabeled nstance would result n a dfferent kernel and hence dfferent predctons on the other unlabeled data. In contrast, kernel selfconsstent labelng uses a fxed kernel and allows the change of an unlabeled nstance to affect predctons on other unlabeled nstance by self-consstency and optmzng an approprate cost functon. he rest of ths paper s organzed as follows. Secton gves some background on Gaussan process and ntroduces the leave-one-out predcton algorthm whch establshes the relatonshps among all data ponts based on any vald kernel. Secton 3 presents the proposed mxed classfcaton and regresson leave-one-out style algorthm that uses the hnge loss on labeled data and the ε- nsenstve loss functon on unlabeled data. he model s connectons to exstng learnng algorthms are dscussed n Secton 4. Computatonal effcency s consdered n Secton 5. Secton 6 contans expermental results. Fnally the paper s concluded n Secton 7.. eave-one-out Gaussan Process Predcton We assume that ether a proper kernel functon k (x, x ) that satsfes Mercer s theorem or a vald Gram matrx K (symmetrc and postve sem-defnte) (Schölkopf & Smola, ) for both labeled and unlabeled data s gven. By Mercer s heorem, we have k (, ) ( ) ( ) xx = λφ x φ x where { φ ()} are the egenfunctons of k (,), or egenvectors of Gram matrx. hen we defne Bayesan near Regresson: zx ( ) = θ φ( x), yx ( ) = zx ( ) + ε = θ φ( x) + ε where φ( x) = ( φ( x), φ( x),...), θ (, Σ), Σ= dag( λ, λ,...), and nose ε (, σ ) whch s ndependent between dfferent x. hen covarance cov( zx ( ), zx ( )) = φ( x) Σ φ ( x ) = k ( xx, ') cov( yx ( ), yx ( )) = φ( x) Σ φ( x ) + σδ( xx, ) = k xx + xx (, ) σδ (, ) where δ ( xx, ) s Kronecker delta. So we can vew { Z( x )} as Gaussan Process (GP) wth covarance matrx K f K s nonsngular. Suppose we have labeled examples {( x, )} c = wth c {,}. hen the log posteror of the Gaussan process Z(x) s gven by J = ( c z) ( c z) + z K z () σ where the frst term s log lkelhood and the second term s the log of the pror, actng as a regularzaton factor. Assume there are U unlabeled examples {(,)} + x U = + and denote = U +. We can vew { Y( x )} as a GP as well wth covarance matrx be wrtten as K K K σ I K = +. Suppose K can K. U = KU KUU As K s postve sem-defnte, K must be postve defnte f σ. Condtonng on the gven labels c, the expected value of unlabeled data s label y U s K K c. Unfortunately, the result shows that unlabeled data are not used n predctng y U,.e., only K and the row correspondng to the sngle test example k are nvolved. Unlabeled data are not mutually related, drectly or ndrectly va labeled data, n the process of classfyng other unlabeled data. Our strategy of utlzng unlabeled data s to do leave-oneout predcton for each nstance assumng that the labels of the other nstances are known and mpose selfconsstency. We compute a decson vector y wth each element assocated wth one example and then assgn labels based on y. For labeled data, we do not fx ts assocated element n y to any predefned value, allowng t to vary. Assume we know z = (y,, y, y +,, y ) then the predcton of y s k z, where k s the th row vector of K, wthout the th element. s the matrx after removng the th row and th column of K. et b be the vector created by n- x U

3 sertng at the th element of k, then the predcton of y based on all other elements s b y. he functon b(,) s called equvalent kernel or dual kernel and s very smlar to the equvalent kernel n (Slverman, 984; Sollch, 4), defned n a supervsed learnng settngs. et B = (b,, b ). We call B the GP predcton coeffcent matrx. 3. eave-one-out ρ-ε earnng Havng bult up the relatonshp among all examples through the leave-one-out predcton formula, we can mplement the self-consstency prncple. For labeled data, snce we are only concerned about ther class and t s a classfcaton problem, there s no need to force a strong match to target value lke n regresson. Rather, the maxmum margn prncple should be preferable. On unlabeled data, f we stll use margn constrants, as n ransductve (Joachms, 999), to fnd an assgnment such that y b y ξ, where y s varable, then we nevtably come up wth a non-convex optmzaton problem. One smple alternatve s usng regresson constrants such as the ε-nsenstve loss functon (Vapnk, 995). Fnally, regularzaton s ncorporated n the same way as n (). Puttng t all together, suppose the label for the frst nstances s known and K has already ncoporated Gaussan nose, the problem s formulated as: Mnmze wth respect to { y, ξξ, '}: ξ ( ξ ξ ) () y K y C C = = + s.t. c b y ξ =... (3) b y y ε + ξ = +,..., (4) y b y ε + ξ = +,..., (5) ξ, ξ =..., = +,..., (6) he constrant (3) s margn constrant. he constrants (4) and (5) are for ε-nsenstve regresson. he frst term n () s for regularzaton. We name ths mxed classfcaton-regresson algorthm as ρ-ε learnng because ρ s usually used to denote margn. he whole problem s a quadratc programmng (QP) problem wth K (thus K ) beng postve defnte. he agrangan s: = y K y+ C ξ + C ( ξ + ξ ) = = + α ( cb y ξ ) β ( y b y ε ξ ) = = + γ ( b y y ε ξ ) λξ σξ (7) + + = + = = + So ξ = C α λ =... (8) ξ = C β λ = +,..., (9) C ξ = γ σ = +,..., () = + + = y K y αcb β b βex γ b γex = = + = + where βex 's = (,...,, β ), γex 's = (,...,, γ ). So y = K αcb β b + βex + γ b γex = = + = + () Defne x ( α, β, γ ) R = cb, cb,..., cb, =, ( ) ( ) ( ),...,, R = b+ b + U IU, R = (R, R, R ) () where I s U U dentty matrx and U U s U zero matrx. We obtan y = K( Rα + R β R γ) = KRx (3) Substtutng (8)~(3) nto (7), we obtan the dual problem: Maxmze wth respect to { αβγ,, }: x R KRx + α ε ( β + γ ) (4) = = + s.t. α [, C ]... (5) β [, C ], γ [, C ] +,..., (6) and the predcton formula s (3). ow the constrants are all bound constrants and the bound constraned QP can be solved very effcently. In our experments usng the RO algorthm (n & Moré, 999), the optmzaton usually converges to global optmum wthn seconds for 4 varables, convergng after about to teratons. As n standard whch employs norm penalty, the solutons to ρ-ε learnng also enoys a sparse structure. 4. Model Interpretaton and Connectons to Other Works he prncple dea for our model s to mnmze the leaveone-out (OO) cross valdaton error. hs s emboded by both Gaussan process predcton formula and mxed classfcaton-regresson ρ-ε model.

4 4. OO Gaussan Process predcton Sem-supervsed learnng reles heavly on smlarty between nstances. Snce kernels can also be vewed as descrpton of smlarty, t s nterestng to nterpret exstng sem-supervsed learnng algorthms from the perspectve of kernels, or Gaussan process. In (Zhu et al., 3), RBF kernel W s used as smlarty measure: = exp (7) w x x σ Defnng a decson vector f = ( f, fu ) wth f fxed to the gven labels /, we mnmze the weghted sum of the squared dfferences of f on each edge, also nterpreted as a plausble cost for a one-dmensonal embeddng of nodes: w( f f ) (8), = et D = dag( d,..., d ), where d = w = and defne graph aplacan as = D W. he algorthm essentally mnmzes the quadratc form f f. Suppose U = U UU hen the closed-form soluton s: f = f. (9) U UU U ocally lnear embeddng (E) (Rowes & Saul, ) establshed cost functons on each node. Suppose for node x, the combned contrbuton from all other nodes s w f, subect to w =, w (f requrng reconstructon of each pont lyng wthn the convex hull of ts neghbors), and w = f x does not belong to the set of neghbors of x. hen we would lke the weghted average to be close to f. So we have the followng problem: Mnmze f w f () Usng notaton above, t s essentally mnmzng f f and a smlar closed-form soluton can be gven. (Ham et al., 4) ponted out that both aplacan algorthm (8) and E are kernel PCA wth kernels beng the pseudo-nverse of (denoted as ) and λmax I (or ts centered form) respectvely, where λ max s the largest egenvalue of. In the mnmzaton problems (8, ) for classfcaton tasks, t s n E that plays a smlar role as n aplacan method. o make and ( ) vald covarance matrces of the assocated Gaussan process, we add very small nose σ I. Surprsngly, from ths perspectve, the predcton, say (9), does not make use of nformaton n the lower-rght U U sub-matrx of covarance matrx (kernel) assocated wth unlabeled data. hs s because for a multvarate Gaussan dstrbuton of ( x, x ) wth zero mean and covarance U Σ Σ U the expected mean of x U gven x s ΣUΣ x. What (8) seeks s the x U whch mnmzes the logarthm of Gaussan probablty dstrbuton functon wth x fxed, and ts soluton (9) should be exactly the expected condtonal mean of x U. ote, however, that when these methods are used for sem-supervsed learnng, t s the aplacan that s constructed from the unlabeled data, hence the kernel changes when the unlabeled data change. One restrcton of the aplacan algorthm s the requrement of nonnegatve smlarty. Otherwse the aplacan wll no longer be postve sem-defnte. In E, weghts are decded by mnmzng a functon and t may also requre postve weghts for certan applcatons. As some kernel functons may assume negatve value, ths restrcton precludes a rch source of smlarty functons. Usng our leave-one-out Gaussan process predcton, we can obtan a naturally scaled formula for combnng the contrbutons from all other nodes whle utlzng kernel functons, even those that may assume negatve values. he obectve functon now becomes mnmzng ( ) f b f wth respect to f U n f = ( f, fu ), clampng f to gven labels ± or /. hs OO Gaussan process predcton algorthm makes t possble to utlze general kernels as smlarty for graph algorthms. 4. OO Mxed ρ-ε earnng It s nterestng to nvestgate the model s behavor as an nterpolaton between puttng unlabeled data on smlar footng as the labeled data and gnorng unlabeled data completely. In mn-cut class graph algorthms, the reweghtng of labeled and unlabeled data can usually be carred out by modfyng the weght between nodes, adustng the orgnal weght by a factor η ( η ) when an edge connects unlabeled nodes. he smplest correspondng manpulaton n our model s to convert matrx B as follows: B BU B ηbu B = B = BU BUU + η BU ηbuu In general cases, η as unlabeled ponts are less relable and less nformatve than labeled data. Another method s to tune ε to provde the tradeoff. By applyng Karush-Kuhn-ucker (KK) condtons, we have: β ε + ξ b y+ y = Σ Σ U UU ( ) ( + + b y y ) = γ ε ξ

5 When ε s very large, the constrants (4) and (5) are unlkely to be actve, so β and γ are almost certanly. he only remanng constrants are on labeled data. In ths sense, tunng ε can re-weght the effect of unlabeled data on our model. If β and γ, then Rx Rα and the predcton formula s: y = KR α () Substtutng () nto (4), we have a smplfed model: Maxmze α R KRα + α () = s.t. α [, C ]... (3) Specfcally, for data pont x, the predcton x = x α = x α = α x = = (4) y k R k b c b c k where k x s the kxx (, ),..., kxx (, ) correspondng to unlabeled nstance x n K. Snce b αc s vector ( ) = ndependent of x, ths result shows that the optmal soluton must le n the span of kernels evaluated at the labeled and unlabeled data ponts. In comparson, accordng to the Representer heorem (Schölkopf & Smola, ), the soluton of the Support Vector Machne can be represented entrely wth kernels assocated wth labeled data, even when unlabeled data s present. ote also that the Representer heorem, appled to ρ-ε learnng also shows that the soluton for ρ-ε learnng can be represented usng only kernels assocated wth the labeled and unlabeled data and do not requre kernels assocated wth unseen data. Hence, when addtonal examples are gven for testng, the model can perform n an nductve fashon rather than only n the transductve style. here s a natural connecton between our mxed ρ-ε model and leave-one-out (OO) (Herbrch, ). he OO s formulated as: Mnmze s.t. ξ (5) = c α K( x, x) + b ξ... (6) =, ξ α... (7) he constrants for our mxed ρ-ε model has the form cb y ξ whch s dfferent to (6) but s stll a leave-one-out predcton constrant. he obectve functon () has an addtonal regularzaton term compared to (5) whch ust mnmzes the norm of penalty. Despte these dfferences, both OO and our model are mnmzng the leave-one-out error, usng smlar frameworks to produce the predcton of one nstance s label assumng the others labels are known. 5. Effcency Consderatons wo costly operatons nvolved n ths algorthm are matrx nverson for buldng the GP predcton coeffcent matrx B, and the calculaton of the Hessan matrx for quadratc programmng. o calculate B, one needs k for all =... so the bottleneck les n. Here we can apply Matrx Inverson emma n ts smplest form: ( A uv ) A A u va ( v A u) + = + where A s a nvertble matrx, u and v are vectors. We frst calculate. otcng that + ( ) s dfferent from only by the th row and th column, we have = + u + + v, where s a ( ) vector wth all elements beng except the th element beng. So,..., can be calculated one after one. As each applcaton of matrx nverson lemma has computatonal complexty O( ), the total cost of buldng B s O( 3 ). he QP tself can be solved effcently. he calculaton of the Hessan matrx n (4) wth R defned n () can also be done quckly. But f one-aganst-all approach s adopted for mult-class tasks, calculatng R KR for multple tmes can be costly. However, as R = (R, R, R ) and only R s related to labels whch change n the oneaganst-all process, we only need to calculate R KR and R KR repeatedly. hey both cost OU ( ), under the typcal settngs of sem-supervsed learnng that U. 3 herefore, both can be done effcently. he OU ( ) step for producng R KR needs to be done only once. 6. Expermental Results We evaluate our algorthm on eght datasets. hree are from the newsgroup dataset (9997 nstances) (ang, 995): baseball-hockey (993 nstances / classes), relgon-athesm (44 / ), and pc-mac (943 / ). wo are from handwrtten dgts recognton task: odd-even and dgts (-way), for whch we used a smplfed form avalable at data, wth 96 features each rangng between and 9. We extracted examples from each class (so nstances n all). he other three datasets are synthetc -

6 spral (94 / / features), wne (78 / 3 / 3) (UCI Repostory), and yeast (484 / / 8). All nput feature vectors are normalzed to have length, except the three newsgroup datasets and the -spral dataset. We randomly pck tranng sets and test predcton accuracy on the remanng examples under the constrant that all classes, ncludng each dgt as subclasses for the odd-even dataset, must be present n labeled set. For mult-class tasks, the one-aganst-all heurstc s used and we pck the class whose functon has the largest value. he program s wrtten n Matlab, except the QP solver whch s mplemented more effcently by applyng RO algorthm provded by the oolkt for Advanced Optmzaton (AO) (Benson et al., 4). We compare our algorthm wth the. o ensure that comparson s far we use RBF kernel for both models. For, we choose the RBF kernel bandwdth σ (as n (7)) and tradeoff parameter C that yeld the best average generalzaton performance,.e., maxmzng the average testng accuracy of all choces of tranng dataset sze. For kernel self-consstent labelng, we fx Gaussan nose varance to 4. After some experments, we fnd that C = C =, ε = performs well throughout all datasets and all possble number of labeled data. In fact, the performance s not very senstve to the exact value of C and C, as long as they are at a proper magntude. hs s smlar to the property of tradeoff parameter C n. However, the RBF kernel bandwdth σ needs to be tuned for each dataset. o demonstrate that good values for σ can be found wth reasonable effort, we expermented wth four bandwdths for all datasets: σ = σ r, where r =,,,. In practce, when enough labeled data exsts, k-fold cross valdaton can be used to select good settngs for the parameters. Fgure to Fgure 4 demonstrate the resultng accuraces. Each result s averaged over 3 trals of randomly pckng labeled data. We observe that self-consstent labelng wth r = outperformed for 6 out of the 8 datasets. For the other datasets, yeast and -dgt (both -way multclass tasks), we tred more settngs to see f settngs that can outperform the exst and ndeed we found better settngs as shown n Fgure 4. For yeast dataset, r = yelds consstently better results, whle for -dgt dataset, the result gven by r = 5 s of lttle statstcal dfference from n most cases. Whether the preference for smaller r s due to the larger number of classes or ust due to the property of the dataset tself requres further nvestgaton, because the effect of one-aganst-all heurstc on kernel self-consstent labelng s less known. One avenue of nvestgaton, as future work, s to try to transplant strateges that extend for mult-class classfcaton to our algorthm. For all the datasets except the -dgt dataset, we fnd that r > yelds better results than r < (not shown because of the already consstently poor performance at r = and the trend shown). In normal supervsed learnng, larger tranng set sze usually allows the use of smaller bandwdths to mprove performance. hs suggests that kernel self-consstent labelng s able to take advantage of the unlabeled data and use kernels wth smaller bandwdth effectvely. It s nterestng to nvestgate the nfluence of the rato between labeled postve and negatve data. For all the 5 bnary tasks, the performance of grows worse under the sequence #postve/#negatve = /, /, /3, /4, though the total number of tranng data s ncreasng. Kernel self-consstent labelng has a dfferent trend when usng unbalanced tranng data n the -spral dataset. Wth the number of labeled data growng n that sequence, albet ncreasngly unbalanced, there s consstent mprovement n accuracy. he σ and optmal C for are gven n able. able. Optmal parameters for baseball - relgon - hockey athesm pc - mac odd - even σ C dgts -spral wne yeast σ C 4 4 Fgure. Accuracy results (n percentage) for -spral data. In the upper fgure, there s equal number of postve and negatve labeled data. he lower fgure shows dfferent numbers of pos/neg labeled data.,,, n legend are the values of r.

7 Fgure. Accuracy results for baseball vs hockey (left), relgon vs athesm (mddle), and pc vs mac (rght). he upper mddle fgure has no curve for r = because ts accuracy s always below 5%. Fgure 3. Accuracy results for odd vs even. 7. Conclusons In ths paper, we descrbe a new method of utlzng unlabeled data for classfcaton based on leave-one-out Gaussan process predcton and mxed classfcatonregresson. We motvate ths approach usng the prncples of low tranng error on labeled data and low OOCV error on whole dataset. he essence of the algorthm s to vew any generc vald kernel or Gram matrx as covarance matrx of a Gaussan process and then establsh the leave-one-out predcton formula based on condtonal expectaton under multvarate Gaussan dstrbuton. hs formula connects all nstances and can also possbly be nterpreted as a smlarty measure. Bult upon ths connecton, a support vector style model s proposed by mnmzng the hnge loss on labeled data for classfcaton, and the ε-nsenstve cost on unlabeled data for regresson. he model reduces to a local-optma-free quadratc programmng problem wth only bound constran and can be solved effcently. Both the Hessan matrx and GP predcton coeffcent matrx can be computed

8 Fgure 4. Accuracy results for wne (left), yeast (mddle), and -dgt (rght). For these mult-class tasks, we do not show the nfluence of dfferent rato of labeled data from each class. here s no lne for r = n the rght fgure (-dgt) because ts accuracy s always below 8%. effcently, by matrx block decomposton and matrx nverson lemma respectvely. Expermental results llustrate the advantage of sem-supervsed learnng usng kernel self-consstent labelng when compared to usng the. Further work needs to be done to extend the current bnary classfcaton settngs so that t uses an n-bult mechansm for mult-class classfcaton, rather than general one-aganst-all heurstcs. Other varants of such as γ - or other loss functons lke Huber s robust loss (Schölkopf & Smola, ) can also be utlzed under the framework of self-consstency. Fnally, t s desrable to explore and establsh, under the settngs of sem-supervsed learnng, stronger theoretcal connectons wth RKHS models and learnng theory, where many state-of-the-art learnng models are rooted. References Benson, S., McInnes,., Moré, J., & Sarch, J. (4). AO User Manual (Revson.7) A/MCS-M-4. Blum, A., & Chawla, S. (). earnng from labeled and unlabeled data usng graph mncuts. Proc. of the 8 th Int l Conf. on Machne earnng (pp. 9-6). Chung, F. R. K. (997). Spectral graph theory. o. 9 n Regonal Conference Seres n Mathematcs. Amercan Mathematcal Socety. Crstann,., Shawe-aylor, J., Elsseeff, A., & Kandola, J. (). On kernel-target algnment. In Advances n eural Informaton Processng Systems, 4, Ham, J., ee, D., Mka, S., & Schölkopf, B. (4). A kernel vew of the dmensonalty reducton of manfolds. Proceedngs of the th Internatonal Conference on Machne earnng (pp. 9-97). Joachms,. (999). ransductve nference for text Classfcaton usng support vector machnes. Proceedngs of the 6 th Internatonal Conference on Machne earnng (pp. -9). Joachms,. (3). ransductve learnng va Spectral Graph Parttonng. Proceedngs of the th Internatonal Conference on Machne earnng (pp. 9-97). n C.-J., & Moré, J. (999). ewton s method for large bound-constraned optmzaton problems. SIOP, 9(4), -7. Herbrch, R. (). earnng kernel classfers: theory and algorthms. Cambrdge, Mass.: MI Press. Rowes, S., & Saul, K. (). onlnear dmensonalty reducton by locally lnear embeddng. Scence, 9, Seeger, M. (4). Gaussan processes for machne learnng. Int l Journal of eural Systems, 4(), Schölkopf, B., & Smola, A. (). earnng wth kernels: support vector machnes, regularzaton, optmzaton, and beyond. Cambrdge, MA: MI Press. Slverman, B. (984). Splne smoothng: the equvalent varable kernel method. Annals of Statstcs,, Sollch, P., & Wllams, C. (4). Usng the equvalent kernel to understand Gaussan Process regresson. In Advances n eural Informaton Processng Systems, 7. Szummer, M., & Jaakkola,. (). Partally labeled classfcaton wth Markov random walks. In Advances n eural Informaton Processng Systems, 4, UCI (). Repostory of machne learnng databases. Vapnk, V. (995). he ature of Statstcal earnng heory. Sprnger, ew York. Zhu, X., Ghahraman, Z., & afferty, J. (3). Semsupervsed learnng usng Gaussan felds and harmonc functons. Proceedngs of the th Internatonal Conference on Machne earnng (pp. 9-99).

Kernel Methods and SVMs Extension

Kernel Methods and SVMs Extenson The purpose of ths document s to revew materal covered n Machne Learnng 1 Supervsed Learnng regardng support vector machnes (SVMs). Ths document also provdes a general