9 Adaptive Soft K-Nearest-Neighbour Classifiers with Large Margin

Size: px

Start display at page:

Download "9 Adaptive Soft K-Nearest-Neighbour Classifiers with Large Margin"

Patricia Armstrong
5 years ago
Views:

1 9 Adaptve Soft -Nearest-Neghbour Cassfers wth Large argn Abstract- A nove cassfer s ntroduced to overcome the mtatons of the -NN cassfcaton systems. It estmates the posteror cass probabtes usng a oca Parzen wndow estmaton wth the -nearest-neghbour prototypes (n the Eucdean sense) to the pattern to cassfy. A earnng agorthm to reduce the number of prototypes (that maxmses the confdence or margn on the correct cassfcaton of tranng patterns) s aso presented. Expermenta resuts n two hand-wrtten cassfcaton probems demonstrate the potenta of the proposed cassfcaton system. Index Terms- Soft Nearest Neghbour Cassfers, Onne gradent descent, Hand-wrtten Character Recognton. 9-

2 . Introducton Nearest neghbor (NN) technques are smpe but powerfu non-parametrc cassfcaton systems. Snce ther ntroducton n the fftes, many sophstcaton of the basc schema appeared, concerned wth dfferent topcs as condensng, edtng, earnng or extensons to ncude reecton. (See (Dasarathy, 99) to revew much of the exstng terature.) In recent years, nterest n these methods has fourshed agan n severa feds (ncudng statstcs, machne earnng and pattern recognton) snce they reman the best choce n many cassfcaton probems. Ths chapter presents a nove cassfcaton earnng archtecture n the context of condensed -NN cassfers wth the goa of enhancng ts generazaton propertes. In earnng systems, generazaton performance s affected by a trade-off between the number of tranng exampes and the capacty (e.g. the number of parameters) of the earnng machne. Ths goba trade-off can be renterpreted n oca earnng systems (ke NN methods) as a trade-off between capacty and ocaty (how oca the souton s) (Bottou & Vapnk, 992). One of the mpcatons of ths s that the earnng system must contro the effectve number of tranng sampes avaabe for tranng ocay the system. -NN technques are good exampes of such oca earnng systems. For every testng pattern, they estmate posteror cass probabtes wth a fxed number of tranng ponts. In ths way, good generazaton s guaranteed snce there s aways enough data to compute the estmatons. By contrast, these systems have poor approxmaton capabtes due to estmate usng a rato of ntegers. The quaty of the -NN approxmaton can be smpy mproved f we use nstead of the usua rato, a Parzen wndow estmate computed wth the - nearest-neghbor tranng sampes. Hence, the resutng soft estmaton coud beneft from the vrtues of -NN and Parzen estmates, snce t mproves the approxmaton quaty of crsp - NN estmators by means of a smooth nterpoaton and t retans ther contro of the tranng ponts that contrbute to the estmatons. However, an mportant dsadvantage of the method, nherted from ther ancestors, woud be that a of the tranng data must be retaned to compute the estmatons. So we aso present a more sophstcated verson of the agorthm to aow fewer data ponts to be used. Ths ncudes a earnng agorthm to compute a condensed set of prototypes from tranng data. The organzaton of ths chapter s as foows. Secton 2 derves the soft -NN cassfer n the context of oca earnng agorthms (Bottou & Vapnk, 992). In Secton 3 we ntroduce an adaptve verson of the prevous cassfcaton rue. The proposed earnng agorthm desgns a condensed set of prototypes that mnmzes the emprca average number of mscassfcatons. 9-2

3 Secton 4 shows expermenta resuts that compares successfuy the adaptve soft -NN cassfer wth other agorthms usng two NIST handwrtten character databases (Garrs, 997). Fnay, n Sectons 5 and 6 some dscusson and concuson are gven. 2. The soft -NN cassfer Let c:x {,..., C} be a maxmum cassfer that assgns a nput vector x to one of the C exstng casses. Ths knd of cassfers s defned n terms of a set of dscrmnant functons d (x), =,..., C. The cassfer c then assgns x to cass f d (x)>d (x) for a. In the case of equay mscassfcaton costs, c acheves a mnmum expected number of mscassfcatons when d s the posteror probabty of beongng to cass, P(C \x). Suppose we have N observatons pars T={(x, y )} =0,..., N- where x s a random sampe taken from nput space X that beongs to one of the casses and y s an ndcator varabe. If x beongs to the nth cass, the nth coeffcent of y s equa to and the other coeffcents are equa to 0. To ensure c s a good cassfcaton procedure each dscrmnant functons d (x) must estmate the bnary random varabe y (that s equa to f x beongs to the th cass and 0 otherwse). In oca agorthms ths estmaton s defned for each neghborhood around the testng pattern x as those whch mnmzes a weghted emprca average oss functon of the foowng form (Bottou & Vapnk, 992) p.892: I d oca emp * [ d ] G( x x; γ ) J( y, d ( x ; γ )) = = ( x; γ ) = arg mn I [ d ] oca emp () where kerne G s a bounded and even functon on X that s peaked about 0, γ denotes the {,..,C} ocaty contro parameters assocated to G and d * ( ; γ ), x are the optma dscrmnant = functons. If we use a constant approxmaton d (wth respect to x) of y and a quadratc oss J(y,d )=(y -d ) 2, sovng the equaton I [ d ] d 0 =,...,C yeds oca emp = d * ; = = ( x; γ ) = G( x x; γ ) y G( x x γ ) (2) As we see from nspectng (2), ths optma oca dscrmnants functons computes kerne-based estmates of the posteror probabty of the casses (Bshop, 995) 2.5. The shape of the kerne 9-3

4 G controed by γ then determnes dfferent we-known soutons to the same orgna probem. If G=G H where G H s a square kerne whose wdth s adusted to contan exacty k sampes then equaton (2) s the -NN agorthm. On the other hand f G=G S where G S s a smooth kerne wth a wdth moduated by a ocaty parameter σ then (2) s the Parzen wndow agorthm. The -NN technque consders a hypershere centered at the testng pattern x that exacty contans tranng sampes. Ths feature prevents the estmator from not havng enough tranng data to ensure good generazaton. Athough n practce, ther estmates poory approxmate posteror probabtes, snce the regon around x s not sma enough to keep the estmatons, a rato of ntegers ( /), vad. By contrast, the Parzen wndow agorthm can estmate better snce t s based on a smooth nterpoaton between tranng ponts, athough t does not effectvey contro the tradeoff between the capacty of the earnng devce and the number of tranng sampes. Ths s due to usng a fxed wdth parameter σ for a the tranng ponts. If the dstrbuton of the tranng patterns n the nput space s uneven, t w be mpossbe to ensure that there are enough tranng data that sgnfcanty contrbute to the estmatons for a testng pattern x. We propose an approach that benefts from the vrtues of these two technques and can be understood as a compromse between them: for each testng pattern x, a Parzen wndow estmate s computed usng the nearest (n the Eucdean sense) tranng ponts. In fact, ths agorthm resuts of usng a composed kerne G soft =G H G S n equaton (2). As we show n the Appendx, the resutng equaton aso estmates the posteror cass probabtes. Equaton (2) s then caed the soft -NN agorthm and can be wrtten as foows d * C ( x; γ ) = G( x x; γ ) G( x x; γ ) = = u= u (3) where C -NN of cass = {, u =,..., } x are those prototypes of cass among the k nearest u C prototypes to x and = =.We remark that the (crsp) -NN agorthm n equaton (3) s recovered n the mt σ. In the partcuar case k=2, the resutng cassfer s equvaent to the -nearest-neghbor cassfcaton rue snce ony two prototypes are nvoved n the decson. Fgure shows the cassfcaton borders for the Rpey s synthetc probem (Rpey, 994) n the case of the soft -NN and crsp -NN (σæ ) rues. Note that the soft -NN agorthm can smooth the crsp cassfcaton border where σ contros the degree of smoothness. 9-4

5 a) =3, σæ b) =3, σ=0.00 c) =3, σ=0. d) =3, σ=0.05 e) =, σæ f) =, σ=0.0 g) =, σ=0.05 h) =, σ=

6 h) =33, σæ ) =33, σ=0.0 ) =33, σ=0.05 k) =33, σ=0. ) =55, σæ m) =55, σ=0.075 n) =99, σæ o) =99, σ=0.075 Fg.. Soft & crsp -nearest-neghbor cassfcaton borders n the Rpey s synthetc probem for severa vaues of sgma and. (The soft verson uses gaussan kernes) 9-6

7 3. The adaptve soft -NN cassfer Gven a tranng set T, we can consder to appy (2) usng G soft, cassfyng a new nput to those cass whose dscrmnant functon has the hghest vaue. However when tranng set sze s arge, storage and processng requrements make ths smpe agorthm unusabe n engneerng appcatons. In the case of (crsp) -NN cassfers, many heurstc sophstcaton of the basc agorthm aow to use fewer data ponts (e.g. Condensed NN (Hart, 968)). By contrast, our approach can smpy be extended by repacng the oca kerne estmaton wth a smpfed adaptve mxture mode (clachan & Basford, 988). Suppose we defne for each cass densty functon a mxture as foows f X C ()( x P C ) = f () x = X C, (4) where {f X\C, =,...,C} are the cass densty probabtes, {P(C ), =,...,C} are the prors and {f X\C, =..., =,...,C} are known parametrc modes restrcted to the form G(x-w ;γ) wth G(u) a decreasng functon on U and peaked about u=0. Appyng Bayes' theorem, we arrve at the foowng expresson for the posteror probabtes P ( C x) f X C, = = C = u= f () x X C,u () x (5) Snce each mode of the mxture s ocay defned, we can approxmate equaton (5) usng ony the -nearest-modes to the testng pattern x. If we use ths smpfcaton to construct the cassfer, then the dscrmnant functons gve P ( C x) f, () x X C = C fx C,u = u= () x (6) { } where f () x, =,...,C, u =,..., are the -nearest-modes to x. Lke the mxture X C,u modes are ocay defned around ts centers w, P(C \x) then can be approxmated by 9-7

8 P ( C x) where { u =,...,,,...,C} = C G = u= ( w x; γ ) G ( w x; γ ) C w u, = are the -nearest-mxture-centers to x and = = u (7). The tota number of mxture modes (*C) w be typcay much smaer than the number of tranng ponts (N), and the mxture modes centers (w ) are no onger constraned to concde wth the tranng ponts. The parameters of the mxture modes coud be estmated usng unsupervsed procedures as the E agorthm (Dempster, 977), athough the foowed approach here make use of a supervsed earnng procedure based on mnmzng the number of mscassfcatons. 3.. The Loss functon. Consder a tranng set of random pars T={(x, y )} =0,..., N-. If the pattern x beongs to the nth cass, the nth coeffcent of y s equa to and the other coeffcents are equa to 0. As a oss functon we propose a modfcaton * of the cross-entropy error functon for mutpe casses (Bshop, 995) 6.9: L = N N C = 0 = y d ( x ) (8) The absoute mnmum wth respect to the {d (x )} occurs when a the tranng ponts are correcty cassfed wth the hghest confdence (or maxmum margn), that s when d (x )=y for a and. In the mt n whch the sze N of the tranng set goes to nfnty, we can substtute the sum over patters wth the foowng ntegra L = m L = N C d R = y () x d () x f ()x x d X (9) where f X s the probabty densty of the nput space. Snce y =(x cass ) where s the ndcator functon * We w see n the next subsecton, when we derve the earnng equatons, why equaton (8) can be of more practca nterest than cross-entropy error. 9-8

9 ( condton) = 0 f condton s true otherwse (0) C and f () x = f () x P( C ) X = X C, equaton (9) can be easy rewrtten as L = C = R d C x dx X C X C = () R () f ()( x P C ) dx + d () x f ()( x P C ) where R are the decson regon of cass, that s the nput regon n whch d has the hghest vaue among dscrmnant functons. <L> woud be mnmzed when d = for a x n R and f X\C P(C )=max f X\C P(C ) for a R. But as we know the best possbe choce for d s P(C \x). If we use an optmzaton method to mnmze teratvey <L>, snce the agorthm w try to mnmze the number of mscassfcatons t w be possbe a convergence of d to P(C \x) (at east near Bayes Borders). In ths case f cass overappng s moderate, the maxmum expected number of correct cassfcatons coud approxmate the absoute vaue of <Lmn>, L mn C = R ( C x) f ()( x P C ) (2) P dx X C Ths atter resut gves us, when N s arge enough, a possbe tendency of the system from mnmzng equaton (8) durng the earnng phase The earnng agorthm. In rea-fe pattern recognton probems tranng sets usuay are arge and hgh dmensona. So agorthmc smpcty of the earnng procedure, or the optmzaton process, s a requrement n order to acheve a souton wth moderated computatona resources. Onne gradent descent agorthms (Bottou, 998) are one of the most smpe computatona approaches to the earnng probem. Another features of these agorthms ncude better convergence speed than ther batch counterparts n redundant tranng sets (e.g. rea-word databases) and good mathematca characterzaton (Benvenste, 990)(Bottou, 998). 9-9

10 3.2.. Dervaton of the on-ne gradent earnng agorthm. We restrct the mxture mode parameters to centers and a goba defned parameter σ that contros ocaty. So the goa n the earnng phase s reduced to compute a set of abeed w =,...,C, one for each cass, that mnmzes an estmator of the codebooks C = {, =,..., } expected oss functon gven n equaton (9), ({,...C} ) = E [ Q(,{&,...C} )] L & X x = (3) = Snce E X [ ] s unknown, an emprca estmator s formed wth a (tranng) set of random sampes pars L={(x, cass ndex(x ))} =0,...,N-. The onne gradent earnng agorthm estmates usng the nstantaneous oss functon Q ( x,{, =,...,C} ) &. Each step of ths agorthm conssts of pckng up (cyccay or randomy) one sampe par from the tranng set L and appyng the foowng update equaton w ( ) [] k w [ k ] [] k H w [ k ], x[] k = η (4) The earnng rates [k] are postve numbers and are usuay made to decrease monotoncay wth dscrete tme k, athough they can be aso constant wth tme. The term H(w, x) s defned as H ( w, x) ( x, {,...C} ) & Q = w = 0 when dfferent abe otherwse (5) Gven a tranng set of nfnte sze, the necessary (but not suffcent, see (Bottou, 998) 5) condton to ensure the convergence of the teratve equaton (4) to a oca mnmum of <L> s that E [ H(, x) ] L ({&,...C} ) w (6) X = = w Snce Q s not dfferentabe on a fnte number of ponts and the teratve agorthm has a zero probabty to reach them, ths condton s met (see (Bottou, 998) p.0). In the case of usng a constant η, we mght ensure that η was sma enough (see (Benvenste et a., 990) p.48). Practcay, an optma vaue of η can be estmated usng a vadaton set. 9-0

11 Let (x-w ) be exp( x w σ 2 ' ), L = σ 2 L and >N@ :H DYH PRGLILHG WH ORVV functon for anaytca convenence snce the added constant does not affect the desred ponts of convergence. Hence equaton (5) yeds H ( w, x) d d ( )( ) () W () x d () x x w f w C k NN and = cass ndex x () x d ()( x x w ) f w C and cass ndex() x = W k NN 0 otherwse (7) where C -NN are those prototypes of cass among the k nearest prototypes (n the Eucdean sense) to x and d W s equa to d W 2 exp x w σ x = (8) C m 2 m exp x w u σ () m= u= In the case of σ goes to nfnte exp( σ ) tends to the unty. So H can be smpfed snce d W tends to / and d to /. Fnay the earnng equaton gves f w w f w w otherwse w [ k ] C k NN( x[ k] ) and = cass ndex( x[] k ) [] k = w [ k ] + d W ( x[] k ) d ( x[] k ) x k w k [ k ] C k NN( x[ k] ) and cass ndex( x[] k ) [] k = w [ k ] η d ( x[] k ) d ( x[] k ) x[] k w [ k ] [] k = w [ k ] W ( )( [] [ ] ) η (9) ( ) = x( k )mod N x f we pck up cyccay. where [] k k Propertes of the earnng equaton. As we ponted out before, the proposed oss functon has the same mnmum ponts than the cross-entropy error functon 9-

12 L cross entropy = N N C = 0 = y n d ( x ) (20) If we derve the on-ne gradent equatons for ths error functon, we obtan the equaton (7) dvded by d (x). So practca dfferences between performng on-ne gradent descent usng one oss functon or another must be determned wth an anayss of how ths term can affect to the evouton of the optmzaton process. Suppose that severa outers are present n the tranng set. Ths s a typca stuaton n rea-word databases. When we pck up one outer of cass, the amount of change n a wnnng prototype w w be w cross entropy = d W () x d d () x () ( x w ) x (2) Snce t s probabe that d (x) takes a ow vaue when x s an outer of cass, the term (- d (x))/d (x) tends to nfnte. In ths way w cross-entropy can be arbtrary arge and the norma evouton of the earnng process can be enormousy degraded by the presence of outers. By contrast, f we use our proposed oss functon, outers have a tte mpact snce the term d w domnates the amount of change and tends to zero snce the dstance between the wnnng prototype and the outer can be arge. 4. Emprca Resuts In ths secton, we compare ohonen s LVQX agorthms (ohonen, 996) (where X s, 2 and 3) and -NN cassfers wth our proposa. We have used the NIST (uppercase & owercase) handwrtten characters database (Garrs, 997), whch can be found n drectores tran/hsf_4, tran/hsf_6 and tran/hsf_7. We have spt drectory tran/hsf_7 n two sets: one for vadaton (00 sampes) and one for test (over 0000 sampes) preservng the orgna cass dstrbuton n data. The other two drectores were chosen for tranng (above sampes). These mages (32x32 pxes) have been preprocessed wth a Prncpa Component Anayzer that extracts 64 components from each mage (NIST's ms2evt utty). The correaton matrx of the PCA was computed usng the tranng set. 4.. Smuatons #: Adaptve Soft -NN vs. LVQ Ten runs for each cassfcaton probem, earnng agorthm and severa codebook szes have been made. We have ntazed the codebooks usng the LVQ_PA s evennt program 9-2

13 (ohonen et a., 995). Then we have apped every earnng agorthm unt cassfcaton error n vadaton set ncreases or stands. (We have montored ths error every teratons. In the case of LVQX agorthms, every tme we montored the error, we restart ts α vaue). Optma parameters of agorthms have been estmated usng the vadaton set Resuts Two man aspects are of nterest n ths secton. The frst one woud be concerned wth the comparson of our cassfer and ther crsp counterparts when our system s not constraned to use aso the crsp -NN estmaton. The second one woud compare the cassfer when a of them make use of ths cassfcaton rue. In ths way we can evauate f the use of a more compex cassfcaton procedure, the soft -NN estmaton, s of practca nterest and, secondy, what happens when we force our system to compete wth other we-estabshed methods to desgn prototypes sets for crsp -NN cassfers. Tabes and 2 summarze the average cassfcaton error on uppercase and owercase test sets. Each resut s the average of ten tras for each earnng agorthm (except n the case of - NN cassfers that the test error s computed once usng the whoe tranng set as the prototypes set). In tabe, we show the average mscassfcaton rate of our proposed cassfer n four dfferent cases. In the frst case, we fx the number of nearest-prototypes (k) equa to 2 and σ to nfnte. Whe n the second case wth k fxed to 2, we make use of the best σ vaue estmated by cross-vadaton. In these two frst cases, the resutng cassfcaton systems are crsp -NN cassfers snce ony two nearest-prototypes contrbute to form the soft estmatons. In the thrd case, we fx σ to nfnte and make use of the best-estmated k vaue (wth k greater than 2). As above, our cassfer then s the crsp -NN cassfer. Fnay n the fourth case, we constran k to be greater than 2 and aow σ to take a fnte vaue. Both parameters are estmated wth the vadaton set usng these constrants. Tabe 2 shows cassfcaton resuts of -NN cassfers whose codebooks are desgned usng ohonen's LVQ agorthms and crsp -NN that use the whoe tranng database as the prototypes set. The condensed soft -NN cassfers (the fourth case) aways outperforms crsp -NN cassfers based on LVQ agorthms for each tested codebook sze. In uppercase and owercase recognton, best resuts are acheved when codebook sze s 800. In ths case, our cassfcaton systems aso outperform -NN cassfers that use the whoe tranng database as the prototypes set. On average, our earnng system correcty recognzed 94,62% of the uppercase handwrtten etters and 87,46% of the owercase handwrtten characters, n front of 9,77% and 85,94% respectvey for LVQ2, the best of the LVQ agorthms. Dfferences between our procedure and 9-3

14 LVQ agorthms are notaby greater as codebook sze decreases. We aso note that the evouton of the mscassfcaton rate, as codebook sze grows, s ess abrupt n our procedure. Ths behavor suggests that our procedure acheves a better gracefu degradaton n the cassfcaton accuracy as the codebook sze s smaer. Our best crsp -NN cassfers, the best soutons of the two frst cases, ceary outperform LVQ-based crsp -NN cassfers except n one case. When the codebook sze s equa to 200, codebooks desgned wth the LVQ2 agorthm are better than ours. It s nterestng to compare the cassfcaton accuracy between case and case 2 snce both earnng agorthms desgn codebooks for crsp -NN cassfer athough they dffer n the way of dong t. Whe the frst agorthm uses an nfnte σ, the second one do not and can better defne how oca the souton s, snce σ reguate how many tranng data contrbute to update each prototype durng the earnng phase. In both cassfcaton probems we observe the same behavor. For sma codebooks (00 and 200), the effect of usng a fnte σ aows to acheve an optma souton. However, the use of an nfnte σ works better for medum codebooks. Cassfers of the thrd case ony acheve better cassfcaton accuracy than -NN cassfers (that use over prototypes) when ther codebook sze s 800. Athough these cassfers are nferor to the best soutons acheved n the above crsp cases. Fnay we compare best soutons of cases,2 and 4. We can observe that the soft estmaton s n sx out of eght cases superor than the crsp estmaton. Ony n owercase hand-wrtten recognton, the crsp estmaton s superor on average when codebook szes are 400 and 800. Codebook Upper Lower Sze σ Best σ< σ Best σ< σ Best σ< σ Best σ< =2 =2 Best >2 Best >2 =2 =2 Best >2 Best >2 00 8,2 4,52 25,95(3),94(5) 24,7 20,92 3,0(3) 8,8 () 200 0,98 0,63 5,8(3) 9,0(5) 7,62 7,26 22,69(3) 5,93(5) 400 7,36 7,97 0,09(3) 6,53(5) 3,3 5,28 7,3(3) 4,02(5) 800 5,45 7,22 5,74(3) 5,38(),28 4,,82(3) 2,54(5) Tabe. Expermenta Resuts for our proposed cassfer. We show for each agorthm the average test error over ten runs. The number n parenthess denotes the vaue of k. 9-4

15 Upper Lower Codebook LVQ LVQ2 LVQ3 -NN LVQ LVQ2 LVQ3 -NN Sze 00 2,5 5,82 6,02-27,89 22,4 22, ,3 9,94,4-20,85 6,8 8,39-400,8 8,24 9,08-7,27 4,84 5, ,6 8,23 8,26-5,53 4,06 4,43 - Whoe database ,84() ,(7) Tabe 2. Expermenta Resuts for LVQ-based and -NN cassfers. We show for each agorthm the average test error over ten runs, except n the case of -NN cassfcaton. In - NN cassfers, the number n parenthess denotes the vaue of k Smuatons #2: The utty of Soft outputs for post-processng purposes An addtona feature the soft -NN approach s that soft abes can be assgned to test patterns. Ths can enhance the overa cassfcaton accuracy when some post-processng s performed usng context (e.g. reatons of patterns n tme). Ths secton expores how the mscassfcaton rate s mproved when we accept that the correct souton was among the top greatest actvaton of the dscrmnant functons. In ths way, we can detect f a correct souton s among the top greatest actvaton and consequenty a post-processor woud recover the correct souton. We use the best soft -NN cassfers of secton 4.2 for the upper probem and now we perform ten runs for each cassfer checkng the vadaton error every steps. Tabe 3 and 4 show the averaged vadaton and test error respectvey usng the top hghest actvaton. (The cassfer w make a correct decson f the abe of the tranng pattern agrees wth the one of the abes of the dscrmnant functons wth hghest actvaton.) Note that top resuts n tabe 4 are sghty dfferent than the fourth coumn of tabe snce now the cassfcaton error s checked every steps so the agorthm stops at dfferent ponts. As tabe 3 and 4 show f a correct souton s between the top 2 hghest actvaton the cassfcaton accuracy s reduced by more than a haf (except n the case of 00 prototypes). 9-5

16 Top Top Top Top Tabe 3. Averaged vadaton error over ten runs for the best soft -NN cassfcaton n the upper probem. (Note: a cassfcaton error s produced f the abe of the tranng pattern does not agree wth the one of the abes of the dscrmnant functons wth hghest actvaton.) Top Top Top Top Tabe 4. Averaged test error over ten runs for the best soft -NN cassfcaton n the upper probem Smuatons #3: Overtranng and Overfttng n adaptve soft -NN cassfers Overtranng (that s, the excessve tunng to the tranng set of the souton due to reachng a deep mnmum of the cost functon) can ead to soutons wth poor generazaton capabtes. If the number of free parameters of the cassfer s excessve, overtranng eads to overfttng (the souton s too ftted to nosy sampes). The usua way of avodng them s to stop at a (oca) mnmum of the vadaton set. However ths practce can be too conservatve snce the earnng agorthm coud converge to a oca mnmum of the tranng cost functon far from the goba mnmum so the rsk of overtranng woud be ow. (See e.g. mut-ayer perceptrons traned wth the back-propagaton agorthm (Lawrence et a., 997)). In tabe 5 we show the resuts of the best soft -NN cassfers for the upper probem over 50 runs when the earnng agorthm stops at mnmum of tranng and vadaton errors. Observe that cassfers acheve ther best resuts when they stop at a mnmum of the tranng error. 9-6

17 Stop at a Codebook sze mnmum of the Tranng error (.30) (0.4689) (0.2336) (0.43) Vadaton error 2.7 (.4342) (0.837) (0.297) (0.8) Tabe 5. Averaged test error and varance over 50 runs for the best soft -NN cassfcaton n the upper probem when the earnng agorthm stop at a mnmum of the tranng and vadaton errors. 5. Dscusson 5.. Soft -NN vs. Crsp -NN The Soft -NN agorthm uses the dstance to test patterns for ts estmaton of posteror cass probabtes. Ths addtona nformaton enhances the approxmaton quaty of the estmaton snce t ntroduces a smooth nterpoaton n reaton wth the crsp -NN estmate. We have shown n fgure that the cassfcaton borders of the soft -NN estmate are a smooth verson of the crsp -NN borders. As secton 4. demonstrated, the resutng smoothed borders ncrease the cassfcaton rate for a hard decson at the expense of a tte more computaton. Besdes the soft -NN rue aows a fuzzy cassfcaton (Bezdek & Pa, 992) where soft abes can be used for post-processng purposes. Secton 4.3 show that f a post-processor detect the correct abe among the two greatest outputs of the soft -NN cassfer the cassfcaton accuracy can be notaby reduced (more than a haf) The Adaptve Soft -NN agorthm vs. LVQ agorthms The adaptve soft -NN agorthm computes a reduced set of prototypes for the soft -NN cassfcaton rue. The earnng agorthm mnmses a modfcaton of the cross-entropy functon whose absoute mnmum ponts ensure the mnmsaton of the tranng cassfcaton error. However the earnng rue resembes a fuzzfed verson of ohonen s LVQ agorthm (ohonen, 996) so ths fact can ndcate some smarty between our agorthm and LVQ when hard decsons are empoyed (σæ ) and =2. Suppose that m [k] and m [k] are the two nearest prototypes to pattern x[k] then equaton 9 yeds 9-7

18 f m m m [ k ], x[ k] same cass and m [ k ], x[ k] η 4 η 4 [] k = m [ k ] + ( x[] k m [ k ] ) ( ) [] k = m [ k ] x[] k m [ k ] same cass (22) Equaton 2 s the LVQ2. agorthm when the tranng pattern x[k] fas n a wndow defned n LVQ2 and LVQ3 agorthms. Ths resut s not a surprse snce (Bottou, 998) showed that the LVQ2 agorthm performs gradent descent over an nstantaneous oss functon based on a contnuos approxmaton of the (bnary) cassfcaton error functon The oss functon The oss functon that we use n the earnng phase has nterestng propertes. It s ess senstve to outers as secton shows (see (Bermeo, 2000) for an addtona dscusson). Besdes, our error functon s, for a two-cass probem, the margn functon of the cassfer averaged over the whoe tranng patterns. If =2 for a two-cass probem, the earnng rue concdes wth the earnnn agorthm (Bermeo, 2000). So the computed prototypes uses the support vectors to bud Optma argn (O) hyperpanes (Schökopf, 997) as t was emprcay shown n (Bermeo, 2000) Overtranng and Overfttng Expermenta secton 4.4 shows that f we stop at a oca mnmum of the tranng cassfcaton error nstead of usng the eary stoppng crteron, the test error s reduced. Ths resut can suggest that overtranng s harder that expected (Lawrence, 997). Snce the earnng agorthm can hardy acheve a goba mnmum the rsk of overtranng s reduced so the use of a vadaton set makes the agorthm to stop too eary. Furthermore, If overtranng s dffcut then the rsk of overfttng s notaby reduced. 6. Concuson Crsp -nearest-neghbour cassfers estmate posteror cass probabtes usng a fxed number of tranng data. Ths feature prevents them, unke other approaches, for not havng n some nput regons enough data to ensure good generazaton. Athough the quaty of ther estmatons (a smpe rato of ntegers) s too poor to propery approxmate the true probabtes when fnte data sets are empoyed. An extenson of -NN estmaton was presented n order to overcome ths mtaton: for each testng pattern x, a Parzen wndow estmate s computed usng the nearest (n the 9-8

19 Eucdean sense) tranng ponts. Ths smpe agorthm aso estmates the posteror cass probabtes athough a the tranng data ponts must be retaned n order to compute the estmatons. An mprovement of ths approach can be acheved, n terms of storage and processng requrements, by repacng the oca kerne estmaton wth a smpfed adaptve mxture mode: snce each mode of the mxture s ocay defned, for each testng pattern we ony take nto account ts -nearest-modes. The tota number of mxture modes w be typcay much smaer than the number of tranng ponts and the mxture modes centers are no onger constraned to concde wth the tranng ponts. We have proposed a supervsed earnng procedure to estmate the mxture modes parameters based on mnmzng a modfcaton of the cross-entropy error functon. Ths error functon has the same mnmum ponts that the crossentropy error functon whe t aows the derved on-ne gradent earnng agorthm to have a more robust behavor n presence of outers n the tranng set. Expermenta resuts n two handwrtten cassfcaton probems demonstrate the potenta and versatty of the proposed cassfcaton system. Snce the soft -NN cassfer ncudes as a speca case the crsp verson, n numerca smuatons we aways fnd an optma souton that acheves better cassfcaton accuracy than crsp -NN cassfers desgned wth LVQ agorthms and -NN cassfers. References Benvenste, A., étver,., & Prouret, P. (990). Adaptve Agorthms and Stochastc Approxmatons, Bern: Sprnger-Verag. Bermeo, S. (2000). Large argn Adaptve -Nearest Neghbour Cassfers, n Bermeo, S. Learnng wth Nearest Neghbour Cassfers, Ph.D. Thess, Barceona: Unverstat Potècnca de Cataunya. Bezdek, J.C & Pa, S.. (Eds.) (992). Fuzzy odes for Paterrn Recognton, NY: IEEE Press. Bshop, C.. (995). Neura Networks and Pattern Recognton, Oxford: Oxford Unversty Press. Bottou, L. (998). Onne Learnng and Stochastc Approxmatons, n Davd Saa (Ed.) Onne Learnng and Neura Networks, Cambrdge: Cambrdge Unversty Press. Bottou, L.& Vapnk, V. (992). Loca Learnng Agorthms, Neura Computaton, 4, Darasay, B. V. (Ed.). (99). Nearest Neghbor Pattern Cassfcaton Technques, Los Aamtos, LA: IEEE Computer Socety Press. Dempster, A.P., Lard, N.. & Rubn, D.B. (977). axmum Lkehood from ncompete data va the E agorthm, Journa of the Roya Statstca Socety, seres B, 39, -38. Garrs,. et a. (997). NIST Form-Based Handprnt Recognton System (Reease 2.0), Natona Insttute of Standards and Technoogy. Hart, P.E. (968). The Condensed Nearest Neghbor Rue, IEEE Transactons on Informaton Theory (Corresp.), 4,

20 ohonen, T., Hynnnen, J., angas, J., Laaksonen, J., & Torkkoa,., LVQ_PA. The Learnng Vector Quantzaton Program Package. Verson 3., Hesnk: Hesnk Unversty of Technoogy, Laboratory of Computer and Informaton Scence. ohonen, T. (996). Sef-organzng aps, 2nd Edton, Bern: Sprnger-Verag. Lawrence, S., Ges, C. L. & Tso, A. C. (997). Lessons n Neura Network Tranng: Overfttng ay be Harder than Expected, Proceedngs of the Fourteenth Natona Conference on Artfca Integence, AAAI-97, , eno Park, CA: AAAI Press. clachan, G. J., & Basford,. E. (988). xture odes. Inference and Appcatons to Custerng, New York: arce Dekker. Rpey, D. (994). Neura Networks and ethods for Cassfcaton, Journa of the Roya Statstca Socety, Seres B, 56, p Schökopf, B. (997). Support Vector Learnng. Ph.D. Thess, Bern: Informatk der Technschen Unverstät Bern. Appendx: The Soft -NN erne Estmaton Consder a tranng set of random pars T={(x, cass abe(x ))} =0,..., N- where x s a random sampe beongng to one of the C exstng casses. Suppose we take the regon R(x) to be a sma hypersphere centered at the pont x that contans precsey tranng data ponts. Let us denote R as the event X beongs to R(X). Ths event s aways true due to any vector beongs to a regon centered at tsef. So the foowng equaton can be wrtten P ( C x ) = P( C x, R) (A.) akng use of Bayes' theorem n the rght hand-sde of equaton (A.) we have P ( C x, R) f = X C,R f ()( x P C R) () x X R (A.2) If the tranng data beongng to cass that fa n R(x) are denoted by {(x )} =... =...C, we obtan the foowng Parzen wndow estmates (equaton (2.57) (Bshop, 995)) for the probabty densty functons f X\R and f X\C,R f f X C,R X R () x C = () x = m= d h h d x x G h x x G h m (A.3) 9-20

21 9-2 Snce the probabty of beongng to cass condtoned to beong to regon R, P(C \R) can be approxmated by /, we have as an estmaton of P(C \x) ( ) = = = C m m h G h G C P x x x x x (A.4)

Image Classification Using EM And JE algorithms

Image Classification Using EM And JE algorithms Machne earnng project report Fa, 2 Xaojn Sh, jennfer@soe Image Cassfcaton Usng EM And JE agorthms Xaojn Sh Department of Computer Engneerng, Unversty of Caforna, Santa Cruz, CA, 9564 jennfer@soe.ucsc.edu