Experiments on logistic regression

Size: px

Start display at page:

Download "Experiments on logistic regression"

Anastasia Jones
5 years ago
Views:

1 Experimens on logisic regression Ning Bao March, 8 Absrac In his repor, several experimens have been conduced on a spam daa se wih Logisic Regression based on Gradien Descen approach. Firs, he overfiing effec is shown wih basic seings (vanilla version). Then Sochasic Gradien Descen and -Norm Regularizaion echniques are boh implemened wih demonsraion of he benefis of hese wo mehods in prevening overfiing. Besides, a new rick of modifying Sigmoid ransfer funcion in he raining sage is described and implemened, which is shown could help reducing overfiing as well. A las, he Label Shrinking echnique is implemened. Ye, he resuls on his daa se are no saisfying. Preliminaries. Logisic Regression The daa is a se of examples (x, y ). The feaures of an example x is ypical binary; and he label y is ypically binary as well. To predic, firs using linear acivaion of individual feaures â = w x well. This is acually a single Neuron. One of our goal is o rain his neuron o make i predicing Then use he Sigmoid funcion as he ransfer funcion o ransfer he linear acivaion ino probabiliy as he predicion ŷ = σ(â ) = And Logisic Loss has he following form + e â = + e w x. Loss(y, ŷ ) = y ln y + ( y ) ln y ŷ ŷ ln( ŷ ) = ln( + e w x ) if y = = ln ŷ = ln( + e w x ) w x if y = = ln( + e w x ) y w x

2 PRELIMINARIES. Gradien Descen Basically speaking, he Gradien Descen I have used is essenially abou solving he below minimizaion problem: inf w Loss(y, σ(w x )) T For he unregularized version of Gradien Descen., we ried o minimize he unregularized average logisic loss wih respec o he vecor w Loss(w) = T Loss(w) = Loss(y, ŷ ) T = (ln( + ew x ) y w x ) T (Theoreically, I should use he oal logisic loss as described in Secion. as he arge funcion o minimize. However, during he experimen I found if I use he oal logisic loss, some of he neuron will become raher larger and lead o considerably huge linear acivaion. This in urn will cause infinie logisic loss in laer raining passes. Thus, I swiched o using average logisic loss over he examples as he arge funcion. ) Consequenly, he gradien of he arge funcion will be w = Loss(w) w = (ŷ y )x T = (σ(w x ) y )x T i, wi = (ŷ y )x,i T = (σ(w x ) y )x,i T If we le w = o minimize he loss, he resuling below equaions are difficul o solve i, σ(w x )x,i = y x,i Thus, we use he G.D. updaing ieraively o minimize he loss, as w p+ = w p η w i, w p+,i = w p,i η wi in which i is he index of feaures and p is he index of ieraions. (I migh ake quie a few

PRELIMINARIES 3 ieraions o converge o he approximae minimum.).3 Daa The size of he daa is, in which here are examples (rows) and each example consiss of feaures (columns).

The visualizaion of original and permued daa ses are shown in Figure, in which he black dos represens a value or eiher label or feaure poin.

3 PRELIMINARIES 3 ieraions o converge o he approximae minimum.).3 Daa The size of he daa is, in which here are examples (rows) and each example consiss of feaures (columns). The firs column of each row is he real label. All he labels and feaures are binary aking on values of {, }. In he daa, here are 68 posiive examples (wih label value ). The visualizaion of original and permued daa ses are shown in Figure, in which he black dos represens a value or eiher label or feaure poin. (MATLAB : imagesc(-daa); colormap gray) (MATLAB : p=randperm(); daa_perm=daa(p(:),:); imagesc(-daa_perm); colormap gray) The map of original daase (*) The map of permued daase (*) Figure : The map of he original daase.4 Cross Validaion For some experimens conduced, I employed Cross Validaion echnique o pick he bes model. In his conex, he bes model is composed of paricular parameers and he corresponding vecor (or Neuron) since for each differen value of parameers we need o rain he neuron o learn he daa. Using Cross Validaion is he sandard echnique in machine learning for model selecion and resul reporing. I will preven he bias incurred wih cerain daa. Thus, i makes he resul of cerain algorihm more accepable. In he raining sage, using he 3-fold Cross Validaion scheme as below:

4 VANILLA VERSION OF LOGISTIC REGRESSION 4 pariion he raining se ino 3 pars for each of he 3 holdous rain all models on he /3 pars record average logisic loss on he /3 par he bes model is chosen as he one wih hen bes average validaion loss over he 3 handous Figure : The 3-fold Cross Validaion scheme.5 Srucure of his repor I have conduced several experimens wih logisic regression. The ouline of hese resuls is lised below. In Secion, I experimened wih he vanilla version of logisic regression on illusrae he overfiing problem. Then, briefly I demonsraed in Secion 3 ha Sochasic Gradien Descen wih simulaed annealing echnique could help o preven he overfiing effec in logisic regression. Besides, he -Norm regularizaion could help reduce he overfiing effec as well, as shown in Secion 4. Then, I ried he new rick of modifying he Sigmoid ransfer funcion during he raining process. The deail informaion is provided in Secion 5. A las, he echnique of label shrinking is also implemened and described in Secion 6. Vanilla version of Logisic Regression For he Vanilla version of Logisic Regression, I only use he mehod menioned in Secion. and Secion.. Ye, one need o decide when o sop he raining process using Gradien Descen mehod. Usually, he -Norm of gradien is a rule-of-humb sopping crierion. Also, one could use -Norm of gradien as well. The basic idea is he same: sop he raining when he gradiens are small enough which implies ha he model approaches he local minimum of he loss funcion. Noe ha by using -Norm (or -Norm) or gradien as sopping crieria is kind of implici regularizaion - early sopping. Since as one rains he model more on he raining daase, he model will overfi he raining se wih relaively small loss. However, when predicing he model may incur considerable large loss due o he unknown new daa.. -Norm of gradien as sopping crieria For his par, he mehod s ouline is lised in Figure 3. Training he neuron on 3/4 of he whole daase, /4 as he esing se early sopping wih he crierion ha he -Norm of gradien converge o some level repor average logisic loss on raining se & esing se for comparison Figure 3: Mehod s ouline of -Norm of gradien sopping crieria, Vanilla Log. Reg.

5 VANILLA VERSION OF LOGISTIC REGRESSION 5 Noe ha he learning rae η is he parameer, which should be uned. Ye, for simpliciy, I picked η =. as a rule-of-humb value for all he experimens below. The jusificaion here is ha if η is larger, hen each sep of gradien descen is larger which is faser in convergence ye may no be able o approach he local minimum sufficienly enough since he jumping back-and-forh effec in gradien descen. On he oher hand, if η is smaller he convergence speed is exremely slow which is no feasible in pracical implemenaion. The sopping crieria here is (σ(w x ) y )x T i, i [, 3] I ploed he mean logisic loss on he raining se as well as on he esing se in Figure 4..6 Vanilla Log. Reg., Norm sopping crierion, (η =.) ^() ^( ) ^( ) Figure 4: Vanilla version, Early sopping by -Norm of gradien, v.s. es loss We could see from Figure 4, as he model is raining more and more he mean logisic loss on he esing se increases (afer w < ) due o he overfiing effec.. -Norm of gradien as sopping crieria Nex, I experimened wih -Norm of gradien as sopping crieria. Oher seings are he same as in.. The sopping crieria now is

6 VANILLA VERSION OF LOGISTIC REGRESSION 6 And he resul is depiced in Figure 5. (σ(w x ) y )x T i, i [, 4].7 Vanilla Log. Reg., Norm of gradien sopping crierion, (η =.) ^ ^( ) ^( ) ^( 3) ^( 4) The Norm of Gradien a sopping poin Figure 5: Vanilla version, Early sopping by -Norm of gradien, v.s. es loss This resul has wo poins need o discuss: Firs, he doesn blow up subsanially as he -Norm resul. In oher words, he overfiing phenomenon is no so obvious in his case. My explanaion is ha he -Norm value is basically smaller han he -Norm of gradien, hus by adoping his crierion he neuron is no so over-rained as before. Second, he mean logisic losses in his resul are greaer han he -Norm case. The explanaion is similar as above, he neuron is no rained ha much, herefore he average logisic loss is no minimized enough here. For a beer visualizaion, I ploed he -Norm and -Norm cases ogeher as in Figure 6.

7 3 STOCHASTIC GRADIENT DESCENT WITH SIMULATED ANNEALING 7.7 Vanilla Log. Reg., Norm vs Norm of gradien sopping crieria, (η =.).6.5 meanlogisic Loss.4.3, Norm Norm, Norm Norm.. ^() ^( ) ^( ) Figure 6: Early sopping by -Norm vs -Norm of gradien Anoher ineresing observaion I have made is ha using -Norm of gradien as he sopping crierion is exremely slow han using he -Norm. Whereas i seems he -Norm crierion could also provide reasonable loss if se o cerain level. Basically speaking, he wo sopping crieria have similar effecs in he sense of early sopping. I s difficul o judge which one is beer if one only considers he mean logisic loss since he only difference beween hem is he raining ime. If he model is rained sufficienly enough wihou overfiing, hey will provide he exac same resuls. To conclude, by seing a reasonable sopping crierion, one could use early sopping o help prevening he overfiing effec wihou changing oher pars of he gradien descen mehod in logisic regression. 3 Sochasic Gradien Descen wih Simulaed Annealing In his par, I implemened sochasic gradien descen approach. Insead of updaing all he afer seeing all he examples, sochasic gradien descen will updae he weigh vecor example by example, similar as he online seing. Besides, his approach employed a simulaed annealing echnique o sabilize he converging process - he learning rae will decay as he raining process goes which helps prevening overfiing effec.

8 4 -NORM REGULARIZATION 8 The sochasic gradien descen will updae he wih w = w η α i w in which i is he number of raining passes over he raining se and α <. Thus as he raining goes over passes by passes, he learning rae will drop and sabilize ( cool down ) he converging. Consequenly, he will change less and less. This is shown in Figure 7 Sochasic Gradien Descen, (η =., α =.95) Sochasic G.D. (η =., α =.95) Number of passes over raining se Number of passes over he raining se Figure 7: Mean log. loss and of Sochasic Gradien Descen (η =., α =.95) 4 -Norm Regularizaion In his par, I experimened wih he -Norm regularizaion effecs of prevening overfiing. By using -Norm regularizaion, he moivaion of updae would become inf w and he updae formula will become ( λ w + Loss(y, σ(w x ))) T w = w η(λw + (σ(w x ) y )x ) T Figure 8 is a plo of -Norm regularized logisic regression, raining unil w < 6. As we could see, since we have he regularizer in he moivaion i helps preven overfiing by punishing he large. Thus, as we rained more he mean loss on esing se flaens ou. Also, he is conrolled wihin a relaively small range.

9 5 MODIFYING PREDICTION FUNCTION IN UPDATE FORMULA Norm Regularized logisic reg. (η =., λ =.5).5 Weighs variaion of Norm Regularizaion, (η =., λ =.5) ^() ^( ) ^( ) ^( 3) ^( 4) ^( 5) ^( 6) ^() ^( ) ^( ) ^( 3) ^( 4) ^( 5) ^( 6) Norm of gradiens a sopping poins Figure 8: Mean log. loss and of -Norm Regularized G.D. (η =., λ =.5) 5 Modifying predicion funcion in updae formula In his par, I changed he original sigmoid funcion of predicion in he updae formula of during raining, in order o conrol he variaion of. However, when predicing on esing se he sigmoid funcion is lef unchanged. By changing he ransfer funcion, one could conrol he since if he linear acivaion is larger/smaller han cerain heshold he won be updaed. In oher words, he idea could be expressed as once he predicion is good, he say pu. 5. Modified sigmoid funcion Here, I use he following predicion if w x σ heshold ŷ = oherwise +e w x if w x σ heshold which looks like in Figure 9

10 5 MODIFYING PREDICTION FUNCTION IN UPDATE FORMULA Modified Sigmoid func, (σ heshold = 3) Original Sigmoid func Figure 9: Modified Sigmoid funcion as predicion (σ heshold = 3) 5.. Effecs in prevening overfiing Wih his change he will be conrol from growing/dropping oo much. As shown in Figure, he are inenionally bounded compared wih he original sigmoid case. 5 Bounded sigmoid predicion, early sopping by Norm gradien (η =., σ heshold = 3) 5 Original Sigmoid func., Norm of gradien early sopping, (η =.) ^() ^( ) ^( ) ^( 3) gradiens a sopping poins 5 ^() ^( ) ^( ) ^( 3) gradien a sopping poins Figure : Weighs comparison, modified sigmoid vs original Inuiively, his bounding effec on would help prevening overfiing on raining se. As shown in Figure, he mean logisic loss on esing se increase moderaely as raining more (i.e. sopping a smaller -Norm of gradien).

11 5 MODIFYING PREDICTION FUNCTION IN UPDATE FORMULA.6.4 Bounded sigmoid predicion, (η =., σ heshold = 3).6.4 Vanilla Log. Reg., Norm sopping crierion, (η =.) ^() ^( ) ^( ) ^( 3) ^() ^( ) ^( ) Figure : Mean log. loss comparison, modified sigmoid vs original 5.. Varying σ heshold By varying he value of σ heshold, I ploed he curves of mean logisic loss on boh raining and esing ses as shown in Figure. I is observed ha as σ heshold increases he increases as well while he drops..8 Bounded Sigmoid func., varying σ heshold, (η =.) σ heshold Figure : The effecs of differen values of σ heshold This resul accords wih inuiion since as σ heshold increases, he funcion will be more like he

12 5 MODIFYING PREDICTION FUNCTION IN UPDATE FORMULA original sigmoid and he are no inenionally bounded. Therefore, he overfiing effec will show up as he goes up in Figure. Also, in Table σ heshold loss rain. valid. rain. valid. rain. valid. rain. valid. rain. valid. s fold nd fold rd fold avg σ heshold loss rain. valid. rain. valid. rain. valid. rain. valid. rain. valid. s nd rd avg Table : 3-fold Cross Validaion, varying σ heshold, raining/validaion/es log. loss In Table, is repored based on bes se of (wih minimum validaion loss, fixed cerain σ heshold ). As he value of σ heshold increases, he keeps dropping, which accords wih expecaion since he Modified sigmoid funcion is more and more like he original one. Based on he resul, he bes model seems o be σ heshold =.5 plus corresponding rained Side-effec of modified sigmoid One ineresing propery of his modified sigmoid funcion is is insabiliy. The gradiens will jump back-and-forh from ieraion o ieraion. Consider he Figure 3 below.

13 5 MODIFYING PREDICTION FUNCTION IN UPDATE FORMULA 3.35 Norm of gradiens, Bounded Sigmoid func. (η =.).3.5 Norm of gradiens Training ieraions x 4 Figure 3: -Norm of gradien variaion of he modified sigmoid mehod This is due o he arificial modificaion o he original sigmoid funcion. In oher words, since he Modified Sigmoid funcion is no more coninuous he changes of could be dramaic herefore causing he unsable effec. (For he Sep-like ransfer funcion in Secion 5., I didn observe similar effec since he new ransfer funcion is sill coninuous.) I is also observed ha as σ heshold becomes bigger (he modified sigmoid funcion will approximae he original one hen) he -Norm of gradien ends o vary less. 5. Sep-like ransfer funcion Nex, I ried use a sep-like ransfer funcion as which looks as if w x σ heshold w x ŷ = +σ heshold σ oherwise heshold if w x σ heshold

14 5 MODIFYING PREDICTION FUNCTION IN UPDATE FORMULA 4 Sep like ransfer func., (σ heshold = 3) Original Sigmoid func Figure 4: Modified Sigmoid funcion as predicion (σ heshold = 3) 5.. Effecs in prevening overfiing This corresponding resuls of his change are shown in Figure 5 and 6..6 Sep like ransfer func., (η =., σ heshold = 3).6 Vanilla Log. Reg., Norm sopping crierion, (η =.) ^() ^( ) ^( ) ^( 3) gradien a sopping poins ^() ^( ) ^( ) Figure 5: Mean log. loss comparison, sep-like ransfer func. vs original sigmoid

15 5 MODIFYING PREDICTION FUNCTION IN UPDATE FORMULA 5 5 Sep like ransfer func., early sopping w/ Norm gradien 5 Original Sigmoid func., Norm of gradien early sopping, (η =.) ^() ^( ) ^( ) ^( 3) gradiens a sopping poins 5 ^() ^( ) ^( ) ^( 3) gradien a sopping poins Figure 6: Weighs comparison, sep-like ransfer func. vs original sigmoid This mehod helps o conrol o some exen. And consequenly, i also helps o reduce he effec of overfiing. 5.. Compared wih Modified Sigmoid Compared wih he Modified Sigmoid above by raining unil w < 4 yields he plos (Figure 7, 8) below. Bounded sigmoid predicion, (η =., σ heshold = 3) Sep like ransfer func., (η =., σ heshold = 3) ^() ^( ) ^( ) ^( 3) ^( 4) ^() ^( ) ^( ) ^( 3) ^( 4) Figure 7: Mean log. loss comparison 3, Modified sigmoid vs sep-like ransfer func.

16 5 MODIFYING PREDICTION FUNCTION IN UPDATE FORMULA 6 5 Bounded sigmoid predicion, early sopping by Norm gradien (η =., σ heshold = 3) 5 Sep like ransfer func., early sopping w/ Norm gradien ^() ^( ) ^( ) ^( 3) ^( 4) Norm Gradiens a sopping poins 5 ^() ^( ) ^( ) ^( 3) ^( 4) Gradiens a sopping poins Figure 8: Weighs comparison 3, Modified sigmoid vs sep-like ransfer func. As seen in his comparison, Modified Sigmoid performs beer if only rained unil w < 3 and he don diffuse. Ye, if rain more hese mehods performance are geing closer. They boh arificially conrol he by changing he predicion in he updae formula. As shown in he plo, all he will evenually be bounded in he range of [ σ heshold, σ heshold ] Varying σ heshold By varying he value of σ heshold, I ploed he curves of average logisic loss on boh raining and esing ses as shown in Figure 9. I is observed ha as σ heshold increases he increases as well while he drops.

17 5 MODIFYING PREDICTION FUNCTION IN UPDATE FORMULA Sep like ransfer func., varying σ heshold, (η =.) σ heshold Figure 9: The effecs of differen values of σ heshold Also, in Table he relevan daa is recorded. σ heshold loss rain. valid. rain. valid. rain. valid. rain. valid. rain. valid. s fold nd fold rd fold avg σ heshold loss rain. valid. rain. valid. rain. valid. rain. valid. rain. valid. s nd rd avg Table : 3-fold Cross Validaion, varying σ heshold, raining/validaion/es log. loss In Table, is repored based on bes se of (wih minimum validaion loss, fixed cerain σ heshold ). Based on he resul, he bes model seems o be σ heshold =.5 plus corresponding rained.

18 6 LABEL SHRINKING 8 6 Label Shrinking Based on [], I implemened he label shrinking echnique for prevening overfiing. The original idea in [] is more concerned wih overfiing resuled from adding sparse feaures while here I employed his echnique in he conex of logisic regression o observe he resul. 6. Label Shrinking As menioned above, for he spam daa se he labels are binary, y {, }.Shrinking will use a linear ransformaion mapping binary labels o a shrunk label se {a, b} in which a >, b <, as demonsraed below: y := a + (b a)y i.e. a, b. Consider he plos in Figure, in which one could see he effec of shrinking on he corresponding logisic loss..5 σ(ah) a=. loss(., ah) a=. loss(., ah) a=.4 loss(.4, ah) Label Shrinking effecs, a.5 Label Shrinking effecs, b σ(ah) b= loss(.99999, ah) b=.8 loss(.8, ah) b=.4 loss(.4, ah) y, σ(ah), loss(y, ah).5 y, σ(ah), loss(y, ah) ah ah Figure : Effec of Label Shrinking on logisic loss Firs, noe ha by shrinking he logisic loss becomes quadradic-shaped, ha is he loss curve won flaen ou if he predicion ŷ is close o or (equivalenly, he loss curve won flaen ou as he linear predicion â is relaively large or small). In implemenaion, I employed a single parameer ɛ o shrink labels, he label mapping relaion is y := ɛ + ( ɛ)y ɛ i.e.{, } {ɛ, ɛ}

19 6 LABEL SHRINKING Label Shrinking, (ε =.).6.4 Degeneraed Label Shrinking, (ε = ) ^() ^( ) ^( ). ^() ^( ) ^( ) Label Shrinking, (ε =.) ^() ^( ) ^( ) Figure : Mean log. loss comparison 4, Label Shrinking (ɛ =.,.) wih no shrinking I is observed in Figure ha label shrinking doesn help preven overfiing on his daa se, and he mean logisic loss is even worse han he Vanilla version of logisic regression. I proposed wo explanaions o his resul: () The performance of label shrinking echnique may highly relaed wih he characerisic of he daase. Therefore, i s hard o conclude his echnique won help in general case. Besides, he experimen I have done is based on original daase. Permuing he order of he daa may yield differen resul. () Theoreically, raining wih shrunk labels may cause he model minimizing he modified logisic loss funcion (as demonsraed in Figure ) which is differen from he original loss funcion. This may incur a higher loss han raining wihou shrinking. Ye, by ploing ou he change in Figure we could observe ha he vecor is indeed well conrolled as expeced. Moreover, as he ɛ increases he are more consrained.

20 7 CONCLUSION AND FUTURE WORK 3 Label Shrinking, (ε =.) 3 Degeneraed Label Shrinking, (ε = ) 3 3 ^() ^( ) ^( ) gradien a sopping poins 4 ^() ^( ) ^( ) gradien a sopping poins 3 Label Shrinking, (ε =.) 3 ^() ^( ) ^( ) gradien a sopping poins Figure : Weighs comparison 4, Label Shrinking (ɛ =.,.) wih no shrinking. Label shrinking could help conrol, ye he effec in prevening overfiing and benefiing is sill no confirmed. Moreover, due o ime limiaion he Predicion Sreching echnique is no fully esed wih reasonable resuls o presen. Fuure experimens are needed o invesigae hese echniques. 7 Conclusion and Fuure work To conclude, I proposed he poins below: () Overfiing effec is obvious in logisic regression based on gradien descen. Early sopping could help reduce his effec, ye i requires exra compuaion o decide he reasonable sopping poins as well as he sopping crierion o adop. -Norm/-Norm of gradien are he common sopping crieria. There is no essenial difference beween hese wo crieria. () Sochasic Gradien Descen wih Simulaed Annealing approach could help preven overfiing clearly due o he naure of his echnique. Ye, o apply his echnique in pracice, many parameers need o une. -Norm regularizaion is also effecive in prevening overfiing and i s more pracical. (3) Modifying ransfer funcion when raining is observed effecive in prevening overfiing by

21 8 ACKNOWLEDGMENT conrolling he. More exensive experimens are needed o jusify he usabiliy of his echnique in pracical problems. (4) Label Shrinking echnique is useful in conrolling. However, he resul here doesn show his echnique could benefi as in []. Fuure experimens on various daases as well as wih Predicion Sreching echnique should be conduced. 8 Acknowledgmen Thanks Maya Hrisakeva and Nikhila Arkalgud for heir excellen work on homework 3 abou Logisic Regression. I learned a lo from heir work. Thanks Bruno Asuo Arouche Nunes for eaching me he screen deaching rick on SSH clien, which faciliaed my experimens a grea deal. Special hanks o Professor Manfred Warmuh, for his encouragemen, guidance and pleny of help o me which suppors me working hrough his projec. References [] M. Warmuh, Shrink-srech of labels for regularizing logisic regression, 8.

CHAPTER 10 VALIDATION OF TEST WITH ARTIFICAL NEURAL NETWORK

175 CHAPTER 10 VALIDATION OF TEST WITH ARTIFICAL NEURAL NETWORK 10.1 INTRODUCTION Amongs he research work performed, he bes resuls of experimenal work are validaed wih Arificial Neural Nework. From he