Aricle from Predicive Analyics and Fuurism July 6 Issue
An Inroducion o Incremenal Learning By Qiang Wu and Dave Snell Machine learning provides useful ools for predicive analyics The ypical machine learning problem can be described as follows: A sysem produces a specific oupu for each given inpu The mechanism underlying he sysem can be described by a funcion ha maps he inpu o he oupu Human beings do no know he mechanism bu can observe he inpus and oupus The goal of a machine learning algorihm is o infer he mechanism by a se of observaions colleced for he inpu and oupu Mahemaically, we use (x i,y i ) o denoe he i-h pair of observaion of inpu and oupu If he real mechanism of he sysem o produce daa is described by a funcion f*, hen he rue oupu is supposed o be f*(x i ) However, due o sysemaic noise or measuremen error, he observed oupu y i saisfies y i = f*(x i )+ϵ i where ϵ i is an unavoidable bu hopefully small error erm The goal hen, is o learn he funcion f* from he n pairs of observaions {(x,y ),(x,y ),,(x n,y n )} A machine learning algorihm mus firs specify a loss funcion L(y,f(x)) o measure he error ha will occur when we use f(x) o predic he oupu y for an unobserved x We use he erm unobserved x o describe new observaions ouside our raining ses We wish o find a funcion such ha he oal loss on all unobserved daa is as small as possible Ideally, for an appropriaely designed loss funcion, f* is he arge funcion In his case, if we can compue he oal loss on all unobserved daa, we can exacly find f* Unforunaely, compuing he oal loss on unobserved daa is impossible A machine learning algorihm usually searches for an approximaion of f* by minimizing he loss on he observed daa This is called he empirical loss The erm generalizaion error measures how well a funcion having small empirical loss can predic unobserved daa There are wo machine learning paradigms Bach learning refers o machine learning mehods ha use all he observed daa a once Incremenal learning (also called online learning) refers o he machine learning mehods ha apply o sreaming daa colleced over ime These mehods are used o updae he learned funcion accordingly when new daa come in Incremenal learning mimics he human learning process from experiences In his aricle, we will inroduce hree classical incremenal learning algorihms: he sochasic gradien descen for linear regression, percepron for classificaion and incremenal principal componen analysis STOCHASTIC GRADIENT DESCENT In linear regression, f*(x) = w T x is a linear funcion of he inpu vecor The usual choice of he loss funcion is he squared loss L(y,w T x) = (y-w T x) The gradien of L wih respec o he weigh vecor w is given by Noe he gradien is he direcion for he funcion o increase, so if we wan he squared loss o decrease, we need o le he weigh vecor move opposie o he gradien This moivaes he sochasic gradien descen algorihm for linear regression as follows: he algorihm sars wih he iniial guess of w as w A ime, we receive he -h observaion x and we can predic he oupu as Afer we observe he rue oupu y, we can updae he esimae for w by The number η > is called he sep size Theoreical sudy shows ha w becomes closer and closer o he rue coefficien vecor w provided he sep size is properly chosen Typical choice of he sep size is for some predeermined consan η Anoher quaniy o mea- 8 JULY 6 PREDICTIVE ANALYTICS AND FUTURISM
sure he effeciveness is he accumulaed regre afer T seps defined by Figure : Regre vs Ieraions If his algorihm is used in a financial decision-making process and w T x is he opimal decision a sep, he regre measures he oal addiional losses because he decisions are no opimal In heory, he regre is bounded, implying ha he average addiional loss resuling from one decision is minimal when T is large We use a simulaion o illusrae he use and he effec of his algorihm Assume ha in a cerain business, here are five risk facors They may eiher drive up or down he financial losses The loss is he weighed sum of hese facors plus some flucuaion due o noise: y = x - x + x - x + x + ϵ So he rue weigh coefficiens are given by w=[, -,, -, ] We assume each risk facor can ake values beween and and he noise follows a mean zero normal disribuion wih variance The small variance choice is empirically seleced o achieve a smaller signal o noise raio We generae, daa poins sequenially o mimic he daa-generaing process and perform he learning wih an iniial esimae w =[,,,,] In Figure, we plo he disance beween w and w, showing esimaion error decays fas (which is desirable) In Figure, we plo he regre for each sep We see mos addiional losses occur a he beginning because we have used a supid iniial guess They increase very slowly afer seps, indicaing he decisions become near opimal In oher words, even a poor guess can lead o excellen resuls afer a sufficien number of seps Figure : Esimaion Error vs Ieraions Regre 6 7 8 9 PERCEPTRON In a classificaion problem, he arge is o develop a rule o assign a label o each insance For example, in auo insurance, a driver could be labeled as a high risk or low risk driver In financial decision-making, one can deermine wheher an acion should be aken or no In a binary classificaion problem where here are wo classes, he labels for he wo classes are usually aken as and or and + When and + are used as he wo labels, he classifier could be deermined by he sign of a real valued funcion A linear classifier is he sign of a linear funcion of predicors f(x) = sign(w T x) Mahemaically w T x = forms a separaing hyperplane in he space of predicors The percepron for binary classificaion is an algorihm o incremenally updae he weigh vecors of he hyperplane afer receiving each new insance I sars wih an iniial vecor w and when each new insance (x,y ) is received, he coefficien vecor is updaed by, Disance beween w and w 6 7 8 9 oherwise, where y is a user specified parameer called he margin The original percepron inroduced by Rosenbla in he 9s has a margin, ie, y = The percepron can be explained as follows If y (β - x )<, he -h observaion is classified incorrecly and hus he rule is updaed o decrease he chance for i being classified incorrecly If y (β - x )>, he -h observaion is classified correcly, and no updae is necessary The idea of using a posiive margin is from he well-known suppor vecor machine classificaion algorihm The moivaion is ha he classificaion is considered unsable if he observaion is oo close o he decision boundary even when i is classified correcly Updaing is sill required in his case as a penaly The classificaion rule is no updaed only when an insance is classified correcly JULY 6 PREDICTIVE ANALYTICS AND FUTURISM 9
An Inroducion Principal componen analysis (PCA) is probably he mos famous feaure exracion ool for analyics professionals and has a margin from he decision boundary For percepron, he cumulaive classificaion accuracy, which is defined as he percenage of he classified insances, can be used o measure he effeciveness of he algorihm In Figure, we simulaed, daa poins for wo classes: he posiive class conains daa poins cenered a (, ) and he negaive class conains daa poins cenered a (, ) Boh classes are normally disribued The opimal separaing line is x - x =, which can achieve a classificaion accuracy of 9 percen Tha is, here is a sysemaic error of 786 percen We assume he daa poins come in sequenially and apply he percepron algorihm The cumulaive classificaion accuracy is shown in Figure As desired, he classificaion abiliy of he percepron is near opimal afer some number of updaes Figure : Daa for a Binary Classificaion Problem Figure : Cumulaive Classificaion Accuracy of Percepron Technique Cumulaive Classificaion Accuracy 9 8 7 6 6 7 8 9 INCREMENTAL PCA Principal componen analysis (PCA) is probably he mos famous feaure exracion ool for analyics professionals The principal componens are linear combinaions of predicors ha preserve he mos variabiliy in he daa Mahemaically hey are defined as he direcions on which he projecion of he daa has larges variance and can be calculaed as he eigenvecors associaed wih he larges eigenvalues of he covariance marix I can also be implemened by an incremenal manner For he firs principal componen v, he algorihm can be described as follows I sars wih an iniial esimaion v, and when a new insance x comes in, he esimaion is updaed by x, x The accuracy can be measured by he disance beween he esimaed principal componen and he rue one Again, we use a simulaion o illusrae is use and effeciveness We generaed, daa poins from a mulivariae normal disribuion wih mean μ = [,,,,] and covariance marix JULY 6 PREDICTIVE ANALYTICS AND FUTURISM
The firs principal componen is [97, 898,,, ] In Figure, we used he scaer plo o show he firs wo variables of he daa wih he red line indicaing he direcion of he firs principal componen Afer applying he incremenal PCA algorihm, he disance beween he esimaed principal componen and he rue principal componen is ploed for each sep in Figure 6 As expeced, he disance shrinks o as more and more daa poins ge in Figure : Feaure Absracion via Principal Componen Analysis 6 REMARKS We close wih a few remarks Firs, incremenal learning has very imporan applicaion domains, for example, personalized handwriing recogniion for smarphones and sequenial decision-making for financial sysems In he real applicaions, bach learning mehods are usually used wih a number of experiences o se up he iniial esimaor This helps avoid large losses a he beginning Incremenal learning can hen be used o refine or personalize he esimaion Second, we have inroduced he algorihm for linear models All hese algorihms can be exended o nonlinear models by using he so-called kernel rick in machine learning Finally, we would menion ha i seems he erm online learning is more popular in machine learning lieraure; however, we prefer he erm incremenal learning because online learning is widely used o refer o he learning sysem via he Inerne and can easily confuse people Acually, in Google, you probably canno ge wha you wan by searching online learning Insead, online machine learning should be used x Qiang Wu, PhD, ASA, is asociae professor a Middle Tennessee Sae Universiy in Murfreesboro, Tenn He can be reached a qwu@msuedu 6 8 x Figure 6: Esimaion Error from Principal Componen Analysis Dave Snell, ASA, MAAA, is echnology evangelis a RGA Reinsurance Company in Cheserfield, MO He can be reached a dave@ AcuariesAndTechnologycom Disance beween V, and V 8 6 ENDNOTES Vladimir N Vapnik, Saisical Learning Theory, John Wiley & Sons,998 Juyang Weng, Yilu Zhang, and Wey-Shiuan Hwang, Candid Covariance-Free Incremenal Principal Componen Analysis, IEEE Transacions on Paern Analysis and Machine Inelligence, (8),, -9 Wikipedia, Online Machine Learning hps://enwikipediaorg/wiki/online_machine_ learning Wikipedia, Percepron hps://enwikipediaorg/wiki/percepron 6 7 8 9 JULY 6 PREDICTIVE ANALYTICS AND FUTURISM