In: Fourth International Conference on Articial Neural Networks, Churchill College, University of

Size: px

Start display at page:

Download "In: Fourth International Conference on Articial Neural Networks, Churchill College, University of"

Jesse Watts
6 years ago
Views:

1 In: Fourth Internatonal Conference on Artcal Neural Networks, Churchll College, Unversty of Cambrdge, UK. IEE Conference Publcaton No. 9 pp 6-65, (995) ON THE RELATIONSHIP BETWEEN BAYESIAN ERROR BARS AND THE INPUT DATA DENSITY C K I Wllams, C Qazaz, C M Bshop and H hu Neural Computng Research Group, Aston Unversty, UK. ABSTRACT We nvestgate the dependence of Bayesan error bars on the dstrbuton of data n nput space. For generalzed lnear regresson models we derve an upper bound on the error bars whch shows that, n the neghbourhood of the data ponts, the error bars are substantally reduced from ther pror values. For regons of hgh data densty we also show that the contrbuton to the output varance due to the uncertanty n the weghts can exhbt an approxmate nverse proportonalty to the probablty densty. Emprcal results support these conclusons. INTRODUCTION When gven a predcton, t s also very useful to be gven some dea of the \error bars" assocated wth that predcton. Error bars arse naturally n a Bayesan treatment of neural networks and are made up of two terms, one due to the posteror weght uncertanty, and the other due to the ntrnsc nose n the data. As the two contrbutons are ndependent, we have y(x) w(x) + (x) () where w(x) s the varance of the output due to weght uncertanty and (x) s the varance of the ntrnsc nose. Under the assumpton that the posteror n weght space can be approxmated by a Gaussan (MacKay ()), we have w(x) g T (x)a g(x) () where A s the Hessan matrx of the model and w)@w s the vector of the dervatves of the output wth respect to the weght parameters n the network. A contans contrbutons from both the pror dstrbuton on the weghts and the eect of the tranng data. Although the weght uncertanty component of the error bar s gven by equaton, the dependence of ths quantty on the locaton of the tranng ponts s not at all obvous. Intutvely we would expect If the network used s not the correct generatve model for the data there wll be a thrd component due to model ms-speccaton; we do not dscuss ths further n ths paper. the error bars from the pror (.e. before any data s seen) to be qute large, and that the eect of the tranng data would be to reduce the magntude of the error bars for those regons of the nput space close to the data ponts, whle leavng large error bars further away. The purpose of ths paper s to provde theoretcal nsghts to support ths ntuton. In partcular, our analyss focusses on generalzed lnear regresson (such as radal bass functon networks wth xed bass functon parameters) and allows us to quantfy the extent of the reducton and the length scale over whch t occurs. We also show that the relatonshp w(x) ' [N p(x)v (x)] holds approxmately, where p(x) s the densty of the data n the nput space, N s the number of data ponts n the tranng set and V (x) s a functon of x that measures a volume n the nput space. Ths relatonshp pertans to the \hgh-data" lmt where the eect of the data overwhelms the pror n the Hessan. GENERALIED LINEAR REGRESSION Consder a generalzed lnear regresson (GLR) model of the form mx y(x) T (x)w w j j (x) (3) j where j ; :::; m labels the bass functons f j g of the model. Gven a data set D ((x ; t); (x ; t); : : : ; (x N ; t N )), a squared error functon wth nose varance and a regularzer of the form w T Sw, the posteror mean value of the weghts ^w s the choce of w that mnmzes the quadratc form X t j so that ^w s the soluton of w j j (x ) A + wt Sw () (B + S) ^w T t (5) In ths secton we assume that s ndependent of x. Ths assumpton can be easly relaxed, but at the expense of somewhat more complcated notaton.

2 where s the n m desgn matrx (x ) (x ) m (x ) (x ) (x ) m (x ) C.... A (6) (x n ) (x n ) m (x n ) B T and t s the vector of targets. Wrtng A B + S, we nd ^y(x) T (x) ^w T (x)a T t def k T (x)t (7) where ^y(x) s the functon obtaned from equaton 3 usng ^w as the weght vector. Equaton 7 denes the eectve kernel k(x) and makes t clear that ^y(x) can be wrtten as a lnear combnaton of the target values,.e. t s a lnear smoother (see, e.g. Haste and Tbshran ()). The contrbuton of the uncertanty of the weghts to the varance of the predcton s gven from equaton by w (x) T (x)a (x) (8) Note that for generalzed lnear regresson ths expresson s exact, and that the error bars (gven ) are ndependent of the targets. 3 ERROR BARS FOR GLR + pror nose level posteror Fgure : A schematc llustraton of the eect of one data pont on y (x). The posteror varance s reduced from ts pror level n the neghbourhood of the data pont (+), but remans above the nose level. In ths secton we analyze the response of the pror varance to the addton of the data ponts. In partcular we show that the eect of a sngle data pont s to pull the y (x) surface down to a value less than (x)3 at and nearby to the data pont, and that 3 The analyss n ths secton permts the nose level to vary as a functon of x. the length scale over whch ths eect operates s determned by the pror covarance functon where A S. C(x; x ) T (x)a (x ) (9) The man tool used n ths analyss s the eect of addng just one data pont. A schematc llustraton of ths eect s shown n Fgure. The varance due to the pror s qute large (and roughly constant over x-space). Addng a sngle data pont pulls down the varance n ts neghbourhood (but not as far as the lmt). Fgure s relevant because we can show (see Appendx A.) that y (x), when all data ponts are used to compute the Hessan, s never greater than y(x) when any subset of the data ponts are used, and hence the surface pertanng to any partcular data pont s an upper bound on the overall surface. To obtan a bound on the depth of the dp, consder the case when there s only one data pont (at x x ), so that the Hessan s gven by A A + (x )(x ) T (x ). Usng the dentty (M + vv T ) M (M v)(v T M ) + v T M v t s easy to show that () wjx (x ) (x ) () + r where wjx denotes the posteror weght uncertanty surface due to a data pont at x and r r T (x )A (x ) (x ) ().e. r s the rato of the pror to nose varances at the pont x. For any postve value of z, the functon z( + z) les between and, hence we see that the w contrbuton to the error bars must always be less than (x) at a data pont. Typcally the nose varance s much smaller than the pror varance, so r. Further evdence that w at any data pont s of the order of (x) s provded by the calculaton n appendx A. whch shows that the average of w (x ) at the data ponts s less than mn, where m s the number of weghts n the model and N s the number of data ponts. For a sngle data pont at x, we can use equaton to show that wjx (x) C(x; x) (C(x; x )) (3) + C(x ; x ) Hence the wdth of the depresson n the varance surface s related to the characterstc length scale of the

3 pror covarance functon C(x; x ). It s also possble to show that f a test pont x has zero covarance C(x; x ) wth all of the tranng ponts fx g, then ts posteror varance wll be equal ts pror varance. We are currently explorng the propertes of C(x; x ) for derent weght prors and choces of bass functons. However, we note that a smple dagonal pror S I as used by some authors s not n general a very sensble pror, because f the type of bass functons used (e.g. Gaussans, tanh functons etc.) s changed, then the covarance structure of the pror also changes. More sensbly, the weght pror should be chosen so as to approxmate some desred pror covarance functon C(x; x ). DENSITY DEPENDENCE OF w(x) As we have already noted, error bars on network predctons would be expected to be relatvely large n regons of nput space for whch there s lttle data, and smaller n regons of hgh data densty. In ths secton, we establsh an approxmate proportonalty between the varance due to weght uncertanty and the nverse of the probablty densty of tranng data, vald n regons of hgh data densty. A relatonshp of ths knd was conjectured n Bshop (3). We rst consder a specal case of the class of generalzed lnear models where the bass functons are nonoverlappng bn (or \top-hat") actvaton functons. Let the th bass functon have heght h and a d- dmensonal \base area" of V, where d s the dmensonalty of x. If we choose a dagonal pror (S I) then the Hessan s dagonal and thus easy to nvert. B j (x q ) j (x q ) q n h f j otherwse () where n s the number of data ponts fallng n bn. From equaton (8) the error bars assocated wth a pont x whch falls nto the th bn are gven by w(x) h + n (5) As usual, the eect of the pror s to reduce the sze of error bar compared to the case where t s not present. In the lmt of! we have w (x) n N V ^p(x) (6) where N s the total number of data ponts and ^p(x) s the hstogram estmate of the densty nsde the bn contanng x. Equaton (6) demonstrates that for Ths analyss can easly be extended to arbtrary nonconstant bass functons as long as they do not overlap. ths knd of model the error bars are nversely proportonal to the nput densty and to the volume factor V. It also shows that we can understand the reducton n the varance y (x) n regons of hgh densty as the n eect for the varance of the mean of n (d) Gaussan varables each of whch has varance. The am of the remander of ths secton s to show how results smlar to those for the bn bass functons can be obtaned, n certan crcumstances, for generalzed lnear regresson models,.e. that the error bars wll be nversely proportonal to p(x) and an area factor V (x). The key dea needed s that of an eectve kernel, whch we now descrbe. As noted n equaton 7, we can wrte ^y(x) k T (x)t, where k(x) s the eectve kernel. To take ths analyss further t s helpful to thnk of ^y(x) k T (x)t P k t as an approxmaton to the ntegral ^y(x) R K(z; x)t(z)dz, where K(z; x) (regarded as a functon of z) s the eectve kernel for the pont x and t(z) s a \target functon". Followng smlar reasonng we obtan B (x q ) T (x q ) ' N p(x)(x) T (x) dx q (7) If the orgnal bass functons are lnearly combned to produce a new set ~ C, then the matrx C R can be chosen so that p(x) ~ (x) ~ j (x)dx j, where j s the Kronecker delta. From now on t s assumed that we are workng wth the orthonormal bass functons (.e. the tldes are omtted) and that B NI. Ignorng the weght pror we obtan However, ^w A T t N T t (8) ( T t) (x q )t(x q ) ' N q and so (z)p(z)t(z)dz (9) ^y(x) N T (x) T t () ( X (x) (z)p(z) ) t(z)dz () K(z; x)t(z)dz () We can also show that K(z; x) s the projecton of the delta functon onto the bass space f g, where (x) (x)p(x), and that f a constant (bas) functon s one of the orgnal R bass functons (before orthonormalzaton), then K(z; x)dz. The fact that K(z; x) s an approxmaton to the delta 3

4 functon suggests that as the number of bass functons ncreases the eectve kernel should become more tghtly peaked and concentrated around x..6.5 A We now turn to the varance of the generalzed lnear model. Usng orthonormal bass functons, the error bar at x s gven by y(x) N T (x)(x) 5. However, ths can be rewrtten n terms of the eectve kernel w(x) N usng the orthonormalty propertes. K (z; x) dz (3) p(z) If K(z; x) s sharply peaked around x (.e. t looks somethng lke a Gaussan) then the p(z) n the denomnator can R be pulled through the ntegral sgn as p(x). Also, K (z; x)dz measures the nverse base area of K(z; x); for example, for a one dmensonal Gaussan wth R standard devaton centered at x we nd that K (z; x)dz p ( ). Denng we can wrte K (z; x)dz def V (x) () probablty densty nose varance/(n*varance) B Gaussan Sgmod Polynomal Network w(x) ' N p(x)v (x) (5) By extendng the analyss of appendx A. to the contnuous case we obtan w(x)p(x)dx N (6) where s the eectve number of parameters n the model (), showng that we would expect w (x) to be larger for a model wth more parameters. Under the assumpton that K(z; x) s sharply peaked about x we have obtaned a result n equaton 5 smlar to equaton 6 for the bn bass functons. We wll now present evdence to show that ths relatonshp holds expermentally. The rst experment has a one dmensonal nput space. The probablty densty form whch the data was drawn s shown n Fgure (A). Fgure (B) shows that for a range of GLR models (and for a two-layer perceptron) there s a close relatonshp between w(x) and the densty, ndcatng that V (x) s roughly constant n the hgh densty regons for these models. Ths concluson s backed up by Fgure, whch plots V (x) R K (z; x)dz aganst x. The log-log plot n Fgure 3 also ndcates that the relatonshp w(x) / p (x) holds qute relably, especally for areas wth hgh data densty. 5 It s nterestng to note that the error bar w (x) can also be obtaned from the nte-dmensonal eectve kernel dened by ^y(x) k T (x)t. Usng the assumpton that each t has ndependent, zero-mean nose of varance, we nd that the varance of the lnear combnaton ^y(x) s w (x) kt k, whch can easly be shown to be equvalent to w (x) T N for. Fgure : (A) A mxture of two Gaussan denstes, from whch data ponts were drawn for the experments. (B) shows the (scaled) nverse varance aganst x for three generalzed lnear regresson (GLR) models and a neural network. The GLR models used Gaussan, sgmod and polynomal bass functons respectvely, and each model conssted of 6 bass functons and a bas and was traned on data ponts. (B) also shows the nverse varance for a two layer perceptron wth two hdden unts. The net was traned on a data set consstng of data ponts wth nputs drawn from the densty shown n panel (A) and targets generated from sn(x) wth added zero-mean Gaussan nose of standard devaton.. For all four models the smlarty between the nverse varance for these models and the plot of the densty s strkng. Fgure (B) also shows that the dependence of the overall magntude of w on the number of eectve parameters descrbed n equaton 6 holds; the twolayer perceptron, whch has only seven weghts compared to the 6 n the GLR models, has a correspondngly larger nverse varance. Some eectve kernels for the GLR model wth a bas and 6 Gaussan bass functons of standard devaton :5, spaced equally between 5: and : are shown n Fgures 5 and 6 6. The kernels n Fgure 5 correspond to areas of hgh densty and show a strong, narrow sngle peak. For regons of low densty Fgure 6 shows that the kernels are much wder and more oscllatory, ndcatng that target values from a wde range of x values are used to compute ^y(x). As the wdths of the kernels n the low densty regons 6 Smlar f g and kernels are obtaned for sgmodal and polynomal bass functons.

5 log(nverse probablty densty) a -. a log(varance) Fgure 3: Plot of the log nverse densty of the nput data aganst the log of w (x) for a generalzed lnear model wth 6 Gaussan bass functons. Note that the ponts le close to the lne wth slope, ndcatng that w (x) / p (x). Fgure 5: Eectve kernels at x : and :, correspondng to hgh densty regons, as shown n gure (A). See text for further dscusson. agan ndcatng that for regons of hgh data densty the nose term wll domnate. 8 5 DISCUSSION Fgure : Plot of V (x) K (z; x)dz aganst x for a GLR model wth 6 Gaussan bass functons spaced equally between 5: and :, and a bas. Note that the plot s roughly constant n regons of hgh densty. are greater than the length scale of the varaton of the densty, we would expect the approxmaton used n equaton 5 to break down at ths pont. We have conducted several other experments wth one and two dmensonal nput spaces whch produce smlar results to those shown n the log-log plot, Fgure 3, ncludng a two-layer perceptron whch learned to approxmate a functon of two nputs. Whle ths relatonshp between w(x) and the nput data densty s nterestng, t should be noted that ts valdty s lmted at best to regons of hgh data densty. Furthermore, n such regons the contrbuton to the error bars from w(x) s dwarfed by that from the nose term. Ths can be seen n the case of non-overlappng bass functons from equaton 6. More generally we can consder the extenson of the result to the case of n data ponts all located at x. Ths leads to R r wjn x (x ) ' + n r n (7) In ths paper we have analyzed the behavour of the Bayesan error bars for generalzed lnear regresson models. For the case of a sngle solated data pont we have shown that the error bar s pulled down close to the nose level, and that the length scale over whch ths eect occurs s characterzed by the pror covarance functon. We have also shown theoretcally that, n regons of hgh data densty, the contrbuton to the output varance due to the uncertanty n the weghts can exhbt an approxmate nverse proportonalty to the data densty. These ndngs have been supported by numercal smulaton. Also, we have noted that, n such hgh-densty regons, ths contrbuton to the varance wll be nsgncant compared to the contrbuton arsng from the nose term. Although much of the theoretcal analyss has been performed for generalzed lnear regresson models, there s emprcal evdence that smlar results hold also for mult-layer networks. Furthermore, f the outputs of the network have lnear actvaton functons, then under least-squares tranng t s eectvely a generalzed lnear regresson model wth adaptve bass functons. It s therefore a lnear smoother wth ^y(x) k T (x)t, and hence the result that w k T k wll stll hold. Other results, ncludng the expresson 36 derved n Appendx A., also hold for general non-lnear networks, provded we make the usual Gaussan approxmaton for the posteror weght dstrbuton, and the outer-product approxmaton to the Hessan. One potentally mportant lmtaton of the models 5

6 a APPENDICES A. In ths appendx we show that for generalzed lnear regresson, yjd(x) yjt (x), where D s the full data set ((x ; t ); : : : ; (x N ; t N )) and T s a subset of ths data set a Fgure 6: Eectve kernels at x :5 and 3: correspondng to low densty regons of the nput space, as shown n gure (A). Note that the densty functon seems to dene an \envelope" for the lower kernel; even though x may be n a low densty regon, the magntude of K(z; x) s largest n the hgh densty regons. See text for further dscusson. We rst note that as (x) s equal n both cases, we are only concerned about the relatve contrbutons from the weght uncertanty to the overall varance. The key to the proof s to decompose the Hessan A nto two parts, A and A, where A A + X qt q q T q A X q6t q q T q (8) and A S. Note that A and A are symmetrc nonnegatve dente, and hence A and A are also (usng the Moore-Penrose pseudo-nverse f necessary). The matrx dentty (A + A ) A A (A + A ) A (9) mples that for any vector v v T (A + A ) v (3) v T A v (A v)t (A + A ) (A v) From non-negatve dente condton we see that the second term n equaton 3 s always non-negatve, and hence v T A v v T (A + A ) v (3) Substtutng (x) for v completes the proof. consdered n ths paper (and ndeed of the models consdered by most authors) s that the nose varance s assumed to be a constant, ndependent of x. To understand why ths assumpton may be partcularly restrctve, consder the stuaton n whch there s a lot of data n one regon of nput space and a sngle data pont n another regon. The estmate of the nose varance, whch we shall assume to be relatvely small, wll be domnated by the hgh densty regon. However, as we have seen, the error bar wll be pulled down to less than n the neghbourhood of the solated data pont. The model s therefore hghly condent of the regresson functon (.e. the most probable nterpolant) n ths regon even though there s only a sngle data pont present! If, however, we relax the assumpton of a constant then we see that there n the neghbourhood of the solated data pont there s lttle evdence to suggest a small value of and so we would expect much larger error bars. We are currently nvestgatng models n whch (x) s adapted to the data. Acknowledgements Ths work was manly supported by EPSRC grant GR/J755 (CW) and by an EPSRC post-graduate scholarshp to CQ. A. In ths appendx we show that w, the average value of w(x) evaluated at the data ponts, s equal to N, where ( m) s the eectve number of parameters n the model (). w References N N X N tr[(x w(x ) (3) T A (33) T )A ] (3) N tr(ba ) (35) N (36). MacKay D. J. C., 99, \Bayesan Interpolaton", Neural Computaton, (3), 5{7.. Haste T. J. and Tbshran R. J., 99, \Generalzed Addtve Models", Chapman and Hall. 3. Bshop C. M., 99, \Novelty detecton and neural network valdaton", IEE Proceedngs: Vson, Image and Sgnal Processng,, 7{. 6

Lecture 12: Classification

Lecture 12: Classification Lecture : Classfcaton g Dscrmnant functons g The optmal Bayes classfer g Quadratc classfers g Eucldean and Mahalanobs metrcs g K Nearest Neghbor Classfers Intellgent Sensor Systems Rcardo Guterrez-Osuna