Improving the performance of radial basis function classifiers in condition monitoring and fault diagnosis applications where unknown faults may occur

Size: px

Start display at page:

Download "Improving the performance of radial basis function classifiers in condition monitoring and fault diagnosis applications where unknown faults may occur"

Oswald Mason
6 years ago
Views:

1 Improvng the performance of radal bass functon classfers n condton montorng and fault dagnoss applcatons where unknown faults may occur Yuhua L, Mchael J. Pont and N. Barre Jones Control & Instrumentaton Research Secton, Department of Engneerng, Unversty of Lecester, Lecester, LE 7RH, Unted Kngdom Abstract: Ths paper presents a novel technque whch may be used to determne an approprate threshold for nterpretng the outputs of a traned Radal Bass Functon (RBF) classfer. Results from two experments demonstrate that ths method can be used to mprove the performance of RBF classfers n practcal applcatons. Keywords: Radal bass functon network, condton montorng, novelty detecton, threshold determnaton Pre-prnt of: L, Y., Pont, M.J. and Jones, N.B. (2002) Improvng the performance of radal bass functon classfers n condton montorng and fault dagnoss applcatons where unknown faults may occur, Pattern Recognton Letters, 23: To whom correspondence should be addressed.

2 . Introducton Radal Bass Functon (RBF) neural networks (Broomhead and Lowe, 988) have been wdely used n classfcaton problems such as speech recognton, medcal dagnoss, handwrtng recognton, mage processng, and fault dagnoss. In such applcatons, RBF networks are frequently used wth a wnner takes all (WTA) output rule (Scholkopf et al., 997). Ths rule s smple to use, and s often an approprate soluton (Cordella et al., 995). However, n ths paper, we are prmarly concerned wth condton montorng and fault dagnoss (CMFD) systems: for such systems, the WTA rule s not always approprate. Instead, the outputs of the classfer may be nterpreted by applyng a threshold to the output vector: f the value of an output neuron exceeds the gven threshold, then an example of the correspondng class s sad to have occurred (e.g. Josh et al., 997; Cheon et al., 993). There are two reasons why threshold-based output nterpretatons are approprate n CMFD systems. The frst reason s that a CMFD classfer s often requred not only to classfy known normal and fault nput vectors, but also to recognse that a partcular nput s nether normal, nor a member of one of the exstng fault categores: ths process s often called novelty detecton (Bshop, 994). By usng a threshold-based classfer, outputs may be readly nterpreted as an unknown fault when none of the normal or fault output neurons exceeds the threshold. The second reason for usng threshold-based output nterpretatons s that the WTA classfer s unsutable for use n applcatons where an nput vector may need to be assgned to more than one class: ths s often the case n CMFD applcatons, where multple faults may occur smultaneously (see, for example, Cheon et al., 993). Despte these potental advantages, use of a threshold classfer for CMFD applcatons can be problematc, because the overall performance of the system depends on the use of an approprate threshold value: n most publshed work n ths area, thresholds are smply emprcally set, usually at values of 0.5 (Watanabe et al., 994), a process whch does not necessarly result n optmal classfer performance (Josh et al., 997). To mprove the performance of threshold classfers n CMFD and other applcaton areas, methods for dentfyng the optmal threshold values are requred. 2

3 In ths paper, we present a novel technque for determnng an approprate threshold for RBF classfers: ths technque s based on an analyss of the relatonshp between the behavour of a well-traned RBF classfer and ts response to the data set. The paper s organsed as follows: n Secton 2, a method for determnng a sutable threshold s derved, based on theoretcal consderatons. Followng that, the results of two emprcal tests are presented n Secton 3. The results are dscussed n Secton Theoretcal Consderatons A technque for relable threshold selecton n RBF classfers wll be derved n ths secton. The problem of threshold selecton for neural network classfers s frst formulated n Secton 2.. In Secton 2.2, the decson behavour of RBF classfers s mathematcally and geometrcally analysed. The relable threshold selecton method for RBF classfers s then proposed n Secton The general problem A basc learnng problem can be represented by sx components (Smolensky et al., 996): X, Y, A, H, P, and L. The frst four components are the nstance(nput vector), outcome, decson, and decson rule spaces, respectvely. X s an arbtrary set, Y, A { 0, }, and H s a famly of functons from X nto A. The ffth component, P, s a famly of jont probablty dstrbutons of Z = X Y. These represent the possble states of nature that mght be governng the generaton of examples. The last component, the loss functon L, s a mappng from Y A nto the real number set R. Ths paper manly concerns the behavour of A and L wth respect to the threshold. Assume that the classfer has been well traned wth samples, z r ( z,, z m ), where z = ( x, y ) Z, drawn ndependently at random accordng to some probablty dstrbuton P P. After tranng, the classfer s appled to a set of samples whose class s known. Suppose the outcome of output neurons s $y, a threshold s appled to $y, then we have the hypothess h H that specfes the approprate acton a A as: h: X A, a( x, τ) = θ($ y τ ) () 3

4 0 f δ 0 where: θ( δ) = f δ > 0 ths gves () the followng values: 0 a( x, τ) = θ($ y τ) = f f y$ τ y$ > τ (2) a equals one for ndcatng the class occurred, zero otherwse. We consder the followng loss functon: 0 L( z, τ) = L( y, a( x, τ)) = f f a = y a y (3) and the rsk functon (Scholkopf et al., 997): I( τ) = L( z, τ) dp( z) (4) snce the probablty dstrbuton functon P(z) s unknown but a random, ndependent sample of pars z = ( x, y ) s gven, then we have nstead the emprcal rsk functon from (4) m I( τ ) = L( y, a ) m = (5) Thus the expected rsk of the decson rule (or hypothess) s just the probablty that t predcts ncorrectly. Our goal s then to make the rsk functon approach mnmum, by determnng an optmum value of the threshold, τ (see Secton 2.3). 2.2 Behavour of the RBF classfer A radal bass functon (RBF) network wth k hdden neurons has the form (Bshop, 995): T y( x) = w φ ( x) + b (6) where φ are bass functons: these can take several forms (see below). The weght coeffcents w combne the bass functons nto an output value. b s a bas term 2. The mostly commonly used RBF s the Gaussan bass functon: k y( x) = w φ ( x) + b = k x µ = w exp 2 σ = 2 b + (7) 2 The sgnfcance of ths bas s that the desred output values of the classfer have non-zero mean. 4

5 Thus n (7) φ s th Gaussan bass functon wth centre µ and varance σ 2. Snce the receptve felds of radal bass functons n the hdden layer have a localsed response behavour n the nput space, ts response approaches zero at large rad,.e., φ x µ 0. Consderng the most common form of codng scheme for classfcaton usng neural networks (Tarassenko and Roberts, 994), the output value y s f the nput pattern x belongs to the class and 0 otherwse, that s, y [ 0, ] on the tranng data 3. As an RBF classfer obtans ts parameters through tranng, ths gves the followng statement about the nterval of bas n the output layer: T For classfcaton usng RBFN, y = w φ( x) + b, f the output s set as y [ 0, ] on tranng data, then a well traned RBF classfer satsfy: b [ 0, ]. The valdty of the above statement can be proved as follows: Snce y [ 0,], and φ (0,], w are fnte numbers, f an nput vector s far from the centers of the radal bass functons, we have, =, L, k, x µ φ 0, also then from (7) we have lm φ = 0 y b T assume b [ 0, ], say b >, then y = w φ( x) + b b >, whch contradcts the condton of Thus b [0,] y [ 0, ]. Thus b. Smlarly we can prove that b 0. To understand the physcal meanng of outcomes and the bas n the output layer, consder a geometrc nterpretaton. Suppose there s a two-class problem wth nputs dstrbutng n two dmensons as n Fgure (a). An RBF 3 The condton y [0,] apples for the tranng data only, usng the codng scheme dscussed above. Even when an optmally traned network s used for classfcaton, the network output may stll fall outsde the nterval [0,]. In ths case, the classfer stll can perform well, when an approprate threshold s appled to the network output. 5

6 classfer wth two nput and two output neurons s used for ths classfcaton problem. After tranng, the values of output neurons wth respect to the nput varable x are as n Fgure (b). The correspondng output neuron has a hgh value f the nput s wthn the regon of the class, whle the other output neuron has a low value. Outcomes of all output neurons wll be asymptotc to ther bas f the nput s out of ther class regon. It s notable that the bas value of a traned RBF classfer may le out the nterval [0, ] n some mplementatons. However, such classfers wll perform very poorly, as demonstrated n Sectons 3 and 4. In these crcumstances the RBF classfer wll generally need to be re-traned by adjustng the tranng parameters (that s, the number and the spread constant of the RBFs). 2.3 Relable threshold selecton for RBF classfers As dscussed n Secton 2., for the problem of classfcaton, our goal s to determne a threshold that tends to mnmse the probablty of erroneous classfcaton n a gven class. To derve the optmal threshold for a radal bass functon classfer, the general model (Smolensky et al., 996) s used. The goal of desgnng a network s to fnd that network whch provdes the most lkely explanaton of the observed data set. To do ths t s necessary to try to maxmse the probablty: P ( N D) ( D N ) P( N ) P = (8) P( D) where N represents the network (wth all of the weghts and bases specfed), D represents the observed data set, and P( D N ) s the probablty that the network N would have produced the observed data D. Applyng the monotonc logarthm transformaton (Bshop, 995) to (8), we have: ( N D) ( D N) ( N) ( D) ln P = ln P + ln P ln P (9) Thus, maxmsng (9) s equvalent to maxmsng (8). Snce the probablty dstrbuton of the data s not dependent on the network, ln P( D) wll have no contrbuton to the maxmsng soluton of ln P( N D ) and s dropped from (9). The second term of (9) s a representaton of the probablty of the network tself; that s, t s the a pror probablty or a pror constrant on the network: snce our method assumes that the classfer has been well traned - and our purpose s to determne an optmal threshold for the traned RBF classfer - we wll also drop ths term. The frst term represents the probablty of the data gven the network: that s, t s a measure of how well the 6

7 classfer accounts for the data. Therefore the threshold selecton problem s equvalent to a requrement to maxmse ln P( D N ). Further, f the data are broken nto two parts: the output y and the nput x, then: ( D N ) = P( ( x y) N) = ln P( y x N ) + ln P( x) ln P ln, (0) Fnally, suppose that the nput x does not depend on the network, the last term of (0) has no effect n maxmsng ln P( D N ). Therefore, we need only maxmse the frst term ln P( y x N ). For the classfcaton problem, the output vectors, a, defned n (2), consst of a sequence of 0 s and s. In ths case, we magne that each element of classfer output, a, represents the probablty that the correspondng element of the desred output y takes on the value. Then the probablty of the data gven the network, for c classes problem, s represented by the bnomal dstrbuton (Flemng and Nells, 994): ( ) y P y x N = a ( a ) c = y () Thus we need to maxmse the log of () gven by: c F = y a + y a = ( ln ( ) ln( )) (2) By dfferentatng (2) wth respect to decson acton a, we then obtan: F a y = a a ( a ) (3) To obtan the statonary ponts of F, we set (3) to zero. We then have: y a = 0 (4) Snce to the left of the above statonary pont, F s postve and to the rght a F s negatve, the statonary a pont from (4) s the maxmum pont of F. Snce for an RBF network: T y$ ( x) = w φ ( x) + b (5) As dscussed n Secton 2.2, b [0,], and the classfer output as n (5) satsfes yˆ b for an nput pattern located far from the tranng patterns. To satsfy (4), we wsh a to concde wth the actual y when τ 7

8 (assumng, here, that each output node of the RBF network s allowed to have a dfferent threshold) s appled to ŷ, as n (2). In other word, a should be f the nput pattern s wthn the tranng class, and should be 0 f t s out of the tranng class. Therefore, we must select: τ = b (6) The threshold wth the value gven by (6) decdes whether samples are classfed nto that class. If there s nose or dsturbance n the data, we expect that the turnng pont wll be less senstve to the data set, and we therefore ncrease (6) by a small amount, ε: τ = b + ε (7) The ntroducton of ε s to make the classfer robust to nose and dsturbance whle havng lttle ncrease on the msclassfcaton rate. Fnally, f we use a sngle-threshold RBF classfer, then we obtan: τ = max( b ) + ε (8) Here, ε s a very small postve constant to make τ slghtly greater than max(b): of course, τ must not exceed. 3. Expermental Tests In ths secton, the threshold determnaton technque derved n Secton 2 s assessed n two expermental studes. The chosen data sets were obtaned, frstly, from a mathematcal model smulatng statc fault dagnoss and, secondly, from a non-lnear model of a desel engne coolng system. In both cases, the nput varables were normalsed to [0, ], and the classfers were traned usng the orthogonal least square algorthm (Chen et al., 99) n the Matlab Neural Network Toolbox. Please note that, n addton to determnng the threshold value, mplementng an effectve RBF classfer for a gven task nvolves determnng two further mportant parameters: a) the maxmum number (me) of radal bass functons to use n the hdden layer; b) the spread constant (sc) of the radal bass functon. For each of the followng two experments, the classfer was traned usng a range of possble values for me and sc. The traned classfer was then tested both on the (seen) tranng set and on (unseen) test sets. The classfcaton error rate on each data set s estmated usng (5). 8

9 The key purpose of each study was to explore the mpact of the threshold values. To ths end a tradtonal threshold value (τ 0 ) of 0.5 was used. Ths was compared wth a threshold value (τ ) determned accordng to (8). More explctly, ε was selected as: ε = max( b ), to provde a threshold slghtly greater than max(b). 3. Mathematcal Model Data Set The followng model represents a large class of statc dagnoss problems and s adapted from that ntroduced n (Leonard and Kramer, 990), and descrbed as follows: Y = Y0 + α p + v (9) Here Y represents the measurement vector of the plant, Y 0 represents the plant nomnal steady state. Measurement Y s a functon of plant physcal parameters p, and suffers from measurement nose v. α s a dstrbuton matrx of parameter effects on the measurement vector. Plant faults are caused by devaton of parameters. Here we assume that Y has two measurements y and y 2, p has two parameters p and p 2, and: α = Ths means that p fault causes y and y 2 to devate n the same drecton, and p 2 fault causes y and y 2 to devate n opposte drectons. The classes were defned as: Normal (C 0 ): p < 0. 05, p < Fault (C ): p > 0. 05, p < Fault 2 (C 2 ): p > 0. 05, p < v, v2 ~ N( 0, 0. 05) One set of tranng data was generated wth values of p and p 2 sampled from the normal dstrbuton N(0, 0.25). In total, 600 nput/output pars were generated from Equaton (9) and were used for tranng all networks. Two addtonal sets of test data each wth 300 nput/output pars were also generated, desgnated Test Set and Test Set 2. These data sets were ntended to explore how ths approach performs, both n terms of nterpolaton (Test Set ) and extrapolaton (Test Set 2). Test Set had the same dstrbuton as the tranng set. Test Set 2 had values dstrbuted over the whole parameter space. For ths set, samples wthn the regon of tranng data were assgned to one of the known classes, all other samples were assumed to belong to unknown faults. Usng the tranng data set, RBF classfers were traned by changng the number and the spread constant of 9

10 radal bass functons n the hdden layer. Table lsts the msclassfcaton rate of traned RBF classfers. In the table, tranng data, Test Set and Test Set 2 are represented by D 0, D and D 2, respectvely. Row 2 s the maxmum number of hdden neurons (me) and the spread constant (sc). Elements n row b are bases of correspondng output neurons. The followng rows are the msclassfcaton rate for the data sets (D 0, D and D 2 ) usng τ 0 =0.5 and τ =max(b)+ε, respectvely. As noted above, ε was set to ε = max( b ). From the table, t s clear that when usng ether τ 0 or τ, the classfers provde a very smlar msclassfcaton rate on the tranng set and test set. However, for samples out of the tranng regon the classfer usng τ produces a lower msclassfcaton rate. These results may be readly understood. They arse because the approprately traned classfer s ntended to produce a hgh output value (close to ) for samples n the class whle a low output value (close to 0) for samples n other classes. Thus there s a large nterval for threshold selecton. Theoretcally one could use any threshold between 0 and for a perfectly-traned classfer whch s requred only to classfy samples wthn the tranng range. To further explore the effects of the value of the threshold on the msclassfcaton rate, dfferent thresholds were used for the classfer 5 n Table. Fgure 2 shows the msclassfcaton rate versus the threshold. These emprcal results confrm the fndngs n (8). Specfcally, as s apparent n Fgure 2, the classfer wth a threshold slghtly larger than max(b), say τ, produces a mnmum msclassfcaton rate for Test 2 whle a near mnmum msclassfcaton rate for the tranng set and Test Set, and τ s the turnng pont of msclassfcaton rate on Test Set 2. It s frequently a requrement (n CMFD applcatons) that the classfer wll provde us not only wth a hgh performance for known condtons but also good performance n the presence of unknown faults. For ths purpose, these emprcal results support the use of a classfer wth threshold τ. 3.2 Desel engne coolng system dagnoss data set The second data set used n ths study was generated from a non-lnear model of the coolng system for a desel 0

11 engne, developed elsewhere n our research group (Twddle, 999). The model can smulate varous faults, ncludng those consdered n ths study: fan fault (that s, the radator fan s permanently off), thermostat fault (that s, the thermostat s stuck open) and pump fault (that s, the coolant pump has 50 percent of the capacty of the normal condton). To detect these faults, ths experment used sx measurements: the ambent temperature; the engne block temperature, the coolant temperature at engne block nlet; the coolant temperature at the engne block outlet; the coolant temperature at the radator outlet; and the engne speed. Usng the model, a tranng data set (D 0 ) wth three states (normal, fan fault, and thermostat fault) was generated. Two dfferent test data sets were also generated. One test set (Test Set, D ) has the same three classes as the tranng data. Another test data set (Test Set 2, D 2 ) has the same three classes plus pump fault, here pump fault was consdered as an unknown fault. Each data set has 300 samples. In each case, the data sets conssted of equal numbers of samples for each class. Table 2 lsts the msclassfcaton rate of classfers wth dfferent tranng parameters. To explore, agan, the effects of the value of the threshold on the msclassfcaton rate, dfferent thresholds were used for the classfers 3 and 5 n Table 2. Fgure 3 shows the msclassfcaton rate versus the threshold. Table 3 lsts the msclassfcaton rates versus thresholds around max(b). Here ε n Equaton (8) was set to be proportonal to max(b),.e., ε 0.}. = λ max( b ), where λ s a small coeffcent n {-0., -0.05, -0.0, 0.0, 0.0, 0.05, The results show that the classfers wth τ 0 produce a lower msclassfcaton rate for tranng data and Test Set than that wth τ, whle classfers wth τ can gve a better msclassfcaton rate for Test Set 2. As demonstrated n Experment, the classfer wth τ gves a near mnmum msclassfcaton rate for samples wthn the range of tranng data, so τ can stll be used n such applcatons. However, f we are concerned wth the classfer performance n applcatons wth possble unknown classes, τ s more sutable, at the cost of the rsk of some msclassfcaton of known classes.

12 4. Dscusson and Conclusons In ths paper, a technque for determnng a relable threshold for RBF classfers has been derved, and assessed n two emprcal studes. The approach s easy to use and s seen to be partcularly effectve n classfcaton problems where there may be possble new classes or unknown faults. Overall, these results suggest a two-phase approach to RBF classfer use n stuatons where unknown faults may occur. In the frst phase, we deal wth the possblty of unknown faults. Snce the threshold, τ (determned by the method descrbed n ths paper), results n a measurable mprovement of classfer performance for cases wth unknown faults, τ s used n these crcumstances. However, as the ablty to detect unknown faults may be at the expense of a decrease n the classfcaton performance for known classes, we also perform a second phase. In ths second phase, the classfer threshold wll modfed n order to translate the unknown faults nto known faults. In other words, when we have collected suffcent data about unknown faults, the classfer wll be re-traned usng all avalable data. A tradtonal threshold value of τ 0 may then be appled to the re-traned classfer, snce unknown faults are less lkely to occur. It should be noted that the results n ths paper have also demonstrated that the bas, b, at the output layer of a well traned RBF classfer should satsfy the condton b [ 0, ]. Ths result may be used to check whether a partcular classfer has traned successfully, and the classfer should be dscarded f max(b)> or mn(b)<0. Thus, on ths bass, t would be sensble to abandon classfers numbered 8 and 9 n Table. Fnally, t should also be noted that care must be exercsed when acceptng the tranng f max(b) s close to, because the classfer wll leave lttle margn for response to prevously unseen samples. An deal RBF classfer wll therefore have a low msclassfcaton rate, and a small value for max(b). 2

13 References Bshop, C.M., 994. Novelty detecton and neural-network valdaton, IEE Proceedngs-Vson Image and Sgnal Processng 4, Bshop, C.M., 995. Neural Networks for Pattern Recognton. Clarendon Press, Oxford. Broomhead, D.S., D. Lowe Multvarable functon nterpolaton and adaptve networks. Complex Systems 2, Chen, S., F.N. Cowan, P.M. Grant. 99. Orthogonal least squares learnng algorthm for radal bass functon networks. IEEE Transactons on Neural Networks 2, Cheon, S.W., S.H. Chang, H.Y. Chung, Z.N. Ben Applcaton of neural networks to multple alarm processng and dagnoss n nuclear-power-plants. IEEE Transactons on Nuclear Scence 40, -20 Cordella, L.P., C. Destefano, F. Tortorella, M. Vento A method for mprovng classfcaton relablty of multlayer perceptrons. IEEE Transactons on Neural Networks 6, Flemng, M.C., J.G. Nells Prncples of Appled Statstcs. Routledge. Josh, A., N. Ramakrshman, E.N. Housts, J.R. Rce On neurobologcal, neuro-fuzzy, machne learnng, and statstcal pattern recognton technques. IEEE Transactons on Neural Networks 8, 8-3. Leonard, J.A., M.A. Kramer Classfyng process behavour wth neural networks: strateges for mproved tranng and generalzaton. Proceedngs of the Amercan Control Conference, San Dego, CA, USA, Scholkopf, B., K.K. Sung, C.J.C. Burges, F. Gros, P. Nyog, T. Poggo, V. Vapnk Comparng support vector machnes wth Gaussan kernels to radal bass functon classfers. IEEE Transactons on Sgnal Processng 45, Smolensky, P., M. C. Mozer, D. E. Rumelhart Mathematcal Perspectves on Neural Networks. Lawrence Erlbaum Assocates, Publshers, Mahwal, New Jersey. Tarassenko, L., S. Roberts 994. Supervsed and unsupervsed learnng n radal bass functon classfers. IEE Proceedngs-Vson Image and Sgnal Processng 4, Twddle, J.A., 999. Fuzzy model based fault dagnoss of a desel engne coolng system. Department of Engneerng, Unversty of Lecester, Report 99-. Watanabe, K., S. Hrota, L. Hou, D.M. Hmmelblau Dagnoss of multple smultaneous fault va herarchcal artfcal neural networks. AICHE Journal 40,

14 x 2 Class 2 (a) Class x Output Output 2 b (b) b 2 x 4

15 0.8 Tranng Test Test 2 Error rate τ Threshold 5

16 Error rate 0.5 τ Tranng Test Test 2 (a) Error rate 0.5 Tranng Test Test 2 τ (b) Threshold 6

17 Fgure Outcomes of output neurons Fgure 2 Msclassfcaton rate vs. Threshold (classfer 5) Fgure 3 Msclassfcaton rate vs. threshold (a) for classfer 3 and (b) for classfer 5 7

18 Table Classfcaton error rate (%) for the mathematcal model data Classfer Number me/sc 50/ / /0. 00/ / / / /0. 00/0.2 b D 0 wth D 0 wth τ D wth D wth τ D 2 wth D 2 wth τ

19 Table 2 Classfcaton error rate (%) for coolng system dagnoss Classfer Number me/sc 00/0.5 00/0.2 00/ / / /0.3 b D 0 wth D 0 wth τ D wth D wth τ D 2 wth D 2 wth τ

20 Table 3 Classfcaton error rate (%) usng threshold around max(b) τ = max( b) + ε = max( b) + λ max( b) Classfer 3 Classfer 5 λ D 0 D D 2 D 0 D D

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Support Vector Machines. Vibhav Gogate The University of Texas at dallas Support Vector Machnes Vbhav Gogate he Unversty of exas at dallas What We have Learned So Far? 1. Decson rees. Naïve Bayes 3. Lnear Regresson 4. Logstc Regresson 5. Perceptron 6. Neural networks 7. K-Nearest