Statistical Foundations of Pattern Recognition

Statstcal Foundatons of Pattern Recognton Learnng Objectves Bayes Theorem Decson-mang Confdence factors Dscrmnants The connecton to neural nets

Statstcal Foundatons of Pattern Recognton NDE measurement system feature (pattern) extracton observed patterns x Problem: How do we decde whch class x belongs to? possble classes c 1 x c 2 c 3

Bayesan (probablstc) approach Let P( c ) apror probablty that a pattern belongs to class c, regardless of the dentty of the pattern P P P ( x ) ( x c ) ( ) x x ( ), c apror probablty that a pattern s x, regardless of ts class membershp condtonal probablty that the pattern s x, gven that t belongs to class c condtonal probablty that the pattern's class membershp s c, gven that the pattern s x the jont probablty that the pattern s x and the class membershp s c

Example: consder the case where there s one pattern value that s observed or not, and two classes (e.g. sgnal or nose) x, c 1 * ~x, c 1 x, c 1 * ~x, c 2 x, c 2 # x, c 1 * x, c 2 # ~x, c 1 ~x, c 1 x, c 1 * (~x means x not observed) Then ( ) ( 1) ( 2 ) ( 1) ( 2 ) ( 1 x) ( 2 x) ( 1 ) ( x) P x 2 6/10 7/10 3/10 P x c 4/7 P x c 2/3 4/6 2/6, x 4/10, 2/10 (see * s) (see # s)

Bayes Theorem (, x) ( x ) ( ) P( c x ) P( x ) or, equvalently, n an "updatng form" "new" probablty of c, havng seen x ( x ) ( x ) ( ) P ( x ) "old" probablty of c

Bayes Theorem ( x ) ( x ) ( ) P ( x ) Snce P( x ) P( x c ) P( c ) j j j ( ) we can calculate f we now the probabltes P( c ) j j 1,2,... ( ) and x j x P c j 1, 2,...

( ) ( 1) ( 2 ) ( 1) ( 2 ) ( 1 x) ( 2 x) ( 1 ) ( x) P x 2 6/10 7/10 3/10 P x c 4/7 P x c 2/3 4/6 2/6, x 4/10, 2/10 Now, consder our prevous example ( 1, ) ( 1) ( 1) ( )( ) P( c1 x) P( x) ( )( ) x P x c ( x) 1 4/7 7/10 4/10 4/6 6/10 4/10 or, n the "updatng" form P( x c1) P( c1) P( x c1) P( c1) + P( x c2) P( c2) ( 4/7)( 7/10) ( 4/7)( 7/10) + ( 2/3)( 3/10) ( 4/7)( 7/10) ( 6/10) 4/6 P( x)

As a smple example, consder tryng to classfy a flaw as a crac or a volumetrc flaw based on the these two features: x 1 : a postve leadng edge pulse, PP x 2 : flash ponts, FP

Assume: P( crac ) 0.5 P( volumetrc ) 0.5 P( PP crac) 0.1 (cracs have leadng edge sgnal that s always negatve, so unless the leadng edge sgnal s mstaenly dentfed, ths case s unlely) P( PP volumetrc ) 0.5 (low mpedance (relatve to host) volumetrc flaws have negatve leadng edge pulses and hgh mpedance volumetrc flaws have postve leadng edge pulses, so assume both types of volumetrc flaws equally lely) P( FP crac ) 0.8 ( flashponts s a features strongly characterstc of cracs, so mae ths probablty hgh) P( FP volumetrc ) 0.05 (alternatvely, mae ths a very low probablty)

(1) Now, suppose a pece of data comes n and there s frm evdence that flashponts (FP) exsts n the measured response. Then what s the probablty that the flaw s a crac? ( ) rac FP P( FP crac) P( crac) P ( FP crac ) P( crac ) + P( FP vol) P( vol) ( 0.8)( 0.5) ( 0.8)( 0.5) + ( 0.05)( 0.5) 0.94118 Thus, we also have P( vol FP ) 0.05882

(2) Now, suppose another pece of data comes n wth the frm evdence of a postve leadng edge pulse (PP). What s the new probablty that the flaw s a crac? ( ) rac PP P( PP crac) P( crac) P ( PP crac ) P( crac ) + P( PP vol) P( vol) ( 0.1)( 0.94118) ( 0.1)( 0.94118) + ( 0.5)( 0.05882) 0.76191 and, hence P( vol PP ) 0.23809 ( ) rac PP Note how the prevous was now taen as the new, apror P( crac) n ths Bayesan updatng

(3) Fnally, suppose another data set comes n wth frm evdence that the flashponts (FP) do not exst. What s the probablty now that the flaw s a crac? ( ~ FP) rac P ( ~ FP crac ) P( crac ) P ( ~ FP crac ) P ( crac ) + P ( ~ FP vol) P( vol) ( 0.2)( 0.762) ( 0.2)( 0.762) + ( 0.95)( 0.238) 0.403 and now P( vol ~ FP ) 0.597 0.8 ~ 0.2 Note: P( Frac) P( Frac) P( FP vol) P( FP vol) 0.05 ~ 0.95

In all three cases we must mae some decson on whether the flaw s a crac or not. One possble choce s to smply loo at the probabltes and decde x belongs to class c c j f and only f ( x ) > P( c x ) for all 1,2,..., N j j Snce n our present example we only have two classes, we only have the two condtonal probabltes ( x ), P( vol x ) rac and snce just ( x ) 1 P( crac x ) P vol ( x ) > 1 P( crac x ) rac, our decson rule s or ( x ) rac > 0.5

Usng ths smple probablty decson rule, n the prevous three cases we would fnd (1) (2) (3) ( ) ( ) ( FP) rac FP 0.941 rac PP 0.761 rac ~ 0.401 decson crac crac volumetrc There s no reason, however, that we need to mae a decson on the condtonal probabltes x by themselves. We could synthesze a decson functons g x from such condtonal g probabltes and use the nstead. Ths s the dea behnd what s called Bayes Decson Rule: ( ) ( )

Bayes Decson Rule Decde x belongs to class c c j f and only f g for all 1, 2,..., N j ( x ) > g ( x ) j where g ( x ) s the decson functon for class c

Example: Suppose that not all decson errors are equally mportant. We could weght these decsons by defnng the loss, l j that s sustaned when we decde class membershp s c when t s n realty class c j. Then n terms of these losses we could also defne the rs that x belongs to to class c as ( x ) ( x ) + ( x ) R l l j j j For our two class problem we would have ( x) ( x) + ( x) ( x ) ( x ) + ( x ) R l l 1 11 1 12 2 R l l 2 21 1 22 2 The decson rule n ths case would be to decde x belongs to class c 1 f and only f R x < R x ( ) ( ) 1 2 or, equvalently ( l l ) P( c x ) < ( l l ) P( c x ) 11 21 1 22 12 2

In the specal case where there s no loss when we guess correctly, then l 11 l 22 0. If, also t s equally costly to guess ether c 1 or c 2 then l 12 l 21 and the decson rule becomes ( x ) l P( c x ) l < 21 1 21 2 or P( c x ) > P( c x ) 1 2 whch s the smple decson rule based on condtonal probabltes we dscussed prevously

Now, consder our prevous example agan and let c 1 crac, c 2 volumetrc flaw and suppose we choose the followng loss factors: l l 11 12 1 1 (a gan. If we guess cracs, whch are dangerous, correctly we should reward ths decson) (f we guess that the flaw s a crac and t s really volumetrc, then there s a cost (loss) snce we may do unnecessary repars or removal from servce) l 21 10 (f we guess the flaw s volumetrc and t s really a crac, there may be a sgnfcant loss because of a loss of safety due to msclassfcaton) l 22 0 (f we guess t s volumetrc and t s, there mght be no loss or gan)

In ths case we fnd the decson rule s decde that a crac s present f or ( 11.0 ) P( c x ) < ( 1.0 ) P( c x ) ( 1 x ) ( x ) 2 1 2 > 0.091 For our example then we have decson (1) ( 1 x2) ( x ) 0.941 15.9 > 0.091 0.059 2 2 c 1 (crac) (2) ( 1 x1) ( x ) 0.761 3.18 > 0.091 0.239 2 1 c 1 (crac) (3) ( 1 x2) ( x ) ~ 0401 0.669 > 0.091 ~ 0.599 2 2 c 1 (crac)

Bayes Theorem (Odds) We can also wrte Bayes Theorem n terms of odds rather than probabltes by notng that for any probablty P (condton, jont, etc.) we have the correspondng odds, O, gven by Example: O( c x) P O 1 P ( ) ( x ) x 1 Usng ths defnton of odds, Bayes Theorem becomes ( x ) LR O( c ) O c O P 1 + O ( or ) where P( x c) LR P ( x ~ c ) s called the lelhood rato

Gong bac to our example wth x 1 PP, x 2 FP 0.5 P( c1) 0.5 O( c1) 1 1 0.5 0.5 P( c2) 0.5 O( c2) 1 1 0.5 0.1 P( x1 c1) 0.1 O( x1 c1) 0.11111 1 0.1 0.5 P( x1 c2) 0.5 O( x1 c2) 1 1 0.5 0.8 P( x2 c1) 0.8 O( x1 c1) 4 1 0.8 0.05 P( x2 c2) 0.05 O( x2 c2) 0.0526 1 0.05 Then for our three cases:

(1) O( crac FP) ( ) ( ~ crac) P Frac P FP 0.8 1 16 0.05 ( ) O crac () and ( ) 16 rac FP 0.941 1+ 16 (2) ( ) O crac PP ( ) ( ~ crac) P Prac P PP 0.1 16 3.2 0.5 ( ) O crac ( ) and ( ) 3.2 rac PP 0.762 1+ 3.2 (3) ( ~ FP crac) ( ~ ~ ) P O( crac ~ FP) O crac P Frac ( ) 0.2 ( 3.2 ) 0.674 and ( ) 0.95 0.674 rac ~ FP 0.403 1+ 0.674

As we see from ths result we can update the probabltes accordng to Bayes Theorem by P( x c) O( c x) O( c) P x ~ c ( ) f the feature pattern x s observed and by ( ~ x ) O c P ( ~ x c) ( ~ x ~ c ) ( ) O c P f the feature pattern x s not observed. We can combne these two cases as ( xˆ ) ( xˆ, ) ( ) O c LR c O c where LR ( xˆ, c ) P P ( xˆ c) ( xˆ ~ c ) and xˆ x ~ x f x s observed f x s not observed

Crtcsms of ths Probablstc Approach 1. It does not nclude uncertanty n the evdence of the exstence (or not) of the feature patterns 2. It s dffcult to assgn the apror probabltes To solve the frst problem we wll show how to ntroduce uncertanty wth confdence factors To solve the second problems, we wll dscuss the alternatve use of dscrmnants

Confdence Factors Consder Bayes Theorem n the odds form ( xˆ ) ( xˆ, ) ( ) O c LR c O c In updatng the odds, the lelhood rato s based on beng able to have frm evdence of the exstence of the pattern x or not LR ( xˆ, c ) P P ( xˆ c) ( xˆ ~ c ) We can ntroduce uncertanty nto ths updatng by lettng a user (or a program) gve a response R n the range [-1,1], where R 1 corresponds to complete certanty x s present R 0 corresponds to complete uncertanty that x s or s not present R -1 corresponds to complete certanty that x s not present

Then n updatng the odds, we can replace the lelhood rato, LR, by a functon of LR and R that ncorporates ths uncertanty ( xˆ ) (, ) ( ) O c f LR R O c There are, however, some propertes that ths functon f should satsfy. They are: 1. f R 1 (, ) f LR x c (f we are certan n the evdence of x, we should reduce to ordnary Bayes) 2. If R -1 ( ~, ) f LR x c (f we are certan x does not exst, agan reduce to ordnary Bayes) 3. If LR 0, f 0 (f the lelhood s zero, regardless of the uncertanty, R, the updated odds should be zero)

A popular choce that appears n the lterature s to choose where, f R [0,1] (, ) ( xˆ, ) + ( 1 ) f LR R LR c R R LR( xˆ, c ) LR( x, c ) and, where, f R [-1,0] LR( xˆ, c ) LR( ~ x, c ) If we plot ths functon, we see the effects of R LR f ( x, c ) 1.0 ( LR, R) LR ( ~ x, c ) -1 1 R

Although ths s a smple functon to use, there s a problem wth t whch we can see f we plot f versus LR for dfferent R. ( Note that LR [0, ) ) f ( LR, R) R 1 R ncreasng 1.0 R 0 1- R R ncreasng 1.0 LR At LR 0 the functon f does not go to zero as we sad t should (see property 3 dscussed above). To remedy that problem, we need to choose a nonlnear functon.

One choce that satsfes all three propertes f should have s (, ) ( ) f LR R LR R f ( LR, R) R 1 R ncreasng 1.0 R 0 R ncreasng 1.0 LR

Ths gves a dependency on R that s nonlnear f ( LR, R) LR ( x, c ) 1.0 LR ( ~ x, c ) -1 0 1 R

Wth ths choce of f, we would have R ( xˆ ) LR O( c ) O c However, f one wants to wor n terms of probabltes, not odds, we have ( xˆ ) P( c ) ( ) + ( ) 1 ( ) R ( ) LR wth P( xˆ c) LR P ( xˆ ~ c ) and xˆ x R > 0 ~ x R < 0

Bayes theorem, even n ths modfed form to tae nto account uncertanty n the evdence, stll requres us to have apror probablty estmates and those may be dffcult to come by. How do we get around ths? Consder our two class problem where we have classes (c 1, c 2 ) and where x x s a sngle feature (pattern). Accordng to Bayes decson rule we could decde on c 1 (or c 2 ) f g 1 (x) > g 2 (x) (or g 2 (x) > g 1 (x) ). For example, suppose both g 1 and g 2 were unmodal, smooth dstrbutons. Then we mght have: g1 ( x) g2 ( x) decde c 1 decde c 2 x x threshold Then we see the decson rule s really just class c f x < x 1 class c f x > x 2 threshold threshold

x threshold Thus, f we had a way of fndng, whch serves as a dscrmnant, we could mae our decsons and not have to even consder the underlyng probabltes! However, we have not really elmnated the probabltes entrely snce they ultmately determne the errors made n the decson mang process. Note that n the more general mult-modal decson functon case, several dscrmnants may be needed: g1 ( x) g2 ( x) x1 x2 x3

If we tae the g (x) to be just the probablty dstrbutons P( c x) then recall that Bayes decson rule says that x belongs to class c c f and only f ( ) > ( ) x x for all j 1,2,..., N j or, equvalently P ( c x) P( x) > P( c x) P( x) whch says that P ( c, x) > P( c, x) so that also P( x c) P( c) > P( x cj) P( cj) j j j If x s a contnuous varable, then we can assocate probablty dstrbutons wth quanttes such as p( c, x ) and p( x c ) and so we expect that the dscrmnants are dependent on the nature of these dstrbutons. We wll now examne closer that relatonshp.

Probablty Dstrbutons and Dscrmnants Frst, consder the 1-D case where x x and where we assume the dstrbutons are Gaussans,.e. 1 ( ) ( ) 2 2, σ 2π exp ( μ ) /2σ ( ) p xc x where μ mean value of x for class c σ standard devaton for class c σ σ σ then Bayes decson rule If we assume ( ) ( ), j j says that x belongs to class c f and only f ( x μ ) 2 exp / 2σ 1 2 > exp ( x μj ) / 2σ

or, equvalently, x belongs to class c f and only f 2 ( x μ ) < ( x μ ) 2 for all j 1,2,..., N j j Ths s just the bass for the nearest cluster center classfcaton method

Now, consder the more general case of N-dmensonal features but eep the assumpton of Gaussan dstrbutons. Then 1 1 p c 2 N/2 1/2 T 1 ( x, ) ( 2π ) Σ exp ( x μ ) Σ ( x μ ) ( ) where μ N-component mean vector for class c Σ N x N covarance matrx

Bayes decson theory says that x belongs to class c f and only f 1 1 2 > 1 1 2 N/2 1/2 T ( ) 1 ( 2π ) Σ exp ( x μ ) Σ ( x μ ) 1/2 1 /2 T ( ) N 1 j ( 2π ) Σ j exp ( x μ j) Σ j ( x μ j) Now, suppose we are on the boundary between c and c j and also suppose that Σ Σ I 2 j σ where I s the unt matrx. Then T ( ) ( x μ ) ( x μ ) 2 exp / 2σ 1 2 exp / 2σ T ( ) j ( x μ j) ( x μ j) Tang the ln of ths equaton then we have

( ) T T ( x μ) ( x μ) σ + ( x μ j) ( x μ j) P( cj ) 2 2 ln /2 /2σ 0 whch can be expanded out to gve ( ) + x ( μ μ j) σ + ( μμ j j μμ ) P( cj ) T 2 T T 2 2ln 2 / / σ 0 However, these are just the equatons of the hyperplanes T xwj b j wth ( ) 2 wj μ μ j / σ ( T T ) ( ) 2 j j j /2σ ln b μμ μμ ( ) j

T xwj b j The w j and the b j here determne the hyperplanes separatng the classes and hence are dscrmnants. If we can fnd a way to determne these dscrmnants drectly, we need not deal wth the underlyng probabltes that defne them here. We wll now examne ways n whch we can fnd such hyperplanes (or hypersurfaces n a more general context).

Learnng Dscrmnants wth Neural Networs Suppose that we have a two class problem and a lnear dscrmnant s able to dstngush between observed patterns, x, of ether class. Such a problem s sad to be lnearly separable. Geometrcally, we have the stuaton where we can place a dscrmnant hyperplane T xw b between patterns, x, of ether class. For example, for x (x 1, x 2 ) w x 2 class c 1 class c 2 T xw b < 0 b T xw b > x 1 0 pattern observed from class c 2 pattern observed from class c 1 T xw b

Learnng to dstngush between these two classes then conssts of fndng the values of w, b that wll separate the observed patterns. Note that we can augment the vector x ( x1, x2,..., xn ) and the weght vector w ( w w w ),,..., n 1 2 by redefnng them as w ( w1, w2,..., wn, b) x ( x, x,..., x, 1) 1 2 Then the equaton of the hyperplane n terms of these augmented vectors becomes T xw 0 n

T xw Ths equaton can be related to neural networ deas snce we can vew the process of mang a decson c 1 or c 2 as smlar to the frng (or not frng) of a neuron based on the actvty level of the nputs: x n w n b 1 0 x 2 w 2 Σ O 1 D 0 1 D < 0 class c 1 class c 2 x 1 w 1 D n+ 1 wx 1 x w n+ 1 n+ 1 1 b Now, the queston s, how do we determne the unnown "weghts", w, b?

Two Category Tranng Procedure Gven an extended weght vector and an extended feature vector ( w1, w2,..., wn, b) ( x, x,..., x, 1) 1 2 the followng steps defne a two class error correcton algorthm Defnton: Let w be the weghts assocated wth x, " feature tranng vectors" for cases where the class of each x s nown (1) let w 1 ( 0,0,...,0) w x (actually, w 1 can be arbtrary) (2) Gven w, the followng case rules apply ( ) case1: x c class c 1 1 af. x w 0, w+ 1 w bf. x w < 0, w w + λx + 1 ( ) case 2: x c class c 2 2 af. x w < 0, w+ 1 w bf. x w 0, w w λx where λ > 0 + 1 n

Two Category Learnng Theorem If c 1 and c 2 are lnearly separable and the two class tranng procedure s used to defne w, then there exsts an nteger t 1 such that w t lnearly separates c 1 and c 2 and hence w t+ w t for all postve. Ref: Hunt, E.B., Artfcal Intellgence, Academc Press, 1975. Remar: λ can be a constant or can tae on partcular forms. For example: α w λ λ x x 2 ( 0< α < 2) can be used and the algorthm stll converges. Ths often speeds up the convergence Ref: Nlsson, N, Learnng Machnes, McGraw Hll, 1965.

Example: Determne f an ultrasonc sgnal s from a crac or a volumetrc flaw based on the followng two features: x 1 " has flashponts" 1 "yes", -1 "no" x 2 " has negatve leadng edge pulse" 1 "yes", -1 "no" crac volumetrc (low mpedance) volumetrc (hgh mpeance) x 1 x 2 1-1 -1 1 1-1 x 2 volumetrc (low mped.) 1.0 crac -1.0 volumetrc (hgh mped.) -1.0 1.0 x 1

x 2 w D > 0 D < 0 x 1 D wx 1 1+ w2x2 0 For smplcty, we wll tae b 0, λ 1 so the learnng procedure s 1. Gve an tranng example ( x 1, x 2 ) D 0 D < 0 2. If as "s t a crac" (Y or N) If as "s t volumetrc" (Y or N) 3. If error (N) and If error (N) and D 0 D < 0 w w x w + w + x + 1 1

Suppose for ths case we have the followng tranng set: 1. x 1-1, x 2 1 (vol) 2. x 1 1, x 2 1 crac 3. x 1-1, x 2-1 (vol) 4.. Tranng example 1: vol x 1-1, x 2 1 D 0 (snce w 1 w 2 0 ntally) "s t a crac" N x 2 D 0 D > 0 w w 0 ( x ) 1 1 1 ( x ) 0 1 2 2 D < 0 w x 1

Tranng example 2: crac x 1 1, x 2 1 D (1)(1) + (-1)(1) 0 "s t a crac" Y no change x 2 D 0 D > 0 w w 1 2 1 1 D < 0 w x 1 w w Tranng example 3: vol x 1-1, x 2-1 D (1)(-1) + (-1)(-1) 0 "s t a crac" N 1 x 2 1 1 1 x 0 2 2 D < 0 x 2 D 0 w D > 0 x 1 no further changes

Note that ths classfer can also handle stuatons other than whch t s traned on. Ths "generalzaton" ablty s a valuable property of neural nets. For example, suppose we let x 1 1 "defntely has flash ponts" 0.5 "probably has flash ponts" 0 "don't now" -0.5 "probably does not have flash ponts" -1 "defntely does not have flash pont" smlarly for x 2 Now suppose we gve our traned system an example t hasn't seen before such as a crac where x 1 0.5 " probably has flash ponts" x 2 0.5 " probably has a negatve leadng edge pulse" D (2)(0.5) + (0)(0.5) 1 0 " s t a crac" Y (whch s correct)

References Slansy, J., and G. Wassel, Pattern Classfers and Tranable Machnes, Sprnger Verlag, 1981 Pao, Y.H., Adaptve Pattern Recognton and Neural Networs, Addson Wesley, 1989 Gale, W.A., Ed., Artfcal Intellgence and Statstcs, Addson Wesley, 1986 Duda, R.O, Hart, P.E., and D.G. Stor, Pattern Classfcaton, 2 nd Ed., John Wley, 2001 Fuunaga, K. Statstcal Pattern Recognton, Academc Press,1990. Webb, A, Statstcal Pattern Recognton, 2 nd Ed., John Wley, 2002. Nadler, M., and E.P. Smth, Pattern Recognton Engneerng, John Wley, 1993.