Statistical Foundations of Pattern Recognition

Similar documents
P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Linear Approximation with Regularization and Moving Least Squares

Multilayer Perceptron (MLP)

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Kernel Methods and SVMs Extension

MIMA Group. Chapter 2 Bayesian Decision Theory. School of Computer Science and Technology, Shandong University. Xin-Shun SDU

Lecture 12: Classification

10-701/ Machine Learning, Fall 2005 Homework 3

Module 2. Random Processes. Version 2 ECE IIT, Kharagpur

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Pattern Classification

Lecture Notes on Linear Regression

Using T.O.M to Estimate Parameter of distributions that have not Single Exponential Family

Generalized Linear Methods

Evaluation of classifiers MLPs

Feature Selection: Part 1

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

Inner Product. Euclidean Space. Orthonormal Basis. Orthogonal

Linear Feature Engineering 11

Vapnik-Chervonenkis theory

Composite Hypotheses testing

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Société de Calcul Mathématique SA

A Bayes Algorithm for the Multitask Pattern Recognition Problem Direct Approach

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Department of Computer Science Artificial Intelligence Research Laboratory. Iowa State University MACHINE LEARNING

Other NN Models. Reinforcement learning (RL) Probabilistic neural networks

Homework Assignment 3 Due in class, Thursday October 15

EEE 241: Linear Systems

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

Classification as a Regression Problem

1 GSW Iterative Techniques for y = Ax

1 Convex Optimization

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

Limited Dependent Variables

Linear Classification, SVMs and Nearest Neighbors

10.34 Fall 2015 Metropolis Monte Carlo Algorithm

Lecture 3: Probability Distributions

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 8 Luca Trevisan February 17, 2016

FREQUENCY DISTRIBUTIONS Page 1 of The idea of a frequency distribution for sets of observations will be introduced,

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction

The exam is closed book, closed notes except your one-page cheat sheet.

1 Derivation of Rate Equations from Single-Cell Conductance (Hodgkin-Huxley-like) Equations

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

Lecture 4: November 17, Part 1 Single Buffer Management

Which Separator? Spring 1

Why Bayesian? 3. Bayes and Normal Models. State of nature: class. Decision rule. Rev. Thomas Bayes ( ) Bayes Theorem (yes, the famous one)

Structure and Drive Paul A. Jensen Copyright July 20, 2003

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

Generative classification models

Grover s Algorithm + Quantum Zeno Effect + Vaidman

18.1 Introduction and Recap

Foundations of Arithmetic

Chapter 8 Indicator Variables

p 1 c 2 + p 2 c 2 + p 3 c p m c 2

VQ widely used in coding speech, image, and video

Kernels in Support Vector Machines. Based on lectures of Martin Law, University of Michigan

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

CHAPTER 3: BAYESIAN DECISION THEORY

Psychology 282 Lecture #24 Outline Regression Diagnostics: Outliers

Excess Error, Approximation Error, and Estimation Error

Regularized Discriminant Analysis for Face Recognition

CSE 252C: Computer Vision III

Learning from Data 1 Naive Bayes

Neural networks. Nuno Vasconcelos ECE Department, UCSD

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

9 Derivation of Rate Equations from Single-Cell Conductance (Hodgkin-Huxley-like) Equations

The Geometry of Logit and Probit

Multigradient for Neural Networks for Equalizers 1

CHALMERS, GÖTEBORGS UNIVERSITET. SOLUTIONS to RE-EXAM for ARTIFICIAL NEURAL NETWORKS. COURSE CODES: FFR 135, FIM 720 GU, PhD

Statistical Inference. 2.3 Summary Statistics Measures of Center and Spread. parameters ( population characteristics )

The Expectation-Maximization Algorithm

THE CHINESE REMAINDER THEOREM. We should thank the Chinese for their wonderful remainder theorem. Glenn Stevens

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Credit Card Pricing and Impact of Adverse Selection

Case A. P k = Ni ( 2L i k 1 ) + (# big cells) 10d 2 P k.

Lecture 12: Discrete Laplacian

Discriminative classifier: Logistic Regression. CS534-Machine Learning

CHAPTER III Neural Networks as Associative Memory

j) = 1 (note sigma notation) ii. Continuous random variable (e.g. Normal distribution) 1. density function: f ( x) 0 and f ( x) dx = 1

5 The Rational Canonical Form

APPENDIX A Some Linear Algebra

Here is the rationale: If X and y have a strong positive relationship to one another, then ( x x) will tend to be positive when ( y y)

Eigenvalues of Random Graphs

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

CSC 411 / CSC D11 / CSC C11

Classification learning II

Lecture Space-Bounded Derandomization

9.913 Pattern Recognition for Vision. Class IV Part I Bayesian Decision Theory Yuri Ivanov

n α j x j = 0 j=1 has a nontrivial solution. Here A is the n k matrix whose jth column is the vector for all t j=0

The Feynman path integral

Digital Signal Processing

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

Fundamental loop-current method using virtual voltage sources technique for special cases

Differentiating Gaussian Processes

SL n (F ) Equals its Own Derived Group

Transcription:

Statstcal Foundatons of Pattern Recognton Learnng Objectves Bayes Theorem Decson-mang Confdence factors Dscrmnants The connecton to neural nets

Statstcal Foundatons of Pattern Recognton NDE measurement system feature (pattern) extracton observed patterns x Problem: How do we decde whch class x belongs to? possble classes c 1 x c 2 c 3

Bayesan (probablstc) approach Let P( c ) apror probablty that a pattern belongs to class c, regardless of the dentty of the pattern P P P ( x ) ( x c ) ( ) x x ( ), c apror probablty that a pattern s x, regardless of ts class membershp condtonal probablty that the pattern s x, gven that t belongs to class c condtonal probablty that the pattern's class membershp s c, gven that the pattern s x the jont probablty that the pattern s x and the class membershp s c

Example: consder the case where there s one pattern value that s observed or not, and two classes (e.g. sgnal or nose) x, c 1 * ~x, c 1 x, c 1 * ~x, c 2 x, c 2 # x, c 1 * x, c 2 # ~x, c 1 ~x, c 1 x, c 1 * (~x means x not observed) Then ( ) ( 1) ( 2 ) ( 1) ( 2 ) ( 1 x) ( 2 x) ( 1 ) ( x) P x 2 6/10 7/10 3/10 P x c 4/7 P x c 2/3 4/6 2/6, x 4/10, 2/10 (see * s) (see # s)

Bayes Theorem (, x) ( x ) ( ) P( c x ) P( x ) or, equvalently, n an "updatng form" "new" probablty of c, havng seen x ( x ) ( x ) ( ) P ( x ) "old" probablty of c

Bayes Theorem ( x ) ( x ) ( ) P ( x ) Snce P( x ) P( x c ) P( c ) j j j ( ) we can calculate f we now the probabltes P( c ) j j 1,2,... ( ) and x j x P c j 1, 2,...

( ) ( 1) ( 2 ) ( 1) ( 2 ) ( 1 x) ( 2 x) ( 1 ) ( x) P x 2 6/10 7/10 3/10 P x c 4/7 P x c 2/3 4/6 2/6, x 4/10, 2/10 Now, consder our prevous example ( 1, ) ( 1) ( 1) ( )( ) P( c1 x) P( x) ( )( ) x P x c ( x) 1 4/7 7/10 4/10 4/6 6/10 4/10 or, n the "updatng" form P( x c1) P( c1) P( x c1) P( c1) + P( x c2) P( c2) ( 4/7)( 7/10) ( 4/7)( 7/10) + ( 2/3)( 3/10) ( 4/7)( 7/10) ( 6/10) 4/6 P( x)

As a smple example, consder tryng to classfy a flaw as a crac or a volumetrc flaw based on the these two features: x 1 : a postve leadng edge pulse, PP x 2 : flash ponts, FP

Assume: P( crac ) 0.5 P( volumetrc ) 0.5 P( PP crac) 0.1 (cracs have leadng edge sgnal that s always negatve, so unless the leadng edge sgnal s mstaenly dentfed, ths case s unlely) P( PP volumetrc ) 0.5 (low mpedance (relatve to host) volumetrc flaws have negatve leadng edge pulses and hgh mpedance volumetrc flaws have postve leadng edge pulses, so assume both types of volumetrc flaws equally lely) P( FP crac ) 0.8 ( flashponts s a features strongly characterstc of cracs, so mae ths probablty hgh) P( FP volumetrc ) 0.05 (alternatvely, mae ths a very low probablty)

(1) Now, suppose a pece of data comes n and there s frm evdence that flashponts (FP) exsts n the measured response. Then what s the probablty that the flaw s a crac? ( ) rac FP P( FP crac) P( crac) P ( FP crac ) P( crac ) + P( FP vol) P( vol) ( 0.8)( 0.5) ( 0.8)( 0.5) + ( 0.05)( 0.5) 0.94118 Thus, we also have P( vol FP ) 0.05882

(2) Now, suppose another pece of data comes n wth the frm evdence of a postve leadng edge pulse (PP). What s the new probablty that the flaw s a crac? ( ) rac PP P( PP crac) P( crac) P ( PP crac ) P( crac ) + P( PP vol) P( vol) ( 0.1)( 0.94118) ( 0.1)( 0.94118) + ( 0.5)( 0.05882) 0.76191 and, hence P( vol PP ) 0.23809 ( ) rac PP Note how the prevous was now taen as the new, apror P( crac) n ths Bayesan updatng

(3) Fnally, suppose another data set comes n wth frm evdence that the flashponts (FP) do not exst. What s the probablty now that the flaw s a crac? ( ~ FP) rac P ( ~ FP crac ) P( crac ) P ( ~ FP crac ) P ( crac ) + P ( ~ FP vol) P( vol) ( 0.2)( 0.762) ( 0.2)( 0.762) + ( 0.95)( 0.238) 0.403 and now P( vol ~ FP ) 0.597 0.8 ~ 0.2 Note: P( Frac) P( Frac) P( FP vol) P( FP vol) 0.05 ~ 0.95

In all three cases we must mae some decson on whether the flaw s a crac or not. One possble choce s to smply loo at the probabltes and decde x belongs to class c c j f and only f ( x ) > P( c x ) for all 1,2,..., N j j Snce n our present example we only have two classes, we only have the two condtonal probabltes ( x ), P( vol x ) rac and snce just ( x ) 1 P( crac x ) P vol ( x ) > 1 P( crac x ) rac, our decson rule s or ( x ) rac > 0.5

Usng ths smple probablty decson rule, n the prevous three cases we would fnd (1) (2) (3) ( ) ( ) ( FP) rac FP 0.941 rac PP 0.761 rac ~ 0.401 decson crac crac volumetrc There s no reason, however, that we need to mae a decson on the condtonal probabltes x by themselves. We could synthesze a decson functons g x from such condtonal g probabltes and use the nstead. Ths s the dea behnd what s called Bayes Decson Rule: ( ) ( )

Bayes Decson Rule Decde x belongs to class c c j f and only f g for all 1, 2,..., N j ( x ) > g ( x ) j where g ( x ) s the decson functon for class c

Example: Suppose that not all decson errors are equally mportant. We could weght these decsons by defnng the loss, l j that s sustaned when we decde class membershp s c when t s n realty class c j. Then n terms of these losses we could also defne the rs that x belongs to to class c as ( x ) ( x ) + ( x ) R l l j j j For our two class problem we would have ( x) ( x) + ( x) ( x ) ( x ) + ( x ) R l l 1 11 1 12 2 R l l 2 21 1 22 2 The decson rule n ths case would be to decde x belongs to class c 1 f and only f R x < R x ( ) ( ) 1 2 or, equvalently ( l l ) P( c x ) < ( l l ) P( c x ) 11 21 1 22 12 2

In the specal case where there s no loss when we guess correctly, then l 11 l 22 0. If, also t s equally costly to guess ether c 1 or c 2 then l 12 l 21 and the decson rule becomes ( x ) l P( c x ) l < 21 1 21 2 or P( c x ) > P( c x ) 1 2 whch s the smple decson rule based on condtonal probabltes we dscussed prevously

Now, consder our prevous example agan and let c 1 crac, c 2 volumetrc flaw and suppose we choose the followng loss factors: l l 11 12 1 1 (a gan. If we guess cracs, whch are dangerous, correctly we should reward ths decson) (f we guess that the flaw s a crac and t s really volumetrc, then there s a cost (loss) snce we may do unnecessary repars or removal from servce) l 21 10 (f we guess the flaw s volumetrc and t s really a crac, there may be a sgnfcant loss because of a loss of safety due to msclassfcaton) l 22 0 (f we guess t s volumetrc and t s, there mght be no loss or gan)

In ths case we fnd the decson rule s decde that a crac s present f or ( 11.0 ) P( c x ) < ( 1.0 ) P( c x ) ( 1 x ) ( x ) 2 1 2 > 0.091 For our example then we have decson (1) ( 1 x2) ( x ) 0.941 15.9 > 0.091 0.059 2 2 c 1 (crac) (2) ( 1 x1) ( x ) 0.761 3.18 > 0.091 0.239 2 1 c 1 (crac) (3) ( 1 x2) ( x ) ~ 0401 0.669 > 0.091 ~ 0.599 2 2 c 1 (crac)

Bayes Theorem (Odds) We can also wrte Bayes Theorem n terms of odds rather than probabltes by notng that for any probablty P (condton, jont, etc.) we have the correspondng odds, O, gven by Example: O( c x) P O 1 P ( ) ( x ) x 1 Usng ths defnton of odds, Bayes Theorem becomes ( x ) LR O( c ) O c O P 1 + O ( or ) where P( x c) LR P ( x ~ c ) s called the lelhood rato

Gong bac to our example wth x 1 PP, x 2 FP 0.5 P( c1) 0.5 O( c1) 1 1 0.5 0.5 P( c2) 0.5 O( c2) 1 1 0.5 0.1 P( x1 c1) 0.1 O( x1 c1) 0.11111 1 0.1 0.5 P( x1 c2) 0.5 O( x1 c2) 1 1 0.5 0.8 P( x2 c1) 0.8 O( x1 c1) 4 1 0.8 0.05 P( x2 c2) 0.05 O( x2 c2) 0.0526 1 0.05 Then for our three cases:

(1) O( crac FP) ( ) ( ~ crac) P Frac P FP 0.8 1 16 0.05 ( ) O crac () and ( ) 16 rac FP 0.941 1+ 16 (2) ( ) O crac PP ( ) ( ~ crac) P Prac P PP 0.1 16 3.2 0.5 ( ) O crac ( ) and ( ) 3.2 rac PP 0.762 1+ 3.2 (3) ( ~ FP crac) ( ~ ~ ) P O( crac ~ FP) O crac P Frac ( ) 0.2 ( 3.2 ) 0.674 and ( ) 0.95 0.674 rac ~ FP 0.403 1+ 0.674

As we see from ths result we can update the probabltes accordng to Bayes Theorem by P( x c) O( c x) O( c) P x ~ c ( ) f the feature pattern x s observed and by ( ~ x ) O c P ( ~ x c) ( ~ x ~ c ) ( ) O c P f the feature pattern x s not observed. We can combne these two cases as ( xˆ ) ( xˆ, ) ( ) O c LR c O c where LR ( xˆ, c ) P P ( xˆ c) ( xˆ ~ c ) and xˆ x ~ x f x s observed f x s not observed

Crtcsms of ths Probablstc Approach 1. It does not nclude uncertanty n the evdence of the exstence (or not) of the feature patterns 2. It s dffcult to assgn the apror probabltes To solve the frst problem we wll show how to ntroduce uncertanty wth confdence factors To solve the second problems, we wll dscuss the alternatve use of dscrmnants

Confdence Factors Consder Bayes Theorem n the odds form ( xˆ ) ( xˆ, ) ( ) O c LR c O c In updatng the odds, the lelhood rato s based on beng able to have frm evdence of the exstence of the pattern x or not LR ( xˆ, c ) P P ( xˆ c) ( xˆ ~ c ) We can ntroduce uncertanty nto ths updatng by lettng a user (or a program) gve a response R n the range [-1,1], where R 1 corresponds to complete certanty x s present R 0 corresponds to complete uncertanty that x s or s not present R -1 corresponds to complete certanty that x s not present

Then n updatng the odds, we can replace the lelhood rato, LR, by a functon of LR and R that ncorporates ths uncertanty ( xˆ ) (, ) ( ) O c f LR R O c There are, however, some propertes that ths functon f should satsfy. They are: 1. f R 1 (, ) f LR x c (f we are certan n the evdence of x, we should reduce to ordnary Bayes) 2. If R -1 ( ~, ) f LR x c (f we are certan x does not exst, agan reduce to ordnary Bayes) 3. If LR 0, f 0 (f the lelhood s zero, regardless of the uncertanty, R, the updated odds should be zero)

A popular choce that appears n the lterature s to choose where, f R [0,1] (, ) ( xˆ, ) + ( 1 ) f LR R LR c R R LR( xˆ, c ) LR( x, c ) and, where, f R [-1,0] LR( xˆ, c ) LR( ~ x, c ) If we plot ths functon, we see the effects of R LR f ( x, c ) 1.0 ( LR, R) LR ( ~ x, c ) -1 1 R

Although ths s a smple functon to use, there s a problem wth t whch we can see f we plot f versus LR for dfferent R. ( Note that LR [0, ) ) f ( LR, R) R 1 R ncreasng 1.0 R 0 1- R R ncreasng 1.0 LR At LR 0 the functon f does not go to zero as we sad t should (see property 3 dscussed above). To remedy that problem, we need to choose a nonlnear functon.

One choce that satsfes all three propertes f should have s (, ) ( ) f LR R LR R f ( LR, R) R 1 R ncreasng 1.0 R 0 R ncreasng 1.0 LR

Ths gves a dependency on R that s nonlnear f ( LR, R) LR ( x, c ) 1.0 LR ( ~ x, c ) -1 0 1 R

Wth ths choce of f, we would have R ( xˆ ) LR O( c ) O c However, f one wants to wor n terms of probabltes, not odds, we have ( xˆ ) P( c ) ( ) + ( ) 1 ( ) R ( ) LR wth P( xˆ c) LR P ( xˆ ~ c ) and xˆ x R > 0 ~ x R < 0

Bayes theorem, even n ths modfed form to tae nto account uncertanty n the evdence, stll requres us to have apror probablty estmates and those may be dffcult to come by. How do we get around ths? Consder our two class problem where we have classes (c 1, c 2 ) and where x x s a sngle feature (pattern). Accordng to Bayes decson rule we could decde on c 1 (or c 2 ) f g 1 (x) > g 2 (x) (or g 2 (x) > g 1 (x) ). For example, suppose both g 1 and g 2 were unmodal, smooth dstrbutons. Then we mght have: g1 ( x) g2 ( x) decde c 1 decde c 2 x x threshold Then we see the decson rule s really just class c f x < x 1 class c f x > x 2 threshold threshold

x threshold Thus, f we had a way of fndng, whch serves as a dscrmnant, we could mae our decsons and not have to even consder the underlyng probabltes! However, we have not really elmnated the probabltes entrely snce they ultmately determne the errors made n the decson mang process. Note that n the more general mult-modal decson functon case, several dscrmnants may be needed: g1 ( x) g2 ( x) x1 x2 x3

If we tae the g (x) to be just the probablty dstrbutons P( c x) then recall that Bayes decson rule says that x belongs to class c c f and only f ( ) > ( ) x x for all j 1,2,..., N j or, equvalently P ( c x) P( x) > P( c x) P( x) whch says that P ( c, x) > P( c, x) so that also P( x c) P( c) > P( x cj) P( cj) j j j If x s a contnuous varable, then we can assocate probablty dstrbutons wth quanttes such as p( c, x ) and p( x c ) and so we expect that the dscrmnants are dependent on the nature of these dstrbutons. We wll now examne closer that relatonshp.

Probablty Dstrbutons and Dscrmnants Frst, consder the 1-D case where x x and where we assume the dstrbutons are Gaussans,.e. 1 ( ) ( ) 2 2, σ 2π exp ( μ ) /2σ ( ) p xc x where μ mean value of x for class c σ standard devaton for class c σ σ σ then Bayes decson rule If we assume ( ) ( ), j j says that x belongs to class c f and only f ( x μ ) 2 exp / 2σ 1 2 > exp ( x μj ) / 2σ

or, equvalently, x belongs to class c f and only f 2 ( x μ ) < ( x μ ) 2 for all j 1,2,..., N j j Ths s just the bass for the nearest cluster center classfcaton method

Now, consder the more general case of N-dmensonal features but eep the assumpton of Gaussan dstrbutons. Then 1 1 p c 2 N/2 1/2 T 1 ( x, ) ( 2π ) Σ exp ( x μ ) Σ ( x μ ) ( ) where μ N-component mean vector for class c Σ N x N covarance matrx

Bayes decson theory says that x belongs to class c f and only f 1 1 2 > 1 1 2 N/2 1/2 T ( ) 1 ( 2π ) Σ exp ( x μ ) Σ ( x μ ) 1/2 1 /2 T ( ) N 1 j ( 2π ) Σ j exp ( x μ j) Σ j ( x μ j) Now, suppose we are on the boundary between c and c j and also suppose that Σ Σ I 2 j σ where I s the unt matrx. Then T ( ) ( x μ ) ( x μ ) 2 exp / 2σ 1 2 exp / 2σ T ( ) j ( x μ j) ( x μ j) Tang the ln of ths equaton then we have

( ) T T ( x μ) ( x μ) σ + ( x μ j) ( x μ j) P( cj ) 2 2 ln /2 /2σ 0 whch can be expanded out to gve ( ) + x ( μ μ j) σ + ( μμ j j μμ ) P( cj ) T 2 T T 2 2ln 2 / / σ 0 However, these are just the equatons of the hyperplanes T xwj b j wth ( ) 2 wj μ μ j / σ ( T T ) ( ) 2 j j j /2σ ln b μμ μμ ( ) j

T xwj b j The w j and the b j here determne the hyperplanes separatng the classes and hence are dscrmnants. If we can fnd a way to determne these dscrmnants drectly, we need not deal wth the underlyng probabltes that defne them here. We wll now examne ways n whch we can fnd such hyperplanes (or hypersurfaces n a more general context).

Learnng Dscrmnants wth Neural Networs Suppose that we have a two class problem and a lnear dscrmnant s able to dstngush between observed patterns, x, of ether class. Such a problem s sad to be lnearly separable. Geometrcally, we have the stuaton where we can place a dscrmnant hyperplane T xw b between patterns, x, of ether class. For example, for x (x 1, x 2 ) w x 2 class c 1 class c 2 T xw b < 0 b T xw b > x 1 0 pattern observed from class c 2 pattern observed from class c 1 T xw b

Learnng to dstngush between these two classes then conssts of fndng the values of w, b that wll separate the observed patterns. Note that we can augment the vector x ( x1, x2,..., xn ) and the weght vector w ( w w w ),,..., n 1 2 by redefnng them as w ( w1, w2,..., wn, b) x ( x, x,..., x, 1) 1 2 Then the equaton of the hyperplane n terms of these augmented vectors becomes T xw 0 n

T xw Ths equaton can be related to neural networ deas snce we can vew the process of mang a decson c 1 or c 2 as smlar to the frng (or not frng) of a neuron based on the actvty level of the nputs: x n w n b 1 0 x 2 w 2 Σ O 1 D 0 1 D < 0 class c 1 class c 2 x 1 w 1 D n+ 1 wx 1 x w n+ 1 n+ 1 1 b Now, the queston s, how do we determne the unnown "weghts", w, b?

Two Category Tranng Procedure Gven an extended weght vector and an extended feature vector ( w1, w2,..., wn, b) ( x, x,..., x, 1) 1 2 the followng steps defne a two class error correcton algorthm Defnton: Let w be the weghts assocated wth x, " feature tranng vectors" for cases where the class of each x s nown (1) let w 1 ( 0,0,...,0) w x (actually, w 1 can be arbtrary) (2) Gven w, the followng case rules apply ( ) case1: x c class c 1 1 af. x w 0, w+ 1 w bf. x w < 0, w w + λx + 1 ( ) case 2: x c class c 2 2 af. x w < 0, w+ 1 w bf. x w 0, w w λx where λ > 0 + 1 n

Two Category Learnng Theorem If c 1 and c 2 are lnearly separable and the two class tranng procedure s used to defne w, then there exsts an nteger t 1 such that w t lnearly separates c 1 and c 2 and hence w t+ w t for all postve. Ref: Hunt, E.B., Artfcal Intellgence, Academc Press, 1975. Remar: λ can be a constant or can tae on partcular forms. For example: α w λ λ x x 2 ( 0< α < 2) can be used and the algorthm stll converges. Ths often speeds up the convergence Ref: Nlsson, N, Learnng Machnes, McGraw Hll, 1965.

Example: Determne f an ultrasonc sgnal s from a crac or a volumetrc flaw based on the followng two features: x 1 " has flashponts" 1 "yes", -1 "no" x 2 " has negatve leadng edge pulse" 1 "yes", -1 "no" crac volumetrc (low mpedance) volumetrc (hgh mpeance) x 1 x 2 1-1 -1 1 1-1 x 2 volumetrc (low mped.) 1.0 crac -1.0 volumetrc (hgh mped.) -1.0 1.0 x 1

x 2 w D > 0 D < 0 x 1 D wx 1 1+ w2x2 0 For smplcty, we wll tae b 0, λ 1 so the learnng procedure s 1. Gve an tranng example ( x 1, x 2 ) D 0 D < 0 2. If as "s t a crac" (Y or N) If as "s t volumetrc" (Y or N) 3. If error (N) and If error (N) and D 0 D < 0 w w x w + w + x + 1 1

Suppose for ths case we have the followng tranng set: 1. x 1-1, x 2 1 (vol) 2. x 1 1, x 2 1 crac 3. x 1-1, x 2-1 (vol) 4.. Tranng example 1: vol x 1-1, x 2 1 D 0 (snce w 1 w 2 0 ntally) "s t a crac" N x 2 D 0 D > 0 w w 0 ( x ) 1 1 1 ( x ) 0 1 2 2 D < 0 w x 1

Tranng example 2: crac x 1 1, x 2 1 D (1)(1) + (-1)(1) 0 "s t a crac" Y no change x 2 D 0 D > 0 w w 1 2 1 1 D < 0 w x 1 w w Tranng example 3: vol x 1-1, x 2-1 D (1)(-1) + (-1)(-1) 0 "s t a crac" N 1 x 2 1 1 1 x 0 2 2 D < 0 x 2 D 0 w D > 0 x 1 no further changes

Note that ths classfer can also handle stuatons other than whch t s traned on. Ths "generalzaton" ablty s a valuable property of neural nets. For example, suppose we let x 1 1 "defntely has flash ponts" 0.5 "probably has flash ponts" 0 "don't now" -0.5 "probably does not have flash ponts" -1 "defntely does not have flash pont" smlarly for x 2 Now suppose we gve our traned system an example t hasn't seen before such as a crac where x 1 0.5 " probably has flash ponts" x 2 0.5 " probably has a negatve leadng edge pulse" D (2)(0.5) + (0)(0.5) 1 0 " s t a crac" Y (whch s correct)

References Slansy, J., and G. Wassel, Pattern Classfers and Tranable Machnes, Sprnger Verlag, 1981 Pao, Y.H., Adaptve Pattern Recognton and Neural Networs, Addson Wesley, 1989 Gale, W.A., Ed., Artfcal Intellgence and Statstcs, Addson Wesley, 1986 Duda, R.O, Hart, P.E., and D.G. Stor, Pattern Classfcaton, 2 nd Ed., John Wley, 2001 Fuunaga, K. Statstcal Pattern Recognton, Academc Press,1990. Webb, A, Statstcal Pattern Recognton, 2 nd Ed., John Wley, 2002. Nadler, M., and E.P. Smth, Pattern Recognton Engneerng, John Wley, 1993.