EEL 851: Biometrics An Overview of Statistical Pattern Recognition EEL 851 1
Outline Introduction Pattern Feature Noise Example Problem Analysis Segmentation Feature Extraction Classification Design Cycle Bayesian Decision Theory Discriminant Functions Decision Regions Minimum Distance Classification Nearest Neighbor Classification Parameter Estimation Decision Boundaries EEL 851 2
Introduction What is a Pattern? Object process of event consisting of both deterministic/stochastic components Record of dynamic occurrences influenced by both deterministic and stochastic factors Examples voice, image, characters Kind of Patterns Visual patterns Temporal patterns Logical patterns What is Pattern Class? Set of patterns sharing set of common attributes (or features) Usually originating from the same source EEL 851 3
Introduction Feature Relevant characteristics that make patterns apart from each other Data extractable through measurements or processing Examples Patterns Speech waveforms, crystals, textures, weather patterns Features Age, color, height, width Classifications Assigning patterns into classes based on features Noise Distortions associated with pattern processing and/or training samples that effect the classification performance of system EEL 851 4
Example Automation in fish packing plant Sort incoming fish on a conveyor according to species using optical sensing Sorting species Sea bass Salmon EEL 851 5
Problem Analysis Setup camera Take some sample images Features? Suggested features Length Lightness Width Number and shape of fins Position of mouth, etc.. Explore above suggested features! EEL 851 6
Preprocessing Segmentation Isolate fishes from background Isolate fishes from one another Feature extraction Reduce the data by measuring certain features Classifier Evaluates the evidence presented Makes final decision EEL 851 7
Preprocessing EEL 851 8
Classification Selecting length of fish Possible feature for classification EEL 851 9
Classification Length is a poor feature! Select brightness as a possible feature EEL 851 10
Threshold Decision Boundary Cost Consequences of our decision Equal? Task of decision theory Move the decision boundary towards smaller values of lightness, Cost Reduce number of sea brass classified as salmon EEL 851 11
Classification Adopt lightness and the width of fish Fish x = [x 1, x 2 ] Lightness Width EEL 851 12
Classification EEL 851 13
Classification Our satisfaction Premature Central aim Correctly classify a novel input Issue of Generalization! EEL 851 14
Classification Syntactic Pattern Recognition Recursive description using method of formal languages Structural Pattern Recognition Derivation of descriptions using formal rules EEL 851 15
Pattern Recognition Systems Invariant Features Translation Rotation Scale EEL 851 16
Design Cycle Data Collection Feature Choice Model Choice Training Evaluation Computational Complexity EEL 851 17
Design Cycle EEL 851 18
Types of Learning Supervised Learning Teacher Category label or cost for each label in training set Desired input Desired output Goals Produce a correct output given a new input Goals of Supervised Learning Classification Desired output Discrete class labels Regression Desired output Continuous valued Unsupervised Learning The system forms clusters or natural groupings of input pattern Goals Build a model or find useful representations of data Usage Reasoning, finding clusters, dimensionality reduction, decision making, prediction, etc. Data compression, Outlier detection, Help classification EEL 851 19
Bayesian Decision Theory Assumptions Problem Probabilistic terms Knowledge of relevant probabilities Sea Bass/Salmon example State of nature Random Variable ω ω 1 (sea bass) ω ω 1 (salmon) Priori probability P(ω 1 ) sea bass; P(ω 2 ) salmon P(ω 1 ) + P( ω 2 ) = 1 (exclusivity and exhaustivity) EEL 851 20
Bayesian Decision Theory Decision Rule Only prior information, Cannot see! Decide ω 1 if P(ω 1 ) > P(ω 2 ) otherwise decide ω 2 More info in most cases Class Conditional Info Class-conditional pdf p(x ω) p(x ω 1 ) and p(x ω 2 ) Difference in lightness between populations of sea and salmon EEL 851 21
Bayesian Decision Theory Class-conditional pdf x Lightness of fish Normalized EEL 851 22
Bayes Formula Lightness of fish x, Class? P(ω j x) = p(x ω j ). P (ω j ) / p(x) where in case of two categories p( x) j 2 = = j= 1 p( x ω ) P( ω ) j j Posterior = (Likelihood Prior) / Evidence p(ω j x) Likelihood p(x) Evidence (scale factor) EEL 851 23
Posterior Probabilities Prior probabilities P(ω 1 ) = 2/3 P(ω 2 ) = 1/3 EEL 851 24
Bayes Decision Rule Decision given posterior probabilities Observation x if P(ω 1 x) > P(ω 2 x) True state of nature = ω 1 if P(ω 1 x) < P(ω 2 x) True state of nature = ω 2 Probability of error P(error x) = P(ω 1 x) if we decide ω 2 P(error x) = P(ω 2 x) if we decide ω 1 EEL 851 25
Bayes Decision Rule Minimize probability of error Decide ω 1 if P(ω 1 x) > P(ω 2 x); otherwise decide ω 2 Bayes decision P(error x) = min [P(ω 1 x), P(ω 2 x)] EEL 851 26
Bayesian Classifier Generalization of Preceding Ideas More than one feature More than two state of nature Actions not only deciding state of nature Loss function More general than probability of error Bayes Decision Rule Input pattern C is classified into class ω k, for a given feature vector x, maximizes the posterior probability; P(ω k x) P(ω j x) for j k Likelihood conditions p(x ω k ) Training data measurements Prior probabilities P(ω k ) Supposed to known within given population of sample patterns p(x) is same for all class alternatives Ignored, normalization factor EEL 851 27
Loss Function Certain classification errors may be costly than others False Negative may be much more costly than False Positive Examples Fire alarm, Medical diagnosis Ω = {ω 1, ω 1, ω n } Possible set of nature/classes = {α 1,α 2, α n } Possible decisions/actions Loss Function λ(α i ω j ) Loss incurred by taking action α i when true state of nature is ω j Conditional Risk R(α i x) Loss expected by taking action α i when observed evidence is x EEL 851 28
Minimum Error Rate Classification Action α i is associated with class ω j All errors are equally likely Zero-One Loss Classification decision α i is correct only if state of nature is ω j Symmetrical function 0 λ( αi ω j ) = 1 Conditional Risk R(α i x) = λ(α i ω j ) P(ω j x) = 1 P(ω i x) Minimize Risk for for i = j i j Select the class maximizing the posterior probability Suggested by Bayes Decision Rule, Minimum error rate classification EEL 851 29
Neyman-Pearson Criterion Minimize overall risk subject to constraints R(α i x) dx < constant Misclassification limited to a frequency Fish example Do not misclassify more than 1% of salmon as bass Minimize the chance of classifying sea bass as salmon EEL 851 30
Discriminant Functions Classifier Representation Function employed for discriminating among classes g i (x), i = 1,,k Classifier Assign x to class ω i if g i (x) g j (x) for j i Possible Choices g i (x) = P(ω j x) g i (x) = p(x ω j )P(ω j ) g i (x) = ln {p(x ω j )} + ln {P(ω j )} EEL 851 31
Decision Region Effect of Decision Rule Feature space Decision regions Decision Boundaries Surface in feature space Ties occur among largest discriminant functions EEL 851 32
Normal Density Univariate Density Analytically tractable, Maximum entropy Continuous-valued density A lot of processes are asymptotically Gaussian Central Limit Theorem Aggregate effect of independent random disturbances Gaussian Many patterns Prototype corrupted by large number of RP P(x) = 1 exp 2π σ 1 x 2 µ σ 2, µ = mean (or expected value) of x σ 2 = expected squared deviation or variance EEL 851 33
Normal Density Multivariate density Multivariate normal density in d dimensions is: P(x) = (2π) 1 d / 2 Σ 1/ 2 exp 1 (x 2 µ ) t Σ 1 (x µ ) where: x = (x 1, x 2,, x d ) t feature vector µ = (µ 1, µ 2,, µ d ) t mean vector Σ = d*d covariance matrix Σ and Σ -1 are determinant and inverse respectively Σ Shape of Gaussian curve (x - µ) t Σ -1 (x - µ) Mahalanobis distance EEL 851 34
Normal Density P(x) = (2π) 1 d / 2 Σ 1/ 2 exp 1 (x 2 µ ) t Σ 1 (x µ ) EEL 851 35
Discriminant Functions Minimum Error Rate Classification g i (x) = ln p(x ω i ) + ln P(ω i ) g (x) = i 1 (x µ 2 i 1 t i ) (x µ i d ) ln2π 2 1 ln 2 Σ i + lnp( ω ) i Case Σ i = σ 2.I Features are statistically independent, same variance σ 2 Diagonal covariance matrix i t i g (x) = w x+ w 1 i0 1 t w i = 2 µ i wi0 = µ iµ i + ln P( ωi ) 2 2 σ σ Linear machines Decision surface Pieces of hyperplanes g i (x) = g j (x) EEL 851 36
Discriminant Functions EEL 851 37
Discriminant Functions EEL 851 38
Discriminant Functions Case Σ i = Σ Covariance matrices of all classes Identical but arbitrary g 1 t 1 x) = (x- µ i) Σ (x- µ i) + ln P( ω ) 2 i ( i Hyperplane separating R i and R j [ P( ω )/ P( ω )] 1 ln i j x 0 = ( µ i + µ j ).( µ i µ j ) 2 t 1 ( µ µ ) Σ ( µ µ ) i j i j Hyperplane separating R i and R j Not orthogonal to the line between the means EEL 851 39
Discriminant Functions EEL 851 40
Discriminant Functions EEL 851 41
EEL 851 42 Case Σ i = arbitrary The covariance matrices are different for each category Hyperquadrics Hyperplanes, pairs of hyperplanes, hyperspheres, hyperellipsoids, hyperparaboloids, hyperhyperboloids ) ( ln ln 2 1 2 1 w w 2 1 W : ) ( 1 0 1 i 1 i 0 i i i i t i i i i i i t i i t i P where w x w W x x x g ω µ µ µ + Σ Σ = = Σ Σ = = + = Discriminant Functions
Discriminant Functions EEL 851 43
Decision Boundary - Example Compute Bayes decision boundary Gaussian Distributions Means and Covariance's Discrete versions 1/ 2 0 µ 1 = 0 2 3 1 = 6 µ 2 3 = 2 2 2 = 0 0 2 2 x 2 = 3. 514 1. 125x1 + 0. 1875x1 EEL 851 44
Supervised Learning Parametric Approaches Bayesian parameter estimation Maximum likelihood estimation Estimation Problem Estimate P(ω j ) Estimate p(ω j x) Tough High dimensional feature spaces, small number of training samples Simplifying Assumptions Feature Independence Independently drawn samples I. I. D. model Assume that p(ω j x) is Gaussian Estimation problem Parameters of normality EEL 851 45
Estimation Methods Bayesian Estimation (MAP) Distribution parameters are random values that follow a known (i.e. Gaussian) distribution Behavior of training data helps in revising parameter values Large training samples Better chances of refining posterior probabilities (parameter peaking) Maximum-Likelihood (ML) Estimation Parameters of probabilistic distributions are fixed but unknown values Parameters Unknown constants, Identify using training data Best estimates of parameter values Class-conditional probabilities are maximized over the available samples EEL 851 46
Parametric Approaches Curse of dimensionality Estimate the parameters known distribution Smaller number of samples Some priori information EEL 851 47
Non-parametric Approaches Parzen window pdf estimation (KDE) Estimate p(ω j x) directly from sample patterns K n nearest-neighbor Directly construct the decision boundary based on training data EEL 851 48
Statistical Approaches EEL 851 49
Nearest-Neighbor Methods Statistical, nearest-neighbor or memory-based methods k-nearest-neighbor New pattern category Plurality of its k closest neighbor Large k Decreases the chance of undue influence by noisy training pattern Small k Reduces acuity of method Distance metric usually Euclidean In practice features are scaled a j Advantage Does not require training EEL 851 50
Nearest-Neighbor Decision Example Class of training patterns 4 EEL 851 51
References Richard O. Duda, Peter E. Hart, and David G.Stork, Pattern Classification, 2 nd edition,john Wiley & Sons, 2001 (most contents) Nils J. Nilsson, Introduction to Machine Learning, http://ai.stanford.edu/people/nilsson/mlbook.html http://www.rii.ricoh.com/%7estork/dhs.html A. K. Jain, R. P. W. Duin, and J. Mao, Statistical Pattern Recognition: A Review, IEEE Trans PAMI, pp. 4-37, Jan. 2000 Other Useful References Fukunaga, Introduction to Statistical Pattern Recognition, Academic Press, New York, NY, 1972. Therrien, Decision Estimation and Classification, John-Wiley & sons, New York, NY, 1989. Hertz, Krogh, and Palmer, Introduction to the Theory of Neural Computation, Addison-Wesley, Reading, MA, 1991. Bose and Liang, Neural Network Fundamentals with Graphs, Algorithms, and Applications, McGrawll-Hill, New York, NY, 1996 EEL 851 52