EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

Size: px

Start display at page:

Download "EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1"

Damian Pierce
5 years ago
Views:

1 EEL 851: Biometrics An Overview of Statistical Pattern Recognition EEL 851 1

2 Outline Introduction Pattern Feature Noise Example Problem Analysis Segmentation Feature Extraction Classification Design Cycle Bayesian Decision Theory Discriminant Functions Decision Regions Minimum Distance Classification Nearest Neighbor Classification Parameter Estimation Decision Boundaries EEL 851 2

3 Introduction What is a Pattern? Object process of event consisting of both deterministic/stochastic components Record of dynamic occurrences influenced by both deterministic and stochastic factors Examples voice, image, characters Kind of Patterns Visual patterns Temporal patterns Logical patterns What is Pattern Class? Set of patterns sharing set of common attributes (or features) Usually originating from the same source EEL 851 3

4 Introduction Feature Relevant characteristics that make patterns apart from each other Data extractable through measurements or processing Examples Patterns Speech waveforms, crystals, textures, weather patterns Features Age, color, height, width Classifications Assigning patterns into classes based on features Noise Distortions associated with pattern processing and/or training samples that effect the classification performance of system EEL 851 4

5 Example Automation in fish packing plant Sort incoming fish on a conveyor according to species using optical sensing Sorting species Sea bass Salmon EEL 851 5

6 Problem Analysis Setup camera Take some sample images Features? Suggested features Length Lightness Width Number and shape of fins Position of mouth, etc.. Explore above suggested features! EEL 851 6

7 Preprocessing Segmentation Isolate fishes from background Isolate fishes from one another Feature extraction Reduce the data by measuring certain features Classifier Evaluates the evidence presented Makes final decision EEL 851 7

8 Preprocessing EEL 851 8

9 Classification Selecting length of fish Possible feature for classification EEL 851 9

10 Classification Length is a poor feature! Select brightness as a possible feature EEL

11 Threshold Decision Boundary Cost Consequences of our decision Equal? Task of decision theory Move the decision boundary towards smaller values of lightness, Cost Reduce number of sea brass classified as salmon EEL

12 Classification Adopt lightness and the width of fish Fish x = [x 1, x 2 ] Lightness Width EEL

13 Classification EEL

14 Classification Our satisfaction Premature Central aim Correctly classify a novel input Issue of Generalization! EEL

15 Classification Syntactic Pattern Recognition Recursive description using method of formal languages Structural Pattern Recognition Derivation of descriptions using formal rules EEL

16 Pattern Recognition Systems Invariant Features Translation Rotation Scale EEL

17 Design Cycle Data Collection Feature Choice Model Choice Training Evaluation Computational Complexity EEL

18 Design Cycle EEL

19 Types of Learning Supervised Learning Teacher Category label or cost for each label in training set Desired input Desired output Goals Produce a correct output given a new input Goals of Supervised Learning Classification Desired output Discrete class labels Regression Desired output Continuous valued Unsupervised Learning The system forms clusters or natural groupings of input pattern Goals Build a model or find useful representations of data Usage Reasoning, finding clusters, dimensionality reduction, decision making, prediction, etc. Data compression, Outlier detection, Help classification EEL

20 Bayesian Decision Theory Assumptions Problem Probabilistic terms Knowledge of relevant probabilities Sea Bass/Salmon example State of nature Random Variable ω ω 1 (sea bass) ω ω 1 (salmon) Priori probability P(ω 1 ) sea bass; P(ω 2 ) salmon P(ω 1 ) + P( ω 2 ) = 1 (exclusivity and exhaustivity) EEL

21 Bayesian Decision Theory Decision Rule Only prior information, Cannot see! Decide ω 1 if P(ω 1 ) > P(ω 2 ) otherwise decide ω 2 More info in most cases Class Conditional Info Class-conditional pdf p(x ω) p(x ω 1 ) and p(x ω 2 ) Difference in lightness between populations of sea and salmon EEL

22 Bayesian Decision Theory Class-conditional pdf x Lightness of fish Normalized EEL

23 Bayes Formula Lightness of fish x, Class? P(ω j x) = p(x ω j ). P (ω j ) / p(x) where in case of two categories p( x) j 2 = = j= 1 p( x ω ) P( ω ) j j Posterior = (Likelihood Prior) / Evidence p(ω j x) Likelihood p(x) Evidence (scale factor) EEL

24 Posterior Probabilities Prior probabilities P(ω 1 ) = 2/3 P(ω 2 ) = 1/3 EEL

25 Bayes Decision Rule Decision given posterior probabilities Observation x if P(ω 1 x) > P(ω 2 x) True state of nature = ω 1 if P(ω 1 x) < P(ω 2 x) True state of nature = ω 2 Probability of error P(error x) = P(ω 1 x) if we decide ω 2 P(error x) = P(ω 2 x) if we decide ω 1 EEL

26 Bayes Decision Rule Minimize probability of error Decide ω 1 if P(ω 1 x) > P(ω 2 x); otherwise decide ω 2 Bayes decision P(error x) = min [P(ω 1 x), P(ω 2 x)] EEL

27 Bayesian Classifier Generalization of Preceding Ideas More than one feature More than two state of nature Actions not only deciding state of nature Loss function More general than probability of error Bayes Decision Rule Input pattern C is classified into class ω k, for a given feature vector x, maximizes the posterior probability; P(ω k x) P(ω j x) for j k Likelihood conditions p(x ω k ) Training data measurements Prior probabilities P(ω k ) Supposed to known within given population of sample patterns p(x) is same for all class alternatives Ignored, normalization factor EEL

28 Loss Function Certain classification errors may be costly than others False Negative may be much more costly than False Positive Examples Fire alarm, Medical diagnosis Ω = {ω 1, ω 1, ω n } Possible set of nature/classes = {α 1,α 2, α n } Possible decisions/actions Loss Function λ(α i ω j ) Loss incurred by taking action α i when true state of nature is ω j Conditional Risk R(α i x) Loss expected by taking action α i when observed evidence is x EEL

29 Minimum Error Rate Classification Action α i is associated with class ω j All errors are equally likely Zero-One Loss Classification decision α i is correct only if state of nature is ω j Symmetrical function 0 λ( αi ω j ) = 1 Conditional Risk R(α i x) = λ(α i ω j ) P(ω j x) = 1 P(ω i x) Minimize Risk for for i = j i j Select the class maximizing the posterior probability Suggested by Bayes Decision Rule, Minimum error rate classification EEL

30 Neyman-Pearson Criterion Minimize overall risk subject to constraints R(α i x) dx < constant Misclassification limited to a frequency Fish example Do not misclassify more than 1% of salmon as bass Minimize the chance of classifying sea bass as salmon EEL

31 Discriminant Functions Classifier Representation Function employed for discriminating among classes g i (x), i = 1,,k Classifier Assign x to class ω i if g i (x) g j (x) for j i Possible Choices g i (x) = P(ω j x) g i (x) = p(x ω j )P(ω j ) g i (x) = ln {p(x ω j )} + ln {P(ω j )} EEL

32 Decision Region Effect of Decision Rule Feature space Decision regions Decision Boundaries Surface in feature space Ties occur among largest discriminant functions EEL

33 Normal Density Univariate Density Analytically tractable, Maximum entropy Continuous-valued density A lot of processes are asymptotically Gaussian Central Limit Theorem Aggregate effect of independent random disturbances Gaussian Many patterns Prototype corrupted by large number of RP P(x) = 1 exp 2π σ 1 x 2 µ σ 2, µ = mean (or expected value) of x σ 2 = expected squared deviation or variance EEL

34 Normal Density Multivariate density Multivariate normal density in d dimensions is: P(x) = (2π) 1 d / 2 Σ 1/ 2 exp 1 (x 2 µ ) t Σ 1 (x µ ) where: x = (x 1, x 2,, x d ) t feature vector µ = (µ 1, µ 2,, µ d ) t mean vector Σ = d*d covariance matrix Σ and Σ -1 are determinant and inverse respectively Σ Shape of Gaussian curve (x - µ) t Σ -1 (x - µ) Mahalanobis distance EEL

35 Normal Density P(x) = (2π) 1 d / 2 Σ 1/ 2 exp 1 (x 2 µ ) t Σ 1 (x µ ) EEL

36 Discriminant Functions Minimum Error Rate Classification g i (x) = ln p(x ω i ) + ln P(ω i ) g (x) = i 1 (x µ 2 i 1 t i ) (x µ i d ) ln2π 2 1 ln 2 Σ i + lnp( ω ) i Case Σ i = σ 2.I Features are statistically independent, same variance σ 2 Diagonal covariance matrix i t i g (x) = w x+ w 1 i0 1 t w i = 2 µ i wi0 = µ iµ i + ln P( ωi ) 2 2 σ σ Linear machines Decision surface Pieces of hyperplanes g i (x) = g j (x) EEL

37 Discriminant Functions EEL

38 Discriminant Functions EEL

39 Discriminant Functions Case Σ i = Σ Covariance matrices of all classes Identical but arbitrary g 1 t 1 x) = (x- µ i) Σ (x- µ i) + ln P( ω ) 2 i ( i Hyperplane separating R i and R j [ P( ω )/ P( ω )] 1 ln i j x 0 = ( µ i + µ j ).( µ i µ j ) 2 t 1 ( µ µ ) Σ ( µ µ ) i j i j Hyperplane separating R i and R j Not orthogonal to the line between the means EEL

40 Discriminant Functions EEL

41 Discriminant Functions EEL

42 EEL Case Σ i = arbitrary The covariance matrices are different for each category Hyperquadrics Hyperplanes, pairs of hyperplanes, hyperspheres, hyperellipsoids, hyperparaboloids, hyperhyperboloids ) ( ln ln w w 2 1 W : ) ( i 1 i 0 i i i i t i i i i i i t i i t i P where w x w W x x x g ω µ µ µ + Σ Σ = = Σ Σ = = + = Discriminant Functions

43 Discriminant Functions EEL

44 Decision Boundary - Example Compute Bayes decision boundary Gaussian Distributions Means and Covariance's Discrete versions 1/ 2 0 µ 1 = = 6 µ 2 3 = = x 2 = x x1 EEL

45 Supervised Learning Parametric Approaches Bayesian parameter estimation Maximum likelihood estimation Estimation Problem Estimate P(ω j ) Estimate p(ω j x) Tough High dimensional feature spaces, small number of training samples Simplifying Assumptions Feature Independence Independently drawn samples I. I. D. model Assume that p(ω j x) is Gaussian Estimation problem Parameters of normality EEL

46 Estimation Methods Bayesian Estimation (MAP) Distribution parameters are random values that follow a known (i.e. Gaussian) distribution Behavior of training data helps in revising parameter values Large training samples Better chances of refining posterior probabilities (parameter peaking) Maximum-Likelihood (ML) Estimation Parameters of probabilistic distributions are fixed but unknown values Parameters Unknown constants, Identify using training data Best estimates of parameter values Class-conditional probabilities are maximized over the available samples EEL

47 Parametric Approaches Curse of dimensionality Estimate the parameters known distribution Smaller number of samples Some priori information EEL

48 Non-parametric Approaches Parzen window pdf estimation (KDE) Estimate p(ω j x) directly from sample patterns K n nearest-neighbor Directly construct the decision boundary based on training data EEL

49 Statistical Approaches EEL

50 Nearest-Neighbor Methods Statistical, nearest-neighbor or memory-based methods k-nearest-neighbor New pattern category Plurality of its k closest neighbor Large k Decreases the chance of undue influence by noisy training pattern Small k Reduces acuity of method Distance metric usually Euclidean In practice features are scaled a j Advantage Does not require training EEL

51 Nearest-Neighbor Decision Example Class of training patterns 4 EEL

52 References Richard O. Duda, Peter E. Hart, and David G.Stork, Pattern Classification, 2 nd edition,john Wiley & Sons, 2001 (most contents) Nils J. Nilsson, Introduction to Machine Learning, A. K. Jain, R. P. W. Duin, and J. Mao, Statistical Pattern Recognition: A Review, IEEE Trans PAMI, pp. 4-37, Jan Other Useful References Fukunaga, Introduction to Statistical Pattern Recognition, Academic Press, New York, NY, Therrien, Decision Estimation and Classification, John-Wiley & sons, New York, NY, Hertz, Krogh, and Palmer, Introduction to the Theory of Neural Computation, Addison-Wesley, Reading, MA, Bose and Liang, Neural Network Fundamentals with Graphs, Algorithms, and Applications, McGrawll-Hill, New York, NY, 1996 EEL

Bayesian Decision Theory

Bayesian Decision Theory Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2017 CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 1 / 46 Bayesian