Notation. Pattern Recognition II. Michal Haindl. Outline - PR Basic Concepts. Pattern Recognition Notions

Notation S pattern space X feature vector X = [x 1,...,x l ] l = dim{x} number of features X feature space K number of classes ω i class indicator Ω = {ω 1,...,ω K } g(x) discriminant function H decision boundary n i = card{t i } T i training set for class i T i test set for class i µ i mean value θ i p(x ω i ) parameters Σ i covariance matrix Pattern Recognition Notions c M. Haindl MI-ROZ - 02 3/25 Outline Pattern Recognition II Michal Haindl Faculty of Information Technology, KTI Czech Technical University in Prague Institute of Information Theory and Automation Academy of Sciences of the Czech Republic Prague, Czech Republic Evropský sociální fond. MI-ROZ 2011-2012/Z Praha & EU: Investujeme do vaší budoucnosti c M. Haindl MI-ROZ - 02 1/25 January 16, 2012 Outline Outline - PR Basic Concepts set of patterns S = {p 1,p 2,...} (pattern space) pattern recognition S i S j = S = K k=1 S k i j classification - the assignment of an object (pattern) to one of several prespecified categories Ω = {ω 1,...,ω K } Repeated observation of the same pattern should produce the same class. Two different pattern should give rise to two different classes. A slight distortion of a pattern should produce a small displacement of its representation. Notation Notions Representation Classification c M. Haindl MI-ROZ - 02 4/25 c M. Haindl MI-ROZ - 02 2/25

Discriminant Function Notions 2 discriminant function is not unique (multiply or add C > 0, replace by f(g i (X)) f a monotonically increasing function) e.g. identical minimum-error rate discriminant functions g j (X) = linear discriminant function g j (X) = P(ω j X) p(x ω j )P(ω j ) K i=1 p(x ω i)p(ω i ) g j (X) = p(x ω j )P(ω j ) g j (X) = logp(x ω j )+logp(ω j ) g(x) = a 0 + a j x j j=1 classifier - a set of decision rules that partion a feature space into disjoint subspaces (sorts patterns into categories or classes) sequential classifier - K class problem solves as K 1 two-class problems hierarchical classifier - decision rule in a tree form, each terminal node contains the assigned class parallel classifier - K class problem solves as K(K 1)/2 -class problems separable classes their class region do not overlap decision rule - assigns one class on the basis of the unit features c M. Haindl MI-ROZ - 02 7/25 c M. Haindl MI-ROZ - 02 5/25 Notions 4 Notions 3 decision boundary between ω i,ω j H = {X M : g i (X) = g j (X)} classification is not uniquely defined hyperplane decision boundary - decision boundary for linear discriminant function (linear decision rule) (a i,0 a j,0 )+ (a i,k a j,k )x k = 0 non-linear decision boundary piecewise linear k=1. g j (X) = max i=1,...,n j i g j (X) i g j (X) = a i 0 + ai k x k c M. Haindl MI-ROZ - 02 8/25 discriminant function - a scalar function g(x), whose domain is usually measurement space and whose range is usually real numbers selection of g(x) : X ω i g i (X) > g j (X), j = 1,...,K;j i assign X into class ω j if g j (X) = max k {g k(x)} c M. Haindl MI-ROZ - 02 6/25

Notions 7 Notions 5 quadratic discriminant function feature vector - functions of the initial measurement variables of an object (pattern), or some subset of the initial measurement pattern variables, input for classifier X = [x 1,...,x l ] feature space {X} feature selection - determination of the most discriminative pattern measurements (features), feature extraction - a mapping from the original l dimensional measurement space into the l dimensional feature space l < l g(x) = a 0 + i=1 a ij x i x j + j=i a j x j general discriminant function - two equivalent options 1 non-linear discriminant function e.g. g(x) = a 1 ln(x 1 )+a 2 x2 3 (+ linearisation) 2 non-linear mapping Φ : S n S m combined with linear classifier Φ(X) = [Φ 1 (X),...,Φ m ] m g j (X) = a j,k Φ k (X) k=1 j=1 c M. Haindl MI-ROZ - 02 11/25 c M. Haindl MI-ROZ - 02 9/25 Notions 8 Notions 6 ground truth - known classification of some patterns training set - T = {(X j,ω j )} test set - T = {(X j,ω j )} a classifier is learning if its iterative training procedure increases the classification performance accuracy after each few iterations parametric learning - discriminant function (conditional density p(x ω i )) is assumed to be known except for some unknown parameters e.g. multivariate normal density 1 p(x ω i ) = (2π) 2 Σ l i 1 2 exp{ 1 2 (X µ i) T Σ 1 i (X µ i )} dichotomy decision rule - K = 2 g(x) = g 1 (X) g 2 (X) supervised classification - T = K k=1 T k training set partition known unsupervised classification - T training set partition unknown number of classes K known number of classes K unknown c M. Haindl MI-ROZ - 02 12/25 c M. Haindl MI-ROZ - 02 10/25

Representation Notions 9 Environment system - an object with inputs and outputs dynamic system ẋ(t) = F 1 (x(t),u(t)) y(t) = F 2 (x(t),u(t)) u(t) y(t) differential state eq. alg. output eq. automaton - a system with countable number of states pattern - an object where I/O are meaningless identification - a mathematical model building identifier - e.g. an adaptive / learning system system control - appropriate input signals to guarantee required output controller - (regulator) pattern recognition pattern repr. X class. ω i c M. Haindl MI-ROZ - 02 15/25 nonparametric learning - no special functional form of the conditional probability distribution p(x ω i ) is assumed (distribution-free), e.g. the nearest-neighbour rule, LS fit to an empirical cumulative distribution, supervised learning - learning from labelled training set unsupervised learning (learning without a teacher) - learning from unlabeled training set adaptive learning - learning in changing environment c M. Haindl MI-ROZ - 02 13/25 Representation Notions 10 object representation featured - numerical X = [1.2, 23.34] syntactic (structural) - symbolic X = [, ] Featured Representation feature vector (given object measurements, an object image) X = [x 1,...,x l ] T statistical PR - the measurement patterns have the form of a vector, each component is a measurement of a particular quality, property, or characteristic of the unit syntactic PR - the measurement patterns have the form of sentences from the language of a phrase structure grammar structural PR - the measured unit is encoded in terms of its parts, their mutual relationships and their properties X R l c M. Haindl MI-ROZ - 02 16/25 c M. Haindl MI-ROZ - 02 14/25

Classification Syntactic Representation 1 data capturing 2 preprocessing (data normalisation) 3 training & test sets selection 4 feature selection 5 classification 6 postprocessing - thematic map filtering 7 probability of error estimation 8 interpretation formal language - analogous representation terminal symbol - an elementary primitive property word (terminal symbol chain) - pattern representation formal language - set of patterns, not unique for a given pattern set substitution rules - rules how to generate words from terminal symbols grammar - substitution rules + set of terminal symbols, usually recursive ω i grammar syntactic analysis (pattern recognition) - the assignment of an object (word X) to one of several prespecified grammars c M. Haindl MI-ROZ - 02 19/25 c M. Haindl MI-ROZ - 02 17/25 Supervised Classification Statistical PR steps: statistical - based on decision theory, assumed knowledge of cpdf p(x ω j ) deterministic - no assumptions about cpdf, discrimination function deterministic 1 determination of the number of classes of interest K 2 training set selection and its partition into single class subsets T = K k=1 T k 3 determination of a classifier 4 feature selection 5 estimation of classifier parameters from training set data 6 classification of new patterns c M. Haindl MI-ROZ - 02 20/25 p(x ω i ) known Parametric case Statistical PR p(x ω i ) unknown Nonparametric case Labeled Unlabeled Labeled Unlabeled training training training training samples samples samples samples increase of PR difficulty c M. Haindl MI-ROZ - 02 18/25

Gradient Descent Procedure Deterministic Classification a J = Y T (Ya b) b J = (Ya b) for a given b a = (Y T Y) 1 Y T b Ho - Kashyap Algorithm (determination of b) 1 b(0) > 0 otherwise arbitrary 2 a(i) = (Y T Y) 1 Y T b(i) 3 b(i +1) = b(i)+ρ [ e(i)+ e(i) ] thresholding geometric classification problem - determination of decision hypersurfaces knowledge of p(x ω j ) not required determination of decision hypersurfaces, feasible solution only for hyperplanes easy implementation e(i) = b J(i) = Ya(i) b(i) 0 < ρ < 1 convergence in a finite number of steps for linearly separable training samples c M. Haindl MI-ROZ - 02 23/25 c M. Haindl MI-ROZ - 02 21/25 Unsupervised Classification Geometric Classification training set partition unknown statistical - estimation of unknown quantities in mixture density p(x) = K k=1 p(ω k)p(x ω k ) from T some of {K,p(ω k )p(x ω k ), k = 1,...,K} may be unknown no general solution known for known {K,p(ω k ) and the form of p(x ω k ) solution exists deterministic - no assumptions about pdf, discrimination function deterministic based on similarity measures clustering linear discriminant function, w 0 the threshold weight ( w 0 < W T X S 1 ) g(x) = w 0 +W T X = a T y K = 2 g(x) > 0 X S 1 otherwise X S 2 g(x) = 0 defines the decision surface, y = [1,x 1,...,x n ] T { 1 y j : 1 y j T 1 }, { 2 y j : 2 y j T 2 } learning - to find a : a T y j > 0 1 y j, 2 y j (a T ( 2 y j ) > 0) not unique solution, a margin has to be introduced to avoid convergence to a limit point on the boundary, i.e. a T y j b > 0 minimisation of the criterion function J(a,b) = 1 2 Ya b Y = [ 1 y 1,..., 2 y 1,...] m (n+1) subject to the constraint b > 0 c M. Haindl MI-ROZ - 02 24/25 c M. Haindl MI-ROZ - 02 22/25

Unsupervised Classification (determination of the number of classes of interest K ) training set selection, its partition into single class subsets T = K k=1 T k unknown determination of a classifier feature selection estimation of classifier parameters from training set data classification of new patterns c M. Haindl MI-ROZ - 02 25/25