Notation. Pattern Recognition II. Michal Haindl. Outline - PR Basic Concepts. Pattern Recognition Notions

Similar documents
Branch-and-Bound Algorithm. Pattern Recognition XI. Michal Haindl. Outline

Neural Nets in PR. Pattern Recognition XII. Michal Haindl. Outline. Neural Nets in PR 2

Feature Selection. Pattern Recognition X. Michal Haindl. Feature Selection. Outline

Set Theory. Pattern Recognition III. Michal Haindl. Set Operations. Outline

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata

Machine Learning 2017

Machine Learning Lecture 5

Bits of Machine Learning Part 1: Supervised Learning

Pattern recognition. "To understand is to perceive patterns" Sir Isaiah Berlin, Russian philosopher

Neural Networks Lecture 4: Radial Bases Function Networks

Expectation Maximization (EM)

Machine Learning, Midterm Exam

Artificial Neural Networks (ANN)

CMU-Q Lecture 24:

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

Pattern Recognition. Parameter Estimation of Probability Density Functions

Advanced statistical methods for data analysis Lecture 2

Logic and machine learning review. CS 540 Yingyu Liang

Brief Introduction of Machine Learning Techniques for Content Analysis

Linear & nonlinear classifiers

Linear & nonlinear classifiers

SGN (4 cr) Chapter 5

L11: Pattern recognition principles

PATTERN CLASSIFICATION

LEARNING & LINEAR CLASSIFIERS

Learning Methods for Linear Detectors

Expectation Maximization (EM)

Binary Decision Diagrams

Bayesian Decision and Bayesian Learning

CS798: Selected topics in Machine Learning

Statistical Data Mining and Machine Learning Hilary Term 2016

Algorithm-Independent Learning Issues

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Feature selection and extraction Spectral domain quality estimation Alternatives

Bayesian Decision Theory

Hidden Markov Models and Gaussian Mixture Models

Machine Learning Lecture 7

Statistical Machine Learning

NonlinearOptimization

Statistical Learning Reading Assignments

Linear Models for Classification

6.036 midterm review. Wednesday, March 18, 15

ECE521 week 3: 23/26 January 2017

COMS 4771 Introduction to Machine Learning. Nakul Verma

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Feature selection. c Victor Kitov August Summer school on Machine Learning in High Energy Physics in partnership with

Support Vector Machine

Lecture 3: Pattern Classification

Intensity Analysis of Spatial Point Patterns Geog 210C Introduction to Spatial Data Analysis

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Bayesian Decision Theory

Multivariate statistical methods and data mining in particle physics

Unsupervised Learning with Permuted Data

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Pattern Recognition and Machine Learning

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Pattern Recognition and Machine Learning. Learning and Evaluation of Pattern Recognition Processes

Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees

Multilayer Neural Networks

Support Vector Machine (continued)

Linear Discrimination Functions

Linear discriminant functions

Example - basketball players and jockeys. We will keep practical applicability in mind:

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

Support Vector Machine (SVM) and Kernel Methods

Statistical Rock Physics

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

EM Algorithm LECTURE OUTLINE

Clustering with k-means and Gaussian mixture distributions

COM336: Neural Computing

p(d θ ) l(θ ) 1.2 x x x

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

A summary of Deep Learning without Poor Local Minima

STA 4273H: Statistical Machine Learning

Pattern Recognition Approaches to Solving Combinatorial Problems in Free Groups

Heuristics for The Whitehead Minimization Problem

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Does Unlabeled Data Help?

Chemometrics: Classification of spectra

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Minimax risk bounds for linear threshold functions

44 CHAPTER 2. BAYESIAN DECISION THEORY

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Multiclass Classification-1

Classification and Pattern Recognition

Bayesian decision making

Transcription:

Notation S pattern space X feature vector X = [x 1,...,x l ] l = dim{x} number of features X feature space K number of classes ω i class indicator Ω = {ω 1,...,ω K } g(x) discriminant function H decision boundary n i = card{t i } T i training set for class i T i test set for class i µ i mean value θ i p(x ω i ) parameters Σ i covariance matrix Pattern Recognition Notions c M. Haindl MI-ROZ - 02 3/25 Outline Pattern Recognition II Michal Haindl Faculty of Information Technology, KTI Czech Technical University in Prague Institute of Information Theory and Automation Academy of Sciences of the Czech Republic Prague, Czech Republic Evropský sociální fond. MI-ROZ 2011-2012/Z Praha & EU: Investujeme do vaší budoucnosti c M. Haindl MI-ROZ - 02 1/25 January 16, 2012 Outline Outline - PR Basic Concepts set of patterns S = {p 1,p 2,...} (pattern space) pattern recognition S i S j = S = K k=1 S k i j classification - the assignment of an object (pattern) to one of several prespecified categories Ω = {ω 1,...,ω K } Repeated observation of the same pattern should produce the same class. Two different pattern should give rise to two different classes. A slight distortion of a pattern should produce a small displacement of its representation. Notation Notions Representation Classification c M. Haindl MI-ROZ - 02 4/25 c M. Haindl MI-ROZ - 02 2/25

Discriminant Function Notions 2 discriminant function is not unique (multiply or add C > 0, replace by f(g i (X)) f a monotonically increasing function) e.g. identical minimum-error rate discriminant functions g j (X) = linear discriminant function g j (X) = P(ω j X) p(x ω j )P(ω j ) K i=1 p(x ω i)p(ω i ) g j (X) = p(x ω j )P(ω j ) g j (X) = logp(x ω j )+logp(ω j ) g(x) = a 0 + a j x j j=1 classifier - a set of decision rules that partion a feature space into disjoint subspaces (sorts patterns into categories or classes) sequential classifier - K class problem solves as K 1 two-class problems hierarchical classifier - decision rule in a tree form, each terminal node contains the assigned class parallel classifier - K class problem solves as K(K 1)/2 -class problems separable classes their class region do not overlap decision rule - assigns one class on the basis of the unit features c M. Haindl MI-ROZ - 02 7/25 c M. Haindl MI-ROZ - 02 5/25 Notions 4 Notions 3 decision boundary between ω i,ω j H = {X M : g i (X) = g j (X)} classification is not uniquely defined hyperplane decision boundary - decision boundary for linear discriminant function (linear decision rule) (a i,0 a j,0 )+ (a i,k a j,k )x k = 0 non-linear decision boundary piecewise linear k=1. g j (X) = max i=1,...,n j i g j (X) i g j (X) = a i 0 + ai k x k c M. Haindl MI-ROZ - 02 8/25 discriminant function - a scalar function g(x), whose domain is usually measurement space and whose range is usually real numbers selection of g(x) : X ω i g i (X) > g j (X), j = 1,...,K;j i assign X into class ω j if g j (X) = max k {g k(x)} c M. Haindl MI-ROZ - 02 6/25

Notions 7 Notions 5 quadratic discriminant function feature vector - functions of the initial measurement variables of an object (pattern), or some subset of the initial measurement pattern variables, input for classifier X = [x 1,...,x l ] feature space {X} feature selection - determination of the most discriminative pattern measurements (features), feature extraction - a mapping from the original l dimensional measurement space into the l dimensional feature space l < l g(x) = a 0 + i=1 a ij x i x j + j=i a j x j general discriminant function - two equivalent options 1 non-linear discriminant function e.g. g(x) = a 1 ln(x 1 )+a 2 x2 3 (+ linearisation) 2 non-linear mapping Φ : S n S m combined with linear classifier Φ(X) = [Φ 1 (X),...,Φ m ] m g j (X) = a j,k Φ k (X) k=1 j=1 c M. Haindl MI-ROZ - 02 11/25 c M. Haindl MI-ROZ - 02 9/25 Notions 8 Notions 6 ground truth - known classification of some patterns training set - T = {(X j,ω j )} test set - T = {(X j,ω j )} a classifier is learning if its iterative training procedure increases the classification performance accuracy after each few iterations parametric learning - discriminant function (conditional density p(x ω i )) is assumed to be known except for some unknown parameters e.g. multivariate normal density 1 p(x ω i ) = (2π) 2 Σ l i 1 2 exp{ 1 2 (X µ i) T Σ 1 i (X µ i )} dichotomy decision rule - K = 2 g(x) = g 1 (X) g 2 (X) supervised classification - T = K k=1 T k training set partition known unsupervised classification - T training set partition unknown number of classes K known number of classes K unknown c M. Haindl MI-ROZ - 02 12/25 c M. Haindl MI-ROZ - 02 10/25

Representation Notions 9 Environment system - an object with inputs and outputs dynamic system ẋ(t) = F 1 (x(t),u(t)) y(t) = F 2 (x(t),u(t)) u(t) y(t) differential state eq. alg. output eq. automaton - a system with countable number of states pattern - an object where I/O are meaningless identification - a mathematical model building identifier - e.g. an adaptive / learning system system control - appropriate input signals to guarantee required output controller - (regulator) pattern recognition pattern repr. X class. ω i c M. Haindl MI-ROZ - 02 15/25 nonparametric learning - no special functional form of the conditional probability distribution p(x ω i ) is assumed (distribution-free), e.g. the nearest-neighbour rule, LS fit to an empirical cumulative distribution, supervised learning - learning from labelled training set unsupervised learning (learning without a teacher) - learning from unlabeled training set adaptive learning - learning in changing environment c M. Haindl MI-ROZ - 02 13/25 Representation Notions 10 object representation featured - numerical X = [1.2, 23.34] syntactic (structural) - symbolic X = [, ] Featured Representation feature vector (given object measurements, an object image) X = [x 1,...,x l ] T statistical PR - the measurement patterns have the form of a vector, each component is a measurement of a particular quality, property, or characteristic of the unit syntactic PR - the measurement patterns have the form of sentences from the language of a phrase structure grammar structural PR - the measured unit is encoded in terms of its parts, their mutual relationships and their properties X R l c M. Haindl MI-ROZ - 02 16/25 c M. Haindl MI-ROZ - 02 14/25

Classification Syntactic Representation 1 data capturing 2 preprocessing (data normalisation) 3 training & test sets selection 4 feature selection 5 classification 6 postprocessing - thematic map filtering 7 probability of error estimation 8 interpretation formal language - analogous representation terminal symbol - an elementary primitive property word (terminal symbol chain) - pattern representation formal language - set of patterns, not unique for a given pattern set substitution rules - rules how to generate words from terminal symbols grammar - substitution rules + set of terminal symbols, usually recursive ω i grammar syntactic analysis (pattern recognition) - the assignment of an object (word X) to one of several prespecified grammars c M. Haindl MI-ROZ - 02 19/25 c M. Haindl MI-ROZ - 02 17/25 Supervised Classification Statistical PR steps: statistical - based on decision theory, assumed knowledge of cpdf p(x ω j ) deterministic - no assumptions about cpdf, discrimination function deterministic 1 determination of the number of classes of interest K 2 training set selection and its partition into single class subsets T = K k=1 T k 3 determination of a classifier 4 feature selection 5 estimation of classifier parameters from training set data 6 classification of new patterns c M. Haindl MI-ROZ - 02 20/25 p(x ω i ) known Parametric case Statistical PR p(x ω i ) unknown Nonparametric case Labeled Unlabeled Labeled Unlabeled training training training training samples samples samples samples increase of PR difficulty c M. Haindl MI-ROZ - 02 18/25

Gradient Descent Procedure Deterministic Classification a J = Y T (Ya b) b J = (Ya b) for a given b a = (Y T Y) 1 Y T b Ho - Kashyap Algorithm (determination of b) 1 b(0) > 0 otherwise arbitrary 2 a(i) = (Y T Y) 1 Y T b(i) 3 b(i +1) = b(i)+ρ [ e(i)+ e(i) ] thresholding geometric classification problem - determination of decision hypersurfaces knowledge of p(x ω j ) not required determination of decision hypersurfaces, feasible solution only for hyperplanes easy implementation e(i) = b J(i) = Ya(i) b(i) 0 < ρ < 1 convergence in a finite number of steps for linearly separable training samples c M. Haindl MI-ROZ - 02 23/25 c M. Haindl MI-ROZ - 02 21/25 Unsupervised Classification Geometric Classification training set partition unknown statistical - estimation of unknown quantities in mixture density p(x) = K k=1 p(ω k)p(x ω k ) from T some of {K,p(ω k )p(x ω k ), k = 1,...,K} may be unknown no general solution known for known {K,p(ω k ) and the form of p(x ω k ) solution exists deterministic - no assumptions about pdf, discrimination function deterministic based on similarity measures clustering linear discriminant function, w 0 the threshold weight ( w 0 < W T X S 1 ) g(x) = w 0 +W T X = a T y K = 2 g(x) > 0 X S 1 otherwise X S 2 g(x) = 0 defines the decision surface, y = [1,x 1,...,x n ] T { 1 y j : 1 y j T 1 }, { 2 y j : 2 y j T 2 } learning - to find a : a T y j > 0 1 y j, 2 y j (a T ( 2 y j ) > 0) not unique solution, a margin has to be introduced to avoid convergence to a limit point on the boundary, i.e. a T y j b > 0 minimisation of the criterion function J(a,b) = 1 2 Ya b Y = [ 1 y 1,..., 2 y 1,...] m (n+1) subject to the constraint b > 0 c M. Haindl MI-ROZ - 02 24/25 c M. Haindl MI-ROZ - 02 22/25

Unsupervised Classification (determination of the number of classes of interest K ) training set selection, its partition into single class subsets T = K k=1 T k unknown determination of a classifier feature selection estimation of classifier parameters from training set data classification of new patterns c M. Haindl MI-ROZ - 02 25/25