Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Similar documents
Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Bayesian Decision Theory

Naïve Bayes classification

The Bayes classifier

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Bayesian Learning (II)

Pattern Recognition. Parameter Estimation of Probability Density Functions

Curve Fitting Re-visited, Bishop1.2.5

A first model of learning

Bayesian Decision Theory

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

Mining Classification Knowledge

Maximum Likelihood Estimation. only training data is available to design a classifier

Chapter 6 Classification and Prediction (2)

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

PATTERN RECOGNITION AND MACHINE LEARNING

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Machine Learning for Signal Processing Bayes Classification

Machine Learning for Signal Processing Bayes Classification and Regression

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

Density Estimation: ML, MAP, Bayesian estimation

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Introduction to Machine Learning

Machine Learning Lecture 2

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

Bayes Decision Theory

5. Discriminant analysis

Introduction to Machine Learning

Probability Theory for Machine Learning. Chris Cremer September 2015

Learning with Probabilities

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Non-parametric Methods

Bayesian Decision and Bayesian Learning

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

L11: Pattern recognition principles

Machine Learning Lecture 2

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Probabilistic Machine Learning. Industrial AI Lab.

Lecture 4: Probabilistic Learning

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Multivariate statistical methods and data mining in particle physics

Statistical learning. Chapter 20, Sections 1 4 1

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

ECE521 Lecture7. Logistic Regression

Lecture 3: Pattern Classification

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

Machine Learning for Data Science (CS4786) Lecture 12

MLE/MAP + Naïve Bayes

CMU-Q Lecture 24:

Day 5: Generative models, structured classification

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Mining Classification Knowledge

probability of k samples out of J fall in R.

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Logistic Regression. Machine Learning Fall 2018

Mathematical Formulation of Our Example

Generative v. Discriminative classifiers Intuition

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 4 of Data Mining by I. H. Witten, E. Frank and M. A.

Machine Learning Linear Classification. Prof. Matteo Matteucci

Naïve Bayesian. From Han Kamber Pei

Machine Learning 2017

COS 424: Interacting with Data. Lecturer: Dave Blei Lecture #11 Scribe: Andrew Ferguson March 13, 2007

Machine Learning Lecture 5

Parametric Techniques

Gaussian discriminant analysis Naive Bayes

Machine Learning Lecture 7

Gaussian and Linear Discriminant Analysis; Multiclass Classification

CSCE 478/878 Lecture 6: Bayesian Learning

Statistical learning. Chapter 20, Sections 1 3 1

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

Final Exam, Machine Learning, Spring 2009

Data Mining Part 4. Prediction

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

MIRA, SVM, k-nn. Lirong Xia

Machine Learning Lecture 2

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Nearest neighbor classification in metric spaces: universal consistency and rates of convergence

Lecture : Probabilistic Machine Learning

MLE/MAP + Naïve Bayes

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Bayes Decision Theory

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Linear Models for Classification

On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality. Weiqiang Dong

Transcription:

Lecture 15. Pattern Classification (I): Statistical Formulation

Outline Statistical Pattern Recognition Maximum Posterior Probability (MAP) Classifier Maximum Likelihood (ML) Classifier K-Nearest Neighbor Classifier MLP classifier (C) 2001 by Yu Hen Hu 2

An Example Consider classify eggs into 3 categories with labels: medium, large, or jumbo. The classification will be based on the weight and length of each egg. Decision rules: 1. If W < 10 g & L < 3cm, then the egg is medium 2. If W > 20g & L > 5 cm then the egg is jumbo 3. Otherwise, the egg is large Three components in a pattern classifier: Category (target) label Features Decision rule L (C) 2001 by Yu Hen Hu 3 W

Statistical Pattern Classification The objective of statistical pattern classification is to draw an optimal decision rule given a set of training samples. The decision rule is optimal because it is designed to minimize a cost function, called the expected risk in making classification decision. This is a learning problem! Assumptions 1. Features are given. Feature selection problem needs to be solved separately. Training samples are randomly chosen from a population 2. Target labels are given Assume each sample is assigned to a specific, unique label by the nature. Assume the label of training samples are known. (C) 2001 by Yu Hen Hu 4

Pattern Classification Problem Let X be the feature space, and C = {i), 1 i M} be M class labels. For each x X, it is assumed that the nature assigned a class label C according to some probabilistic rule. Randomly draw a feature vector x from X, i)) = x i)) is the a priori probability that = i) without referring to x. i) = x i) is the posteriori probability that = i) given the value of x x i)) = x x i)) is the conditional probability (a.k.a. likelihood function) that x will assume its value given that it is drawn from class i). is the marginal probability that x will assume its value without referring to which class it belongs to. Use Bayes Rule, we have x i))i)) = i) Also, x i)) i)) i) = M x i)) i)) i= 1 (C) 2001 by Yu Hen Hu 5

Decision Function and Prob. Mis-Classification Given a sample x X, the objective of statistical pattern classification is to design a decision rule g( C to assign a label to x. If g( =, the naturally assigned class label, then it is a correct classification. Otherwise, it is a misclassification. Define a 0-1 loss function: 0 if g( = l( x g( ) = 1 if g( Given that g( = i*), then l( x g( = i*)) = 0 = = i*) = i*) Hence the probability of misclassification for a specific decision g( = i*) is l( x g( = i*)) = 1 = 1 i*) Clearly, to minimize the Pr. of mis-classification for a given x, the best choice is to choose g( = i*) if i*) > i) for i i* (C) 2001 by Yu Hen Hu 6

MAP: Maximum A Posteriori Classifier The MAP classifier stipulates that the classifier that minimizes pr. of misclassification should choose g( = i*) if i*) > i), i i*. This is an optimal decision rule. Unfortunately, in real world applications, it is often difficult to estimate i). Fortunately, to derive the optimal MAP decision rule, one can instead estimate a discriminant function G i ( such that for any x X, i i*. G i* ( > G i ( iff i*) > i) G i ( can be an approximation of i) or any function satisfying above relationship. (C) 2001 by Yu Hen Hu 7

Maximum Likelihood Classifier Use Bayes rule, p(i) = p(x i))p(i))/p(. Hence the MAP decision rule can be expressed as: g( = i*) if p(i*))p(x i*)) > p(i))p(x i)), i i*. Under the assumption that the a priori Pr. is unknown, we may assume p(i)) = 1/M. As such, maximizing p(x i)) is equivalent to maximizing p(i) c). The likelihood function p(x i)) may assume a univariate Gaussian model. That is, p(x i)) ~ N(µ i,σ i ) µ i,σ i can be estimated using samples from {x = i)}. A priori pr. p(i)) can be estimated as: i)) {x; # x s. t. = i)} = X (C) 2001 by Yu Hen Hu 8

Nearest-Neighbor Classifier Let {y(1),, y(n)} X be n samples which has already been classified. Given a new sample x, the NN decision rule chooses g( = i) if y( i*) = Min. y( i) x 1 i n is labeled with i). As n, the prob. Mis-classification using NN classifier is at most twice of the prob. Mis-classification of the optimal (MAP) classifier. k-nearest Neighbor classifier examine the k-nearest, classified samples, and classify x into the majority of them. Problem of implementation: require large storage to store ALL the training samples. (C) 2001 by Yu Hen Hu 9

MLP Classifier Each output of a MLP will be used to approximate the a posteriori probability i) directly. The classification decision then amounts to assign the feature to the class whose corresponding output at the MLP is maximum. During training, the classification labels (1 of N encoding) are presented as target values (rather than the true, but unknown, a posteriori probability) Denote y(x,w) to be the i th output of MLP, and to be the corresponding target value (0 or 1) during the training. e 2 ( t) = E E E 2 { y( x, W ) } = E E[ x] + = E[ x] + y( x, W ) 2 { E[ x] } + 2 E{ E[ x] y( x, W ) } { E[ x] y( x, W ) 2 } Hence y(x,w) will approximate E( = i) 2 } (C) 2001 by Yu Hen Hu 10