The big picture. Outline

Similar documents
Lecture 12: Classification

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

EM and Structure Learning

Why Bayesian? 3. Bayes and Normal Models. State of nature: class. Decision rule. Rev. Thomas Bayes ( ) Bayes Theorem (yes, the famous one)

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

Composite Hypotheses testing

MIMA Group. Chapter 2 Bayesian Decision Theory. School of Computer Science and Technology, Shandong University. Xin-Shun SDU

10-701/ Machine Learning, Fall 2005 Homework 3

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Statistical pattern recognition

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

Classification as a Regression Problem

Maximum Likelihood Estimation (MLE)

xp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

Boostrapaggregating (Bagging)

Other NN Models. Reinforcement learning (RL) Probabilistic neural networks

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

Outline. Bayesian Networks: Maximum Likelihood Estimation and Tree Structure Learning. Our Model and Data. Outline

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

Conjugacy and the Exponential Family

Homework Assignment 3 Due in class, Thursday October 15

Logistic Classifier CISC 5800 Professor Daniel Leeds

Lecture Notes on Linear Regression

Space of ML Problems. CSE 473: Artificial Intelligence. Parameter Estimation and Bayesian Networks. Learning Topics

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

A Bayes Algorithm for the Multitask Pattern Recognition Problem Direct Approach

9.913 Pattern Recognition for Vision. Class IV Part I Bayesian Decision Theory Yuri Ivanov

Clustering & Unsupervised Learning

Gaussian Mixture Models

Computation of Higher Order Moments from Two Multinomial Overdispersion Likelihood Models

Outline. Multivariate Parametric Methods. Multivariate Data. Basic Multivariate Statistics. Steven J Zeil

Probability Density Function Estimation by different Methods

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Kernel Methods and SVMs Extension

8/25/17. Data Modeling. Data Modeling. Data Modeling. Patrice Koehl Department of Biological Sciences National University of Singapore

Department of Computer Science Artificial Intelligence Research Laboratory. Iowa State University MACHINE LEARNING

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010

Support Vector Machines

MATH 829: Introduction to Data Mining and Analysis The EM algorithm (part 2)

Machine Learning. Classification. Theory of Classification and Nonparametric Classifier. Representing data: Hypothesis (classifier) Eric Xing

The Expectation-Maximization Algorithm

j) = 1 (note sigma notation) ii. Continuous random variable (e.g. Normal distribution) 1. density function: f ( x) 0 and f ( x) dx = 1

Expectation Maximization Mixture Models HMMs

Clustering & (Ken Kreutz-Delgado) UCSD

Probabilistic Classification: Bayes Classifiers. Lecture 6:

Support Vector Machines

Pattern Classification

Lecture Nov

SDMML HT MSc Problem Sheet 4

Linear Classification, SVMs and Nearest Neighbors

Limited Dependent Variables

Ensemble Methods: Boosting

The conjugate prior to a Bernoulli is. A) Bernoulli B) Gaussian C) Beta D) none of the above

Learning from Data 1 Naive Bayes

Markov Chain Monte Carlo (MCMC), Gibbs Sampling, Metropolis Algorithms, and Simulated Annealing Bioinformatics Course Supplement

Engineering Risk Benefit Analysis

Classification. Representing data: Hypothesis (classifier) Lecture 2, September 14, Reading: Eric CMU,

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Lecture 3: Probability Distributions

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

UNIVERSITY OF TORONTO Faculty of Arts and Science. December 2005 Examinations STA437H1F/STA1005HF. Duration - 3 hours

Communication with AWGN Interference

Probability Theory. The nth coefficient of the Taylor series of f(k), expanded around k = 0, gives the nth moment of x as ( ik) n n!

Differentiating Gaussian Processes

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

CS47300: Web Information Search and Management

Semi-Supervised Learning

A quantum-statistical-mechanical extension of Gaussian mixture model

Linear Approximation with Regularization and Moving Least Squares

Statistical learning

Evaluation for sets of classes

Mixture o f of Gaussian Gaussian clustering Nov

NUMERICAL DIFFERENTIATION

Inductance Calculation for Conductors of Arbitrary Shape

Overview. Hidden Markov Models and Gaussian Mixture Models. Acoustic Modelling. Fundamental Equation of Statistical Speech Recognition

Natural Language Processing and Information Retrieval

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Singular Value Decomposition: Theory and Applications

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Complete subgraphs in multipartite graphs

Maximum Likelihood Estimation

Probability Theory (revisited)

Chat eld, C. and A.J.Collins, Introduction to multivariate analysis. Chapman & Hall, 1980

Bayesian Learning. Smart Home Health Analytics Spring Nirmalya Roy Department of Information Systems University of Maryland Baltimore County

An Experiment/Some Intuition (Fall 2006): Lecture 18 The EM Algorithm heads coin 1 tails coin 2 Overview Maximum Likelihood Estimation

Expected Value and Variance

Generalized Linear Methods

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

STAT 511 FINAL EXAM NAME Spring 2001

3.1 ML and Empirical Distribution

Calculating CLs Limits. Abstract

Intro to Visual Recognition

Chapter 12 Analysis of Covariance

Advances in Longitudinal Methods in the Social and Behavioral Sciences. Finite Mixtures of Nonlinear Mixed-Effects Models.

The Geometry of Logit and Probit

Statistics II Final Exam 26/6/18

Transcription:

The bg pcture Vncent Claveau IRISA - CNRS, sldes from E. Kjak INSA Rennes Notatons classes: C = {ω = 1,.., C} tranng set S of sze m, composed of m ponts (x, ω ) per class ω representaton space: R d (= d numerc features) Problem to solve assgn a class among C to any pont x R d, wth the only knowledge of the tranng set = what s the most probable class gven x: P(ω x) Bayesan nductve prncple Choose the most probable hypothess gven S we suppose that t s possble to defne a probablty dstrbuton over the hypotheses the expert knowledge about the task s expressed through the a pror dstrbuton over H the tranng set s thus consdered as nformaton modfyng ths dstrbuton over H we then choose the most a posteror probable h: Maxmum A Posteror (MAP) Learnng a Bayes formula How to compute P(ω x)? P(ω x) s the a posteror proba of x ω dea: Bayes formula P(ω x) = p(x ω )P(ω ) p(x) p(x ω ) s the probablty densty of the class ω at the pont x, also known as lkelhood (vrasemblance) P(ω ) s the a pror proba of class Maxmum A Posteror (MAP) rule The or MAP rule h assgns to pont x the class ω whch has the hghest a posteror proba of generatng x : h chooses the class ω = ArgMax(P(ω x)) We look for the hypothess h the most probable gven observaton x, that s a posteror. Another way to express the MAP rule: ω = ArgMax p(x ω )P(ω )

Smple example Maxmum A Posteror (MAP) rule Classfy a person as a boy or a grl, based on tranng data. x are descrbed by the sze, the weght, the har length... of the person a pror proba? lkelhood? Theoretcal property Ths rule s optmal: among all the possble classfcaton rules, ths the one wth the smallest error proba (= real rsk) [ ] err(h ) = mn Perr h (x)dx h R d P h err (x) s the proba that x s wrongly classfed by the rule h The value err(h ) s named Bayesan error of classfcaton Ths rule s also called Mnmal error rule snce t mnmzes the number of classfcaton errors Maxmum lkelhood rule If all the classes have the same a pror proba, then the Maxmum a Posteror s called Maxmum Lkelhood (ML) rule ω = ArgMax p(x ω ) Ths rule select the class ω for whch the observaton x s the most probable, that s, the state of the world the more able to generate the event x. The smple dea here s that the observaton x s not fortutous and was hghly probable under the state of the world h (hypothess). Nave case Nave = ndependant features we suppose that the features {a 1,..., a d } are ndependant then, p(x ω) can be futher decomposed nto p(a 1 = v 1x ω)... p(a d = v dx ω) thus, d p(x ω) = p(a = v x ω) =1 the resultng classfer s the Nave Bayes classfer In practce for most problems, the features are not ndependant (eg. weght / heght) but, even when t s not true, the Nave Bayes yelds good results Learnng a Separatve surfaces Separatve surface between ω and ω j s the place where the ponts have equal a posteror probabltes to belong to ω or ω j The equaton of the separatve surface between ω and ω j s: P(ω x) = P(ω j x) p(x ω )P(ω ) p(x) = p(x ω j)p(ω j ) p(x) Learnng a p(x ω )P(ω ) = p(x ω j )P(ω j )

Learnng a Learnng a How to get the? The problem would be easy to solve f the P(ω ) and the p(x ω ) were known P(ω ) : the a pror proba of the classes are ether supposed equal, or estmated from ther frequences n the tranng set p(x ω ) : for each class, we face a problem of estmatng the densty from a fnte number of observatons Estmatng the a pror proba f no relevant nformaton, they are supposed equal wth: P(ω ) = 1 C or f the tranng set s supposed representatve, we use ther frequences n ths set: P(ω ) = m m or we use another estmaton n-between (Laplace formula) P(ω ) = m + M/C m + M where M s an arbtrary constant. Ths formula s used when m s small, e when the estmatons m /m are not precse. M represents a vrtual augmentaton of the number of examples, for whch we suppose the classes are equprobable Learnng a Estmatng a probablty densty we suppose that p(x ω ) have a certan analytcal form eg., f they are supposed gaussan, estmatng ther mean and covarance s enough the probablty that an example x belongs to a certan class can be computed drectly from ts coordonnates (values of features) the denstes p(x ω ) are estmated locally at the pont x by lookng the tranng examples around ths pont these methods are mplemented by 2 well-known technques: Parzen wndows (kernels, noyau) k-nearest-neghbors (K-plus proches vosns) Learnng a Remnder Let us note E[x] the Expectaton (Fr: espérance) of the random varable x The mean (moyenne) of a densty of probablty p n R d s a d dmensonal vector defned as: µ = E[x] The j th component of µ s: µ(j) = E[x j ] = R x jp(x j )dx Its covarance matrx s: Q = E[(x µ)(x µ) T ] wth: Q(j, k) = E[(x j µ(j))(x k µ(k)) T ] Remnder A Gaussan dstrbuton of probablty s defned by ts mean vector µ and ts covarance matrx Q. For each class ω : d = 1 Q s a scalar σ 2 (the varance) p(x ω ) = 1 ( exp 1 (x µ ) 2 ) σ 2π 2 d > 1 p(x ω ) = Q 1/2 2π d/2 σ 2 ( exp 1 ) 2 (x µ ) T Q 1 (x µ )

Remnder of Gaussan classes Determnant det(q) or Q for a( 2-dmensonal ) matrx: a b det( ) = c d a b c d = a d b c Invert A 1 A 1 = 1 det(a) ( com(a)t ) ( ) a b d c ex: A =, com(a) =, c d b a ( ) ( d b com(a) T =, A c a 1 = 1 d b ad bc c a ) A maxmum lkelhood estmaton maxmses the probablty to observe the tranng data. For the class ω, the m tranng ponts are noted {x 1,..., x j..., x m }. It s known that the maxmum lkelhood estmaton of the mean µ and the co-varance matrx Q are computed by: µ = Σl=m l=1 x l m Q = Σl=m l=1 (x l µ )(x l µ ) T m of Gaussan classes Separatve surfaces of Gaussan classes Separatve surfaces n R 2 The place where the probabltes to belong to 2 classes ω and ω j are equal s: Q 1/2 2π d/2 = Q j 1/2 2π d/2 ( exp 1 ) 2 (x µ ) T Q 1 (x µ ) ( exp 1 ) 2 (x µ j) T Q 1 j (x µ j ) After smplfcaton, we get a quadratc form : x T Φx + x T φ + α = 0 The matrx Φ and the vectors φ and α only depends on µ, µ j, Q, Q j. Example wth two dmensons Example wth two dmensons Tranng set ω 1 ( 1 1 ω 2 ( 4 0 ) ( ) 0 ( ) 3 ( ) 4 4 3 0 ) ( ) 7 ( ) 8 ( ) 5 1 4 3 Consder that the two classes are Gaussan, what s the equaton of the separatve surface?

Example wth two dmensons Correcton A more complex case: modelzng wth a mxture of Gaussan detm ( = ad) bc 2 µ 1 = Q1 =... 2 ( ) 6 µ 2 = Q2 =... 2 p(x ω 1 ) = Q1... Mxture of K Gaussans: K Q k 1/2 { p(x ω ) = α k (2π) exp 1 } d/2 2 (x µ k) T Q 1 k (x µ k ) wth k=1 For each class ω, we estmate every parameter: the mean of each Gaussan: {µ 1,..., µ K } the covarance of each Gaussan: {Q 1,..., Q K } the mxture values: {α 1,..., α K } wth an EM (Expectaton-Maxmzaton) algorthm. K α k = 1 k=1 A smplfed case: the nave Bayesan classfcaton In that case, the Nave hypothess (as prevously defned) means the features are not correlated. It means that each class has a dagonal covarance matrx. In that case, the probablty to observe x T = (x 1,..., x d ) for a pont of any class ω s the product of the probablty to observe a 1 = x 1 for ths class, and the probablty to observe a 2 = x 2 for ths class, and so on. Thus, by defnton: Learnng a ω = ArgMax {1,...,C} d P(ω ) p(x j ω ) j=1 Each value p(x j ω ) s estmated by countng n an nterval (monodmensonal hstogram). Non parametrc let x be a pont whose class s unknown we estmate the densty of probabltess around x and then apply the Bayesan classfcaton for each class ω, we have the same problem: we have m tranng ponts that we suppose to have been drawn ndependently (trages ndépendants) n R d accordng to an unknown densty p(x ω ) how one estmates p(x ω) at pont x from these m tranng ponts? How to do We defne around x a regon R m of volume V m and we count the number k m of ponts of the tranng set that are n ths regon. Estmatng p(x ω) for a sample of sze m: p m (x ω) = k m/m V m where V m s the volume of the regon R m consdered. When m ncreases, ths estmator converges to p(x ω), f : lm V m = 0 lm k m = lm(k m /m) = 0

Non parametrc Explanaton probablty P m that x falls nto the regon R m : P m = R m p(x ω)dx f the m sample ponts are..d. sampled from p(x ω), the proba that k m among them fall nto R m : ( ) m Pm km (1 P m) 1 km k m from ths dstrbuton, we know that Expectaton of k m s mp m, so km m s an estmator of P m f V m s small enough for havng p(x ω) beng constant, then we have: P m = R m p(x ω)dx p(x ω) V m and so p(x ω) km/m V m Ponts are ndependant draws w.r.t. a certan dstrbuton n R 2, wth ts densty est hgher at pont A than pont B. For the same volume around A and B, k m s respectvely 6 and 1. To get k m = 6 around B, one has to augment the volume. Non-parametrc bayesan learnng The densty p(x ω) s estmated by the proporton of examples belongng to class ω around x. There are 2 solutons: Parzen wndows (Fenêtres de Parzen) : subdvson of the space nto balls of radus ρ (fxed) centered on x. Let N(x) the number of ponts of class ω contaned n the ball: p m (x ω) N(x) ρ K-nearest neghbors (K-plus proches vosns), or Knn: form balls wth varable radus ρ but contanng exactly K (fxed) ponts from the tranng set (the knn of x): p m (x ω) K ρ K (x) Parzen wndows Learnng a Parzen wndows K-nearest neghbors Parzen wndows Parzen wndows: estmatng wth kernels Parzen wndows Example (rectangular kernels) These technque s more generally descrbed by: p m (x ω) = 1 m =1 1 V m κ(x, x ) functon κ(x, x ) s centered n x and decreased when x s gettng far of x. ts ntegral s the volume V m For example, κ may be a rectangle wth varyng length/wdth (constant surface), or a gaussan wth varyng varance (constant ntegral) Estmatng the densty wth the Parzen wndows method. There are 4 tranng ponts, n a 1-dmenson space. The densty (plan lne) s computed as the sum of the wndows centered on each pont. Here, the wndow s narrow (h s small): the densty s not smooth.

Parzen wndows Example (rectangular kernels) Parzen wndows Example (Gaussan kernels) Same estmaton wth greater h: densty s more smoothed Same estmaton wth Gaussan kernels: densty s very smoothed Parzen wndows Computng problems Parzen wndows Computng problems p m = 1 1 κ(x, x ) m λv m To prevent computng the sum on m terms, one can use a kernel functon for κ,.e. a functon such that there exsts n a n-dmenson space a functon Φ wth: =1 κ(x, y) = Φ(x), Φ(y) p m = 1 1 κ(x, x ) m λv m =1 = 1 1 Φ(x), Φ(x ) m λv m =1 = 1 1 Φ(x), Φ(x ) m λv m =1 m =1 Φ(x ) s pre-computed once for all, t only remans one scalar product n n dmensons K-nearest neghbors K-nearest neghbors K-nearest neghbors algorthm Learnng a Parzen wndows K-nearest neghbors Begn for each example (y, ω) n the tranng set do compute the dstance D(y, x) between y and x end for Among the K nearest ponts of x compute the number of occurrence of each class Assgn to x the most frequent class found End

K-nearest neghbors K-nearest neghbors Decson wth 1-NN and 3-NN for 2 classes Decson wth 1-NN and 3-NN for 3 classes K-nearest neghbors K-NN : valdty (1) The k-nn decson rule approxmates the Bayesan one snce t mplctly makes a comparatve estmaton of the probablty densty of the classes occurrng n the neghborhood of x and then choose the most probable. Let s suppose that among the m tranng ponts, m are of class ω and that among the K nearest neghbors of x, there are K m examples of class ω Then: p m (x ω ) = K m /m V m but snce m /m s an estmator of P(ω ), the a pror probablty the class ω. Thus, one can wrte: m /m = P m (ω ). K-nearest neghbors K-NN : valdty (2) From that, we have: K m = p m (x ω ). P m (ω ).m.v m Consequently, the class maxmzng K m also maxmzes: p m (x ω ). P m (ω ) and so, wth the Bayes rule, t also maxmzes: P m (ω x).p(x) Ths class s coherent wth the bayesan classfcaton rule snce t maxmzes: P m (ω x) K-nearest neghbors Les K-ppv : valdty (3) K-nearest neghbors K-NN : n practce To complete that, we need to demonstrate that ths method fulflls the requrements expressed prevously. For a fxed K and m, for each class we have: V m 0 k m /m 0 The probablty of error E K NN of the KNN rule converges toward the bayesan one when m ncreases. Choosng K dverse practcal and theoretcal consderatons leads to ths heurstc: K m/c where m/c s the average number of tranng ponts by class t s worth notng that d, the representaton space dmenson, s not mpled n ths formula

K-nearest neghbors K-NN : n practce K-nearest neghbors Separatve surfaces for K-NN What to do n case of tes (Fr: égalté) choose a hgher value for K, but the te may persst another good soluton may be to decde randomly whch class to assgn another soluton s to weght the votes of the neghbors by ther dstance to the pont Voronoï area the Voronoï area of a pont s the part of R d n whch eah pont s closer to ths example than any other ths s the ntersecton of m 1 half-spaces, defned by the medator hyperplanes between ths example and any other for k = 1, the separatve surface between 2 classes s the separatve surface between the two volumes obtaned by the unon of Voronoï surfaces of the examples of each class K-nearest neghbors Set of ponts wth ther Voronoï area (K = 1).