The bg pcture Vncent Claveau IRISA - CNRS, sldes from E. Kjak INSA Rennes Notatons classes: C = {ω = 1,.., C} tranng set S of sze m, composed of m ponts (x, ω ) per class ω representaton space: R d (= d numerc features) Problem to solve assgn a class among C to any pont x R d, wth the only knowledge of the tranng set = what s the most probable class gven x: P(ω x) Bayesan nductve prncple Choose the most probable hypothess gven S we suppose that t s possble to defne a probablty dstrbuton over the hypotheses the expert knowledge about the task s expressed through the a pror dstrbuton over H the tranng set s thus consdered as nformaton modfyng ths dstrbuton over H we then choose the most a posteror probable h: Maxmum A Posteror (MAP) Learnng a Bayes formula How to compute P(ω x)? P(ω x) s the a posteror proba of x ω dea: Bayes formula P(ω x) = p(x ω )P(ω ) p(x) p(x ω ) s the probablty densty of the class ω at the pont x, also known as lkelhood (vrasemblance) P(ω ) s the a pror proba of class Maxmum A Posteror (MAP) rule The or MAP rule h assgns to pont x the class ω whch has the hghest a posteror proba of generatng x : h chooses the class ω = ArgMax(P(ω x)) We look for the hypothess h the most probable gven observaton x, that s a posteror. Another way to express the MAP rule: ω = ArgMax p(x ω )P(ω )
Smple example Maxmum A Posteror (MAP) rule Classfy a person as a boy or a grl, based on tranng data. x are descrbed by the sze, the weght, the har length... of the person a pror proba? lkelhood? Theoretcal property Ths rule s optmal: among all the possble classfcaton rules, ths the one wth the smallest error proba (= real rsk) [ ] err(h ) = mn Perr h (x)dx h R d P h err (x) s the proba that x s wrongly classfed by the rule h The value err(h ) s named Bayesan error of classfcaton Ths rule s also called Mnmal error rule snce t mnmzes the number of classfcaton errors Maxmum lkelhood rule If all the classes have the same a pror proba, then the Maxmum a Posteror s called Maxmum Lkelhood (ML) rule ω = ArgMax p(x ω ) Ths rule select the class ω for whch the observaton x s the most probable, that s, the state of the world the more able to generate the event x. The smple dea here s that the observaton x s not fortutous and was hghly probable under the state of the world h (hypothess). Nave case Nave = ndependant features we suppose that the features {a 1,..., a d } are ndependant then, p(x ω) can be futher decomposed nto p(a 1 = v 1x ω)... p(a d = v dx ω) thus, d p(x ω) = p(a = v x ω) =1 the resultng classfer s the Nave Bayes classfer In practce for most problems, the features are not ndependant (eg. weght / heght) but, even when t s not true, the Nave Bayes yelds good results Learnng a Separatve surfaces Separatve surface between ω and ω j s the place where the ponts have equal a posteror probabltes to belong to ω or ω j The equaton of the separatve surface between ω and ω j s: P(ω x) = P(ω j x) p(x ω )P(ω ) p(x) = p(x ω j)p(ω j ) p(x) Learnng a p(x ω )P(ω ) = p(x ω j )P(ω j )
Learnng a Learnng a How to get the? The problem would be easy to solve f the P(ω ) and the p(x ω ) were known P(ω ) : the a pror proba of the classes are ether supposed equal, or estmated from ther frequences n the tranng set p(x ω ) : for each class, we face a problem of estmatng the densty from a fnte number of observatons Estmatng the a pror proba f no relevant nformaton, they are supposed equal wth: P(ω ) = 1 C or f the tranng set s supposed representatve, we use ther frequences n ths set: P(ω ) = m m or we use another estmaton n-between (Laplace formula) P(ω ) = m + M/C m + M where M s an arbtrary constant. Ths formula s used when m s small, e when the estmatons m /m are not precse. M represents a vrtual augmentaton of the number of examples, for whch we suppose the classes are equprobable Learnng a Estmatng a probablty densty we suppose that p(x ω ) have a certan analytcal form eg., f they are supposed gaussan, estmatng ther mean and covarance s enough the probablty that an example x belongs to a certan class can be computed drectly from ts coordonnates (values of features) the denstes p(x ω ) are estmated locally at the pont x by lookng the tranng examples around ths pont these methods are mplemented by 2 well-known technques: Parzen wndows (kernels, noyau) k-nearest-neghbors (K-plus proches vosns) Learnng a Remnder Let us note E[x] the Expectaton (Fr: espérance) of the random varable x The mean (moyenne) of a densty of probablty p n R d s a d dmensonal vector defned as: µ = E[x] The j th component of µ s: µ(j) = E[x j ] = R x jp(x j )dx Its covarance matrx s: Q = E[(x µ)(x µ) T ] wth: Q(j, k) = E[(x j µ(j))(x k µ(k)) T ] Remnder A Gaussan dstrbuton of probablty s defned by ts mean vector µ and ts covarance matrx Q. For each class ω : d = 1 Q s a scalar σ 2 (the varance) p(x ω ) = 1 ( exp 1 (x µ ) 2 ) σ 2π 2 d > 1 p(x ω ) = Q 1/2 2π d/2 σ 2 ( exp 1 ) 2 (x µ ) T Q 1 (x µ )
Remnder of Gaussan classes Determnant det(q) or Q for a( 2-dmensonal ) matrx: a b det( ) = c d a b c d = a d b c Invert A 1 A 1 = 1 det(a) ( com(a)t ) ( ) a b d c ex: A =, com(a) =, c d b a ( ) ( d b com(a) T =, A c a 1 = 1 d b ad bc c a ) A maxmum lkelhood estmaton maxmses the probablty to observe the tranng data. For the class ω, the m tranng ponts are noted {x 1,..., x j..., x m }. It s known that the maxmum lkelhood estmaton of the mean µ and the co-varance matrx Q are computed by: µ = Σl=m l=1 x l m Q = Σl=m l=1 (x l µ )(x l µ ) T m of Gaussan classes Separatve surfaces of Gaussan classes Separatve surfaces n R 2 The place where the probabltes to belong to 2 classes ω and ω j are equal s: Q 1/2 2π d/2 = Q j 1/2 2π d/2 ( exp 1 ) 2 (x µ ) T Q 1 (x µ ) ( exp 1 ) 2 (x µ j) T Q 1 j (x µ j ) After smplfcaton, we get a quadratc form : x T Φx + x T φ + α = 0 The matrx Φ and the vectors φ and α only depends on µ, µ j, Q, Q j. Example wth two dmensons Example wth two dmensons Tranng set ω 1 ( 1 1 ω 2 ( 4 0 ) ( ) 0 ( ) 3 ( ) 4 4 3 0 ) ( ) 7 ( ) 8 ( ) 5 1 4 3 Consder that the two classes are Gaussan, what s the equaton of the separatve surface?
Example wth two dmensons Correcton A more complex case: modelzng wth a mxture of Gaussan detm ( = ad) bc 2 µ 1 = Q1 =... 2 ( ) 6 µ 2 = Q2 =... 2 p(x ω 1 ) = Q1... Mxture of K Gaussans: K Q k 1/2 { p(x ω ) = α k (2π) exp 1 } d/2 2 (x µ k) T Q 1 k (x µ k ) wth k=1 For each class ω, we estmate every parameter: the mean of each Gaussan: {µ 1,..., µ K } the covarance of each Gaussan: {Q 1,..., Q K } the mxture values: {α 1,..., α K } wth an EM (Expectaton-Maxmzaton) algorthm. K α k = 1 k=1 A smplfed case: the nave Bayesan classfcaton In that case, the Nave hypothess (as prevously defned) means the features are not correlated. It means that each class has a dagonal covarance matrx. In that case, the probablty to observe x T = (x 1,..., x d ) for a pont of any class ω s the product of the probablty to observe a 1 = x 1 for ths class, and the probablty to observe a 2 = x 2 for ths class, and so on. Thus, by defnton: Learnng a ω = ArgMax {1,...,C} d P(ω ) p(x j ω ) j=1 Each value p(x j ω ) s estmated by countng n an nterval (monodmensonal hstogram). Non parametrc let x be a pont whose class s unknown we estmate the densty of probabltess around x and then apply the Bayesan classfcaton for each class ω, we have the same problem: we have m tranng ponts that we suppose to have been drawn ndependently (trages ndépendants) n R d accordng to an unknown densty p(x ω ) how one estmates p(x ω) at pont x from these m tranng ponts? How to do We defne around x a regon R m of volume V m and we count the number k m of ponts of the tranng set that are n ths regon. Estmatng p(x ω) for a sample of sze m: p m (x ω) = k m/m V m where V m s the volume of the regon R m consdered. When m ncreases, ths estmator converges to p(x ω), f : lm V m = 0 lm k m = lm(k m /m) = 0
Non parametrc Explanaton probablty P m that x falls nto the regon R m : P m = R m p(x ω)dx f the m sample ponts are..d. sampled from p(x ω), the proba that k m among them fall nto R m : ( ) m Pm km (1 P m) 1 km k m from ths dstrbuton, we know that Expectaton of k m s mp m, so km m s an estmator of P m f V m s small enough for havng p(x ω) beng constant, then we have: P m = R m p(x ω)dx p(x ω) V m and so p(x ω) km/m V m Ponts are ndependant draws w.r.t. a certan dstrbuton n R 2, wth ts densty est hgher at pont A than pont B. For the same volume around A and B, k m s respectvely 6 and 1. To get k m = 6 around B, one has to augment the volume. Non-parametrc bayesan learnng The densty p(x ω) s estmated by the proporton of examples belongng to class ω around x. There are 2 solutons: Parzen wndows (Fenêtres de Parzen) : subdvson of the space nto balls of radus ρ (fxed) centered on x. Let N(x) the number of ponts of class ω contaned n the ball: p m (x ω) N(x) ρ K-nearest neghbors (K-plus proches vosns), or Knn: form balls wth varable radus ρ but contanng exactly K (fxed) ponts from the tranng set (the knn of x): p m (x ω) K ρ K (x) Parzen wndows Learnng a Parzen wndows K-nearest neghbors Parzen wndows Parzen wndows: estmatng wth kernels Parzen wndows Example (rectangular kernels) These technque s more generally descrbed by: p m (x ω) = 1 m =1 1 V m κ(x, x ) functon κ(x, x ) s centered n x and decreased when x s gettng far of x. ts ntegral s the volume V m For example, κ may be a rectangle wth varyng length/wdth (constant surface), or a gaussan wth varyng varance (constant ntegral) Estmatng the densty wth the Parzen wndows method. There are 4 tranng ponts, n a 1-dmenson space. The densty (plan lne) s computed as the sum of the wndows centered on each pont. Here, the wndow s narrow (h s small): the densty s not smooth.
Parzen wndows Example (rectangular kernels) Parzen wndows Example (Gaussan kernels) Same estmaton wth greater h: densty s more smoothed Same estmaton wth Gaussan kernels: densty s very smoothed Parzen wndows Computng problems Parzen wndows Computng problems p m = 1 1 κ(x, x ) m λv m To prevent computng the sum on m terms, one can use a kernel functon for κ,.e. a functon such that there exsts n a n-dmenson space a functon Φ wth: =1 κ(x, y) = Φ(x), Φ(y) p m = 1 1 κ(x, x ) m λv m =1 = 1 1 Φ(x), Φ(x ) m λv m =1 = 1 1 Φ(x), Φ(x ) m λv m =1 m =1 Φ(x ) s pre-computed once for all, t only remans one scalar product n n dmensons K-nearest neghbors K-nearest neghbors K-nearest neghbors algorthm Learnng a Parzen wndows K-nearest neghbors Begn for each example (y, ω) n the tranng set do compute the dstance D(y, x) between y and x end for Among the K nearest ponts of x compute the number of occurrence of each class Assgn to x the most frequent class found End
K-nearest neghbors K-nearest neghbors Decson wth 1-NN and 3-NN for 2 classes Decson wth 1-NN and 3-NN for 3 classes K-nearest neghbors K-NN : valdty (1) The k-nn decson rule approxmates the Bayesan one snce t mplctly makes a comparatve estmaton of the probablty densty of the classes occurrng n the neghborhood of x and then choose the most probable. Let s suppose that among the m tranng ponts, m are of class ω and that among the K nearest neghbors of x, there are K m examples of class ω Then: p m (x ω ) = K m /m V m but snce m /m s an estmator of P(ω ), the a pror probablty the class ω. Thus, one can wrte: m /m = P m (ω ). K-nearest neghbors K-NN : valdty (2) From that, we have: K m = p m (x ω ). P m (ω ).m.v m Consequently, the class maxmzng K m also maxmzes: p m (x ω ). P m (ω ) and so, wth the Bayes rule, t also maxmzes: P m (ω x).p(x) Ths class s coherent wth the bayesan classfcaton rule snce t maxmzes: P m (ω x) K-nearest neghbors Les K-ppv : valdty (3) K-nearest neghbors K-NN : n practce To complete that, we need to demonstrate that ths method fulflls the requrements expressed prevously. For a fxed K and m, for each class we have: V m 0 k m /m 0 The probablty of error E K NN of the KNN rule converges toward the bayesan one when m ncreases. Choosng K dverse practcal and theoretcal consderatons leads to ths heurstc: K m/c where m/c s the average number of tranng ponts by class t s worth notng that d, the representaton space dmenson, s not mpled n ths formula
K-nearest neghbors K-NN : n practce K-nearest neghbors Separatve surfaces for K-NN What to do n case of tes (Fr: égalté) choose a hgher value for K, but the te may persst another good soluton may be to decde randomly whch class to assgn another soluton s to weght the votes of the neghbors by ther dstance to the pont Voronoï area the Voronoï area of a pont s the part of R d n whch eah pont s closer to ths example than any other ths s the ntersecton of m 1 half-spaces, defned by the medator hyperplanes between ths example and any other for k = 1, the separatve surface between 2 classes s the separatve surface between the two volumes obtaned by the unon of Voronoï surfaces of the examples of each class K-nearest neghbors Set of ponts wth ther Voronoï area (K = 1).