Clustering. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar.

Size: px
Start display at page:

Download "Clustering. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar."

Transcription

1 Clusterig CM226: Machie Learig for Bioiformatics. Fall 216 Sriram Sakararama Ackowledgmets: Fei Sha, Ameet Talwalkar Clusterig 1 / 42

2 Admiistratio HW 1 due o Moday. /post o CCLE if you have questios. No class o October 19. OH moved to Moday 1pm Clusterig 2 / 42

3 Supervised versus Usupervised Learig Supervised Learig from labeled observatios Labels teach algorithm to lear mappig from observatios to labels Classificatio, Regressio Topic of last three lectures Usupervised Learig from ulabeled observatios Learig algorithm must fid latet structure from features aloe Ca be goal i itself (discover hidde patters, exploratory aalysis) Ca be meas to a ed (preprocessig for supervised task) Clusterig Dimesioality Reductio: Trasform a iitial feature represetatio ito a more cocise represetatio Modelig complex distributio from simple distributios Clusterig 3 / 42

4 Supervised versus Usupervised Learig Supervised Learig from labeled observatios Labels teach algorithm to lear mappig from observatios to labels Classificatio, Regressio Topic of last three lectures Usupervised Learig from ulabeled observatios Learig algorithm must fid latet structure from features aloe Ca be goal i itself (discover hidde patters, exploratory aalysis) Ca be meas to a ed (preprocessig for supervised task) Clusterig Dimesioality Reductio: Trasform a iitial feature represetatio ito a more cocise represetatio Modelig complex distributio from simple distributios Clusterig 3 / 42

5 Outlie Clusterig K-meas Gaussia mixture models EM Algorithm Model selectio Clusterig Clusterig 4 / 42

6 Clusterig Setup Give D = {x } N =1 ad K, we wat to output {µ k } K k=1 : prototypes of clusters A(x ) {1, 2,..., K}: the cluster membership, i.e., the cluster ID assiged to x Toy Example Cluster data ito two clusters. 2 (a) 2 (i) 2 2 Defiitio Group data poits so that poits withi group are more similar tha poits across groups. Clusterig Clusterig 5 / 42

7 K-meas example 2 (a) 2 (b) 2 (c) (d) 2 (e) 2 (f) (g) 2 (h) 2 (i) Clusterig Clusterig 6 / 42

8 K-meas example 2 (a) 2 (b) 2 (c) (d) 2 (e) 2 (f) (g) 2 (h) 2 (i) Clusterig Clusterig 6 / 42

9 K-meas example 2 (a) 2 (b) 2 (c) (d) 2 (e) 2 (f) (g) 2 (h) 2 (i) Clusterig Clusterig 6 / 42

10 K-meas example 2 (a) 2 (b) 2 (c) (d) 2 (e) 2 (f) (g) 2 (h) 2 (i) Clusterig Clusterig 6 / 42

11 K-meas example 2 (a) 2 (b) 2 (c) (d) 2 (e) 2 (f) (g) 2 (h) 2 (i) Clusterig Clusterig 6 / 42

12 K-meas clusterig Ituitio Data poits assiged to cluster k should be close to µ k, the prototype. Distortio measure (clusterig objective fuctio, cost fuctio) N K J = r k x µ k 2 2 =1 k=1 where r k {, 1} is a idicator variable r k = 1 if ad oly if A(x ) = k Clusterig Clusterig 7 / 42

13 K-meas clusterig Ituitio Data poits assiged to cluster k should be close to µ k, the prototype. Distortio measure (clusterig objective fuctio, cost fuctio) N K J = r k x µ k 2 2 =1 k=1 where r k {, 1} is a idicator variable r k = 1 if ad oly if A(x ) = k Clusterig Clusterig 7 / 42

14 Algorithm Miimize distortio measure alterative optimizatio betwee {r k } ad {µ k } Step Iitialize {µ k } to some values Step 1 Assume the curret value of {µ k } fixed, miimize J over {r k }, which leads to the followig cluster assigmet rule r k = { 1 if k = arg mij x µ j 2 2 otherwise Step 2 Assume the curret value of {r k } fixed, miimize J over {µ k }, which leads to the followig rule to update the prototypes of the clusters µ k = r kx r k Step 3 Determie whether to stop or retur to Step 1 Clusterig Clusterig 8 / 42

15 Algorithm Miimize distortio measure alterative optimizatio betwee {r k } ad {µ k } Step Iitialize {µ k } to some values Step 1 Assume the curret value of {µ k } fixed, miimize J over {r k }, which leads to the followig cluster assigmet rule r k = { 1 if k = arg mij x µ j 2 2 otherwise Step 2 Assume the curret value of {r k } fixed, miimize J over {µ k }, which leads to the followig rule to update the prototypes of the clusters µ k = r kx r k Step 3 Determie whether to stop or retur to Step 1 Clusterig Clusterig 8 / 42

16 Algorithm Miimize distortio measure alterative optimizatio betwee {r k } ad {µ k } Step Iitialize {µ k } to some values Step 1 Assume the curret value of {µ k } fixed, miimize J over {r k }, which leads to the followig cluster assigmet rule r k = { 1 if k = arg mij x µ j 2 2 otherwise Step 2 Assume the curret value of {r k } fixed, miimize J over {µ k }, which leads to the followig rule to update the prototypes of the clusters µ k = r kx r k Step 3 Determie whether to stop or retur to Step 1 Clusterig Clusterig 8 / 42

17 Remarks Prototype µ k is the mea of data poits assiged to the cluster k, hece K-meas The procedure reduces J i both Step 1 ad Step 2 ad thus makes improvemets o each iteratio No guaratee we fid the global solutio; quality of local optimum depeds o iitial values at Step (k-meas++ is a eat approximatio algorithm) Clusterig Clusterig 9 / 42

18 Probabilistic iterpretatio of clusterig? We ca impose a probabilistic iterpretatio of our ituitio that poits stay close to their cluster ceters How ca we model p(x) to reflect this? 1.5 (b).5 1 Data poits seem to form 3 clusters We caot model p(x) with simple ad kow distributios E.g., the data is ot a Guassia b/c we have 3 distict cocetrated regios Clusterig Clusterig 1 / 42

19 Probabilistic iterpretatio of clusterig? We ca impose a probabilistic iterpretatio of our ituitio that poits stay close to their cluster ceters How ca we model p(x) to reflect this? 1.5 (b).5 1 Data poits seem to form 3 clusters We caot model p(x) with simple ad kow distributios E.g., the data is ot a Guassia b/c we have 3 distict cocetrated regios Clusterig Clusterig 1 / 42

20 Gaussia mixture models: ituitio 1.5 (a).5 1 We ca model each regio with a distict distributio Commo to use Gaussias, i.e., Gaussia mixture models (GMMs) or mixture of Gaussias (MoGs). We do t kow cluster assigmets (label) or parameters of Gaussias or mixture compoets! We eed to lear them all from our ulabeled data D = {x } N =1 Clusterig Clusterig 11 / 42

21 Gaussia mixture models: ituitio 1.5 (a).5 1 We ca model each regio with a distict distributio Commo to use Gaussias, i.e., Gaussia mixture models (GMMs) or mixture of Gaussias (MoGs). We do t kow cluster assigmets (label) or parameters of Gaussias or mixture compoets! We eed to lear them all from our ulabeled data D = {x } N =1 Clusterig Clusterig 11 / 42

22 Gaussia mixture models: formal defiitio A Gaussia mixture model has the followig desity fuctio for x K p(x) = π k N (x µ k, Σ k ) k=1 K: the umber of Gaussias they are called (mixture) compoets µ k ad Σ k : mea ad covariace matrix of the k-th compoet π k : mixture weights they represet how much each compoet cotributes to the fial distributio (priors). It satisfies two properties: k, π k >, ad π k = 1 The properties esure p(x) is a properly ormalized probability desity fuctio. k Clusterig Clusterig 12 / 42

23 Gaussia mixture models: formal defiitio A Gaussia mixture model has the followig desity fuctio for x K p(x) = π k N (x µ k, Σ k ) k=1 K: the umber of Gaussias they are called (mixture) compoets µ k ad Σ k : mea ad covariace matrix of the k-th compoet π k : mixture weights they represet how much each compoet cotributes to the fial distributio (priors). It satisfies two properties: k, π k >, ad π k = 1 The properties esure p(x) is a properly ormalized probability desity fuctio. k Clusterig Clusterig 12 / 42

24 Gaussia mixture models: formal defiitio A Gaussia mixture model has the followig desity fuctio for x K p(x) = π k N (x µ k, Σ k ) k=1 K: the umber of Gaussias they are called (mixture) compoets µ k ad Σ k : mea ad covariace matrix of the k-th compoet π k : mixture weights they represet how much each compoet cotributes to the fial distributio (priors). It satisfies two properties: k, π k >, ad π k = 1 The properties esure p(x) is a properly ormalized probability desity fuctio. k Clusterig Clusterig 12 / 42

25 GMM as the margial distributio of a joit distributio Cosider the followig joit distributio p(x, z) = p(z)p(x z) where z is a discrete radom variable takig values betwee 1 ad K, i.e., a multiomial radom variable. Observed variables: x Hidde or latet variables: z Clusterig Clusterig 13 / 42

26 GMM as the margial distributio of a joit distributio Cosider the followig joit distributio p(x, z) = p(z)p(x z) where z is a discrete radom variable takig values betwee 1 ad K, i.e., a multiomial radom variable. Deote π k = p(z = k) Now, assume the coditioal distributios are Gaussia distributios p(x z = k) = N (x µ k, Σ k ) The, the margial distributio of x is p(x) = K p(z = k)p(x z = k) = k=1 Namely, the Gaussia mixture model K π k N (x µ k, Σ k ) k=1 Clusterig Clusterig 13 / 42

27 GMMs: example 1.5 (a).5 1 The coditioal distributio betwee x ad z (represetig color) are p(x z = red) = N (x µ 1, Σ 1 ) p(x z = blue) = N (x µ 2, Σ 2 ) p(x z = gree) = N (x µ 3, Σ 3 ) 1 The margial distributio is thus (b).5 p(x) = p(red)n (x µ 1, Σ 1 ) + p(blue)n (x µ 2, Σ 2 ) + p(gree)n (x µ 3, Σ 3 ).5 1 Clusterig Clusterig 14 / 42

28 GMMs: example 1.5 (a).5 1 The coditioal distributio betwee x ad z (represetig color) are p(x z = red) = N (x µ 1, Σ 1 ) p(x z = blue) = N (x µ 2, Σ 2 ) p(x z = gree) = N (x µ 3, Σ 3 ) 1 The margial distributio is thus (b).5 p(x) = p(red)n (x µ 1, Σ 1 ) + p(blue)n (x µ 2, Σ 2 ) + p(gree)n (x µ 3, Σ 3 ).5 1 Clusterig Clusterig 14 / 42

29 Parameter estimatio for Gaussia mixture models The parameters i GMMs are θ = {π k, µ k, Σ k } K k=1. To estimate, cosider the simple (ad urealistic) case first. We have labels z If we assume z is observed for every x, the our estimatio problem is easier to solve. I fact, this is a supervised learig problem. Our traiig data is augmeted: D = {x, z } N =1, D = {x } N =1 z deotes the regio where x comes from. D is the complete data ad D the icomplete data. How ca we lear our parameters? The maximum likelihood estimate is obtaied by maximizig the complete log likelihood θ = arg max θ LL c (θ) log P (x, z ) Clusterig Clusterig 15 / 42

30 Parameter estimatio for Gaussia mixture models The parameters i GMMs are θ = {π k, µ k, Σ k } K k=1. To estimate, cosider the simple (ad urealistic) case first. We have labels z If we assume z is observed for every x, the our estimatio problem is easier to solve. I fact, this is a supervised learig problem. Our traiig data is augmeted: D = {x, z } N =1, D = {x } N =1 z deotes the regio where x comes from. D is the complete data ad D the icomplete data. How ca we lear our parameters? The maximum likelihood estimate is obtaied by maximizig the complete log likelihood θ = arg max θ LL c (θ) log P (x, z ) Clusterig Clusterig 15 / 42

31 Parameter estimatio for GMMs: complete data The complete likelihood is decomposable LL c (θ) = log P (x, z ) = log P (z )p(x z ) Itroduce a biary variable z k {, 1} to idicate whether z = k. K K P (z ) = P (z = k) 1{z=k} = P (z = k) z k We the have k=1 k=1 K P (x z ) = P (x z = k) 1{z=k} = k=1 k=1 K P (x z = k) z k LL c (θ) = log k (P (z = k)p(x z = k)) z k Clusterig Clusterig 16 / 42

32 Parameter estimatio for GMMs: complete data The complete likelihood is decomposable LL c (θ) = log P (x, z ) = log P (z )p(x z ) Itroduce a biary variable z k {, 1} to idicate whether z = k. K K P (z ) = P (z = k) 1{z=k} = P (z = k) z k We the have k=1 k=1 K P (x z ) = P (x z = k) 1{z=k} = k=1 k=1 K P (x z = k) z k LL c (θ) = log k (P (z = k)p(x z = k)) z k Clusterig Clusterig 16 / 42

33 Parameter estimatio for GMMs: complete data The complete likelihood is decomposable LL c (θ) = log P (x, z ) = log P (z )p(x z ) Itroduce a biary variable z k {, 1} to idicate whether z = k. K K P (z ) = P (z = k) 1{z=k} = P (z = k) z k We the have k=1 k=1 K P (x z ) = P (x z = k) 1{z=k} = k=1 k=1 K P (x z = k) z k LL c (θ) = log k (P (z = k)p(x z = k)) z k Clusterig Clusterig 16 / 42

34 Parameter estimatio for GMMs: complete data The complete likelihood is decomposable LL c (θ) = log P (x, z ) = log P (z )p(x z ) LL c (θ) = = = = k ( ) log (P (z = k)p(x z = k)) z k k z k log (P (z = k)p(x z = k)) k z k [log P (z = k) + log p(x z = k)] k z k [log P (z = k) + log p(x z = k)] We use a dummy variable z to deote all the possible values cluster assigmet values for x D specifies this value i the complete data settig Clusterig Clusterig 17 / 42

35 Parameter estimatio for GMMs: complete data From our previous discussio, we have LL c (θ) = z k [log π k + log N (x µ k, Σ k )] k Regroupig, we have LL c (θ) = k z k log π k + k { } z k log N (x µ k, Σ k ) The term iside the braces depeds o k-th compoet s parameters. It is ow easy to show that (left as a exercise) the MLE is: π k = z k 1 k z, µ k = k z z k x k 1 Σ k = z z k (x µ k )(x µ k ) T k What s the ituitio? Clusterig Clusterig 18 / 42

36 Parameter estimatio for GMMs: complete data From our previous discussio, we have LL c (θ) = z k [log π k + log N (x µ k, Σ k )] k Regroupig, we have LL c (θ) = k z k log π k + k { } z k log N (x µ k, Σ k ) The term iside the braces depeds o k-th compoet s parameters. It is ow easy to show that (left as a exercise) the MLE is: π k = z k 1 k z, µ k = k z z k x k 1 Σ k = z z k (x µ k )(x µ k ) T k What s the ituitio? Clusterig Clusterig 18 / 42

37 Ituitio Sice z k is biary, the previous solutio is othig but For π k : cout the umber of data poits whose z is k ad divide by the total umber of data poits (ote that k z k = N) For µ k : get all the data poits whose z is k, compute their mea For Σ k : get all the data poits whose z is k, compute their covariace matrix This ituitio is goig to help us to develop a algorithm for estimatig θ whe we do ot kow z (icomplete data). Clusterig Clusterig 19 / 42

38 Parameter estimatio for GMMs: icomplete data Whe z is ot give, we ca guess it via the posterior probability p(z = k x ) = p(x z = k)p(z = k) p(x ) = p(x z = k)p(z = k) K k =1 p(x z = k )p(z = k ) To compute the posterior probability, we eed to kow the parameters θ! Let s preted we kow the value of the parameters so we ca compute the posterior probability. How is that goig to help us? Clusterig Clusterig 2 / 42

39 Parameter estimatio for GMMs: icomplete data Whe z is ot give, we ca guess it via the posterior probability p(z = k x ) = p(x z = k)p(z = k) p(x ) = p(x z = k)p(z = k) K k =1 p(x z = k )p(z = k ) To compute the posterior probability, we eed to kow the parameters θ! Let s preted we kow the value of the parameters so we ca compute the posterior probability. How is that goig to help us? Clusterig Clusterig 2 / 42

40 Estimatio with soft r k We defie r k = p(z = k x ) Recall that z k should be biary r k is a soft assigmet of x to k-th compoet Each x is assiged to a compoet fractioally accordig to r k Use the soft assigmets i the complete log-likelihood. l(θ) = r k log π k + { } r k log N (x µ k, Σ k ) k k Usig the soft r k i LL c, we get the same expressio for the MLE! π k = r k 1 k r, µ k = k r r k x k 1 Σ k = r r k (x µ k )(x µ k ) T k But remember, we re cheatig by usig θ to compute r k! Clusterig Clusterig 21 / 42

41 Estimatio with soft r k We defie r k = p(z = k x ) Recall that z k should be biary r k is a soft assigmet of x to k-th compoet Each x is assiged to a compoet fractioally accordig to r k Use the soft assigmets i the complete log-likelihood. l(θ) = r k log π k + { } r k log N (x µ k, Σ k ) k k Usig the soft r k i LL c, we get the same expressio for the MLE! π k = r k 1 k r, µ k = k r r k x k 1 Σ k = r r k (x µ k )(x µ k ) T k But remember, we re cheatig by usig θ to compute r k! Clusterig Clusterig 21 / 42

42 Estimatio with soft r k We defie r k = p(z = k x ) Recall that z k should be biary r k is a soft assigmet of x to k-th compoet Each x is assiged to a compoet fractioally accordig to r k Use the soft assigmets i the complete log-likelihood. l(θ) = r k log π k + { } r k log N (x µ k, Σ k ) k k Usig the soft r k i LL c, we get the same expressio for the MLE! π k = r k 1 k r, µ k = k r r k x k 1 Σ k = r r k (x µ k )(x µ k ) T k But remember, we re cheatig by usig θ to compute r k! Clusterig Clusterig 21 / 42

43 Iterative procedure We ca alterate betwee estimatig r k ad usig the estimated r k to compute the parameters (same idea as with K-meas!) Step : iitialize θ with some values (radom or otherwise) Step 1: compute r k usig the curret θ Step 2: update θ usig the just computed r k Step 3: go back to Step 1 Questios: Is this procedure reasoable, i.e., are we optimizig a sesible criteria? Will this procedure coverge? The aswers lie i the EM algorithm a powerful procedure for model estimatio with ukow data. Clusterig Clusterig 22 / 42

44 GMMs ad K-meas GMMs provide probabilistic iterpretatio for K-meas GMMs reduce to K-meas uder the followig assumptios (i which case EM for GMM parameter estimatio simplifies to K-meas): Assume all Gaussias have σ 2 I covariace matrices Further assume σ, so we oly eed to estimate µ k, i.e., meas K-meas is ofte called hard GMM or GMMs is called soft K-meas The posterior γ k provides a probabilistic assigmet for x to cluster k Clusterig Clusterig 23 / 42

45 GMMs ad K-meas GMMs provide probabilistic iterpretatio for K-meas GMMs reduce to K-meas uder the followig assumptios (i which case EM for GMM parameter estimatio simplifies to K-meas): Assume all Gaussias have σ 2 I covariace matrices Further assume σ, so we oly eed to estimate µ k, i.e., meas K-meas is ofte called hard GMM or GMMs is called soft K-meas The posterior γ k provides a probabilistic assigmet for x to cluster k Clusterig Clusterig 23 / 42

46 Outlie Clusterig K-meas Gaussia mixture models EM Algorithm Model selectio Clusterig EM Algorithm 24 / 42

47 EM algorithm: motivatio ad setup As a geeral procedure, EM is used to estimate parameters for probabilistic models with hidde/latet variables. Suppose the model is give by a joit distributio p(x θ) = z p(x, z θ) where x is the observed radom variable ad z is hidde. We are give data cotaiig oly the observed variable D = {x } where the correspodig hidde variable values z is ot icluded. Our goal is to obtai the maximum likelihood estimate of θ. Namely, we choose θ = arg max θ LL(θ) = arg max log P (x θ) = arg max θ log p(x, z θ) z The objective fuctio LL(θ) is called icomplete log-likelihood. Clusterig EM Algorithm 25 / 42

48 Expected (complete) log-likelihood θ = arg max θ log p(x, z θ) z The difficulty with icomplete log-likelihood is that it eeds to sum over all possible values that z ca take, the take a logarithm. This log-sum format makes computatio itractable. LL(θ) = z p(x, z θ) z p(x, z θ) Clusterig EM Algorithm 26 / 42

49 Expected (complete) log-likelihood If we kew the z (complete data settig), optimizig the likelihood is easy. Istead, the EM algorithm uses a clever trick to chage this ito sum-log form. Q q (θ) = = E z q(z ) log P (x, z θ) q(z ) log P (x, z θ) z which is called expected (complete) log-likelihood (with respect to q(z). q(z) is a distributio over z. Note that Q q (θ) takes the form of sum-log, which turs out to be tractable. Clusterig EM Algorithm 27 / 42

50 Examples Cosider the previous model where x could be from 3 regios. We ca choose q(z) ay valid distributio. This will lead to differet Q q (θ). Note that z here represets differet colors. q(z = k) = 1/3 for ay of 3 colors. This gives rise to Q q (θ) = 1 3 [log P (x, red θ) + log P (x, blue θ) + log P (x, gree θ) ] q(z = k) = 1/2 for red ad blue, for gree. This gives rise to Q q (θ) = 1 2 [log P (x, red θ) + log P (x, blue θ)] Clusterig EM Algorithm 28 / 42

51 Which q(z) to choose? We will choose a special q(z) = p(z x; θ), i.e., the posterior probability of z. We defie Q(θ; θ ) = Q z p(z x;θ )(θ) = E θ [log P (x, z θ)] Clusterig EM Algorithm 29 / 42

52 EM algorithm We ca alterate betwee estimatig r k ad usig the estimated r k to compute the parameters (same idea as with K-meas!) Step : Iitialize t, θ () with some values (radom or otherwise) Repeat Step 1 (E-step): Compute r (t) k = p(z = k x, θ (t) ) usig the curret θ (t) Step 2 (M-step): θ (t+1) arg max θ Q(θ; θ (t) ). Update θ usig the just computed r k Step 3: t t + 1. Go back to step 1 while ot coverged. Clusterig EM Algorithm 3 / 42

53 EM for GMM Recall the complete log-likelihood LL c (θ) = z k [log π k + log N (x µ k, Σ k )] k The Q fuctio for GMM { ] Q(θ; θ (t) ) = E θ (t) z k [log π k + log N (x µ k, Σ k )} = k E θ (t) [z k {log π k + log N (x µ k, Σ k )}] k [ Eθ (t)[z k ] ] {log π k + log N (x µ k, Σ k )} = = k k r (t) k {log π k + log N (x µ k, Σ k )} Clusterig EM Algorithm 31 / 42

54 EM for GMM Regroupig Q(θ; θ (t) ) = k r (t) k log π k + k { r (t) k log N (x µ k, Σ k ) } We have recovered the parameter estimatio algorithm for GMMs discussed previously! This gives M-step updates to the parameters: π (t) k = r(t) k r(t) k k Σ (t) k = 1 r(t) k, µ (t) k = 1 r(t) k r (t) k x r (t) k (x µ (t) k )(x µ (t) k )T Clusterig EM Algorithm 32 / 42

55 EM for GMM Regroupig Q(θ; θ (t) ) = k r (t) k log π k + k { r (t) k log N (x µ k, Σ k ) } We have recovered the parameter estimatio algorithm for GMMs discussed previously! This gives M-step updates to the parameters: π (t) k = r(t) k r(t) k k Σ (t) k = 1 r(t) k, µ (t) k = 1 r(t) k r (t) k x r (t) k (x µ (t) k )(x µ (t) k )T Clusterig EM Algorithm 32 / 42

56 Why does EM work? EM: costruct lower boud o LL(θ) (E-step) ad optimize it (M-step) If we defie q(z) as a distributio over z, the LL(θ) = log z p(x, z θ) = log z z q(z ) p(x, z θ) q(z ) q(z ) log p(x, z θ) q(z ) Last step follows from Jese s iequality, i.e., f(ex) Ef(X) for cocave fuctio f Clusterig EM Algorithm 33 / 42

57 Why does EM work? EM: costruct lower boud o LL(θ) (E-step) ad optimize it (M-step) If we defie q(z) as a distributio over z, the LL(θ) = log z p(x, z θ) = log z z q(z ) p(x, z θ) q(z ) q(z ) log p(x, z θ) q(z ) Last step follows from Jese s iequality, i.e., f(ex) Ef(X) for cocave fuctio f Clusterig EM Algorithm 33 / 42

58 Why does EM work? EM: costruct lower boud o LL(θ) (E-step) ad optimize it (M-step) If we defie q(z) as a distributio over z, the LL(θ) = log z p(x, z θ) = log z z q(z ) p(x, z θ) q(z ) q(z ) log p(x, z θ) q(z ) Last step follows from Jese s iequality, i.e., f(ex) Ef(X) for cocave fuctio f Clusterig EM Algorithm 33 / 42

59 Why does EM work? EM: costruct lower boud o LL(θ) (E-step) ad optimize it (M-step) If we defie q(z) as a distributio over z, the LL(θ) = log z p(x, z θ) = log z z q(z ) p(x, z θ) q(z ) q(z ) log p(x, z θ) q(z ) Last step follows from Jese s iequality, i.e., f(ex) Ef(X) for cocave fuctio f Clusterig EM Algorithm 33 / 42

60 Which q(z) to choose? LL(θ) = log p(x, z θ) = log z z q(z ) log p(x, z θ) q(z ) z q(z ) p(x, z θ) q(z ) The lower boud we derived for LL(θ) holds for all choices of q( ) We wat a tight lower boud, ad give some curret estimate θ t, we will pick q( ) such that our lower boud holds with equality at θ t Choose q(z ) p(x, z θ t )! Sice q( ) is a distributio, we have q(z ) = p(x, z θ t ) k p(x, z = k θ) = p(x, z θ p(x θ t ) t ) = p(z x ; θ t ) This is the posterior distributio of z give x ad θ t Clusterig EM Algorithm 34 / 42

61 Which q(z) to choose? LL(θ) = log p(x, z θ) = log z z q(z ) log p(x, z θ) q(z ) z q(z ) p(x, z θ) q(z ) The lower boud we derived for LL(θ) holds for all choices of q( ) We wat a tight lower boud, ad give some curret estimate θ t, we will pick q( ) such that our lower boud holds with equality at θ t Choose q(z ) p(x, z θ t )! Sice q( ) is a distributio, we have q(z ) = p(x, z θ t ) k p(x, z = k θ) = p(x, z θ p(x θ t ) t ) = p(z x ; θ t ) This is the posterior distributio of z give x ad θ t Clusterig EM Algorithm 34 / 42

62 Which q(z) to choose? LL(θ) = log p(x, z θ) = log z z q(z ) log p(x, z θ) q(z ) z q(z ) p(x, z θ) q(z ) The lower boud we derived for LL(θ) holds for all choices of q( ) We wat a tight lower boud, ad give some curret estimate θ t, we will pick q( ) such that our lower boud holds with equality at θ t Choose q(z ) p(x, z θ t )! Sice q( ) is a distributio, we have q(z ) = p(x, z θ t ) k p(x, z = k θ) = p(x, z θ p(x θ t ) t ) = p(z x ; θ t ) This is the posterior distributio of z give x ad θ t Clusterig EM Algorithm 34 / 42

63 Which q(z) to choose? LL(θ) = log p(x, z θ) = log z z q(z ) log p(x, z θ) q(z ) z q(z ) p(x, z θ) q(z ) The lower boud we derived for LL(θ) holds for all choices of q( ) We wat a tight lower boud, ad give some curret estimate θ t, we will pick q( ) such that our lower boud holds with equality at θ t Choose q(z ) p(x, z θ t )! Sice q( ) is a distributio, we have q(z ) = p(x, z θ t ) k p(x, z = k θ) = p(x, z θ p(x θ t ) t ) = p(z x ; θ t ) This is the posterior distributio of z give x ad θ t Clusterig EM Algorithm 34 / 42

64 E ad M Steps Our simplified expressio LL(θ t ) = t ) p(z x ; θ t ) log p(x, z θ z p(z x ; θ t )) E-Step: For all, compute q(z ) = p(z x ; θ t ) Why is this called the E-Step? Because we ca view it as computig the expected (complete) log-likelihood: Q(θ θ t ) = p(z x ; θ t ) log p(x, z θ) = E q log p(x, z θ) z M-Step: Maximize Q(θ θ t ), i.e., θ t+1 = arg max θ Q(θ θ t ) Clusterig EM Algorithm 35 / 42

65 E ad M Steps Our simplified expressio LL(θ t ) = t ) p(z x ; θ t ) log p(x, z θ z p(z x ; θ t )) E-Step: For all, compute q(z ) = p(z x ; θ t ) Why is this called the E-Step? Because we ca view it as computig the expected (complete) log-likelihood: Q(θ θ t ) = p(z x ; θ t ) log p(x, z θ) = E q log p(x, z θ) z M-Step: Maximize Q(θ θ t ), i.e., θ t+1 = arg max θ Q(θ θ t ) Clusterig EM Algorithm 35 / 42

66 E ad M Steps Our simplified expressio LL(θ t ) = t ) p(z x ; θ t ) log p(x, z θ z p(z x ; θ t )) E-Step: For all, compute q(z ) = p(z x ; θ t ) Why is this called the E-Step? Because we ca view it as computig the expected (complete) log-likelihood: Q(θ θ t ) = p(z x ; θ t ) log p(x, z θ) = E q log p(x, z θ) z M-Step: Maximize Q(θ θ t ), i.e., θ t+1 = arg max θ Q(θ θ t ) Clusterig EM Algorithm 35 / 42

67 E ad M Steps Our simplified expressio LL(θ t ) = t ) p(z x ; θ t ) log p(x, z θ z p(z x ; θ t )) E-Step: For all, compute q(z ) = p(z x ; θ t ) Why is this called the E-Step? Because we ca view it as computig the expected (complete) log-likelihood: Q(θ θ t ) = p(z x ; θ t ) log p(x, z θ) = E q log p(x, z θ) z M-Step: Maximize Q(θ θ t ), i.e., θ t+1 = arg max θ Q(θ θ t ) Clusterig EM Algorithm 35 / 42

68 Iterative ad mootoic improvemet We ca show that LL(θ t+1 ) LL(θ t ) Recall that we chose q( ) i the E-step such that: LL(θ t ) = q(z ) log p(x, z θ q(z z ) However, i the M-step, θ t+1 is chose to maximize the right had side of the equatio, thus provig our desired result Note: the EM procedure coverges but oly to a local optimum Ru algorithm with radom iitializatios ad pick the best solutio t ) Clusterig EM Algorithm 36 / 42

69 Bayesia viewpoit Ca impose a prior o θ. Reduces over-fittig. Ca compute the maximum a posterior (MAP) estimate. Equivalet to addig a regularizer. ˆθ = arg max θ log P (D θ) + log P (θ) EM ca be exteded to this settig. Clusterig EM Algorithm 37 / 42

70 Outlie Clusterig K-meas Gaussia mixture models EM Algorithm Model selectio Clusterig Model selectio 38 / 42

71 Model selectio: choosig K How well does the model fit the data for differet K? Problem: Will fit better with larger K. Ca evaluate fit o idepedet data ot used i traiig. Clusterig Model selectio 39 / 42

72 Model selectio: choosig K Byesia solutio: Prior cotrols model complexity K = arg max K P (D K) = P (D θ)p (θ K)d(θ) Challegig to evaluate itegral Other approximate methods to cotrol model complexity Akaike Iformatio Criterio Bayesia Iformatio Criterio K = arg max K 2LL(ˆθ) 2k K = arg max K 2LL(ˆθ) k log() ˆθ : MLE, k: Number of parameters Clusterig Model selectio 4 / 42

73 Model selectio: choosig K Byesia oparametrics: Prior o K! More details i future lectures. Clusterig Model selectio 41 / 42

74 Summary Usupervised learig: fidig structure i data without labels Clusterig K-meas GMM: Probabilistic model for K-meas Iferece o hidde or latet variables. More challegig tha supervised learig. EM algorithm Pricipled approach to iferece i these models. Clusterig Model selectio 42 / 42

Outline. CSCI-567: Machine Learning (Spring 2019) Outline. Prof. Victor Adamchik. Mar. 26, 2019

Outline. CSCI-567: Machine Learning (Spring 2019) Outline. Prof. Victor Adamchik. Mar. 26, 2019 Outlie CSCI-567: Machie Learig Sprig 209 Gaussia mixture models Prof. Victor Adamchik 2 Desity estimatio U of Souther Califoria Mar. 26, 209 3 Naive Bayes Revisited March 26, 209 / 57 March 26, 209 2 /

More information

Expectation-Maximization Algorithm.

Expectation-Maximization Algorithm. Expectatio-Maximizatio Algorithm. Petr Pošík Czech Techical Uiversity i Prague Faculty of Electrical Egieerig Dept. of Cyberetics MLE 2 Likelihood.........................................................................................................

More information

Algorithms for Clustering

Algorithms for Clustering CR2: Statistical Learig & Applicatios Algorithms for Clusterig Lecturer: J. Salmo Scribe: A. Alcolei Settig: give a data set X R p where is the umber of observatio ad p is the umber of features, we wat

More information

10-701/ Machine Learning Mid-term Exam Solution

10-701/ Machine Learning Mid-term Exam Solution 0-70/5-78 Machie Learig Mid-term Exam Solutio Your Name: Your Adrew ID: True or False (Give oe setece explaatio) (20%). (F) For a cotiuous radom variable x ad its probability distributio fuctio p(x), it

More information

The Expectation-Maximization (EM) Algorithm

The Expectation-Maximization (EM) Algorithm The Expectatio-Maximizatio (EM) Algorithm Readig Assigmets T. Mitchell, Machie Learig, McGraw-Hill, 997 (sectio 6.2, hard copy). S. Gog et al. Dyamic Visio: From Images to Face Recogitio, Imperial College

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Patter Recogitio Classificatio: No-Parametric Modelig Hamid R. Rabiee Jafar Muhammadi Sprig 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Ageda Parametric Modelig No-Parametric Modelig

More information

Mixtures of Gaussians and the EM Algorithm

Mixtures of Gaussians and the EM Algorithm Mixtures of Gaussias ad the EM Algorithm CSE 6363 Machie Learig Vassilis Athitsos Computer Sciece ad Egieerig Departmet Uiversity of Texas at Arligto 1 Gaussias A popular way to estimate probability desity

More information

Clustering: Mixture Models

Clustering: Mixture Models Clusterig: Mixture Models Machie Learig 10-601B Seyoug Kim May of these slides are derived from Tom Mitchell, Ziv- Bar Joseph, ad Eric Xig. Thaks! Problem with K- meas Hard Assigmet of Samples ito Three

More information

Lecture 2: Monte Carlo Simulation

Lecture 2: Monte Carlo Simulation STAT/Q SCI 43: Itroductio to Resamplig ethods Sprig 27 Istructor: Ye-Chi Che Lecture 2: ote Carlo Simulatio 2 ote Carlo Itegratio Assume we wat to evaluate the followig itegratio: e x3 dx What ca we do?

More information

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5 CS434a/54a: Patter Recogitio Prof. Olga Veksler Lecture 5 Today Itroductio to parameter estimatio Two methods for parameter estimatio Maimum Likelihood Estimatio Bayesia Estimatio Itroducto Bayesia Decisio

More information

Axis Aligned Ellipsoid

Axis Aligned Ellipsoid Machie Learig for Data Sciece CS 4786) Lecture 6,7 & 8: Ellipsoidal Clusterig, Gaussia Mixture Models ad Geeral Mixture Models The text i black outlies high level ideas. The text i blue provides simple

More information

Probabilistic Unsupervised Learning

Probabilistic Unsupervised Learning HT2015: SC4 Statistical Data Miig ad Machie Learig Dio Sejdiovic Departmet of Statistics Oxford http://www.stats.ox.ac.u/~sejdiov/sdmml.html Probabilistic Methods Algorithmic approach: Data Probabilistic

More information

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d Liear regressio Daiel Hsu (COMS 477) Maximum likelihood estimatio Oe of the simplest liear regressio models is the followig: (X, Y ),..., (X, Y ), (X, Y ) are iid radom pairs takig values i R d R, ad Y

More information

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015 ECE 8527: Itroductio to Machie Learig ad Patter Recogitio Midterm # 1 Vaishali Ami Fall, 2015 tue39624@temple.edu Problem No. 1: Cosider a two-class discrete distributio problem: ω 1 :{[0,0], [2,0], [2,2],

More information

Expectation maximization

Expectation maximization Motivatio Expectatio maximizatio Subhrasu Maji CMSCI 689: Machie Learig 14 April 015 Suppose you are builig a aive Bayes spam classifier. After your are oe your boss tells you that there is o moey to label

More information

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss ECE 90 Lecture : Complexity Regularizatio ad the Squared Loss R. Nowak 5/7/009 I the previous lectures we made use of the Cheroff/Hoeffdig bouds for our aalysis of classifier errors. Hoeffdig s iequality

More information

Chapter 12 EM algorithms The Expectation-Maximization (EM) algorithm is a maximum likelihood method for models that have hidden variables eg. Gaussian

Chapter 12 EM algorithms The Expectation-Maximization (EM) algorithm is a maximum likelihood method for models that have hidden variables eg. Gaussian Chapter 2 EM algorithms The Expectatio-Maximizatio (EM) algorithm is a maximum likelihood method for models that have hidde variables eg. Gaussia Mixture Models (GMMs), Liear Dyamic Systems (LDSs) ad Hidde

More information

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam. Probability ad Statistics FS 07 Secod Sessio Exam 09.0.08 Time Limit: 80 Miutes Name: Studet ID: This exam cotais 9 pages (icludig this cover page) ad 0 questios. A Formulae sheet is provided with the

More information

Empirical Process Theory and Oracle Inequalities

Empirical Process Theory and Oracle Inequalities Stat 928: Statistical Learig Theory Lecture: 10 Empirical Process Theory ad Oracle Iequalities Istructor: Sham Kakade 1 Risk vs Risk See Lecture 0 for a discussio o termiology. 2 The Uio Boud / Boferoi

More information

Vector Quantization: a Limiting Case of EM

Vector Quantization: a Limiting Case of EM . Itroductio & defiitios Assume that you are give a data set X = { x j }, j { 2,,, }, of d -dimesioal vectors. The vector quatizatio (VQ) problem requires that we fid a set of prototype vectors Z = { z

More information

CSE 527, Additional notes on MLE & EM

CSE 527, Additional notes on MLE & EM CSE 57 Lecture Notes: MLE & EM CSE 57, Additioal otes o MLE & EM Based o earlier otes by C. Grat & M. Narasimha Itroductio Last lecture we bega a examiatio of model based clusterig. This lecture will be

More information

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1 EECS564 Estimatio, Filterig, ad Detectio Hwk 2 Sols. Witer 25 4. Let Z be a sigle observatio havig desity fuctio where. p (z) = (2z + ), z (a) Assumig that is a oradom parameter, fid ad plot the maximum

More information

Agnostic Learning and Concentration Inequalities

Agnostic Learning and Concentration Inequalities ECE901 Sprig 2004 Statistical Regularizatio ad Learig Theory Lecture: 7 Agostic Learig ad Cocetratio Iequalities Lecturer: Rob Nowak Scribe: Aravid Kailas 1 Itroductio 1.1 Motivatio I the last lecture

More information

Outline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression

Outline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression REGRESSION 1 Outlie Liear regressio Regularizatio fuctios Polyomial curve fittig Stochastic gradiet descet for regressio MLE for regressio Step-wise forward regressio Regressio methods Statistical techiques

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract I this lecture we derive risk bouds for kerel methods. We will start by showig that Soft Margi kerel SVM correspods to miimizig

More information

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator Ecoomics 24B Relatio to Method of Momets ad Maximum Likelihood OLSE as a Maximum Likelihood Estimator Uder Assumptio 5 we have speci ed the distributio of the error, so we ca estimate the model parameters

More information

Lecture 11 and 12: Basic estimation theory

Lecture 11 and 12: Basic estimation theory Lecture ad 2: Basic estimatio theory Sprig 202 - EE 94 Networked estimatio ad cotrol Prof. Kha March 2 202 I. MAXIMUM-LIKELIHOOD ESTIMATORS The maximum likelihood priciple is deceptively simple. Louis

More information

Probability and MLE.

Probability and MLE. 10-701 Probability ad MLE http://www.cs.cmu.edu/~pradeepr/701 (brief) itro to probability Basic otatios Radom variable - referrig to a elemet / evet whose status is ukow: A = it will rai tomorrow Domai

More information

Distributional Similarity Models (cont.)

Distributional Similarity Models (cont.) Sematic Similarity Vector Space Model Similarity Measures cosie Euclidea distace... Clusterig k-meas hierarchical Last Time EM Clusterig Soft versio of K-meas clusterig Iput: m dimesioal objects X = {

More information

Direction: This test is worth 250 points. You are required to complete this test within 50 minutes.

Direction: This test is worth 250 points. You are required to complete this test within 50 minutes. Term Test October 3, 003 Name Math 56 Studet Number Directio: This test is worth 50 poits. You are required to complete this test withi 50 miutes. I order to receive full credit, aswer each problem completely

More information

Seunghee Ye Ma 8: Week 5 Oct 28

Seunghee Ye Ma 8: Week 5 Oct 28 Week 5 Summary I Sectio, we go over the Mea Value Theorem ad its applicatios. I Sectio 2, we will recap what we have covered so far this term. Topics Page Mea Value Theorem. Applicatios of the Mea Value

More information

Introduction to Machine Learning DIS10

Introduction to Machine Learning DIS10 CS 189 Fall 017 Itroductio to Machie Learig DIS10 1 Fu with Lagrage Multipliers (a) Miimize the fuctio such that f (x,y) = x + y x + y = 3. Solutio: The Lagragia is: L(x,y,λ) = x + y + λ(x + y 3) Takig

More information

CS284A: Representations and Algorithms in Molecular Biology

CS284A: Representations and Algorithms in Molecular Biology CS284A: Represetatios ad Algorithms i Molecular Biology Scribe Notes o Lectures 3 & 4: Motif Discovery via Eumeratio & Motif Represetatio Usig Positio Weight Matrix Joshua Gervi Based o presetatios by

More information

Distributional Similarity Models (cont.)

Distributional Similarity Models (cont.) Distributioal Similarity Models (cot.) Regia Barzilay EECS Departmet MIT October 19, 2004 Sematic Similarity Vector Space Model Similarity Measures cosie Euclidea distace... Clusterig k-meas hierarchical

More information

Exponential Families and Bayesian Inference

Exponential Families and Bayesian Inference Computer Visio Expoetial Families ad Bayesia Iferece Lecture Expoetial Families A expoetial family of distributios is a d-parameter family f(x; havig the followig form: f(x; = h(xe g(t T (x B(, (. where

More information

Optimally Sparse SVMs

Optimally Sparse SVMs A. Proof of Lemma 3. We here prove a lower boud o the umber of support vectors to achieve geeralizatio bouds of the form which we cosider. Importatly, this result holds ot oly for liear classifiers, but

More information

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f. Lecture 5 Let us give oe more example of MLE. Example 3. The uiform distributio U[0, ] o the iterval [0, ] has p.d.f. { 1 f(x =, 0 x, 0, otherwise The likelihood fuctio ϕ( = f(x i = 1 I(X 1,..., X [0,

More information

Support vector machine revisited

Support vector machine revisited 6.867 Machie learig, lecture 8 (Jaakkola) 1 Lecture topics: Support vector machie ad kerels Kerel optimizatio, selectio Support vector machie revisited Our task here is to first tur the support vector

More information

5.1 Review of Singular Value Decomposition (SVD)

5.1 Review of Singular Value Decomposition (SVD) MGMT 69000: Topics i High-dimesioal Data Aalysis Falll 06 Lecture 5: Spectral Clusterig: Overview (cotd) ad Aalysis Lecturer: Jiamig Xu Scribe: Adarsh Barik, Taotao He, September 3, 06 Outlie Review of

More information

Machine Learning for Data Science (CS 4786)

Machine Learning for Data Science (CS 4786) Machie Learig for Data Sciece CS 4786) Lecture & 3: Pricipal Compoet Aalysis The text i black outlies high level ideas. The text i blue provides simple mathematical details to derive or get to the algorithm

More information

Introductory statistics

Introductory statistics CM9S: Machie Learig for Bioiformatics Lecture - 03/3/06 Itroductory statistics Lecturer: Sriram Sakararama Scribe: Sriram Sakararama We will provide a overview of statistical iferece focussig o the key

More information

5. Likelihood Ratio Tests

5. Likelihood Ratio Tests 1 of 5 7/29/2009 3:16 PM Virtual Laboratories > 9. Hy pothesis Testig > 1 2 3 4 5 6 7 5. Likelihood Ratio Tests Prelimiaries As usual, our startig poit is a radom experimet with a uderlyig sample space,

More information

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n. Jauary 1, 2019 Resamplig Methods Motivatio We have so may estimators with the property θ θ d N 0, σ 2 We ca also write θ a N θ, σ 2 /, where a meas approximately distributed as Oce we have a cosistet estimator

More information

REGRESSION WITH QUADRATIC LOSS

REGRESSION WITH QUADRATIC LOSS REGRESSION WITH QUADRATIC LOSS MAXIM RAGINSKY Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X, Y ), where, as before, X is a R d

More information

The Bayesian Learning Framework. Back to Maximum Likelihood. Naïve Bayes. Simple Example: Coin Tosses. Given a generative model

The Bayesian Learning Framework. Back to Maximum Likelihood. Naïve Bayes. Simple Example: Coin Tosses. Given a generative model Back to Maximum Likelihood Give a geerative model f (x, y = k) =π k f k (x) Usig a geerative modellig approach, we assume a parametric form for f k (x) =f (x; k ) ad compute the MLE θ of θ =(π k, k ) k=

More information

Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach

Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach STAT 425: Itroductio to Noparametric Statistics Witer 28 Lecture 7: Desity Estimatio: k-nearest Neighbor ad Basis Approach Istructor: Ye-Chi Che Referece: Sectio 8.4 of All of Noparametric Statistics.

More information

Unsupervised Learning 2001

Unsupervised Learning 2001 Usupervised Learig 2001 Lecture 3: The EM Algorithm Zoubi Ghahramai zoubi@gatsby.ucl.ac.uk Carl Edward Rasmusse edward@gatsby.ucl.ac.uk Gatsby Computatioal Neurosciece Uit MSc Itelliget Systems, Computer

More information

6.3 Testing Series With Positive Terms

6.3 Testing Series With Positive Terms 6.3. TESTING SERIES WITH POSITIVE TERMS 307 6.3 Testig Series With Positive Terms 6.3. Review of what is kow up to ow I theory, testig a series a i for covergece amouts to fidig the i= sequece of partial

More information

Chapter 8: Estimating with Confidence

Chapter 8: Estimating with Confidence Chapter 8: Estimatig with Cofidece Sectio 8.2 The Practice of Statistics, 4 th editio For AP* STARNES, YATES, MOORE Chapter 8 Estimatig with Cofidece 8.1 Cofidece Itervals: The Basics 8.2 8.3 Estimatig

More information

Lecture 12: September 27

Lecture 12: September 27 36-705: Itermediate Statistics Fall 207 Lecturer: Siva Balakrisha Lecture 2: September 27 Today we will discuss sufficiecy i more detail ad the begi to discuss some geeral strategies for costructig estimators.

More information

Probabilistic Unsupervised Learning

Probabilistic Unsupervised Learning Statistical Data Miig ad Machie Learig Hilary Term 2016 Dio Sejdiovic Departmet of Statistics Oxford Slides ad other materials available at: http://www.stats.ox.ac.u/~sejdiov/sdmml Probabilistic Methods

More information

Lecture 9: Boosting. Akshay Krishnamurthy October 3, 2017

Lecture 9: Boosting. Akshay Krishnamurthy October 3, 2017 Lecture 9: Boostig Akshay Krishamurthy akshay@csumassedu October 3, 07 Recap Last week we discussed some algorithmic aspects of machie learig We saw oe very powerful family of learig algorithms, amely

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS MASSACHUSTTS INSTITUT OF TCHNOLOGY 6.436J/5.085J Fall 2008 Lecture 9 /7/2008 LAWS OF LARG NUMBRS II Cotets. The strog law of large umbers 2. The Cheroff boud TH STRONG LAW OF LARG NUMBRS While the weak

More information

Factor Analysis. Lecture 10: Factor Analysis and Principal Component Analysis. Sam Roweis

Factor Analysis. Lecture 10: Factor Analysis and Principal Component Analysis. Sam Roweis Lecture 10: Factor Aalysis ad Pricipal Compoet Aalysis Sam Roweis February 9, 2004 Whe we assume that the subspace is liear ad that the uderlyig latet variable has a Gaussia distributio we get a model

More information

4.1 Data processing inequality

4.1 Data processing inequality ECE598: Iformatio-theoretic methods i high-dimesioal statistics Sprig 206 Lecture 4: Total variatio/iequalities betwee f-divergeces Lecturer: Yihog Wu Scribe: Matthew Tsao, Feb 8, 206 [Ed. Mar 22] Recall

More information

MIDTERM 3 CALCULUS 2. Monday, December 3, :15 PM to 6:45 PM. Name PRACTICE EXAM SOLUTIONS

MIDTERM 3 CALCULUS 2. Monday, December 3, :15 PM to 6:45 PM. Name PRACTICE EXAM SOLUTIONS MIDTERM 3 CALCULUS MATH 300 FALL 08 Moday, December 3, 08 5:5 PM to 6:45 PM Name PRACTICE EXAM S Please aswer all of the questios, ad show your work. You must explai your aswers to get credit. You will

More information

Question 1: The magnetic case

Question 1: The magnetic case September 6, 018 Corell Uiversity, Departmet of Physics PHYS 337, Advace E&M, HW # 4, due: 9/19/018, 11:15 AM Questio 1: The magetic case I class, we skipped over some details, so here you are asked to

More information

Direction: This test is worth 150 points. You are required to complete this test within 55 minutes.

Direction: This test is worth 150 points. You are required to complete this test within 55 minutes. Term Test 3 (Part A) November 1, 004 Name Math 6 Studet Number Directio: This test is worth 10 poits. You are required to complete this test withi miutes. I order to receive full credit, aswer each problem

More information

Bertrand s Postulate

Bertrand s Postulate Bertrad s Postulate Lola Thompso Ross Program July 3, 2009 Lola Thompso (Ross Program Bertrad s Postulate July 3, 2009 1 / 33 Bertrad s Postulate I ve said it oce ad I ll say it agai: There s always a

More information

Lecture 14: Graph Entropy

Lecture 14: Graph Entropy 15-859: Iformatio Theory ad Applicatios i TCS Sprig 2013 Lecture 14: Graph Etropy March 19, 2013 Lecturer: Mahdi Cheraghchi Scribe: Euiwoog Lee 1 Recap Bergma s boud o the permaet Shearer s Lemma Number

More information

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization ECE 90 Lecture 4: Maximum Likelihood Estimatio ad Complexity Regularizatio R Nowak 5/7/009 Review : Maximum Likelihood Estimatio We have iid observatios draw from a ukow distributio Y i iid p θ, i,, where

More information

Math 113 Exam 3 Practice

Math 113 Exam 3 Practice Math Exam Practice Exam 4 will cover.-., 0. ad 0.. Note that eve though. was tested i exam, questios from that sectios may also be o this exam. For practice problems o., refer to the last review. This

More information

Recurrence Relations

Recurrence Relations Recurrece Relatios Aalysis of recursive algorithms, such as: it factorial (it ) { if (==0) retur ; else retur ( * factorial(-)); } Let t be the umber of multiplicatios eeded to calculate factorial(). The

More information

Sequences and Series of Functions

Sequences and Series of Functions Chapter 6 Sequeces ad Series of Fuctios 6.1. Covergece of a Sequece of Fuctios Poitwise Covergece. Defiitio 6.1. Let, for each N, fuctio f : A R be defied. If, for each x A, the sequece (f (x)) coverges

More information

Lecture 13: Maximum Likelihood Estimation

Lecture 13: Maximum Likelihood Estimation ECE90 Sprig 007 Statistical Learig Theory Istructor: R. Nowak Lecture 3: Maximum Likelihood Estimatio Summary of Lecture I the last lecture we derived a risk (MSE) boud for regressio problems; i.e., select

More information

Boosting. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 1, / 32

Boosting. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 1, / 32 Boostig Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, 2017 1 / 32 Outlie 1 Admiistratio 2 Review of last lecture 3 Boostig Professor Ameet Talwalkar CS260

More information

Econ 325 Notes on Point Estimator and Confidence Interval 1 By Hiro Kasahara

Econ 325 Notes on Point Estimator and Confidence Interval 1 By Hiro Kasahara Poit Estimator Eco 325 Notes o Poit Estimator ad Cofidece Iterval 1 By Hiro Kasahara Parameter, Estimator, ad Estimate The ormal probability desity fuctio is fully characterized by two costats: populatio

More information

Chapter 6 Principles of Data Reduction

Chapter 6 Principles of Data Reduction Chapter 6 for BST 695: Special Topics i Statistical Theory. Kui Zhag, 0 Chapter 6 Priciples of Data Reductio Sectio 6. Itroductio Goal: To summarize or reduce the data X, X,, X to get iformatio about a

More information

ECE 901 Lecture 13: Maximum Likelihood Estimation

ECE 901 Lecture 13: Maximum Likelihood Estimation ECE 90 Lecture 3: Maximum Likelihood Estimatio R. Nowak 5/7/009 The focus of this lecture is to cosider aother approach to learig based o maximum likelihood estimatio. Ulike earlier approaches cosidered

More information

Goodness-of-Fit Tests and Categorical Data Analysis (Devore Chapter Fourteen)

Goodness-of-Fit Tests and Categorical Data Analysis (Devore Chapter Fourteen) Goodess-of-Fit Tests ad Categorical Data Aalysis (Devore Chapter Fourtee) MATH-252-01: Probability ad Statistics II Sprig 2019 Cotets 1 Chi-Squared Tests with Kow Probabilities 1 1.1 Chi-Squared Testig................

More information

Math 113 Exam 3 Practice

Math 113 Exam 3 Practice Math Exam Practice Exam will cover.-.9. This sheet has three sectios. The first sectio will remid you about techiques ad formulas that you should kow. The secod gives a umber of practice questios for you

More information

The picture in figure 1.1 helps us to see that the area represents the distance traveled. Figure 1: Area represents distance travelled

The picture in figure 1.1 helps us to see that the area represents the distance traveled. Figure 1: Area represents distance travelled 1 Lecture : Area Area ad distace traveled Approximatig area by rectagles Summatio The area uder a parabola 1.1 Area ad distace Suppose we have the followig iformatio about the velocity of a particle, how

More information

Topics Machine learning: lecture 2. Review: the learning problem. Hypotheses and estimation. Estimation criterion cont d. Estimation criterion

Topics Machine learning: lecture 2. Review: the learning problem. Hypotheses and estimation. Estimation criterion cont d. Estimation criterion .87 Machie learig: lecture Tommi S. Jaakkola MIT CSAIL tommi@csail.mit.edu Topics The learig problem hypothesis class, estimatio algorithm loss ad estimatio criterio samplig, empirical ad epected losses

More information

L = n i, i=1. dp p n 1

L = n i, i=1. dp p n 1 Exchageable sequeces ad probabilities for probabilities 1996; modified 98 5 21 to add material o mutual iformatio; modified 98 7 21 to add Heath-Sudderth proof of de Fietti represetatio; modified 99 11

More information

Random Variables, Sampling and Estimation

Random Variables, Sampling and Estimation Chapter 1 Radom Variables, Samplig ad Estimatio 1.1 Itroductio This chapter will cover the most importat basic statistical theory you eed i order to uderstad the ecoometric material that will be comig

More information

Stat410 Probability and Statistics II (F16)

Stat410 Probability and Statistics II (F16) Some Basic Cocepts of Statistical Iferece (Sec 5.) Suppose we have a rv X that has a pdf/pmf deoted by f(x; θ) or p(x; θ), where θ is called the parameter. I previous lectures, we focus o probability problems

More information

Maximum Likelihood Estimation

Maximum Likelihood Estimation Chapter 9 Maximum Likelihood Estimatio 9.1 The Likelihood Fuctio The maximum likelihood estimator is the most widely used estimatio method. This chapter discusses the most importat cocepts behid maximum

More information

Regression and generalization

Regression and generalization Regressio ad geeralizatio CE-717: Machie Learig Sharif Uiversity of Techology M. Soleymai Fall 2016 Curve fittig: probabilistic perspective Describig ucertaity over value of target variable as a probability

More information

10. Comparative Tests among Spatial Regression Models. Here we revisit the example in Section 8.1 of estimating the mean of a normal random

10. Comparative Tests among Spatial Regression Models. Here we revisit the example in Section 8.1 of estimating the mean of a normal random Part III. Areal Data Aalysis 0. Comparative Tests amog Spatial Regressio Models While the otio of relative likelihood values for differet models is somewhat difficult to iterpret directly (as metioed above),

More information

1 Duality revisited. AM 221: Advanced Optimization Spring 2016

1 Duality revisited. AM 221: Advanced Optimization Spring 2016 AM 22: Advaced Optimizatio Sprig 206 Prof. Yaro Siger Sectio 7 Wedesday, Mar. 9th Duality revisited I this sectio, we will give a slightly differet perspective o duality. optimizatio program: f(x) x R

More information

Stat 421-SP2012 Interval Estimation Section

Stat 421-SP2012 Interval Estimation Section Stat 41-SP01 Iterval Estimatio Sectio 11.1-11. We ow uderstad (Chapter 10) how to fid poit estimators of a ukow parameter. o However, a poit estimate does ot provide ay iformatio about the ucertaity (possible

More information

Regression with quadratic loss

Regression with quadratic loss Regressio with quadratic loss Maxim Ragisky October 13, 2015 Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X,Y, where, as before,

More information

INF Introduction to classifiction Anne Solberg Based on Chapter 2 ( ) in Duda and Hart: Pattern Classification

INF Introduction to classifiction Anne Solberg Based on Chapter 2 ( ) in Duda and Hart: Pattern Classification INF 4300 90 Itroductio to classifictio Ae Solberg ae@ifiuioo Based o Chapter -6 i Duda ad Hart: atter Classificatio 90 INF 4300 Madator proect Mai task: classificatio You must implemet a classificatio

More information

Machine Learning Brett Bernstein

Machine Learning Brett Bernstein Machie Learig Brett Berstei Week 2 Lecture: Cocept Check Exercises Starred problems are optioal. Excess Risk Decompositio 1. Let X = Y = {1, 2,..., 10}, A = {1,..., 10, 11} ad suppose the data distributio

More information

Introduction to Artificial Intelligence CAP 4601 Summer 2013 Midterm Exam

Introduction to Artificial Intelligence CAP 4601 Summer 2013 Midterm Exam Itroductio to Artificial Itelligece CAP 601 Summer 013 Midterm Exam 1. Termiology (7 Poits). Give the followig task eviromets, eter their properties/characteristics. The properties/characteristics of the

More information

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample. Statistical Iferece (Chapter 10) Statistical iferece = lear about a populatio based o the iformatio provided by a sample. Populatio: The set of all values of a radom variable X of iterest. Characterized

More information

Review Questions, Chapters 8, 9. f(y) = 0, elsewhere. F (y) = f Y(1) = n ( e y/θ) n 1 1 θ e y/θ = n θ e yn

Review Questions, Chapters 8, 9. f(y) = 0, elsewhere. F (y) = f Y(1) = n ( e y/θ) n 1 1 θ e y/θ = n θ e yn Stat 366 Lab 2 Solutios (September 2, 2006) page TA: Yury Petracheko, CAB 484, yuryp@ualberta.ca, http://www.ualberta.ca/ yuryp/ Review Questios, Chapters 8, 9 8.5 Suppose that Y, Y 2,..., Y deote a radom

More information

Lecture 19: Convergence

Lecture 19: Convergence Lecture 19: Covergece Asymptotic approach I statistical aalysis or iferece, a key to the success of fidig a good procedure is beig able to fid some momets ad/or distributios of various statistics. I may

More information

1 Review of Probability & Statistics

1 Review of Probability & Statistics 1 Review of Probability & Statistics a. I a group of 000 people, it has bee reported that there are: 61 smokers 670 over 5 960 people who imbibe (drik alcohol) 86 smokers who imbibe 90 imbibers over 5

More information

Grouping 2: Spectral and Agglomerative Clustering. CS 510 Lecture #16 April 2 nd, 2014

Grouping 2: Spectral and Agglomerative Clustering. CS 510 Lecture #16 April 2 nd, 2014 Groupig 2: Spectral ad Agglomerative Clusterig CS 510 Lecture #16 April 2 d, 2014 Groupig (review) Goal: Detect local image features (SIFT) Describe image patches aroud features SIFT, SURF, HoG, LBP, Group

More information

ACO Comprehensive Exam 9 October 2007 Student code A. 1. Graph Theory

ACO Comprehensive Exam 9 October 2007 Student code A. 1. Graph Theory 1. Graph Theory Prove that there exist o simple plaar triagulatio T ad two distict adjacet vertices x, y V (T ) such that x ad y are the oly vertices of T of odd degree. Do ot use the Four-Color Theorem.

More information

6.867 Machine learning

6.867 Machine learning 6.867 Machie learig Mid-term exam October, ( poits) Your ame ad MIT ID: Problem We are iterested here i a particular -dimesioal liear regressio problem. The dataset correspodig to this problem has examples

More information

Hypothesis Testing. Evaluation of Performance of Learned h. Issues. Trade-off Between Bias and Variance

Hypothesis Testing. Evaluation of Performance of Learned h. Issues. Trade-off Between Bias and Variance Hypothesis Testig Empirically evaluatig accuracy of hypotheses: importat activity i ML. Three questios: Give observed accuracy over a sample set, how well does this estimate apply over additioal samples?

More information

6.867 Machine learning, lecture 7 (Jaakkola) 1

6.867 Machine learning, lecture 7 (Jaakkola) 1 6.867 Machie learig, lecture 7 (Jaakkola) 1 Lecture topics: Kerel form of liear regressio Kerels, examples, costructio, properties Liear regressio ad kerels Cosider a slightly simpler model where we omit

More information

Study the bias (due to the nite dimensional approximation) and variance of the estimators

Study the bias (due to the nite dimensional approximation) and variance of the estimators 2 Series Methods 2. Geeral Approach A model has parameters (; ) where is ite-dimesioal ad is oparametric. (Sometimes, there is o :) We will focus o regressio. The fuctio is approximated by a series a ite

More information

Lecture 2 October 11

Lecture 2 October 11 Itroductio to probabilistic graphical models 203/204 Lecture 2 October Lecturer: Guillaume Oboziski Scribes: Aymeric Reshef, Claire Verade Course webpage: http://www.di.es.fr/~fbach/courses/fall203/ 2.

More information

Estimation for Complete Data

Estimation for Complete Data Estimatio for Complete Data complete data: there is o loss of iformatio durig study. complete idividual complete data= grouped data A complete idividual data is the oe i which the complete iformatio of

More information

Kurskod: TAMS11 Provkod: TENB 21 March 2015, 14:00-18:00. English Version (no Swedish Version)

Kurskod: TAMS11 Provkod: TENB 21 March 2015, 14:00-18:00. English Version (no Swedish Version) Kurskod: TAMS Provkod: TENB 2 March 205, 4:00-8:00 Examier: Xiagfeg Yag (Tel: 070 2234765). Please aswer i ENGLISH if you ca. a. You are allowed to use: a calculator; formel -och tabellsamlig i matematisk

More information

Exercises Advanced Data Mining: Solutions

Exercises Advanced Data Mining: Solutions Exercises Advaced Data Miig: Solutios Exercise 1 Cosider the followig directed idepedece graph. 5 8 9 a) Give the factorizatio of P (X 1, X 2,..., X 9 ) correspodig to this idepedece graph. P (X) = 9 P

More information

Table 12.1: Contingency table. Feature b. 1 N 11 N 12 N 1b 2 N 21 N 22 N 2b. ... a N a1 N a2 N ab

Table 12.1: Contingency table. Feature b. 1 N 11 N 12 N 1b 2 N 21 N 22 N 2b. ... a N a1 N a2 N ab Sectio 12 Tests of idepedece ad homogeeity I this lecture we will cosider a situatio whe our observatios are classified by two differet features ad we would like to test if these features are idepedet

More information