Clustering. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar.
|
|
- Anthony Griffin
- 6 years ago
- Views:
Transcription
1 Clusterig CM226: Machie Learig for Bioiformatics. Fall 216 Sriram Sakararama Ackowledgmets: Fei Sha, Ameet Talwalkar Clusterig 1 / 42
2 Admiistratio HW 1 due o Moday. /post o CCLE if you have questios. No class o October 19. OH moved to Moday 1pm Clusterig 2 / 42
3 Supervised versus Usupervised Learig Supervised Learig from labeled observatios Labels teach algorithm to lear mappig from observatios to labels Classificatio, Regressio Topic of last three lectures Usupervised Learig from ulabeled observatios Learig algorithm must fid latet structure from features aloe Ca be goal i itself (discover hidde patters, exploratory aalysis) Ca be meas to a ed (preprocessig for supervised task) Clusterig Dimesioality Reductio: Trasform a iitial feature represetatio ito a more cocise represetatio Modelig complex distributio from simple distributios Clusterig 3 / 42
4 Supervised versus Usupervised Learig Supervised Learig from labeled observatios Labels teach algorithm to lear mappig from observatios to labels Classificatio, Regressio Topic of last three lectures Usupervised Learig from ulabeled observatios Learig algorithm must fid latet structure from features aloe Ca be goal i itself (discover hidde patters, exploratory aalysis) Ca be meas to a ed (preprocessig for supervised task) Clusterig Dimesioality Reductio: Trasform a iitial feature represetatio ito a more cocise represetatio Modelig complex distributio from simple distributios Clusterig 3 / 42
5 Outlie Clusterig K-meas Gaussia mixture models EM Algorithm Model selectio Clusterig Clusterig 4 / 42
6 Clusterig Setup Give D = {x } N =1 ad K, we wat to output {µ k } K k=1 : prototypes of clusters A(x ) {1, 2,..., K}: the cluster membership, i.e., the cluster ID assiged to x Toy Example Cluster data ito two clusters. 2 (a) 2 (i) 2 2 Defiitio Group data poits so that poits withi group are more similar tha poits across groups. Clusterig Clusterig 5 / 42
7 K-meas example 2 (a) 2 (b) 2 (c) (d) 2 (e) 2 (f) (g) 2 (h) 2 (i) Clusterig Clusterig 6 / 42
8 K-meas example 2 (a) 2 (b) 2 (c) (d) 2 (e) 2 (f) (g) 2 (h) 2 (i) Clusterig Clusterig 6 / 42
9 K-meas example 2 (a) 2 (b) 2 (c) (d) 2 (e) 2 (f) (g) 2 (h) 2 (i) Clusterig Clusterig 6 / 42
10 K-meas example 2 (a) 2 (b) 2 (c) (d) 2 (e) 2 (f) (g) 2 (h) 2 (i) Clusterig Clusterig 6 / 42
11 K-meas example 2 (a) 2 (b) 2 (c) (d) 2 (e) 2 (f) (g) 2 (h) 2 (i) Clusterig Clusterig 6 / 42
12 K-meas clusterig Ituitio Data poits assiged to cluster k should be close to µ k, the prototype. Distortio measure (clusterig objective fuctio, cost fuctio) N K J = r k x µ k 2 2 =1 k=1 where r k {, 1} is a idicator variable r k = 1 if ad oly if A(x ) = k Clusterig Clusterig 7 / 42
13 K-meas clusterig Ituitio Data poits assiged to cluster k should be close to µ k, the prototype. Distortio measure (clusterig objective fuctio, cost fuctio) N K J = r k x µ k 2 2 =1 k=1 where r k {, 1} is a idicator variable r k = 1 if ad oly if A(x ) = k Clusterig Clusterig 7 / 42
14 Algorithm Miimize distortio measure alterative optimizatio betwee {r k } ad {µ k } Step Iitialize {µ k } to some values Step 1 Assume the curret value of {µ k } fixed, miimize J over {r k }, which leads to the followig cluster assigmet rule r k = { 1 if k = arg mij x µ j 2 2 otherwise Step 2 Assume the curret value of {r k } fixed, miimize J over {µ k }, which leads to the followig rule to update the prototypes of the clusters µ k = r kx r k Step 3 Determie whether to stop or retur to Step 1 Clusterig Clusterig 8 / 42
15 Algorithm Miimize distortio measure alterative optimizatio betwee {r k } ad {µ k } Step Iitialize {µ k } to some values Step 1 Assume the curret value of {µ k } fixed, miimize J over {r k }, which leads to the followig cluster assigmet rule r k = { 1 if k = arg mij x µ j 2 2 otherwise Step 2 Assume the curret value of {r k } fixed, miimize J over {µ k }, which leads to the followig rule to update the prototypes of the clusters µ k = r kx r k Step 3 Determie whether to stop or retur to Step 1 Clusterig Clusterig 8 / 42
16 Algorithm Miimize distortio measure alterative optimizatio betwee {r k } ad {µ k } Step Iitialize {µ k } to some values Step 1 Assume the curret value of {µ k } fixed, miimize J over {r k }, which leads to the followig cluster assigmet rule r k = { 1 if k = arg mij x µ j 2 2 otherwise Step 2 Assume the curret value of {r k } fixed, miimize J over {µ k }, which leads to the followig rule to update the prototypes of the clusters µ k = r kx r k Step 3 Determie whether to stop or retur to Step 1 Clusterig Clusterig 8 / 42
17 Remarks Prototype µ k is the mea of data poits assiged to the cluster k, hece K-meas The procedure reduces J i both Step 1 ad Step 2 ad thus makes improvemets o each iteratio No guaratee we fid the global solutio; quality of local optimum depeds o iitial values at Step (k-meas++ is a eat approximatio algorithm) Clusterig Clusterig 9 / 42
18 Probabilistic iterpretatio of clusterig? We ca impose a probabilistic iterpretatio of our ituitio that poits stay close to their cluster ceters How ca we model p(x) to reflect this? 1.5 (b).5 1 Data poits seem to form 3 clusters We caot model p(x) with simple ad kow distributios E.g., the data is ot a Guassia b/c we have 3 distict cocetrated regios Clusterig Clusterig 1 / 42
19 Probabilistic iterpretatio of clusterig? We ca impose a probabilistic iterpretatio of our ituitio that poits stay close to their cluster ceters How ca we model p(x) to reflect this? 1.5 (b).5 1 Data poits seem to form 3 clusters We caot model p(x) with simple ad kow distributios E.g., the data is ot a Guassia b/c we have 3 distict cocetrated regios Clusterig Clusterig 1 / 42
20 Gaussia mixture models: ituitio 1.5 (a).5 1 We ca model each regio with a distict distributio Commo to use Gaussias, i.e., Gaussia mixture models (GMMs) or mixture of Gaussias (MoGs). We do t kow cluster assigmets (label) or parameters of Gaussias or mixture compoets! We eed to lear them all from our ulabeled data D = {x } N =1 Clusterig Clusterig 11 / 42
21 Gaussia mixture models: ituitio 1.5 (a).5 1 We ca model each regio with a distict distributio Commo to use Gaussias, i.e., Gaussia mixture models (GMMs) or mixture of Gaussias (MoGs). We do t kow cluster assigmets (label) or parameters of Gaussias or mixture compoets! We eed to lear them all from our ulabeled data D = {x } N =1 Clusterig Clusterig 11 / 42
22 Gaussia mixture models: formal defiitio A Gaussia mixture model has the followig desity fuctio for x K p(x) = π k N (x µ k, Σ k ) k=1 K: the umber of Gaussias they are called (mixture) compoets µ k ad Σ k : mea ad covariace matrix of the k-th compoet π k : mixture weights they represet how much each compoet cotributes to the fial distributio (priors). It satisfies two properties: k, π k >, ad π k = 1 The properties esure p(x) is a properly ormalized probability desity fuctio. k Clusterig Clusterig 12 / 42
23 Gaussia mixture models: formal defiitio A Gaussia mixture model has the followig desity fuctio for x K p(x) = π k N (x µ k, Σ k ) k=1 K: the umber of Gaussias they are called (mixture) compoets µ k ad Σ k : mea ad covariace matrix of the k-th compoet π k : mixture weights they represet how much each compoet cotributes to the fial distributio (priors). It satisfies two properties: k, π k >, ad π k = 1 The properties esure p(x) is a properly ormalized probability desity fuctio. k Clusterig Clusterig 12 / 42
24 Gaussia mixture models: formal defiitio A Gaussia mixture model has the followig desity fuctio for x K p(x) = π k N (x µ k, Σ k ) k=1 K: the umber of Gaussias they are called (mixture) compoets µ k ad Σ k : mea ad covariace matrix of the k-th compoet π k : mixture weights they represet how much each compoet cotributes to the fial distributio (priors). It satisfies two properties: k, π k >, ad π k = 1 The properties esure p(x) is a properly ormalized probability desity fuctio. k Clusterig Clusterig 12 / 42
25 GMM as the margial distributio of a joit distributio Cosider the followig joit distributio p(x, z) = p(z)p(x z) where z is a discrete radom variable takig values betwee 1 ad K, i.e., a multiomial radom variable. Observed variables: x Hidde or latet variables: z Clusterig Clusterig 13 / 42
26 GMM as the margial distributio of a joit distributio Cosider the followig joit distributio p(x, z) = p(z)p(x z) where z is a discrete radom variable takig values betwee 1 ad K, i.e., a multiomial radom variable. Deote π k = p(z = k) Now, assume the coditioal distributios are Gaussia distributios p(x z = k) = N (x µ k, Σ k ) The, the margial distributio of x is p(x) = K p(z = k)p(x z = k) = k=1 Namely, the Gaussia mixture model K π k N (x µ k, Σ k ) k=1 Clusterig Clusterig 13 / 42
27 GMMs: example 1.5 (a).5 1 The coditioal distributio betwee x ad z (represetig color) are p(x z = red) = N (x µ 1, Σ 1 ) p(x z = blue) = N (x µ 2, Σ 2 ) p(x z = gree) = N (x µ 3, Σ 3 ) 1 The margial distributio is thus (b).5 p(x) = p(red)n (x µ 1, Σ 1 ) + p(blue)n (x µ 2, Σ 2 ) + p(gree)n (x µ 3, Σ 3 ).5 1 Clusterig Clusterig 14 / 42
28 GMMs: example 1.5 (a).5 1 The coditioal distributio betwee x ad z (represetig color) are p(x z = red) = N (x µ 1, Σ 1 ) p(x z = blue) = N (x µ 2, Σ 2 ) p(x z = gree) = N (x µ 3, Σ 3 ) 1 The margial distributio is thus (b).5 p(x) = p(red)n (x µ 1, Σ 1 ) + p(blue)n (x µ 2, Σ 2 ) + p(gree)n (x µ 3, Σ 3 ).5 1 Clusterig Clusterig 14 / 42
29 Parameter estimatio for Gaussia mixture models The parameters i GMMs are θ = {π k, µ k, Σ k } K k=1. To estimate, cosider the simple (ad urealistic) case first. We have labels z If we assume z is observed for every x, the our estimatio problem is easier to solve. I fact, this is a supervised learig problem. Our traiig data is augmeted: D = {x, z } N =1, D = {x } N =1 z deotes the regio where x comes from. D is the complete data ad D the icomplete data. How ca we lear our parameters? The maximum likelihood estimate is obtaied by maximizig the complete log likelihood θ = arg max θ LL c (θ) log P (x, z ) Clusterig Clusterig 15 / 42
30 Parameter estimatio for Gaussia mixture models The parameters i GMMs are θ = {π k, µ k, Σ k } K k=1. To estimate, cosider the simple (ad urealistic) case first. We have labels z If we assume z is observed for every x, the our estimatio problem is easier to solve. I fact, this is a supervised learig problem. Our traiig data is augmeted: D = {x, z } N =1, D = {x } N =1 z deotes the regio where x comes from. D is the complete data ad D the icomplete data. How ca we lear our parameters? The maximum likelihood estimate is obtaied by maximizig the complete log likelihood θ = arg max θ LL c (θ) log P (x, z ) Clusterig Clusterig 15 / 42
31 Parameter estimatio for GMMs: complete data The complete likelihood is decomposable LL c (θ) = log P (x, z ) = log P (z )p(x z ) Itroduce a biary variable z k {, 1} to idicate whether z = k. K K P (z ) = P (z = k) 1{z=k} = P (z = k) z k We the have k=1 k=1 K P (x z ) = P (x z = k) 1{z=k} = k=1 k=1 K P (x z = k) z k LL c (θ) = log k (P (z = k)p(x z = k)) z k Clusterig Clusterig 16 / 42
32 Parameter estimatio for GMMs: complete data The complete likelihood is decomposable LL c (θ) = log P (x, z ) = log P (z )p(x z ) Itroduce a biary variable z k {, 1} to idicate whether z = k. K K P (z ) = P (z = k) 1{z=k} = P (z = k) z k We the have k=1 k=1 K P (x z ) = P (x z = k) 1{z=k} = k=1 k=1 K P (x z = k) z k LL c (θ) = log k (P (z = k)p(x z = k)) z k Clusterig Clusterig 16 / 42
33 Parameter estimatio for GMMs: complete data The complete likelihood is decomposable LL c (θ) = log P (x, z ) = log P (z )p(x z ) Itroduce a biary variable z k {, 1} to idicate whether z = k. K K P (z ) = P (z = k) 1{z=k} = P (z = k) z k We the have k=1 k=1 K P (x z ) = P (x z = k) 1{z=k} = k=1 k=1 K P (x z = k) z k LL c (θ) = log k (P (z = k)p(x z = k)) z k Clusterig Clusterig 16 / 42
34 Parameter estimatio for GMMs: complete data The complete likelihood is decomposable LL c (θ) = log P (x, z ) = log P (z )p(x z ) LL c (θ) = = = = k ( ) log (P (z = k)p(x z = k)) z k k z k log (P (z = k)p(x z = k)) k z k [log P (z = k) + log p(x z = k)] k z k [log P (z = k) + log p(x z = k)] We use a dummy variable z to deote all the possible values cluster assigmet values for x D specifies this value i the complete data settig Clusterig Clusterig 17 / 42
35 Parameter estimatio for GMMs: complete data From our previous discussio, we have LL c (θ) = z k [log π k + log N (x µ k, Σ k )] k Regroupig, we have LL c (θ) = k z k log π k + k { } z k log N (x µ k, Σ k ) The term iside the braces depeds o k-th compoet s parameters. It is ow easy to show that (left as a exercise) the MLE is: π k = z k 1 k z, µ k = k z z k x k 1 Σ k = z z k (x µ k )(x µ k ) T k What s the ituitio? Clusterig Clusterig 18 / 42
36 Parameter estimatio for GMMs: complete data From our previous discussio, we have LL c (θ) = z k [log π k + log N (x µ k, Σ k )] k Regroupig, we have LL c (θ) = k z k log π k + k { } z k log N (x µ k, Σ k ) The term iside the braces depeds o k-th compoet s parameters. It is ow easy to show that (left as a exercise) the MLE is: π k = z k 1 k z, µ k = k z z k x k 1 Σ k = z z k (x µ k )(x µ k ) T k What s the ituitio? Clusterig Clusterig 18 / 42
37 Ituitio Sice z k is biary, the previous solutio is othig but For π k : cout the umber of data poits whose z is k ad divide by the total umber of data poits (ote that k z k = N) For µ k : get all the data poits whose z is k, compute their mea For Σ k : get all the data poits whose z is k, compute their covariace matrix This ituitio is goig to help us to develop a algorithm for estimatig θ whe we do ot kow z (icomplete data). Clusterig Clusterig 19 / 42
38 Parameter estimatio for GMMs: icomplete data Whe z is ot give, we ca guess it via the posterior probability p(z = k x ) = p(x z = k)p(z = k) p(x ) = p(x z = k)p(z = k) K k =1 p(x z = k )p(z = k ) To compute the posterior probability, we eed to kow the parameters θ! Let s preted we kow the value of the parameters so we ca compute the posterior probability. How is that goig to help us? Clusterig Clusterig 2 / 42
39 Parameter estimatio for GMMs: icomplete data Whe z is ot give, we ca guess it via the posterior probability p(z = k x ) = p(x z = k)p(z = k) p(x ) = p(x z = k)p(z = k) K k =1 p(x z = k )p(z = k ) To compute the posterior probability, we eed to kow the parameters θ! Let s preted we kow the value of the parameters so we ca compute the posterior probability. How is that goig to help us? Clusterig Clusterig 2 / 42
40 Estimatio with soft r k We defie r k = p(z = k x ) Recall that z k should be biary r k is a soft assigmet of x to k-th compoet Each x is assiged to a compoet fractioally accordig to r k Use the soft assigmets i the complete log-likelihood. l(θ) = r k log π k + { } r k log N (x µ k, Σ k ) k k Usig the soft r k i LL c, we get the same expressio for the MLE! π k = r k 1 k r, µ k = k r r k x k 1 Σ k = r r k (x µ k )(x µ k ) T k But remember, we re cheatig by usig θ to compute r k! Clusterig Clusterig 21 / 42
41 Estimatio with soft r k We defie r k = p(z = k x ) Recall that z k should be biary r k is a soft assigmet of x to k-th compoet Each x is assiged to a compoet fractioally accordig to r k Use the soft assigmets i the complete log-likelihood. l(θ) = r k log π k + { } r k log N (x µ k, Σ k ) k k Usig the soft r k i LL c, we get the same expressio for the MLE! π k = r k 1 k r, µ k = k r r k x k 1 Σ k = r r k (x µ k )(x µ k ) T k But remember, we re cheatig by usig θ to compute r k! Clusterig Clusterig 21 / 42
42 Estimatio with soft r k We defie r k = p(z = k x ) Recall that z k should be biary r k is a soft assigmet of x to k-th compoet Each x is assiged to a compoet fractioally accordig to r k Use the soft assigmets i the complete log-likelihood. l(θ) = r k log π k + { } r k log N (x µ k, Σ k ) k k Usig the soft r k i LL c, we get the same expressio for the MLE! π k = r k 1 k r, µ k = k r r k x k 1 Σ k = r r k (x µ k )(x µ k ) T k But remember, we re cheatig by usig θ to compute r k! Clusterig Clusterig 21 / 42
43 Iterative procedure We ca alterate betwee estimatig r k ad usig the estimated r k to compute the parameters (same idea as with K-meas!) Step : iitialize θ with some values (radom or otherwise) Step 1: compute r k usig the curret θ Step 2: update θ usig the just computed r k Step 3: go back to Step 1 Questios: Is this procedure reasoable, i.e., are we optimizig a sesible criteria? Will this procedure coverge? The aswers lie i the EM algorithm a powerful procedure for model estimatio with ukow data. Clusterig Clusterig 22 / 42
44 GMMs ad K-meas GMMs provide probabilistic iterpretatio for K-meas GMMs reduce to K-meas uder the followig assumptios (i which case EM for GMM parameter estimatio simplifies to K-meas): Assume all Gaussias have σ 2 I covariace matrices Further assume σ, so we oly eed to estimate µ k, i.e., meas K-meas is ofte called hard GMM or GMMs is called soft K-meas The posterior γ k provides a probabilistic assigmet for x to cluster k Clusterig Clusterig 23 / 42
45 GMMs ad K-meas GMMs provide probabilistic iterpretatio for K-meas GMMs reduce to K-meas uder the followig assumptios (i which case EM for GMM parameter estimatio simplifies to K-meas): Assume all Gaussias have σ 2 I covariace matrices Further assume σ, so we oly eed to estimate µ k, i.e., meas K-meas is ofte called hard GMM or GMMs is called soft K-meas The posterior γ k provides a probabilistic assigmet for x to cluster k Clusterig Clusterig 23 / 42
46 Outlie Clusterig K-meas Gaussia mixture models EM Algorithm Model selectio Clusterig EM Algorithm 24 / 42
47 EM algorithm: motivatio ad setup As a geeral procedure, EM is used to estimate parameters for probabilistic models with hidde/latet variables. Suppose the model is give by a joit distributio p(x θ) = z p(x, z θ) where x is the observed radom variable ad z is hidde. We are give data cotaiig oly the observed variable D = {x } where the correspodig hidde variable values z is ot icluded. Our goal is to obtai the maximum likelihood estimate of θ. Namely, we choose θ = arg max θ LL(θ) = arg max log P (x θ) = arg max θ log p(x, z θ) z The objective fuctio LL(θ) is called icomplete log-likelihood. Clusterig EM Algorithm 25 / 42
48 Expected (complete) log-likelihood θ = arg max θ log p(x, z θ) z The difficulty with icomplete log-likelihood is that it eeds to sum over all possible values that z ca take, the take a logarithm. This log-sum format makes computatio itractable. LL(θ) = z p(x, z θ) z p(x, z θ) Clusterig EM Algorithm 26 / 42
49 Expected (complete) log-likelihood If we kew the z (complete data settig), optimizig the likelihood is easy. Istead, the EM algorithm uses a clever trick to chage this ito sum-log form. Q q (θ) = = E z q(z ) log P (x, z θ) q(z ) log P (x, z θ) z which is called expected (complete) log-likelihood (with respect to q(z). q(z) is a distributio over z. Note that Q q (θ) takes the form of sum-log, which turs out to be tractable. Clusterig EM Algorithm 27 / 42
50 Examples Cosider the previous model where x could be from 3 regios. We ca choose q(z) ay valid distributio. This will lead to differet Q q (θ). Note that z here represets differet colors. q(z = k) = 1/3 for ay of 3 colors. This gives rise to Q q (θ) = 1 3 [log P (x, red θ) + log P (x, blue θ) + log P (x, gree θ) ] q(z = k) = 1/2 for red ad blue, for gree. This gives rise to Q q (θ) = 1 2 [log P (x, red θ) + log P (x, blue θ)] Clusterig EM Algorithm 28 / 42
51 Which q(z) to choose? We will choose a special q(z) = p(z x; θ), i.e., the posterior probability of z. We defie Q(θ; θ ) = Q z p(z x;θ )(θ) = E θ [log P (x, z θ)] Clusterig EM Algorithm 29 / 42
52 EM algorithm We ca alterate betwee estimatig r k ad usig the estimated r k to compute the parameters (same idea as with K-meas!) Step : Iitialize t, θ () with some values (radom or otherwise) Repeat Step 1 (E-step): Compute r (t) k = p(z = k x, θ (t) ) usig the curret θ (t) Step 2 (M-step): θ (t+1) arg max θ Q(θ; θ (t) ). Update θ usig the just computed r k Step 3: t t + 1. Go back to step 1 while ot coverged. Clusterig EM Algorithm 3 / 42
53 EM for GMM Recall the complete log-likelihood LL c (θ) = z k [log π k + log N (x µ k, Σ k )] k The Q fuctio for GMM { ] Q(θ; θ (t) ) = E θ (t) z k [log π k + log N (x µ k, Σ k )} = k E θ (t) [z k {log π k + log N (x µ k, Σ k )}] k [ Eθ (t)[z k ] ] {log π k + log N (x µ k, Σ k )} = = k k r (t) k {log π k + log N (x µ k, Σ k )} Clusterig EM Algorithm 31 / 42
54 EM for GMM Regroupig Q(θ; θ (t) ) = k r (t) k log π k + k { r (t) k log N (x µ k, Σ k ) } We have recovered the parameter estimatio algorithm for GMMs discussed previously! This gives M-step updates to the parameters: π (t) k = r(t) k r(t) k k Σ (t) k = 1 r(t) k, µ (t) k = 1 r(t) k r (t) k x r (t) k (x µ (t) k )(x µ (t) k )T Clusterig EM Algorithm 32 / 42
55 EM for GMM Regroupig Q(θ; θ (t) ) = k r (t) k log π k + k { r (t) k log N (x µ k, Σ k ) } We have recovered the parameter estimatio algorithm for GMMs discussed previously! This gives M-step updates to the parameters: π (t) k = r(t) k r(t) k k Σ (t) k = 1 r(t) k, µ (t) k = 1 r(t) k r (t) k x r (t) k (x µ (t) k )(x µ (t) k )T Clusterig EM Algorithm 32 / 42
56 Why does EM work? EM: costruct lower boud o LL(θ) (E-step) ad optimize it (M-step) If we defie q(z) as a distributio over z, the LL(θ) = log z p(x, z θ) = log z z q(z ) p(x, z θ) q(z ) q(z ) log p(x, z θ) q(z ) Last step follows from Jese s iequality, i.e., f(ex) Ef(X) for cocave fuctio f Clusterig EM Algorithm 33 / 42
57 Why does EM work? EM: costruct lower boud o LL(θ) (E-step) ad optimize it (M-step) If we defie q(z) as a distributio over z, the LL(θ) = log z p(x, z θ) = log z z q(z ) p(x, z θ) q(z ) q(z ) log p(x, z θ) q(z ) Last step follows from Jese s iequality, i.e., f(ex) Ef(X) for cocave fuctio f Clusterig EM Algorithm 33 / 42
58 Why does EM work? EM: costruct lower boud o LL(θ) (E-step) ad optimize it (M-step) If we defie q(z) as a distributio over z, the LL(θ) = log z p(x, z θ) = log z z q(z ) p(x, z θ) q(z ) q(z ) log p(x, z θ) q(z ) Last step follows from Jese s iequality, i.e., f(ex) Ef(X) for cocave fuctio f Clusterig EM Algorithm 33 / 42
59 Why does EM work? EM: costruct lower boud o LL(θ) (E-step) ad optimize it (M-step) If we defie q(z) as a distributio over z, the LL(θ) = log z p(x, z θ) = log z z q(z ) p(x, z θ) q(z ) q(z ) log p(x, z θ) q(z ) Last step follows from Jese s iequality, i.e., f(ex) Ef(X) for cocave fuctio f Clusterig EM Algorithm 33 / 42
60 Which q(z) to choose? LL(θ) = log p(x, z θ) = log z z q(z ) log p(x, z θ) q(z ) z q(z ) p(x, z θ) q(z ) The lower boud we derived for LL(θ) holds for all choices of q( ) We wat a tight lower boud, ad give some curret estimate θ t, we will pick q( ) such that our lower boud holds with equality at θ t Choose q(z ) p(x, z θ t )! Sice q( ) is a distributio, we have q(z ) = p(x, z θ t ) k p(x, z = k θ) = p(x, z θ p(x θ t ) t ) = p(z x ; θ t ) This is the posterior distributio of z give x ad θ t Clusterig EM Algorithm 34 / 42
61 Which q(z) to choose? LL(θ) = log p(x, z θ) = log z z q(z ) log p(x, z θ) q(z ) z q(z ) p(x, z θ) q(z ) The lower boud we derived for LL(θ) holds for all choices of q( ) We wat a tight lower boud, ad give some curret estimate θ t, we will pick q( ) such that our lower boud holds with equality at θ t Choose q(z ) p(x, z θ t )! Sice q( ) is a distributio, we have q(z ) = p(x, z θ t ) k p(x, z = k θ) = p(x, z θ p(x θ t ) t ) = p(z x ; θ t ) This is the posterior distributio of z give x ad θ t Clusterig EM Algorithm 34 / 42
62 Which q(z) to choose? LL(θ) = log p(x, z θ) = log z z q(z ) log p(x, z θ) q(z ) z q(z ) p(x, z θ) q(z ) The lower boud we derived for LL(θ) holds for all choices of q( ) We wat a tight lower boud, ad give some curret estimate θ t, we will pick q( ) such that our lower boud holds with equality at θ t Choose q(z ) p(x, z θ t )! Sice q( ) is a distributio, we have q(z ) = p(x, z θ t ) k p(x, z = k θ) = p(x, z θ p(x θ t ) t ) = p(z x ; θ t ) This is the posterior distributio of z give x ad θ t Clusterig EM Algorithm 34 / 42
63 Which q(z) to choose? LL(θ) = log p(x, z θ) = log z z q(z ) log p(x, z θ) q(z ) z q(z ) p(x, z θ) q(z ) The lower boud we derived for LL(θ) holds for all choices of q( ) We wat a tight lower boud, ad give some curret estimate θ t, we will pick q( ) such that our lower boud holds with equality at θ t Choose q(z ) p(x, z θ t )! Sice q( ) is a distributio, we have q(z ) = p(x, z θ t ) k p(x, z = k θ) = p(x, z θ p(x θ t ) t ) = p(z x ; θ t ) This is the posterior distributio of z give x ad θ t Clusterig EM Algorithm 34 / 42
64 E ad M Steps Our simplified expressio LL(θ t ) = t ) p(z x ; θ t ) log p(x, z θ z p(z x ; θ t )) E-Step: For all, compute q(z ) = p(z x ; θ t ) Why is this called the E-Step? Because we ca view it as computig the expected (complete) log-likelihood: Q(θ θ t ) = p(z x ; θ t ) log p(x, z θ) = E q log p(x, z θ) z M-Step: Maximize Q(θ θ t ), i.e., θ t+1 = arg max θ Q(θ θ t ) Clusterig EM Algorithm 35 / 42
65 E ad M Steps Our simplified expressio LL(θ t ) = t ) p(z x ; θ t ) log p(x, z θ z p(z x ; θ t )) E-Step: For all, compute q(z ) = p(z x ; θ t ) Why is this called the E-Step? Because we ca view it as computig the expected (complete) log-likelihood: Q(θ θ t ) = p(z x ; θ t ) log p(x, z θ) = E q log p(x, z θ) z M-Step: Maximize Q(θ θ t ), i.e., θ t+1 = arg max θ Q(θ θ t ) Clusterig EM Algorithm 35 / 42
66 E ad M Steps Our simplified expressio LL(θ t ) = t ) p(z x ; θ t ) log p(x, z θ z p(z x ; θ t )) E-Step: For all, compute q(z ) = p(z x ; θ t ) Why is this called the E-Step? Because we ca view it as computig the expected (complete) log-likelihood: Q(θ θ t ) = p(z x ; θ t ) log p(x, z θ) = E q log p(x, z θ) z M-Step: Maximize Q(θ θ t ), i.e., θ t+1 = arg max θ Q(θ θ t ) Clusterig EM Algorithm 35 / 42
67 E ad M Steps Our simplified expressio LL(θ t ) = t ) p(z x ; θ t ) log p(x, z θ z p(z x ; θ t )) E-Step: For all, compute q(z ) = p(z x ; θ t ) Why is this called the E-Step? Because we ca view it as computig the expected (complete) log-likelihood: Q(θ θ t ) = p(z x ; θ t ) log p(x, z θ) = E q log p(x, z θ) z M-Step: Maximize Q(θ θ t ), i.e., θ t+1 = arg max θ Q(θ θ t ) Clusterig EM Algorithm 35 / 42
68 Iterative ad mootoic improvemet We ca show that LL(θ t+1 ) LL(θ t ) Recall that we chose q( ) i the E-step such that: LL(θ t ) = q(z ) log p(x, z θ q(z z ) However, i the M-step, θ t+1 is chose to maximize the right had side of the equatio, thus provig our desired result Note: the EM procedure coverges but oly to a local optimum Ru algorithm with radom iitializatios ad pick the best solutio t ) Clusterig EM Algorithm 36 / 42
69 Bayesia viewpoit Ca impose a prior o θ. Reduces over-fittig. Ca compute the maximum a posterior (MAP) estimate. Equivalet to addig a regularizer. ˆθ = arg max θ log P (D θ) + log P (θ) EM ca be exteded to this settig. Clusterig EM Algorithm 37 / 42
70 Outlie Clusterig K-meas Gaussia mixture models EM Algorithm Model selectio Clusterig Model selectio 38 / 42
71 Model selectio: choosig K How well does the model fit the data for differet K? Problem: Will fit better with larger K. Ca evaluate fit o idepedet data ot used i traiig. Clusterig Model selectio 39 / 42
72 Model selectio: choosig K Byesia solutio: Prior cotrols model complexity K = arg max K P (D K) = P (D θ)p (θ K)d(θ) Challegig to evaluate itegral Other approximate methods to cotrol model complexity Akaike Iformatio Criterio Bayesia Iformatio Criterio K = arg max K 2LL(ˆθ) 2k K = arg max K 2LL(ˆθ) k log() ˆθ : MLE, k: Number of parameters Clusterig Model selectio 4 / 42
73 Model selectio: choosig K Byesia oparametrics: Prior o K! More details i future lectures. Clusterig Model selectio 41 / 42
74 Summary Usupervised learig: fidig structure i data without labels Clusterig K-meas GMM: Probabilistic model for K-meas Iferece o hidde or latet variables. More challegig tha supervised learig. EM algorithm Pricipled approach to iferece i these models. Clusterig Model selectio 42 / 42
Outline. CSCI-567: Machine Learning (Spring 2019) Outline. Prof. Victor Adamchik. Mar. 26, 2019
Outlie CSCI-567: Machie Learig Sprig 209 Gaussia mixture models Prof. Victor Adamchik 2 Desity estimatio U of Souther Califoria Mar. 26, 209 3 Naive Bayes Revisited March 26, 209 / 57 March 26, 209 2 /
More informationExpectation-Maximization Algorithm.
Expectatio-Maximizatio Algorithm. Petr Pošík Czech Techical Uiversity i Prague Faculty of Electrical Egieerig Dept. of Cyberetics MLE 2 Likelihood.........................................................................................................
More informationAlgorithms for Clustering
CR2: Statistical Learig & Applicatios Algorithms for Clusterig Lecturer: J. Salmo Scribe: A. Alcolei Settig: give a data set X R p where is the umber of observatio ad p is the umber of features, we wat
More information10-701/ Machine Learning Mid-term Exam Solution
0-70/5-78 Machie Learig Mid-term Exam Solutio Your Name: Your Adrew ID: True or False (Give oe setece explaatio) (20%). (F) For a cotiuous radom variable x ad its probability distributio fuctio p(x), it
More informationThe Expectation-Maximization (EM) Algorithm
The Expectatio-Maximizatio (EM) Algorithm Readig Assigmets T. Mitchell, Machie Learig, McGraw-Hill, 997 (sectio 6.2, hard copy). S. Gog et al. Dyamic Visio: From Images to Face Recogitio, Imperial College
More informationStatistical Pattern Recognition
Statistical Patter Recogitio Classificatio: No-Parametric Modelig Hamid R. Rabiee Jafar Muhammadi Sprig 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Ageda Parametric Modelig No-Parametric Modelig
More informationMixtures of Gaussians and the EM Algorithm
Mixtures of Gaussias ad the EM Algorithm CSE 6363 Machie Learig Vassilis Athitsos Computer Sciece ad Egieerig Departmet Uiversity of Texas at Arligto 1 Gaussias A popular way to estimate probability desity
More informationClustering: Mixture Models
Clusterig: Mixture Models Machie Learig 10-601B Seyoug Kim May of these slides are derived from Tom Mitchell, Ziv- Bar Joseph, ad Eric Xig. Thaks! Problem with K- meas Hard Assigmet of Samples ito Three
More informationLecture 2: Monte Carlo Simulation
STAT/Q SCI 43: Itroductio to Resamplig ethods Sprig 27 Istructor: Ye-Chi Che Lecture 2: ote Carlo Simulatio 2 ote Carlo Itegratio Assume we wat to evaluate the followig itegratio: e x3 dx What ca we do?
More informationCS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5
CS434a/54a: Patter Recogitio Prof. Olga Veksler Lecture 5 Today Itroductio to parameter estimatio Two methods for parameter estimatio Maimum Likelihood Estimatio Bayesia Estimatio Itroducto Bayesia Decisio
More informationAxis Aligned Ellipsoid
Machie Learig for Data Sciece CS 4786) Lecture 6,7 & 8: Ellipsoidal Clusterig, Gaussia Mixture Models ad Geeral Mixture Models The text i black outlies high level ideas. The text i blue provides simple
More informationProbabilistic Unsupervised Learning
HT2015: SC4 Statistical Data Miig ad Machie Learig Dio Sejdiovic Departmet of Statistics Oxford http://www.stats.ox.ac.u/~sejdiov/sdmml.html Probabilistic Methods Algorithmic approach: Data Probabilistic
More informationLinear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d
Liear regressio Daiel Hsu (COMS 477) Maximum likelihood estimatio Oe of the simplest liear regressio models is the followig: (X, Y ),..., (X, Y ), (X, Y ) are iid radom pairs takig values i R d R, ad Y
More informationECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015
ECE 8527: Itroductio to Machie Learig ad Patter Recogitio Midterm # 1 Vaishali Ami Fall, 2015 tue39624@temple.edu Problem No. 1: Cosider a two-class discrete distributio problem: ω 1 :{[0,0], [2,0], [2,2],
More informationExpectation maximization
Motivatio Expectatio maximizatio Subhrasu Maji CMSCI 689: Machie Learig 14 April 015 Suppose you are builig a aive Bayes spam classifier. After your are oe your boss tells you that there is o moey to label
More informationECE 901 Lecture 12: Complexity Regularization and the Squared Loss
ECE 90 Lecture : Complexity Regularizatio ad the Squared Loss R. Nowak 5/7/009 I the previous lectures we made use of the Cheroff/Hoeffdig bouds for our aalysis of classifier errors. Hoeffdig s iequality
More informationChapter 12 EM algorithms The Expectation-Maximization (EM) algorithm is a maximum likelihood method for models that have hidden variables eg. Gaussian
Chapter 2 EM algorithms The Expectatio-Maximizatio (EM) algorithm is a maximum likelihood method for models that have hidde variables eg. Gaussia Mixture Models (GMMs), Liear Dyamic Systems (LDSs) ad Hidde
More informationThis exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.
Probability ad Statistics FS 07 Secod Sessio Exam 09.0.08 Time Limit: 80 Miutes Name: Studet ID: This exam cotais 9 pages (icludig this cover page) ad 0 questios. A Formulae sheet is provided with the
More informationEmpirical Process Theory and Oracle Inequalities
Stat 928: Statistical Learig Theory Lecture: 10 Empirical Process Theory ad Oracle Iequalities Istructor: Sham Kakade 1 Risk vs Risk See Lecture 0 for a discussio o termiology. 2 The Uio Boud / Boferoi
More informationVector Quantization: a Limiting Case of EM
. Itroductio & defiitios Assume that you are give a data set X = { x j }, j { 2,,, }, of d -dimesioal vectors. The vector quatizatio (VQ) problem requires that we fid a set of prototype vectors Z = { z
More informationCSE 527, Additional notes on MLE & EM
CSE 57 Lecture Notes: MLE & EM CSE 57, Additioal otes o MLE & EM Based o earlier otes by C. Grat & M. Narasimha Itroductio Last lecture we bega a examiatio of model based clusterig. This lecture will be
More informationEECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1
EECS564 Estimatio, Filterig, ad Detectio Hwk 2 Sols. Witer 25 4. Let Z be a sigle observatio havig desity fuctio where. p (z) = (2z + ), z (a) Assumig that is a oradom parameter, fid ad plot the maximum
More informationAgnostic Learning and Concentration Inequalities
ECE901 Sprig 2004 Statistical Regularizatio ad Learig Theory Lecture: 7 Agostic Learig ad Cocetratio Iequalities Lecturer: Rob Nowak Scribe: Aravid Kailas 1 Itroductio 1.1 Motivatio I the last lecture
More informationOutline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression
REGRESSION 1 Outlie Liear regressio Regularizatio fuctios Polyomial curve fittig Stochastic gradiet descet for regressio MLE for regressio Step-wise forward regressio Regressio methods Statistical techiques
More informationMachine Learning Theory Tübingen University, WS 2016/2017 Lecture 12
Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract I this lecture we derive risk bouds for kerel methods. We will start by showig that Soft Margi kerel SVM correspods to miimizig
More informationEconomics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator
Ecoomics 24B Relatio to Method of Momets ad Maximum Likelihood OLSE as a Maximum Likelihood Estimator Uder Assumptio 5 we have speci ed the distributio of the error, so we ca estimate the model parameters
More informationLecture 11 and 12: Basic estimation theory
Lecture ad 2: Basic estimatio theory Sprig 202 - EE 94 Networked estimatio ad cotrol Prof. Kha March 2 202 I. MAXIMUM-LIKELIHOOD ESTIMATORS The maximum likelihood priciple is deceptively simple. Louis
More informationProbability and MLE.
10-701 Probability ad MLE http://www.cs.cmu.edu/~pradeepr/701 (brief) itro to probability Basic otatios Radom variable - referrig to a elemet / evet whose status is ukow: A = it will rai tomorrow Domai
More informationDistributional Similarity Models (cont.)
Sematic Similarity Vector Space Model Similarity Measures cosie Euclidea distace... Clusterig k-meas hierarchical Last Time EM Clusterig Soft versio of K-meas clusterig Iput: m dimesioal objects X = {
More informationDirection: This test is worth 250 points. You are required to complete this test within 50 minutes.
Term Test October 3, 003 Name Math 56 Studet Number Directio: This test is worth 50 poits. You are required to complete this test withi 50 miutes. I order to receive full credit, aswer each problem completely
More informationSeunghee Ye Ma 8: Week 5 Oct 28
Week 5 Summary I Sectio, we go over the Mea Value Theorem ad its applicatios. I Sectio 2, we will recap what we have covered so far this term. Topics Page Mea Value Theorem. Applicatios of the Mea Value
More informationIntroduction to Machine Learning DIS10
CS 189 Fall 017 Itroductio to Machie Learig DIS10 1 Fu with Lagrage Multipliers (a) Miimize the fuctio such that f (x,y) = x + y x + y = 3. Solutio: The Lagragia is: L(x,y,λ) = x + y + λ(x + y 3) Takig
More informationCS284A: Representations and Algorithms in Molecular Biology
CS284A: Represetatios ad Algorithms i Molecular Biology Scribe Notes o Lectures 3 & 4: Motif Discovery via Eumeratio & Motif Represetatio Usig Positio Weight Matrix Joshua Gervi Based o presetatios by
More informationDistributional Similarity Models (cont.)
Distributioal Similarity Models (cot.) Regia Barzilay EECS Departmet MIT October 19, 2004 Sematic Similarity Vector Space Model Similarity Measures cosie Euclidea distace... Clusterig k-meas hierarchical
More informationExponential Families and Bayesian Inference
Computer Visio Expoetial Families ad Bayesia Iferece Lecture Expoetial Families A expoetial family of distributios is a d-parameter family f(x; havig the followig form: f(x; = h(xe g(t T (x B(, (. where
More informationOptimally Sparse SVMs
A. Proof of Lemma 3. We here prove a lower boud o the umber of support vectors to achieve geeralizatio bouds of the form which we cosider. Importatly, this result holds ot oly for liear classifiers, but
More informationLet us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.
Lecture 5 Let us give oe more example of MLE. Example 3. The uiform distributio U[0, ] o the iterval [0, ] has p.d.f. { 1 f(x =, 0 x, 0, otherwise The likelihood fuctio ϕ( = f(x i = 1 I(X 1,..., X [0,
More informationSupport vector machine revisited
6.867 Machie learig, lecture 8 (Jaakkola) 1 Lecture topics: Support vector machie ad kerels Kerel optimizatio, selectio Support vector machie revisited Our task here is to first tur the support vector
More information5.1 Review of Singular Value Decomposition (SVD)
MGMT 69000: Topics i High-dimesioal Data Aalysis Falll 06 Lecture 5: Spectral Clusterig: Overview (cotd) ad Aalysis Lecturer: Jiamig Xu Scribe: Adarsh Barik, Taotao He, September 3, 06 Outlie Review of
More informationMachine Learning for Data Science (CS 4786)
Machie Learig for Data Sciece CS 4786) Lecture & 3: Pricipal Compoet Aalysis The text i black outlies high level ideas. The text i blue provides simple mathematical details to derive or get to the algorithm
More informationIntroductory statistics
CM9S: Machie Learig for Bioiformatics Lecture - 03/3/06 Itroductory statistics Lecturer: Sriram Sakararama Scribe: Sriram Sakararama We will provide a overview of statistical iferece focussig o the key
More information5. Likelihood Ratio Tests
1 of 5 7/29/2009 3:16 PM Virtual Laboratories > 9. Hy pothesis Testig > 1 2 3 4 5 6 7 5. Likelihood Ratio Tests Prelimiaries As usual, our startig poit is a radom experimet with a uderlyig sample space,
More informationResampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.
Jauary 1, 2019 Resamplig Methods Motivatio We have so may estimators with the property θ θ d N 0, σ 2 We ca also write θ a N θ, σ 2 /, where a meas approximately distributed as Oce we have a cosistet estimator
More informationREGRESSION WITH QUADRATIC LOSS
REGRESSION WITH QUADRATIC LOSS MAXIM RAGINSKY Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X, Y ), where, as before, X is a R d
More informationThe Bayesian Learning Framework. Back to Maximum Likelihood. Naïve Bayes. Simple Example: Coin Tosses. Given a generative model
Back to Maximum Likelihood Give a geerative model f (x, y = k) =π k f k (x) Usig a geerative modellig approach, we assume a parametric form for f k (x) =f (x; k ) ad compute the MLE θ of θ =(π k, k ) k=
More informationLecture 7: Density Estimation: k-nearest Neighbor and Basis Approach
STAT 425: Itroductio to Noparametric Statistics Witer 28 Lecture 7: Desity Estimatio: k-nearest Neighbor ad Basis Approach Istructor: Ye-Chi Che Referece: Sectio 8.4 of All of Noparametric Statistics.
More informationUnsupervised Learning 2001
Usupervised Learig 2001 Lecture 3: The EM Algorithm Zoubi Ghahramai zoubi@gatsby.ucl.ac.uk Carl Edward Rasmusse edward@gatsby.ucl.ac.uk Gatsby Computatioal Neurosciece Uit MSc Itelliget Systems, Computer
More information6.3 Testing Series With Positive Terms
6.3. TESTING SERIES WITH POSITIVE TERMS 307 6.3 Testig Series With Positive Terms 6.3. Review of what is kow up to ow I theory, testig a series a i for covergece amouts to fidig the i= sequece of partial
More informationChapter 8: Estimating with Confidence
Chapter 8: Estimatig with Cofidece Sectio 8.2 The Practice of Statistics, 4 th editio For AP* STARNES, YATES, MOORE Chapter 8 Estimatig with Cofidece 8.1 Cofidece Itervals: The Basics 8.2 8.3 Estimatig
More informationLecture 12: September 27
36-705: Itermediate Statistics Fall 207 Lecturer: Siva Balakrisha Lecture 2: September 27 Today we will discuss sufficiecy i more detail ad the begi to discuss some geeral strategies for costructig estimators.
More informationProbabilistic Unsupervised Learning
Statistical Data Miig ad Machie Learig Hilary Term 2016 Dio Sejdiovic Departmet of Statistics Oxford Slides ad other materials available at: http://www.stats.ox.ac.u/~sejdiov/sdmml Probabilistic Methods
More informationLecture 9: Boosting. Akshay Krishnamurthy October 3, 2017
Lecture 9: Boostig Akshay Krishamurthy akshay@csumassedu October 3, 07 Recap Last week we discussed some algorithmic aspects of machie learig We saw oe very powerful family of learig algorithms, amely
More informationMASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS
MASSACHUSTTS INSTITUT OF TCHNOLOGY 6.436J/5.085J Fall 2008 Lecture 9 /7/2008 LAWS OF LARG NUMBRS II Cotets. The strog law of large umbers 2. The Cheroff boud TH STRONG LAW OF LARG NUMBRS While the weak
More informationFactor Analysis. Lecture 10: Factor Analysis and Principal Component Analysis. Sam Roweis
Lecture 10: Factor Aalysis ad Pricipal Compoet Aalysis Sam Roweis February 9, 2004 Whe we assume that the subspace is liear ad that the uderlyig latet variable has a Gaussia distributio we get a model
More information4.1 Data processing inequality
ECE598: Iformatio-theoretic methods i high-dimesioal statistics Sprig 206 Lecture 4: Total variatio/iequalities betwee f-divergeces Lecturer: Yihog Wu Scribe: Matthew Tsao, Feb 8, 206 [Ed. Mar 22] Recall
More informationMIDTERM 3 CALCULUS 2. Monday, December 3, :15 PM to 6:45 PM. Name PRACTICE EXAM SOLUTIONS
MIDTERM 3 CALCULUS MATH 300 FALL 08 Moday, December 3, 08 5:5 PM to 6:45 PM Name PRACTICE EXAM S Please aswer all of the questios, ad show your work. You must explai your aswers to get credit. You will
More informationQuestion 1: The magnetic case
September 6, 018 Corell Uiversity, Departmet of Physics PHYS 337, Advace E&M, HW # 4, due: 9/19/018, 11:15 AM Questio 1: The magetic case I class, we skipped over some details, so here you are asked to
More informationDirection: This test is worth 150 points. You are required to complete this test within 55 minutes.
Term Test 3 (Part A) November 1, 004 Name Math 6 Studet Number Directio: This test is worth 10 poits. You are required to complete this test withi miutes. I order to receive full credit, aswer each problem
More informationBertrand s Postulate
Bertrad s Postulate Lola Thompso Ross Program July 3, 2009 Lola Thompso (Ross Program Bertrad s Postulate July 3, 2009 1 / 33 Bertrad s Postulate I ve said it oce ad I ll say it agai: There s always a
More informationLecture 14: Graph Entropy
15-859: Iformatio Theory ad Applicatios i TCS Sprig 2013 Lecture 14: Graph Etropy March 19, 2013 Lecturer: Mahdi Cheraghchi Scribe: Euiwoog Lee 1 Recap Bergma s boud o the permaet Shearer s Lemma Number
More informationECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization
ECE 90 Lecture 4: Maximum Likelihood Estimatio ad Complexity Regularizatio R Nowak 5/7/009 Review : Maximum Likelihood Estimatio We have iid observatios draw from a ukow distributio Y i iid p θ, i,, where
More informationMath 113 Exam 3 Practice
Math Exam Practice Exam 4 will cover.-., 0. ad 0.. Note that eve though. was tested i exam, questios from that sectios may also be o this exam. For practice problems o., refer to the last review. This
More informationRecurrence Relations
Recurrece Relatios Aalysis of recursive algorithms, such as: it factorial (it ) { if (==0) retur ; else retur ( * factorial(-)); } Let t be the umber of multiplicatios eeded to calculate factorial(). The
More informationSequences and Series of Functions
Chapter 6 Sequeces ad Series of Fuctios 6.1. Covergece of a Sequece of Fuctios Poitwise Covergece. Defiitio 6.1. Let, for each N, fuctio f : A R be defied. If, for each x A, the sequece (f (x)) coverges
More informationLecture 13: Maximum Likelihood Estimation
ECE90 Sprig 007 Statistical Learig Theory Istructor: R. Nowak Lecture 3: Maximum Likelihood Estimatio Summary of Lecture I the last lecture we derived a risk (MSE) boud for regressio problems; i.e., select
More informationBoosting. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 1, / 32
Boostig Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, 2017 1 / 32 Outlie 1 Admiistratio 2 Review of last lecture 3 Boostig Professor Ameet Talwalkar CS260
More informationEcon 325 Notes on Point Estimator and Confidence Interval 1 By Hiro Kasahara
Poit Estimator Eco 325 Notes o Poit Estimator ad Cofidece Iterval 1 By Hiro Kasahara Parameter, Estimator, ad Estimate The ormal probability desity fuctio is fully characterized by two costats: populatio
More informationChapter 6 Principles of Data Reduction
Chapter 6 for BST 695: Special Topics i Statistical Theory. Kui Zhag, 0 Chapter 6 Priciples of Data Reductio Sectio 6. Itroductio Goal: To summarize or reduce the data X, X,, X to get iformatio about a
More informationECE 901 Lecture 13: Maximum Likelihood Estimation
ECE 90 Lecture 3: Maximum Likelihood Estimatio R. Nowak 5/7/009 The focus of this lecture is to cosider aother approach to learig based o maximum likelihood estimatio. Ulike earlier approaches cosidered
More informationGoodness-of-Fit Tests and Categorical Data Analysis (Devore Chapter Fourteen)
Goodess-of-Fit Tests ad Categorical Data Aalysis (Devore Chapter Fourtee) MATH-252-01: Probability ad Statistics II Sprig 2019 Cotets 1 Chi-Squared Tests with Kow Probabilities 1 1.1 Chi-Squared Testig................
More informationMath 113 Exam 3 Practice
Math Exam Practice Exam will cover.-.9. This sheet has three sectios. The first sectio will remid you about techiques ad formulas that you should kow. The secod gives a umber of practice questios for you
More informationThe picture in figure 1.1 helps us to see that the area represents the distance traveled. Figure 1: Area represents distance travelled
1 Lecture : Area Area ad distace traveled Approximatig area by rectagles Summatio The area uder a parabola 1.1 Area ad distace Suppose we have the followig iformatio about the velocity of a particle, how
More informationTopics Machine learning: lecture 2. Review: the learning problem. Hypotheses and estimation. Estimation criterion cont d. Estimation criterion
.87 Machie learig: lecture Tommi S. Jaakkola MIT CSAIL tommi@csail.mit.edu Topics The learig problem hypothesis class, estimatio algorithm loss ad estimatio criterio samplig, empirical ad epected losses
More informationL = n i, i=1. dp p n 1
Exchageable sequeces ad probabilities for probabilities 1996; modified 98 5 21 to add material o mutual iformatio; modified 98 7 21 to add Heath-Sudderth proof of de Fietti represetatio; modified 99 11
More informationRandom Variables, Sampling and Estimation
Chapter 1 Radom Variables, Samplig ad Estimatio 1.1 Itroductio This chapter will cover the most importat basic statistical theory you eed i order to uderstad the ecoometric material that will be comig
More informationStat410 Probability and Statistics II (F16)
Some Basic Cocepts of Statistical Iferece (Sec 5.) Suppose we have a rv X that has a pdf/pmf deoted by f(x; θ) or p(x; θ), where θ is called the parameter. I previous lectures, we focus o probability problems
More informationMaximum Likelihood Estimation
Chapter 9 Maximum Likelihood Estimatio 9.1 The Likelihood Fuctio The maximum likelihood estimator is the most widely used estimatio method. This chapter discusses the most importat cocepts behid maximum
More informationRegression and generalization
Regressio ad geeralizatio CE-717: Machie Learig Sharif Uiversity of Techology M. Soleymai Fall 2016 Curve fittig: probabilistic perspective Describig ucertaity over value of target variable as a probability
More information10. Comparative Tests among Spatial Regression Models. Here we revisit the example in Section 8.1 of estimating the mean of a normal random
Part III. Areal Data Aalysis 0. Comparative Tests amog Spatial Regressio Models While the otio of relative likelihood values for differet models is somewhat difficult to iterpret directly (as metioed above),
More information1 Duality revisited. AM 221: Advanced Optimization Spring 2016
AM 22: Advaced Optimizatio Sprig 206 Prof. Yaro Siger Sectio 7 Wedesday, Mar. 9th Duality revisited I this sectio, we will give a slightly differet perspective o duality. optimizatio program: f(x) x R
More informationStat 421-SP2012 Interval Estimation Section
Stat 41-SP01 Iterval Estimatio Sectio 11.1-11. We ow uderstad (Chapter 10) how to fid poit estimators of a ukow parameter. o However, a poit estimate does ot provide ay iformatio about the ucertaity (possible
More informationRegression with quadratic loss
Regressio with quadratic loss Maxim Ragisky October 13, 2015 Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X,Y, where, as before,
More informationINF Introduction to classifiction Anne Solberg Based on Chapter 2 ( ) in Duda and Hart: Pattern Classification
INF 4300 90 Itroductio to classifictio Ae Solberg ae@ifiuioo Based o Chapter -6 i Duda ad Hart: atter Classificatio 90 INF 4300 Madator proect Mai task: classificatio You must implemet a classificatio
More informationMachine Learning Brett Bernstein
Machie Learig Brett Berstei Week 2 Lecture: Cocept Check Exercises Starred problems are optioal. Excess Risk Decompositio 1. Let X = Y = {1, 2,..., 10}, A = {1,..., 10, 11} ad suppose the data distributio
More informationIntroduction to Artificial Intelligence CAP 4601 Summer 2013 Midterm Exam
Itroductio to Artificial Itelligece CAP 601 Summer 013 Midterm Exam 1. Termiology (7 Poits). Give the followig task eviromets, eter their properties/characteristics. The properties/characteristics of the
More informationStatistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.
Statistical Iferece (Chapter 10) Statistical iferece = lear about a populatio based o the iformatio provided by a sample. Populatio: The set of all values of a radom variable X of iterest. Characterized
More informationReview Questions, Chapters 8, 9. f(y) = 0, elsewhere. F (y) = f Y(1) = n ( e y/θ) n 1 1 θ e y/θ = n θ e yn
Stat 366 Lab 2 Solutios (September 2, 2006) page TA: Yury Petracheko, CAB 484, yuryp@ualberta.ca, http://www.ualberta.ca/ yuryp/ Review Questios, Chapters 8, 9 8.5 Suppose that Y, Y 2,..., Y deote a radom
More informationLecture 19: Convergence
Lecture 19: Covergece Asymptotic approach I statistical aalysis or iferece, a key to the success of fidig a good procedure is beig able to fid some momets ad/or distributios of various statistics. I may
More information1 Review of Probability & Statistics
1 Review of Probability & Statistics a. I a group of 000 people, it has bee reported that there are: 61 smokers 670 over 5 960 people who imbibe (drik alcohol) 86 smokers who imbibe 90 imbibers over 5
More informationGrouping 2: Spectral and Agglomerative Clustering. CS 510 Lecture #16 April 2 nd, 2014
Groupig 2: Spectral ad Agglomerative Clusterig CS 510 Lecture #16 April 2 d, 2014 Groupig (review) Goal: Detect local image features (SIFT) Describe image patches aroud features SIFT, SURF, HoG, LBP, Group
More informationACO Comprehensive Exam 9 October 2007 Student code A. 1. Graph Theory
1. Graph Theory Prove that there exist o simple plaar triagulatio T ad two distict adjacet vertices x, y V (T ) such that x ad y are the oly vertices of T of odd degree. Do ot use the Four-Color Theorem.
More information6.867 Machine learning
6.867 Machie learig Mid-term exam October, ( poits) Your ame ad MIT ID: Problem We are iterested here i a particular -dimesioal liear regressio problem. The dataset correspodig to this problem has examples
More informationHypothesis Testing. Evaluation of Performance of Learned h. Issues. Trade-off Between Bias and Variance
Hypothesis Testig Empirically evaluatig accuracy of hypotheses: importat activity i ML. Three questios: Give observed accuracy over a sample set, how well does this estimate apply over additioal samples?
More information6.867 Machine learning, lecture 7 (Jaakkola) 1
6.867 Machie learig, lecture 7 (Jaakkola) 1 Lecture topics: Kerel form of liear regressio Kerels, examples, costructio, properties Liear regressio ad kerels Cosider a slightly simpler model where we omit
More informationStudy the bias (due to the nite dimensional approximation) and variance of the estimators
2 Series Methods 2. Geeral Approach A model has parameters (; ) where is ite-dimesioal ad is oparametric. (Sometimes, there is o :) We will focus o regressio. The fuctio is approximated by a series a ite
More informationLecture 2 October 11
Itroductio to probabilistic graphical models 203/204 Lecture 2 October Lecturer: Guillaume Oboziski Scribes: Aymeric Reshef, Claire Verade Course webpage: http://www.di.es.fr/~fbach/courses/fall203/ 2.
More informationEstimation for Complete Data
Estimatio for Complete Data complete data: there is o loss of iformatio durig study. complete idividual complete data= grouped data A complete idividual data is the oe i which the complete iformatio of
More informationKurskod: TAMS11 Provkod: TENB 21 March 2015, 14:00-18:00. English Version (no Swedish Version)
Kurskod: TAMS Provkod: TENB 2 March 205, 4:00-8:00 Examier: Xiagfeg Yag (Tel: 070 2234765). Please aswer i ENGLISH if you ca. a. You are allowed to use: a calculator; formel -och tabellsamlig i matematisk
More informationExercises Advanced Data Mining: Solutions
Exercises Advaced Data Miig: Solutios Exercise 1 Cosider the followig directed idepedece graph. 5 8 9 a) Give the factorizatio of P (X 1, X 2,..., X 9 ) correspodig to this idepedece graph. P (X) = 9 P
More informationTable 12.1: Contingency table. Feature b. 1 N 11 N 12 N 1b 2 N 21 N 22 N 2b. ... a N a1 N a2 N ab
Sectio 12 Tests of idepedece ad homogeeity I this lecture we will cosider a situatio whe our observatios are classified by two differet features ad we would like to test if these features are idepedet
More information