Outline. CSCI-567: Machine Learning (Spring 2019) Outline. Prof. Victor Adamchik. Mar. 26, 2019
|
|
- Camilla Bryant
- 5 years ago
- Views:
Transcription
1 Outlie CSCI-567: Machie Learig Sprig 209 Gaussia mixture models Prof. Victor Adamchik 2 Desity estimatio U of Souther Califoria Mar. 26, Naive Bayes Revisited March 26, 209 / 57 March 26, / 57 Outlie Gaussia mixture models Gaussia mixture models GMM is a probabilistic approach for clusterig. Gaussia mixture models Motivatio ad Model EM algorithm EM applied to GMMs 2 Desity estimatio 3 Naive Bayes Revisited We wat to come up with a probabilistic model p to explai how the data is geerated. We will model each regio with a Gaussia distributio. To geerate a poit, we first radomly pick oe of the Gaussia models, the draw a poit accordig this Gaussia. March 26, / 57 March 26, / 57
2 GMM: formal defiitio A example A GMM has the followig desity fuctio: px = where ω k Nx µ k, Σ k = k= k= ω k 2π D Σ k e 2 x µ k T Σ k x µ k K: the umber of Gaussia compoets same as #clusters we wat µ k ad Σ k : mea ad covariace matrix of the k-th Gaussia ω,..., ω K : mixture weights, they represet how much each compoet cotributes to the fial distributio. It satisfies two properties: k, ω k > 0, ad ω k = k The coditioal distributios are px z = red = Nx µ, Σ px z = blue = Nx µ 2, Σ 2 px z = gree = Nx µ 3, Σ 3 Here z is the hidde latet variable. The margial distributio is px = prednx µ, Σ + pbluenx µ 2, Σ 2 + pgreenx µ 3, Σ 3 March 26, / 57 March 26, / 57 Learig GMMs Preview of EM for learig GMMs Learig a GMM meas fidig all the parameters = {ω k, µ k, Σ k } K k=. How to lear these parameters? A obvious attempt is maximum-likelihood estimatio MLE: fid l N = px ; = N = l px ; P The problem is itractable i geeral o-cocave problem, also there is a latet parameter. Oe solutio is to still apply GD/SGD, but a much more effective approach is the Expectatio Maximizatio EM algorithm. Step 0 Iitialize ω k, µ k, Σ k for each k [K] Step E-Step update the soft assigmet fixig parameters γ k = pz = k x ω k N x µ k, Σ k Step 2 M-Step update the model parameter fixig assigmets ω k = γ k µ k = γ kx N γ k Σ k = γ γ k x µ k x µ k T k Step 3 retur to Step if ot coverged March 26, / 57 March 26, / 57
3 EM algorithm EM algorithm I geeral EM is a heuristic to solve MLE with latet variables ot just GMM, i.e. fid the maximizer of P = N l px ; = is the parameters for a geeral probabilistic model x s are observed radom variables z s are latet variables Agai, directly solvig the objective is itractable. A geeral algorithm for dealig with hidde data. EM is a optimizatio strategy for objective fuctios that ca be iterpreted as likelihoods i the presece of missig data. EM is much simpler tha gradiet methods: o eed to choose step size. EM is a iterative algorithm with two steps: E-step: fill-i hidde values usig iferece M-step: apply stadard MLE method to completed data We will prove that EM always coverges to a local optimum of the likelihood. March 26, / 57 March 26, / 57 High level idea Derivatio of EM Keep maximizig a lower boud of P that is more maageable Fidig the lower boud of P : px, z ; l px ; = l pz x ; [ ] px, z ; = E z q l pz x ; true for ay z true for ay dist. q Let us recall the defiitio of expectatio E z q [fz] = z qzfz ad etropy Hz = E z q [l qz] = z qz l qz March 26, 209 / 57 March 26, / 57
4 Derivatio of EM Fidig the lower boud of P : px, z ; l px ; = l true for ay z pz x ; [ ] px, z ; = E z q l true for ay dist. q pz x ; [ ] pz x ; = E z q [l px, z ; ] E z q [l qz] E z q l qz [ ] pz x ; = E z q [l px, z ; ] + Hq E z q l H is etropy qz [ ] pz x ; E z q [l px, z ; ] + Hq l E z q qz Jese s iequality Jese s iequality Claim: E [l X] l E[X] Proof. By the defiitio of E[X] = N x + x x, the It follows, E [l X] = N l x + l x l x = N l N N N l = x l N N N x N = N = N = x x = This is the AGM iequality. For N = 2, it is just x x x March 26, / 57 March 26, / 57 Derivatio of EM Alteratively maximize the lower boud After applyig Jese s iequality, we obtai l px ; E z q [l px, z ; ] + Hq l E z q [ pz x ; qz Next, we observe that [ ] pz x ; E z q = qz z pz x ; qz = qz z ] pz x ; = We have foud a lower boud for the log-likelihood fuctio P = N l px ; = N = E z q [l px, z ; ] + Hq = F, {q } This holds for ay {q }, so how do we choose? It follows, l px ; E z q [l px, z ; ] + Hq Naturally, the oe that maximizes the lower boud i.e. the tightest lower boud! This is similar to K-meas: we will alteratively maximizig F over {q } ad. March 26, / 57 March 26, / 57
5 Pictorial explaatio P is o-cocave, but F maximize., {q t } ofte is cocave ad easy to Maximizig over {q } Fix t, ad maximize F over {q } ] E z q [l px, z ; t F, {q } = q = q k= q q k l px, z = k ; t q k l q k + Hq subject to coditios: q k 0 ad q k = k Next, write dow the Lagragia ad the apply KKT coditios. March 26, / 57 March 26, / 57 Maximizig over {q } Maximizig over The solutio to [ ] F, {q } = E z q l px, z ; t q q is you have to verify it by yourself q t z = pz = k x ; t i.e., the posterior distributio of z give x ad t. So at t, we foud the tightest lower boud F, {q t } : F, {q t } P for all. F t, {q t } = P t + Hq Fix {q t }, maximize over ote, Hq t is idepedet of : F =, {q t } N = E z q t Q ; t [l px, z ; ] {q t } are computed via t Q is called a complete likelihood ad is usually more tractable, sice z are ot latet variables aymore. March 26, / 57 March 26, / 57
6 Geeral EM algorithm Pictorial explaatio Step 0 Iitialize, t = Step E-Step update the posterior of latet variables q t = p x ; t ad obtai Expectatio of complete likelihood Q ; t = N = E z q t [l px, z ; ] Step 2 M-Step update the model parameter via Maximizatio t+ Q ; t P is o-cocave, but Q; t ofte is cocave ad easy to maximize. P t+ F t+ ; {q t } F t ; {q t } = P t So EM always icreases the objective value ad will coverge to some local maximum similar to K-meas. Step 3 t t + ad retur to Step if ot coverged March 26, / 57 March 26, / 57 Apply EM to lear GMMs E-Step: z = k = p z = k x ; t = p z = k ; t px z = k ; t = ω t k N x µ t k, Σt k q t This computes the soft assigmet γ k = q t z = k, i.e. coditioal probability of x belogig to cluster k. Apply EM to lear GMMs M-Step: Q, t = To fid ω,..., ω K, solve ω N = k= = = N = N = E z q t E z q t N {ω k,µ k,σ k } = k= γ k l ω k [l px, z ; ] [l pz ; + l px z ; ] γ k l ω k + l Nx µ k, Σ k To fid each µ k, Σ k, solve µ k,σ k N γ k l Nx µ k, Σ k = March 26, / 57 March 26, / 57
7 M-Step cotiued Solutios to previous two problems are very atural see slide 8, for each k ω k = γ k N i.e. weighted fractio of examples belogig to cluster k µ k = γ kx γ k i.e. weighted average of examples belogig to cluster k Σ k = γ γ k x µ k x µ k T k i.e weighted covariace of examples belogig to cluster k March 26, / 57 GMM: puttig it together EM for clusterig: Step 0 Iitialize ω k, µ k, Σ k for each k [K] Step E-Step update the soft assigmet fixig parameters γ k = pz = k x ω k N x µ k, Σ k Step 2 M-Step update the model parameter fixig assigmets ω k = γ k µ k = γ kx N γ k Σ k = γ γ k x µ k x µ k T k Step 3 retur to Step if ot coverged March 26, / 57 Coectio to K-meas K-meas is i fact a special case of EM for a simplified GMM: Let Σ k = σ 2 I for some fixed σ, so oly ω k ad µ k are parameters. EM becomes K-meas: N px ; = = N = k= pz = knx µ k If we assume hard assigmets pz = k =, if k = C, the Outlie Gaussia mixture models 2 Desity estimatio Parametric models Noparametric models N = = N px ; = = N exp 2σ 2 x µ C 2 2 = Nx µ C = µ,c N x µ C 2 2 GMM is a soft versio of K-meas ad it provides a probabilistic iterpretatio of the data. March 26, / 57 = 3 Naive Bayes Revisited March 26, / 57
8 Desity estimatio Parametric geerative models Observe what we have doe idirectly for clusterig with GMMs is: Give a traiig set x,..., x N, estimate a desity fuctio p that i.i.d. could have geerated this dataset via x p. This is exactly the problem of desity estimatio, aother importat usupervised learig problem. Useful for may dowstream applicatios we have see clusterig already, will see more applicatios today these applicatios also provide a way to measure quality of the desity estimator Parametric estimatio assumes a geerative model parametrized by : Examples: px = px ; GMM: px ; = K k= ω knx µ k, Σ k where = {ω k, µ k, Σ k } Multiomial for D examples with K possible values px = k ; = k where is a distributio over K elemets. Size of is idepedet of the traiig set size, so it s parametric. March 26, / 57 March 26, / 57 Parametric methods MLE for multiomial Agai, we apply MLE to lear the parameters : = N l px ; For some cases this is itractable ad we ca use EM to approximately solve MLE e.g. GMMs. = For some other cases this admits a simple closed-form solutio e.g. multiomial. = = N l px = x ; = = k= :x =k l k = N l x = z k l k where z k = { : x = k} is the umber of examples with value k. The solutio your TA4 is simply k = z k N z k, i.e. the fractio of examples with value k. k= March 26, / 57 March 26, / 57
9 Noparametric models Ca we estimate without assumig a fixed geerative model? High level idea Costruct somethig similar to a histogram: for each data poit, create a hump via a kerel sum up all the humps; more data - a higher hump picture from Wikipedia Kerel desity estimatio KDE is a commo approach for oparametric desity estimatio. Here kerel meas somethig differet from what we have see for kerel fuctio. We focus o the D cotiuous case. March 26, / 57 March 26, / 57 Kerel Differet kerels Kx KDE with a kerel Kx: R R cetered at x : px = N N Kx x = May choices for K, for example, Kx = 2π e x2 2, the stadard Gaussia desity e x2 2 2π 2 I[ x ] 3 4 max{ x2, 0} Properties of a kerel: symmetry: Kx = K x Kxdx =, this isures p is a desity fuctio. March 26, / 57 March 26, / 57
10 Badwidth Effect of badwidth picture from Wikipedia If Kx is a kerel, the for ay h > 0 K h u h K x h stretchig the kerel A larger h will smooth a desity. A small h will yield a desity that is spiky ad very hard to iterpret. ca be used as a kerel too verify the two properties yourself So, geeral KDE is determied by both the kerel K ad the badwidth h px = N N = K h x x = Nh x cotrols the ceter of each hump N x x K h = Assume Gaussia kerel. Gray curve is groud-truth Red: h = 0.05 Black: h = Gree: h = 2 h cotrols the width/variace of the humps March 26, / 57 March 26, / 57 Badwidth selectio Outlie Selectig h is a deep topic oe ca also do cross-validatio based o dowstream applicatios there are theoretically-motivated approaches Fid a value of h that miimizes the error betwee the estimated desity ad the true desity: E [ p KDE x px 2] = E [p KDE x px] 2 + V ar [p KDE x] Gaussia mixture models 2 Desity estimatio 3 Naive Bayes Revisited Setup ad assumptio Coectio to logistic regressio Geerative ad Discrimiative Models This expressio is a example of the bias-variace tradeoff, which we saw i the earlier lecture. March 26, / 57 March 26, / 57
11 Bayes optimal classifier Discrete features Suppose the data x, y is draw from a joit distributio px, y, the Bayes optimal classifier is f x = pc x i.e. predict the class with the largest coditioal probability. For a label c [C], py = c = { : y = c} N px, y is of course ukow, but we ca estimate it, which is exactly a desity estimatio problem! Observe that px, y = pypx y For each possible value k of a discrete feature d, px d = k y = c = { : x d = k, y = c} { : y = c} To estimate px y = c for some c [C], we are doig desity estimatio usig data with label y = c. March 26, / 57 March 26, / 57 Cotiuous features If the feature is cotiuous, we ca do parametric estimatio, e.g. via a Gaussia px d = x y = c = exp x µ cd 2 2πσcd 2σ 2 cd How to predict? Usig Naive Bayes assumptio: D px y = c = px d y = c the predictio for a ew example x is d= where µ cd ad σcd 2 are the empirical mea ad variace of feature d amog all examples with label c. or oparametric estimatio, e.g. via a kerel K ad badwidth h: px d = x y = c = { : y = c} :y =c K h x x d py = c x = = = px y = cpy = c px D py = c px d y = c d= l py = c + D l px d y = c d= March 26, / 57 March 26, / 57
12 Naive Bayes Naive Bayes For discrete features, pluggig i previous MLE estimatios gives = = py = c x l py = c + D l px d y = c d= l { : y = c} + D d= l { : x d = x d, y = c} { : y = c} For cotiuous features with a Gaussia model, = = = py = c x l py = c + D l px d y = c d= l { : y = c} + l { : y = c} D l exp x d µ cd 2 2πσcd d= D l σ cd + x d µ cd 2 d= 2σ 2 cd 2σ 2 cd March 26, / 57 March 26, / 57 Coectio to logistic regressio Coectio to logistic regressio Let us fix the variace for each feature to be σ i.e. ot a parameter of the model ay more, the the predictio becomes = = = py = c x l { : y = c} D l σ + x d µ cd 2 d= l { : y = c} x 2 D 2 2σ 2 d= D w c0 + w cd x d = wc T x d= where we deote w c0 = l { : y = c} D µ 2 cd d= 2σ 2 2σ 2 µ 2 D cd 2σ 2 + d= µ cd σ 2 x d liear classifier! ad w cd = µ cd σ 2. You ca verify py = c x e wt c x This is exactly the softmax fuctio, the same model we used for a probabilistic iterpretatio of logistic regressio! So what is differet the? They lear the parameters i differet ways: both via MLE, oe o py = c x, the other o px, y solutios are differet: logistic regressio has o closed-form, aive Bayes admits a simple closed-form March 26, / 57 March 26, / 57
13 Two differet modelig paradigms Geerative model v.s discrimiative model Suppose the traiig data is from a ukow joit probabilistic model px, y. There are two kids of classificatio models i machie learig geerative models ad discrimiative models. Discrimiative model Geerative model Differeces i assumig models for the data the geerative approach requires we specify the model for the joit distributio such as Naive Bayes, ad thus, maximize the joit likelihood log px, y the discrimiative approach discrimiative requires oly specifyig a model for the coditioal distributio such as logistic regressio, ad thus, maximize the coditioal likelihood log py x Sometimes, modelig by discrimiative approach is easier Sometimes, parameter estimatio by geerative approach is easier Example logistic regressio aive Bayes Model coditioal py x joit px, y might have same py x Learig MLE MLE Accuracy usually better for large N usually better for small N Remark more flexible, ca geerate data after learig March 26, / 57 March 26, / 57 Determiig sex ma or woma based o measuremets Example: Geerative approach Propose a model of the joit distributio of x = height, y =sex 280 red = female, blue=male our data red = female, blue=male weight Sex Height height weight height Ituitio: we will model how heights vary accordig to a Gaussia i each sub-populatio male ad female. Note: This is similar to Naive Bayes for detectig spam s. March 26, / 57 March 26, / 57
14 Model of the joit distributio Parameter estimatio px, y = pypx y p = 2πσ e x µ 2 2πσ2 e x µ 2 2 p 2 2σ 2 if y = 2σ 2 2 if y = 2 where p + p 2 = represets two prior probabilities that x is give the label or 2 respectively. px y is assumed to be Gaussias. weight red = female, blue=male height Likelihood of the traiig data D = {x, y } N = with y {, 2} log P D = log px, y = log p e x µ 2 2σ 2 2πσ :y = + :y =2 log p 2 e x µ 2 2 2σ 2 2 2πσ2 Maximize the likelihood fuctio p, p 2, µ, µ 2, σ, σ2 = log P D March 26, / 57 March 26, / 57 Decisio boudary Example of oliear decisio boudary The decisio boudary betwee two classes is defied by py = x py = 2 x which is equivalet to px y = py = px y = 2py = 2 Namely, 2 0 Parabolic Boudary x µ 2 2σ 2 log 2πσ + log p x µ 2 2 2σ 2 2 log 2πσ 2 + log p 2 2 It is quadratic i x. It follows for some a, b ad c, that The decisio boudary is ot liear! ax 2 + bx + c 0 March 26, / Note: the boudary is characterized by a quadratic fuctio, givig rise to the shape of parabolic curve. March 26, / 57
15 A special case What if we assume the two Gaussias have the same variace? We will get a liear decisio boudary From the previous slide: x µ 2 2σ 2 log 2πσ + log p x µ 2 2 2σ 2 2 log 2πσ 2 + log p 2 Settig σ = σ 2, we obtai bx + c 0 Note: equal variaces across two differet categories could be a very strog assumptio. For example, the plot suggests that the male populatio has slightly bigger variace i.e., bigger eclipse tha the female populatio. March 26, / 57
Clustering. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar.
Clusterig CM226: Machie Learig for Bioiformatics. Fall 216 Sriram Sakararama Ackowledgmets: Fei Sha, Ameet Talwalkar Clusterig 1 / 42 Admiistratio HW 1 due o Moday. Email/post o CCLE if you have questios.
More informationClustering: Mixture Models
Clusterig: Mixture Models Machie Learig 10-601B Seyoug Kim May of these slides are derived from Tom Mitchell, Ziv- Bar Joseph, ad Eric Xig. Thaks! Problem with K- meas Hard Assigmet of Samples ito Three
More informationExpectation-Maximization Algorithm.
Expectatio-Maximizatio Algorithm. Petr Pošík Czech Techical Uiversity i Prague Faculty of Electrical Egieerig Dept. of Cyberetics MLE 2 Likelihood.........................................................................................................
More informationExpectation maximization
Motivatio Expectatio maximizatio Subhrasu Maji CMSCI 689: Machie Learig 14 April 015 Suppose you are builig a aive Bayes spam classifier. After your are oe your boss tells you that there is o moey to label
More informationECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015
ECE 8527: Itroductio to Machie Learig ad Patter Recogitio Midterm # 1 Vaishali Ami Fall, 2015 tue39624@temple.edu Problem No. 1: Cosider a two-class discrete distributio problem: ω 1 :{[0,0], [2,0], [2,2],
More informationAlgorithms for Clustering
CR2: Statistical Learig & Applicatios Algorithms for Clusterig Lecturer: J. Salmo Scribe: A. Alcolei Settig: give a data set X R p where is the umber of observatio ad p is the umber of features, we wat
More information10-701/ Machine Learning Mid-term Exam Solution
0-70/5-78 Machie Learig Mid-term Exam Solutio Your Name: Your Adrew ID: True or False (Give oe setece explaatio) (20%). (F) For a cotiuous radom variable x ad its probability distributio fuctio p(x), it
More informationCS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5
CS434a/54a: Patter Recogitio Prof. Olga Veksler Lecture 5 Today Itroductio to parameter estimatio Two methods for parameter estimatio Maimum Likelihood Estimatio Bayesia Estimatio Itroducto Bayesia Decisio
More informationMixtures of Gaussians and the EM Algorithm
Mixtures of Gaussias ad the EM Algorithm CSE 6363 Machie Learig Vassilis Athitsos Computer Sciece ad Egieerig Departmet Uiversity of Texas at Arligto 1 Gaussias A popular way to estimate probability desity
More informationThe Bayesian Learning Framework. Back to Maximum Likelihood. Naïve Bayes. Simple Example: Coin Tosses. Given a generative model
Back to Maximum Likelihood Give a geerative model f (x, y = k) =π k f k (x) Usig a geerative modellig approach, we assume a parametric form for f k (x) =f (x; k ) ad compute the MLE θ of θ =(π k, k ) k=
More informationLecture 11 and 12: Basic estimation theory
Lecture ad 2: Basic estimatio theory Sprig 202 - EE 94 Networked estimatio ad cotrol Prof. Kha March 2 202 I. MAXIMUM-LIKELIHOOD ESTIMATORS The maximum likelihood priciple is deceptively simple. Louis
More informationLinear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d
Liear regressio Daiel Hsu (COMS 477) Maximum likelihood estimatio Oe of the simplest liear regressio models is the followig: (X, Y ),..., (X, Y ), (X, Y ) are iid radom pairs takig values i R d R, ad Y
More informationAxis Aligned Ellipsoid
Machie Learig for Data Sciece CS 4786) Lecture 6,7 & 8: Ellipsoidal Clusterig, Gaussia Mixture Models ad Geeral Mixture Models The text i black outlies high level ideas. The text i blue provides simple
More informationMachine Learning Brett Bernstein
Machie Learig Brett Berstei Week Lecture: Cocept Check Exercises Starred problems are optioal. Statistical Learig Theory. Suppose A = Y = R ad X is some other set. Furthermore, assume P X Y is a discrete
More informationProbabilistic Unsupervised Learning
HT2015: SC4 Statistical Data Miig ad Machie Learig Dio Sejdiovic Departmet of Statistics Oxford http://www.stats.ox.ac.u/~sejdiov/sdmml.html Probabilistic Methods Algorithmic approach: Data Probabilistic
More informationGrouping 2: Spectral and Agglomerative Clustering. CS 510 Lecture #16 April 2 nd, 2014
Groupig 2: Spectral ad Agglomerative Clusterig CS 510 Lecture #16 April 2 d, 2014 Groupig (review) Goal: Detect local image features (SIFT) Describe image patches aroud features SIFT, SURF, HoG, LBP, Group
More information15-780: Graduate Artificial Intelligence. Density estimation
5-780: Graduate Artificial Itelligece Desity estimatio Coditioal Probability Tables (CPT) But where do we get them? P(B)=.05 B P(E)=. E P(A B,E) )=.95 P(A B, E) =.85 P(A B,E) )=.5 P(A B, E) =.05 A P(J
More informationOutline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression
REGRESSION 1 Outlie Liear regressio Regularizatio fuctios Polyomial curve fittig Stochastic gradiet descet for regressio MLE for regressio Step-wise forward regressio Regressio methods Statistical techiques
More informationNYU Center for Data Science: DS-GA 1003 Machine Learning and Computational Statistics (Spring 2018)
NYU Ceter for Data Sciece: DS-GA 003 Machie Learig ad Computatioal Statistics (Sprig 208) Brett Berstei, David Roseberg, Be Jakubowski Jauary 20, 208 Istructios: Followig most lab ad lecture sectios, we
More informationStatistical Pattern Recognition
Statistical Patter Recogitio Classificatio: No-Parametric Modelig Hamid R. Rabiee Jafar Muhammadi Sprig 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Ageda Parametric Modelig No-Parametric Modelig
More informationLecture 7: Density Estimation: k-nearest Neighbor and Basis Approach
STAT 425: Itroductio to Noparametric Statistics Witer 28 Lecture 7: Desity Estimatio: k-nearest Neighbor ad Basis Approach Istructor: Ye-Chi Che Referece: Sectio 8.4 of All of Noparametric Statistics.
More informationLecture 4. Hw 1 and 2 will be reoped after class for every body. New deadline 4/20 Hw 3 and 4 online (Nima is lead)
Lecture 4 Homework Hw 1 ad 2 will be reoped after class for every body. New deadlie 4/20 Hw 3 ad 4 olie (Nima is lead) Pod-cast lecture o-lie Fial projects Nima will register groups ext week. Email/tell
More informationECE 901 Lecture 13: Maximum Likelihood Estimation
ECE 90 Lecture 3: Maximum Likelihood Estimatio R. Nowak 5/7/009 The focus of this lecture is to cosider aother approach to learig based o maximum likelihood estimatio. Ulike earlier approaches cosidered
More informationMachine Learning Brett Bernstein
Machie Learig Brett Berstei Week 2 Lecture: Cocept Check Exercises Starred problems are optioal. Excess Risk Decompositio 1. Let X = Y = {1, 2,..., 10}, A = {1,..., 10, 11} ad suppose the data distributio
More informationIntroduction to Machine Learning DIS10
CS 189 Fall 017 Itroductio to Machie Learig DIS10 1 Fu with Lagrage Multipliers (a) Miimize the fuctio such that f (x,y) = x + y x + y = 3. Solutio: The Lagragia is: L(x,y,λ) = x + y + λ(x + y 3) Takig
More informationEconomics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator
Ecoomics 24B Relatio to Method of Momets ad Maximum Likelihood OLSE as a Maximum Likelihood Estimator Uder Assumptio 5 we have speci ed the distributio of the error, so we ca estimate the model parameters
More informationECE 901 Lecture 12: Complexity Regularization and the Squared Loss
ECE 90 Lecture : Complexity Regularizatio ad the Squared Loss R. Nowak 5/7/009 I the previous lectures we made use of the Cheroff/Hoeffdig bouds for our aalysis of classifier errors. Hoeffdig s iequality
More informationMATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4
MATH 30: Probability ad Statistics 9. Estimatio ad Testig of Parameters Estimatio ad Testig of Parameters We have bee dealig situatios i which we have full kowledge of the distributio of a radom variable.
More informationCSE 527, Additional notes on MLE & EM
CSE 57 Lecture Notes: MLE & EM CSE 57, Additioal otes o MLE & EM Based o earlier otes by C. Grat & M. Narasimha Itroductio Last lecture we bega a examiatio of model based clusterig. This lecture will be
More informationPattern Classification, Ch4 (Part 1)
Patter Classificatio All materials i these slides were take from Patter Classificatio (2d ed) by R O Duda, P E Hart ad D G Stork, Joh Wiley & Sos, 2000 with the permissio of the authors ad the publisher
More informationThe Expectation-Maximization (EM) Algorithm
The Expectatio-Maximizatio (EM) Algorithm Readig Assigmets T. Mitchell, Machie Learig, McGraw-Hill, 997 (sectio 6.2, hard copy). S. Gog et al. Dyamic Visio: From Images to Face Recogitio, Imperial College
More informationThis exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.
Probability ad Statistics FS 07 Secod Sessio Exam 09.0.08 Time Limit: 80 Miutes Name: Studet ID: This exam cotais 9 pages (icludig this cover page) ad 0 questios. A Formulae sheet is provided with the
More informationFrequentist Inference
Frequetist Iferece The topics of the ext three sectios are useful applicatios of the Cetral Limit Theorem. Without kowig aythig about the uderlyig distributio of a sequece of radom variables {X i }, for
More informationLecture 13: Maximum Likelihood Estimation
ECE90 Sprig 007 Statistical Learig Theory Istructor: R. Nowak Lecture 3: Maximum Likelihood Estimatio Summary of Lecture I the last lecture we derived a risk (MSE) boud for regressio problems; i.e., select
More informationCSIE/GINM, NTU 2009/11/30 1
Itroductio ti to Machie Learig (Part (at1: Statistical Machie Learig Shou de Li CSIE/GINM, NTU sdli@csie.tu.edu.tw 009/11/30 1 Syllabus of a Itro ML course ( Machie Learig, Adrew Ng, Staford, Autum 009
More informationNaïve Bayes. Naïve Bayes
Statistical Data Miig ad Machie Learig Hilary Term 206 Dio Sejdiovic Departmet of Statistics Oxford Slides ad other materials available at: http://www.stats.ox.ac.uk/~sejdiov/sdmml : aother plug-i classifier
More informationOptimally Sparse SVMs
A. Proof of Lemma 3. We here prove a lower boud o the umber of support vectors to achieve geeralizatio bouds of the form which we cosider. Importatly, this result holds ot oly for liear classifiers, but
More informationEECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1
EECS564 Estimatio, Filterig, ad Detectio Hwk 2 Sols. Witer 25 4. Let Z be a sigle observatio havig desity fuctio where. p (z) = (2z + ), z (a) Assumig that is a oradom parameter, fid ad plot the maximum
More informationSupport vector machine revisited
6.867 Machie learig, lecture 8 (Jaakkola) 1 Lecture topics: Support vector machie ad kerels Kerel optimizatio, selectio Support vector machie revisited Our task here is to first tur the support vector
More informationBoosting. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 1, / 32
Boostig Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, 2017 1 / 32 Outlie 1 Admiistratio 2 Review of last lecture 3 Boostig Professor Ameet Talwalkar CS260
More informationStudy the bias (due to the nite dimensional approximation) and variance of the estimators
2 Series Methods 2. Geeral Approach A model has parameters (; ) where is ite-dimesioal ad is oparametric. (Sometimes, there is o :) We will focus o regressio. The fuctio is approximated by a series a ite
More informationClassification with linear models
Lecture 8 Classificatio with liear models Milos Hauskrecht milos@cs.pitt.edu 539 Seott Square Geerative approach to classificatio Idea:. Represet ad lear the distributio, ). Use it to defie probabilistic
More informationResampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.
Jauary 1, 2019 Resamplig Methods Motivatio We have so may estimators with the property θ θ d N 0, σ 2 We ca also write θ a N θ, σ 2 /, where a meas approximately distributed as Oce we have a cosistet estimator
More information1 Duality revisited. AM 221: Advanced Optimization Spring 2016
AM 22: Advaced Optimizatio Sprig 206 Prof. Yaro Siger Sectio 7 Wedesday, Mar. 9th Duality revisited I this sectio, we will give a slightly differet perspective o duality. optimizatio program: f(x) x R
More informationHypothesis Testing. Evaluation of Performance of Learned h. Issues. Trade-off Between Bias and Variance
Hypothesis Testig Empirically evaluatig accuracy of hypotheses: importat activity i ML. Three questios: Give observed accuracy over a sample set, how well does this estimate apply over additioal samples?
More informationLecture 12: September 27
36-705: Itermediate Statistics Fall 207 Lecturer: Siva Balakrisha Lecture 2: September 27 Today we will discuss sufficiecy i more detail ad the begi to discuss some geeral strategies for costructig estimators.
More informationIntro to Learning Theory
Lecture 1, October 18, 2016 Itro to Learig Theory Ruth Urer 1 Machie Learig ad Learig Theory Comig soo 2 Formal Framework 21 Basic otios I our formal model for machie learig, the istaces to be classified
More informationRegression and generalization
Regressio ad geeralizatio CE-717: Machie Learig Sharif Uiversity of Techology M. Soleymai Fall 2016 Curve fittig: probabilistic perspective Describig ucertaity over value of target variable as a probability
More informationProbability and MLE.
10-701 Probability ad MLE http://www.cs.cmu.edu/~pradeepr/701 (brief) itro to probability Basic otatios Radom variable - referrig to a elemet / evet whose status is ukow: A = it will rai tomorrow Domai
More informationEmpirical Process Theory and Oracle Inequalities
Stat 928: Statistical Learig Theory Lecture: 10 Empirical Process Theory ad Oracle Iequalities Istructor: Sham Kakade 1 Risk vs Risk See Lecture 0 for a discussio o termiology. 2 The Uio Boud / Boferoi
More informationStep 1: Function Set. Otherwise, output C 2. Function set: Including all different w and b
Logistic Regressio Step : Fuctio Set We wat to fid P w,b C x σ z = + exp z If P w,b C x.5, output C Otherwise, output C 2 z P w,b C x = σ z z = w x + b = w i x i + b i z Fuctio set: f w,b x = P w,b C x
More informationVector Quantization: a Limiting Case of EM
. Itroductio & defiitios Assume that you are give a data set X = { x j }, j { 2,,, }, of d -dimesioal vectors. The vector quatizatio (VQ) problem requires that we fid a set of prototype vectors Z = { z
More informationFactor Analysis. Lecture 10: Factor Analysis and Principal Component Analysis. Sam Roweis
Lecture 10: Factor Aalysis ad Pricipal Compoet Aalysis Sam Roweis February 9, 2004 Whe we assume that the subspace is liear ad that the uderlyig latet variable has a Gaussia distributio we get a model
More informationIntroductory statistics
CM9S: Machie Learig for Bioiformatics Lecture - 03/3/06 Itroductory statistics Lecturer: Sriram Sakararama Scribe: Sriram Sakararama We will provide a overview of statistical iferece focussig o the key
More informationAgnostic Learning and Concentration Inequalities
ECE901 Sprig 2004 Statistical Regularizatio ad Learig Theory Lecture: 7 Agostic Learig ad Cocetratio Iequalities Lecturer: Rob Nowak Scribe: Aravid Kailas 1 Itroductio 1.1 Motivatio I the last lecture
More informationLecture 2: Monte Carlo Simulation
STAT/Q SCI 43: Itroductio to Resamplig ethods Sprig 27 Istructor: Ye-Chi Che Lecture 2: ote Carlo Simulatio 2 ote Carlo Itegratio Assume we wat to evaluate the followig itegratio: e x3 dx What ca we do?
More informationLogit regression Logit regression
Logit regressio Logit regressio models the probability of Y= as the cumulative stadard logistic distributio fuctio, evaluated at z = β 0 + β X: Pr(Y = X) = F(β 0 + β X) F is the cumulative logistic distributio
More informationTopics Machine learning: lecture 2. Review: the learning problem. Hypotheses and estimation. Estimation criterion cont d. Estimation criterion
.87 Machie learig: lecture Tommi S. Jaakkola MIT CSAIL tommi@csail.mit.edu Topics The learig problem hypothesis class, estimatio algorithm loss ad estimatio criterio samplig, empirical ad epected losses
More informationDiscrete Mathematics for CS Spring 2008 David Wagner Note 22
CS 70 Discrete Mathematics for CS Sprig 2008 David Wager Note 22 I.I.D. Radom Variables Estimatig the bias of a coi Questio: We wat to estimate the proportio p of Democrats i the US populatio, by takig
More informationLecture 2 October 11
Itroductio to probabilistic graphical models 203/204 Lecture 2 October Lecturer: Guillaume Oboziski Scribes: Aymeric Reshef, Claire Verade Course webpage: http://www.di.es.fr/~fbach/courses/fall203/ 2.
More informationJacob Hays Amit Pillay James DeFelice 4.1, 4.2, 4.3
No-Parametric Techiques Jacob Hays Amit Pillay James DeFelice 4.1, 4.2, 4.3 Parametric vs. No-Parametric Parametric Based o Fuctios (e.g Normal Distributio) Uimodal Oly oe peak Ulikely real data cofies
More informationStat410 Probability and Statistics II (F16)
Some Basic Cocepts of Statistical Iferece (Sec 5.) Suppose we have a rv X that has a pdf/pmf deoted by f(x; θ) or p(x; θ), where θ is called the parameter. I previous lectures, we focus o probability problems
More informationProbabilistic Unsupervised Learning
Statistical Data Miig ad Machie Learig Hilary Term 2016 Dio Sejdiovic Departmet of Statistics Oxford Slides ad other materials available at: http://www.stats.ox.ac.u/~sejdiov/sdmml Probabilistic Methods
More informationIntroduction to Artificial Intelligence CAP 4601 Summer 2013 Midterm Exam
Itroductio to Artificial Itelligece CAP 601 Summer 013 Midterm Exam 1. Termiology (7 Poits). Give the followig task eviromets, eter their properties/characteristics. The properties/characteristics of the
More information10. Comparative Tests among Spatial Regression Models. Here we revisit the example in Section 8.1 of estimating the mean of a normal random
Part III. Areal Data Aalysis 0. Comparative Tests amog Spatial Regressio Models While the otio of relative likelihood values for differet models is somewhat difficult to iterpret directly (as metioed above),
More informationCSCI567 Machine Learning (Fall 2014)
CSCI567 Machie Learig (Fall 2014) Drs. Sha & Liu {feisha,yaliu.cs}@usc.edu October 9, 2014 Drs. Sha & Liu ({feisha,yaliu.cs}@usc.edu) CSCI567 Machie Learig (Fall 2014) October 9, 2014 1 / 49 Outlie Admiistratio
More informationSTAT Homework 2 - Solutions
STAT-36700 Homework - Solutios Fall 08 September 4, 08 This cotais solutios for Homework. Please ote that we have icluded several additioal commets ad approaches to the problems to give you better isight.
More informationStatistical and Mathematical Methods DS-GA 1002 December 8, Sample Final Problems Solutions
Statistical ad Mathematical Methods DS-GA 00 December 8, 05. Short questios Sample Fial Problems Solutios a. Ax b has a solutio if b is i the rage of A. The dimesio of the rage of A is because A has liearly-idepedet
More information3/8/2016. Contents in latter part PATTERN RECOGNITION AND MACHINE LEARNING. Dynamical Systems. Dynamical Systems. Linear Dynamical Systems
Cotets i latter part PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 13: SEQUENTIAL DATA Liear Dyamical Systems What is differet from HMM? Kalma filter Its stregth ad limitatio Particle Filter Its simple
More informationChapter 12 EM algorithms The Expectation-Maximization (EM) algorithm is a maximum likelihood method for models that have hidden variables eg. Gaussian
Chapter 2 EM algorithms The Expectatio-Maximizatio (EM) algorithm is a maximum likelihood method for models that have hidde variables eg. Gaussia Mixture Models (GMMs), Liear Dyamic Systems (LDSs) ad Hidde
More information1 Review of Probability & Statistics
1 Review of Probability & Statistics a. I a group of 000 people, it has bee reported that there are: 61 smokers 670 over 5 960 people who imbibe (drik alcohol) 86 smokers who imbibe 90 imbibers over 5
More information1.010 Uncertainty in Engineering Fall 2008
MIT OpeCourseWare http://ocw.mit.edu.00 Ucertaity i Egieerig Fall 2008 For iformatio about citig these materials or our Terms of Use, visit: http://ocw.mit.edu.terms. .00 - Brief Notes # 9 Poit ad Iterval
More informationMachine Learning Theory (CS 6783)
Machie Learig Theory (CS 6783) Lecture 2 : Learig Frameworks, Examples Settig up learig problems. X : istace space or iput space Examples: Computer Visio: Raw M N image vectorized X = 0, 255 M N, SIFT
More informationREGRESSION WITH QUADRATIC LOSS
REGRESSION WITH QUADRATIC LOSS MAXIM RAGINSKY Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X, Y ), where, as before, X is a R d
More informationLecture 19: Convergence
Lecture 19: Covergece Asymptotic approach I statistical aalysis or iferece, a key to the success of fidig a good procedure is beig able to fid some momets ad/or distributios of various statistics. I may
More informationLecture Note 8 Point Estimators and Point Estimation Methods. MIT Spring 2006 Herman Bennett
Lecture Note 8 Poit Estimators ad Poit Estimatio Methods MIT 14.30 Sprig 2006 Herma Beett Give a parameter with ukow value, the goal of poit estimatio is to use a sample to compute a umber that represets
More informationECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization
ECE 90 Lecture 4: Maximum Likelihood Estimatio ad Complexity Regularizatio R Nowak 5/7/009 Review : Maximum Likelihood Estimatio We have iid observatios draw from a ukow distributio Y i iid p θ, i,, where
More informationFACULTY OF MATHEMATICAL STUDIES MATHEMATICS FOR PART I ENGINEERING. Lectures
FACULTY OF MATHEMATICAL STUDIES MATHEMATICS FOR PART I ENGINEERING Lectures MODULE 5 STATISTICS II. Mea ad stadard error of sample data. Biomial distributio. Normal distributio 4. Samplig 5. Cofidece itervals
More informationTopic 9: Sampling Distributions of Estimators
Topic 9: Samplig Distributios of Estimators Course 003, 2018 Page 0 Samplig distributios of estimators Sice our estimators are statistics (particular fuctios of radom variables), their distributio ca be
More informationOutline. L7: Probability Basics. Probability. Probability Theory. Bayes Law for Diagnosis. Which Hypothesis To Prefer? p(a,b) = p(b A) " p(a)
Outlie L7: Probability Basics CS 344R/393R: Robotics Bejami Kuipers. Bayes Law 2. Probability distributios 3. Decisios uder ucertaity Probability For a propositio A, the probability p(a is your degree
More informationExpectation and Variance of a random variable
Chapter 11 Expectatio ad Variace of a radom variable The aim of this lecture is to defie ad itroduce mathematical Expectatio ad variace of a fuctio of discrete & cotiuous radom variables ad the distributio
More information6.867 Machine learning, lecture 7 (Jaakkola) 1
6.867 Machie learig, lecture 7 (Jaakkola) 1 Lecture topics: Kerel form of liear regressio Kerels, examples, costructio, properties Liear regressio ad kerels Cosider a slightly simpler model where we omit
More informationAda Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities
CS8B/Stat4B Sprig 008) Statistical Learig Theory Lecture: Ada Boost, Risk Bouds, Cocetratio Iequalities Lecturer: Peter Bartlett Scribe: Subhrasu Maji AdaBoost ad Estimates of Coditioal Probabilities We
More informationExponential Families and Bayesian Inference
Computer Visio Expoetial Families ad Bayesia Iferece Lecture Expoetial Families A expoetial family of distributios is a d-parameter family f(x; havig the followig form: f(x; = h(xe g(t T (x B(, (. where
More information1 Review and Overview
DRAFT a fial versio will be posted shortly CS229T/STATS231: Statistical Learig Theory Lecturer: Tegyu Ma Lecture #3 Scribe: Migda Qiao October 1, 2013 1 Review ad Overview I the first half of this course,
More informationSolution of Final Exam : / Machine Learning
Solutio of Fial Exam : 10-701/15-781 Machie Learig Fall 2004 Dec. 12th 2004 Your Adrew ID i capital letters: Your full ame: There are 9 questios. Some of them are easy ad some are more difficult. So, if
More informationLet us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.
Lecture 5 Let us give oe more example of MLE. Example 3. The uiform distributio U[0, ] o the iterval [0, ] has p.d.f. { 1 f(x =, 0 x, 0, otherwise The likelihood fuctio ϕ( = f(x i = 1 I(X 1,..., X [0,
More informationChapter 6 Principles of Data Reduction
Chapter 6 for BST 695: Special Topics i Statistical Theory. Kui Zhag, 0 Chapter 6 Priciples of Data Reductio Sectio 6. Itroductio Goal: To summarize or reduce the data X, X,, X to get iformatio about a
More information1 Review and Overview
CS9T/STATS3: Statistical Learig Theory Lecturer: Tegyu Ma Lecture #6 Scribe: Jay Whag ad Patrick Cho October 0, 08 Review ad Overview Recall i the last lecture that for ay family of scalar fuctios F, we
More informationBig Picture. 5. Data, Estimates, and Models: quantifying the accuracy of estimates.
5. Data, Estimates, ad Models: quatifyig the accuracy of estimates. 5. Estimatig a Normal Mea 5.2 The Distributio of the Normal Sample Mea 5.3 Normal data, cofidece iterval for, kow 5.4 Normal data, cofidece
More informationLecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting
Lecture 6 Chi Square Distributio (χ ) ad Least Squares Fittig Chi Square Distributio (χ ) Suppose: We have a set of measuremets {x 1, x, x }. We kow the true value of each x i (x t1, x t, x t ). We would
More informationPattern Classification
Patter Classificatio All materials i these slides were tae from Patter Classificatio (d ed) by R. O. Duda, P. E. Hart ad D. G. Stor, Joh Wiley & Sos, 000 with the permissio of the authors ad the publisher
More informationSTAT Homework 1 - Solutions
STAT-36700 Homework 1 - Solutios Fall 018 September 11, 018 This cotais solutios for Homework 1. Please ote that we have icluded several additioal commets ad approaches to the problems to give you better
More informationTopic 9: Sampling Distributions of Estimators
Topic 9: Samplig Distributios of Estimators Course 003, 2018 Page 0 Samplig distributios of estimators Sice our estimators are statistics (particular fuctios of radom variables), their distributio ca be
More informationTopic 9: Sampling Distributions of Estimators
Topic 9: Samplig Distributios of Estimators Course 003, 2016 Page 0 Samplig distributios of estimators Sice our estimators are statistics (particular fuctios of radom variables), their distributio ca be
More informationGoodness-of-Fit Tests and Categorical Data Analysis (Devore Chapter Fourteen)
Goodess-of-Fit Tests ad Categorical Data Aalysis (Devore Chapter Fourtee) MATH-252-01: Probability ad Statistics II Sprig 2019 Cotets 1 Chi-Squared Tests with Kow Probabilities 1 1.1 Chi-Squared Testig................
More informationMachine Learning. Logistic Regression -- generative verses discriminative classifier. Le Song /15-781, Spring 2008
Machie Learig 070/578 Srig 008 Logistic Regressio geerative verses discrimiative classifier Le Sog Lecture 5 Setember 4 0 Based o slides from Eric Xig CMU Readig: Cha. 3..34 CB Geerative vs. Discrimiative
More informationSupport Vector Machines and Kernel Methods
Support Vector Machies ad Kerel Methods Daiel Khashabi Fall 202 Last Update: September 26, 206 Itroductio I Support Vector Machies the goal is to fid a separator betwee data which has the largest margi,
More informationDimensionality Reduction vs. Clustering
Dimesioality Reductio vs. Clusterig Lecture 9: Cotiuous Latet Variable Models Sam Roweis Traiig such factor models (e.g. FA, PCA, ICA) is called dimesioality reductio. You ca thik of this as (o)liear regressio
More informationDistributional Similarity Models (cont.)
Sematic Similarity Vector Space Model Similarity Measures cosie Euclidea distace... Clusterig k-meas hierarchical Last Time EM Clusterig Soft versio of K-meas clusterig Iput: m dimesioal objects X = {
More information