Semi-Supervised Learning
|
|
- Virginia Carter
- 5 years ago
- Views:
Transcription
1 Sem-Supervsed Learnng Consder the problem of Prepostonal Phrase Attachment. Buy car wth money ; buy car wth wheel There are several ways to generate features. Gven the lmted representaton, we can assume that all possble conjunctons of the 4 attrbutes are used. (5 feature n each example). Assume we wll use naïve Bayes for learnng to decde between [n,v] Examples are: (x,x, x n,[n,v]) EM CS446 Sprng 7
2 Usng naïve Bayes To use naïve Bayes, we need to use the data to estmate: P(n) P(v) P(x n) P(x v) P(x n) P(x v) P(x n n) P(x n v) Then, gven an example (x,x, x n,?), compare: P(n x)=p(n) P(x n) P(x n) P(x n n) and P(v x)=p(v) P(x v) P(x v) P(x n v) EM CS446 Sprng 7
3 Usng naïve Bayes After seeng 0 examples, we have: P(n) =0.5; P(v)=0.5 P(x n)=0.75;p(x n) =0.5; P(x 3 n) =0.5; P(x 4 n) =0.5 P(x v)=0.5; P(x v) =0.5;P(x 3 v) =0.75;P(x 4 v) =0.5 Then, gven an example x=(000), we have: P n (x)= = 3/64 P v (x)= =3/56 Now, assume that n addton to the 0 labeled examples, we also have 00 unlabeled examples. Wll that help? EM CS446 Sprng 7 3
4 Usng naïve Bayes For example, what can be done wth the example (000)? We have an estmate for ts label But, can we use t to mprove the classfer (that s, the estmaton of the probabltes that we wll use n the future)? Opton : We can make predctons, and beleve them Or some of them (based on what?) Opton : We can assume the example x=(000) s a An n-labeled example wth probablty P n (x)/(p n (x) + P v (x)) A v-labeled example wth probablty P v (x)/(p n (x) + P v (x)) Estmaton of probabltes does not requre workng wth ntegers! EM CS446 Sprng 7 4
5 Usng Unlabeled Data The dscusson suggests several algorthms:. Use a threshold. Chose examples labeled wth hgh confdence. Label them [n,v]. Retran.. Use fractonal examples. Label the examples wth fractonal labels [p of n, (-p) of v]. Retran. EM CS446 Sprng 7 5
6 Comments on Unlabeled Data Both algorthms suggested can be used teratvely. Both algorthms can be used wth other classfers, not only naïve Bayes. The only requrement a robust confdence measure n the classfcaton. There are other approaches to Sem-Supervsed learnng: See ncluded papers (co-tranng; Yarowksy s Decson Lst/Bootstrappng algorthm; graph-based algorthms that assume smlar examples have smlar labels, etc.) What happens f nstead of 0 labeled examples we start wth 0 labeled examples? Make a Guess; contnue as above; a verson of EM EM CS446 Sprng 7 6
7 EM EM s a class of algorthms that s used to estmate a probablty dstrbuton n the presence of mssng attrbutes. Usng t, requres an assumpton on the underlyng probablty dstrbuton. The algorthm can be very senstve to ths assumpton and to the startng pont (that s, the ntal guess of parameters. In general, known to converge to a local maxmum of the maxmum lkelhood functon. EM CS446 Sprng 7 7
8 Three Con Example We observe a seres of con tosses generated n the followng way: A person has three cons. Con 0: probablty of Head s a Con : probablty of Head p Con : probablty of Head q Consder the followng con-tossng scenaros: EM CS446 Sprng 7 8
9 Estmaton Problems Scenaro I: Toss one of the cons four tmes. Observng HHTH Queston: Whch con s more lkely to produce ths sequence? Scenaro II: Toss con 0. If Head toss con ; o/w toss con Observng the sequence HHHHT, THTHT, HHHHT, HHTTH produced by Con 0, Con and Con Queston: Estmate most lkely values for p, q (the probablty of H n each con) and the probablty to use each of the cons (a) Scenaro III: Toss con 0. If Head toss con ; o/w toss con Observng the sequence HHHT, HTHT, HHHT, HTTH produced by Con and/or Con Con 0 Queston: Estmate most lkely values for p, q and a There s no known analytcal soluton to ths problem (general settng). That s, t s not known how to compute the values of the parameters so as to maxmze the lkelhood of the data. EM CS446 Sprng 7 st toss nd toss nth toss 9
10 Key Intuton () If we knew whch of the data ponts (HHHT), (HTHT), (HTTH) came from Con and whch from Con, there was no problem. Recall that the smple estmaton s the ML estmaton: Assume that you toss a (p,-p) con m tmes and get k Heads m-k Tals. log[p(d p)] = log [ p k (-p) m-k ]= k log p + (m-k) log (-p) To maxmze, set the dervatve w.r.t. p equal to 0: d log P(D p)/dp = k/p (m-k)/(-p) = 0 Solvng ths for p, gves: p=k/m EM CS446 Sprng 7 0
11 Key Intuton () If we knew whch of the data ponts (HHHT), (HTHT), (HTTH) came from Con and whch from Con, there was no problem. Instead, use an teratve approach for estmatng the parameters: Guess the probablty that a gven data pont came from Con or ; Generate fctonal labels, weghted accordng to ths probablty. Now, compute the most lkely value of the parameters. [recall NB example] Compute the lkelhood of the data gven ths model. Re-estmate the ntal parameter settng: set them to maxmze the lkelhood of the data. (Labels Model Parameters) Lkelhood of the data Ths process can be terated and can be shown to converge to a local maxmum of the lkelhood functon EM CS446 Sprng 7
12 EM Algorthm (Cons) -I We wll assume (for a mnute) that we know the parameters and use t to estmate whch Con t s (Problem ) Then, we wll use ths label estmaton of the observed tosses, to estmate the most lkely parameters and so on... p,q, a Notaton: n data ponts; n each one: m tosses, h heads. What s the probablty that the th data pont came from Con? STEP (Expectaton Step): (Here h=h ) P P(Con D ) P(D Con) P(Con) P(D ) a p h h a p (p) mh (p) (a)q mh h ( q) mh EM CS446 Sprng 7
13 EM Algorthm (Cons) - II Now, we would lke to compute the lkelhood of the data, and fnd the parameters that maxmze t. We wll maxmze the log lkelhood of the data (n data ponts) LL =,n logp(d p,q,a) But, one of the varables the con s name - s hdden. We can margnalze: LL= =,n log y=0, P(D, y p,q, a) LL= =,n log y=0, P(D, y p,q, a) = = =,n log y=0, P(D p,q, a )P(y D,p,q,a) = = =,n log E_y P(D p,q, a) =,n E_y log P(D p,q, a) Where the nequalty s due to Jensen s Inequalty. We maxmze a lower bound on the Lkelhood. However, the sum s nsde the log, makng ML soluton dffcult. Snce the latent varable y s not observed, we cannot use the completedata log lkelhood. Instead, we use the expectaton of the complete-data log lkelhood under the posteror dstrbuton of the latent varable to approxmate log p(d p,q, ) We thnk of the lkelhood logp(d p,q,a ) as a random varable that depends on the value y of the con n the th toss. Therefore, nstead of maxmzng the LL we wll maxmze the expectaton of ths random varable (over the con s name). [Justfed usng Jensen s Inequalty; later & above] EM CS446 Sprng 7 3
14 EM Algorthm (Cons) - III We maxmze the expectaton of ths random varable (over the con name). E[LL] = E[ =,n log P(D p,q, a)] = =,n E[log P(D p,q, a)] = = =,n P log P(D, p,q, a)] + (-P ) log P(D, 0 p,q, a)] - P log P - (-P ) log (- P ) (Does not matter when we maxmze) Ths s due to the lnearty of the expectaton and the random varable defnton: log P(D, y p,q, a) = log P(D, p,q, a) wth Probablty P log P(D, 0 p,q, a) wth Probablty (-P ) EM CS446 Sprng 7 4
15 EM Algorthm (Cons) - IV Explctly, we get: E( log P(D p,q, a) P log P(,D p,q, a) (P )log P(0,D p,q, a) h mh h mh P log( a p (p) ) (P )log((- a) q ( q) ) P (log a hlogp (m-h )log(p)) (P )(log(- a) hlogq (m-h )log( q)) EM CS446 Sprng 7 5
16 EM CS446 Sprng 7 EM Algorthm (Cons) - V Fnally, to fnd the most lkely parameters, we maxmze the dervatves wth respect to : STEP : Maxmzaton Step (Santy check: Thnk of the weghted fctonal ponts) a, p,q n P 0 P - P d de n a a a a P m h P p 0 ) p h m - p h P ( dp de n (-P ) m h P ) ( q 0 ) q h m - q h (-P )( dq de n When computng the dervatves, notce P here s a constant; t was computed usng the current parameters n the E step 6
17 Models wth Hdden Varables EM CS446 Sprng 7 7
18 EM: General Settng The EM algorthm s a general purpose algorthm for fndng the maxmum lkelhood estmate n latent varable models. In the E-Step, we fll n the latent varables usng the posteror, and n the M-Step, we maxmze the expected complete log lkelhood wth respect to the complete posteror dstrbuton. Let D = (x,, x N ) be the observed data, and Let Z denote hdden random varables. (We are not commttng to any partcular model.) Let θ be the model parameters. Then µ * = argmax µ p(x µ) = argmax µ z p(x,z µ) = = argmax µ z [p(z µ)p(x z, µ)] Ths expresson s called the complete log lkelhood. EM CS446 Sprng 7 8
19 EM: General Settng () To derve the EM objectve functon, we re-wrte the complete log lkelhood functon by multplyng t by q(z)/q(z), where q(z) s an arbtrary dstrbuton for the random varable z. log p(x µ) = log z p(x,z µ) = log z p(z µ) p(x z,µ) q(z)/q(z) = = log E q [p(z µ) p(x z,µ) /q(z)] E q log [p(z µ) p(x z,µ) /q(z)], Where the nequalty s due to Jensen s nequalty appled to the concave functon, log. We get the objectve: Jensen s Inequalty for convex functons: E(f(x)) f(e(x)) But log s concave, so E(log(x)) log (E(x)) L(µ, q) = E q [log p(z µ)] + E q [log p(x z,µ)] - E q [log q(z)] The last component s an Entropy component; t s also possble to wrte the objectve so that t ncludes a KL dvergence (a dstance functon between dstrbutons) of q(z) and p(z x,µ). EM CS446 Sprng 7 9
20 Other q s can be chosen [Samdan & Roth0] to gve other EM algorthms. Specfcally, you can choose a q that chooses the most lkely z n the E-step, and then contnues to estmate the parameters (called Truncated EM, or Hard EM). (Thnk back to the sem-supervsed case) EM: General Settng (3) EM now contnues teratvely, as a gradent accent algorthm, where we choose q = p(z x, µ). At the t-th step, we have q (t) and µ (t). E-Step: update the posteror q, whle holdng µ (t) fxed: q (t+) = argmax q L(q, µ (t) ) = p(z x, µ (t) ). M-Step: update the model parameters to maxmze the expected complete log-lkelhood functon: µ (t+) = argmax µ L(q (t+), µ) To wrap t up, wth the rght q: L(µ, q) = E q log [p(z µ) p(x z,µ) /q(z)] = z p(z x, µ) log [p(x, z µ)/p(z x, µ)] = = z p(z x, µ) log [p(x, z µ) p(x µ)/p(z, x µ)] = = z p(z x, µ) log [p(x µ)] = log [p(x µ)] z p(z x, µ) = log [p(x µ)] So, by maxmzng the objectve functon, we are also maxmzng the log lkelhood functon. EM CS446 Sprng 7 0
21 The General EM Procedure E M EM CS446 Sprng 7
22 EM Summary (so far) EM s a general procedure for learnng n the presence of unobserved varables. We have shown how to use t n order to estmate the most lkely densty functon for a mxture of (Bernoull) dstrbutons. EM s an teratve algorthm that can be shown to converge to a local maxmum of the lkelhood functon. It depends on assumng a famly of probablty dstrbutons. In ths sense, t s a famly of algorthms. The update rules you wll derve depend on the model assumed. It has been shown to be qute useful n practce, when the assumptons made on the probablty dstrbuton are correct, but can fal otherwse. EM CS446 Sprng 7
23 EM Summary (so far) EM s a general procedure for learnng n the presence of unobserved varables. The (famly of ) probablty dstrbuton s known; the problem s to estmate ts parameters In the presence of hdden varables, we can often thnk about t as a problem of a mxture of dstrbutons the partcpatng dstrbutons are known, we need to estmate: Parameters of the dstrbutons The mxture polcy Our prevous example: Mxture of Bernoull dstrbutons EM CS446 Sprng 7 3
24 Example: K-Means Algorthm K- means s a clusterng algorthm. We are gven data ponts, known to be sampled ndependently from a mxture of k Normal dstrbutons, wth means, =, k and the same standard varaton p(x) x EM CS446 Sprng 7 4
25 Example: K-Means Algorthm Frst, notce that f we knew that all the data ponts are taken from a normal dstrbuton wth mean, fndng ts most lkely value s easy. p(x ) exp[ (x ) ] We get many data ponts, D = {x,,x m } ln(l(d )) ln(p(d )) - (x ) Maxmzng the log-lkelhood s equvalent to mnmzng: ML argmn (x ) Calculate the dervatve wth respect to, we get that the mnmal pont, that s, the most lkely mean s EM CS446 Sprng 7 m x 5
26 A mxture of Dstrbutons As n the con example, the problem s that data s sampled from a mxture of k dfferent normal dstrbutons, and we do not know, for a gven data pont x, where s t sampled from. Assume that we observe data pont x ;what s the probablty that t was sampled from the dstrbuton j? P(x j )P( j) P(x x j) Pj P( j x) k k P(x ) P(x x n n) k exp[ (x j ) ] k exp[ (x n n) ] EM CS446 Sprng 7 6
27 A Mxture of Dstrbutons As n the con example, the problem s that data s sampled from a mxture of k dfferent normal dstrbutons, and we do not know, for a gven each data pont x, where s t sampled from. For a data pont x, defne k bnary hdden varables, z,z,,z k, s.t z j = ff x s sampled from the j-th dstrbuton. E[z j ] P(x 0 P(x was sampled from ) was not sampled from ) P EM CS446 Sprng 7 7 j j E[Y] y P(Y y j y ) E[X Y] E[X] E[Y]
28 Example: K-Means Algorthms,,,..., Expectaton: (here: h = k ) p(y h) p(x,z,..., zk h) exp[ j z j (x ) j ] Computng the lkelhood gven the observed data D = {x,,x m } and the hypothess h (w/o the constant coeffcent) m ln(p(y h)) - z (x ) j j j [ m E[ln(P(Y h))] E - z ] j j(x j) m - E[z ](x ) j j j EM CS446 Sprng 7 8
29 Example: K-Means Algorthms Maxmzaton: Maxmzng m Q(h h') - E[z ](x ) j j j wth respect to we get that: Whch yelds: j dq d j j m C E[z ](x ) m m E[z j j E[z ]x j ] j 0 EM CS446 Sprng 7 9
30 Summary: K-Means Algorthms Gven a set D = {x,,x m } of data ponts, guess ntal parameters,,,..., k Compute (for all,j) exp[ (x p j ) ] j E[zj] k exp[ (x n n ) and a new set of means: m E[z ]x j j m E[z ] j repeat to convergence Notce that ths algorthm wll fnd the best k means n the sense of mnmzng the sum of square dstance. ] EM CS446 Sprng 7 30
31 Summary: EM EM s a general procedure for learnng n the presence of unobserved varables. We have shown how to use t n order to estmate the most lkely densty functon for a mxture of probablty dstrbutons. EM s an teratve algorthm that can be shown to converge to a local maxmum of the lkelhood functon. Thus, mght requres many restarts. It depends on assumng a famly of probablty dstrbutons. It has been shown to be qute useful n practce, when the assumptons made on the probablty dstrbuton are correct, but can fal otherwse. As examples, we have derved an mportant clusterng algorthm, the k-means algorthm and have shown how to use t n order to estmate the most lkely densty functon for a mxture of probablty dstrbutons. EM CS446 Sprng 7 3
32 More Thoughts about EM Tranng: a sample of data ponts, (x 0, x,, x n ) {0,} n+ Task: predct the value of x 0, gven assgnments to all n varables. EM CS446 Sprng 7 3
33 z z P z More Thoughts about EM Assume that a set x {0,} n+ of data ponts s generated as follows: Postulate a hdden varable Z, wth k values, z k wth probablty z,,k z = Havng randomly chosen a value z for the hdden varable, we choose the value x for each observable X to be wth probablty p z and 0 otherwse, [ = 0,,,.n] Tranng: a sample of data ponts, (x 0, x,, x n ) {0,} n+ Task: predct the value of x 0, gven assgnments to all n varables. EM CS446 Sprng 7 33
34 z z P z More Thoughts about EM Two optons: Parametrc: estmate the model usng EM. Once a model s known, use t to make predctons. Problem: Cannot use EM drectly wthout an addtonal assumpton on the way data s generated. Non-Parametrc: Learn x 0 drectly as a functon of the other varables. Problem: whch functon to try and learn? x 0 turns out to be a lnear functon of the other varables, when k= (what does t mean)? When Another k mportant s known, dstncton the EM to attend approach to s the fact performs that, once you well; f an estmated ncorrect all the value parameters s assumed wth EM, you the can answer estmaton many predcton fals; the problems e.g., p(x lnear methods 0, x performs 7,,x 8 x, x,, x better n ) whle wth Perceptron (say) [Grove & Roth 00] you need to learn separate models for each predcton problem. EM CS446 Sprng 7 34
35 EM EM CS446 Sprng 7 35
36 The EM Algorthm Algorthm: Guess ntal values for the hypothess h=,,..., Expectaton: Calculate Q(h,h) = E(Log P(Y h ) h, X) usng the current hypothess h and the observed data X., Maxmzaton: Replace the current hypothess h by h, that maxmzes the Q functon (the lkelhood functon) set h = h, such that Q(h,h) s maxmal Repeat: Estmate the Expectaton agan. k EM CS446 Sprng 7 36
xp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ
CSE 455/555 Sprng 2013 Homework 7: Parametrc Technques Jason J. Corso Computer Scence and Engneerng SUY at Buffalo jcorso@buffalo.edu Solutons by Yngbo Zhou Ths assgnment does not need to be submtted and
More informationCS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements
CS 750 Machne Learnng Lecture 5 Densty estmaton Mlos Hauskrecht mlos@cs.ptt.edu 539 Sennott Square CS 750 Machne Learnng Announcements Homework Due on Wednesday before the class Reports: hand n before
More informationEM and Structure Learning
EM and Structure Learnng Le Song Machne Learnng II: Advanced Topcs CSE 8803ML, Sprng 2012 Partally observed graphcal models Mxture Models N(μ 1, Σ 1 ) Z X N N(μ 2, Σ 2 ) 2 Gaussan mxture model Consder
More informationMATH 829: Introduction to Data Mining and Analysis The EM algorithm (part 2)
1/16 MATH 829: Introducton to Data Mnng and Analyss The EM algorthm (part 2) Domnque Gullot Departments of Mathematcal Scences Unversty of Delaware Aprl 20, 2016 Recall 2/16 We are gven ndependent observatons
More informationMachine learning: Density estimation
CS 70 Foundatons of AI Lecture 3 Machne learnng: ensty estmaton Mlos Hauskrecht mlos@cs.ptt.edu 539 Sennott Square ata: ensty estmaton {.. n} x a vector of attrbute values Objectve: estmate the model of
More informationLogistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI
Logstc Regresson CAP 561: achne Learnng Instructor: Guo-Jun QI Bayes Classfer: A Generatve model odel the posteror dstrbuton P(Y X) Estmate class-condtonal dstrbuton P(X Y) for each Y Estmate pror dstrbuton
More informationMLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012
MLE and Bayesan Estmaton Je Tang Department of Computer Scence & Technology Tsnghua Unversty 01 1 Lnear Regresson? As the frst step, we need to decde how we re gong to represent the functon f. One example:
More informationCourse 395: Machine Learning - Lectures
Course 395: Machne Learnng - Lectures Lecture 1-2: Concept Learnng (M. Pantc Lecture 3-4: Decson Trees & CC Intro (M. Pantc Lecture 5-6: Artfcal Neural Networks (S.Zaferou Lecture 7-8: Instance ased Learnng
More informationAn Experiment/Some Intuition (Fall 2006): Lecture 18 The EM Algorithm heads coin 1 tails coin 2 Overview Maximum Likelihood Estimation
An Experment/Some Intuton I have three cons n my pocket, 6.864 (Fall 2006): Lecture 18 The EM Algorthm Con 0 has probablty λ of heads; Con 1 has probablty p 1 of heads; Con 2 has probablty p 2 of heads
More informationLecture 10 Support Vector Machines. Oct
Lecture 10 Support Vector Machnes Oct - 20-2008 Lnear Separators Whch of the lnear separators s optmal? Concept of Margn Recall that n Perceptron, we learned that the convergence rate of the Perceptron
More information10-701/ Machine Learning, Fall 2005 Homework 3
10-701/15-781 Machne Learnng, Fall 2005 Homework 3 Out: 10/20/05 Due: begnnng of the class 11/01/05 Instructons Contact questons-10701@autonlaborg for queston Problem 1 Regresson and Cross-valdaton [40
More informationThe Expectation-Maximization Algorithm
The Expectaton-Maxmaton Algorthm Charles Elan elan@cs.ucsd.edu November 16, 2007 Ths chapter explans the EM algorthm at multple levels of generalty. Secton 1 gves the standard hgh-level verson of the algorthm.
More informationCIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M
CIS56: achne Learnng Lecture 3 (Sept 6, 003) Preparaton help: Xaoyng Huang Lnear Regresson Lnear regresson can be represented by a functonal form: f(; θ) = θ 0 0 +θ + + θ = θ = 0 ote: 0 s a dummy attrbute
More informationSpace of ML Problems. CSE 473: Artificial Intelligence. Parameter Estimation and Bayesian Networks. Learning Topics
/7/7 CSE 73: Artfcal Intellgence Bayesan - Learnng Deter Fox Sldes adapted from Dan Weld, Jack Breese, Dan Klen, Daphne Koller, Stuart Russell, Andrew Moore & Luke Zettlemoyer What s Beng Learned? Space
More informationLecture Notes on Linear Regression
Lecture Notes on Lnear Regresson Feng L fl@sdueducn Shandong Unversty, Chna Lnear Regresson Problem In regresson problem, we am at predct a contnuous target value gven an nput feature vector We assume
More informationFinite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin
Fnte Mxture Models and Expectaton Maxmzaton Most sldes are from: Dr. Maro Fgueredo, Dr. Anl Jan and Dr. Rong Jn Recall: The Supervsed Learnng Problem Gven a set of n samples X {(x, y )},,,n Chapter 3 of
More informationLimited Dependent Variables
Lmted Dependent Varables. What f the left-hand sde varable s not a contnuous thng spread from mnus nfnty to plus nfnty? That s, gven a model = f (, β, ε, where a. s bounded below at zero, such as wages
More informationHomework Assignment 3 Due in class, Thursday October 15
Homework Assgnment 3 Due n class, Thursday October 15 SDS 383C Statstcal Modelng I 1 Rdge regresson and Lasso 1. Get the Prostrate cancer data from http://statweb.stanford.edu/~tbs/elemstatlearn/ datasets/prostate.data.
More informationClassification as a Regression Problem
Target varable y C C, C,, ; Classfcaton as a Regresson Problem { }, 3 L C K To treat classfcaton as a regresson problem we should transform the target y nto numercal values; The choce of numercal class
More informationSDMML HT MSc Problem Sheet 4
SDMML HT 06 - MSc Problem Sheet 4. The recever operatng characterstc ROC curve plots the senstvty aganst the specfcty of a bnary classfer as the threshold for dscrmnaton s vared. Let the data space be
More informationEnsemble Methods: Boosting
Ensemble Methods: Boostng Ncholas Ruozz Unversty of Texas at Dallas Based on the sldes of Vbhav Gogate and Rob Schapre Last Tme Varance reducton va baggng Generate new tranng data sets by samplng wth replacement
More informationClustering gene expression data & the EM algorithm
CG, Fall 2011-12 Clusterng gene expresson data & the EM algorthm CG 08 Ron Shamr 1 How Gene Expresson Data Looks Entres of the Raw Data matrx: Rato values Absolute values Row = gene s expresson pattern
More informationFor now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.
Neural Networks : Dervaton compled by Alvn Wan from Professor Jtendra Malk s lecture Ths type of computaton s called deep learnng and s the most popular method for many problems, such as computer vson
More informationLecture 12: Classification
Lecture : Classfcaton g Dscrmnant functons g The optmal Bayes classfer g Quadratc classfers g Eucldean and Mahalanobs metrcs g K Nearest Neghbor Classfers Intellgent Sensor Systems Rcardo Guterrez-Osuna
More informationGaussian Mixture Models
Lab Gaussan Mxture Models Lab Objectve: Understand the formulaton of Gaussan Mxture Models (GMMs) and how to estmate GMM parameters. You ve already seen GMMs as the observaton dstrbuton n certan contnuous
More informationMaximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models
ECO 452 -- OE 4: Probt and Logt Models ECO 452 -- OE 4 Maxmum Lkelhood Estmaton of Bnary Dependent Varables Models: Probt and Logt hs note demonstrates how to formulate bnary dependent varables models
More informationLinear Approximation with Regularization and Moving Least Squares
Lnear Approxmaton wth Regularzaton and Movng Least Squares Igor Grešovn May 007 Revson 4.6 (Revson : March 004). 5 4 3 0.5 3 3.5 4 Contents: Lnear Fttng...4. Weghted Least Squares n Functon Approxmaton...
More informationThe conjugate prior to a Bernoulli is. A) Bernoulli B) Gaussian C) Beta D) none of the above
The conjugate pror to a Bernoull s A) Bernoull B) Gaussan C) Beta D) none of the above The conjugate pror to a Gaussan s A) Bernoull B) Gaussan C) Beta D) none of the above MAP estmates A) argmax θ p(θ
More informationGeneralized Linear Methods
Generalzed Lnear Methods 1 Introducton In the Ensemble Methods the general dea s that usng a combnaton of several weak learner one could make a better learner. More formally, assume that we have a set
More informationHidden Markov Models & The Multivariate Gaussian (10/26/04)
CS281A/Stat241A: Statstcal Learnng Theory Hdden Markov Models & The Multvarate Gaussan (10/26/04) Lecturer: Mchael I. Jordan Scrbes: Jonathan W. Hu 1 Hdden Markov Models As a bref revew, hdden Markov models
More information9 : Learning Partially Observed GM : EM Algorithm
10-708: Probablstc Graphcal Models 10-708, Sprng 2012 9 : Learnng Partally Observed GM : EM Algorthm Lecturer: Erc P. Xng Scrbes: Mrnmaya Sachan, Phan Gadde, Vswanathan Srpradha 1 Introducton So far n
More informationHidden Markov Models
Hdden Markov Models Namrata Vaswan, Iowa State Unversty Aprl 24, 204 Hdden Markov Model Defntons and Examples Defntons:. A hdden Markov model (HMM) refers to a set of hdden states X 0, X,..., X t,...,
More information1 Convex Optimization
Convex Optmzaton We wll consder convex optmzaton problems. Namely, mnmzaton problems where the objectve s convex (we assume no constrants for now). Such problems often arse n machne learnng. For example,
More information3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X
Statstcs 1: Probablty Theory II 37 3 EPECTATION OF SEVERAL RANDOM VARIABLES As n Probablty Theory I, the nterest n most stuatons les not on the actual dstrbuton of a random vector, but rather on a number
More informationLecture 10 Support Vector Machines II
Lecture 10 Support Vector Machnes II 22 February 2016 Taylor B. Arnold Yale Statstcs STAT 365/665 1/28 Notes: Problem 3 s posted and due ths upcomng Frday There was an early bug n the fake-test data; fxed
More informationCOS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #16 Scribe: Yannan Wang April 3, 2014
COS 511: Theoretcal Machne Learnng Lecturer: Rob Schapre Lecture #16 Scrbe: Yannan Wang Aprl 3, 014 1 Introducton The goal of our onlne learnng scenaro from last class s C comparng wth best expert and
More informationINF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018
INF 5860 Machne learnng for mage classfcaton Lecture 3 : Image classfcaton and regresson part II Anne Solberg January 3, 08 Today s topcs Multclass logstc regresson and softma Regularzaton Image classfcaton
More informationMaximum Likelihood Estimation (MLE)
Maxmum Lkelhood Estmaton (MLE) Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175A Wnter 01 UCSD Statstcal Learnng Goal: Gven a relatonshp between a feature vector x and a vector y, and d data samples (x,y
More informationFeature Selection: Part 1
CSE 546: Machne Learnng Lecture 5 Feature Selecton: Part 1 Instructor: Sham Kakade 1 Regresson n the hgh dmensonal settng How do we learn when the number of features d s greater than the sample sze n?
More informationConjugacy and the Exponential Family
CS281B/Stat241B: Advanced Topcs n Learnng & Decson Makng Conjugacy and the Exponental Famly Lecturer: Mchael I. Jordan Scrbes: Bran Mlch 1 Conjugacy In the prevous lecture, we saw conjugate prors for the
More informationCS286r Assign One. Answer Key
CS286r Assgn One Answer Key 1 Game theory 1.1 1.1.1 Let off-equlbrum strateges also be that people contnue to play n Nash equlbrum. Devatng from any Nash equlbrum s a weakly domnated strategy. That s,
More informationOutline. Bayesian Networks: Maximum Likelihood Estimation and Tree Structure Learning. Our Model and Data. Outline
Outlne Bayesan Networks: Maxmum Lkelhood Estmaton and Tree Structure Learnng Huzhen Yu janey.yu@cs.helsnk.f Dept. Computer Scence, Unv. of Helsnk Probablstc Models, Sprng, 200 Notces: I corrected a number
More information3.1 ML and Empirical Distribution
67577 Intro. to Machne Learnng Fall semester, 2008/9 Lecture 3: Maxmum Lkelhood/ Maxmum Entropy Dualty Lecturer: Amnon Shashua Scrbe: Amnon Shashua 1 In the prevous lecture we defned the prncple of Maxmum
More informationENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition
EG 880/988 - Specal opcs n Computer Engneerng: Pattern Recognton Memoral Unversty of ewfoundland Pattern Recognton Lecture 7 May 3, 006 http://wwwengrmunca/~charlesr Offce Hours: uesdays hursdays 8:30-9:30
More information1 The Mistake Bound Model
5-850: Advanced Algorthms CMU, Sprng 07 Lecture #: Onlne Learnng and Multplcatve Weghts February 7, 07 Lecturer: Anupam Gupta Scrbe: Bryan Lee,Albert Gu, Eugene Cho he Mstake Bound Model Suppose there
More informationExpectation Maximization Mixture Models HMMs
-755 Machne Learnng for Sgnal Processng Mture Models HMMs Class 9. 2 Sep 200 Learnng Dstrbutons for Data Problem: Gven a collecton of eamples from some data, estmate ts dstrbuton Basc deas of Mamum Lelhood
More information8/25/17. Data Modeling. Data Modeling. Data Modeling. Patrice Koehl Department of Biological Sciences National University of Singapore
8/5/17 Data Modelng Patrce Koehl Department of Bologcal Scences atonal Unversty of Sngapore http://www.cs.ucdavs.edu/~koehl/teachng/bl59 koehl@cs.ucdavs.edu Data Modelng Ø Data Modelng: least squares Ø
More informationOnline Classification: Perceptron and Winnow
E0 370 Statstcal Learnng Theory Lecture 18 Nov 8, 011 Onlne Classfcaton: Perceptron and Wnnow Lecturer: Shvan Agarwal Scrbe: Shvan Agarwal 1 Introducton In ths lecture we wll start to study the onlne learnng
More informationThe Geometry of Logit and Probit
The Geometry of Logt and Probt Ths short note s meant as a supplement to Chapters and 3 of Spatal Models of Parlamentary Votng and the notaton and reference to fgures n the text below s to those two chapters.
More information1 Matrix representations of canonical matrices
1 Matrx representatons of canoncal matrces 2-d rotaton around the orgn: ( ) cos θ sn θ R 0 = sn θ cos θ 3-d rotaton around the x-axs: R x = 1 0 0 0 cos θ sn θ 0 sn θ cos θ 3-d rotaton around the y-axs:
More informationEEE 241: Linear Systems
EEE : Lnear Systems Summary #: Backpropagaton BACKPROPAGATION The perceptron rule as well as the Wdrow Hoff learnng were desgned to tran sngle layer networks. They suffer from the same dsadvantage: they
More informationThe big picture. Outline
The bg pcture Vncent Claveau IRISA - CNRS, sldes from E. Kjak INSA Rennes Notatons classes: C = {ω = 1,.., C} tranng set S of sze m, composed of m ponts (x, ω ) per class ω representaton space: R d (=
More informationThe EM Algorithm (Dempster, Laird, Rubin 1977) The missing data or incomplete data setting: ODL(φ;Y ) = [Y;φ] = [Y X,φ][X φ] = X
The EM Algorthm (Dempster, Lard, Rubn 1977 The mssng data or ncomplete data settng: An Observed Data Lkelhood (ODL that s a mxture or ntegral of Complete Data Lkelhoods (CDL. (1a ODL(;Y = [Y;] = [Y,][
More informationCOS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture # 15 Scribe: Jieming Mao April 1, 2013
COS 511: heoretcal Machne Learnng Lecturer: Rob Schapre Lecture # 15 Scrbe: Jemng Mao Aprl 1, 013 1 Bref revew 1.1 Learnng wth expert advce Last tme, we started to talk about learnng wth expert advce.
More informationStat260: Bayesian Modeling and Inference Lecture Date: February 22, Reference Priors
Stat60: Bayesan Modelng and Inference Lecture Date: February, 00 Reference Prors Lecturer: Mchael I. Jordan Scrbe: Steven Troxler and Wayne Lee In ths lecture, we assume that θ R; n hgher-dmensons, reference
More informationAssortment Optimization under MNL
Assortment Optmzaton under MNL Haotan Song Aprl 30, 2017 1 Introducton The assortment optmzaton problem ams to fnd the revenue-maxmzng assortment of products to offer when the prces of products are fxed.
More information18.1 Introduction and Recap
CS787: Advanced Algorthms Scrbe: Pryananda Shenoy and Shjn Kong Lecturer: Shuch Chawla Topc: Streamng Algorthmscontnued) Date: 0/26/2007 We contnue talng about streamng algorthms n ths lecture, ncludng
More informationLecture 12: Discrete Laplacian
Lecture 12: Dscrete Laplacan Scrbe: Tanye Lu Our goal s to come up wth a dscrete verson of Laplacan operator for trangulated surfaces, so that we can use t n practce to solve related problems We are mostly
More informationMASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications
MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.65/15.070J Fall 013 Lecture 1 10/1/013 Martngale Concentraton Inequaltes and Applcatons Content. 1. Exponental concentraton for martngales wth bounded ncrements.
More informationParametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010
Parametrc fractonal mputaton for mssng data analyss Jae Kwang Km Survey Workng Group Semnar March 29, 2010 1 Outlne Introducton Proposed method Fractonal mputaton Approxmaton Varance estmaton Multple mputaton
More informationExpected Value and Variance
MATH 38 Expected Value and Varance Dr. Neal, WKU We now shall dscuss how to fnd the average and standard devaton of a random varable X. Expected Value Defnton. The expected value (or average value, or
More informationThe Second Anti-Mathima on Game Theory
The Second Ant-Mathma on Game Theory Ath. Kehagas December 1 2006 1 Introducton In ths note we wll examne the noton of game equlbrum for three types of games 1. 2-player 2-acton zero-sum games 2. 2-player
More informationINTRODUCTION TO MACHINE LEARNING 3RD EDITION
ETHEM ALPAYDIN The MIT Press, 2014 Lecture Sldes for INTRODUCTION TO MACHINE LEARNING 3RD EDITION alpaydn@boun.edu.tr http://www.cmpe.boun.edu.tr/~ethem/2ml3e CHAPTER 3: BAYESIAN DECISION THEORY Probablty
More informationProblem Set 9 Solutions
Desgn and Analyss of Algorthms May 4, 2015 Massachusetts Insttute of Technology 6.046J/18.410J Profs. Erk Demane, Srn Devadas, and Nancy Lynch Problem Set 9 Solutons Problem Set 9 Solutons Ths problem
More informationClustering & Unsupervised Learning
Clusterng & Unsupervsed Learnng Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175A Wnter 2012 UCSD Statstcal Learnng Goal: Gven a relatonshp between a feature vector x and a vector y, and d data samples (x,y
More informationIntroduction to Vapor/Liquid Equilibrium, part 2. Raoult s Law:
CE304, Sprng 2004 Lecture 4 Introducton to Vapor/Lqud Equlbrum, part 2 Raoult s Law: The smplest model that allows us do VLE calculatons s obtaned when we assume that the vapor phase s an deal gas, and
More informationModule 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur
Module 3 LOSSY IMAGE COMPRESSION SYSTEMS Verson ECE IIT, Kharagpur Lesson 6 Theory of Quantzaton Verson ECE IIT, Kharagpur Instructonal Objectves At the end of ths lesson, the students should be able to:
More informationSupport Vector Machines CS434
Support Vector Machnes CS434 Lnear Separators Many lnear separators exst that perfectly classfy all tranng examples Whch of the lnear separators s the best? + + + + + + + + + Intuton of Margn Consder ponts
More information2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification
E395 - Pattern Recognton Solutons to Introducton to Pattern Recognton, Chapter : Bayesan pattern classfcaton Preface Ths document s a soluton manual for selected exercses from Introducton to Pattern Recognton
More informationWeek 5: Neural Networks
Week 5: Neural Networks Instructor: Sergey Levne Neural Networks Summary In the prevous lecture, we saw how we can construct neural networks by extendng logstc regresson. Neural networks consst of multple
More informationLecture 4. Instructor: Haipeng Luo
Lecture 4 Instructor: Hapeng Luo In the followng lectures, we focus on the expert problem and study more adaptve algorthms. Although Hedge s proven to be worst-case optmal, one may wonder how well t would
More informationClustering with Gaussian Mixtures
Note to other teachers and users of these sldes. Andrew would be delghted f you found ths source materal useful n gvng your own lectures. Feel free to use these sldes verbatm, or to modfy them to ft your
More informationLinear Feature Engineering 11
Lnear Feature Engneerng 11 2 Least-Squares 2.1 Smple least-squares Consder the followng dataset. We have a bunch of nputs x and correspondng outputs y. The partcular values n ths dataset are x y 0.23 0.19
More informationLectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix
Lectures - Week 4 Matrx norms, Condtonng, Vector Spaces, Lnear Independence, Spannng sets and Bass, Null space and Range of a Matrx Matrx Norms Now we turn to assocatng a number to each matrx. We could
More informationprinceton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg
prnceton unv. F 17 cos 521: Advanced Algorthm Desgn Lecture 7: LP Dualty Lecturer: Matt Wenberg Scrbe: LP Dualty s an extremely useful tool for analyzng structural propertes of lnear programs. Whle there
More informationPredictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore
Sesson Outlne Introducton to classfcaton problems and dscrete choce models. Introducton to Logstcs Regresson. Logstc functon and Logt functon. Maxmum Lkelhood Estmator (MLE) for estmaton of LR parameters.
More informationStanford University CS359G: Graph Partitioning and Expanders Handout 4 Luca Trevisan January 13, 2011
Stanford Unversty CS359G: Graph Parttonng and Expanders Handout 4 Luca Trevsan January 3, 0 Lecture 4 In whch we prove the dffcult drecton of Cheeger s nequalty. As n the past lectures, consder an undrected
More informationprinceton univ. F 13 cos 521: Advanced Algorithm Design Lecture 3: Large deviations bounds and applications Lecturer: Sanjeev Arora
prnceton unv. F 13 cos 521: Advanced Algorthm Desgn Lecture 3: Large devatons bounds and applcatons Lecturer: Sanjeev Arora Scrbe: Today s topc s devaton bounds: what s the probablty that a random varable
More informationSTAT 3008 Applied Regression Analysis
STAT 3008 Appled Regresson Analyss Tutoral : Smple Lnear Regresson LAI Chun He Department of Statstcs, The Chnese Unversty of Hong Kong 1 Model Assumpton To quantfy the relatonshp between two factors,
More informationU.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017
U.C. Berkeley CS94: Beyond Worst-Case Analyss Handout 4s Luca Trevsan September 5, 07 Summary of Lecture 4 In whch we ntroduce semdefnte programmng and apply t to Max Cut. Semdefnte Programmng Recall that
More informationLecture Nov
Lecture 18 Nov 07 2008 Revew Clusterng Groupng smlar obects nto clusters Herarchcal clusterng Agglomeratve approach (HAC: teratvely merge smlar clusters Dfferent lnkage algorthms for computng dstances
More informationDifference Equations
Dfference Equatons c Jan Vrbk 1 Bascs Suppose a sequence of numbers, say a 0,a 1,a,a 3,... s defned by a certan general relatonshp between, say, three consecutve values of the sequence, e.g. a + +3a +1
More informationMotion Perception Under Uncertainty. Hongjing Lu Department of Psychology University of Hong Kong
Moton Percepton Under Uncertanty Hongjng Lu Department of Psychology Unversty of Hong Kong Outlne Uncertanty n moton stmulus Correspondence problem Qualtatve fttng usng deal observer models Based on sgnal
More informationMaximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models
ECO 452 -- OE 4: Probt and Logt Models ECO 452 -- OE 4 Mamum Lkelhood Estmaton of Bnary Dependent Varables Models: Probt and Logt hs note demonstrates how to formulate bnary dependent varables models for
More informationSTATS 306B: Unsupervised Learning Spring Lecture 10 April 30
STATS 306B: Unsupervsed Learnng Sprng 2014 Lecture 10 Aprl 30 Lecturer: Lester Mackey Scrbe: Joey Arthur, Rakesh Achanta 10.1 Factor Analyss 10.1.1 Recap Recall the factor analyss (FA) model for lnear
More informationThe Basic Idea of EM
The Basc Idea of EM Janxn Wu LAMDA Group Natonal Key Lab for Novel Software Technology Nanjng Unversty, Chna wujx2001@gmal.com June 7, 2017 Contents 1 Introducton 1 2 GMM: A workng example 2 2.1 Gaussan
More informationOpen Systems: Chemical Potential and Partial Molar Quantities Chemical Potential
Open Systems: Chemcal Potental and Partal Molar Quanttes Chemcal Potental For closed systems, we have derved the followng relatonshps: du = TdS pdv dh = TdS + Vdp da = SdT pdv dg = VdP SdT For open systems,
More informationRepresenting arbitrary probability distributions Inference. Exact inference; Approximate inference
Bayesan Learnng So far What does t mean to be Bayesan? Naïve Bayes Independence assumptons EM Algorthm Learnng wth hdden varables Today: Representng arbtrary probablty dstrbutons Inference Exact nference;
More informationVapnik-Chervonenkis theory
Vapnk-Chervonenks theory Rs Kondor June 13, 2008 For the purposes of ths lecture, we restrct ourselves to the bnary supervsed batch learnng settng. We assume that we have an nput space X, and an unknown
More information} Often, when learning, we deal with uncertainty:
Uncertanty and Learnng } Often, when learnng, we deal wth uncertanty: } Incomplete data sets, wth mssng nformaton } Nosy data sets, wth unrelable nformaton } Stochastcty: causes and effects related non-determnstcally
More informationProbabilistic Classification: Bayes Classifiers. Lecture 6:
Probablstc Classfcaton: Bayes Classfers Lecture : Classfcaton Models Sam Rowes January, Generatve model: p(x, y) = p(y)p(x y). p(y) are called class prors. p(x y) are called class condtonal feature dstrbutons.
More informationCS 229, Public Course Problem Set #3 Solutions: Learning Theory and Unsupervised Learning
CS9 Problem Set #3 Solutons CS 9, Publc Course Problem Set #3 Solutons: Learnng Theory and Unsupervsed Learnng. Unform convergence and Model Selecton In ths problem, we wll prove a bound on the error of
More informationHidden Markov Models
CM229S: Machne Learnng for Bonformatcs Lecture 12-05/05/2016 Hdden Markov Models Lecturer: Srram Sankararaman Scrbe: Akshay Dattatray Shnde Edted by: TBD 1 Introducton For a drected graph G we can wrte
More informationLaboratory 3: Method of Least Squares
Laboratory 3: Method of Least Squares Introducton Consder the graph of expermental data n Fgure 1. In ths experment x s the ndependent varable and y the dependent varable. Clearly they are correlated wth
More informationRetrieval Models: Language models
CS-590I Informaton Retreval Retreval Models: Language models Luo S Department of Computer Scence Purdue Unversty Introducton to language model Ungram language model Document language model estmaton Maxmum
More informationCHAPTER 3: BAYESIAN DECISION THEORY
HATER 3: BAYESIAN DEISION THEORY Decson mang under uncertanty 3 Data comes from a process that s completely not nown The lac of nowledge can be compensated by modelng t as a random process May be the underlyng
More informationMATH 5707 HOMEWORK 4 SOLUTIONS 2. 2 i 2p i E(X i ) + E(Xi 2 ) ä i=1. i=1
MATH 5707 HOMEWORK 4 SOLUTIONS CİHAN BAHRAN 1. Let v 1,..., v n R m, all lengths v are not larger than 1. Let p 1,..., p n [0, 1] be arbtrary and set w = p 1 v 1 + + p n v n. Then there exst ε 1,..., ε
More informationI529: Machine Learning in Bioinformatics (Spring 2017) Markov Models
I529: Machne Learnng n Bonformatcs (Sprng 217) Markov Models Yuzhen Ye School of Informatcs and Computng Indana Unversty, Bloomngton Sprng 217 Outlne Smple model (frequency & profle) revew Markov chan
More informationClustering & (Ken Kreutz-Delgado) UCSD
Clusterng & Unsupervsed Learnng Nuno Vasconcelos (Ken Kreutz-Delgado) UCSD Statstcal Learnng Goal: Gven a relatonshp between a feature vector x and a vector y, and d data samples (x,y ), fnd an approxmatng
More informationLaboratory 1c: Method of Least Squares
Lab 1c, Least Squares Laboratory 1c: Method of Least Squares Introducton Consder the graph of expermental data n Fgure 1. In ths experment x s the ndependent varable and y the dependent varable. Clearly
More information