Semi-Supervised Learning

Similar documents
xp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

EM and Structure Learning

MATH 829: Introduction to Data Mining and Analysis The EM algorithm (part 2)

Machine learning: Density estimation

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

Course 395: Machine Learning - Lectures

An Experiment/Some Intuition (Fall 2006): Lecture 18 The EM Algorithm heads coin 1 tails coin 2 Overview Maximum Likelihood Estimation

Lecture 10 Support Vector Machines. Oct

10-701/ Machine Learning, Fall 2005 Homework 3

The Expectation-Maximization Algorithm

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Space of ML Problems. CSE 473: Artificial Intelligence. Parameter Estimation and Bayesian Networks. Learning Topics

Lecture Notes on Linear Regression

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

Limited Dependent Variables

Homework Assignment 3 Due in class, Thursday October 15

Classification as a Regression Problem

SDMML HT MSc Problem Sheet 4

Ensemble Methods: Boosting

Clustering gene expression data & the EM algorithm

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Lecture 12: Classification

Gaussian Mixture Models

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Linear Approximation with Regularization and Moving Least Squares

The conjugate prior to a Bernoulli is. A) Bernoulli B) Gaussian C) Beta D) none of the above

Generalized Linear Methods

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

9 : Learning Partially Observed GM : EM Algorithm

Hidden Markov Models

1 Convex Optimization

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

Lecture 10 Support Vector Machines II

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #16 Scribe: Yannan Wang April 3, 2014

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

Maximum Likelihood Estimation (MLE)

Feature Selection: Part 1

Conjugacy and the Exponential Family

CS286r Assign One. Answer Key

Outline. Bayesian Networks: Maximum Likelihood Estimation and Tree Structure Learning. Our Model and Data. Outline

3.1 ML and Empirical Distribution

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

1 The Mistake Bound Model

Expectation Maximization Mixture Models HMMs

8/25/17. Data Modeling. Data Modeling. Data Modeling. Patrice Koehl Department of Biological Sciences National University of Singapore

Online Classification: Perceptron and Winnow

The Geometry of Logit and Probit

1 Matrix representations of canonical matrices

EEE 241: Linear Systems

The big picture. Outline

The EM Algorithm (Dempster, Laird, Rubin 1977) The missing data or incomplete data setting: ODL(φ;Y ) = [Y;φ] = [Y X,φ][X φ] = X

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture # 15 Scribe: Jieming Mao April 1, 2013

Stat260: Bayesian Modeling and Inference Lecture Date: February 22, Reference Priors

Assortment Optimization under MNL

18.1 Introduction and Recap

Lecture 12: Discrete Laplacian

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010

Expected Value and Variance

The Second Anti-Mathima on Game Theory

INTRODUCTION TO MACHINE LEARNING 3RD EDITION

Problem Set 9 Solutions

Clustering & Unsupervised Learning

Introduction to Vapor/Liquid Equilibrium, part 2. Raoult s Law:

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Support Vector Machines CS434

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Week 5: Neural Networks

Lecture 4. Instructor: Haipeng Luo

Clustering with Gaussian Mixtures

Linear Feature Engineering 11

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Stanford University CS359G: Graph Partitioning and Expanders Handout 4 Luca Trevisan January 13, 2011

princeton univ. F 13 cos 521: Advanced Algorithm Design Lecture 3: Large deviations bounds and applications Lecturer: Sanjeev Arora

STAT 3008 Applied Regression Analysis

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

Lecture Nov

Difference Equations

Motion Perception Under Uncertainty. Hongjing Lu Department of Psychology University of Hong Kong

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

STATS 306B: Unsupervised Learning Spring Lecture 10 April 30

The Basic Idea of EM

Open Systems: Chemical Potential and Partial Molar Quantities Chemical Potential

Representing arbitrary probability distributions Inference. Exact inference; Approximate inference

Vapnik-Chervonenkis theory

} Often, when learning, we deal with uncertainty:

Probabilistic Classification: Bayes Classifiers. Lecture 6:

CS 229, Public Course Problem Set #3 Solutions: Learning Theory and Unsupervised Learning

Hidden Markov Models

Laboratory 3: Method of Least Squares

Retrieval Models: Language models

CHAPTER 3: BAYESIAN DECISION THEORY

MATH 5707 HOMEWORK 4 SOLUTIONS 2. 2 i 2p i E(X i ) + E(Xi 2 ) ä i=1. i=1

I529: Machine Learning in Bioinformatics (Spring 2017) Markov Models

Clustering & (Ken Kreutz-Delgado) UCSD

Laboratory 1c: Method of Least Squares

Transcription:

Sem-Supervsed Learnng Consder the problem of Prepostonal Phrase Attachment. Buy car wth money ; buy car wth wheel There are several ways to generate features. Gven the lmted representaton, we can assume that all possble conjunctons of the 4 attrbutes are used. (5 feature n each example). Assume we wll use naïve Bayes for learnng to decde between [n,v] Examples are: (x,x, x n,[n,v]) EM CS446 Sprng 7

Usng naïve Bayes To use naïve Bayes, we need to use the data to estmate: P(n) P(v) P(x n) P(x v) P(x n) P(x v) P(x n n) P(x n v) Then, gven an example (x,x, x n,?), compare: P(n x)=p(n) P(x n) P(x n) P(x n n) and P(v x)=p(v) P(x v) P(x v) P(x n v) EM CS446 Sprng 7

Usng naïve Bayes After seeng 0 examples, we have: P(n) =0.5; P(v)=0.5 P(x n)=0.75;p(x n) =0.5; P(x 3 n) =0.5; P(x 4 n) =0.5 P(x v)=0.5; P(x v) =0.5;P(x 3 v) =0.75;P(x 4 v) =0.5 Then, gven an example x=(000), we have: P n (x)=0.5 0.75 0.5 0.5 0.5 = 3/64 P v (x)=0.5 0.5 0.75 0.5 0.5=3/56 Now, assume that n addton to the 0 labeled examples, we also have 00 unlabeled examples. Wll that help? EM CS446 Sprng 7 3

Usng naïve Bayes For example, what can be done wth the example (000)? We have an estmate for ts label But, can we use t to mprove the classfer (that s, the estmaton of the probabltes that we wll use n the future)? Opton : We can make predctons, and beleve them Or some of them (based on what?) Opton : We can assume the example x=(000) s a An n-labeled example wth probablty P n (x)/(p n (x) + P v (x)) A v-labeled example wth probablty P v (x)/(p n (x) + P v (x)) Estmaton of probabltes does not requre workng wth ntegers! EM CS446 Sprng 7 4

Usng Unlabeled Data The dscusson suggests several algorthms:. Use a threshold. Chose examples labeled wth hgh confdence. Label them [n,v]. Retran.. Use fractonal examples. Label the examples wth fractonal labels [p of n, (-p) of v]. Retran. EM CS446 Sprng 7 5

Comments on Unlabeled Data Both algorthms suggested can be used teratvely. Both algorthms can be used wth other classfers, not only naïve Bayes. The only requrement a robust confdence measure n the classfcaton. There are other approaches to Sem-Supervsed learnng: See ncluded papers (co-tranng; Yarowksy s Decson Lst/Bootstrappng algorthm; graph-based algorthms that assume smlar examples have smlar labels, etc.) What happens f nstead of 0 labeled examples we start wth 0 labeled examples? Make a Guess; contnue as above; a verson of EM EM CS446 Sprng 7 6

EM EM s a class of algorthms that s used to estmate a probablty dstrbuton n the presence of mssng attrbutes. Usng t, requres an assumpton on the underlyng probablty dstrbuton. The algorthm can be very senstve to ths assumpton and to the startng pont (that s, the ntal guess of parameters. In general, known to converge to a local maxmum of the maxmum lkelhood functon. EM CS446 Sprng 7 7

Three Con Example We observe a seres of con tosses generated n the followng way: A person has three cons. Con 0: probablty of Head s a Con : probablty of Head p Con : probablty of Head q Consder the followng con-tossng scenaros: EM CS446 Sprng 7 8

Estmaton Problems Scenaro I: Toss one of the cons four tmes. Observng HHTH Queston: Whch con s more lkely to produce ths sequence? Scenaro II: Toss con 0. If Head toss con ; o/w toss con Observng the sequence HHHHT, THTHT, HHHHT, HHTTH produced by Con 0, Con and Con Queston: Estmate most lkely values for p, q (the probablty of H n each con) and the probablty to use each of the cons (a) Scenaro III: Toss con 0. If Head toss con ; o/w toss con Observng the sequence HHHT, HTHT, HHHT, HTTH produced by Con and/or Con Con 0 Queston: Estmate most lkely values for p, q and a There s no known analytcal soluton to ths problem (general settng). That s, t s not known how to compute the values of the parameters so as to maxmze the lkelhood of the data. EM CS446 Sprng 7 st toss nd toss nth toss 9

Key Intuton () If we knew whch of the data ponts (HHHT), (HTHT), (HTTH) came from Con and whch from Con, there was no problem. Recall that the smple estmaton s the ML estmaton: Assume that you toss a (p,-p) con m tmes and get k Heads m-k Tals. log[p(d p)] = log [ p k (-p) m-k ]= k log p + (m-k) log (-p) To maxmze, set the dervatve w.r.t. p equal to 0: d log P(D p)/dp = k/p (m-k)/(-p) = 0 Solvng ths for p, gves: p=k/m EM CS446 Sprng 7 0

Key Intuton () If we knew whch of the data ponts (HHHT), (HTHT), (HTTH) came from Con and whch from Con, there was no problem. Instead, use an teratve approach for estmatng the parameters: Guess the probablty that a gven data pont came from Con or ; Generate fctonal labels, weghted accordng to ths probablty. Now, compute the most lkely value of the parameters. [recall NB example] Compute the lkelhood of the data gven ths model. Re-estmate the ntal parameter settng: set them to maxmze the lkelhood of the data. (Labels Model Parameters) Lkelhood of the data Ths process can be terated and can be shown to converge to a local maxmum of the lkelhood functon EM CS446 Sprng 7

EM Algorthm (Cons) -I We wll assume (for a mnute) that we know the parameters and use t to estmate whch Con t s (Problem ) Then, we wll use ths label estmaton of the observed tosses, to estmate the most lkely parameters and so on... p,q, a Notaton: n data ponts; n each one: m tosses, h heads. What s the probablty that the th data pont came from Con? STEP (Expectaton Step): (Here h=h ) P P(Con D ) P(D Con) P(Con) P(D ) a p h h a p (p) mh (p) (a)q mh h ( q) mh EM CS446 Sprng 7

EM Algorthm (Cons) - II Now, we would lke to compute the lkelhood of the data, and fnd the parameters that maxmze t. We wll maxmze the log lkelhood of the data (n data ponts) LL =,n logp(d p,q,a) But, one of the varables the con s name - s hdden. We can margnalze: LL= =,n log y=0, P(D, y p,q, a) LL= =,n log y=0, P(D, y p,q, a) = = =,n log y=0, P(D p,q, a )P(y D,p,q,a) = = =,n log E_y P(D p,q, a) =,n E_y log P(D p,q, a) Where the nequalty s due to Jensen s Inequalty. We maxmze a lower bound on the Lkelhood. However, the sum s nsde the log, makng ML soluton dffcult. Snce the latent varable y s not observed, we cannot use the completedata log lkelhood. Instead, we use the expectaton of the complete-data log lkelhood under the posteror dstrbuton of the latent varable to approxmate log p(d p,q, ) We thnk of the lkelhood logp(d p,q,a ) as a random varable that depends on the value y of the con n the th toss. Therefore, nstead of maxmzng the LL we wll maxmze the expectaton of ths random varable (over the con s name). [Justfed usng Jensen s Inequalty; later & above] EM CS446 Sprng 7 3

EM Algorthm (Cons) - III We maxmze the expectaton of ths random varable (over the con name). E[LL] = E[ =,n log P(D p,q, a)] = =,n E[log P(D p,q, a)] = = =,n P log P(D, p,q, a)] + (-P ) log P(D, 0 p,q, a)] - P log P - (-P ) log (- P ) (Does not matter when we maxmze) Ths s due to the lnearty of the expectaton and the random varable defnton: log P(D, y p,q, a) = log P(D, p,q, a) wth Probablty P log P(D, 0 p,q, a) wth Probablty (-P ) EM CS446 Sprng 7 4

EM Algorthm (Cons) - IV Explctly, we get: E( log P(D p,q, a) P log P(,D p,q, a) (P )log P(0,D p,q, a) h mh h mh P log( a p (p) ) (P )log((- a) q ( q) ) P (log a hlogp (m-h )log(p)) (P )(log(- a) hlogq (m-h )log( q)) EM CS446 Sprng 7 5

EM CS446 Sprng 7 EM Algorthm (Cons) - V Fnally, to fnd the most lkely parameters, we maxmze the dervatves wth respect to : STEP : Maxmzaton Step (Santy check: Thnk of the weghted fctonal ponts) a, p,q n P 0 P - P d de n a a a a P m h P p 0 ) p h m - p h P ( dp de n (-P ) m h P ) ( q 0 ) q h m - q h (-P )( dq de n When computng the dervatves, notce P here s a constant; t was computed usng the current parameters n the E step 6

Models wth Hdden Varables EM CS446 Sprng 7 7

EM: General Settng The EM algorthm s a general purpose algorthm for fndng the maxmum lkelhood estmate n latent varable models. In the E-Step, we fll n the latent varables usng the posteror, and n the M-Step, we maxmze the expected complete log lkelhood wth respect to the complete posteror dstrbuton. Let D = (x,, x N ) be the observed data, and Let Z denote hdden random varables. (We are not commttng to any partcular model.) Let θ be the model parameters. Then µ * = argmax µ p(x µ) = argmax µ z p(x,z µ) = = argmax µ z [p(z µ)p(x z, µ)] Ths expresson s called the complete log lkelhood. EM CS446 Sprng 7 8

EM: General Settng () To derve the EM objectve functon, we re-wrte the complete log lkelhood functon by multplyng t by q(z)/q(z), where q(z) s an arbtrary dstrbuton for the random varable z. log p(x µ) = log z p(x,z µ) = log z p(z µ) p(x z,µ) q(z)/q(z) = = log E q [p(z µ) p(x z,µ) /q(z)] E q log [p(z µ) p(x z,µ) /q(z)], Where the nequalty s due to Jensen s nequalty appled to the concave functon, log. We get the objectve: Jensen s Inequalty for convex functons: E(f(x)) f(e(x)) But log s concave, so E(log(x)) log (E(x)) L(µ, q) = E q [log p(z µ)] + E q [log p(x z,µ)] - E q [log q(z)] The last component s an Entropy component; t s also possble to wrte the objectve so that t ncludes a KL dvergence (a dstance functon between dstrbutons) of q(z) and p(z x,µ). EM CS446 Sprng 7 9

Other q s can be chosen [Samdan & Roth0] to gve other EM algorthms. Specfcally, you can choose a q that chooses the most lkely z n the E-step, and then contnues to estmate the parameters (called Truncated EM, or Hard EM). (Thnk back to the sem-supervsed case) EM: General Settng (3) EM now contnues teratvely, as a gradent accent algorthm, where we choose q = p(z x, µ). At the t-th step, we have q (t) and µ (t). E-Step: update the posteror q, whle holdng µ (t) fxed: q (t+) = argmax q L(q, µ (t) ) = p(z x, µ (t) ). M-Step: update the model parameters to maxmze the expected complete log-lkelhood functon: µ (t+) = argmax µ L(q (t+), µ) To wrap t up, wth the rght q: L(µ, q) = E q log [p(z µ) p(x z,µ) /q(z)] = z p(z x, µ) log [p(x, z µ)/p(z x, µ)] = = z p(z x, µ) log [p(x, z µ) p(x µ)/p(z, x µ)] = = z p(z x, µ) log [p(x µ)] = log [p(x µ)] z p(z x, µ) = log [p(x µ)] So, by maxmzng the objectve functon, we are also maxmzng the log lkelhood functon. EM CS446 Sprng 7 0

The General EM Procedure E M EM CS446 Sprng 7

EM Summary (so far) EM s a general procedure for learnng n the presence of unobserved varables. We have shown how to use t n order to estmate the most lkely densty functon for a mxture of (Bernoull) dstrbutons. EM s an teratve algorthm that can be shown to converge to a local maxmum of the lkelhood functon. It depends on assumng a famly of probablty dstrbutons. In ths sense, t s a famly of algorthms. The update rules you wll derve depend on the model assumed. It has been shown to be qute useful n practce, when the assumptons made on the probablty dstrbuton are correct, but can fal otherwse. EM CS446 Sprng 7

EM Summary (so far) EM s a general procedure for learnng n the presence of unobserved varables. The (famly of ) probablty dstrbuton s known; the problem s to estmate ts parameters In the presence of hdden varables, we can often thnk about t as a problem of a mxture of dstrbutons the partcpatng dstrbutons are known, we need to estmate: Parameters of the dstrbutons The mxture polcy Our prevous example: Mxture of Bernoull dstrbutons EM CS446 Sprng 7 3

Example: K-Means Algorthm K- means s a clusterng algorthm. We are gven data ponts, known to be sampled ndependently from a mxture of k Normal dstrbutons, wth means, =, k and the same standard varaton p(x) x EM CS446 Sprng 7 4

Example: K-Means Algorthm Frst, notce that f we knew that all the data ponts are taken from a normal dstrbuton wth mean, fndng ts most lkely value s easy. p(x ) exp[ (x ) ] We get many data ponts, D = {x,,x m } ln(l(d )) ln(p(d )) - (x ) Maxmzng the log-lkelhood s equvalent to mnmzng: ML argmn (x ) Calculate the dervatve wth respect to, we get that the mnmal pont, that s, the most lkely mean s EM CS446 Sprng 7 m x 5

A mxture of Dstrbutons As n the con example, the problem s that data s sampled from a mxture of k dfferent normal dstrbutons, and we do not know, for a gven data pont x, where s t sampled from. Assume that we observe data pont x ;what s the probablty that t was sampled from the dstrbuton j? P(x j )P( j) P(x x j) Pj P( j x) k k P(x ) P(x x n n) k exp[ (x j ) ] k exp[ (x n n) ] EM CS446 Sprng 7 6

A Mxture of Dstrbutons As n the con example, the problem s that data s sampled from a mxture of k dfferent normal dstrbutons, and we do not know, for a gven each data pont x, where s t sampled from. For a data pont x, defne k bnary hdden varables, z,z,,z k, s.t z j = ff x s sampled from the j-th dstrbuton. E[z j ] P(x 0 P(x was sampled from ) was not sampled from ) P EM CS446 Sprng 7 7 j j E[Y] y P(Y y j y ) E[X Y] E[X] E[Y]

Example: K-Means Algorthms,,,..., Expectaton: (here: h = k ) p(y h) p(x,z,..., zk h) exp[ j z j (x ) j ] Computng the lkelhood gven the observed data D = {x,,x m } and the hypothess h (w/o the constant coeffcent) m ln(p(y h)) - z (x ) j j j [ m E[ln(P(Y h))] E - z ] j j(x j) m - E[z ](x ) j j j EM CS446 Sprng 7 8

Example: K-Means Algorthms Maxmzaton: Maxmzng m Q(h h') - E[z ](x ) j j j wth respect to we get that: Whch yelds: j dq d j j m C E[z ](x ) m m E[z j j E[z ]x j ] j 0 EM CS446 Sprng 7 9

Summary: K-Means Algorthms Gven a set D = {x,,x m } of data ponts, guess ntal parameters,,,..., k Compute (for all,j) exp[ (x p j ) ] j E[zj] k exp[ (x n n ) and a new set of means: m E[z ]x j j m E[z ] j repeat to convergence Notce that ths algorthm wll fnd the best k means n the sense of mnmzng the sum of square dstance. ] EM CS446 Sprng 7 30

Summary: EM EM s a general procedure for learnng n the presence of unobserved varables. We have shown how to use t n order to estmate the most lkely densty functon for a mxture of probablty dstrbutons. EM s an teratve algorthm that can be shown to converge to a local maxmum of the lkelhood functon. Thus, mght requres many restarts. It depends on assumng a famly of probablty dstrbutons. It has been shown to be qute useful n practce, when the assumptons made on the probablty dstrbuton are correct, but can fal otherwse. As examples, we have derved an mportant clusterng algorthm, the k-means algorthm and have shown how to use t n order to estmate the most lkely densty functon for a mxture of probablty dstrbutons. EM CS446 Sprng 7 3

More Thoughts about EM Tranng: a sample of data ponts, (x 0, x,, x n ) {0,} n+ Task: predct the value of x 0, gven assgnments to all n varables. EM CS446 Sprng 7 3

z z P z More Thoughts about EM Assume that a set x {0,} n+ of data ponts s generated as follows: Postulate a hdden varable Z, wth k values, z k wth probablty z,,k z = Havng randomly chosen a value z for the hdden varable, we choose the value x for each observable X to be wth probablty p z and 0 otherwse, [ = 0,,,.n] Tranng: a sample of data ponts, (x 0, x,, x n ) {0,} n+ Task: predct the value of x 0, gven assgnments to all n varables. EM CS446 Sprng 7 33

z z P z More Thoughts about EM Two optons: Parametrc: estmate the model usng EM. Once a model s known, use t to make predctons. Problem: Cannot use EM drectly wthout an addtonal assumpton on the way data s generated. Non-Parametrc: Learn x 0 drectly as a functon of the other varables. Problem: whch functon to try and learn? x 0 turns out to be a lnear functon of the other varables, when k= (what does t mean)? When Another k mportant s known, dstncton the EM to attend approach to s the fact performs that, once you well; f an estmated ncorrect all the value parameters s assumed wth EM, you the can answer estmaton many predcton fals; the problems e.g., p(x lnear methods 0, x performs 7,,x 8 x, x,, x better n ) whle wth Perceptron (say) [Grove & Roth 00] you need to learn separate models for each predcton problem. EM CS446 Sprng 7 34

EM EM CS446 Sprng 7 35

The EM Algorthm Algorthm: Guess ntal values for the hypothess h=,,..., Expectaton: Calculate Q(h,h) = E(Log P(Y h ) h, X) usng the current hypothess h and the observed data X., Maxmzaton: Replace the current hypothess h by h, that maxmzes the Q functon (the lkelhood functon) set h = h, such that Q(h,h) s maxmal Repeat: Estmate the Expectaton agan. k EM CS446 Sprng 7 36