Notes on Machine Learning for 16.410 and 16.413 (Notes adapted from Tom Mitchell and Andrew Moore.) Choosing Hypotheses Generally want the most probable hypothesis given the training data Maximum a posteriori hypothesis h MAP : P (h D) = P (D h)p (h) P (D) h MAP = arg max P (h D) = arg max P (D h)p (h) P (D) = arg max P (D h)p (h) If assume P (h i ) = P (h j ), then can choose the Maximum likelihood (ML) hypothesis h ML = arg max h i H P (D h i) Brute Force MAP Hypothesis Learner 1. For each hypothesis h in H, calculate the posterior probability P (h D) = P (D h)p (h) P (D) 2. Output the hypothesis h MAP with the highest posterior probability h MAP = argmax P (h D) Relation to Concept Learning Consider our usual concept learning task instance space X, hypothesis space H, training examples D List-then-Eliminate learning algorithm (outputs set of hypotheses from the version space V S H,D ) Choose P (D h): P (D h) = 1 if h consistent with D P (D h) = 0 otherwise Choose P (h) to be uniform distribution P (h) = 1 H for all h in H 1
Then, P (h D) = 1 V S H,D if h is consistent with D 0 otherwise Maximum-likelihood parameter learning: If prior over hypotheses is uniform, then there is a standard method for maximum-likelihood parameter learning: 1. Write down an expression for the likelihood of the data as a function of the parameter(s). 2. Write down the derivative of the log likelihood with respect to each parameter. 3. Find the parameter values such that the derivatives are zero. By taking logarithms, we reduce the product to a sum over the data, which is usually easier to maximize.) To find the maximum-likelihood value of θ, we differentiate L with respect to θ and set the resulting expression to zero. Discrete models Assume data set d is N independent draws from binomial population with unknown parameter θ, that is, P (d θ) = N P (d j θ) = θ c (1 θ) l c instances have +ve label and (N c) have ve label and we want to learn θ. Minimum Description Length Principle MDL: prefer the hypothesis h that minimizes log 2 P (h) is length of h under optimal code j=1 L(d θ) = log P (d θ) N = log P (d j θ) j=1 = c log θ + l log(1 θ) dl(d θ) = c dθ θ l 1 θ = 0 c θ = c + l = c N h MDL = arg max P (D h)p (h) = arg max 2 P (D h) + log 2 P (h) = arg min log 2 P (D h) log 2 P (h) log 2 P (D h) is length of D given h under optimal code 2
Most Probable Classification of New Instances So far we ve sought the most probable hypothesis given the data D (i.e., h MAP ) Given new instance x, what is its most probable classification? h MAP (x) is not the most probable classification Bayes Optimal Classifier arg max h i H P (v j h i )P (h i D) Gibbs Classifier Bayes optimal classifier provides best result, but can be expensive if many hypotheses. Gibbs algorithm: 1. Choose one hypothesis at random, according to P (h D) 2. Use this to classify new instance Surprising fact: Assume target concepts are drawn at random from H according to priors on H. Then: E[error Gibbs ] 2E[error BayesOptimal ] Suppose correct, uniform prior distribution over H, then Pick any hypothesis from version space with uniform probability Its expected error is no worse than twice Bayes optimal 3
Learning A Real Valued Function y f h ML e x Consider any real-valued target function f Training examples x i, d i, where d i is noisy training value d i = f(x i ) + e i e i is random variable (noise) drawn independently for each x i according to some Gaussian distribution with mean=0 Maximize natural log of this instead... h ML = argmax p(d h) m = argmax p(d i h) = argmax h ML = argmax = argmax = argmax = argmin m 1 2πσ 2 e 1 2 ( d i h(x i ) σ ) 2 m 1 ln 1 2πσ 2 2 m 1 ( di h(x i ) 2 σ m (d i h(x i )) 2 m (d i h(x i )) 2 ( di h(x i ) Then the maximum likelihood hypothesis h ML is the one that minimizes the sum of squared errors. ) 2 σ ) 2 4
Naive Bayes Classifier Assume data has attributes a 1, a 2,..., a n and has possible labels v j. The Naive Bayes assumption: P (a 1, a 2... a n v j ) = i P (a i v j ) Assume target function f : X V, where each instance x described by attributes a 1, a 2... a n. Most probable value of f(x) is: which gives v MAP = argmax P (v j a 1, a 2... a n ) v MAP = argmax P (a 1, a 2... a n v j )P (v j ) P (a 1, a 2... a n ) = argmax P (a 1, a 2... a n v j )P (v j ) Naive Bayes classifier: v NB = argmax P (v j ) i P (a i v j ) Bayes Algorithm Naive Bayes Learn(examples) For each target value v j ˆP (v j) estimate P (v j) For each attribute value a i of each attribute a ˆP (a i v j) estimate P (a i v j) Bayes: Subtleties 1. Conditional independence assumption is often violated: P (a 1, a 2... a n v j ) = i P (a i v j )...but it works surprisingly well anyway. Note don t need estimated posteriors ˆP (v j x) to be correct; need only that argmax ˆP (v j ) ˆP (a i v j ) = argmax P (v j )P (a 1..., a n v j ) i 2. Naive Bayes posteriors often unrealistically close to 1 or 0 3. What if none of the training instances with target value v j have attribute value a i? Then ˆP (a i v j ) = 0, and... ˆP (v j ) i ˆP (a i v j ) = 0 Typical solution is Bayesian estimate for ˆP (a i v j ) where ˆP (a i v j ) n c + mp n + m n is number of training examples for which v = v j, n c number of examples for which v = v j and a = a i p is prior estimate for ˆP (a i v j ) m is weight given to prior (i.e. number of virtual examples) 5
Bayesian Belief Networks Recall that X is conditionally independent of Y given Z if the probability distribution governing X is independent of the value of Y given the value of Z; that is, if ( x i, y j, z k ) P (X = x i Y = y j, Z = z k ) = P (X = x i Z = z k ) more compactly, we write P (X Y, Z) = P (X Z) Naive Bayes uses conditional independence to justify P (X, Y Z) = P (X Y, Z)P (Y Z) = P (X Z)P (Y Z) Bayesian Belief Network Storm BusTourGroup S,B S, B S,B S, B Lightning Campfire C C 0.4 0.6 0.1 0.9 0.8 0.2 0.2 0.8 Campfire Thunder ForestFire Network represents a set of conditional independence assertions: Each node is asserted to be conditionally independent of its nondescendants, given its immediate predecessors. Network uses independence assertions to represent the joint probability distribution, e.g., P (Storm, BusT ourgroup,..., F orestf ire), over all variables compactly P (y 1,..., y n ) = n P (y i P arents(y i )) where P arents(y i ) denotes immediate predecessors of Y i in graph. So, the joint distribution is fully defined by graph, plus the P (y i P arents(y i )) Inference in Bayesian Networks How can one infer the (probabilities of) values of one or more network variables, given observed values of others? Bayes net contains all information needed for this inference If only one variable with unknown value, easy to infer it 6
In general case, problem is NP hard In practice, can succeed in many cases Exact inference methods work well for some network structures Monte Carlo methods simulate the network randomly to calculate approximate solutions Learning of Bayesian Networks Several variants of this learning task Network structure might be known or unknown Training examples might provide values of all network variables, or just some If structure known and observe all variables Then it s easy as training a Naive Bayes classifier Gradient Ascent for Bayes Nets Suppose structure known, variables partially observable e.g., observe ForestFire, Storm, BusTourGroup, Thunder, but not Lightning, Campfire... Let w ijk denote one entry in the conditional probability table for variable Y i in the network w ijk = P (Y i = y ij P arents(y i ) = the list u ik of values) e.g., if Y i = Campfire, then u ik might be Storm = T, BusT ourgroup = F Perform gradient ascent by repeatedly 1. update all w ijk using training data D w ijk w ijk + η d D P h (y ij, u ik d) w ijk 2. then, renormalize the w ijk to assure j w ijk = 1 0 w ijk 1 7
Unsupervised Learning Expectation Maximization (EM) p(x) x Each instance x generated by 1. Choosing one of the k Gaussians with uniform probability 2. Generating an instance at random according to that Gaussian EM for Estimating k Means Given: Instances from X generated by mixture of k Gaussian distributions Unknown means µ 1,..., µ k of the k Gaussians Don t know which instance x i was generated by which Gaussian Determine: Maximum likelihood estimates of µ 1,..., µ k Think of full description of each instance as y i = x i, z i1, z i2, where z ij is 1 if x i generated by jth Gaussian x i observable z ij unobservable EM for Estimating k Means EM Algorithm: Pick random initial h = µ 1, µ 2, then iterate E step: Calculate the expected value E[z ij ] of each hidden variable z ij, assuming the current hypothesis h = µ 1, µ 2 holds. E[z ij ] = p(x = x i µ = µ j ) 2 n=1 p(x = x i µ = µ n ) = e 1 2σ 2 (xi µj)2 2 n=1 e 1 2σ 2 (xi µn)2 8
M step: Calculate a new maximum likelihood hypothesis h = µ 1, µ 2, assuming the value taken on by each hidden variable z ij is its expected value E[z ij ] calculated above. Replace h = µ 1, µ 2 by h = µ 1, µ 2. µ j m E[z ij] x i m E[z ij] EM Algorithm Converges to local maximum likelihood h and provides estimates of hidden variables z ij In fact, local maximum is E[ln P (Y h)] Y is complete (observable plus unobservable variables) data Expected value is taken over possible values of unobserved variables in Y General EM Problem Given: Observed data X = {x 1,..., x m } Unobserved data Z = {z 1,..., z m } Parameterized probability distribution P (Y h), where Determine: Y = {y 1,..., y m } is the full data y i = x i z i h are the parameters h that (locally) maximizes E[ln P (Y h)] General EM Method Define likelihood function Q(h h) which calculates Y = X Z using observed X and current parameters h to estimate Z EM Algorithm: Q(h h) E[ln P (Y h ) h, X] Estimation (E) step: Calculate Q(h h) using the current hypothesis h and the observed data X to estimate the probability distribution over Y. Q(h h) E[ln P (Y h ) h, X] Maximization (M) step: Replace hypothesis h by the hypothesis h that maximizes this Q function. h argmax Q(h h) h 9