Notes on Machine Learning for and

Similar documents
Bayesian Learning. Two Roles for Bayesian Methods. Bayes Theorem. Choosing Hypotheses

CSCE 478/878 Lecture 6: Bayesian Learning

Two Roles for Bayesian Methods

Statistical learning. Chapter 20, Sections 1 4 1

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Bayesian Learning. Examples. Conditional Probability. Two Roles for Bayesian Methods. Prior Probability and Random Variables. The Chain Rule P (B)

Stephen Scott.

BAYESIAN LEARNING. [Read Ch. 6] [Suggested exercises: 6.1, 6.2, 6.6]

Bayesian Learning. Bayes Theorem. MAP, MLhypotheses. MAP learners. Minimum description length principle. Bayes optimal classier. Naive Bayes learner

Naïve Bayes classification

MODULE -4 BAYEIAN LEARNING

Lecture 9: Bayesian Learning

Bayesian Learning. Chapter 6: Bayesian Learning. Bayes Theorem. Roles for Bayesian Methods. CS 536: Machine Learning Littman (Wu, TA)

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Machine Learning (CS 567)

Statistical learning. Chapter 20, Sections 1 3 1

CSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas

Introduction to Bayesian Learning. Machine Learning Fall 2018

Bayesian Learning. Bayesian Learning Criteria

Algorithmisches Lernen/Machine Learning

Machine Learning. Bayesian Learning.

Bayesian Learning. Remark on Conditional Probabilities and Priors. Two Roles for Bayesian Methods. [Read Ch. 6] [Suggested exercises: 6.1, 6.2, 6.

Bayesian Learning Features of Bayesian learning methods:

Statistical learning. Chapter 20, Sections 1 3 1

Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses

Bayesian Learning (II)

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Statistical Learning. Philipp Koehn. 10 November 2015

Learning with Probabilities

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning. Bayesian Learning. Acknowledgement Slides courtesy of Martin Riedmiller

Bayesian Methods: Naïve Bayes

Machine Learning

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Graphical Models for Collaborative Filtering

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

CSC321 Lecture 18: Learning Probabilistic Models

Mathematical Formulation of Our Example

Machine Learning, Fall 2012 Homework 2

Introduction to Machine Learning

CSE 473: Artificial Intelligence Autumn Topics

Lecture : Probabilistic Machine Learning

Introduction to Machine Learning

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Bayesian Inference. Definitions from Probability: Naive Bayes Classifiers: Advantages and Disadvantages of Naive Bayes Classifiers:

Bayesian Learning Extension

Machine Learning

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Expectation Maximization, and Learning from Partly Unobserved Data (part 2)

Notes on Machine Learning for and

Expectation Maximization (EM)

Probability Based Learning

Introduction to Bayesian Learning

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Gaussian Mixture Models

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

10-701/15-781, Machine Learning: Homework 4

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Generative Clustering, Topic Modeling, & Bayesian Inference

Generative v. Discriminative classifiers Intuition

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

A brief introduction to Conditional Random Fields

COMP90051 Statistical Machine Learning

CS6220: DATA MINING TECHNIQUES

Density Estimation. Seungjin Choi

From inductive inference to machine learning

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

CS6220: DATA MINING TECHNIQUES

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Expectation Maximization

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Bayesian Networks BY: MOHAMAD ALSABBAGH

Based on slides by Richard Zemel

Stochastic Gradient Descent

The Naïve Bayes Classifier. Machine Learning Fall 2017

Logistic Regression. Machine Learning Fall 2018

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

p(d θ ) l(θ ) 1.2 x x x

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Lecture 9: PGM Learning

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Gaussian Mixture Models, Expectation Maximization

CS534 Machine Learning - Spring Final Exam

Lecture 16 Deep Neural Generative Models

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Machine Learning. Bayesian Learning. Michael M. Richter. Michael M. Richter

Introduction to Machine Learning. Lecture 2

Final Exam, Fall 2002

Chapter 16. Structured Probabilistic Models for Deep Learning

COMP 328: Machine Learning

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Directed Graphical Models

Transcription:

Notes on Machine Learning for 16.410 and 16.413 (Notes adapted from Tom Mitchell and Andrew Moore.) Choosing Hypotheses Generally want the most probable hypothesis given the training data Maximum a posteriori hypothesis h MAP : P (h D) = P (D h)p (h) P (D) h MAP = arg max P (h D) = arg max P (D h)p (h) P (D) = arg max P (D h)p (h) If assume P (h i ) = P (h j ), then can choose the Maximum likelihood (ML) hypothesis h ML = arg max h i H P (D h i) Brute Force MAP Hypothesis Learner 1. For each hypothesis h in H, calculate the posterior probability P (h D) = P (D h)p (h) P (D) 2. Output the hypothesis h MAP with the highest posterior probability h MAP = argmax P (h D) Relation to Concept Learning Consider our usual concept learning task instance space X, hypothesis space H, training examples D List-then-Eliminate learning algorithm (outputs set of hypotheses from the version space V S H,D ) Choose P (D h): P (D h) = 1 if h consistent with D P (D h) = 0 otherwise Choose P (h) to be uniform distribution P (h) = 1 H for all h in H 1

Then, P (h D) = 1 V S H,D if h is consistent with D 0 otherwise Maximum-likelihood parameter learning: If prior over hypotheses is uniform, then there is a standard method for maximum-likelihood parameter learning: 1. Write down an expression for the likelihood of the data as a function of the parameter(s). 2. Write down the derivative of the log likelihood with respect to each parameter. 3. Find the parameter values such that the derivatives are zero. By taking logarithms, we reduce the product to a sum over the data, which is usually easier to maximize.) To find the maximum-likelihood value of θ, we differentiate L with respect to θ and set the resulting expression to zero. Discrete models Assume data set d is N independent draws from binomial population with unknown parameter θ, that is, P (d θ) = N P (d j θ) = θ c (1 θ) l c instances have +ve label and (N c) have ve label and we want to learn θ. Minimum Description Length Principle MDL: prefer the hypothesis h that minimizes log 2 P (h) is length of h under optimal code j=1 L(d θ) = log P (d θ) N = log P (d j θ) j=1 = c log θ + l log(1 θ) dl(d θ) = c dθ θ l 1 θ = 0 c θ = c + l = c N h MDL = arg max P (D h)p (h) = arg max 2 P (D h) + log 2 P (h) = arg min log 2 P (D h) log 2 P (h) log 2 P (D h) is length of D given h under optimal code 2

Most Probable Classification of New Instances So far we ve sought the most probable hypothesis given the data D (i.e., h MAP ) Given new instance x, what is its most probable classification? h MAP (x) is not the most probable classification Bayes Optimal Classifier arg max h i H P (v j h i )P (h i D) Gibbs Classifier Bayes optimal classifier provides best result, but can be expensive if many hypotheses. Gibbs algorithm: 1. Choose one hypothesis at random, according to P (h D) 2. Use this to classify new instance Surprising fact: Assume target concepts are drawn at random from H according to priors on H. Then: E[error Gibbs ] 2E[error BayesOptimal ] Suppose correct, uniform prior distribution over H, then Pick any hypothesis from version space with uniform probability Its expected error is no worse than twice Bayes optimal 3

Learning A Real Valued Function y f h ML e x Consider any real-valued target function f Training examples x i, d i, where d i is noisy training value d i = f(x i ) + e i e i is random variable (noise) drawn independently for each x i according to some Gaussian distribution with mean=0 Maximize natural log of this instead... h ML = argmax p(d h) m = argmax p(d i h) = argmax h ML = argmax = argmax = argmax = argmin m 1 2πσ 2 e 1 2 ( d i h(x i ) σ ) 2 m 1 ln 1 2πσ 2 2 m 1 ( di h(x i ) 2 σ m (d i h(x i )) 2 m (d i h(x i )) 2 ( di h(x i ) Then the maximum likelihood hypothesis h ML is the one that minimizes the sum of squared errors. ) 2 σ ) 2 4

Naive Bayes Classifier Assume data has attributes a 1, a 2,..., a n and has possible labels v j. The Naive Bayes assumption: P (a 1, a 2... a n v j ) = i P (a i v j ) Assume target function f : X V, where each instance x described by attributes a 1, a 2... a n. Most probable value of f(x) is: which gives v MAP = argmax P (v j a 1, a 2... a n ) v MAP = argmax P (a 1, a 2... a n v j )P (v j ) P (a 1, a 2... a n ) = argmax P (a 1, a 2... a n v j )P (v j ) Naive Bayes classifier: v NB = argmax P (v j ) i P (a i v j ) Bayes Algorithm Naive Bayes Learn(examples) For each target value v j ˆP (v j) estimate P (v j) For each attribute value a i of each attribute a ˆP (a i v j) estimate P (a i v j) Bayes: Subtleties 1. Conditional independence assumption is often violated: P (a 1, a 2... a n v j ) = i P (a i v j )...but it works surprisingly well anyway. Note don t need estimated posteriors ˆP (v j x) to be correct; need only that argmax ˆP (v j ) ˆP (a i v j ) = argmax P (v j )P (a 1..., a n v j ) i 2. Naive Bayes posteriors often unrealistically close to 1 or 0 3. What if none of the training instances with target value v j have attribute value a i? Then ˆP (a i v j ) = 0, and... ˆP (v j ) i ˆP (a i v j ) = 0 Typical solution is Bayesian estimate for ˆP (a i v j ) where ˆP (a i v j ) n c + mp n + m n is number of training examples for which v = v j, n c number of examples for which v = v j and a = a i p is prior estimate for ˆP (a i v j ) m is weight given to prior (i.e. number of virtual examples) 5

Bayesian Belief Networks Recall that X is conditionally independent of Y given Z if the probability distribution governing X is independent of the value of Y given the value of Z; that is, if ( x i, y j, z k ) P (X = x i Y = y j, Z = z k ) = P (X = x i Z = z k ) more compactly, we write P (X Y, Z) = P (X Z) Naive Bayes uses conditional independence to justify P (X, Y Z) = P (X Y, Z)P (Y Z) = P (X Z)P (Y Z) Bayesian Belief Network Storm BusTourGroup S,B S, B S,B S, B Lightning Campfire C C 0.4 0.6 0.1 0.9 0.8 0.2 0.2 0.8 Campfire Thunder ForestFire Network represents a set of conditional independence assertions: Each node is asserted to be conditionally independent of its nondescendants, given its immediate predecessors. Network uses independence assertions to represent the joint probability distribution, e.g., P (Storm, BusT ourgroup,..., F orestf ire), over all variables compactly P (y 1,..., y n ) = n P (y i P arents(y i )) where P arents(y i ) denotes immediate predecessors of Y i in graph. So, the joint distribution is fully defined by graph, plus the P (y i P arents(y i )) Inference in Bayesian Networks How can one infer the (probabilities of) values of one or more network variables, given observed values of others? Bayes net contains all information needed for this inference If only one variable with unknown value, easy to infer it 6

In general case, problem is NP hard In practice, can succeed in many cases Exact inference methods work well for some network structures Monte Carlo methods simulate the network randomly to calculate approximate solutions Learning of Bayesian Networks Several variants of this learning task Network structure might be known or unknown Training examples might provide values of all network variables, or just some If structure known and observe all variables Then it s easy as training a Naive Bayes classifier Gradient Ascent for Bayes Nets Suppose structure known, variables partially observable e.g., observe ForestFire, Storm, BusTourGroup, Thunder, but not Lightning, Campfire... Let w ijk denote one entry in the conditional probability table for variable Y i in the network w ijk = P (Y i = y ij P arents(y i ) = the list u ik of values) e.g., if Y i = Campfire, then u ik might be Storm = T, BusT ourgroup = F Perform gradient ascent by repeatedly 1. update all w ijk using training data D w ijk w ijk + η d D P h (y ij, u ik d) w ijk 2. then, renormalize the w ijk to assure j w ijk = 1 0 w ijk 1 7

Unsupervised Learning Expectation Maximization (EM) p(x) x Each instance x generated by 1. Choosing one of the k Gaussians with uniform probability 2. Generating an instance at random according to that Gaussian EM for Estimating k Means Given: Instances from X generated by mixture of k Gaussian distributions Unknown means µ 1,..., µ k of the k Gaussians Don t know which instance x i was generated by which Gaussian Determine: Maximum likelihood estimates of µ 1,..., µ k Think of full description of each instance as y i = x i, z i1, z i2, where z ij is 1 if x i generated by jth Gaussian x i observable z ij unobservable EM for Estimating k Means EM Algorithm: Pick random initial h = µ 1, µ 2, then iterate E step: Calculate the expected value E[z ij ] of each hidden variable z ij, assuming the current hypothesis h = µ 1, µ 2 holds. E[z ij ] = p(x = x i µ = µ j ) 2 n=1 p(x = x i µ = µ n ) = e 1 2σ 2 (xi µj)2 2 n=1 e 1 2σ 2 (xi µn)2 8

M step: Calculate a new maximum likelihood hypothesis h = µ 1, µ 2, assuming the value taken on by each hidden variable z ij is its expected value E[z ij ] calculated above. Replace h = µ 1, µ 2 by h = µ 1, µ 2. µ j m E[z ij] x i m E[z ij] EM Algorithm Converges to local maximum likelihood h and provides estimates of hidden variables z ij In fact, local maximum is E[ln P (Y h)] Y is complete (observable plus unobservable variables) data Expected value is taken over possible values of unobserved variables in Y General EM Problem Given: Observed data X = {x 1,..., x m } Unobserved data Z = {z 1,..., z m } Parameterized probability distribution P (Y h), where Determine: Y = {y 1,..., y m } is the full data y i = x i z i h are the parameters h that (locally) maximizes E[ln P (Y h)] General EM Method Define likelihood function Q(h h) which calculates Y = X Z using observed X and current parameters h to estimate Z EM Algorithm: Q(h h) E[ln P (Y h ) h, X] Estimation (E) step: Calculate Q(h h) using the current hypothesis h and the observed data X to estimate the probability distribution over Y. Q(h h) E[ln P (Y h ) h, X] Maximization (M) step: Replace hypothesis h by the hypothesis h that maximizes this Q function. h argmax Q(h h) h 9