Introduction to Bayesian Learning. Machine Learning Fall 2018

Similar documents
The Naïve Bayes Classifier. Machine Learning Fall 2017

Logistic Regression. Machine Learning Fall 2018

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Bayesian Learning Features of Bayesian learning methods:

Statistical learning. Chapter 20, Sections 1 4 1

CPSC 340: Machine Learning and Data Mining

Statistical learning. Chapter 20, Sections 1 3 1

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

MODULE -4 BAYEIAN LEARNING

Bayesian Learning (II)

Lecture : Probabilistic Machine Learning

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Lecture 9: Bayesian Learning

Notes on Machine Learning for and

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Statistical learning. Chapter 20, Sections 1 3 1

Naïve Bayes classification

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

CSCE 478/878 Lecture 6: Bayesian Learning

Generative Clustering, Topic Modeling, & Bayesian Inference

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Learning with Probabilities

ECE521 week 3: 23/26 January 2017

MLE/MAP + Naïve Bayes

Machine Learning

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

MLE/MAP + Naïve Bayes

Bayesian Machine Learning

Bayesian Methods: Naïve Bayes

Bayesian RL Seminar. Chris Mansley September 9, 2008

y Xw 2 2 y Xw λ w 2 2

COMP90051 Statistical Machine Learning

Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses

Machine Learning, Fall 2012 Homework 2

Machine Learning

CS-E3210 Machine Learning: Basic Principles

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Midterm, Fall 2003

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

The PAC Learning Framework -II

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.

Bayesian Machine Learning

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Learning Bayesian network : Given structure and completely observed data

Behavioral Data Mining. Lecture 2

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Model Averaging (Bayesian Learning)

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Bayesian Linear Regression [DRAFT - In Progress]

CS 6140: Machine Learning Spring What We Learned Last Week. Survey 2/26/16. VS. Model

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

CS 6140: Machine Learning Spring 2016

Stochastic Gradient Descent

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Point Estimation. Vibhav Gogate The University of Texas at Dallas

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Algorithmisches Lernen/Machine Learning

Probability Theory for Machine Learning. Chris Cremer September 2015

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Computational Cognitive Science

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Bayesian Learning. Two Roles for Bayesian Methods. Bayes Theorem. Choosing Hypotheses

CSC321 Lecture 18: Learning Probabilistic Models

CS 361: Probability & Statistics

Least Squares Regression

Introduction to Probability and Statistics (Continued)

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

PMR Learning as Inference

Machine Learning Linear Classification. Prof. Matteo Matteucci

Overfitting, Bias / Variance Analysis

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Logistic Regression. William Cohen

Machine Learning

Introduc)on to Bayesian methods (con)nued) - Lecture 16

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Introduction to Machine Learning

CMU-Q Lecture 24:

Review: Probability. BM1: Advanced Natural Language Processing. University of Potsdam. Tatjana Scheffler

CHAPTER 2 Estimating Probabilities

Department of Computer and Information Science and Engineering. CAP4770/CAP5771 Fall Midterm Exam. Instructor: Prof.

Week 3: Linear Regression

Machine Learning

Statistical Learning. Philipp Koehn. 10 November 2015

Bayesian Learning. Chapter 6: Bayesian Learning. Bayes Theorem. Roles for Bayesian Methods. CS 536: Machine Learning Littman (Wu, TA)

Least Squares Regression

An Introduction to Statistical and Probabilistic Linear Models

6.867 Machine Learning

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

STA414/2104 Statistical Methods for Machine Learning II

Gaussians Linear Regression Bias-Variance Tradeoff

Probabilistic Machine Learning. Industrial AI Lab.

Transcription:

Introduction to Bayesian Learning Machine Learning Fall 2018 1

What we have seen so far What does it mean to learn? Mistake-driven learning Learning by counting (and bounding) number of mistakes PAC learnability Sample complexity and bounds on errors on unseen examples Various learning algorithms Analyzed algorithms under these models of learnability In all cases, the algorithm outputs a function that produces a label y for a given input x 2

Coming up Another way of thinking about What does it mean to learn? Bayesian learning Different learning algorithms in this regime Naïve Bayes Logistic Regression 3

Today s lecture Bayesian Learning Maximum a posteriori and maximum likelihood estimation Two examples of maximum likelihood estimation Binomial distribution Normal distribution 4

Today s lecture Bayesian Learning Maximum a posteriori and maximum likelihood estimation Two examples of maximum likelihood estimation Binomial distribution Normal distribution 5

Probabilistic Learning Two different notions of probabilistic learning Learning probabilistic concepts The learned concept is a function c:x [0,1] c(x) may be interpreted as the probability that the label 1 is assigned to x The learning theory that we have studied before is applicable (with some extensions) Bayesian Learning: Use of a probabilistic criterion in selecting a hypothesis The hypothesis can be deterministic, a Boolean function The criterion for selecting the hypothesis is probabilistic 6

Probabilistic Learning Two different notions of probabilistic learning Learning probabilistic concepts The learned concept is a function c:x [0,1] c(x) may be interpreted as the probability that the label 1 is assigned to x The learning theory that we have studied before is applicable (with some extensions) Bayesian Learning: Use of a probabilistic criterion in selecting a hypothesis The hypothesis can be deterministic, a Boolean function The criterion for selecting the hypothesis is probabilistic 7

Probabilistic Learning Two different notions of probabilistic learning Learning probabilistic concepts The learned concept is a function c:x [0,1] c(x) may be interpreted as the probability that the label 1 is assigned to x The learning theory that we have studied before is applicable (with some extensions) Bayesian Learning: Use of a probabilistic criterion in selecting a hypothesis The hypothesis can be deterministic, a Boolean function The criterion for selecting the hypothesis is probabilistic 8

Bayesian Learning: The basics Goal: To find the best hypothesis from some space H of hypotheses, using the observed data D Define best = most probable hypothesis in H In order to do that, we need to assume a probability distribution over the class H We also need to know something about the relation between the data observed and the hypotheses As we will see, we can be Bayesian about other things. e.g., the parameters of the model 9

Bayesian Learning: The basics Goal: To find the best hypothesis from some space H of hypotheses, using the observed data D Define best = most probable hypothesis in H In order to do that, we need to assume a probability distribution over the class H We also need to know something about the relation between the data observed and the hypotheses As we will see, we can be Bayesian about other things. e.g., the parameters of the model 10

Bayesian Learning: The basics Goal: To find the best hypothesis from some space H of hypotheses, using the observed data D Define best = most probable hypothesis in H In order to do that, we need to assume a probability distribution over the class H We also need to know something about the relation between the data observed and the hypotheses As we will see in the coming lectures, we can be Bayesian about other things. e.g., the parameters of the model 11

Bayesian methods have multiple roles Provide practical learning algorithms Combining prior knowledge with observed data Guide the model towards something we know Provide a conceptual framework For evaluating other learners Provide tools for analyzing learning 12

Bayes Theorem P Y X = P X Y P(Y) P(X) 13

Bayes Theorem P Y X = P X Y P(Y) P(X) Short for x, y P Y = y X = x = P X = x Y = y P(Y = y) P(X = x) 14

Bayes Theorem 15

Bayes Theorem Prior probability: What is our belief in Y before we see X? 16

Bayes Theorem Likelihood: What is the likelihood of observing X given a specific Y? Prior probability: What is our belief in Y before we see X? 17

Bayes Theorem Posterior probability: What is the probability of Y given that X is observed? Likelihood: What is the likelihood of observing X given a specific Y? Prior probability: What is our belief in Y before we see X? 18

Bayes Theorem Posterior probability: What is the probability of Y given that X is observed? Likelihood: What is the likelihood of observing X given a specific Y? Prior probability: What is our belief in Y before we see X? Posterior / Likelihood Prior 19

Probability Refresher Product rule: Sum rule: Events A, B are independent if: P(A, B) = P(A) P(B) Equivalently, P(A B) = P(A), P(B A) = P(B) Theorem of Total probability: For mutually exclusive events A 1, A 2,!, A n (i.e A i Å A j = ;) with 20

Bayesian Learning Given a dataset D, we want to find the best hypothesis h What does best mean? Bayesian learning uses P h D), the conditional probability of a hypothesis given the data, to define best. 21

Bayesian Learning Given a dataset D, we want to find the best hypothesis h What does best mean? 22

Bayesian Learning Given a dataset D, we want to find the best hypothesis h What does best mean? Posterior probability: What is the probability that h is the hypothesis, given that the data D is observed? 23

Bayesian Learning Given a dataset D, we want to find the best hypothesis h What does best mean? Posterior probability: What is the probability that h is the hypothesis, given that the data D is observed? Key insight: Both h and D are events. D: The event that we observed this particular dataset h: The event that the hypothesis h is the true hypothesis So we can apply the Bayes rule here. 24

Bayesian Learning Given a dataset D, we want to find the best hypothesis h What does best mean? Posterior probability: What is the probability that h is the hypothesis, given that the data D is observed? Key insight: Both h and D are events. D: The event that we observed this particular dataset h: The event that the hypothesis h is the true hypothesis 25

Bayesian Learning Given a dataset D, we want to find the best hypothesis h What does best mean? Posterior probability: What is the probability that h is the hypothesis, given that the data D is observed? Prior probability of h: Background knowledge. What do we expect the hypothesis to be even before we see any data? For example, in the absence of any information, maybe the uniform distribution. 26

Bayesian Learning Given a dataset D, we want to find the best hypothesis h What does best mean? Posterior probability: What is the probability that h is the hypothesis, given that the data D is observed? Likelihood: What is the probability that this data point (an example or an entire dataset) is observed, given that the hypothesis is h? Prior probability of h: Background knowledge. What do we expect the hypothesis to be even before we see any data? For example, in the absence of any information, maybe the uniform distribution. 27

Bayesian Learning Given a dataset D, we want to find the best hypothesis h What does best mean? Posterior probability: What is the probability that h is the hypothesis, given that the data D is observed? Likelihood: What is the probability that this data point (an example or an entire dataset) is observed, given that the hypothesis is h? What is the probability that the data D is observed (independent of any knowledge about the hypothesis)? Prior probability of h: Background knowledge. What do we expect the hypothesis to be even before we see any data? For example, in the absence of any information, maybe the uniform distribution. 28

Today s lecture Bayesian Learning Maximum a posteriori and maximum likelihood estimation Two examples of maximum likelihood estimation Binomial distribution Normal distribution 29

Choosing a hypothesis Given some data, find the most probable hypothesis The Maximum a Posteriori hypothesis h MAP 30

Choosing a hypothesis Given some data, find the most probable hypothesis The Maximum a Posteriori hypothesis h MAP 31

Choosing a hypothesis Given some data, find the most probable hypothesis The Maximum a Posteriori hypothesis h MAP Posterior / Likelihood Prior 32

Choosing a hypothesis Given some data, find the most probable hypothesis The Maximum a Posteriori hypothesis h MAP If we assume that the prior is uniform i.e. P(h i ) = P(h j ), for all h i, h j Simplify this to get the Maximum Likelihood hypothesis Often computationally easier to maximize log likelihood 33

Choosing a hypothesis Given some data, find the most probable hypothesis The Maximum a Posteriori hypothesis h MAP If we assume that the prior is uniform i.e. P(h i ) = P(h j ), for all h i, h j Simplify this to get the Maximum Likelihood hypothesis Often computationally easier to maximize log likelihood 34

Choosing a hypothesis Given some data, find the most probable hypothesis The Maximum a Posteriori hypothesis h MAP If we assume that the prior is uniform i.e. P(h i ) = P(h j ), for all h i, h j Simplify this to get the Maximum Likelihood hypothesis Often computationally easier to maximize log likelihood 35

Brute force MAP learner Input: Data D and a hypothesis set H 1. Calculate the posterior probability for each h 2 H 2. Output the hypothesis with the highest posterior probability 36

Brute force MAP learner Input: Data D and a hypothesis set H 1. Calculate the posterior probability for each h 2 H Difficult to compute, except for the most simple hypothesis spaces 2. Output the hypothesis with the highest posterior probability 37

Today s lecture Bayesian Learning Maximum a posteriori and maximum likelihood estimation Two examples of maximum likelihood estimation Bernoulli trials Normal distribution 38

Maximum Likelihood estimation Maximum Likelihood estimation (MLE) What we need in order to define learning: 1. A hypothesis space H 2. A model that says how data D is generated given h 39

Example 1: Bernoulli trials The CEO of a startup hires you for your first consulting job CEO: My company makes light bulbs. I need to know what is the probability they are faulty. You: Sure. I can help you out. Are they all identical? CEO: Yes! You: Excellent. I know how to help. We need to experiment 40

Faulty lightbulbs The experiment: Try out 100 lightbulbs 80 work, 20 don t You: The probability is P(failure) = 0.2 CEO: But how do you know? You: Because 41

Bernoulli trials P(failure) = p, P(success) = 1 p Each trial is i.i.d Independent and identically distributed You have seen D = {80 work, 20 don t} The most likely value of p for this observation is? argmax ; P D p = 100 80 p23 1 p 53 P D p = argmax ; 100 80 p23 1 p 53 42

Bernoulli trials P(failure) = p, P(success) = 1 p Each trial is i.i.d Independent and identically distributed You have seen D = {80 work, 20 don t} The most likely value of p for this observation is? argmax ; P D p = 100 80 p23 1 p 53 P D p = argmax ; 100 80 p23 1 p 53 43

Bernoulli trials P(failure) = p, P(success) = 1 p Each trial is i.i.d Independent and identically distributed You have seen D = {80 work, 20 don t} The most likely value of p for this observation is? argmax ; P D p = 100 80 p23 1 p 53 P D p = argmax ; 100 80 p23 1 p 53 44

Bernoulli trials P(failure) = p, P(success) = 1 p Each trial is i.i.d Independent and identically distributed You have seen D = {80 work, 20 don t} The most likely value of p for this observation is? argmax ; P D p = 100 80 p23 1 p 53 P D p = argmax ; 100 80 p23 1 p 53 45

The learning algorithm Say you have a Work and b Not-Work p <=>? = argmax ; = argmax ; = argmax ; = argmax ; Calculus 101: Set the derivative to zero P best = b/(a + b) P(D h) log P(D h) a + b log p F 1 p < a a log p + b log (1 p) 46

The learning algorithm Say you have a Work and b Not-Work p <=>? = argmax ; = argmax ; = argmax ; = argmax ; Calculus 101: Set the derivative to zero P best = b/(a + b) P(D h) log P(D h) Log likelihood a + b log p F 1 p < a a log p + b log (1 p) 47

The learning algorithm Say you have a Work and b Not-Work p <=>? = argmax ; = argmax ; = argmax ; = argmax ; Calculus 101: Set the derivative to zero P best = b/(a + b) P(D h) log P(D h) Log likelihood a + b log p F 1 p < a a log p + b log (1 p) 48

The learning algorithm Say you have a Work and b Not-Work p <=>? = argmax ; = argmax ; = argmax ; = argmax ; Calculus 101: Set the derivative to zero P best = b/(a + b) P(D h) log P(D h) Log likelihood a + b log p F 1 p < a a log p + b log (1 p) 49

The learning algorithm Say you have a Work and b Not-Work p <=>? = argmax ; = argmax ; = argmax ; = argmax ; Calculus 101: Set the derivative to zero P best = b/(a + b) P(D h) log P(D h) Log likelihood a + b log p F 1 p < a a log p + b log (1 p) 50

The learning algorithm Say you have a Work and b Not-Work p <=>? = argmax ; = argmax ; = argmax ; = argmax ; Calculus 101: Set the derivative to zero P best = b/(a + b) P(D h) log P(D h) Log likelihood a + b log p F 1 p < a a log p + b log (1 p) The model we assumed is Bernoulli. You could assume a different model! Next we will consider other models and see how to learn their parameters. 51

Today s lecture Bayesian Learning Maximum a posteriori and maximum likelihood estimation Two examples of maximum likelihood estimation Bernoulli trials Normal distribution 52

Maximum Likelihood estimation Maximum Likelihood estimation (MLE) What we need in order to define learning: 1. A hypothesis space H 2. A model that says how data D is generated given h 53

Example 2: Maximum Likelihood and least squares Suppose H consists of real valued functions Inputs are vectors x 2 < d and the output is a real number y 2 < Suppose the training data is generated as follows: An input x i is drawn randomly (say uniformly at random) The true function f is applied to get f(x i ) This value is then perturbed by noise e i Drawn independently according to an unknown Gaussian with zero mean Say we have m training examples (x i, y i ) generated by this process 54

Example 2: Maximum Likelihood and least squares Suppose H consists of real valued functions Inputs are vectors x 2 < d and the output is a real number y 2 < Suppose the training data is generated as follows: An input x i is drawn randomly (say uniformly at random) The true function f is applied to get f(x i ) This value is then perturbed by noise e i Drawn independently according to an unknown Gaussian with zero mean Say we have m training examples (x i, y i ) generated by this process 55

Example 2: Maximum Likelihood and least squares Suppose H consists of real valued functions Inputs are vectors x 2 < d and the output is a real number y 2 < Suppose the training data is generated as follows: An input x i is drawn randomly (say uniformly at random) The true function f is applied to get f(x i ) This value is then perturbed by noise e i Drawn independently according to an unknown Gaussian with zero mean Say we have m training examples (x i, y i ) generated by this process 56

Example 2: Maximum Likelihood and least squares Suppose H consists of real valued functions Inputs are vectors x 2 < d and the output is a real number y 2 < Suppose the training data is generated as follows: An input x i is drawn randomly (say uniformly at random) The true function f is applied to get f(x i ) This value is then perturbed by noise e i Drawn independently according to an unknown Gaussian with zero mean Say we have m training examples (x i, y i ) generated by this process 57

Example: Maximum Likelihood and least squares Suppose we have a hypothesis h. We want to know what is the probability that a particular label y i was generated by this hypothesis as h(x i )? The error for this example is y G h x G We believe that this error is from a Gaussian distribution with mean = 0 and standard deviation= ¾ We can compute the probability of observing one data point (x i, y i ), if it were generated using the function h 58

Example: Maximum Likelihood and least squares Suppose we have a hypothesis h. We want to know what is the probability that a particular label y i was generated by this hypothesis as h(x i )? The error for this example is y G h x G We believe that this error is from a Gaussian distribution with mean = 0 and standard deviation= ¾ We can compute the probability of observing one data point (x i, y i ), if it were generated using the function h 59

Example: Maximum Likelihood and least squares Probability of observing one data point (x i, y i ), if it were generated using the function h Each example in our dataset D = {(x i, y i )} is generated independently by this process 60

Example: Maximum Likelihood and least squares Probability of observing one data point (x i, y i ), if it were generated using the function h Each example in our dataset D = {(x i, y i )} is generated independently by this process 61

Example: Maximum Likelihood and least squares Our goal is to find the most likely hypothesis 62

Example: Maximum Likelihood and least squares Our goal is to find the most likely hypothesis 63

Example: Maximum Likelihood and least squares Our goal is to find the most likely hypothesis How do we maximize this expression? Any ideas? 64

Example: Maximum Likelihood and least squares Our goal is to find the most likely hypothesis How do we maximize this expression? Any ideas? Answer: Take the logarithm to simplify 65

Example: Maximum Likelihood and least squares Our goal is to find the most likely hypothesis 66

Example: Maximum Likelihood and least squares Our goal is to find the most likely hypothesis 67

Example: Maximum Likelihood and least squares Our goal is to find the most likely hypothesis Because we assumed that the standard deviation is a constant. 68

Example: Maximum Likelihood and least squares The most likely hypothesis is If we consider the set of linear functions as our hypothesis space: h(x i ) = w T x i This is the probabilistic explanation for least squares regression 69

Example: Maximum Likelihood and least squares The most likely hypothesis is If we consider the set of linear functions as our hypothesis space: h(x i ) = w T x i This is the probabilistic explanation for least squares regression 70

Example: Maximum Likelihood and least squares The most likely hypothesis is If we consider the set of linear functions as our hypothesis space: h(x i ) = w T x i This is the probabilistic explanation for least squares regression 71

Linear regression: Two perspectives Loss minimization perspective We want to minimize the difference between the squared loss error of our prediction Minimize the total squared loss Bayesian perspective We believe that the errors are Normally distributed with zero mean and a fixed variance Find the linear regressor using the maximum likelihood principle 72

Today s lecture: Summary Bayesian Learning Another way to ask: What is the best hypothesis for a dataset? Two answers to the question: Maximum a posteriori (MAP) and maximum likelihood estimation (MLE) We saw two examples of maximum likelihood estimation Binomial distribution, normal distribution You should be able to apply both MAP and MLE to simple problems 73