Bayesian Learning (II)

Similar documents
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Naïve Bayes classification

Bayesian Models in Machine Learning

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Bayesian Methods: Naïve Bayes

Logistic Regression. Machine Learning Fall 2018

CSC321 Lecture 18: Learning Probabilistic Models

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

The Naïve Bayes Classifier. Machine Learning Fall 2017

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Statistical learning. Chapter 20, Sections 1 3 1

Introduction to Bayesian Learning. Machine Learning Fall 2018

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Bayesian Learning Extension

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

PMR Learning as Inference

Probabilistic Machine Learning. Industrial AI Lab.

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Density Estimation. Seungjin Choi

Announcements. Proposals graded

Statistical learning. Chapter 20, Sections 1 3 1

Notes on Machine Learning for and

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Statistical Learning. Philipp Koehn. 10 November 2015

Computational Cognitive Science

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Machine Learning for Signal Processing Bayes Classification and Regression

CSCE 478/878 Lecture 6: Bayesian Learning

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Machine Learning

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Models, Data, Learning Problems

Statistical learning. Chapter 20, Sections 1 4 1

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

A Brief Review of Probability, Bayesian Statistics, and Information Theory

CPSC 340: Machine Learning and Data Mining

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

ECE521 week 3: 23/26 January 2017

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

COMP90051 Statistical Machine Learning

Learning Bayesian network : Given structure and completely observed data

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

INTRODUCTION TO BAYESIAN INFERENCE PART 2 CHRIS BISHOP

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

Lecture : Probabilistic Machine Learning

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

y Xw 2 2 y Xw λ w 2 2

Machine Learning

Inconsistency of Bayesian inference when the model is wrong, and how to repair it

Bayesian RL Seminar. Chris Mansley September 9, 2008

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Support Vector Machines

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

MLE/MAP + Naïve Bayes

MLE/MAP + Naïve Bayes

An Introduction to Statistical and Probabilistic Linear Models

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Bias-Variance Tradeoff

Bayesian Learning Features of Bayesian learning methods:

Learning with Probabilities

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

MODULE -4 BAYEIAN LEARNING

Least Squares Regression

Generative v. Discriminative classifiers Intuition

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Bayesian Learning. Reading: Tom Mitchell, Generative and discriminative classifiers: Naive Bayes and logistic regression, Sections 1-2.

Least Squares Regression

Computational Cognitive Science

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Machine Learning Gaussian Naïve Bayes Big Picture

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Naive Bayes classification

Algorithms for Classification: The Basic Methods

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 4 of Data Mining by I. H. Witten, E. Frank and M. A.

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

6.867 Machine Learning

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

CMU-Q Lecture 24:

Introduction to Machine Learning. Lecture 2

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Linear Classifiers IV

Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Point Estimation. Vibhav Gogate The University of Texas at Dallas

PATTERN RECOGNITION AND MACHINE LEARNING

Introduction to Bayesian Inference

Transcription:

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning (II) Niels Landwehr

Overview Probabilities, expected values, variance Basic concepts of Bayesian learning MAP hypothesis and regularized loss Bayesian Model Averaging (Bayesian) parameter estimation for probability distributions Bayesian linear regression, naive Bayes 2

Conceptual Model for Learning Many machine learning methods are based on probabilistic considerations. We want to learn models of the form y = f x from training data L = x 1, y 1,, x n, y n. conceptual model of the data generating process Someone draws the real model f from the ( prior ) distribution p f. f is not known, but p f reflects prior knowledge (what are the most probable models?) Training inputs x i are drawn (independent of θ ). Class labels y i are drawn from p y i x i, θ. Learning Question: given L and p θ, what is the most likely true model? Try to (approximately) reconstruct θ 3

Bayes Rule Bayes Rule: Proof is simple: p X Y p X Y = Definition of conditional distribution = p X, Y p Y p Y X p X p Y = p Y X p X p Y Product rule Important basic knowledge for machine learning: allows the inference of model probabilities given the probabilities of observations

Bayes Rule Model probability given data and prior knowledge p data is constant; it is independent of model Likelihood: how probable is the data, under the assumption that model is the true model? p model data = p data model p model p data Prior: how probable is a model, a priori? p data model p model

Maximum a Posteriori Hypothesis Most likely model given the data f MAP = argmax p f w L f w p L f w p f w = argmax f w p L = argmax p L f w p f w f w = argmax f w log p L f w p f w = argmin f w log p L f w log p f w Log-Likelihood Application of Bayes Rule Log-Prior Optimization criterion consists of log-likelihood and log-prior w parameterizes model f w x 6

Log-Likelihood How likely are the data given the model? log p L f w Assumption: Data points are independent Label y i doesn t depend on x j for j i. Product rule (given f w ) = log p y 1,, y n x 1,, x n, f w p x 1,, x n f w = log p y 1,, y n x 1,, x n, f w p x 1,, x n = log p y 1,, y n f w, x 1,, x n + const N = log p y i f w, x 1,, x n i=1 N = log p y i f w, x i i=1 = log p y i f w, x i i How do we model p y i f w, x i? Input x 1,, x n is independent of model f w + const + const + const Constant, independent of f w 7

Log-Likelihood Assumption for modeling p y i f w, x i : special exponential distribution based on a loss function. Probability that f w generates label y i from x i decreases exponentially in l f w x i, y i p y i f w, x i = 1 Z exp l f w x i, y i Model assumptions used in negative log-likelihood: log p y i f w, x i i Normalizer = l f w x i, y i + log Z i = l f w x i, y i Loss function l f w x i, y i measures the distance between f w x i and y i l f w x i, y i = 0 f w x i = y i c f w x i y i Negative Log-Likelihood corresponds to a loss term! i + const Constant, independent of f w 8

A Priori Probability (Prior) Distribution over models = distribution over model parameters Assumption: model parameter is normal with mean μ = 0 We prefer models with small attribute weights. p f w p f w w R m = N w 0, σ 2 I = 1 2πσ 2 m exp 1 2σ 2 w 2 0 Model assumptions used in negative Log-Prior: log p f w = 1 2σ 2 Negative Log-Prior = Regularizer! w 2 + const 0 w 2 w 1 Constant, independent of f w 9

A Posteriori Probability (Posterior) Most likely model given prior knowledge and data. f MAP = argmax f w = argmin w = argmin w p f w L log p L f w l f w x i, y i i log p f w + λ w 2 Argmin over a regularized loss function! λ = 1 2σ 2 Justification for this Optimization criterion? Mostly likely hypothesis (MAP-Hypothesis). 10

Overview Probabilities, expected values, variance Basic concepts of Bayesian learning MAP Hypothesis and regularized loss Bayesian Model Averaging (Bayesian) parameter estimation for probability distributions Bayesian linear regression, naive Bayes 11

Learning and Prediction Previously: Learning problem separated from predictions Learning: f MAP = argmax f w Predictions: x f MAP x p f w L x is new test instance Most likely model given the data Prediction of the MAP Model If we must commit ourselves to a single model, then the MAP model is a sensible choice However the actual goal is the prediction of a class! It is better not to specify a model instead directly search for the optimal prediction. 12

Learning and Prediction: Example Model space with 4 models: H = f 1, f 2, f 3, f 4 Binary classification problem, Y = 0,1 Training data L We compute the a-posteriori probabilities of the models p f 1 L = 0.3 p f 3 L = 0.25 p f 2 L = 0.25 p f 4 L = 0.2 MAP Model is f 1 = argmax f i p f i L 13

p y = 1 x, w Learning and Prediction: Example Model f i is a probabilistic classifier: binary classification: p y = 1 x, f i 0,1 E.g., logistic regression (linear model): Parameter vector: Decision function: f w x = w T x w Logistic function: σ z = 1 1+exp z Class probability: p y = 1 x, w = σ w T x logistic Regression Decision function value w T x 14

Learning and Prediction: Example We want to classify a new test sample x p y = 1 x, f 1 = 0.6 p y = 1 x, f 3 = 0.2 p y = 1 x, f 2 = 0.1 p y = 1 x, f 4 = 0.3 Classification given by MAP model f 1 : y = 1 However (by the computation rules of probability!): p y = 1 x, L = p y = 1, f i x, L 4 i=1 4 = p y = 1 f i, x, L p f i x, L i=1 4 (Sum rule) (Product rule) (Independence) = p y = 1 f i, x p f i L i=1 = 0.6 0.3 + 0.1 0.25 + 0.2 0.25 + 0.3 0.2 = 0.315 15

Learning and Prediction: Example If the goal is prediction, should we use p y = 1 x, L Do not specify a single model, as long as there is still uncertainty about the models This is the fundamental idea behind Bayesian Learning/Prediction! 16

Bayesian Learning and Prediction Problem setting: prediction Given: Training data L, New test instance x. Searching for: Distribution over labels y for a given x: p y x, L Bayesian prediction: y = argmax y p y x, L Minimizes risk of an incorrect prediction. Also called the Bayes optimal decision or the Bayes Hypothesis. 17

Bayesian Learning and Prediction Computation of Bayesian Prediction Sum rule Product rule Bayesian Model Averaging y = argmax y p y x, L Bayesian Learning: = argmax y p y, θ x, L dθ = argmax y p y θ, x, L p θ x, L dθ = argmax y p y θ, x p θ L dθ prediction, given the model Average of the predictions over all models. θ model posterior of the models Weighting: how well a model fits to the training data. 18

Bayesian Learning and Prediction Is Bayesian prediction practical? y = argmax y p y x, L = argmax y p y θ, x p θ L dθ Bayesian Model Averaging: implicitly averages over infinitely many models. How to compute? It is only sometimes practical to obtain a closed-form solution. In contrast on decision tree learning: Find a model that fits well to the data. Give predictions for new instances based on this model. There is a separation between learning of a model and using it for prediction. 19

p y = 1 x, θ Bayesian Learning and Prediction How is the Bayes-Hypothesis calculated? y = argmax y p y x, L We need: = argmax y p y θ, x p θ L dθ 1) Probability of a class label given model, p y θ, x. Follows from the model definition e.g., the linear probabilistic classifier (logistic regression) p y = 1 x, θ = σ θ T x Decision function value θ T x

Bayesian Learning and Prediction How is the Bayes-Hypothesis calculated? y = argmax y p y x, L We need: = argmax y p y θ, x p θ L dθ 2) Probability for model given data, the a posteriori probability, p θ L Calculated via Bayes Rule

Bayesian Learning and Prediction Computation of the a posteriori distribution over models Bayes Theorem Posterior, A posteriori distribution p θ L Bayes Rule: Posterior Likelihood x Prior = p L θ p θ = 1 Z p L p L θ p θ Likelihood, How well does the model fit data? Prior, A priori distribution Normalization constant 22

Bayes Rule Need: Likelihood p L θ. Labels y 1,, y N are generated depending only on model θ & data point x i How probable would the training data be, if θ would be the correct model. How well does the model fit to the data. L = x 1, y 1,, x N, y N p L θ = p y 1,, y N x 1,, x N, θ p x 1,, x N θ = p y 1,, y N x 1,, x N, θ p x 1,, x N = 1 Z p y 1,, y N x 1,, x N, θ N = 1 Z p y i x i, θ i=1 Input x 1,, x n is independent of model θ Follows from model definition (for example, logistic regression) 23

Bayes Rule Need: Prior p θ. How probable is model θ before we have seen any training data. Assumptions about p θ come from dataindependent prior knowledge about the problem. Linear model example: 24

Bayes Rule Need: Prior p θ. How probable is model θ before we have seen any training data. Assumptions about p θ come from dataindependent prior knowledge about the problem. Linear model example: θ 2 should be as low as possible 25

Bayes Rule Need: Prior p θ. How probable is model θ before we have seen any training data. Assumptions about p θ come from dataindependent prior knowledge about the problem. Decision tree learning example: 26

Bayes Rule Need: Prior p θ. How probable is model θ before we have seen any training data. Assumptions about p θ come from dataindependent prior knowledge about the problem. Decision tree learning example: Small trees are often better than complex trees. Learning algorithm hence prefers small trees 27

Summary of Bayesian/MAP/ML- Hypotheses To minimize the risk of an incorrect decision, choose Bayesian prediction y = argmax y p y x, L = argmax y p y θ, x p θ L dθ Problem: In many cases there is no closed-form solution and integration over all models is impractical. Maximum a posteriori (MAP) hypothesis: choose θ MAP = argmax θ p θ L y = argmax y p y x, θ MAP Corresponds to decision tree learning. Find the best model from the data, Classifies only with this model. 28

Summary of Bayesian/MAP/ML- Hypotheses To specify the MAP-Hypothesis we must be able to compute the posterior (likelihood x prior). Not possible, if no prior knowledge (prior) exists. Maximum likelihood (ML) Hypothesis: θ ML = argmax θ p L θ y = argmax y p y x, θ ML Based only on observations in L, no prior knowledge. Has a problem of overfitting to the data. 29

Overview Probabilities, expected values, variance Basic concepts of Bayesian learning (Bayesian) parameter estimation for probability distributions Bayesian linear regression, naive Bayes 30

Estimating the Distribution s Parameters Often we can assume that the data comes from a specified distribution E.g. a binomial distribution for N coin flips E.g. a Gaussian distribution for body size, IQ, These distributions are parameterized Binomial distribution: parameter μ is probability for heads Gaussian distribution: parameters μ, σ for mean value and standard deviation True probability / parameters are never known What conclusions can we make about the true probabilities given the data. 31

Estimating the Distribution s Parameters Problem: estimating the distribution s parameters: Given a parameterized family of distributions (e.g. Binomial, Gaussian) with parameter vector θ Given data L: Expressed as a random variable Desired Goal: a posteriori distribution p θ L or respectively the maximum a posteriori estimation θ = argmax p θ L θ Applying Bayes Rule: p θ L = p L θ p θ p L 32

Binomially Distributed Data Estimation Example: coin flips, estimated parameter θ = μ A coin is flipped N times Data L: N h times heads, N t times tails Best estimator θ given L? Bayes equation: Likelihood: how likely are N h heads and N t tails given parameter θ p θ L A posteriori distribution over Parameters; characterizes probable parameter values & remaining uncertainty = A priori distribution over parameters representing prior knowledge p L θ p θ p L Probability of the data, only serves as a normalizer 33

Binomially Distributed Data Estimation Likelihood of the data: p L θ (θ = μ is the probability of heads ) Likelihood is binomially distributed: p L θ = p N h, N t θ = Bin N h N, θ = N N h θ N h 1 θ N t N = N h + N t probability of seeing N h heads and N t tails in N coin flips given coin parameter θ. 34

Binomially Distributed Data Estimation What is the prior p θ for the coin flipping example? 1) Try: no prior knowledge p θ = 1 0 θ 1 0 otherwise Example: Data L = tails, tails, tails MAP model: θ = argmax θ 0,1 p θ L = argmax θ 0,1 p L θ = argmax θ 0,1 p L θ p θ = argmax θ 0,1 p L 3 0 θ0 1 θ 3 = 0 Inference: coin will never land on heads Bad, overfitting of data 35