Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning (II) Niels Landwehr
Overview Probabilities, expected values, variance Basic concepts of Bayesian learning MAP hypothesis and regularized loss Bayesian Model Averaging (Bayesian) parameter estimation for probability distributions Bayesian linear regression, naive Bayes 2
Conceptual Model for Learning Many machine learning methods are based on probabilistic considerations. We want to learn models of the form y = f x from training data L = x 1, y 1,, x n, y n. conceptual model of the data generating process Someone draws the real model f from the ( prior ) distribution p f. f is not known, but p f reflects prior knowledge (what are the most probable models?) Training inputs x i are drawn (independent of θ ). Class labels y i are drawn from p y i x i, θ. Learning Question: given L and p θ, what is the most likely true model? Try to (approximately) reconstruct θ 3
Bayes Rule Bayes Rule: Proof is simple: p X Y p X Y = Definition of conditional distribution = p X, Y p Y p Y X p X p Y = p Y X p X p Y Product rule Important basic knowledge for machine learning: allows the inference of model probabilities given the probabilities of observations
Bayes Rule Model probability given data and prior knowledge p data is constant; it is independent of model Likelihood: how probable is the data, under the assumption that model is the true model? p model data = p data model p model p data Prior: how probable is a model, a priori? p data model p model
Maximum a Posteriori Hypothesis Most likely model given the data f MAP = argmax p f w L f w p L f w p f w = argmax f w p L = argmax p L f w p f w f w = argmax f w log p L f w p f w = argmin f w log p L f w log p f w Log-Likelihood Application of Bayes Rule Log-Prior Optimization criterion consists of log-likelihood and log-prior w parameterizes model f w x 6
Log-Likelihood How likely are the data given the model? log p L f w Assumption: Data points are independent Label y i doesn t depend on x j for j i. Product rule (given f w ) = log p y 1,, y n x 1,, x n, f w p x 1,, x n f w = log p y 1,, y n x 1,, x n, f w p x 1,, x n = log p y 1,, y n f w, x 1,, x n + const N = log p y i f w, x 1,, x n i=1 N = log p y i f w, x i i=1 = log p y i f w, x i i How do we model p y i f w, x i? Input x 1,, x n is independent of model f w + const + const + const Constant, independent of f w 7
Log-Likelihood Assumption for modeling p y i f w, x i : special exponential distribution based on a loss function. Probability that f w generates label y i from x i decreases exponentially in l f w x i, y i p y i f w, x i = 1 Z exp l f w x i, y i Model assumptions used in negative log-likelihood: log p y i f w, x i i Normalizer = l f w x i, y i + log Z i = l f w x i, y i Loss function l f w x i, y i measures the distance between f w x i and y i l f w x i, y i = 0 f w x i = y i c f w x i y i Negative Log-Likelihood corresponds to a loss term! i + const Constant, independent of f w 8
A Priori Probability (Prior) Distribution over models = distribution over model parameters Assumption: model parameter is normal with mean μ = 0 We prefer models with small attribute weights. p f w p f w w R m = N w 0, σ 2 I = 1 2πσ 2 m exp 1 2σ 2 w 2 0 Model assumptions used in negative Log-Prior: log p f w = 1 2σ 2 Negative Log-Prior = Regularizer! w 2 + const 0 w 2 w 1 Constant, independent of f w 9
A Posteriori Probability (Posterior) Most likely model given prior knowledge and data. f MAP = argmax f w = argmin w = argmin w p f w L log p L f w l f w x i, y i i log p f w + λ w 2 Argmin over a regularized loss function! λ = 1 2σ 2 Justification for this Optimization criterion? Mostly likely hypothesis (MAP-Hypothesis). 10
Overview Probabilities, expected values, variance Basic concepts of Bayesian learning MAP Hypothesis and regularized loss Bayesian Model Averaging (Bayesian) parameter estimation for probability distributions Bayesian linear regression, naive Bayes 11
Learning and Prediction Previously: Learning problem separated from predictions Learning: f MAP = argmax f w Predictions: x f MAP x p f w L x is new test instance Most likely model given the data Prediction of the MAP Model If we must commit ourselves to a single model, then the MAP model is a sensible choice However the actual goal is the prediction of a class! It is better not to specify a model instead directly search for the optimal prediction. 12
Learning and Prediction: Example Model space with 4 models: H = f 1, f 2, f 3, f 4 Binary classification problem, Y = 0,1 Training data L We compute the a-posteriori probabilities of the models p f 1 L = 0.3 p f 3 L = 0.25 p f 2 L = 0.25 p f 4 L = 0.2 MAP Model is f 1 = argmax f i p f i L 13
p y = 1 x, w Learning and Prediction: Example Model f i is a probabilistic classifier: binary classification: p y = 1 x, f i 0,1 E.g., logistic regression (linear model): Parameter vector: Decision function: f w x = w T x w Logistic function: σ z = 1 1+exp z Class probability: p y = 1 x, w = σ w T x logistic Regression Decision function value w T x 14
Learning and Prediction: Example We want to classify a new test sample x p y = 1 x, f 1 = 0.6 p y = 1 x, f 3 = 0.2 p y = 1 x, f 2 = 0.1 p y = 1 x, f 4 = 0.3 Classification given by MAP model f 1 : y = 1 However (by the computation rules of probability!): p y = 1 x, L = p y = 1, f i x, L 4 i=1 4 = p y = 1 f i, x, L p f i x, L i=1 4 (Sum rule) (Product rule) (Independence) = p y = 1 f i, x p f i L i=1 = 0.6 0.3 + 0.1 0.25 + 0.2 0.25 + 0.3 0.2 = 0.315 15
Learning and Prediction: Example If the goal is prediction, should we use p y = 1 x, L Do not specify a single model, as long as there is still uncertainty about the models This is the fundamental idea behind Bayesian Learning/Prediction! 16
Bayesian Learning and Prediction Problem setting: prediction Given: Training data L, New test instance x. Searching for: Distribution over labels y for a given x: p y x, L Bayesian prediction: y = argmax y p y x, L Minimizes risk of an incorrect prediction. Also called the Bayes optimal decision or the Bayes Hypothesis. 17
Bayesian Learning and Prediction Computation of Bayesian Prediction Sum rule Product rule Bayesian Model Averaging y = argmax y p y x, L Bayesian Learning: = argmax y p y, θ x, L dθ = argmax y p y θ, x, L p θ x, L dθ = argmax y p y θ, x p θ L dθ prediction, given the model Average of the predictions over all models. θ model posterior of the models Weighting: how well a model fits to the training data. 18
Bayesian Learning and Prediction Is Bayesian prediction practical? y = argmax y p y x, L = argmax y p y θ, x p θ L dθ Bayesian Model Averaging: implicitly averages over infinitely many models. How to compute? It is only sometimes practical to obtain a closed-form solution. In contrast on decision tree learning: Find a model that fits well to the data. Give predictions for new instances based on this model. There is a separation between learning of a model and using it for prediction. 19
p y = 1 x, θ Bayesian Learning and Prediction How is the Bayes-Hypothesis calculated? y = argmax y p y x, L We need: = argmax y p y θ, x p θ L dθ 1) Probability of a class label given model, p y θ, x. Follows from the model definition e.g., the linear probabilistic classifier (logistic regression) p y = 1 x, θ = σ θ T x Decision function value θ T x
Bayesian Learning and Prediction How is the Bayes-Hypothesis calculated? y = argmax y p y x, L We need: = argmax y p y θ, x p θ L dθ 2) Probability for model given data, the a posteriori probability, p θ L Calculated via Bayes Rule
Bayesian Learning and Prediction Computation of the a posteriori distribution over models Bayes Theorem Posterior, A posteriori distribution p θ L Bayes Rule: Posterior Likelihood x Prior = p L θ p θ = 1 Z p L p L θ p θ Likelihood, How well does the model fit data? Prior, A priori distribution Normalization constant 22
Bayes Rule Need: Likelihood p L θ. Labels y 1,, y N are generated depending only on model θ & data point x i How probable would the training data be, if θ would be the correct model. How well does the model fit to the data. L = x 1, y 1,, x N, y N p L θ = p y 1,, y N x 1,, x N, θ p x 1,, x N θ = p y 1,, y N x 1,, x N, θ p x 1,, x N = 1 Z p y 1,, y N x 1,, x N, θ N = 1 Z p y i x i, θ i=1 Input x 1,, x n is independent of model θ Follows from model definition (for example, logistic regression) 23
Bayes Rule Need: Prior p θ. How probable is model θ before we have seen any training data. Assumptions about p θ come from dataindependent prior knowledge about the problem. Linear model example: 24
Bayes Rule Need: Prior p θ. How probable is model θ before we have seen any training data. Assumptions about p θ come from dataindependent prior knowledge about the problem. Linear model example: θ 2 should be as low as possible 25
Bayes Rule Need: Prior p θ. How probable is model θ before we have seen any training data. Assumptions about p θ come from dataindependent prior knowledge about the problem. Decision tree learning example: 26
Bayes Rule Need: Prior p θ. How probable is model θ before we have seen any training data. Assumptions about p θ come from dataindependent prior knowledge about the problem. Decision tree learning example: Small trees are often better than complex trees. Learning algorithm hence prefers small trees 27
Summary of Bayesian/MAP/ML- Hypotheses To minimize the risk of an incorrect decision, choose Bayesian prediction y = argmax y p y x, L = argmax y p y θ, x p θ L dθ Problem: In many cases there is no closed-form solution and integration over all models is impractical. Maximum a posteriori (MAP) hypothesis: choose θ MAP = argmax θ p θ L y = argmax y p y x, θ MAP Corresponds to decision tree learning. Find the best model from the data, Classifies only with this model. 28
Summary of Bayesian/MAP/ML- Hypotheses To specify the MAP-Hypothesis we must be able to compute the posterior (likelihood x prior). Not possible, if no prior knowledge (prior) exists. Maximum likelihood (ML) Hypothesis: θ ML = argmax θ p L θ y = argmax y p y x, θ ML Based only on observations in L, no prior knowledge. Has a problem of overfitting to the data. 29
Overview Probabilities, expected values, variance Basic concepts of Bayesian learning (Bayesian) parameter estimation for probability distributions Bayesian linear regression, naive Bayes 30
Estimating the Distribution s Parameters Often we can assume that the data comes from a specified distribution E.g. a binomial distribution for N coin flips E.g. a Gaussian distribution for body size, IQ, These distributions are parameterized Binomial distribution: parameter μ is probability for heads Gaussian distribution: parameters μ, σ for mean value and standard deviation True probability / parameters are never known What conclusions can we make about the true probabilities given the data. 31
Estimating the Distribution s Parameters Problem: estimating the distribution s parameters: Given a parameterized family of distributions (e.g. Binomial, Gaussian) with parameter vector θ Given data L: Expressed as a random variable Desired Goal: a posteriori distribution p θ L or respectively the maximum a posteriori estimation θ = argmax p θ L θ Applying Bayes Rule: p θ L = p L θ p θ p L 32
Binomially Distributed Data Estimation Example: coin flips, estimated parameter θ = μ A coin is flipped N times Data L: N h times heads, N t times tails Best estimator θ given L? Bayes equation: Likelihood: how likely are N h heads and N t tails given parameter θ p θ L A posteriori distribution over Parameters; characterizes probable parameter values & remaining uncertainty = A priori distribution over parameters representing prior knowledge p L θ p θ p L Probability of the data, only serves as a normalizer 33
Binomially Distributed Data Estimation Likelihood of the data: p L θ (θ = μ is the probability of heads ) Likelihood is binomially distributed: p L θ = p N h, N t θ = Bin N h N, θ = N N h θ N h 1 θ N t N = N h + N t probability of seeing N h heads and N t tails in N coin flips given coin parameter θ. 34
Binomially Distributed Data Estimation What is the prior p θ for the coin flipping example? 1) Try: no prior knowledge p θ = 1 0 θ 1 0 otherwise Example: Data L = tails, tails, tails MAP model: θ = argmax θ 0,1 p θ L = argmax θ 0,1 p L θ = argmax θ 0,1 p L θ p θ = argmax θ 0,1 p L 3 0 θ0 1 θ 3 = 0 Inference: coin will never land on heads Bad, overfitting of data 35