Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr

Remember: Normal Distribution Distribution over x. Density function with parameters µ (mean) and 2 σ (variance). Density of normal distribution 2

Remember: Multivariate Normal Distribution Distribution over vectors Density function with parameters x mean vector, 1 1 2π 2 ) Σ = Σ x μ Σ T 1 ( x μ ) exp ( x μ) ( D/2 1/2 D. μ D D D, Σ. covariance matrix Example D=2: density, sample from distribution 3

Overview Basic concepts of Bayesian learning Linear regression: Maximum-likelihood model, Maximum-a-posteriori model, Posterior distribution over models, Bayesian prediction, predictive distribution, Linear classification (logistic regression): Predictive distribution, Maximum-likelihood model, Maximum-a-posteriori model, Bayesian Prediction. Nonlinear models: Gaussian processes. 4

Statistics & Machine Learning Machine learning: tightly related to inductive statistics. Two areas in Statistics: Descriptive Statistics: description and examination of the properties of data. Mean values Variances Difference between Populations Inductive Statistics: What conclusions can be drawn from data about the underlying reality? Explanations for observations Model building Relationships and patterns in the data 5

Frequentist vs. Bayesian Probabilities Frequentist probabilities Describe the possibility of an occurrence of an intrinsically stochastic event (e.g., a coin toss). Defined as limits of relative frequencies of possible outcomes in a repeatable experiment If one throws a fair coin 1000 times, it will land on heads about 500 times In 1 gram of Potassium-40, around 260,000 nuclei decay per second 6

Frequentist vs. Bayesian Probabilities Bayesian subjective probabilities Here, the reason for uncertainty is attributed to a lack of information. How likely is it that suspect X killed the victim? New Information (e.g., finger prints) can change these subjective probabilities. Bayesian view is more important in machine learning Frequentist and Bayesian perspectives are mathematically equivalent; in Bayesian statistics, probabilities are just used to model different things (lack of information). 7

Bayesian Statistics 1702-1761 An essay towards solving a problem in the doctrine of chances, published in 1764. The work of Bayes laid the foundations for inductive Statistics. Bayesian Probabilities offer an important perspective on uncertainty and probability. 8

Bayesian Probability in Machine Learning Model building: find an explanation for observations. What is the most likely model? Trade-off between Prior knowledge (a priori distribution over models), Evidence (data, observations). Bayesian Perspective: Evidence (data) changes the subjective probability for models (explanation), A posteriori model probability, MAP hypothesis. 9

Bayesian Model of Learning Nature conducts random experiment, determines. System with parameter generates observations y = f (X). Bayesian inference inverts this process: System parameter Observations y for input X Given these assumptions about how y is generated, Given the observed values of y for input matrix X, What is the most likely true value of? Nature randomly determines according to P What is the most likely value y for a new input x? 10

Bayesian Model of Learning System parameter Observations y for input X Likelihood of observing y X when model parameter is. Nature randomly determines according to P A priori ( prior ) probability of nature choosing Bayes equation: P X, y = P y X, P() P(y X) Probability of observing y X; independent of. A posteriori ( posterior ) probability that is the correct parameter given observations y X. 11

Bayesian Model of Learning Maximum-likelihood (ML) model: MM = arg max P(y X, ). Maximum-a-positeriori (MAP) model: MAA = arg max P( y, X) Likelihood A posteriori ( posterior ) distribution 12

Bayesian Model of Learning Maximum-likelihood (ML) model: MM = arg max P(y X, ). Maximum-a-positeriori (MAP) model: MAA = arg max P y X, P P( y, X) = arg max P y X P y X, P = arg max Posterior likelihood x prior 13

Bayesian Model of Learning Maximum-likelihood (ML) model: MM = arg max P(y X, ). Maximum-a-positeriori (MAP) model: MAA = arg max P y X, P P( y, X) = arg max P y X P y X, P = arg max Most likely value y for new input x (Bayes-optimal decision): y = arg max y P(y x, y, X) P y x, y, X = P y, x, y, X dd = P y x, P y, X dd Predictive distribution Bayesian model averaging. Often computationally infeasible, but has a closed-form solution in some cases. 15

Linear Regression Models Training data: X = y = x 11 x 1m x nn x nn y 1 y n Model f X Y f x = x T 16

Probabilistic Linear Regression Assumption 1: Nature generates parameter of a linear function f x = x T according to p(). f x i = x i T f x i x i 18

Probabilistic Linear Regression Assumption 1: Nature generates parameter of a linear function f x = x T according to p(). Assumption 2: Given inputs X, nature generates outputs y: y i = f x i + ε i with ε i ~N(ε 0, σ 2 ). p y i x i, = N(y i x i T, σ 2 ) f x i = x i T f x i p y i x i, = N y i x i T, σ 2 p y X T, = N y X, σ 2 I x i In reality, we have y, X and want to make inferences about. 19

Maximum Likelihood for Linear Regression Maximum-likelihood (ML) model: MM = arg max P(y X, ) = arg max n N(y i x i T, σ 2 ) i=1 Training instances are independent 20

Maximum Likelihood for Linear Regression Maximum-likelihood (ML) model: MM = arg max P(y X, ) = arg max n n N(y i x i T, σ 2 ) i=1 = arg max 1 2πσ exp 1 2 2σ 2 y i x T i 2 i=1 21

Maximum Likelihood for Linear Regression Maximum-likelihood (ML) model: MM = arg max P(y X, ) = arg max n n N(y i x i T, σ 2 ) i=1 = arg max 1 2πσ exp 1 2 2σ 2 y i x T i 2 i=1 Log is a monononic transformation: arg max P(y X, ) = arg max log P(y X, ) Also constant terms (constant in ) can be dropped. 22

Maximum Likelihood for Linear Regression Maximum-likelihood (ML) model: log e x = x MM = arg max P(y X, ) = arg max n n N(y i x i T, σ 2 ) i=1 = arg max 1 2πσ exp 1 2 2σ 2 y i x T i 2 i=1 n = arg min y i x T i 2 i=1 Log is a monononic transformation: Unregularized linear regression with squared loss arg max P(y X, ) = arg max log P(y X, ) Also constant terms (constant in ) can be dropped 23

Maximum Likelihood for Linear Regression Maximum-likelihood (ML) model: MM = arg min y i x T i 2 n i=1 Known as least-squares method in statistics. Setting the derivative to zero gives closed-form solution: ML = X T X 1 X T y Inversion of X T X is numerically unstable. 24

Maximum Likelihood for Linear Regression The maximum-likelihood model is only based on the data, it is independent of any prior knowledge or domain assumptions. Calculating the maximum-likelihood model is numerically unstable. Regularized least squares works much better in practice. 25

MAP for Linear Regression Maximum-a-positeriori (MAP) model: MMM = arg max P y X, P P( y, X) = arg max P y X 27

MAP for Linear Regression Maximum-a-positeriori (MAP) model: MMM = arg max = arg max = arg max = arg max P y X, P P( y, X) = arg max P y X P y X, P log P y X, P n log P y i x i, + log P i=1 n = arg max log N y i x T i, σ 2 + log P = i=1 Training instances are independent 28

MAP for Linear Regression: Prior Nature generates parameter of linear model f x = x T according to p(). For convenience, assume p = N( 0, σ p 2 I). p( ) = ( 0, σ I = 2 p ) 1 1 exp 2 σ p m/ 2 m 2π σ p 2 2 p( ) 0 2 σ 2 p controls strength of prior 0 1 29

MAP for Linear Regression Maximum-a-positeriori (MAP) model: n MMM = arg max i=1 log N y i x T i, σ 2 + log P n = arg max log 1 2πσ 1 2 2σ 2 y i x T i 2 + log i=1 1 m 1 2π 2 σ p 2σ p 2 T All terms that are constant in can be dropped. 30

MAP for Linear Regression Maximum-a-positeriori (MAP) model: n MMM = arg max i=1 log N y i x T i + log P n = arg max log 1 2πσ 1 2 2σ 2 y i x T i 2 + log i=1 1 m 1 2π 2 σ p n 2σ p 2 T = arg min 1 2σ 2 y i x T i 2 i=1 n = arg min y i x T i 2 i=1 + 1 2σ p 2 T σ2 2 T σ p λ l 2 -regularized linear regression with squared loss (ridge regression). 31

MAP for Linear Regression Maximum-a-positeriori (MAP) model: n MMM = arg min y i x T i 2 i=1 σ2 σ p 2 λ T Same optimization criterion as ridge regression. Analytic solution (see lecture on ridge regression): MMM = X T X + σ2 σ p 2 I 1 X T y 32

Posterior for Linear Regression Posterior distribution of given y, X: P y, X = P y X, P = 1 P y X Z P y X, P 34

Posterior for Linear Regression Posterior distribution of given y, X: n P y, X = P y X, P P y X = 1 Z P y X, P = 1 Z N y XT, σ 2 I N 0, σ p 2 I = 1 2πσ exp 1 2 2σ 2 y i x T i 2 i=1 The normal distribution is the conjugate of itself. Therefore N, N, = N, 35

Posterior for Linear Regression Posterior distribution of given y, X: P y, X = P y X, P P y X = 1 Z P y X, P = 1 Z N y XT, σ 2 I N 0, σ p 2 I = N(, A 1 ) With = X T X + σ2 σ p 2 I 1 X T y And A 1 = σ 2 X T X + σ p 2 I. Mean value of posterior: MMM 36

Example MAP solution regression Training data: 2 = 0 4 3, 2 x1 3, x2 = 3 Matrix notation (adding constant attribute): x 0 = 1, 2 y 1 = 2 y 2 = 3 y 3 = 4 X 1 2 3 0 = 1 4 3 2 1 0 1 2 2 y = 3 4 37

Example MAP solution regression Choose Variance of prior: σ p = 1 Noise parameter: σ = 0.5 2 T σ 1 T Compute: = ( XX+ I) Xy 2 σ p 1 T 1 0 0 0 T 1 2 3 0 1 2 3 0 1 2 3 0 2 0 1 0 0 = 1 4 3 2 1 4 3 2 + 0.25 1 4 3 2 3 0 0 1 0 1 0 1 2 1 0 1 2 1 0 1 2 4 0 0 0 1 0.7975-0.5598 0.7543 1.1217 38

Example MAP solution regression Predictions of model on the training data: 0.7975 1 2 3 0 1.9408-0.5598 ˆ y = X = 1 4 3 2 = 3.0646 0.7543 1 0 1 2 3.7952 1.1217 39

Posterior and Regularized Loss Function MAP model: P y X, P MMM = arg max P( y, X) = arg max P y X = arg max P y X, P = arg max = arg max log P y X, + log P n log P y i x i, i=1 n = arg min y i x T i 2 i=1 + log P + σ2 σ p 2 T Likelihood loss function Prior regularizer 40

Sequential Learning Training examples arrive sequentially. Each training example (x i, y i ) changes prior p i 1 () into posterior p i 1 ( y i, x i ) which becomes the new prior p i P y, X = 1 Z P 0 P y X, n = 1 Z P 0 P(y i x i, ) i=1 = 1 Z P 0 P y 1 x 1, P 1 () P 2 () P y 2 x 2, P 3 () P y 3 x 3, P(y n x n, ) 41

Example: Sequential Learning [from Chris Bishop, Pattern Recognition and Machine Learning] f x = 0 + 1 x (one-dimensional regression) Sequential update: P 0 () P 0 () Sample from P 0 () 1 0 42

Example: Sequential Learning f x = 0 + 1 x (one-dimensional regression) Sequential update: P 0 () Likelihood P(y 1 x 1, ) P 0 () Instance x 1, y 1 1 1 0 0 43

Example: Sequential Learning f x = 0 + 1 x (one-dimensional regression) Sequential update: P 1 P 0 ()P(y 1 x 1, ) Likelihood P(y 1 x 1, ) P 1 () Sample from P 1 () 1 1 0 0 44

Example: Sequential Learning f x = 0 + 1 x (one-dimensional regression) Sequential update: P 2 P 1 ()P(y 2 x 2, ) Likelihood P(y 2 x 2, ) P 2 () Sample from P 2 () 1 1 0 0 45

Example: Sequential Learning f x = 0 + 1 x (one-dimensional regression) Sequential update: P n P n 1 ()P(y n x n, ) Likelihood P(y n x n, ) P n () Sample from P n () 1 1 0 0 46

Learning and Prediction So far, we have always separated learning from prediction. Learning: = arg max Prediction: y = f (x ) R y, X, + Ω() For instance, in MAP linear regression, learning is MMM = arg max P( y, X). And prediction is y T = MMM x. 47

Learning and Prediction So far, we have always separated learning from prediction. And there are good reasons to do this: Learning can require processing massive amounts of training data which can take a long time. Predictions may have to be made in real time. However, sometimes, when relatively few data are available and an accurate prediction is worth waiting for, one can directly search for the best possible prediction: y = arg max y P(y x, y, X) Most likely y for new input x given training data y, X. 48

Bayes-Optimal Prediction Bayes-optimal decision: most likely value y for new input x y = arg max y P(y x, y, X) Predictive distribution for new input x given training data: P y x, y, X = P y, x, y, X d = P y, x, y, X P x, y, X d = P y, x P y, X d Sum rule Product rule Bayesian model averaging Independence assumptions from model data generation process Bayes-optimal decision is made by a weighted sum over all values of the model parameters. In general, there is no single model in the model space that always makes the Bayes-optimal decision. 50

Bayes-Optimal Prediction Bayes-optimal decision is made by a weighted sum over all model parameters: P y x, y, X = P y x, P y, X d The prediction of the MAP model is only the prediction made by the single most likely model MMM. Predictive distribution P(y x, MMM ). Most likely prediction f MMM x = arg max y P(y x, MMM ). The MAP model MMM is an approximaton of this weighted sum by its element with the highest weight. 51

Bayes-Optimal Prediction Bayes-optimal decision is made by a weighted sum over all model parameters: P y x, y, X = P y x, P y, X d Integration over the space of all model parameters is not generally possible. In some cases, there is a closed-form solution. In other cases, approximate numerical integration may be possible. 52

Predictive Distribution for Linear Regression Predictive distribution for linear regression P y x, y, X = P y x, P y, X d = N y x, N, A 1 d = N(y T x, σ 2 + x T A 1 x ) With = X T X + σ2 σ p 2 I 1 X T y And A 1 = σ 2 X T X + σ p 2 I. Bayes-optimal prediction: y = arg max y P(y x, y, X) = T x 53

Predictive Distribution: Confidence Band Bayesian regression not only yields prediction y * = x T, but a distribution over y und therefore a confidence band. T 2 T 1 ( y x, σ + x A x) y * = x T x e.g. 95% confidence 54

Linear Classification Training data: X = y = x 11 x 1m x n1 x nm y 1 y n Decision function: f x = x T Predictive distribution P y x, = σ(x T ) Linear classifier: y x = arg max P y x, y y : 56

Linear Classification: Predictive Distribution For binary classification, y { 1, +1} Predictive distribution given parameters of linear model: P y = +1 x, = σ x T = 1 1+e xt P y = 1 x, = 1 P y = +1 x, Sigmoid function maps, + [0,1]. p(y=1 x,) 0 0.5 1-10 -5 0 5 10 T x 57

Logistic Regression For binary classification, y { 1, +1} Predictive distribution given parameters of linear model: P y = +1 x, = σ x T = 1 1+e xt P y = 1 x, = 1 1 1+e xt = 1 Written jointly for both classes: P y x, = σ yx T = Classification function: y x = arg max y P y x, 1 1+e yxt 1+e xt Called logistic regression even though it is a classification model. 58

Logistic Regression Decision boundary: P y = +1 x, = P y = 1 x, = 0.5. 0.5 = σ x T 1 = 1 + e xt 1 = e xt 0 = x T Decision boundary is a hyperplane in input space. T x = 0 py= ( + 1 x, ) 59

Logistic Regression For multi-class classification: = 1 k Generalize sigmoid function to softmax function: P y x, = ext y e xt y y Classification function: y x = arg max y P y x, Normalizer ensures that P y x, = 1 Called multi-class logistic regression even though it is a classification model. y 60

Logistic Regression: ML Model Maximum-likelihood model: MM = arg max P y X, n = argmax 1 1 + e y ix T i = argmin = argmin i=1 n 1 log 1 + e y ix T i i=1 n log 1 + e y T ix i i=1 No analytic solution; numeric optimization, for instance, using (stochastic) gradient descent. 62

Logistic Regression: ML Model Maximum-likelihood model: n i=1 MM = argmin log 1 + e y ix i Gradient: n log 1 + e y T ix i i=1 n = 1 + e y ix T i i=1 n log 1 + e y T ix i = 1 1 + e y ix T i e y T ix i T y i x i i=1 n = y i x i T 1 i=1 n 1 + e y ix i T = y i x i 1 σ y i x i T i=1 T y i x i T n = y i x i T e y ix i i=1 1 + e y T ix i T 1 + e y ix i T y ix i T 63

Logistic Regression: MAP Model Maximum-a-posteriori model with prior P = N( 0, σ 2 I): MMM = arg max n P y X, P() = argmax 1 N( 0, σ 2 I) 1 + e y ix T i = argmin = argmin i=1 n 1 log log N( 0, σ 2 I) 1 + e y ix T i i=1 n log 1 + e y T ix i i=1 + 1 2σ 2 T No analytic solution; numeric optimization, for instance, using (stochastic) gradient descent. 65

Logistic Regression: MAP Model Maximum-a-posteriori model with prior P = N( 0, σ 2 I): Gradient: MMM = argmin log 1 + e y ix i T n i=1 + 1 n log 1 + e y T ix i i=1 + 1 2σ2 T p = y i x i 1 σ y i x i T + 1 2σ p 2 2σ p 2 T 66

Bayes-Optimal Prediction for Classification Predictive distribution given the data P y x, y, X = P y, x P y, X d 1 = 1 + e yx T N 0, σ 2 I d No closed-form solution for logistic regression. Possible to approximate by sampling from the posterior. Standard approximation: use only MAP model instead of integrating over model space. 68

Nonlinear Regression Limitation of model discussed so far: only linear dependency between x and f (x). Linear model Nonlinear model Now: nonlinear models. 70

Feature Mappings and Kernels Use mapping φ to embed instances x X in higherdimensional feature space. Find linear model in higher-dimensional space, corresponds to non-linear model in input space X. Representer theorem: Model f x = T φ(x) n i=1 Has a representation f α x = α i φ x i T φ x =k(x i,x) Feature mapping φ(x) does not have to be computed; only kernel function k(x i, x) is evaluated. Feature mapping φ(x) can therefore be high- or even infinitedimensional. 71

Generalized Linear Regression (Finite-Dimensional Case) Assumption 1: Nature generates parameter of a linear function f x = φ(x) T according to p = N( 0, σ p 2 I). Assumption 2: Inputs are X with feature representation Φ; line i of Φ contains row vector φ x i T. Nature generates outputs y: y i = f x i + ε i with ε i ~N(ε 0, σ 2 ). p y i x i, = N(y i φ x i T, σ 2 ) f x i = φ(x i ) T f x i p y i x i, = N y i φ x i T, σ 2 p y X T, = N y Φ, σ 2 I x i 72

Generalized Linear Regression Generalized linear model: f x = φ(x) T y = Φ Parameter governed by normal distribution N( 0, σ 2 p I). Therefore output vector y is also normally distributed. Mean value E y = E Φ = ΦE = 0. Covariance E yy T = ΦE T Φ T = σ p 2 ΦΦ T = σ p 2 K. With K ii = φ x i T φ x j = k(x i, x j ). Remember: ΦΦ T = φ x 1 φ x n φ x 1 φ x n = K 73

Generalized Linear Regression (General Case) Data generation assumptions: Given inputs X, nature generates target values y N(y 0, σ p 2 K). Then, nature generates observations y i = y i + ε i with noise ε i ~N(ε 0, σ 2 ). Bayesdian inference: determine predictive distribution P y x, y, X for new test instance. 74

Reminder: Linear Regression Bayes-optimal prediction: y = arg max y With = X T X + σ2 σ p 2 I P(y x, y, X) = T x 1 X T y. Number of parameters j = number of attributes in x. 75

Generalized Linear Regression Mean value of predictive distribution P y x, y, X has the form y n = i=1 α i k(x i, x ) With α = ΦΦ T + σ2 σ p 2 I 1 y = K + σ2 σ p 2 I Number of parameters α i = number of training instances. 1 y. 76

Example nonlinear regression Example for nonlinear regression Generating nonlinear data by y = x + x 2 sin(2 π ) ε ε ~ ( ε 0, σ ), [0,1] Nonlinear kernel k x, x = exp( x x ) How does the predictive distribution and the posterior over models look like? 77

Predictive distribution f( x) N=1 data point y = sin(2 π x) N=2 N=4 N=25 78

Models sampled from posterior N=1 N=2 N=4 N=25 79

Summary Linear regression: Maximum-likelihood model arg max P(y X, ), Maximum-a-posteriori model arg max P( y, X), Posterior distribution over models P( y, X), Bayesian prediction, predictive distribution arg max y P(y x, y, X). Linear classification (logistic regression): Predictive distribution P(y x, ), Maximum-likelihood model arg max Maximum-a-posteriori model arg max Bayesian Prediction arg max y P(y X, ), P(y x, y, X). Nonlinear models: Gaussian processes. P( y, X), 80