Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Similar documents
Bayesian Learning (II)

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Probabilistic Machine Learning. Industrial AI Lab.

ECE521 week 3: 23/26 January 2017

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Naïve Bayes classification

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Linear Models for Regression CS534

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Linear Regression and Discrimination

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Linear Models for Regression CS534

Some Concepts of Probability (Review) Volker Tresp Summer 2018

Probabilistic Machine Learning

Linear Models for Regression CS534

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Linear Models for Regression

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

An Introduction to Statistical and Probabilistic Linear Models

Announcements. Proposals graded

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

Least Squares Regression

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Ch 4. Linear Models for Classification

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

y(x) = x w + ε(x), (1)

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Logistic Regression. Machine Learning Fall 2018

DEPARTMENT OF COMPUTER SCIENCE AUTUMN SEMESTER MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Density Estimation. Seungjin Choi

Reading Group on Deep Learning Session 1

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Machine Learning. Bayes Basics. Marc Toussaint U Stuttgart. Bayes, probabilities, Bayes theorem & examples

Introduction to Bayesian Learning. Machine Learning Fall 2018

Relevance Vector Machines

MODULE -4 BAYEIAN LEARNING

10-701/ Machine Learning, Fall

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

CSC321 Lecture 18: Learning Probabilistic Models

Bayesian Linear Regression. Sargur Srihari

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Machine Learning 4771

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Lecture : Probabilistic Machine Learning

CMU-Q Lecture 24:

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Least Squares Regression

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

Linear Models for Classification

Overfitting, Bias / Variance Analysis

Algorithmisches Lernen/Machine Learning

Lecture 11: Probability Distributions and Parameter Estimation

Machine Learning for Signal Processing Bayes Classification and Regression

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Bayesian Machine Learning

Linear Models for Regression. Sargur Srihari

The Naïve Bayes Classifier. Machine Learning Fall 2017

Classification: Logistic Regression from Data

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

CS-E3210 Machine Learning: Basic Principles

Machine Learning. 7. Logistic and Linear Regression

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Linear Regression (continued)

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Bayesian Linear Regression [DRAFT - In Progress]

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Neutron inverse kinetics via Gaussian Processes

DD Advanced Machine Learning

Introduction to Machine Learning

MLPR: Logistic Regression and Neural Networks

Linear Models in Machine Learning

Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

CSCI567 Machine Learning (Fall 2014)

Lecture 2 Machine Learning Review

Lecture 5: GPs and Streaming regression

Linear Regression (9/11/13)

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Introduction to Logistic Regression

Generative v. Discriminative classifiers Intuition

Cheng Soon Ong & Christian Walder. Canberra February June 2018

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

MLE/MAP + Naïve Bayes

Transcription:

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr

Remember: Normal Distribution Distribution over x. Density function with parameters µ (mean) and 2 σ (variance). Density of normal distribution 2

Remember: Multivariate Normal Distribution Distribution over vectors Density function with parameters x mean vector, 1 1 2π 2 ) Σ = Σ x μ Σ T 1 ( x μ ) exp ( x μ) ( D/2 1/2 D. μ D D D, Σ. covariance matrix Example D=2: density, sample from distribution 3

Overview Basic concepts of Bayesian learning Linear regression: Maximum-likelihood model, Maximum-a-posteriori model, Posterior distribution over models, Bayesian prediction, predictive distribution, Linear classification (logistic regression): Predictive distribution, Maximum-likelihood model, Maximum-a-posteriori model, Bayesian Prediction. Nonlinear models: Gaussian processes. 4

Statistics & Machine Learning Machine learning: tightly related to inductive statistics. Two areas in Statistics: Descriptive Statistics: description and examination of the properties of data. Mean values Variances Difference between Populations Inductive Statistics: What conclusions can be drawn from data about the underlying reality? Explanations for observations Model building Relationships and patterns in the data 5

Frequentist vs. Bayesian Probabilities Frequentist probabilities Describe the possibility of an occurrence of an intrinsically stochastic event (e.g., a coin toss). Defined as limits of relative frequencies of possible outcomes in a repeatable experiment If one throws a fair coin 1000 times, it will land on heads about 500 times In 1 gram of Potassium-40, around 260,000 nuclei decay per second 6

Frequentist vs. Bayesian Probabilities Bayesian subjective probabilities Here, the reason for uncertainty is attributed to a lack of information. How likely is it that suspect X killed the victim? New Information (e.g., finger prints) can change these subjective probabilities. Bayesian view is more important in machine learning Frequentist and Bayesian perspectives are mathematically equivalent; in Bayesian statistics, probabilities are just used to model different things (lack of information). 7

Bayesian Statistics 1702-1761 An essay towards solving a problem in the doctrine of chances, published in 1764. The work of Bayes laid the foundations for inductive Statistics. Bayesian Probabilities offer an important perspective on uncertainty and probability. 8

Bayesian Probability in Machine Learning Model building: find an explanation for observations. What is the most likely model? Trade-off between Prior knowledge (a priori distribution over models), Evidence (data, observations). Bayesian Perspective: Evidence (data) changes the subjective probability for models (explanation), A posteriori model probability, MAP hypothesis. 9

Bayesian Model of Learning Nature conducts random experiment, determines. System with parameter generates observations y = f (X). Bayesian inference inverts this process: System parameter Observations y for input X Given these assumptions about how y is generated, Given the observed values of y for input matrix X, What is the most likely true value of? Nature randomly determines according to P What is the most likely value y for a new input x? 10

Bayesian Model of Learning System parameter Observations y for input X Likelihood of observing y X when model parameter is. Nature randomly determines according to P A priori ( prior ) probability of nature choosing Bayes equation: P X, y = P y X, P() P(y X) Probability of observing y X; independent of. A posteriori ( posterior ) probability that is the correct parameter given observations y X. 11

Bayesian Model of Learning Maximum-likelihood (ML) model: MM = arg max P(y X, ). Maximum-a-positeriori (MAP) model: MAA = arg max P( y, X) Likelihood A posteriori ( posterior ) distribution 12

Bayesian Model of Learning Maximum-likelihood (ML) model: MM = arg max P(y X, ). Maximum-a-positeriori (MAP) model: MAA = arg max P y X, P P( y, X) = arg max P y X P y X, P = arg max Posterior likelihood x prior 13

Bayesian Model of Learning Maximum-likelihood (ML) model: MM = arg max P(y X, ). Maximum-a-positeriori (MAP) model: MAA = arg max P y X, P P( y, X) = arg max P y X P y X, P = arg max Most likely value y for new input x (Bayes-optimal decision): y = arg max y P(y x, y, X) 14

Bayesian Model of Learning Maximum-likelihood (ML) model: MM = arg max P(y X, ). Maximum-a-positeriori (MAP) model: MAA = arg max P y X, P P( y, X) = arg max P y X P y X, P = arg max Most likely value y for new input x (Bayes-optimal decision): y = arg max y P(y x, y, X) P y x, y, X = P y, x, y, X dd = P y x, P y, X dd Predictive distribution Bayesian model averaging. Often computationally infeasible, but has a closed-form solution in some cases. 15

Linear Regression Models Training data: X = y = x 11 x 1m x nn x nn y 1 y n Model f X Y f x = x T 16

Overview Basic concepts of Bayesian learning Linear regression: Maximum-likelihood model, Maximum-a-posteriori model, Posterior distribution over models, Bayesian prediction, predictive distribution, Linear classification (logistic regression): Predictive distribution, Maximum-likelihood model, Maximum-a-posteriori model, Bayesian Prediction. Nonlinear models: Gaussian processes. 17

Probabilistic Linear Regression Assumption 1: Nature generates parameter of a linear function f x = x T according to p(). f x i = x i T f x i x i 18

Probabilistic Linear Regression Assumption 1: Nature generates parameter of a linear function f x = x T according to p(). Assumption 2: Given inputs X, nature generates outputs y: y i = f x i + ε i with ε i ~N(ε 0, σ 2 ). p y i x i, = N(y i x i T, σ 2 ) f x i = x i T f x i p y i x i, = N y i x i T, σ 2 p y X T, = N y X, σ 2 I x i In reality, we have y, X and want to make inferences about. 19

Maximum Likelihood for Linear Regression Maximum-likelihood (ML) model: MM = arg max P(y X, ) = arg max n N(y i x i T, σ 2 ) i=1 Training instances are independent 20

Maximum Likelihood for Linear Regression Maximum-likelihood (ML) model: MM = arg max P(y X, ) = arg max n n N(y i x i T, σ 2 ) i=1 = arg max 1 2πσ exp 1 2 2σ 2 y i x T i 2 i=1 21

Maximum Likelihood for Linear Regression Maximum-likelihood (ML) model: MM = arg max P(y X, ) = arg max n n N(y i x i T, σ 2 ) i=1 = arg max 1 2πσ exp 1 2 2σ 2 y i x T i 2 i=1 Log is a monononic transformation: arg max P(y X, ) = arg max log P(y X, ) Also constant terms (constant in ) can be dropped. 22

Maximum Likelihood for Linear Regression Maximum-likelihood (ML) model: log e x = x MM = arg max P(y X, ) = arg max n n N(y i x i T, σ 2 ) i=1 = arg max 1 2πσ exp 1 2 2σ 2 y i x T i 2 i=1 n = arg min y i x T i 2 i=1 Log is a monononic transformation: Unregularized linear regression with squared loss arg max P(y X, ) = arg max log P(y X, ) Also constant terms (constant in ) can be dropped 23

Maximum Likelihood for Linear Regression Maximum-likelihood (ML) model: MM = arg min y i x T i 2 n i=1 Known as least-squares method in statistics. Setting the derivative to zero gives closed-form solution: ML = X T X 1 X T y Inversion of X T X is numerically unstable. 24

Maximum Likelihood for Linear Regression The maximum-likelihood model is only based on the data, it is independent of any prior knowledge or domain assumptions. Calculating the maximum-likelihood model is numerically unstable. Regularized least squares works much better in practice. 25

Overview Basic concepts of Bayesian learning Linear regression: Maximum-likelihood model, Maximum-a-posteriori model, Posterior distribution over models, Bayesian prediction, predictive distribution, Linear classification (logistic regression): Predictive distribution, Maximum-likelihood model, Maximum-a-posteriori model, Bayesian Prediction. Nonlinear models: Gaussian processes. 26

MAP for Linear Regression Maximum-a-positeriori (MAP) model: MMM = arg max P y X, P P( y, X) = arg max P y X 27

MAP for Linear Regression Maximum-a-positeriori (MAP) model: MMM = arg max = arg max = arg max = arg max P y X, P P( y, X) = arg max P y X P y X, P log P y X, P n log P y i x i, + log P i=1 n = arg max log N y i x T i, σ 2 + log P = i=1 Training instances are independent 28

MAP for Linear Regression: Prior Nature generates parameter of linear model f x = x T according to p(). For convenience, assume p = N( 0, σ p 2 I). p( ) = ( 0, σ I = 2 p ) 1 1 exp 2 σ p m/ 2 m 2π σ p 2 2 p( ) 0 2 σ 2 p controls strength of prior 0 1 29

MAP for Linear Regression Maximum-a-positeriori (MAP) model: n MMM = arg max i=1 log N y i x T i, σ 2 + log P n = arg max log 1 2πσ 1 2 2σ 2 y i x T i 2 + log i=1 1 m 1 2π 2 σ p 2σ p 2 T All terms that are constant in can be dropped. 30

MAP for Linear Regression Maximum-a-positeriori (MAP) model: n MMM = arg max i=1 log N y i x T i + log P n = arg max log 1 2πσ 1 2 2σ 2 y i x T i 2 + log i=1 1 m 1 2π 2 σ p n 2σ p 2 T = arg min 1 2σ 2 y i x T i 2 i=1 n = arg min y i x T i 2 i=1 + 1 2σ p 2 T σ2 2 T σ p λ l 2 -regularized linear regression with squared loss (ridge regression). 31

MAP for Linear Regression Maximum-a-positeriori (MAP) model: n MMM = arg min y i x T i 2 i=1 σ2 σ p 2 λ T Same optimization criterion as ridge regression. Analytic solution (see lecture on ridge regression): MMM = X T X + σ2 σ p 2 I 1 X T y 32

Overview Basic concepts of Bayesian learning Linear regression: Maximum-likelihood model, Maximum-a-posteriori model, Posterior distribution over models, Bayesian prediction, predictive distribution, Linear classification (logistic regression): Predictive distribution, Maximum-likelihood model, Maximum-a-posteriori model, Bayesian Prediction. Nonlinear models: Gaussian processes. 33

Posterior for Linear Regression Posterior distribution of given y, X: P y, X = P y X, P = 1 P y X Z P y X, P 34

Posterior for Linear Regression Posterior distribution of given y, X: n P y, X = P y X, P P y X = 1 Z P y X, P = 1 Z N y XT, σ 2 I N 0, σ p 2 I = 1 2πσ exp 1 2 2σ 2 y i x T i 2 i=1 The normal distribution is the conjugate of itself. Therefore N, N, = N, 35

Posterior for Linear Regression Posterior distribution of given y, X: P y, X = P y X, P P y X = 1 Z P y X, P = 1 Z N y XT, σ 2 I N 0, σ p 2 I = N(, A 1 ) With = X T X + σ2 σ p 2 I 1 X T y And A 1 = σ 2 X T X + σ p 2 I. Mean value of posterior: MMM 36

Example MAP solution regression Training data: 2 = 0 4 3, 2 x1 3, x2 = 3 Matrix notation (adding constant attribute): x 0 = 1, 2 y 1 = 2 y 2 = 3 y 3 = 4 X 1 2 3 0 = 1 4 3 2 1 0 1 2 2 y = 3 4 37

Example MAP solution regression Choose Variance of prior: σ p = 1 Noise parameter: σ = 0.5 2 T σ 1 T Compute: = ( XX+ I) Xy 2 σ p 1 T 1 0 0 0 T 1 2 3 0 1 2 3 0 1 2 3 0 2 0 1 0 0 = 1 4 3 2 1 4 3 2 + 0.25 1 4 3 2 3 0 0 1 0 1 0 1 2 1 0 1 2 1 0 1 2 4 0 0 0 1 0.7975-0.5598 0.7543 1.1217 38

Example MAP solution regression Predictions of model on the training data: 0.7975 1 2 3 0 1.9408-0.5598 ˆ y = X = 1 4 3 2 = 3.0646 0.7543 1 0 1 2 3.7952 1.1217 39

Posterior and Regularized Loss Function MAP model: P y X, P MMM = arg max P( y, X) = arg max P y X = arg max P y X, P = arg max = arg max log P y X, + log P n log P y i x i, i=1 n = arg min y i x T i 2 i=1 + log P + σ2 σ p 2 T Likelihood loss function Prior regularizer 40

Sequential Learning Training examples arrive sequentially. Each training example (x i, y i ) changes prior p i 1 () into posterior p i 1 ( y i, x i ) which becomes the new prior p i P y, X = 1 Z P 0 P y X, n = 1 Z P 0 P(y i x i, ) i=1 = 1 Z P 0 P y 1 x 1, P 1 () P 2 () P y 2 x 2, P 3 () P y 3 x 3, P(y n x n, ) 41

Example: Sequential Learning [from Chris Bishop, Pattern Recognition and Machine Learning] f x = 0 + 1 x (one-dimensional regression) Sequential update: P 0 () P 0 () Sample from P 0 () 1 0 42

Example: Sequential Learning f x = 0 + 1 x (one-dimensional regression) Sequential update: P 0 () Likelihood P(y 1 x 1, ) P 0 () Instance x 1, y 1 1 1 0 0 43

Example: Sequential Learning f x = 0 + 1 x (one-dimensional regression) Sequential update: P 1 P 0 ()P(y 1 x 1, ) Likelihood P(y 1 x 1, ) P 1 () Sample from P 1 () 1 1 0 0 44

Example: Sequential Learning f x = 0 + 1 x (one-dimensional regression) Sequential update: P 2 P 1 ()P(y 2 x 2, ) Likelihood P(y 2 x 2, ) P 2 () Sample from P 2 () 1 1 0 0 45

Example: Sequential Learning f x = 0 + 1 x (one-dimensional regression) Sequential update: P n P n 1 ()P(y n x n, ) Likelihood P(y n x n, ) P n () Sample from P n () 1 1 0 0 46

Learning and Prediction So far, we have always separated learning from prediction. Learning: = arg max Prediction: y = f (x ) R y, X, + Ω() For instance, in MAP linear regression, learning is MMM = arg max P( y, X). And prediction is y T = MMM x. 47

Learning and Prediction So far, we have always separated learning from prediction. And there are good reasons to do this: Learning can require processing massive amounts of training data which can take a long time. Predictions may have to be made in real time. However, sometimes, when relatively few data are available and an accurate prediction is worth waiting for, one can directly search for the best possible prediction: y = arg max y P(y x, y, X) Most likely y for new input x given training data y, X. 48

Overview Basic concepts of Bayesian learning Linear regression: Maximum-likelihood model, Maximum-a-posteriori model, Posterior distribution over models, Bayesian prediction, predictive distribution, Linear classification (logistic regression): Predictive distribution, Maximum-likelihood model, Maximum-a-posteriori model, Bayesian Prediction. Nonlinear models: Gaussian processes. 49

Bayes-Optimal Prediction Bayes-optimal decision: most likely value y for new input x y = arg max y P(y x, y, X) Predictive distribution for new input x given training data: P y x, y, X = P y, x, y, X d = P y, x, y, X P x, y, X d = P y, x P y, X d Sum rule Product rule Bayesian model averaging Independence assumptions from model data generation process Bayes-optimal decision is made by a weighted sum over all values of the model parameters. In general, there is no single model in the model space that always makes the Bayes-optimal decision. 50

Bayes-Optimal Prediction Bayes-optimal decision is made by a weighted sum over all model parameters: P y x, y, X = P y x, P y, X d The prediction of the MAP model is only the prediction made by the single most likely model MMM. Predictive distribution P(y x, MMM ). Most likely prediction f MMM x = arg max y P(y x, MMM ). The MAP model MMM is an approximaton of this weighted sum by its element with the highest weight. 51

Bayes-Optimal Prediction Bayes-optimal decision is made by a weighted sum over all model parameters: P y x, y, X = P y x, P y, X d Integration over the space of all model parameters is not generally possible. In some cases, there is a closed-form solution. In other cases, approximate numerical integration may be possible. 52

Predictive Distribution for Linear Regression Predictive distribution for linear regression P y x, y, X = P y x, P y, X d = N y x, N, A 1 d = N(y T x, σ 2 + x T A 1 x ) With = X T X + σ2 σ p 2 I 1 X T y And A 1 = σ 2 X T X + σ p 2 I. Bayes-optimal prediction: y = arg max y P(y x, y, X) = T x 53

Predictive Distribution: Confidence Band Bayesian regression not only yields prediction y * = x T, but a distribution over y und therefore a confidence band. T 2 T 1 ( y x, σ + x A x) y * = x T x e.g. 95% confidence 54

Overview Basic concepts of Bayesian learning Linear regression: Maximum-likelihood model, Maximum-a-posteriori model, Posterior distribution over models, Bayesian prediction, predictive distribution, Linear classification (logistic regression): Predictive distribution, Maximum-likelihood model, Maximum-a-posteriori model, Bayesian Prediction. Nonlinear models: Gaussian processes. 55

Linear Classification Training data: X = y = x 11 x 1m x n1 x nm y 1 y n Decision function: f x = x T Predictive distribution P y x, = σ(x T ) Linear classifier: y x = arg max P y x, y y : 56

Linear Classification: Predictive Distribution For binary classification, y { 1, +1} Predictive distribution given parameters of linear model: P y = +1 x, = σ x T = 1 1+e xt P y = 1 x, = 1 P y = +1 x, Sigmoid function maps, + [0,1]. p(y=1 x,) 0 0.5 1-10 -5 0 5 10 T x 57

Logistic Regression For binary classification, y { 1, +1} Predictive distribution given parameters of linear model: P y = +1 x, = σ x T = 1 1+e xt P y = 1 x, = 1 1 1+e xt = 1 Written jointly for both classes: P y x, = σ yx T = Classification function: y x = arg max y P y x, 1 1+e yxt 1+e xt Called logistic regression even though it is a classification model. 58

Logistic Regression Decision boundary: P y = +1 x, = P y = 1 x, = 0.5. 0.5 = σ x T 1 = 1 + e xt 1 = e xt 0 = x T Decision boundary is a hyperplane in input space. T x = 0 py= ( + 1 x, ) 59

Logistic Regression For multi-class classification: = 1 k Generalize sigmoid function to softmax function: P y x, = ext y e xt y y Classification function: y x = arg max y P y x, Normalizer ensures that P y x, = 1 Called multi-class logistic regression even though it is a classification model. y 60

Overview Basic concepts of Bayesian learning Linear regression: Maximum-likelihood model, Maximum-a-posteriori model, Posterior distribution over models, Bayesian prediction, predictive distribution, Linear classification (logistic regression): Predictive distribution, Maximum-likelihood model, Maximum-a-posteriori model, Bayesian Prediction. Nonlinear models: Gaussian processes. 61

Logistic Regression: ML Model Maximum-likelihood model: MM = arg max P y X, n = argmax 1 1 + e y ix T i = argmin = argmin i=1 n 1 log 1 + e y ix T i i=1 n log 1 + e y T ix i i=1 No analytic solution; numeric optimization, for instance, using (stochastic) gradient descent. 62

Logistic Regression: ML Model Maximum-likelihood model: n i=1 MM = argmin log 1 + e y ix i Gradient: n log 1 + e y T ix i i=1 n = 1 + e y ix T i i=1 n log 1 + e y T ix i = 1 1 + e y ix T i e y T ix i T y i x i i=1 n = y i x i T 1 i=1 n 1 + e y ix i T = y i x i 1 σ y i x i T i=1 T y i x i T n = y i x i T e y ix i i=1 1 + e y T ix i T 1 + e y ix i T y ix i T 63

Overview Basic concepts of Bayesian learning Linear regression: Maximum-likelihood model, Maximum-a-posteriori model, Posterior distribution over models, Bayesian prediction, predictive distribution, Linear classification (logistic regression): Predictive distribution, Maximum-likelihood model, Maximum-a-posteriori model, Bayesian Prediction. Nonlinear models: Gaussian processes. 64

Logistic Regression: MAP Model Maximum-a-posteriori model with prior P = N( 0, σ 2 I): MMM = arg max n P y X, P() = argmax 1 N( 0, σ 2 I) 1 + e y ix T i = argmin = argmin i=1 n 1 log log N( 0, σ 2 I) 1 + e y ix T i i=1 n log 1 + e y T ix i i=1 + 1 2σ 2 T No analytic solution; numeric optimization, for instance, using (stochastic) gradient descent. 65

Logistic Regression: MAP Model Maximum-a-posteriori model with prior P = N( 0, σ 2 I): Gradient: MMM = argmin log 1 + e y ix i T n i=1 + 1 n log 1 + e y T ix i i=1 + 1 2σ2 T p = y i x i 1 σ y i x i T + 1 2σ p 2 2σ p 2 T 66

Overview Basic concepts of Bayesian learning Linear regression: Maximum-likelihood model, Maximum-a-posteriori model, Posterior distribution over models, Bayesian prediction, predictive distribution, Linear classification (logistic regression): Predictive distribution, Maximum-likelihood model, Maximum-a-posteriori model, Bayesian Prediction. Nonlinear models: Gaussian processes. 67

Bayes-Optimal Prediction for Classification Predictive distribution given the data P y x, y, X = P y, x P y, X d 1 = 1 + e yx T N 0, σ 2 I d No closed-form solution for logistic regression. Possible to approximate by sampling from the posterior. Standard approximation: use only MAP model instead of integrating over model space. 68

Overview Basic concepts of Bayesian learning Linear regression: Maximum-likelihood model, Maximum-a-posteriori model, Posterior distribution over models, Bayesian prediction, predictive distribution, Linear classification (logistic regression): Predictive distribution, Maximum-likelihood model, Maximum-a-posteriori model, Bayesian Prediction. Nonlinear models: Gaussian processes. 69

Nonlinear Regression Limitation of model discussed so far: only linear dependency between x and f (x). Linear model Nonlinear model Now: nonlinear models. 70

Feature Mappings and Kernels Use mapping φ to embed instances x X in higherdimensional feature space. Find linear model in higher-dimensional space, corresponds to non-linear model in input space X. Representer theorem: Model f x = T φ(x) n i=1 Has a representation f α x = α i φ x i T φ x =k(x i,x) Feature mapping φ(x) does not have to be computed; only kernel function k(x i, x) is evaluated. Feature mapping φ(x) can therefore be high- or even infinitedimensional. 71

Generalized Linear Regression (Finite-Dimensional Case) Assumption 1: Nature generates parameter of a linear function f x = φ(x) T according to p = N( 0, σ p 2 I). Assumption 2: Inputs are X with feature representation Φ; line i of Φ contains row vector φ x i T. Nature generates outputs y: y i = f x i + ε i with ε i ~N(ε 0, σ 2 ). p y i x i, = N(y i φ x i T, σ 2 ) f x i = φ(x i ) T f x i p y i x i, = N y i φ x i T, σ 2 p y X T, = N y Φ, σ 2 I x i 72

Generalized Linear Regression Generalized linear model: f x = φ(x) T y = Φ Parameter governed by normal distribution N( 0, σ 2 p I). Therefore output vector y is also normally distributed. Mean value E y = E Φ = ΦE = 0. Covariance E yy T = ΦE T Φ T = σ p 2 ΦΦ T = σ p 2 K. With K ii = φ x i T φ x j = k(x i, x j ). Remember: ΦΦ T = φ x 1 φ x n φ x 1 φ x n = K 73

Generalized Linear Regression (General Case) Data generation assumptions: Given inputs X, nature generates target values y N(y 0, σ p 2 K). Then, nature generates observations y i = y i + ε i with noise ε i ~N(ε 0, σ 2 ). Bayesdian inference: determine predictive distribution P y x, y, X for new test instance. 74

Reminder: Linear Regression Bayes-optimal prediction: y = arg max y With = X T X + σ2 σ p 2 I P(y x, y, X) = T x 1 X T y. Number of parameters j = number of attributes in x. 75

Generalized Linear Regression Mean value of predictive distribution P y x, y, X has the form y n = i=1 α i k(x i, x ) With α = ΦΦ T + σ2 σ p 2 I 1 y = K + σ2 σ p 2 I Number of parameters α i = number of training instances. 1 y. 76

Example nonlinear regression Example for nonlinear regression Generating nonlinear data by y = x + x 2 sin(2 π ) ε ε ~ ( ε 0, σ ), [0,1] Nonlinear kernel k x, x = exp( x x ) How does the predictive distribution and the posterior over models look like? 77

Predictive distribution f( x) N=1 data point y = sin(2 π x) N=2 N=4 N=25 78

Models sampled from posterior N=1 N=2 N=4 N=25 79

Summary Linear regression: Maximum-likelihood model arg max P(y X, ), Maximum-a-posteriori model arg max P( y, X), Posterior distribution over models P( y, X), Bayesian prediction, predictive distribution arg max y P(y x, y, X). Linear classification (logistic regression): Predictive distribution P(y x, ), Maximum-likelihood model arg max Maximum-a-posteriori model arg max Bayesian Prediction arg max y P(y X, ), P(y x, y, X). Nonlinear models: Gaussian processes. P( y, X), 80