Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Similar documents
Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015

Gaussian Processes (10/16/13)

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

GAUSSIAN PROCESS REGRESSION

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Nonparametric Bayesian Methods (Gaussian Processes)

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Nonparameteric Regression:

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Bayesian Learning (II)

Linear Models in Machine Learning

Model Selection for Gaussian Processes

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

GWAS V: Gaussian processes

Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

An Introduction to Statistical and Probabilistic Linear Models

Linear Regression and Discrimination

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

DD Advanced Machine Learning

Bayesian Machine Learning

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Statistical Techniques in Robotics (16-831, F12) Lecture#20 (Monday November 12) Gaussian Processes

Kernel methods, kernel SVM and ridge regression

Gaussian Processes in Machine Learning

Lecture : Probabilistic Machine Learning

Linear Classification

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Machine Learning Lecture 5

Advanced Introduction to Machine Learning CMU-10715

Gaussian Process Regression

Gaussian Processes. 1 What problems can be solved by Gaussian Processes?

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Least Squares Regression

Learning Gaussian Process Models from Uncertain Data

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Probabilistic & Unsupervised Learning

Machine Learning. Regression basics. Marc Toussaint University of Stuttgart Summer 2015

Introduction to Gaussian Processes

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Linear regression example Simple linear regression: f(x) = ϕ(x)t w w ~ N(0, ) The mean and covariance are given by E[f(x)] = ϕ(x)e[w] = 0.

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Relevance Vector Machines

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

Statistical Data Mining and Machine Learning Hilary Term 2016

Introduction to Machine Learning

The Laplace Approximation

Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference

Linear Models for Classification

Logistic Regression. Machine Learning Fall 2018

Support Vector Machines

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Ch 4. Linear Models for Classification

y(x) = x w + ε(x), (1)

Probabilistic numerics for deep learning

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

Neutron inverse kinetics via Gaussian Processes

Reading Group on Deep Learning Session 1

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan

CS 7140: Advanced Machine Learning

Reliability Monitoring Using Log Gaussian Process Regression

Outline Lecture 2 2(32)

Lecture 5: GPs and Streaming regression

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

CSCI-567: Machine Learning (Spring 2019)

Pattern Recognition and Machine Learning

PATTERN RECOGNITION AND MACHINE LEARNING

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning

Probabilistic Machine Learning. Industrial AI Lab.

Least Squares Regression

Gaussian Processes for Machine Learning

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning

Bayesian Linear Regression. Sargur Srihari

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Multivariate statistical methods and data mining in particle physics

Machine Learning Srihari. Gaussian Processes. Sargur Srihari

Intelligent Systems I

STA141C: Big Data & High Performance Statistical Computing

More Spectral Clustering and an Introduction to Conjugacy

Day 5: Generative models, structured classification

STA 4273H: Statistical Machine Learning

Gaussian with mean ( µ ) and standard deviation ( σ)

Transcription:

Machine Learning Bayesian Regression & Classification learning as inference, Bayesian Kernel Ridge regression & Gaussian Processes, Bayesian Kernel Logistic Regression & GP classification, Bayesian Neural Networks Marc Toussaint U Stuttgart

Learning as Inference The parameteric view The function space view P (β Data) = P (f Data) = P (Data β) P (β) P (Data) P (Data f) P (f) P (Data) Today: Bayesian (Kernel) Ridge Regression Gaussian Process (GP) Bayesian (Kernel) Logistic Regression GP classification Bayesian Neural Networks (briefly) 2/24

Beyond learning about specific Bayesian learning methods: Understand relations between loss/error neg-log likelihood regularization neg-log prior cost (reg.+loss) neg-log posterior 3/24

Ridge regression as Bayesian inference We have random variables X 1:n, Y 1:n, β We observe data D = {(x i, y i )} n i=1 and want to compute P (β D) Let s assume: P (X) is arbitrary P (β) is Gaussian: β N(0, σ2 λ ) λ e 2σ 2 β 2 P (Y X, β) is Gaussian: y = x β + ɛ, ɛ N(0, σ 2 ) x i y i i = 1 : n β 4/24

Ridge regression as Bayesian inference Bayes Theorem: P (D β) P (β) P (β D) = P (D) n i=1 P (β x 1:n, y 1:n ) = P (y i β, x i ) P (β) Z P (D β) is a product of independent likelihoods for each observation (x i, y i) 5/24

Ridge regression as Bayesian inference Bayes Theorem: P (D β) P (β) P (β D) = P (D) n i=1 P (β x 1:n, y 1:n ) = P (y i β, x i ) P (β) Z P (D β) is a product of independent likelihoods for each observation (x i, y i) Using the Gaussian expressions: P (β D) = 1 Z n i=1 e 1 2σ 2 (yi x i β)2 e λ 2σ 2 β 2 5/24

Ridge regression as Bayesian inference Bayes Theorem: P (D β) P (β) P (β D) = P (D) n i=1 P (β x 1:n, y 1:n ) = P (y i β, x i ) P (β) Z P (D β) is a product of independent likelihoods for each observation (x i, y i) Using the Gaussian expressions: P (β D) = 1 Z n i=1 e 1 2σ 2 (yi x i β)2 e λ 2σ 2 β 2 log P (β D) = 1 [ n 2σ 2 (y i x iβ) 2 + λ β 2] log Z i=1 log P (β D) L ridge (β) 1st insight: The neg-log posterior P (β D) is equal to the cost function L ridge (β)! 5/24

Ridge regression as Bayesian inference Let us compute P (β D) explicitly: P (β D) = 1 Z n i=1 = 1 Z e 1 2σ 2 e 1 2σ 2 (y i x i β)2 e λ 2σ 2 β 2 i (y i x i β)2 e λ 2σ 2 β 2 = 1 Z e 1 2σ 2 [(y Xβ) (y Xβ)+λβ β] = 1 Z e 1 2 [ 1 σ 2 y y+ 1 σ 2 β (X X+λI)β 2 σ 2 β X y] = N(β ˆβ, Σ) This is a Gaussian with covariance and mean Σ = σ 2 (X X + λi) -1, ˆβ = 1 σ ΣX y = (X X + λi) -1 X y 2 2nd insight: The mean ˆβ is exactly the classical argmin β L ridge (β). 3rd insight: The Bayesian inference approach not only gives a mean/optimal ˆβ, but also a variance Σ of that estimate! 6/24

(from Bishop, p176) 7/24 Predicting with an uncertain β Suppose we want to make a prediction at x. We can compute the predictive distribution over a new observation y at x : P (y x, D) = β P (y x, β) P (β D) dβ = β N(y φ(x ) β, σ 2 ) N(β ˆβ, Σ) dβ = N(y φ(x ) ˆβ, σ 2 + φ(x ) Σφ(x )) Note P (f(x) D) = N(f(x) φ(x) ˆβ, φ(x) Σφ(x)) without the σ 2 So, y is Gaussian distributed around the mean prediction φ(x ) ˆβ:

Wrapup of Bayesian Ridge regression 1st insight: The neg-log posterior P (β D) is equal to the cost function L ridge (β)! This is a very very common relation: optimization costs correspond to neg-log probabilities; probabilities correspond to exp-neg costs. 2nd insight: The mean ˆβ is exactly the classical argmin β L ridge (β). More generally, the most likely parameter argmax β P (β D) is also the least-cost parameter argmin β L(β). In the Gaussian case, mean and most-likely coincide. 3rd insight: The Bayesian inference approach not only gives a mean/optimal ˆβ, but also a variance Σ of that estimate! This is a core benefit of the Bayesian view: It naturally provides a probability distribution over predictions ( error bars ), not only a single prediction. 8/24

Kernelized Bayesian Ridge Regression As in the classical case, we can consider arbitrary features φ(x).. or directly use a kernel k(x, x ): P (f(x) D) = N(f(x) φ(x) ˆβ, φ(x) Σφ(x)) φ(x) ˆβ = φ(x) X (XX + λi) -1 y = κ(x)(k + λi) -1 y φ(x) Σφ(x) = φ(x) σ 2 (X X + λi) -1 φ(x) = σ2 λ φ(x) φ(x) σ2 λ φ(x) X(XX + λi k ) -1 X φ(x) = σ2 σ2 k(x, x) λ λ κ(x)(k + λin)-1 κ(x) 3rd line: As on slide 02:24 last lines: Woodbury identity (A + UBV ) -1 = A -1 A -1 U(B -1 + V A -1 U) -1 V A -1 with A = λi In standard conventions λ = σ 2, P (β) = N(β 0, 1) Regularization: scale the covariance function (or features) 9/24

Kernelized Bayesian Ridge Regression is equivalent to Gaussian Processes (see also Welling: Kernel Ridge Regression Lecture Notes; Rasmussen & Williams sections 2.1 & 6.2; Bishop sections 3.3.3 & 6) As we have the equations alreay, I skip further math details. (See Rasmussen & Williams) 10/24

Gaussian Processes The function space view P (f Data) = P (Data f) P (f) P (Data) Gaussian Processes define a probability distribution over functions: A function is an infinite dimensional thing how could we define a Gaussian distribution over functions? For every finite set {x 1,.., x M }, the function values f(x 1),.., f(x M ) are Gaussian distributed with mean and cov. f(x i) = µ(x i) (often zero) [f(x i) µ(x i)][f(x j) µ(x j)] = k(x i, x j) Here, k(, ) is called covariance function Second, Gaussian Processes define an observation probability P (y x, f) = N(y f(x), σ 2 ) 11/24

Gaussian Processes (from Rasmussen & Williams) 12/24

GP: different covariance functions (from Rasmussen & Williams) These are examples from the γ-exponential covariance function k(x, x ) = exp{ (x x )/l γ } 13/24

GP: derivative observations (from Rasmussen & Williams) 14/24

Bayesian Kernel Ridge Regression = Gaussian Process GPs have become a standard regression method If exact GP is not efficient enough, many approximations exist, e.g. sparse and pseudo-input GPs 15/24

Bayesian (Ridge) Logistic Regression 16/24

Bayesian Logistic Regression f now defines a logistic probability over y {0, 1}: P (X) = arbitrary P (β) = N(β 0, 2 λ ) exp{ λ β 2 } P (Y =1 X, β) = σ(β φ(x)) Recall n L logistic (β) = log p(y i x i ) + λ β 2 Again, the parameter posterior is P (β D) P (D β) P (β) exp{ L logistic (β)} i=1 17/24

Bayesian Logistic Regression Use Laplace approximation (2nd order Taylor for L) at β = argmin β L(β): L(β) L(β ) + β + 1 2 β H β, β = β β P (β D) exp{ β 1 2 β H β} = N[ β, H] = N( β H -1, H -1 ) = N(β β, H -1 ) (because = 0 at β ) Then the predictive distribution of the discriminative function is also Gaussian! P (f(x) D) = P (f(x) β) P (β D) dβ β = β N(f(x) φ(x) β, 0) N(β β, H -1 ) dβ = N(f(x) φ(x) β, φ(x) H -1 φ(x)) =: N(f(x) f, s 2 ) The predictive distribution over the label y {0, 1}: P (y(x)=1 D) = σ(f(x)) P (f(x) D) df f(x) ϕ( 1 + s 2 π/8f ) the approximation replaced σ by the probit function ϕ(x) = x N(0, 1)dx. 18/24

Kernelized Bayesian Logistic Regression As with Kernel Logistic Regression, the MAP discriminative function f can be found iterating the Newton method iterating GP estimation on a re-weighted data set. The rest is as above. 19/24

Kernel Bayesian Logistic Regression is equivalent to Gaussian Process Classification GP classification became a standard classification method, if the prediction needs to be a meaningful probability that takes the model uncertainty into account. 20/24

Bayesian Neural Networks 21/24

General non-linear models Above we always assumed f(x) = φ(x) β (or kernelized) Bayesian Learning also works for non-linear function models f(x, β) Regression case: P (X) is arbitrary. P (β) is Gaussian: β N(0, σ2 λ ) λ e 2σ 2 β 2 P (Y X, β) is Gaussian: y = f(x, β) + ɛ, ɛ N(0, σ 2 ) 22/24

General non-linear models To compute P (β D) we first compute the most likely β = argmin β L(β) = argmax P (β D) β Use Laplace approximation around β : 2nd-order Taylor of f(x, β) and then of L(β) to estimate a Gaussian P (β D) Neural Networks: The Gaussian prior P (β) = N(β 0, σ2 λ ) is called weight decay This pushes sigmoids to be in the linear region. 23/24

Conclusions Probabilistic inference is a very powerful concept! Inferring about the world given data Learning, decision making, reasoning can view viewed as forms of (probabilistic) inference We introduced Bayes Theorem as the fundamental form of probabilistic inference Marrying Bayes with (Kernel) Ridge (Logisic) regression yields Gaussian Processes Gaussian Process classification 24/24