Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Similar documents
Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 3. Regression

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Announcements. Proposals graded

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Gaussian with mean ( µ ) and standard deviation ( σ)

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Lecture : Probabilistic Machine Learning

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Probabilistic & Unsupervised Learning

Introduction to SVM and RVM

Introduction Dual Representations Kernel Design RBF Linear Reg. GP Regression GP Classification Summary. Kernel Methods. Henrik I Christensen

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

PATTERN RECOGNITION AND MACHINE LEARNING

Support Vector Machine (SVM) and Kernel Methods

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Kernel methods, kernel SVM and ridge regression

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Kernels and the Kernel Trick. Machine Learning Fall 2017

Gaussian Process Regression

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Machine Learning Practice Page 2 of 2 10/28/13

Support Vector Machine (SVM) and Kernel Methods

Kernel Methods. Charles Elkan October 17, 2007

Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1

STA 414/2104, Spring 2014, Practice Problem Set #1

Gaussian Processes (10/16/13)

Linear & nonlinear classifiers

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Bayesian Gaussian / Linear Models. Read Sections and 3.3 in the text by Bishop

GAUSSIAN PROCESS REGRESSION

Linear & nonlinear classifiers

Bayesian Linear Regression [DRAFT - In Progress]

Bayesian Machine Learning

Kernel Methods and Support Vector Machines

Perceptron Revisited: Linear Separators. Support Vector Machines

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Nonparameteric Regression:

Support Vector Machine (SVM) and Kernel Methods

Gaussian Processes in Machine Learning

(Kernels +) Support Vector Machines

Multivariate statistical methods and data mining in particle physics

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Advanced Introduction to Machine Learning CMU-10715

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Kernel Methods. Barnabás Póczos

Mathematical Formulation of Our Example

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Linear Models for Classification

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Machine Learning Lecture 7

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Kernel Methods. Machine Learning A W VO

Introduction to Machine Learning

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space

Introduction to Support Vector Machines

Machine Learning Srihari. Gaussian Processes. Sargur Srihari

Linear Classification

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

18.9 SUPPORT VECTOR MACHINES

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

CIS 520: Machine Learning Oct 09, Kernel Methods

Ch 4. Linear Models for Classification

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Overfitting, Bias / Variance Analysis

Recap from previous lecture

CS798: Selected topics in Machine Learning

Reliability Monitoring Using Log Gaussian Process Regression

Support Vector Machines

Multivariate Bayesian Linear Regression MLAI Lecture 11

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

The Bayes classifier

Where now? Machine Learning and Bayesian Inference

CS-E3210 Machine Learning: Basic Principles

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Introduction to Gaussian Processes

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Review: Support vector machines. Machine learning techniques and image analysis

Bayesian Linear Regression. Sargur Srihari

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

CMU-Q Lecture 24:

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Nonparametric Regression With Gaussian Processes

DD Advanced Machine Learning

10-701/ Machine Learning - Midterm Exam, Fall 2010

Transcription:

Prof. Daniel Cremers 2. Regression (cont.)

Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2

Maximum A-Posteriori Estimation So far, we searched for parameters w, that maximize the data likelihood. Now, we assume a Gaussian prior: Using this, we can compute the posterior (Bayes): p(w x, t) / p(t w, x)p(w) Posterior Likelihood Prior Maximum A-Posteriori Estimation (MAP) 3

Maximum A-Posteriori Estimation So far, we searched for parameters w, that maximize the data likelihood. Now, we assume a Gaussian prior: Using this, we can compute the posterior (Bayes): strictly: p(w x, t, 1, 2) = p(t x, w, 1)p(w 2) R p(t x, w, 1)p(w 2)dw but the denominator is independent of w and we want to maximize p. 4

Maximum A-Posteriori Estimation 2 2 1 2 2 1 This is equal to the regularized error minimization. The MAP Estimate corresponds to a regularized error minimization where λ = (σ 1 / σ 2 ) 2 5

Summary: MAP Estimation To summarize, we have the following optimization problem: J(w) = 1 2 NX n=1 (w T (x n ) t n ) 2 + 2 w T w (x n ) 2 R M The same in vector notation: J(w) = 1 2 wt T w w T t + 1 2 tt t + 2 w T w t 2 R N 2 R N M Feature Matrix 6

Summary: MAP Estimation To summarize, we have the following optimization problem: J(w) = 1 2 NX n=1 (w T (x n ) t n ) 2 + 2 w T w (x n ) 2 R M The same in vector notation: J(w) = 1 2 wt T w w T t + 1 2 tt t + 2 w T w t 2 R N And the solution is w =( I M + T ) 1 T t Identity matrix of size M by M 7

MLE And MAP The benefit of MAP over MLE is that prediction is less sensitive to overfitting, i.e. even if there is only little data the model predicts well. This is achieved by using prior information, i.e. model assumptions that are not based on any observations (= data) But: both methods only give the most likely model, there is no notion of uncertainty yet Idea 1: Find a distribution over model parameters ( parameter posterior ) 8

MLE And MAP The benefit of MAP over MLE is that prediction is less sensitive to overfitting, i.e. even if there is only little data the model predicts well. This is achieved by using prior information, i.e. model assumptions that are not based on any observations (= data) But: both methods only give the most likely model, there is no notion of uncertainty yet Idea 1: Find a distribution over model parameters Idea 2: Use that distribution to estimate prediction uncertainty ( predictive distribution ) 9

When Bayes Meets Gauß Theorem: If we are given this: I. II. p(x) =N (x µ, 1 ) p(y x) =N (y Ax + b, 2 ) linear dependency on x Then it follows (properties of Gaussians): III. IV. p(y) =N (y Aµ + b, 2 + A 1 A T ) p(x y) =N (x (A T 1 2 (y b)+ 1 µ), ) 1 where =( 1 1 + A T 1 2 A) 1 Linear Gaussian Model See Bishop s book for the proof! 10

When Bayes Meets Gauß Thus: When using the Bayesian approach, we can do even more than MLE and MAP by using these formulae. This means: If the prior and the likelihood are Gaussian then the posterior and the normalizer are also Gaussian and we can compute them in closed form. This gives us a natural way to compute uncertainty! 11

The Posterior Distribution Remember Bayes Rule: p(w x, t) / p(t w, x)p(w) Posterior Likelihood Prior With our theorem, we can compute the posterior in closed form (and not just its maximum)! The posterior is also a Gaussian and its mean is the MAP solution. 12

The Posterior Distribution We have and p(w) =N (w; 0, p(t w, x) =N (t; w, 2 2I M ) 2 1I M N ) From this and IV. we get the posterior covariance: =( = 2 1( and the mean: 2 2 I M + 1 2 2 1 2 2 T ) 1 I M + T ) 1 µ = 2 1 T t So the entire posterior distribution is p(w t, x) =N (w; µ, ) 13

The Predictive Distribution We obtain the predictive distribution by integrating over all possible model parameters: New data likelihood Parameter posterior This distribution can be computed in closed form, because both terms on the RHS are Gaussian. From above we have where and µ = 1 2 T t = 2 1( 2 1 2 2 p(w t, x) =N (w; µ, ) I M + T ) 1 14

The Predictive Distribution We obtain the predictive distribution by integrating over all possible model parameters: New data likelihood Parameter posterior This distribution can be computed in closed form, because both terms on the RHS are Gaussian. From above we have where and µ = 1 2 T t = 2 1( 2 1 2 2 p(w t, x) =N (w; µ, ) I M + T ) 1 w =( I M + T ) 1 T t ) µ MAP solution 15

The Predictive Distribution Using formula III. from above (linear Gaussian), = Z N (t; (x) T w, )N (w; µ, )dw = N (t; (x) T µ, 2 N (x)) where 2 N (x) = 2 + (x) T (x) 16

The Predictive Distribution (2) Example: Sinusoidal data, 9 Gaussian basis functions, 1 data point The predictive distribution From: C.M. Bishop 17 Some samples from the posterior

Predictive Distribution (3) Example: Sinusoidal data, 9 Gaussian basis functions, 2 data points The predictive distribution From: C.M. Bishop 18 Some samples from the posterior

Predictive Distribution (4) Example: Sinusoidal data, 9 Gaussian basis functions, 4 data points The predictive distribution From: C.M. Bishop 19 Some samples from the posterior

Predictive Distribution (5) Example: Sinusoidal data, 9 Gaussian basis functions, 25 data points The predictive distribution From: C.M. Bishop 20 Some samples from the posterior

Summary Regression can be expressed as a least-squares problem To avoid overfitting, we need to introduce a regularisation term with an additional parameter λ Regression without regularisation is equivalent to Maximum Likelihood Estimation Regression with regularisation is Maximum A-Posteriori When using Gaussian priors (and Gaussian noise), all computations can be done analytically This gives a closed form of the parameter posterior and the predictive distribution 21

Prof. Daniel Cremers 3. Kernel Methods

Motivation Usually learning algorithms assume that some kind of feature function is given Reasoning is then done on a feature vector of a given (finite) length But: some objects are hard to represent with a fixed-size feature vector, e.g. text documents, molecular structures, evolutionary trees Idea: use a way of measuring similarity without the need of features, e.g. the edit distance for strings This we will call a kernel function 23

Dual Representation Many problems can be expressed using a dual formulation. Example (linear regression): J(w) = 1 2 NX n=1 (w T (x n ) t n ) 2 + 2 w T w (x n ) 2 R M 24

Dual Representation Many problems can be expressed using a dual formulation. Example (linear regression): J(w) = 1 2 NX (w T (x n ) t n ) 2 + w T w 2 n=1 (x n ) 2 R M if we write this in vector form, we get J(w) = 1 2 wt T w w T t + 1 2 tt t + 2 w T w t 2 R N 2 R N M 25

Dual Representation Many problems can be expressed using a dual formulation. Example (linear regression): J(w) = 1 2 NX (w T (x n ) t n ) 2 + w T w 2 n=1 (x n ) 2 R M if we write this in vector form, we get J(w) = 1 2 wt T w w T t + 1 2 tt t + w T w 2 t 2 R N 2 R N M and the solution is w =( T + I M ) 1 T t 26

Dual Representation Many problems can be expressed using a dual formulation, including linear regression. J(w) = 1 2 wt T w w T t + 1 2 tt t + w T w 2 w =( T + I M ) 1 T t However, we can express this result in a different way using the matrix inversion lemma: (A + BCD) 1 = A 1 A 1 B(C 1 + DA 1 B) 1 DA 1 27

Dual Representation Many problems can be expressed using a dual formulation. Example (linear regression): J(w) = 1 2 wt T w w T t + 1 2 tt t + w T w 2 w =( T + I M ) 1 T t However, we can express this result in a different way using the matrix inversion lemma: (A + BCD) 1 = A 1 A 1 B(C 1 + DA 1 B) 1 DA 1 w = T ( T + I N ) 1 t 28

Dual Representation Many problems can be expressed using a dual formulation. Example (linear regression): J(w) = 1 2 wt T w w T t + 1 2 tt t + 2 w T w w =( T + I M ) 1 T t w = T ( T + I N ) 1 t =: a Dual Variables 29

Dual Representation Many problems can be expressed using a dual formulation. Example (linear regression): J(w) = 1 2 wt T w w T t + 1 2 tt t + 2 w T w w =( T + I M ) 1 T t w = T ( T + I N ) 1 t =: a Plugging into J(w) gives: w = T a Dual Variables J(a) = 1 2 at T T a a T T t + t T t + 2 a T T a =: K 30

Dual Representation Many problems can be expressed using a dual formulation. Example (linear regression): J(w) = 1 2 wt T w w T t + 1 2 tt t + 2 w T w K = T This is called the dual formulation. Note: a 2 R N w 2 R M 31

Dual Representation Many problems can be expressed using a dual formulation. Example (linear regression): J(w) = 1 2 wt T w w T t + 1 2 tt t + 2 w T w This is called the dual formulation. The solution to the dual problem is: 32

Dual Representation Many problems can be expressed using a dual formulation. Example (linear regression): J(w) = 1 2 wt T w w T t + 1 2 tt t + 2 w T w This we can use to make predictions: f(x )=w T (x )=a T (x )=k(x ) T (K + I N ) 1 t (now x* is unknown and a is given from training) 33

Dual Representation f(x )=k(x ) T (K + I N ) 1 t where: k(x )= 0 B @ 1 (x 1 ) T (x ) C. A (x N ) T (x ) K = 0 B @ (x 1 ) T (x 1 )... (x 1 ) T (x N )..... (x N ) T (x 1 )... (x N ) T (x N ) 1 C A Thus, f is expressed only in terms of dot products between different pairs of (x), or in terms of the kernel function k(x i, x j )= (x i ) T (x j ) 34

Representation using the Kernel f(x )=k(x ) T (K + I N ) 1 t Now we have to invert a matrix of size N N, before it was where M<N, but: M M By expressing everything with the kernel function, we can deal with very high-dimensional or even infinite-dimensional feature spaces! Idea: Don t use features at all but simply define a similarity function expressed as the kernel! 35

Constructing Kernels The straightforward way to define a kernel function is to first find a basis function (x) and to define: k(x i, x j )= (x i ) T (x j ) This means, k is an inner product in some space H, i.e: 1.Symmetry: 2.Linearity: k(x i, x j )=h (x j ), (x i )i = h (x i ), (x j )i ha( (x i )+z), (x j )i = ah (x i ), (x j )i + ahz, (x j )i 3.Positive definite: h (x i ), (x i )i 0 (x i )=0, equal if Can we find conditions for k under which there is a (possibly infinite dimensional) basis function into, where k is an inner product? H 36

Constructing Kernels Theorem (Mercer): If k is 1.symmetric, i.e. 2.positive definite, i.e. K = 0 B @ and is positive definite, then there exists a mapping into a feature space as an inner product in H. k(x i, x j )=k(x j, x i ) k(x 1, x 1 )... k(x 1, x N )......... k(x N, x 1 )... k(x N, x N ) This means, we don t need to find H so that k can be expressed We can directly work with k 1 C A (x) Gram Matrix (x) explicitly! Kernel Trick 37

Constructing Kernels Finding valid kernels from scratch is hard, but: A number of rules exist to create a new valid kernel k from given kernels k 1 and k 2. For example: where A is positive semidefinite and symmetric 38

Examples of Valid Kernels Polynomial Kernel: Gaussian Kernel: Kernel for sets: k(x i, x j )=(x T i x j + c) d c>0 d 2 N k(x i, x j )=exp( kx i x j k 2 /2 2 ) k(a 1,A 2 )=2 A 1\A 2 Matern kernel: k(r) = 21 ( ) p! p 2 r 2 r K l l! r = kx i x j k, > 0,l >0 39

A Simple Example Define a kernel function as This can be written as: k(x, x 0 )=(x T x 0 ) 2 x, x 0 2 R 2 (x 1 x 0 1 + x 2 x 0 2) 2 = x 2 1x 02 1 +2x 1 x 0 1x 2 x 0 2 + x 2 2x 02 2 =(x 2 1,x 2 2, p 2x 1 x 2 )(x 02 1,x 02 2, p 2x 0 1x 0 2) T = (x) T (x 0 ) It can be shown that this holds in general for k(x i, x j )=(x T i x j ) d 40

Visualization of the Example Decision boundary becomes a hyperplane Original decision boundary is an ellipse 41

Application Examples Kernel Methods can be applied for many different problems, e.g.: Density estimation (unsupervised learning) Regression Principal Component Analysis (PCA) Classification Most important Kernel Methods are Support Vector Machines Gaussian Processes 42

Kernelization Many existing algorithms can be converted into kernel methods This process is called kernelization Idea: express similarities of data points in terms of an inner product (dot product) replace all occurrences of that inner product by the kernel function This is called the kernel trick 43

Example: Nearest Neighbor The NN classifier selects the label of the nearest neighbor in Euclidean distance kx i x j k 2 = x T i x i + x T j x j 2x T i x j 44

Example: Nearest Neighbor The NN classifier selects the label of the nearest neighbor in Euclidean distance kx i x j k 2 = x T i x i + x T j x j 2x T i x j We can now replace the dot products by a valid Mercer kernel and we obtain: d(x i, x j ) 2 = k(x i, x i )+k(x j, x j ) 2k(x i, x j ) This is a kernelized nearest-neighbor classifier We do not explicitly compute feature vectors! 45

Back to Linear Regression (Rep.) We had the primal and the dual formulation: J(w) = 1 2 wt T w w T t + 1 2 tt t + 2 w T w with the dual solution: This we can use to make predictions (MAP): f(x )=w T (x )=a T (x )=k(x ) T (K + I N ) 1 t 46

Observations We have found a way to predict function values of y for new input points x* As we used regularized regression, we can equivalently find the predictive distribution by marginalizing out the parameters w Questions: Can we find a closed form for that distribution? How can we model the uncertainty of our prediction? Can we use that for classification? 47

Gaussian Marginals and Conditionals First, we need some formulae: Assume we have two variables and that are jointly Gaussian distributed, i.e. x a x b N (x µ, ) with x = xa x b µ = µa µ b = aa ab Then the cond. distribution where and The marginal is µ a b = µ a + ab 1 bb (x b µ b ) a b = aa ab 1 bb ba p(x a )=N (x a µ a, aa ) p(x a x b )=N (x µ a b, a b ) Schur Complement 48

Gaussian Marginals and Conditionals Main idea of the proof for the conditional (using inverse of block matrices): aa ba ab bb 1 = I 0 1 bb ba I ( / bb ) 1 0 0 1 bb I ab 1 bb 0 I The lower line corresponds to a quadratic form that is only dependent on p(x b ), i.e. the rest can be identified with the conditional Normal distribution p(x a x b ). (for details see, e.g. Bishop or Murphy) 49

Definition Definition: A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution. The number of random variables can be infinite! This means: a GP is a Gaussian distribution over functions! To specify a GP we need: mean function: m(x) =E[y(x)] covariance function: k(x 1, x 2 )=E[y(x 1 ) m(x 1 )y(x 2 ) m(x 2 )] 50

Example green line: sinusoidal data source blue circles: data points with Gaussian noise red line: mean function of the Gaussian process 51

How Can We Handle Infinity? Idea: split the (infinite) number of random variables into a finite and an infinite subset. x = xf x i N µf µ i, f fi T fi i finite part infinite part From the marginalization property we get: Z p(x f )= p(x f, x i )dx i = N (x f µ f, f ) This means we can use finite vectors. 52

A Simple Example In Bayesian linear regression, we had y(x) = (x) T w with prior probability w N (0, p ). This means: E[y(x)] = (x) T E[w] =0 E[y(x 1 )y(x 2 ))] = (x 1 ) T E[ww T ] (x 2 )= (x 1 ) T p (x 2 ) Any number of function values y(x 1 ),...,y(x N ) is jointly Gaussian with zero mean. The covariance function of this process is k(x 1, x 2 )= (x 1 ) T p (x 2 ) In general, any valid kernel function can be used. 53

The Covariance Function The most used covariance function (kernel) is: k(x p, x q )= 2 f exp( 1 2l 2 (x p x q ) 2 )+ 2 n pq signal variance length scale noise variance It is known as squared exponential, radial basis function or Gaussian kernel. Other possibilities exist, e.g. the exponential kernel: k(x p, x q )=exp( x p x q ) This is used in the Ornstein-Uhlenbeck process. 54

Sampling from a GP Just as we can sample from a Gaussian distribution, we can also generate samples from a GP. Every sample will then be a function! Process: 1.Choose a number of input points x 1,...,x M 2.Compute the covariance matrix K where K ij = k(x i, x j ) 3.Generate a random Gaussian vector from y N (0,K) 4.Plot the values x 1,...,x M versus y 1,...,y M 55

Sampling from a GP Squared exponential kernel Exponential kernel 56