Bayesian Linear Regression. Sargur Srihari

Similar documents
These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Linear Models for Regression. Sargur Srihari

Outline Lecture 2 2(32)

The Laplace Approximation

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Neural Network Training

Outline lecture 2 2(30)

Linear Models for Regression

Overfitting, Bias / Variance Analysis

STA414/2104 Statistical Methods for Machine Learning II

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Linear Models for Classification

Linear Models for Regression

Introduction to Machine Learning

Linear Models for Regression

Bayesian Machine Learning

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Bayesian Linear Regression [DRAFT - In Progress]

Iterative Reweighted Least Squares

Machine Learning Srihari. Probability Theory. Sargur N. Srihari

PATTERN RECOGNITION AND MACHINE LEARNING

INTRODUCTION TO PATTERN RECOGNITION

STA 4273H: Sta-s-cal Machine Learning

Reading Group on Deep Learning Session 1

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

CSCI567 Machine Learning (Fall 2014)

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Introduction to Machine Learning

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning Srihari. Gaussian Processes. Sargur Srihari

Machine Learning Lecture 5

Outline lecture 4 2(26)

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Probabilistic & Unsupervised Learning

ECE521 week 3: 23/26 January 2017

GAUSSIAN PROCESS REGRESSION

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Linear regression example Simple linear regression: f(x) = ϕ(x)t w w ~ N(0, ) The mean and covariance are given by E[f(x)] = ϕ(x)e[w] = 0.

Bayesian Logistic Regression

Logistic Regression. Sargur N. Srihari. University at Buffalo, State University of New York USA

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Linear Regression (continued)

Machine Learning. 7. Logistic and Linear Regression

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Bayesian Regression Linear and Logistic Regression

Linear & nonlinear classifiers

CMU-Q Lecture 24:

SCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA. Sistemi di Elaborazione dell Informazione. Regressione. Ruggero Donida Labati

ECE521 Tutorial 2. Regression, GPs, Assignment 1. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel for the slides.

Bias-Variance Trade-off in ML. Sargur Srihari

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Lecture : Probabilistic Machine Learning

Introduction to Probabilistic Graphical Models

Introduction to SVM and RVM

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Machine Learning Basics: Maximum Likelihood Estimation

Machine Learning Lecture 7

Ch 4. Linear Models for Classification

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

Least Squares Regression

Lecture 5: GPs and Streaming regression

Relevance Vector Machines

Variational Bayesian Logistic Regression

DEPARTMENT OF COMPUTER SCIENCE AUTUMN SEMESTER MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Linear & nonlinear classifiers

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

GWAS IV: Bayesian linear (variance component) models

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Bayesian Machine Learning

Bayesian methods in economics and finance

Overview c 1 What is? 2 Definition Outlines 3 Examples of 4 Related Fields Overview Linear Regression Linear Classification Neural Networks Kernel Met

Linear Dynamical Systems

Machine learning - HT Basis Expansion, Regularization, Validation

CS-E3210 Machine Learning: Basic Principles

y(x) = x w + ε(x), (1)

Gaussian Process Regression

Today. Calculus. Linear Regression. Lagrange Multipliers

Linear Models for Regression CS534

Least Squares Regression

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

ECE521 lecture 4: 19 January Optimization, MLE, regularization

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Probability and Information Theory. Sargur N. Srihari

Transcription:

Bayesian Linear Regression Sargur srihari@cedar.buffalo.edu

Topics in Bayesian Regression Recall Max Likelihood Linear Regression Parameter Distribution Predictive Distribution Equivalent Kernel 2

Linear Regression: model complexity M Polynomial regression y(x, w) = w 0 + w 1 x + w 2 x 2 +.. + w M x M = w j x j Red lines are best fits with M = 0,1,3,9 and =10 M j =0 Poor representations of sin(2πx) Best Fit to sin(2πx) Over Fit Poor representation of sin(2πx) 3

Max Likelihood Regression Input vector x, basis functions {ϕ 1 (x),.., ϕ M (x)}: y(x,w) = M 1 j =0 w j φ j (x) = w T φ(x) Objective Function: Closed-form ML solution is: w ML = (Φ T Φ) 1 Φ T t w ML = (λi + Φ T Φ) 1 Φ T t Gradient Descent: { } Radial basis fns: Max Likelihood objective with examples {x 1,..x }: (equivalent to Mean Squared Error Objective) Regularized solution is: Regularized MSE with examples: (λ is the regularization coefficient) E(w) = 1 2 E(w) = 1 2 where Φ is the design matrix: (Φ T Φ) is Moore-Penrose inverse w (τ +1) = w (τ ) η E φ j (x) = exp 1 2 (x µ j )t Σ 1 (x µ j ) { t n w T φ(x n ) } { t n w T φ(x n ) } 2 + λ 2 wt w φ 0(x1) φ 0(x 2) Φ = φ 0(x ) { } E = t n w (τ)t φ(x n ) φ(x n ) Regularized version: E = t n w (τ )T φ(x n ) φ(x n ) λw (τ ) 2 φ (x 1 1 )... φ φ M 1 M 1 (x1) (x ) 4

Shortcomings of MLE M.L.E. of parameters w does not address M (Model complexity: how many basis functions? It is controlled by data size More data allows better fit without overfitting Regularization also controls overfit (λ controls effect) E(w) = E D (w) + λe W (w) where E D (w) = 1 2 { t n w T φ(x n ) } But M and choice of ϕ j are still important M can be determined by holdout, but wasteful of data E W (w) = 1 2 wt w Model complexity and over-fitting are better handled using Bayesian approach 5 2

Bayesian Linear Regression Using Bayes rule, posterior is proportional to Likelihood Prior: p(w t) = where p(t w) is the likelihood of observed data p(w) is prior distribution over the parameters We will look at: p(t w)p(w) p(t) A normal distribution for prior p(w) Likelihood p(t w) is a product of Gaussians based on the noise model And conclude that posterior is also Gaussian 6

Gaussian Prior Parameters Assume multivariate Gaussian prior for w (which has components w 0,..,w M ) p(w) = (w m 0, S 0 ) with mean m 0 and covariance matrix S 0 If we choose S 0 = α I it means that the variances of the weights are all equal to α and covariances are zero p(w) with zero mean (m 0 =0) and isotropic over weights (same variances) w 1 w 0 7

Likelihood of Data is Gaussian Assume noise precision parameter β t = y(x,w)+ε where ε is defined probabilistically as Gaussian noise p(t x,w,β)=(t y(x,w),β ) ote that output t is a scalar Likelihood of t ={t 1,..,t } is then p(t X,w, β) = ( t n w T φ(x n ), β ) 1 This is the probability of target data t given the parameters w and input X={x 1,..,x } Due to Gaussian noise, likelihood p(t w) is also a Gaussian 8

Posterior Distribution is also Gaussian Prior: p(w)~ (w m 0, S 0 ) i.e., it is Gaussian Likelihood comes from Gaussian noise p(t X,w, β) = It follows that posterior p(w t) is also Gaussian Proof: use standard result from Gaussians: If marginal p(w) & conditional p(t w) have Gaussian forms then the marginals p(t) and p(w t) are also Gaussian: Let p(w) = (w µ,λ ) and p(t w) = (t Aw+b,L ) Then marginal p(t) = (t Aµ+b,L +AΛ A T ) and conditional p(w t) = (w Σ{A t L(t-b)+Λµ},Σ) where Σ=(Λ+A T LA) ( ) t n w T φ(x n ), β 1 9

Exact form of Posterior Distribution We have p(w)= (w m 0, S 0 ) & p(t X,w, β) = ( t n w T φ(x n ), β ) 1 Posterior is also Gaussian, written directly as p(w t)=(w m,s ) where m is the mean of the posterior given by m = S (S 0 m 0 + β Φ T t) Φ is the design matrix and S is the covariance matrix of posterior given by S = S 0 + β Φ T Φ φ 0(x1) φ 0(x 2) Φ = φ 0(x ) φ (x ) 1 1... φ M 1(x 1) φ M 1(x ) w 1 Prior p(w α) = (w 0,α 1 I ) and Posterior in weight space for scalar input x and y(x,w)=w 0 +w 1 x w 0 w 1 w 1 w 0 10

Properties of Posterior 1. Since posterior p(w t)=(w m,s ) is Gaussian its mode coincides with its mean Thus maximum posterior weight is w MAP = m 2. Infinitely broad prior S 0 =α I, i.e.,precision α à0 Then mean m reduces to the maximum likelihood value, i.e., mean is the solution vector w ML = (Φ T Φ) 1 Φ T t 3. If = 0, posterior reverts to the prior 4. If data points arrive sequentially, then posterior to any stage acts as prior distribution for subsequent data points 11

Choose a simple Gaussian prior p(w) y(x,w)=w 0 +w 1 x p(w α)~(0,1/α) Zero mean (m 0 =0) isotropic (same variances) Gaussian p(w α) (w 0,α 1 I) Corresponding posterior distribution is p(w t)=(w m,s ) where m =β S Φ T t and S =α I+β Φ T Φ ote: β is noise precision and α is variance of parameter w in prior w 1 Single precision parameter α Point Estimate with infinite samples w 0 12

Equivalence to MLE with Regularization Since we have Log of Posterior is ln p(w t) = β 2 ( ) and { t n w T φ(x n )} Thus Maximization of posterior is equivalent to minimization of sum-of-squares error { t n w T φ(x n ) } 2 + λ 2 wt w E(w) = 1 2 p(t X,w, β) = t n w T φ(x n ), β 1 p(w α) = (w 0,α 1 I ) p(w t) = ( ) t n w T φ(x n ), β 1 (w 0,α 1 I) with addition of quadratic regularization term w T w with λ = α /β 2 α 2 wt w + const 13

Bayesian Linear Regression Example (Straight Line Fit) Single input variable x Single target variable t Goal is to fit Linear model y(x,w) = w 0 + w 1 x Goal of Linear Regression is to recover w =[w 0,w 1 ] given the samples t x 14

Data Generation Synthetic data generated from f(x,w)=w 0 +w 1 x with parameter values w 0 = -0.3 and w 1 =0.5 t x 1 First choose x n from U(x,1), then evaluate f(x n,w) Add Gaussian noise with st dev 0.2 to get target t n Precision parameter β = (1/0.2 ) 2 = 25 For prior over w we choose α = 2 p(w α) = (w 0,α 1 I ) 15

Sampling p(w) and p(w t) Each sample represents a straight line in data space (modified by examples) Distribution y(x,w)=w 0 +w 1 x Six samples w 1 p(w) With no examples: w 0 p(w t) With two examples: Goal of Bayesian Linear Regression: Determine p(w t) 16

Sequential Bayesian Learning Machine Learning Since there are only two parameters We can plot prior and posterior distributions in parameter space We look at sequential update of posterior We are plotting p(w t) for a single data point Before data points observed After first data point (x,t) observed Band represents values of w 0, w 1 representing st lines going near data point x Likelihood for 2 nd point alone Likelihood p(t x.w) as function of w True parameter Value X Prior/ Posterior p(w) gives p(w t) Six samples (regression functions) corresponding to y(x,w) with w drawn from posterior o Data Point First Data Point (x 1,t 1 ) Second Data Point With infinite points posterior is a delta function centered at true parameters (white cross) Likelihood for 20 th point alone Twenty Data Points 17

Generalization of Gaussian prior The Gaussian prior over parameters is p(w α) = (w 0,α 1 I) Maximization of posterior ln p(w t) is equivalent to minimization of sum of squares error E(w) = 1 { t 2 n w T φ(x n ) } 2 + λ 2 wt w Other prior yields Lasso and variations: p(w α) = q α 2 2 q=2 corresponds to Gaussian 1/q 1 Γ(1 /q) M exp α 2 w j q Corresponds to minimization of regularized error function M t n w T φ(x n ) w j q 1 2 { } 2 + λ 2 j =1 M j =1 18

Predictive Distribution Usually not interested in the value of w itself But predicting t for a new value of x p(t t,x,x) or p(t t) Leaving out conditioning variables X and x for convenience Marginalizing over parameter variable w, is the standard Bayesian approach Sum rule of probability We can now write p(t)= p(t,w)dw = p(t w)p(w)dw p(t t)= p(t w)p(w t)dw 19

Predictive Distribution with α, β,x,t We can predict t for a new value of x using p(t t)= p(t w)p(w t)dw With explicit dependence on prior parameter α, noise parameter β, & targets in training set t p(t t,α,β)= We have left out conditioning variables X and x for convenience. Also we have applied sum rule of probability p(t)=σ w p(t w)p(w) p(t w,β) p(w t,α,β)d w Conditional of target t given weight w p(t x,w, β) = (t y(x,w), β 1 ) posterior of weight w p(w t)=(w m,s ) where m =β S Φ T t S =α I+β Φ T Φ RHS is a convolution of two Gaussian distributions whose result is the Gaussian: p(t x,t,α, β) = (t m T φ(x),σ 2 (x)) where σ 2 (x) = 1 β + φ(x)t S φ(x)

Variance of Predictive Distribution Predictive distribution is a Gaussian: p(t x, t,α, β) = (t m T φ(x),σ 2 (x)) where σ 2 (x) = 1 β + φ(x)t S φ(x) Since noise process and distribution of w are independent Gaussians their variances are additive oise in data Uncertainty associated with parameters w: where S is the covariance of p(w α) =α I+β Φ T Φ 2 Since σ +1 (x) σ 2 (x) as no. of samples increases it becomes narrower As à, second term of variance goes to zero and variance of predictive distribution arises solely from the additive noise parameter β 21

Example of Predictive Distribution Data generated from sin(2πx) Model: nine Gaussian basis functions y(x,w) = 8 j=0 w j φ j (x) = w T φ(x) Predictive distribution p(t x, t,α, β) = (t m T φ(x),σ 2 (x)) φ j (x) = exp (x µ j )2 2σ 2 where σ 2 (x) = 1 β + φ(x)t S φ(x) Plot of p(t x) for one data point showing mean (red) and one std dev (pink) where m =β S Φ T t, S =α I+β Φ T Φ and α and β come from assumptions p(w α)= (w 0, α I ) p(t x,w, β) = (t y(x,w), β 1 ) Mean of Predictive Distribution 22

Predictive Distribution Variance Bayesian prediction: p(t x, t,α, β) = (t m T φ(x),σ 2 (x))where σ 2 (x) = 1 β + φ(x)t S φ(x) where we have assumed =1 =2 Gaussian prior over parameters: p(w α) = (w 0,α 1 I) oise model assumed Gaussian: p(t x,w,β)=(t y(x,w),β ) and use design matrix as: S =α I+β Φ T Φ 0 Using data from sin(2πx): φ 0(x1) φ 0(x 2) Φ = φ (x ) φ (x ) σ 2 (x), std dev of t, is smallest in neighborhood of data points 1 1... φ M 1(x 1) φ M 1(x ) Mean of the Gaussian Predictive Distribution One standard deviation from Mean =4 =25 Uncertainty decreases as more data points are observed Plot only shows point-wise predictive variance To show covariance between predictions at different values of x draw samples from posterior distribution over w p(w t) and plot corresponding functions y(x,w)

Plots of function y(x,w) Draw samples w from from posterior distribution p(w t) p(w t)=(w m,s ) and plot samples from y(x,w) = w T ϕ(x) Shows covariance between predictions at different values of x For a given function, for a pair of x,x, the values of y,y are determined by k(x,x ) which in turn is determined by the samples =1 =2 =4 =25 24

Disadvantage of Local Basis Predictive distribution, assuming Gaussian prior p(w α) = (w 0,α 1 I) and Gaussian noise t = y(x,w)+ε where noise is defined probabilistically as p(t x,w,β)=(t y(x,w),β ) p(t x, t,α, β) = (t m T φ(x),σ 2 (x)) where σ 2 (x) = 1 β + φ(x)t S φ(x) S =α I+β Φ T Φ With localized basis functions, e.g., Gaussian at regions away from basis function centers, contribution of second term of variance σ n 2 in will go to zero leaving only noise contribution β Model becomes very confident outside of region occupied by basis functions Problem avoided by alternative Bayesian approach of Gaussian Processes 25

Dealing with unknown β If both w and β are treated as unknown then we can introduce a conjugate prior distribution p(w,β) which is given by a Gaussian-gamma distribution p(µ,λ) = µ µ 0, βλ ( ) 1 Gam λ a,b ( ) In this case the predictive distribution is a Student s t-distribution St( x µ, λ, ν ) = Γ( ν / 2 + 1/ 2) Γ( ν + 2) λ πν 1/ 2 1 + λ( x µ ) ν 2 ν / 2 1/ 2 26

Mean of p(w t) has Kernel Interpretation Regression function is: y(x,w) = M 1 j =0 If we take a Bayesian approach with Gaussian prior p(w)=(w m 0, S 0 ) then we have: Posterior p(w t)= (w m,s ) where m = S (S 0 m 0 + βφ T t) S = S 0 + βφ T Φ With zero mean isotropic p(w α)= (w 0, α I) m = β S Φ T t, w j φ j (x) = w T φ(x) S = α I+ β Φ T Φ Posterior mean β S Φ T t has a kernel interpretation Sets stage for kernel methods and Gaussian processes 27

Equivalent Kernel Posterior mean of w is m =βs Φ T t where S = S 0 + βφ T Φ, S 0 is the covariance matrix of the prior p(w), β is the noise parameter and Φ is the design matrix that depends on the samples Substitute mean value into Regression function y(x,w) = M 1 j =0 w j φ j (x) = w T φ(x) Mean of predictive distribution at point x is y(x,m ) = m T φ(x) = βφ(x) T S Φ T t = βφ(x) T S φ(x n )t n = k(x,x n )t n where k(x,x )=βϕ (x) T S ϕ (x ) is the equivalent kernel Thus mean of predictive distribution is a linear combination of training set target variables t n ote: the equivalent kernel depends on input values x n from the dataset because they appear in S

Kernel Function Regression functions such as y(x,m ) = k(x,x n )t n k(x,x )=βϕ (x) T S ϕ (x ) S = S 0 + βφ T Φ That take a linear combination of the training set target values are known as linear smoothers φ 0(x1) φ 0(x 2) Φ = φ 0(x ) They depend on the input values x n from the data set since they appear in the definition of S 29 φ (x ) 1 1... φ M 1(x 1) φ M 1(x )

Example of kernel for Gaussian Basis Equivalent Kernel k(x, x )=ϕ(x) T S ϕ(x ) S = S 0 + βφ T Φ φ 0(x1) φ 0(x 2) Φ = φ 0(x ) φ (x 1 1 )... φ φ M 1 M 1 (x1) (x ) For three values of x the behavior of k(x,x ) is shown as a slice Kernel used directly in regression. Mean of the predictive distribution is Kernels are localized around x, i.e., peaks when x =x y(x,m ) = k(x,x n )t n Obtained by forming a weighted combination of target values: Data points close to x are given higher weight than points further removed from x Gaussian Basis ϕ(x) x φ j (x) = exp (x µ j )2 2s 2 x Plot of k(x,x ) shown as a function of x and x Peaks when x=x Data set used to generate kernel were 200 values of x equally spaced in (,1) 30

Equivalent Kernel for Polynomial Basis Function ϕ j (x)=x j k(x,x )=βϕ (x) T S ϕ (x ) S = S 0 + β Φ T Φ Data points close to x are given higher weight than points further removed from x Plotted as a function of x for x=0 Localized function of x even though corresponding basis function is nonlocal 31

Equivalent Kernel for Sigmoidal Basis Function φ j (x) = σ x µ j s where σ(a) = 1 1 + exp( a) k(x,x )=βϕ(x) T S ϕ(x ) Localized function of x even though corresponding basis function is nonlocal 32

Covariance between y(x) and y(x ) An important insight: The value of the kernel function between two points is directly related to the covariance between their target values cov [y(x), y(x )] = cov[ϕ(x) T w, w T ϕ (x )] = ϕ (x) T S ϕ (x ) = β k(x, x ) From the form of the equivalent kernel k(x, x ) the predictive mean at nearby points y(x), y(x ) will be highly correlated For more distant pairs correlation is smaller The kernel captures the covariance where we have used: p(w t)~(w m,s ) k(x, x )= βϕ(x) T S ϕ (x) 33

Predictive plot vs. Posterior plots Predictive distribution allows us to visualize pointwise uncertainty in the predictions governed by p(t x, t,α, β) = (t m T φ(x),σ 2 (x)) where σ 2 (x) = 1 β + φ(x)t S φ(x) Drawing samples from posterior p(w t) Plotting corresponding functions y(x,w) we visualize joint uncertainty in the posterior distribution between the y values at two or more x values as governed by the kernel 34

Directly Specifying Kernel Function Formulation of Linear Regression in terms of kernel function suggests an alternative approach to regression: Instead of introducing a set of basis functions, which implicitly determines an equivalent kernel: Directly define kernel functions and use to make predictions for new input x, given the observation set This leads to a practical framework for regression (and classification) called Gaussian Processes 35

Summing Kernel Values Over samples Effective kernel defines weights by which target values combined to make a prediction at x It can be shown that weights sum to one, i.e., For all values of x k(x,x n ) = 1 This result can be proven intuitively: k(x, x )=ϕ(x) T S ϕ(x ) S = S 0 + βφ T Φ Since y(x, m ) = k(x,x n )t n summation is equivalent to considering predictive mean ŷ(x) for a set of integer data in which t n =1 for all n Provided basis functions are linearly independent, that >M, one of the basis functions is constant (corresponding to the bias parameter), then we can fit 36 training data exactly, and hence ŷ(x)=1

Kernel Function Properties Equivalent kernel can be positive or negative Although it satisfies a summation constraint, the corresponding predictions are not necessarily a convex combination of the training set target variables Equivalent kernel satisfies important property shared by kernel functions in general. It can be expressed in the form of an inner product wrt a vector ψ(x) of nonlinear functions: k(x,z) = ψ(x) T ψ(z) k(x, x )=ϕ(x) T S ϕ(x ) where ψ(x) = β 1/2 S 1/2 φ(x) 37