CS 7140: Advanced Machine Learning

Instructor CS 714: Advanced Machine Learning Lecture 3: Gaussian Processes (17 Jan, 218) Jan-Willem van de Meent (j.vandemeent@northeastern.edu) Scribes Mo Han (han.m@husky.neu.edu) Guillem Reus Muns (reusmuns.g@husky.neu.edu) Somanshu Singh (singh.som@husky.neu.edu) 1 Gaussian Processes In ridge regression, our goal is to compute a point estimate, which is the expected value of the posterior predictive distribution given previous observations. When we additionally want to obtain a confidence interval for predictions then we can use Gaussian processes, which compute the full posterior predictive distribution p(f y) for a new input x using previous inputs x 1:N and labels y, where f( x) is the function of input x. 1.1 Formal View: Non-parametric Distribution on Functions In a prediction problem, the new input x should be a known constant value or vector. But in a formal view, before we get the new input x when given previous inputs and outputs, we don t know its exact value, it is said that the new input is a variable, and the prediction should be a function of this variable. So our goal here is to calculate a non-parametric distribution p(f y) on functions f( x), which has infinite degrees of freedom. Furthermore, for each new input, we can have a new posterior distribution based on previous data. The posterior distribution p(f y) can be defined using Bayes rule p(f y) = p( y f)p(f). (1) p( y) In Gaussian Processes, we assume the prior to be a Gaussian distribution over functions of input x f GP(µ( x), k( x, x )), (2) where µ( x) is a mean function and k( x, x ) is covariance function. The likelihood of existing labels is given by y n f Norm(f( x n ), σ)). (3) In many machine learning applications, the knowledge about the true underlying mechanism behind the data generation process is limited. Instead one relies on generic smoothness assumptions; for example we might wish that for two inputs x and x that are close, the corresponding outputs y and y should be similar. Many generic techniques in machine learning can be viewed as different characteristics of smoothness. The kernel function as convariance can represent the smoothness of two inputs. The closer the inputs, the higher the convariance function, and the more similar their corresponding outputs.

1.2 Practical View: Generalization of the Multivariate Normal From a formal point of view, the function value f is a random variable. In absence of further assumptions, a function on R n has an uncountably infinite number of degrees of freedom. However, in practice we will only ever need to reason about the function values at a finite set of inputs. For any set of inputs X = [ x 1,..., x N ], a Gaussian process defines a joint distribution on function values that is a multivariate Gaussian f Norm(µ(X), k(x, X)). (4) Here we use f and µ(x) as shorthands for the vectors of function values f := (f( x 1 ),..., f( x N )), (5) µ(x) := (µ( x 1 ),..., µ( x N )), (6) and k(x, X) is similarly a shorthand for the covariance matrix k( x 1, x 1 )... k( x 1, x N ) k(x, X) :=... (7) k( x N, x 1 )... k( x N, x N ) 1.3 Regression with the Predictive Distribution Suppose that we have observed a vector of values y = (y 1,..., y N ) distributed as y f Norm( f, σ 2 I), (8) The goal of Gaussian process regression is to reason about function values f := (f( x 1 ),..., f( x M )) at some new set of inputs X = [ x 1,..., x M ]. The predictive distribution on f is p( f y) = p( y, f ) p( y) = p( y f)p( f, f) df. (9) p( y) In this equation, the joint distribution p( f, f ) is a multivariate Gaussian on function values at N + M points with mean and covariance matrix [ ] ([ ] [ f µ(x) k(x, X) k(x, X f Norm µ(x, ]) ) ) k(x, X) k(x, X, (1) ) where the notation is k(x, X ) is used to refer to an N M (sub-)matrix k( x 1, x 1 k(x, X )... k( x 1, x M ) ) :=.., (11) k( x N, x 1 )... k( x N, x M ) and k(x, X) and k(x, X ) are defined analogously as M N and M M (sub-)matrices. When we assume Gaussian distributed additive noise y = f + ɛ, ɛ Norm(, σ 2 I), (12) 2

the distribution on y is once again a multivariate Gaussian y Norm(µ(X), k(x, X) + σ 2 I). (13) If we substitute the part of f by y in (1), then we see that the joint distribution p( y, f ) is [ ] ([ ] [ y µ(x) k(x, X) + σ f Norm µ(x, 2 I k(x, X ]) ) ) k(x, X) k(x, X. (14) ) Given the joint distribution p( y, f ) we can make use of standard Gaussian identities which state that for any joint distribution of the form ([ [ [ ]) p( α, β) α a A C = Norm β] ; b], C (15) B the conditional distribution is once more a multivariate normal with mean and covariance p( α β) = Norm( α; a + CB 1 ( β b), A CB 1 C ), (16) and the marginal distributions are likewise multivariate Gaussians of the form p( α) = Norm( α; a, A), p( β) = Norm( β; b, B). (17) When we substitute the form of the joint (14) into these identities we obtain a predictive distribution in which the predictive mean and covariance are f y Norm( f; µ(x ), k(x, X )), (18) µ(x ) = µ(x ) + k(x, X) ( k(x, X) + σ 2 I ) 1 ( y µ(x)), (19) k(x, X ) = k(x, X ) k(x, X) ( k(x, X) + σ 2 I ) 1 k(x, X ). (2) In practice we can always pre-process our data into a detrended form by defining y = y µ(x) and f ( x) = f( x) µ( x). For this reason, software implementations of Gaussian processes typically assume a zero mean distribution [ ] ([ ] [ y k(x, X) + σ f Norm, 2 I k(x, X ]) ) k(x, X) k(x, X, (21) ) for which the predictive mean simplifies to µ(x ) = k(x, X) ( k(x, X) + σ 2 I ) 1 y. (22) When we want to perform regression with a non-zero mean, we can simply pre-process the data to compute y, perform regression for the zero-mean function values f and finally add the mean back in to define the non-zero mean function values f := f + µ(x ). 3

2 Relationship to Kernel Ridge Regression In the previous set of lecture notes, we considered kernel ridge regression, which computes the expected function value f := f( x ) at a previously unseen input x, conditioned on previous observed values y, f = E[Y Y = y], (23) under the assumption of a linear regression model with a Gaussian prior on the weights y = f + ɛ, f D = w d φ d (X), w N (, s 2 I), ɛ N (, σ 2 I). (24) d=1 In the normal formulation of kernel ridge regression, we define a kernel function k( x i, x j ) = φ( x i ) φ( x j ), (25) and compute the expected function value in terms of the kernel evaluations ) 1 f = k( x, X) (k(x, X) + σ2 s 2 I y. (26) Note that this expression is identical to the expression for the predictive mean in (22), with two minor distinctions. The first is that the expression above predicts a single function value f = f( x ), rather than a vector of function values f = f (X ). The second is that the expression above contains a regularization constant λ = σ 2 /s 2, whereas equation (22) only contains the term σ 2. The reason for this is that in Gaussian process regression, we implicitly absorb the variance of the weights into the definition of kernel function. To see what we mean by absorb, let us consider the more general case of prior on weights with a full covariance matrix S, rather than the diagonal covariance s 2 I, For this prior, the predictive mean in ridge regression is w N (, S ). (27) ( 1 f = φ( x ) S φ(x) φ(x) S φ(x) + σ I) 2 y. (28) Note that this expression is reduces to the one in equation (26) when S = s 2 I. We now also see that when we define the kernel k ( x i, x j ) = φ( x i ) S φ( x j ) = D φ d ( x i )S,d,e, φ e ( x j ), (29) d,e we can rewrite equation (28) to f = k ( x, X) ( k (X, X) + σ 2 I ) 1 y, (3) which recovers the predictive mean in equation (22). In other words, kernel ridge regression is equivalent to computing the mean in Gaussian process regression, when we employ a kernel of the form in equation (29). 4

2 2 6 4 2 2 4 6 2 2 6 4 2 2 4 6 2 2 6 4 2 2 4 6 Figure 1: Estimations for different λ, going from small values (green curves) to large (red curves). The relationship between Gaussian processes and kernel ridge regression has consequences for how we should think of regularization in both these methods. As explained in previous lecture, λ is the parameter defined for tuning regularization, making the output estimated functions smoothness higher or lower. Taking Fig. 1 as an example, we can easily notice how high values of λ (high regularization) impact by considerably smoothening the output functions, whereas the exact opposite result can be observed for small λ values (low regularization). In contrast to the previous examples of regression considered in class, where the parameter lambda was tunned by the following: λ = σ2 s 2 (31) However, once any Kernel-based regression is applied, the Kernel function that is employed is the one that intrinsically implies a choice on λ: λ = σ2 S 2 (32) Looking back to Eq. (29), we see that choosing the Kernel implies choosing S and in turn, choosing λ. This means that when we are performing kernelized regression w.r.t. some explicit set of features φ d ( x), we need to define a kernel k( x, x ) = s 2 φ( x) φ( x ) in order to obtain results that are fully equivalent to performing ridge regression. More generally speaking, any choice of kernel in Gaussian process regression implies regularization in the form smoothness assumptions that are encoded by the kernel. 3 Kernel Hyperparameters The degree of smoothness assumptions imposed by a kernel are controlled by the kernel hyperparameters. We will consider a couple of examples of different Kernels will be presented in order to show how the choice of parameters affects smoothness. The Squared-Exponential Kernel function 5

Example: Fitting the length scale parameter Parameterized covariance function: k(x, x ) = v2 exp Characteristic Lengthscales 2 noise xx...5 1. too long about right too short.5 function value, y (x - x )2 + 2`2 1 5 5 1 input, x The mean posterior predictive function is plotted for 3 different length scales (the blue curve corresponds to optimizing the marginal likelihood). Notice, that an Figure 2: Squared-Exponential Kernel for different values of l almost exact fit to the data can be achieved by reducing the length scale but the marginal likelihood does not favour this! Carl Edward Rasmussen is defined by the following expression: GP Marginal Likelihood and Hyperparameters k(x, x ) = e October 13th, 216 5/7 (x x )2 2l2 (33) Fig. 2 shows different estimated functions for different values of l. We can see then how the picked 8. A Predominantly Approach to Optimization Kernel impacts on thebayesian regularization of the model. For that particular Kernel, large values159 of l means stronger regularization. Figure 3: Matern Kernel for different values of l Figure 8.2: Effective of changing the length scale for Matérn-5/2 kernel. The Matern Kernel is defined by the following expression:! 1 v Õ 2 2v 3 not 5 consider non-stationary that it only depends on x absolute values), will kv (~, ~x ) rather = σf thankthe k~x ~x k we v=,,... (34) v Γ(v) l 2 2 cases further here (see Rasmussen and Williams [26] for further information). Where Γ(v) and Kv are the Gamma and Bessel functions respectively. Observing Fig. 3, we can [Duvenaud, 2.1]isshows some simple example Note x tunning. here is equivalent see again how the 214, KernelFigure function the one responsible for the kernels. Regularization to in our notation. The figure shows that the qualitative behavior of the GP changes substantially 4 with Basic Kernels and Combinations of Kernels changes to the kernel. All of these kernels also behave substantially differently to the Depending onmatérn the problem is being tackled, it may be of The interest to use kerneliswith whose compound kernelthat we itconsidered in the last section. choice of akernel therefore properties reflect our prior knowledge about the problem domain. See in Fig. 4 for a few examples and their to main the Though case of athere particular problem, we some may want specify function critical the properties. performancegiven of GPs. is work, including of ourtoown [Janz et al., behavior that using a combination of Kernels. Kernel functions have a set of properties that allow 216], that examines methods for The learning kernel directly [Duvenaud et al., 213; Lloyd et al., combining them in multiple ways. mostthe important ones are: 214; Wilson et al., 214], this is typically too computationally intensive to carry out in practice. 6 One must therefore either use prior knowledge about the problem, or use a relatively general purpose kernel to ensure that the nuances of the data are not overwhelmed.

2.2 A few basic kernels To begin understanding the types of structures expressible by GPs, we will start by briefly examining the priors on functions encoded by some commonly used kernels: the squared-exponential (SE), periodic (Per), and linear (Lin) kernels. defined in figure 2.1. These kernels are Kernel name: Squared-exp (SE) Periodic (Per) Linear (Lin) k(x, x Õ )= f 2 exp 1 2 (x xõ ) 2 2 2 2 f exp 1 2 2 sin 1 22 2 fi x xõ 2 p f(x c)(x Õ c) Plot of k(x, x Õ ): Functions f(x) sampled from GP prior: x x Õ x x Õ x (with x Õ =1) x x x Type of structure: local variation repeating structure linear functions Figure 2.1: Examples of structures expressible by some basic kernels. Each covariance function corresponds to a di erent set of assumptions made about the function we wish to model. For example, using a squared-exp (SE) kernel implies that Sum: k( x, x ) = k 1 ( x, x ) + k 2 ( x, x ) the function we are modeling has infinitely many derivatives. There exist many variants of local kernels similar to the SE kernel, each encoding slightly di erent assumptions Product: k( x, x ) = k 1 ( x, x )k 2 ( x, x ) about the smoothness of the function being modeled. Kernel parameters Figure 4: Examples of basic Kernels Product Spaces: For any z = ( x, y); k( z, z ) = k 1 ( x, x ) + k 2 ( y, y ); k( z, z ) = k 1 ( x, x )k 2 ( y, y ) Each kernel has a number of parameters which specify the precise shape of the covariance function. These are sometimes referred to as hyper-parameters, Vertical Rescaling: k( x, x ) = a( x)k 1 ( x, x )a( x); for any function a( x) since they can be viewed as specifying a distribution over function parameters, instead of being parameters which specify a function directly. An example would be the lengthscale 2.3 Combining kernels 11 In Fig. 5, different Kernel combinations and their obtained properties can be observed. Lin Lin SE Per Lin SE Lin Per x (with x Õ =1) x x Õ x (with x Õ =1) x (with x Õ =1) quadratic functions locally periodic increasing variation growing amplitude 2.2: Examples of one-dimensional structures expressible by multiplying kernels. PlotsFigure have same 5: meaning Examples as in figure of different 2.1. combinations of Kernels 2.3.2 Combining properties through multiplication Multiplying two positive-definite kernels together always results in another positivedefinite kernel. But what properties do these new kernels have? Figure 2.2 shows some kernels obtained by multiplying two basic kernels together. Working with kernels, rather than the parametric form of the function itself, allows us to express high-level properties of functions that do not necessarily have a simple parametric form. Here, we discuss a few examples: Polynomial Regression. By multiplying together T linear kernels, we obtain a prior on polynomials of degree T. The first column of figure 2.2 shows a quadratic kernel. Locally Periodic Functions. In univariate data, multiplying a kernel by SE gives a way of converting global structure to local structure. For example, Per corresponds to exactly periodic structure, whereas Per SE corresponds to locally periodic structure, as shown in the second column of figure 2.2. 7 Functions with Growing Amplitude. Multiplying by a linear kernel means that the marginal standard deviation of the function being modeled grows linearly

5 Inner Product To motivate the concept of inner product, think of vectors in R 2 and R 3 as arrows with initial point at the origin. The length of a vector x in R 2 or R 3 is called the norm of x, denoted by x. Thus for x = (x 1, x 2 ) R 2, we have x = x 1 2 + x 2 2.Similarly, if x = (x 1, x 2, x 3 ) R 3, then x = x 1 2 + x 2 2 + x 3 2. we define the norm of x = (x 1,..., x 2 ) R n by x = x 1 2 +... + x n 2 (35) The norm is not linear on R n. To inject linearity into the discussion, we introduce the dot product. Definition 1. For x; y R n, the dot product of x and y, denoted x y, is defined by where x := (x 1,..., x n ) and y := (y 1,..., y n ). The dot product on R n has the following properties: x x for all x R n ; x x = if and only if x = ; x y := x 1 y 1 +... + x n y n, (36) for y R n fixed, the map from R n to R that sends x R n to x y is linear; x y = y x for all x, y R n An inner product is a generalization of the dot product for all kinds of vector spaces, not just real vector spaces. Definition 2. An inner product on V is a function that maps each ordered pair (u, v) of elements of V to a scalar u, v F and has the following properties: positivity; v, v for all v V definiteness; v, v = if and only if v = additivity in first slot; u + v, w = u, w + v, w for all u, v, w V homogeneity in the first slot; λu, v = λ u, v for all λ F and all u, v V conjugate symmetry u, v = v, u for all u, v V where V is any vector space and F is any scalar field (Real or Complex). 8

5.1 Examples : Inner Product 1. The Euclidean inner product on F n is defined by (w 1,..., w n ), (z 1,..., z n ) = w 1 z 1 +... + w n z n. (37) 2. An inner product can be defined on the vector space of continuous real-valued functions on the interval [a, b] by f, g = b a f(x)g(x)dx. (38) An inner product space is a vector space where inner product is defined. 6 Hilbert Spaces and Kernels 6.1 Hilbert Spaces Any vector space V with an inner product, defines a norm by v = v, v. Similarly, it also defines a metric, d(x, y) = x y. A sequence of elements {v n } in V is called a Cauchy sequence if, for every positive real number ɛ, there is a positive integer N, such that for all m, n > N An inner product space is also called pre-hilbert space. v m v n < ɛ, (39) Definition 3. An inner product space H is called a Hilbert space if it is a complete metric space, i.e, if {h n } is a Cauchy sequence in H, then there exits h H with 6.2 Kernels h h n as n (4) Definition 4. A function k : X X R is a kernel, if there is a function φ : X X H such that x, x X k(x, x ) = φ(x), φ(x ) H (41) Definition 5. Let K denote the scalar field either of real or complex numbers. Let K N be the set of all sequences of scalars (x n ) n N, x n K (42) A sequence space is defined as any linear subspace of K N for which vector addition and scalar multiplication is defined. (x n ) n N + (y n ) n N := (x n ) + (y n ) n N (43) α(x n ) n N := α(x n ) n N (44) Definition 6. l 2 is the subspace of K N consisting of all the sequences x = (x n ) satisfying α=1 x n 2 < (45) Given a sequence {φ d (x)} d 1 in l 2 where φ d : X R is the d-th co-ordinate, k(x, x ) := α=1 φ d(x)φ d (x ) = φ(x), φ(x ) H (46) 9

6.3 Positive Definiteness Theorem 1. If H is a Hilbert space, φ : X H is a feature map (and X is a non-empty set) then, 6.4 Reproducing Kernels k(x, x ) := φ(x), φ(x ) H, is positive definite. (47) Definition 7. Suppose that H is a Hilbert space over functions f : X R then H is a Reproducing Kernel Hilbert Space(RKHS) when, k(, x) H, x X (48) f( ), k(, x) H = f(x) (49) This is also called Kernel trick or Reproducing property. Theorem 2. (Moore-Aronszajn)Suppose K is symmetric, positive definite kernel on a set X, then there is a unique Hilbert space of functions on X for which K is a reproducing kernel. 6.5 Function Space Equivalence classes There is equivalence relation between reproducing kernels, positive definite functions, and Hilbert functions spaces with bounded point evaluation. Every reproducing kernel function is also a positive definite function and is part of Hilbert function space with bounded point evaluation. 6.6 Example: XOR We can see what is reproducing kernel Hilbert space using a simple XOR example. Let us consider a feature and feature map as defined below, φ : R 2 R 3 (5) [ ] x1 x = (51) x 2 φ( x) = x 1 x 2 (52) x 1 x 2 1

with the kernel defined as, k( x, y) = x 1 x 2 x 1 x 2 y 1 y 2 = φ( x), φ( y) R 3 (53) y 1 y 2 T This feature space is a Hilbert(H) space because inner product is defined and is given by the dot product. Let s now define a function of the features x 1, x 2, x 1 x 2 of x as, f( x) = w 1 x 1 + w 2 x 2 + w 3 x 1 x 2 = w, φ( x) R 3 (54) This function is a member of a space of functions mapping from x = R 2 to R. We can define an equivalent representation for f, w 1 f( ) = w 2 (55) w 3 the notation f( ) refers to the function itself and the notation f(x) R refers to the function evaluated at a particular point. Then we can write, f(x) = f( ) T φ(x) (56) f(x) := f( ), φ(x) H (57) In other words, the evaluation of f at x can be written as an inner product in feature space. Moreover, we can write express the kernel function in terms of feature map using the same convention as above as, y 1 k(, y) = y 2 y 1 y 2 = φ(y) (58) where w 1 = y 1, w 2 = y 2, and w 3 = y 1 y 2. Due to symmetry we can also write, k(, x), φ(y) = uy 1 + vy 2 + wy 3 = k(x, y) (59) In other words φ(x) = k(, x) and φ(y) = k(, y). This way of writing feature mapping is called the canonical feature map. Reproducing Property : f( x) = w, φ( x) R 3 = f( ), φ( x) H, φ( x) = k(, x) Here, we use the positive definite kernels to define functions on X. The space of such functions is known as a reproducing kernel Hilbert space. 6.7 Implications of RKHS for Gaussian Processes Suppose that we have a random function f that is distributed according to a Gaussian process prior f GP (, k( x, x )), (6) Since, k(x, x ) is a positive definite kernel, f lies in an RKHS. To see this note, that we can express the GP posterior as, f y GP ( µ( x), k( x, x )), (61) 11

where the posterior mean is defined as µ( ) = k(, X)(k(X, X) + σ 2 I) 1 y, (62) = nm k(, x n) ( k(x, X) + σ 2 I ) 1 nm y m. (63) In other words, the we can express the mean function of the posterior as, µ( ) = k(, x n)w n, (64) n w n = ( k(x, X) + σ 2 I ) 1 nm y m. (65) m In summary, the choice of kernel in a Gaussian process implies a choice of RKHS, which constrains GP regression to solutions that lie inside this RKHS. 12