Reproducing Kernel Hilbert Spaces for Penalized Regression: A tutorial

Size: px

Start display at page:

Download "Reproducing Kernel Hilbert Spaces for Penalized Regression: A tutorial"

Bethanie Brianna Gilmore
6 years ago
Views:

1 Reproducing Kernel Hilbert Spaces for Penalized Regression: A tutorial Alvaro Nosedal-Sanchez a, Curtis B. Storlie b, Thomas C.M. Lee c, Ronald Christensen d a Indiana University of Pennsylvania b Los Alamos National Laboratory c University of California, Davis d University of New Mexico December 8, 21 Abstract Penalized Regression procedures have become very popular ways to estimate complex functions. The smoothing spline, for example, is the solution of a minimization problem in a functional space. If such a minimization problem is posed on a reproducing kernel Hilbert space (RKHS), the solution is guaranteed to exist, is unique and has a very simple form. There are excellent books and articles about RKHS and their applications in statistics, however, this existing literature is very dense. Our intention is to provide a friendly reference for a reader approaching this subject for the first time. This paper begins with a simple problem, a system of linear equations, then gives an intuitive motivation for reproducing kernels. Armed with the intuition gained from our first examples, we take the reader from Vector Spaces, to Banach Spaces and to RKHS. Finally, we present some statistical estimation problems that can be solved using the mathematical machinery discussed. After reading this tutorial, the reader will be ready to study more advanced texts and articles about the subject such as Wahba (199) or Gu (22). 1 Introduction Penalized regression procedures have become a very popular approach to estimating complex functions (Wahba 199, Eubank 1999, Hastie, Tibshirani & Friedman 21). These procedures use an estimator that is defined as the solution to a minimization problem. In any minimization problem, there are the following questions: Does the solution exist? If yes, is the solution unique? How can we find it? If the problem is posed in the Reproducing Kernel Hilbert Space (RKHS) framework that we discuss below, then the solution is guaranteed to exist, it is unique and it takes a particularly simple form. Reproducing Kernel Hilbert Spaces and Reproducing Kernels play a central role in Penalized Regression. In the first section of this tutorial we present, and solve, simple problems while gently introducing key concepts. The remainder of the paper is organized as follows. Section 2 takes the reader from a basic understanding of fields through Banach Spaces and Hilbert Spaces. In Section 3, 1

2 we provide elementary theory of RKHS s along with some examples. Section 4 discusses Penalized Regression with RKHS s. Two specific examples involving ridge regression and smoothing splines are given with R code to solidify the concepts. Some methods for smoothing parameter selection are briefly mentioned at the end. Section 5 contains some closing remarks. 1.1 Why should we care about Reproducing Kernel Hilbert Spaces? Before introducing new concepts, we present some simple illustrations of the tools used to solve problems in the RKHS framework. Consider solving the system of linear equations x 1 + x 3 = (1) x 2 = 1. (2) Clearly the real-valued solutions to this system are the vectors x t = ( α, 1, α) for α R. Suppose we want to find the smallest solution. Under the usual squared norm x 2 = x x2 2 + x2 3, the smallest solution is x t s = (, 1, ). Now consider a more general problem. For a given p n matrix R and n 1 matrix η, solve R t x = η, (3) where R t is the transpose of R, x and the columns of R, say R k, k = 1, 2,..., n, are all in R p, and η R n. We wish to find the solution x s that minimizes the norm x = x t x. We solve the problem using concepts that extend to RKHS. A solution x (not necessarily a minimum norm solution) exists whenever η C(R t ). Here C(R t ) denotes the column space of R t. Given one solution x, all solutions x must satisfy R t x = R t x or R t (x x) =. The vector x can be written uniquely as x = x + x 1 with x C(R) and x 1 C(R), where C(R) is the orthogonal complement of C(R). Clearly, x is a solution because R t (x x ) = R t x 1 = In fact, x is both the unique solution in C(R) and the minimum norm solution. If x is any other solution in C(R) then R t (x x ) = so we have both (x x ) C(R) and (x x ) C(R), two sets whose intersection is only the vector. Thus x x = and x = x. In other words, every solution x has the same x vector. Finally, x is also the minimum norm solution because the arbitrary solution x has x t x x t x + x t 1x 1 x t x. We have established the existence of a unique, minimum norm solution in C(R) that can be written as x s x = Rξ = 2 ξ k R k, (4) k=1

3 for some ξ k, k = 1,..., n. To find x s explicitly, write x s = Rξ and the defining equation (3) becomes R t Rξ = η, (5) which is just a system of linear equations. Even if there exist multiple solutions ξ, Rξ is unique. Now we use this framework to find the smallest solution to the system of equations (1) and (2). In the general framework we have x t = (x 1, x 2, x 3 ), η t = (, 1), R t 1 = (1,, 1), R t 2 = (, 1, ). We know that the solution has the form (4) and we also know that we have to solve a system of equations given by (5). In this case, the system of equations is 2ξ 1 + ξ 2 =, ξ 1 + 1ξ 2 = 1. The solution to the system is (ξ 1, ξ 2 ) = (, 1) which implies that our solution to the original problem is x s = R 1 + 1R 2 = (, 1, ) t as expected. Virtually the same methods can be used to solve a similar problem in any inner-product space Ω. As discussed later, an inner product, assigns real numbers to pairs of vectors. For given vectors R k Ω and numbers η k R, find x Ω such that R k, x = η k, k = 1, 2,..., n (6) for which the norm of x x, x is minimal. The solution has the form x s = with ξ k satisfying the linear equations ξ k R k, (7) k=1 R i, R k ξ i = η i, i = 1,..., n. k=1 For a formal proof see Máté (199). In RKHS applications, vectors are typically functions. We now apply this result to the interpolating spline problem. Interpolating Splines. Suppose we want to find a function f(t) that interpolates between the points (t k, η k ), k =, 1, 2,..., n, where η and = t < t 1 < < t n = 1. We restrict attention to functions f F where F ={f: f is absolutely continuous on [,1], f() =, f L 2 [, 1] }. 3

4 Throughout f (m) denotes the m-th derivative of f with f f (1) and f f (2). The restriction that η = f() = is not really necessary, but simplifies the presentation. We want to find the smoothest function f(t) that satisfies f(t k ) = η k, k = 1,..., n. Defining an inner product on F by f, g = 1 f (x)g (x)dx implies a norm over the space F that is small for smooth functions. To address the interpolation problem, note that the functions R k (s) min(s, t k ), k = 1, 2,..., n have R k () = and the property that R k, f = f(t k ) because f, R k = = = 1 tk tk f (s)r k (s)ds 1 f (s)1ds + f (s)ds t k f (s)ds = f(t k ) f() = f(t k ). Thus, an interpolator f satisfies a system of equations like (6), namely f(t k ) = R k, f = η k, k = 1,..., n. (8) and by (7), the smoothest function f (minimum norm) that satisfies the requirements has the form ˆf(t) = ξ k R k (t) k=1 The ξ j s are the solutions to the system of real linear equations obtained by substituting ˆf into (8), R k, R j ξ j = η k, k = 1, 2,..., n. j=1 Note that and define the function R k, R j = R j (t k ) = R k (t j ) = min(t k, t j ) R(s, t) = min(s, t) which turns out to be a reproducing kernel. Numerical example. Given points f(t i ) = η i, say, f() =, f(.1) =.1, f(.25) = 1, f(.5) = 2, f(.75) = 1.5, and f(1) = 1.75, we find arg min f F f 2 = 1 f (x) 2 dx. 4

5 The system of equations is.1ξ 1 +.1ξ 2 +.1ξ 3 +.1ξ 4 +.1ξ 5 =.1.1ξ ξ ξ ξ ξ 5 = 1.1ξ ξ 2 +.5ξ 3 +.5ξ 4 +.5ξ 5 = 2.1ξ ξ 2 +.5ξ ξ ξ 5 = 1.5.1ξ ξ 2 +.5ξ ξ 4 + ξ 5 = The solution is ξ = ( 5, 2, 6, 3, 1) t, which implies that our function is ˆf(t) = 5R 1 (t) + 2R 2 (t) + 6R 3 (t) 3R 4 (t) + 1R 5 (t) (9) = 5R(t, t 1 ) + 2R(t, t 2 ) + 6R(t, t 3 ) 3R(t, t 4 ) + 1R(t, t 5 ) or, adding the slopes for t > t i and finding the intercepts, t t.1 6t.5.1 t.25 ˆf(t) = 4t.25 t.5 2t t.75 t t 1 This the linear interpolating spline as can be seen graphically in Figure A.2. For this illustration we restricted f so that f() =. This was only for convenience of presentation. It can be shown that the form of the solution remains the same with any shift to the function, so that in general the solution takes the form ˆf(t) = ξ + n j=1 ξ jr j (t) where ξ = η. The key points are (i) the elements R i that allow us to express a function evaluated at a point as an inner-product constraint, and (ii) the restriction to functions in F. F is a very special function space, a reproducing kernel Hilbert space, and R i is determined by a reproducing kernel R. Ultimately, our goal is to address more complicated regression problems like the Linear Smoothing Spline Problem. Consider simple regression data (x i, y i ), i = 1,..., n and finding the function that minimizes 1 n {y i f(x i )} 2 + λ 1 f (x) 2 dx. (1) If f(x) is restricted to be in some class of functions F, minimizing only the first term gives least squares estimation within F. If F contains functions with f(x i ) = y i for all i, such functions minimize the first term but are typically very unsmooth, i.e., have large second term. second penalty term is minimized by having a horizontal line, but that rarely has a small first term. As we will see in Section 4, for suitable F the minimizer takes the form ˆf(x) = ξ + ξ k R k (x), i=k The 5

6 where the R k s are known functions and the ξ k s are coefficients found by solving a system of linear equations. This produces a linear smoothing spline. If our goal is only to derive the solution to the linear smoothing spline problem with one predictor variable, RKHS theory is overkill. The value of RKHS theory lies in its generality. The linear spline penalty can be replaced by any other penalty with an associated inner product, and the x i s can be vectors in R p. Using RKHS results, we can solve the general problem of finding the minimizer of 1 n n (y i f(x i )) 2 + λj(f) for a general functional J that corresponds to a squared norm in a subspace. See Wahba (199) or Gu (22) for a full treatment of this approach. We now present an introduction to this theory. 2 Vector, Banach and Hilbert Spaces This section presents background material required for the formal development of the RKHS framework. 2.1 Vector Spaces A vector space is a set that contains elements called vectors and supports two kinds of operations: addition of vectors and multiplication by scalars. The scalars are drawn from some field (the real numbers in the rest of this article) and the vector space is said to be a vector space over that field. Formally, a set V is a vector space over a field F if there exists a structure {V, F, +,, v } consisting of V, F, a vector addition operation +, a scalar multiplication, and an identity element v V. This structure must obey the following axioms for any u, v, w V and a, b F : Associative Law: (u + v) + w = u + (v + w). Commutative Law: u + v = v + u. Inverse Law: s V s.t. u + s = v. (Write u s.) Identity Laws: v + u = u. 1 u = u. Distributive Laws: a (b u) = (a b) u. (a + b) u = a u + b u. a (u + v) = a u + a v. We will write for v V and u + ( v) as u v. Any subset of a vector space that is closed under vector addition and scalar multiplication is call a subspace. The simplest example of a vector space is just R itself, which is a vector space over R. Vector addition and scalar multiplication are just addition and multiplication on R. For more on vector spaces and the other topics to follow in this section, see, for example, Naylor & Sell (1982), Young (1988), Máté (199) and Rustagi (1994). 6

7 2.2 Banach Spaces Definition. A norm of a vector space V, denoted by, is a nonnegative real valued function satisfying the following properties for all u, v V and all a R, 1. Non-negative: u 2. Strictly positive: u = implies u = 3. Homogeneous: au = a u 4. Triangle inequality: u + v u + v Definition. A vector space is called a normed vector space when a norm is defined on the space. Definition. A sequence {v n } in a normed vector space V is said to converge to v V if lim v n v = n Definition. A sequence {v n } V is called a Cauchy sequence if for any given ϵ >, there exists an integer N such that v m v n < ϵ, whenever m, n N. Convergence of sequences in normed vector spaces follows the same general idea as sequences of real numbers except that the distance between two elements of the space is measured by the norm of the difference between the two elements. Definition (Banach Space). A normed vector space V is called complete if every Cauchy sequence in V converges to an element of V. A complete normed vector space is called a Banach Space. Example 2.1. R with the absolute value norm x x is a complete, normed vector space over R, and is thus a Banach space. Example 2.2. Let x = (x 1,..., x n ) t be a point in R n. The l p norm on R n is defined by [ x p = x i p ] 1/p for 1 p <. One can verify properties 1-4 at the beginning of this section for each p, validating that x p is a norm on R n. 2.3 Hilbert Spaces A Hilbert Space is a Banach space in which the norm is defined by an inner-product (also called dotproduct) which we define below. We typically denote Hilbert spaces by H. For elements u, v H, write the inner product of u and v either as u, v H or, when it is clear by context that the inner product is taking place in H, as u, v. If H is a vector space over F, the result of the inner product is an element in F. We have F = R, so the result of an inner product will be a real number. The inner product operation must satisfy four properties for all u, v, w H and all a F. 7

8 1. Associative: au, v = a u, v. 2. Commutative: u, v = v, u. 3. Distributive: u, v + w = u, v + u, w. 4. Positive Definite: u, u with equality holding only if u =. Definition. A vector space with an inner product defined on it is called an inner-product space. The norm of an element u in an inner-product space is taken as u = u, u 1/2. Two vectors are said to be orthogonal if their inner product is and two sets of vectors are said to be orthogonal if every vector in one is orthogonal to every vector in the other. The set of all vectors orthogonal to a subspace is called the orthogonal complement of the subspace. A complete inner-product space is called a Hilbert space. Example 2.3. R n with inner product defined by u, v u t v = u i v i is a Hilbert space. For any positive definite matrix A, u, v A u t Av also defines a valid inner product. Example 2.4. Let L 2 (a, b) be the vector space of all functions defined on the interval (a, b) that are square integrable and define the inner product f, g b a f(x)g(x)dx. The inner-product space L 2 (a, b) is well-known to be complete; see de Barra (1981), thus L 2 (a, b) is a Hilbert space. 3 Reproducing Kernel Hilbert Spaces (RKHS s) Hilbert spaces that display certain properties on certain linear operators are reproducing kernel Hilbert spaces. Definition. A function T mapping a vector space X into another vector space Y is called a linear operator if T (λ 1 x 1 + λ 2 x 2 ) = λ 1 T (x 1 ) + λ 2 T (x 2 ) for any x 1, x 2 X and any λ 1, λ 2 R. Any m n matrix A maps vectors in R n into vectors in R m via Ax = y and is linear. Definition. The operator T : X Y mapping a Banach space into a Banach space is continuous at x X if and only if for every ϵ > there exists δ = δ(ϵ) > such that for every x with x x < δ we have T x T x < ϵ. Linear operators are continuous everywhere if they are continuous at. Definition. A real valued function defined on a vector space is called a functional. 8

9 A 1 n matrix defines a linear functional on R n. Example 3.1. Let S be the set of bounded real valued continuous functions {f(x)} defined on the real line. Then S is a vector space with the usual + and operations for functions. Some functionals on S are ϕ(f) = b a f(x)dx and ϕ a(f) = f (a) for some fixed a. A functional of particular importance is the evaluation functional. Definition. Let V be a vector space of functions defined from E into R. For any t E, denote by e t the evaluation functional at the point t, i.e., for g V, the mapping is e t (g) = g(t). Clearly, evaluation functionals are linear operators. In a Hilbert space (or any normed vector space) of functions, the notion of pointwise convergence is related to the continuity of the evaluation functionals. The following are equivalent for a normed vector space H of real valued functions. (i) The evaluation functionals are continuous for all t E. (ii) If f n, f H and f n f then f n (t) f(t) for every t E. (iii) For every t E there exists K t > such that f(t) K t f for all f H. Here (ii) is the definition of (i). See Máté (199) for a proof of (iii). To define a reproducing kernel, we need the famous Riesz Representation Theorem. Theorem. Let H be a Hilbert space and let ϕ be a continuous linear functional on H. Then there is one and only one vector g H such that ϕ(f) = f, g, for all f H. The vector g is sometimes called the representation of ϕ. However, ϕ and g are different objects: ϕ is a linear functional on H and g is a vector in H. For a proof of this theorem see Naylor & Sell (1982) or Máté (199). For H = R p, vectors can be viewed as functions from the set E = {1, 2,..., p} into R. An evaluation functional is e i (x) = x i. The representation of this linear functional is the indicator vector e i that is everywhere except has a 1 in the ith place. Then x i = e i (x) = x t e i. In fact, the entire representation theorem is well known in R p because for ϕ(x) to be a linear functional there must exist a vector ϕ such that ϕ(x) = ϕ t x. An element of a set of functions, say f, is sometimes denoted f( ) to be explicit that the elements are functions, whereas f(t) is the value of f( ) evaluated at t E. Applying the Riesz 9

10 Representation Theorem to a Hilbert space H of real valued functions in which evaluation functionals are continuous, for every t E there is a unique symmetric function R : E E R with R(, t) H the representation of e t, so that f(t) = e t (f) = f( ), R(, t) H, f H. The function R is called a reproducing kernel (r.k.) and f(t) = f( ), R(, t) is called the reproducing property of R. In particular, by the reproducing property R(s, t) = R(, t), R(, s). In Section 1 we found the r.k. for an interpolating spline problem. shortly. Other examples follow Definition A Hilbert space H of functions defined on E is called a reproducing kernel Hilbert space if all evaluation functionals are continuous. The projection principle for a RKHS. Consider the connection between the reproducing kernel R of the RKHS H and the reproducing kernel R for a subspace H H. Suppose H is an RKHS with r.k. R and H is a closed subset of H and let H be the orthogonal complement of H. Then any vector f H can be written uniquely as f = f + f 1 with f H and f 1 H. More particularly, R(, t) = R (, t) + R 1 (, t) with R (, t) H and R 1 (, t) H if and only if R is the r.k. of H and R 1 is the r.k. of H. For a proof see Gu (22). 3.1 Examples of Reproducing Kernel Hilbert Spaces Example 3.2. Consider the space of all constant functionals over x = (x 1, x 2,..., x p ) t R p, H = {f θ : f θ (x) = θ, θ R}, with f θ, f λ = θλ. (For simplicity, think of p = 1.) Since R p is a Hilbert Space, so is H. H has continuous evaluation functionals, so it is an RKHS and has a unique reproducing kernel. To find the r.k., observe that R(, x) H, so it is a constant for any x. Write R(x) R(, x). By the representation theorem and the defined inner product θ = f θ (x) = f θ ( ), R(, x) = θr(x) for any x and θ. This implies that R(x) 1 so that R(, x) = R(x) 1 and R(, ) 1. Example 3.3. Consider all linear functionals over x R p passing through the origin H = {f θ : f θ (x) = θ t x, θ R p }. Define f θ, f λ = θ t λ = θ 1 λ 1 + θ 2 λ θ p λ p. The kernel R must satisfy f θ (x) = f θ ( ), R(, x) 1

11 for all θ and any x. Since R(, x) H, R(v, x) = u t v for some u that depends on x, i.e., R(, x) = f u(x) ( ), so R(v, x) = u(x) t v. By our definition of H we have θ t x = f θ (x) = f θ ( ), R(, x) = f θ ( ), f u(x) ( ) = θ t u(x), so we need u(x) such that for any θ and x we have θ t x = θ t u(x). It follows that u(x) = x. For example, taking θ to be the indicator vector e i implies that u i (x) = x i for every i = 1,..., p. We now have R(, x) = fx( ) so that R( x, x) = x t x = x 1 x 1 + x 2 x x p x p. Example 3.4. Now consider all affine functionals in R p, H = {f θ : f θ (x) = θ + θ 1 x θ p x p, θ R p+1 }, with f θ, f λ = θ λ + θ 1 λ θ p λ p. The subspace H = {f θ H : θ R, = θ 1 = = θ p } has the orthogonal complement H = {f θ H : = θ }. For practical purposes, H is the space of constant functionals from Example 3.2 and H is the space of linear functionals from Example 3.3. Note that the inner product on H when applied to vectors in H and H, respectively, reduces to the inner products used in Examples 3.2 and 3.3. Write H as H = H H where denotes the direct sum of two vector spaces. For two subspaces A and B contained in a vector space C, the direct sum is the space D = {a + b : a A, b B}. Any elements d 1, d 2 D can be written as a 1 + b 1 and a 2 + b 2, respectivley for some a 1, a 2 A and b 1, b 2 B. When the two subspaces are orthogonal, as in our example, those decompositions are unique and the inner product between d 1 and d 2 is d 1, d 2 = a 1, a 2 + b 1, b 2. For more information about direct sum decomposition, see Berlinet & Thomas-Agnan (24) or Gu (22), for example. We have already derived the r.k. s for H and H (call them R and R 1, respectively) in Examples 3.2 and 3.3. Applying the projection principle, the r.k. for H is the sum of R and R 1, i.e., Example 3.5. subspace R( x, x) = 1 + x t x. Denote by V the collection of functions f with f L 2 [, 1] and consider the W 2 = {f(x) V : f, f absolutely continuous and f() = f () = }. Define an inner product on W 2 as f, g = 1 f (t)g (t)dt. (11) 11

12 Below we demonstrate that for f W2 and any s, f(s) can be written as f(s) = 1 (s u) + f (u)du, (12) where (a) + is a for a > and for a. Given any arbitrary and fixed s [, 1], Integrating by parts 1 (s u) + f (u)du = s (s u)f (u)du. s (s u)f (u)du = (s s)f (s) (s )f () + and applying the Fundamental Theorem of Calculus to the last term, s Since the r.k. of the space W2 R(, s) is a function such that s (s u)f (u)du = f(s) f() = f(s). f (u)du = s f (u)du must satisfy f(s) = f( ), R(, s) from (11) and (12) we see that d 2 R(u, s) d 2 u = (s u) +. We also know that R(, s) W2, so using R(s, t) = R(, t), R(, s) R(s, t) = 1 (t u) + (s u) + du = max(s, t) min2 (s, t) 2 min3 (s, t). 6 For further examples of RKHSs with various inner products, see Berlinet & Thomas-Agnan (24). 4 Penalized Regression with RKHSs Nonparametric regression is a powerful approach for solving many modern problems such as computer modeling, image processing, environmental monitoring, to name a few. The nonparametric regression model is given by y i = f(x i ) + ϵ i, i = 1, 2,..., n, where f is an unknown regression function and the ϵ i are independent error terms. We start this section with two common examples of penalized regression: ridge regression and smoothing splines. Ridge Regression. In the classical linear regression setting y i = x t i β + ϵ i the ridge regression estimator ˆβ R proposed by Hoerl & Kennard (197a) minimizes min β y i β p β j x ij j=1 2 + λ p βj 2 (13) where x ij is the ith observation of the jth component. The resulting estimate is biased but can reduce the variance relative to least squares estimates. The tuning parameter λ is a constant 12 j=1

13 that controls the trade-off between bias and variance in ˆβ R, and is often selected by some form of cross validation; see Section 4.4. Smoothing Splines. Smoothing splines are among the most popular methods for the estimation of f, due to their good empirical performance and sound theoretical support. It is often assumed, without loss of generality, that the domain of f is [, 1]. With f (m) the m-th derivative of f, a smoothing spline estimate ˆf is the unique minimizer of [ 2 {y i f(x i )} 2 + λ f (x)] (m) dx (14) The minimization of (14) is implicitly over functions with square integrable m-th derivatives. The first term of (14) encourages the fitted f to be close to the data, while the second term penalizes the roughness of f. The smoothing parameter λ, usually pre-specified, controls the tradeoff between the two conflicting goals. The special case of m = 1 reduces to the linear smoothing spline problem from (1). In practice it is common to choose m = 2 in which case the minimizer f λ of (14) is called a cubic smoothing spline. As λ, ˆfλ approaches the least squares simple linear regression line, while as λ, ˆf λ approaches the minimum curvature interpolant. 4.1 Solving the General Penalized Regression Problem We now review a general framework to minimize (13), (14) and many other similar problems, cf. (O Sullivan, Yandell & Raynor 1986, Lin & Zhang 26, Storlie, Bondell, Reich & Zhang 21, Storlie, Bondell & Reich 21, Gu & Qiu 1993). The data model is y i = f(x i ) + ϵ i, i = 1, 2,..., n, (15) where the ϵ i are error terms and f V, a given vector space of functions on a set E. An estimate of f is obtained by minimizing 1 n {y i f(x i )} 2 + λj(f), (16) over f V where J is a penalty functional that must satisfy several restrictions and that helps to define the vector space V. We require (i) that J(f) for any f in some vector space Ṽ, (ii) that the null set N = {f Ṽ : J(f) = } be a subspace, i.e., that it be closed under vector addition and scalar multiplication, (iii) that for f N N and f Ṽ, we have J(f N + f) = J(f), and (iv) that there exists an RKHS H contained in Ṽ for which the inner product satisfies f, f = J(f). This condition forces the intersection of N and H to contain only the zero vector. Define a finite dimensional subspace of N, say N, with a basis of known functions, say {ϕ 1,..., ϕ M }, and M n. In applications N is often finite dimensional, and we can simply take N = N. In the minimization problem, we restrict attention to f V where V N H. 13

14 For Example 3.5, Ṽ consists of functions in L2 [, 1] with finite values of J(f) 1 [ f (t) ] 2 dt. J(f) satisfies our four conditions with H = W2. The linear functions f(x) = a + bx are in N so we can take ϕ 1 (x) 1 and ϕ 2 (x) = x. Note that equation (11) defines an inner product on W2 but does not define an inner product on all of Ṽ because nonzero functions can have a zero inner product with themselves, hence the nontrivial nature of N. The key result is that the minimizer of (16) is a linear combination of known functions involving the reproducing kernel on H. This fact will allow us to find the coefficients of the linear combination by solving a quadratic minimization problem similar to those in standard linear models. Representation Theorem. The minimizer ˆf λ of equation (16) has the form ˆf λ (x) = M d j ϕ j (x) + j=1 c i R(x i, x), (17) where R(s, t) is the r.k. for H. An informal proof is given below, see Wahba (199) or Gu (22) for a formal proof. Since we are working in V, clearly, a minimizer ˆf must have ˆf = ˆf + ˆf 1 with ˆf N and ˆf 1 H. We want to establish that ˆf 1 ( ) = n c ir(x i, ). To simplify notation write ˆf R ( ) c i R(x i, ). (18) Decompose H as H = H H where H = span{r(x i, ), i = 1,..., n} so that ˆf 1 ( ) = ˆf R ( ) + η( ), with η( ) H. By orthogonality and the reproducing property of the r.k., = R(x i, ), η( ) = η(x i ). We now establish the representation theorem. Using our assumptions about J, 1 {y i n ˆf(x i )} 2 + λj( ˆf) = 1 {y i n ˆf (x i ) ˆf 1 (x i )} 2 + λj( ˆf + ˆf 1 ) = 1 {y i n ˆf (x i ) ˆf 1 (x i )} 2 + λj( ˆf 1 ) = 1 n {y i ˆf (x i ) ˆf R (x i ) η(x i )} 2 + λj( ˆf R + η). Because η(x i ) = and using orthogonality within H 1 {y i n ˆf(x i )} 2 + λj( ˆf) = 1 {y i n ˆf (x i ) ˆf [ R (x i )} 2 + λ J( ˆf ] R ) + J(η) 1 {y i n ˆf (x i ) ˆf R (x i )} 2 + λj( ˆf R ). 14

15 Clearly, any η makes the inequality strict, so minimizers have η = and ˆf = ˆf + ˆf R with the last inequality an equality. Once we know that the minimizer takes the form (17), we can find the coefficients of the linear combination by solving a quadratic minimization problem similar to those in standard linear models. This occurs because we can write J( ˆf) = J( ˆf R ) as a quadratic form in c = (c 1,..., c n ) t. Define Σ as the n n matrix where the i, j entry is Σ ij = R(x i, x j ). Now, using the reproducing property of R, write J( ˆf R ) = c i R(x i, ), c j R(x j, ) = j=1 j=1 c i c j R(x i, x j ) = c t Σc. Define the observation vector y = [y 1,..., y n ] t, and let T be the n M matrix with the ij-th entry defined by T ij = {ϕ j (x i )}. The minimization of (16) then takes the form and min c,d To solve (19) we define the following matrices, Now (19) becomes 1 n y (Td + Σc) 2 + λc t Σc. (19) Q n (n+m) = [ T n M Σ n n ], γ (n+m) 1 = S (n+m) (n+m) = min γ [ ] dm 1, c n 1 [ ] M M M n. n 1 Σ n n 1 n y Qγ 2 + λγ t Sγ. (2) The minimization in (2) is the same as a generalized ridge regression. Taking derivatives with respect to γ we have (Q t Q + λs)ˆγ = Q t y, which requires solving a system of n+m equations to find ˆγ. For analytical purposes, we can write ˆγ as ˆγ = (Q t Q + λs) Q t y. (21) As long as T is of full column rank, then ˆf λ is unique. If the x i are not unique, then ˆγ is not unique as defined, but ˆf λ is still unique. One could simply use only the unique x i in the definition of f R in (18) to ensure that ˆγ is unique in that case as well. model Alternatively, ˆγ can be obtained as the generalized least squares estimate from fitting the linear [ y ] [ = ] [ T Σ d n M I n n c ] [ In n + e, Cov(e) n n n n Σ 1 n n ]. 15

16 For clarity, we have restricted our attention to minimizing (16), which incorporates squared error loss between the observations and the unknown function evaluations. The representation theorem holds for more general loss functions (e.g., those from logistic or Poisson regression); see Wahba (199) and Gu (22). 4.2 General Solution Applied to Cubic Smoothing Spline Consider again the regression problem y i = f(x i ) + ϵ i, i = 1, 2,..., n where x i [, 1] and ϵ i N(, σ 2 ). We focus on the cubic smoothing spline solution to this problem. That is, we find a function that minimizes {y i f(x i )} 2 + λ f (x) 2 dx. As discussed in Subsection 4.1, N is the space of linear functions from (23) with p = 1 and a basis of ϕ 1 (x) = 1, ϕ 2 (x) = x. In this case N happens to be an RKHS, but in general it is not even necessary to define an inner product on N. H = W2 comes from Example 3.5. We know that the reproducing kernel for H is R(s, t) = 1 so the solution has the form From (21) we have (t u) + (s u) + du = max(s, t) min2 (s, t) 2 ˆf(x) = ˆd (1) + ˆd 1 x i + ĉ i R(x i, x). i=i (Q t Q + λs) 1 Q t y = [ ˆd ĉ ] min3 (s, t), 6 Appendix A.2 provides a demonstration with R code of fitting the cubic smoothing spline using the solution in (22) on some actual data. The demonstration also includes searching for the best value of the tuning parameter λ which is briefly discussed in Section General Solution Applied to Ridge Regression We now solve the linear ridge regression problem of minimizing (13) with the RKHS approach detailed above. (22) Although the RKHS framework is not necessary to solve the ridge regression problem, it serves as a good illustration of the RKHS machinery. To put the ridge regression problem in the framework of (16) consider Ṽ = {f(x) = β + p β j x j }. (23) j=1 with the penalty function p J(f) = βj 2. j=1 16

17 Notice that by letting Ṽ = V H where V is the RKHS from Example 3.2 and H is from Example 3.3, we have J(f) = f, f H. The dimension of V = N = N is M = 1 and ϕ 1 (x) = 1. From (17) the solution takes the form ˆf λ (x) = ˆd 1 + ĉ i R(x i, x) (24) with (Q t Q + λs) 1 Q t y = [ ˆd1 ĉ as given by (21). In this case it is more familiar to write the solution as ] (25) ˆf λ (x) = ˆβ + ˆβ 1 x 1 + ˆβ 2 x ˆβ p x p. This can be done by recalling from Example 3.3 that R(x i, x) = x i1 x 1 + x i2 x x ip x p. Substituting into (24) gives ˆf(x) = ˆd 1 + which implies that ĉ i x i1 x 1 + ĉ i x i2 x ˆβ = ˆd 1 and ˆβj = ĉ i x ij ĉ i x ip x p, for j = 1, 2,..., p. In Appendix A.1, we provide a demonstration of this solution on some actual data using the statistical software R. Both of the previous examples illustrate a typical case in which Ṽ is constructed as the direct sum of a RKHS V and a RKHS H whose intersection is the zero vector, Ṽ = V H. Because the intersection is zero, any f Ṽ can be written uniquely as f = f + f 1 with f V and f 1 H. The importance of this is that if the penalty functional J happens to satisfy J(f) = f 1, f 1 H, then all of our assumptions about J hold immediately with N = V, and the solution in (21) can be applied. As a further aside, when V is itself a RKHS, then Ṽ is an RKHS under the inner product f, g Ṽ f, g V + f 1, g 1 H. Note that V and H are orthogonal under this inner product. This orthogonal decomposition of Ṽ is also closely related to additive models and more generally Smoothing Spline ANOVA models (Gu 22). 17

18 4.4 Choosing the Degree of Smoothness With the penalized regression procedures described above, the choice of the smoothing parameter λ is an important issue. There are many methods available for this task; e.g., visual inspection of the fit, cross-validation, generalized MLE and generalized cross-validation (GCV). For those examples given in the appendix, we use the GCV approach, which works as follows for any generalized ridge regression solution, such as those in (25) and (22). Suppose that an estimate admits the following closed-form expression: ˆβ = (Q t Q + λs) 1 Q t y. The GCV choice of λ for this generalized ridge estimate is the minimizer of V (λ) = 1 n (I A(λ)y 2 /[ 1 n Trace{I A(λ)}]2, where A(λ) = Q(Q t Q + λs) 1 Q t. The goal of GCV is to find the optimal λ so that the resulting ˆβ has the smallest mean squared error. For more details about GCV and other methods of finding λ see Golub, Heath & Wahba (1979), Allen (1974), Wecker & Ansley (1983) and Wahba (199). 5 Concluding Remarks We have given several examples motivating the utility of the RKHS approach to penalized regression problems. We reviewed the building blocks necessary to define an RKHS and presented several key results about these spaces. Finally, we used the results to illustrate estimation for ridge regression and the cubic smoothing spline problems, providing transparent R code to enhance understanding. References Allen, D. (1974), The relationship between variable selection and data augmentation and a method for prediction, Technometrics 16, Berlinet, A. & Thomas-Agnan, C. (24), Reproducing Kernel Hilbert Spaces in Probability and Statistics, Kluwer Academic Publishers, Norwell, MA. de Barra, G. (1981), Measure Theory and Integration, Horwood Publishing, West Sussex, UK. Eubank, R. (1999), Nonparametric Regression and Spline Smoothing, Marcel Dekker, New York, NY. Golub, G., Heath, M. & Wahba, G. (1979), Generalized cross-validation as a method for choosing a good ridge parameter, Technometrics 21, Gu, C. (22), Smoothing Spline ANOVA Models, Springer-Verlag, New York, NY. 18

19 Gu, C. & Qiu, C. (1993), Smoothing spline density estimation: Theory, Annals of Statistics 21, Hastie, T., Tibshirani, R. & Friedman, J. (21), The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer-Verlag, New York, NY. Hoerl, A. & Kennard, R. (197a), Ridge regression: Biased estimation for nonorthogonal problems, Technometrics 12, Lin, Y. & Zhang, H. (26), Component selection and smoothing in smoothing spline analysis of variance models, Annals of Statistics 34, Máté, L. (199), Hilbert Space Methods in Science and Engineering, Taylor & Francis, Oxford, UK. Naylor, A. & Sell, G. (1982), Linear Operator Theory in Engineering and Science, Springer, New York, NY. O Sullivan, F., Yandell, B. & Raynor, W. (1986), Automatic smoothing of regression functions in generalized linear models, Journal of the American Statistical Association 81, Rustagi, J. (1994), Optimization Techniques in Statistics, Academic Press, London, UK. Storlie, C., Bondell, H. & Reich, B. (21), A locally adaptive penalty for estimation of functions with varying roughness, Journal of Computational and Graphical Statistics 19, Storlie, C., Bondell, H., Reich, B. & Zhang, H. (21), Surface estimation, variable selection, and the nonparametric oracle property, Statistica Sinica (to appear). Wahba, G. (199), Spline Models for Observational Data, vol. 59. SIAM. CBMS-NSF Regional Conference Series in Applied Mathematics. Wecker, W. & Ansley, C. (1983), The signal extraction approach to nonlinear regression and spline smoothing, Journal of the American Statistical Association 78, Young, N. (1988), An Introduction to Hilbert Space, Cambridge University Press, Cambridge, UK. 19

20 A Examples using R A.1 RKHS solution applied to Ridge Regression We consider a sample of size n = 2, (y 1, y 2, y 3,..., y 2 ), from the model y i = β + β 1 x i1 + β 2 x i2 + ϵ i where β = 2, β 1 = 3, β 2 = and ϵ i has a N(,.25 2 ). The distribution of the covariates, x 1 and x 2, is uniform in [, 1] 2 so that x 2 is uninformative. The following lines of code generate the data. ###### Data ######## set.seed(3) n<-2 x1<-runif(n) x2<-runif(n) X<-matrix(c(x1,x2),ncol=2) y<-2+3*x1+rnorm(n,sd=.25) # design matrix Below we give a function ridge regression to solve the ridge regression problem using the RKHS framework we discussed in Section 4.2. ##### function to find the inverse of a matrix #### my.inv<-function(x,eps=1e-12){ eig.x<-eigen(x,symmetric=t) P<-eig.X[[2]] lambda<-eig.x[[1]] ind<-lambda>eps lambda[ind]<-1/lambda[ind] lambda[!ind]<- ans<-p%*%diag(lambda,nrow=length(lambda))%*%t(p) return(ans) } ###### Reproducing Kernel ######### rk<-function(s,t){ p<-length(s) rk<- for (i in 1:p){ rk<-s[i]*t[i]+rk } return( (rk) ) } ##### Gram matrix ####### get.gramm<-function(x){ n<-dim(x)[1] Gramm<-matrix(,n,n) #initializes Gramm array #i=index for rows #j=index for columns Gramm<-as.matrix(Gramm) # Gramm matrix for (i in 1:n){ for (j in 1:n){ 2

21 Gramm[i,j]<-rk(X[i,],X[j,]) } } return(gramm) } ridge.regression<-function(x,y,lambda){ Gramm<-get.gramm(X) #Gramm matrix (nxn) n<-dim(x)[1] # n=length of y J<-matrix(1,n,1) # vector of ones dim Q<-cbind(J,Gramm) # design matrix m<-1 # dimension of the null space of the penalty S<-matrix(,n+m,n+m) #initialize S S[(m+1):(n+m),(m+1):(n+m)]<-Gramm #non-zero part of S M<-(t(Q)%*%Q+lambda*S) M.inv<-my.inv(M) # inverse of M gamma.hat<-crossprod(m.inv,crossprod(q,y)) f.hat<-q%*%gamma.hat A<-Q%*%M.inv%*%t(Q) tr.a<-sum(diag(a)) #trace of hat matrix rss<-t(y-f.hat)%*%(y-f.hat) #residual sum of squares gcv<-n*rss/(n-tr.a)^2 #obtain GCV score return(list(f.hat=f.hat,gamma.hat=gamma.hat,gcv=gcv)) } A simple direct search for the GCV optimal smoothing parameter can be made as follows: # Plot of GCV lambda<-1e-8 V<-rep(,4) for (i in 1:4){ V[i]<-ridge.regression(X,y,lambda)$gcv #obtain GCV score lambda<-lambda*1.5 #increase lambda } index<-(1:4) plot(1.5^(index-1)*1e-8,v,type="l",main="gcv score",lwd=2,xlab="lambda",ylab="gcv") # plot score The GCV plot produced by this code is displayed in Figure A.2. Now, by following Section 4.2, ˆβ, ˆβ 1 and ˆβ 2 can be obtained as follows. i<-(1:6)[v==min(v)] # extract index of min(v) opt.mod<-ridge.regression(x,y,1.5^(i-1)*1e-8) #fit optimal model ### finding beta., beta.1 and beta.2 ########## gamma.hat<-opt.mod$gamma.hat beta.hat.<-opt.mod$gamma.hat[1]#intercept beta.hat<-gamma.hat[2:21, ]%*%X #slope and noise term coefficients 21

22 The resulting estimates are: ˆβ = , ˆβ1 = and ˆβ 2 = A.2 RKHS solution applied to Cubic Smoothing Spline We consider a sample of size n = 5, (y 1, y 2, y 3,..., y 2 ), from the model y i = sin(2πx i ) + ϵ i where ϵ i has a N(,.2 2 ). The following code generates x and y ###### Data ######## set.seed(3) n<-5 x<-matrix(runif(n),nrow,ncol=1) x.star<-matrix(sort(x),nrow,ncol=1) # sorted x, used by plot y<-sin(2*pi*x.star)+rnorm(n,sd=.2) Below we give a function to find the cubic smoothing spline using the RKHS framework we discussed in Section 4.3. We also provide a graph with our estimation along with the true function and data. #### Reproducing Kernel for <f,g>=int_^1 f (x)g (x)dx ##### rk.1<-function(s,t){ return( (1/2)*min(s,t)^2)*(max(s,t)+ (1/6)*(min(s,t))^3 ) } get.gramm.1<-function(x){ n<-dim(x)[1] Gramm<-matrix(,n,n) #initializes Gramm array #i=index for rows #j=index for columns Gramm<-as.matrix(Gramm) # Gramm matrix for (i in 1:n){ for (j in 1:n){ Gramm[i,j]<-rk.1(X[i,],X[j,]) } } return(gramm) } smoothing.spline<-function(x,y,lambda){ Gramm<-get.gramm.1(X) #Gramm matrix (nxn) n<-dim(x)[1] # n=length of y J<-matrix(1,n,1) # vector of ones dim T<-cbind(J,X) # matrix with a basis for the null space of the penalty Q<-cbind(T,Gramm) # design matrix 22

23 } m<-dim(t)[2] # dimension of the null space of the penalty S<-matrix(,n+m,n+m) #initialize S S[(m+1):(n+m),(m+1):(n+m)]<-Gramm #non-zero part of S M<-(t(Q)%*%Q+lambda*S) M.inv<-my.inv(M) # inverse of M gamma.hat<-crossprod(m.inv,crossprod(q,y)) f.hat<-q%*%gamma.hat A<-Q%*%M.inv%*%t(Q) tr.a<-sum(diag(a)) #trace of hat matrix rss<-t(y-f.hat)%*%(y-f.hat) #residual sum of squares gcv<-n*rss/(n-tr.a)^2 #obtain GCV score return(list(f.hat=f.hat,gamma.hat=gamma.hat,gcv=gcv)) A simple direct search for the GCV optimal smoothing parameter can be made as follows: ### Now we have to find an optimal lambda using GCV... ### Plot of GCV lambda<-1e-8 V<-rep(,6) for (i in 1:6){ V[i]<-smoothing.spline(x.star,y,lambda)$gcv #obtain GCV score lambda<-lambda*1.5 #increase lambda } plot(1:6,v,type="l",main="gcv score",xlab="i") # plot score i<-(1:6)[v==min(v)] # extract index of min(v) opt.mod.2<-smoothing.spline(x.star,y,1.5^(i-1)*1e-8) #fit optimal model #Graph (Cubic Spline) plot(x.star,opt.mod.2$f.hat,type="l",lty=2,lwd=2,col="blue",xlab="x",ylim=c(-2.5,1.5), xlim=c(-.1,1.1),ylab="response",main="cubic Spline")#predictions lines(x.star,sin(2*pi*x.star),lty=1,lwd=2) #true legend(-.1,-1.5,c("predictions","true"),lty=c(2,1),bty="n",lwd=c(2,2), col=c("blue","black")) points(x.star,y) #data This graph is plotted in Figure A.2. 23

24 Figure 1: Linear Interpolating Spline Figure 2: GCV Score

25 Figure 3: Cubic Spline

Adaptive weighting for flexible estimation in nonparametric regression models.

University of New Mexico UNM Digital Repository Mathematics & Statistics ETDs Electronic Theses and Dissertations 7-2-211 Adaptive weighting for flexible estimation in nonparametric regression models.