A Siple Regression Proble R. M. Castro March 23, 2 In this brief note a siple regression proble will be introduced, illustrating clearly the bias-variance tradeoff. Let Y i f(x i ) + W i, i,..., n, where x i i/n, f : [, ] R is a function, and W i s are independent rando variables such that E[W i ] and E[W 2 i ] σ 2 <. The object of interest is the function f. Using the data {Y i } we want to construct an estiate ˆf n that is good, in the sense that the squared L 2 distance ˆf n f 2 ( ˆf n (t) f(t)) 2 dt, is sall (note that the above is a rando quantity). In particular we want to iniize the expected risk E[ ˆf n f 2 ]. In order to characterize the expected risk we need further assuptions on the function f, naely we assue it is Lipschitz sooth. Forally we assue f F L F {f : [, ] R : f(s) f(t) L t s, t, s [, ]}, where L > is a constant. Notice that such functions are continuous, but not necessarily differentiable. An exaple of such a function is depicted in Figure (a). Our approach will use piecewise constant functions, in what is usually referred to as a regressogra (this is the regression analogue of the histogra). Let N and define the class of piecewise constant functions F f : f(t) { j c j t < j }, c j R. The set F is a linear space consisting of functions that are constant on the intervals [ j,, j ), j,...,.
.4.2.8.6.4.2.2.4.6.8 Figure : Exaple of a Lipschitz function (blue), and corresponding observations (red): The red dots correspond to (i/n, Y i ), i,..., n. Clearly if is large we can approxiate alost any bounded function arbitrarily well. For notational ease we will drop the subscript in, and use siply. We are going to use a bias-variance decoposition. First let s define our estiator. It is going to be siply the average of the data in each one of the intervals Y i. A way to otivate this estiator is as follows. Our goal is to iniize E[ ˆf n f 2 ], but obviously we cannot copute this expectation. Let s consider instead an epirical surrogate for it, naely ˆR n (f ) n n (f (x i ) Y i ) 2, i where f is an arbitrary function. Now let f F, so that we can write it as where c j R. Define We can rewrite the ˆR n (f ) as f (t) ˆR n (f ) n n c j {t }, N j {i : x i }. n c j {x i } Y i i (c j Y i ) 2. 2 Define the estiator ˆf n arg in f F ˆRn (f ). () 2
Then where ˆf n (t) ĉ j {t }, ĉ j Y i, (2) where denotes the nuber of eleents in N j. Notice that is always greater than zero provided < n. We will assue this throughout the entire docuent. Exercise Prove that the solution of () is given by ˆf n (t) ĉj{t }, where the ĉ j s are given by (2). Define also f F, the expected value of ˆf n : f(t) E[ ˆf n (t)] c j {t }, c j f(x i ). i We are ready to do our bias-variance decoposition. E[ ˆf n f 2 ] E[ ˆf n f + f f 2 ] E[ ˆf n f 2 ] + E[ f f 2 ] + 2E[ ˆf n f, f f ] E[ ˆf n f 2 ] + f ˆf n 2 + 2 E[ ˆf n ] f, f f E[ ˆf n f 2 ] + f ˆf n 2, where the final step follows fro the fact that E[ ˆf n (t)] f(t). So the expected risk is decoposed in two ters, the first is the variance (or estiation error), and the second is the squared bias (or approxiation error). Now we just need 3
to evaluate each one of these ters. Let s start with the bias ter. f f 2 ( f(t) f(t)) 2 dt ( f(t) f(t)) 2 dt ( c j f(t)) 2 dt ( ) i f(t) n f ( f 2 dt ( ) f(t)) 2 i dt n 2 ( ) i f f(t) n dt 2 L dt ( ) 2 L dt ( ) 2 L L2 2. So we see that if is large (provided it is saller than n) the bias ter goes to zero. In other words we can approxiate a Lipschitz sooth function arbitrarily 4
well with a piecewise constant function. Now for the variance. [ ] E[ ˆf n f 2 ] E ( ˆf n (t) f(t)) 2 dt E ( ˆf n (t) f(t)) 2 dt E (ĉ j c j ) 2 dt E (ĉ j c j ) I 2 dt j E (ĉ j c j ) 2 E [ (ĉ j c j ) 2] 2 E Y i f(x i ) dt N j 2 E (Y i f(x i )) dt N j 2 E dt σ 2. W i Now notice that n/. In fact, if we want to be precise we can say that n/, where x is the largest integer k such that k < x. Therefore n/, and so E[ ˆf n f 2 ] σ 2 n/ σ 2 n/ σ 2 n/ σ2 n ( n ) n. 5
So, as long as < cn, with < c < then E[ ˆf n f 2 ] σ 2 n c, so the variance ter is essentially proportional to /n. In words this eans the variance ter is proportional to the nuber of odel paraeters divided by the aount of data n. Cobining everything we have E[ ˆf n 2 f 2 ] σ 2 ( { n + L2 2 O ax 2, }), (3) n where we ake use of the Big-O notation. At this point it becoes clear that there is an optial choice for, naely if is sall then the squared bias ter O(/ 2 ) is going to be large, but the variance ter O(/n) is going to be sall, and vice-versa. This two conflicting goals provide a tradeoff that directs our choice of (as a function of n). In Figure 2 we depict this tradeoff. In Figure 2(a) we considered a large value, and we see that the approxiation of f by a function in the class F can be very accurate (that is, our estiate will have a sall bias), but when we use the easured data our estiate looks very bad (high variance). On the other hand, as illustrated in Figure 2(b), using a very sall allows our estiator to get very close to the best approxiating function in the class F, so we have a low variance estiator, but the bias of our estiator (the difference between f and f) is quite considerable..4.2.8.6.4.2.2.4.6.8 (a).4.2.8.6.4.2.2.4.6.8 (b) Figure 2: Approxiation and estiation of f (in blue) for n 6. The function f is depicted in green and the function ˆf n is depicted in red. In (a)we have 6 and in (b) we have 6. The notation x n O(y n) (that reads x n is big-o y n, or x n is of the order of y n as n goes to infinity ) eans that x n Cy n, where C is a positive constant and y n is a non-negative sequence. 6
We need to balance the two ters in the right-hand-side of (3) in order to axiize the rate of decay (with n) of the expected risk. This iplies that 2 n therefore n/3 and the Mean Squared Error (MSE) is E[ ˆf n f 2 ] O(n 2/3 ). It is interesting to note that the rate of decay of the MSE we obtain with this strategy cannot be further iproved by using ore sophisticated estiation techniques. In fact we have the following iniax lower bound: inf ˆf n sup E[ ˆf n f 2 ] c(l, σ 2 )n 2/3, f F where c(l, σ 2 ) >, and the infiu is taken over all possible estiators (i.e., all easurable functions of the data). Also, rather surprisingly, we are considering classes of odels F that are actually not Lipschitz, therefore our estiator of f is not a Lipschitz function, unlike f itself. Exercise 2 Suppose that the true regression function f was not really a Lipschitz sooth function, but instead a piecewise Lipshitz functions. These are functions that are coposed by a finite nuber of pieces that are Lipschitz. An exaple of such a function is g(t) f (t){t [, /3]} + f 2 (t){t (/3, /2)} + f 3 (t){t (/2, ]}, where f, f 2, f 3 F L. Let G(M, L, R) denote the class of bounded piecewise Lipschitz functions. Each piece belongs to class F L, there are at ost M pieces, and any function f G(M, L, R) is bounded in the sense that f(x) R for all x [, ]. Study the perforance of the above estiator when f G(M, L, R). Identify the best rate of error decay of the estiator risk. Exercise 3 Suppose you want to reove noise fro an iage (e.g., a edical iage). An iage can be thought of as a function f : [, ] 2 R. Let s suppose it satisfies a 2-diensional Lipschitz condition f(x, y ) f(x 2, y 2 ) L ax ( x x 2, y y 2 ), x, y, x 2, y 2 [, ].. Do you think this is a good odel for iages? Why and why not. 2. Assue n, the nuber of saples you get fro the function, is the square of an integer, therefore n N. Let f be a function satisfying the above condition let the observation odel be Y ij f(i/ n, j/ n) + W ij, i, j {,..., n}, where as before the noise variables are utually independent and again E[W ij ] and E[W 2 ij ] σ2 <. Using a siilar approach to the one in class construct an estiator ˆf n for f. Using this procedure what is the best rate of convergence attainable when f is a 2-diensional Lipschitz function? 7