A Simple Regression Problem

Similar documents
ECE 901 Lecture 4: Estimation of Lipschitz smooth functions

Bootstrapping Dependent Data

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

The Weierstrass Approximation Theorem

Computational and Statistical Learning Theory

1 Bounding the Margin

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Machine Learning Basics: Estimators, Bias and Variance

In this chapter, we consider several graph-theoretic and probabilistic models

CS Lecture 13. More Maximum Likelihood

Computable Shell Decomposition Bounds

Probability Distributions

Block designs and statistics

Estimating Parameters for a Gaussian pdf

DEPARTMENT OF ECONOMETRICS AND BUSINESS STATISTICS

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

Stochastic Subgradient Methods

Computational and Statistical Learning Theory

3.3 Variational Characterization of Singular Values

On the Use of A Priori Information for Sparse Signal Approximations

Combining Classifiers

Support Vector Machines. Maximizing the Margin

Randomized Accuracy-Aware Program Transformations For Efficient Approximate Computations

Computable Shell Decomposition Bounds

Sequence Analysis, WS 14/15, D. Huson & R. Neher (this part by D. Huson) February 5,

Keywords: Estimator, Bias, Mean-squared error, normality, generalized Pareto distribution

A Theoretical Framework for Deep Transfer Learning

What is Probability? (again)

Introduction to Machine Learning. Recitation 11

Supplement to: Subsampling Methods for Persistent Homology

Feature Extraction Techniques

1 Proof of learning bounds

Consistent Multiclass Algorithms for Complex Performance Measures. Supplementary Material

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

Experimental Design For Model Discrimination And Precise Parameter Estimation In WDS Analysis

Non-Parametric Non-Line-of-Sight Identification 1

COS 424: Interacting with Data. Written Exercises

Multi-Scale/Multi-Resolution: Wavelet Transform

arxiv: v1 [cs.ds] 3 Feb 2014

Lower Bounds for Quantized Matrix Completion

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search

Pattern Recognition and Machine Learning. Artificial Neural networks

Nonmonotonic Networks. a. IRST, I Povo (Trento) Italy, b. Univ. of Trento, Physics Dept., I Povo (Trento) Italy

Fixed-to-Variable Length Distribution Matching

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t.

Statistical clustering and Mineral Spectral Unmixing in Aviris Hyperspectral Image of Cuprite, NV

An Extension to the Tactical Planning Model for a Job Shop: Continuous-Time Control

Lecture October 23. Scribes: Ruixin Qiang and Alana Shine

On Constant Power Water-filling

Sharp Time Data Tradeoffs for Linear Inverse Problems

List Scheduling and LPT Oliver Braun (09/05/2017)

Extension of CSRSM for the Parametric Study of the Face Stability of Pressurized Tunnels

Pattern Recognition and Machine Learning. Artificial Neural networks

Testing Properties of Collections of Distributions

Tight Information-Theoretic Lower Bounds for Welfare Maximization in Combinatorial Auctions

ESTIMATING AND FORMING CONFIDENCE INTERVALS FOR EXTREMA OF RANDOM POLYNOMIALS. A Thesis. Presented to. The Faculty of the Department of Mathematics

The degree of a typical vertex in generalized random intersection graph models

Topic 5a Introduction to Curve Fitting & Linear Regression

A Theoretical Analysis of a Warm Start Technique

1 Rademacher Complexity Bounds

Estimating Average-Case Learning Curves Using Bayesian, Statistical Physics and VC Dimension Methods

Detection and Estimation Theory

C na (1) a=l. c = CO + Clm + CZ TWO-STAGE SAMPLE DESIGN WITH SMALL CLUSTERS. 1. Introduction

Understanding Machine Learning Solution Manual

Compression and Predictive Distributions for Large Alphabet i.i.d and Markov models

Universal algorithms for learning theory Part II : piecewise polynomial functions

Analyzing Simulation Results

Optimal Jamming Over Additive Noise: Vector Source-Channel Case

A Note on the Applied Use of MDL Approximations

ON THE TWO-LEVEL PRECONDITIONING IN LEAST SQUARES METHOD

FAST DYNAMO ON THE REAL LINE

Multiple Instance Learning with Query Bags

Symmetrization and Rademacher Averages

12 Towards hydrodynamic equations J Nonlinear Dynamics II: Continuum Systems Lecture 12 Spring 2015

Shannon Sampling II. Connections to Learning Theory

ASSUME a source over an alphabet size m, from which a sequence of n independent samples are drawn. The classical

Distributed Subgradient Methods for Multi-agent Optimization

Polygonal Designs: Existence and Construction

Principal Components Analysis

Necessity of low effective dimension

VARIATIONAL ALGORITHMS TO REMOVE STRIPES: A GENERALIZATION OF THE NEGATIVE NORM MODELS.

PULSE-TRAIN BASED TIME-DELAY ESTIMATION IMPROVES RESILIENCY TO NOISE

Bayes Decision Rule and Naïve Bayes Classifier

Proc. of the IEEE/OES Seventh Working Conference on Current Measurement Technology UNCERTAINTIES IN SEASONDE CURRENT VELOCITIES

3.8 Three Types of Convergence

Interactive Markov Models of Evolutionary Algorithms

Ocean 420 Physical Processes in the Ocean Project 1: Hydrostatic Balance, Advection and Diffusion Answers

Boosting with log-loss

arxiv: v1 [cs.ds] 17 Mar 2016

Ensemble Based on Data Envelopment Analysis

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression

ZISC Neural Network Base Indicator for Classification Complexity Estimation

Kernel Methods and Support Vector Machines

The Wilson Model of Cortical Neurons Richard B. Wells

Transcription:

A Siple Regression Proble R. M. Castro March 23, 2 In this brief note a siple regression proble will be introduced, illustrating clearly the bias-variance tradeoff. Let Y i f(x i ) + W i, i,..., n, where x i i/n, f : [, ] R is a function, and W i s are independent rando variables such that E[W i ] and E[W 2 i ] σ 2 <. The object of interest is the function f. Using the data {Y i } we want to construct an estiate ˆf n that is good, in the sense that the squared L 2 distance ˆf n f 2 ( ˆf n (t) f(t)) 2 dt, is sall (note that the above is a rando quantity). In particular we want to iniize the expected risk E[ ˆf n f 2 ]. In order to characterize the expected risk we need further assuptions on the function f, naely we assue it is Lipschitz sooth. Forally we assue f F L F {f : [, ] R : f(s) f(t) L t s, t, s [, ]}, where L > is a constant. Notice that such functions are continuous, but not necessarily differentiable. An exaple of such a function is depicted in Figure (a). Our approach will use piecewise constant functions, in what is usually referred to as a regressogra (this is the regression analogue of the histogra). Let N and define the class of piecewise constant functions F f : f(t) { j c j t < j }, c j R. The set F is a linear space consisting of functions that are constant on the intervals [ j,, j ), j,...,.

.4.2.8.6.4.2.2.4.6.8 Figure : Exaple of a Lipschitz function (blue), and corresponding observations (red): The red dots correspond to (i/n, Y i ), i,..., n. Clearly if is large we can approxiate alost any bounded function arbitrarily well. For notational ease we will drop the subscript in, and use siply. We are going to use a bias-variance decoposition. First let s define our estiator. It is going to be siply the average of the data in each one of the intervals Y i. A way to otivate this estiator is as follows. Our goal is to iniize E[ ˆf n f 2 ], but obviously we cannot copute this expectation. Let s consider instead an epirical surrogate for it, naely ˆR n (f ) n n (f (x i ) Y i ) 2, i where f is an arbitrary function. Now let f F, so that we can write it as where c j R. Define We can rewrite the ˆR n (f ) as f (t) ˆR n (f ) n n c j {t }, N j {i : x i }. n c j {x i } Y i i (c j Y i ) 2. 2 Define the estiator ˆf n arg in f F ˆRn (f ). () 2

Then where ˆf n (t) ĉ j {t }, ĉ j Y i, (2) where denotes the nuber of eleents in N j. Notice that is always greater than zero provided < n. We will assue this throughout the entire docuent. Exercise Prove that the solution of () is given by ˆf n (t) ĉj{t }, where the ĉ j s are given by (2). Define also f F, the expected value of ˆf n : f(t) E[ ˆf n (t)] c j {t }, c j f(x i ). i We are ready to do our bias-variance decoposition. E[ ˆf n f 2 ] E[ ˆf n f + f f 2 ] E[ ˆf n f 2 ] + E[ f f 2 ] + 2E[ ˆf n f, f f ] E[ ˆf n f 2 ] + f ˆf n 2 + 2 E[ ˆf n ] f, f f E[ ˆf n f 2 ] + f ˆf n 2, where the final step follows fro the fact that E[ ˆf n (t)] f(t). So the expected risk is decoposed in two ters, the first is the variance (or estiation error), and the second is the squared bias (or approxiation error). Now we just need 3

to evaluate each one of these ters. Let s start with the bias ter. f f 2 ( f(t) f(t)) 2 dt ( f(t) f(t)) 2 dt ( c j f(t)) 2 dt ( ) i f(t) n f ( f 2 dt ( ) f(t)) 2 i dt n 2 ( ) i f f(t) n dt 2 L dt ( ) 2 L dt ( ) 2 L L2 2. So we see that if is large (provided it is saller than n) the bias ter goes to zero. In other words we can approxiate a Lipschitz sooth function arbitrarily 4

well with a piecewise constant function. Now for the variance. [ ] E[ ˆf n f 2 ] E ( ˆf n (t) f(t)) 2 dt E ( ˆf n (t) f(t)) 2 dt E (ĉ j c j ) 2 dt E (ĉ j c j ) I 2 dt j E (ĉ j c j ) 2 E [ (ĉ j c j ) 2] 2 E Y i f(x i ) dt N j 2 E (Y i f(x i )) dt N j 2 E dt σ 2. W i Now notice that n/. In fact, if we want to be precise we can say that n/, where x is the largest integer k such that k < x. Therefore n/, and so E[ ˆf n f 2 ] σ 2 n/ σ 2 n/ σ 2 n/ σ2 n ( n ) n. 5

So, as long as < cn, with < c < then E[ ˆf n f 2 ] σ 2 n c, so the variance ter is essentially proportional to /n. In words this eans the variance ter is proportional to the nuber of odel paraeters divided by the aount of data n. Cobining everything we have E[ ˆf n 2 f 2 ] σ 2 ( { n + L2 2 O ax 2, }), (3) n where we ake use of the Big-O notation. At this point it becoes clear that there is an optial choice for, naely if is sall then the squared bias ter O(/ 2 ) is going to be large, but the variance ter O(/n) is going to be sall, and vice-versa. This two conflicting goals provide a tradeoff that directs our choice of (as a function of n). In Figure 2 we depict this tradeoff. In Figure 2(a) we considered a large value, and we see that the approxiation of f by a function in the class F can be very accurate (that is, our estiate will have a sall bias), but when we use the easured data our estiate looks very bad (high variance). On the other hand, as illustrated in Figure 2(b), using a very sall allows our estiator to get very close to the best approxiating function in the class F, so we have a low variance estiator, but the bias of our estiator (the difference between f and f) is quite considerable..4.2.8.6.4.2.2.4.6.8 (a).4.2.8.6.4.2.2.4.6.8 (b) Figure 2: Approxiation and estiation of f (in blue) for n 6. The function f is depicted in green and the function ˆf n is depicted in red. In (a)we have 6 and in (b) we have 6. The notation x n O(y n) (that reads x n is big-o y n, or x n is of the order of y n as n goes to infinity ) eans that x n Cy n, where C is a positive constant and y n is a non-negative sequence. 6

We need to balance the two ters in the right-hand-side of (3) in order to axiize the rate of decay (with n) of the expected risk. This iplies that 2 n therefore n/3 and the Mean Squared Error (MSE) is E[ ˆf n f 2 ] O(n 2/3 ). It is interesting to note that the rate of decay of the MSE we obtain with this strategy cannot be further iproved by using ore sophisticated estiation techniques. In fact we have the following iniax lower bound: inf ˆf n sup E[ ˆf n f 2 ] c(l, σ 2 )n 2/3, f F where c(l, σ 2 ) >, and the infiu is taken over all possible estiators (i.e., all easurable functions of the data). Also, rather surprisingly, we are considering classes of odels F that are actually not Lipschitz, therefore our estiator of f is not a Lipschitz function, unlike f itself. Exercise 2 Suppose that the true regression function f was not really a Lipschitz sooth function, but instead a piecewise Lipshitz functions. These are functions that are coposed by a finite nuber of pieces that are Lipschitz. An exaple of such a function is g(t) f (t){t [, /3]} + f 2 (t){t (/3, /2)} + f 3 (t){t (/2, ]}, where f, f 2, f 3 F L. Let G(M, L, R) denote the class of bounded piecewise Lipschitz functions. Each piece belongs to class F L, there are at ost M pieces, and any function f G(M, L, R) is bounded in the sense that f(x) R for all x [, ]. Study the perforance of the above estiator when f G(M, L, R). Identify the best rate of error decay of the estiator risk. Exercise 3 Suppose you want to reove noise fro an iage (e.g., a edical iage). An iage can be thought of as a function f : [, ] 2 R. Let s suppose it satisfies a 2-diensional Lipschitz condition f(x, y ) f(x 2, y 2 ) L ax ( x x 2, y y 2 ), x, y, x 2, y 2 [, ].. Do you think this is a good odel for iages? Why and why not. 2. Assue n, the nuber of saples you get fro the function, is the square of an integer, therefore n N. Let f be a function satisfying the above condition let the observation odel be Y ij f(i/ n, j/ n) + W ij, i, j {,..., n}, where as before the noise variables are utually independent and again E[W ij ] and E[W 2 ij ] σ2 <. Using a siilar approach to the one in class construct an estiator ˆf n for f. Using this procedure what is the best rate of convergence attainable when f is a 2-diensional Lipschitz function? 7