Estimation of a regression function by maxima of minima of linear functions

Similar documents
Estimation of the essential supremum of a regression function

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Optimally Sparse SVMs

Nonparametric estimation of conditional distributions

5.1 A mutual information bound based on metric entropy

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

Approximation by Superpositions of a Sigmoidal Function

A survey on penalized empirical risk minimization Sara A. van de Geer

Chapter 6 Infinite Series

Sieve Estimators: Consistency and Rates of Convergence

Sequences and Series of Functions

Study the bias (due to the nite dimensional approximation) and variance of the estimators

Empirical Process Theory and Oracle Inequalities

Journal of Multivariate Analysis. Superefficient estimation of the marginals by exploiting knowledge on the copula

Basics of Probability Theory (for Theory of Computation courses)

Machine Learning Brett Bernstein

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

Estimation for Complete Data

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4.

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Empirical Processes: Glivenko Cantelli Theorems

REGRESSION WITH QUADRATIC LOSS

REAL ANALYSIS II: PROBLEM SET 1 - SOLUTIONS

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

1 Review and Overview

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +

10-701/ Machine Learning Mid-term Exam Solution

A Proof of Birkhoff s Ergodic Theorem

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 3 9/11/2013. Large deviations Theory. Cramér s Theorem

Nonparametric estimation of a maximum of quantiles

Advanced Stochastic Processes.

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Lecture 10: Universal coding and prediction

The Method of Least Squares. To understand least squares fitting of data.


Discrete Mathematics for CS Spring 2008 David Wagner Note 22

A multivariate rational interpolation with no poles in R m

Chapter 6 Principles of Data Reduction

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

Regression with quadratic loss

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

Rademacher Complexity

Lecture 2. The Lovász Local Lemma

1 Inferential Methods for Correlation and Regression Analysis

A RANK STATISTIC FOR NON-PARAMETRIC K-SAMPLE AND CHANGE POINT PROBLEMS

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector

Lecture 19: Convergence

Advanced Analysis. Min Yan Department of Mathematics Hong Kong University of Science and Technology

ON POINTWISE BINOMIAL APPROXIMATION

Maximum Likelihood Estimation and Complexity Regularization

Statistical Inference Based on Extremum Estimators

Output Analysis and Run-Length Control

6.3 Testing Series With Positive Terms

MAS111 Convergence and Continuity

Lecture 3 The Lebesgue Integral

Information Theory and Statistics Lecture 4: Lempel-Ziv code

McGill University Math 354: Honors Analysis 3 Fall 2012 Solutions to selected problems

Infinite Sequences and Series

6.867 Machine learning, lecture 7 (Jaakkola) 1

7.1 Convergence of sequences of random variables

It is often useful to approximate complicated functions using simpler ones. We consider the task of approximating a function by a polynomial.

Slide Set 13 Linear Model with Endogenous Regressors and the GMM estimator

A remark on p-summing norms of operators

Algebra of Least Squares

18.657: Mathematics of Machine Learning

Sequences A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

Mathematical Methods for Physics and Engineering

Kernel density estimator

The Choquet Integral with Respect to Fuzzy-Valued Set Functions

Law of the sum of Bernoulli random variables

Expectation and Variance of a random variable

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

Lecture 15: Learning Theory: Concentration Inequalities

Random Walks on Discrete and Continuous Circles. by Jeffrey S. Rosenthal School of Mathematics, University of Minnesota, Minneapolis, MN, U.S.A.

Sequences. Notation. Convergence of a Sequence

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

Lesson 10: Limits and Continuity

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 21 11/27/2013

Random Matrices with Blocks of Intermediate Scale Strongly Correlated Band Matrices

University of Colorado Denver Dept. Math. & Stat. Sciences Applied Analysis Preliminary Exam 13 January 2012, 10:00 am 2:00 pm. Good luck!

6 Integers Modulo n. integer k can be written as k = qn + r, with q,r, 0 r b. So any integer.

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ.

Preponderantly increasing/decreasing data in regression analysis

4. Partial Sums and the Central Limit Theorem

An Introduction to Randomized Algorithms

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

The Maximum-Likelihood Decoding Performance of Error-Correcting Codes

Math Solutions to homework 6

Problem Set 2 Solutions

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

Geometry of LS. LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT

Problem Set 4 Due Oct, 12

Fall 2013 MTH431/531 Real analysis Section Notes

5. Likelihood Ratio Tests

It is always the case that unions, intersections, complements, and set differences are preserved by the inverse image of a function.

The standard deviation of the mean

Lecture 11 October 27

Transcription:

Estimatio of a regressio fuctio by maxima of miima of liear fuctios Adil M. Bagirov, Coy Clause 2 ad Michael Kohler 3 School of Iformatio Techology ad Mathematical Scieces, Uiversity of Ballarat, PO Box 663, Ballarat Victoria 3353 Australia, email: a.bagirov@ballarat.edu.au 2 Departmet of Mathematics, Uiversität des Saarlades, Postfach 550, D-6604 Saarbrücke Germay, email: clause@math.ui-sb.de 3 Departmet of Mathematics, Techische Uiversität Darmstadt, Schloßgartestr. 7, D-64289 Darmstadt, Germay, email: kohler@mathematik.tu-darmstadt.de Abstract Estimatio of a regressio fuctio from idepedet ad idetically distributed radom variables is cosidered. Estimates are defied by miimizatio of the empirical L 2 risk over a class of fuctios, which are defied as maxima of miima of liear fuctios. Results cocerig the rate of covergece of the estimates are derived. I particular it is show that for smooth regressio fuctios satisfyig the assumptio of sigle idex models, the estimate is able to achieve up to some logarithmic factor the correspodig optimal oe-dimesioal rate of covergece. Hece uder these assumptios the estimate is able to circumvet the so-called curse of dimesioality. The small sample behaviour of the estimates is illustrated by applyig them to simulated data. Key words ad phrases: Adaptatio, dimesio reductio, oparametric regressio, Ruig title: Estimatio of a regressio fuctio Please sed correspodece ad proofs to: Coy Clause, Departmet of Mathematics, Uiversität des Saarlades, Postfach 550, D-6604 Saarbrücke Germay, email: clause@math.uisb.de

rate of covergece, sigle idex model, L 2 error. MSC 2000 subject classificatios: Primary 62G08; secodary 62G05 Itroductio. Scope of this paper. This paper cosiders the problem of estimatig a multivariate regressio fuctio give a sample of the uderlyig distributio. I applicatios usually o a priori iformatio about the regressio fuctio is kow, therefore it is ecessary to apply oparametric methods for this estimatio problem. There are several established methods for oparametric regressio, icludig regressio trees like CART cf., Breima et al. 984, adaptive splie fittig like MARS cf., Friedma 99 ad least squares eural etwork estimates cf., e.g., Chapter i Hastie, Tibshirai ad Friedma 200. All these methods miimize a kid of least squares risk of the regressio estimate, either heuristically over a fixed ad very complex fuctio space as for eural etworks or over a stepwise defied data depedet space of piecewise costat fuctios or piecewise polyomials as for CART or MARS. I this paper we cosider a rather complex fuctio space cosistig of maxima of miima of liear fuctios, over which we miimize a least squares risk. Sice each maxima of miima of liear fuctios is i fact a cotiuous piecewise liear fuctio, we fit a liear splie fuctio with free kots to the data. But i cotrast to MARS, we do ot eed heuristics to choose these free kots, but use istead advaced methods of optimizatio theory of oliear ad ocovex fuctios to compute our estimate approximately i a applicatio..2 Regressio estimatio. I regressio aalysis a IR d IR-valued radom vector X, Y with EY 2 < is cosidered ad the depedecy of Y o the value of X is of iterest. More precisely, the goal is to fid a fuctio f : IR d IR such that fx is a good approximatio of Y. I the sequel we assume that the mai 2

aim of the aalysis is miimizatio of the mea squared predictio error or L 2 risk E{ fx Y 2. I this case the optimal fuctio is the so-called regressio fuctio m : IR d IR, mx = E{Y X = x. Ideed, let f : IR d IR be a arbitrary measurable fuctio ad deote the distributio of X by µ. The well-kow relatio E{ fx Y 2 = E{ mx Y 2 + fx mx 2 µdx cf. e.g., Györfi et al. 2002, eq.. implies that the regressio fuctio is the optimal predictor i view of miimizatio of the L 2 risk: E{ mx Y 2 = mi E{ fx Y 2. f:ir d IR I additio, ay fuctio f is a good predictor i the sese that its L 2 risk is close to the optimal value, if ad oly if the so-called L 2 error fx mx 2 µdx is small. This motivates to measure the error caused by usig a fuctio f istead of the regressio fuctio by the L 2 error. I applicatios, usually the distributio of X, Y ad hece also the regressio fuctio is ukow. But ofte it is possible to observe a sample of the uderlyig distributio. This leads to the regressio estimatio problem. Here X, Y, X, Y, X 2, Y 2,... are idepedet ad idetically distributed radom vectors. The set of data D = {X, Y,...,X, Y is give, ad the goal is to costruct a estimate m = m, D : IR d IR of the regressio fuctio such that the L 2 error m x mx 2 µdx 3

is small. For a detailed itroductio to oparametric regressio we refer the reader to the moography Györfi et al. 2002..3 Defiitio of the estimate. I the sequel we will use the priciple of least squares to fit maxima of miima of liear fuctios to the data. More precisely, let K IN ad L,,...,L K, IN be parameters of the estimate ad set { F = f : IR d IR : fx = max mi a k,l x + b k,l x IR d k=,...,k l=,...,l k, for some a k,l IR d, b k,l IR where a k,l x = a k,l x +... + a d k,l xd deotes the scalar product betwee a k,l = a k,l,...,ad k,l T ad x = x,...,x d T. For this class of fuctios the estimate m is defied by m = arg mi f F fx i Y i 2. 2 Here we assume that the miimum exists, however we do ot require that it is uique. I Sectio 2 we will aalyze the rate of covergece of a trucated versio of this least squares estimate defied by β z > β, m = T β m, where T β z = z β z β, β z < β for some β R +..4 Mai results. Uder a Sub-Gaussia coditio o the distributio of Y ad for bouded distributio of X we show that the L 2 error of the estimate achieves for p, C-smooth regressio fuctio with p 2 where roughly speakig all partial derivates of the regressio fuctio of order p exist the correspodig optimal rate of covergece i= 2p/2p+d 4

up to some logarithmic factor. For sigle idex models, where the regressio fuctio m satifies i additio mx = mβ T x x R d for some uivariate fuctio m ad some vector β R d, we show furthermore, that our estimate achieves up to some logarithmic factor the oe-dimesioal rate of covergece 2p/2p+. Hece uder these assumptios the estimate is able to circumvet the curse of dimesioality..5 Discussio of related results. I multivariate oparametric regressio fuctio estimatio there is a gap betwee theory ad practice: The established estimates like CART, MARS or least squares eural etworks are based o several heuristics for computig the estimates, which makes it basically impossible to aalyze their rate of covergece theoretically. However, if oe defies them without these heuristics, their rate of covergece ca be aalyzed ad this has bee doe for eural etworks, e.g., i Barro 993, 994 ad for CART i Kohler 999, but i this form the estimates caot be computed i a applicatio. For our estimate, a similar pheome occurs sice we eed heuristics to compute it approximately i a applicatio. The differece to the above established estimates is that we use heuristics from advaced optimizatio theory, i particular from the optimizatio theory of oliear ad ocovex optimizatio cf., e.g., Bagirov 999, 2002 ad Bagirov ad Udo 2006 istead of complicated heuristics from statistics for stepwise computatio as for CART or MARS, or a simple gradiet descet as for least squares eural etworks. It follows from Stoe 982 that the rates of covergece, which we derive, are optimal i some Miimax sese up to a logarithmic factor. The idea of imposig additioal restrictios o the structure of the regressio fuctio like additivity or like the assumptio i the sigle idex model ad to derive uder these assumptio better rates of covergece is due to Stoe 985, 994. 5

We use a theorem of Lee, Bartlett ad Williamso 996 to derive our rate of covergece results. This approach is described i detail i Sectio.3 of Györfi at al. 2002. Below we exted this approach to ubouded data which satisfies a Sub- Gaussia coditio by itroducig ew trucatio argumets. I this way we are able to derive the results uder similar geeral assumptios o the distributio of Y as with alterative methods from empirical process theory, see, e.g., the moography va de Geer 2000 or Kohler 2000, 2006. Maxima of miima of liear fuctios have bee used i regressio estimatio already i Beliakov ad Kohler 2005. There least squares estimates are derived by miimizig the empirical L 2 risk over classes of fuctios cosistig of Lipschitz smooth fuctios where a boud o the Lipschitz costat is give. It is show that the resultig estimate is i fact a maxima of miima of liear fuctios, where the umber of miima occurrig i the maxima is equal to the sample size. Additioal restrictios e.g. o the liear fuctios i the miima esure that there will be o overfittig. I cotrast, the umber of liear fuctios which we cosider i this article is much smaller ad restrictios o these liear fuctios are therefore ot ecessary. This seems to be promisig, because we do ot fit too may parameters to the data. I Corollary 2 we show that eve for large dimesio of X the L 2 error of our estimate coverges to zero quickly if the regressio fuctio satisfies the structural assumptio of sigle idex models. Similar results are show i Sectio 22.2 of Györfi et al. 2002. However, i cotrast to the estimate defied there our ewly proposed estimate ca be computed i a applicatio which we will demostrate i Sectio 3. So the mai result here is to derive this good rate of covergece for a estimate which ca be computed i a applicatio..6 Notatios. The sets of atural umbers, atural umbers icludig zero, real umbers ad o-egative real umbers are deoted by N, N 0, R ad R +, respectively. For vectors x R we deote by x the Euclidia orm of x ad by x y the scalarproduct betwee x ad y. The least iteger greater or equal to a real umber 6

x will be deoted by x. logx deotes the atural logarithm of x > 0. For a fuctio f : R d R deotes the supremum orm. f = sup x R d fx.7 Outlie of the paper The mai theoretical result is formulated i Sectio 2 ad prove i Sectio 4. I Sectio 3 the estimate is illustrated by applyig it to simulated data. 2 Aalysis of the rate of covergece of the estimate Our first theorem gives a upper boud for the expected L 2 error of our estimate. Theorem. Let K, L,,..., L K, N, with K max{l,,..., L K, 2, ad set β = c log for some costat c > 0. Assume that the distributio of X, Y satifies E e c 2 Y 2 < 3 for some costat c 2 > 0 ad that the regressio fuctio m is bouded i absolute value. The for the estimate m defied as i Subsectio.3 E m x mx 2 µdx c 3 log 3 K k= L k, +E 2 if f F fx i Y i 2 i= for some costat c 3 > 0 ad hece also E mx i Y i 2 i= m x mx 2 µdx c 3 log 3 K k= L k, +2 if f F fx mx 2 µdx, where c 3 does ot deped o, β or the parameters of the estimate. 4 7

The coditio 3 is a modified Sub-Gaussia coditio ad it is particulary satisfied, if P Y X=x is the ormal distributio N mx,σ 2 ad the regressio fuctio m is bouded. This coditio allows us to cosider a ubouded coditioal distributio of Y. Together with a approximatio result this theorem implies the ext corollary, which cosiders the rate of covergece of the estimate. Here it is ecessary to impose smoothess coditios o the regressio fuctio. Defiitio. Let p = k + β for some k N 0 ad 0 < β ad let C > 0. A fuctio m : [a, b] d R is called p, C-smooth if for every α = α,..., α d, α i N 0, d j= α j = k the partial derivative k f x α... x α d d exists ad satisfies k f k f x α x... xα d x α z... xα d C x z β for all x, z [a, b] d. d Corollary. Assume that the distributio of X, Y satifies, that X [a, b] d a.s. for some a, b R, that the modified Sub-Gaussia coditio Eexpc 2 Y 2 < is fullfiled for some costat c 2 > 0 ad that m is p, C-smooth for some 0 < p 2 ad C >. Set β = c log for some c > 0, d/2p+d K = C 2d 2p+d ad L log 3 k, = L k = 2d + k =,..., K. The we have for the estimate m defied as above E m x mx 2 µdx cost C 2d 2p+d d log 3 2p 2p+d. The above rate of covergece coverges slowly to zero i case of large dimesio d of the predictor variable X so-called curse of dimesioality. Next we preset a result which shows that uder structural assumptios o the regressio fuctio 8

more precisly, for sigle idex models our estimate is able to circumvet the curse of dimesioality. Corollary 2. Assume that the distributio of X, Y satifies, that X [a, b] d a.s. for some a, b R ad that the modified Sub-Gaussia coditio Eexpc 2 Y 2 < is fullfiled for some costat c 2 > 0. Furthermore assume, that the regressio fuctio m satisfies mx = mα x x R d for some p, C-smooth fuctio m : R R ad some α R d. The for the estimate m as above ad with the settig β = c log for some c > 0, /2p+ K = C 2 2p+ ad L log 3 k, = L k = 3 k =,..., K we get E m x mx 2 µdx cost C 2 2p+ log 3 2p 2p+. Remark. It follows from Stoe 982 that uder the coditios of Corollary o estimate ca achieve i some Miimax sese a rate of covergece which coverges faster to zero tha 2p/2p+d cf., e.g., Chapter 3 i Györfi et al. 2002. Hece Corollary implies, that our estimate has a optimal rate of covergece up to the logarithmic factor. Remark 2. I ay applicatio the smoothess of the regressio fuctio measured by p, C is ot kow i advace ad hece the parameters of the estimate have to be chose data-depedet. This ca be doe, e.g., by splittig of the sample, where the estimate is computed for various values of the parameters o a learig sample cosistig, e.g., of the first half of the data poits ad the parameters are chose 9

such that the empirical L 2 risk o a testig sample cosistig, e.g., of the secod half of the data poits is miimized cf., e.g., Chapter 7 i Györfi et al. 2002. Theoretical results cocerig splittig of the sample ca be foud i Hamers ad Kohler 2003 ad Chapter 7 i Györfi et al. 2002. 3 Applicatio to simulated data I our applicatios we choose the umber of liear fuctios cosidered i the maxima ad the miima i a data-depedet way by splittig of the sample. We split the sample of size i a learig sample of size l < ad a testig sample of size t = l. We use the learig sample to defie for a fixed umber of liear fuctios K a estimate m l,k, ad compute the empirical L 2 risk of this estimate o the testig sample. Sice the testig sample is idepedet of the learig sample, this gives us a ubiased estimate of the L 2 risk of m l,k. The we choose K by miimizig this estimate with respect to K. I the sequel we use {500, 3000 ad t = l = /2. To compute the estimate for give umbers of liear fuctios we have to miimize i= max k=,...,k 2 mi a k,l x i + b k,l y i l=,...,l k for give fixed x,..., x IR d, y,...,y IR with respect to a k,l IR d, b k,l IR k =,...,K, l =,...,L k. Ufortuately, we caot solve this miimizatio problem exactly i geeral. The reaso is that the fuctio to be miimized is osmooth ad ocovex. Depedig o K ad L k it may have a large umber of variables more tha hudred eve i the case of uivariate data. The fuctio has may local miima ad their umber icreases drastically as the umber of maxima ad miima fuctios icreases. Most of the local miimizers do ot provide a good approximatio to the data ad 0

therefore oe is iterested to fid either a global miimizer or a miimizer which is ear to a global oe. Covetioal methods of global optimizatio are ot effective for miimizig of such fuctios, sice they are very time cosumig ad caot solve this problem i a reasoable time. Furthermore, the fuctio to be miimized is a very complicated osmooth fuctio ad the calculatio eve of oly oe subgradiet of such a fuctio is a difficult task. Therefore subgradiet-based methods of osmooth optimizatio are ot effective here. Eve though we caot solve this miimzatio problem exactly, we are able to compute the estimate approximately. For this we use the followig properties of the fuctio to be miimized: It is a semismooth fuctio cf., Miffli 977, moreover it is a smooth compositio of so-called quasidifferetiable fuctios see, Demyaov ad Rubiov 995 for the defiitio of quasidifferetiable fuctios. Therefore we ca use the discrete gradiet method from Bagirov 2002 to solve it. Furthermore, it is piecewise partially separable see Bagirov ad Ugo 2006 for the defiitio of such fuctios. We use the versio of the discrete gradiet method described i Bagirov ad Ugo 2006 for miimizig piecewise partially separable fuctios to solve it. The discrete gradiet method is a derivative-free method ad it is especially effective for miimizatio of osmooth ad ocovex fuctio whe the subgradiet is ot available or it is difficult to calculate the subgradiet. A detailed descriptio of the algorithm used to compute the estimate is give i Bagirov, Clause ad Kohler 2007. A implemetatio of the estimate i Fortra is available from the authors by request. I Bagirov, Clause ad Kohler 2007 the estimate is also compared to various other oparametric regressio estimates. I the sequel we will oly illustrate it by applyig it to a few simulated data sets. Here we defie X, Y by Y = mx + σ ǫ, where X is uiformly distributed o [ 2, 2] d, ǫ is stadard ormally distributed ad idepedet of X, ad σ 0. I Figures to 4 we choose d = ad σ =, ad use

four differet uivariate regressio fuctios i order to defie four differet data sets of size = 500. Each figures shows the true regressio fuctio together with its formula, a correspodig sample of size = 500 ad our estimate applied to this sample. mx=2*x^3 4*x y 5 0 5 2 0 2 x = 500, sigma= Figure : Simulatio with the first uivariate regressio fuctio. Here the first two examples show how the maxmi-estimate looks like for rather simple regressio estimates, while i the third ad fourth example the regressio fuctio has some local irregularity. Here it ca be see that our ewly proposed estimate is able to adapt locally to such irregularities i the regressio fuctio. Next we cosider the case d = 2. I our fifth example we choose mx, x 2 = x six 2 x 2 six 2 2, = 5000 ad σ = 0.2. Figures 5 shows the regressio fuctio ad our estimate applied to a correspodig data set of sample size 5000. 2

mx=4* x *six*pi/2 y 6 4 2 0 2 4 6 2 0 2 x = 500, sigma= Figure 2: Simulatio with the secod uivariate regressio fuctio. I our sixth example we choose mx, x 2 = 4 + 4 x 2 + 4 x 2 2, ad agai = 5000 ad σ = 0.2. Figures 6 shows the regressio fuctio ad our estimate applied to a correspodig data set of sample size 5000. I our seveth ad fial example we choose mx, x 2 = 6 2 mi3, 4 x 2 + 4 x 2, ad agai = 5000 ad σ = 0.2. Figures 7 show the regressio fuctio ad our estimate applied to a correspodig data set of sample size 5000. From the last simulatio we see agai that our estimate is able to adapt to the local behaviour of the regressio fuctio. 3

mx=8 if x>0, mx=0 if x<=0 y 0 5 0 2 0 2 x = 500, sigma= 4 Proofs Figure 3: Simulatio with the third uivariate regressio fuctio. I the proofs we eed the otatio of coverig umbers. Defiitio 2. Let x,..., x R d ad set x = x,..., x. Let F be a set of fuctios f : R d R. A L p -ǫ-cover of F o x is a fiite set of fuctios f,..., f k : R d R with the property mi j k /p fx i f j x i p < ǫ for all f F. 5 i= The L p -ǫ-coverig umber N p ǫ, F, x of F o x is the miimal size of a L p-ǫcover of F o x. I case that there exist o fiite L p-ǫ-cover of F the L p -ǫ-coverig umber of F o x is defied by N pǫ, F, x =. To get bouds for coverig umbers of sets of maxima of miima of liear fuctios we first show the coectio betwee the L p -ǫ-coverig umbers of sets 4

mx= 8*x 2*x^2/0.5+x^2 y 2 0 2 4 6 8 0 2 0 2 x = 500, sigma= Figure 4: Simulatio with the fourth uivariate regressio fuctio. G, G 2,... ad the L p -ǫ-coverig umber of their maximum { max{g,..., G l = f : IR d IR : fx = max{g x,..., g l x ad miimum defied aalogously, respectively. for some g G,..., g l G l Lemma. Let G, G 2,..., G l be l sets of fuctios from R d to R ad let x = x,..., x R d R d be fixed poits i R d. The N p ǫ, max {G,..., G l,x l i= ǫ N p l /p,g i, x 6 ad N p ǫ, mi {G,..., G l,x l i= ǫ N p l /p,g i, x. 7 5

y y x x x2 x2 Figure 5: The bivariate regressio fuctio together with our max-mi-estimate i the fifth example. Proof. Iequaltity 6 follows from max g ix k max i=,...,l k= i=,...,l gj i i x k p /p max i=,...,l k= k= gi x k g j i i x k /p p l gi x k g j i i x k /p p i= l /p max i=,...,l k= g i x k g j i i x k p /p. Iequality 7 follows directly from 6 with mi {G,..., G l = max { G,..., G l. 6

y y x x x2 x2 Figure 6: The bivariate regressio fuctio together with our max-mi-estimate i the sixth example. I the ext lemma we boud the L p -ǫ-coverig umber of a trucated versio of our class F of fuctios. Lemma 2. Let x Rd... R d ad set L := max{l,,..., L K,. The for 0 < ǫ < β/2 N ǫ, T β F, x 3 6eβ ǫ K L 2d+2 PK k= L k,. Proof. I the first step of the proof, we show that we ca ivolve the trucatio operator ito the class of fuctios, i.e., we show 7

y y x x x2 x2 Figure 7: The bivariate regressio fuctio together with our max-mi-estimate i the seveth example. { T β F = f : IR d IR : fx = max mi T β a k,l x + b k,l 8 k K l L k, for some a k,l R d, b k,l R At the begiig we observe, that by mootoicity of the mappig x T β x the equality T β max i z i = max i T βz i 9 holds for real umbers z i R i =,...,. With mi i z i = max i z i ad T β z = T β z we get also T β mi i z i = mi i T βz i, 8

which implies 8. Set { G = g : IR d IR : gx = a k,l x + b k,l for some a k,l R d, b k,l R. From Theorem 9.4, Theorem 9.5 ad iequality 0.23 i Györfi et al. 2002 we get N ǫ, T β G, x 3 4eβ ǫ log 6eβ d++. ǫ By applyig Lemma we get the desired result. With this boud of the coverig umber of T β F we ca ow start with the proof of Theorem. Proof of Theorem. I the proof we use the followig error decompositio: m x mx 2 µdx = = [ E { m X Y 2 D [ + E E { mx Y 2 ] E { m X T β Y 2 D E { m β X T β Y 2 { m X T β Y 2 D E { m β X T β Y 2 2 m ] X i T β Y i 2 m β X i T β Y i 2 i= [ + 2 m X i T β Y i 2 2 m β X i T β Y i 2 i= i= ] 2 m X i Y i 2 2 mx i Y i 2 i= i= [ ] + 2 m X i Y i 2 mx i Y i 2 i= i= 4 T i,, i= 9

where T β Y is the trucated versio of Y ad m β is the regressio fuctio of T β Y, i.e., { m β x = E T β Y X = x. We start with boudig T,. By usig a 2 b 2 = a ba + b we get T, = E { m X Y 2 m X T β Y 2 D E { mx Y 2 m β X T β Y 2 { = E T β Y Y 2m X Y T β Y D { E mx m β X + T β Y Y mx + m β X Y T β Y = T 5, + T 6,. With the Cauchy-Schwarz iequality ad I { Y >β expc 2/2 Y 2 expc 2 /2 β 2 0 it follows T 5, E{ T β Y Y 2 E{ 2m X Y T β Y 2 D E{ Y 2 I { Y >β E{2 2m X T β Y 2 + 2 Y 2 D { E Y 2 expc 2/2 Y 2 expc 2 /2 β 2 E{2 2m X T β Y 2 D + 2E{ Y 2 { E Y 2 expc 2 /2 Y 2 exp c 2 β 2 23β 4 2 + 2E{ Y 2. With x expx for x R we get Y 2 2 c2 exp c 2 2 Y 2 { ad hece E Y 2 expc 2 /2 Y 2 is bouded by 2 E exp c 2 /2 Y 2 2 expc 2 /2 Y 2 E exp c 2 Y 2 c 4 c 2 c 2 20

which is less tha ifiity by the assumptios of the theorem. Furthermore the third term is bouded by 8β 2 + c 5 because E Y 2 E/c 2 expc 2 Y 2 c 5 < which follows agai as above. With the settig β = c log it follows for some costats c 6, c 7 > 0 T 5, c 4 exp c 6 log 2 8 c log 2 + c 5 c 7 log. From the Cauchy-Schwarz iequality we get 2E { mx m β X 2 T 6, E{ mx + mβ X Y T β Y + 2E { T β Y Y 2 2, where we ca boud the secod factor o the right had-side i the above iequality i the same way we have bouded the secod factor from T 5,, because by assumptio m is bouded ad furthermore m β is bouded by β. Thus we get for some costat c 8 > 0 E{ mx + mβ X Y T β Y 2 c 8 log. Next we cosider the first term. With the iequality from Jese it follows E { mx m β X 2 { E E Y T β Y 2 X = E { Y T β Y 2. Hece we get T 6, 4E { Y T β Y 2 c 8 log ad therefore with the calculatios from T 5, it follows T 6, c 9 log/ for some costat c 9 > 0. Alltogether we get T, c 0 log 2

for some costat c 0 > 0. Next we cosider T 2,. Let t > / be arbitrary. The { fx P{T 2, > t P f T β F : E T β Y 2 m β X E β β β fx i T β Y i 2 β i= β m β X i T β Y i 2 β β > t fx + E 2 T β Y 2 m β X E β β β β 2 Thus with Theorem.4 i Györfi et al. 2002 ad { N δ, f : f F, x N δ β, F, x β, we get for x = x,..., x R d... R d t P{T 2, > t 4 sup N, T β F, x x 80β exp 536 β 2 T β Y β T β Y β From Lemma 2 we kow, that with L := max{l,,..., L K, for / < t < 40β t N, T β F, x 80β 6eβ 80β K L 3 t c PK k= L k, t. 2d+2 P K k= L k, for some sufficiet large c > 0. This iequality holds also for t 40β, sice the right-had side above does ot deped o t ad the coverig umber is decreasig i t. Usig this we get for arbitrary ǫ / ET 2, ǫ + ad this expressio is miimized for Alltogether we get ǫ P{T 2, > tdt = ǫ + 4 c P K k= L k, 536β2 exp ǫ = 536 β2 log 4 c P K k= L k,. 536β 2 ǫ 2 2. ET 2, c 2 log 3 K k= L k, 22

for some sufficiet large costat c 2 > 0, which does ot deped o, β or the parameters of the estimate. By boudig T 3, similarly to T, we get ET 3, c 3 log for some large eough costat c 3 > 0 ad hece we get over all 3 E T i, c 4 log 3 K k= L k, i= for some sufficiet large costat c 4 > 0. We fiish the proof by boudig T 4,. Let A be the evet, that there exists i {,..., such that Y i > β ad let I A be the idicator fuctio of A. The we get ET 4, 2 E m X i Y i 2 I A i= +2 E m X i Y i 2 I A c i= = 2 E m X Y 2 I A +2 E m X i Y i 2 I A c = T 7, + T 8,. i= mx i Y i 2 i= mx i Y i 2 i= With the Cauchy-Schwarz iequality we get for T 7, 2 T 7, E m X Y 2 2 PA E 2 m X 2 + 2 Y 2 2 P{ Y > β E 8 m X 4 + 8 Y 4 E expc 2 Y 2 expc 2 β 2, where the last iequality follows from iequality 0. With x expx for x R we get E Y 4 = E Y 2 Y 2 E 2 c2 exp c 2 2 Y 2 2 c2 exp Y 2 c 2 2 = 4 c 2 2 E exp c 2 Y 2, 23

which is less tha ifiity by coditio 3 of the theorem. Furthermore m is bouded by β ad therefore the first factor is bouded by c 5 β 2 = c 6 log 2 for some costat c 6 > 0. The secod factor is bouded by /, because by the assumptios of the theorem E exp c 2 Y 2 is bouded by some costat c 7 < ad hece we get E expc 2 Y 2 expc 2 β 2 c7 expc2 β 2 Sice exp c log 2 = O 2 for c > 0, we get alltogether c7 expc 2 c 2 log 2 /2. T 7, c 8 log2 2 c 9 log2. With the defiitio of A c ad m defied as i 2 it follows T 8, 2 E m X i Y i 2 I A c mx i Y i 2 i= i= 2 E m X i Y i 2 mx i Y i 2 i= i= 2 E if fx i Y i 2 mx i Y i 2, f F i= i= because T β z y z y holds for y β. Hece ET 4, c 9 log2 + 2E if fx i Y i 2 f F which completes the proof. I the sequel we will boud i= mx i Y i 2, i= if f F fx i Y i 2 i= mx i Y i 2. i= Therefore we will use the followig lemma. 24

Lemma 3. Let K IN ad let Π be a partitio of [a, b] d cosistig of K rectagulars. Assume that f li : [a, b] d R is a piecewise polyomial of degree M = i each coordiate with respect to Π ad assume that f is cotiuous. Furthermore let x,..., x R d be fixed poits i R d. The there exist liear fuctios such that f li z = max f,0,..., f,2d,..., f K,0,..., f K,2d : R d R, i=,...,k mi f i,kz for all z {x,..., x. k=0,..,2d Proof. Sice f li is a piecewise polyomial of degree it is of the shape K K d f li z = fi li z I Ai = α i,j z j + α i,0 I Ai i= i= for some costats α i,j R i =,..., K, j = 0,..., d, where Π = {A,..., A K is a partitio of [a, b] d ad for some uivariate itervals I j i edpoit of I j i A i = I i by a i,j ad b i,j, resp., i.e., j=... I d i i =,...,K. We deote the left ad the right I j i = [a i,j, b i,j or I j i = [a i,j, b i,j ]. This choice is without restrictio of ay kid because f li is cotiuous. Now we choose for every i {,..., K f i,0 x = f li i x = d α i,j x j + α i,0. This implies, that f i,0 ad the give piecewise polyomial f li match o A i for every i =,..., K. Furthermore for i =,..., K ad j =,..., d we defie where β i,j 0 is such that j= f i,2j x = f li i x + x j a i,j β i,j, f i,2j z f li z for all z = z,..., z d {x,..., x satisfyig z j < a i,j 25

ad f i,2j z f li z for all z = z,..., z d {x,..., x satisfyig z j > a i,j. The above coditios are satisfied, if β i,j max k=,...,;x j k a i,j For z j = a i,j obviously f i,2j z = fi li z. Aalogously we choose where γ i,j 0 is such that f li x k fi li x k. x j k a i,j f i,2j x = f li i x x j b i,j γ i,j, f i,2j z f li z for all z = z,..., z d {x,..., x satisfyig z j < b i,j ad f i,2j z f li z for all z = z,..., z d {x,..., x satisfyig z j > b i,j. I this case the coditios from above are satisfied, if fi li x k f li x k γ i,j max. k=,...,;x j k a i,j x j k b i,j From this choice of fuctios f i,j i =,..., K, j = 0,..., 2d results directly, that mi f = fi li z = f li z for z A i {x,..., x i,kz k=0,..,2d f li z for z {x,..., x holds for all i =,..., K, which implies the assertio. Proof of Corollary. Lemma 3 yields E 2 if fx i Y i 2 f F i= mx i Y i 2 i= 26

E 2 if fx i Y i 2 f G i= 2 if fx mx 2 µdx, f G mx i Y i 2 i= where G is the set of fuctios which cotais all cotiuous piecewise polyomials of degree with respect to a arbitrary partitio Π cosistig of K rectagulars. Next we icrease the right-had side above by choosig Π such that it cosists of equivolume cubes. Now we ca apply approximatio results from splie theory, see, e.g., Schumaker 98, Theorem 2.8 ad 3.62. From this, the p, C smoothess of m ad Theorem we coclude for some sufficiet large costat c 20 > 0 E m x mx 2 µdx c 3 K 2d + log 3 c 20 C 2d 2p+d log 3 where the last iequaltity results from the choice of K. 2p 2p+d, + c 20 C 2 K 2p d Proof of Corollary 2. With the assumptios o the regressio fuctio m the secod term o the right-had side of iequality 4 equals E 2 if fx i Y i 2 mα X i Y i 2 f F i= ad with F := {max k=,...,k mi l=,...,lk a k,l x + b k,l, for some a k,l, b k,l R this expected value is less tha or equal to E 2 if hα X i Y i 2 h F i= because for every fuctio h F ad every vector α R d i= mα X i Y i 2, i= fx = hα x x R d is cotaied i F. Together with Lemma 3 this yields to E 2 if fx i Y i 2 mx i Y i 2 f F i= 27 i=

E 2 if hα X i Y i 2 h G i= 2 if hα x mα x 2 µdx h G 2 if max h G 2 if h G hα x mα x 2 x [a,b] d max hx mx 2, x [â,ˆb] mα X i Y i 2 i= where G is the set of fuctios from R to R which cotais all piecewise polyomials of degree oe with respect to a partitio of [â,ˆb] cosistig of K itervals. Here [â,ˆb] is chose such that α x [â,ˆb] for x [a, b] d. Hece agai with the approximatio result from splie theory we get as i the proof of Corollary for some sufficietly large costat c 2 E 2 if f F fx i Y i 2 i= mx i Y i 2 i= c 2 C 2 K 2p. Summarizig the above results we get by Theorem E m x mx 2 µdx c 3 log 3 K k= L k, log c 22 C 2/2p+ 3 + c 2 C 2 K 2p 2p 2p+. Refereces [] Bagirov, A. M. 999. Miimizatio methods for oe class of osmooth fuctios ad calculatio of semi-equilibrium prices. I: A. Eberhard et al. eds. Progress i Optimizatio: Cotributio from Australia, Kluwer Academic Publishers, 999, pp. 47-75. [2] Bagirov, A. M. 2002. A method for miimizatio of quasidifferetiable fuctios. Optimizatio Methods ad Software 7, pp. 3 60. 28

[3] Bagirov, A. M., Clause, C., ad Kohler, M. 2007. A algorithm for the estimatio of a regressio fuctio by cotiuous piecewise liear fuctios. Submitted for publicatio. [4] Bagirov, A. M., ad Ugo, J. 2006. Piecewise partially separable fuctios ad a derivative-free method for large-scale osmooth optimizatio. Joural of Global Optimizatio 35, pp. 63-95. [5] Barro, A. R. 993. Uiversal approximatio bouds for superpositios of a sigmoidal fuctio. IEEE Trasactios o Iformatio Theory 39, pp. 930 944. [6] Barro, A. R. 994. Approximatio ad estimatio bouds for eural etworks. Neural Networks 4, pp. 5-33. [7] Beliakov, G., ad Kohler, M. 2005. Estimatio of regressio fuctios by Lipschitz cotiuous fuctios. Submitted for publicatio. [8] Breima, L., Friedma, J. H., Olshe, R. H. ad Stoe, C. J. 984. Classificatio ad regressio trees. Wadsworth, Belmot, CA. [9] Demyaov, V.F., ad Rubiov, A.M. 995. Costructive Nosmooth Aalysis. Peter Lag, Frakfurt am Mai, 995. [0] Friedma, J. H. 99. Multivariate adaptive regressio splies with discussio. Aals of Statistics 9, pp. -4. [] Györfi, L., Kohler, M., Krzyżak, A., ad Walk, H. 2002. A Distributio-Free Theory of Noparametric Regressio. Spriger Series i Statistics, Spriger. [2] Hamers, M. ad Kohler, M. 2003. A boud o the expected maximal deviatios of sample averages from their meas. Statistics & Probability Letters 62, pp. 37 44. [3] Hastie, T., Tibshirai, R. ad Friedma, J. 200. The elemets of statistical learig. Spriger-Verlag, New York. 29

[4] Kohler, M. 999. Noparametric estimatio of piecewise smooth regressio fuctios. Statistics & Probability Letters 43, pp. 49 55. [5] Kohler, M. 2000. Iequalities for uiform deviatios of averages from expectatios with applicatios to oparametric regressio. Joural of Statistical Plaig ad Iferece 89, pp. -23. [6] Kohler, M. 2006. Noparametric regressio with additioal measuremets errors i the depedet variable. Joural of Statistical Plaig ad Iferece 36, pp. 3339-336. [7] Lee, W. S., Bartlett, P. L., Williamso, R. C. 996. Efficiet agostic learig of eural etworks with bouded fa i. IEEE Tras. Iform. Theory 42, pp. 28 232. [8] Miffli, R. 977. Semismooth ad semicovex fuctios i costraied optimizatio. SIAM Joural o Cotrol ad Optimizatio 5, pp. 957-972. [9] Schumaker, L., 98. Splie fuctios: Basic Theory. Wiley, New York. [20] Stoe, C.J. 982. Optimal global rates of covergece for oparametric regressio. Aals of Statistics 0, pp. 040 053. [2] Stoe, C. J. 985. Additive regressio ad other oparametric models. Aals of Statistics 3, pp. 689 705. [22] Stoe, C.J. 994. The use of polyomial splies ad their tesor products i multivariate fuctio estimatio. Aals of Statistics 22, pp. 8 84. [23] va de Geer, S. 2000. Empirical Processes i M estimatio. Cambridge Uiversity Press. 30