Best approximation by linear combinations of characteristic functions of half-spaces

Similar documents
Geometry and topology of continuous best and near best approximations

GENERICITY OF INFINITE-ORDER ELEMENTS IN HYPERBOLIC GROUPS

Elementary theory of L p spaces

Haar type and Carleson Constants

MATH 6210: SOLUTIONS TO PROBLEM SET #3

ON THE NORM OF AN IDEMPOTENT SCHUR MULTIPLIER ON THE SCHATTEN CLASS

Approximating min-max k-clustering

Extremal Polynomials with Varying Measures

LEIBNIZ SEMINORMS IN PROBABILITY SPACES

Multiplicity of weak solutions for a class of nonuniformly elliptic equations of p-laplacian type

GENERALIZED NORMS INEQUALITIES FOR ABSOLUTE VALUE OPERATORS

Mollifiers and its applications in L p (Ω) space

CONSTRUCTIVE APPROXIMATION

IMPROVED BOUNDS IN THE SCALED ENFLO TYPE INEQUALITY FOR BANACH SPACES

Commutators on l. D. Dosev and W. B. Johnson

RIEMANN-STIELTJES OPERATORS BETWEEN WEIGHTED BERGMAN SPACES

Research Article An iterative Algorithm for Hemicontractive Mappings in Banach Spaces

Elementary Analysis in Q p

ON UNIFORM BOUNDEDNESS OF DYADIC AVERAGING OPERATORS IN SPACES OF HARDY-SOBOLEV TYPE. 1. Introduction

Introduction to Banach Spaces

SOME TRACE INEQUALITIES FOR OPERATORS IN HILBERT SPACES

Sobolev Spaces with Weights in Domains and Boundary Value Problems for Degenerate Elliptic Equations

Best Simultaneous Approximation in L p (I,X)

Various Proofs for the Decrease Monotonicity of the Schatten s Power Norm, Various Families of R n Norms and Some Open Problems

STRONG TYPE INEQUALITIES AND AN ALMOST-ORTHOGONALITY PRINCIPLE FOR FAMILIES OF MAXIMAL OPERATORS ALONG DIRECTIONS IN R 2

Convex Analysis and Economic Theory Winter 2018

Radial Basis Function Networks: Algorithms

Feedback-error control

arxiv:math/ v4 [math.gn] 25 Nov 2006

1 Riesz Potential and Enbeddings Theorems

Sums of independent random variables

Journal of Mathematical Analysis and Applications

LORENZO BRANDOLESE AND MARIA E. SCHONBEK

ON MINKOWSKI MEASURABILITY

Greediness of higher rank Haar wavelet bases in L p w(r) spaces

Products of Composition, Multiplication and Differentiation between Hardy Spaces and Weighted Growth Spaces of the Upper-Half Plane

ETNA Kent State University

ON FREIMAN S 2.4-THEOREM

HEAT AND LAPLACE TYPE EQUATIONS WITH COMPLEX SPATIAL VARIABLES IN WEIGHTED BERGMAN SPACES

Combinatorics of topmost discs of multi-peg Tower of Hanoi problem

Journal of Inequalities in Pure and Applied Mathematics

Factorizations Of Functions In H p (T n ) Takahiko Nakazi

A Numerical Radius Version of the Arithmetic-Geometric Mean of Operators

Recursive Estimation of the Preisach Density function for a Smart Actuator

Maxisets for μ-thresholding rules

CHAPTER 2: SMOOTH MAPS. 1. Introduction In this chapter we introduce smooth maps between manifolds, and some important

1. Introduction In this note we prove the following result which seems to have been informally conjectured by Semmes [Sem01, p. 17].

Applications to stochastic PDE

On the capacity of the general trapdoor channel with feedback

Numbers and functions. Introduction to Vojta s analogy

WAVELETS, PROPERTIES OF THE SCALAR FUNCTIONS

Improvement on the Decay of Crossing Numbers

p-adic Measures and Bernoulli Numbers

ON JOINT CONVEXITY AND CONCAVITY OF SOME KNOWN TRACE FUNCTIONS

SOME INEQUALITIES FOR (α, β)-normal OPERATORS IN HILBERT SPACES. 1. Introduction

On the Properties for Iteration of a Compact Operator with Unstructured Perturbation

Some Unitary Space Time Codes From Sphere Packing Theory With Optimal Diversity Product of Code Size

Spectral Properties of Schrödinger-type Operators and Large-time Behavior of the Solutions to the Corresponding Wave Equation

MEAN AND WEAK CONVERGENCE OF FOURIER-BESSEL SERIES by J. J. GUADALUPE, M. PEREZ, F. J. RUIZ and J. L. VARONA

Inequalities for the L 1 Deviation of the Empirical Distribution

Solving Support Vector Machines in Reproducing Kernel Banach Spaces with Positive Definite Functions

CR extensions with a classical Several Complex Variables point of view. August Peter Brådalen Sonne Master s Thesis, Spring 2018

PETER J. GRABNER AND ARNOLD KNOPFMACHER

SCHUR S LEMMA AND BEST CONSTANTS IN WEIGHTED NORM INEQUALITIES. Gord Sinnamon The University of Western Ontario. December 27, 2003

On the approximation of a polytope by its dual L p -centroid bodies

Difference of Convex Functions Programming for Reinforcement Learning (Supplementary File)

JUHA KINNUNEN. Sobolev spaces

Correspondence Between Fractal-Wavelet. Transforms and Iterated Function Systems. With Grey Level Maps. F. Mendivil and E.R.

ANALYTIC NUMBER THEORY AND DIRICHLET S THEOREM

HIGHER ORDER NONLINEAR DEGENERATE ELLIPTIC PROBLEMS WITH WEAK MONOTONICITY

Hölder s and Minkowski s Inequality

On Isoperimetric Functions of Probability Measures Having Log-Concave Densities with Respect to the Standard Normal Law

Arithmetic and Metric Properties of p-adic Alternating Engel Series Expansions

Stochastic integration II: the Itô integral

Location of solutions for quasi-linear elliptic equations with general gradient dependence

On the minimax inequality and its application to existence of three solutions for elliptic equations with Dirichlet boundary condition

A CHARACTERIZATION OF THE LEINERT PROPERTY

DUAL NUMBERS, WEIGHTED QUIVERS, AND EXTENDED SOMOS AND GALE-ROBINSON SEQUENCES. To Alexandre Alexandrovich Kirillov on his 3 4 th anniversary

On the continuity property of L p balls and an application

Applications of the course to Number Theory

3 Properties of Dedekind domains

On the Interplay of Regularity and Decay in Case of Radial Functions I. Inhomogeneous spaces

A CONCRETE EXAMPLE OF PRIME BEHAVIOR IN QUADRATIC FIELDS. 1. Abstract

A viability result for second-order differential inclusions

LECTURE 7 NOTES. x n. d x if. E [g(x n )] E [g(x)]

arxiv:math/ v1 [math.mg] 20 May 1996

Estimation of the large covariance matrix with two-step monotone missing data

Holder Continuity of Local Minimizers. Giovanni Cupini, Nicola Fusco, and Raffaella Petti

B8.1 Martingales Through Measure Theory. Concept of independence

Improved Bounds on Bell Numbers and on Moments of Sums of Random Variables

On the existence of principal values for the Cauchy integral on weighted Lebesgue spaces for non-doubling measures.

Dependence on Initial Conditions of Attainable Sets of Control Systems with p-integrable Controls

Discrete Calderón s Identity, Atomic Decomposition and Boundedness Criterion of Operators on Multiparameter Hardy Spaces

Interpolatory curl-free wavelets on bounded domains and characterization of Besov spaces

1-way quantum finite automata: strengths, weaknesses and generalizations

Pseudodifferential operators with homogeneous symbols

VERTICAL LIMITS OF GRAPH DOMAINS

POINTS ON CONICS MODULO p

Chapter 5 Approximating Multivariable Functions by Feedforward Neural Nets

LINEAR FRACTIONAL COMPOSITION OPERATORS OVER THE HALF-PLANE

Transcription:

Best aroximation by linear combinations of characteristic functions of half-saces Paul C. Kainen Deartment of Mathematics Georgetown University Washington, D.C. 20057-1233, USA Věra Kůrková Institute of Comuter Science, Academy of Sciences of the Czech Reublic 182 07 Prague 8, Czech Reublic Andrew Vogt Deartment of Mathematics Georgetown University Washington, D.C. 20057-1233, USA Journal of Aroximation Theory, Volume 122, Number 2, June 2003, 151-159. 1

Proosed running head: Half-sace characteristic functions Name and address of corresonding author: Andrew Vogt Deartment of Mathematics Georgetown University Washington, D.C. 20057-1233, USA e-mail:vogt@math.georgetown.edu 2

Abstract It is shown that for any ositive integer n and any function in L ([0, 1] d ) with [1, ) there exists a best aroximation by linear combinations of n characteristic functions of half-saces. Further, sequences of such linear combinations converging in distance to the best aroximation distance have subsequences converging to the best aroximation, i.e., these linear combinations are an aroximatively comact set. Keywords. Best aroximation, roximinal, aroximatively comact, boundedly comact, Heaviside ercetron networks, lane waves. 3

1 Introduction An imortant tye of nonlinear aroximation is variable-basis aroximation, where the set of aroximating functions is formed by linear combinations of n functions from a given set. This aroximation scheme has been widely investigated: it includes slines with free nodes, trigonometric olynomials with free frequencies, sums of wavelets, and feedforward neural networks. To estimate rates of variable-basis aroximation, it is helful to study roerties like existence, uniqueness, and continuity of corresonding aroximation oerators. Here we investigate the existence roerty for one-hidden-layer Heaviside ercetron networks, i.e., aroximations by linear combinations of characteristic functions of closed half-saces. Such functions are obtained by comosing the Heaviside function with affine functions. We show that for all ositive integers n, d in L ([0, 1] d ) with [1, ) there exists a best aroximation maing to the set of functions comutable by Heaviside ercetron networks with n hidden and d inut units. Thus for any -integrable function on [0, 1] d there is a linear combination of n characteristic functions of closed half-saces that is nearest in the L -norm. A related roosition is roved by Chui, Li, and Mhaskar in [1], where certain sequences are shown to have subsequences that converge a. e. These authors work in R d rather than [0, 1] d and show a. e. convergence rather than L convergence. 2 Heaviside ercetron networks Feedforward networks comute arametrized sets of functions deendent both on the tye of comutational units and their interconnections. Comutational units comute functions of two vector variables: an inut vector and a arameter vector. A standard tye of comutational unit is the ercetron. A ercetron with an activation function ψ : R R (where R denotes the set of real numbers) comutes real-valued functions on R d R d+1 of the form ψ(v x + b), where x R d is an inut vector, v R d is an inut weight vector, and b R is a bias. The most common activation functions are sigmoidals, i.e., functions with ess-shaed grah. Both continuous and discontinuous sigmoidals are used. Here we study networks based on the archetyal discontinuous sigmoidal, namely, the Heaviside function ϑ defined by ϑ(t) = 0 for t < 0 and ϑ(t) = 1 for t 0. Let H d denote the set of functions on [0, 1] d comutable by Heaviside ercetrons, i.e., H d = {f : [0, 1] d R : f(x) = ϑ(v x + b), v R d, b R}. H d is the set of characteristic functions of closed half-saces of R d restricted to [0, 1] d, which is a subset of the set of lane waves (see, e.g., Courant and Hilbert [2,.676 681]). For A R d we denote by ξ A the characteristic function of A, i.e., ξ A (x) = 1 for x A and χ A (x) = 0 for x / A. The simlest tye of multilayer feedforward network has one hidden layer and one linear outut. Such networks with Heaviside ercetrons in the hidden layer comute functions of the form n w i ϑ(v i x + b i ), i=1 where n is the number of hidden units, w i R are outut weights, and v i R d and b i R are inut weights and biases resectively. 4

The set of all such functions is the set of all linear combinations of n elements of H d and is denoted by san n H d. It is known that for all ositive integers d, n N+ san n H d (where N + denotes the set of all ositive integers) is dense in (C([0, 1] d ),. C ), the linear sace of all continuous functions on [0, 1] d with the suremum norm, as well as in (L ([0, 1] d ),. ) with [1, ] (see, e.g., Mhaskar and Micchelli [10] or Leshno et al. [9]). We study best aroximation in san n H d for a fixed n. 3 Existence of a best aroximation Existence of a best aroximation has been formalized in aroximation theory by the concet of roximinal set (sometimes also called existence set). A subset M of a normed linear sace (X,. ) is called roximinal if for every f X the distance f M = inf g M f g is achieved for some element of M, i.e., f M = min g M f g (Singer [13]). Clearly a roximinal subset must be closed. A sufficient condition for roximinality of a subset M of a normed linear sace (X,. ) is comactness (i.e., each sequence of elements of M has a subsequence convergent to an element of M). Indeed, for each f X the functional e {f} : M R defined by e {f} (m) = m f is continuous [13,. 391] and hence must achieve its minimum on any comact set M. Gurvits and Koiran [5] have shown that for all ositive integers d the set of characteristic functions of half-saces H d is comact in (L ([0, 1] d ),. ) with [1, ). This can be easily verified once the set H d is rearametrized by elements of the unit shere S d in R d+1. Indeed, a function ϑ(v x + b), with the vector (v 1,..., v d, b) R d+1 nonzero, is equal to ϑ(ˆv x + ˆb), where (ˆv 1,..., ˆv d, ˆb) S d is obtained from (v 1,..., v d, b) R d+1 by normalization. Strictly seaking, H d is arametrized by equivalence classes in S d since different arametrizations may reresent the same member of H d when restricted to [0, 1] d. Since S d is comact, and the quotient sace formed by the equivalence classes is likewise, so is H d. However, by extending H d into san n H d for any ositive integer n we lose comactness since the norms are not bounded. Nevertheless comactness can be relaced by a weaker roerty that requires only some sequences to have convergent subsequences. A subset M of a normed linear sace (X,. ) is called aroximatively comact if for each f X and any sequence {g i : i N + } in M such that i f g i = f M, there exists g M such that {g i : i N + } converges subsequentially to g [13,.368]. The following theorem shows that san n H d is aroximatively comact in L -saces. It extends a weaker result by Kůrková [8], who showed that san n H d is closed in L -saces with (1, ). Theorem 3.1 For every n, d ositive integers and for every [1, ) san n H d is an aroximatively comact subset of (L ([0, 1] d,. ). To rove the theorem we need the following lemma. For a set A P(A) denotes the set of all subsets of A. Lemma 3.2 Let m be a ositive integer, { : k N +, j = 1,..., m} be m sequences of real numbers, and S P({1,..., m}) be such that for each S S j S = c S for some c S R. Then there exist real numbers {a j : j = 1,..., m} such that for each S S j S a j = c S. 5

Proof. Let = card S and let S = {S 1,..., S }. Define T : R m R by T (x 1,..., x m ) = ( j S 1 x j,..., j S x j ). Then T is linear, and hence its range is a subsace of R and so is a closed set. Since (c S1,..., c S ) cl T (R m ) = T (R m ), there exists (a 1,..., a m ) R m with (c S1,..., c S ) = T (a 1,..., a m ). Proof of Theorem 3.1 Let f L ([0, 1] d ) and let { n j=1 g jk : k N + } be a sequence of elements of san n H d such that f n j=1 g jk = f san n H d. Since H d is comact, by assing to suitable subsequences we can assume that for all j = 1,..., n, there exist g j H d such that g jk = g j (here and in the sequel, we use the notation to mean a it of a suitable subsequence). We shall show that there exist real numbers a 1,..., a n such that f san n H d = f n a j g j. (1) Then using (1) we shall show even that { n j=1 g jk : k N + } converges to n j=1 a jg j in. subsequentially. Decomose {1,..., n} into two disjoint subsets I and J such that I consists of those j for which the sequences { : k N + } have convergent subsequences, and J of those j for which the sequences { : k N + } diverge. Again, by assing to suitable subsequences we can assume that for all j I, = a j. Thus { j I g jk : k N + } converges subsequentially to j I a jg j. Set h = f j I a jg j. Since for all j I, the chosen subsequences { : k N + } and {g jk : k N + } are bounded, we have f san n H d = f n j=1 g jk = h j J g jk. Let S denotes the set of all subsets of J. Decomose S into two disjoint subsets S 1 and S 2 such that S 1 consists of those S S for which by assage to suitable subsequences j S = c S for some c S R, and S 2 consists of those S S for which j S =. Note that the emty set is in S 1 with the convention j = 0. Using Lemma 3.2, for all j S 1, we get a j R such that for all S S 1, j S a j = c S. For j J S 1, set a j = 0. j=1 a jg j and thus to Since n j=1 a jg j san n H d, we have f san n H d f n rove (1), it is sufficient to show that f san n H d f n [0,1] d h g jk dµ j J j=1 [0,1] d j=1 a jg j or equivalently h a j g j dµ (2) j J where µ is Lebesgue measure on [0, 1] d. To verify (2), for each k N + we shall decomose the integration over [0, 1] d into sum of integrals over convex regions where the functions j J g jk are constant. To describe such regions, we shall define artitions of [0, 1] d determined by families of characteristic functions {g jk : j J, k N + }, and {g j : j J}. The artitions are indexed by the elements of the set S of all subsets of J. For k N +, a artition { : S S} is defined by = {x [0, 1] d : (g jk (x) = 1 j S)}, and similarly a artition {T (S) : S S} is defined by T (S) = {x [0, 1] d : g j (x) = 1 j S}. Notice that since for all j = 1,..., n, 6

g jk = g j in L ([0, 1] d ), we have µ() = µ(t (S)) for all S S. Indeed, the characteristic function of equals the roduct j S g jk j / S (1 g jk) and converges in L ([0, 1] d ) to the characteristic function of T (S), the latter equal to j S g j j / S (1 g j). Using the definition of (in articular its roerty guaranteeing that for all S S, is just the region where for all j S and no other j J, g jk is equal to 1), we get [0,1] d h g jk dµ = j J h dµ = 1 h dµ + 2 h dµ h dµ. (3) Since for all S S, µ() = µ(t (S)) and for all S S 1, j S = c S = j S a j, we have 1 h dµ = 1 h a j dµ = T (S) h a j dµ. For all S S, by the triangle inequality in L () h j S j S dµ dµ 1/ 1/ ( ) 1/ + h dµ [0,1] d h g jk dµ j J 1/ + ( [0,1] d h dµ ) 1/ = f san n H d + h. Thus for all S S, j S dµ is finite. In articular this is true when S S 2, for which j S =, and so µ() = 0 = µ(t (S)) for S S 2. Thus we can relace the integration over 1 T (S) by the integration over the whole of [0, 1] d and so we obtain 7

T (S) h j S a j dµ = [0,1] d h a j g j dµ, j J which roves (2). Moreover, as a byroduct we even get that h dµ = 0, (4) 2 since in (3) the left hand side is equal to the right hand side (both are equal to f san n H d ). So we have shown that san n H d is roximinal. Now we shall verify that it is even aroximatively comact by showing that { j J g jk : k N + } converges subsequentially to j J a jg j, or equivalently ( g jk a j g j ) dµ = 0. (5) [0,1] d j J As above, we start by decomosing the integration into sum of integrals over convex regions. The left hand side of (5) is equal to ( a j g j ) dµ + ( a j g j ) dµ. j S 2 Using the triangle inequality, (4), and µ() = 0 for all S S 2, we get ( a j g j ) dµ 2 j S 2 h dµ + 2 h a j g j dµ = h a j g j dµ = T (S) h a j g j dµ = 0 2 since µ(t (S)) = 0 for S S 2. Thus 2 j S ( a j g j ) dµ = 0, which imlies that the left hand side of (5) is equal to j S 2 ( a j g j ) dµ = T (S) j S j S a j g j a j g j dµ = 0 because ( j S g jk )χ Tk (S) = ( j S )χ Tk (S) converges to c S χ T (S) = ( j S a jg j )χ T (S) in L ([0, 1] d ). 8

So j J g jk = j J a jg j, the same is already known to be true when J is relaced by I, and hence also n j=1 g jk = n j=1 a jg j subsequentially in L ([0, 1] d ). Theorem 3.1 shows that a function in L ([0, 1] d ) has a best aroximation among functions comutable by one-hidden-layer networks with a single linear outut unit and n Heaviside ercetrons in the hidden layer. In other words, in the sace of arameters of networks of this tye, there exists a global minimum of the error functional defined as L -distance from the function to be aroximated. Combining Theorem 3.1 with results from [7], we get the following corollary. Corollary 3.3 In (L ([0, 1] d ),. ) with (1, ) for all n, d ositive integers there exists a best aroximation maing from L ([0, 1] d ) to san n H d, but no such maing is continuous. 4 Discussion In Proosition 3.3 of [1] the authors show that any sequence {P k } in san n H d, with the roerty that su k P k L1 (K) 1 for every comact set K in R d, has a subsequence converging a. e. in R d to a member of san n H d. Although the roof techniques in [1] do have some overla with those used here, the results there are different. A. e. convergence need not imly L convergence for [1, ): the sequence P k = (k) 1 χ[0, 1 k ] converges a. e. in L (R 1 ) but has no convergent subsequence in the L -norm. Since this sequence is bounded and has no convergent subsequences, it also illustrates that san n H d is not boundedly comact. Another examle of an aroximatively comact set that is not boundedly comact is any closed infinite-dimensional subsace of a uniformly convex Banach sace. Theorem 3.1 cannot be extended to ercetron networks with differentiable activation functions, e.g., the logistic sigmoid or hyerbolic tangent. For such functions, sets san n P d (ψ) (where P d (ψ) = {f : [0, 1] d R : f(x) = ψ(v x + b), v R d, b R}) are not closed and hence cannot be roximinal. This was first observed by Girosi and Poggio [4] and later exloited by Leschno et al. [9] for a roof of the universal aroximation roerty. Theorem 3.1 does not offer any information on the error of the best aroximation. Estimates in the literature (DeVore, Howard, and Micchelli [3], Pinkus [11], Pinkus [12]) that give lower bounds on such errors and deend on continuity of best aroximation oerators are not alicable because of Corollary 3.3. Acknowledgment V. Kůrková was artially suorted by GA ČR grants 201/99/0092 and 201/02/0428. Collaboration of V. Kůrková and P. C. Kainen was suorted in art by an NRC COBASE grant. References [1] C. K. Chui, X. Li, and H. N. Mhaskar, Neural networks for localized aroximation. Math. of Comutation 63 (1994), 607 623. 9

[2] R. Courant and D. Hilbert, Methods of Mathematical Physics, vol. II, Wiley, New York, 1962. [3] R. DeVore, R. Howard, and C. Micchelli, Otimal nonlinear aroximation. Manuscrita Math. 63 (1989), 469 478. [4] F. Girosi and T. Poggio, Networks and the best aroximation roerty, Biological Cybernetics 63 (1990), 169 176. [5] L. Gurvits and P. Koiran, Aroximation and learning of convex suerositions. J. of Comuter and System Sciences 55, (1997), 161 170. [6] P. C. Kainen, V. Kůrková, and A. Vogt, Aroximation by neural networks is not continuous. Neurocomuting 29 (1999), 47 65. [7] P. C. Kainen, V. Kůrková, and A. Vogt, Geometry and toology of continuous best and near best aroximations. J. Arox. Theory 105 (2000), 252 262. [8] V. Kůrková, Aroximation of functions by ercetron networks with bounded number of hidden units. Neural Networks 8 (1995), 745 750. [9] M. Leschno, V. Y. Lin, A. Pinkus, and S. Schocken, Multilayer feedforward networks with a nonolynomial activation can aroximate any function. Neural Networks 6 (1993), 861 867. [10] H. N. Mhaskar and C. Micchelli, Aroximation by suerosition of sigmoidal and radial basis functions. Advances in Alied Math. 13 (1992), 350 373. [11] A. Pinkus, n-width in Aroximation Theory, Sringer-Verlag, Berlin, 1989. [12] A. Pinkus, Aroximation theory of the MLP model in neural networks. Acta Numerica 8 (1999), 143 195. [13] I. Singer, Best Aroximation in Normed Linear Saces by Elements of Linear Subsaces, Sringer-Verlag, Berlin, 1970. 10