Multilayer feedforward networks are universal approximators

Similar documents
Universal Approximation properties of Feedforward Artificial Neural Networks

arxiv: v2 [cs.ne] 20 May 2016

Universal Approximation by Ridge Computational Models and Neural Networks: A Survey

MATH 202B - Problem Set 5

On Ridge Functions. Allan Pinkus. September 23, Technion. Allan Pinkus (Technion) Ridge Function September 23, / 27

THEOREMS, ETC., FOR MATH 515

Math 209B Homework 2

CHAPTER I THE RIESZ REPRESENTATION THEOREM

Math 328 Course Notes

THE STONE-WEIERSTRASS THEOREM AND ITS APPLICATIONS TO L 2 SPACES

Metric Spaces and Topology

Approximation of Multivariate Functions

Most Continuous Functions are Nowhere Differentiable

Assignment-10. (Due 11/21) Solution: Any continuous function on a compact set is uniformly continuous.

NN V: The generalized delta learning rule

Bounded uniformly continuous functions

Continuity. Matt Rosenzweig

d(x n, x) d(x n, x nk ) + d(x nk, x) where we chose any fixed k > N

Riesz Representation Theorems

Definition A.1. We say a set of functions A C(X) separates points if for every x, y X, there is a function f A so f(x) f(y).

Metric Spaces. DEF. If (X; d) is a metric space and E is a nonempty subset, then (E; d) is also a metric space, called a subspace of X:

Bounded and continuous functions on a locally compact Hausdorff space and dual spaces

Approximation theory in neural networks

Problem Set 5. 2 n k. Then a nk (x) = 1+( 1)k

Continuity. Chapter 4

Prohorov s theorem. Bengt Ringnér. October 26, 2008

Continuous Functions on Metric Spaces

PROBLEMS. (b) (Polarization Identity) Show that in any inner product space

Multi-Layer Perceptrons for Functional Data Analysis: a Projection Based Approach 0

INTRODUCTION TO REAL ANALYTIC GEOMETRY

Measure and Integration: Solutions of CW2

Accumulation constants of iterated function systems with Bloch target domains

HOMEWORK ASSIGNMENT 6

MATH 51H Section 4. October 16, Recall what it means for a function between metric spaces to be continuous:

Review of measure theory

MATH 23b, SPRING 2005 THEORETICAL LINEAR ALGEBRA AND MULTIVARIABLE CALCULUS Midterm (part 1) Solutions March 21, 2005

Problem Set 5: Solutions Math 201A: Fall 2016

arxiv: v1 [cs.lg] 30 Sep 2018

Math General Topology Fall 2012 Homework 11 Solutions

Upper Bounds on the Number of Hidden Neurons in Feedforward Networks with Arbitrary Bounded Nonlinear Activation Functions

Baire category theorem and nowhere differentiable continuous functions

Problem Set 6: Solutions Math 201A: Fall a n x n,

CHAPTER 3 REVIEW QUESTIONS MATH 3034 Spring a 1 b 1

Analysis III. Exam 1

l(y j ) = 0 for all y j (1)

Immerse Metric Space Homework

Math 140A - Fall Final Exam

Math 5051 Measure Theory and Functional Analysis I Homework Assignment 3

Introduction to Bases in Banach Spaces

THEOREMS, ETC., FOR MATH 516

Introduction to Machine Learning Spring 2018 Note Neural Networks

Deep Belief Networks are Compact Universal Approximators

SOME QUESTIONS FOR MATH 766, SPRING Question 1. Let C([0, 1]) be the set of all continuous functions on [0, 1] endowed with the norm

Summer Jump-Start Program for Analysis, 2012 Song-Ying Li

8 Singular Integral Operators and L p -Regularity Theory

Defining the Integral

Chapter 6. Integration. 1. Integrals of Nonnegative Functions. a j µ(e j ) (ca j )µ(e j ) = c X. and ψ =

Continuous functions that are nowhere differentiable

Artificial Neural Networks

EC 521 MATHEMATICAL METHODS FOR ECONOMICS. Lecture 1: Preliminaries

Lecture 4: Completion of a Metric Space

4th Preparation Sheet - Solutions

WHY DEEP NEURAL NETWORKS FOR FUNCTION APPROXIMATION SHIYU LIANG THESIS

Math 240 (Driver) Qual Exam (9/12/2017)

7 Complete metric spaces and function spaces

Walker Ray Econ 204 Problem Set 3 Suggested Solutions August 6, 2015

1 Preliminaries on locally compact spaces and Radon measure

4 Countability axioms

The Weierstrass Approximation Theorem. LaRita Barnwell Hipp. Bachelor of Arts Furman University, 1996

Approximation by ridge functions with weights in a specified set

Homework I, Solutions

Examples of metric spaces. Uniform Convergence

QUALIFYING EXAMINATION Harvard University Department of Mathematics Tuesday September 21, 2004 (Day 1)

1 Topology Definition of a topology Basis (Base) of a topology The subspace topology & the product topology on X Y 3

Lebesgue s Differentiation Theorem via Maximal Functions

ALGEBRAIC GROUPS. Disclaimer: There are millions of errors in these notes!

REAL AND COMPLEX ANALYSIS

Midterm 1. Every element of the set of functions is continuous

(U) =, if 0 U, 1 U, (U) = X, if 0 U, and 1 U. (U) = E, if 0 U, but 1 U. (U) = X \ E if 0 U, but 1 U. n=1 A n, then A M.

THE INVERSE FUNCTION THEOREM

Lebesgue Measure on R n

Real Analysis Math 131AH Rudin, Chapter #1. Dominique Abdi

Fundamentality of Ridge Functions

1 Solutions to selected problems

MATH 104: INTRODUCTORY ANALYSIS SPRING 2008/09 PROBLEM SET 8 SOLUTIONS

Functional Analysis Exercise Class

Math 5051 Measure Theory and Functional Analysis I Homework Assignment 2

THE PROBLEMS FOR THE SECOND TEST FOR BRIEF SOLUTIONS

Measurable functions are approximately nice, even if look terrible.

CONVOLUTION OPERATORS IN INFINITE DIMENSION

MATH MEASURE THEORY AND FOURIER ANALYSIS. Contents

Some results on primeness in the near-ring of Lipschitz functions on a normed vector space

Solutions Final Exam May. 14, 2014

201B, Winter 11, Professor John Hunter Homework 9 Solutions

ON THE DEFINITION OF RELATIVE PRESSURE FOR FACTOR MAPS ON SHIFTS OF FINITE TYPE. 1. Introduction

It follows from the above inequalities that for c C 1

A survey on universal approximation and its limits in soft computing techniques

Continuity. Chapter 4

A summary of Deep Learning without Poor Local Minima

The Power of Approximating: a Comparison of Activation Functions

Transcription:

Multilayer feedforward networks are universal approximators Kur Hornik, Maxwell Stinchcombe and Halber White (1989) Presenter: Sonia Todorova

Theoretical properties of multilayer feedforward networks - universal approximators: standard multilayer feedforward networks are capable of approximating any measurable function to any desired degree of accuracy - there are no theoretical constraints for the success of feedforward networks - lack of success is due to inadequate learning, insufficient number of hidden units or the lack of a deterministic relationship between input and target * rate of convergence as the number of hidden units grows * rate of increase of the number of hidden units as the input dimension increases for a fixed accuracy Deep Learning: Multilayer feedforward networks are universal approximators (Hornik, 1989) 1

Main Theorems: Cybenko, 1989 Sums of the form N α j σ(yj T x + θ j ) j=1 where y j R n, α j, θ j R, are dense in the space of continuous functions on the unit cube if σ is any continuous sigmoidal function. Deep Learning: Multilayer feedforward networks are universal approximators (Hornik, 1989) 2

Main Theorems: Hornik, 1989 English Single hidden layer ΣΠ feedforward networks can approximate any measurable function arbitrarily well regardless of the activation function Ψ, the dimension of the input space r, and the input space environment µ. Math For every squashing function Ψ, every r and every probability measure µ on (R r, B r ), both ΣΠ r (Ψ) and Σ r (Ψ) are uniformly dense on compacta in C r and ρ µ -dense in M r q j=1 β jg(a j (x)), x R r, β j R, A j A r, q N q j=1 β jπ l j k=1 G(A jk(x)), x R r, β j R, A jk A r, l j N Deep Learning: Multilayer feedforward networks are universal approximators (Hornik, 1989) 3

Def: A function Ψ : R [0, 1] is a squashing function if it is non-decreasing, lim λ Ψ(λ) = 1 and lim λ Ψ(λ) = 0 Squashing functions Ψ(λ) 0.0 0.2 0.4 0.6 0.8 1.0 2 1 0 1 2 indicator function, ramp function, cosine squasher (Gallant and White, 1988) λ Deep Learning: Multilayer feedforward networks are universal approximators (Hornik, 1989) 4

Main Theorems English There is a single hidden layer feedforward network that approximates any measurable function to any desired degree of accuracy on some compact set K. Math For every function g in M r there is a compact subset K of R r and an f r (Ψ) such that for any ɛ > 0 we have µ(k) < 1 ɛ and for every X K we have f(x) g(x) < ɛ, regardless of Ψ, r, or µ. eep Learning: Multilayer feedforward networks are universal approximators (Hornik, 1989) 5

Main Theorems English Functions with finite support can be approximated exactly with a single hidden layer. Math Let {x 1,..., x n } be a set of distinct points in R r and let g : R r R be an arbitrary function. If Ψ achieves 0 and 1, then there is a function f Σ r (Ψ) with n hidden units such that f(x i ) = g(x i ) for all i. eep Learning: Multilayer feedforward networks are universal approximators (Hornik, 1989) 6

Main Theorems Most results stated for a single-dimensional output but can be extended to multi-output networks. eep Learning: Multilayer feedforward networks are universal approximators (Hornik, 1989) 7

Proofs Stone-Weierstrass Theorem Let A be an algebra of real continuous functions on a compact set K. If A separates points in K and if A vanishes at no point of K, then the uniform closure B of A consists of all real continuous functions on K (i.e. A is ρ K -dense in the space of real continuous functions on K) Namely, pick a set {f 1, f 2,...} of real continuous functions defined on a compact set K. Add all f j + f k, f j f k, and cf j to form A. If for any x y there exists an f A such that f(x) f(y), and for every x there exists an f A such that f(x) 0, then for any given real continuous function on K, A contains a function which approximates it arbitrarily well. The distance is measured as ρ K (f, g) = sup x K f(x) g(x) See: An Elementary Proof of the Stone-Weierstrass Theorem, Brosowski and Deutsch (1981) Deep Learning: Multilayer feedforward networks are universal approximators (Hornik, 1989) 8

Proof of Theorem 2.1 Let K R r be any compact set. For any G, ΣΠ r (G) is an algebra on K, because any sum and product of two elements is in the same form, as are scalar multiples. To show that ΣΠ r (G) separates points on K, for any x y, pick A such that A(x) = a, A(y) = b, for two constants a b such that G(a) G(b). Then, G(A(x)) G(A(y)). This is why it is important that G be non-constant. To show that ΣΠ r (G) vanishes nowhere on K, pick b R such that G(b) 0 and let A(x) = 0 x + b. For all x K, G(A(x)) = G(b). By the Stone-Weierstrass Theorem, ΣΠ r (G) is ρ K -dense in the space of real continuous functions on K Deep Learning: Multilayer feedforward networks are universal approximators (Hornik, 1989) 9

Lemma A1: For any finite measure µ C r is ρ µ -dense in M r Recall: ρ µ (f, g) = inf{ɛ > 0 : µ{x : f(x) f(y) > ɛ} < ɛ} Proof Pick an arbitrary f M r and ɛ > 0. Need to show that there is a g C r such that ρ µ (f, g) < ɛ. For a sufficiently large number D, min{ f 1 { f <D} f, 1}dµ < ɛ/2 There exists a continuous g such that f 1{ f <D} g dµ < ɛ/2. Thus, min{ f g, 1}dµ < ɛ. Deep Learning: Multilayer feedforward networks are universal approximators (Hornik, 1989) 10

Theorem 2.5 Exact approximation of functions on a finite set in R 1 Let {x 1,..., x n } be a set of distinct points in R r and let g : R r R be an arbitrary function. If Ψ achieves 0 and 1, then there is a function f Σ r (Ψ) with n hidden units such that f(x i ) = g(x i ) for all i. Proof Let {x 1, x 2,..., x n } R 1 and relabeling so that x 1 < x 2 <... < x n. Pick M > 0 such that Ψ( M) = 1 Ψ(M) = 0. Define A 1 as the constant affine functions A 1 = M and set β 1 = g(x 1 ). Set f 1 (x) = β 1 Ψ(A 1 (x)). Inductively define A k by A k (x k 1 ) = M and A k (x k ) = M. Define β k = g(x k ) g(x k 1 ). Set f k (x) = k h=1 β jψ(a j (x)). For i kf k (x i ) = g k (x i ). The desired function is f n. eep Learning: Multilayer feedforward networks are universal approximators (Hornik, 1989) 11

Universal approximation bounds for superpositions of sigmoidal functions (Barron, 1993) For an artificial neural network with one layer of n sigmoidal nodes, the integrated squared error of approximation, integrating on a bounded subset of d variables, is bounded by c f /n, where c f depends on a norm of the Fourier transform of the function being approximated. This rate of approximation is achieved under growth constraints on the magnitudes of the parameters of the network. The optimization of a network to achieve these bounds may proceed one node at a time. Because of the economy of number of parameters, order nd instead of n d, these approximation rates permit satisfactory estimators of functions using artificial neural networks even in moderately high-dimensional problems. Deep Learning: Multilayer feedforward networks are universal approximators (Hornik, 1989) 12