Deep Neural Networks and Partial Differential Equations: Approximation Theory and Structural Properties. Philipp Christian Petersen

Similar documents
Approximation theory in neural networks

Memory-Optimal Neural Network Approximation

Harmonic Analysis of Deep Convolutional Neural Networks

WAVELETS, SHEARLETS AND GEOMETRIC FRAMES: PART II

Deep Convolutional Neural Networks on Cartoon Functions

When can Deep Networks avoid the curse of dimensionality and other theoretical puzzles

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design

Local Affine Approximators for Improving Knowledge Transfer

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

The Deep Ritz method: A deep learning-based numerical algorithm for solving variational problems

Sparse Tensor Galerkin Discretizations for First Order Transport Problems

Beyond incoherence and beyond sparsity: compressed sensing in the real world

Global Optimality in Matrix and Tensor Factorization, Deep Learning & Beyond

Learning Deep Architectures for AI. Part I - Vijay Chakilam

Deep Learning: Approximation of Functions by Composition

arxiv: v1 [cs.lg] 30 Sep 2018

Multilayer feedforward networks are universal approximators

Depth-Width Tradeoffs in Approximating Natural Functions with Neural Networks

I-theory on depth vs width: hierarchical function composition

PDEs in Image Processing, Tutorials

Persistent Chaos in High-Dimensional Neural Networks

On Some Mathematical Results of Neural Networks

Advanced Machine Learning

Nonlinear Models. Numerical Methods for Deep Learning. Lars Ruthotto. Departments of Mathematics and Computer Science, Emory University.

Neural networks COMS 4771

Why and When Can Deep-but Not Shallow-networks Avoid the Curse of Dimensionality: A Review

On Ridge Functions. Allan Pinkus. September 23, Technion. Allan Pinkus (Technion) Ridge Function September 23, / 27

Understanding Neural Networks : Part I

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Deep Learning: Self-Taught Learning and Deep vs. Shallow Architectures. Lecture 04

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Scientific Computing WS 2018/2019. Lecture 15. Jürgen Fuhrmann Lecture 15 Slide 1

Some Statistical Properties of Deep Networks

Class 2 & 3 Overfitting & Regularization

Measure and Integration: Solutions of CW2

Implementation of Sparse Wavelet-Galerkin FEM for Stochastic PDEs

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

A large deviation principle for a RWRC in a box

Space-time sparse discretization of linear parabolic equations

Support Vector Machines: Maximum Margin Classifiers

THE STOKES SYSTEM R.E. SHOWALTER

arxiv: v2 [cs.ne] 20 May 2016

Topic 3: Neural Networks

Non-radial solutions to a bi-harmonic equation with negative exponent

Error bounds for approximations with deep ReLU networks

Optimization and Optimal Control in Banach Spaces

Feature Design. Feature Design. Feature Design. & Deep Learning

Local strong convexity and local Lipschitz continuity of the gradient of convex functions

The Dirichlet s P rinciple. In this lecture we discuss an alternative formulation of the Dirichlet problem for the Laplace equation:

Deep Learning for Partial Differential Equations (PDEs)

Introduction to (Convolutional) Neural Networks

Lecture 3 Feedforward Networks and Backpropagation

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Generalized quantiles as risk measures

On the Structure of Anisotropic Frames

Sobolev spaces. May 18

Deep learning quantum matter

Deep Feedforward Networks. Han Shao, Hou Pong Chan, and Hongyi Zhang

Convergence of Multivariate Quantile Surfaces

Notes for Functional Analysis

IN the field of ElectroMagnetics (EM), Boundary Value

Feed-forward Network Functions

Topics in Harmonic Analysis, Sparse Representations, and Data Analysis

arxiv: v3 [cs.lg] 8 Jun 2018

Discussion of Hypothesis testing by convex optimization

ON THE STABILITY OF DEEP NETWORKS

The Mathematics of Deep Learning Part 1: Continuous-time Theory

Content. Learning. Regression vs Classification. Regression a.k.a. function approximation and Classification a.k.a. pattern recognition

Takens embedding theorem for infinite-dimensional dynamical systems

Introduction to Machine Learning (67577) Lecture 7

Course Description for Real Analysis, Math 156

Based on the original slides of Hung-yi Lee

CPSC 540: Machine Learning

Non-Convex Optimization in Machine Learning. Jan Mrkos AIC

Numerical Solutions to Partial Differential Equations

SGD and Deep Learning

Bounded uniformly continuous functions

Signal Recovery, Uncertainty Relations, and Minkowski Dimension

Energy Landscapes of Deep Networks

Deep Neural Networks (1) Hidden layers; Back-propagation

S chauder Theory. x 2. = log( x 1 + x 2 ) + 1 ( x 1 + x 2 ) 2. ( 5) x 1 + x 2 x 1 + x 2. 2 = 2 x 1. x 1 x 2. 1 x 1.

Reliable and Interpretable Artificial Intelligence

RegML 2018 Class 8 Deep learning

Introduction to Empirical Processes and Semiparametric Inference Lecture 13: Entropy Calculations

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Brownian Motion. 1 Definition Brownian Motion Wiener measure... 3

MATH 590: Meshfree Methods

Towards stability and optimality in stochastic gradient descent

Quarkonial frames of wavelet type - Stability, approximation and compression properties

WHY DEEP NEURAL NETWORKS FOR FUNCTION APPROXIMATION SHIYU LIANG THESIS

On the Hilbert Transform of Wavelets

Image classification via Scattering Transform

Institut de Recherche MAthématique de Rennes

Fast learning rates for plug-in classifiers under the margin condition

A Bayesian perspective on GMM and IV

Contents: 1. Minimization. 2. The theorem of Lions-Stampacchia for variational inequalities. 3. Γ -Convergence. 4. Duality mapping.

Deep Neural Networks (1) Hidden layers; Back-propagation

Lecture 15: Neural Networks Theory

Continuous functions that are nowhere differentiable

Transcription:

Deep Neural Networks and Partial Differential Equations: Approximation Theory and Structural Properties Philipp Christian Petersen

Joint work Joint work with: Helmut Bölcskei (ETH Zürich) Philipp Grohs (University of Vienna) Joost Opschoor (ETH Zürich) Gitta Kutyniok (TU Berlin) Mones Raslan (TU Berlin) Christoph Schwab (ETH Zürich) Felix Voigtlaender (KU Eichstätt-Ingolstadt) 1 / 36

Today s Goal Goal of this talk: Discuss the suitability of neural networks as an ansatz system for the solution of PDEs. Two threads: Structural properties: Approximation theory: I universal approximation I I optimal approximation rates for all classical function spaces non-convex, non-closed ansatz spaces I parametrization not stable I very hard to optimize over reduced curse of dimension I 1 0.5 1 0 0 1 0.8-1 1 1 0.6 0.8 0.8 0.6 0.6 0.4 0 0.2 0.4 0.4 0.4 0.2 0.6 0.2 0.2 0 0 0 0.8 1 2 / 36

Outline Neural networks Introduction to neural networks Approaches to solve PDEs Approximation theory of neural networks Classical results Optimality High-dimensional approximation Structural results Convexity Closedness Stable parametrization 3 / 36

Neural networks We consider neural networks as a special kind of functions: d = N 0 N: input dimension, L: number of layers, ϱ : R R: activation function, T l : R N l 1 R N l, l = 1,..., L: affine-linear maps. Then Φ ϱ : R d R N L given by Φ ϱ (x) = T L (ϱ(t L 1 (ϱ(... ϱ(t 1 (x)))))), x R d, is called a neural network (NN). The sequence (d, N 1,..., N L ) is called the architecture of Φ ϱ. 4 / 36

Why are neural networks interesting? - I Deep Learning: Deep learning describes a variety of techniques based on data-driven adaptation of the affine linear maps in a neural network. Overwhelming success: Image classification Text understanding Game intelligence Ren, He, Girshick, Sun; 2015 Hardware design of the future! 5 / 36

Why are neural networks interesting? - II Expressibility: Neural networks constitute a very powerful architecture. Theorem (Cybenko; 1989, Hornik; 1991, Pinkus; 1999) Let d N, K R d compact, f : K R continuous, ϱ : R R continuous and not a polynomial. Let ε > 0, then there exist a two-layer NN Φ ϱ : f Φ ϱ ε. Efficient expressibility: R M θ (T 1,..., T L ) Φ ϱ θ yields a parametrized system of functions. In a sense this parametrization is optimally efficient. (More on this below). 6 / 36

How can we apply NNs to solve PDEs? PDE problem: For D R d, d N find u such that G(x, u(x), u(x), 2 u(x)) = 0 for all x D. Approach of [Lagaris, Likas, Fotiadis; 1998]: Let (x i ) i I D, find a NN Φ ϱ θ such that G(x i, Φ ϱ θ (x i), Φ ϱ θ (x i), 2 Φ ϱ θ (x i)) = 0 for all i I. Standard methods can be used to find parameters θ. 7 / 36

Approaches to solve PDEs - Examples General Framework: Deep Ritz Method [E, Yu; 2017]: NNs as trial functions, SGD naturally replaces quadrature. High-dimensional PDEs: [Sirignano, Spiliopoulos; 2017]: Let D R d d 100 find u such that u (t, x) + H(u)(t, x) = 0, t (t, x) [0, T ] Ω, + BC + IC As the number of parameters of the NNs increases the minimizer of associated energy approaches true solution. No mesh generation required! [Berner, Grohs, Hornung, Jentzen, von Wurstemberger; 2017]: Phrasing problem as empirical risk minimization provably no curse of dimension in approximation problem or number of samples. 8 / 36

How can we apply NNs to solve PDEs? Deep learning and PDEs: Both approaches above are based on two ideas. Neural networks are highly efficient in representing solutions of PDEs, hence the complexity of the problem can be greatly reduced. There exist black box methods from machine learning that solve the optimization problem. This talk: We will show exactly how efficient the representations are. Raise doubt that the black box can produce reliable results in general. 9 / 36

Approximation theory of neural networks 10 / 36

Complexity of neural networks Recall: Φ ϱ (x) = T L (ϱ(t L 1 (ϱ(... ϱ(t 1 (x)))))), x R d. Each affine linear mapping T l is defined by a matrix A l R N l N l 1 and a translation b l R N l via T l (x) = A l x + b l. The number of weights W (Φ ϱ ) and the number of neurons N(Φ ϱ ) are W (Φ ϱ ) = L j l 0 + b j l 0) and N(Φ j L( A ϱ ) = N j. j=0 11 / 36

Power of the architecture Exemplary results Given f from some class of functions, how many weights/neurons does an ε-approximating NN need to have? 12 / 36

Power of the architecture Exemplary results Given f from some class of functions, how many weights/neurons does an ε-approximating NN need to have? Not so many... Theorem (Maiorov, Pinkus; 1999) There exists an activation function ϱ weird : R R that is analytic and strictly increasing, satisfies lim x ϱ weird (x) = 0 and lim x ϱ weird (x) = 1, such that for any d N, any f C([0, 1] d ), and any ε > 0, there is a 3-layer ϱ-network Φ ϱ weird ε with f Φ ϱ weird ε L ε and N(Φ ϱ weird ε ) = 9d + 3. 12 / 36

Power of the architecture Exemplary results Barron; 1993: Approximation rate for functions with one finite Fourier moment using shallow networks with activation function ϱ sigmoidal of order zero. 13 / 36

Power of the architecture Exemplary results Barron; 1993: Approximation rate for functions with one finite Fourier moment using shallow networks with activation function ϱ sigmoidal of order zero. Mhaskar; 1993: Let ϱ be sigmoidal function of order k 2. For f C s ([0, 1] d ), we have f Φ ϱ n L N(Φ ϱ n) s/d and L(Φ ϱ n) = L(d, s, k). 13 / 36

Power of the architecture Exemplary results Barron; 1993: Approximation rate for functions with one finite Fourier moment using shallow networks with activation function ϱ sigmoidal of order zero. Mhaskar; 1993: Let ϱ be sigmoidal function of order k 2. For f C s ([0, 1] d ), we have f Φ ϱ n L N(Φ ϱ n) s/d and L(Φ ϱ n) = L(d, s, k). Yarotsky; 2017: For f C s ([0, 1] d ), we have for ϱ(x) = x + (called ReLU) that f Φ ϱ n L W (Φ ϱ n) s/d and L(Φ ϱ ε) log(n). 13 / 36

Power of the architecture Exemplary results Barron; 1993: Approximation rate for functions with one finite Fourier moment using shallow networks with activation function ϱ sigmoidal of order zero. Mhaskar; 1993: Let ϱ be sigmoidal function of order k 2. For f C s ([0, 1] d ), we have f Φ ϱ n L N(Φ ϱ n) s/d and L(Φ ϱ n) = L(d, s, k). Yarotsky; 2017: For f C s ([0, 1] d ), we have for ϱ(x) = x + (called ReLU) that f Φ ϱ n L W (Φ ϱ n) s/d and L(Φ ϱ ε) log(n). Shaham, Cloninger, Coifman; 2015: One can implement certain wavelets using 4 layer NNs. 13 / 36

Power of the architecture Exemplary results Barron; 1993: Approximation rate for functions with one finite Fourier moment using shallow networks with activation function ϱ sigmoidal of order zero. Mhaskar; 1993: Let ϱ be sigmoidal function of order k 2. For f C s ([0, 1] d ), we have f Φ ϱ n L N(Φ ϱ n) s/d and L(Φ ϱ n) = L(d, s, k). Yarotsky; 2017: For f C s ([0, 1] d ), we have for ϱ(x) = x + (called ReLU) that f Φ ϱ n L W (Φ ϱ n) s/d and L(Φ ϱ ε) log(n). Shaham, Cloninger, Coifman; 2015: One can implement certain wavelets using 4 layer NNs. He, Li, Xu, Zheng; 2018, Opschoor, Schwab, P.; 2019: ReLU NNs reproduce approximation rates of h-, p- and hp-fem. 13 / 36

Lower bounds Optimal approximation rates: Lower bounds on required network size only exist under additional assumptions. (Recall networks based on ϱ weird ). Options: (A) Place restrictions on activation function (e.g. only consider the ReLU), thereby excluding pathological examples like ϱ weird. ( VC dimension bounds) (B) Place restrictions on the weights. ( Information theoretical bounds, entropy arguments) (C) Use still other concepts like continuous N-widths. 14 / 36

Lower bounds Optimal approximation rates: Lower bounds on required network size only exist under additional assumptions. (Recall networks based on ϱ weird ). Options: (A) Place restrictions on activation function (e.g. only consider the ReLU), thereby excluding pathological examples like ϱ weird. ( VC dimension bounds) (B) Place restrictions on the weights. ( Information theoretical bounds, entropy arguments) (C) Use still other concepts like continuous N-widths. 14 / 36

Asymptotic min-max rate distortion Encoders: Let C L 2 (R d ), l N { E l := E : C {0, 1} l} { }, D l := D : {0, 1} l L 2 (R d ). {0, 1, 0, 0, 1, 1, 1} Min-max code length: { } L(ɛ, C) := min l N : D D l, C C l :sup D(E(f )) f 2 < ɛ f C. Optimal exponent: γ (C) := inf { γ > 0 : L(ɛ, C) = O(ɛ γ ) }. 15 / 36

Asymptotic min-max rate distortion Theorem (Boelcskei, Grohs, Kutyniok, P.; 2017) Let C L 2 (R d ), ϱ : R R, then for all ɛ > 0: sup f C inf Φ ϱ NN with quantised weights Φ ϱ f 2 ɛ W (Φ ϱ ) ɛ γ (C). (1) Optimal approximation/parametrization: If for C L 2 (R d ) one also has in (1), then NNs approximate a function class optimally. Versatility: It turns out that NNs achieve optimal approximation rates for many practically-used function classes. 16 / 36

Some instances of optimal approximation Mhaskar; 1993: Let ϱ be sigmoidal function of order k 2. For f C s ([0, 1] d ), we have f Φ ϱ n L N(Φ ϱ n) s/d. We have γ ({f C s ([0, 1] d : f 1}) = d/s. Shaham, Cloninger, Coifman; 2015: One can implement certain wavelets using 4 layer ReLU NNs. Optimal, when wavelets are optimal. Bölcskei, Grohs, Kutyniok, P.; 2017: Networks yield optimal rates if any affine system does. Example: shearlets for cartoon-like functions. 17 / 36

ReLU Approximation Piecewise smooth functions: E β,d denotes the d-dimensional C β -piecewise smooth functions on [0, 1] d with interfaces in C β. 1 0.5 0 1 0.8 0.6 0.4 0.2 1 0.8 0.6 0.4 0.2 0 0 Theorem (P., Voigtlaender; 2018) Let d N, β 0, ϱ : R R, ϱ(x) = x +, then sup f E β,d inf Φ ϱ NN with quantised weights Φ ϱ f 2 ɛ W (Φ ϱ ) ɛ γ (E β,d ) = ɛ 2(d 1)/β. The optimal depth of the networks is β/d. 18 / 36

High-dimensional approximation Curse of dimension: To guarantee approximation with error ε of functions in Eβ,d one requires networks with O(ε 2(d 1)/β ) weights. Symmetries and invariances: Image classifiers are often: I translation, dilation, and rotation invariant, I invariant to small deformations, I invariant to small changes in brightness, contrast, color. 19 / 36

Curse of dimension Two-step setup: f = χ τ τ : R D R d is a smooth dimension reducing feature map. χ E β,d performs classification on low-dimensional space. 20 / 36

Curse of dimension Two-step setup: f = χ τ τ : R D R d is a smooth dimension reducing feature map. χ E β,d performs classification on low-dimensional space. Theorem (P., Voigtlaender; 2017) Let ϱ(x) = x +. There are constants c > 0, L N such that for any f = χ τ and any ε (0, 1/2), there is a NN Φ ϱ ε with at most L layers, and at most c ε 2(d 1)/β non-zero weights such that Φ ϱ ε f L 2 < ε. Asymptotic approximation rate depends only on d, not on D. 20 / 36

Compositional functions Compositional functions: [Mhaskar, Poggio; 2016] High-dimensional functions as dyadic composition of 2-dimensional functions. R 8 x h 3 1(h 2 1(h 1 1(x 1, x 2 ), h 1 2(x 3, x 4 )), h 2 2(h 1 3(x 5, x 6 ), h 1 4(x 7, x 8 ))) x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 21 / 36

Extensions Approximation with respect to Sobolev norms: ReLU NNs Φ are Lipschitz continuous. Hence, for s [0, 1], p 1 and f W s,p (Ω), we can measure f Φ W s,p (Ω). ReLU Networks achieve the same approximation rates as h-, p-, hp-fem, [Opschoor, P., Schwab; 2019]. Convolutional neural networks: Direct correspondence between approximation by CNNs (without pooling) and approximation by fully-connected networks, [P., Voigtlaender; 2018]. 22 / 36

Optimal parametrization Optimal parametrization: Neural networks yield optimal representations of many function classes relevant in PDE applications, Approximation is flexible and quality is improved if low-dimensional structure is present. PDE discretization: Problem complexity drastically reduced, No design of ansatz system necessary, since NNs approximate almost every function class well. Can neural networks really be this good? 23 / 36

The inconvenient structure of neural networks 24 / 36

Fixed architecture networks Goal: Fix a space of networks with prescribed shape and understand the associated set of functions. Fixed architecture networks: Let d, L N, N 1,..., N L 1 N, ϱ : R R then we define by N N ϱ (d, N 1,..., N L 1, 1) the set of NNs with architecture (d, N 1,..., N L 1, 1). d = 8 N 1 = 12 N 2 = 12 N 3 = 12 N 4 = 8 25 / 36

Back to the basics Topological properties: Is N N ϱ (d, N 1,..., N L 1, 1) star-shaped? convex? approximately convex? closed? Is the map (T 1,..., T L ) Φ open? Implications for optimization: If we do not have the properties above, then we can have terrible local minima, exploding weights, very slow convergence. 26 / 36

Star-shapedness Star-shapedness: N N ϱ (d, N 1,..., N L 1, 1) is trivially star-shaped with center 0....but... Proposition (P., Raslan, Voigtlaender; 2018) Let d, L, N, N 1,..., N L 1 N and let ϱ : R R be locally Lipschitz continuous. Then the number of linearly independent centers of N N ϱ (d, N 1,..., N L ) is at most L l=1 (N l 1 + 1)N l, where N 0 = d. 27 / 36

Convexity? Corollary (P., Raslan, Voigtlaender; 2018) Let d, L, N, N 1,..., N L 1 N, N 0 = d, and let ϱ : R R be locally Lipschitz continuous. If N N ϱ (d, N 1,..., N L 1, 1) contains more than L l=1 (N l 1 + 1)N l linearly independent functions, then N N ϱ (d, N 1,..., N L 1, 1) is not convex. From translation invariance: If N N ϱ (d, N 1,..., N L 1, 1) only finitely many linearly independent functions then ϱ is a finite sum of complex exponentials multiplied with polynomials. 28 / 36

Weak Convexity? Weak convexity: N N ϱ (d, N 1,..., N L 1, 1) is almost never convex, but what about N N ϱ (d, N 1,..., N L 1, 1) + B ɛ (0) for a hopefully small ɛ > 0? 29 / 36

Weak Convexity? Weak convexity: N N ϱ (d, N 1,..., N L 1, 1) is almost never convex, but what about N N ϱ (d, N 1,..., N L 1, 1) + B ɛ (0) for a hopefully small ɛ > 0? Theorem (P., Raslan, Voigtlaender; 2018) Let d, L, N, N 1,..., N L 1 N, N 0 = d. For all commonly-used activation functions there does not exist ɛ > 0 such that N N ϱ (d, N 1,..., N L 1, 1) + B ɛ (0) is convex. As a corollary, we also get that N N ϱ (d, N 1,..., N L 1, 1) is usually nowhere dense. 29 / 36

Illustration Illustration: The set N N ϱ (d, N 1,..., N L 1, 1) has very few centers, it is scaling invariant, not approximately convex, and nowhere dense. 30 / 36

Closedness in L p Compact weights: If the activation function ϱ is continuous, then a compactness argument shows that the set of networks of a compact parameter set is closed. 31 / 36

Closedness in L p Compact weights: If the activation function ϱ is continuous, then a compactness argument shows that the set of networks of a compact parameter set is closed. Theorem (P., Raslan, Voigtlaender; 2018) Let d, L, N 1,..., N L 1 N, N 0 = d. If ϱ has one of the properties below, then N N ϱ (d, N 1,..., N L 1, 1) is not closed in L p, p (0, ). analytic, bounded, not constant, C 1 but not C, continuous, monotone, bounded, ϱ (x 0 ) exists and is non-zero in at least one point x 0 R. continuous, monotone, continuous differentiable outside a compact set, and lim x ϱ (x), lim x ϱ (x) exist and do not coincide. 31 / 36

Closedness in L Theorem (P., Raslan, Voigtlaender; 2018) Let d, L, N, N 1,..., N L 1 N, N 0 = d. If ϱ is has one of the properties below, then N N ϱ (d, N 1,..., N L 1, 1) is not closed in L. analytic, bounded, not constant, C 1 but not C ρ C p and ρ(x) x p + bounded, for p 1. ReLU: The set of two-layer ReLU NNs is closed in L! 32 / 36

Illustration Illustration: For most activation functions (except the ReLU) ϱ the set N N ϱ (d, N 1,..., N L 1, 1) is star-shaped with center 0, not approximately convex, not closed. 33 / 36

Stable parametrization Continuous parametrization: It is not hard to see, that if ϱ is continuous, then so is the map R ϱ : (T 1,..., T L ) Φ. Quotient map: We can also ask if R ϱ is a quotient map, i.e., if Φ 1, Φ 2 are NNs which are close (w.r.t. sup ), then there exist (T1 1,..., T L 1) and (T 1 2,..., T L 2 ) which are close in some norm and R ϱ ((T 1 1,..., T 1 L )) = Φ 1 and R ϱ ((T 2 1,..., T 2 L )) = Φ2. Proposition (P., Raslan, Voigtlaender; 2018) Let ϱ be Lipschitz continuous and not affine linear, then R ϱ is not a quotient map. 34 / 36

Consequences No convexity: Want to solve J(Φ) = 0 for an energy J and NN Φ. Not only J could be non-convex, but also the set we optimize over. Similar to N-term approximation by dictionaries. No closedness: Exploding coefficients (if P N N (f ) N N ). No low-neuron approximation. No inverse-stable parametrization: Error term very small, while parametrization is far from optimal. Potentially very slow convergence. 35 / 36

Where to go from here? Different networks: Special types of networks could be more robust. Convolutional neural networks are probably still too large a class. [P., Voigtlaender; 2018]. Stronger norms: Stronger norms naturally help with closedness and inverse stability. Example is Sobolev training [Czarnecki, Osindero, Jaderberg, Swirszcz, Pascanu; 2017]. Many arguments of our results break down if W 1, norm is used. 36 / 36

Conclusion Approximation: NNs are a very powerful approximation tool: Often optimally efficient parametrization overcome curse of dimension surprisingly efficient black-box optimization Topological structure: NNs form an impractical set: non-convex non-closed no inverse-stable parametrization. 37 / 36

References: H. Andrade-Loarca, G. Kutyniok, O. Öktem, P. Petersen, Extraction of digital wavefront sets using applied harmonic analysis and deep neural networks, arxiv:1901.01388 H. Bölcskei, P. Grohs, G. Kutyniok, P. Petersen, Optimal Approximation with Sparsely Connected Deep Neural Networks, arxiv:1705.01714 J. Opschoor, P. Petersen, Ch. Schwab, Deep ReLU Networks and High-Order Finite Element Methods, SAM, ETH Zürich, 2019. P. Petersen, F. Voigtlaender, Optimal approximation of piecewise smooth functions using deep ReLU neural networks, Neural Networks, (2018) P. Petersen, M. Raslan, F. Voigtlaender, Topological properties of the set of functions generated by neural networks of fixed size, arxiv:1806.08459 P. Petersen, F. Voigtlaender, Equivalence of approximation by convolutional neural networks and fully-connected networks, arxiv:1809.00973 37 / 36

Thank you for your attention! References: H. Andrade-Loarca, G. Kutyniok, O. Öktem, P. Petersen, Extraction of digital wavefront sets using applied harmonic analysis and deep neural networks, arxiv:1901.01388 H. Bölcskei, P. Grohs, G. Kutyniok, P. Petersen, Optimal Approximation with Sparsely Connected Deep Neural Networks, arxiv:1705.01714 J. Opschoor, P. Petersen, Ch. Schwab, Deep ReLU Networks and High-Order Finite Element Methods, SAM, ETH Zürich, 2019. P. Petersen, F. Voigtlaender, Optimal approximation of piecewise smooth functions using deep ReLU neural networks, Neural Networks, (2018) P. Petersen, M. Raslan, F. Voigtlaender, Topological properties of the set of functions generated by neural networks of fixed size, arxiv:1806.08459 P. Petersen, F. Voigtlaender, Equivalence of approximation by convolutional neural networks and fully-connected networks, arxiv:1809.00973 36 / 36