Lecture 15: Neural Networks Theory

Size: px

Start display at page:

Download "Lecture 15: Neural Networks Theory"

Lenard Cobb
6 years ago
Views:

1 EECS 598-5: Theoretical Foundations of Machine Learning Fall 25 Lecture 5: Neural Networks Theory Lecturer: Matus Telgarsky Scribes: Xintong Wang, Editors: Yuan Zhuang 5. Neural Networks Definition and Overview 5.. Neural Networks Definition Definition 5. (Neural Network). A neural network is a function class defined by a DAG as follows: Input Hidd Input # Input #2 Output Input #3 Note: A neural network is not necessarily fully connected. Input nodes: g i (x) = x i. Internal nodes: g i (x) = σ( g parts of g i wgg(x) i + w). i Typical choices of activation function σ: ReLu: σ R (r) = max{, r}. ReLu 8 6 σ r Sigmoid: σ S (r) = +exp( r). 5-

2 Lecture 5: Neural Networks Theory 5-2 Sigmoid.8.6 σ r Output: Neural networks outputs value at output node and it can be multivariate. N := {fixed DAG with varying weights} = {x f(x; w) : w R p }, where p is the total number of weights and thresholds. Remark: Currt practical neural networks have some variations Theory Overview We have three aspects of theory (for classification specific setting): Approximation/Represtation How well the function class fits the problem? Known: Neural networks can fit arbitrary continuous functions. Unknown: Are good represtations learnable? What should be the size and the number of s of a good represtation? Optimization How well you optimize risk over the function class (on a finite sample)? Little theory in this aspect: training O() size neural network is NP-hard; in practice, we use gradit desct (on a nonconvex function). Estimation Differce in risk betwe sample and distribution. There is a nice VC theory, which is very difficult. It is still unclear what this VC theory means in practice. Remark: In terms of non-classification problems, many successes of neural networks are for unsupervised learning.

3 Lecture 5: Neural Networks Theory Represtation/Approximation 5.2. Neural Network with One Internal Node Suppose we have only one internal(non-input) node in a neural network. With activation function σ(r) = [r ], we can exactly represt the indicator function for a halfspace. Specifically, the following diagram shows the hyperplane {x R d : w, x = w } and a classification achieved by the corresponding halfspace indicator x σ( w, x w ). y w x = w x Now suppose σ is bounded and continuous, with lim r σ(r) = and lim r σ(r) =. Consequtly, there exists M such that r > M, σ(r) [ ɛ, + ɛ] and r < M, σ(r) [ ɛ, ɛ] Thus, giv any τ >, the function f(x) := σ( M τ ( w, x w )) approximates the earlier halfspace indicator in the following sse: { [ ɛ, +ɛ] wh w, x w τ, f(x) [ ɛ, + ɛ] wh w, x w + τ. On the other hand, f is not controlled in any way wh w, x w < τ. This approximate halfspace indicator f is depicted as follows. y 2τ w x w = τ w x

4 Lecture 5: Neural Networks Theory Neural Network with k Halfspaces Next note how a polyhedron P can be approximated by adding another to the preceding construction. Suppose the polyhedron is specified via k halfspaces, and each of these is approximated as before with some τ (, 4k ), giving a function g i. Input Halfspace...k Output Input Output Now consider adding a final node defined as g(x) := σ( M τ ( i g i(x) (k /2))). To see that this approximates an indicator on P, consider any x which does not fall within the error region of width 2τ of any of the preceding approximate halfspace indicators. If this x also fails to land within at least one of the halfspaces, th i g i(x) (k )( + τ) < k /2 τ, thus g(x) [ ɛ, +ɛ].on the other hand, if x is in the intersection of the halfspaces, th i g i(x) k( τ) > k /2 + τ, thus g(x) [ ɛ, + ɛ]. Homework Problem: based on what we have said so far, for any continuous f : [, ] d [, ] and any ɛ >, there exists a 3 (+input ) neural network g with f(x) g(x) dx ɛ. [,] d 5.3 Estimation/Statistics The main results in this section will be VC dimsion bounds which are polynomial in the number of s L. Before giving these bounds, note that it is not clear how to specify a meaningful Rademacher complexity bound which is polynomial in the number of s. For instance, consider a fixed DAG represting the layout of a neural network, and suppose the weights w R p obey a constraint w B (which is natural for instance in the case of linear separators). Th by placing weight B L along each edge of a chain in the network, the constraint w B is obeyed, and it can be se as the Lipschitz constant of the function computed by the network and thus its Rademacher complexity are upper bounded by ( B L )L VC dimsion of Linear Threshold Network(LTN) Notations S: a fixed set of n examples.

5 Lecture 5: Neural Networks Theory 5-5 k: number of non-input nodes in the neural network. d is the total number of input nodes. p: total number of parameters in the neural network. p i is the number of parameters involving node i. p = i p i. Moreover, suppose p n. L: total number of s in the neural network. D i := All output values possible for non-input nodes up to i, meaning D i = {(g d+ (S; w), g d+2 (S; w),..., g d+i (S; w)) : w R p }. This definition is the key idea of the proof: rather than keeping track of the possible outputs of the output node of the network, the outputs of all nodes are considered. It will now be shown by induction that D i i ( ) pj. Consider the base case i = ; by Sauer s Lemma, since n p p, ( ) p D. j= Now consider some i >. Ev though S is fixed, it is no longer the case that (non-input) node i is receiving a fixed set of inputs, since it is pottial taking input from non-input nodes. However, for any fixed outputs of the parts to node i, Sauer-Shelah can once again be used (once again using n p p i ). Combining this with the inductive hypothesis, ( ) pi D i p p j possible inputs to i ) pi D i ( p i i ( ) pj. j= p j To complete the VC dimsion calculation, first note that every distinct output for the network also incremts the value of D k, Π N (n) sup S =n D k, where N dotes this class of networks. Secondly, note that D k does not actually depd on S, the upper bound holds for all S with S n. As such we have, Π N (n) sup D k S =n k ( ) pi i= p i p i k () pi = () p. To finish the calculation, it is possible to use a guess-and-check. In order to show that the VC dimsion is at most d, it suffices to show that Π N (d) < 2 d. As such, suppose n cp ln(p) for some c. By the above calculation, ln(π N (n)) p ln(ecp ln(p)) = p ln(p) + p( + ln(c) + ln(ln(p))). Since this is strictly less than cp ln(p) ln(2) for sufficitly large c, it follows that the VC dimsion is O(p ln(p)). i=

6 Lecture 5: Neural Networks Theory Further VC dimsion results (without proof) The following table summarizes worst case VC dimsion bounds giv various conditions. There are two key points here: first, the linear threshold network really did not gain much power (whereas other networks do), and secondly the choice of σ is esstial. σ r [r ] piecewise polynomial convex for x <, concave for x >, and satisfying limit properties lim r σ(r) = and lim r σ(r) = Worst case VC dimsion With one non-input node (k = ), this is perceptron, thus VC is Θ(p). In geral, the VC dimsion is Õ(p ln(p)); s did not matter much in this example! Ω(pL) and Õ(pL2 ); this also holds in the piecewise affine case (in particular for the popular choice σ(r) = max{, r}). It s possible for VC dimsion to be infinite! (Note however that other members of this class, for instance the sigmoid σ(r) = /(+ exp( r)), have VC dimsion polynomial in p and L.)

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity; CSCI699: Topics in Learning and Game Theory Lecture 2 Lecturer: Ilias Diakonikolas Scribes: Li Han Today we will cover the following 2 topics: 1. Learning infinite hypothesis class via VC-dimension and