(b) (a) 1 Introduction. 2 Related Works. 3 Problem Definition. 4 The Analytical Formula. Yuandong Tian Facebook AI Research

Size: px

Start display at page:

Download "(b) (a) 1 Introduction. 2 Related Works. 3 Problem Definition. 4 The Analytical Formula. Yuandong Tian Facebook AI Research"

Wesley Carter
5 years ago
Views:

1 Supp Materials: An Analytical Formula of Population Gradient for two-layered ReLU network and its Applications in Convergence and Critical Point Analysis Yuandong Tian Facebook AI Research March 8, 07 Introduction No theorem is provided. Related Works No theorem is provided. 3 Problem Definition No theorem is provided. 4 The Analytical Formula Here we list all detailed proof for all the theorems. (a) e w (b) w e O O 0 <0 Figure : (a)-(b) Two cases in Thm..

2 4. One ReLU Case Theorem Suppose F (e, w) = X D(e)D(w)Xw where e is a unit vector and X = [x, x,, x N ] is N-by-d sample matrix. If x i N(0, I), then: where θ [0, π] is the angle between e and w. E [F (e, w)] = N ((π θ)w + w sin θe) () π Proof Note that F can be written in the following form: F (e, w) = x i x i w () i:x i e 0,x i w 0 where x i are samples so that X = [x, x,, x n ]. We set up the axes related to e and w as in Fig., while the rest of the axis are prependicular to the plane. In this coordinate system, any vector w/ w e cos θ x = [r sin φ, r cos φ, x 3,..., x d ]. We have an orthonomal set of bases: e, e = sin θ (and any set of bases that span the rest of the space). Under the basis, the representation for e and w is [, 0 d ] and [ w cos θ, w sin θ, 0 d ]. Note that here θ ( π, π]. The angle θ is positive when e chases after w, and is otherwise negative. Now we consider the quality R(φ 0 ) = E [ N ] i:φ x i [0,φ 0] ix i. If we take the expectation and use polar coordinate only in the first two dimensions, we have: R(φ 0 ) = E x i x i = E [x i x i N φ i [0, φ 0 ]] P [φ i [0, φ 0 ]] i:φ i [0,φ 0] + + r sin φ φ0 = r cos φ [ ] d r sin φ r cos φ... xd p(r)p(θ) p(x k )rdrdφdx 3... dx d k=3 x d where p(r) = e r / and p(θ) = /π. Note that R(φ 0 ) is a d-by-d matrix. The first -by- block can be computed in close form (note that + r p(r)rdr = ). Any off-diagonal element except for the first -by- 0 block is zero due to symmetric property of spherical Gaussian variables. Any diagonal element outside the first -by- block will be P [φ i [0, φ 0 ]] = φ 0 /π. Finally, we have: R(φ 0 ) = E x i x i = φ 0 sin φ 0 cos φ 0 0 cos φ 0 φ 0 + sin φ 0 0 (3) N 4π i:φ i [0,φ 0] 0 0 φ 0 I d = φ 0 π I d + sin φ 0 cos φ 0 0 cos φ 0 sin φ 0 0 (4) 4π With this equation, we could then compute E [F (e, w)]. When θ 0, the condition {i : x i e 0, x i w 0} is equivalent to {i : φ i [θ, π]} (Fig. (a)). Using w = [ w cos θ, w sin θ, 0 d ] and we have: E [F (e, w)] = N (R(π) R(θ)) w (5) = N sin θ cos θ 0 cos θ (π θ)w w cos θ sin θ 0 sin θ (6) 4π = N ( [ ]) sin θ (π θ)w + w (7) π 0 = N ((π θ)w + w sin θe) (8) π For θ < 0, the condition {i : x i e 0, x i w 0} is equivalent to {i : φ i [0, π + θ]} (Fig. (b)), and similarly we get E [F (e, w)] = N (R(π + θ) R(0)) w = N ((π + θ)w w sin θe) (9) π Notice that by abuse of notation, the θ appears in Eqn. is the absolute value and Eqn. follows.

3 5 Critical Point Analysis Remember that we have: suppose F (e, w) = X D(e)D(w)Xw where e is a unit vector and X = [x, x,, x N ] is N-by-d sample matrix. If x i N(0, I), then: E [F (e, w)] = N ((π θ)w + w sin θe) (0) π where θ [0, π] is the angle between e and w. And the expected gradient for g(x) = K respect to w j is the following: E [ wj J ] = j= σ(w j x) with K E [ F (e j, wj )] E [F (e j, w j )] () j = where e j = w j / w j. By solving Eqn. 64 (E [ wj J ] = 0 j =,..., K), it is possible to identify all critical points of g(x). In general Eqn. 64 is highly nonlinear and cannot be solved efficiently, however, we show that in our particular case, it is possible since Eqn. 64 has interesting properties. First of all, the system of equations or K E [F (e j, w j ] = j = is a linear combination of w j and w j E [ wj J ] = 0, j =,..., K () K E [ F (e j, wj ] j =, j =,..., K (3) but with varying coefficients. We could rewrite the system as follows: diagae + Bdiag we = diaga E + B W (4) where E, W, W, a, B, a and B are all K-by-K matrices: where θ j j (w j, wj ) and θj j = θj j θj i) and Θ = [θ i a = sin Θ w, a = sin Θ w (5) B = π Θ, B = π Θ (6) E = [e, e,..., e K ] (7) W = [w, w,..., w K ], W = [w, w,..., w K] (8) (w j, w j ), Θ = [θj i ] (the element at i-th row, j-th column of Θ is j ], w = [ w, w,..., w K ] and w = [ w, w,..., wk ]. Eqn. 4 already has interesting properties. The first thing we consider is whether the critical point will fall outside the Principle Hyperplane Π, which is the plane spanned by the ground truth weight vectors W. The following theorem shows that the critical points outside Π must lie in a manifold: Lemma If {w j } is a critical point satisfying Eqn. 4, then for any orthogonal mapping R with R Π = Id, {Rw j } is also a critical point. Proof First of all, since R is an orthogonal transformation, it keeps all angles and magnitudes and a, a, B, w and w are invariant. For simplicity we write Y = diaga + Bdiag w diaga and Y is also invariant under R. Since R Π = Id, we have RW = W and Y R E R B RW = Y E R B W R = (Y E B W ) R = 0 (9) Note that for d K +, there always exists R Id and satisfy such a condition, which yield continuous critical points. Further, such a transformation forms a Lie group. Therefore we have: Theorem If d K +, then any critical point satisfying Eqn. 4 and is outside Π must lie in a manifold. The intuition is simple. For any out-of-plane critical point, pick a matrix that satisfies the condition of the theorem, and tranforms it to a different yet infinitely close critical points. Such a matrix always exists, since for the d K subspace, if it is odd, then we can always pick a rotation whose fixed axis is not aligned with all K weights; if it is even, then there is a rotation matrix without a fixed point. 3

4 5. Characteristics within the Principle Plane We could right-multiple E and turn the normal equation to a linear function with respect to the magnitude of weights w. Note that we have: Therefore, Eqn. 4 becomes: E E = cos Θ, W E = diag w cos Θ, (W ) E = diag w Θ (0) diaga cos Θ + Bdiag w cos Θ = diaga cos Θ + B diag w cos Θ () which is a homogenous linear equation with respect to the magnitude of the weights (note that a and a is linear to the magnitudes). In particular, the (i, j) entry of the LHS and RHS of this equality are: ( K ) K LHS ij = cos θj i sin θi k w k + (π θi k ) w k cos θj k () RHS ij = cos θ i j k= ( K Therefore, the following equation holds: k= sin θi k wk ) + k= K (π θi k ) wk cos θj k (3) k= M w = M w (4) where M and M are K -by-k matrices. Each entry m ij,k that correponds to the coefficient of k-th weight vector at (i, j) entry of Eqn. 4 is defined as: m ij,k = (π θ k i ) cos θ k j + sin θ k i cos θ i j (5) m ij,k = (π θi k ) cos θj k + sin θi k cos θj i (6) Special case on the diagonal. For diagonal element (i, i), cos θ i i = and m ii,k = h(θ k i ), m ii,k = h(θ k i ), where h(θ) = (π θ) cos θ + sin θ. (7) Therefore, with only diagonal element, we arrive at the following subset of the constraints to be satisfied for any critical points: M r w = M r w (8) where M r = h(θ ) and M r = h(θ ) are both K-by-K matrices. Note that if M r is full-rank, then we could solve w from Eqn. 8 and plug it back in Eqn. 4 to check whether it is indeed a critical point. Lemma If w 0 (no trivial ground truth solutions), and for a given (Θ, Θ ), there exists a row (e.g. l-th row) of M and M, namely m l and m l, satisfying Then (Θ, Θ ) cannot be a critical point. m l Mr Mr m l > 0 or m l Mr Mr m l < 0 (9) Proof Suppose given (Θ, Θ ), we get M r and M r and compute w using Eqn. 8, then we have ( w ) M r M r m l = (Mr w ) Mr m l = w Mr Mr m l = w m l (30) Therefore, from the condition m l M r Mr m l > 0 and w 0 but w 0, we have ( w ) (m l Mr Mr m l ) = ( w ) m l w m l > 0 (3) but this contradicts with the necessary condition for (Θ, Θ ) to become a critical point (Eqn. 4). Similarly we can prove the other side. Separation property of Eqn. 9. Note that both the k-th element of m l and M r M r m l in Eqn. 9 are only dependent on the k-th true weight vector w k (and all {w j}). 4

5 For m l, this can be seen by Eqn. 6, in which the k-th element is only related to the angles θ k between wk and {w j}. For M r M r m l, notice that the k-th column of M r (the k-th row of M r ) is only related to w k but not other ground truth weight vectors. This separation property makes analysis much easier, as shown in the case of K =. Therefore, we could consider the following function regarding to one (rather than K) ground truth unit weight vector e and all estimated unit vectors {e l }: L ij (e, {e l }) = m ij v M r m ij (3) where v = [h(θ ), h(θ ),..., h(θ K )], θ j = (e, w j ) and m ij = (π θ i ) cos θ j + sin θ i cos θi j (like Eqn. 6). Proposition L ij (e, {e l }) = 0 for any e = e l, l K. In addition, L ii (e, {e l }) 0. Proof When e = e l, then θk = θl k and v becomes the l-th row of M r. Since M r Mr becomes a unit vector with only l-th element being. Therefore, again with θk = θl k, we have: = I K K, v M r L ij (e, {e l }) = m ij m ij,l = 0 (33) For L ii, by definition m ii is i-th column of M r, so Mr m ii is a unit vector with only i-th element being. Therefore (v ) Mr m ii = h(θi ) = m ii (34) Then the previous lemma can be written as the following: Theorem 3 If w 0, and for a given parameter w, L jj ({θl k }, Θ) > 0 or < 0 for all k K, then w cannot be a critical point. 5. ReLU network with two hidden nodes (K = ) For K =, we have 4-by- matrix (the row order is (, ), (, ), (, ), (, )): (π θ) cos θ + sin θ cos θ (π θ) cos θ + sin θ cos θ M = (π θ) cos θ + sin θ cos θ (π θ) cos θ + sin θ cos θ (π θ) cos θ + sin θ cos θ (π θ) cos θ + sin θ cos θ (35) (π θ) cos θ + sin θ cos θ (π θ) cos θ + sin θ cos θ π (π θ) cos θ + sin θ = π cos θ (π θ) + sin θ cos θ (π θ) + sin θ cos θ π cos θ (36) (π θ) cos θ + sin θ π since θ = θ = θ, θ = θ = 0. Similarly we could write M : (π θ ) cos θ + sin θ cos θ (π θ ) cos θ + sin θ cos θ M = (π θ ) cos θ + sin θ cos θ (π θ ) cos θ + sin θ cos θ (π θ ) cos θ + sin θ cos θ (π θ ) cos θ + sin θ cos θ (37) (π θ ) cos θ + sin θ cos θ (π θ ) cos θ + sin θ cos θ In this case, M r = [ ] h(θ ) h(θ) h(θ) h(θ), Mr = [ h(θ ) h(θ ) ) h(θ ) h(θ Therefore, if we know θ = θ, θ, θ, θ and θ, then we could compute M and M and solve a linear equation to get the magnitude of w and w, which collectly identify the critical points. Note that M is a 4-by- matrix, so critical point only happens if the matrix has singular structure. ] (38) 5

6 Global Optimum. One special case is when θ = θ = θ = θ = π/ and θ = θ = 0, in this case, we have: π M = M = 0 π/ π/ 0 (39) π and thus w j = wj is the unique solution. When K =, the following conjecture is empirically correct. Conjecture If e is in the interior of Cone(e, e ), then L (θ, θ, θ ) > 0. If e is in the exterior, then L < 0. If e is on the boundary then L = 0. Same for L. Remark Note that L j can be written as the following: Here we have L j (e, {e, e }) = m j [h(θ), h(θ)] Mr m j (40) = [(π θ)e + sin θe ] e j (4) [α(θ, θ, θ), β(θ, θ, θ)] [ (π θ )e + sin θ e (π θ )e + sin θ e ] e j (4) [α, β] = [α(θ, θ, θ), β(θ, θ, θ)] [h(θ ), h(θ )]M r (43) We know that L = 0 by Proposition. Therefore u j (π θ)e + sin θe [ (π θ)e + sin θe, (π θ)e + sin θe ] [ ] α(θ, θ, θ) β(θ, θ, θ) [ ] = (π θ)e + sin θe α(θ [πe, (π θ)e + sin θe ], θ, θ) β(θ, θ, θ) (44) is perpendicular to e. So if we compute the inner product between u and e Π and is orthogonal to e ), we get (the unit vector that is in u e = (π θ ) sin θ [(π θ) sin θ] β (45) Since e = cos θe + sin θe so we have: L (e, {e, e }) = u e = sin θ(u e ) (46) Note that Eqn. 45 is a function with -variables θ and θ (θ is determined by θ and θ, depending on whether e is inside or outside Cone(e, e )). And we could verify it numerically. Theorem 4 If Conjecture is correct, then for ReLU network, (w, w ) (w w ) is not a critical point, if they both are in Cone(w, w ), or both out of it. Proof If both w and w are inside Cone(w, w ), then from Conjecture, we have L (θ k, θ k, θ ) > 0 (47) for k =,. Since K = we could simply apply Thm.?? to say (w, w ) is not a critical point. Similary we prove the case for both w and w outside Cone(w, w ). 6 Convergence Analysis 6. Single ReLU case In this subsection, we mainly deal with the following dynamics: E [ w J] = N (w w ) + N ( ) θw w sin θw π w (48) 6

7 Theorem 5 In the region w 0 w < w, following the dynamics (Eqn. 48), the Lyapunov function V (w) = w w has V < 0 and the system is asymptotically stable and thus w t w when t +. Proof Denote that Ω = {w : w 0 w < w }. Note that V = (w w ) w J = y My (49) where y = [ w, w ] and M is the following -by- matrix: M = [ ] sin θ + π θ (π θ) cos θ sin θ (π θ) cos θ sin θ π (50) In the following we will show that M is positive definite when θ (0, π/]. It suffices to show that M > 0, M > 0 and det(m) > 0. The first two are trivial, while the last one is: 4det(M) = π(sin θ + π θ) [(π θ) cos θ + sin θ] (5) = π(sin θ + π θ) [ (π θ) cos θ + (π θ) sin θ + sin θ ] (5) = (4π ) sin θ 4πθ + 4πθ cos θ θ cos θ + θ sin θ (53) = (4π 4πθ ) sin θ + θ cos θ( sin θ θ cos θ) (54) Note that 4π 4πθ = 4π(π θ) > 0 for θ [0, π/], and g(θ) = sin θ θ cos θ 0 for θ [0, π/] since g(0) = 0 and g (θ) 0 in this region. Therefore, when θ (0, π/], M is positive definite. When θ = 0, M(θ) = π[, ;, ] and is semi-positive definite, with the null eigenvector being [, ], i.e., w = w. However, along θ = 0, the only w that satisfies w = w is w = w. Therefore, V = y My < 0 in Ω. Note that although this region could be expanded to the entire open half-space H = {w : w w > 0}, it is not straightforward to prove the convergence in H, since the trajectory might go outside H. On the other hand, Ω is the level set V < w so the trajectory starting within Ω remains inside. (a) O kw w k < kw k Convergent region w (b) r O w Sample region Successful samples V d ( )/ Figure : (a) Sampling strategy to maximize the probability of convergence. sampling range r and desired probability of success ( ɛ)/. (b) Relationship between Theorem 6 When K =, the dynamics in Eqn. 64 converges to w with probability at least ( ɛ)/, if the initial value w 0 is sampled uniformly from B r = {w : w r} with: π r ɛ d + w (55) Proof Given a ball of radius r, we first compute the gap δ of sphere cap (Fig. (b)). First cos θ = so δ = r cos θ = r w r w,. Then a sufficient condition for the probability argument to hold, is to ensure that the volume V shaded of the shaded area is greater than ɛ V d(r), where V d (r) is the volume of d-dimensional ball of radius r. Since V shaded V d(r) δv d, it suffices to have: V d(r) δv d ɛ V d(r) (56) 7

8 which gives Using δ = r w and V d(r) = V d ()r d, we thus have: V d δ ɛ (57) V d r ɛ V d() V d () w (58) where V d () is the volume of the unit ball. Since the volume of d-dimensional unit ball is where Γ(x) = 0 t x e t dt. So we have From Gautschi s Inequality x s < with s = / and x = d/ we have: Therefore, it suffices to have V d () = π d/ Γ(d/ + ) V d () V d () = Γ(d/ + /) π Γ(d/ + ) (59) (60) Γ(x + ) Γ(x + s) < (x + s) s x > 0, 0 < s < (6) ( ) / d + < Γ(d/ + /) Γ(d/ + ) < ( ) / d (6) π r ɛ d + w (63) Note that this upper bound is tight when δ 0 and d +, since all inequality involved asymptotically becomes equal. 6. Multiple ReLU case Explanation of Eqn. 8. We first write down the dynamics to be studied: E [ wj J ] = K E [ F (e j, wj )] E [F (e j, w j )] (64) j = We first define f(w j, w j, wj ) = F (w j/ w j, wj ) F (w j/ w j, w j ). Therefore, the dynamics can be written as: E [ wj J ] = E [ f(w j, w j, wj )] (65) j Suppose we have a finite group P = {P j } which is a subgroup of orthognoal group O(d). P is the identity element. If w and w have the following symmetry: w j = P j w and w j = P jw, then RHS of Eqn. 64 can be simplified as follows: E [ wj J ] = j E [ f(w j, w j, w j )] = j E [f(p j w, P j w, P j w )] = j E [f(p j w, P j P j w, P j P j w )] ({P j } K j= is a group) = P j j E [f(w, P j w, P j w )] ( P w = w, (P w, P w ) = (w, w )) = P j E [ w J] (66) 8

9 which means that if all w j and wj are symmetric under the action of cyclic group, so does their expected gradient. Therefore, the trajectory {w t } keeps such cyclic structure. Instead of solving a system of K equations, we only need to solve one: E [ w J] = K E [f(w, P j w, P j w )] (67) j= Theorem 7 For a bias-free two-layered ReLU network g(x; w) = j σ(w j x) (68) that takes spherical Gaussian inputs, if the teacher s parameters {wj } form a set of orthnomal bases, then: () When the student parameters is initialized to be [x 0, y 0,..., y 0 ] under the basis of w, where (x 0, y 0 ) Ω = {x (0, ], y [0, ], x > y}, then the dynamics (Eqn. 64) converges to teacher s parameters {w j } (or (x, y) = (, 0)); () when x 0 = y 0 (0, ], then it converges to a saddle point x = y = πk ( K arccos(/ K) + π). This theorem is quite complicated. We will start with a few lemmas and gradually come to the conclusion. First, if w 0 = [x, y, y,..., y] under the bases {wj }K j=, then from simple computation we know that wt also follows this pattern. Therefore, we only need to study the following D dynamics related to x and y: π [ ] { [ [ ] [ ]} N E x J θ x = [(π φ)(x + (K )y)] + y J ] φ + φ φ y [ ] + [(K )(α sin φ x sin φ) + α sin θ] (69) y Here the symmetrical factor (α wj / w j, θ θ j j, φ θj j, φ θ j j ) are defined as follows: α = (x + (K )y ) /, cos θ = αx, cos φ = αy, cos φ = α (xy + (K )y ) (70) Now we start a sequence of lemmas. Lemma 3 For φ, θ and φ defined in Eqn. 70: α (x + (K )y ) / (7) cos θ αx (7) cos φ αy (73) cos φ α (xy + (K )y ) (74) we have the following relations in the triangular region Ω ɛ0 = {(x, y) : x 0, y 0, x y + ɛ 0 } (Fig. (c)): () φ, φ [0, π/] and θ [0, θ 0 ) where θ 0 = arccos K. () cos φ = α (x y) and sin φ = α(x y) α (x y). (3) φ φ (equality holds only when y = 0) and φ > θ. Proof Propositions () and () are computed by direct calculations. In particular, note that since cos θ = αx = / + (K )(y/x) and x > y 0, we have cos θ (/ K, ] and θ [0, θ 0 ). For Preposition (3), φ = arccos αy > θ = arccos αx because x > y. Finally, for x > y > 0, we have cos φ cos φ = α (xy + (K )y ) = α(x + (K )y) > α(x + (K )y) > (75) αy The final inequality is because K, x, y > 0 and thus (x+(k )y) > x +(K ) y > x +(K )y = α. Therefore φ > φ. If y = 0 then φ = φ. 9

10 y (0, ) x=y+ Saddle point (x,x ) O (, 0) (, 0) Teacher s parameters x Figure 3: The region Ω ɛ considered in the analysis of Eqn. 69. Lemma 4 For the dynamics defined in Eqn. 69, there exists ɛ 0 > 0 so that the trianglar region Ω ɛ0 = {(x, y) : x 0, y 0, x y + ɛ 0 } (Fig. 3) is a convergent region. That is, the flow goes inwards for all three edges and any trajectory starting in Ω ɛ0 stays. Proof We discuss the three boundaries as follows: Case : y = 0, 0 x, horizontal line. In this case, θ = 0, φ = π/ and φ = π/. The y component of the dynamics in this line is: f π N yj = π (x ) 0 (76) So y J points to the interior of Ω. Case : x =, 0 y, vertical line. In this case, α and the x component of the dynamics is: f π N xj = (π φ)(k )y θ + (K )(α sin φ sin φ) + α sin θ (77) = (K ) [(π φ)y (α sin φ sin φ)] + (α sin θ θ) (78) Note that since α, α sin θ sin θ θ, so the second term is non-positive. For the first term, we only need to check whether (π φ)y (α sin φ sin φ) is nonnegative. Note that (π φ)y (α sin φ sin φ) (79) = (π φ)y + α(x y) α (x y) α α y (80) = y [π φ α ] α (x y) + α [x α (x y) ] α y (8) In Ω we have (x y), combined with α, we have α (x y) and α y. Since x =, the second term is nonnegative. For the first term, since α, π φ α α (x y) π π > 0 (8) So (π φ)y (α sin φ sin φ) 0 and x J 0, pointing inwards. Case 3: x = y + ɛ, 0 y, diagonal line. We compute the inner product between ( x J, y J) and (, ), the inward normal of Ω at the line. Using φ π sin φ for φ [0, π/] and φ θ = arccos αy arccos αx 0 when x y, we have: f 3 (y, ɛ) π N Note that for y > 0: [ x J y J] [ αy = ] = φ θ ɛφ + [(K )(α sin φ sin φ) + α sin θ] ɛ (83) [ ( ) ] ɛ(k ) α sin φ π + sin φ (K ) [ ( ) = ɛα(k ) α y π ] ɛ + α ɛ (K ) (x/y) + (K ) = ( + ɛ/y) + (K ) K (84) 0

11 For y = 0, αy = 0 < /K. So we have α y /K. And α ɛ. Therefore f 3 ɛα(k )(C ɛc ) with C /K > 0 and C ( + π/(k )) > 0. With ɛ = ɛ 0 > 0 sufficiently small, f 3 > 0. Lemma 5 (Reparametrization) Denote ɛ = x y > 0. The terms αx, αy and αɛ involved in the trigometric functions in Eqn. 69 has the following parameterization: α y x = β β β + (K )β (85) K ɛ Kβ where β = (K β )/(K ). The reverse transformation is given by β = K (K )α ɛ. Here β [, K) and β (0, ]. In particular, the critical point (x, y) = (, 0) corresponds to (β, ɛ) = (, ). As a result, all trigometric functions in Eqn. 69 only depend on the single variable β. In particular, the following relationship is useful: β = cos θ + K sin θ (86) Proof This transformation can be checked by simple algebraic manipulation. For example: ( ) αk (β β ) = K K α (K )ɛ ɛ = ( (Ky + ɛ) ɛ) = y (87) K To prove Eqn. 86, first we notice that K cos θ = Kαx = β + (K )β. Therefore, we have (K cos θ β) (K ) β = 0, which gives β β cos θ + K sin θ = 0. Solving this quadratic equation and notice that β, θ [0, π/] and we get: β = cos θ + cos θ + K sin θ = cos θ + K sin θ (88) Lemma 6 After reparametrization (Eqn. 85), f 3 (β, ɛ) 0 for ɛ (0, β /β]. Furthermore, the equality is true only if (β, ɛ) = (, ) or (y, ɛ) = (0, ). Proof Applying the parametrization (Eqn. 85) to Eqn. 83 and notice that αɛ = β = β (β), we could write f 3 = h (β) (φ + (K ) sin φ)ɛ (89) When β is fixed, f 3 now is a monotonously decreasing function with respect to ɛ > 0. Therefore, f 3 (β, ɛ) f 3 (β, ɛ ) for 0 < ɛ ɛ β /β. If we could prove f 3 (β, ɛ ) 0 and only attain zero at known critical point (β, ɛ) = (, ), the proof is complete. Denote f 3 (β, ɛ ) = f 3 + f 3 where f 3 (β, ɛ ) = φ θ ɛ φ + ɛ α sin θ (90) f 3 (β, ɛ ) = (K )(α sin φ sin φ)ɛ (9) For f 3 it suffices to prove that ɛ (α sin φ sin φ) = β sin φ β β sin φ 0, which is equivalent to sin φ sin φ/β 0. But this is trivially true since φ φ and β. Therefore, f 3 0. Note that the equality only holds when φ = φ and β =, which corresponds to the horizontal line x (0, ], y = 0. For f 3, since φ φ, φ > θ and ɛ (0, ], we have the following: ( f 3 = ɛ (φ φ) + ( ɛ )(φ θ) ɛ θ + β sin θ ɛ θ + β sin θ β sin θ θ ) (9) β And it reduces to showing whether β sin θ θ is nonnegative. Using Eqn. 86, we have: f 33 (θ) = β sin θ θ = sin θ + K sin θ θ (93) Note that f 33 = cos θ + K sin θ = K cos(θ θ 0 ), where θ 0 = arccos K. By Prepositions in Lemma 3, θ [0, θ 0 ). Therefore, f 33 0 and since f 33 (0) = 0, f Again the equity holds when θ = 0, φ = φ and ɛ =, which is the critical point (β, ɛ) = (, ) or (y, ɛ) = (0, ).

12 Lemma 7 For the dynamics defined in Eqn. 69, the only critical point ( x J = 0 and y J = 0) within Ω ɛ is (y, ɛ) = (0, ). Proof We prove by contradiction. Suppose (β, ɛ) is a critical point other than w. A necessary condition for this to hold is f 3 = 0 (Eqn. 83). By Lemma 7, ɛ > ɛ = β /β > 0 and ɛ + Ky = α (β α + β β ) = β α α = β β /ɛ α > β β /ɛ α So ɛ + Ky is strictly greater than zero. On the other hand, the condition f 3 = 0 implies that Using φ [0, π/], φ φ and φ > θ, we have: = 0 (94) ((K )(α sin φ sin φ) + α sin θ) = ɛ (φ θ) + φ (95) π N yj = (π φ)(ɛ + Ky) (φ φ) φy + ((K )(α sin φ sin φ) + α sin θ) y = (π φ)(ɛ + Ky) (φ φ) ɛ (φ θ)y < 0 (96) So the current point (β, ɛ) cannot be a critical point. Lemma 8 Any trajectory in Ω ɛ0 converges to (y, ɛ) = (, 0), following the dynamics defined in Eqn. 69. Proof We have Lyaponov function V = E [E] so that V = E [ w J w J] E [ w J] E [ w J] 0. By Lemma 7, other than the optimal solution w, there is no other symmetric critical point, w J 0 and thus V < 0. On the other hand, by Lemma 4, the triangular region Ω ɛ0 is convergent, in which the D dynamics is C differentiable. Therefore, any D solution curve ξ(t) will stay within. By PoincareBendixson theorem, when there is a unique critical point, the curve either converges to a limit circle or the critical point. However, limit cycle is not possible since V is strictly monotonous decreasing along the curve. Therefore, ξ(t) will converge to the unique critical point, which is (y, ɛ) = (, 0) and so does the symmetric system (Eqn. 64). Lemma 9 When x = y (0, ], the D dynamics (Eqn. 69) reduces to the following D case: π N xj = πk(x x ) (97) where x = πk ( K arccos(/ K) + π). Furthermore, x is a convergent critical point. Proof The D system can be computed with simple algebraic manipulations (note that when x = y, φ = 0 and θ = φ = arccos(/ K)). Note that the D system is linear and its close form solution is x t = x 0 + Ce K/Nt and thus convergent. Combining Lemma 8 and Lemma 9 yields Thm Simulations No theorems is provided. 8 Extension to multilayer ReLU network Proposition For neural network with ReLU nonlinearity and using l loss to match with a teacher network of the same size, the gradient inflow g j for node j at layer c has the following form: g j = Q j (Q j u j Q j u j ) (98) j where Q j and Q j are N-by-N diagonal matrices. For any k [c + ], Q k = j [c] w jkd j Q j and similarly for Q k. The gradient with respect to w j (the parameters immediately under node j), is computed as: wj J = X T c D j g j (99)

13 Proof We prove by induction on layer. For the first layer, there is only one node with g = u v, therefore Q j = Q j = I. Suppose the condition holds for all node j [c]. Then for node k [c + ], we have: g k = w jk D j g j = w jk D j Q j Q j u j Q j u j j j j j = j = j = k w jk D j Q j Q j D j w jk u k j w jk D j Q j j Q j D j j k j Q j k D j w jk u k w jk u k w jk D j Q j k w jk D j Q j Q j D j w jk u k j k j j j Q j D j k w jk u k w jk D j Q j j Q j D j w jk u k Setting Q k = j w jkd j Q j and Q k = j w jk D j Q j (both are diagonal matrices), we thus have: g k = Q k Q k u k Q k Q k u k = Q k Q k u k Q k u k (00) k k 3

Lecture 7: Positive Semidefinite Matrices

Lecture 7: Positive Semidefinite Matrices Rajat Mittal IIT Kanpur The main aim of this lecture note is to prepare your background for semidefinite programming. We have already seen some linear algebra.