arxiv: v1 [cs.lg] 4 Oct 2018

Size: px

Start display at page:

Download "arxiv: v1 [cs.lg] 4 Oct 2018"

Theodore McDonald
5 years ago
Views:

1 Gradient descent aligns the layers of deep linear networks Ziwei Ji Matus Telgarsky University of Illinois, Urbana-Champaign arxiv:80.003v [cs.lg] 4 Oct 08 Abstract This paper establishes risk convergence and asymptotic weight matrix alignment a form of implicit regularization of gradient flow and gradient descent when applied to deep linear networks on linearly separable data. In more detail, for gradient flow applied to strictly decreasing loss functions (with similar results for gradient descent with particular decreasing step sizes): (i) the risk converges to 0; (ii) the normalized i th weight matrix asymptotically equals its rank- approximation u iv i ; (iii) these rank- matrices are aligned across layers, meaning v i+u i. In the case of the logistic loss (binary cross entropy), more can be said: the linear function induced by the network the product of its weight matrices converges to the same direction as the maximum margin solution. This last property was identified in prior work, but only under assumptions on gradient descent which here are implied by the alignment phenomenon. Introduction Efforts to explain the effectiveness of gradient descent in deep learning have uncovered an exciting possibility: it not only finds solutions with low error, but also biases the search for low complexity solutions which generalize well (Zhang et al., 07; Bartlett et al., 07; Soudry et al., 07; Gunasekar et al., 08). This paper analyzes the implicit regularization of gradient descent and gradient flow on deep linear networks and linearly separable data. or strictly decreasing losses, the optimum is off at infinity, and we establish various alignment phenomena: or each weight matrix W i, the corresponding normalized weight matrix Wi / W i asymptotically equals its rank- approximation u i vi, where the robenius norm W i satisfies W i. In other words, W i / W i, and asymptotically only the rank- approximation of each weight matrix contributes to the final predictor, a form of implicit regularization. Adjacent rank- weight matrix approximations are aligned: v i+ u i. or the logistic loss, the first right singular vector v of W is aligned with the data, meaning v converges to the unique maximum margin predictor ū defined by the data. Moreover, the linear predictor induced by the network, w prod := W L W, is also aligned with the data, meaning w prod/ w prod ū. Simultaneously, this work proves that the risk is globally optimized: it asymptotes to 0. Alignment and risk convergence are proved simultaneously; the phenomena are coupled within the proofs. The paper is organized as follows. This introduction continues with related work, notation, and assumptions in Sections. and.. The analysis of gradient flow is in Section, and gradient descent is analyzed in Section 3. The paper closes with future directions in Section 4; a particular highlight is a preliminary experiment on CIAR-0 which establishes empirically that a form of the alignment phenomenon occurs on the standard nonlinear network AlexNet.. Related work On the implicit regularization of gradient descent, Soudry et al. (07) show that for linear predictors and linearly separable data, the gradient descent iterates converge to the same direction as the maximum margin

2 4 0 layer 4 layers risk W / W W / W W 3 / W 3 W 4 / W (a) Margin maximization. (b) Alignment and risk minimization. igure : Visualization of main results on synthetic data with a 4-layer linear network compared to a -layer network (a linear predictor). igure a shows the convergence of -layer and 4-layer networks to the same linear predictor on positive (blue) and negative (red) separable data. igure b shows the alignment phenomenon in the 4-layer network, plotted against the risk. Specifically, for each layer, the ratio of spectral to robenius norms is plotted, and converges to. As in the theoretical analysis, the convergence in alignment and risk occur simultaneously. solution. Ji and Telgarsky (08) further characterize such an implicit bias for general nonseparable data. Gunasekar et al. (08) consider gradient descent on fully connected linear networks and linear convolutional networks. In particular, for the exponential loss, assuming the risk is minimized to 0 and the gradients converge in direction, they show that the whole network converges in direction to the maximum margin solution. These two assumptions are on the gradient descent process itself, and specifically the second one might be hard to interpret and justify. Compared with Gunasekar et al. (08), this paper proves that the risk converges to 0 and the weight matrices align; moreover the proof here proves the properties simultaneously, rather than assuming one and deriving the other. Lastly, for ReLU networks, Du et al. (08) show that gradient flow does not change the difference between squared robenius norms of any two layers. or a smooth (nonconvex) function, Lee et al. (06) show that any strict saddle can be avoided almost surely with small step sizes. If there are only countably many saddle points and they are all strict, and if gradient descent iterates converge, then this implies (almost surely) they converge to a local minimum. In the present work, since there is no finite local minimum, gradient descent will go to infinity and never converge, and thus these results of Lee et al. (06) do not show that the risk converges to 0. There has been a rich literature on linear networks. Saxe et al. (03) analyze the learning dynamics of deep linear networks, showing that they exhibit some learning patterns similar to nonlinear networks, such as a long plateau followed by a rapid risk drop. Arora et al. (08) show that depth can help accelerate optimization. On the landscape properties of deep linear networks, Lu and Kawaguchi (07); Laurent and von Brecht (07) show that under various structural assumptions, all local optima are global. Zhou and Liang (08) give a necessary and sufficient characterization of critical points for deep linear networks.. Notation, setting, and assumptions Consider a data set {(x i, y i )} n i=, where x i R d, x i, and y i {, +}. The data set is assumed to be linearly separable, i.e., there exists a unit vector u which correctly classifies every data point: for any i n, y i u, x i > 0. urthermore, let γ := max u = min i n y i u, x i > 0 denote the maximum margin, and ū := arg max u = min i n y i u, x i denote the maximum margin solution (the solution to the hard-margin SVM). A linear network of depth L is parameterized by weight matrices W L,..., W, where W k R d k d k, d 0 = d, and d L =. Let W = (W L,..., W ) denote all parameters of the network. The (empirical) risk

3 induced by the network is given by R(W ) = R (W L,..., W ) = n n l (y i W L W x i ) = n i= n l ( w prod, z i ), where w prod := (W L W ), and z i := y i x i. The loss l is assumed to be continuously differentiable, unbounded, and strictly decreasing to 0. Examples include the exponential loss l exp (x) = e x and the logistic loss l log (x) = ln ( + e x). Assumption.. l < 0 is continuous, lim x l(x) = and lim x l(x) = 0. This paper considers gradient flow and gradient descent, where gradient flow { W (t) t 0, t R } can be interpreted as gradient descent with infinitesimal step sizes. It starts from some W (0) at t = 0, and proceeds as dw (t) = R ( W (t) ). dt By contrast, gradient descent { W (t) t 0, t Z } is a discrete-time process given by W (t + ) = W (t) η t R ( W (t) ), where η t is the step size at time t. We assume that the initialization of the network is not a critical point and induces a risk no larger than the risk of the trivial linear predictor 0. Assumption.. The initialization W (0) satisfies R ( W (0) ) 0 and R ( W (0) ) R(0) = l(0). It is natural to require that the initialization is not a critical point, since otherwise gradient flow/descent will never make a progress. The requirement R ( W (0) ) R(0) can be easily satisfied, for example, by making W (0) = 0 and W L (0) W (0) 0. On the other hand, if R ( W (0) ) > R(0), gradient flow/descent may never minimize the risk to 0. Proofs of those claims are given in Appendix A. Results for gradient flow In this section, we consider gradient flow. Although impractical when compared with gradient descent, gradient flow can simplify the analysis and highlight proof ideas. or convenience, we usually use W, W k, and w prod, but they all change with (the continuous time) t. Only proof sketches are given here; detailed proofs are deferred to Appendix B. i=. Risk convergence One key property of gradient flow is that it never increases the risk: dr(w ) dt = R(W ), dw dt L = R(W ) = W k k= 0. (.) We now state the main result: under Assumptions. and., gradient flow minimizes the risk, W k and w prod all go to infinity, and the alignment phenomenon occurs. Theorem.. Under Assumptions. and., gradient flow iterates satisfy the following properties: lim t R(W ) = 0. or any k L, lim t W k =. or any k L, letting (u k, v k ) denote the first left and right singular vectors of W k, lim W k t u k v k W k = 0. 3

4 Moreover, for any k < L, lim vk+ t, u k =. As a result, lim w prod t L k= W, v =, k and thus lim t w prod =. Theorem. is proved using two lemmas, which may be of independent interest. To show the ideas, let us first introduce a little more notation. Recall that R(W ) denotes the empirical risk induced by the deep linear network W. Abusing the notation a little, for any linear predictor w R d, we also use R(w) to denote the risk induced by w. With this notation, R(W ) = R(w prod ), while R(w prod ) = n l ( w prod, z i ) z i = n l (W L W z i ) z i n n i= is in R d and different from R(W ), which has L k= d kd k entries, as given below: urthermore, for any R > 0, let i= W k = W k+ W L R(w prod ) W { } B(R) = W max W k R. k L W k. The first lemma shows that for any R > 0, the time spent by gradient flow in B(R) is finite. Lemma.3. Under Assumptions. and., for any R > 0, there exists a constant ɛ(r) > 0, such that for any t and any W B(R), / W ɛ(r). As a result, gradient flow spends a finite amount of time in B(R) for any R > 0, and max k L W k is unbounded. Here is a proof sketch. If all W k are bounded, then R(w prod ) will be lower bounded by a positive constant, therefore if / W = W L W R(w prod ) can be arbitrarily small, then W L W and w prod can also be arbitrarily small, and thus R(W ) can be arbitrarily close to R(0). This cannot happen after t =, otherwise it will contradict Assumption. and eq. (.). To proceed, we need the following properties of linear networks from prior work (Arora et al., 08; Du et al., 08). or any time t 0 and any k < L, To see this, just notice that W k+(t)w k+ (t) W k+(0)w k+ (0) = W k (t)w k (t) W k (0)W k (0). (.4) W k+ = Wk+ WL R(w prod ) W Wk = Wk. W k+ W k Taking the trace on both sides of eq. (.4), we have Wk+ (t) Wk+ (0) = Wk (t) Wk (0). (.5) In other words, the difference between the squares of robenius norms of any two layers remains a constant. Together with Lemma.3, it implies that all W k are unbounded. However, even if W k are large, it does not necessarily follow that w prod is also large. Lemma.6 shows that this is indeed true: for gradient flow, as W k become larger, adjacent layers also get more aligned to each other, which ensures that their product also has a large norm. or k L, let σ k, u k, and v k denote the first singular value (the -norm), the first left singular vector, and the first right singular vector of W k, respectively. urthermore, define ( D := max k L W k(0) ) W L (0) + L W k (0)Wk (0) Wk+(0)W k+ (0), which depends only on the initialization. If for all k < L it holds that W k (0)W k (0) = W k+ (0)W k+(0), then D = 0. k= 4

5 Lemma.6. The gradient flow iterates satisfy the following properties: or any k L, W k W k D. ( ) or any k < L, v k+, u k D+ W k+ (0) + W k(0) / W k+. Suppose max k L W k, then w prod/ L k= W k, v. The proof is based on eq. (.4) and eq. (.5). If W k (0)Wk (0) = W k+ (0)W k+(0), then eq. (.4) gives that W k+ and W k have the same singular values, and W k+ s right singular vectors and W k s left singular vectors are the same. If it is true for any two adjacent layers, since W L is a row vector, all layers have rank. With general initialization, we have similar results when W k is large enough so that the initialization is negligible. Careful calculations give the exact results in Lemma.6. An interesting point is that the implicit regularization result in Lemma.6 helps establish risk convergence in Theorem.. Specifically, by Lemma.6, if all layers have large norms, W L W will be large. If the risk is not minimized to 0, R(w prod ) will be lower bounded by a positive constant, and thus / W = W L W R(w prod ) will be large. Invoking eq. (.), Lemma.3 and eq. (.5) gives a contradiction. Since the risk has no finite optimum, W k.. Convergence to the maximum margin solution Here we focus on the exponential loss l exp (x) = e x and the logistic loss l log (x) = ln( + e x ). In addition to risk convergence, these two losses also enable gradient descent to find the maximum margin solution. To get such a strong convergence, we need one more assumption on the data set. Recall that γ = max u = min i n u, z i > 0 denotes the maximum margin, and ū denotes the unique maximum margin predictor which attains this margin γ. Those data points z i for which ū, z i = γ are called support vectors. Assumption.7. The support vectors span the whole space R d. Assumption.7 appears in prior work (Soudry et al., 07), and can be satisfied in many cases: for example, it is almost surely true if the number of support vectors is larger than or equal to d and the data set is sampled from some density w.r.t. the Lebesgue measure. It can also be relaxed to the situation that the support vectors and the whole data set span the same space; in this case R(w prod ) will never leave this space, and we can always restrict our attention to this space. With Assumption.7, we can state the main theorem. Theorem.8. Under Assumptions. and.7, for almost all data and for losses l exp and l log, then lim v t, ū =, where v is the first right singular vector of W. As a result, lim w t prod/ L k= W k = ū. Theorem.8 relies on two structural lemmas. The first one is based on a similar almost-all argument due to Soudry et al. (07, Lemma 8). Let S {,..., n} denote the set of indices of support vectors. Lemma.9. Under Assumption.7, if the data set is sampled from some density w.r.t. the Lebesgue measure, then with probability, α := min max ξ, z i > 0. ξ =,ξ ū i S Let ū denote the orthogonal complement of span(ū), and let Π denote the projection onto ū. Lemma.0. Under Assumption.7, for almost all data, losses l exp and l log, and any w R d, if w, ū 0 and Π w is larger than +ln(n) /α for l exp or n /eα for l log, then Π w, R(w) 0. With Lemma.9 and Lemma.0 in hand, we can prove Theorem.8. Let Π W denote the projection of rows of W onto ū. Notice that Π w prod = ( W L... W (Π W ) ) d Π W and = Π w prod, R(w prod ). dt If Π W is large compared with W, since layers become aligned, Π w prod will also be large, and then Lemma.0 implies that Π W will not increase. At the same time, W, and thus for large enough t, Π W must be very small compared with W. Many details need to be handled to make this intuition exact; the proof is given in Appendix B. 5

6 3 Results for gradient descent One key property of gradient flow which is used in the previous proofs is that it never increases the risk, which is not necessarily true for gradient descent. However, for smooth losses (i.e, with Lipschitz continuous derivatives), we can design some decaying step sizes, with which gradient descent never increases the risk, and basically the same results hold as in the gradient flow case. Deferred proofs are given in Appendix C. We make the following additional assumption on the loss, which is satisfied by the logistic loss l log. Assumption 3.. l is β-lipschitz (i.e, l is β-smooth), and l G (i.e., l is G-Lipschitz). Under Assumption 3., the risk is also a smooth function of W, if all layers are bounded. Lemma 3.. Suppose l is β-smooth. or any R, the risk R is a β(r)-smooth function on the set B(R) = { W W k R, k L }, where β(r) = L R L (β + G). Smoothness ensures that for any W, V B(R), R(W ) R(V ) R(V ), W V + β(r) W V / (see (Bubeck et al., 05, Lemma 3.4)). In particular, if we choose some R and set a constant step size η t = /β(r), then as long as W (t + ) and W (t) are both in B(R), R ( W (t + ) ) R ( W (t) ) R ( W (t) ), η t R ( W (t) ) + β(r)η t R ( W (t) ) = R ( W (t) ) η t = R ( W (t) ). (3.3) β(r) In other words, the risk does not increase at this step. However, similar to gradient flow, the gradient descent iterate will eventually escape B(R), which may increase the risk. Lemma 3.4. Under Assumptions. to., suppose gradient descent is run with a constant step size /β(r). Then there exists a time t when W (t) B(R), in other words, max k L W k (t) > R. ortunately, this issue can be handled by adaptively increasing R and correspondingly decreasing the step sizes, formalized in the following assumption. Assumption 3.5. The step size η t = min{/β(r t ), }, where R t satisfies W (t) B(R t ), and if W (t + ) B(R t ), R t+ = R t. Assumption 3.5 can be satisfied by a line search, which ensures that the gradient descent update is not too aggressive and the boundary R is increased properly. With the additional Assumptions 3. and 3.5, exactly the same theorems can be proved for gradient descent. We restate them briefly here. Theorem 3.6. Under Assumptions. to. and 3.5, gradient descent satisfies lim t R ( W (t) ) = 0. or any k L, lim t W k (t) =. lim t wprod (t)/ L k= W k(t), v (t) =, where v (t) is the first right singular vector of W (t). Theorem 3.7. Under Assumptions., 3.5 and.7, for the logistic loss l log and almost all data, lim t v (t), ū =, and lim t w prod (t)/ L k= W k(t) = ū. Proofs of Theorem 3.6 and 3.7 are given in Appendix C, and are basically the same as the gradient flow proofs. The key difference is that an error of t=0 η t R(W (t)) will be introduced in many parts of the proof. However, it is bounded in light of eq. (3.3): t=0 η t R ( W (t) ) R ( ) ( ) η t W (t) R W (0). t=0 Since all weight matrices go to infinity, such a bounded error does not matter asymptotically, and thus proofs still go through. 6

7 .0 risk W 3 / W 3.5 W / W W / W risk W 3 / W 3 W / W W / W (a) Default initialization. (b) Initialization with the same robenius norm. igure : Risk and alignment of dense layers (the ratio W i / W i ) of (nonlinear!) AlexNet on CIAR-0. igure a uses default PyTorch initialization, while igure b forces initial robenius norms to be equal amongst dense layers. 4 Summary and future directions This paper rigorously proves that, for deep linear networks on linearly separable data, gradient flow and gradient descent minimize the risk to 0, align adjacent weight matrices, and align the first right singular vector of the first layer to the maximum margin solution determined by the data. There are many potential future directions; a few are as follows. Convergence rate. This paper only proves asymptotic convergence with no convergence rate. A convergence rate would allow the algorithm to be compared to other methods which also globally optimize this objective, would also suggest ways to improve step sizes and initialization, and ideally even exhibit a sensitivity to the network architecture and suggest how it could be improved. Nonseparable data and nonlinear networks. Real-world data is generally not linearly separable, but nonlinear deep networks can reliably decrease the risk to 0, even with random labels (Zhang et al., 07). This seems to suggest that a nonlinear notion of separability is at play; is there some way to adapt the present analysis? The present analysis is crucially tied to the alignment of weight matrices: alignment and risk are analyzed simultaneously. Motivated by this, consider a preliminary experiment, presented in igure, where stochastic gradient descent was used to minimize the risk of a standard AlexNet on CIAR-0 (Krizhevsky et al., 0; Krizhevsky and Hinton, 009). Even though there are ReLUs, max-pooling layers, and convolutional layers, the alignment phenomenon is occurring in a reduced form on the dense layers (the last three layers of the network). Specifically, despite these weight matrices having shape (04, 4096), (4096, 4096), and (4096, 0) the key alignment ratios W i / W i are much larger than their respective lower bounds (04 /, 4096 /, 0 / ). Two initializations were tried: default PyTorch initialization, and a Gaussian initialization forcing all initial robenius norms to be just 4, which is suggested by the norm preservation property in the analysis and removes noise in the weights. Acknowledgements The authors are grateful for support from the NS under grant IIS This grant allowed them to focus on research, and when combined with a gracious NVIDIA GPU grant, led to the creation of their beloved GPU machine DutchCrunch. 7

8 References Sanjeev Arora, Nadav Cohen, and Elad Hazan. On the optimization of deep networks: Implicit acceleration by overparameterization. arxiv preprint arxiv: , 08. Peter Bartlett, Dylan oster, and Matus Telgarsky. Spectrally-normalized margin bounds for neural networks. NIPS, 07. Sébastien Bubeck et al. Convex optimization: Algorithms and complexity. oundations and Trends R in Machine Learning, 8(3-4):3 357, 05. Simon S Du, Wei Hu, and Jason D Lee. Algorithmic regularization in learning deep homogeneous models: Layers are automatically balanced. arxiv preprint arxiv: , 08. Suriya Gunasekar, Jason Lee, Daniel Soudry, and Nathan Srebro. Implicit bias of gradient descent on linear convolutional networks. arxiv preprint arxiv: , 08. Ziwei Ji and Matus Telgarsky. Risk and parameter convergence of logistic regression. arxiv preprint arxiv: , 08. Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 009. Alex Krizhevsky, Ilya Sutskever, and Geoffery Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 0. Thomas Laurent and James von Brecht. Deep linear neural networks with arbitrary loss: All local minima are global. arxiv preprint arxiv:7.0473, 07. Jason D Lee, Max Simchowitz, Michael I Jordan, and Benjamin Recht. Gradient descent converges to minimizers. arxiv preprint arxiv: , 06. Haihao Lu and Kenji Kawaguchi. Depth creates no bad local minima. arxiv preprint arxiv: , 07. Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arxiv preprint arxiv:3.60, 03. Daniel Soudry, Elad Hoffer, and Nathan Srebro. The implicit bias of gradient descent on separable data. arxiv preprint arxiv: , 07. Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. ICLR, 07. Yi Zhou and Yingbin Liang. Critical points of linear neural networks: Analytical forms and landscape properties

9 A Regarding Assumption. Suppose W (0) = 0 while W L (0) W (0) 0. irst of all, W L (0) W (0) = 0 and thus R ( W (0) ) = R(0). Moreover, R ( w prod (0) ), ū = n l (0) z i, ū l (0)γ < 0, n i= which implies R ( w prod (0) ) 0 and / W = ( W L (0) W (0) ) R ( wprod (0) ) 0. On the other hand, if R ( W (0) ) > R(0), gradient flow/descent may never minimize the risk to 0. or example, suppose the network has two layers, and both the input and output have dimension ; the network just computes the dot product of two vectors w and w. Consider minimizing R(w, w ) = exp ( w, w ). If w (0) = w (0) 0, then R ( w (0), w (0) ) = exp ( w ) > exp(0). It is easy to verify that for any t, w (t) = w (t), and R ( w (t), w (t) ) exp(0) > 0. B Omitted proofs from Section Proof of Lemma.3. ix an arbitrary R > 0. If the claim is not true, then for any ɛ > 0, there exists some t such that W k R for all k while / W ɛ, which means W = W WL R(w prod ) = W L W R(w prod ) ɛ. Since w prod R L, we have R(w prod ), ū = n n l ( w prod, z i ) z i, ū n i= n l ( w prod, z i ) γ Mγ, where M = max R L x R L l (x). Since l is continuous and the domain is bounded, the maximum is attained and negative, and thus M > 0. Therefore R(wprod ) Mγ, and thus WL W ɛ/mγ. Since W R, we further have w prod ɛr/mγ. In other words, after t =, wprod may be arbitrarily small, which implies R ( w prod ) can be arbitrarily close to R (0). On the other hand, by Assumption., dr(w )/ dt = R(W ) < 0 at t = 0. This implies that R ( W () ) < R ( W (0) ), and for any t, R ( W (t) ) R ( W () ) < R ( W (0) ) R(0), which is a contradiction. Since the risk is always positive, we have i= R ( W (0) ) L t=0 W k dt k= t=0 W dt [ ] t=0 W max W k R dt k L [ ] t= W max W k R dt k L [ ] ɛ(r) max W k R dt, k L t= which implies gradient flow only spends a finite amount of time in { W max k L W k R }. This directly implies that max k L W k is unbounded. 9

10 Proof of Lemma.6. The first claim is true for k = L since W L is a row vector. or any k < L, recall the following relation (Arora et al., 08; Du et al., 08): W k+(t)w k+ (t) W k+(0)w k+ (0) = W k (t)w k (t) W k (0)W k (0). (B.) Let A k,k+ = W k (0)W k (0) W k+ (0)W k+(0). By eq. (B.) and the definition of singular vectors and singular values, we have σ k v k+w k W k v k+ = v k+w k+w k+ v k+ + v k+a k,k+ v k+ = σ k+ + v k+a k,k+ v k+ σ k+ A k,k+. (B.) Moreover, by taking the trace on both sides of eq. (B.), we have ( ) ) ( ) W k = tr W k Wk = tr (Wk+W k+ + tr W k (0)Wk (0) Summing eq. (B.) and eq. (B.3) from k to L, we get ( ) tr Wk+(0)W k+ (0) = W k+ + W k (0) W k+ (0). (B.3) W k W k W k (0) W L (0) + L k =k A k,k + D. (B.4) Next we prove that singular vectors get aligned. Consider u k W k+ W k+u k. On one hand, similarly to eq. (B.), u k W k+w k+ u k = u k W k W k u k u k W k (0)W k (0)u k + u k W k+(0)w k+ (0)u k u k W k W k u k u k W k (0)W k (0)u k σ k W k (0). (B.5) On the other hand, it follows from the definition of singular vectors and eq. (B.4) that ( ) u k Wk+W k+ u k = u k, v k+ σk+ + u k Wk+W k+ v k+ σk+v k+ u k u k, v k+ σ k+ + W k+ W k+ u k, v k+ σ k+ + D. (B.6) Combining eq. (B.5) and eq. (B.6), we get σ k u k, v k+ σ k+ + D + W k (0). (B.7) Similar to eq. (B.5), we can get Therefore σ k v k+w k W k v k+ σ k+ W k+ (0). σ k σ k+ Combining eq. (B.7) and eq. (B.8), we finally get W k+(0) σk+. (B.8) u k, v k+ D + W k(0) + W k+ (0) σk+. Regarding the last claim, first recall that since the difference between the squares of robenius norms of any two layers is a constant, max k L W k implies W k for any k. We further have the following. 0

11 Since W k W k D, W k for any k, and W k / W k u k v k. Since W k, u k, v k+. As a result, w prod L L k= W, v = W k, v k W k k= L u i vi, v k=. Proof of Theorem.. Suppose for some ɛ > 0, R (W ) ɛ for any t. Then there exists some j n such that l ( w prod, z j ) ɛ, and thus w prod, z j l (ɛ). On the other hand, since R(W ) R(0) = l(0), l ( w prod, z j ) nl(0), and thus w prod, z j l ( nl(0) ). Let M = max l (nl(0)) x l (ɛ/n) l (x) < 0, we have for any t, R(w prod ), ū = n l ( w prod, z i ) z i, ū n n i= n l ( w prod, z i ) γ i= n l ( w prod, z j ) γ Mγ < 0, n and thus R(w prod ) Mγ/n. Similar to the proof of Lemma.6, we can show that if W k, (W L W ), v W k W. In other words, there exists some C > 0, such that when min k L W k > C, W L W W k W / > C L /. Lemma.3 shows that gradient flow spends a finite amount of time in { W max k L W k R } for any R > 0. Since the difference between the squares of robenius norms of any two layers is a constant, gradient flow also spends a finite amount of time in { W min k L W k C }. Now we have R ( W (0) ) L t=0 W k dt k= W dt = t=0 t=0 t=0 ( Mγ n =, W L W R(w prod ) dt [ W L W R(w prod ) W min W k C k L ) ) (C L [ ] W min W k C dt k L t=0 ] dt

12 which is a contradiction. Therefore R(ɛ) 0. This further implies W k, since R(W ) has no finite optimum. inally, invoking Lemma.6 proves the final claim of Theorem.. Proof of Lemma.9. Soudry et al. (07) Lemma 8 proves that, with probability, there are at most d support vectors, and moreover, the i-th support vector z i has a positive dual variable α i, such that i S α iz i = ū. Suppose there exists some ξ ū, such that max i S ξ, z i 0. Since α i ξ, z i = ξ, α i z i = ξ, ū = 0, i S i S we actually have ξ, z i = 0 for all i S. This is impossible under Assumption.7, since the support vectors span the whole space. Proof of Lemma.0. or the sake of presentation, we leave out the subscript in z i and denote a data point by z generally. or any data point z and predictor w, let z and w denote their projection onto ū. Let z arg max i S w, z, and thus w, z α w. or l exp, we have w, R(w) = z = z [ exp ( w, z )] w, z n n [ exp ( w, z )] w, z n exp ( w, z ) w, z + z,w 0 n exp ( w, z ) w, z. The first part can be lower bounded as below (recall that w, z = w, z α w ) (B.9) n exp ( w, z ) w, z = n exp ( w, γū ) exp ( w, z ) w, z n exp ( w, γū ) exp ( α w ) α w. (B.0) To bound the second part, first notice that since we assume w, ū 0, for any z, w, z γū = w, z + w, z γū z w, z = w, z. (B.) The reason is that every data point has margin at least γ, and thus z γū z = cū for some c 0. Using eq. (B.), we can bound the second part of eq. (B.9). z,w 0 = z,w 0 z,w 0 z,w 0 n exp ( w, z ) w, z n exp ( w, γū ) exp ( w, z γū ) w, z n exp ( w, γū ) exp ( ) w, z w, z n exp ( w, γū ) ( ) e exp ( w, γū ) ( ). (B.) e On the third line eq. (B.) is applied. The fourth line applies the property that f(x) = xe x /e when x 0.

13 Combining eq. (B.9), eq. (B.0) and eq. (B.), we get w, R(w) exp ( w, γū ) ( n exp ( α w ) α w ). e As long as w ( + ln(n))/α, w, R(w) 0. or l log, similar to eq. (B.9), we have w, R(w) exp ( w, z ) n + exp ( w, z ) w, z + exp ( w, z ) n + exp ( w, z ) w, z + z,w 0 z,w 0 exp ( w, z ) n + exp ( w, z ) w, z n exp ( w, z ) w, z. (B.3) The second part of eq. (B.3) can be bounded again by eq. (B.). To bound the first part of eq. (B.3), first notice that (recall w, ū 0) exp ( w, z ) = exp ( w, γū ) exp ( w, z ) exp ( w, z ). (B.4) Using eq. (B.4), and recall that w, z = w, z α w 0, we can bound the first part of eq. (B.3) as below. exp ( w, z ) n + exp ( w, z ) w, z = n exp ( w, γū ) exp ( w, z ) + exp ( w, z ) w, z n exp ( w, γū ) exp ( w, z ) + exp ( w, z ) w, z n exp ( w, γū ) w, z n exp ( w, γū ) α w. Combining eq. (B.3), eq. (B.5) and eq. (B.), we get w, R(w) exp ( w, γū ) ( n α w ). e As long as w n/eα, w, R(w) 0. Proof of Theorem.8. Recall that and thus dw dt d W dt = W = W W L R(w prod ), = (B.5) W, dw = w prod, R(w prod ). (B.6) dt Let Πū denote the projection onto span(ū), and let Π denote the projection onto ū. Also let ΠūW and Π W denote the projection of rows of W onto span(ū) and ū, respectively. Notice that Πūw prod = ( W L W (ΠūW ) ), and Π w prod = ( W L W (Π W ) ). We further have d Π W = Π w prod, R(w prod ). (B.7) dt Let W = u σ v + S. We have S σ, σ, W W D, where σ, is the second singular value of W and D is the constant introduced in Lemma.6. Then Π W = u σ (Π v ) + Π S, 3

14 and Π W u σ (Π v ) + Π S = σ Π v + Π S σ Π v + dd. It follows that Π v Π W σ dd σ Π W W dd W. ix an arbitrary ɛ > 0. By Theorem., we can find some t 0 large enough such that for any t t 0 :. dd/ W ɛ/3.. w prod/ W L W v ɛ/3, or w prod/ W L W + v ɛ/3. (B.8) 3. W L W 3K/ɛ, where K is the threshold given in Lemma.0, i.e., +ln(n) /α for l exp, n /eα for l log. 4. R(W ) l(0)/n, which implies w prod, z i 0 for all i n. By Lemma.9, there always exists a support vector z for which Π w prod, z 0, and therefore w prod, ū 0. Suppose for some t t 0, Π W / W ɛ. By eq. (B.8) and bullet above, Π v ɛ/3. Bullet above then gives Π w prod/ W L W ɛ/3, which together with bullet 3 above implies Π w prod K. Since also w prod, ū 0, we can apply Lemma.0 and get that Π w prod, R(w prod ) 0. In light of eq. (B.7), d Π W / dt 0. On the other hand, since after t t 0, w prod, z i 0, we have d W / dt 0 by eq. (B.6). Therefore Π W / W will not increase, and since W, it will eventually drop below ɛ, and will never exceed ɛ again. Therefore, Since ɛ is arbitrary, we have Π W lim sup ɛ. t W Π W lim sup = 0, t W and thus lim t v, ū =. An application of Theorem. gives the other part of Theorem.8. C Omitted proofs from Section 3 Proof of Lemma 3.. Given W, V B(R), we need to show that R(W ) R(V ) β(r) W V for some β(r). Consider k = first. Let w = (W L W ), and v = (V L V ). Since l G, R(w), R(v) G. We have W W V = WL R(w) V VL R(v) W WL R(w) V W3 WL R(w) + V W3 WL R(w) V VL R(v) R L G W V + V W3 WL R(w) V VL R(v) R L G W V + V W3 WL R(w) V VL R(v). (C.) Proceeding in this way, we can get W V (L )RL G W V + R L R(w) R(v). (C.) 4

15 Since z i, l is β-lipschitz, we have R(w) R(v) β w v βlr L W V, (C.3) where the last inequality follows from a similar one-by-one replacement procedure as in eq. (C.). Combining eq. (C.) and eq. (C.3), we get for R, W V ((L )R L G + βlr L ) W V LR L (β + G) W V. The same procedure can be done for other layers, and together R(W ) R(V ) L R L (β + G) W V. Proof of Lemma 3.4. Recall that if W (t), W (t + ) B(R) and η t = /β(r), R ( W (t + ) ) R ( W (t) ) R ( W (t) ), η t R ( W (t) ) + β(r)η t R ( W (t) ) = R ( W (t) ) β(r) = η t R ( W (t) ). Suppose W (t) B(R) for all t. By Assumption. and eq. (C.4), R ( W () ) R ( W (0) ) R ( W (0) ) ( ) < R W (0). β(r) (C.4) By eq. (C.4), gradient descent never increases the risk, and thus for all t, R ( W (t) ) R ( W () ) < R ( W (0) ). In exactly the same way as in the proof of Lemma.3, one can show that there exists some constant ɛ(r) > 0, so that / W (t) ɛ(r) for all t. Invoking eq. (C.4) again, we will get R ( W (0) ) t=0 β(r) ɛ(r) =, which is a contradiction. Therefore W (t) must go out of B(R) at some time. Next we prove Theorem 3.6 and 3.7. The proofs depend on several lemmas which are similar to the gradient flow ones. The following Lemma C.5 is similar to Lemma.3. Lemma C.5. Under Assumptions. to. and 3.5, gradient descent ensures that max k L W k (t) is unbounded. t=0 η t =. or any R > 0, t:w (t) B(R) η t <. Proof. By Assumption 3.5, we always have that W (t) B(R t ). Since β(r t ) = L Rt L (β + G) Rt L G, we have for any k L, W k (t + ) W k (t) + η t W k (t) W k (t) + β(r t ) W k (t) W k (t) + β(r t ) RL t G W k (t) +. (C.6) 5

16 Moreover, Lemma 3.4 shows that R t. Since R t+ = R t as long as W (t+) B(R t ), max k L W k (t) is unbounded. It then follows that for any t, by Cauchy-Schwarz, t Since eq. (C.4) implies η τ t R ( ) t η τ W (τ) R ( ) η τ W (τ) t R ( ) ( ) ( ) ( ) η τ W (τ) R W (0) R W (t) R W (0),. together we have t=0 η t =. Since under Assumptions 3. and 3.5 gradient descent never increases the risk, it can be shown in exactly the same as in the proof of Lemma.3 that, for W (t) B(R), / W (t) ɛ(r) for some constant ɛ(r) > 0. Invoking eq. (C.4) again, we get that t:w (t) B(R) η t <. The next lemma is an analog to Lemma.6. Lemma C.7. Under Assumptions. and 3., the gradient descent iterates satisfy the following properties: or any k L, W k W k D + R ( W (0) ). or any k < L, v k+, u k D+3R(W (0))+ W k+(0) + W k(0) / W k+. Suppose max k L W k, then w prod/ L k= W k, v. Proof. Recall that for any W, W k+ = Wk+ WL R(w prod ) W Wk = Wk. W k+ W k (C.8) or gradient descent iterates, summing eq. (C.8) from 0 to t, we get t Wk+(t)W k+ (t) Wk+(0)W k+ (0) + t = W k (t)wk (t) W k (0)Wk (0) + or any k L and any t, let t P k (t) = η τ η τ η τ ( W k (τ) ( ) ( ), W k (τ) W k (τ) ( W k+ (τ) ) ( W k+ (τ) ) ) ( ). W k (τ) (C.9) and t Q k (t) = η τ ( ) ( ). W k (τ) W k (τ) 6

17 We have P k (t) = Q k (t) tr ( Q k (t) ) = tr ( P k (t) ). Moreover, invoking eq. (C.4), L tr ( P k (t) ) = k= L t k= t = η τ η τ W k (τ) R ( W (τ) ) t R ( ) η τ W (τ) R ( W (0) ) R ( W (t) ) R ( W (0) ). (C.0) Still let σ k (t), u k (t) and v k (t) denote the first singular value, left singular vector and right singular vector of W k (t). We can then proceed basically in the same way as in the proof of Lemma.6. or example, eq. (B.) becomes σ k(t) σ k+(t) A k,k+ (t) P k (t) σ k+(t) A k,k+ (t) tr ( P k (t) ), (C.) while eq. (B.3) becomes W k (t) = W k+ (t) + W k (0) W k+ (0) tr ( P k (t) ) + tr ( Q k+ (t) ). (C.) Summing eq. (C.) and eq. (C.) from k to L, and invoke eq. (C.0), we get W k (t) W k (t) D tr ( P k (t) ) + tr ( Q L (t) ) + L k =k tr ( P k (t) ) D + R ( W (0) ). To prove singular vectors get aligned, we can still proceed in nearly the same way as in the proof of Lemma.6. eq. (B.5) becomes u k W k+w k+ u k σ k W k (0) Q k+ (t), (C.3) while eq. (B.6) becomes u k W k+w k+ u k u k, v k+ σ k+ + D + R ( W (0) ). (C.4) Combining eq. (C.3) and eq. (C.4) σ k u k, v k+ σ k+ + D + R ( W (0) ) + Q k+ (t) + W k (0). (C.5) Similar to eq. (C.3), we can get and thus eq. (B.8) becomes σ k v k+w k W k v k+ σ k+ W k+ (0) P k (t), σ k σ k+ W k+(0) + P k (t) σk+. (C.6) Combining eq. (C.5) and eq. (C.6), we get u k, v k+ D + W k(0) + W k+ (0) + 3R ( W (0) ) σ k+ The final claim of Lemma C.7 can be proved in exactly the same way as Lemma.6.. 7

18 Proof of Theorem 3.6. Summing eq. (C.), we know that for any two different layers j > k, W k (t) W j (t) = W k (0) W j (0) tr ( P k (t) ) + tr ( Q j (t) ). Recall eq. (C.0), we know that ( ) ( ) W k (t) W j (t) W k (0) W j (0) R ( W (0) ). (C.7) In other words, the difference between the squares of robenius norms of any two layers is still bounded. The proof then goes in the same way as the proof of Theorem.. Suppose the risk is always above ɛ > 0. Then there exists some c(ɛ) > 0 such that R(w prod ) c(ɛ). By Lemma C.7, there exists some C such that if min k L W k (t) > C, W L (t) W (t) C L /. By eq. (C.7) and Lemma C.5, t: W k (t) C for some k η t is finite. On the other hand, by Lemma C.5, i=0 η t =, and thus t: W k (t) >C for all k η t =. Therefore we have, by invoking eq. (C.4), R ( W (0) ) R ( ) η t W (t) t=0 η t W (t) t=0 c(ɛ) CL =, t: W k (t) >C for all k which is a contradiction. Therefore R ( W (t) ) 0, and since it has no finite optimum, W k. The other results follow from Lemma C.5. Proof of Theorem 3.7. Recall that and thus W = W W L R(w prod ), W (t + ) = W (t) η t W (t), + ηt W (t) W (t) = W (t) η t w prod (t), R ( w prod (t) ) + ηt W (t) If w prod, z i 0 for all i, then W (t + ) W (t). Also recall that Π W (t) denote the projection of rows of W (t) onto ū, the orthogonal complement of span(ū). We have Π W (t + ) W (t) η t Π W (t), + ηt W (t) W (t) = W (t) η t Π w prod (t), R ( w prod (t) ) + ηt W (t). (C.8) Invoking eq. (C.4) again gives ηt W (t) η t R ( W (t) ) (R ( W (t) ) R ( W (t + ) )). (C.9) The proof then goes in almost the same way as the proof of Theorem.8. or any ɛ > 0, we can find some large enough time t 0, such that for any t t 0, η t. 8

19 . Π W (t) / W (t) ɛ implies that Π w prod (t), R ( w prod (t) ) 0.. w prod (t), z i 0 for all i, and thus W (t + ) W (t). ( 3. W (t) ) + R(W (0)) /ɛ. Suppose at some time t t 0, Π W (t ) / W (t ) ɛ. As long as this still holds, in light of bullet () above, eq. (C.8) and eq. (C.9), Π W will increase by at most R ( W (t ) ) R ( W (0) ). On the other hand, W, and thus there exists some t > t such that Π W (t ) / W (t ) < ɛ. Let t 3 denote the smallest time after t such that Π W (t 3 ) / W (t 3 ) ɛ (if it exists). Recall that W (t + ) W (t) + for any t 0, and W (t + ) W (t) for any t t 0, we have Π W (t 3 ) W (t 3 ) Π W (t 3 ) Π W (t 3 ) + < ɛ +. W (t 3 ) W (t 3 ) W (t 3 ) After t 3, Π W will increase by at most R ( W (0) ), and thus Π W will increase by at most R ( W (0) ). Therefore, for any t 4 t 3, as long as Π W (t 4 ) / W (t 4 ) ɛ, we have since W (t) Π W (t 4 ) W (t 4 ) ( ) + R(W (0)) /ɛ after t 0. In other words, Π W (t 4 ) W (t 3 ) Π W (t 3 ) + R ( W (0) ) W (t 3 ) R ( W (0) ) ɛ + + ɛ, W (t 3 ) W (t 3 ) Π W lim sup ɛ. t W Since ɛ is arbitrary, we have and thus lim t v, ū =. Π W lim sup = 0, t W 9

Overparametrization for Landscape Design in Non-convex Optimization

Overparametrization for Landscape Design in Non-convex Optimization Jason D. Lee University of Southern California September 19, 2018 The State of Non-Convex Optimization Practical observation: Empirically,