arxiv: v1 [cs.lg] 4 Oct 2018
|
|
- Theodore McDonald
- 5 years ago
- Views:
Transcription
1 Gradient descent aligns the layers of deep linear networks Ziwei Ji Matus Telgarsky University of Illinois, Urbana-Champaign arxiv:80.003v [cs.lg] 4 Oct 08 Abstract This paper establishes risk convergence and asymptotic weight matrix alignment a form of implicit regularization of gradient flow and gradient descent when applied to deep linear networks on linearly separable data. In more detail, for gradient flow applied to strictly decreasing loss functions (with similar results for gradient descent with particular decreasing step sizes): (i) the risk converges to 0; (ii) the normalized i th weight matrix asymptotically equals its rank- approximation u iv i ; (iii) these rank- matrices are aligned across layers, meaning v i+u i. In the case of the logistic loss (binary cross entropy), more can be said: the linear function induced by the network the product of its weight matrices converges to the same direction as the maximum margin solution. This last property was identified in prior work, but only under assumptions on gradient descent which here are implied by the alignment phenomenon. Introduction Efforts to explain the effectiveness of gradient descent in deep learning have uncovered an exciting possibility: it not only finds solutions with low error, but also biases the search for low complexity solutions which generalize well (Zhang et al., 07; Bartlett et al., 07; Soudry et al., 07; Gunasekar et al., 08). This paper analyzes the implicit regularization of gradient descent and gradient flow on deep linear networks and linearly separable data. or strictly decreasing losses, the optimum is off at infinity, and we establish various alignment phenomena: or each weight matrix W i, the corresponding normalized weight matrix Wi / W i asymptotically equals its rank- approximation u i vi, where the robenius norm W i satisfies W i. In other words, W i / W i, and asymptotically only the rank- approximation of each weight matrix contributes to the final predictor, a form of implicit regularization. Adjacent rank- weight matrix approximations are aligned: v i+ u i. or the logistic loss, the first right singular vector v of W is aligned with the data, meaning v converges to the unique maximum margin predictor ū defined by the data. Moreover, the linear predictor induced by the network, w prod := W L W, is also aligned with the data, meaning w prod/ w prod ū. Simultaneously, this work proves that the risk is globally optimized: it asymptotes to 0. Alignment and risk convergence are proved simultaneously; the phenomena are coupled within the proofs. The paper is organized as follows. This introduction continues with related work, notation, and assumptions in Sections. and.. The analysis of gradient flow is in Section, and gradient descent is analyzed in Section 3. The paper closes with future directions in Section 4; a particular highlight is a preliminary experiment on CIAR-0 which establishes empirically that a form of the alignment phenomenon occurs on the standard nonlinear network AlexNet.. Related work On the implicit regularization of gradient descent, Soudry et al. (07) show that for linear predictors and linearly separable data, the gradient descent iterates converge to the same direction as the maximum margin
2 4 0 layer 4 layers risk W / W W / W W 3 / W 3 W 4 / W (a) Margin maximization. (b) Alignment and risk minimization. igure : Visualization of main results on synthetic data with a 4-layer linear network compared to a -layer network (a linear predictor). igure a shows the convergence of -layer and 4-layer networks to the same linear predictor on positive (blue) and negative (red) separable data. igure b shows the alignment phenomenon in the 4-layer network, plotted against the risk. Specifically, for each layer, the ratio of spectral to robenius norms is plotted, and converges to. As in the theoretical analysis, the convergence in alignment and risk occur simultaneously. solution. Ji and Telgarsky (08) further characterize such an implicit bias for general nonseparable data. Gunasekar et al. (08) consider gradient descent on fully connected linear networks and linear convolutional networks. In particular, for the exponential loss, assuming the risk is minimized to 0 and the gradients converge in direction, they show that the whole network converges in direction to the maximum margin solution. These two assumptions are on the gradient descent process itself, and specifically the second one might be hard to interpret and justify. Compared with Gunasekar et al. (08), this paper proves that the risk converges to 0 and the weight matrices align; moreover the proof here proves the properties simultaneously, rather than assuming one and deriving the other. Lastly, for ReLU networks, Du et al. (08) show that gradient flow does not change the difference between squared robenius norms of any two layers. or a smooth (nonconvex) function, Lee et al. (06) show that any strict saddle can be avoided almost surely with small step sizes. If there are only countably many saddle points and they are all strict, and if gradient descent iterates converge, then this implies (almost surely) they converge to a local minimum. In the present work, since there is no finite local minimum, gradient descent will go to infinity and never converge, and thus these results of Lee et al. (06) do not show that the risk converges to 0. There has been a rich literature on linear networks. Saxe et al. (03) analyze the learning dynamics of deep linear networks, showing that they exhibit some learning patterns similar to nonlinear networks, such as a long plateau followed by a rapid risk drop. Arora et al. (08) show that depth can help accelerate optimization. On the landscape properties of deep linear networks, Lu and Kawaguchi (07); Laurent and von Brecht (07) show that under various structural assumptions, all local optima are global. Zhou and Liang (08) give a necessary and sufficient characterization of critical points for deep linear networks.. Notation, setting, and assumptions Consider a data set {(x i, y i )} n i=, where x i R d, x i, and y i {, +}. The data set is assumed to be linearly separable, i.e., there exists a unit vector u which correctly classifies every data point: for any i n, y i u, x i > 0. urthermore, let γ := max u = min i n y i u, x i > 0 denote the maximum margin, and ū := arg max u = min i n y i u, x i denote the maximum margin solution (the solution to the hard-margin SVM). A linear network of depth L is parameterized by weight matrices W L,..., W, where W k R d k d k, d 0 = d, and d L =. Let W = (W L,..., W ) denote all parameters of the network. The (empirical) risk
3 induced by the network is given by R(W ) = R (W L,..., W ) = n n l (y i W L W x i ) = n i= n l ( w prod, z i ), where w prod := (W L W ), and z i := y i x i. The loss l is assumed to be continuously differentiable, unbounded, and strictly decreasing to 0. Examples include the exponential loss l exp (x) = e x and the logistic loss l log (x) = ln ( + e x). Assumption.. l < 0 is continuous, lim x l(x) = and lim x l(x) = 0. This paper considers gradient flow and gradient descent, where gradient flow { W (t) t 0, t R } can be interpreted as gradient descent with infinitesimal step sizes. It starts from some W (0) at t = 0, and proceeds as dw (t) = R ( W (t) ). dt By contrast, gradient descent { W (t) t 0, t Z } is a discrete-time process given by W (t + ) = W (t) η t R ( W (t) ), where η t is the step size at time t. We assume that the initialization of the network is not a critical point and induces a risk no larger than the risk of the trivial linear predictor 0. Assumption.. The initialization W (0) satisfies R ( W (0) ) 0 and R ( W (0) ) R(0) = l(0). It is natural to require that the initialization is not a critical point, since otherwise gradient flow/descent will never make a progress. The requirement R ( W (0) ) R(0) can be easily satisfied, for example, by making W (0) = 0 and W L (0) W (0) 0. On the other hand, if R ( W (0) ) > R(0), gradient flow/descent may never minimize the risk to 0. Proofs of those claims are given in Appendix A. Results for gradient flow In this section, we consider gradient flow. Although impractical when compared with gradient descent, gradient flow can simplify the analysis and highlight proof ideas. or convenience, we usually use W, W k, and w prod, but they all change with (the continuous time) t. Only proof sketches are given here; detailed proofs are deferred to Appendix B. i=. Risk convergence One key property of gradient flow is that it never increases the risk: dr(w ) dt = R(W ), dw dt L = R(W ) = W k k= 0. (.) We now state the main result: under Assumptions. and., gradient flow minimizes the risk, W k and w prod all go to infinity, and the alignment phenomenon occurs. Theorem.. Under Assumptions. and., gradient flow iterates satisfy the following properties: lim t R(W ) = 0. or any k L, lim t W k =. or any k L, letting (u k, v k ) denote the first left and right singular vectors of W k, lim W k t u k v k W k = 0. 3
4 Moreover, for any k < L, lim vk+ t, u k =. As a result, lim w prod t L k= W, v =, k and thus lim t w prod =. Theorem. is proved using two lemmas, which may be of independent interest. To show the ideas, let us first introduce a little more notation. Recall that R(W ) denotes the empirical risk induced by the deep linear network W. Abusing the notation a little, for any linear predictor w R d, we also use R(w) to denote the risk induced by w. With this notation, R(W ) = R(w prod ), while R(w prod ) = n l ( w prod, z i ) z i = n l (W L W z i ) z i n n i= is in R d and different from R(W ), which has L k= d kd k entries, as given below: urthermore, for any R > 0, let i= W k = W k+ W L R(w prod ) W { } B(R) = W max W k R. k L W k. The first lemma shows that for any R > 0, the time spent by gradient flow in B(R) is finite. Lemma.3. Under Assumptions. and., for any R > 0, there exists a constant ɛ(r) > 0, such that for any t and any W B(R), / W ɛ(r). As a result, gradient flow spends a finite amount of time in B(R) for any R > 0, and max k L W k is unbounded. Here is a proof sketch. If all W k are bounded, then R(w prod ) will be lower bounded by a positive constant, therefore if / W = W L W R(w prod ) can be arbitrarily small, then W L W and w prod can also be arbitrarily small, and thus R(W ) can be arbitrarily close to R(0). This cannot happen after t =, otherwise it will contradict Assumption. and eq. (.). To proceed, we need the following properties of linear networks from prior work (Arora et al., 08; Du et al., 08). or any time t 0 and any k < L, To see this, just notice that W k+(t)w k+ (t) W k+(0)w k+ (0) = W k (t)w k (t) W k (0)W k (0). (.4) W k+ = Wk+ WL R(w prod ) W Wk = Wk. W k+ W k Taking the trace on both sides of eq. (.4), we have Wk+ (t) Wk+ (0) = Wk (t) Wk (0). (.5) In other words, the difference between the squares of robenius norms of any two layers remains a constant. Together with Lemma.3, it implies that all W k are unbounded. However, even if W k are large, it does not necessarily follow that w prod is also large. Lemma.6 shows that this is indeed true: for gradient flow, as W k become larger, adjacent layers also get more aligned to each other, which ensures that their product also has a large norm. or k L, let σ k, u k, and v k denote the first singular value (the -norm), the first left singular vector, and the first right singular vector of W k, respectively. urthermore, define ( D := max k L W k(0) ) W L (0) + L W k (0)Wk (0) Wk+(0)W k+ (0), which depends only on the initialization. If for all k < L it holds that W k (0)W k (0) = W k+ (0)W k+(0), then D = 0. k= 4
5 Lemma.6. The gradient flow iterates satisfy the following properties: or any k L, W k W k D. ( ) or any k < L, v k+, u k D+ W k+ (0) + W k(0) / W k+. Suppose max k L W k, then w prod/ L k= W k, v. The proof is based on eq. (.4) and eq. (.5). If W k (0)Wk (0) = W k+ (0)W k+(0), then eq. (.4) gives that W k+ and W k have the same singular values, and W k+ s right singular vectors and W k s left singular vectors are the same. If it is true for any two adjacent layers, since W L is a row vector, all layers have rank. With general initialization, we have similar results when W k is large enough so that the initialization is negligible. Careful calculations give the exact results in Lemma.6. An interesting point is that the implicit regularization result in Lemma.6 helps establish risk convergence in Theorem.. Specifically, by Lemma.6, if all layers have large norms, W L W will be large. If the risk is not minimized to 0, R(w prod ) will be lower bounded by a positive constant, and thus / W = W L W R(w prod ) will be large. Invoking eq. (.), Lemma.3 and eq. (.5) gives a contradiction. Since the risk has no finite optimum, W k.. Convergence to the maximum margin solution Here we focus on the exponential loss l exp (x) = e x and the logistic loss l log (x) = ln( + e x ). In addition to risk convergence, these two losses also enable gradient descent to find the maximum margin solution. To get such a strong convergence, we need one more assumption on the data set. Recall that γ = max u = min i n u, z i > 0 denotes the maximum margin, and ū denotes the unique maximum margin predictor which attains this margin γ. Those data points z i for which ū, z i = γ are called support vectors. Assumption.7. The support vectors span the whole space R d. Assumption.7 appears in prior work (Soudry et al., 07), and can be satisfied in many cases: for example, it is almost surely true if the number of support vectors is larger than or equal to d and the data set is sampled from some density w.r.t. the Lebesgue measure. It can also be relaxed to the situation that the support vectors and the whole data set span the same space; in this case R(w prod ) will never leave this space, and we can always restrict our attention to this space. With Assumption.7, we can state the main theorem. Theorem.8. Under Assumptions. and.7, for almost all data and for losses l exp and l log, then lim v t, ū =, where v is the first right singular vector of W. As a result, lim w t prod/ L k= W k = ū. Theorem.8 relies on two structural lemmas. The first one is based on a similar almost-all argument due to Soudry et al. (07, Lemma 8). Let S {,..., n} denote the set of indices of support vectors. Lemma.9. Under Assumption.7, if the data set is sampled from some density w.r.t. the Lebesgue measure, then with probability, α := min max ξ, z i > 0. ξ =,ξ ū i S Let ū denote the orthogonal complement of span(ū), and let Π denote the projection onto ū. Lemma.0. Under Assumption.7, for almost all data, losses l exp and l log, and any w R d, if w, ū 0 and Π w is larger than +ln(n) /α for l exp or n /eα for l log, then Π w, R(w) 0. With Lemma.9 and Lemma.0 in hand, we can prove Theorem.8. Let Π W denote the projection of rows of W onto ū. Notice that Π w prod = ( W L... W (Π W ) ) d Π W and = Π w prod, R(w prod ). dt If Π W is large compared with W, since layers become aligned, Π w prod will also be large, and then Lemma.0 implies that Π W will not increase. At the same time, W, and thus for large enough t, Π W must be very small compared with W. Many details need to be handled to make this intuition exact; the proof is given in Appendix B. 5
6 3 Results for gradient descent One key property of gradient flow which is used in the previous proofs is that it never increases the risk, which is not necessarily true for gradient descent. However, for smooth losses (i.e, with Lipschitz continuous derivatives), we can design some decaying step sizes, with which gradient descent never increases the risk, and basically the same results hold as in the gradient flow case. Deferred proofs are given in Appendix C. We make the following additional assumption on the loss, which is satisfied by the logistic loss l log. Assumption 3.. l is β-lipschitz (i.e, l is β-smooth), and l G (i.e., l is G-Lipschitz). Under Assumption 3., the risk is also a smooth function of W, if all layers are bounded. Lemma 3.. Suppose l is β-smooth. or any R, the risk R is a β(r)-smooth function on the set B(R) = { W W k R, k L }, where β(r) = L R L (β + G). Smoothness ensures that for any W, V B(R), R(W ) R(V ) R(V ), W V + β(r) W V / (see (Bubeck et al., 05, Lemma 3.4)). In particular, if we choose some R and set a constant step size η t = /β(r), then as long as W (t + ) and W (t) are both in B(R), R ( W (t + ) ) R ( W (t) ) R ( W (t) ), η t R ( W (t) ) + β(r)η t R ( W (t) ) = R ( W (t) ) η t = R ( W (t) ). (3.3) β(r) In other words, the risk does not increase at this step. However, similar to gradient flow, the gradient descent iterate will eventually escape B(R), which may increase the risk. Lemma 3.4. Under Assumptions. to., suppose gradient descent is run with a constant step size /β(r). Then there exists a time t when W (t) B(R), in other words, max k L W k (t) > R. ortunately, this issue can be handled by adaptively increasing R and correspondingly decreasing the step sizes, formalized in the following assumption. Assumption 3.5. The step size η t = min{/β(r t ), }, where R t satisfies W (t) B(R t ), and if W (t + ) B(R t ), R t+ = R t. Assumption 3.5 can be satisfied by a line search, which ensures that the gradient descent update is not too aggressive and the boundary R is increased properly. With the additional Assumptions 3. and 3.5, exactly the same theorems can be proved for gradient descent. We restate them briefly here. Theorem 3.6. Under Assumptions. to. and 3.5, gradient descent satisfies lim t R ( W (t) ) = 0. or any k L, lim t W k (t) =. lim t wprod (t)/ L k= W k(t), v (t) =, where v (t) is the first right singular vector of W (t). Theorem 3.7. Under Assumptions., 3.5 and.7, for the logistic loss l log and almost all data, lim t v (t), ū =, and lim t w prod (t)/ L k= W k(t) = ū. Proofs of Theorem 3.6 and 3.7 are given in Appendix C, and are basically the same as the gradient flow proofs. The key difference is that an error of t=0 η t R(W (t)) will be introduced in many parts of the proof. However, it is bounded in light of eq. (3.3): t=0 η t R ( W (t) ) R ( ) ( ) η t W (t) R W (0). t=0 Since all weight matrices go to infinity, such a bounded error does not matter asymptotically, and thus proofs still go through. 6
7 .0 risk W 3 / W 3.5 W / W W / W risk W 3 / W 3 W / W W / W (a) Default initialization. (b) Initialization with the same robenius norm. igure : Risk and alignment of dense layers (the ratio W i / W i ) of (nonlinear!) AlexNet on CIAR-0. igure a uses default PyTorch initialization, while igure b forces initial robenius norms to be equal amongst dense layers. 4 Summary and future directions This paper rigorously proves that, for deep linear networks on linearly separable data, gradient flow and gradient descent minimize the risk to 0, align adjacent weight matrices, and align the first right singular vector of the first layer to the maximum margin solution determined by the data. There are many potential future directions; a few are as follows. Convergence rate. This paper only proves asymptotic convergence with no convergence rate. A convergence rate would allow the algorithm to be compared to other methods which also globally optimize this objective, would also suggest ways to improve step sizes and initialization, and ideally even exhibit a sensitivity to the network architecture and suggest how it could be improved. Nonseparable data and nonlinear networks. Real-world data is generally not linearly separable, but nonlinear deep networks can reliably decrease the risk to 0, even with random labels (Zhang et al., 07). This seems to suggest that a nonlinear notion of separability is at play; is there some way to adapt the present analysis? The present analysis is crucially tied to the alignment of weight matrices: alignment and risk are analyzed simultaneously. Motivated by this, consider a preliminary experiment, presented in igure, where stochastic gradient descent was used to minimize the risk of a standard AlexNet on CIAR-0 (Krizhevsky et al., 0; Krizhevsky and Hinton, 009). Even though there are ReLUs, max-pooling layers, and convolutional layers, the alignment phenomenon is occurring in a reduced form on the dense layers (the last three layers of the network). Specifically, despite these weight matrices having shape (04, 4096), (4096, 4096), and (4096, 0) the key alignment ratios W i / W i are much larger than their respective lower bounds (04 /, 4096 /, 0 / ). Two initializations were tried: default PyTorch initialization, and a Gaussian initialization forcing all initial robenius norms to be just 4, which is suggested by the norm preservation property in the analysis and removes noise in the weights. Acknowledgements The authors are grateful for support from the NS under grant IIS This grant allowed them to focus on research, and when combined with a gracious NVIDIA GPU grant, led to the creation of their beloved GPU machine DutchCrunch. 7
8 References Sanjeev Arora, Nadav Cohen, and Elad Hazan. On the optimization of deep networks: Implicit acceleration by overparameterization. arxiv preprint arxiv: , 08. Peter Bartlett, Dylan oster, and Matus Telgarsky. Spectrally-normalized margin bounds for neural networks. NIPS, 07. Sébastien Bubeck et al. Convex optimization: Algorithms and complexity. oundations and Trends R in Machine Learning, 8(3-4):3 357, 05. Simon S Du, Wei Hu, and Jason D Lee. Algorithmic regularization in learning deep homogeneous models: Layers are automatically balanced. arxiv preprint arxiv: , 08. Suriya Gunasekar, Jason Lee, Daniel Soudry, and Nathan Srebro. Implicit bias of gradient descent on linear convolutional networks. arxiv preprint arxiv: , 08. Ziwei Ji and Matus Telgarsky. Risk and parameter convergence of logistic regression. arxiv preprint arxiv: , 08. Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 009. Alex Krizhevsky, Ilya Sutskever, and Geoffery Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 0. Thomas Laurent and James von Brecht. Deep linear neural networks with arbitrary loss: All local minima are global. arxiv preprint arxiv:7.0473, 07. Jason D Lee, Max Simchowitz, Michael I Jordan, and Benjamin Recht. Gradient descent converges to minimizers. arxiv preprint arxiv: , 06. Haihao Lu and Kenji Kawaguchi. Depth creates no bad local minima. arxiv preprint arxiv: , 07. Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arxiv preprint arxiv:3.60, 03. Daniel Soudry, Elad Hoffer, and Nathan Srebro. The implicit bias of gradient descent on separable data. arxiv preprint arxiv: , 07. Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. ICLR, 07. Yi Zhou and Yingbin Liang. Critical points of linear neural networks: Analytical forms and landscape properties
9 A Regarding Assumption. Suppose W (0) = 0 while W L (0) W (0) 0. irst of all, W L (0) W (0) = 0 and thus R ( W (0) ) = R(0). Moreover, R ( w prod (0) ), ū = n l (0) z i, ū l (0)γ < 0, n i= which implies R ( w prod (0) ) 0 and / W = ( W L (0) W (0) ) R ( wprod (0) ) 0. On the other hand, if R ( W (0) ) > R(0), gradient flow/descent may never minimize the risk to 0. or example, suppose the network has two layers, and both the input and output have dimension ; the network just computes the dot product of two vectors w and w. Consider minimizing R(w, w ) = exp ( w, w ). If w (0) = w (0) 0, then R ( w (0), w (0) ) = exp ( w ) > exp(0). It is easy to verify that for any t, w (t) = w (t), and R ( w (t), w (t) ) exp(0) > 0. B Omitted proofs from Section Proof of Lemma.3. ix an arbitrary R > 0. If the claim is not true, then for any ɛ > 0, there exists some t such that W k R for all k while / W ɛ, which means W = W WL R(w prod ) = W L W R(w prod ) ɛ. Since w prod R L, we have R(w prod ), ū = n n l ( w prod, z i ) z i, ū n i= n l ( w prod, z i ) γ Mγ, where M = max R L x R L l (x). Since l is continuous and the domain is bounded, the maximum is attained and negative, and thus M > 0. Therefore R(wprod ) Mγ, and thus WL W ɛ/mγ. Since W R, we further have w prod ɛr/mγ. In other words, after t =, wprod may be arbitrarily small, which implies R ( w prod ) can be arbitrarily close to R (0). On the other hand, by Assumption., dr(w )/ dt = R(W ) < 0 at t = 0. This implies that R ( W () ) < R ( W (0) ), and for any t, R ( W (t) ) R ( W () ) < R ( W (0) ) R(0), which is a contradiction. Since the risk is always positive, we have i= R ( W (0) ) L t=0 W k dt k= t=0 W dt [ ] t=0 W max W k R dt k L [ ] t= W max W k R dt k L [ ] ɛ(r) max W k R dt, k L t= which implies gradient flow only spends a finite amount of time in { W max k L W k R }. This directly implies that max k L W k is unbounded. 9
10 Proof of Lemma.6. The first claim is true for k = L since W L is a row vector. or any k < L, recall the following relation (Arora et al., 08; Du et al., 08): W k+(t)w k+ (t) W k+(0)w k+ (0) = W k (t)w k (t) W k (0)W k (0). (B.) Let A k,k+ = W k (0)W k (0) W k+ (0)W k+(0). By eq. (B.) and the definition of singular vectors and singular values, we have σ k v k+w k W k v k+ = v k+w k+w k+ v k+ + v k+a k,k+ v k+ = σ k+ + v k+a k,k+ v k+ σ k+ A k,k+. (B.) Moreover, by taking the trace on both sides of eq. (B.), we have ( ) ) ( ) W k = tr W k Wk = tr (Wk+W k+ + tr W k (0)Wk (0) Summing eq. (B.) and eq. (B.3) from k to L, we get ( ) tr Wk+(0)W k+ (0) = W k+ + W k (0) W k+ (0). (B.3) W k W k W k (0) W L (0) + L k =k A k,k + D. (B.4) Next we prove that singular vectors get aligned. Consider u k W k+ W k+u k. On one hand, similarly to eq. (B.), u k W k+w k+ u k = u k W k W k u k u k W k (0)W k (0)u k + u k W k+(0)w k+ (0)u k u k W k W k u k u k W k (0)W k (0)u k σ k W k (0). (B.5) On the other hand, it follows from the definition of singular vectors and eq. (B.4) that ( ) u k Wk+W k+ u k = u k, v k+ σk+ + u k Wk+W k+ v k+ σk+v k+ u k u k, v k+ σ k+ + W k+ W k+ u k, v k+ σ k+ + D. (B.6) Combining eq. (B.5) and eq. (B.6), we get σ k u k, v k+ σ k+ + D + W k (0). (B.7) Similar to eq. (B.5), we can get Therefore σ k v k+w k W k v k+ σ k+ W k+ (0). σ k σ k+ Combining eq. (B.7) and eq. (B.8), we finally get W k+(0) σk+. (B.8) u k, v k+ D + W k(0) + W k+ (0) σk+. Regarding the last claim, first recall that since the difference between the squares of robenius norms of any two layers is a constant, max k L W k implies W k for any k. We further have the following. 0
11 Since W k W k D, W k for any k, and W k / W k u k v k. Since W k, u k, v k+. As a result, w prod L L k= W, v = W k, v k W k k= L u i vi, v k=. Proof of Theorem.. Suppose for some ɛ > 0, R (W ) ɛ for any t. Then there exists some j n such that l ( w prod, z j ) ɛ, and thus w prod, z j l (ɛ). On the other hand, since R(W ) R(0) = l(0), l ( w prod, z j ) nl(0), and thus w prod, z j l ( nl(0) ). Let M = max l (nl(0)) x l (ɛ/n) l (x) < 0, we have for any t, R(w prod ), ū = n l ( w prod, z i ) z i, ū n n i= n l ( w prod, z i ) γ i= n l ( w prod, z j ) γ Mγ < 0, n and thus R(w prod ) Mγ/n. Similar to the proof of Lemma.6, we can show that if W k, (W L W ), v W k W. In other words, there exists some C > 0, such that when min k L W k > C, W L W W k W / > C L /. Lemma.3 shows that gradient flow spends a finite amount of time in { W max k L W k R } for any R > 0. Since the difference between the squares of robenius norms of any two layers is a constant, gradient flow also spends a finite amount of time in { W min k L W k C }. Now we have R ( W (0) ) L t=0 W k dt k= W dt = t=0 t=0 t=0 ( Mγ n =, W L W R(w prod ) dt [ W L W R(w prod ) W min W k C k L ) ) (C L [ ] W min W k C dt k L t=0 ] dt
12 which is a contradiction. Therefore R(ɛ) 0. This further implies W k, since R(W ) has no finite optimum. inally, invoking Lemma.6 proves the final claim of Theorem.. Proof of Lemma.9. Soudry et al. (07) Lemma 8 proves that, with probability, there are at most d support vectors, and moreover, the i-th support vector z i has a positive dual variable α i, such that i S α iz i = ū. Suppose there exists some ξ ū, such that max i S ξ, z i 0. Since α i ξ, z i = ξ, α i z i = ξ, ū = 0, i S i S we actually have ξ, z i = 0 for all i S. This is impossible under Assumption.7, since the support vectors span the whole space. Proof of Lemma.0. or the sake of presentation, we leave out the subscript in z i and denote a data point by z generally. or any data point z and predictor w, let z and w denote their projection onto ū. Let z arg max i S w, z, and thus w, z α w. or l exp, we have w, R(w) = z = z [ exp ( w, z )] w, z n n [ exp ( w, z )] w, z n exp ( w, z ) w, z + z,w 0 n exp ( w, z ) w, z. The first part can be lower bounded as below (recall that w, z = w, z α w ) (B.9) n exp ( w, z ) w, z = n exp ( w, γū ) exp ( w, z ) w, z n exp ( w, γū ) exp ( α w ) α w. (B.0) To bound the second part, first notice that since we assume w, ū 0, for any z, w, z γū = w, z + w, z γū z w, z = w, z. (B.) The reason is that every data point has margin at least γ, and thus z γū z = cū for some c 0. Using eq. (B.), we can bound the second part of eq. (B.9). z,w 0 = z,w 0 z,w 0 z,w 0 n exp ( w, z ) w, z n exp ( w, γū ) exp ( w, z γū ) w, z n exp ( w, γū ) exp ( ) w, z w, z n exp ( w, γū ) ( ) e exp ( w, γū ) ( ). (B.) e On the third line eq. (B.) is applied. The fourth line applies the property that f(x) = xe x /e when x 0.
13 Combining eq. (B.9), eq. (B.0) and eq. (B.), we get w, R(w) exp ( w, γū ) ( n exp ( α w ) α w ). e As long as w ( + ln(n))/α, w, R(w) 0. or l log, similar to eq. (B.9), we have w, R(w) exp ( w, z ) n + exp ( w, z ) w, z + exp ( w, z ) n + exp ( w, z ) w, z + z,w 0 z,w 0 exp ( w, z ) n + exp ( w, z ) w, z n exp ( w, z ) w, z. (B.3) The second part of eq. (B.3) can be bounded again by eq. (B.). To bound the first part of eq. (B.3), first notice that (recall w, ū 0) exp ( w, z ) = exp ( w, γū ) exp ( w, z ) exp ( w, z ). (B.4) Using eq. (B.4), and recall that w, z = w, z α w 0, we can bound the first part of eq. (B.3) as below. exp ( w, z ) n + exp ( w, z ) w, z = n exp ( w, γū ) exp ( w, z ) + exp ( w, z ) w, z n exp ( w, γū ) exp ( w, z ) + exp ( w, z ) w, z n exp ( w, γū ) w, z n exp ( w, γū ) α w. Combining eq. (B.3), eq. (B.5) and eq. (B.), we get w, R(w) exp ( w, γū ) ( n α w ). e As long as w n/eα, w, R(w) 0. Proof of Theorem.8. Recall that and thus dw dt d W dt = W = W W L R(w prod ), = (B.5) W, dw = w prod, R(w prod ). (B.6) dt Let Πū denote the projection onto span(ū), and let Π denote the projection onto ū. Also let ΠūW and Π W denote the projection of rows of W onto span(ū) and ū, respectively. Notice that Πūw prod = ( W L W (ΠūW ) ), and Π w prod = ( W L W (Π W ) ). We further have d Π W = Π w prod, R(w prod ). (B.7) dt Let W = u σ v + S. We have S σ, σ, W W D, where σ, is the second singular value of W and D is the constant introduced in Lemma.6. Then Π W = u σ (Π v ) + Π S, 3
14 and Π W u σ (Π v ) + Π S = σ Π v + Π S σ Π v + dd. It follows that Π v Π W σ dd σ Π W W dd W. ix an arbitrary ɛ > 0. By Theorem., we can find some t 0 large enough such that for any t t 0 :. dd/ W ɛ/3.. w prod/ W L W v ɛ/3, or w prod/ W L W + v ɛ/3. (B.8) 3. W L W 3K/ɛ, where K is the threshold given in Lemma.0, i.e., +ln(n) /α for l exp, n /eα for l log. 4. R(W ) l(0)/n, which implies w prod, z i 0 for all i n. By Lemma.9, there always exists a support vector z for which Π w prod, z 0, and therefore w prod, ū 0. Suppose for some t t 0, Π W / W ɛ. By eq. (B.8) and bullet above, Π v ɛ/3. Bullet above then gives Π w prod/ W L W ɛ/3, which together with bullet 3 above implies Π w prod K. Since also w prod, ū 0, we can apply Lemma.0 and get that Π w prod, R(w prod ) 0. In light of eq. (B.7), d Π W / dt 0. On the other hand, since after t t 0, w prod, z i 0, we have d W / dt 0 by eq. (B.6). Therefore Π W / W will not increase, and since W, it will eventually drop below ɛ, and will never exceed ɛ again. Therefore, Since ɛ is arbitrary, we have Π W lim sup ɛ. t W Π W lim sup = 0, t W and thus lim t v, ū =. An application of Theorem. gives the other part of Theorem.8. C Omitted proofs from Section 3 Proof of Lemma 3.. Given W, V B(R), we need to show that R(W ) R(V ) β(r) W V for some β(r). Consider k = first. Let w = (W L W ), and v = (V L V ). Since l G, R(w), R(v) G. We have W W V = WL R(w) V VL R(v) W WL R(w) V W3 WL R(w) + V W3 WL R(w) V VL R(v) R L G W V + V W3 WL R(w) V VL R(v) R L G W V + V W3 WL R(w) V VL R(v). (C.) Proceeding in this way, we can get W V (L )RL G W V + R L R(w) R(v). (C.) 4
15 Since z i, l is β-lipschitz, we have R(w) R(v) β w v βlr L W V, (C.3) where the last inequality follows from a similar one-by-one replacement procedure as in eq. (C.). Combining eq. (C.) and eq. (C.3), we get for R, W V ((L )R L G + βlr L ) W V LR L (β + G) W V. The same procedure can be done for other layers, and together R(W ) R(V ) L R L (β + G) W V. Proof of Lemma 3.4. Recall that if W (t), W (t + ) B(R) and η t = /β(r), R ( W (t + ) ) R ( W (t) ) R ( W (t) ), η t R ( W (t) ) + β(r)η t R ( W (t) ) = R ( W (t) ) β(r) = η t R ( W (t) ). Suppose W (t) B(R) for all t. By Assumption. and eq. (C.4), R ( W () ) R ( W (0) ) R ( W (0) ) ( ) < R W (0). β(r) (C.4) By eq. (C.4), gradient descent never increases the risk, and thus for all t, R ( W (t) ) R ( W () ) < R ( W (0) ). In exactly the same way as in the proof of Lemma.3, one can show that there exists some constant ɛ(r) > 0, so that / W (t) ɛ(r) for all t. Invoking eq. (C.4) again, we will get R ( W (0) ) t=0 β(r) ɛ(r) =, which is a contradiction. Therefore W (t) must go out of B(R) at some time. Next we prove Theorem 3.6 and 3.7. The proofs depend on several lemmas which are similar to the gradient flow ones. The following Lemma C.5 is similar to Lemma.3. Lemma C.5. Under Assumptions. to. and 3.5, gradient descent ensures that max k L W k (t) is unbounded. t=0 η t =. or any R > 0, t:w (t) B(R) η t <. Proof. By Assumption 3.5, we always have that W (t) B(R t ). Since β(r t ) = L Rt L (β + G) Rt L G, we have for any k L, W k (t + ) W k (t) + η t W k (t) W k (t) + β(r t ) W k (t) W k (t) + β(r t ) RL t G W k (t) +. (C.6) 5
16 Moreover, Lemma 3.4 shows that R t. Since R t+ = R t as long as W (t+) B(R t ), max k L W k (t) is unbounded. It then follows that for any t, by Cauchy-Schwarz, t Since eq. (C.4) implies η τ t R ( ) t η τ W (τ) R ( ) η τ W (τ) t R ( ) ( ) ( ) ( ) η τ W (τ) R W (0) R W (t) R W (0),. together we have t=0 η t =. Since under Assumptions 3. and 3.5 gradient descent never increases the risk, it can be shown in exactly the same as in the proof of Lemma.3 that, for W (t) B(R), / W (t) ɛ(r) for some constant ɛ(r) > 0. Invoking eq. (C.4) again, we get that t:w (t) B(R) η t <. The next lemma is an analog to Lemma.6. Lemma C.7. Under Assumptions. and 3., the gradient descent iterates satisfy the following properties: or any k L, W k W k D + R ( W (0) ). or any k < L, v k+, u k D+3R(W (0))+ W k+(0) + W k(0) / W k+. Suppose max k L W k, then w prod/ L k= W k, v. Proof. Recall that for any W, W k+ = Wk+ WL R(w prod ) W Wk = Wk. W k+ W k (C.8) or gradient descent iterates, summing eq. (C.8) from 0 to t, we get t Wk+(t)W k+ (t) Wk+(0)W k+ (0) + t = W k (t)wk (t) W k (0)Wk (0) + or any k L and any t, let t P k (t) = η τ η τ η τ ( W k (τ) ( ) ( ), W k (τ) W k (τ) ( W k+ (τ) ) ( W k+ (τ) ) ) ( ). W k (τ) (C.9) and t Q k (t) = η τ ( ) ( ). W k (τ) W k (τ) 6
17 We have P k (t) = Q k (t) tr ( Q k (t) ) = tr ( P k (t) ). Moreover, invoking eq. (C.4), L tr ( P k (t) ) = k= L t k= t = η τ η τ W k (τ) R ( W (τ) ) t R ( ) η τ W (τ) R ( W (0) ) R ( W (t) ) R ( W (0) ). (C.0) Still let σ k (t), u k (t) and v k (t) denote the first singular value, left singular vector and right singular vector of W k (t). We can then proceed basically in the same way as in the proof of Lemma.6. or example, eq. (B.) becomes σ k(t) σ k+(t) A k,k+ (t) P k (t) σ k+(t) A k,k+ (t) tr ( P k (t) ), (C.) while eq. (B.3) becomes W k (t) = W k+ (t) + W k (0) W k+ (0) tr ( P k (t) ) + tr ( Q k+ (t) ). (C.) Summing eq. (C.) and eq. (C.) from k to L, and invoke eq. (C.0), we get W k (t) W k (t) D tr ( P k (t) ) + tr ( Q L (t) ) + L k =k tr ( P k (t) ) D + R ( W (0) ). To prove singular vectors get aligned, we can still proceed in nearly the same way as in the proof of Lemma.6. eq. (B.5) becomes u k W k+w k+ u k σ k W k (0) Q k+ (t), (C.3) while eq. (B.6) becomes u k W k+w k+ u k u k, v k+ σ k+ + D + R ( W (0) ). (C.4) Combining eq. (C.3) and eq. (C.4) σ k u k, v k+ σ k+ + D + R ( W (0) ) + Q k+ (t) + W k (0). (C.5) Similar to eq. (C.3), we can get and thus eq. (B.8) becomes σ k v k+w k W k v k+ σ k+ W k+ (0) P k (t), σ k σ k+ W k+(0) + P k (t) σk+. (C.6) Combining eq. (C.5) and eq. (C.6), we get u k, v k+ D + W k(0) + W k+ (0) + 3R ( W (0) ) σ k+ The final claim of Lemma C.7 can be proved in exactly the same way as Lemma.6.. 7
18 Proof of Theorem 3.6. Summing eq. (C.), we know that for any two different layers j > k, W k (t) W j (t) = W k (0) W j (0) tr ( P k (t) ) + tr ( Q j (t) ). Recall eq. (C.0), we know that ( ) ( ) W k (t) W j (t) W k (0) W j (0) R ( W (0) ). (C.7) In other words, the difference between the squares of robenius norms of any two layers is still bounded. The proof then goes in the same way as the proof of Theorem.. Suppose the risk is always above ɛ > 0. Then there exists some c(ɛ) > 0 such that R(w prod ) c(ɛ). By Lemma C.7, there exists some C such that if min k L W k (t) > C, W L (t) W (t) C L /. By eq. (C.7) and Lemma C.5, t: W k (t) C for some k η t is finite. On the other hand, by Lemma C.5, i=0 η t =, and thus t: W k (t) >C for all k η t =. Therefore we have, by invoking eq. (C.4), R ( W (0) ) R ( ) η t W (t) t=0 η t W (t) t=0 c(ɛ) CL =, t: W k (t) >C for all k which is a contradiction. Therefore R ( W (t) ) 0, and since it has no finite optimum, W k. The other results follow from Lemma C.5. Proof of Theorem 3.7. Recall that and thus W = W W L R(w prod ), W (t + ) = W (t) η t W (t), + ηt W (t) W (t) = W (t) η t w prod (t), R ( w prod (t) ) + ηt W (t) If w prod, z i 0 for all i, then W (t + ) W (t). Also recall that Π W (t) denote the projection of rows of W (t) onto ū, the orthogonal complement of span(ū). We have Π W (t + ) W (t) η t Π W (t), + ηt W (t) W (t) = W (t) η t Π w prod (t), R ( w prod (t) ) + ηt W (t). (C.8) Invoking eq. (C.4) again gives ηt W (t) η t R ( W (t) ) (R ( W (t) ) R ( W (t + ) )). (C.9) The proof then goes in almost the same way as the proof of Theorem.8. or any ɛ > 0, we can find some large enough time t 0, such that for any t t 0, η t. 8
19 . Π W (t) / W (t) ɛ implies that Π w prod (t), R ( w prod (t) ) 0.. w prod (t), z i 0 for all i, and thus W (t + ) W (t). ( 3. W (t) ) + R(W (0)) /ɛ. Suppose at some time t t 0, Π W (t ) / W (t ) ɛ. As long as this still holds, in light of bullet () above, eq. (C.8) and eq. (C.9), Π W will increase by at most R ( W (t ) ) R ( W (0) ). On the other hand, W, and thus there exists some t > t such that Π W (t ) / W (t ) < ɛ. Let t 3 denote the smallest time after t such that Π W (t 3 ) / W (t 3 ) ɛ (if it exists). Recall that W (t + ) W (t) + for any t 0, and W (t + ) W (t) for any t t 0, we have Π W (t 3 ) W (t 3 ) Π W (t 3 ) Π W (t 3 ) + < ɛ +. W (t 3 ) W (t 3 ) W (t 3 ) After t 3, Π W will increase by at most R ( W (0) ), and thus Π W will increase by at most R ( W (0) ). Therefore, for any t 4 t 3, as long as Π W (t 4 ) / W (t 4 ) ɛ, we have since W (t) Π W (t 4 ) W (t 4 ) ( ) + R(W (0)) /ɛ after t 0. In other words, Π W (t 4 ) W (t 3 ) Π W (t 3 ) + R ( W (0) ) W (t 3 ) R ( W (0) ) ɛ + + ɛ, W (t 3 ) W (t 3 ) Π W lim sup ɛ. t W Since ɛ is arbitrary, we have and thus lim t v, ū =. Π W lim sup = 0, t W 9
Overparametrization for Landscape Design in Non-convex Optimization
Overparametrization for Landscape Design in Non-convex Optimization Jason D. Lee University of Southern California September 19, 2018 The State of Non-Convex Optimization Practical observation: Empirically,
More informationSome Statistical Properties of Deep Networks
Some Statistical Properties of Deep Networks Peter Bartlett UC Berkeley August 2, 2018 1 / 22 Deep Networks Deep compositions of nonlinear functions h = h m h m 1 h 1 2 / 22 Deep Networks Deep compositions
More informationOptimization geometry and implicit regularization
Optimization geometry and implicit regularization Suriya Gunasekar Joint work with N. Srebro (TTIC), J. Lee (USC), D. Soudry (Technion), M.S. Nacson (Technion), B. Woodworth (TTIC), S. Bhojanapalli (TTIC),
More informationDeep Linear Networks with Arbitrary Loss: All Local Minima Are Global
homas Laurent * 1 James H. von Brecht * 2 Abstract We consider deep linear networks with arbitrary convex differentiable loss. We provide a short and elementary proof of the fact that all local minima
More informationwhat can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley
what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley Collaborators Joint work with Samy Bengio, Moritz Hardt, Michael Jordan, Jason Lee, Max Simchowitz,
More informationGeneralization in Deep Networks
Generalization in Deep Networks Peter Bartlett BAIR UC Berkeley November 28, 2017 1 / 29 Deep neural networks Game playing (Jung Yeon-Je/AFP/Getty Images) 2 / 29 Deep neural networks Image recognition
More informationImplicit Optimization Bias
Implicit Optimization Bias as a key to Understanding Deep Learning Nati Srebro (TTIC) Based on joint work with Behnam Neyshabur (TTIC IAS), Ryota Tomioka (TTIC MSR), Srinadh Bhojanapalli, Suriya Gunasekar,
More informationFoundations of Deep Learning: SGD, Overparametrization, and Generalization
Foundations of Deep Learning: SGD, Overparametrization, and Generalization Jason D. Lee University of Southern California November 13, 2018 Deep Learning Single Neuron x σ( w, x ) ReLU: σ(z) = [z] + Figure:
More informationThe Implicit Bias of Gradient Descent on Separable Data
Journal of Machine Learning Research 19 2018 1-57 Submitted 4/18; Published 11/18 The Implicit Bias of Gradient Descent on Separable Data Daniel Soudry Elad Hoffer Mor Shpigel Nacson Department of Electrical
More informationCourse Notes for EE227C (Spring 2018): Convex Optimization and Approximation
Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee227c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee227c@berkeley.edu
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Neural Networks: A brief touch Yuejie Chi Department of Electrical and Computer Engineering Spring 2018 1/41 Outline
More informationLinear Models in Machine Learning
CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,
More informationClassification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box
ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton Motivation Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More informationCharacterization of Gradient Dominance and Regularity Conditions for Neural Networks
Characterization of Gradient Dominance and Regularity Conditions for Neural Networks Yi Zhou Ohio State University Yingbin Liang Ohio State University Abstract zhou.1172@osu.edu liang.889@osu.edu The past
More informationA Greedy Framework for First-Order Optimization
A Greedy Framework for First-Order Optimization Jacob Steinhardt Department of Computer Science Stanford University Stanford, CA 94305 jsteinhardt@cs.stanford.edu Jonathan Huggins Department of EECS Massachusetts
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table
More informationNeural Network Training
Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification
More informationStochastic Optimization Algorithms Beyond SG
Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods
More informationOptimization methods
Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,
More informationSGD and Deep Learning
SGD and Deep Learning Subgradients Lets make the gradient cheating more formal. Recall that the gradient is the slope of the tangent. f(w 1 )+rf(w 1 ) (w w 1 ) Non differentiable case? w 1 Subgradients
More information9 Classification. 9.1 Linear Classifiers
9 Classification This topic returns to prediction. Unlike linear regression where we were predicting a numeric value, in this case we are predicting a class: winner or loser, yes or no, rich or poor, positive
More informationGlobal Optimality in Matrix and Tensor Factorization, Deep Learning & Beyond
Global Optimality in Matrix and Tensor Factorization, Deep Learning & Beyond Ben Haeffele and René Vidal Center for Imaging Science Mathematical Institute for Data Science Johns Hopkins University This
More informationComparison of Modern Stochastic Optimization Algorithms
Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,
More information1 Machine Learning Concepts (16 points)
CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions
More informationWHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY,
WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY, WITH IMPLICATIONS FOR TRAINING Sanjeev Arora, Yingyu Liang & Tengyu Ma Department of Computer Science Princeton University Princeton, NJ 08540, USA {arora,yingyul,tengyu}@cs.princeton.edu
More informationReading Group on Deep Learning Session 1
Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More information1 Lyapunov theory of stability
M.Kawski, APM 581 Diff Equns Intro to Lyapunov theory. November 15, 29 1 1 Lyapunov theory of stability Introduction. Lyapunov s second (or direct) method provides tools for studying (asymptotic) stability
More informationMachine Learning for NLP
Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline
More informationLecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron
CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer
More information10-725/36-725: Convex Optimization Prerequisite Topics
10-725/36-725: Convex Optimization Prerequisite Topics February 3, 2015 This is meant to be a brief, informal refresher of some topics that will form building blocks in this course. The content of the
More informationarxiv: v4 [math.oc] 5 Jan 2016
Restarted SGD: Beating SGD without Smoothness and/or Strong Convexity arxiv:151.03107v4 [math.oc] 5 Jan 016 Tianbao Yang, Qihang Lin Department of Computer Science Department of Management Sciences The
More informationCharacterizing Implicit Bias in Terms of Optimization Geometry
Suriya Gunasekar 1 Jason Lee Daniel Soudry 3 Nathan Srebro 1 Abstract We study the implicit bias of generic optimization methods, including mirror descent, natural gradient descent, and steepest descent
More informationStochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos
1 Stochastic Variance Reduction for Nonconvex Optimization Barnabás Póczos Contents 2 Stochastic Variance Reduction for Nonconvex Optimization Joint work with Sashank Reddi, Ahmed Hefny, Suvrit Sra, and
More informationClassical generalization bounds are surprisingly tight for Deep Networks
CBMM Memo No. 9 July, 28 Classical generalization bounds are surprisingly tight for Deep Networks Qianli Liao, Brando Miranda, Jack Hidary 2 and Tomaso Poggio Center for Brains, Minds, and Machines, MIT
More informationLinear Algebra Massoud Malek
CSUEB Linear Algebra Massoud Malek Inner Product and Normed Space In all that follows, the n n identity matrix is denoted by I n, the n n zero matrix by Z n, and the zero vector by θ n An inner product
More informationTowards stability and optimality in stochastic gradient descent
Towards stability and optimality in stochastic gradient descent Panos Toulis, Dustin Tran and Edoardo M. Airoldi August 26, 2016 Discussion by Ikenna Odinaka Duke University Outline Introduction 1 Introduction
More informationDistributed Inexact Newton-type Pursuit for Non-convex Sparse Learning
Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology
More informationCOR-OPT Seminar Reading List Sp 18
COR-OPT Seminar Reading List Sp 18 Damek Davis January 28, 2018 References [1] S. Tu, R. Boczar, M. Simchowitz, M. Soltanolkotabi, and B. Recht. Low-rank Solutions of Linear Matrix Equations via Procrustes
More informationAdvanced computational methods X Selected Topics: SGD
Advanced computational methods X071521-Selected Topics: SGD. In this lecture, we look at the stochastic gradient descent (SGD) method 1 An illustrating example The MNIST is a simple dataset of variety
More informationDay 3 Lecture 3. Optimizing deep networks
Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient
More informationAd Placement Strategies
Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad
More informationCourse Notes for EE227C (Spring 2018): Convex Optimization and Approximation
Course Notes for EE7C (Spring 08): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee7c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee7c@berkeley.edu October
More informationarxiv: v1 [cs.lg] 7 Jan 2019
Generalization in Deep Networks: The Role of Distance from Initialization arxiv:1901672v1 [cs.lg] 7 Jan 2019 Vaishnavh Nagarajan Computer Science Department Carnegie-Mellon University Pittsburgh, PA 15213
More informationLinear Discrimination Functions
Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica Università degli Studi di Bari November 4, 2009 Outline Linear models Gradient descent Perceptron Minimum square error approach
More informationConcentration behavior of the penalized least squares estimator
Concentration behavior of the penalized least squares estimator Penalized least squares behavior arxiv:1511.08698v2 [math.st] 19 Oct 2016 Alan Muro and Sara van de Geer {muro,geer}@stat.math.ethz.ch Seminar
More informationApproximate Dynamic Programming
Master MVA: Reinforcement Learning Lecture: 5 Approximate Dynamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of the lecture 1.
More informationCSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18
CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H
More informationOptimization and Gradient Descent
Optimization and Gradient Descent INFO-4604, Applied Machine Learning University of Colorado Boulder September 12, 2017 Prof. Michael Paul Prediction Functions Remember: a prediction function is the function
More informationStochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization
Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine
More informationMath 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces.
Math 350 Fall 2011 Notes about inner product spaces In this notes we state and prove some important properties of inner product spaces. First, recall the dot product on R n : if x, y R n, say x = (x 1,...,
More informationSupport Vector Machines
Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized
More informationLecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.
Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods
More informationGeneralization theory
Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1
More informationStochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions
International Journal of Control Vol. 00, No. 00, January 2007, 1 10 Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions I-JENG WANG and JAMES C.
More informationUNDERSTANDING LOCAL MINIMA IN NEURAL NET-
UNDERSTANDING LOCAL MINIMA IN NEURAL NET- WORKS BY LOSS SURFACE DECOMPOSITION Anonymous authors Paper under double-blind review ABSTRACT To provide principled ways of designing proper Deep Neural Network
More informationSupport Vector Machines and Bayes Regression
Statistical Techniques in Robotics (16-831, F11) Lecture #14 (Monday ctober 31th) Support Vector Machines and Bayes Regression Lecturer: Drew Bagnell Scribe: Carl Doersch 1 1 Linear SVMs We begin by considering
More informationLecture 17: Neural Networks and Deep Learning
UVA CS 6316 / CS 4501-004 Machine Learning Fall 2016 Lecture 17: Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions
More informationEmpirical Risk Minimization
Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space
More information2. Dual space is essential for the concept of gradient which, in turn, leads to the variational analysis of Lagrange multipliers.
Chapter 3 Duality in Banach Space Modern optimization theory largely centers around the interplay of a normed vector space and its corresponding dual. The notion of duality is important for the following
More informationSupport Vector Machines and Kernel Methods
2018 CS420 Machine Learning, Lecture 3 Hangout from Prof. Andrew Ng. http://cs229.stanford.edu/notes/cs229-notes3.pdf Support Vector Machines and Kernel Methods Weinan Zhang Shanghai Jiao Tong University
More informationDeep Neural Networks: From Flat Minima to Numerically Nonvacuous Generalization Bounds via PAC-Bayes
Deep Neural Networks: From Flat Minima to Numerically Nonvacuous Generalization Bounds via PAC-Bayes Daniel M. Roy University of Toronto; Vector Institute Joint work with Gintarė K. Džiugaitė University
More informationSVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels
SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels Karl Stratos June 21, 2018 1 / 33 Tangent: Some Loose Ends in Logistic Regression Polynomial feature expansion in logistic
More informationTrace Class Operators and Lidskii s Theorem
Trace Class Operators and Lidskii s Theorem Tom Phelan Semester 2 2009 1 Introduction The purpose of this paper is to provide the reader with a self-contained derivation of the celebrated Lidskii Trace
More informationIncremental Reshaped Wirtinger Flow and Its Connection to Kaczmarz Method
Incremental Reshaped Wirtinger Flow and Its Connection to Kaczmarz Method Huishuai Zhang Department of EECS Syracuse University Syracuse, NY 3244 hzhan23@syr.edu Yingbin Liang Department of EECS Syracuse
More informationLast updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
Last updated: Oct 22, 2012 LINEAR CLASSIFIERS Problems 2 Please do Problem 8.3 in the textbook. We will discuss this in class. Classification: Problem Statement 3 In regression, we are modeling the relationship
More informationStatistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks
Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Jan Drchal Czech Technical University in Prague Faculty of Electrical Engineering Department of Computer Science Topics covered
More informationLecture 5: Logistic Regression. Neural Networks
Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture
More informationSolving Corrupted Quadratic Equations, Provably
Solving Corrupted Quadratic Equations, Provably Yuejie Chi London Workshop on Sparse Signal Processing September 206 Acknowledgement Joint work with Yuanxin Li (OSU), Huishuai Zhuang (Syracuse) and Yingbin
More informationIMPROVING STOCHASTIC GRADIENT DESCENT
IMPROVING STOCHASTIC GRADIENT DESCENT WITH FEEDBACK Jayanth Koushik & Hiroaki Hayashi Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213, USA {jkoushik,hiroakih}@cs.cmu.edu
More informationarxiv: v1 [cs.lg] 4 Oct 2018
Gradient Descent Provably Optimizes Over-parameterized Neural Networks Simon S. Du 1, Xiyu Zhai, Barnabás Póczos 1, and Aarti Singh 1 arxiv:1810.0054v1 [cs.lg] 4 Oct 018 1 Machine Learning Department,
More informationMachine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.
Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted
More informationPrioritized Sweeping Converges to the Optimal Value Function
Technical Report DCS-TR-631 Prioritized Sweeping Converges to the Optimal Value Function Lihong Li and Michael L. Littman {lihong,mlittman}@cs.rutgers.edu RL 3 Laboratory Department of Computer Science
More informationarxiv: v1 [math.oc] 1 Jul 2016
Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the
More informationarxiv: v1 [cs.it] 21 Feb 2013
q-ary Compressive Sensing arxiv:30.568v [cs.it] Feb 03 Youssef Mroueh,, Lorenzo Rosasco, CBCL, CSAIL, Massachusetts Institute of Technology LCSL, Istituto Italiano di Tecnologia and IIT@MIT lab, Istituto
More informationDS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.
DS-GA 1002 Lecture notes 0 Fall 2016 Linear Algebra These notes provide a review of basic concepts in linear algebra. 1 Vector spaces You are no doubt familiar with vectors in R 2 or R 3, i.e. [ ] 1.1
More informationLearning theory. Ensemble methods. Boosting. Boosting: history
Learning theory Probability distribution P over X {0, 1}; let (X, Y ) P. We get S := {(x i, y i )} n i=1, an iid sample from P. Ensemble methods Goal: Fix ɛ, δ (0, 1). With probability at least 1 δ (over
More informationA Unified Analysis of Stochastic Momentum Methods for Deep Learning
A Unified Analysis of Stochastic Momentum Methods for Deep Learning Yan Yan,2, Tianbao Yang 3, Zhe Li 3, Qihang Lin 4, Yi Yang,2 SUSTech-UTS Joint Centre of CIS, Southern University of Science and Technology
More informationLecture Notes 1: Vector spaces
Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector
More informationA summary of Deep Learning without Poor Local Minima
A summary of Deep Learning without Poor Local Minima by Kenji Kawaguchi MIT oral presentation at NIPS 2016 Learning Supervised (or Predictive) learning Learn a mapping from inputs x to outputs y, given
More informationUnconstrained optimization
Chapter 4 Unconstrained optimization An unconstrained optimization problem takes the form min x Rnf(x) (4.1) for a target functional (also called objective function) f : R n R. In this chapter and throughout
More informationComposite Functional Gradient Learning of Generative Adversarial Models. Appendix
A. Main theorem and its proof Appendix Theorem A.1 below, our main theorem, analyzes the extended KL-divergence for some β (0.5, 1] defined as follows: L β (p) := (βp (x) + (1 β)p(x)) ln βp (x) + (1 β)p(x)
More informationReproducing Kernel Hilbert Spaces
9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we
More informationOPTIMIZATION METHODS IN DEEP LEARNING
Tutorial outline OPTIMIZATION METHODS IN DEEP LEARNING Based on Deep Learning, chapter 8 by Ian Goodfellow, Yoshua Bengio and Aaron Courville Presented By Nadav Bhonker Optimization vs Learning Surrogate
More informationJeff Howbert Introduction to Machine Learning Winter
Classification / Regression Support Vector Machines Jeff Howbert Introduction to Machine Learning Winter 2012 1 Topics SVM classifiers for linearly separable classes SVM classifiers for non-linearly separable
More informationWorst-Case Analysis of the Perceptron and Exponentiated Update Algorithms
Worst-Case Analysis of the Perceptron and Exponentiated Update Algorithms Tom Bylander Division of Computer Science The University of Texas at San Antonio San Antonio, Texas 7849 bylander@cs.utsa.edu April
More informationTheory of Deep Learning IIb: Optimization Properties of SGD
CBMM Memo No. 72 December 27, 217 Theory of Deep Learning IIb: Optimization Properties of SGD by Chiyuan Zhang 1 Qianli Liao 1 Alexander Rakhlin 2 Brando Miranda 1 Noah Golowich 1 Tomaso Poggio 1 1 Center
More informationQualifying Exam in Machine Learning
Qualifying Exam in Machine Learning October 20, 2009 Instructions: Answer two out of the three questions in Part 1. In addition, answer two out of three questions in two additional parts (choose two parts
More informationTheory of Deep Learning III: explaining the non-overfitting puzzle
arxiv:1801.00173v1 [cs.lg] 30 Dec 2017 CBMM Memo No. 073 January 3, 2018 Theory of Deep Learning III: explaining the non-overfitting puzzle by Tomaso Poggio,, Kenji Kawaguchi, Qianli Liao, Brando Miranda,
More informationA Surprising Linear Relationship Predicts Test Performance in Deep Networks
CBMM Memo No. 91 July 26, 218 arxiv:187.9659v1 [cs.lg] 25 Jul 218 A Surprising Linear Relationship Predicts Test Performance in Deep Networks Qianli Liao 1, Brando Miranda 1, Andrzej Banburski 1, Jack
More informationSupport Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs
E0 270 Machine Learning Lecture 5 (Jan 22, 203) Support Vector Machines for Classification and Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in
More informationComputational Learning Theory - Hilary Term : Learning Real-valued Functions
Computational Learning Theory - Hilary Term 08 8 : Learning Real-valued Functions Lecturer: Varun Kanade So far our focus has been on learning boolean functions. Boolean functions are suitable for modelling
More informationPCA with random noise. Van Ha Vu. Department of Mathematics Yale University
PCA with random noise Van Ha Vu Department of Mathematics Yale University An important problem that appears in various areas of applied mathematics (in particular statistics, computer science and numerical
More informationDeep Learning for Computer Vision
Deep Learning for Computer Vision Lecture 4: Curse of Dimensionality, High Dimensional Feature Spaces, Linear Classifiers, Linear Regression, Python, and Jupyter Notebooks Peter Belhumeur Computer Science
More informationLocal Affine Approximators for Improving Knowledge Transfer
Local Affine Approximators for Improving Knowledge Transfer Suraj Srinivas & François Fleuret Idiap Research Institute and EPFL {suraj.srinivas, francois.fleuret}@idiap.ch Abstract The Jacobian of a neural
More informationAutomatic Differentiation and Neural Networks
Statistical Machine Learning Notes 7 Automatic Differentiation and Neural Networks Instructor: Justin Domke 1 Introduction The name neural network is sometimes used to refer to many things (e.g. Hopfield
More informationLearning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013
Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description
More informationSummary and discussion of: Dropout Training as Adaptive Regularization
Summary and discussion of: Dropout Training as Adaptive Regularization Statistics Journal Club, 36-825 Kirstin Early and Calvin Murdock November 21, 2014 1 Introduction Multi-layered (i.e. deep) artificial
More informationA Conservation Law Method in Optimization
A Conservation Law Method in Optimization Bin Shi Florida International University Tao Li Florida International University Sundaraja S. Iyengar Florida International University Abstract bshi1@cs.fiu.edu
More information