arxiv: v1 [cs.lg] 4 Oct 2018

Size: px
Start display at page:

Download "arxiv: v1 [cs.lg] 4 Oct 2018"

Transcription

1 Gradient descent aligns the layers of deep linear networks Ziwei Ji Matus Telgarsky University of Illinois, Urbana-Champaign arxiv:80.003v [cs.lg] 4 Oct 08 Abstract This paper establishes risk convergence and asymptotic weight matrix alignment a form of implicit regularization of gradient flow and gradient descent when applied to deep linear networks on linearly separable data. In more detail, for gradient flow applied to strictly decreasing loss functions (with similar results for gradient descent with particular decreasing step sizes): (i) the risk converges to 0; (ii) the normalized i th weight matrix asymptotically equals its rank- approximation u iv i ; (iii) these rank- matrices are aligned across layers, meaning v i+u i. In the case of the logistic loss (binary cross entropy), more can be said: the linear function induced by the network the product of its weight matrices converges to the same direction as the maximum margin solution. This last property was identified in prior work, but only under assumptions on gradient descent which here are implied by the alignment phenomenon. Introduction Efforts to explain the effectiveness of gradient descent in deep learning have uncovered an exciting possibility: it not only finds solutions with low error, but also biases the search for low complexity solutions which generalize well (Zhang et al., 07; Bartlett et al., 07; Soudry et al., 07; Gunasekar et al., 08). This paper analyzes the implicit regularization of gradient descent and gradient flow on deep linear networks and linearly separable data. or strictly decreasing losses, the optimum is off at infinity, and we establish various alignment phenomena: or each weight matrix W i, the corresponding normalized weight matrix Wi / W i asymptotically equals its rank- approximation u i vi, where the robenius norm W i satisfies W i. In other words, W i / W i, and asymptotically only the rank- approximation of each weight matrix contributes to the final predictor, a form of implicit regularization. Adjacent rank- weight matrix approximations are aligned: v i+ u i. or the logistic loss, the first right singular vector v of W is aligned with the data, meaning v converges to the unique maximum margin predictor ū defined by the data. Moreover, the linear predictor induced by the network, w prod := W L W, is also aligned with the data, meaning w prod/ w prod ū. Simultaneously, this work proves that the risk is globally optimized: it asymptotes to 0. Alignment and risk convergence are proved simultaneously; the phenomena are coupled within the proofs. The paper is organized as follows. This introduction continues with related work, notation, and assumptions in Sections. and.. The analysis of gradient flow is in Section, and gradient descent is analyzed in Section 3. The paper closes with future directions in Section 4; a particular highlight is a preliminary experiment on CIAR-0 which establishes empirically that a form of the alignment phenomenon occurs on the standard nonlinear network AlexNet.. Related work On the implicit regularization of gradient descent, Soudry et al. (07) show that for linear predictors and linearly separable data, the gradient descent iterates converge to the same direction as the maximum margin

2 4 0 layer 4 layers risk W / W W / W W 3 / W 3 W 4 / W (a) Margin maximization. (b) Alignment and risk minimization. igure : Visualization of main results on synthetic data with a 4-layer linear network compared to a -layer network (a linear predictor). igure a shows the convergence of -layer and 4-layer networks to the same linear predictor on positive (blue) and negative (red) separable data. igure b shows the alignment phenomenon in the 4-layer network, plotted against the risk. Specifically, for each layer, the ratio of spectral to robenius norms is plotted, and converges to. As in the theoretical analysis, the convergence in alignment and risk occur simultaneously. solution. Ji and Telgarsky (08) further characterize such an implicit bias for general nonseparable data. Gunasekar et al. (08) consider gradient descent on fully connected linear networks and linear convolutional networks. In particular, for the exponential loss, assuming the risk is minimized to 0 and the gradients converge in direction, they show that the whole network converges in direction to the maximum margin solution. These two assumptions are on the gradient descent process itself, and specifically the second one might be hard to interpret and justify. Compared with Gunasekar et al. (08), this paper proves that the risk converges to 0 and the weight matrices align; moreover the proof here proves the properties simultaneously, rather than assuming one and deriving the other. Lastly, for ReLU networks, Du et al. (08) show that gradient flow does not change the difference between squared robenius norms of any two layers. or a smooth (nonconvex) function, Lee et al. (06) show that any strict saddle can be avoided almost surely with small step sizes. If there are only countably many saddle points and they are all strict, and if gradient descent iterates converge, then this implies (almost surely) they converge to a local minimum. In the present work, since there is no finite local minimum, gradient descent will go to infinity and never converge, and thus these results of Lee et al. (06) do not show that the risk converges to 0. There has been a rich literature on linear networks. Saxe et al. (03) analyze the learning dynamics of deep linear networks, showing that they exhibit some learning patterns similar to nonlinear networks, such as a long plateau followed by a rapid risk drop. Arora et al. (08) show that depth can help accelerate optimization. On the landscape properties of deep linear networks, Lu and Kawaguchi (07); Laurent and von Brecht (07) show that under various structural assumptions, all local optima are global. Zhou and Liang (08) give a necessary and sufficient characterization of critical points for deep linear networks.. Notation, setting, and assumptions Consider a data set {(x i, y i )} n i=, where x i R d, x i, and y i {, +}. The data set is assumed to be linearly separable, i.e., there exists a unit vector u which correctly classifies every data point: for any i n, y i u, x i > 0. urthermore, let γ := max u = min i n y i u, x i > 0 denote the maximum margin, and ū := arg max u = min i n y i u, x i denote the maximum margin solution (the solution to the hard-margin SVM). A linear network of depth L is parameterized by weight matrices W L,..., W, where W k R d k d k, d 0 = d, and d L =. Let W = (W L,..., W ) denote all parameters of the network. The (empirical) risk

3 induced by the network is given by R(W ) = R (W L,..., W ) = n n l (y i W L W x i ) = n i= n l ( w prod, z i ), where w prod := (W L W ), and z i := y i x i. The loss l is assumed to be continuously differentiable, unbounded, and strictly decreasing to 0. Examples include the exponential loss l exp (x) = e x and the logistic loss l log (x) = ln ( + e x). Assumption.. l < 0 is continuous, lim x l(x) = and lim x l(x) = 0. This paper considers gradient flow and gradient descent, where gradient flow { W (t) t 0, t R } can be interpreted as gradient descent with infinitesimal step sizes. It starts from some W (0) at t = 0, and proceeds as dw (t) = R ( W (t) ). dt By contrast, gradient descent { W (t) t 0, t Z } is a discrete-time process given by W (t + ) = W (t) η t R ( W (t) ), where η t is the step size at time t. We assume that the initialization of the network is not a critical point and induces a risk no larger than the risk of the trivial linear predictor 0. Assumption.. The initialization W (0) satisfies R ( W (0) ) 0 and R ( W (0) ) R(0) = l(0). It is natural to require that the initialization is not a critical point, since otherwise gradient flow/descent will never make a progress. The requirement R ( W (0) ) R(0) can be easily satisfied, for example, by making W (0) = 0 and W L (0) W (0) 0. On the other hand, if R ( W (0) ) > R(0), gradient flow/descent may never minimize the risk to 0. Proofs of those claims are given in Appendix A. Results for gradient flow In this section, we consider gradient flow. Although impractical when compared with gradient descent, gradient flow can simplify the analysis and highlight proof ideas. or convenience, we usually use W, W k, and w prod, but they all change with (the continuous time) t. Only proof sketches are given here; detailed proofs are deferred to Appendix B. i=. Risk convergence One key property of gradient flow is that it never increases the risk: dr(w ) dt = R(W ), dw dt L = R(W ) = W k k= 0. (.) We now state the main result: under Assumptions. and., gradient flow minimizes the risk, W k and w prod all go to infinity, and the alignment phenomenon occurs. Theorem.. Under Assumptions. and., gradient flow iterates satisfy the following properties: lim t R(W ) = 0. or any k L, lim t W k =. or any k L, letting (u k, v k ) denote the first left and right singular vectors of W k, lim W k t u k v k W k = 0. 3

4 Moreover, for any k < L, lim vk+ t, u k =. As a result, lim w prod t L k= W, v =, k and thus lim t w prod =. Theorem. is proved using two lemmas, which may be of independent interest. To show the ideas, let us first introduce a little more notation. Recall that R(W ) denotes the empirical risk induced by the deep linear network W. Abusing the notation a little, for any linear predictor w R d, we also use R(w) to denote the risk induced by w. With this notation, R(W ) = R(w prod ), while R(w prod ) = n l ( w prod, z i ) z i = n l (W L W z i ) z i n n i= is in R d and different from R(W ), which has L k= d kd k entries, as given below: urthermore, for any R > 0, let i= W k = W k+ W L R(w prod ) W { } B(R) = W max W k R. k L W k. The first lemma shows that for any R > 0, the time spent by gradient flow in B(R) is finite. Lemma.3. Under Assumptions. and., for any R > 0, there exists a constant ɛ(r) > 0, such that for any t and any W B(R), / W ɛ(r). As a result, gradient flow spends a finite amount of time in B(R) for any R > 0, and max k L W k is unbounded. Here is a proof sketch. If all W k are bounded, then R(w prod ) will be lower bounded by a positive constant, therefore if / W = W L W R(w prod ) can be arbitrarily small, then W L W and w prod can also be arbitrarily small, and thus R(W ) can be arbitrarily close to R(0). This cannot happen after t =, otherwise it will contradict Assumption. and eq. (.). To proceed, we need the following properties of linear networks from prior work (Arora et al., 08; Du et al., 08). or any time t 0 and any k < L, To see this, just notice that W k+(t)w k+ (t) W k+(0)w k+ (0) = W k (t)w k (t) W k (0)W k (0). (.4) W k+ = Wk+ WL R(w prod ) W Wk = Wk. W k+ W k Taking the trace on both sides of eq. (.4), we have Wk+ (t) Wk+ (0) = Wk (t) Wk (0). (.5) In other words, the difference between the squares of robenius norms of any two layers remains a constant. Together with Lemma.3, it implies that all W k are unbounded. However, even if W k are large, it does not necessarily follow that w prod is also large. Lemma.6 shows that this is indeed true: for gradient flow, as W k become larger, adjacent layers also get more aligned to each other, which ensures that their product also has a large norm. or k L, let σ k, u k, and v k denote the first singular value (the -norm), the first left singular vector, and the first right singular vector of W k, respectively. urthermore, define ( D := max k L W k(0) ) W L (0) + L W k (0)Wk (0) Wk+(0)W k+ (0), which depends only on the initialization. If for all k < L it holds that W k (0)W k (0) = W k+ (0)W k+(0), then D = 0. k= 4

5 Lemma.6. The gradient flow iterates satisfy the following properties: or any k L, W k W k D. ( ) or any k < L, v k+, u k D+ W k+ (0) + W k(0) / W k+. Suppose max k L W k, then w prod/ L k= W k, v. The proof is based on eq. (.4) and eq. (.5). If W k (0)Wk (0) = W k+ (0)W k+(0), then eq. (.4) gives that W k+ and W k have the same singular values, and W k+ s right singular vectors and W k s left singular vectors are the same. If it is true for any two adjacent layers, since W L is a row vector, all layers have rank. With general initialization, we have similar results when W k is large enough so that the initialization is negligible. Careful calculations give the exact results in Lemma.6. An interesting point is that the implicit regularization result in Lemma.6 helps establish risk convergence in Theorem.. Specifically, by Lemma.6, if all layers have large norms, W L W will be large. If the risk is not minimized to 0, R(w prod ) will be lower bounded by a positive constant, and thus / W = W L W R(w prod ) will be large. Invoking eq. (.), Lemma.3 and eq. (.5) gives a contradiction. Since the risk has no finite optimum, W k.. Convergence to the maximum margin solution Here we focus on the exponential loss l exp (x) = e x and the logistic loss l log (x) = ln( + e x ). In addition to risk convergence, these two losses also enable gradient descent to find the maximum margin solution. To get such a strong convergence, we need one more assumption on the data set. Recall that γ = max u = min i n u, z i > 0 denotes the maximum margin, and ū denotes the unique maximum margin predictor which attains this margin γ. Those data points z i for which ū, z i = γ are called support vectors. Assumption.7. The support vectors span the whole space R d. Assumption.7 appears in prior work (Soudry et al., 07), and can be satisfied in many cases: for example, it is almost surely true if the number of support vectors is larger than or equal to d and the data set is sampled from some density w.r.t. the Lebesgue measure. It can also be relaxed to the situation that the support vectors and the whole data set span the same space; in this case R(w prod ) will never leave this space, and we can always restrict our attention to this space. With Assumption.7, we can state the main theorem. Theorem.8. Under Assumptions. and.7, for almost all data and for losses l exp and l log, then lim v t, ū =, where v is the first right singular vector of W. As a result, lim w t prod/ L k= W k = ū. Theorem.8 relies on two structural lemmas. The first one is based on a similar almost-all argument due to Soudry et al. (07, Lemma 8). Let S {,..., n} denote the set of indices of support vectors. Lemma.9. Under Assumption.7, if the data set is sampled from some density w.r.t. the Lebesgue measure, then with probability, α := min max ξ, z i > 0. ξ =,ξ ū i S Let ū denote the orthogonal complement of span(ū), and let Π denote the projection onto ū. Lemma.0. Under Assumption.7, for almost all data, losses l exp and l log, and any w R d, if w, ū 0 and Π w is larger than +ln(n) /α for l exp or n /eα for l log, then Π w, R(w) 0. With Lemma.9 and Lemma.0 in hand, we can prove Theorem.8. Let Π W denote the projection of rows of W onto ū. Notice that Π w prod = ( W L... W (Π W ) ) d Π W and = Π w prod, R(w prod ). dt If Π W is large compared with W, since layers become aligned, Π w prod will also be large, and then Lemma.0 implies that Π W will not increase. At the same time, W, and thus for large enough t, Π W must be very small compared with W. Many details need to be handled to make this intuition exact; the proof is given in Appendix B. 5

6 3 Results for gradient descent One key property of gradient flow which is used in the previous proofs is that it never increases the risk, which is not necessarily true for gradient descent. However, for smooth losses (i.e, with Lipschitz continuous derivatives), we can design some decaying step sizes, with which gradient descent never increases the risk, and basically the same results hold as in the gradient flow case. Deferred proofs are given in Appendix C. We make the following additional assumption on the loss, which is satisfied by the logistic loss l log. Assumption 3.. l is β-lipschitz (i.e, l is β-smooth), and l G (i.e., l is G-Lipschitz). Under Assumption 3., the risk is also a smooth function of W, if all layers are bounded. Lemma 3.. Suppose l is β-smooth. or any R, the risk R is a β(r)-smooth function on the set B(R) = { W W k R, k L }, where β(r) = L R L (β + G). Smoothness ensures that for any W, V B(R), R(W ) R(V ) R(V ), W V + β(r) W V / (see (Bubeck et al., 05, Lemma 3.4)). In particular, if we choose some R and set a constant step size η t = /β(r), then as long as W (t + ) and W (t) are both in B(R), R ( W (t + ) ) R ( W (t) ) R ( W (t) ), η t R ( W (t) ) + β(r)η t R ( W (t) ) = R ( W (t) ) η t = R ( W (t) ). (3.3) β(r) In other words, the risk does not increase at this step. However, similar to gradient flow, the gradient descent iterate will eventually escape B(R), which may increase the risk. Lemma 3.4. Under Assumptions. to., suppose gradient descent is run with a constant step size /β(r). Then there exists a time t when W (t) B(R), in other words, max k L W k (t) > R. ortunately, this issue can be handled by adaptively increasing R and correspondingly decreasing the step sizes, formalized in the following assumption. Assumption 3.5. The step size η t = min{/β(r t ), }, where R t satisfies W (t) B(R t ), and if W (t + ) B(R t ), R t+ = R t. Assumption 3.5 can be satisfied by a line search, which ensures that the gradient descent update is not too aggressive and the boundary R is increased properly. With the additional Assumptions 3. and 3.5, exactly the same theorems can be proved for gradient descent. We restate them briefly here. Theorem 3.6. Under Assumptions. to. and 3.5, gradient descent satisfies lim t R ( W (t) ) = 0. or any k L, lim t W k (t) =. lim t wprod (t)/ L k= W k(t), v (t) =, where v (t) is the first right singular vector of W (t). Theorem 3.7. Under Assumptions., 3.5 and.7, for the logistic loss l log and almost all data, lim t v (t), ū =, and lim t w prod (t)/ L k= W k(t) = ū. Proofs of Theorem 3.6 and 3.7 are given in Appendix C, and are basically the same as the gradient flow proofs. The key difference is that an error of t=0 η t R(W (t)) will be introduced in many parts of the proof. However, it is bounded in light of eq. (3.3): t=0 η t R ( W (t) ) R ( ) ( ) η t W (t) R W (0). t=0 Since all weight matrices go to infinity, such a bounded error does not matter asymptotically, and thus proofs still go through. 6

7 .0 risk W 3 / W 3.5 W / W W / W risk W 3 / W 3 W / W W / W (a) Default initialization. (b) Initialization with the same robenius norm. igure : Risk and alignment of dense layers (the ratio W i / W i ) of (nonlinear!) AlexNet on CIAR-0. igure a uses default PyTorch initialization, while igure b forces initial robenius norms to be equal amongst dense layers. 4 Summary and future directions This paper rigorously proves that, for deep linear networks on linearly separable data, gradient flow and gradient descent minimize the risk to 0, align adjacent weight matrices, and align the first right singular vector of the first layer to the maximum margin solution determined by the data. There are many potential future directions; a few are as follows. Convergence rate. This paper only proves asymptotic convergence with no convergence rate. A convergence rate would allow the algorithm to be compared to other methods which also globally optimize this objective, would also suggest ways to improve step sizes and initialization, and ideally even exhibit a sensitivity to the network architecture and suggest how it could be improved. Nonseparable data and nonlinear networks. Real-world data is generally not linearly separable, but nonlinear deep networks can reliably decrease the risk to 0, even with random labels (Zhang et al., 07). This seems to suggest that a nonlinear notion of separability is at play; is there some way to adapt the present analysis? The present analysis is crucially tied to the alignment of weight matrices: alignment and risk are analyzed simultaneously. Motivated by this, consider a preliminary experiment, presented in igure, where stochastic gradient descent was used to minimize the risk of a standard AlexNet on CIAR-0 (Krizhevsky et al., 0; Krizhevsky and Hinton, 009). Even though there are ReLUs, max-pooling layers, and convolutional layers, the alignment phenomenon is occurring in a reduced form on the dense layers (the last three layers of the network). Specifically, despite these weight matrices having shape (04, 4096), (4096, 4096), and (4096, 0) the key alignment ratios W i / W i are much larger than their respective lower bounds (04 /, 4096 /, 0 / ). Two initializations were tried: default PyTorch initialization, and a Gaussian initialization forcing all initial robenius norms to be just 4, which is suggested by the norm preservation property in the analysis and removes noise in the weights. Acknowledgements The authors are grateful for support from the NS under grant IIS This grant allowed them to focus on research, and when combined with a gracious NVIDIA GPU grant, led to the creation of their beloved GPU machine DutchCrunch. 7

8 References Sanjeev Arora, Nadav Cohen, and Elad Hazan. On the optimization of deep networks: Implicit acceleration by overparameterization. arxiv preprint arxiv: , 08. Peter Bartlett, Dylan oster, and Matus Telgarsky. Spectrally-normalized margin bounds for neural networks. NIPS, 07. Sébastien Bubeck et al. Convex optimization: Algorithms and complexity. oundations and Trends R in Machine Learning, 8(3-4):3 357, 05. Simon S Du, Wei Hu, and Jason D Lee. Algorithmic regularization in learning deep homogeneous models: Layers are automatically balanced. arxiv preprint arxiv: , 08. Suriya Gunasekar, Jason Lee, Daniel Soudry, and Nathan Srebro. Implicit bias of gradient descent on linear convolutional networks. arxiv preprint arxiv: , 08. Ziwei Ji and Matus Telgarsky. Risk and parameter convergence of logistic regression. arxiv preprint arxiv: , 08. Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 009. Alex Krizhevsky, Ilya Sutskever, and Geoffery Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 0. Thomas Laurent and James von Brecht. Deep linear neural networks with arbitrary loss: All local minima are global. arxiv preprint arxiv:7.0473, 07. Jason D Lee, Max Simchowitz, Michael I Jordan, and Benjamin Recht. Gradient descent converges to minimizers. arxiv preprint arxiv: , 06. Haihao Lu and Kenji Kawaguchi. Depth creates no bad local minima. arxiv preprint arxiv: , 07. Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arxiv preprint arxiv:3.60, 03. Daniel Soudry, Elad Hoffer, and Nathan Srebro. The implicit bias of gradient descent on separable data. arxiv preprint arxiv: , 07. Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. ICLR, 07. Yi Zhou and Yingbin Liang. Critical points of linear neural networks: Analytical forms and landscape properties

9 A Regarding Assumption. Suppose W (0) = 0 while W L (0) W (0) 0. irst of all, W L (0) W (0) = 0 and thus R ( W (0) ) = R(0). Moreover, R ( w prod (0) ), ū = n l (0) z i, ū l (0)γ < 0, n i= which implies R ( w prod (0) ) 0 and / W = ( W L (0) W (0) ) R ( wprod (0) ) 0. On the other hand, if R ( W (0) ) > R(0), gradient flow/descent may never minimize the risk to 0. or example, suppose the network has two layers, and both the input and output have dimension ; the network just computes the dot product of two vectors w and w. Consider minimizing R(w, w ) = exp ( w, w ). If w (0) = w (0) 0, then R ( w (0), w (0) ) = exp ( w ) > exp(0). It is easy to verify that for any t, w (t) = w (t), and R ( w (t), w (t) ) exp(0) > 0. B Omitted proofs from Section Proof of Lemma.3. ix an arbitrary R > 0. If the claim is not true, then for any ɛ > 0, there exists some t such that W k R for all k while / W ɛ, which means W = W WL R(w prod ) = W L W R(w prod ) ɛ. Since w prod R L, we have R(w prod ), ū = n n l ( w prod, z i ) z i, ū n i= n l ( w prod, z i ) γ Mγ, where M = max R L x R L l (x). Since l is continuous and the domain is bounded, the maximum is attained and negative, and thus M > 0. Therefore R(wprod ) Mγ, and thus WL W ɛ/mγ. Since W R, we further have w prod ɛr/mγ. In other words, after t =, wprod may be arbitrarily small, which implies R ( w prod ) can be arbitrarily close to R (0). On the other hand, by Assumption., dr(w )/ dt = R(W ) < 0 at t = 0. This implies that R ( W () ) < R ( W (0) ), and for any t, R ( W (t) ) R ( W () ) < R ( W (0) ) R(0), which is a contradiction. Since the risk is always positive, we have i= R ( W (0) ) L t=0 W k dt k= t=0 W dt [ ] t=0 W max W k R dt k L [ ] t= W max W k R dt k L [ ] ɛ(r) max W k R dt, k L t= which implies gradient flow only spends a finite amount of time in { W max k L W k R }. This directly implies that max k L W k is unbounded. 9

10 Proof of Lemma.6. The first claim is true for k = L since W L is a row vector. or any k < L, recall the following relation (Arora et al., 08; Du et al., 08): W k+(t)w k+ (t) W k+(0)w k+ (0) = W k (t)w k (t) W k (0)W k (0). (B.) Let A k,k+ = W k (0)W k (0) W k+ (0)W k+(0). By eq. (B.) and the definition of singular vectors and singular values, we have σ k v k+w k W k v k+ = v k+w k+w k+ v k+ + v k+a k,k+ v k+ = σ k+ + v k+a k,k+ v k+ σ k+ A k,k+. (B.) Moreover, by taking the trace on both sides of eq. (B.), we have ( ) ) ( ) W k = tr W k Wk = tr (Wk+W k+ + tr W k (0)Wk (0) Summing eq. (B.) and eq. (B.3) from k to L, we get ( ) tr Wk+(0)W k+ (0) = W k+ + W k (0) W k+ (0). (B.3) W k W k W k (0) W L (0) + L k =k A k,k + D. (B.4) Next we prove that singular vectors get aligned. Consider u k W k+ W k+u k. On one hand, similarly to eq. (B.), u k W k+w k+ u k = u k W k W k u k u k W k (0)W k (0)u k + u k W k+(0)w k+ (0)u k u k W k W k u k u k W k (0)W k (0)u k σ k W k (0). (B.5) On the other hand, it follows from the definition of singular vectors and eq. (B.4) that ( ) u k Wk+W k+ u k = u k, v k+ σk+ + u k Wk+W k+ v k+ σk+v k+ u k u k, v k+ σ k+ + W k+ W k+ u k, v k+ σ k+ + D. (B.6) Combining eq. (B.5) and eq. (B.6), we get σ k u k, v k+ σ k+ + D + W k (0). (B.7) Similar to eq. (B.5), we can get Therefore σ k v k+w k W k v k+ σ k+ W k+ (0). σ k σ k+ Combining eq. (B.7) and eq. (B.8), we finally get W k+(0) σk+. (B.8) u k, v k+ D + W k(0) + W k+ (0) σk+. Regarding the last claim, first recall that since the difference between the squares of robenius norms of any two layers is a constant, max k L W k implies W k for any k. We further have the following. 0

11 Since W k W k D, W k for any k, and W k / W k u k v k. Since W k, u k, v k+. As a result, w prod L L k= W, v = W k, v k W k k= L u i vi, v k=. Proof of Theorem.. Suppose for some ɛ > 0, R (W ) ɛ for any t. Then there exists some j n such that l ( w prod, z j ) ɛ, and thus w prod, z j l (ɛ). On the other hand, since R(W ) R(0) = l(0), l ( w prod, z j ) nl(0), and thus w prod, z j l ( nl(0) ). Let M = max l (nl(0)) x l (ɛ/n) l (x) < 0, we have for any t, R(w prod ), ū = n l ( w prod, z i ) z i, ū n n i= n l ( w prod, z i ) γ i= n l ( w prod, z j ) γ Mγ < 0, n and thus R(w prod ) Mγ/n. Similar to the proof of Lemma.6, we can show that if W k, (W L W ), v W k W. In other words, there exists some C > 0, such that when min k L W k > C, W L W W k W / > C L /. Lemma.3 shows that gradient flow spends a finite amount of time in { W max k L W k R } for any R > 0. Since the difference between the squares of robenius norms of any two layers is a constant, gradient flow also spends a finite amount of time in { W min k L W k C }. Now we have R ( W (0) ) L t=0 W k dt k= W dt = t=0 t=0 t=0 ( Mγ n =, W L W R(w prod ) dt [ W L W R(w prod ) W min W k C k L ) ) (C L [ ] W min W k C dt k L t=0 ] dt

12 which is a contradiction. Therefore R(ɛ) 0. This further implies W k, since R(W ) has no finite optimum. inally, invoking Lemma.6 proves the final claim of Theorem.. Proof of Lemma.9. Soudry et al. (07) Lemma 8 proves that, with probability, there are at most d support vectors, and moreover, the i-th support vector z i has a positive dual variable α i, such that i S α iz i = ū. Suppose there exists some ξ ū, such that max i S ξ, z i 0. Since α i ξ, z i = ξ, α i z i = ξ, ū = 0, i S i S we actually have ξ, z i = 0 for all i S. This is impossible under Assumption.7, since the support vectors span the whole space. Proof of Lemma.0. or the sake of presentation, we leave out the subscript in z i and denote a data point by z generally. or any data point z and predictor w, let z and w denote their projection onto ū. Let z arg max i S w, z, and thus w, z α w. or l exp, we have w, R(w) = z = z [ exp ( w, z )] w, z n n [ exp ( w, z )] w, z n exp ( w, z ) w, z + z,w 0 n exp ( w, z ) w, z. The first part can be lower bounded as below (recall that w, z = w, z α w ) (B.9) n exp ( w, z ) w, z = n exp ( w, γū ) exp ( w, z ) w, z n exp ( w, γū ) exp ( α w ) α w. (B.0) To bound the second part, first notice that since we assume w, ū 0, for any z, w, z γū = w, z + w, z γū z w, z = w, z. (B.) The reason is that every data point has margin at least γ, and thus z γū z = cū for some c 0. Using eq. (B.), we can bound the second part of eq. (B.9). z,w 0 = z,w 0 z,w 0 z,w 0 n exp ( w, z ) w, z n exp ( w, γū ) exp ( w, z γū ) w, z n exp ( w, γū ) exp ( ) w, z w, z n exp ( w, γū ) ( ) e exp ( w, γū ) ( ). (B.) e On the third line eq. (B.) is applied. The fourth line applies the property that f(x) = xe x /e when x 0.

13 Combining eq. (B.9), eq. (B.0) and eq. (B.), we get w, R(w) exp ( w, γū ) ( n exp ( α w ) α w ). e As long as w ( + ln(n))/α, w, R(w) 0. or l log, similar to eq. (B.9), we have w, R(w) exp ( w, z ) n + exp ( w, z ) w, z + exp ( w, z ) n + exp ( w, z ) w, z + z,w 0 z,w 0 exp ( w, z ) n + exp ( w, z ) w, z n exp ( w, z ) w, z. (B.3) The second part of eq. (B.3) can be bounded again by eq. (B.). To bound the first part of eq. (B.3), first notice that (recall w, ū 0) exp ( w, z ) = exp ( w, γū ) exp ( w, z ) exp ( w, z ). (B.4) Using eq. (B.4), and recall that w, z = w, z α w 0, we can bound the first part of eq. (B.3) as below. exp ( w, z ) n + exp ( w, z ) w, z = n exp ( w, γū ) exp ( w, z ) + exp ( w, z ) w, z n exp ( w, γū ) exp ( w, z ) + exp ( w, z ) w, z n exp ( w, γū ) w, z n exp ( w, γū ) α w. Combining eq. (B.3), eq. (B.5) and eq. (B.), we get w, R(w) exp ( w, γū ) ( n α w ). e As long as w n/eα, w, R(w) 0. Proof of Theorem.8. Recall that and thus dw dt d W dt = W = W W L R(w prod ), = (B.5) W, dw = w prod, R(w prod ). (B.6) dt Let Πū denote the projection onto span(ū), and let Π denote the projection onto ū. Also let ΠūW and Π W denote the projection of rows of W onto span(ū) and ū, respectively. Notice that Πūw prod = ( W L W (ΠūW ) ), and Π w prod = ( W L W (Π W ) ). We further have d Π W = Π w prod, R(w prod ). (B.7) dt Let W = u σ v + S. We have S σ, σ, W W D, where σ, is the second singular value of W and D is the constant introduced in Lemma.6. Then Π W = u σ (Π v ) + Π S, 3

14 and Π W u σ (Π v ) + Π S = σ Π v + Π S σ Π v + dd. It follows that Π v Π W σ dd σ Π W W dd W. ix an arbitrary ɛ > 0. By Theorem., we can find some t 0 large enough such that for any t t 0 :. dd/ W ɛ/3.. w prod/ W L W v ɛ/3, or w prod/ W L W + v ɛ/3. (B.8) 3. W L W 3K/ɛ, where K is the threshold given in Lemma.0, i.e., +ln(n) /α for l exp, n /eα for l log. 4. R(W ) l(0)/n, which implies w prod, z i 0 for all i n. By Lemma.9, there always exists a support vector z for which Π w prod, z 0, and therefore w prod, ū 0. Suppose for some t t 0, Π W / W ɛ. By eq. (B.8) and bullet above, Π v ɛ/3. Bullet above then gives Π w prod/ W L W ɛ/3, which together with bullet 3 above implies Π w prod K. Since also w prod, ū 0, we can apply Lemma.0 and get that Π w prod, R(w prod ) 0. In light of eq. (B.7), d Π W / dt 0. On the other hand, since after t t 0, w prod, z i 0, we have d W / dt 0 by eq. (B.6). Therefore Π W / W will not increase, and since W, it will eventually drop below ɛ, and will never exceed ɛ again. Therefore, Since ɛ is arbitrary, we have Π W lim sup ɛ. t W Π W lim sup = 0, t W and thus lim t v, ū =. An application of Theorem. gives the other part of Theorem.8. C Omitted proofs from Section 3 Proof of Lemma 3.. Given W, V B(R), we need to show that R(W ) R(V ) β(r) W V for some β(r). Consider k = first. Let w = (W L W ), and v = (V L V ). Since l G, R(w), R(v) G. We have W W V = WL R(w) V VL R(v) W WL R(w) V W3 WL R(w) + V W3 WL R(w) V VL R(v) R L G W V + V W3 WL R(w) V VL R(v) R L G W V + V W3 WL R(w) V VL R(v). (C.) Proceeding in this way, we can get W V (L )RL G W V + R L R(w) R(v). (C.) 4

15 Since z i, l is β-lipschitz, we have R(w) R(v) β w v βlr L W V, (C.3) where the last inequality follows from a similar one-by-one replacement procedure as in eq. (C.). Combining eq. (C.) and eq. (C.3), we get for R, W V ((L )R L G + βlr L ) W V LR L (β + G) W V. The same procedure can be done for other layers, and together R(W ) R(V ) L R L (β + G) W V. Proof of Lemma 3.4. Recall that if W (t), W (t + ) B(R) and η t = /β(r), R ( W (t + ) ) R ( W (t) ) R ( W (t) ), η t R ( W (t) ) + β(r)η t R ( W (t) ) = R ( W (t) ) β(r) = η t R ( W (t) ). Suppose W (t) B(R) for all t. By Assumption. and eq. (C.4), R ( W () ) R ( W (0) ) R ( W (0) ) ( ) < R W (0). β(r) (C.4) By eq. (C.4), gradient descent never increases the risk, and thus for all t, R ( W (t) ) R ( W () ) < R ( W (0) ). In exactly the same way as in the proof of Lemma.3, one can show that there exists some constant ɛ(r) > 0, so that / W (t) ɛ(r) for all t. Invoking eq. (C.4) again, we will get R ( W (0) ) t=0 β(r) ɛ(r) =, which is a contradiction. Therefore W (t) must go out of B(R) at some time. Next we prove Theorem 3.6 and 3.7. The proofs depend on several lemmas which are similar to the gradient flow ones. The following Lemma C.5 is similar to Lemma.3. Lemma C.5. Under Assumptions. to. and 3.5, gradient descent ensures that max k L W k (t) is unbounded. t=0 η t =. or any R > 0, t:w (t) B(R) η t <. Proof. By Assumption 3.5, we always have that W (t) B(R t ). Since β(r t ) = L Rt L (β + G) Rt L G, we have for any k L, W k (t + ) W k (t) + η t W k (t) W k (t) + β(r t ) W k (t) W k (t) + β(r t ) RL t G W k (t) +. (C.6) 5

16 Moreover, Lemma 3.4 shows that R t. Since R t+ = R t as long as W (t+) B(R t ), max k L W k (t) is unbounded. It then follows that for any t, by Cauchy-Schwarz, t Since eq. (C.4) implies η τ t R ( ) t η τ W (τ) R ( ) η τ W (τ) t R ( ) ( ) ( ) ( ) η τ W (τ) R W (0) R W (t) R W (0),. together we have t=0 η t =. Since under Assumptions 3. and 3.5 gradient descent never increases the risk, it can be shown in exactly the same as in the proof of Lemma.3 that, for W (t) B(R), / W (t) ɛ(r) for some constant ɛ(r) > 0. Invoking eq. (C.4) again, we get that t:w (t) B(R) η t <. The next lemma is an analog to Lemma.6. Lemma C.7. Under Assumptions. and 3., the gradient descent iterates satisfy the following properties: or any k L, W k W k D + R ( W (0) ). or any k < L, v k+, u k D+3R(W (0))+ W k+(0) + W k(0) / W k+. Suppose max k L W k, then w prod/ L k= W k, v. Proof. Recall that for any W, W k+ = Wk+ WL R(w prod ) W Wk = Wk. W k+ W k (C.8) or gradient descent iterates, summing eq. (C.8) from 0 to t, we get t Wk+(t)W k+ (t) Wk+(0)W k+ (0) + t = W k (t)wk (t) W k (0)Wk (0) + or any k L and any t, let t P k (t) = η τ η τ η τ ( W k (τ) ( ) ( ), W k (τ) W k (τ) ( W k+ (τ) ) ( W k+ (τ) ) ) ( ). W k (τ) (C.9) and t Q k (t) = η τ ( ) ( ). W k (τ) W k (τ) 6

17 We have P k (t) = Q k (t) tr ( Q k (t) ) = tr ( P k (t) ). Moreover, invoking eq. (C.4), L tr ( P k (t) ) = k= L t k= t = η τ η τ W k (τ) R ( W (τ) ) t R ( ) η τ W (τ) R ( W (0) ) R ( W (t) ) R ( W (0) ). (C.0) Still let σ k (t), u k (t) and v k (t) denote the first singular value, left singular vector and right singular vector of W k (t). We can then proceed basically in the same way as in the proof of Lemma.6. or example, eq. (B.) becomes σ k(t) σ k+(t) A k,k+ (t) P k (t) σ k+(t) A k,k+ (t) tr ( P k (t) ), (C.) while eq. (B.3) becomes W k (t) = W k+ (t) + W k (0) W k+ (0) tr ( P k (t) ) + tr ( Q k+ (t) ). (C.) Summing eq. (C.) and eq. (C.) from k to L, and invoke eq. (C.0), we get W k (t) W k (t) D tr ( P k (t) ) + tr ( Q L (t) ) + L k =k tr ( P k (t) ) D + R ( W (0) ). To prove singular vectors get aligned, we can still proceed in nearly the same way as in the proof of Lemma.6. eq. (B.5) becomes u k W k+w k+ u k σ k W k (0) Q k+ (t), (C.3) while eq. (B.6) becomes u k W k+w k+ u k u k, v k+ σ k+ + D + R ( W (0) ). (C.4) Combining eq. (C.3) and eq. (C.4) σ k u k, v k+ σ k+ + D + R ( W (0) ) + Q k+ (t) + W k (0). (C.5) Similar to eq. (C.3), we can get and thus eq. (B.8) becomes σ k v k+w k W k v k+ σ k+ W k+ (0) P k (t), σ k σ k+ W k+(0) + P k (t) σk+. (C.6) Combining eq. (C.5) and eq. (C.6), we get u k, v k+ D + W k(0) + W k+ (0) + 3R ( W (0) ) σ k+ The final claim of Lemma C.7 can be proved in exactly the same way as Lemma.6.. 7

18 Proof of Theorem 3.6. Summing eq. (C.), we know that for any two different layers j > k, W k (t) W j (t) = W k (0) W j (0) tr ( P k (t) ) + tr ( Q j (t) ). Recall eq. (C.0), we know that ( ) ( ) W k (t) W j (t) W k (0) W j (0) R ( W (0) ). (C.7) In other words, the difference between the squares of robenius norms of any two layers is still bounded. The proof then goes in the same way as the proof of Theorem.. Suppose the risk is always above ɛ > 0. Then there exists some c(ɛ) > 0 such that R(w prod ) c(ɛ). By Lemma C.7, there exists some C such that if min k L W k (t) > C, W L (t) W (t) C L /. By eq. (C.7) and Lemma C.5, t: W k (t) C for some k η t is finite. On the other hand, by Lemma C.5, i=0 η t =, and thus t: W k (t) >C for all k η t =. Therefore we have, by invoking eq. (C.4), R ( W (0) ) R ( ) η t W (t) t=0 η t W (t) t=0 c(ɛ) CL =, t: W k (t) >C for all k which is a contradiction. Therefore R ( W (t) ) 0, and since it has no finite optimum, W k. The other results follow from Lemma C.5. Proof of Theorem 3.7. Recall that and thus W = W W L R(w prod ), W (t + ) = W (t) η t W (t), + ηt W (t) W (t) = W (t) η t w prod (t), R ( w prod (t) ) + ηt W (t) If w prod, z i 0 for all i, then W (t + ) W (t). Also recall that Π W (t) denote the projection of rows of W (t) onto ū, the orthogonal complement of span(ū). We have Π W (t + ) W (t) η t Π W (t), + ηt W (t) W (t) = W (t) η t Π w prod (t), R ( w prod (t) ) + ηt W (t). (C.8) Invoking eq. (C.4) again gives ηt W (t) η t R ( W (t) ) (R ( W (t) ) R ( W (t + ) )). (C.9) The proof then goes in almost the same way as the proof of Theorem.8. or any ɛ > 0, we can find some large enough time t 0, such that for any t t 0, η t. 8

19 . Π W (t) / W (t) ɛ implies that Π w prod (t), R ( w prod (t) ) 0.. w prod (t), z i 0 for all i, and thus W (t + ) W (t). ( 3. W (t) ) + R(W (0)) /ɛ. Suppose at some time t t 0, Π W (t ) / W (t ) ɛ. As long as this still holds, in light of bullet () above, eq. (C.8) and eq. (C.9), Π W will increase by at most R ( W (t ) ) R ( W (0) ). On the other hand, W, and thus there exists some t > t such that Π W (t ) / W (t ) < ɛ. Let t 3 denote the smallest time after t such that Π W (t 3 ) / W (t 3 ) ɛ (if it exists). Recall that W (t + ) W (t) + for any t 0, and W (t + ) W (t) for any t t 0, we have Π W (t 3 ) W (t 3 ) Π W (t 3 ) Π W (t 3 ) + < ɛ +. W (t 3 ) W (t 3 ) W (t 3 ) After t 3, Π W will increase by at most R ( W (0) ), and thus Π W will increase by at most R ( W (0) ). Therefore, for any t 4 t 3, as long as Π W (t 4 ) / W (t 4 ) ɛ, we have since W (t) Π W (t 4 ) W (t 4 ) ( ) + R(W (0)) /ɛ after t 0. In other words, Π W (t 4 ) W (t 3 ) Π W (t 3 ) + R ( W (0) ) W (t 3 ) R ( W (0) ) ɛ + + ɛ, W (t 3 ) W (t 3 ) Π W lim sup ɛ. t W Since ɛ is arbitrary, we have and thus lim t v, ū =. Π W lim sup = 0, t W 9

Overparametrization for Landscape Design in Non-convex Optimization

Overparametrization for Landscape Design in Non-convex Optimization Overparametrization for Landscape Design in Non-convex Optimization Jason D. Lee University of Southern California September 19, 2018 The State of Non-Convex Optimization Practical observation: Empirically,

More information

Some Statistical Properties of Deep Networks

Some Statistical Properties of Deep Networks Some Statistical Properties of Deep Networks Peter Bartlett UC Berkeley August 2, 2018 1 / 22 Deep Networks Deep compositions of nonlinear functions h = h m h m 1 h 1 2 / 22 Deep Networks Deep compositions

More information

Optimization geometry and implicit regularization

Optimization geometry and implicit regularization Optimization geometry and implicit regularization Suriya Gunasekar Joint work with N. Srebro (TTIC), J. Lee (USC), D. Soudry (Technion), M.S. Nacson (Technion), B. Woodworth (TTIC), S. Bhojanapalli (TTIC),

More information

Deep Linear Networks with Arbitrary Loss: All Local Minima Are Global

Deep Linear Networks with Arbitrary Loss: All Local Minima Are Global homas Laurent * 1 James H. von Brecht * 2 Abstract We consider deep linear networks with arbitrary convex differentiable loss. We provide a short and elementary proof of the fact that all local minima

More information

what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley

what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley Collaborators Joint work with Samy Bengio, Moritz Hardt, Michael Jordan, Jason Lee, Max Simchowitz,

More information

Generalization in Deep Networks

Generalization in Deep Networks Generalization in Deep Networks Peter Bartlett BAIR UC Berkeley November 28, 2017 1 / 29 Deep neural networks Game playing (Jung Yeon-Je/AFP/Getty Images) 2 / 29 Deep neural networks Image recognition

More information

Implicit Optimization Bias

Implicit Optimization Bias Implicit Optimization Bias as a key to Understanding Deep Learning Nati Srebro (TTIC) Based on joint work with Behnam Neyshabur (TTIC IAS), Ryota Tomioka (TTIC MSR), Srinadh Bhojanapalli, Suriya Gunasekar,

More information

Foundations of Deep Learning: SGD, Overparametrization, and Generalization

Foundations of Deep Learning: SGD, Overparametrization, and Generalization Foundations of Deep Learning: SGD, Overparametrization, and Generalization Jason D. Lee University of Southern California November 13, 2018 Deep Learning Single Neuron x σ( w, x ) ReLU: σ(z) = [z] + Figure:

More information

The Implicit Bias of Gradient Descent on Separable Data

The Implicit Bias of Gradient Descent on Separable Data Journal of Machine Learning Research 19 2018 1-57 Submitted 4/18; Published 11/18 The Implicit Bias of Gradient Descent on Separable Data Daniel Soudry Elad Hoffer Mor Shpigel Nacson Department of Electrical

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee227c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee227c@berkeley.edu

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Neural Networks: A brief touch Yuejie Chi Department of Electrical and Computer Engineering Spring 2018 1/41 Outline

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton Motivation Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Characterization of Gradient Dominance and Regularity Conditions for Neural Networks

Characterization of Gradient Dominance and Regularity Conditions for Neural Networks Characterization of Gradient Dominance and Regularity Conditions for Neural Networks Yi Zhou Ohio State University Yingbin Liang Ohio State University Abstract zhou.1172@osu.edu liang.889@osu.edu The past

More information

A Greedy Framework for First-Order Optimization

A Greedy Framework for First-Order Optimization A Greedy Framework for First-Order Optimization Jacob Steinhardt Department of Computer Science Stanford University Stanford, CA 94305 jsteinhardt@cs.stanford.edu Jonathan Huggins Department of EECS Massachusetts

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

SGD and Deep Learning

SGD and Deep Learning SGD and Deep Learning Subgradients Lets make the gradient cheating more formal. Recall that the gradient is the slope of the tangent. f(w 1 )+rf(w 1 ) (w w 1 ) Non differentiable case? w 1 Subgradients

More information

9 Classification. 9.1 Linear Classifiers

9 Classification. 9.1 Linear Classifiers 9 Classification This topic returns to prediction. Unlike linear regression where we were predicting a numeric value, in this case we are predicting a class: winner or loser, yes or no, rich or poor, positive

More information

Global Optimality in Matrix and Tensor Factorization, Deep Learning & Beyond

Global Optimality in Matrix and Tensor Factorization, Deep Learning & Beyond Global Optimality in Matrix and Tensor Factorization, Deep Learning & Beyond Ben Haeffele and René Vidal Center for Imaging Science Mathematical Institute for Data Science Johns Hopkins University This

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

1 Machine Learning Concepts (16 points)

1 Machine Learning Concepts (16 points) CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions

More information

WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY,

WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY, WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY, WITH IMPLICATIONS FOR TRAINING Sanjeev Arora, Yingyu Liang & Tengyu Ma Department of Computer Science Princeton University Princeton, NJ 08540, USA {arora,yingyul,tengyu}@cs.princeton.edu

More information

Reading Group on Deep Learning Session 1

Reading Group on Deep Learning Session 1 Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

1 Lyapunov theory of stability

1 Lyapunov theory of stability M.Kawski, APM 581 Diff Equns Intro to Lyapunov theory. November 15, 29 1 1 Lyapunov theory of stability Introduction. Lyapunov s second (or direct) method provides tools for studying (asymptotic) stability

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer

More information

10-725/36-725: Convex Optimization Prerequisite Topics

10-725/36-725: Convex Optimization Prerequisite Topics 10-725/36-725: Convex Optimization Prerequisite Topics February 3, 2015 This is meant to be a brief, informal refresher of some topics that will form building blocks in this course. The content of the

More information

arxiv: v4 [math.oc] 5 Jan 2016

arxiv: v4 [math.oc] 5 Jan 2016 Restarted SGD: Beating SGD without Smoothness and/or Strong Convexity arxiv:151.03107v4 [math.oc] 5 Jan 016 Tianbao Yang, Qihang Lin Department of Computer Science Department of Management Sciences The

More information

Characterizing Implicit Bias in Terms of Optimization Geometry

Characterizing Implicit Bias in Terms of Optimization Geometry Suriya Gunasekar 1 Jason Lee Daniel Soudry 3 Nathan Srebro 1 Abstract We study the implicit bias of generic optimization methods, including mirror descent, natural gradient descent, and steepest descent

More information

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos 1 Stochastic Variance Reduction for Nonconvex Optimization Barnabás Póczos Contents 2 Stochastic Variance Reduction for Nonconvex Optimization Joint work with Sashank Reddi, Ahmed Hefny, Suvrit Sra, and

More information

Classical generalization bounds are surprisingly tight for Deep Networks

Classical generalization bounds are surprisingly tight for Deep Networks CBMM Memo No. 9 July, 28 Classical generalization bounds are surprisingly tight for Deep Networks Qianli Liao, Brando Miranda, Jack Hidary 2 and Tomaso Poggio Center for Brains, Minds, and Machines, MIT

More information

Linear Algebra Massoud Malek

Linear Algebra Massoud Malek CSUEB Linear Algebra Massoud Malek Inner Product and Normed Space In all that follows, the n n identity matrix is denoted by I n, the n n zero matrix by Z n, and the zero vector by θ n An inner product

More information

Towards stability and optimality in stochastic gradient descent

Towards stability and optimality in stochastic gradient descent Towards stability and optimality in stochastic gradient descent Panos Toulis, Dustin Tran and Edoardo M. Airoldi August 26, 2016 Discussion by Ikenna Odinaka Duke University Outline Introduction 1 Introduction

More information

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology

More information

COR-OPT Seminar Reading List Sp 18

COR-OPT Seminar Reading List Sp 18 COR-OPT Seminar Reading List Sp 18 Damek Davis January 28, 2018 References [1] S. Tu, R. Boczar, M. Simchowitz, M. Soltanolkotabi, and B. Recht. Low-rank Solutions of Linear Matrix Equations via Procrustes

More information

Advanced computational methods X Selected Topics: SGD

Advanced computational methods X Selected Topics: SGD Advanced computational methods X071521-Selected Topics: SGD. In this lecture, we look at the stochastic gradient descent (SGD) method 1 An illustrating example The MNIST is a simple dataset of variety

More information

Day 3 Lecture 3. Optimizing deep networks

Day 3 Lecture 3. Optimizing deep networks Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient

More information

Ad Placement Strategies

Ad Placement Strategies Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE7C (Spring 08): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee7c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee7c@berkeley.edu October

More information

arxiv: v1 [cs.lg] 7 Jan 2019

arxiv: v1 [cs.lg] 7 Jan 2019 Generalization in Deep Networks: The Role of Distance from Initialization arxiv:1901672v1 [cs.lg] 7 Jan 2019 Vaishnavh Nagarajan Computer Science Department Carnegie-Mellon University Pittsburgh, PA 15213

More information

Linear Discrimination Functions

Linear Discrimination Functions Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica Università degli Studi di Bari November 4, 2009 Outline Linear models Gradient descent Perceptron Minimum square error approach

More information

Concentration behavior of the penalized least squares estimator

Concentration behavior of the penalized least squares estimator Concentration behavior of the penalized least squares estimator Penalized least squares behavior arxiv:1511.08698v2 [math.st] 19 Oct 2016 Alan Muro and Sara van de Geer {muro,geer}@stat.math.ethz.ch Seminar

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Master MVA: Reinforcement Learning Lecture: 5 Approximate Dynamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of the lecture 1.

More information

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18 CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H

More information

Optimization and Gradient Descent

Optimization and Gradient Descent Optimization and Gradient Descent INFO-4604, Applied Machine Learning University of Colorado Boulder September 12, 2017 Prof. Michael Paul Prediction Functions Remember: a prediction function is the function

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces.

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces. Math 350 Fall 2011 Notes about inner product spaces In this notes we state and prove some important properties of inner product spaces. First, recall the dot product on R n : if x, y R n, say x = (x 1,...,

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods

More information

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions International Journal of Control Vol. 00, No. 00, January 2007, 1 10 Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions I-JENG WANG and JAMES C.

More information

UNDERSTANDING LOCAL MINIMA IN NEURAL NET-

UNDERSTANDING LOCAL MINIMA IN NEURAL NET- UNDERSTANDING LOCAL MINIMA IN NEURAL NET- WORKS BY LOSS SURFACE DECOMPOSITION Anonymous authors Paper under double-blind review ABSTRACT To provide principled ways of designing proper Deep Neural Network

More information

Support Vector Machines and Bayes Regression

Support Vector Machines and Bayes Regression Statistical Techniques in Robotics (16-831, F11) Lecture #14 (Monday ctober 31th) Support Vector Machines and Bayes Regression Lecturer: Drew Bagnell Scribe: Carl Doersch 1 1 Linear SVMs We begin by considering

More information

Lecture 17: Neural Networks and Deep Learning

Lecture 17: Neural Networks and Deep Learning UVA CS 6316 / CS 4501-004 Machine Learning Fall 2016 Lecture 17: Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

2. Dual space is essential for the concept of gradient which, in turn, leads to the variational analysis of Lagrange multipliers.

2. Dual space is essential for the concept of gradient which, in turn, leads to the variational analysis of Lagrange multipliers. Chapter 3 Duality in Banach Space Modern optimization theory largely centers around the interplay of a normed vector space and its corresponding dual. The notion of duality is important for the following

More information

Support Vector Machines and Kernel Methods

Support Vector Machines and Kernel Methods 2018 CS420 Machine Learning, Lecture 3 Hangout from Prof. Andrew Ng. http://cs229.stanford.edu/notes/cs229-notes3.pdf Support Vector Machines and Kernel Methods Weinan Zhang Shanghai Jiao Tong University

More information

Deep Neural Networks: From Flat Minima to Numerically Nonvacuous Generalization Bounds via PAC-Bayes

Deep Neural Networks: From Flat Minima to Numerically Nonvacuous Generalization Bounds via PAC-Bayes Deep Neural Networks: From Flat Minima to Numerically Nonvacuous Generalization Bounds via PAC-Bayes Daniel M. Roy University of Toronto; Vector Institute Joint work with Gintarė K. Džiugaitė University

More information

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels Karl Stratos June 21, 2018 1 / 33 Tangent: Some Loose Ends in Logistic Regression Polynomial feature expansion in logistic

More information

Trace Class Operators and Lidskii s Theorem

Trace Class Operators and Lidskii s Theorem Trace Class Operators and Lidskii s Theorem Tom Phelan Semester 2 2009 1 Introduction The purpose of this paper is to provide the reader with a self-contained derivation of the celebrated Lidskii Trace

More information

Incremental Reshaped Wirtinger Flow and Its Connection to Kaczmarz Method

Incremental Reshaped Wirtinger Flow and Its Connection to Kaczmarz Method Incremental Reshaped Wirtinger Flow and Its Connection to Kaczmarz Method Huishuai Zhang Department of EECS Syracuse University Syracuse, NY 3244 hzhan23@syr.edu Yingbin Liang Department of EECS Syracuse

More information

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition Last updated: Oct 22, 2012 LINEAR CLASSIFIERS Problems 2 Please do Problem 8.3 in the textbook. We will discuss this in class. Classification: Problem Statement 3 In regression, we are modeling the relationship

More information

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Jan Drchal Czech Technical University in Prague Faculty of Electrical Engineering Department of Computer Science Topics covered

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

Solving Corrupted Quadratic Equations, Provably

Solving Corrupted Quadratic Equations, Provably Solving Corrupted Quadratic Equations, Provably Yuejie Chi London Workshop on Sparse Signal Processing September 206 Acknowledgement Joint work with Yuanxin Li (OSU), Huishuai Zhuang (Syracuse) and Yingbin

More information

IMPROVING STOCHASTIC GRADIENT DESCENT

IMPROVING STOCHASTIC GRADIENT DESCENT IMPROVING STOCHASTIC GRADIENT DESCENT WITH FEEDBACK Jayanth Koushik & Hiroaki Hayashi Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213, USA {jkoushik,hiroakih}@cs.cmu.edu

More information

arxiv: v1 [cs.lg] 4 Oct 2018

arxiv: v1 [cs.lg] 4 Oct 2018 Gradient Descent Provably Optimizes Over-parameterized Neural Networks Simon S. Du 1, Xiyu Zhai, Barnabás Póczos 1, and Aarti Singh 1 arxiv:1810.0054v1 [cs.lg] 4 Oct 018 1 Machine Learning Department,

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Prioritized Sweeping Converges to the Optimal Value Function

Prioritized Sweeping Converges to the Optimal Value Function Technical Report DCS-TR-631 Prioritized Sweeping Converges to the Optimal Value Function Lihong Li and Michael L. Littman {lihong,mlittman}@cs.rutgers.edu RL 3 Laboratory Department of Computer Science

More information

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

arxiv: v1 [cs.it] 21 Feb 2013

arxiv: v1 [cs.it] 21 Feb 2013 q-ary Compressive Sensing arxiv:30.568v [cs.it] Feb 03 Youssef Mroueh,, Lorenzo Rosasco, CBCL, CSAIL, Massachusetts Institute of Technology LCSL, Istituto Italiano di Tecnologia and IIT@MIT lab, Istituto

More information

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra. DS-GA 1002 Lecture notes 0 Fall 2016 Linear Algebra These notes provide a review of basic concepts in linear algebra. 1 Vector spaces You are no doubt familiar with vectors in R 2 or R 3, i.e. [ ] 1.1

More information

Learning theory. Ensemble methods. Boosting. Boosting: history

Learning theory. Ensemble methods. Boosting. Boosting: history Learning theory Probability distribution P over X {0, 1}; let (X, Y ) P. We get S := {(x i, y i )} n i=1, an iid sample from P. Ensemble methods Goal: Fix ɛ, δ (0, 1). With probability at least 1 δ (over

More information

A Unified Analysis of Stochastic Momentum Methods for Deep Learning

A Unified Analysis of Stochastic Momentum Methods for Deep Learning A Unified Analysis of Stochastic Momentum Methods for Deep Learning Yan Yan,2, Tianbao Yang 3, Zhe Li 3, Qihang Lin 4, Yi Yang,2 SUSTech-UTS Joint Centre of CIS, Southern University of Science and Technology

More information

Lecture Notes 1: Vector spaces

Lecture Notes 1: Vector spaces Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector

More information

A summary of Deep Learning without Poor Local Minima

A summary of Deep Learning without Poor Local Minima A summary of Deep Learning without Poor Local Minima by Kenji Kawaguchi MIT oral presentation at NIPS 2016 Learning Supervised (or Predictive) learning Learn a mapping from inputs x to outputs y, given

More information

Unconstrained optimization

Unconstrained optimization Chapter 4 Unconstrained optimization An unconstrained optimization problem takes the form min x Rnf(x) (4.1) for a target functional (also called objective function) f : R n R. In this chapter and throughout

More information

Composite Functional Gradient Learning of Generative Adversarial Models. Appendix

Composite Functional Gradient Learning of Generative Adversarial Models. Appendix A. Main theorem and its proof Appendix Theorem A.1 below, our main theorem, analyzes the extended KL-divergence for some β (0.5, 1] defined as follows: L β (p) := (βp (x) + (1 β)p(x)) ln βp (x) + (1 β)p(x)

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces 9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we

More information

OPTIMIZATION METHODS IN DEEP LEARNING

OPTIMIZATION METHODS IN DEEP LEARNING Tutorial outline OPTIMIZATION METHODS IN DEEP LEARNING Based on Deep Learning, chapter 8 by Ian Goodfellow, Yoshua Bengio and Aaron Courville Presented By Nadav Bhonker Optimization vs Learning Surrogate

More information

Jeff Howbert Introduction to Machine Learning Winter

Jeff Howbert Introduction to Machine Learning Winter Classification / Regression Support Vector Machines Jeff Howbert Introduction to Machine Learning Winter 2012 1 Topics SVM classifiers for linearly separable classes SVM classifiers for non-linearly separable

More information

Worst-Case Analysis of the Perceptron and Exponentiated Update Algorithms

Worst-Case Analysis of the Perceptron and Exponentiated Update Algorithms Worst-Case Analysis of the Perceptron and Exponentiated Update Algorithms Tom Bylander Division of Computer Science The University of Texas at San Antonio San Antonio, Texas 7849 bylander@cs.utsa.edu April

More information

Theory of Deep Learning IIb: Optimization Properties of SGD

Theory of Deep Learning IIb: Optimization Properties of SGD CBMM Memo No. 72 December 27, 217 Theory of Deep Learning IIb: Optimization Properties of SGD by Chiyuan Zhang 1 Qianli Liao 1 Alexander Rakhlin 2 Brando Miranda 1 Noah Golowich 1 Tomaso Poggio 1 1 Center

More information

Qualifying Exam in Machine Learning

Qualifying Exam in Machine Learning Qualifying Exam in Machine Learning October 20, 2009 Instructions: Answer two out of the three questions in Part 1. In addition, answer two out of three questions in two additional parts (choose two parts

More information

Theory of Deep Learning III: explaining the non-overfitting puzzle

Theory of Deep Learning III: explaining the non-overfitting puzzle arxiv:1801.00173v1 [cs.lg] 30 Dec 2017 CBMM Memo No. 073 January 3, 2018 Theory of Deep Learning III: explaining the non-overfitting puzzle by Tomaso Poggio,, Kenji Kawaguchi, Qianli Liao, Brando Miranda,

More information

A Surprising Linear Relationship Predicts Test Performance in Deep Networks

A Surprising Linear Relationship Predicts Test Performance in Deep Networks CBMM Memo No. 91 July 26, 218 arxiv:187.9659v1 [cs.lg] 25 Jul 218 A Surprising Linear Relationship Predicts Test Performance in Deep Networks Qianli Liao 1, Brando Miranda 1, Andrzej Banburski 1, Jack

More information

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs E0 270 Machine Learning Lecture 5 (Jan 22, 203) Support Vector Machines for Classification and Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in

More information

Computational Learning Theory - Hilary Term : Learning Real-valued Functions

Computational Learning Theory - Hilary Term : Learning Real-valued Functions Computational Learning Theory - Hilary Term 08 8 : Learning Real-valued Functions Lecturer: Varun Kanade So far our focus has been on learning boolean functions. Boolean functions are suitable for modelling

More information

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University PCA with random noise Van Ha Vu Department of Mathematics Yale University An important problem that appears in various areas of applied mathematics (in particular statistics, computer science and numerical

More information

Deep Learning for Computer Vision

Deep Learning for Computer Vision Deep Learning for Computer Vision Lecture 4: Curse of Dimensionality, High Dimensional Feature Spaces, Linear Classifiers, Linear Regression, Python, and Jupyter Notebooks Peter Belhumeur Computer Science

More information

Local Affine Approximators for Improving Knowledge Transfer

Local Affine Approximators for Improving Knowledge Transfer Local Affine Approximators for Improving Knowledge Transfer Suraj Srinivas & François Fleuret Idiap Research Institute and EPFL {suraj.srinivas, francois.fleuret}@idiap.ch Abstract The Jacobian of a neural

More information

Automatic Differentiation and Neural Networks

Automatic Differentiation and Neural Networks Statistical Machine Learning Notes 7 Automatic Differentiation and Neural Networks Instructor: Justin Domke 1 Introduction The name neural network is sometimes used to refer to many things (e.g. Hopfield

More information

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013 Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description

More information

Summary and discussion of: Dropout Training as Adaptive Regularization

Summary and discussion of: Dropout Training as Adaptive Regularization Summary and discussion of: Dropout Training as Adaptive Regularization Statistics Journal Club, 36-825 Kirstin Early and Calvin Murdock November 21, 2014 1 Introduction Multi-layered (i.e. deep) artificial

More information

A Conservation Law Method in Optimization

A Conservation Law Method in Optimization A Conservation Law Method in Optimization Bin Shi Florida International University Tao Li Florida International University Sundaraja S. Iyengar Florida International University Abstract bshi1@cs.fiu.edu

More information