Everything You Wanted to Know About the Loss Surface but Were Afraid to Ask The Talk
|
|
- Suzanna Montgomery
- 5 years ago
- Views:
Transcription
1 Everything You Wanted to Know About the Loss Surface but Were Afraid to Ask The Talk Texas A&M August 21, 2018
2 Plan
3 Plan 1 Linear models: Baldi-Hornik (1989) Kawaguchi-Lu (2016)
4 Plan 1 Linear models: Baldi-Hornik (1989) Kawaguchi-Lu (2016) 2 One hidden layer: Ge-Lee-Ma (2016) Mei-Montanari-Nguyen (2018) Venturi-Bandeira-Bruna (2018)
5 Plan 1 Linear models: Baldi-Hornik (1989) Kawaguchi-Lu (2016) 2 One hidden layer: Ge-Lee-Ma (2016) Mei-Montanari-Nguyen (2018) Venturi-Bandeira-Bruna (2018) 3 More than one hidden layer: Gori-Tesi (1992) + Nguyen-Hein (2017)
6 are randomized before training (Zhang et al., 2017). However, this good behavior is not universal; trainability neural nets is highlyskip dependent on network architecture choices, the choice ButtheFirst...of Turns out Connections aredesign Good of optimizer, variable initialization, and a variety of other considerations. Unfortunately, the effect (a) without skip connections (b) with skip connections Figure 1: The loss surfaces of ResNet-56 with/without skip connections. The vertical axis is logarithmic to show dynamic range. The proposed filter normalization scheme is used to enable comparisons of sharpness/flatness between the two figures. Figure: From Visualizing the Loss Landscape of Neural Nets by Li, Xu, 1 Taylor, Studer, Goldstein
7 From Figure 7, we see that network depth has a dramatic effect the loss surfaces of neural networks Butwhen First skip connections... Turns are not out used. The Skip network Connections ResNet-20-noshortare has a fairly Good benign landscape dominated by a region with convex contours in the center, and no dramatic non-convexity. This (a) 110 layers, no skip connections (b) DenseNet, 121 layers Figure 6: (left) The loss surfaces of ResNet-110-noshort, without skip connections. (right) DenseNet, the current state-of-the-art network for CIFAR Figure: From Visualizing the Loss Landscape of Neural Nets by Li, Xu, Taylor, Studer, Goldstein
8 Setup to Baldi-Hornik (1989)
9 Setup to Baldi-Hornik (1989) Suppose X = (x j, 1 j N), Y = (y j, 1 j N), x j R n, y j R m.
10 Setup to Baldi-Hornik (1989) Suppose X = (x j, 1 j N), Y = (y j, 1 j N), x j R n, y j R m. Define empirical (co)-variances: Σ YX = j y j x T j, Σ XX = j x j x T j, Σ YY = j y j y T j.
11 Setup to Baldi-Hornik (1989) Suppose X = (x j, 1 j N), Y = (y j, 1 j N), x j R n, y j R m. Define empirical (co)-variances: Σ YX = j y j x T j, Σ XX = j x j x T j, Σ YY = j y j y T j. Classical least squares regression A = argmin L(A) = AX Y 2 F = j Ax j y j 2 is a convex problem and can be solved by GD or analytically by A = Σ YX Σ 1 XX.
12 Setup to Baldi-Hornik (1989)
13 Setup to Baldi-Hornik (1989) Q. What about rank-constrained least squares: A k = argmin L(A) = AX Y 2 F, rank(a) k < n?
14 Setup to Baldi-Hornik (1989) Q. What about rank-constrained least squares: A k = argmin L(A) = AX Y 2 F, rank(a) k < n? A. A k is top k principal components of A.
15 Setup to Baldi-Hornik (1989) Q. What about rank-constrained least squares: A k = argmin L(A) = AX Y 2 F, rank(a) k < n? A. A k is top k principal components of A. Q. Note that if A is n m then rank(a) k A = BC, B n k, C k m.
16 Setup to Baldi-Hornik (1989) Q. What about rank-constrained least squares: A k = argmin L(A) = AX Y 2 F, rank(a) k < n? A. A k is top k principal components of A. Q. Note that if A is n m then rank(a) k A = BC, B n k, C k m. RC least squares linear net with one layer and L 2 loss: (A, B ) = argmin L(A, B) = ABX Y 2 F up to A AC, B C 1 B, C GL k.
17 Setup to Baldi-Hornik (1989) Q. What about rank-constrained least squares: A k = argmin L(A) = AX Y 2 F, rank(a) k < n? A. A k is top k principal components of A. Q. Note that if A is n m then rank(a) k A = BC, B n k, C k m. RC least squares linear net with one layer and L 2 loss: (A, B ) = argmin L(A, B) = ABX Y 2 F up to A AC, B C 1 B, C GL k. Key. L(A, B) is not convex, so unclear if can solve by GD.
18 Results of Baldi-Hornik (1989)
19 Results of Baldi-Hornik (1989) Suppose Σ XX invertible and Σ = Σ YX Σ 1 XX Σ XY has full rank. Thm. All local minima of L(A, B) are global minima. Thus, GD should be fine.
20 Results of Baldi-Hornik (1989) Suppose Σ XX invertible and Σ = Σ YX Σ 1 XX Σ XY has full rank. Thm. All local minima of L(A, B) are global minima. Thus, GD should be fine. Thm. The saddle points of L(A, B) correspond to AB = proj k (A ), where proj k is projection onto some k principal component directions of A.
21 Results of Baldi-Hornik (1989) Suppose Σ XX invertible and Σ = Σ YX Σ 1 XX Σ XY has full rank. Thm. All local minima of L(A, B) are global minima. Thus, GD should be fine. Thm. The saddle points of L(A, B) correspond to AB = proj k (A ), where proj k is projection onto some k principal component directions of A. The proofs are by explicit computation.
22 Proof Idea Baldi-Hornik (1989) Write L(A, B) = vec (ABX Y ) T vec (ABX Y ).
23 Proof Idea Baldi-Hornik (1989) Write L(A, B) = vec (ABX Y ) T vec (ABX Y ). Use identity vec(abc) = ( ) C T A vec (B) to differentiate with respect to vec(a), vec(b).
24 Proof Idea Baldi-Hornik (1989) Write L(A, B) = vec (ABX Y ) T vec (ABX Y ). Use identity vec(abc) = ( ) C T A vec (B) to differentiate with respect to vec(a), vec(b). Check that at critical point W = (A, B), we must have where P A = proj col(a). P A ΣP A = P A Σ = P A Σ,
25 Kawaguchi-Lu (2016)
26 Kawaguchi-Lu (2016) Want to understand local minima of L (W 1,..., W d ) = W d W 1 X Y 2 F, when W 1,..., W d have fixed shape.
27 Kawaguchi-Lu (2016) Want to understand local minima of L (W 1,..., W d ) = W d W 1 X Y 2 F, when W 1,..., W d have fixed shape. Thm. All local minima of L are global minima.
28 Kawaguchi-Lu (2016) Want to understand local minima of L (W 1,..., W d ) = W d W 1 X Y 2 F, when W 1,..., W d have fixed shape. Thm. All local minima of L are global minima. Idea of Proof: 1 Every local minimum {W j } can be perturbed to local minimum {Ŵj } with same value of loss and all Ŵj of full rank.
29 Kawaguchi-Lu (2016) Want to understand local minima of L (W 1,..., W d ) = W d W 1 X Y 2 F, when W 1,..., W d have fixed shape. Thm. All local minima of L are global minima. Idea of Proof: 1 Every local minimum {W j } can be perturbed to local minimum {Ŵj } with same value of loss and all Ŵj of full rank. 2 Perturbations of Ŵ j s give all rank min {layer widths} pertrubations of j Ŵ j.
30 Kawaguchi-Lu (2016) Want to understand local minima of L (W 1,..., W d ) = W d W 1 X Y 2 F, when W 1,..., W d have fixed shape. Thm. All local minima of L are global minima. Idea of Proof: 1 Every local minimum {W j } can be perturbed to local minimum {Ŵj } with same value of loss and all Ŵj of full rank. 2 Perturbations of Ŵ j s give all rank min {layer widths} pertrubations of j Ŵ j. } { 3 Thus, j min Ŵj, Ŵmin is local minimum of Baldi-Hornik problem. Hence, global minimum.
31 Loss for One Layer, Gaussian Inputs: Ge-Lee-Ma (2016)
32 Loss for One Layer, Gaussian Inputs: Ge-Lee-Ma (2016) Input/Output Distribution. Inputs x j N(0, Id n ) i.i.d. and outputs y j produced by net with one hidden layer: n y j = a T σ (B x) = ak σ( b k, x j ), bk = a = 1. k=1
33 Loss for One Layer, Gaussian Inputs: Ge-Lee-Ma (2016) Input/Output Distribution. Inputs x j N(0, Id n ) i.i.d. and outputs y j produced by net with one hidden layer: n y j = a T σ (B x) = ak σ( b k, x j ), bk = a = 1. k=1 Thm. If a = b k = 1, then L 2 population risk E [ y(x, a, B) y(x, a, B ) 2] is a sum of rank m tensor-norms: σ m 2 n a k b m k m 0 k=1 n 2 ak (b k ) m, k=1 where σ m are the Hermite coefficients of σ : σ(t) = σ m m! H m(t). m 0 F
34 Proof Ingredients for Ge-Lee-Ma (2016)
35 Proof Ingredients for Ge-Lee-Ma (2016) 1 Orthogonality: R H n (x)h m (x)e x2 /2 dx = δ n=m n!. 2π
36 Proof Ingredients for Ge-Lee-Ma (2016) 1 Orthogonality: R H n (x)h m (x)e x2 /2 dx = δ n=m n!. 2π 2 Sum-product formula for Hermite polynomials: H n (x y) = p =n ( ) d n H pj (x j )y p j, x, y R d, y = 1, p j=1
37 Proof Ingredients for Ge-Lee-Ma (2016) 1 Orthogonality: R H n (x)h m (x)e x2 /2 dx = δ n=m n!. 2π 2 Sum-product formula for Hermite polynomials: H n (x y) = p =n ( ) d n H pj (x j )y p j, x, y R d, y = 1, p j=1 where the sum is over all multi-indices p. 3 Wick formula [ Hk (v T x) E k! H j (w T ] x) 1 = δ j=k j! k! v, w k.
38 SGD for One Layer: Mei-Montanari-Nguyen (2018)
39 SGD for One Layer: Mei-Montanari-Nguyen (2018) Input/Output Distribution. Inputs x j P i.i.d. and outputs y j produced by net with one hidden layer: y j = 1 N N σ (x; θ i ) = 1 N i=1 N a i σ ( x, w i + b i ). i=1
40 SGD for One Layer: Mei-Montanari-Nguyen (2018) Input/Output Distribution. Inputs x j P i.i.d. and outputs y j produced by net with one hidden layer: y j = 1 N N σ (x; θ i ) = 1 N i=1 N a i σ ( x, w i + b i ). i=1 L 2 loss R N (θ) = 1 2 E[ y N ŷ 2 ] is pair interaction in potential: R(ρ) := V (θ)d ρ (N) (θ) + 1 U(θ 1, θ 2 ) dρ (N) (θ 1 )dρ (N) (θ 2 ), 2
41 SGD for One Layer: Mei-Montanari-Nguyen (2018) Input/Output Distribution. Inputs x j P i.i.d. and outputs y j produced by net with one hidden layer: y j = 1 N N σ (x; θ i ) = 1 N i=1 N a i σ ( x, w i + b i ). i=1 L 2 loss R N (θ) = 1 2 E[ y N ŷ 2 ] is pair interaction in potential: R(ρ) := V (θ)d ρ (N) (θ) + 1 U(θ 1, θ 2 ) dρ (N) (θ 1 )dρ (N) (θ 2 ), 2 plus a constant where dρ (N) (θ) = N i=1 δ θ i and V (θ) = E [yσ (x, θ)], U(θ 1, θ 2 ) = E [σ (x, θ 1 )σ (x, θ 2 )].
42 SGD for One Layer: Mei-Montanari-Nguyen (2018) Input/Output Distribution. Inputs x j P i.i.d. and outputs y j produced by net with one hidden layer: y j = 1 N N σ (x; θ i ) = 1 N i=1 N a i σ ( x, w i + b i ). i=1 L 2 loss R N (θ) = 1 2 E[ y N ŷ 2 ] is pair interaction in potential: R(ρ) := V (θ)d ρ (N) (θ) + 1 U(θ 1, θ 2 ) dρ (N) (θ 1 )dρ (N) (θ 2 ), 2 plus a constant where dρ (N) (θ) = N i=1 δ θ i and V (θ) = E [yσ (x, θ)], U(θ 1, θ 2 ) = E [σ (x, θ 1 )σ (x, θ 2 )]. Q. What does SGD looks like in ρ space?
43 SGD for One Layer: Mei-Montanari-Nguyen (2018)
44 SGD for One Layer: Mei-Montanari-Nguyen (2018) Thm. Consider step sizes s k = ɛξ(kɛ) and write ρ (N) k for empirical measure after k GD steps. Then ρ (N) t/ɛ = ρ t as N and ɛ 0, where ρ t evolves under gradient flow for Wasserstein metric t ρ t = ξ(t) θ (ρ t θ Ψ(θ, ρ t )) Ψ(θ, ρ) := V (θ) + U(θ, θ )dρ(θ ). In particular, have a good approximation when k = t/ɛ with ɛ, N 1 1/input-dim.
45 L 2 loss in Deep and Wide Networks: Nguyen-Hein (2017)
46 L 2 loss in Deep and Wide Networks: Nguyen-Hein (2017) σ smooth, analytic, bounded N distinct training samples with L 2 loss
47 L 2 loss in Deep and Wide Networks: Nguyen-Hein (2017) σ smooth, analytic, bounded N distinct training samples with L 2 loss Thm. Fix a critical point. Suppose 1 layer k with n k N 1 neurons with Hessian of loss restricted to parameters in layers k + 1 is non-degenerate 2 weight matrices in layers k + 1 all have full column rank. Then this critical point is a global minimum. Rmk. Second condition requires pyramidal structure.
48 Proof Idea of Nguyen-Hein (2017)
49 Proof Idea of Nguyen-Hein (2017) Thm. (Gori Tesi 1992) If inputs are linearly independent, layer widths monotonically decreasing, then any crit at which the weight matrices have full rank is a global min.
50 Proof Idea of Nguyen-Hein (2017) Thm. (Gori Tesi 1992) If inputs are linearly independent, layer widths monotonically decreasing, then any crit at which the weight matrices have full rank is a global min. Idea. Backprop relates derivative of loss with respect to activations at successive layers: L Act (j) = L Act (j+1) D(j) }{{ W (j) } where ( ( D (j) = Diag φ preact (j) β ) Z (j),, β = 1,..., n j ).
51 Proof Idea of Nguyen-Hein (2017) Thm. (Gori Tesi 1992) If inputs are linearly independent, layer widths monotonically decreasing, then any crit at which the weight matrices have full rank is a global min. Idea. Backprop relates derivative of loss with respect to activations at successive layers: L Act (j) = L Act (j+1) D(j) }{{ W (j) } where ( ( D (j) = Diag φ preact (j) β ) Z (j),, β = 1,..., n j ). If φ 0 and W (j) has full rank, then can do forward prop L Act (j+1) = L Act (j) Z (j) T [ Z (j) Z (j) T ] 1.
52 Proof Idea of Nguyen-Hein (2017) Thm. (Gori Tesi 1992) If inputs are linearly independent, layer widths monotonically decreasing, then any crit at which the weight matrices have full rank is a global min. Idea. Backprop relates derivative of loss with respect to activations at successive layers: L Act (j) = L Act (j+1) D(j) }{{ W (j) } where ( ( D (j) = Diag φ preact (j) β ) Z (j),, β = 1,..., n j ). If φ 0 and W (j) has full rank, then can do forward prop L Act (j+1) = L [ ] Act (j) Z (j) T Z (j) Z (j) T 1. Now just transplant this condition to some intermediate layer.
53 Finite Intrinsic Dimension: Venturi-Bandeira-Bruna (2018)
54 Finite Intrinsic Dimension: Venturi-Bandeira-Bruna (2018) Consider one layer network with no bias, L 2 loss, and hidden layer of width p [ Φ(x, θ) = Uρ (Wx), L(θ) := E Φ(X, θ) Y 2].
55 Finite Intrinsic Dimension: Venturi-Bandeira-Bruna (2018) Consider one layer network with no bias, L 2 loss, and hidden layer of width p [ Φ(x, θ) = Uρ (Wx), L(θ) := E Φ(X, θ) Y 2]. Consider span of possible features V := Span {ψ w, w R n }, ψ w := ρ ( w, )
56 Finite Intrinsic Dimension: Venturi-Bandeira-Bruna (2018) Consider one layer network with no bias, L 2 loss, and hidden layer of width p [ Φ(x, θ) = Uρ (Wx), L(θ) := E Φ(X, θ) Y 2]. Consider span of possible features V := Span {ψ w, w R n }, ψ w := ρ ( w, ) Thm. If p dim(v ), then there are no bad local minima. If p 2 dim(v ), then all local=global minima are connected.
57 Finite Intrinsic Dimension: Venturi-Bandeira-Bruna (2018) Consider one layer network with no bias, L 2 loss, and hidden layer of width p [ Φ(x, θ) = Uρ (Wx), L(θ) := E Φ(X, θ) Y 2]. Consider span of possible features V := Span {ψ w, w R n }, ψ w := ρ ( w, ) Thm. If p dim(v ), then there are no bad local minima. If p 2 dim(v ), then all local=global minima are connected. Ex. ρ = polynomial, linear
58 Setup for Proof in Venturi-Bandeira-Bruna (2018)
59 Setup for Proof in Venturi-Bandeira-Bruna (2018) Output Φ(x, θ) V.
60 Setup for Proof in Venturi-Bandeira-Bruna (2018) Output Φ(x, θ) V. V is reproducing kernel Hilbert space (RKHS):
61 Setup for Proof in Venturi-Bandeira-Bruna (2018) Output Φ(x, θ) V. V is reproducing kernel Hilbert space (RKHS): Choose basis { ψ wj } and define, with { ψwj }j = ONB
62 Setup for Proof in Venturi-Bandeira-Bruna (2018) Output Φ(x, θ) V. V is reproducing kernel Hilbert space (RKHS): Choose basis { } { ψ wj and define, with ψwj = ONB }j Define reproducing kernel K V (x, y) := j ψ w j (x)ψ wj (y).
63 Setup for Proof in Venturi-Bandeira-Bruna (2018) Output Φ(x, θ) V. V is reproducing kernel Hilbert space (RKHS): Choose basis { } { ψ wj and define, with ψwj = ONB }j Define reproducing kernel K V (x, y) := j ψ w j (x)ψ wj (y). φ(x) := K V (x, ) V is kernel of evaluation at x
64 Setup for Proof in Venturi-Bandeira-Bruna (2018) Output Φ(x, θ) V. V is reproducing kernel Hilbert space (RKHS): Choose basis { } { ψ wj and define, with ψwj = ONB }j Define reproducing kernel K V (x, y) := j ψ w j (x)ψ wj (y). φ(x) := K V (x, ) V is kernel of evaluation at x Then ψ w (x) = ψ w, φ(x)
65 Setup for Proof in Venturi-Bandeira-Bruna (2018) Output Φ(x, θ) V. V is reproducing kernel Hilbert space (RKHS): Choose basis { } { ψ wj and define, with ψwj = ONB }j Define reproducing kernel K V (x, y) := j ψ w j (x)ψ wj (y). φ(x) := K V (x, ) V is kernel of evaluation at x Then ψ w (x) = ψ w, φ(x) Obtain Φ = Uψ(W ), φ(x)
ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Neural Networks: A brief touch Yuejie Chi Department of Electrical and Computer Engineering Spring 2018 1/41 Outline
More informationOverparametrization for Landscape Design in Non-convex Optimization
Overparametrization for Landscape Design in Non-convex Optimization Jason D. Lee University of Southern California September 19, 2018 The State of Non-Convex Optimization Practical observation: Empirically,
More informationA summary of Deep Learning without Poor Local Minima
A summary of Deep Learning without Poor Local Minima by Kenji Kawaguchi MIT oral presentation at NIPS 2016 Learning Supervised (or Predictive) learning Learn a mapping from inputs x to outputs y, given
More informationSolutions to the Calculus and Linear Algebra problems on the Comprehensive Examination of January 28, 2011
Solutions to the Calculus and Linear Algebra problems on the Comprehensive Examination of January 8, Solutions to Problems 5 are omitted since they involve topics no longer covered on the Comprehensive
More informationA picture of the energy landscape of! deep neural networks
A picture of the energy landscape of! deep neural networks Pratik Chaudhari December 15, 2017 UCLA VISION LAB 1 Dy (x; w) = (w p (w p 1 (... (w 1 x))...)) w = argmin w (x,y ) 2 D kx 1 {y =i } log Dy i
More informationNONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function
More informationAssignment 1 Math 5341 Linear Algebra Review. Give complete answers to each of the following questions. Show all of your work.
Assignment 1 Math 5341 Linear Algebra Review Give complete answers to each of the following questions Show all of your work Note: You might struggle with some of these questions, either because it has
More informationFoundations of Deep Learning: SGD, Overparametrization, and Generalization
Foundations of Deep Learning: SGD, Overparametrization, and Generalization Jason D. Lee University of Southern California November 13, 2018 Deep Learning Single Neuron x σ( w, x ) ReLU: σ(z) = [z] + Figure:
More informationUnraveling the mysteries of stochastic gradient descent on deep neural networks
Unraveling the mysteries of stochastic gradient descent on deep neural networks Pratik Chaudhari UCLA VISION LAB 1 The question measures disagreement of predictions with ground truth Cat Dog... x = argmin
More informationBits of Machine Learning Part 1: Supervised Learning
Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification
More informationOptimization geometry and implicit regularization
Optimization geometry and implicit regularization Suriya Gunasekar Joint work with N. Srebro (TTIC), J. Lee (USC), D. Soudry (Technion), M.S. Nacson (Technion), B. Woodworth (TTIC), S. Bhojanapalli (TTIC),
More informationKernel Learning via Random Fourier Representations
Kernel Learning via Random Fourier Representations L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Module 5: Machine Learning L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random
More informationEcon 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines
Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the
More information4 Bias-Variance for Ridge Regression (24 points)
Implement Ridge Regression with λ = 0.00001. Plot the Squared Euclidean test error for the following values of k (the dimensions you reduce to): k = {0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500,
More informationLinear Regression and Its Applications
Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start
More informationMini-Course 1: SGD Escapes Saddle Points
Mini-Course 1: SGD Escapes Saddle Points Yang Yuan Computer Science Department Cornell University Gradient Descent (GD) Task: min x f (x) GD does iterative updates x t+1 = x t η t f (x t ) Gradient Descent
More informationInner products. Theorem (basic properties): Given vectors u, v, w in an inner product space V, and a scalar k, the following properties hold:
Inner products Definition: An inner product on a real vector space V is an operation (function) that assigns to each pair of vectors ( u, v) in V a scalar u, v satisfying the following axioms: 1. u, v
More informationCS540 Machine learning Lecture 5
CS540 Machine learning Lecture 5 1 Last time Basis functions for linear regression Normal equations QR SVD - briefly 2 This time Geometry of least squares (again) SVD more slowly LMS Ridge regression 3
More informationMore on Neural Networks
More on Neural Networks Yujia Yan Fall 2018 Outline Linear Regression y = Wx + b (1) Linear Regression y = Wx + b (1) Polynomial Regression y = Wφ(x) + b (2) where φ(x) gives the polynomial basis, e.g.,
More informationContents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016
ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................
More informationArtificial Neural Networks (ANN) Xiaogang Su, Ph.D. Department of Mathematical Science University of Texas at El Paso
Artificial Neural Networks (ANN) Xiaogang Su, Ph.D. Department of Mathematical Science University of Texas at El Paso xsu@utep.edu Fall, 2018 Outline Introduction A Brief History ANN Architecture Terminology
More informationKernel Methods. Machine Learning A W VO
Kernel Methods Machine Learning A 708.063 07W VO Outline 1. Dual representation 2. The kernel concept 3. Properties of kernels 4. Examples of kernel machines Kernel PCA Support vector regression (Relevance
More informationNon-Convex Optimization. CS6787 Lecture 7 Fall 2017
Non-Convex Optimization CS6787 Lecture 7 Fall 2017 First some words about grading I sent out a bunch of grades on the course management system Everyone should have all their grades in Not including paper
More informationDeep Linear Networks with Arbitrary Loss: All Local Minima Are Global
homas Laurent * 1 James H. von Brecht * 2 Abstract We consider deep linear networks with arbitrary convex differentiable loss. We provide a short and elementary proof of the fact that all local minima
More informationImplicit Functions, Curves and Surfaces
Chapter 11 Implicit Functions, Curves and Surfaces 11.1 Implicit Function Theorem Motivation. In many problems, objects or quantities of interest can only be described indirectly or implicitly. It is then
More informationStochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos
1 Stochastic Variance Reduction for Nonconvex Optimization Barnabás Póczos Contents 2 Stochastic Variance Reduction for Nonconvex Optimization Joint work with Sashank Reddi, Ahmed Hefny, Suvrit Sra, and
More informationLecture 35: December The fundamental statistical distances
36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose
More informationWhen can Deep Networks avoid the curse of dimensionality and other theoretical puzzles
When can Deep Networks avoid the curse of dimensionality and other theoretical puzzles Tomaso Poggio, MIT, CBMM Astar CBMM s focus is the Science and the Engineering of Intelligence We aim to make progress
More informationGeneralization theory
Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1
More information1 Machine Learning Concepts (16 points)
CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions
More informationAdvanced Machine Learning
Advanced Machine Learning Lecture 4: Deep Learning Essentials Pierre Geurts, Gilles Louppe, Louis Wehenkel 1 / 52 Outline Goal: explain and motivate the basic constructs of neural networks. From linear
More informationCSC321 Lecture 7: Optimization
CSC321 Lecture 7: Optimization Roger Grosse Roger Grosse CSC321 Lecture 7: Optimization 1 / 25 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:
More informationOPTIMIZATION METHODS IN DEEP LEARNING
Tutorial outline OPTIMIZATION METHODS IN DEEP LEARNING Based on Deep Learning, chapter 8 by Ian Goodfellow, Yoshua Bengio and Aaron Courville Presented By Nadav Bhonker Optimization vs Learning Surrogate
More informationLinear Algebra Practice Problems
Linear Algebra Practice Problems Page of 7 Linear Algebra Practice Problems These problems cover Chapters 4, 5, 6, and 7 of Elementary Linear Algebra, 6th ed, by Ron Larson and David Falvo (ISBN-3 = 978--68-78376-2,
More informationMODULE 8 Topics: Null space, range, column space, row space and rank of a matrix
MODULE 8 Topics: Null space, range, column space, row space and rank of a matrix Definition: Let L : V 1 V 2 be a linear operator. The null space N (L) of L is the subspace of V 1 defined by N (L) = {x
More informationwhat can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley
what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley Collaborators Joint work with Samy Bengio, Moritz Hardt, Michael Jordan, Jason Lee, Max Simchowitz,
More informationCSC321 Lecture 8: Optimization
CSC321 Lecture 8: Optimization Roger Grosse Roger Grosse CSC321 Lecture 8: Optimization 1 / 26 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:
More informationKernel-Based Contrast Functions for Sufficient Dimension Reduction
Kernel-Based Contrast Functions for Sufficient Dimension Reduction Michael I. Jordan Departments of Statistics and EECS University of California, Berkeley Joint work with Kenji Fukumizu and Francis Bach
More informationInformation theoretic perspectives on learning algorithms
Information theoretic perspectives on learning algorithms Varun Jog University of Wisconsin - Madison Departments of ECE and Mathematics Shannon Channel Hangout! May 8, 2018 Jointly with Adrian Tovar-Lopez
More informationFunction Spaces. 1 Hilbert Spaces
Function Spaces A function space is a set of functions F that has some structure. Often a nonparametric regression function or classifier is chosen to lie in some function space, where the assume structure
More informationElementary linear algebra
Chapter 1 Elementary linear algebra 1.1 Vector spaces Vector spaces owe their importance to the fact that so many models arising in the solutions of specific problems turn out to be vector spaces. The
More informationRegML 2018 Class 2 Tikhonov regularization and kernels
RegML 2018 Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT June 17, 2018 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n = (x i,
More informationMath 4A Notes. Written by Victoria Kala Last updated June 11, 2017
Math 4A Notes Written by Victoria Kala vtkala@math.ucsb.edu Last updated June 11, 2017 Systems of Linear Equations A linear equation is an equation that can be written in the form a 1 x 1 + a 2 x 2 +...
More informationRegML 2018 Class 8 Deep learning
RegML 2018 Class 8 Deep learning Lorenzo Rosasco UNIGE-MIT-IIT June 18, 2018 Supervised vs unsupervised learning? So far we have been thinking of learning schemes made in two steps f(x) = w, Φ(x) F, x
More informationLinear Regression. CSL603 - Fall 2017 Narayanan C Krishnan
Linear Regression CSL603 - Fall 2017 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis Regularization
More informationLinear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan
Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis
More informationMIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design
MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications Class 19: Data Representation by Design What is data representation? Let X be a data-space X M (M) F (M) X A data representation
More informationMachine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.
Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted
More informationMinimization of Static! Cost Functions!
Minimization of Static Cost Functions Robert Stengel Optimal Control and Estimation, MAE 546, Princeton University, 2017 J = Static cost function with constant control parameter vector, u Conditions for
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationDivide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates
: A Distributed Algorithm with Minimax Optimal Rates Yuchen Zhang, John C. Duchi, Martin Wainwright (UC Berkeley;http://arxiv.org/pdf/1305.509; Apr 9, 014) Gatsby Unit, Tea Talk June 10, 014 Outline Motivation.
More informationCS489/698: Intro to ML
CS489/698: Intro to ML Lecture 03: Multi-layer Perceptron Outline Failure of Perceptron Neural Network Backpropagation Universal Approximator 2 Outline Failure of Perceptron Neural Network Backpropagation
More informationVasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks
C.M. Bishop s PRML: Chapter 5; Neural Networks Introduction The aim is, as before, to find useful decompositions of the target variable; t(x) = y(x, w) + ɛ(x) (3.7) t(x n ) and x n are the observations,
More informationDeep Feedforward Networks. Seung-Hoon Na Chonbuk National University
Deep Feedforward Networks Seung-Hoon Na Chonbuk National University Neural Network: Types Feedforward neural networks (FNN) = Deep feedforward networks = multilayer perceptrons (MLP) No feedback connections
More informationLecture 7: Kernels for Classification and Regression
Lecture 7: Kernels for Classification and Regression CS 194-10, Fall 2011 Laurent El Ghaoui EECS Department UC Berkeley September 15, 2011 Outline Outline A linear regression problem Linear auto-regressive
More informationNonlinear Models. Numerical Methods for Deep Learning. Lars Ruthotto. Departments of Mathematics and Computer Science, Emory University.
Nonlinear Models Numerical Methods for Deep Learning Lars Ruthotto Departments of Mathematics and Computer Science, Emory University Intro 1 Course Overview Intro 2 Course Overview Lecture 1: Linear Models
More informationMathematical Methods for Data Analysis
Mathematical Methods for Data Analysis Massimiliano Pontil Istituto Italiano di Tecnologia and Department of Computer Science University College London Massimiliano Pontil Mathematical Methods for Data
More informationPorcupine Neural Networks: (Almost) All Local Optima are Global
Porcupine Neural Networs: (Almost) All Local Optima are Global Soheil Feizi, Hamid Javadi, Jesse Zhang and David Tse arxiv:1710.0196v1 [stat.ml] 5 Oct 017 Stanford University Abstract Neural networs have
More informationComments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms
Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:
More informationSTAT 100C: Linear models
STAT 100C: Linear models Arash A. Amini June 9, 2018 1 / 56 Table of Contents Multiple linear regression Linear model setup Estimation of β Geometric interpretation Estimation of σ 2 Hat matrix Gram matrix
More informationNeural Networks: Backpropagation
Neural Networks: Backpropagation Seung-Hoon Na 1 1 Department of Computer Science Chonbuk National University 2018.10.25 eung-hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25
More information+E x, y ρ. (fw (x) y) 2] (4)
Bias-Variance Analysis Let X be a set (or space) of objects and let ρ be a fixed probabiity distribution (or density) on X R. In other worlds ρ is a probability density on pairs x, y with x X and y R.
More informationReproducing Kernel Hilbert Spaces
Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 12, 2007 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationThe Deep Ritz method: A deep learning-based numerical algorithm for solving variational problems
The Deep Ritz method: A deep learning-based numerical algorithm for solving variational problems Weinan E 1 and Bing Yu 2 arxiv:1710.00211v1 [cs.lg] 30 Sep 2017 1 The Beijing Institute of Big Data Research,
More informationand u and v are orthogonal if and only if u v = 0. u v = x1x2 + y1y2 + z1z2. 1. In R 3 the dot product is defined by
Linear Algebra [] 4.2 The Dot Product and Projections. In R 3 the dot product is defined by u v = u v cos θ. 2. For u = (x, y, z) and v = (x2, y2, z2), we have u v = xx2 + yy2 + zz2. 3. cos θ = u v u v,
More informationLinear Heteroencoders
Gatsby Computational Neuroscience Unit 17 Queen Square, London University College London WC1N 3AR, United Kingdom http://www.gatsby.ucl.ac.uk +44 20 7679 1176 Funded in part by the Gatsby Charitable Foundation.
More informationA Quick Tour of Linear Algebra and Optimization for Machine Learning
A Quick Tour of Linear Algebra and Optimization for Machine Learning Masoud Farivar January 8, 2015 1 / 28 Outline of Part I: Review of Basic Linear Algebra Matrices and Vectors Matrix Multiplication Operators
More informationNeural networks COMS 4771
Neural networks COMS 4771 1. Logistic regression Logistic regression Suppose X = R d and Y = {0, 1}. A logistic regression model is a statistical model where the conditional probability function has a
More informationKernel Methods. Jean-Philippe Vert Last update: Jan Jean-Philippe Vert (Mines ParisTech) 1 / 444
Kernel Methods Jean-Philippe Vert Jean-Philippe.Vert@mines.org Last update: Jan 2015 Jean-Philippe Vert (Mines ParisTech) 1 / 444 What we know how to solve Jean-Philippe Vert (Mines ParisTech) 2 / 444
More informationStochastic Analogues to Deterministic Optimizers
Stochastic Analogues to Deterministic Optimizers ISMP 2018 Bordeaux, France Vivak Patel Presented by: Mihai Anitescu July 6, 2018 1 Apology I apologize for not being here to give this talk myself. I injured
More informationThe Loss Surface of Deep and Wide Neural Networks
Quynh Nguyen 1 Matthias Hein 1 Abstract While the optimization problem behind deep neural networks is highly non-convex, it is frequently observed in practice that training deep networks seems possible
More informationTranspose & Dot Product
Transpose & Dot Product Def: The transpose of an m n matrix A is the n m matrix A T whose columns are the rows of A. So: The columns of A T are the rows of A. The rows of A T are the columns of A. Example:
More informationChapter 4 Euclid Space
Chapter 4 Euclid Space Inner Product Spaces Definition.. Let V be a real vector space over IR. A real inner product on V is a real valued function on V V, denoted by (, ), which satisfies () (x, y) = (y,
More informationTutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba
Tutorial on: Optimization I (from a deep learning perspective) Jimmy Ba Outline Random search v.s. gradient descent Finding better search directions Design white-box optimization methods to improve computation
More informationNon-Convex Optimization in Machine Learning. Jan Mrkos AIC
Non-Convex Optimization in Machine Learning Jan Mrkos AIC The Plan 1. Introduction 2. Non convexity 3. (Some) optimization approaches 4. Speed and stuff? Neural net universal approximation Theorem (1989):
More informationOslo Class 2 Tikhonov regularization and kernels
RegML2017@SIMULA Oslo Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT May 3, 2017 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n
More informationApprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning
Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome Prenom.Nom@cnam.fr http://cedric.cnam.fr/vertigo/cours/ml2/ Département Informatique Conservatoire
More informationMATH 2331 Linear Algebra. Section 2.1 Matrix Operations. Definition: A : m n, B : n p. Example: Compute AB, if possible.
MATH 2331 Linear Algebra Section 2.1 Matrix Operations Definition: A : m n, B : n p ( 1 2 p ) ( 1 2 p ) AB = A b b b = Ab Ab Ab Example: Compute AB, if possible. 1 Row-column rule: i-j-th entry of AB:
More informationFinal Review Written by Victoria Kala SH 6432u Office Hours R 12:30 1:30pm Last Updated 11/30/2015
Final Review Written by Victoria Kala vtkala@mathucsbedu SH 6432u Office Hours R 12:30 1:30pm Last Updated 11/30/2015 Summary This review contains notes on sections 44 47, 51 53, 61, 62, 65 For your final,
More informationOR MSc Maths Revision Course
OR MSc Maths Revision Course Tom Byrne School of Mathematics University of Edinburgh t.m.byrne@sms.ed.ac.uk 15 September 2017 General Information Today JCMB Lecture Theatre A, 09:30-12:30 Mathematics revision
More informationReproducing Kernel Hilbert Spaces
Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 11, 2009 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationKernel Bayes Rule: Nonparametric Bayesian inference with kernels
Kernel Bayes Rule: Nonparametric Bayesian inference with kernels Kenji Fukumizu The Institute of Statistical Mathematics NIPS 2012 Workshop Confluence between Kernel Methods and Graphical Models December
More informationBayesian Interpretations of Regularization
Bayesian Interpretations of Regularization Charlie Frogner 9.50 Class 15 April 1, 009 The Plan Regularized least squares maps {(x i, y i )} n i=1 to a function that minimizes the regularized loss: f S
More informationTranspose & Dot Product
Transpose & Dot Product Def: The transpose of an m n matrix A is the n m matrix A T whose columns are the rows of A. So: The columns of A T are the rows of A. The rows of A T are the columns of A. Example:
More informationHow to Escape Saddle Points Efficiently? Praneeth Netrapalli Microsoft Research India
How to Escape Saddle Points Efficiently? Praneeth Netrapalli Microsoft Research India Chi Jin UC Berkeley Michael I. Jordan UC Berkeley Rong Ge Duke Univ. Sham M. Kakade U Washington Nonconvex optimization
More informationKernels A Machine Learning Overview
Kernels A Machine Learning Overview S.V.N. Vishy Vishwanathan vishy@axiom.anu.edu.au National ICT of Australia and Australian National University Thanks to Alex Smola, Stéphane Canu, Mike Jordan and Peter
More informationElectromagnetism HW 1 math review
Electromagnetism HW math review Problems -5 due Mon 7th Sep, 6- due Mon 4th Sep Exercise. The Levi-Civita symbol, ɛ ijk, also known as the completely antisymmetric rank-3 tensor, has the following properties:
More informationConvex Optimization and Modeling
Convex Optimization and Modeling Introduction and a quick repetition of analysis/linear algebra First lecture, 12.04.2010 Jun.-Prof. Matthias Hein Organization of the lecture Advanced course, 2+2 hours,
More informationLecture 1: Supervised Learning
Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Low-rank matrix recovery via nonconvex optimization Yuejie Chi Department of Electrical and Computer Engineering Spring
More informationConvolutional Neural Network Architecture
Convolutional Neural Network Architecture Zhisheng Zhong Feburary 2nd, 2018 Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 1 / 55 Outline 1 Introduction of Convolution Motivation
More informationAdvanced computational methods X Selected Topics: SGD
Advanced computational methods X071521-Selected Topics: SGD. In this lecture, we look at the stochastic gradient descent (SGD) method 1 An illustrating example The MNIST is a simple dataset of variety
More informationBoolean Autoencoders and Hypercube Clustering Complexity
Designs, Codes and Cryptography manuscript No. (will be inserted by the editor) Boolean Autoencoders and Hypercube Clustering Complexity P. Baldi Received: 1 November 2010 / Revised: 5 June 2012 / Accepted:
More information2018 Fall 2210Q Section 013 Midterm Exam II Solution
08 Fall 0Q Section 0 Midterm Exam II Solution True or False questions points 0 0 points) ) Let A be an n n matrix. If the equation Ax b has at least one solution for each b R n, then the solution is unique
More information1. Mathematical Foundations of Machine Learning
1. Mathematical Foundations of Machine Learning 1.1 Basic Concepts Definition of Learning Definition [Mitchell (1997)] A computer program is said to learn from experience E with respect to some class of
More informationDirect Learning: Linear Classification. Donglin Zeng, Department of Biostatistics, University of North Carolina
Direct Learning: Linear Classification Logistic regression models for classification problem We consider two class problem: Y {0, 1}. The Bayes rule for the classification is I(P(Y = 1 X = x) > 1/2) so
More informationRANDOM FIELDS AND GEOMETRY. Robert Adler and Jonathan Taylor
RANDOM FIELDS AND GEOMETRY from the book of the same name by Robert Adler and Jonathan Taylor IE&M, Technion, Israel, Statistics, Stanford, US. ie.technion.ac.il/adler.phtml www-stat.stanford.edu/ jtaylor
More informationReading Group on Deep Learning Session 1
Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular
More informationNormalization Techniques
Normalization Techniques Devansh Arpit Normalization Techniques 1 / 39 Table of Contents 1 Introduction 2 Motivation 3 Batch Normalization 4 Normalization Propagation 5 Weight Normalization 6 Layer Normalization
More informationσ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =
Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,
More information