Everything You Wanted to Know About the Loss Surface but Were Afraid to Ask The Talk

Size: px
Start display at page:

Download "Everything You Wanted to Know About the Loss Surface but Were Afraid to Ask The Talk"

Transcription

1 Everything You Wanted to Know About the Loss Surface but Were Afraid to Ask The Talk Texas A&M August 21, 2018

2 Plan

3 Plan 1 Linear models: Baldi-Hornik (1989) Kawaguchi-Lu (2016)

4 Plan 1 Linear models: Baldi-Hornik (1989) Kawaguchi-Lu (2016) 2 One hidden layer: Ge-Lee-Ma (2016) Mei-Montanari-Nguyen (2018) Venturi-Bandeira-Bruna (2018)

5 Plan 1 Linear models: Baldi-Hornik (1989) Kawaguchi-Lu (2016) 2 One hidden layer: Ge-Lee-Ma (2016) Mei-Montanari-Nguyen (2018) Venturi-Bandeira-Bruna (2018) 3 More than one hidden layer: Gori-Tesi (1992) + Nguyen-Hein (2017)

6 are randomized before training (Zhang et al., 2017). However, this good behavior is not universal; trainability neural nets is highlyskip dependent on network architecture choices, the choice ButtheFirst...of Turns out Connections aredesign Good of optimizer, variable initialization, and a variety of other considerations. Unfortunately, the effect (a) without skip connections (b) with skip connections Figure 1: The loss surfaces of ResNet-56 with/without skip connections. The vertical axis is logarithmic to show dynamic range. The proposed filter normalization scheme is used to enable comparisons of sharpness/flatness between the two figures. Figure: From Visualizing the Loss Landscape of Neural Nets by Li, Xu, 1 Taylor, Studer, Goldstein

7 From Figure 7, we see that network depth has a dramatic effect the loss surfaces of neural networks Butwhen First skip connections... Turns are not out used. The Skip network Connections ResNet-20-noshortare has a fairly Good benign landscape dominated by a region with convex contours in the center, and no dramatic non-convexity. This (a) 110 layers, no skip connections (b) DenseNet, 121 layers Figure 6: (left) The loss surfaces of ResNet-110-noshort, without skip connections. (right) DenseNet, the current state-of-the-art network for CIFAR Figure: From Visualizing the Loss Landscape of Neural Nets by Li, Xu, Taylor, Studer, Goldstein

8 Setup to Baldi-Hornik (1989)

9 Setup to Baldi-Hornik (1989) Suppose X = (x j, 1 j N), Y = (y j, 1 j N), x j R n, y j R m.

10 Setup to Baldi-Hornik (1989) Suppose X = (x j, 1 j N), Y = (y j, 1 j N), x j R n, y j R m. Define empirical (co)-variances: Σ YX = j y j x T j, Σ XX = j x j x T j, Σ YY = j y j y T j.

11 Setup to Baldi-Hornik (1989) Suppose X = (x j, 1 j N), Y = (y j, 1 j N), x j R n, y j R m. Define empirical (co)-variances: Σ YX = j y j x T j, Σ XX = j x j x T j, Σ YY = j y j y T j. Classical least squares regression A = argmin L(A) = AX Y 2 F = j Ax j y j 2 is a convex problem and can be solved by GD or analytically by A = Σ YX Σ 1 XX.

12 Setup to Baldi-Hornik (1989)

13 Setup to Baldi-Hornik (1989) Q. What about rank-constrained least squares: A k = argmin L(A) = AX Y 2 F, rank(a) k < n?

14 Setup to Baldi-Hornik (1989) Q. What about rank-constrained least squares: A k = argmin L(A) = AX Y 2 F, rank(a) k < n? A. A k is top k principal components of A.

15 Setup to Baldi-Hornik (1989) Q. What about rank-constrained least squares: A k = argmin L(A) = AX Y 2 F, rank(a) k < n? A. A k is top k principal components of A. Q. Note that if A is n m then rank(a) k A = BC, B n k, C k m.

16 Setup to Baldi-Hornik (1989) Q. What about rank-constrained least squares: A k = argmin L(A) = AX Y 2 F, rank(a) k < n? A. A k is top k principal components of A. Q. Note that if A is n m then rank(a) k A = BC, B n k, C k m. RC least squares linear net with one layer and L 2 loss: (A, B ) = argmin L(A, B) = ABX Y 2 F up to A AC, B C 1 B, C GL k.

17 Setup to Baldi-Hornik (1989) Q. What about rank-constrained least squares: A k = argmin L(A) = AX Y 2 F, rank(a) k < n? A. A k is top k principal components of A. Q. Note that if A is n m then rank(a) k A = BC, B n k, C k m. RC least squares linear net with one layer and L 2 loss: (A, B ) = argmin L(A, B) = ABX Y 2 F up to A AC, B C 1 B, C GL k. Key. L(A, B) is not convex, so unclear if can solve by GD.

18 Results of Baldi-Hornik (1989)

19 Results of Baldi-Hornik (1989) Suppose Σ XX invertible and Σ = Σ YX Σ 1 XX Σ XY has full rank. Thm. All local minima of L(A, B) are global minima. Thus, GD should be fine.

20 Results of Baldi-Hornik (1989) Suppose Σ XX invertible and Σ = Σ YX Σ 1 XX Σ XY has full rank. Thm. All local minima of L(A, B) are global minima. Thus, GD should be fine. Thm. The saddle points of L(A, B) correspond to AB = proj k (A ), where proj k is projection onto some k principal component directions of A.

21 Results of Baldi-Hornik (1989) Suppose Σ XX invertible and Σ = Σ YX Σ 1 XX Σ XY has full rank. Thm. All local minima of L(A, B) are global minima. Thus, GD should be fine. Thm. The saddle points of L(A, B) correspond to AB = proj k (A ), where proj k is projection onto some k principal component directions of A. The proofs are by explicit computation.

22 Proof Idea Baldi-Hornik (1989) Write L(A, B) = vec (ABX Y ) T vec (ABX Y ).

23 Proof Idea Baldi-Hornik (1989) Write L(A, B) = vec (ABX Y ) T vec (ABX Y ). Use identity vec(abc) = ( ) C T A vec (B) to differentiate with respect to vec(a), vec(b).

24 Proof Idea Baldi-Hornik (1989) Write L(A, B) = vec (ABX Y ) T vec (ABX Y ). Use identity vec(abc) = ( ) C T A vec (B) to differentiate with respect to vec(a), vec(b). Check that at critical point W = (A, B), we must have where P A = proj col(a). P A ΣP A = P A Σ = P A Σ,

25 Kawaguchi-Lu (2016)

26 Kawaguchi-Lu (2016) Want to understand local minima of L (W 1,..., W d ) = W d W 1 X Y 2 F, when W 1,..., W d have fixed shape.

27 Kawaguchi-Lu (2016) Want to understand local minima of L (W 1,..., W d ) = W d W 1 X Y 2 F, when W 1,..., W d have fixed shape. Thm. All local minima of L are global minima.

28 Kawaguchi-Lu (2016) Want to understand local minima of L (W 1,..., W d ) = W d W 1 X Y 2 F, when W 1,..., W d have fixed shape. Thm. All local minima of L are global minima. Idea of Proof: 1 Every local minimum {W j } can be perturbed to local minimum {Ŵj } with same value of loss and all Ŵj of full rank.

29 Kawaguchi-Lu (2016) Want to understand local minima of L (W 1,..., W d ) = W d W 1 X Y 2 F, when W 1,..., W d have fixed shape. Thm. All local minima of L are global minima. Idea of Proof: 1 Every local minimum {W j } can be perturbed to local minimum {Ŵj } with same value of loss and all Ŵj of full rank. 2 Perturbations of Ŵ j s give all rank min {layer widths} pertrubations of j Ŵ j.

30 Kawaguchi-Lu (2016) Want to understand local minima of L (W 1,..., W d ) = W d W 1 X Y 2 F, when W 1,..., W d have fixed shape. Thm. All local minima of L are global minima. Idea of Proof: 1 Every local minimum {W j } can be perturbed to local minimum {Ŵj } with same value of loss and all Ŵj of full rank. 2 Perturbations of Ŵ j s give all rank min {layer widths} pertrubations of j Ŵ j. } { 3 Thus, j min Ŵj, Ŵmin is local minimum of Baldi-Hornik problem. Hence, global minimum.

31 Loss for One Layer, Gaussian Inputs: Ge-Lee-Ma (2016)

32 Loss for One Layer, Gaussian Inputs: Ge-Lee-Ma (2016) Input/Output Distribution. Inputs x j N(0, Id n ) i.i.d. and outputs y j produced by net with one hidden layer: n y j = a T σ (B x) = ak σ( b k, x j ), bk = a = 1. k=1

33 Loss for One Layer, Gaussian Inputs: Ge-Lee-Ma (2016) Input/Output Distribution. Inputs x j N(0, Id n ) i.i.d. and outputs y j produced by net with one hidden layer: n y j = a T σ (B x) = ak σ( b k, x j ), bk = a = 1. k=1 Thm. If a = b k = 1, then L 2 population risk E [ y(x, a, B) y(x, a, B ) 2] is a sum of rank m tensor-norms: σ m 2 n a k b m k m 0 k=1 n 2 ak (b k ) m, k=1 where σ m are the Hermite coefficients of σ : σ(t) = σ m m! H m(t). m 0 F

34 Proof Ingredients for Ge-Lee-Ma (2016)

35 Proof Ingredients for Ge-Lee-Ma (2016) 1 Orthogonality: R H n (x)h m (x)e x2 /2 dx = δ n=m n!. 2π

36 Proof Ingredients for Ge-Lee-Ma (2016) 1 Orthogonality: R H n (x)h m (x)e x2 /2 dx = δ n=m n!. 2π 2 Sum-product formula for Hermite polynomials: H n (x y) = p =n ( ) d n H pj (x j )y p j, x, y R d, y = 1, p j=1

37 Proof Ingredients for Ge-Lee-Ma (2016) 1 Orthogonality: R H n (x)h m (x)e x2 /2 dx = δ n=m n!. 2π 2 Sum-product formula for Hermite polynomials: H n (x y) = p =n ( ) d n H pj (x j )y p j, x, y R d, y = 1, p j=1 where the sum is over all multi-indices p. 3 Wick formula [ Hk (v T x) E k! H j (w T ] x) 1 = δ j=k j! k! v, w k.

38 SGD for One Layer: Mei-Montanari-Nguyen (2018)

39 SGD for One Layer: Mei-Montanari-Nguyen (2018) Input/Output Distribution. Inputs x j P i.i.d. and outputs y j produced by net with one hidden layer: y j = 1 N N σ (x; θ i ) = 1 N i=1 N a i σ ( x, w i + b i ). i=1

40 SGD for One Layer: Mei-Montanari-Nguyen (2018) Input/Output Distribution. Inputs x j P i.i.d. and outputs y j produced by net with one hidden layer: y j = 1 N N σ (x; θ i ) = 1 N i=1 N a i σ ( x, w i + b i ). i=1 L 2 loss R N (θ) = 1 2 E[ y N ŷ 2 ] is pair interaction in potential: R(ρ) := V (θ)d ρ (N) (θ) + 1 U(θ 1, θ 2 ) dρ (N) (θ 1 )dρ (N) (θ 2 ), 2

41 SGD for One Layer: Mei-Montanari-Nguyen (2018) Input/Output Distribution. Inputs x j P i.i.d. and outputs y j produced by net with one hidden layer: y j = 1 N N σ (x; θ i ) = 1 N i=1 N a i σ ( x, w i + b i ). i=1 L 2 loss R N (θ) = 1 2 E[ y N ŷ 2 ] is pair interaction in potential: R(ρ) := V (θ)d ρ (N) (θ) + 1 U(θ 1, θ 2 ) dρ (N) (θ 1 )dρ (N) (θ 2 ), 2 plus a constant where dρ (N) (θ) = N i=1 δ θ i and V (θ) = E [yσ (x, θ)], U(θ 1, θ 2 ) = E [σ (x, θ 1 )σ (x, θ 2 )].

42 SGD for One Layer: Mei-Montanari-Nguyen (2018) Input/Output Distribution. Inputs x j P i.i.d. and outputs y j produced by net with one hidden layer: y j = 1 N N σ (x; θ i ) = 1 N i=1 N a i σ ( x, w i + b i ). i=1 L 2 loss R N (θ) = 1 2 E[ y N ŷ 2 ] is pair interaction in potential: R(ρ) := V (θ)d ρ (N) (θ) + 1 U(θ 1, θ 2 ) dρ (N) (θ 1 )dρ (N) (θ 2 ), 2 plus a constant where dρ (N) (θ) = N i=1 δ θ i and V (θ) = E [yσ (x, θ)], U(θ 1, θ 2 ) = E [σ (x, θ 1 )σ (x, θ 2 )]. Q. What does SGD looks like in ρ space?

43 SGD for One Layer: Mei-Montanari-Nguyen (2018)

44 SGD for One Layer: Mei-Montanari-Nguyen (2018) Thm. Consider step sizes s k = ɛξ(kɛ) and write ρ (N) k for empirical measure after k GD steps. Then ρ (N) t/ɛ = ρ t as N and ɛ 0, where ρ t evolves under gradient flow for Wasserstein metric t ρ t = ξ(t) θ (ρ t θ Ψ(θ, ρ t )) Ψ(θ, ρ) := V (θ) + U(θ, θ )dρ(θ ). In particular, have a good approximation when k = t/ɛ with ɛ, N 1 1/input-dim.

45 L 2 loss in Deep and Wide Networks: Nguyen-Hein (2017)

46 L 2 loss in Deep and Wide Networks: Nguyen-Hein (2017) σ smooth, analytic, bounded N distinct training samples with L 2 loss

47 L 2 loss in Deep and Wide Networks: Nguyen-Hein (2017) σ smooth, analytic, bounded N distinct training samples with L 2 loss Thm. Fix a critical point. Suppose 1 layer k with n k N 1 neurons with Hessian of loss restricted to parameters in layers k + 1 is non-degenerate 2 weight matrices in layers k + 1 all have full column rank. Then this critical point is a global minimum. Rmk. Second condition requires pyramidal structure.

48 Proof Idea of Nguyen-Hein (2017)

49 Proof Idea of Nguyen-Hein (2017) Thm. (Gori Tesi 1992) If inputs are linearly independent, layer widths monotonically decreasing, then any crit at which the weight matrices have full rank is a global min.

50 Proof Idea of Nguyen-Hein (2017) Thm. (Gori Tesi 1992) If inputs are linearly independent, layer widths monotonically decreasing, then any crit at which the weight matrices have full rank is a global min. Idea. Backprop relates derivative of loss with respect to activations at successive layers: L Act (j) = L Act (j+1) D(j) }{{ W (j) } where ( ( D (j) = Diag φ preact (j) β ) Z (j),, β = 1,..., n j ).

51 Proof Idea of Nguyen-Hein (2017) Thm. (Gori Tesi 1992) If inputs are linearly independent, layer widths monotonically decreasing, then any crit at which the weight matrices have full rank is a global min. Idea. Backprop relates derivative of loss with respect to activations at successive layers: L Act (j) = L Act (j+1) D(j) }{{ W (j) } where ( ( D (j) = Diag φ preact (j) β ) Z (j),, β = 1,..., n j ). If φ 0 and W (j) has full rank, then can do forward prop L Act (j+1) = L Act (j) Z (j) T [ Z (j) Z (j) T ] 1.

52 Proof Idea of Nguyen-Hein (2017) Thm. (Gori Tesi 1992) If inputs are linearly independent, layer widths monotonically decreasing, then any crit at which the weight matrices have full rank is a global min. Idea. Backprop relates derivative of loss with respect to activations at successive layers: L Act (j) = L Act (j+1) D(j) }{{ W (j) } where ( ( D (j) = Diag φ preact (j) β ) Z (j),, β = 1,..., n j ). If φ 0 and W (j) has full rank, then can do forward prop L Act (j+1) = L [ ] Act (j) Z (j) T Z (j) Z (j) T 1. Now just transplant this condition to some intermediate layer.

53 Finite Intrinsic Dimension: Venturi-Bandeira-Bruna (2018)

54 Finite Intrinsic Dimension: Venturi-Bandeira-Bruna (2018) Consider one layer network with no bias, L 2 loss, and hidden layer of width p [ Φ(x, θ) = Uρ (Wx), L(θ) := E Φ(X, θ) Y 2].

55 Finite Intrinsic Dimension: Venturi-Bandeira-Bruna (2018) Consider one layer network with no bias, L 2 loss, and hidden layer of width p [ Φ(x, θ) = Uρ (Wx), L(θ) := E Φ(X, θ) Y 2]. Consider span of possible features V := Span {ψ w, w R n }, ψ w := ρ ( w, )

56 Finite Intrinsic Dimension: Venturi-Bandeira-Bruna (2018) Consider one layer network with no bias, L 2 loss, and hidden layer of width p [ Φ(x, θ) = Uρ (Wx), L(θ) := E Φ(X, θ) Y 2]. Consider span of possible features V := Span {ψ w, w R n }, ψ w := ρ ( w, ) Thm. If p dim(v ), then there are no bad local minima. If p 2 dim(v ), then all local=global minima are connected.

57 Finite Intrinsic Dimension: Venturi-Bandeira-Bruna (2018) Consider one layer network with no bias, L 2 loss, and hidden layer of width p [ Φ(x, θ) = Uρ (Wx), L(θ) := E Φ(X, θ) Y 2]. Consider span of possible features V := Span {ψ w, w R n }, ψ w := ρ ( w, ) Thm. If p dim(v ), then there are no bad local minima. If p 2 dim(v ), then all local=global minima are connected. Ex. ρ = polynomial, linear

58 Setup for Proof in Venturi-Bandeira-Bruna (2018)

59 Setup for Proof in Venturi-Bandeira-Bruna (2018) Output Φ(x, θ) V.

60 Setup for Proof in Venturi-Bandeira-Bruna (2018) Output Φ(x, θ) V. V is reproducing kernel Hilbert space (RKHS):

61 Setup for Proof in Venturi-Bandeira-Bruna (2018) Output Φ(x, θ) V. V is reproducing kernel Hilbert space (RKHS): Choose basis { ψ wj } and define, with { ψwj }j = ONB

62 Setup for Proof in Venturi-Bandeira-Bruna (2018) Output Φ(x, θ) V. V is reproducing kernel Hilbert space (RKHS): Choose basis { } { ψ wj and define, with ψwj = ONB }j Define reproducing kernel K V (x, y) := j ψ w j (x)ψ wj (y).

63 Setup for Proof in Venturi-Bandeira-Bruna (2018) Output Φ(x, θ) V. V is reproducing kernel Hilbert space (RKHS): Choose basis { } { ψ wj and define, with ψwj = ONB }j Define reproducing kernel K V (x, y) := j ψ w j (x)ψ wj (y). φ(x) := K V (x, ) V is kernel of evaluation at x

64 Setup for Proof in Venturi-Bandeira-Bruna (2018) Output Φ(x, θ) V. V is reproducing kernel Hilbert space (RKHS): Choose basis { } { ψ wj and define, with ψwj = ONB }j Define reproducing kernel K V (x, y) := j ψ w j (x)ψ wj (y). φ(x) := K V (x, ) V is kernel of evaluation at x Then ψ w (x) = ψ w, φ(x)

65 Setup for Proof in Venturi-Bandeira-Bruna (2018) Output Φ(x, θ) V. V is reproducing kernel Hilbert space (RKHS): Choose basis { } { ψ wj and define, with ψwj = ONB }j Define reproducing kernel K V (x, y) := j ψ w j (x)ψ wj (y). φ(x) := K V (x, ) V is kernel of evaluation at x Then ψ w (x) = ψ w, φ(x) Obtain Φ = Uψ(W ), φ(x)

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Neural Networks: A brief touch Yuejie Chi Department of Electrical and Computer Engineering Spring 2018 1/41 Outline

More information

Overparametrization for Landscape Design in Non-convex Optimization

Overparametrization for Landscape Design in Non-convex Optimization Overparametrization for Landscape Design in Non-convex Optimization Jason D. Lee University of Southern California September 19, 2018 The State of Non-Convex Optimization Practical observation: Empirically,

More information

A summary of Deep Learning without Poor Local Minima

A summary of Deep Learning without Poor Local Minima A summary of Deep Learning without Poor Local Minima by Kenji Kawaguchi MIT oral presentation at NIPS 2016 Learning Supervised (or Predictive) learning Learn a mapping from inputs x to outputs y, given

More information

Solutions to the Calculus and Linear Algebra problems on the Comprehensive Examination of January 28, 2011

Solutions to the Calculus and Linear Algebra problems on the Comprehensive Examination of January 28, 2011 Solutions to the Calculus and Linear Algebra problems on the Comprehensive Examination of January 8, Solutions to Problems 5 are omitted since they involve topics no longer covered on the Comprehensive

More information

A picture of the energy landscape of! deep neural networks

A picture of the energy landscape of! deep neural networks A picture of the energy landscape of! deep neural networks Pratik Chaudhari December 15, 2017 UCLA VISION LAB 1 Dy (x; w) = (w p (w p 1 (... (w 1 x))...)) w = argmin w (x,y ) 2 D kx 1 {y =i } log Dy i

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

Assignment 1 Math 5341 Linear Algebra Review. Give complete answers to each of the following questions. Show all of your work.

Assignment 1 Math 5341 Linear Algebra Review. Give complete answers to each of the following questions. Show all of your work. Assignment 1 Math 5341 Linear Algebra Review Give complete answers to each of the following questions Show all of your work Note: You might struggle with some of these questions, either because it has

More information

Foundations of Deep Learning: SGD, Overparametrization, and Generalization

Foundations of Deep Learning: SGD, Overparametrization, and Generalization Foundations of Deep Learning: SGD, Overparametrization, and Generalization Jason D. Lee University of Southern California November 13, 2018 Deep Learning Single Neuron x σ( w, x ) ReLU: σ(z) = [z] + Figure:

More information

Unraveling the mysteries of stochastic gradient descent on deep neural networks

Unraveling the mysteries of stochastic gradient descent on deep neural networks Unraveling the mysteries of stochastic gradient descent on deep neural networks Pratik Chaudhari UCLA VISION LAB 1 The question measures disagreement of predictions with ground truth Cat Dog... x = argmin

More information

Bits of Machine Learning Part 1: Supervised Learning

Bits of Machine Learning Part 1: Supervised Learning Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification

More information

Optimization geometry and implicit regularization

Optimization geometry and implicit regularization Optimization geometry and implicit regularization Suriya Gunasekar Joint work with N. Srebro (TTIC), J. Lee (USC), D. Soudry (Technion), M.S. Nacson (Technion), B. Woodworth (TTIC), S. Bhojanapalli (TTIC),

More information

Kernel Learning via Random Fourier Representations

Kernel Learning via Random Fourier Representations Kernel Learning via Random Fourier Representations L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Module 5: Machine Learning L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random

More information

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the

More information

4 Bias-Variance for Ridge Regression (24 points)

4 Bias-Variance for Ridge Regression (24 points) Implement Ridge Regression with λ = 0.00001. Plot the Squared Euclidean test error for the following values of k (the dimensions you reduce to): k = {0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500,

More information

Linear Regression and Its Applications

Linear Regression and Its Applications Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start

More information

Mini-Course 1: SGD Escapes Saddle Points

Mini-Course 1: SGD Escapes Saddle Points Mini-Course 1: SGD Escapes Saddle Points Yang Yuan Computer Science Department Cornell University Gradient Descent (GD) Task: min x f (x) GD does iterative updates x t+1 = x t η t f (x t ) Gradient Descent

More information

Inner products. Theorem (basic properties): Given vectors u, v, w in an inner product space V, and a scalar k, the following properties hold:

Inner products. Theorem (basic properties): Given vectors u, v, w in an inner product space V, and a scalar k, the following properties hold: Inner products Definition: An inner product on a real vector space V is an operation (function) that assigns to each pair of vectors ( u, v) in V a scalar u, v satisfying the following axioms: 1. u, v

More information

CS540 Machine learning Lecture 5

CS540 Machine learning Lecture 5 CS540 Machine learning Lecture 5 1 Last time Basis functions for linear regression Normal equations QR SVD - briefly 2 This time Geometry of least squares (again) SVD more slowly LMS Ridge regression 3

More information

More on Neural Networks

More on Neural Networks More on Neural Networks Yujia Yan Fall 2018 Outline Linear Regression y = Wx + b (1) Linear Regression y = Wx + b (1) Polynomial Regression y = Wφ(x) + b (2) where φ(x) gives the polynomial basis, e.g.,

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

Artificial Neural Networks (ANN) Xiaogang Su, Ph.D. Department of Mathematical Science University of Texas at El Paso

Artificial Neural Networks (ANN) Xiaogang Su, Ph.D. Department of Mathematical Science University of Texas at El Paso Artificial Neural Networks (ANN) Xiaogang Su, Ph.D. Department of Mathematical Science University of Texas at El Paso xsu@utep.edu Fall, 2018 Outline Introduction A Brief History ANN Architecture Terminology

More information

Kernel Methods. Machine Learning A W VO

Kernel Methods. Machine Learning A W VO Kernel Methods Machine Learning A 708.063 07W VO Outline 1. Dual representation 2. The kernel concept 3. Properties of kernels 4. Examples of kernel machines Kernel PCA Support vector regression (Relevance

More information

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017 Non-Convex Optimization CS6787 Lecture 7 Fall 2017 First some words about grading I sent out a bunch of grades on the course management system Everyone should have all their grades in Not including paper

More information

Deep Linear Networks with Arbitrary Loss: All Local Minima Are Global

Deep Linear Networks with Arbitrary Loss: All Local Minima Are Global homas Laurent * 1 James H. von Brecht * 2 Abstract We consider deep linear networks with arbitrary convex differentiable loss. We provide a short and elementary proof of the fact that all local minima

More information

Implicit Functions, Curves and Surfaces

Implicit Functions, Curves and Surfaces Chapter 11 Implicit Functions, Curves and Surfaces 11.1 Implicit Function Theorem Motivation. In many problems, objects or quantities of interest can only be described indirectly or implicitly. It is then

More information

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos 1 Stochastic Variance Reduction for Nonconvex Optimization Barnabás Póczos Contents 2 Stochastic Variance Reduction for Nonconvex Optimization Joint work with Sashank Reddi, Ahmed Hefny, Suvrit Sra, and

More information

Lecture 35: December The fundamental statistical distances

Lecture 35: December The fundamental statistical distances 36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose

More information

When can Deep Networks avoid the curse of dimensionality and other theoretical puzzles

When can Deep Networks avoid the curse of dimensionality and other theoretical puzzles When can Deep Networks avoid the curse of dimensionality and other theoretical puzzles Tomaso Poggio, MIT, CBMM Astar CBMM s focus is the Science and the Engineering of Intelligence We aim to make progress

More information

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

1 Machine Learning Concepts (16 points)

1 Machine Learning Concepts (16 points) CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Lecture 4: Deep Learning Essentials Pierre Geurts, Gilles Louppe, Louis Wehenkel 1 / 52 Outline Goal: explain and motivate the basic constructs of neural networks. From linear

More information

CSC321 Lecture 7: Optimization

CSC321 Lecture 7: Optimization CSC321 Lecture 7: Optimization Roger Grosse Roger Grosse CSC321 Lecture 7: Optimization 1 / 25 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:

More information

OPTIMIZATION METHODS IN DEEP LEARNING

OPTIMIZATION METHODS IN DEEP LEARNING Tutorial outline OPTIMIZATION METHODS IN DEEP LEARNING Based on Deep Learning, chapter 8 by Ian Goodfellow, Yoshua Bengio and Aaron Courville Presented By Nadav Bhonker Optimization vs Learning Surrogate

More information

Linear Algebra Practice Problems

Linear Algebra Practice Problems Linear Algebra Practice Problems Page of 7 Linear Algebra Practice Problems These problems cover Chapters 4, 5, 6, and 7 of Elementary Linear Algebra, 6th ed, by Ron Larson and David Falvo (ISBN-3 = 978--68-78376-2,

More information

MODULE 8 Topics: Null space, range, column space, row space and rank of a matrix

MODULE 8 Topics: Null space, range, column space, row space and rank of a matrix MODULE 8 Topics: Null space, range, column space, row space and rank of a matrix Definition: Let L : V 1 V 2 be a linear operator. The null space N (L) of L is the subspace of V 1 defined by N (L) = {x

More information

what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley

what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley Collaborators Joint work with Samy Bengio, Moritz Hardt, Michael Jordan, Jason Lee, Max Simchowitz,

More information

CSC321 Lecture 8: Optimization

CSC321 Lecture 8: Optimization CSC321 Lecture 8: Optimization Roger Grosse Roger Grosse CSC321 Lecture 8: Optimization 1 / 26 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:

More information

Kernel-Based Contrast Functions for Sufficient Dimension Reduction

Kernel-Based Contrast Functions for Sufficient Dimension Reduction Kernel-Based Contrast Functions for Sufficient Dimension Reduction Michael I. Jordan Departments of Statistics and EECS University of California, Berkeley Joint work with Kenji Fukumizu and Francis Bach

More information

Information theoretic perspectives on learning algorithms

Information theoretic perspectives on learning algorithms Information theoretic perspectives on learning algorithms Varun Jog University of Wisconsin - Madison Departments of ECE and Mathematics Shannon Channel Hangout! May 8, 2018 Jointly with Adrian Tovar-Lopez

More information

Function Spaces. 1 Hilbert Spaces

Function Spaces. 1 Hilbert Spaces Function Spaces A function space is a set of functions F that has some structure. Often a nonparametric regression function or classifier is chosen to lie in some function space, where the assume structure

More information

Elementary linear algebra

Elementary linear algebra Chapter 1 Elementary linear algebra 1.1 Vector spaces Vector spaces owe their importance to the fact that so many models arising in the solutions of specific problems turn out to be vector spaces. The

More information

RegML 2018 Class 2 Tikhonov regularization and kernels

RegML 2018 Class 2 Tikhonov regularization and kernels RegML 2018 Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT June 17, 2018 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n = (x i,

More information

Math 4A Notes. Written by Victoria Kala Last updated June 11, 2017

Math 4A Notes. Written by Victoria Kala Last updated June 11, 2017 Math 4A Notes Written by Victoria Kala vtkala@math.ucsb.edu Last updated June 11, 2017 Systems of Linear Equations A linear equation is an equation that can be written in the form a 1 x 1 + a 2 x 2 +...

More information

RegML 2018 Class 8 Deep learning

RegML 2018 Class 8 Deep learning RegML 2018 Class 8 Deep learning Lorenzo Rosasco UNIGE-MIT-IIT June 18, 2018 Supervised vs unsupervised learning? So far we have been thinking of learning schemes made in two steps f(x) = w, Φ(x) F, x

More information

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan Linear Regression CSL603 - Fall 2017 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis Regularization

More information

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis

More information

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications Class 19: Data Representation by Design What is data representation? Let X be a data-space X M (M) F (M) X A data representation

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Minimization of Static! Cost Functions!

Minimization of Static! Cost Functions! Minimization of Static Cost Functions Robert Stengel Optimal Control and Estimation, MAE 546, Princeton University, 2017 J = Static cost function with constant control parameter vector, u Conditions for

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Divide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates

Divide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates : A Distributed Algorithm with Minimax Optimal Rates Yuchen Zhang, John C. Duchi, Martin Wainwright (UC Berkeley;http://arxiv.org/pdf/1305.509; Apr 9, 014) Gatsby Unit, Tea Talk June 10, 014 Outline Motivation.

More information

CS489/698: Intro to ML

CS489/698: Intro to ML CS489/698: Intro to ML Lecture 03: Multi-layer Perceptron Outline Failure of Perceptron Neural Network Backpropagation Universal Approximator 2 Outline Failure of Perceptron Neural Network Backpropagation

More information

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks C.M. Bishop s PRML: Chapter 5; Neural Networks Introduction The aim is, as before, to find useful decompositions of the target variable; t(x) = y(x, w) + ɛ(x) (3.7) t(x n ) and x n are the observations,

More information

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University Deep Feedforward Networks Seung-Hoon Na Chonbuk National University Neural Network: Types Feedforward neural networks (FNN) = Deep feedforward networks = multilayer perceptrons (MLP) No feedback connections

More information

Lecture 7: Kernels for Classification and Regression

Lecture 7: Kernels for Classification and Regression Lecture 7: Kernels for Classification and Regression CS 194-10, Fall 2011 Laurent El Ghaoui EECS Department UC Berkeley September 15, 2011 Outline Outline A linear regression problem Linear auto-regressive

More information

Nonlinear Models. Numerical Methods for Deep Learning. Lars Ruthotto. Departments of Mathematics and Computer Science, Emory University.

Nonlinear Models. Numerical Methods for Deep Learning. Lars Ruthotto. Departments of Mathematics and Computer Science, Emory University. Nonlinear Models Numerical Methods for Deep Learning Lars Ruthotto Departments of Mathematics and Computer Science, Emory University Intro 1 Course Overview Intro 2 Course Overview Lecture 1: Linear Models

More information

Mathematical Methods for Data Analysis

Mathematical Methods for Data Analysis Mathematical Methods for Data Analysis Massimiliano Pontil Istituto Italiano di Tecnologia and Department of Computer Science University College London Massimiliano Pontil Mathematical Methods for Data

More information

Porcupine Neural Networks: (Almost) All Local Optima are Global

Porcupine Neural Networks: (Almost) All Local Optima are Global Porcupine Neural Networs: (Almost) All Local Optima are Global Soheil Feizi, Hamid Javadi, Jesse Zhang and David Tse arxiv:1710.0196v1 [stat.ml] 5 Oct 017 Stanford University Abstract Neural networs have

More information

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:

More information

STAT 100C: Linear models

STAT 100C: Linear models STAT 100C: Linear models Arash A. Amini June 9, 2018 1 / 56 Table of Contents Multiple linear regression Linear model setup Estimation of β Geometric interpretation Estimation of σ 2 Hat matrix Gram matrix

More information

Neural Networks: Backpropagation

Neural Networks: Backpropagation Neural Networks: Backpropagation Seung-Hoon Na 1 1 Department of Computer Science Chonbuk National University 2018.10.25 eung-hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25

More information

+E x, y ρ. (fw (x) y) 2] (4)

+E x, y ρ. (fw (x) y) 2] (4) Bias-Variance Analysis Let X be a set (or space) of objects and let ρ be a fixed probabiity distribution (or density) on X R. In other worlds ρ is a probability density on pairs x, y with x X and y R.

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 12, 2007 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

The Deep Ritz method: A deep learning-based numerical algorithm for solving variational problems

The Deep Ritz method: A deep learning-based numerical algorithm for solving variational problems The Deep Ritz method: A deep learning-based numerical algorithm for solving variational problems Weinan E 1 and Bing Yu 2 arxiv:1710.00211v1 [cs.lg] 30 Sep 2017 1 The Beijing Institute of Big Data Research,

More information

and u and v are orthogonal if and only if u v = 0. u v = x1x2 + y1y2 + z1z2. 1. In R 3 the dot product is defined by

and u and v are orthogonal if and only if u v = 0. u v = x1x2 + y1y2 + z1z2. 1. In R 3 the dot product is defined by Linear Algebra [] 4.2 The Dot Product and Projections. In R 3 the dot product is defined by u v = u v cos θ. 2. For u = (x, y, z) and v = (x2, y2, z2), we have u v = xx2 + yy2 + zz2. 3. cos θ = u v u v,

More information

Linear Heteroencoders

Linear Heteroencoders Gatsby Computational Neuroscience Unit 17 Queen Square, London University College London WC1N 3AR, United Kingdom http://www.gatsby.ucl.ac.uk +44 20 7679 1176 Funded in part by the Gatsby Charitable Foundation.

More information

A Quick Tour of Linear Algebra and Optimization for Machine Learning

A Quick Tour of Linear Algebra and Optimization for Machine Learning A Quick Tour of Linear Algebra and Optimization for Machine Learning Masoud Farivar January 8, 2015 1 / 28 Outline of Part I: Review of Basic Linear Algebra Matrices and Vectors Matrix Multiplication Operators

More information

Neural networks COMS 4771

Neural networks COMS 4771 Neural networks COMS 4771 1. Logistic regression Logistic regression Suppose X = R d and Y = {0, 1}. A logistic regression model is a statistical model where the conditional probability function has a

More information

Kernel Methods. Jean-Philippe Vert Last update: Jan Jean-Philippe Vert (Mines ParisTech) 1 / 444

Kernel Methods. Jean-Philippe Vert Last update: Jan Jean-Philippe Vert (Mines ParisTech) 1 / 444 Kernel Methods Jean-Philippe Vert Jean-Philippe.Vert@mines.org Last update: Jan 2015 Jean-Philippe Vert (Mines ParisTech) 1 / 444 What we know how to solve Jean-Philippe Vert (Mines ParisTech) 2 / 444

More information

Stochastic Analogues to Deterministic Optimizers

Stochastic Analogues to Deterministic Optimizers Stochastic Analogues to Deterministic Optimizers ISMP 2018 Bordeaux, France Vivak Patel Presented by: Mihai Anitescu July 6, 2018 1 Apology I apologize for not being here to give this talk myself. I injured

More information

The Loss Surface of Deep and Wide Neural Networks

The Loss Surface of Deep and Wide Neural Networks Quynh Nguyen 1 Matthias Hein 1 Abstract While the optimization problem behind deep neural networks is highly non-convex, it is frequently observed in practice that training deep networks seems possible

More information

Transpose & Dot Product

Transpose & Dot Product Transpose & Dot Product Def: The transpose of an m n matrix A is the n m matrix A T whose columns are the rows of A. So: The columns of A T are the rows of A. The rows of A T are the columns of A. Example:

More information

Chapter 4 Euclid Space

Chapter 4 Euclid Space Chapter 4 Euclid Space Inner Product Spaces Definition.. Let V be a real vector space over IR. A real inner product on V is a real valued function on V V, denoted by (, ), which satisfies () (x, y) = (y,

More information

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba Tutorial on: Optimization I (from a deep learning perspective) Jimmy Ba Outline Random search v.s. gradient descent Finding better search directions Design white-box optimization methods to improve computation

More information

Non-Convex Optimization in Machine Learning. Jan Mrkos AIC

Non-Convex Optimization in Machine Learning. Jan Mrkos AIC Non-Convex Optimization in Machine Learning Jan Mrkos AIC The Plan 1. Introduction 2. Non convexity 3. (Some) optimization approaches 4. Speed and stuff? Neural net universal approximation Theorem (1989):

More information

Oslo Class 2 Tikhonov regularization and kernels

Oslo Class 2 Tikhonov regularization and kernels RegML2017@SIMULA Oslo Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT May 3, 2017 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n

More information

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome Prenom.Nom@cnam.fr http://cedric.cnam.fr/vertigo/cours/ml2/ Département Informatique Conservatoire

More information

MATH 2331 Linear Algebra. Section 2.1 Matrix Operations. Definition: A : m n, B : n p. Example: Compute AB, if possible.

MATH 2331 Linear Algebra. Section 2.1 Matrix Operations. Definition: A : m n, B : n p. Example: Compute AB, if possible. MATH 2331 Linear Algebra Section 2.1 Matrix Operations Definition: A : m n, B : n p ( 1 2 p ) ( 1 2 p ) AB = A b b b = Ab Ab Ab Example: Compute AB, if possible. 1 Row-column rule: i-j-th entry of AB:

More information

Final Review Written by Victoria Kala SH 6432u Office Hours R 12:30 1:30pm Last Updated 11/30/2015

Final Review Written by Victoria Kala SH 6432u Office Hours R 12:30 1:30pm Last Updated 11/30/2015 Final Review Written by Victoria Kala vtkala@mathucsbedu SH 6432u Office Hours R 12:30 1:30pm Last Updated 11/30/2015 Summary This review contains notes on sections 44 47, 51 53, 61, 62, 65 For your final,

More information

OR MSc Maths Revision Course

OR MSc Maths Revision Course OR MSc Maths Revision Course Tom Byrne School of Mathematics University of Edinburgh t.m.byrne@sms.ed.ac.uk 15 September 2017 General Information Today JCMB Lecture Theatre A, 09:30-12:30 Mathematics revision

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 11, 2009 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Kernel Bayes Rule: Nonparametric Bayesian inference with kernels

Kernel Bayes Rule: Nonparametric Bayesian inference with kernels Kernel Bayes Rule: Nonparametric Bayesian inference with kernels Kenji Fukumizu The Institute of Statistical Mathematics NIPS 2012 Workshop Confluence between Kernel Methods and Graphical Models December

More information

Bayesian Interpretations of Regularization

Bayesian Interpretations of Regularization Bayesian Interpretations of Regularization Charlie Frogner 9.50 Class 15 April 1, 009 The Plan Regularized least squares maps {(x i, y i )} n i=1 to a function that minimizes the regularized loss: f S

More information

Transpose & Dot Product

Transpose & Dot Product Transpose & Dot Product Def: The transpose of an m n matrix A is the n m matrix A T whose columns are the rows of A. So: The columns of A T are the rows of A. The rows of A T are the columns of A. Example:

More information

How to Escape Saddle Points Efficiently? Praneeth Netrapalli Microsoft Research India

How to Escape Saddle Points Efficiently? Praneeth Netrapalli Microsoft Research India How to Escape Saddle Points Efficiently? Praneeth Netrapalli Microsoft Research India Chi Jin UC Berkeley Michael I. Jordan UC Berkeley Rong Ge Duke Univ. Sham M. Kakade U Washington Nonconvex optimization

More information

Kernels A Machine Learning Overview

Kernels A Machine Learning Overview Kernels A Machine Learning Overview S.V.N. Vishy Vishwanathan vishy@axiom.anu.edu.au National ICT of Australia and Australian National University Thanks to Alex Smola, Stéphane Canu, Mike Jordan and Peter

More information

Electromagnetism HW 1 math review

Electromagnetism HW 1 math review Electromagnetism HW math review Problems -5 due Mon 7th Sep, 6- due Mon 4th Sep Exercise. The Levi-Civita symbol, ɛ ijk, also known as the completely antisymmetric rank-3 tensor, has the following properties:

More information

Convex Optimization and Modeling

Convex Optimization and Modeling Convex Optimization and Modeling Introduction and a quick repetition of analysis/linear algebra First lecture, 12.04.2010 Jun.-Prof. Matthias Hein Organization of the lecture Advanced course, 2+2 hours,

More information

Lecture 1: Supervised Learning

Lecture 1: Supervised Learning Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Low-rank matrix recovery via nonconvex optimization Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

Convolutional Neural Network Architecture

Convolutional Neural Network Architecture Convolutional Neural Network Architecture Zhisheng Zhong Feburary 2nd, 2018 Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 1 / 55 Outline 1 Introduction of Convolution Motivation

More information

Advanced computational methods X Selected Topics: SGD

Advanced computational methods X Selected Topics: SGD Advanced computational methods X071521-Selected Topics: SGD. In this lecture, we look at the stochastic gradient descent (SGD) method 1 An illustrating example The MNIST is a simple dataset of variety

More information

Boolean Autoencoders and Hypercube Clustering Complexity

Boolean Autoencoders and Hypercube Clustering Complexity Designs, Codes and Cryptography manuscript No. (will be inserted by the editor) Boolean Autoencoders and Hypercube Clustering Complexity P. Baldi Received: 1 November 2010 / Revised: 5 June 2012 / Accepted:

More information

2018 Fall 2210Q Section 013 Midterm Exam II Solution

2018 Fall 2210Q Section 013 Midterm Exam II Solution 08 Fall 0Q Section 0 Midterm Exam II Solution True or False questions points 0 0 points) ) Let A be an n n matrix. If the equation Ax b has at least one solution for each b R n, then the solution is unique

More information

1. Mathematical Foundations of Machine Learning

1. Mathematical Foundations of Machine Learning 1. Mathematical Foundations of Machine Learning 1.1 Basic Concepts Definition of Learning Definition [Mitchell (1997)] A computer program is said to learn from experience E with respect to some class of

More information

Direct Learning: Linear Classification. Donglin Zeng, Department of Biostatistics, University of North Carolina

Direct Learning: Linear Classification. Donglin Zeng, Department of Biostatistics, University of North Carolina Direct Learning: Linear Classification Logistic regression models for classification problem We consider two class problem: Y {0, 1}. The Bayes rule for the classification is I(P(Y = 1 X = x) > 1/2) so

More information

RANDOM FIELDS AND GEOMETRY. Robert Adler and Jonathan Taylor

RANDOM FIELDS AND GEOMETRY. Robert Adler and Jonathan Taylor RANDOM FIELDS AND GEOMETRY from the book of the same name by Robert Adler and Jonathan Taylor IE&M, Technion, Israel, Statistics, Stanford, US. ie.technion.ac.il/adler.phtml www-stat.stanford.edu/ jtaylor

More information

Reading Group on Deep Learning Session 1

Reading Group on Deep Learning Session 1 Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular

More information

Normalization Techniques

Normalization Techniques Normalization Techniques Devansh Arpit Normalization Techniques 1 / 39 Table of Contents 1 Introduction 2 Motivation 3 Batch Normalization 4 Normalization Propagation 5 Weight Normalization 6 Layer Normalization

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information