Everything You Wanted to Know About the Loss Surface but Were Afraid to Ask The Talk

Size: px

Start display at page:

Download "Everything You Wanted to Know About the Loss Surface but Were Afraid to Ask The Talk"

Suzanna Montgomery
5 years ago
Views:

1 Everything You Wanted to Know About the Loss Surface but Were Afraid to Ask The Talk Texas A&M August 21, 2018

2 Plan

3 Plan 1 Linear models: Baldi-Hornik (1989) Kawaguchi-Lu (2016)

4 Plan 1 Linear models: Baldi-Hornik (1989) Kawaguchi-Lu (2016) 2 One hidden layer: Ge-Lee-Ma (2016) Mei-Montanari-Nguyen (2018) Venturi-Bandeira-Bruna (2018)

5 Plan 1 Linear models: Baldi-Hornik (1989) Kawaguchi-Lu (2016) 2 One hidden layer: Ge-Lee-Ma (2016) Mei-Montanari-Nguyen (2018) Venturi-Bandeira-Bruna (2018) 3 More than one hidden layer: Gori-Tesi (1992) + Nguyen-Hein (2017)

..of Turns out Connections aredesign Good of optimizer, variable initialization, and a variety of other considerations.

6 are randomized before training (Zhang et al., 2017). However, this good behavior is not universal; trainability neural nets is highlyskip dependent on network architecture choices, the choice ButtheFirst...of Turns out Connections aredesign Good of optimizer, variable initialization, and a variety of other considerations. Unfortunately, the effect (a) without skip connections (b) with skip connections Figure 1: The loss surfaces of ResNet-56 with/without skip connections. The vertical axis is logarithmic to show dynamic range. The proposed filter normalization scheme is used to enable comparisons of sharpness/flatness between the two figures. Figure: From Visualizing the Loss Landscape of Neural Nets by Li, Xu, 1 Taylor, Studer, Goldstein

From Figure 7, we see that network depth has a dramatic effect the loss surfaces of neural networks Butwhen

The Skip network Connections ResNet-20-noshortare has a fairly Good benign landscape dominated by a region

This (a) 110 layers, no skip connections (b) DenseNet, 121 layers Figure 6: (left) The loss surfaces of

7 From Figure 7, we see that network depth has a dramatic effect the loss surfaces of neural networks Butwhen First skip connections... Turns are not out used. The Skip network Connections ResNet-20-noshortare has a fairly Good benign landscape dominated by a region with convex contours in the center, and no dramatic non-convexity. This (a) 110 layers, no skip connections (b) DenseNet, 121 layers Figure 6: (left) The loss surfaces of ResNet-110-noshort, without skip connections. (right) DenseNet, the current state-of-the-art network for CIFAR Figure: From Visualizing the Loss Landscape of Neural Nets by Li, Xu, Taylor, Studer, Goldstein

8 Setup to Baldi-Hornik (1989)

9 Setup to Baldi-Hornik (1989) Suppose X = (x j, 1 j N), Y = (y j, 1 j N), x j R n, y j R m.

10 Setup to Baldi-Hornik (1989) Suppose X = (x j, 1 j N), Y = (y j, 1 j N), x j R n, y j R m. Define empirical (co)-variances: Σ YX = j y j x T j, Σ XX = j x j x T j, Σ YY = j y j y T j.

11 Setup to Baldi-Hornik (1989) Suppose X = (x j, 1 j N), Y = (y j, 1 j N), x j R n, y j R m. Define empirical (co)-variances: Σ YX = j y j x T j, Σ XX = j x j x T j, Σ YY = j y j y T j. Classical least squares regression A = argmin L(A) = AX Y 2 F = j Ax j y j 2 is a convex problem and can be solved by GD or analytically by A = Σ YX Σ 1 XX.

12 Setup to Baldi-Hornik (1989)

13 Setup to Baldi-Hornik (1989) Q. What about rank-constrained least squares: A k = argmin L(A) = AX Y 2 F, rank(a) k < n?

14 Setup to Baldi-Hornik (1989) Q. What about rank-constrained least squares: A k = argmin L(A) = AX Y 2 F, rank(a) k < n? A. A k is top k principal components of A.

15 Setup to Baldi-Hornik (1989) Q. What about rank-constrained least squares: A k = argmin L(A) = AX Y 2 F, rank(a) k < n? A. A k is top k principal components of A. Q. Note that if A is n m then rank(a) k A = BC, B n k, C k m.

16 Setup to Baldi-Hornik (1989) Q. What about rank-constrained least squares: A k = argmin L(A) = AX Y 2 F, rank(a) k < n? A. A k is top k principal components of A. Q. Note that if A is n m then rank(a) k A = BC, B n k, C k m. RC least squares linear net with one layer and L 2 loss: (A, B ) = argmin L(A, B) = ABX Y 2 F up to A AC, B C 1 B, C GL k.

17 Setup to Baldi-Hornik (1989) Q. What about rank-constrained least squares: A k = argmin L(A) = AX Y 2 F, rank(a) k < n? A. A k is top k principal components of A. Q. Note that if A is n m then rank(a) k A = BC, B n k, C k m. RC least squares linear net with one layer and L 2 loss: (A, B ) = argmin L(A, B) = ABX Y 2 F up to A AC, B C 1 B, C GL k. Key. L(A, B) is not convex, so unclear if can solve by GD.

18 Results of Baldi-Hornik (1989)

19 Results of Baldi-Hornik (1989) Suppose Σ XX invertible and Σ = Σ YX Σ 1 XX Σ XY has full rank. Thm. All local minima of L(A, B) are global minima. Thus, GD should be fine.

20 Results of Baldi-Hornik (1989) Suppose Σ XX invertible and Σ = Σ YX Σ 1 XX Σ XY has full rank. Thm. All local minima of L(A, B) are global minima. Thus, GD should be fine. Thm. The saddle points of L(A, B) correspond to AB = proj k (A ), where proj k is projection onto some k principal component directions of A.

21 Results of Baldi-Hornik (1989) Suppose Σ XX invertible and Σ = Σ YX Σ 1 XX Σ XY has full rank. Thm. All local minima of L(A, B) are global minima. Thus, GD should be fine. Thm. The saddle points of L(A, B) correspond to AB = proj k (A ), where proj k is projection onto some k principal component directions of A. The proofs are by explicit computation.

22 Proof Idea Baldi-Hornik (1989) Write L(A, B) = vec (ABX Y ) T vec (ABX Y ).

23 Proof Idea Baldi-Hornik (1989) Write L(A, B) = vec (ABX Y ) T vec (ABX Y ). Use identity vec(abc) = ( ) C T A vec (B) to differentiate with respect to vec(a), vec(b).

24 Proof Idea Baldi-Hornik (1989) Write L(A, B) = vec (ABX Y ) T vec (ABX Y ). Use identity vec(abc) = ( ) C T A vec (B) to differentiate with respect to vec(a), vec(b). Check that at critical point W = (A, B), we must have where P A = proj col(a). P A ΣP A = P A Σ = P A Σ,

25 Kawaguchi-Lu (2016)

26 Kawaguchi-Lu (2016) Want to understand local minima of L (W 1,..., W d ) = W d W 1 X Y 2 F, when W 1,..., W d have fixed shape.

27 Kawaguchi-Lu (2016) Want to understand local minima of L (W 1,..., W d ) = W d W 1 X Y 2 F, when W 1,..., W d have fixed shape. Thm. All local minima of L are global minima.

28 Kawaguchi-Lu (2016) Want to understand local minima of L (W 1,..., W d ) = W d W 1 X Y 2 F, when W 1,..., W d have fixed shape. Thm. All local minima of L are global minima. Idea of Proof: 1 Every local minimum {W j } can be perturbed to local minimum {Ŵj } with same value of loss and all Ŵj of full rank.

29 Kawaguchi-Lu (2016) Want to understand local minima of L (W 1,..., W d ) = W d W 1 X Y 2 F, when W 1,..., W d have fixed shape. Thm. All local minima of L are global minima. Idea of Proof: 1 Every local minimum {W j } can be perturbed to local minimum {Ŵj } with same value of loss and all Ŵj of full rank. 2 Perturbations of Ŵ j s give all rank min {layer widths} pertrubations of j Ŵ j.

30 Kawaguchi-Lu (2016) Want to understand local minima of L (W 1,..., W d ) = W d W 1 X Y 2 F, when W 1,..., W d have fixed shape. Thm. All local minima of L are global minima. Idea of Proof: 1 Every local minimum {W j } can be perturbed to local minimum {Ŵj } with same value of loss and all Ŵj of full rank. 2 Perturbations of Ŵ j s give all rank min {layer widths} pertrubations of j Ŵ j. } { 3 Thus, j min Ŵj, Ŵmin is local minimum of Baldi-Hornik problem. Hence, global minimum.

31 Loss for One Layer, Gaussian Inputs: Ge-Lee-Ma (2016)

32 Loss for One Layer, Gaussian Inputs: Ge-Lee-Ma (2016) Input/Output Distribution. Inputs x j N(0, Id n ) i.i.d. and outputs y j produced by net with one hidden layer: n y j = a T σ (B x) = ak σ( b k, x j ), bk = a = 1. k=1

33 Loss for One Layer, Gaussian Inputs: Ge-Lee-Ma (2016) Input/Output Distribution. Inputs x j N(0, Id n ) i.i.d. and outputs y j produced by net with one hidden layer: n y j = a T σ (B x) = ak σ( b k, x j ), bk = a = 1. k=1 Thm. If a = b k = 1, then L 2 population risk E [ y(x, a, B) y(x, a, B ) 2] is a sum of rank m tensor-norms: σ m 2 n a k b m k m 0 k=1 n 2 ak (b k ) m, k=1 where σ m are the Hermite coefficients of σ : σ(t) = σ m m! H m(t). m 0 F

34 Proof Ingredients for Ge-Lee-Ma (2016)

35 Proof Ingredients for Ge-Lee-Ma (2016) 1 Orthogonality: R H n (x)h m (x)e x2 /2 dx = δ n=m n!. 2π

36 Proof Ingredients for Ge-Lee-Ma (2016) 1 Orthogonality: R H n (x)h m (x)e x2 /2 dx = δ n=m n!. 2π 2 Sum-product formula for Hermite polynomials: H n (x y) = p =n ( ) d n H pj (x j )y p j, x, y R d, y = 1, p j=1

37 Proof Ingredients for Ge-Lee-Ma (2016) 1 Orthogonality: R H n (x)h m (x)e x2 /2 dx = δ n=m n!. 2π 2 Sum-product formula for Hermite polynomials: H n (x y) = p =n ( ) d n H pj (x j )y p j, x, y R d, y = 1, p j=1 where the sum is over all multi-indices p. 3 Wick formula [ Hk (v T x) E k! H j (w T ] x) 1 = δ j=k j! k! v, w k.

38 SGD for One Layer: Mei-Montanari-Nguyen (2018)

39 SGD for One Layer: Mei-Montanari-Nguyen (2018) Input/Output Distribution. Inputs x j P i.i.d. and outputs y j produced by net with one hidden layer: y j = 1 N N σ (x; θ i ) = 1 N i=1 N a i σ ( x, w i + b i ). i=1

40 SGD for One Layer: Mei-Montanari-Nguyen (2018) Input/Output Distribution. Inputs x j P i.i.d. and outputs y j produced by net with one hidden layer: y j = 1 N N σ (x; θ i ) = 1 N i=1 N a i σ ( x, w i + b i ). i=1 L 2 loss R N (θ) = 1 2 E[ y N ŷ 2 ] is pair interaction in potential: R(ρ) := V (θ)d ρ (N) (θ) + 1 U(θ 1, θ 2 ) dρ (N) (θ 1 )dρ (N) (θ 2 ), 2

41 SGD for One Layer: Mei-Montanari-Nguyen (2018) Input/Output Distribution. Inputs x j P i.i.d. and outputs y j produced by net with one hidden layer: y j = 1 N N σ (x; θ i ) = 1 N i=1 N a i σ ( x, w i + b i ). i=1 L 2 loss R N (θ) = 1 2 E[ y N ŷ 2 ] is pair interaction in potential: R(ρ) := V (θ)d ρ (N) (θ) + 1 U(θ 1, θ 2 ) dρ (N) (θ 1 )dρ (N) (θ 2 ), 2 plus a constant where dρ (N) (θ) = N i=1 δ θ i and V (θ) = E [yσ (x, θ)], U(θ 1, θ 2 ) = E [σ (x, θ 1 )σ (x, θ 2 )].

42 SGD for One Layer: Mei-Montanari-Nguyen (2018) Input/Output Distribution. Inputs x j P i.i.d. and outputs y j produced by net with one hidden layer: y j = 1 N N σ (x; θ i ) = 1 N i=1 N a i σ ( x, w i + b i ). i=1 L 2 loss R N (θ) = 1 2 E[ y N ŷ 2 ] is pair interaction in potential: R(ρ) := V (θ)d ρ (N) (θ) + 1 U(θ 1, θ 2 ) dρ (N) (θ 1 )dρ (N) (θ 2 ), 2 plus a constant where dρ (N) (θ) = N i=1 δ θ i and V (θ) = E [yσ (x, θ)], U(θ 1, θ 2 ) = E [σ (x, θ 1 )σ (x, θ 2 )]. Q. What does SGD looks like in ρ space?

43 SGD for One Layer: Mei-Montanari-Nguyen (2018)

44 SGD for One Layer: Mei-Montanari-Nguyen (2018) Thm. Consider step sizes s k = ɛξ(kɛ) and write ρ (N) k for empirical measure after k GD steps. Then ρ (N) t/ɛ = ρ t as N and ɛ 0, where ρ t evolves under gradient flow for Wasserstein metric t ρ t = ξ(t) θ (ρ t θ Ψ(θ, ρ t )) Ψ(θ, ρ) := V (θ) + U(θ, θ )dρ(θ ). In particular, have a good approximation when k = t/ɛ with ɛ, N 1 1/input-dim.

45 L 2 loss in Deep and Wide Networks: Nguyen-Hein (2017)

46 L 2 loss in Deep and Wide Networks: Nguyen-Hein (2017) σ smooth, analytic, bounded N distinct training samples with L 2 loss

47 L 2 loss in Deep and Wide Networks: Nguyen-Hein (2017) σ smooth, analytic, bounded N distinct training samples with L 2 loss Thm. Fix a critical point. Suppose 1 layer k with n k N 1 neurons with Hessian of loss restricted to parameters in layers k + 1 is non-degenerate 2 weight matrices in layers k + 1 all have full column rank. Then this critical point is a global minimum. Rmk. Second condition requires pyramidal structure.

48 Proof Idea of Nguyen-Hein (2017)

49 Proof Idea of Nguyen-Hein (2017) Thm. (Gori Tesi 1992) If inputs are linearly independent, layer widths monotonically decreasing, then any crit at which the weight matrices have full rank is a global min.

50 Proof Idea of Nguyen-Hein (2017) Thm. (Gori Tesi 1992) If inputs are linearly independent, layer widths monotonically decreasing, then any crit at which the weight matrices have full rank is a global min. Idea. Backprop relates derivative of loss with respect to activations at successive layers: L Act (j) = L Act (j+1) D(j) }{{ W (j) } where ( ( D (j) = Diag φ preact (j) β ) Z (j),, β = 1,..., n j ).

51 Proof Idea of Nguyen-Hein (2017) Thm. (Gori Tesi 1992) If inputs are linearly independent, layer widths monotonically decreasing, then any crit at which the weight matrices have full rank is a global min. Idea. Backprop relates derivative of loss with respect to activations at successive layers: L Act (j) = L Act (j+1) D(j) }{{ W (j) } where ( ( D (j) = Diag φ preact (j) β ) Z (j),, β = 1,..., n j ). If φ 0 and W (j) has full rank, then can do forward prop L Act (j+1) = L Act (j) Z (j) T [ Z (j) Z (j) T ] 1.

52 Proof Idea of Nguyen-Hein (2017) Thm. (Gori Tesi 1992) If inputs are linearly independent, layer widths monotonically decreasing, then any crit at which the weight matrices have full rank is a global min. Idea. Backprop relates derivative of loss with respect to activations at successive layers: L Act (j) = L Act (j+1) D(j) }{{ W (j) } where ( ( D (j) = Diag φ preact (j) β ) Z (j),, β = 1,..., n j ). If φ 0 and W (j) has full rank, then can do forward prop L Act (j+1) = L [ ] Act (j) Z (j) T Z (j) Z (j) T 1. Now just transplant this condition to some intermediate layer.

53 Finite Intrinsic Dimension: Venturi-Bandeira-Bruna (2018)

54 Finite Intrinsic Dimension: Venturi-Bandeira-Bruna (2018) Consider one layer network with no bias, L 2 loss, and hidden layer of width p [ Φ(x, θ) = Uρ (Wx), L(θ) := E Φ(X, θ) Y 2].

55 Finite Intrinsic Dimension: Venturi-Bandeira-Bruna (2018) Consider one layer network with no bias, L 2 loss, and hidden layer of width p [ Φ(x, θ) = Uρ (Wx), L(θ) := E Φ(X, θ) Y 2]. Consider span of possible features V := Span {ψ w, w R n }, ψ w := ρ ( w, )

56 Finite Intrinsic Dimension: Venturi-Bandeira-Bruna (2018) Consider one layer network with no bias, L 2 loss, and hidden layer of width p [ Φ(x, θ) = Uρ (Wx), L(θ) := E Φ(X, θ) Y 2]. Consider span of possible features V := Span {ψ w, w R n }, ψ w := ρ ( w, ) Thm. If p dim(v ), then there are no bad local minima. If p 2 dim(v ), then all local=global minima are connected.

57 Finite Intrinsic Dimension: Venturi-Bandeira-Bruna (2018) Consider one layer network with no bias, L 2 loss, and hidden layer of width p [ Φ(x, θ) = Uρ (Wx), L(θ) := E Φ(X, θ) Y 2]. Consider span of possible features V := Span {ψ w, w R n }, ψ w := ρ ( w, ) Thm. If p dim(v ), then there are no bad local minima. If p 2 dim(v ), then all local=global minima are connected. Ex. ρ = polynomial, linear

58 Setup for Proof in Venturi-Bandeira-Bruna (2018)

59 Setup for Proof in Venturi-Bandeira-Bruna (2018) Output Φ(x, θ) V.

60 Setup for Proof in Venturi-Bandeira-Bruna (2018) Output Φ(x, θ) V. V is reproducing kernel Hilbert space (RKHS):

61 Setup for Proof in Venturi-Bandeira-Bruna (2018) Output Φ(x, θ) V. V is reproducing kernel Hilbert space (RKHS): Choose basis { ψ wj } and define, with { ψwj }j = ONB

62 Setup for Proof in Venturi-Bandeira-Bruna (2018) Output Φ(x, θ) V. V is reproducing kernel Hilbert space (RKHS): Choose basis { } { ψ wj and define, with ψwj = ONB }j Define reproducing kernel K V (x, y) := j ψ w j (x)ψ wj (y).

63 Setup for Proof in Venturi-Bandeira-Bruna (2018) Output Φ(x, θ) V. V is reproducing kernel Hilbert space (RKHS): Choose basis { } { ψ wj and define, with ψwj = ONB }j Define reproducing kernel K V (x, y) := j ψ w j (x)ψ wj (y). φ(x) := K V (x, ) V is kernel of evaluation at x

64 Setup for Proof in Venturi-Bandeira-Bruna (2018) Output Φ(x, θ) V. V is reproducing kernel Hilbert space (RKHS): Choose basis { } { ψ wj and define, with ψwj = ONB }j Define reproducing kernel K V (x, y) := j ψ w j (x)ψ wj (y). φ(x) := K V (x, ) V is kernel of evaluation at x Then ψ w (x) = ψ w, φ(x)

65 Setup for Proof in Venturi-Bandeira-Bruna (2018) Output Φ(x, θ) V. V is reproducing kernel Hilbert space (RKHS): Choose basis { } { ψ wj and define, with ψwj = ONB }j Define reproducing kernel K V (x, y) := j ψ w j (x)ψ wj (y). φ(x) := K V (x, ) V is kernel of evaluation at x Then ψ w (x) = ψ w, φ(x) Obtain Φ = Uψ(W ), φ(x)

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Neural Networks: A brief touch Yuejie Chi Department of Electrical and Computer Engineering Spring 2018 1/41 Outline