Generalization. A cat that once sat on a hot stove will never again sit on a hot stove or on a cold one either. Mark Twain

Size: px

Start display at page:

Download "Generalization. A cat that once sat on a hot stove will never again sit on a hot stove or on a cold one either. Mark Twain"

Valentine McKenzie
6 years ago
Views:

1 Generalization

2 Generalization A cat that once sat on a hot stove will never again sit on a hot stove or on a cold one either. Mark Twain 2

3 Generalization The network input-output mapping is accurate for the training data and for test data never seen before. The network interpolates well. 3

4 Cause of Overfitting Poor generalization is caused by using a network that is too complex (too many neurons/parameters). To have the best performance we need to find the least complex network that can represent the data (Ockham s Razor). 4

5 Ockham s Razor Find the simplest model that explains the data. 5

6 Problem Statement Training Set { p, t } p 2 t 2, {, },,{ p Q, t Q } Underlying Function t q g( p q ) + ε q Performance Function Q F( x ) E D ( t q a q ) T ( t q a q ) q 6

7 Poor Generalization Overfitting Extrapolation Interpolation 7

8 Good Generalization Interpolation Extrapolation 8

9 Extrapolation in 2-D 9

10 Measuring Generalization Test Set Part of the available data is set aside during the training process. After training, the network error on the test set is used as a measure of generalization ability. The test set must never be used in any way to train the network, or even to select one network from a group of candidate networks. The test set must be representative of all situations for which the network will be used. 0

11 Methods for Improving Generalization Pruning (removing neurons) until the performance is degraded. Growing (adding neurons) until the performance is adequate. Validation Methods Regularization

12 Early Stopping Break up data into training, validation, and test sets. Use only the training set to compute gradients and determine weight updates. Compute the performance on the validation set at each iteration of training. Stop training when the performance on the validation set goes up for a specified number of iterations. Use the weights which achieved the lowest error on the validation set. 2

13 Early Stopping Example F(x) Validation a b Training a b 3

14 Regularization Standard Performance Measure F E D Performance Measure with Regularization F βe D α E W Complexity Penalty Q ( ) 2 + α x i + β ( t q a q ) T t q a q q n i (Smaller weights means a smoother function.) 4

15 Effect of Weight Changes w, w, 2 w, 2 2 w,

16 Effect of Regularization α/β 0 α/β 0.0 α/β 0.25 α/β 6

17 Bayes Rule PAB ( ) PBA ( )P( A) PB ( ) P(A) Prior Probability. What we know about A before B is known. P(A B) Posterior Probability. What we know about A after we know the outcome of B. P(B A) Conditional Probability (Likelihood Function). Describes our knowledge of the system. P(B) Marginal Probability. A normalization factor. 7

18 Example Problem % of the population have a certain disease. A test for the disease is 80% accurate in detecting the disease in people who have it. 0% of the time the test yields a false positive. If you have a positive test, what is your probability of having the disease? 8

19 Bayesian Analysis PAB ( ) PBA ( )P( A) PB ( ) A Event that you have the disease. B Event that you have a positive test. P(A) 0.0 P(B A) 0.8 P(B) P(B A)P(A) + P(B ~A)P(~A) PAB ( ) PBA ( )P( A) PB ( )

20 Signal Plus Noise Example t x+ ε f( ε) ε 2 exp πσ 2σ 2 fx ( ) exp πσ x 2σ x x 2 ftx ( ) ( t x) 2 exp πσ 2σ 2 fxt ( ) ftx ( )f( x) ft () f( xt) 0.4 f( tx) 0.2 f( x)

21 NN Bayesian Framework MP Posterior (MacKay 92) ML Likelihood Prior P( x D, αβm,, ) PDx (, β, M)P( x α, M) - PDαβM (,, ) Normalization (Evidence) D - Data Set M - Neural Network Model x - Vector of Network Weights 2

22 Gaussian Assumptions Gaussian Noise PDx (, β, M) exp( βe Z D ( β) D ) Z D ( β) ( 2πσ ε ) N 2 ( π β ) N 2 P( x α, M) Gaussian Prior: exp( αe Z W ( α ) W ) Z W ( α) ( 2πσ w ) n 2 ( π α) n 2 P( x D, αβm,, ) exp ( ( βe Z W ( α) Z D ( β ) D + αe W )) Normalization Factor exp( F( x ) Z F ( αβ, ) F β E D + α E W Minimize F to Maximize P. 22

23 Optimizing Regularization Parameters Evidence from First Level Second Level of Inference P( αβd,, M) P( D α, β, M)P ( α, β M) P( D M) Evidence: PDαβM (,, ) PDx (, β, M)P x ( α, M) P( x D, αβm,, ) exp ( βe Z D ( β ) D ) exp ( α E Z W ( α) W ) Z exp( F ( αβ, ) F ( x )) Z F ( αβ, ) Z D ( β )Z W ( α) exp( βe D α E W ) exp ( F( x ) Z F ( αβ, ) Z D ( β)z W ( α ) Z F ( αβ, ) is the only unknown in this expression. 23

24 Quadratic Approximation Taylor series expansion: F( x ) F( x MP ) ( x x MP ) T H MP ( x x MP ) H β 2 E D + α 2 E W Substituting into previous posterior density function: P( x D, αβm,, ) P( x D, α, β, M) P x ( ) --- exp F( x MP ) Z F 2 -( x x MP ) T H MP x x MP ( ) ---exp F( x Z MP ( )) exp -- x x 2 ( MP ) T H MP ( x x MP ) F Equate with standard Gaussian density: exp ( 2π ) n ( H MP ) 2 ( ) -( x x MP ) T H MP x x MP Comparing to previous equation, we have: Z F ( αβ, ) ( 2π) n 2 ( det( ( H MP ) )) 2 exp( F( x MP )) 24

25 Optimum Parameters If we make this substitution for Z F in the expression for the evidence and then take the derivative with respect to α and β to locate the minimum we find: α MP γ 2E W ( x MP ) β MP N γ 2E D ( x MP ) Effective Number of Parameters γ n 2α MP tr( H MP ) 25

26 Gauss-Newton Approximation It can be expensive to compute the Hessian matrix. Try the Gauss-Newton Approximation. H 2 F( x) 2βJ T J + 2αI n This is readily available if the Levenberg-Marquardt algorithm is used for training. 26

27 Algorithm (GNBR) 0. Initialize α, β and the weights.. Take one step of Levenberg-Marquardt to minimize F(w). 2. Compute the effective number of parameters γ n - 2αtr(H - ), using the Gauss-Newton approximation for H. 3. Compute new estimates of the regularization parameters α γ/(2e W ) and β (N-γ)/(2E D ). 4. Iterate steps -3 until convergence. 27

28 Checks of Performance If γ is very close to n, then the network may be too small. Add more hidden layer neurons and retrain. If the larger network has the same final γ, then the smaller network was large enough. Otherwise, increase the number of hidden neurons. If a network is sufficiently large, then a larger network will achieve comparable values for γ, E D and E W. 28

29 GNBR Example α β

30 Convergence of GNBR E D E D Training Testing Iteration Iteration γ α/β Iteration Iteration 30

31 Relationship between Early Stopping and Regularization 3

32 Linear Network Input Linear Neuron R p R x W S x R b S x a S n S x S x a purelin( Wp + b) Wp + b a purelin (Wp + b) a i purelin( n i ) purelin( w T i p + b i ) w T i p + b i i w w i, w i 2, w i R, 32

33 Performance Index Training Set: { p, t }, { p 2, t 2 },, { p Q, t Q } Input: p q Target: t q Notation: x w z p a w T p + b a x T z b Mean Square Error: F( x ) E[ e 2 ] E[ ( t a) 2 ] E[( t x T z) 2 ] E D 33

34 Error Analysis F( x ) E[ e 2 ] E[ ( t a) 2 ] E[( t x T z) 2 ] F( x ) E[ t 2 2tx T z + x T zz T x ] F( x ) E[ t 2 ] 2x T E[ tz] + x T E[ zz T ]x F( x) c 2 x T h + x T Rx c E[ t 2 ] h E[ tz] R E[ zz T ] The mean square error for the Linear Network is a quadratic function: F( x) c + d T x + -x T Ax 2 d 2h A 2R 34

35 Example Inputs Two-Input Neuron p, t (Probability 0.75) p w, S p 2 w,2 b n a p 2, t 2 F( x) c 2x T h + x T Rx (Probability 0.25) E D a purelin (Wp + b) c E[ t 2 ] ( ) 2 ( 0.75) + ( ) 2 ( 0.25 ) h Etz [ ] ( 0.75) ( ) + ( 0.25 )( ) 0.5 R E[ zz T T T ] p p ( 0.75) + p2 p 2 ( 0.25)

36 Performance Contour Optimum Point (Maximum Likelihood) Hessian Matrix x ML A d R h F( x) A 2R 2 2 A λi Eigenvalues 2 λ λ 2 4λ + 3 ( λ ) ( λ 3) λ, λ2 3 2 λ Eigenvectors A λi v 0 λ v 0 v λ 2 3 v 2 0 v 2 36

37 Contour Plot of E D γ n 2α MP tr( H MP ) 37

38 Steepest Descent Trajectory x k + x k αg k x k α( Ax k + d) x k αax ( k + A d) x k αax ( k x ML ) [ I αa]x k + αax ML Mx k + [ I M]x ML M [ I αa] x Mx 0 + [ I M]x ML x 2 Mx + [ I M]x ML M 2 x 0 + MI [ M]x ML +[ I M]x ML M 2 x 0 + Mx ML M 2 x ML + x ML Mx ML M 2 x 0 + x ML M 2 x ML M 2 2 x0 + [ I M ]x ML x k M k x 0 + [ I M k ]x M L 38

39 Regularization F( x) E D + ρe W (ρ α/β) E W -- ( x x 2 0 ) T ( x x 0 ) To locate the minimum point, set the gradient to zero. ( x x 0 ) E D Ax ( x ML ) E W F x ( ) E D + ρ E W F ( x ) Ax ( x M L ) ρ x x 0 + ( ) 0 39

40 MAP ML Ax MAP ( x ML ) ρ( x MAP x 0 ) ρ x MAP x ML + x ML ( x 0 ) x ML ρ x ML x 0 ρ( x MAP ) ( ) ( A + ρi )( x MAP x ML ) ρ( x 0 x ML ) x MAP x ML ( ) ρ A ρi ( + ) ( x 0 x ML ) x MAP x ML ρ( A + ρi ) x ML + ρ( A + ρi ) x 0 x ML M ρ x ML + M ρ x 0 M ρ ρ( A + ρi ) x MAP M ρ x 0 + [ I M ρ ]x ML 40

41 Early Stopping Regularization x k M k x 0 + [ I M k ]x M L x MAP M ρ x 0 + [ I M ρ ]x ML M [ I αa] M ρ ρ( A + ρi ) Eigenvalues of M k : [ I αa]z i z i αaz i z i α λ i z i ( α λ i )z i z i λ i - eigenvector of A - eigenvalue of A eig( M k ) ( αλ i ) k Eigenvalues of M Eigenvalues of M γ : eig( M ρ ) ρ ( + ρ) λ i 4

42 Reg. Parameter Iteration Number M k and M ρ have the same eigenvectors. They would be equal if their eigenvalues were equal. ρ ( αλ ( + ρ) i ) k λ Taking log : i log + -- ρ klog( αλ i ) λ i Since these are equal at λ i 0, they are always equal if the slopes are equal λ + -- i ρ - ρ k ( α) αλ i α k -- ρ ( αλ i ) ( + λ i ρ) If αλ i and λ i /ρ are small, then: αk -- ρ (Increasing the number of iterations is equivalent to decreasing the regularization parameter!) 42

43 Example Inputs Two-Input Neuron p, t (Probability 0.75) p w, S p 2 w,2 b n a purelin (Wp + b) a p 2, t 2 F ( x ) E D + ρe W (Probability 0.25) E D c + x T d x T Ax E W -x T x 2 2 d 2h A 2R c F( x) 2E D + ρ 2E 2 W + ρ ρ 2 + ρ 43

44 ρ 0, 2,

45 ρ

46 Steepest Descent Path

47 Effective Number of Parameters H x γ n 2α MP tr ( H MP ) ( ) 2 F( x) β 2 ED + α 2 E W β E 2 D 2 αi + tr{ H } n i βλ i + 2α γ n 2α MP tr ( H MP ) 2α n βλ i + 2α n i n i βλ i βλ i + 2α Effective number of parameters will equal number of large eigenvalues of the Hessian. γ n βλ i βλ γ βλ i + 2α i γ i i γ βλ i + 2α i i n i 47

Neural Network Training

Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification