VI. Backpropagation Neural Networks (BPNN)

Size: px

Start display at page:

Download "VI. Backpropagation Neural Networks (BPNN)"

Darrell Campbell
6 years ago
Views:

1 VI. Backpropagation Neural Networks (BPNN) Review of Adaline Newton s ethod Backpropagation algorith definition derivative coputation weight/bias coputation function approxiation exaple network generalization issues potential probles with the BPNN oentu filter iteration schees review Generalization regularization early stopping Ipleentation issues References: [Hagan], [Mathworks] NN FAQ at ftp://ftp.sas.co/pub/neural/faq.htl Pattern Classification, Duda & Hart, Wiley, 00 07/0/06 EC4460.SuFy06/MPF

2 Recall the Adaline (LMS) network: Input Linear Neuron R p R x W S x R b S x a=wp+ b a n S S x S x Restriction to the Adaline (LMS)? Linear activation function Proble solved with Adaline/LMS: Given a set {p i, t i } define the weights and bias which iniize the Mean Square error ( ) E t a x a w p = z = b = x T z ( T ( ) [ ] ) T T F x = E e = E t x z = x Rx x h+ c 07/0/06 EC4460.SuFy06/MPF

3 Practical application: solving F(x) requires coputing R and h, and R - alternative: solve proble iteratively using steepest descent only xk+ = x k α F(x k) = x α( ez ), e = t a k k LMS iteration: pick x(0) a=x kt z k e=t-a x k+ =x k +αez k k=k+ Extensions ==> Multilayer perceptron 07/0/06 EC4460.SuFy06/MPF 3

4 Why use ulti-layer structures? subclasses Class Class K = 9 subclasses, M = classes 07/0/06 EC4460.SuFy06/MPF 4

5 Exaple: pattern classification: the XOR gate 0 0 p =, t = 0 ; p =, t = ; 0 p =, t 3 3 = ; p =, t 4 4 = 0 0 Can it be solved with a single layer perceptron? 07/0/06 EC4460.SuFy06/MPF 5

6 NN block diagra: 07/0/06 EC4460.SuFy06/MPF 6

7 Note: Final network space partitioning varies as a function of the nuber of neurons in the hidden layer 07/0/06 EC4460.SuFy06/MPF 7

8 Exaple: b y b y y 3 b 3 Assue b = 0.5, b =, b 3 = Plot the decision boundaries obtained assuing HL is used as activation functions Derive the weight atrix and bias vector used for this network Design the NN second layer (following given in-class guidelines, i.e., identify weight atrix and bias 07/0/06 EC4460.SuFy06/MPF 8

9 07/0/06 EC4460.SuFy06/MPF 9

10 07/0/06 EC4460.SuFy06/MPF 0

11 Exaple: ultilayer perceptron (classification) Assue dark = 07/0/06 EC4460.SuFy06/MPF

12 07/0/06 EC4460.SuFy06/MPF

13 Exaple: function approxiation Input Log-Sigoid Layer Linear Layer p w, Σ b Σ b n a n a w, w, w, Σ n a b a = logsig (W p + b ) a = purelin (W a + b ) f ( n) ( ) = + e f n = n ω ω n = 0ω = 0, b = 0,, b,, = 0 = ω = b = 0 3 Exaple Function Approxiation Network 3 0 b 0 () w 3 (), a () w (), b p Noinal Response of Network of Figure Above Effect of Paraeter Changes on Network Response 07/0/06 EC4460.SuFy06/MPF 3

14 Backpropagation algorith: Input First Layer Second Layer Third Layer S x R W S f x S x S b W S f 3 x S 3 x S b 3 f 3 p a a W a3 n n n 3 R x b S x S x S x R S S S 3 a = f (W p + b ) a = f (W a + b ) a 3 = f 3 (W 3 a + b 3 ) a 3 = f 3 (W 3 f (W f (W p + b ) + b ) + b 3 ) S x S 3 x S 3 x S 3 x Goal: given a set of {p i, t i }; find the weights and bias which iniize the ean square error (perforance surface) ( ) = F x E[( t a) ] Discard the expected operation T ( ) = ( ) ( ) F x t a t a k k k k k 07/0/06 EC4460.SuFy06/MPF 4

15 For a -layer only and purelin activation function T k+ = k + α k k w w e p b = b + α e k+ k k 07/0/06 EC4460.SuFy06/MPF 5

16 How to copute the derivatives? use SD Recall: F ( x) F w ( k+ ) = w ( k) α F bi ( k+ ) = bi ( k) α bi i, j i, j wi, j Note: F(x) ay not be expressed directly in ters of w, w, etc i, j i, j We need to use the chain rule ( ) ( ) ( ) df n w df n dn w = dw dn dw Exaple: ( ) f n = e 3n n= 5w+ 35 ( w+ ) ( ( )) = e f n w 07/0/06 EC4460.SuFy06/MPF 6

17 F F n w n w n w i = i, j i i, j F F n b n b i = i i i n = w a + b n b i j i, j j i i i, j i i = a = j w n w n i, j th : layer weight associated to j th input th neuron th i : associated with i neuron i, j th : layer, associated to i th neuron and i th input th i : associated with i neuron j F w = w α a b i, j, k+ i, j, k j ni F = b α s i : sensitivity of F(.) to ik, + ik, ni changes in i th neuron eleent at layer 07/0/06 EC4460.SuFy06/MPF 7

18 Expressing Weight/Bias in a Matrix For Matrix For F w ( k+ ) = w ( k) α a i, j i, j j ni F b ( k+ ) = b ( k) α n i i i W k+ = W α k w w w W w w w R = R Associated with neuron () () w w w w F input, neuron, w, k+ = w, k α a j = i= n F = = = input, neuron, w, k+ w, k α a i, j n F = = input, neuron, w, k+ = w, k α a i, j n F F a a w w w w n n = α w w F F k+ k a a n n F n = α a a F T ( a n ) s 07/0/06 EC4460.SuFy06/MPF 8

19 F F a a w w w w n n = α w F w k+ k a n F n = α a a F T ( a n ) s k+ k k+ = k α W = W α s a b b s ( ) t 07/0/06 EC4460.SuFy06/MPF 9

20 Need to use the chain rule F n i Will involve ters of the for: F n + j n n + j i Define the atrix: n n + +, + n n n = + + n n n, n n + i j + + wi, a b i nj n + = n [ ] a = a n j + j i, j j j n j + ( ) i, j ( ) ( j) ( ) ( ) ( ) ( ) f ( n ) 0 0 ( ) + + ( ) + ( ) n w f n w f n = ( ) ( ) n + + w f n w f n j j a = w a = f n = w f n w w w w f n + ( ) = W F n ( ) 07/0/06 EC4460.SuFy06/MPF 0

21 s F F n n F n = = n : first neuron n : second neuron Next, apply the chain rule for vectors: F F n F n n n n n n + + = () Sensitivity of F to change in the st eleent of the net input at layer. F F n F n n n n n n + + = () Sensitivity of F to change in the st eleent of the net input at layer. () & () + + n n F + n n n s = + + n F n + n n n + T + T n + F n F s + = = n n n n. ( ) T. ( ) T + ( ) ( )( ) s = W F n s = F n W s 07/0/06 EC4460.SuFy06/MPF

22 07/0/06 EC4460.SuFy06/MPF We need to copute s M ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) s 0 0 M M M j j M j j M M j j M M M j j M M M M M M M M M M M F n F n t a n t a n t a a a n t a a a n f n t a n f n t a n f n n = = = = = = ( ) ( )( ) ( ) t-a M M M M M M t a t a f n n s F n = Note: a =f(n)

23 Suary: a o = p + + Start + ( + a = f W a + b ) a = a M Copute M ( )( ) M ( M ) s = F n t-a + ( )( ) ( ) T + s = F n W s ; = M,, Update the Weights = b ( k+ ) = b ( k) α s ( + ) = ( ) α ( ) W k W k s a T Note: We will need derivatives for all activation functions 07/0/06 EC4460.SuFy06/MPF 3

24 Exaple: Function Approxiation k= π g( p) = + sin pk 4 t p - + e -- Network a 07/0/06 EC4460.SuFy06/MPF 4

25 p -- Network a Input Log-Sigoid Layer Linear Layer p w n a, Σ b n a Σ b w, w, w, n a Σ b a = logsig (W p + b ) a = purelin (W a + b ) 07/0/06 EC4460.SuFy06/MPF 5

26 Initial Conditions: W W () 0. () 0.5 (0) = ; b = [ ] (0) = 0 0. ; b = 0.5 () () 3 For initial values Network Response Sine Wave Exaple: Textbook, pp /0/06 EC4460.SuFy06/MPF 6

27 What does the -- network look like? 07/0/06 EC4460.SuFy06/MPF 7

28 07/0/06 EC4460.SuFy06/MPF 8

29 3 0 3 i= i= i=4 i= Exaple: function approxiation: iπ g p p p 4 ( ) = + sin ; [ ;] Figure.0 Function Approxiation Using a -3- Network f n = f n = n ( ) ; ( ) () () n + e 6π g p p p 4 ( ) = + sin ; ε [ ;] Figure.0 Effect of Increasing the Nuber of Hidden Neurons Convergence issues: ( ) = + sin ( π ); ε[ ;] g p p p Figure.0 Function Approxiation Using a -3- Network 07/0/06 EC4460.SuFy06/MPF 9

30 3 ( ) = + sin ( π ); ε[ ;] g p p p Figure.3 Convergence to a Local Miniu Network generalization issues Figure.4 -- Network Approxiation of g(p) Figure.5-9- Network Approxiation of g(p) 07/0/06 EC4460.SuFy06/MPF 30

31 07/0/06 EC4460.SuFy06/MPF 3

32 Potential probles with Backpropagation: activation functions ay be nonlinear perforance surface is not uniodal convergence ay be sped up with a variable learning rate increase step size when perforance index is flat decrease step size when perforance index is steep Possible strategy: If error increases by ore than a pre-defined value (typically 4-5%): new weights are discarded learning rate is decreasing (*0.7) If error increases by leass than 4-5%: keep new weights If error decreases: learning rate is increased by 5% 07/0/06 EC4460.SuFy06/MPF 3

33 Recall Convergence ay be sped up with the oentu filter. Wk+ = Wk α s a b = b s k+ k α ( ) T b k W k Introduce eory and LP filter behavior Define to update W, b = γ + ( γ ) x x s k k k s k x k filter response X z X z z S z X( z) γ ==> H( z) = = S( z) γ z ( ) = γ ( ) + ( γ) ( ) Apply above concept to iteration equations ( ) W W W s (a bk+ = bk + γ bk ( γ) αs t k+ = k + γ k= γ α ) b k 07/0/06 EC4460.SuFy06/MPF 33 W k

34 Iteration Techniques Apply above concept to iteration equations x = x + α p where p k+ k k k ( ) < F( x ) F x k+ Use Taylor series expansion. first order expansion: k ( ) ( ) ( ) ( ) selected so that F x = F x + x F x + VF x x T k+ k k k = k x xk ( ) x = x α F x k+ k x= x k SD schee. second order expansion: ( ) ( ) F x + = F x + x k k k T T F ( x ) + VF( x) x= x x + x A k kx k k k k ( ) xk = xk Ak VF x x= x k ( ) A = V F x = k x x k Leads to Newton s schee Recall: potential probles with Newton schee (Hessian, gradient, convergence) 07/0/06 EC4460.SuFy06/MPF 34

35 Levenberg-Marquardt Algorith Recall Assue: Designed to speed up the convergence of the Newton s ethod by reducing the coputational load. T ( ) ( ) ( ) F x V x V x v i. VF ( x) = J T V ( x). ( ) = = T V F x = J ( x) J( x) + S( x) ( ) ( ) x = x F x F x k+ k x= x x= x k T ( ) ( ) µ ( ) ( ) = x J x J x + I J x V x T k k k k k k k General guidelines for µ k Start with µ k =0.0: if F(x) doesn t decrease, repeat with µ k = 0µ k 07/0/06 EC4460.SuFy06/MPF 35

36 Squared Error Surface as a function of the weight values 5 0 w, w, w, w, Figure.3 07/0/06 EC4460.SuFy06/MPF 36

37 Squared Error Surface as a function of the weight values 5 5 w, w, w, w, 0 07/0/06 EC4460.SuFy06/MPF 37

38 Two SDBP (batch ode) trajectories 5 0 w, Figure.6 w, 07/0/06 EC4460.SuFy06/MPF 38

39 Trajectory with learning rate too large 5 0 w, Figure.8 w, 07/0/06 EC4460.SuFy06/MPF 39

40 Moentu Backpropagation Steepest Descent Backpropagation (SDBP) Wk+ = Wk α s a b = b s k+ k α ( ) T W b k+ k+ Moentu Backpropagation (MOBP) = W + k ( ) + γ Wk γ αs (a ) = b + k ( ) + γ bk γ αs t 5 0 γ = 0.8 w, w, 07/0/06 EC4460.SuFy06/MPF 40

41 Variable Learning Rate If the squared error (over the entire training set) increases by ore than soe set percentage ζ after a weight update, then the weight update is discarded, the learning rate is ultiplied by soe factor ( > ρ > 0), and the oentu coefficient γ is set to zero. If the squared error decreases after a weight update, then the weight update is accepted and the learning rate is ultiplied by soe factor η >. If γ has been previously set to zero, it is reset to its original value. If the squared error increases by less than ζ, then the weight update is accepted, but the learning rate and the oentu coefficient are unchanged. 07/0/06 EC4460.SuFy06/MPF 4

42 Variable Learning Rate Trajectory w, η =.05 Weight selection threshold ρ = 0.7 Daping factor for learning rate ζ = 4% Error threshold w,.5 Squared error 60 Learning rate Iteration Nuber Iteration Nuber Figure. 07/0/06 EC4460.SuFy06/MPF 4

43 Conjugate Gradient. The first search direction is the steepest descent. p 0 = g 0 g k F( x) x =. Take a step and choose the learning rate to iniize the function along the search direction. x k + = x k + α k p k x k 3. Select the next search direction according to: where T p k g k = g k β k = T g k p k Hestenes-Steifel update or β k g k + β k p k = or T g k g k T g k g k β k = g k T T g k g k g k Polak-Ribiere update Fletcher-Reeves update 07/0/06 EC4460.SuFy06/MPF 43

44 Conjugate Gradient Trajectory 5 0 w, w, 07/0/06 EC4460.SuFy06/MPF 44

45 Levenberg-Marquardt Trajectory 5 0 w, w, 07/0/06 EC4460.SuFy06/MPF 45

46 Resilient Backpropagation Network BPNN usually use sigoid function (tansig, logsig) as activation functions to introduce nonlinear behavior Can cause the network to have very sall gradient and iterations to stall (alost) Resilient BPNN uses the signs of the gradient coponents only to deterine the direction of the weight update weight change values deterined by separate update value 07/0/06 EC4460.SuFy06/MPF 46

47 Algorith Coparisons It is very difficult to know which training algorith will be the fastest for a given proble. Convergence speed depends on any factors: coplexity of the proble, nuber of data points in the training set, nuber of weights and biases in the network, error goal, whether the network is being used for pattern recognition (discriinant analysis) or function approxiation (regression) etc... 07/0/06 EC4460.SuFy06/MPF 47

48 Toy Exaple : Sinusoid function approxiation Network set-up: -5-; Activation functions (tansig, purelin) Nuber of trials: 30 with rando initial weights and bias Error threshold: MSE<0.00 Algorith Mean Tie (s) Ratio Min. Tie (s) Max. Tie (s) Std. (s) LM BFG RP SCG CGB CGF CGP OSS GDX Sun Sparc workstation Algorith Acrony LM (trainl) - Levenberg-Marquardt BFG (trainbfg) - BFGS Quasi-Newton RP (trainrp) - Resilient Backpropagation SCG (trainscg) - Scaled Conjugate Gradient CGB (traincgb) - Conjugate Gradient with Powell /Beale Restarts CGF(traincgf) - Fletcher-Powell Conjugate Gradient CGP (traincgp) - Polak-Ribiére Conjugate Gradient OSS (trainoss) - One-Step Secant GDX (traingdx) - Variable Learning Rate Backpropagation 07/0/06 EC4460.SuFy06/MPF 48

49 07/0/06 EC4460.SuFy06/MPF 49

50 Exaple : function approxiation (non linear regression) - Engine data set Network set-up: -30- Network inputs: engine speed and fueling levels Network outputs: torque and eission levels. Activation functions (tansig,purelin) Nuber of trials: 30 with rando initial weights and bias Error threshold: MSE < Algorith Mean Tie (s) Ratio Min. Tie (s) Max. Tie (s) Std. (s) LM BFG RP SCG CGB CGF CGP OSS GDX Sun Enterprise 4000 workstation Algorith Acrony LM (trainl) - Levenberg-Marquardt BFG (trainbfg) - BFGS Quasi-Newton RP (trainrp) - Resilient Backpropagation SCG (trainscg) - Scaled Conjugate Gradient CGB (traincgb) - Conjugate Gradient with Powell /Beale Restarts CGF(traincgf) - Fletcher-Powell Conjugate Gradient CGP (traincgp) - Polak-Ribiére Conjugate Gradient OSS (trainoss) - One-Step Secant GDX (traingdx) - Variable Learning Rate Backpropagation 07/0/06 EC4460.SuFy06/MPF 50

51 07/0/06 EC4460.SuFy06/MPF 5

52 Exaple 3: Pattern recognition - Cancer data set Network set-up: Network inputs: clup thickness, unifority of cell size and cell shape, aount of arginal adhesion, frequency of bare nuclei. Network outputs: benign or alignant tuor Activation functions (tansig in all layers) Nuber of trials: 30 with rando initial weights and bias Error threshold: MSE < 0.0 Algorith Mean Tie (s) Ratio Min. Tie (s) Max. Tie (s) Std. (s) CGB RP SCG CGP CGF LM BFG GDX OSS Sun Sparc workstation Algorith Acrony LM (trainl) - Levenberg-Marquardt BFG (trainbfg) - BFGS Quasi-Newton RP (trainrp) - Resilient Backpropagation SCG (trainscg) - Scaled Conjugate Gradient CGB (traincgb) - Conjugate Gradient with Powell /Beale Restarts CGF(traincgf) - Fletcher-Powell Conjugate Gradient CGP (traincgp) - Polak-Ribiére Conjugate Gradient OSS (trainoss) - One-Step Secant GDX (traingdx) - Variable Learning Rate Backpropagation 07/0/06 EC4460.SuFy06/MPF 5

53 07/0/06 EC4460.SuFy06/MPF 53

54 Other exaples available at x/nnet/backpr4.shtl 07/0/06 EC4460.SuFy06/MPF 54

55 EXPERIMENT CONCLUSIONS Several algorith characteristics which can be deuced fro experients: In general, on function approxiation probles, for networks that contain up to a few hundred weights, the LM algorith will have the fastest convergence. This advantage is especially noticeable if very accurate training is required. In any cases, trainl is able to obtain lower ean square errors than any of the other algoriths tested. However, as the nuber of weights in the network increases, the advantage of the trainl decreases. In addition, trainl perforance is relatively poor on pattern recognition probles. The storage requireents of trainl are larger than the other algoriths tested. By adjusting the e_reduc paraeter, discussed earlier, the storage requireents can be reduced, but at a cost of increased execution tie. The trainrp function is the fastest algorith on pattern recognition probles. However, it does not perfor well on function approxiation probles. Its perforance also degrades as the error goal is reduced. The eory requireents for this algorith are relatively sall in coparison to the other algoriths considered. The conjugate gradient algoriths, in particular trainscg, see to perfor well over a wide variety of probles, particularly for networks with a large nuber of weights. The SCG algorith is alost as fast as the LM algorith on function approxiation probles (faster for large networks) and is alost as fast as trainrp on pattern recognition probles. Its perforance does not degrade as quickly as trainrp perforance does when the error is reduced. The conjugate gradient algoriths have relatively odest eory requireents. The trainbfg perforance is siilar to that of trainl. It does not require as uch storage as trainl, but the coputation required does increase geoetrically with the size of the network, since the equivalent of a atrix inverse ust be coputed at each iteration. The variable learning rate algorith traingdx is usually uch slower than the other ethods, and has about the sae storage requireents as trainrp, but it can still be useful for soe probles. There are certain situations in which it is better to converge ore slowly. For exaple, when using early stopping, you ay have inconsistent results if you use an algorith that converges too quickly. You ay overshoot the point at which the error on the validation set is iniized. 07/0/06 EC4460.SuFy06/MPF 55

56 Generalization Issues Network ay be overtrained (overfitting issues) when MSE on training set is set too low Potential Risk: the network eorizes the training exaples, but doesn t learn to generalize to siilar but new situations Consequences: very good perforances on training set, very poor perforance on testing set (-0-) net; noisy sine How to prevent overfitting? Use a network not too large for the proble a-priori network size is difficult to guess Increase training set size if possible Apply regularization early stopping 07/0/06 EC4460.SuFy06/MPF 56

57 Regularization Recall basic perforance (MSE) function is defined as: N N MSE = ei = ( ti ai) N N i= i= Perforance function is odified as: MSE = γ MSE + ( γ ) MSW reg P where MSW = wi N i= & γ : perforance ratio Consequences: MSE reg forces the network to have saller weights and biases, to produce a soother response to be less likely to overfit Drawbacks: difficult to estiate γ γ too large overfitting pb γ too sall no good fit of training data 07/0/06 EC4460.SuFy06/MPF 57

58 Autoated Regularization (MATLAB: trainbr) Definition: Assue weights and bias are rando variables with specific distributions Define new perforance function as: MSE aut =αmse+βmsw Apply statistical concepts (Bayes Rule) to find optiu values for α and β (iterative procedure) Basic MSE MSE aut 07/0/06 EC4460.SuFy06/MPF 58

59 Early Stopping (MATLAB: train with option val ) Definition: Training set split into two sets: training subset: used to copute network weight and biases validation subset: error on the validation is onitored during training: validation error: goes down at training onset goes back up when network starts to overfit the data training continued until validation error increases for a specified nuber of iterations final weights & biases are those obtained for the iniu validation error. Basic MSE Early Stopping MSE 07/0/06 EC4460.SuFy06/MPF 59

60 (MATHWORKS) CONCLUSIONS Both regularization and early stopping can ensure network generalization when properly applied. When using Bayesian regularization, it is iportant to train the network until it reaches convergence. The MSE, MSW, and the effective nuber of paraeters should reach constant values when the network has converged. For early stopping, careful not to use an algorith that converges too rapidly. If you are using a fast algorith (like trainl), you want to set the training paraeters so that the convergence is relatively slow (e.g., set u to a relatively large value, such as, and set u_dec and u_inc to values close to, such as 0.8 and.5, respectively). The training functions trainscg and trainrp usually work well with early stopping. With early stopping, the choice of the validation set is also iportant The validation set should be representative of all points in the training set. With both regularization and early stopping, it is a good idea to train the network starting fro several different initial conditions. It is possible for either ethod to fail in certain circustances. By testing several different initial conditions, you can verify robust network perforance. Based on our (MATWHORKS) experience, Bayesian regularization generally provides better generalization perforance than early stopping, when training function approxiation networks. This is because Bayesian regularization does not require that a validation data set be separated out of the training data set. It uses all of the data. This advantage is especially noticeable when the size of the data set is sall. 07/0/06 EC4460.SuFy06/MPF 60

61 Early Stopping/Validation discussions Data Set Title No. pts. Network Description SINE (5% N) 4-5- Single-cycle sine wave with Gaussian noise at 5% level. SINE (% N) 4-5- Single-cycle sine wave with Gaussian noise at % level. ENGINE (ALL) Engine sensor - full data set. ENGINE (/4) Engine sensor ¼ of data set. Method Engine (All) Engine (/4) Sine (5% N) Sine (% N) ES.3e-.9e-.7e-.3e- BR.6e-3 4.7e-3 3.0e- 6.3e-3 ES/BR Mean Squared Test Set Error 07/0/06 EC4460.SuFy06/MPF 6

62 Soe general design principles (fro NN FAQ) Data encoding issues Nuber of layers issues Nuber of neurons per layer issues Input variable standardization issues Output variable standardization issues Generalization error evaluation issues 07/0/06 EC4460.SuFy06/MPF 6

63 Data encoding issues (fro NN FAQ) X /0/06 EC4460.SuFy06/MPF 63

64 Nuber of layers issues [fro NN FAQ] You ay not need any hidden layers at all. Linear and generalized linear odels are useful in a wide variety of applications. And even if the function you want to learn is ildly nonlinear, you ay get better generalization with a siple linear odel than with a coplicated non-linear odel if there is too little data or too uch noise to estiate the nonlinearities accurately. In MLPs with step/threshold/heaviside activation functions, you need two hidden layers for full generality. In MLPs with any of a wide variety of continuous non-linear hidden-layer activation functions, one hidden layer with an arbitrarily large nuber of units suffices for the universal approxiation property But there is no theory yet to tell you how any hidden units are needed to approxiate any given function. 07/0/06 EC4460.SuFy06/MPF 64

65 Nuber of neurons per layer issues [NN FAQ] The best nuber of hidden units depends in a coplex way on: the nubers of input and output units the nuber of training cases the aount of noise in the targets the coplexity of the function or classification to be learned the architecture the type of hidden unit activation function the training algorith regularization In ost situations, there is no way to deterine the best nuber of hidden units without training several networks and estiating the generalization error of each. If you have too few hidden units, you will get high training error and high generalization error due to underfitting and high statistical bias. If you have too any hidden units, you ay get low training error but still have high generalization error due to overfitting and high variance. 07/0/06 EC4460.SuFy06/MPF 65

66 Input variable standardization issues [NN FAQ] Input contribution depends on its variability relative to other inputs Exaple: Input in range [[- ] Input in range [0 0,000] Input contribution will be swaped by Input Scale inputs so that variability reflects their iportance. * If iportance is not known: scale all inputs to sae variability or sae range * If iportance is known: scale ore iportant inputs so that they have larger variance/ranges Standardizing input variables has different effects on different training algoriths for MLPs. For exaple: ) Steepest descent is very sensitive to scaling. The ore ill-conditioned the Hessian is, the slower the convergence. Hence, scaling is an iportant consideration for gradient descent ethods such as standard backpropagation ) Quasi-Newton and conjugate gradient ethods begin with a steepest descent step and therefore are scale sensitive. However, they accuulate second-order inforation as training proceeds and hence are less scale sensitive than pure gradient descent. 3) Newton-Raphson and Gauss-Newton, if ipleented correctly, are theoretically invariant under scale changes as long as none of the scaling is so extree as to produce underflow or overflow. 4) Levenberg-Marquardt is scale invariant as long as no ridging is required. There are several different ways to ipleent ridging; soe are scale invariant and soe are not. Perforance under bad scaling will depend on details of the ipleentation. 07/0/06 EC4460.SuFy06/MPF 66

67 Output variable standardization issues [NN FAQ] Target ouptuts value ranges should reflect possible neural network output values If the target variable does not have known upper and lower bounds, do not use an output activation function with a bounded range Standardizing target variables is typically ore a convenience for getting good initial weights than a necessity. However, if you have two or ore target variables and your error function is scale-sensitive like the usual least (ean) squares error function, then the variability of each target relative to the others can effect how well the net learns that target. If one target has a range of 0 to, while another target has a range of 0 to 0 6, the net will expend ost of its effort learning the second target to the possible exclusion of the first. So it is essential to rescale the targets so that their variability reflects their iportance, or at least is not in inverse relation to their iportance. If the targets are of equal iportance, they should typically be standardized to the sae range or the sae standard deviation. 07/0/06 EC4460.SuFy06/MPF 67

68 Generalization error evaluation issues [NN FAQ] 3 basic necessary (not sufficient!) conditions for generalization ) Network inputs contain sufficient inforation pertaining to the target, so that there exists a atheatical function relating correct outputs to inputs with the desired degree of accuracy. (neural nets are not clairvoyant!) ) Function which relates inputs to correct outputs ust be in soe sense, sooth, i.e., a sall change in the inputs should, ost of the tie, produce a sall change in the outputs. For continuous inputs and targets, soothness of the function iplies continuity and restrictions on the first derivative over ost of the input space. Soe neural nets can learn discontinuities as long as the function consists of a finite nuber of continuous pieces. Very nonsooth functions such as those produced by pseudo-rando nuber generators and encryption algoriths cannot be generalized by neural nets. Often a nonlinear transforation of the input space can increase the soothness of the function and iprove generalization. 3) the training set ust be a sufficiently large and representative subset of the set of all cases that you want to generalize to. The iportance of this condition is related to the fact that there are, loosely speaking, two different types of generalization: interpolation and extrapolation. interpolation applies to cases that are ore or less surrounded by nearby training cases; everything else is extrapolation. In particular, cases that are outside the range of the training data require extrapolation. Cases inside large "holes" in the training data ay also effectively require extrapolation. Interpolation can often be done reliably, but extrapolation is notoriously unreliable. Hence it is iportant to have sufficient training data to avoid the need for extrapolation. 07/0/06 EC4460.SuFy06/MPF 68

69 Cross-validation and bootstrapping schees to evaluate generalization errors (and copare ipleentations) Schees are called perutation tests because they are based on data resapling ) Cross-validation (Resapling without replaceent) Recoended for sall datasets Can be used to estiate odel error or to copare different NN set-ups How does this work? Split data in k (~0) subsets of equal size. Train the NN k ties, each tie: leave one of the subsets out of the training test NN on the oitted subset When k=saple size leave-one-out cross-validation Overall accuracy is ean of all testing set accuracies 07/0/06 EC4460.SuFy06/MPF 69

70 ) Jackknife estiation Special case of cross-validation Recoended for sall datasets Can be used to estiate odel error or to copare different NN set-ups How does this work? Split data in subsets of size equal to M- (for M data saples available); Train the NN on each set; Each tie, test NN on the leave-one-out oitted saple (i.e., each testing set has only one saple). Overall accuracy is ean of all testing set accuracies 07/0/06 EC4460.SuFy06/MPF 70

71 3) Bootstrapping (Resapling with replaceent) [Boostrap ethods and perutation tests, Hesterberg et al.,w.h. Freean et copany, ] 07/0/06 EC4460.SuFy06/MPF 7

72 Bootstrapping Recoended for sall datasets Is expensive to ipleent. Sees to work better than cross-validation in any cases, but not always in such cases not worth the investent Can be used to estiate odel error or to copare different NN set-ups How does this work? Select k (fro 50 to 000) subsets of the data with replaceent. Train the NN k ties, each tie: Train on one subset Test on another subset Overall accuracy is ean of all testing set accuracies 07/0/06 EC4460.SuFy06/MPF 7

73 Perforance Coparison Which technique is best? Which is ore accurate? Classifier perforance assessent allows to evaluate how well it does and copares with other schees. Useful when cobining decisions/outputs fro several classifiers/detectors (in data fusion applications). evaluates set as a hypothesis test Given two algoriths A and B Hypothesis H 0 : For a randoly drawn set of fixed size, algoriths A and B have the sae error rate. Hypothesis H : For a randoly drawn set of fixed size, algoriths A and B do not have the sae error rate. 07/0/06 EC4460.SuFy06/MPF 73

74 Need to define: Type error rate: Probability of incorrectly rejecting the true null hypothesis. Type error rate: Probability of incorrectly accepting a false null hypothesis. Applied to this proble Type error rate: Probability of incorrectly detecting a difference between classifier perforance when no difference exists. Significance of level α: α represents how selective (i.e., restrictive) the user wants the decision between H 0 and H to be, i.e., for a = 0.05, the user is willing to accept the fact that there is a 5% chance of deciding H 0 is incorrect (or false) when it is in fact correct (or true). 07/0/06 EC4460.SuFy06/MPF 74

75 Thus, The larger α is, the ore likely the user is to decide the clai (H 0 ) is incorrect, when in fact, it is correct, i.e., the user becoes ore selective, as the user rejects ore and ore clais even though they are correct. The saller α is, the less likely the user is to decide the clai is incorrect, when it is, in fact, correct, i.e., the user becoes less selective, as the user will reject fewer clais, however the user will accept ore and ore clais which are, in fact, incorrect. 07/0/06 EC4460.SuFy06/MPF 75

76 McNear s Test Define the following qualities: Nuber of test cases isclassified by A and B Nuber of test cases isclassified by A but not by B n 00 n 0 Nuber of test cases isclassified by B but not by A Nuber of test cases isclassified by neither A nor B n 0 n Note: Total nuber of test cases n = n 0 + n 0 + n + n 00 Under H 0, A and B have sae error rates n 0 = n 0 theoretically, the expected nuber of errors ade only by one of the two algoriths is E E /0/06 EC4460.SuFy06/MPF 76 = n + n

77 McNear s test copares the observed nuber of errors obtained with one of the two algoriths and the expected nuber. Copute z = ( n ) 0 n0 n + n 0 0 Turns out z is χ H 0 (hypothesis that the algoriths A and B have soe error rate) is rejected with a significance level a (i.e., assuing we accept the α% chance deciding H 0 is incorrect when it is, in fact, correct). When How to read χ table χ,0.95 = z > χ, α 07/0/06 EC4460.SuFy06/MPF 77

78 Exaple: Assue we have a proble with 9 classes and 60 text saples. Results give n 00 = n 0 = 4 n 0 = n = 44 Algorith A gives 48 correct decisions. Algorith B gives 45 correct decisions. Are the two algoriths to be considered with sae perforances? 07/0/06 EC4460.SuFy06/MPF 78

Ch 12: Variations on Backpropagation

Ch 12: Variations on Backpropagation Ch 2: Variations on Backpropagation The basic backpropagation algorith is too slow for ost practical applications. It ay take days or weeks of coputer tie. We deonstrate why the backpropagation algorith