8. Lecture Neural Networks

Size: px

Start display at page:

Download "8. Lecture Neural Networks"

Rose Houston
6 years ago
Views:

1 Soft Control (AT 3, RMA) 8. Lecture Neural Networks Learning Process

2 Contents of the 8 th lecture 1. Introduction of Soft Control: Definition and Limitations, Basics of Intelligent" Systems 2. Knowledge representation and Knowledge Processing (Symbolic AI) Application: Expert Systems 3. Fuzzy-Systems: Dealing with Fuzzy Knowledge Application : Fuzzy-Control 4. Connective Systems: Neuronale Networks Application: Identification and neural Control 1. Basics 2. Learning 5. Genetic Algorithms: Stochastic Optimization Application: Optimization 6. Summary & Literature References 198

3 Contents of 7 th Lecture Learning in Neural Networks Supervised (monitored) learning Solid Learning Task: Geg.: Input E, Output A Un-Supervised (un-monitored) learning Free Learning Task : Geg.: Input E Example: Backpropagation Example: Competitive Learning 199

4 Unsupervised Learning Learning in Neural Networks Supervised (monitored) learning Solid Learning Task: Geg.: Input E, Output A Un-Supervised (un-monitored) learning Free Learning Task : Geg.: Input E Example : Backpropagation Example : Competitive Learning Source: Carola Huthmacher 200

5 Principle of Competitive Learning in the problem of clustering Obectives of the clustering: Differences between obects of a cluster are minimal Differences between obects of different clusters are maximum Learning through competition Competition principle (Competition) Obective: Each group will activate an output neuron (binary) 201

6 Architecture of a Competitive Learning Network Output ( 1 0 ) 0 ) = y B m... Competitive Layer n... Input Layer Input ( ) = x R n 202

7 Processes in the Competitive Layer Measure of the distance (displacement/offset) between input and weighting vector S = i w i x i = w x cos S is large for small displacement Winner: Neuron with S > S k for all k w 1 w 2 w n Output: y winner = 1 y loser = 0 ( winner takes all ) ( x 1 x 2 x n ) = x R n 203

8 Unsupervised Learning Algorithms Initialization: Early Random weighting (normalized weight vectors) Vectors from training inputs (normalized) as initial weights Competitive process Learning: Input is a Vector x Recalculate the weightings of the winner neuron : w (t+1) = w (t) + (t) [x - w (t)] (t) is the Learning rate (0,01-0,3) in the process the learning is gradually reduced 1 (t) [ x w (t) ] w (t) x w (t) w (t+1) x Normalization(Standardization) Termination: At the end the fulfillment of a Termination criterion

9 Advantages and Dis-Advantages Disadvantages: difficult to find good initialization Unstable Problem: # Neurons in Competitive Layer Advantages: good clustering easier and faster algorithm Building block for more complex networks 205

10 Supervised Learning Learning in Neural Networks Supervised (monitored) learning Solid Learning Task: Geg.: Input E, Output A Un-Supervised (un-monitored) learning Free Learning Task : Geg.: Input E Example: Back propagation Example: Competitive Learning Source: Dr. Van Bang Le 206

11 The Back propagation-learning algorithms History Werbos (1974) Rumelharts, Hintons, Williams (1986) Very important and well-known supervised learning for forward networks Idea: Minimizing the error function by Gradient relegation (descend) Consequences Back propagation is a Gradient base procedure Learning here is math, no biological motivation! 207

12 Task and aims of back propagation-learning Learning Task: Quantity of input / output examples (training set): L = {(x 1, t 1 ),..., (x k, t k )}, where: x i = Input Example (input pattern) t i = Solution (Desired task, target) with input x i Learning Obective: Each task (x, t) from L should be from the network with as little error as can be calculated.. 208

13 BP general approach to learning Subdivision of existing data in Trainings data Validation data Error Training to achieve desired error Validation Validation Problem: Optimal end point for training Underfitting Overfitting Training Trainings-Iterations 209

14 The Back propagation-learning algorithms Error measurement: Let (x, t) L and y is actual output of the network when input is x. Error concerning the pair (x, t): E x,t = 2 (t ( = ½ t y 2 ) i yi ) 2 1 Total Error : E Note: : E ( x, t) L i (t y 2, i i ) 2 1 x t ( x, t) L i The factor ½ is not relevant ( t y 2 is then exactly minimum, If ½ t y 2 is minimum), but later leads to simplify the formulas. 210

15 The gradient method 1. Consider the error as a function of weights Fehler 2. To the weight vector E(w) w = (W11, W12,...) belongs to the point (w, E (w)) on the error surface w w Gewichte 3 Since E is differentiable, so at point w the gradient of the error area is possible, and the gradient descends at a fraction New weight vector w 4. Repeat the Procedure at the Point w

16 Gradient Let f : R n R eine real Value Function. Partial derivative of f after xi : x f i Gradient of f : f ( f, f,..., x f) 1 x 2 x n f(x 1,..., x n ) show,,in the direction of the highest growth rate of f and instead (x 1,..., x n ). Towards the relegation : f Towards the descent into xi-direction: x i f Example: f(x 1, x 2 ) = ½ x 1 2 x 2, f(x 1, x 2 ) = (x 1, 1) 212

17 BP to multiple networks Viewing multiple-networks without abbreviation (pure Feed-forward networks with connections between Successive layers) Designations: The network with input x was completely broken into shares! i w i Output of neuron i: o i Input for neuron : net := i:i o i w i A:= {i : i is Output neuron} the quantity of output neurons For (x, t) L is then y =(o i ) i A is the output when input is x 213

18 BP to multiple networks: : Notation: Error Function Error function: E = Ex, t E x,t = ( x, t)l 2 1 A (t o ) 2 o = f(net ), where f is the activation function of neurons. net = i:i o i w i f is differentiable, so is E x,t and E is also differentiable, and gradient relegation method can be applied! Offline-Version: Weight change after calculation of total error E (Batch Learning) Online-Version: Weight change under the current calculation error E x,t 214

19 Sigmoid as the activation function Until now, the Activation function f was the staircase function So not everywhere differentiable : Everywhere differentiable Function: 1 s c (x) = 1+e cx 1 1 s 2 s 1 As an activation function for all neurons is Now the sigmoid function s (x) = s1 (x) It is: s (x) = s(x)(1 s(x)) 215

20 The Back propagation-learning algorithm: Online-Version (1) Initialize the weights with random values w i (2) Choose a pair (x, t) L (3) Calculate the output y when input is x (4) Consider the error E x,t as a function of weights : E x,t = ½ t y 2 = E x,t (w 11, w 12,...) (5) Fractionally change w i (Learning rate) in the steepest descent direction of the error : w i := w i + ( ) E x, t w i (6) If there is no termination then repeat from (2) criterion 216

21 The Back propagation-learning algorithm: Online-Version (2) Calculation of E x, t w i i w i For a fixed pair i, E x,t is considered as a Function of w i (all other weights are included in this calculation constant ) E x,t depends on network output y (i.e. o, A) o, A, depends on the input of neuron, net, ab net depends on w k and o k, for all Connections k... E x, t w i So backward is determined by the network! Backpropagation 217

22 The Back propagation-learning algorithm: Online-Version (3) Calculation of E x, t w i i w i Dependency: E x,t (w i ) depends on net, net depends on w i ab. Application of the chain rule: E w x, t i E x, t net net w i net = o w i := E x, t,, Error Signal i net 218

23 The Back propagation-learning algorithm: Online-Version (4) Dependency: E x,t (net ) depends on o, o depends on net. Application of the chain rule: o E x, t net f (net o x, t ) = f (net ) =... net net For f = sigmoid Activation function s shall continue :... = s (net ) = s(net ) (1 s(net )) = o (1 o ) E o net 219

24 The Back propagation-learning algorithm: Online-Version (5) Calculation of E x, t o i w i Case 1: is a output neuron. E o x, t o ( 1 2 k A ( t k o k ) 2 ) = 2 ½ (t o ) ( 1) = (t o ) 220

25 The Back propagation-learning algorithm: Online-Version (6) Calculation of E x, t o i w i Case 2: is not an output neuron. Dependency: o will be presented at all follow-up of neurons, k and redirected and E x,t depends on! Application of the chain rule : E o x, t k: k E x, t net k net o k k: k k w k 221

26 The Back propagation-learning algorithm: Online-Version (7) Summary: Error signal: o o (1 o (1 o ) (t ) o k: k k ), w k A,sonst to be calculated, all k must be known for all connections k Back propagation Relegation(descend) direction w i : E x, t w i = o i Correction for w i : w i = w i + o i 222

27 The Back propagation-learning algorithm: Online-Version (8) Initialize the weights with random values Determination of abort criterion for total failure (error) E Determination of maximum Epoch number e max E:= 0; e:= 1 repeat for all (x, t) L do 1 compute E x,t = 2 2 (t o ) E:= E + E A x,t calculate backward, layerwise starting with the output layer of the error signals w i = w i + o i endfor e:= e + 1 until (E meets ) or (e > e max ) 223

28 The Back propagation-learning algorithm : Offline-Version Offline means that the error for all input data should also be minimized In this mode, the weights after Presentation of all tasks (x, t) L are modified: w i w w i i w i ( ( w E ( x, t) L o ( x, t) L ) i ( E w ( x ) ( x) i x, t i )) 224

29 Online vs. Offline When offline learning (Batch Learning) is in a corrective step, the total error function (for all data) is optimized. There is a descent in the direction of the real Gradient direction the total error function When online learning are the weights after the presentation of each date adapt immediately. The direction of adustment is in general not in agreement with the Gradient direction. If the entries are selected in a random order, it is the middle of the gradient that is followed. The online version is necessary, if not all pairs (x, t) at the beginning of learning are known (adapting to new data, adaptive systems), or if the offline version is too burdensome. 225

30 Problems of Backpropagation: Symmetry Breaking For complete layers, forward-affiliated networks, the weights may not give equal value to be initialized. Otherwise, the weights between two layers through back-propagation will always give the same values Ini: w i = a for all i, After the Forward-Phase: o 4 = o 5 = o 6 4 = 5 = 6 w 14 = w 15 = w 16, w 24 = w 25 = w 26, w 34 = w 35 = w 36, w 47 = w 57 = w 67, w 48 = w 58 = w 68 This situation applies forward after each phase. Through such initialization is therefore certain symmetry, which no longer be broken! Solution: Small, random values for top weights. Network input net i for all Neurons i is almost Null s (net i ) size, and the Network adapts quickly. 226

31 Problems of back propagation: Local minima As with all gradient may be in back propagation a local minimum area of error remains : E w 0 w 1 w 2 w 3 w There is no guarantee that a global minimum (optimal weights) will be found. With a growing number of connections ( the dimension of the weight room is great ) the surface error greater agged. In a local minimum is likely to land! Way out: Learning rate not to be chosen too small Several different initialization of the weights to try According to experience, the one minimum found for the concrete application is acceptable solution 227

32 Problems of Backpropagation: Leave (abandon) good minima Leave good Minima: The size of the weight change depends on the amount of gradients. A good minimum is in a steep valley, the amount of the is gradient so large that the good and minimize skipped in the vicinity of where a worse minimum will be landed will: E w Way out: Learning rate not to be chosen very large Several different initialization of the weights to try According to experience, the one minimum found for the concrete application is an acceptable solution 228

33 Problems of Backpropagation: Flat plateau Flat plateau : At the very shallow surface, the error of the gradient is small and the weights change according marginally. Especially many iteration step (high time for training) In extreme cases, do not fix the weights instead! E w 229

34 Problems of Backpropagation: Oscillation Oscillation In steep ravines (gorges), the procedure oscillate. At the edges of a steep ravine, the weight change cause from one page to another is cracked, because the gradient is the same amount but the reverse sign holds : E w 230

35 Modification 0f Backpropagation There are many modifications to remedy the problems addressed. All are based on heuristics: they cause in many cases, a rapid acceleration of convergence. But there are cases where the adoption of heuristics is not present, and a deterioration compared to the traditional procedure occurs back propagation. Some popular modifications : Momentum-Term (also conugated Gradient relegation): The alleged problems at the shallow plateaus and steep canyons. Idea: Increase the Learning rate to shallow levels and reduction in the valleys.. Weight Decay Large weights are neurobiological look implausible and cause steep errors and rugged area. Error functions usually change at the same time minimizing the weights (weight decay). Quickprop Heuristic: A Valley of the fault surface (about a local minimum) may be replaced by a top open parabolic approximate described. Idea: In a step toward the vertex of the parabola (expected minimum of error function) ump. 231

36 Summary and learning from the 8th Lecture To know basic forms of learning in neural networks Supervised Unsupervised To know the idea of learning without teachers based on the concurrent learning To know the idea of learning by minimizing errors (with "teacher") Example Back propagation To know Back propagation Procedure Possible Problems 232

4. Multilayer Perceptrons

4. Multilayer Perceptrons This is a supervised error-correction learning algorithm. 1 4.1 Introduction A multilayer feedforward network consists of an input layer, one or more hidden layers, and an output