Incremental Stochastic Gradient Descent

Size: px

Start display at page:

Download "Incremental Stochastic Gradient Descent"

Nigel Bond
6 years ago
Views:

1 Incremental Stochastic Gradient Descent Batch mode : gradient descent w=w - η E D [w] over the entire data D E D [w]=1/2σ d (t d -o d ) 2 Incremental mode: gradient descent w=w - η E d [w] over individual training examples d E d [w]=1/2 (t d -o d ) 2 Incremental Gradient Descent can approximate Batch Gradient Descent arbitrarily closely if η is small enough

2 Comparison: Perceptron and Gradient Descent Rule Perceptron learning rule guaranteed to succeed (perfectly classifying training examples) if Training examples are linearly separable Sufficiently small learning rate η Linear unit training rules using gradient descent Guaranteed to converge to hypothesis with minimum squared error Given sufficiently small learning rate η Even when training data contains noise Even when training data not separable by H

4 Restaurant Problem: Will I wait for a table? Alternate whether there is a suitable alternative restaurant nearby Bar whether the restaurant has a comfortable bar area to wait in Fri/Sat true on Fridays and Saturdays Hungry whether we are hungry Patrons how many people are in the restaurant (None, Some or Full) Price the restaurants price range ($, $$, $$$) Raining whether its is raining outside Reservation whether we made a reservation Type the kind of restaurant (French, Italian, Thai, or Burger) WaitEstimate the wait estimate by the host (0-10 minutes, 10-30, 30-60, > 60)

5 Multilayer Network

8 A compromise function Perceptron Linear n output = 1 if w ix i > 0 i=0 0 else output = net = n i=0 w i x i Sigmoid (Logistic) output = σ (net) = 1 1+ e net

9 Learning in Multilayer Networks Same method as for Perceptrons Example inputs are presented to the network If the network computes an output that matches the desired, nothing is done If there is an error, then the weights are adjusted to balance the error

11 BackPropagation Learning

12 Alternative Error Measures

13 Neural Network Model Inputs Age 34 Gender 2 Stage Σ Σ Σ Output 0.6 Probability of beingalive Independent variables Weights Hidden Layer Weights Dependent variable Prediction

14 Getting an answer from a NN Inputs Age 34.6 Output Gender 2 Stage Σ 0.6 Probability of beingalive Independent variables Weights Hidden Layer Weights Dependent variable Prediction

15 Getting an answer from a NN Inputs Age 34 Gender 2 Stage Σ Output 0.6 Probability of beingalive Independent variables Weights Hidden Layer Weights Dependent variable Prediction

16 Getting an answer from a NN Inputs Age 34 Gender 1 Stage Σ Output 0.6 Probability of beingalive Independent variables Weights Hidden Layer Weights Dependent variable Prediction

17 Minimizing the Error Error surface initial error negative derivative final error local minimum w initial w trained positive change

18 Representational Power (FFNN) Boolean functions 2 layers of units Continuous functions 2 layers of units (sigmoid then linear) Arbitrary functions 3 layers of units (sigmoids then linear)

19 Hypothesis Space and Inductive Bias

20 Hidden Layer Representations

21 Hidden Layer Representations

25 Overfitting

26 Neural Nets for Face Recognition

27 Learning Hidden Unit Weights

28 ALVINN Drives 70 mph on a public highway Camera image 30 outputs for steering 4 hidden units 30x32 pixels as inputs 30x32 weights into one out of four hidden unit

29 Handwritten Character Recognition Le Cun et al. (1989) implemented a neural network to read zip codes on hand-addressed envelopes, for sorting purposes To identify the digits, uses a 16x16 array of pixels as input, 3 hidden layers, and a distributed output encoding with 10 output units for digits input nodes, 10 output units (1 for the liklihood of each number)

32 Interpreting Satellite Imagery for Automated Weather Forecasting

33 Recurrent Neural Nets

34 Neural Network Language Models Statistical Language Modeling: Predict probability of next word in sequence I was headed to Madrid, P( = Spain ) = 0.5, P( = but ) = 0.2, etc. Used in speech recognition, machine translation, (recently) information extraction

35 Summary Perceptrons, one layer networks, are insufficiently expressive Multi-layer networks are sufficiently expressive and can be trained by error back-propogation Many applications including speech, driving, hand written character recognition, fraud detection, driving, etc.

37 Local Search algorithms In many optimization problems, the path to the goal is irrelevant; the goal state itself is the solution In such cases, we can use local search algorithms keep a single "current" state, try to improve it Hill-climbing Simulated annealing Local Beam Search Stochastic Beam Search Genetic Algorithms

38 Genetic algorithms A successor state is generated by combining two parent states Start with k randomly generated states (population) A state is represented as a string over a finite alphabet (often a string of 0s and 1s) Evaluation function (fitness function). Higher values for better states.

39 Genetic algorithms Fitness function: number of non-attacking pairs of queens (min = 0, max = 8 7/2 = 28) 24/( ) = 31% 23/( ) = 29% etc

40 Genetic algorithms

41 Genetic Algorithms Continued 1. Choose initial population 2. Evaluate fitness of each in population 3. Repeat the following until we hit a terminating condition: 1. Select best-ranking to reproduce 2. Breed using crossover and mutation 3. Evaluate the fitnesses of the offspring 4. Replace worst ranked part of population with offspring

42 How computers play games

43 Minimax: An Optimal Strategy

44 Minimax Algorithm: An Optimal Strategy Choose the best move based on the resulting states MINIMAX-VALUE MINIMAX-VALUE(n) = if n is a terminal state then Utility(n) else if MAX s turn the MAXIMUM MINIMAX-VALUE of all possible successors to n else if MIN s turn the MINIMUM MINIMAX-VALUE of all possible successors to n

Introduction To Artificial Neural Networks

Introduction To Artificial Neural Networks Machine Learning Supervised circle square circle square Unsupervised group these into two categories Supervised Machine Learning Supervised Machine Learning Supervised