Machine Learning. Neural Networks. Le Song. CSE6740/CS7641/ISYE6740, Fall Lecture 7, September 11, 2012 Based on slides from Eric Xing, CMU

Size: px

Start display at page:

Download "Machine Learning. Neural Networks. Le Song. CSE6740/CS7641/ISYE6740, Fall Lecture 7, September 11, 2012 Based on slides from Eric Xing, CMU"

Julie Moore
5 years ago
Views:

1 Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Neural Networks Le Song Lecture 7, September 11, 2012 Based on slides from Eric Xing, CMU Reading: Chap. 5 CB

2 Learning highly non-linear functions f: X Y f might be non-linear function X (vector of) continuous and/or discrete vars Y (vector of) continuous and/or discrete vars The XOR gate Speech recognition

Perceptron and Neural Nets From biological neuron to artificial neuron (perceptron) Dendrites Synapse Axon Soma Synapse Soma Synapse Activation function X n i Dendrites Axon x i w i 1, if X Y 1 1, if

3 Perceptron and Neural Nets From biological neuron to artificial neuron (perceptron) Dendrites Synapse Axon Soma Synapse Soma Synapse Activation function X n i Dendrites Axon x i w i 1, if X Y 1 1, if X Inputs x 1 x 2 w 1 w 2 Linear Combiner Hard Limiter Threshold Output Y Artificial neuron networks supervised learning gradient descent i g n a l s I n p u t S Input Layer Middle Layer Output Layer O u t p u t S i g n a l s

Connectionist Models Consider humans: Neuron switching time ~ 0.

1 second 100 inference steps doesn't seem like enough much parallel computation Properties of

4 Connectionist Models Consider humans: Neuron switching time ~ second Number of neurons ~ Connections per neuron ~ Scene recognition time ~ 0.1 second 100 inference steps doesn't seem like enough much parallel computation Properties of artificial neural nets (ANN) Many neuron-like threshold switching units Many weighted interconnections among units Highly parallel, distributed processes

5 Abdominal Pain Perceptron Intensity Male Age Temp WBC Pain Duration Pain adjustable weights Appendicitis Diverticulitis Ulcer Duodenal Perforated Pain Cholecystitis Non-specific Small Obstruction Pancreatitis Bowel

6 Myocardial Infarction Network Duration Pain Intensity Pain Elevation ECG: ST Smoker Age Male Myocardial Infarction 0.8 Probability of MI

7 The "Driver" Network

8 Perceptrons Input units weights Cough Headache D rule change weights to decrease the error No disease Pneumonia Flu Meningitis what we got - what we wanted error Output units

9 Perceptrons Output units j Output of unit j : o j = 1/ (1 + e - ( a j + j ) ) Input to unit j : a j = S w ij a i Input units i Input to unit i : a i measured value of variable i

10 Jargon Pseudo-Correspondence Independent variable = input variable Dependent variable = output variable Coefficients = weights Estimates = targets Logistic Regression Model (the sigmoid unit) Inputs Output Age 34 Gender 1 Stage S 0.6 Probability of beingalive Independent variables x1, x2, x3 Coefficients a, b, c Dependent variable p Prediction

11 The perceptron learning algorithm Recall the nice property of sigmoid function Consider regression problem f:x Y, for scalar Y: Let s maximize the conditional data likelihood

12 Gradient Descent x d = input t d = target output o d = observed unit output w i = weight i

13 The perceptron learning rules x d = input t d = target output o d = observed unit output w i = weight i Incremental mode: Do until converge: For each training example d in D 1. compute gradient E d [w] 2. where Batch mode: Do until converge: 1. compute gradient E D [w] 2.

14 MLE vs MAP Maximum conditional likelihood estimate Maximum a posteriori estimate

What decision surface does a perceptron define? x y Z (color) 0 0 1 0 1 1 1 0 1 1 1 0 NAND y = 0.

15 What decision surface does a perceptron define? x y Z (color) NAND y = 0.5 w 1 w 2 x 1 x 2 f(x 1 w 1 + x 2 w 2 ) = y f(0w 1 + 0w 2 ) = 1 f(0w 1 + 1w 2 ) = 1 f(1w 1 + 0w 2 ) = 1 f(1w 1 + 1w 2 ) = 0 f(a) = 1, for a > 0, for a some possible values for w 1 and w 2 w 1 w

16 What decision surface does a perceptron define? x y Z (color) NAND y = 0.5 w 1 w 2 x 1 x 2 f(x 1 w 1 + x 2 w 2 ) = y f(0w 1 + 0w 2 ) = 0 f(0w 1 + 1w 2 ) = 1 f(1w 1 + 0w 2 ) = 1 f(1w 1 + 1w 2 ) = 0 f(a) = 1, for a > 0, for a w 1 w 2 some possible values for w 1 and w 2

What decision surface does a perceptron define? x y Z (color) 0 0 0 0 1 1 1 0 1 1 1 0 NAND w 5 w 6 = 0.

17 What decision surface does a perceptron define? x y Z (color) NAND w 5 w 6 = 0.5 for all units w 1 w 4 w 3 w 2 f(a) = 1, for a > 0, for a a possible set of values for (w 1, w 2, w 3, w 4, w 5, w 6 ): (0.6,-0.6,-0.7,0.8,1,1)

18 Non Linear Separation Meningitis No cough Headache Flu Cough Headache No treatment Treatment No disease Pneumonia No cough No headache Cough No headache

19 Neural Network Model Inputs Age 34 Gender 2 Stage S S S Output 0.6 Probability of beingalive Independent variables Weights Hidden Layer Weights Dependent variable Prediction

20 Combined logistic models Inputs Age 34.6 Output Gender 2 Stage S 0.6 Probability of beingalive Independent variables Weights Hidden Layer Weights Dependent variable Prediction

21 Inputs Age 34 Gender 2 Stage S Output 0.6 Probability of beingalive Independent variables Weights Hidden Layer Weights Dependent variable Prediction

22 Inputs Age 34 Gender 1 Stage S Output 0.6 Probability of beingalive Independent variables Weights Hidden Layer Weights Dependent variable Prediction

23 Not really, no target for hidden units... Age 34 Gender 2 Stage S S S 0.6 Probability of beingalive Independent variables Weights Hidden Layer Weights Dependent variable Prediction

24 Perceptrons Input units weights Cough Headache D rule change weights to decrease the error No disease Pneumonia Flu Meningitis what we got - what we wanted error Output units

25 Hidden Units and Backpropagation

Backpropagation Algorithm x d = input t d = target output o d = observed unit output w i = weight i Initialize all weights to small random numbers Until convergence, Do

26 Backpropagation Algorithm x d = input t d = target output o d = observed unit output w i = weight i Initialize all weights to small random numbers Until convergence, Do 1. Input the training example to the network and compute the network outputs 1. For each output unit k 2. For each hidden unit h 3. Undate each network weight w i,j where

27 More on Backpropatation It is doing gradient descent over entire network weight vector Easily generalized to arbitrary directed graphs Will find a local, not necessarily global error minimum In practice, often works well (can run multiple times) Often include weight momentum a Minimizes error over training examples Will it generalize well to subsequent testing examples? Training can take thousands of iterations, very slow! Using network after training is very fast

28 Learning Hidden Layer Representation A network: A target function: Can this be learned?

29 Learning Hidden Layer Representation A network: Learned hidden layer representation:

30 Training

31 Training

32 Expressive Capabilities of ANNs Boolean functions: Every Boolean function can be represented by network with single hidden layer But might require exponential (in number of inputs) hidden units Continuous functions: Every bounded continuous function can be approximated with arbitrary small error, by network with one hidden layer [Cybenko 1989; Hornik et al 1989] Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988].

33 Application: ANN for Face Reco. The model The learned hidden unit weights

34 Regression vs. Neural Networks Y Y X 1 X 2 X 1 X 3 X 1 X 2 X 3 X 1 X 2 X 3 X 1 X 2 X 1 X 3 X 2 X 3 X 1 X 2 X 3 (2 3-1) possible combinations X 1 X 2 X 3 Y = a(x 1 ) + b(x 2 ) + c(x 3 ) + d(x 1 X 2 ) +...

35 Minimizing the Error Error surface initial error negative derivative final error local minimum w initial w trained positive change

36 CHD Overfitting in Neural Nets Overfitted model Real model error holdout Overfitted model training 0 age cycles

37 Alternative Error Functions Penalize large weights: Training on target slopes as well as values Tie together weights E.g., in phoneme recognition

38 Artificial neural networks what you should know Highly expressive non-linear functions Highly parallel network of logistic function units Minimizing sum of squared training errors Gives MLE estimates of network weights if we assume zero mean Gaussian noise on output values Minimizing sum of sq errors plus weight squared (regularization) MAP estimates assuming weight priors are zero mean Gaussian Gradient descent as training procedure How to derive your own gradient descent procedure Discover useful representations at hidden units Local minima is greatest problem Overfitting, regularization, early stopping

Artificial Neural Networks

Artificial Neural Networks Threshold units Gradient descent Multilayer networks Backpropagation Hidden layer representations Example: Face Recognition Advanced topics 1 Connectionist Models Consider humans: