Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)


 Lionel Bridges
 1 years ago
 Views:
Transcription
1 Machine Learning Neural Networks (slides from Domingos, Pardo, others)
2 For this week, Reading Chapter 4: Neural Networks (Mitchell, 1997) See Canvas For subsequent weeks: Scaling Learning Algorithms toward AI Learning Deep Architectures for AI
3 Human Brain
4 Neurons
5 InputOutput Transformation Input Spikes Output Spike Spike (= a brief pulse) (Excitatory PostSynaptic Potential)
6 Human Learning Number of neurons: ~ Connections per neuron: ~ 10 3 to 10 5 Neuron switching time: ~ second Scene recognition time: ~ 0.1 second 100 inference steps doesn t seem much
7 Machine Learning Abstraction
8 Artificial Neural Networks Typically, machine learning ANNs are very artificial, ignoring: Time Space Biological learning processes More realistic neural models exist Hodgkin & Huxley (1952) won a Nobel prize for theirs (in 1963) Nonetheless, very artificial ANNs have been useful in many ML applications
9 Perceptrons The first wave in neural networks Big in the 1960 s McCulloch & Pitts (1943), Woodrow & Hoff (1960), Rosenblatt (1962)
10 Problem def: Perceptrons Let f be a target function from X = <x 1, x 2, > where x i {0, 1} to y {0, 1} Given training data {(X 1, y 1 ), (X 2, y 2 ) } Learn h (X ), an approximation of f (X )
11 A single perceptron x 1 x 0 Bias (x 0 =1,always) w 1 w 0 x 2 Inputs x 3 w 2 w 3 n 1 if σ = i= 0 0 else w i x i > 0 x 4 w 4 x 5 w 5
12 Logical Operators x 1 x x n 1 if = w i x σ i i= 0 0 else x > 0 AND x x n 1 if = w i x σ i i= 0 0 else > 0 NOT x 1 x n 1 if = w i x σ i i= 0 0 else > 0 OR
13 Learning Weights Perceptron Training Rule Gradient Descent (other approaches: Genetic Algorithms) x 1? x 0? x 2? n 1 if = w i x σ i i= 0 0 else > 0
14 Perceptron Training Rule Weights modified for each training example Update Rule: w i w i + w i where w = η( t o) x i i learning rate target value perceptron output input value
15 Perception Training for NOT Initialize: w w = 0, 1 0 x 1 x 0 w 1 w 0 n 1 if = w i x σ i i= 0 0 else > 0 NOT w i w i + w w = η( t o) x i i Work i Start End Bryan Pardo, Machine Learning: EECS 349 Fall
16 What weights make XOR? x 1 x 0? x 2?? n 1 if = w i x σ i i= 0 0 else > 0 No combination of weights works Perceptrons can only represent linearly separable functions
17 Linear Separability x OR + x 1
18 Linear Separability x 2 + AND x 1
19 Linear Separability x 2 + XOR + x 1
20 Perceptron Training Rule Converges to the correct classification IF Cases are linearly separable Learning rate is slow enough Proved by Minsky and Papert in 1969 Killed widespread interest in perceptrons till the 80 s
21 XOR x 0 0 x 1 x n 1 if = w i x σ i i= 0 0 else x 0 0 n 1 if = w i x σ i i= 0 0 else > 0 > x 0 0 n 1 if = w i x σ i i= 0 0 else > 0 XOR
22 What s wrong with perceptrons? You can always plug multiple perceptrons together to calculate any function. BUT who decides what the weights are? Assignment of error to parental inputs becomes a problem.
23 Perceptrons use a step function x 1 x 2?? x 0? n 1 if = w i x σ i i= 0 0 else > 0 Perceptron Threshold Step function Small changes in inputs > either no change or large change in output.
24 Solution: Differentiable Function x 0? Simple linear function x 1 x 2?? σ = n i= 0 w i x i Varying any input a little creates a perceptible change in the output We can now characterize how error changes w i even in multilayer case
25 Measuring error for linear units Output Function σ ( x) = w x Error Measure: E( w ) 1 2 d D ( t o d d 2 ) data target value linear unit output
26 Gradient Descent Gradient: Training rule:
27 Gradient Descent Rule D d d d i i o t w w E 2 ) ( 2 1 = D d d i d d x o t ) )( (, + D d d i d d i i x o t w w, ) ( η Update Rule:
28 Gradient Descent for Multiple Layers x 0 x 1 w ij σ = x 0 n i= 0 w i x i x 0 σ = n i= 0 w i x i XOR? x 2 σ = n i= 0 w i x i We can compute: E w ij =
29 Gradient Descent vs. Perceptrons Perceptron Rule & Threshold Units Learner converges on an answer ONLY IF data is linearly separable Can t assign proper error to parent nodes Gradient Descent (locally) Minimizes error even if examples are not linearly separable Works for multilayer networks But linear units only make linear decision surfaces (can t learn XOR even with many layers) And the step function isn t differentiable
30 Perceptron A compromise function output n 1 if w > = i x i 0 i= 0 0 else Linear output = net = n w i x i i= 0 Sigmoid (Logistic) output = σ ( net) = 1+ 1 e net
31 The sigmoid (logistic) unit Has differentiable function Allows gradient descent Can be used to learn nonlinear functions x 1? σ = n 1+ 1 w i x i e i= 0 x 2?
32 Logistic function Inputs Output Age 34 Gender 1 Stage Σ 0.6 Probability of beingalive Independent variables Coefficients σ = n 1+ 1 w i x i i= e 0 Prediction
33 Neural Network Model Inputs Age 34 Gender 2 Stage Σ Σ Σ Output 0.6 Probability of beingalive Independent variables Weights Hidden Layer Weights Dependent variable Prediction
34 Getting an answer from a NN Inputs Age 34.6 Output Gender 2 Stage Σ 0.6 Probability of beingalive Independent variables Weights Hidden Layer Weights Dependent variable Prediction
35 Getting an answer from a NN Inputs Age 34 Gender 2 Stage Σ Output 0.6 Probability of beingalive Independent variables Weights Hidden Layer Weights Dependent variable Prediction
36 Getting an answer from a NN Inputs Age 34 Gender 1 Stage Σ Output 0.6 Probability of beingalive Independent variables Weights Hidden Layer Weights Dependent variable Prediction
37 Minimizing the Error Error surface initial error negative derivative final error local minimum w initial w trained positive change
38 Differentiability is key! Sigmoid is easy to differentiate σ ( y) = σ ( y) (1 σ ( y)) y For gradient descent on multiple layers, a little dynamic programming can help: Compute errors at each output node Use these to compute errors at each hidden node Use these to compute weight gradient
39 The Backpropagation Algorithm For each input training example, x,t 1. Input instance x to the network and compute the output for every unit u in the network 2. For each output unit k, δ δ k h w ji o o k h w (1 o (1 o ji k h +ηδ )( t kii k hk k outputs x ji o k w 4.Update each network weight ) ) calculateits error term δ 3. For each hidden unit h, calculateits error term δ δ k w ji k h o u
40 Learning Weights Inputs Age 34 Gender 1 Stage Σ Output 0.6 Probability of beingalive Independent variables Weights Hidden Layer Weights Dependent variable Prediction
41 The fine print Don t implement backpropagation Use a package Secondorder or variable stepsize optimization techniques exist Feature normalization Typical to normalize inputs to lie in [0,1] (and outputs must be normalized) Problems with NN training: Slow training times (though, getting better) Local minima
42 Minimizing the Error Error surface initial error negative derivative final error local minimum w initial w trained positive change
43 Expressive Power of ANNs Universal Function Approximator: Given enough hidden units, can approximate any continuous function f Need 2+ hidden units to learn XOR Why not use millions of hidden units? Efficiency (training is slow) Overfitting
44 Overfitting Real Distribution Overfitted Model
45 Combating Overfitting in Neural Nets Many techniques Two popular ones: Early Stopping (most popular) Use a lot of hidden units Just don t overtrain Crossvalidation Test different architectures to choose right number of hidden units
46 Early Stopping error Overfitted model error a min ( error) a = validation set b = training set error b Stopping criterion Epochs
47 Learning Rate? A knob you twist empirically Important One popular option: look for validation set acc to decrease/stabilize, then halve learning rate
48 Modern Neural Networks (Deep Nets) Local minima in large networks is less of an issue Early stopping is useful, but so is initializing at zero and training until almost zero training error count on stochastic gradient descent to perform implicit regularization Also: Dropout Many layers are now common And specific structure: convolution, max pooling
49 Summary of Neural Networks When are Neural Networks useful? Instances represented by attributevalue pairs Particularly when attributes are real valued The target function is Discretevalued Realvalued Vectorvalued Training examples may contain errors Fast evaluation times are necessary When not? Fast training times are necessary Understandability of the function is required
50 Summary of Neural Networks Nonlinear regression technique that is trained with gradient descent. Question: How important is the biological metaphor?
51 Other Topics in Neural Nets Batch Move vs. stochastic AutoEncoders Neural Networks on Silicon
52 Stochastic vs. Batch Mode Stochastic Gradient Descent
53 Incremental vs. Batch Mode In Batch Mode we minimize: Same as computing: Δww DD = dd DD Δww dd Then setting ww = ww + Δww DD
54 Advanced Topics in Neural Nets Batch Move vs. incremental AutoEncoders Neural Networks on Silicon
55 Hidden Layer Representations Input>Hidden Layer mapping: representation of input vectors tailored to the task Can also be exploited for dimensionality reduction Form of unsupervised learning in which we output a more compact representation of input vectors <x 1,,x n > > <x 1,,x m > where m < n Useful for visualization, problem simplification, data compression, etc.
56 Dimensionality Reduction Model: Function to learn:
57 Dimensionality Reduction: Example
58 Dimensionality Reduction: Example
59 Dimensionality Reduction: Example
60 Dimensionality Reduction: Example
61 Advanced Topics in Neural Nets Batch Move vs. incremental Autoencoders Neural Networks on Silicon
62 Neural Networks on Silicon Currently: Simulation of continuous device physics (neural networks) Why not skip this? Digital computational model (thresholding) Continuous device physics (voltage)
63 Example: Silicon Retina Simulates function of biological retina Singletransistor synapses adapt to luminance, temporal contrast Modeling retina directly on chip => requires 100x less power!
64 Example: Silicon Retina Synapses modeled with single transistors
65 Luminance Adaptation
66 Comparison with Mammal Data Real: Artificial:
67 Graphics and results taken from:
68 General NN learning in silicon? People seem more excited about / satisfied with GPUs But, that could change