Machine Learning (CSE 446): Neural Networks Noah Smith c 2017 University of Washington nasmith@cs.washington.edu November 6, 2017 1 / 22
Admin No Wednesday office hours for Noah; no lecture Friday. 2 / 22
Classifiers We ve Covered So Far decision boundary? difficult part of learning? decision trees piecewise-axis-aligned greedy split decisions K-nearest neighbors possibly very complex indexing training data perceptron linear iterative optimization method required logistic regression linear iterative optimization method required naïve Bayes linear (see A4) none 3 / 22
Classifiers We ve Covered So Far decision boundary? difficult part of learning? decision trees piecewise-axis-aligned greedy split decisions K-nearest neighbors possibly very complex indexing training data perceptron linear iterative optimization method required logistic regression linear iterative optimization method required naïve Bayes linear (see A4) none The next methods we ll cover permit nonlinear decision boundaries. 4 / 22
Inspiration from Neurons Image from Wikimedia Commons. Input signals come in through dendrites, output signal passes out through the axon. 5 / 22
Neuron-Inspired Classifiers input weight parameters x[1] w[1] x[2] w[2] activation output x[3] w[3]! fire, or not? ŷ x[d] w[d] bias parameter b 6 / 22
Neuron-Inspired Classifiers input x[1] w[1] f x[2] w[2] x[3] w[3] output! ŷ x[d] w[d] b 7 / 22
Neuron-Inspired Classifiers correct output y n L n loss input weights activation x n w! ŷ classifier output, f b 8 / 22
Neuron-Inspired Classifiers -1.0 0.0 1.0-10 -5 0 5 10 Hyperbolic tangent function, tanh(z) = ez e z e z + e z. Generalization: apply elementwise to a vector, so that tanh : R k ( 1, 1) k. 9 / 22
Neuron-Inspired Classifiers correct output y n L n loss input weights hidden units x n w 1 tanh b 1 v 1 activation! ŷ w 2 tanh classifier output, f b 2 v 2 10 / 22
Neuron-Inspired Classifiers correct output y n L n loss hidden units input weights x n W tanh activation! ŷ b v classifier output, f 11 / 22
Two-Layer Neural Network ( H ) f(x) = sign v h tanh (w h x + b h ) h=1 = sign (v tanh (Wx + b)) 12 / 22
Two-Layer Neural Network ( H ) f(x) = sign v h tanh (w h x + b h ) h=1 = sign (v tanh (Wx + b)) Two-layer networks allow decision boundaries that are nonlinear. 13 / 22
Two-Layer Neural Network ( H ) f(x) = sign v h tanh (w h x + b h ) h=1 = sign (v tanh (Wx + b)) Two-layer networks allow decision boundaries that are nonlinear. It s fairly easy to show that XOR can be simulated (recall conjunction features from the practical issues lecture on 10/18). 14 / 22
Two-Layer Neural Network ( H ) f(x) = sign v h tanh (w h x + b h ) h=1 = sign (v tanh (Wx + b)) Two-layer networks allow decision boundaries that are nonlinear. It s fairly easy to show that XOR can be simulated (recall conjunction features from the practical issues lecture on 10/18). Theoretical result: any continuous function on a bounded region in R d can be approximated arbitrarily well, with a finite number of hidden units. 15 / 22
Two-Layer Neural Network ( H ) f(x) = sign v h tanh (w h x + b h ) h=1 = sign (v tanh (Wx + b)) Two-layer networks allow decision boundaries that are nonlinear. It s fairly easy to show that XOR can be simulated (recall conjunction features from the practical issues lecture on 10/18). Theoretical result: any continuous function on a bounded region in R d can be approximated arbitrarily well, with a finite number of hidden units. The number of hidden units affects how complicated your decision boundary can be and how easily you will overfit. 16 / 22
Learning with a Two-Layer Network Parameters: W R H d, b R H, and v R H 17 / 22
Learning with a Two-Layer Network Parameters: W R H d, b R H, and v R H If we choose a differentiable loss, then the the whole function will be differentiable with respect to all parameters. 18 / 22
Learning with a Two-Layer Network Parameters: W R H d, b R H, and v R H If we choose a differentiable loss, then the the whole function will be differentiable with respect to all parameters. Because of the squashing function, which is not convex, the overall learning problem is not convex. 19 / 22
Learning with a Two-Layer Network Parameters: W R H d, b R H, and v R H If we choose a differentiable loss, then the the whole function will be differentiable with respect to all parameters. Because of the squashing function, which is not convex, the overall learning problem is not convex. What does (stochastic) (sub)gradient descent do with non-convex functions? 20 / 22
Learning with a Two-Layer Network Parameters: W R H d, b R H, and v R H If we choose a differentiable loss, then the the whole function will be differentiable with respect to all parameters. Because of the squashing function, which is not convex, the overall learning problem is not convex. What does (stochastic) (sub)gradient descent do with non-convex functions? It finds a local minimum. To calculate gradients, we need to use the chain rule from calculus. 21 / 22
Learning with a Two-Layer Network Parameters: W R H d, b R H, and v R H If we choose a differentiable loss, then the the whole function will be differentiable with respect to all parameters. Because of the squashing function, which is not convex, the overall learning problem is not convex. What does (stochastic) (sub)gradient descent do with non-convex functions? It finds a local minimum. To calculate gradients, we need to use the chain rule from calculus. Special name for (S)GD with chain rule invocations: backpropagation. 22 / 22