Artificial Neuron (Perceptron)

Similar documents
Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

CSC 411 Lecture 10: Neural Networks

Machine Learning Basics III

CSC321 Lecture 5: Multilayer Perceptrons

Intro to Neural Networks and Deep Learning

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Machine Learning (CSE 446): Neural Networks

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 13 Mar 1, 2018

Artificial Neural Networks 2

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Lab 5: 16 th April Exercises on Neural Networks

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Neural Networks (Part 1) Goals for the lecture

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Lecture 5: Logistic Regression. Neural Networks

Course 395: Machine Learning - Lectures

Lecture 17: Neural Networks and Deep Learning

Differential calculus. Background mathematics review

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Multilayer Perceptrons and Backpropagation

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Machine Learning and Data Mining. Linear classification. Kalev Kask

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Machine Learning (CSE 446): Backpropagation

Neural Networks in Structured Prediction. November 17, 2015

Feedforward Neural Nets and Backpropagation

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Artificial Neural Networks

Artificial Neural Networks. MGS Lecture 2

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

CSC321 Lecture 6: Backpropagation

CSC 578 Neural Networks and Deep Learning

Deep Feedforward Networks

Neural Networks: Backpropagation

Feed-forward Network Functions

ECS171: Machine Learning

AI Programming CS F-20 Neural Networks

Multilayer Perceptron

1 What a Neural Network Computes

Artificial Neural Networks

Data Mining Part 5. Prediction

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Intelligent Systems Discriminative Learning, Neural Networks

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Day 3 Lecture 3. Optimizing deep networks

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Nonlinear Classification

Deep Learning & Artificial Intelligence WS 2018/2019

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation

Jakub Hajic Artificial Intelligence Seminar I

4. Multilayer Perceptrons

Supervised Learning. George Konidaris

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Perceptron & Neural Networks

LECTURE # - NEURAL COMPUTATION, Feb 04, Linear Regression. x 1 θ 1 output... θ M x M. Assumes a functional form

Neural Networks DWML, /25

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning. Neural Networks

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

Neural Network Tutorial & Application in Nuclear Physics. Weiguang Jiang ( 蒋炜光 ) UTK / ORNL

Artificial Intelligence

ECE521 Lecture 7/8. Logistic Regression

Introduction to Neural Networks

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Introduction to Neural Networks

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

CS 4700: Foundations of Artificial Intelligence

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

Introduction to Machine Learning

Neural Network Language Modeling

Neural Networks and Deep Learning.

Unit III. A Survey of Neural Network Model

Advanced Machine Learning

Machine Learning

SGD and Deep Learning

Deep Feedforward Networks. Han Shao, Hou Pong Chan, and Hongyi Zhang

Machine Learning. Boris

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

Learning Deep Architectures for AI. Part I - Vijay Chakilam

Neural networks. Chapter 19, Sections 1 5 1

More Tips for Training Neural Network. Hung-yi Lee

Deep Feedforward Networks

Ch.6 Deep Feedforward Networks (2/3)

Introduction Biologically Motivated Crude Model Backpropagation

A summary of Deep Learning without Poor Local Minima

Deep Neural Networks (1) Hidden layers; Back-propagation

Input layer. Weight matrix [ ] Output layer

Deep Neural Networks (1) Hidden layers; Back-propagation

From perceptrons to word embeddings. Simon Šuster University of Groningen

Review of elements of Calculus (functions in one variable)

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Deep Learning for NLP

CSC321 Lecture 8: Optimization

Learning from Examples

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Linear Discriminant Functions

Transcription:

9/6/208 Gradient Descent (GD) Hantao Zhang Deep Learning with Python Reading: https://en.wikipedia.org/wiki/gradient_descent Artificial Neuron (Perceptron) <w, > = w T = w 0 0 + + w 2 2 + + w d d where 0 = neuron Many monotonic functions can be used as Activation Function f: y = f(<w, >) The value of bias w 0 decides where to fire the neuron. 2

9/6/208 Perceptron Learning Perceptron learns linear decision boundaries E.g. 2 + + + + + + 0 0 0 0 0 0 0 But not 2 or 0 0 0 w + w 2 2 = 0 0 w + w 2 2 = w > 0.5 0 0 w + w 2 2 = w 2 > 0.5 0 w + w 2 2 = w +w 2 < 0.5 impossible! 2 + 0 or + Multilayer NN are universal function approimators Input, output, and arbitrary number of hidden layers hidden layer sufficient for DNF representation of any Boolean function - One hidden node per positive conjunct, output node set to the Or function 2 hidden layers allow arbitrary number of labeled clusters hidden layer sufficient to approimate all bounded continuous functions hidden layer was the most common in practice, but recently Deep networks show ecellent results! 2

9/6/208 Solving the XOR Problem Network Topology: 2 hidden nodes output Activation Function: step() = if > 0; 0 otherwise w w 3 w 2 w 2 w 0 w 03 w22 y 2 2 w 23 y y 3 y 9 Weights: w = w 2 =, w 2 = w 22 = w 0 = -.5; w 02 = -0.5; w 03 = -0.5 w 3 = ; w 23 = y = step(w + w 2 2 + w 0 ) y 2 = step(w 2 + w 22 2 + w 02 ) y 3 = step(w 3 y + w 23 y 2 + w 03 ) Desired: 2 y 0 0 0 0 0 0 w 02 Actual: y 30 0 Feed Forward Computation Neural Network with sigmoid activation functions Output Hidden Layer Input 6 3

9/6/208 Neural Net Training Goal: y = f(, w) Determine how to change weights to get correct output Large change in weight to produce large change in error Approach: Compute actual output: y Compare to desired output: y* Determine effect of weights w on error = y* y Adjust weights w Cost (Loss, Error) Function Neural Network with sigmoid activation functions Output Hidden Layer Input 8 4

9/6/208 Backpropagation Weights are parameters to change Backpropagation: Computes current output, works backward to correct error If smooth function, use Gradient Descent Linear functions (including identity function) are not useful, as combination of linear functions is still linear. i XOR Eample y 3 i : i th sample input vector w : weight vector y i *: desired output for i th sample F: output of the neural network s: the activation function Sum of squares error over training samples: * 2 E ( y i F ( i, w )) 2 z w 0 w 03 z 3 w 3 y w23 y 2 w w 2 w 2 w 22 z 2 w 02 2 We may use Gradient Descend to find w so that E is minimum. From 6.034 notes lozano-perez Full epression of output in terms of input and weights z z 2 y3 F(, w) s( w3s( w w22 w0) w23s( w2 w222 w02) w03) z 3 5

9/6/208 Gradient Descent Method Task: Find a local minimum value of the function y = f(). Method: Start at given point, use the gradient to move toward the minimum Gradient is the slope of a function: dy/d = f (). If is a minimum, then f () = 0. The least of all the minimum points is called the global minimum. Every minimum is a local minimum. f() global maimum inflection point global minimum local minimum One Variable Function 0 0 Starting at 0, the net point is = 0 f ( 0 ), where is a positive constant, called the moving (learning) rate. Rate parameter Large enough to learn quickly Small enough to reach (but not overshoot) target values. If looking for the maimum, the net point is = 0 + f ( 0 ). 6

9/6/208 One Variable Function Pick random starting point. f() One Variable Function Compute gradient at point (by calculus) f() 7

9/6/208 One Variable Function Move along parameter space in direction of negative gradient f() = amount to move = learning rate One Variable Function Move along parameter space in direction of negative gradient. f() = amount to move = learning rate 8

9/6/208 One Variable Function Stop when we don t move any more. f() : 0 Two Variable Function f (, ) 5 2 2 2 2 Partial Derivatives: f / = 2, f / 2 = 0 2 0.8 0.6 0.4 2 0.2 0-0.2-0.4-0.6-0.8 - - -0.5 0 0.5 The gradient descent at the point (, 2 ): (2 ), 2 2 (0 2 ) 9

9/6/208 Two Variable Function 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0. 0 0 0.5.5 saddle point local min Multi Variable Function First compute partial derivatives: f f f = (,..., ) Then for any given point (, 2,, n ), change i i f/ i (, 2,, n ) until satisfaction or pick another point and start over. n Summary: Gradient Descent Method is a greedy optimization algorithm. To find a local minimum of a function, each variable takes a step proportional to the negative of the partial gradient of the function at the current point. 0

9/6/208 The Gradient Descent Algorithm Data n 0 R Step 0: set i = 0 Step : if f ( ) 0 stop, i else, compute search direction h f ) i ( i Step 2: compute the step-size h arg min f ( i 0 Step 3: set i i i i go to step i h ) i Various learning rates are tried in the above algorithm. Eample Given: f, 2sin.47 sin 0.34 sin sin.9 2 2 2 Find the minimum when is allowed to vary from 0.5 to.5 and 2 is allowed to vary from 0 to 2.

9/6/208 Gradient descent oscillations We wish to descent like this. Gradient descent oscillations Actual path may look like this. Slow to converge to the (local) optimum 2

9/6/208 Lowering the learning rate = smaller steps in SGD -Less ping pong -Takes longer to get to the optimum Learning Rate 3

9/6/208 Picking learning rate 27 Use grid-search in log-space over small values on a tuning set: e.g., 0.0, 0.00, Sometimes, decrease after each pass: e.g factor of /( + dt), t = pass sometimes /t 2 Fancier techniques: Adaptive gradient: scale gradient differently for each dimension (Adagrad, ADAM,.) Pros and Cons of Gradient Descent Simple and often quite effective on machine learning tasks Often very scalable Only applies to smooth functions (differentiable) Better in general than other search methods, such as local search Might find a local minimum, rather than a global one 4

9/6/208 Using Gradient Descent for NN 29 What functions are used in NN? Cost functions: e.g., f( i, w) = ½ (y* y i ) 2 Activation functions: e.g. s(a) = /( + e -a ) Linear functions: e.g.,. w Composed functions: e.g., sigmoid(. w) How to compute derivatives with respect to w? Replace sign(. w) with something differentiable: e.g. sigmoid(. w) sign() Computation of derivative The derivative of f: R R is a function f : R R given by f ' df f h f lim d h0 h if the limit eists. 5

9/6/208 Rules for Differentiation Constant: d d c 0 Power: d n d n n Sum: d u v du dv d d d Ep: d e e d Product: d uvu dv v du d d d Log: d ln d Quotient: d u v du u dv 2 v v Chain Rule: dy dy du d du d f g y f u u g If is the composite of and, then: f g f gat at ug Eample: Sigmoid function y = s() = /(+e ) y = /u dy/du = /u 2 by quotient rule u = +v du/dv = by sum and power rules v = e w dv/dw = e w by eponential rule w = dw/d = by product and power rules dy/d = (dy/du)(du/dv)(dv/dw)(dw/d) = ( /u 2 )()(e w )( ) = e w /u 2 = y( y) = s()( s()) 6

9/6/208 Sigmoid Activation Function 33 s( u) e u P( Y X ) e w Derivative of sigmoid: ds(z)/dz = s(z)( s(z)).25 0-5 0 5 Net Derivative of Logistic Regression <w, > = w T = w 0 0 + w + w 2 2 + + w n n where 0 = Sigmoid function: () neuron d(z)/dz = (z)( (z)) Logistic regression: f(,w) = f = ( f / w, f / w 2,, f / w n ) =? )) 7

9/6/208 Alternative Activation Functions The logistic function is not widely used in modern NNs Derivative of Hyperbolic Tangent: dt(z)/dz = ( + t(z))( t(z)) Hyperbolic Tangent: t(z) = ( e -2z )/( + e -2z ) Like logistic function but shifted to range [-, +] Alternative Activation Functions Rectified Linear Unit (ReLU): relu(a) = ma(0, a) 0 0 8

9/6/208 Alternative Activation Functions Soft version of relu Soft version of relu: r() = ln(e + ) Doesn t saturate (at one end) Helps with vanishing gradient Derivative of Soft relu: dr()/d = /( + e - ) = s() AI Stats 200 depth 4? Test Errors: sigmoid vs. tanh Figure from Glorot & Bentio (200) 9

9/6/208 y XOR Eample: Gradient of Error z z 2 F(, w) s( w3s( w w22 w0) w23s( w2 w222 w02) 03) 3 w E * v. ( y i F ( i, w v )) 2 i E * y3 Σ i( ( yi y3) ) w w j y w 3 s( z3) z3 s( z3) s( z3) s( z) 3 z3 w3 z3 z3 j 2 s z 3 y z w 0 y 3 z 3 w 03 w 3 y w23 y 2 z 2 w 2 w 22 w 02 w w 2 2 y w If sigmoid is used, s(z i )/ z i = s(z i )( s(z i )) = y i ( y i ) 3 s( z3) z3 s( z3) s( z) z s( z3) s( z) w3 w3 z3 w z3 z w z3 z Backprop Eample: XOR How to compute the updates for general NN? Using sigmoid and quadratic error, updates for all w: y 3 z 3 * 3 ( y3 y3 ) y ( y w 2 3 3) 3 y3 y3) 3 w w w ( w 23 3 w03 w03 y3( y3) 3( ) w02 w02 y2( y2) 2( ) w w y y ) ( ) 0 0 (. 3 w3 y y3( y3) 3 2 w2 y2( y2) 2 w y( y) w w w z w 0 w 03 w 3 y w23 y 2 w 2 d s(z)/d z = s(z)(-s(z)) y i = s(z i ) 23 w23 y2 y3( y3) 3 22 w22 2 y2( y2) 2 2 w2 2 y( y) w 22 z 2 w 02 w w 2 2 20