Inf2b Learning and Data Lecture : Single layer Neural Networks () (Credit: Hiroshi Shimodaira Iain Murray and Steve Renals) Centre for Speech Technology Research (CSTR) School of Informatics University of Edinburgh http://www.inf.ed.ac.uk/teaching/courses/inf2b/ https://piazza.com/ed.ac.uk/spring209/infr08009inf2blearning Office hours: Wednesdays at 4:00-5:00 in IF-3.04 Jan-Mar 209 Inf2b Learning and Data: Lecture Single layer Neural Networks ()
Today s Schedule Discriminant functions (recap) 2 Decision boundary of linear discriminants (recap) 3 Discriminative training of linear discriminans (Perceptron) 4 Structures and decision boundaries of Perceptron 5 LSE Training of linear discriminants 6 Appendi - calculus, gradient descent, linear regression Inf2b Learning and Data: Lecture Single layer Neural Networks () 2
Discriminant functions (recap) y k () = ln (P( C)P(C k )) = 2 ( µ k) T k ( µ k) 2 ln k + ln P(C k ) = 2 T k + µt k k 2 µt k k µ k 2 ln k + ln P(C k ) (a) (b) (c) Inf2b Learning and Data: Lecture Single layer Neural Networks () 3
Linear discriminants for a 2-class problem y () = w T + w 0 y 2 () = w T 2 + w 20 Combined discriminant function: y() = y () y 2 () = (w w 2 ) T + (w 0 w 20 ) = w T + w 0 Decision: {, if y() 0, C = 2, if y() < 0 Inf2b Learning and Data: Lecture Single layer Neural Networks () 4
Decision boundary of linear discriminants Decision boundary: y() = w T + w 0 = 0 Dimension Decision boundary 2 line w + w 2 2 + w 0 = 0 3 plane w + w 2 2 + w 3 3 + w 0 = 0 D hyperplane ( D i= w i i ) + w 0 = 0 NB: w is a normal vector to the hyperplane Inf2b Learning and Data: Lecture Single layer Neural Networks () 5
Decision boundary of linear discriminant (2D) y() = w + w 2 2 + w 0 = 0 ( 2 = w w 2 w 0 w 2, when w 2 0) C T y( )= w +w0 =0 C w =( w, w2 ) slope = w /w 2 0 2 2 T w 0 /w 2 slope = w /w 2 Inf2b Learning and Data: Lecture Single layer Neural Networks () 6
Decision boundary of linear discriminant (3D) y() = w + w 2 2 + w 3 3 + w 0 = 0 3 w=(w, w, w ) 2 3 T 2 Inf2b Learning and Data: Lecture Single layer Neural Networks () 7
Approach to linear discminant functions Generative models : p( C k ) Discriminant function based on Bayes decision rule y k () = ln p( C k ) + ln P(C k ) Gaussian pdf (model) y k () = 2 ( µ k) T k ( µ k) 2 ln k + ln P(C k ) Equal covariance assumption y k () = w T + w 0 Why not estimating the decision boundary or P(C k ) directly? Discriminative trainig / models (Logistic regression, Percepton / Neural network, SVM) Inf2b Learning and Data: Lecture Single layer Neural Networks () 8
Training linear discriminant functions directly A discriminant for a two-class problem: y() = y () y 2 () = (w w 2 ) T + (w 0 w 20 ) = w T + w 0 2 ( w,w) T 2 2 Inf2b Learning and Data: Lecture Single layer Neural Networks () 9
Perceptron error correction algorithm a(ẋ) = w T + w 0 = ẇ T ẋ where ẇ = (w 0, w T ) T, ẋ = (, T ) T Let s just use w and to denote ẇ and ẋ from now on! y() = g( a() ) = g( w T ) where g(a) = {, if a 0, 0, if a < 0 Training set : D = {(, t ),..., ( N, t N )} Modify w if i was misclassified g(a): activation / transfer function where t i {0, } : target value NB: w (new) w + η (t i y( i )) i (0 < η < ) learning rate (w (new) ) T i = w T i + η (t i y( i )) i 2 Inf2b Learning and Data: Lecture Single layer Neural Networks () 0
Geometry of Perceptron s error correction y( i ) = g(w T i ) w (new) w + η (t i y( i )) i (0 < η < ) y( i ) t i y( i ) 0 0 0 - t i 0 w T = w cos θ 2 C T w w C 0 Inf2b Learning and Data: Lecture Single layer Neural Networks ()
Geometry of Perceptron s error correction (cont.) y( i ) = g(w T i ) w (new) w + η (t i y( i )) i (0 < η < ) y( i ) t i y( i ) 0 0 0 - t i 0 2 C w w T = w cos θ C 0 T w Inf2b Learning and Data: Lecture Single layer Neural Networks () 2
Geometry of Perceptron s error correction (cont.) y( i ) = g(w T i ) w (new) w + η (t i y( i )) i (0 < η < ) y( i ) t i y( i ) 0 0 0 - t i 0 η 2 (new) w η w C w T = w cos θ C 0 Inf2b Learning and Data: Lecture Single layer Neural Networks () 3
The Perceptron learning algorithm Incremental (online) Perceptron algorithm: for i =,..., N w w + η (t i y( i )) i Batch Perceptron algorithm: v sum = 0 for i =,..., N v sum = v sum + (t i y( i )) i w w + η v sum What about convergence? The Perceptron learning algorithm terminates if training samples are linearly separable. Inf2b Learning and Data: Lecture Single layer Neural Networks () 4
Linearly separable vs linearly non-separable (a ) (a 2) Linearly separable (b) Linearly non separable Inf2b Learning and Data: Lecture Single layer Neural Networks () 5
Background of Perceptron w w 2 w 3 y (https://en.wikipedia.org/wiki/file:neuron Hand-tuned.svg) (a) function unit 940s Warren McCulloch and Walter Pitts : threshold logic Donald Hebb : Hebbian learning 957 Frank Rosenblatt : Perceptron Inf2b Learning and Data: Lecture Single layer Neural Networks () 6
Character recognition by Perceptron (,) W j H i (W,H) Inf2b Learning and Data: Lecture Single layer Neural Networks () 7
Perceptron structures and decision boundaries y() = g(a()) = g( w T ) 2 w = (w 0, w,..., w D ) T = (,,..., D ) T {, if a 0, where g(a) = 0, if a < 0 w 0 w w 2 2 a() = + 2 = w 0 +w +w 2 2 0 2 w 0 =, w =, w 2 = NB: A one node/neuron constructs a decision boundary, which splits the input space into two regions Inf2b Learning and Data: Lecture Single layer Neural Networks () 8
Perceptron as a logical function NOT y 0 0 OR 2 y 0 0 0 0 0 NAND 2 y 0 0 0 0 0 OR 2 y 0 0 0 0 0 0 2 2 2 0 0 0 Question: find the weights for each function Inf2b Learning and Data: Lecture Single layer Neural Networks () 9
Perceptron structures and decision boundaries (cont.) 0 2 0 2 0 2 Inf2b Learning and Data: Lecture Single layer Neural Networks () 20
Perceptron structures and decision boundaries (cont.) 2 0 2 Inf2b Learning and Data: Lecture Single layer Neural Networks () 2
Perceptron structures and decision boundaries (cont.) 2 0 2 Inf2b Learning and Data: Lecture Single layer Neural Networks () 22
Perceptron structures and decision boundaries (cont.) 2 0 2 Inf2b Learning and Data: Lecture Single layer Neural Networks () 23
Problems with the Perceptron learning algorithm No training algorithms for multi-layer Percepton Non-convergence for linearly non-separable data Weights w are adjusted for misclassified data only (correctly classified data are not considered at all) Consider not only mis-classification (on train data), but also the optimality of decision boundary Least squares error training Large margin classifiers (e.g. SVM) Inf2b Learning and Data: Lecture Single layer Neural Networks () 24
Training with least squares Squared error function: E(w) = N ( ) w T 2 n t n 2 n= Optimisation problem: min E(w) w One way to solve this is to apply gradient descent (steepest descent): w w η w E(w) where η : step size (a small positive const.) ( ) T E E w E(w) =,... w 0 w D Inf2b Learning and Data: Lecture Single layer Neural Networks () 25
Training with least squares (cont.) E w i = = = w i 2 N n= N ( ) w T 2 n t n n= ( w T n t n ) w i w T n N ( ) w T n t n ni n= Trainable in linearly non-separable case Not robust (sensitive) against errornous data (outliers) far away from the boundary More or less a linear discriminant Inf2b Learning and Data: Lecture Single layer Neural Networks () 26
Appendi derivatives Derivatives of functions of one variable df d = f f ( + ɛ) f () () = lim ɛ 0 ɛ e.g., f () = 4 3, f () = 2 2 Partial derivatives of functions of more than one variable f = lim f ( + ɛ, y) f (, y) ɛ 0 ɛ e.g., f (, y) = y 3 2, f = 2y 3 Inf2b Learning and Data: Lecture Single layer Neural Networks () 27
Derivative rules Function Derivative Constant c 0 Power n n n 2 2 2 Eponential e e Logarithms ln() Sum rule f () + g() f () + g () Product rule f ()g() f ()g() + f ()g () Reciprocal rule f () f () g() f () f 2 () f ()g() f ()g () g 2 () Chain rule f (g()) f (g())g () z = f (y), y = g() dz d = dz dy dy d Inf2b Learning and Data: Lecture Single layer Neural Networks () 28
Vectors of derivatives Consider f (), where = (,..., D ) T Notation: all partial derivatives put in a vector: ( f f () =, f,, f ) T 2 D Eample: f () = 3 2 2 ( 3 2 2 2 f () = 2 3 2 ) Fact: f () changes most quickly in direction f () Inf2b Learning and Data: Lecture Single layer Neural Networks () 29
Gradient descent (steepest descent) First order optimisation algorithm using f () Optimisation problem: min f () Useful when analytic solutions (closed forms) are not available or difficult to find Algorithm Set an initial value 0 and set t = 0 2 If f ( t ) 0, then stop. Otherwise, do the following. 3 t+ = t η f ( t ) for η > 0 4 t = t +, and go to step 2. Problem: stops at a local minimum (difficult to find a global maimum). Inf2b Learning and Data: Lecture Single layer Neural Networks () 30
Linear regression (one variable) least squares line fitting Training set: D = {( n, t n )} N n= Linear regression: ˆt n = a n + b N Objective function: E = (t i (a i + b)) 2 n= n= Optimisation problem: min E a,b Partial derivatives: E N a = 2 ((t i (a i + b)) ( i ) n= N N N = 2a i 2 + 2b i 2 t i i n= E N b = 2 ((t i (a i + b)) n= N i + 2b N = 2a 2 t i n= n= n= Inf2b Learning and Data: Lecture Single layer Neural Networks () 3 n= N
Linear regression (one variable) (cont.) Letting E E = 0 and a b = 0 ( N n= i 2 N ) ( n= i a N N n= i n= b ) = ( N n= t i i N n= t i ) Inf2b Learning and Data: Lecture Single layer Neural Networks () 32
Linear regression (multiple variables) Y 2 E 3.. Linear least squares fitting with. We seek the linear function of that minie sum of squared residuals from Y. Training set: D = {( n, t n )} N n=, where n = (,,..., D ) T Linear regression: ˆt n = w T n Objective function: N ( ) E = tn w T 2 n n= Optimisation problem: min a,b E Elements of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 2009 Inf2b Learning and Data: Lecture Single layer Neural Networks () 33
Linear regression (multiple variables) (cont.) E = N ( ) tn w T 2 n n= E N Partial derivatives: = 2 (t n w T n ) ni w i n= Vector/matri representation (NE) : T 0,..., d t =. =.., T =. N T N0,..., Nd t N E = (T w) T (T w) E w = 2 T (T W ) Letting E = 0 T (T W ) = 0 w T W = T T W = ( T ) T T analytic solution if the inverse eists Inf2b Learning and Data: Lecture Single layer Neural Networks () 34
Summary Training discriminant functions directly (discriminative training) Perceptron training algorithm Perceptron error correction algorithm Least squares error + gradient descent algorithm Linearly separable vs linearly non-separable Perceptron structures and decision boundaries See Notes for a Perceptron with multiple output nodes Inf2b Learning and Data: Lecture Single layer Neural Networks () 35