Inf2b Learnin and Data Lecture 2 (Chapter 2): Multi-laer neural netorks (Credit: Hiroshi Shimodaira Iain Murra and Steve Renals) Centre for Speech Technolo Research (CSTR) School of Informatics Universit of Edinburh http://.inf.ed.ac.uk/teachin/courses/inf2b/ https://piazza.com/ed.ac.uk/sprin28/infr89learnin Office hours: Wednesdas at 4:-5: in IF-3.4 Jan-Mar 28 Inf2b Learnin and Data: Lecture 2 (Chapter 2) Multi-laer neural netorks
Toda s Schedule Sinle-laer netork ith a sinle output node (recap) 2 Sinle-laer netork ith multiple output nodes 3 Activation functions 4 Multi-laer neural netork 5 Overfittin and eneralisation 6 Deep Neural Netorks Inf2b Learnin and Data: Lecture 2 (Chapter 2) Multi-laer neural netorks 2
Sinle-laer netork ith a sinle output node (recap) Activation function: D = (a) = ( i i ) (a) = i= + ep( a) Trainin set : D = {( n, t n )} N n= here t n {, } Error function: E() = N ( n t n ) 2 2 n= Optimisation problem (trainin) min E() D D Inf2b Learnin and Data: Lecture 2 (Chapter 2) Multi-laer neural netorks 3
Trainin of sinle laer neural netork (recap) Optimisation problem: min E() No analtic solution (no closed form) Emplo an iterative method (requires initial values) e.. Gradient descent (steepest descent), Neton s method, Conjuate radient methods Gradient descent (scalar rep.) (ne) i i η E(), i (η > ) (vector rep.) (ne) η E(), (η > ) Online/stochastic radient descent (cf. Batch trainin) Update the eihts one pattern at a time. (See Note ) Inf2b Learnin and Data: Lecture 2 (Chapter 2) Multi-laer neural netorks 4
Trainin of the sinle-laer neural netork (cont.) E() = N ( n t n ) 2 = N ( (a n ) t n ) 2 2 2 n= E() i n= here n = (a n ), a n = = E() n N = = n= n a n a n i ( n t n ) (a n) a n N ( n t n ) (a n ) ni n= D i ni, i= a n i a n i = ni Inf2b Learnin and Data: Lecture 2 (Chapter 2) Multi-laer neural netorks 5
Sinle-laer netork ith multiple output nodes k k kd K KD i D K output nodes:,..., K. For n = ( n,..., nd ) T, ( D ) = ki ni = (a nk ) nk i= D a nk = ki ni i= Inf2b Learnin and Data: Lecture 2 (Chapter 2) Multi-laer neural netorks 6
Sinle-laer netork ith multiple output nodes Trainin set : D = {(, t ),..., ( N, t N )} here t n = (t n,..., t nk ) and t nk {, } Error function: E() = N n t n 2 = N K ( nk t nk ) 2 2 2 n= n=k= N = E n, here E n = K ( nk t nk ) 2 2 n= Trainin b the radient descent: k= ki ki η E ki, (η > ) Inf2b Learnin and Data: Lecture 2 (Chapter 2) Multi-laer neural netorks 7
The derivatives of the error function (sinle-laer) E n = K ( nk t nk ) 2 k K 2 k= nk = (a nk ) d KD a nk = ki ni i= E n ki = E n nk nk a nk a nk ki = ( nk t nk ) (a nk ) ni i D Inf2b Learnin and Data: Lecture 2 (Chapter 2) Multi-laer neural netorks 8
Notes on Activation functions k k kd K KD i D Interpretation of output values Normalisation of the output values Inf2b Learnin and Data: Lecture 2 (Chapter 2) Multi-laer neural netorks 9
Output of loistic simoid activation function Consider a sinle-laer netork ith a sinle output node loistic simoid activation function: ( D = (a) = + ep( a) = ) i i i= = + ep ( D ) i= i i D D Consider a to class problem, ith classes C and C 2. The posterior probabilit of C : P(C ) = p( C ) P(C ) = p() = + p( C 2) P(C 2 = ) p( C ) P(C ) p( C ) P(C ) p( C ) P(C ) + p( C 2 ) P(C 2 ) + ep ( ln p( C ) P(C )) p( C 2 ) P(C 2 ) Inf2b Learnin and Data: Lecture 2 (Chapter 2) Multi-laer neural netorks
Approimation of posterior probabilities P(S)=.5, P(T)=.5.8.8 p(s ) p(t ) h(z) = / (+ep(-z)).6.4.2 p().6.4.2-6 -4-2 2 4 6 z Loistic simoid function (a) = + ep( a) 5 5 2 Posterior probabilities of to classes ith Gaussian distributions: Inf2b Learnin and Data: Lecture 2 (Chapter 2) Multi-laer neural netorks
Normalisation of output nodes Outputs ith simoid activation funtion: K k k= k = (a k ) = + ep( a k ), a k = D ki i i= k k kd K KD Softma activation function for (): k = ep(a k) K l= ep(a l ) i D Properties of the softma function (i) k K (ii) k = k= (iii) differentiable (iv) k P(C k ) = p( C k)p(c k ) K l= p( C k)p(c k ) Inf2b Learnin and Data: Lecture 2 (Chapter 2) Multi-laer neural netorks 2
Some questions on activation functions Is the loistic simoid function necessar for sinle-laer sinle-output-node netork? No, in terms of classification. We can replace it ith (a) = a. Hoever, decision boundaries can be different. (NB: A linear decision boundar (a =.5) is formed in either case.) What benefits are there in usin the loistic simoid function in the case above? The output can be rearded as a posterior probabilit. Compared ith a linear output node ((a) = a), loistic reression normall forms a more robust decision boundar aainst noise. What benefits are there in usin nonlinear activation functions in multi-laer neural netorks? Inf2b Learnin and Data: Lecture 2 (Chapter 2) Multi-laer neural netorks 3
Loistic simoid vs a linear output node Binar classification problem ith the least squares error (LSE): (a) = vs (a) = a + ep( a) 4 2 2 4 6 8 4 2 2 4 6 8 (after Fi 4.4b in PRML C. M. Bishop (26)) Inf2b Learnin and Data: Lecture 2 (Chapter 2) Multi-laer neural netorks 4
Multi-laer neural netorks Multi-laer perceptron (MLP) Hidden-to-output eihts: (2) kj (2) kj η E (2) kj Input-to-hidden eihts: () ji () ji η E () ji (2) z z h z h (2) KM () () MD k z h j i K M D Inf2b Learnin and Data: Lecture 2 (Chapter 2) Multi-laer neural netorks 5
Trainin of MLP 94s Warren McCulloch and Walter Pitts : threshold loic Donald Hebb : Hebbian learnin 957 Frank Rosenblatt : Perceptron 969 Marvin Minsk and Semour Papert : limitations of neural netorks 98 Kunihiro Fukushima: Neoconitoron 986 D. Rumelhart, G. Hinton, and R. Williams, Learnin representations b back-propaatin errors (974, Paul Werbos) Inf2b Learnin and Data: Lecture 2 (Chapter 2) Multi-laer neural netorks 6
The derivatives of the error function (to-laers)(ne) E n = K ( nk t nk ) 2 k K 2 k= M nk = (a nk ), a nk = (2) (2) kj z (2) nj KM j= z z zj zm D z nj = h(b nj ), b nj = () ji ni h h h i= () () E n ji = E n nk a nk (2) nk a nk kj (2) kj E n () ji = ( nk t nk ) (a nk ) z nj = E n z nj b ( nj K = b nj () ji k= = z nj ( K k= ( nk t nk ) (a nk ) (2) kj ( nk t nk ) nk z nj ) h (b nj ) ni ) h (b nj ) ni Inf2b Learnin and Data: Lecture 2 (Chapter 2) Multi-laer neural netorks 7 i () MD D
Error back propaation (NE) E n (2) kj E n () ji = E n nk nk a nk a nk (2) kj = ( nk t nk ) (a nk ) z nj = δ (2) nk z nj, δ (2) nk = E n = E n = z nj ( K k= z nj b nj b nj () ji ( nk t nk ) (a nk ) (2) kj ( K ) = δ (2) nk (2) kj h (b nj ) ni k= (2) z () a () nk ji ) h (b nj ) ni z h k z h j i K z h (2) KM M () MD D Inf2b Learnin and Data: Lecture 2 (Chapter 2) Multi-laer neural netorks 8
Problems ith multi-laer neural netorks Still difficult to train Computationall ver epensive (e.. eeks of trainin) Slo converence ( vanishin radients ) Difficult to find the optimal netork topolo Poor eneralisation (under some conditions) Ver ood performance on the trainin set Poor performance on the test set Inf2b Learnin and Data: Lecture 2 (Chapter 2) Multi-laer neural netorks 9
Overfittin and eneralisation Eample of curve fittin b a polnomial function: M (, ) = + + 2 2 +... + M M = k k k= M = M = 3 M = 9 t t t (after Fi.4 in PRML C. M. Bishop (26)) cf. memorisin the trainin data Inf2b Learnin and Data: Lecture 2 (Chapter 2) Multi-laer neural netorks 2
Breakthrouh ( ) 957 Frank Rosenblatt : Perceptron 986 D. Rumelhart, G. Hinton, and R. Williams: Backpropaation 26 G. Hinton etal (U. Toronto) Reducin the dimensionalit of data ith neural netorks, Science. 29 J. Schmidhuber (Siss AI Lab IDSIA) Winner at ICDAR29 handritin reconition competition 2- man papers from U.Toronto, Microsoft, IBM, Goole,... What s the ideas? Pretrainin A sinle laer of feature detectors Stack it to form several hidden laers Fine-tunin, dropout GPU Convolutional netork (CNN), Lon short-term memor (LSTM) Rectified linear unit (ReLU) Inf2b Learnin and Data: Lecture 2 (Chapter 2) Multi-laer neural netorks 2
Breakthrouh ( ) Phone error rate [%] Speaker-independent phonetic reconition on TIMIT 3 28 26 24 22 2 8 99 995 2 25 2 25 Year Inf2b Learnin and Data: Lecture 2 (Chapter 2) Multi-laer neural netorks 22
Summar Trainin of sinle-laer netork Activation functions Approimation of posterior probabilities Simoid function (for sinle output node) Softma function (for multiple output nodes) Trainin of multi-laer netork ith error back propaation A ver ood reference: http://neuralnetorksanddeeplearnin.com/ Inf2b Learnin and Data: Lecture 2 (Chapter 2) Multi-laer neural netorks 23