Statistical Machine Learning from Data

Similar documents
Statistical Machine Learning from Data

Statistical Machine Learning from Data

Statistical Machine Learning from Data

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Reading Group on Deep Learning Session 1

Neural Network Training

Artificial Neural Networks. MGS Lecture 2

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Logistic Regression & Neural Networks

Multilayer Perceptron

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Lecture 5: Logistic Regression. Neural Networks

Neural Networks and Deep Learning

Lecture 17: Neural Networks and Deep Learning

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Linear discriminant functions

Artifical Neural Networks

Introduction to Neural Networks

Neural Networks: Backpropagation

Introduction to Neural Networks

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Neural Networks and the Back-propagation Algorithm

Stochastic gradient descent; Classification

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Links between Perceptrons, MLPs and SVMs

ECE521 Lectures 9 Fully Connected Neural Networks

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

I D I A P. A New Margin-Based Criterion for Efficient Gradient Descent R E S E A R C H R E P O R T. Ronan Collobert * Samy Bengio * IDIAP RR 03-16

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

y(x n, w) t n 2. (1)

Deep Neural Networks (1) Hidden layers; Back-propagation

4. Multilayer Perceptrons

Rapid Introduction to Machine Learning/ Deep Learning

Intro to Neural Networks and Deep Learning

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Computational statistics

Neural Network Language Modeling

Introduction to Convolutional Neural Networks (CNNs)

1 What a Neural Network Computes

Lecture 6. Regression

Lecture 2: Learning with neural networks

18.6 Regression and Classification with Linear Models

Statistical Machine Learning from Data

Computational Intelligence Winter Term 2017/18

Deep Feedforward Networks

Multi-layer Neural Networks

Numerical Learning Algorithms

Deep Neural Networks (1) Hidden layers; Back-propagation

Introduction to Neural Networks

Lecture 3 Feedforward Networks and Backpropagation

Deep Feedforward Networks

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Advanced statistical methods for data analysis Lecture 2

Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems

Advanced Machine Learning

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

CSC 578 Neural Networks and Deep Learning

Linear Models in Machine Learning

Machine Learning Basics III

CSC321 Lecture 4: Learning a Classifier

Multilayer Perceptrons and Backpropagation

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

CSC321 Lecture 4: Learning a Classifier

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Lecture 3 Feedforward Networks and Backpropagation

Computational Intelligence

word2vec Parameter Learning Explained

From perceptrons to word embeddings. Simon Šuster University of Groningen

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Introduction to Machine Learning

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

An Introduction to Statistical Machine Learning - Theoretical Aspects -

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

CSCI567 Machine Learning (Fall 2018)

Optimization and Gradient Descent

Machine Learning and Data Mining. Linear classification. Kalev Kask

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen

Multiclass Logistic Regression

Artificial Neural Networks

Neural Networks. Advanced data-mining. Yongdai Kim. Department of Statistics, Seoul National University, South Korea

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Unit III. A Survey of Neural Network Model

Artificial Neural Networks (ANN) Xiaogang Su, Ph.D. Department of Mathematical Science University of Texas at El Paso

Neural Nets Supervised learning

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Logistic Regression. Stochastic Gradient Descent

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

Transcription:

January 17, 2006 Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Multi-Layer Perceptrons Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne (EPFL), Switzerland bengio@idiap.ch http://www.idiap.ch/~bengio

Samy Bengio Statistical Machine Learning from Data 2 1 Generalities 2 3 4 5

Samy Bengio Statistical Machine Learning from Data 3 1 Generalities 2 3 4 5

Artificial Neural Networks Samy Bengio Statistical Machine Learning from Data 4 y y = g(s) s = f(x;w) w1 w2 x1 x2 An ANN is a set of units (neurons) connected to each other Each unit may have multiple inputs but have one output Each unit performs 2 functions: integration: s = f (x; θ) transfer: y = g(s)

Artificial Neural Networks: Functions Samy Bengio Statistical Machine Learning from Data 5 Example of integration function: s = θ 0 + i x i θ i Examples of transfer functions: tanh: y = tanh(s) 1 sigmoid: y = 1 + exp( s) Some units receive inputs from the outside world. Some units generate outputs to the outside world. The other units are often named hidden. Hence, from the outside, an ANN can be viewed as a function. There are various forms of ANNs. The most popular is the Multi Layer Perceptron (MLP).

Transfer Functions (Graphical View) Samy Bengio Statistical Machine Learning from Data 6 y = tanh(x) 1 1 y = sigmoid(x) y = x 1

Samy Bengio Statistical Machine Learning from Data 7 Generalities Introduction Characteristics 1 Generalities 2 3 4 5

Introduction Characteristics (Graphical View) outputs units output layer hidden layers parameters input layer inputs Samy Bengio Statistical Machine Learning from Data 8

Introduction Characteristics An MLP is a function: ŷ = MLP(x; θ) The parameters θ = {w l i,j, bl i : i, j, l} From now on, let x i (p) be the i th value in the p th example represented by vector x(p) (and when possible, let us drop p). Each layer l (1 l M) is fully connected to the previous layer Integration: s l i = b l i + j y l 1 j w l i,j Transfer: yi l = tanh(si l) or y i l = sigmoid(si l) or y i l = si l The output of the zeroth layer contains the inputs yi 0 = x i The output of the last layer M contains the outputs ŷ i = yi M Samy Bengio Statistical Machine Learning from Data 9

Characteristics of MLPs Introduction Characteristics An MLP can approximate any continuous functions However, it needs to have at least 1 hidden layer (sometimes easier with 2), and enough units in each layer Moreover, we have to find the correct value of the parameters θ This is an NP-complete problem!!!! How can we find these parameters? Answer: optimize a given criterion using a gradient method. Note: capacity controlled by the number of parameters Samy Bengio Statistical Machine Learning from Data 10

Samy Bengio Statistical Machine Learning from Data 11 Separability Generalities Introduction Characteristics Linear Linear+Sigmoid Linear+Sigmoid+Linear Linear+Sigmoid+Linear+Sigmoid+Linear

Criterion Basics Derivation Algorithm and Example Universal Approximator Samy Bengio Statistical Machine Learning from Data 12 1 Generalities 2 3 4 5

Generalities Criterion Basics Derivation Algorithm and Example Universal Approximator Objective: minimize a criterion C over a set of data D n : n C(D n, θ) = L(y(p), ŷ(p)) where p=1 ŷ(p) = MLP(x(p); θ) We are searching for the best parameters θ: θ = arg min θ C(D n, θ) Gradient descent: an iterative procedure where, at each iteration s we modify the parameters θ: θ s+1 = θ s η C(D n, θ s ) θ s where η is the learning rate. WARNING: local optima. Samy Bengio Statistical Machine Learning from Data 13

: The Basics Criterion Basics Derivation Algorithm and Example Universal Approximator a = f(b) Chain Rule if a = f (b) and b = g(c) then a c = a b b c = f (b) g (c) b = g(c) c Samy Bengio Statistical Machine Learning from Data 14

: The Basics Criterion Basics Derivation Algorithm and Example Universal Approximator Sum Rule if a = f (b, c) and b = g(d) and c = h(d) a = f(b,c) then a d = a b b d + a c c d b = g(d) c = h(d) a d f (b, c) = g f (b, c) (d)+ h (d) b c d Samy Bengio Statistical Machine Learning from Data 15

Criterion Basics Derivation Algorithm and Example Universal Approximator Basics (Graphical View) criterion targets outputs back propagation of the error parameter to tune inputs Samy Bengio Statistical Machine Learning from Data 16

: Criterion Criterion Basics Derivation Algorithm and Example Universal Approximator First: we need to pass the gradient through the criterion The global criterion C is: n C(D n, θ) = L(y(p), ŷ(p)) p=1 Example: the mean squared error criterion (MSE): L(y, ŷ) = d i=1 1 2 (y i ŷ i ) 2 And the derivative with respect to the output ŷ i : L(y, ŷ) ŷ i = ŷ i y i Samy Bengio Statistical Machine Learning from Data 17

Samy Bengio Statistical Machine Learning from Data 18 Generalities : Last Layer Criterion Basics Derivation Algorithm and Example Universal Approximator Second: derivative wrt the parameters of the last layer M ŷ i = y M i = tanh(s M i ) s M i = b M i + j y M 1 j w M i,j Hence the derivative with respect to w M i,j is: ŷ i w M i,j = ŷ i s M i sm i w M i,j And the derivative with respect to b M i = (1 (yi M ) 2 ) y M 1 j is: ŷ i b M i = ŷ i s M i sm i b M i = (1 (y M i ) 2 ) 1

Samy Bengio Statistical Machine Learning from Data 19 Generalities : Other Layers Criterion Basics Derivation Algorithm and Example Universal Approximator Third: derivative wrt to the output of a hidden layer y l j ŷ i y l j = k ŷ i y l+1 k y l+1 k y l j where and y l+1 k y l j ŷ i y M i = y l+1 k s l+1 k sl+1 k yj l = (1 (y l+1 k ) 2 ) w l+1 k,j = 1 and ŷ i y M k i = 0

Samy Bengio Statistical Machine Learning from Data 20 Generalities Criterion Basics Derivation Algorithm and Example Universal Approximator : Other Parameters Fourth: derivative wrt the parameters of hidden layer y l j and ŷ i w l j,k ŷ i b l j = ŷ i y l j = ŷ i y l j = ŷ i y l j = ŷ i y l j y l j w l j,k y l 1 k y l j b l j 1

Criterion Basics Derivation Algorithm and Example Universal Approximator : Global Algorithm For each iteration 1 Initialize gradients C θ i = 0 for each θ i 2 For each example z(p) = (x(p), y(p)) 1 Forward phase: compute ŷ(p) = MLP(x(p), θ) 2 Compute L(y(p),ŷ(p)) ŷ(p) 3 For each layer l from M to 1: 1 Compute ŷ(p) y j l y l j 2 Compute y j l and b j l w j,k l 3 Accumulate gradients: C b l j = C + C bj l L C wj,k l 3 Update the parameters: θ s+1 = C + C wj,k l L L ŷ(p) ŷ(p) yj l L ŷ(p) ŷ(p) yj l = θ s i η C θ s i y l j b l j y l j w l j,k i Samy Bengio Statistical Machine Learning from Data 21

: An Example (1) Criterion Basics Derivation Algorithm and Example Universal Approximator Let us start with a simple MLP: 0.3 linear 0.6 1.2 1.1 tanh tanh 2.3 1.3 0.3 0.7 0.6 Samy Bengio Statistical Machine Learning from Data 22

Samy Bengio Statistical Machine Learning from Data 23 Generalities : An Example (2) Criterion Basics Derivation Algorithm and Example Universal Approximator We forward one example and compute its MSE: 0.3 1.07 linear 1.07 MSE =1.23 target 0.5 0.6 1.2 1.1 0.55 0.92 tanh tanh 0.62 1.58 1.3 0.3 0.7 0.6 2.3 0.8 0.8

Samy Bengio Statistical Machine Learning from Data 24 Generalities : An Example (3) Criterion Basics Derivation Algorithm and Example Universal Approximator We backpropagate the gradient everywhere: 0.3 db=1.57 0.6 dw=0.87 1.07 linear 1.07 MSE =1.23 target 0.5 dy = 1.57 dmse =1.57 ds = 1.57 1.2 dw=1.44 0.55 dy= 0.94 0.92 dy=1.89 1.1 tanh tanh db= 0.66 0.62 ds= 0.66 1.58 ds=0.29 1.3 0.3 dw=0.53 dw=0.24 0.7 0.6 dw= 0.53 dw= 0.24 2.3 db=0.29 0.8 0.8

: An Example (4) Criterion Basics Derivation Algorithm and Example Universal Approximator We modify each parameter with learning rate 0.1: 0.1 linear 0.77 0.91 1.23 tanh tanh 2.24 1.19 0.35 0.81 0.65 Samy Bengio Statistical Machine Learning from Data 25

Samy Bengio Statistical Machine Learning from Data 26 Generalities : An Example (5) Criterion Basics Derivation Algorithm and Example Universal Approximator We forward the same example and compute its (smaller) MSE: 0.01 0.24 linear 0.24 MSE =0.27 target 0.5 0.77 0.91 1.23 0.73 0.89 tanh tanh 0.92 1.45 1.19 0.35 0.81 0.65 2.24 0.8 0.8

Samy Bengio Statistical Machine Learning from Data 27 Generalities MLP are Universal Approximators Criterion Basics Derivation Algorithm and Example Universal Approximator It can be shown that, under reasonable assumptions, one can approximate any smooth function with an MLP with one layer of hidden units. First intuition: Let us consider a classification task Let us consider hard transfert functions for hidden units: { 1 if s > 0 y = step(s) = 0 otherwise Let us consider linear transfert functions for output units: First attempt: ŷ = c + y = s N M v i sign x j w i,j + b i i=1 j=1

Criterion Basics Derivation Algorithm and Example Universal Approximator Illustration: Universal Approximators Samy Bengio Statistical Machine Learning from Data 28

Criterion Basics Derivation Algorithm and Example Universal Approximator Illustration: Universal Approximators Samy Bengio Statistical Machine Learning from Data 29

Criterion Basics Derivation Algorithm and Example Universal Approximator Illustration: Universal Approximators Samy Bengio Statistical Machine Learning from Data 30

Criterion Basics Derivation Algorithm and Example Universal Approximator Illustration: Universal Approximators Samy Bengio Statistical Machine Learning from Data 31

Criterion Basics Derivation Algorithm and Example Universal Approximator Illustration: Universal Approximators... but what about that? Samy Bengio Statistical Machine Learning from Data 32

Samy Bengio Statistical Machine Learning from Data 33 Generalities Criterion Basics Derivation Algorithm and Example Universal Approximator Universal Approximation by Cosines Let us consider simple functions of two variables y(x 1, x 2 ) Fourrier decomposition: y(x 1, x 2 ) s A s (x 1 ) cos(sx 2 ) where coefficients of A s are functions of x 1. Further Fourrier decomposition: y(x 1, x 2 ) A s,l cos(lx 1 ) cos(sx 2 ) s l We know that cos(α) cos(β) = 1 2 cos(α + β) + 1 2 cos(α β): y(x 1, x 2 ) [ 1 A s,l 2 cos(lx 1 + sx 2 ) + 1 ] 2 cos(lx 1 sx 2 ) s l

Criterion Basics Derivation Algorithm and Example Universal Approximator Universal Approximation by Cosines The cos function can be approximated with linear combinations of step functions: f (z) = f 0 + i (f i+1 f i )step(z z i ) So y(x 1, x 2 ) can be approximated by a linear combination of step functions whose arguments are linear combinations of x 1 and x 2, and which can be approximated by tanh functions. Samy Bengio Statistical Machine Learning from Data 34

Samy Bengio Statistical Machine Learning from Data 35 Generalities Binary Multiclass Error Correcting Output Codes 1 Generalities 2 3 4 5

ANN for Binary Binary Multiclass Error Correcting Output Codes One output with target coded as { 1, 1} or {0, 1} depending on the last layer output function (linear, sigmoid, tanh,...) For a given output, the associated class corresponds to the nearest target. How to obtain class posterior probabilities: use a sigmoid with targets {0, 1} if the model is correctly trained (with, for instance MSE criterion) = ŷ(x) = E[Y X = x] = 1 P(Y = 1 X = x)+0 P(Y = 0 X = x) the output will thus encode P(Y = 1 X = x) Note: we do not optimize directly the classification error... Samy Bengio Statistical Machine Learning from Data 36

Samy Bengio Statistical Machine Learning from Data 37 Generalities ANN for Multiclass Binary Multiclass Error Correcting Output Codes Simplest solution: one-hot encoding One output per class, coded for instance as (0,, 1,, 0) For a given output, the associated class corresponds to the index of the maximum value in the output vector How to obtain class posterior probabilities: use a softmax: ŷ i = exp(s i) X exp(s j ) j each output i will encode P(Y = i X = x) Otherwise: each class corresponds to a different binary code For example for a 4-class problem, we could have an 8-dim code for each class For a given output, the associated class corresponds to the nearest code (according to a given distance) Example: Error Correcting Output Codes (ECOC)

Error Correcting Output Codes Binary Multiclass Error Correcting Output Codes Let us represent a 4-class problem with 6 bits: class 1: 1 1 0 0 0 1 class 2: 1 0 0 0 1 0 class 3: 0 1 0 1 0 0 class 4: 0 0 1 0 0 0 We then create 6 classifiers (or 1 classifier with 6 outputs) For example: the first classifier will try to separate classes 1 and 2 from classes 3 and 4 Samy Bengio Statistical Machine Learning from Data 38

Error Correcting Output Codes Binary Multiclass Error Correcting Output Codes Given our 4-class problem represented with 6 bits: class 1: 1 1 0 0 0 1 class 2: 1 0 0 0 1 0 class 3: 0 1 0 1 0 0 class 4: 0 0 1 0 0 0 When a new example comes, we compute the distance between the code obtained by the 6 classifiers and the 4 classes: obtained: 0 1 1 1 1 0 distances: (let us use Manhatan distance) to class 1: 5 to class 3: 2 to class 2: 4 to class 4: 3 Samy Bengio Statistical Machine Learning from Data 39

Binary Multiclass Error Correcting Output Codes What is a Good Error Correcting Output Code How to devise a good error correcting output code? Maximize the minimum Hamming distance between any pair of code words. A good ECOC should satisfy two properties: Row separation. (Hamming distance) Column separation. Column functions should be as uncorrelated as possible with each other. Samy Bengio Statistical Machine Learning from Data 40

Stochastic Initialization Learning Rate Weight Decay Training Criteria Samy Bengio Statistical Machine Learning from Data 41 1 Generalities 2 3 4 5

Generalities Stochastic Initialization Learning Rate Weight Decay Training Criteria A good book to make ANNs working Content: G. B. Orr and K. Müller. Neural Networks: Tricks of the Trade. 1998. Springer. Stochastic Gradient Initialization Learning Rate and Learning Rate Decay Weight Decay Samy Bengio Statistical Machine Learning from Data 42

Stochastic Stochastic Initialization Learning Rate Weight Decay Training Criteria The gradient descent technique is batch: First accumulate the gradient from all examples, then adjust the parameters What if the data set is very big, and contains redundencies? Other solution: stochastic gradient descent Adjust the parameters after each example instead Stochastic: we approximate the full gradient with its estimate at each example Nevertheless, convergence proofs exist for such method. Moreover: much faster for large data sets!!! Other gradient techniques: second order methods such as conjugate gradient: good for small data sets Samy Bengio Statistical Machine Learning from Data 43

Samy Bengio Statistical Machine Learning from Data 44 Initialization Generalities Stochastic Initialization Learning Rate Weight Decay Training Criteria How should we initialize the parameters of an ANN? One common problem: saturation When the weighted sum is big, the output of the tanh (or sigmoid) saturates, and the gradient tends towards 0 derivative is good derivative is almost zero weighted sum

Samy Bengio Statistical Machine Learning from Data 45 Initialization Generalities Stochastic Initialization Learning Rate Weight Decay Training Criteria Hence, we should initialize the parameters such that the average weighted sum is in the linear part of the transfer function: See Leon Bottou s thesis for details input data: normalized with zero mean and unit variance, targets: regression: normalized with zero mean and unit variance, classification: output transfer function is tanh: 0.6 and -0.6 output transfer function is sigmoid: 0.8 and 0.2 output transfer function is linear: 0.6 and -0.6 [ ] 1 parameters: uniformly distributed in, 1 fan in fan in

Stochastic Initialization Learning Rate Weight Decay Training Criteria Learning Rate and Learning Rate Decay How to select the learning rate η? If η is too big: the optimization diverges If η is too small: the optimization is very slow and may be stuck into local minima One solution: progressive decay initial learning rate η = η 0 learning rate decay η d At each iteration s: η(s) = η 0 (1 + s η d ) Samy Bengio Statistical Machine Learning from Data 46

Stochastic Initialization Learning Rate Weight Decay Training Criteria Learning Rate Decay (Graphical View) 1 0.9 1/(1+0.1*x) 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 10 20 30 40 50 60 70 80 90 100 Samy Bengio Statistical Machine Learning from Data 47

Samy Bengio Statistical Machine Learning from Data 48 Weight Decay Generalities Stochastic Initialization Learning Rate Weight Decay Training Criteria One way to control the capacity: regularization For MLPs, when the weights tend to 0, sigmoid or tanh functions are almost linear, hence with low capacity Weight decay: penalize solutions with high weights and bias (in amplitude) C(D n, θ) = n L(y(p), ŷ(p)) + β θ 2 p=1 where β controls the weight decay. Easy to implement: n = θj s θ s+1 j p=1 j=1 θ 2 j L(y(p), ŷ(p)) η θj s η β θj s

Examples of Training Criteria Stochastic Initialization Learning Rate Weight Decay Training Criteria Mean-squared error, for regression: L(y, ŷ) = d i=1 1 2 (y i ŷ i ) 2 Cross-entropy criterion, for classification (targets { 1, 1}): Hard version: L(y, ŷ) = log(1 + exp( y i ŷ i )) L(y, ŷ) = 1 y i ŷ i + Samy Bengio Statistical Machine Learning from Data 49

Examples of Training Criteria Stochastic Initialization Learning Rate Weight Decay Training Criteria 5 4 3 2 1 log(1 + exp(-x)) 1 - x + 0.5*x*x 0-5 -4-3 -2-1 0 1 2 3 Samy Bengio Statistical Machine Learning from Data 50