Statistical Machine Learning from Data
|
|
- Laureen Underwood
- 6 years ago
- Views:
Transcription
1 January 17, 2006 Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Multi-Layer Perceptrons Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne (EPFL), Switzerland bengio@idiap.ch
2 Samy Bengio Statistical Machine Learning from Data 2 1 Generalities
3 Samy Bengio Statistical Machine Learning from Data 3 1 Generalities
4 Artificial Neural Networks Samy Bengio Statistical Machine Learning from Data 4 y y = g(s) s = f(x;w) w1 w2 x1 x2 An ANN is a set of units (neurons) connected to each other Each unit may have multiple inputs but have one output Each unit performs 2 functions: integration: s = f (x; θ) transfer: y = g(s)
5 Artificial Neural Networks: Functions Samy Bengio Statistical Machine Learning from Data 5 Example of integration function: s = θ 0 + i x i θ i Examples of transfer functions: tanh: y = tanh(s) 1 sigmoid: y = 1 + exp( s) Some units receive inputs from the outside world. Some units generate outputs to the outside world. The other units are often named hidden. Hence, from the outside, an ANN can be viewed as a function. There are various forms of ANNs. The most popular is the Multi Layer Perceptron (MLP).
6 Transfer Functions (Graphical View) Samy Bengio Statistical Machine Learning from Data 6 y = tanh(x) 1 1 y = sigmoid(x) y = x 1
7 Samy Bengio Statistical Machine Learning from Data 7 Generalities Introduction Characteristics 1 Generalities
8 Introduction Characteristics (Graphical View) outputs units output layer hidden layers parameters input layer inputs Samy Bengio Statistical Machine Learning from Data 8
9 Introduction Characteristics An MLP is a function: ŷ = MLP(x; θ) The parameters θ = {w l i,j, bl i : i, j, l} From now on, let x i (p) be the i th value in the p th example represented by vector x(p) (and when possible, let us drop p). Each layer l (1 l M) is fully connected to the previous layer Integration: s l i = b l i + j y l 1 j w l i,j Transfer: yi l = tanh(si l) or y i l = sigmoid(si l) or y i l = si l The output of the zeroth layer contains the inputs yi 0 = x i The output of the last layer M contains the outputs ŷ i = yi M Samy Bengio Statistical Machine Learning from Data 9
10 Characteristics of MLPs Introduction Characteristics An MLP can approximate any continuous functions However, it needs to have at least 1 hidden layer (sometimes easier with 2), and enough units in each layer Moreover, we have to find the correct value of the parameters θ This is an NP-complete problem!!!! How can we find these parameters? Answer: optimize a given criterion using a gradient method. Note: capacity controlled by the number of parameters Samy Bengio Statistical Machine Learning from Data 10
11 Samy Bengio Statistical Machine Learning from Data 11 Separability Generalities Introduction Characteristics Linear Linear+Sigmoid Linear+Sigmoid+Linear Linear+Sigmoid+Linear+Sigmoid+Linear
12 Criterion Basics Derivation Algorithm and Example Universal Approximator Samy Bengio Statistical Machine Learning from Data 12 1 Generalities
13 Generalities Criterion Basics Derivation Algorithm and Example Universal Approximator Objective: minimize a criterion C over a set of data D n : n C(D n, θ) = L(y(p), ŷ(p)) where p=1 ŷ(p) = MLP(x(p); θ) We are searching for the best parameters θ: θ = arg min θ C(D n, θ) Gradient descent: an iterative procedure where, at each iteration s we modify the parameters θ: θ s+1 = θ s η C(D n, θ s ) θ s where η is the learning rate. WARNING: local optima. Samy Bengio Statistical Machine Learning from Data 13
14 : The Basics Criterion Basics Derivation Algorithm and Example Universal Approximator a = f(b) Chain Rule if a = f (b) and b = g(c) then a c = a b b c = f (b) g (c) b = g(c) c Samy Bengio Statistical Machine Learning from Data 14
15 : The Basics Criterion Basics Derivation Algorithm and Example Universal Approximator Sum Rule if a = f (b, c) and b = g(d) and c = h(d) a = f(b,c) then a d = a b b d + a c c d b = g(d) c = h(d) a d f (b, c) = g f (b, c) (d)+ h (d) b c d Samy Bengio Statistical Machine Learning from Data 15
16 Criterion Basics Derivation Algorithm and Example Universal Approximator Basics (Graphical View) criterion targets outputs back propagation of the error parameter to tune inputs Samy Bengio Statistical Machine Learning from Data 16
17 : Criterion Criterion Basics Derivation Algorithm and Example Universal Approximator First: we need to pass the gradient through the criterion The global criterion C is: n C(D n, θ) = L(y(p), ŷ(p)) p=1 Example: the mean squared error criterion (MSE): L(y, ŷ) = d i=1 1 2 (y i ŷ i ) 2 And the derivative with respect to the output ŷ i : L(y, ŷ) ŷ i = ŷ i y i Samy Bengio Statistical Machine Learning from Data 17
18 Samy Bengio Statistical Machine Learning from Data 18 Generalities : Last Layer Criterion Basics Derivation Algorithm and Example Universal Approximator Second: derivative wrt the parameters of the last layer M ŷ i = y M i = tanh(s M i ) s M i = b M i + j y M 1 j w M i,j Hence the derivative with respect to w M i,j is: ŷ i w M i,j = ŷ i s M i sm i w M i,j And the derivative with respect to b M i = (1 (yi M ) 2 ) y M 1 j is: ŷ i b M i = ŷ i s M i sm i b M i = (1 (y M i ) 2 ) 1
19 Samy Bengio Statistical Machine Learning from Data 19 Generalities : Other Layers Criterion Basics Derivation Algorithm and Example Universal Approximator Third: derivative wrt to the output of a hidden layer y l j ŷ i y l j = k ŷ i y l+1 k y l+1 k y l j where and y l+1 k y l j ŷ i y M i = y l+1 k s l+1 k sl+1 k yj l = (1 (y l+1 k ) 2 ) w l+1 k,j = 1 and ŷ i y M k i = 0
20 Samy Bengio Statistical Machine Learning from Data 20 Generalities Criterion Basics Derivation Algorithm and Example Universal Approximator : Other Parameters Fourth: derivative wrt the parameters of hidden layer y l j and ŷ i w l j,k ŷ i b l j = ŷ i y l j = ŷ i y l j = ŷ i y l j = ŷ i y l j y l j w l j,k y l 1 k y l j b l j 1
21 Criterion Basics Derivation Algorithm and Example Universal Approximator : Global Algorithm For each iteration 1 Initialize gradients C θ i = 0 for each θ i 2 For each example z(p) = (x(p), y(p)) 1 Forward phase: compute ŷ(p) = MLP(x(p), θ) 2 Compute L(y(p),ŷ(p)) ŷ(p) 3 For each layer l from M to 1: 1 Compute ŷ(p) y j l y l j 2 Compute y j l and b j l w j,k l 3 Accumulate gradients: C b l j = C + C bj l L C wj,k l 3 Update the parameters: θ s+1 = C + C wj,k l L L ŷ(p) ŷ(p) yj l L ŷ(p) ŷ(p) yj l = θ s i η C θ s i y l j b l j y l j w l j,k i Samy Bengio Statistical Machine Learning from Data 21
22 : An Example (1) Criterion Basics Derivation Algorithm and Example Universal Approximator Let us start with a simple MLP: 0.3 linear tanh tanh Samy Bengio Statistical Machine Learning from Data 22
23 Samy Bengio Statistical Machine Learning from Data 23 Generalities : An Example (2) Criterion Basics Derivation Algorithm and Example Universal Approximator We forward one example and compute its MSE: linear 1.07 MSE =1.23 target tanh tanh
24 Samy Bengio Statistical Machine Learning from Data 24 Generalities : An Example (3) Criterion Basics Derivation Algorithm and Example Universal Approximator We backpropagate the gradient everywhere: 0.3 db= dw= linear 1.07 MSE =1.23 target 0.5 dy = 1.57 dmse =1.57 ds = dw= dy= dy= tanh tanh db= ds= ds= dw=0.53 dw= dw= 0.53 dw= db=
25 : An Example (4) Criterion Basics Derivation Algorithm and Example Universal Approximator We modify each parameter with learning rate 0.1: 0.1 linear tanh tanh Samy Bengio Statistical Machine Learning from Data 25
26 Samy Bengio Statistical Machine Learning from Data 26 Generalities : An Example (5) Criterion Basics Derivation Algorithm and Example Universal Approximator We forward the same example and compute its (smaller) MSE: linear 0.24 MSE =0.27 target tanh tanh
27 Samy Bengio Statistical Machine Learning from Data 27 Generalities MLP are Universal Approximators Criterion Basics Derivation Algorithm and Example Universal Approximator It can be shown that, under reasonable assumptions, one can approximate any smooth function with an MLP with one layer of hidden units. First intuition: Let us consider a classification task Let us consider hard transfert functions for hidden units: { 1 if s > 0 y = step(s) = 0 otherwise Let us consider linear transfert functions for output units: First attempt: ŷ = c + y = s N M v i sign x j w i,j + b i i=1 j=1
28 Criterion Basics Derivation Algorithm and Example Universal Approximator Illustration: Universal Approximators Samy Bengio Statistical Machine Learning from Data 28
29 Criterion Basics Derivation Algorithm and Example Universal Approximator Illustration: Universal Approximators Samy Bengio Statistical Machine Learning from Data 29
30 Criterion Basics Derivation Algorithm and Example Universal Approximator Illustration: Universal Approximators Samy Bengio Statistical Machine Learning from Data 30
31 Criterion Basics Derivation Algorithm and Example Universal Approximator Illustration: Universal Approximators Samy Bengio Statistical Machine Learning from Data 31
32 Criterion Basics Derivation Algorithm and Example Universal Approximator Illustration: Universal Approximators... but what about that? Samy Bengio Statistical Machine Learning from Data 32
33 Samy Bengio Statistical Machine Learning from Data 33 Generalities Criterion Basics Derivation Algorithm and Example Universal Approximator Universal Approximation by Cosines Let us consider simple functions of two variables y(x 1, x 2 ) Fourrier decomposition: y(x 1, x 2 ) s A s (x 1 ) cos(sx 2 ) where coefficients of A s are functions of x 1. Further Fourrier decomposition: y(x 1, x 2 ) A s,l cos(lx 1 ) cos(sx 2 ) s l We know that cos(α) cos(β) = 1 2 cos(α + β) cos(α β): y(x 1, x 2 ) [ 1 A s,l 2 cos(lx 1 + sx 2 ) + 1 ] 2 cos(lx 1 sx 2 ) s l
34 Criterion Basics Derivation Algorithm and Example Universal Approximator Universal Approximation by Cosines The cos function can be approximated with linear combinations of step functions: f (z) = f 0 + i (f i+1 f i )step(z z i ) So y(x 1, x 2 ) can be approximated by a linear combination of step functions whose arguments are linear combinations of x 1 and x 2, and which can be approximated by tanh functions. Samy Bengio Statistical Machine Learning from Data 34
35 Samy Bengio Statistical Machine Learning from Data 35 Generalities Binary Multiclass Error Correcting Output Codes 1 Generalities
36 ANN for Binary Binary Multiclass Error Correcting Output Codes One output with target coded as { 1, 1} or {0, 1} depending on the last layer output function (linear, sigmoid, tanh,...) For a given output, the associated class corresponds to the nearest target. How to obtain class posterior probabilities: use a sigmoid with targets {0, 1} if the model is correctly trained (with, for instance MSE criterion) = ŷ(x) = E[Y X = x] = 1 P(Y = 1 X = x)+0 P(Y = 0 X = x) the output will thus encode P(Y = 1 X = x) Note: we do not optimize directly the classification error... Samy Bengio Statistical Machine Learning from Data 36
37 Samy Bengio Statistical Machine Learning from Data 37 Generalities ANN for Multiclass Binary Multiclass Error Correcting Output Codes Simplest solution: one-hot encoding One output per class, coded for instance as (0,, 1,, 0) For a given output, the associated class corresponds to the index of the maximum value in the output vector How to obtain class posterior probabilities: use a softmax: ŷ i = exp(s i) X exp(s j ) j each output i will encode P(Y = i X = x) Otherwise: each class corresponds to a different binary code For example for a 4-class problem, we could have an 8-dim code for each class For a given output, the associated class corresponds to the nearest code (according to a given distance) Example: Error Correcting Output Codes (ECOC)
38 Error Correcting Output Codes Binary Multiclass Error Correcting Output Codes Let us represent a 4-class problem with 6 bits: class 1: class 2: class 3: class 4: We then create 6 classifiers (or 1 classifier with 6 outputs) For example: the first classifier will try to separate classes 1 and 2 from classes 3 and 4 Samy Bengio Statistical Machine Learning from Data 38
39 Error Correcting Output Codes Binary Multiclass Error Correcting Output Codes Given our 4-class problem represented with 6 bits: class 1: class 2: class 3: class 4: When a new example comes, we compute the distance between the code obtained by the 6 classifiers and the 4 classes: obtained: distances: (let us use Manhatan distance) to class 1: 5 to class 3: 2 to class 2: 4 to class 4: 3 Samy Bengio Statistical Machine Learning from Data 39
40 Binary Multiclass Error Correcting Output Codes What is a Good Error Correcting Output Code How to devise a good error correcting output code? Maximize the minimum Hamming distance between any pair of code words. A good ECOC should satisfy two properties: Row separation. (Hamming distance) Column separation. Column functions should be as uncorrelated as possible with each other. Samy Bengio Statistical Machine Learning from Data 40
41 Stochastic Initialization Learning Rate Weight Decay Training Criteria Samy Bengio Statistical Machine Learning from Data 41 1 Generalities
42 Generalities Stochastic Initialization Learning Rate Weight Decay Training Criteria A good book to make ANNs working Content: G. B. Orr and K. Müller. Neural Networks: Tricks of the Trade Springer. Stochastic Gradient Initialization Learning Rate and Learning Rate Decay Weight Decay Samy Bengio Statistical Machine Learning from Data 42
43 Stochastic Stochastic Initialization Learning Rate Weight Decay Training Criteria The gradient descent technique is batch: First accumulate the gradient from all examples, then adjust the parameters What if the data set is very big, and contains redundencies? Other solution: stochastic gradient descent Adjust the parameters after each example instead Stochastic: we approximate the full gradient with its estimate at each example Nevertheless, convergence proofs exist for such method. Moreover: much faster for large data sets!!! Other gradient techniques: second order methods such as conjugate gradient: good for small data sets Samy Bengio Statistical Machine Learning from Data 43
44 Samy Bengio Statistical Machine Learning from Data 44 Initialization Generalities Stochastic Initialization Learning Rate Weight Decay Training Criteria How should we initialize the parameters of an ANN? One common problem: saturation When the weighted sum is big, the output of the tanh (or sigmoid) saturates, and the gradient tends towards 0 derivative is good derivative is almost zero weighted sum
45 Samy Bengio Statistical Machine Learning from Data 45 Initialization Generalities Stochastic Initialization Learning Rate Weight Decay Training Criteria Hence, we should initialize the parameters such that the average weighted sum is in the linear part of the transfer function: See Leon Bottou s thesis for details input data: normalized with zero mean and unit variance, targets: regression: normalized with zero mean and unit variance, classification: output transfer function is tanh: 0.6 and -0.6 output transfer function is sigmoid: 0.8 and 0.2 output transfer function is linear: 0.6 and -0.6 [ ] 1 parameters: uniformly distributed in, 1 fan in fan in
46 Stochastic Initialization Learning Rate Weight Decay Training Criteria Learning Rate and Learning Rate Decay How to select the learning rate η? If η is too big: the optimization diverges If η is too small: the optimization is very slow and may be stuck into local minima One solution: progressive decay initial learning rate η = η 0 learning rate decay η d At each iteration s: η(s) = η 0 (1 + s η d ) Samy Bengio Statistical Machine Learning from Data 46
47 Stochastic Initialization Learning Rate Weight Decay Training Criteria Learning Rate Decay (Graphical View) /(1+0.1*x) Samy Bengio Statistical Machine Learning from Data 47
48 Samy Bengio Statistical Machine Learning from Data 48 Weight Decay Generalities Stochastic Initialization Learning Rate Weight Decay Training Criteria One way to control the capacity: regularization For MLPs, when the weights tend to 0, sigmoid or tanh functions are almost linear, hence with low capacity Weight decay: penalize solutions with high weights and bias (in amplitude) C(D n, θ) = n L(y(p), ŷ(p)) + β θ 2 p=1 where β controls the weight decay. Easy to implement: n = θj s θ s+1 j p=1 j=1 θ 2 j L(y(p), ŷ(p)) η θj s η β θj s
49 Examples of Training Criteria Stochastic Initialization Learning Rate Weight Decay Training Criteria Mean-squared error, for regression: L(y, ŷ) = d i=1 1 2 (y i ŷ i ) 2 Cross-entropy criterion, for classification (targets { 1, 1}): Hard version: L(y, ŷ) = log(1 + exp( y i ŷ i )) L(y, ŷ) = 1 y i ŷ i + Samy Bengio Statistical Machine Learning from Data 49
50 Examples of Training Criteria Stochastic Initialization Learning Rate Weight Decay Training Criteria log(1 + exp(-x)) 1 - x + 0.5*x*x Samy Bengio Statistical Machine Learning from Data 50
Statistical Machine Learning from Data
January 17, 2006 Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Other Artificial Neural Networks Samy Bengio IDIAP Research Institute, Martigny, Switzerland,
More informationStatistical Machine Learning from Data
Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Ensembles Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne
More informationStatistical Machine Learning from Data
Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Support Vector Machines Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique
More informationNeed for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels
Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
More informationReading Group on Deep Learning Session 1
Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular
More informationNeural Network Training
Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification
More informationArtificial Neural Networks. MGS Lecture 2
Artificial Neural Networks MGS 2018 - Lecture 2 OVERVIEW Biological Neural Networks Cell Topology: Input, Output, and Hidden Layers Functional description Cost functions Training ANNs Back-Propagation
More informationIntroduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis
Introduction to Natural Computation Lecture 9 Multilayer Perceptrons and Backpropagation Peter Lewis 1 / 25 Overview of the Lecture Why multilayer perceptrons? Some applications of multilayer perceptrons.
More informationNeed for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels
Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
More informationLogistic Regression & Neural Networks
Logistic Regression & Neural Networks CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Graham Neubig, Jacob Eisenstein Logistic Regression Perceptron & Probabilities What if we want a probability
More informationMultilayer Perceptron
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Single Perceptron 3 Boolean Function Learning 4
More informationMachine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6
Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)
More informationLecture 5: Logistic Regression. Neural Networks
Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture
More informationNeural Networks and Deep Learning
Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost
More informationLecture 17: Neural Networks and Deep Learning
UVA CS 6316 / CS 4501-004 Machine Learning Fall 2016 Lecture 17: Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions
More informationNeural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016
Neural Networks Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016 Outline Part 1 Introduction Feedforward Neural Networks Stochastic Gradient Descent Computational Graph
More informationStatistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks
Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Jan Drchal Czech Technical University in Prague Faculty of Electrical Engineering Department of Computer Science Topics covered
More informationNONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function
More informationLinear discriminant functions
Andrea Passerini passerini@disi.unitn.it Machine Learning Discriminative learning Discriminative vs generative Generative learning assumes knowledge of the distribution governing the data Discriminative
More informationArtifical Neural Networks
Neural Networks Artifical Neural Networks Neural Networks Biological Neural Networks.................................. Artificial Neural Networks................................... 3 ANN Structure...........................................
More informationIntroduction to Neural Networks
Introduction to Neural Networks Philipp Koehn 3 October 207 Linear Models We used before weighted linear combination of feature values h j and weights λ j score(λ, d i ) = j λ j h j (d i ) Such models
More informationNeural Networks: Backpropagation
Neural Networks: Backpropagation Seung-Hoon Na 1 1 Department of Computer Science Chonbuk National University 2018.10.25 eung-hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25
More informationIntroduction to Neural Networks
Introduction to Neural Networks Philipp Koehn 4 April 205 Linear Models We used before weighted linear combination of feature values h j and weights λ j score(λ, d i ) = j λ j h j (d i ) Such models can
More informationNeural Networks Learning the network: Backprop , Fall 2018 Lecture 4
Neural Networks Learning the network: Backprop 11-785, Fall 2018 Lecture 4 1 Recap: The MLP can represent any function The MLP can be constructed to represent anything But how do we construct it? 2 Recap:
More informationNeural Networks and the Back-propagation Algorithm
Neural Networks and the Back-propagation Algorithm Francisco S. Melo In these notes, we provide a brief overview of the main concepts concerning neural networks and the back-propagation algorithm. We closely
More informationStochastic gradient descent; Classification
Stochastic gradient descent; Classification Steve Renals Machine Learning Practical MLP Lecture 2 28 September 2016 MLP Lecture 2 Stochastic gradient descent; Classification 1 Single Layer Networks MLP
More informationApprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning
Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome Prenom.Nom@cnam.fr http://cedric.cnam.fr/vertigo/cours/ml2/ Département Informatique Conservatoire
More informationLinks between Perceptrons, MLPs and SVMs
Links between Perceptrons, MLPs and SVMs Ronan Collobert Samy Bengio IDIAP, Rue du Simplon, 19 Martigny, Switzerland Abstract We propose to study links between three important classification algorithms:
More informationECE521 Lectures 9 Fully Connected Neural Networks
ECE521 Lectures 9 Fully Connected Neural Networks Outline Multi-class classification Learning multi-layer neural networks 2 Measuring distance in probability space We learnt that the squared L2 distance
More informationComments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms
Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:
More informationNeural Networks, Computation Graphs. CMSC 470 Marine Carpuat
Neural Networks, Computation Graphs CMSC 470 Marine Carpuat Binary Classification with a Multi-layer Perceptron φ A = 1 φ site = 1 φ located = 1 φ Maizuru = 1 φ, = 2 φ in = 1 φ Kyoto = 1 φ priest = 0 φ
More informationI D I A P. A New Margin-Based Criterion for Efficient Gradient Descent R E S E A R C H R E P O R T. Ronan Collobert * Samy Bengio * IDIAP RR 03-16
R E S E A R C H R E P O R T I D I A P A New Margin-Based Criterion for Efficient Gradient Descent Ronan Collobert * Samy Bengio * IDIAP RR 03-16 March 14, 2003 D a l l e M o l l e I n s t i t u t e for
More informationMark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.
University of Cambridge Engineering Part IIB & EIST Part II Paper I0: Advanced Pattern Processing Handouts 4 & 5: Multi-Layer Perceptron: Introduction and Training x y (x) Inputs x 2 y (x) 2 Outputs x
More informationy(x n, w) t n 2. (1)
Network training: Training a neural network involves determining the weight parameter vector w that minimizes a cost function. Given a training set comprising a set of input vector {x n }, n = 1,...N,
More informationDeep Neural Networks (1) Hidden layers; Back-propagation
Deep Neural Networs (1) Hidden layers; Bac-propagation Steve Renals Machine Learning Practical MLP Lecture 3 4 October 2017 / 9 October 2017 MLP Lecture 3 Deep Neural Networs (1) 1 Recap: Softmax single
More information4. Multilayer Perceptrons
4. Multilayer Perceptrons This is a supervised error-correction learning algorithm. 1 4.1 Introduction A multilayer feedforward network consists of an input layer, one or more hidden layers, and an output
More informationRapid Introduction to Machine Learning/ Deep Learning
Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University 1/59 Lecture 4a Feedforward neural network October 30, 2015 2/59 Table of contents 1 1. Objectives of Lecture
More informationIntro to Neural Networks and Deep Learning
Intro to Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi UVA CS 6316 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions Backpropagation Nonlinearity Functions NNs
More informationLecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning
Lecture 0 Neural networks and optimization Machine Learning and Data Mining November 2009 UBC Gradient Searching for a good solution can be interpreted as looking for a minimum of some error (loss) function
More informationComputational statistics
Computational statistics Lecture 3: Neural networks Thierry Denœux 5 March, 2016 Neural networks A class of learning methods that was developed separately in different fields statistics and artificial
More informationNeural Network Language Modeling
Neural Network Language Modeling Instructor: Wei Xu Ohio State University CSE 5525 Many slides from Marek Rei, Philipp Koehn and Noah Smith Course Project Sign up your course project In-class presentation
More informationIntroduction to Convolutional Neural Networks (CNNs)
Introduction to Convolutional Neural Networks (CNNs) nojunk@snu.ac.kr http://mipal.snu.ac.kr Department of Transdisciplinary Studies Seoul National University, Korea Jan. 2016 Many slides are from Fei-Fei
More information1 What a Neural Network Computes
Neural Networks 1 What a Neural Network Computes To begin with, we will discuss fully connected feed-forward neural networks, also known as multilayer perceptrons. A feedforward neural network consists
More informationLecture 6. Regression
Lecture 6. Regression Prof. Alan Yuille Summer 2014 Outline 1. Introduction to Regression 2. Binary Regression 3. Linear Regression; Polynomial Regression 4. Non-linear Regression; Multilayer Perceptron
More informationLecture 2: Learning with neural networks
Lecture 2: Learning with neural networks Deep Learning @ UvA LEARNING WITH NEURAL NETWORKS - PAGE 1 Lecture Overview o Machine Learning Paradigm for Neural Networks o The Backpropagation algorithm for
More information18.6 Regression and Classification with Linear Models
18.6 Regression and Classification with Linear Models 352 The hypothesis space of linear functions of continuous-valued inputs has been used for hundreds of years A univariate linear function (a straight
More informationStatistical Machine Learning from Data
Samy Bengio Statistical Machine Learning from Data Statistical Machine Learning from Data Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne (EPFL),
More informationComputational Intelligence Winter Term 2017/18
Computational Intelligence Winter Term 207/8 Prof. Dr. Günter Rudolph Lehrstuhl für Algorithm Engineering (LS ) Fakultät für Informatik TU Dortmund Plan for Today Single-Layer Perceptron Accelerated Learning
More informationDeep Feedforward Networks
Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3
More informationMulti-layer Neural Networks
Multi-layer Neural Networks Steve Renals Informatics 2B Learning and Data Lecture 13 8 March 2011 Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 1 Overview Multi-layer neural
More informationNumerical Learning Algorithms
Numerical Learning Algorithms Example SVM for Separable Examples.......................... Example SVM for Nonseparable Examples....................... 4 Example Gaussian Kernel SVM...............................
More informationDeep Neural Networks (1) Hidden layers; Back-propagation
Deep Neural Networs (1) Hidden layers; Bac-propagation Steve Renals Machine Learning Practical MLP Lecture 3 2 October 2018 http://www.inf.ed.ac.u/teaching/courses/mlp/ MLP Lecture 3 / 2 October 2018 Deep
More informationIntroduction to Neural Networks
CUONG TUAN NGUYEN SEIJI HOTTA MASAKI NAKAGAWA Tokyo University of Agriculture and Technology Copyright by Nguyen, Hotta and Nakagawa 1 Pattern classification Which category of an input? Example: Character
More informationLecture 3 Feedforward Networks and Backpropagation
Lecture 3 Feedforward Networks and Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 3, 2017 Things we will look at today Recap of Logistic Regression
More informationDeep Feedforward Networks
Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3
More informationMultilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)
Multilayer Neural Networks (sometimes called Multilayer Perceptrons or MLPs) Linear separability Hyperplane In 2D: w x + w 2 x 2 + w 0 = 0 Feature x 2 = w w 2 x w 0 w 2 Feature 2 A perceptron can separate
More informationAdvanced statistical methods for data analysis Lecture 2
Advanced statistical methods for data analysis Lecture 2 RHUL Physics www.pp.rhul.ac.uk/~cowan Universität Mainz Klausurtagung des GK Eichtheorien exp. Tests... Bullay/Mosel 15 17 September, 2008 1 Outline
More informationStable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems
Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems Thore Graepel and Nicol N. Schraudolph Institute of Computational Science ETH Zürich, Switzerland {graepel,schraudo}@inf.ethz.ch
More informationAdvanced Machine Learning
Advanced Machine Learning Lecture 4: Deep Learning Essentials Pierre Geurts, Gilles Louppe, Louis Wehenkel 1 / 52 Outline Goal: explain and motivate the basic constructs of neural networks. From linear
More informationMultilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)
Multilayer Neural Networks (sometimes called Multilayer Perceptrons or MLPs) Linear separability Hyperplane In 2D: w 1 x 1 + w 2 x 2 + w 0 = 0 Feature 1 x 2 = w 1 w 2 x 1 w 0 w 2 Feature 2 A perceptron
More informationLast updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
Last updated: Oct 22, 2012 LINEAR CLASSIFIERS Problems 2 Please do Problem 8.3 in the textbook. We will discuss this in class. Classification: Problem Statement 3 In regression, we are modeling the relationship
More informationCSC 578 Neural Networks and Deep Learning
CSC 578 Neural Networks and Deep Learning Fall 2018/19 3. Improving Neural Networks (Some figures adapted from NNDL book) 1 Various Approaches to Improve Neural Networks 1. Cost functions Quadratic Cross
More informationLinear Models in Machine Learning
CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,
More informationMachine Learning Basics III
Machine Learning Basics III Benjamin Roth CIS LMU München Benjamin Roth (CIS LMU München) Machine Learning Basics III 1 / 62 Outline 1 Classification Logistic Regression 2 Gradient Based Optimization Gradient
More informationCSC321 Lecture 4: Learning a Classifier
CSC321 Lecture 4: Learning a Classifier Roger Grosse Roger Grosse CSC321 Lecture 4: Learning a Classifier 1 / 28 Overview Last time: binary classification, perceptron algorithm Limitations of the perceptron
More informationMultilayer Perceptrons and Backpropagation
Multilayer Perceptrons and Backpropagation Informatics 1 CG: Lecture 7 Chris Lucas School of Informatics University of Edinburgh January 31, 2017 (Slides adapted from Mirella Lapata s.) 1 / 33 Reading:
More informationDEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY
DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY 1 On-line Resources http://neuralnetworksanddeeplearning.com/index.html Online book by Michael Nielsen http://matlabtricks.com/post-5/3x3-convolution-kernelswith-online-demo
More information(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann
(Feed-Forward) Neural Networks 2016-12-06 Dr. Hajira Jabeen, Prof. Jens Lehmann Outline In the previous lectures we have learned about tensors and factorization methods. RESCAL is a bilinear model for
More informationCSC321 Lecture 4: Learning a Classifier
CSC321 Lecture 4: Learning a Classifier Roger Grosse Roger Grosse CSC321 Lecture 4: Learning a Classifier 1 / 31 Overview Last time: binary classification, perceptron algorithm Limitations of the perceptron
More informationDeep Feedforward Networks. Seung-Hoon Na Chonbuk National University
Deep Feedforward Networks Seung-Hoon Na Chonbuk National University Neural Network: Types Feedforward neural networks (FNN) = Deep feedforward networks = multilayer perceptrons (MLP) No feedback connections
More informationLecture 3 Feedforward Networks and Backpropagation
Lecture 3 Feedforward Networks and Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 3, 2017 Things we will look at today Recap of Logistic Regression
More informationComputational Intelligence
Plan for Today Single-Layer Perceptron Computational Intelligence Winter Term 00/ Prof. Dr. Günter Rudolph Lehrstuhl für Algorithm Engineering (LS ) Fakultät für Informatik TU Dortmund Accelerated Learning
More informationword2vec Parameter Learning Explained
word2vec Parameter Learning Explained Xin Rong ronxin@umich.edu Abstract The word2vec model and application by Mikolov et al. have attracted a great amount of attention in recent two years. The vector
More informationFrom perceptrons to word embeddings. Simon Šuster University of Groningen
From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written
More informationVasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks
C.M. Bishop s PRML: Chapter 5; Neural Networks Introduction The aim is, as before, to find useful decompositions of the target variable; t(x) = y(x, w) + ɛ(x) (3.7) t(x n ) and x n are the observations,
More informationIntroduction to Machine Learning
Introduction to Machine Learning Neural Networks Varun Chandola x x 5 Input Outline Contents February 2, 207 Extending Perceptrons 2 Multi Layered Perceptrons 2 2. Generalizing to Multiple Labels.................
More informationARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD
ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD WHAT IS A NEURAL NETWORK? The simplest definition of a neural network, more properly referred to as an 'artificial' neural network (ANN), is provided
More informationAn Introduction to Statistical Machine Learning - Theoretical Aspects -
An Introduction to Statistical Machine Learning - Theoretical Aspects - Samy Bengio bengio@idiap.ch Dalle Molle Institute for Perceptual Artificial Intelligence (IDIAP) CP 592, rue du Simplon 4 1920 Martigny,
More informationFeed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.
Neural Networks Oliver Schulte - CMPT 726 Bishop PRML Ch. 5 Neural Networks Neural networks arise from attempts to model human/animal brains Many models, many claims of biological plausibility We will
More informationEngineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers
Engineering Part IIB: Module 4F0 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Phil Woodland: pcw@eng.cam.ac.uk Michaelmas 202 Engineering Part IIB:
More informationCSCI567 Machine Learning (Fall 2018)
CSCI567 Machine Learning (Fall 2018) Prof. Haipeng Luo U of Southern California Sep 12, 2018 September 12, 2018 1 / 49 Administration GitHub repos are setup (ask TA Chi Zhang for any issues) HW 1 is due
More informationOptimization and Gradient Descent
Optimization and Gradient Descent INFO-4604, Applied Machine Learning University of Colorado Boulder September 12, 2017 Prof. Michael Paul Prediction Functions Remember: a prediction function is the function
More informationMachine Learning and Data Mining. Linear classification. Kalev Kask
Machine Learning and Data Mining Linear classification Kalev Kask Supervised learning Notation Features x Targets y Predictions ŷ = f(x ; q) Parameters q Program ( Learner ) Learning algorithm Change q
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationIntroduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen
Neural Networks - I Henrik I Christensen Robotics & Intelligent Machines @ GT Georgia Institute of Technology, Atlanta, GA 30332-0280 hic@cc.gatech.edu Henrik I Christensen (RIM@GT) Neural Networks 1 /
More informationMulticlass Logistic Regression
Multiclass Logistic Regression Sargur. Srihari University at Buffalo, State University of ew York USA Machine Learning Srihari Topics in Linear Classification using Probabilistic Discriminative Models
More informationArtificial Neural Networks
Introduction ANN in Action Final Observations Application: Poverty Detection Artificial Neural Networks Alvaro J. Riascos Villegas University of los Andes and Quantil July 6 2018 Artificial Neural Networks
More informationNeural Networks. Advanced data-mining. Yongdai Kim. Department of Statistics, Seoul National University, South Korea
Neural Networks Advanced data-mining Yongdai Kim Department of Statistics, Seoul National University, South Korea What is Neural Networks? One of supervised learning method using one or more hidden layer.
More information<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)
Learning for Deep Neural Networks (Back-propagation) Outline Summary of Previous Standford Lecture Universal Approximation Theorem Inference vs Training Gradient Descent Back-Propagation
More information> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel
Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Neural Networks: A brief touch Yuejie Chi Department of Electrical and Computer Engineering Spring 2018 1/41 Outline
More informationNeural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann
Neural Networks with Applications to Vision and Language Feedforward Networks Marco Kuhlmann Feedforward networks Linear separability x 2 x 2 0 1 0 1 0 0 x 1 1 0 x 1 linearly separable not linearly separable
More informationUnit III. A Survey of Neural Network Model
Unit III A Survey of Neural Network Model 1 Single Layer Perceptron Perceptron the first adaptive network architecture was invented by Frank Rosenblatt in 1957. It can be used for the classification of
More informationArtificial Neural Networks (ANN) Xiaogang Su, Ph.D. Department of Mathematical Science University of Texas at El Paso
Artificial Neural Networks (ANN) Xiaogang Su, Ph.D. Department of Mathematical Science University of Texas at El Paso xsu@utep.edu Fall, 2018 Outline Introduction A Brief History ANN Architecture Terminology
More informationNeural Nets Supervised learning
6.034 Artificial Intelligence Big idea: Learning as acquiring a function on feature vectors Background Nearest Neighbors Identification Trees Neural Nets Neural Nets Supervised learning y s(z) w w 0 w
More informationCOMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16
COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS6 Lecture 3: Classification with Logistic Regression Advanced optimization techniques Underfitting & Overfitting Model selection (Training-
More informationNeural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications
Neural Networks Bishop PRML Ch. 5 Alireza Ghane Neural Networks Alireza Ghane / Greg Mori 1 Neural Networks Neural networks arise from attempts to model human/animal brains Many models, many claims of
More informationLogistic Regression. Stochastic Gradient Descent
Tutorial 8 CPSC 340 Logistic Regression Stochastic Gradient Descent Logistic Regression Model A discriminative probabilistic model for classification e.g. spam filtering Let x R d be input and y { 1, 1}
More informationAN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009
AN INTRODUCTION TO NEURAL NETWORKS Scott Kuindersma November 12, 2009 SUPERVISED LEARNING We are given some training data: We must learn a function If y is discrete, we call it classification If it is
More informationSPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks
Topics in Machine Learning-EE 5359 Neural Networks 1 The Perceptron Output: A perceptron is a function that maps D-dimensional vectors to real numbers. For notational convenience, we add a zero-th dimension
More information