# Lecture 5: Logistic Regression. Neural Networks

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture 5 - September 21,

2 Recall: Binary classification We are given a set of data x i, y i with y i {0, 1}. The probability of a given input x to have class y = 1 can be computed as: where P (y = 1 x) = a = ln P (x, y = 1) P (x) = P (x y = 1)P (y = 1) P (x y = 0)P (y = 0) exp( a) = σ(a) σ is the sigmoid function (also called squashing ) function a is the log-odds of the data being class 1 vs. class 0 COMP-652, Lecture 5 - September 21,

3 Recall: Modelling for binary classification P (y = 1 x) = σ ( ln ) P (x y = 1)P (y = 1) P (x y = 0)P (y = 0) One approach is to model P (y) and P (x y), then use the approach above for classification (naive Bayes, Gaussian discriminants) This is called generative learning, because we can actually use the model to generate (i.e. fantasize) data Another idea is to model directly P (y x) This is called discriminative learning, because we only care about discriminating (i.e. separating) examples of the two classes. We focus on this approach today COMP-652, Lecture 5 - September 21,

4 Error function Maximum likelihood classification assumes that we will find the hypothesis that maximizes the (log) likelihood of the training data: arg max h log P ( x i, y i i=1...m h) = arg max h n log P (x i,, y i ) h) i=1 (using the usual i.i.d. assumption) But if we use discriminative learning, we have no model of the input distribution Instead, we want to maximize the conditional probability of the labels, given the inputs: m arg max log P (y i x i, h) h i=1 COMP-652, Lecture 5 - September 21,

5 The cross-entropy error function Suppose we interpret the output of the hypothesis, h(x i ), as the probability that y i = 1 Then the log-likelihood of a hypothesis h is: log L(h) = = m log P (y i x i, h) = i=1 m i=1 { log h(xi ) if y i = 1 log(1 h(x i )) if y i = 0 m y i log h(x i ) + (1 y i ) log(1 h(x i )) i=1 The cross-entropy error function is the opposite quantity: ( m ) J(h) = y i log h(x i ) + (1 y i ) log(1 h(x i )) i=1 COMP-652, Lecture 5 - September 21,

6 Logistic regression Suppose we represent the hypothesis itself as a logistic function of a linear combination of inputs: h(x = exp(w T x) This is also known as a sigmoid neuron Suppose we interpret h(x) as P (y = 1 x) Then the log-odds ratio, ( ) P (y = 1 x) ln = w T x P (y = 0 x) which is linear (nice!) We can find the optimum weight by minimizing cross-entropy (maximizing the conditional likelihood of the data) COMP-652, Lecture 5 - September 21,

7 Cross-entropy error surface for logistic function ( m ) J(h) = y i log σ(w T x i ) + (1 y i ) log(1 σ(w T x i )) i=1 Nice error surface, unique minimum, but cannot solve in closed form COMP-652, Lecture 5 - September 21,

8 Maximization procedure: Gradient ascent First we compute the gradient of log L(w) wrt w: log L(w) = 1 y i h i w (x i ) h w(x i )(1 h w (x i ))x i 1 + (1 y i ) 1 h w (x i ) h w(x i )(1 h w (x i ))x i ( 1) = i x i (y i y i h w (x i ) h w (x i ) + y i h w (x i )) = i (y i h w (x i ))x i The update rule (because we maximize) is: m w w + α log L(w) = w + α (y i h w (x i ))x i where α (0, 1) is a step-size or learning rate parameter This is called logistic regression i=1 COMP-652, Lecture 5 - September 21,

9 Another algorithm for optimization Recall Newton s method for finding the zero of a function g : R R At point w i, approximate the function by a straight line (its tangent) Solve the linear equation for where the tangent equals 0, and move the parameter to this point: w i+1 = w i g(wi ) g (w i ) COMP-652, Lecture 5 - September 21,

10 Application to machine learning Suppose for simplicity that the error function J has only one parameter We want to optimize J, so we can apply Newton s method to find the zeros of J = d dw J We obtain the iteration: w i+1 = w i J (w i ) J (w i ) Note that there is no step size parameter! This is a second-order method, because it requires computing the second derivative But, if our error function is quadratic, this will find the global optimum in one step! COMP-652, Lecture 5 - September 21,

11 Second-order methods: Multivariate setting If we have an error function J that depends on many variables, we can compute the Hessian matrix, which contains the second-order derivatives of J: H ij = 2 J w i w j The inverse of the Hessian gives the optimal learning rates The weights are updated as: w w H 1 w J This is also called Newton-Raphson method COMP-652, Lecture 5 - September 21,

12 Which method is better? Newton s method usually requires significantly fewer iterations than gradient descent Computing the Hessian requires a batch of data, so there is no natural on-line algorithm Inverting the Hessian explicitly is expensive, necessary but almost never Computing the product of a Hessian with a vector can be done in linear time (Schraudolph, 1994) COMP-652, Lecture 5 - September 21,

13 Newton-Raphson for logistic regression Leads to a nice algorithm called recursive least squares The Hessian has the form: H = Φ T RΦ where R is the diagonal matrix of h(x i )(1 h(x i )) The weight update becomes: w (Φ T RΦ) 1 Φ T R(Φw R 1 (Φw y) COMP-652, Lecture 5 - September 21,

14 Logistic vs. Gaussian Discriminant Comprehensive study done by Ng and Jordan (2002) If the Gaussian assumption is correct, as expected, Gaussian discriminant is better If the assumption is violated, Gaussian discriminant suffers from bias (unlike logistic regression) In practice, Gaussian discriminant, tends to converge quicker to a less helpful solution Note that they optimize different error function! COMP-652, Lecture 5 - September 21,

15 From neurons to networks Logistic regression can be used to learn linearly separable problems x 2 + x x x 1 (a) If the instances are not linearly separable, the function cannot be learned correctly (e.g. XOR) One solution is to provide fixed, non-linear basis functions instead of inputs (like we did with linear regression) Today: learn such non-linear functions from data (b) COMP-652, Lecture 5 - September 21,

16 Example: Logical functions of two variables Sigmoid Unit for And Function mse-curve-0.1 mse-curve-0.01 mse-curve xor-no-hidden Mean Square Error Squared Error Number of Epochs Number of epochs One sigmoid neuron can learn the AND function (left) but not the XOR function (right) In order to learn discrimination in data sets that are not linearly separable, we need networks of sigmoid units COMP-652, Lecture 5 - September 21,

17 Example: A network representing the XOR function 1 0 w30 (-3.06) Input 1 1 (6.92) w31 3 w32 (6.91) w41 (4.5) (10.28) w51 w50 (-4.86) 5 Ouput Learning Curve for XOR Function with Architecture mse-curve Input w42 (4.36) w40 (-6.8) w52 (-10.34) Squared Error Input1 Input2 o3 o4 Ouput Number of Epochs COMP-652, Lecture 5 - September 21,

18 Feed-forward neural networks Hidden Layer 1 1 i Oi=Xji Input Layer Output Layer Wji Input 1 j... Input n... 1 A collection of units (neurons) with sigmoid or sigmoid-like activations, arranged in layers Layers 0 is the input layer, and its units just copy the inputs (by convention) The last layer, K, is called output layer, since its units provide the result of the network Layers 1,... K 1 are usually called hidden layers (their presence cannot be detected from outside the network) 1 COMP-652, Lecture 5 - September 21,

19 Why this name? In feed-forward networks the outputs of units in layer k become inputs for units in layers with index > k There are no cross-connections between units in the same layer There are no backward ( recurrent ) connections from layers downstream Typically, units in layer k provide input only to units in layer k + 1 In fully connected networks, all units in layer k are connected to all units in layer k + 1 COMP-652, Lecture 5 - September 21,

20 Notation Hidden Layer Input Layer 1 i Oi=Xji Wji 1 Output Layer Input 1 j... Input n w j,i is the weight on the connection from unit i to unit j By convention, x j,0 = 1, j The output of unit j, denoted o j, is computed using a sigmoid: o j = σ(w T j x j) where w j is the vector of weights on the connections entering unit j and x j is the vector of inputs to unit j By the definition of the connections, x j,i = o i COMP-652, Lecture 5 - September 21,

21 Computing the output of the network Suppose that we want the network to make a prediction for instance x, y In a feed-forward network, this can be done in one forward pass: For layer k = 1 to K 1. Compute the output of all neurons in layer k: o j σ(w T j x j ), j Layer k 2. Copy these outputs as inputs to the next layer: x j,i o i, i Layer k, j Layer k + 1 COMP-652, Lecture 5 - September 21,

22 Expressiveness of feed-forward neural networks A single sigmoid neuron has the same representational power as a perceptron: Boolean AND, OR, NOT, but not XOR Every Boolean function can be represented by a network with a single hidden layer, but might require a number of hidden units that is exponential in the number of inputs Every bounded continuous function can be approximated with arbitrary precision by a network with one, sufficiently large hidden layer [Cybenko 1989; Hornik et al. 1989] Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988]. COMP-652, Lecture 5 - September 21,

23 Learning in feed-forward neural networks Usually, the network structure (units and interconnections) is specified by the designer The learning problem is finding a good set of weights The answer: gradient descent, because the form of the hypothesis formed by the network, h w, is Differentiable! Because of the choice of sigmoid units Very complex! Hence, direct computation of the optimal weights is not possible COMP-652, Lecture 5 - September 21,

24 Backpropagation algorithm Just gradient descent over all weights in the network! We put together the two phases described above: 1. Forward pass: Compute the outputs of all units in the network, o k, k = N + 1,... N + H + 1, going in increasing order of the layers 2. Backward pass: Compute the δ k updates described before, going from k = N + H + 1 down to k = N + 1 (in decreasing order of the layers) 3. Update to all the weights in the network: w i,j w i,j + α i,j δ i x i,j For computing probabilities, drop the o(1 o) terms from the δ COMP-652, Lecture 5 - September 21,

25 Backpropagation algorithm 1. Initialize all weights to small random numbers. 2. Repeat until satisfied: (a) Pick a training example (b) Input example to the network and compute the outputs o k (c) For each output unit k, δ k o k (1 o k )(y k o k ) (d) For each hidden unit l δ l o l (1 o l ) k outputs w lk δ k (e) Update each network weight: w ij w ij + α ij δ j x ij x ij is the input from unit i into unit j (for the output neurons, these are signals received from the hidden layer neurons) α ij is the learning rate or step size COMP-652, Lecture 5 - September 21,

26 Backpropagation variations The previous version corresponds to incremental (stochastic) gradient descent An analogous batch version can be used as well: Loop through the training data, accumulating weight changes Update weights One pass through the data set is called epoch Algorithm can be easily generalized to predict probabilities, instead of minimizing sum-squared error Generalization is easy for other network structures as well. COMP-652, Lecture 5 - September 21,

27 Convergence of backpropagation Backpropagation performs gradient descent over all the parameters in the network Hence, if the learning rate is appropriate, the algorithm is guaranteed to converge to a local minimum of the cost function NOT the global minimum Can be much worse than global minimum There can be MANY local minima (Auer et al, 1997) Partial solution: restarting = train multiple nets with different initial weights. In practice, quite often the solution found is very good COMP-652, Lecture 5 - September 21,

28 Adding momentum On the t-th training sample, instead of the update: we do: w ij α ij δ j x ij w ij (t) α ij δ j x ij + β w ij (t 1) The second term is called momentum Advantages: Easy to pass small local minima Keeps the weights moving in areas where the error is flat Increases the speed where the gradient stays unchanged Disadvantages: COMP-652, Lecture 5 - September 21,

29 With too much momentum, it can get out of a global maximum! One more parameter to tune, and more chances of divergence COMP-652, Lecture 5 - September 21,

30 Choosing the learning rate Backprop is VERY sensitive to the choice of learning rate Too large divergence Too small VERY slow learning The learning rate also influences the ability to escape local optima Very often, different learning rates are used for units inn different layers It can be shown that each unit has its own optimal learning rate COMP-652, Lecture 5 - September 21,

31 Adjusting the learning rate: Delta-bar-delta Heuristic method, works best in batch mode (though there have been attempts to make it incremental) The intuition: If the gradient direction is stable, the learning rate should be increased If the gradient flips to the opposite direction the learning rate should be decreased A running average of the gradient and a separate learning rate is kept for each weight If the new gradient and the old average have the same sign, increase the learning rate by a constant amount If they have opposite sign, decay the learning rate exponentially COMP-652, Lecture 5 - September 21,

32 How large should the network be? Overfitting occurs if there are too many parameters compared to the amount of data available Choosing the number of hidden units: Too few hidden units do not allow the concept to be learned Too many lead to slow learning and overfitting If the n inputs are binary, log n is a good heuristic choice Choosing the number of layers Always start with one hidden layer Never go beyond 2 hidden layers, unless the task structure suggests something different COMP-652, Lecture 5 - September 21,

33 Overtraining in feed-forward networks Error Error versus weight updates (example 1) Training set error Validation set error Number of weight updates Traditional overfitting is concerned with the number of parameters vs. the number of instances In neural networks there is an additional phenomenon called overtraining which occurs when weights take on large magnitudes, pushing the sigmoids into saturation Effectively, as learning progresses, the network has more actual parameters Use a validation set to decide when to stop training! COMP-652, Lecture 5 - September 21,

34 Finding the right network structure Destructive methods start with a large network and then remove (prune) connections Constructive methods start with a small network (e.g. 1 hidden unit) and add units as required to reduce error COMP-652, Lecture 5 - September 21,

35 Destructive methods Simple solution: consider removing each weight in turn (by setting it to 0), and examine the effect on the error Weight decay: give each weight a chance to go to 0, unless it is needed to decrease error: where λ is a decay rate Optimal brain damage: w j = α j J w j λw j Train the network to a local optimum Approximate the saliency of each link or unit (i.e., its impact on the performance of the network), using the Hessian matrix Greedily prune the element with the lowest saliency This loop is repeated until the error starts to deteriorate COMP-652, Lecture 5 - September 21,

36 Constructive methods Dynamic node creation (Ash): Start with just one hidden unit, train using backprop If the error is still high, add another unit in the same layer and repeat Only one layer is constructed Meiosis networks (Hanson): Start with just one hidden unit, train using backprop Compute the variance of each weight during training If a unit has one or more weights of high variance, it is split into two units, and the weights are perturbed The intuition is that the new units will specialize to different functions. Cascade correlation: Add units to correlate with the residual error COMP-652, Lecture 5 - September 21,

37 Example applications Speech phoneme recognition [Waibel] and synthesis [Nettalk] Image classification [Kanade, Baluja, Rowley] Digit recognition [Bengio, Bouttou, LeCun et al - LeNet] Financial prediction Learning control [Pomerleau et al] COMP-652, Lecture 5 - September 21,

38 When to consider using neural networks Input is high-dimensional discrete or real-valued (e.g. raw sensor input) Output is discrete or real valued, or a vector of values Possibly noisy data Training time is not important Form of target function is unknown Human readability of result is not important The computation of the output based on the input has to be fast COMP-652, Lecture 5 - September 21,

### Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods

### COMP 551 Applied Machine Learning Lecture 14: Neural Networks

COMP 551 Applied Machine Learning Lecture 14: Neural Networks Instructor: Ryan Lowe (ryan.lowe@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise noted,

### Artificial Neural Networks

Artificial Neural Networks Oliver Schulte - CMPT 310 Neural Networks Neural networks arise from attempts to model human/animal brains Many models, many claims of biological plausibility We will focus on

### The error-backpropagation algorithm is one of the most important and widely used (and some would say wildly used) learning techniques for neural

1 2 The error-backpropagation algorithm is one of the most important and widely used (and some would say wildly used) learning techniques for neural networks. First we will look at the algorithm itself

### Lecture 4: Perceptrons and Multilayer Perceptrons

Lecture 4: Perceptrons and Multilayer Perceptrons Cognitive Systems II - Machine Learning SS 2005 Part I: Basic Approaches of Concept Learning Perceptrons, Artificial Neuronal Networks Lecture 4: Perceptrons

### Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

University of Cambridge Engineering Part IIB & EIST Part II Paper I0: Advanced Pattern Processing Handouts 4 & 5: Multi-Layer Perceptron: Introduction and Training x y (x) Inputs x 2 y (x) 2 Outputs x

### 18.6 Regression and Classification with Linear Models

18.6 Regression and Classification with Linear Models 352 The hypothesis space of linear functions of continuous-valued inputs has been used for hundreds of years A univariate linear function (a straight

### Linear Models for Classification

Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,

Advanced Machine Learning Lecture 4: Deep Learning Essentials Pierre Geurts, Gilles Louppe, Louis Wehenkel 1 / 52 Outline Goal: explain and motivate the basic constructs of neural networks. From linear

### CSC321 Lecture 5: Multilayer Perceptrons

CSC321 Lecture 5: Multilayer Perceptrons Roger Grosse Roger Grosse CSC321 Lecture 5: Multilayer Perceptrons 1 / 21 Overview Recall the simple neuron-like unit: y output output bias i'th weight w 1 w2 w3

### Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

+ Machine Learning and Data Mining Multi-layer Perceptrons & Neural Networks: Basics Prof. Alexander Ihler Linear Classifiers (Perceptrons) Linear Classifiers a linear classifier is a mapping which partitions

### Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Neural Networks Nicholas Ruozzi University of Texas at Dallas Handwritten Digit Recognition Given a collection of handwritten digits and their corresponding labels, we d like to be able to correctly classify

### Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Neural Networks Bishop PRML Ch. 5 Alireza Ghane Neural Networks Alireza Ghane / Greg Mori 1 Neural Networks Neural networks arise from attempts to model human/animal brains Many models, many claims of

### Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

### Artificial Neural Networks 2

CSC2515 Machine Learning Sam Roweis Artificial Neural s 2 We saw neural nets for classification. Same idea for regression. ANNs are just adaptive basis regression machines of the form: y k = j w kj σ(b

### Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear Classification CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Example of Linear Classification Red points: patterns belonging

### PATTERN CLASSIFICATION

PATTERN CLASSIFICATION Second Edition Richard O. Duda Peter E. Hart David G. Stork A Wiley-lnterscience Publication JOHN WILEY & SONS, INC. New York Chichester Weinheim Brisbane Singapore Toronto CONTENTS

### Multiclass Logistic Regression

Multiclass Logistic Regression Sargur. Srihari University at Buffalo, State University of ew York USA Machine Learning Srihari Topics in Linear Classification using Probabilistic Discriminative Models

### Statistical Machine Learning from Data

January 17, 2006 Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Multi-Layer Perceptrons Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole

### Machine Learning Lecture 5

Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

### ECE521 Lecture7. Logistic Regression

ECE521 Lecture7 Logistic Regression Outline Review of decision theory Logistic regression A single neuron Multi-class classification 2 Outline Decision theory is conceptually easy and computationally hard

### Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Neural Networks with Applications to Vision and Language Feedforward Networks Marco Kuhlmann Feedforward networks Linear separability x 2 x 2 0 1 0 1 0 0 x 1 1 0 x 1 linearly separable not linearly separable

### Midterm, Fall 2003

5-78 Midterm, Fall 2003 YOUR ANDREW USERID IN CAPITAL LETTERS: YOUR NAME: There are 9 questions. The ninth may be more time-consuming and is worth only three points, so do not attempt 9 unless you are

### NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

### Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Lecture 0 Neural networks and optimization Machine Learning and Data Mining November 2009 UBC Gradient Searching for a good solution can be interpreted as looking for a minimum of some error (loss) function

### The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The Bayes classifier Theorem The classifier satisfies where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know Alternatively, since the maximum it is

### Learning Neural Networks

Learning Neural Networks Neural Networks can represent complex decision boundaries Variable size. Any boolean function can be represented. Hidden units can be interpreted as new features Deterministic

### Neural Networks and Deep Learning.

Neural Networks and Deep Learning www.cs.wisc.edu/~dpage/cs760/ 1 Goals for the lecture you should understand the following concepts perceptrons the perceptron training rule linear separability hidden

### Deep Feedforward Networks

Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

### Last update: October 26, Neural networks. CMSC 421: Section Dana Nau

Last update: October 26, 207 Neural networks CMSC 42: Section 8.7 Dana Nau Outline Applications of neural networks Brains Neural network units Perceptrons Multilayer perceptrons 2 Example Applications

### CSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska. NEURAL NETWORKS Learning

CSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS Learning Neural Networks Classifier Short Presentation INPUT: classification data, i.e. it contains an classification (class) attribute.

### Lecture 3 Feedforward Networks and Backpropagation

Lecture 3 Feedforward Networks and Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 3, 2017 Things we will look at today Recap of Logistic Regression

### ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD WHAT IS A NEURAL NETWORK? The simplest definition of a neural network, more properly referred to as an 'artificial' neural network (ANN), is provided

### Machine Learning Lecture 7

Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

### Neural Networks. Xiaojin Zhu Computer Sciences Department University of Wisconsin, Madison. slide 1

Neural Networks Xiaoin Zhu erryzhu@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison slide 1 Terminator 2 (1991) JOHN: Can you learn? So you can be... you know. More human. Not

### Statistical NLP for the Web

Statistical NLP for the Web Neural Networks, Deep Belief Networks Sameer Maskey Week 8, October 24, 2012 *some slides from Andrew Rosenberg Announcements Please ask HW2 related questions in courseworks

### Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS Problems 2 Please do Problem 8.3 in the textbook. We will discuss this in class. Classification: Problem Statement 3 In regression, we are modeling the relationship

### COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS6 Lecture 3: Classification with Logistic Regression Advanced optimization techniques Underfitting & Overfitting Model selection (Training-

### Artificial neural networks

Artificial neural networks Chapter 8, Section 7 Artificial Intelligence, spring 203, Peter Ljunglöf; based on AIMA Slides c Stuart Russel and Peter Norvig, 2004 Chapter 8, Section 7 Outline Brains Neural

### Learning from Examples

Learning from Examples Data fitting Decision trees Cross validation Computational learning theory Linear classifiers Neural networks Nonparametric methods: nearest neighbor Support vector machines Ensemble

### Links between Perceptrons, MLPs and SVMs

Links between Perceptrons, MLPs and SVMs Ronan Collobert Samy Bengio IDIAP, Rue du Simplon, 19 Martigny, Switzerland Abstract We propose to study links between three important classification algorithms:

### Machine Learning Lecture 12

Machine Learning Lecture 12 Neural Networks 30.11.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory Probability

### Lecture 17: Neural Networks and Deep Learning

UVA CS 6316 / CS 4501-004 Machine Learning Fall 2016 Lecture 17: Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions

### Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University September 20, 2012 Today: Logistic regression Generative/Discriminative classifiers Readings: (see class website)

### Multilayer Perceptron

Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Single Perceptron 3 Boolean Function Learning 4

### Machine Learning 2017

Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section

### Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Introduction to Natural Computation Lecture 9 Multilayer Perceptrons and Backpropagation Peter Lewis 1 / 25 Overview of the Lecture Why multilayer perceptrons? Some applications of multilayer perceptrons.

### Multivariate Analysis Techniques in HEP

Multivariate Analysis Techniques in HEP Jan Therhaag IKTP Institutsseminar, Dresden, January 31 st 2013 Multivariate analysis in a nutshell Neural networks: Defeating the black box Boosted Decision Trees:

### Introduction to Machine Learning

Introduction to Machine Learning Logistic Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574

### Neural Networks: Introduction

Neural Networks: Introduction Machine Learning Fall 2017 Based on slides and material from Geoffrey Hinton, Richard Socher, Dan Roth, Yoav Goldberg, Shai Shalev-Shwartz and Shai Ben-David, and others 1

### Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Machine Learning Neural Networks (slides from Domingos, Pardo, others) For this week, Reading Chapter 4: Neural Networks (Mitchell, 1997) See Canvas For subsequent weeks: Scaling Learning Algorithms toward

### Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome Prenom.Nom@cnam.fr http://cedric.cnam.fr/vertigo/cours/ml2/ Département Informatique Conservatoire

### 10-701/ Machine Learning, Fall

0-70/5-78 Machine Learning, Fall 2003 Homework 2 Solution If you have questions, please contact Jiayong Zhang .. (Error Function) The sum-of-squares error is the most common training

### Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 4: SVM I Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d. from distribution

### How to do backpropagation in a brain

How to do backpropagation in a brain Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto & Google Inc. Prelude I will start with three slides explaining a popular type of deep

### Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Neural Networks Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016 Outline Part 1 Introduction Feedforward Neural Networks Stochastic Gradient Descent Computational Graph

### Introduction to Machine Learning (67577)

Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Deep Learning Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks

### Supervised Learning in Neural Networks

The Norwegian University of Science and Technology (NTNU Trondheim, Norway keithd@idi.ntnu.no March 7, 2011 Supervised Learning Constant feedback from an instructor, indicating not only right/wrong, but

### Learning from Data Logistic Regression

Learning from Data Logistic Regression Copyright David Barber 2-24. Course lecturer: Amos Storkey a.storkey@ed.ac.uk Course page : http://www.anc.ed.ac.uk/ amos/lfd/ 2.9.8.7.6.5.4.3.2...2.3.4.5.6.7.8.9

### 10 NEURAL NETWORKS Bio-inspired Multi-Layer Networks. Learning Objectives:

10 NEURAL NETWORKS TODO The first learning models you learned about (decision trees and nearest neighbor models) created complex, non-linear decision boundaries. We moved from there to the perceptron,

### CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!! November 18, 2015 THE EXAM IS CLOSED BOOK. Once the exam has started, SORRY, NO TALKING!!! No, you can t even say see ya

### ECS171: Machine Learning

ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

### Sections 18.6 and 18.7 Artificial Neural Networks

Sections 18.6 and 18.7 Artificial Neural Networks CS4811 - Artificial Intelligence Nilufer Onder Department of Computer Science Michigan Technological University Outline The brain vs. artifical neural

### Artificial Neural Networks Examination, June 2005

Artificial Neural Networks Examination, June 2005 Instructions There are SIXTY questions. (The pass mark is 30 out of 60). For each question, please select a maximum of ONE of the given answers (either

### CENG 793. On Machine Learning and Optimization. Sinan Kalkan

CENG 793 On Machine Learning and Optimization Sinan Kalkan 2 Now Introduction to ML Problem definition Classes of approaches K-NN Support Vector Machines Softmax classification / logistic regression Parzen

### Lab 5: 16 th April Exercises on Neural Networks

Lab 5: 16 th April 01 Exercises on Neural Networks 1. What are the values of weights w 0, w 1, and w for the perceptron whose decision surface is illustrated in the figure? Assume the surface crosses the

### Introduction to feedforward neural networks

. Problem statement and historical context A. Learning framework Figure below illustrates the basic framework that we will see in artificial neural network learning. We assume that we want to learn a classification

### Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Non-Convex Optimization CS6787 Lecture 7 Fall 2017 First some words about grading I sent out a bunch of grades on the course management system Everyone should have all their grades in Not including paper

### CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

CS 6501: Deep Learning for Computer Graphics Basics of Neural Networks Connelly Barnes Overview Simple neural networks Perceptron Feedforward neural networks Multilayer perceptron and properties Autoencoders

### Machine Learning Lecture 14

Machine Learning Lecture 14 Tricks of the Trade 07.12.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory Probability

### Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features

### The XOR problem. Machine learning for vision. The XOR problem. The XOR problem. x 1 x 2. x 2. x 1. Fall Roland Memisevic

The XOR problem Fall 2013 x 2 Lecture 9, February 25, 2015 x 1 The XOR problem The XOR problem x 1 x 2 x 2 x 1 (picture adapted from Bishop 2006) It s the features, stupid It s the features, stupid The

### Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar

### Machine Learning, Fall 2012 Homework 2

0-60 Machine Learning, Fall 202 Homework 2 Instructors: Tom Mitchell, Ziv Bar-Joseph TA in charge: Selen Uguroglu email: sugurogl@cs.cmu.edu SOLUTIONS Naive Bayes, 20 points Problem. Basic concepts, 0

### Lecture 2: Learning with neural networks

Lecture 2: Learning with neural networks Deep Learning @ UvA LEARNING WITH NEURAL NETWORKS - PAGE 1 Lecture Overview o Machine Learning Paradigm for Neural Networks o The Backpropagation algorithm for

### Machine Learning (COMP-652 and ECSE-608)

Machine Learning (COMP-652 and ECSE-608) Lecturers: Guillaume Rabusseau and Riashat Islam Email: guillaume.rabusseau@mail.mcgill.ca - riashat.islam@mail.mcgill.ca Teaching assistant: Ethan Macdonald Email:

### Linear classification with logistic regression

Section 8.6. Regression and Classification with Linear Models 725 Proportion correct.9.7 Proportion correct.9.7 2 3 4 5 6 7 2 4 6 8 2 4 6 8 Number of weight updates Number of weight updates Number of weight

### 2018 EE448, Big Data Mining, Lecture 5. (Part II) Weinan Zhang Shanghai Jiao Tong University

2018 EE448, Big Data Mining, Lecture 5 Supervised Learning (Part II) Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html Content of Supervised Learning

### Generative v. Discriminative classifiers Intuition

Logistic Regression (Continued) Generative v. Discriminative Decision rees Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University January 31 st, 2007 2005-2007 Carlos Guestrin 1 Generative

### Artificial Neural Networks (ANN)

Artificial Neural Networks (ANN) Edmondo Trentin April 17, 2013 ANN: Definition The definition of ANN is given in 3.1 points. Indeed, an ANN is a machine that is completely specified once we define its:

### Machine Learning

Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University February 2, 2015 Today: Logistic regression Generative/Discriminative classifiers Readings: (see class website)

### Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Backpropagation Matt Gormley Lecture 12 Feb 23, 2018 1 Neural Networks Outline

### MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE March 28, 2012 The exam is closed book. You are allowed a double sided one page cheat sheet. Answer the questions in the spaces provided on

### Single layer NN. Neuron Model

Single layer NN We consider the simple architecture consisting of just one neuron. Generalization to a single layer with more neurons as illustrated below is easy because: M M The output units are independent

### Neural Networks Lecture 4: Radial Bases Function Networks

Neural Networks Lecture 4: Radial Bases Function Networks H.A Talebi Farzaneh Abdollahi Department of Electrical Engineering Amirkabir University of Technology Winter 2011. A. Talebi, Farzaneh Abdollahi

### Machine Learning (COMP-652 and ECSE-608)

Machine Learning (COMP-652 and ECSE-608) Instructors: Doina Precup and Guillaume Rabusseau Email: dprecup@cs.mcgill.ca and guillaume.rabusseau@mail.mcgill.ca Teaching assistants: Tianyu Li and TBA Class

### Learning strategies for neuronal nets - the backpropagation algorithm

Learning strategies for neuronal nets - the backpropagation algorithm In contrast to the NNs with thresholds we handled until now NNs are the NNs with non-linear activation functions f(x). The most common

### UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 Exam policy: This exam allows two one-page, two-sided cheat sheets (i.e. 4 sides); No other materials. Time: 2 hours. Be sure to write

### Deep Feedforward Networks. Lecture slides for Chapter 6 of Deep Learning Ian Goodfellow Last updated

Deep Feedforward Networks Lecture slides for Chapter 6 of Deep Learning www.deeplearningbook.org Ian Goodfellow Last updated 2016-10-04 Roadmap Example: Learning XOR Gradient-Based Learning Hidden Units

### Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Linear Models for Regression Linear Regression Probabilistic Interpretation

### (Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

(Feed-Forward) Neural Networks 2016-12-06 Dr. Hajira Jabeen, Prof. Jens Lehmann Outline In the previous lectures we have learned about tensors and factorization methods. RESCAL is a bilinear model for

### Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d.

### Support Vector Machines and Kernel Methods

2018 CS420 Machine Learning, Lecture 3 Hangout from Prof. Andrew Ng. http://cs229.stanford.edu/notes/cs229-notes3.pdf Support Vector Machines and Kernel Methods Weinan Zhang Shanghai Jiao Tong University

### What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1 Multi-layer networks Steve Renals Machine Learning Practical MLP Lecture 3 7 October 2015 MLP Lecture 3 Multi-layer networks 2 What Do Single

### Backpropagation: The Good, the Bad and the Ugly

Backpropagation: The Good, the Bad and the Ugly The Norwegian University of Science and Technology (NTNU Trondheim, Norway keithd@idi.ntnu.no October 3, 2017 Supervised Learning Constant feedback from

### Midterm: CS 6375 Spring 2015 Solutions

Midterm: CS 6375 Spring 2015 Solutions The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for an

### Deep Learning: a gentle introduction

Deep Learning: a gentle introduction Jamal Atif jamal.atif@dauphine.fr PSL, Université Paris-Dauphine, LAMSADE February 8, 206 Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, 206 / Why

### Error Backpropagation

Error Backpropagation Sargur Srihari 1 Topics in Error Backpropagation Terminology of backpropagation 1. Evaluation of Error function derivatives 2. Error Backpropagation algorithm 3. A simple example